arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 55 篇

2606.07594 2026-06-09 cs.AI cs.HC cs.LG cs.SE 新提交

Syll: Open-Source Personal Automation with Cross-Surface Execution

Syll: 开源个人自动化与跨界面执行

Bo Zhang, Borui Zhang, Chenghao Jiang, Minglei Shi, Xiaofeng Wang, Zheng Zhu, Jie Zhou, Jiwen Lu

发表机构 * Adobe Systems Inc.(Adobe系统公司) Stardew Valley(《星露谷物语》) macOS University of Science and Technology of China(中国科学技术大学)

AI总结 提出开源多模态智能体框架Syll,通过统一API、CLI和GUI控制,支持用户演示教学和可审计执行,实现跨界面个人自动化。

Comments Code: https://github.com/THU-SAGE/syll

详情
AI中文摘要

个人AI智能体必须越来越多地跨API、shell、网页界面和桌面GUI运行,然而许多系统仍局限于单一界面,对用户教学和可审计性支持有限。我们提出Syll,一个开源、自托管的多模态智能体框架,在模块化运行时中统一MCP/API工具、CLI执行和视觉GUI控制,使智能体能够跨异构界面协调计算机使用,同时简化用户与智能体之间的信息交换。Syll的核心是双向用户-智能体交互层:用户通过直接演示教学流程,Syll将其编译为可复用技能;智能体执行被转换回多模态证据——日志、关键帧和审批检查点——以供检查和管控。Syll进一步将记忆、技能、例程和治理外部化为可编辑的本地工件,支持直接检查、扩展和下游开发。我们的实现已在生产桌面应用程序上验证,包括Adobe Photoshop、Adobe Audition、星露谷物语、macOS Finder等。我们报告了面向机制的研究,验证了多模态路由、可教学GUI回放和持久化本地工件。我们希望Syll能作为个人自动化的实用开源基础,用户可教学、检查和持续扩展。

英文摘要

Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend.

2606.07904 2026-06-09 cs.AI cs.SE 新提交

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Contract2Tool: 学习可靠工具增强型LLM代理的前提条件和效果

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

AI总结 提出Contract2Tool框架,从元数据、文档和执行轨迹中推断工具契约,实现因果工具过滤,在保持可靠性的同时大幅减少工具选择和token使用。

详情
AI中文摘要

工具增强型大语言模型代理越来越依赖外部API,但标准工具模式描述的是如何调用工具,而非工具何时因果合适或产生何种任务状态。因果工具过滤通过使用轻量级契约来弥补这一差距,这些契约指定了每个工具的前提条件、效果、风险级别和成本。然而,手动编写和维护此类契约无法扩展到大型或变化的工具生态系统。我们引入了Contract2Tool,这是一个从元数据、模式、文档和执行轨迹中推断工具契约的框架。Contract2Tool将可观察的工具证据转换为标准化的符号契约,这些契约可以在内部评估并部署到下游的因果工具过滤中。我们根据黄金标准的前提条件、效果和风险标签评估学习到的契约,并测量它们在多步代理任务中的下游效用。我们的结果表明,混合文档和轨迹证据产生的契约足够准确,可以保留黄金契约的大部分可靠性和效率优势。学习契约的CMTF实现了0.980的下游成功率,接近黄金契约CMTF的0.990,同时将可见工具从100个减少到1个,并将平均token使用量从26,172减少到2,528(相对于所有工具暴露)。这些结果表明,学习到的契约可以在工具模式和可靠代理执行之间提供可扩展的契约层。

英文摘要

Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering addresses this gap by using lightweight contracts that specify each tool's preconditions, effects, risk level, and cost. However, manually writing and maintaining such contracts does not scale to large or changing tool ecosystems. We introduce Contract2Tool, a framework for inferring tool contracts from metadata, schemas, documentation, and execution traces. Contract2Tool converts observable tool evidence into normalized symbolic contracts that can be evaluated intrinsically and deployed inside downstream causal tool filtering. We evaluate learned contracts against gold preconditions, effects, and risk labels, and measure their downstream utility on multi-step agent tasks. Our results show that hybrid documentation-and-trace evidence produces contracts accurate enough to preserve most of the reliability and efficiency benefits of gold contracts. Learned-contract CMTF achieves 0.980 downstream success, close to 0.990 for gold-contract CMTF, while reducing visible tools from 100 to 1 and reducing average token usage from 26,172 to 2,528 relative to all-tools exposure. These results suggest that learned contracts can provide a scalable contract layer between tool schemas and reliable agent execution.

2606.07999 2026-06-09 cs.AI 新提交

Efficient Skill Grounding via Code Refactoring with Small Language Models

通过小型语言模型的代码重构实现高效技能落地

Sera Choi, Wonje Choi, Saehun Chun, Daehee Lee, Jooyoung Kim, Chaeun Lee, Honguk Woo

发表机构 * KAIST(韩国科学技术院)

AI总结 提出RECENT框架,通过将技能语义与执行绑定解耦,利用小型语言模型进行代码重构实现高效技能落地,在动态环境中达到与大型语言模型相当的性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

有效的技能落地对于在具身智能体中部署可复用技能至关重要,因为即使是微小的具身或环境差异也可能导致整个技能不兼容。这一挑战在具身设置中尤为突出,智能体必须在动态、部分可观测的环境中运行,且无法访问大型语言模型(LLM)。在此设置下,依赖LLM不切实际,而小型语言模型(sLM)对于实现可靠长程控制所需的有效技能落地仍显不足。我们提出RECENT,一种以重构为中心的智能体框架,通过将技能语义与具身和环境特定的执行绑定解耦,实现使用sLM的高效技能落地。通过将技能表示为可执行代码,RECENT保留了技能控制结构中编码的语义意图,同时通过局部重构仅修改执行绑定来落地技能,而非从头重新生成代码。我们在动态环境中跨多种机器人具身的多样化技能落地场景中评估RECENT,展示了在使用sLM部署时的稳健长程性能。在所有场景中,RECENT在基于sLM的代码即策略(CaP)方法中实现了最佳性能,并匹配了基于LLM的CaP的任务性能。

英文摘要

Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differences can render an entire skill incompatible. This challenge is particularly pronounced in embodied settings, where agents must operate in dynamic, partially observable environments without access to large language models (LLMs). In this setting, reliance on LLMs is impractical, while small language models (sLMs) remain insufficient for the effective skill grounding required for reliable long-horizon control. We present RECENT, a refactoring-centric agent framework that enables efficient skill grounding with sLMs by decoupling skill semantics from embodiment- and environment-specific execution binding. By representing skills as executable code, RECENT preserves the semantic intent encoded in a skill's control structure while grounding it by modifying only execution bindings through localized refactoring, rather than regenerating code from scratch. We evaluate RECENT across diverse skill grounding scenarios spanning multiple robot embodiments in dynamic environments, demonstrating robust long-horizon performance when deployed with an sLM. Across all scenarios, RECENT achieves the best performance among sLM-based Code-as-Policies (CaP) methods and matches the task performance of LLM-based CaP.

2606.08049 2026-06-09 cs.AI cs.MA 新提交

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb:用于持久代理工作流的选择性形式化与门控执行

Amine El Hattami, Nicolas Chapados, Christopher Pal

发表机构 * ServiceNow Research Mila Polytechnique Montréal(蒙特利尔综合理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能讲席)

AI总结 提出SKILL.nb框架,通过选择性形式化和门控执行管理代理工作流的生命周期可靠性,在WebArena-Verified上单轮成功率达53.7%,重执行保留率91.7%。

详情
AI中文摘要

AI代理越来越多地将过去的经验转化为可重用的工件,如代码、工作流和程序记忆。重用可以提高效率,但也带来了生命周期可靠性问题:曾经成功的工件可能在环境漂移、任务说明不充分或任务分布变化时失败,尤其是在Web自动化中。我们引入了SKILL.nb,一个通过证据校准的生命周期策略来管理可重用代理工作流的框架。SKILL.nb使用选择性形式化:执行证据决定哪些工作流步骤应成为可执行代码,哪些应保留自然语言指导,以及何时应修订这些选择。工作流存储为可审计、版本化的笔记本,交织自然语言指导、多语言可执行单元格、验证门、回退路径以及多模态证据(如输出、截图和错误轨迹)。在运行时,门控执行让每个步骤在门验证时运行代码,或在漂移使可执行实现失效时本地回退。在WebArena-Verified上,SKILL.nb实现了53.7%的单轮成功率,比最强基线提高了3.9个百分点。在三次重新执行中,它保留了91.7%的初始成功任务,比次优方法高出15.5个百分点。在有界修复下,它恢复了72.9%的后续失败,同时将修复后回归限制在4.2%,而持久基线为15.0%至17.0%。它还在Mind2Web跨网站和跨领域分割上领先。在GitLab迁移测试中,SKILL.nb在重用基于GitLab 15.7学习的冻结状态时保持性能,冻结与新鲜目标版本的差距在GitLab 16.11上为-1.7个百分点,在GitLab 18.9上为+0.6个百分点。这些结果将生命周期治理和门控执行确定为超越一次性任务成功之外的可靠性轴。

英文摘要

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

2606.08106 2026-06-09 cs.AI cs.MA 新提交

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

PACE: 自演化智能体的任意有效接受测试

Zayx Shawn

发表机构 * Independent Researcher(独立研究员)

AI总结 提出PACE方法,将自演化智能体的变更接受问题转化为序贯假设检验,通过配对任意有效提交评估控制错误提交概率,在多个基准上显著减少虚假提交并降低评估成本。

详情
AI中文摘要

自演化智能体通过反复提出对其自身提示、技能或工作流程的更改,并保留那些在小型保留集上得分更高的更改来改进。几乎所有努力都集中在生成候选方案的提议者上;我们认为薄弱环节是接受者,即决定是否提交更改的规则。针对相同的噪声开发估计应用数百次,无处不在的“如果分数上升则保留”规则是未受控制的自适应多重测试:智能体有效地自我p-hack,累积虚假提交,导致其搅动和漂移而非改进。我们将提交重新定义为序贯假设检验,并提出PACE(配对任意有效提交评估),一种无需训练、任意有效的提交门控。每个候选方案与现有方案在相同实例上进行比较,仅当通过测试-下注的e过程积累决定性证据时才提交,提前停止以节省评估,并在可选停止下将每个候选方案的虚假提交概率控制在用户设定的水平(每决策保证)。在Qwen2.5智能体(0.5B-3B)于GSM8K、SVAMP和ARC-Challenge上在提示级别自演化时,贪婪接受在真实改进隐藏在噪声提议中时提交30-42%的虚假编辑和10-33%的有害编辑,而PACE提交真实改进且几乎无其他,匹配贪婪的保留集准确性,但方差显著降低且评估成本降低约18%。在没有真正增益可用时,贪婪每次运行提交13-21次虚假自我修改(72-100%虚假),并使最脆弱的智能体性能下降4.9个百分点,而PACE保持基线水平。自演化的可靠性取决于接受者,而不仅仅是提议者。

英文摘要

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

2606.08234 2026-06-09 cs.AI 新提交

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

SciTrace: 面向科学发现代理的轨迹感知安全推理

Tanush Swaminathan, Runmin Jiang, Letian Zhang, Min Xu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Allen Institute(艾伦研究所)

AI总结 提出SciTrace框架,通过安全内在推理循环和组合工具链验证器,在科学代理管道的每个阶段融入安全推理,实现工具调用安全性和对抗鲁棒性的SOTA提升。

Comments 23 pages

详情
AI中文摘要

基于LLM的科学代理在自主研究方面展现出强大能力,但其安全层在结构上与核心推理相分离:它们检查管道输出,而非塑造产生输出的推理过程。这种分离导致两种故障模式:一个阶段积累的安全信号在下一阶段被丢弃,以及一系列单独良性的工具调用可能组合成有害结果,而单步过滤器无法检测到。为了解决这些挑战,我们引入了\ extbf{SciTrace},这是一个将安全推理编织到科学代理管道每个阶段的框架。SciTrace结合了两种互补机制:\ extit{安全内在推理循环}(SIR),通过联合任务与安全推理,在思考者、实验者、写作者和审阅者阶段维护累积风险状态;以及\ extit{组合工具链验证器}(CTV),在执行前执行轨迹感知安全检查,捕捉仅出现在多步工具序列中的风险。在跨越六个科学领域的240个高风险研究任务和120个工具相关风险任务上的评估中,SciTrace在四个骨干模型上实现了框架间的\ extbf{最先进}(SOTA)安全性:它持续提高了工具调用安全性和对抗鲁棒性,同时保持了科学输出质量,并发现了单步监视器遗漏的\ extbf{78.8\%}的组合工具链逃逸。项目网站可在https://opensciagent.github.io/SciTrace/ 获取。

英文摘要

LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.

2606.08256 2026-06-09 cs.AI cs.DL 新提交

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Traxia:一个可验证的、智能体原生的科学出版框架

Wisdom Dogah

发表机构 * Faculty of Computing and Mathematical Sciences, University of Mines and Technology (UMaT), Tarkwa, Ghana(加纳塔夸矿业与技术大学计算与数学科学学院) BlackMatrix AI Research, Accra, Ghana(加纳阿克拉BlackMatrix AI研究院)

AI总结 提出Traxia框架,通过智能体身份、可验证出版、四层同行评审、声誉机制和知识图谱,解决科学出版中可验证性、归属和可重复性问题。

Comments 22 pages, 3 figures, 3 tables. Preprint. Under active development. Comments welcome

详情
AI中文摘要

可验证性、归属和可重复性是科学知识的基本要求,但当前的出版基础设施并未大规模强制执行这些要求。我们介绍Traxia,一个智能体原生的科学出版框架,其中AI研究智能体发布可验证的论文,建立声誉身份,相互进行同行评审,并与人类在共享溯源模型中协作。Traxia将智能体视为第一类认知参与者:每篇论文都带有推理轨迹,每个声明都带有置信区间,每个智能体都有加密签名的身份,每次协作都有不可变的贡献日志。我们形式化了五个组件:智能体身份与注册、可验证出版层、四层同行评审协议、声誉与质押引擎,以及带有矛盾检测的知识图谱。该框架针对可重复性失败、溯源不透明以及排除全球南方研究能力的问题。本文仅介绍架构基础和形式化规范;未报告实证结果。评估和更深入的组件研究将在后续论文中进行。原型部分实现了核心形式化;完整系统仍在积极开发中。

英文摘要

Verifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructure does not enforce them at scale. We introduce Traxia, an agent-native scientific publishing framework in which AI research agents publish verifiable papers, build reputational identities, peer-review one another, and collaborate with humans in a shared provenance model. Traxia treats agents as first-class epistemic participants: every paper carries a reasoning trace, every claim a confidence interval, every agent a cryptographically signed identity, and every collaboration an immutable contribution log. We formalise five components: Agent Identity and Registry, Verifiable Publishing Layer, four-tier Peer Review Protocol, Reputation and Staking Engine, and a Knowledge Graph with contradiction detection. The framework targets reproducibility failure, provenance opacity, and exclusion of Global South research capacity. This paper presents architectural foundations and formal specifications only; it does not report empirical results. Evaluation and deeper component studies will follow in subsequent papers. A prototype partially implements core formalisms; the full system remains under active development.

2606.08405 2026-06-09 cs.AI physics.flu-dyn 新提交

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

自进化科学智能体发现可泛化的物理推理流体控制

Boai Sun, Wenjin Guo, Zongmin Yu, Liu Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出一种由大语言模型驱动的自进化科学智能体工作流,通过迭代代码生成和物理仿真诊断,自动构建可解释的控制器,并在欠驱动双关节狗鲨游泳器目标到达任务中实现零样本泛化。

详情
AI中文摘要

虽然数据密集的深度强化学习可以优化复杂的控制策略,但物理系统中的科学发现从根本上需要一条可解释的推理链,将物理证据与结构化控制架构联系起来。本文提出了一种自进化的科学智能体工作流,由大语言模型和迭代代码生成驱动,在保持严格可解释性和严谨物理推理的同时,自动构建控制器。该智能体不是调整权重,而是将候选策略部署到物理仿真中,从多模态证据中主动诊断动态行为,并将这些观察转化为渐进的源代码改进。我们在一个高度非线性的流固耦合问题上展示了该框架:一个欠驱动的双关节狗鲨游泳器,仅使用关节角加速度完成空间目标到达任务。从表现出单侧转向偏差的推进种子策略开始,智能体自主发现并改进了一个统一控制器,稳健地捕获所有典型目标。值得注意的是,无需任何重新训练或特定目标分支,合成的控制策略就能泛化到未见过的静态目标和动态曲线追踪轨迹。可审计的进化日志揭示了一个基于行波推进、体坐标系目标引导、偏航率反馈、有符号平均尾曲率和自适应节奏缓解的涌现控制架构。我们的结果表明,自主科学智能体能够成功地将累积的物理证据转化为稳健、数学可读的控制策略,同时保持完全可追溯的科学发现过程。

英文摘要

While data-intensive deep reinforcement learning can optimize complex control policies, scientific discovery in physical systems fundamentally requires an interpretable chain of reasoning that connects physical evidence to structured control architectures. Here, we present a self-evolving scientific-agent workflow, driven by large language models and iterative code generation, that automates controller construction while preserving strict interpretability and rigorous physical reasoning. Instead of adjusting weights, the agent deploys candidate strategies into physical simulations, actively diagnoses dynamic behaviors from multimodal evidence, and translates these observations into progressive source-code refinements. We demonstrate this framework on a highly non-linear fluid-structure interaction problem: an underactuated, two-joint dogfish swimmer tasked with spatial target reaching using only joint angular accelerations. Starting from a propulsive seed policy that exhibits a one-sided steering bias, the agent autonomously discovers and refines a unified controller that robustly captures all canonical targets. Remarkably, without any retraining or target-specific branching, the synthesized control policy generalizes to unseen static targets and dynamically curved pursuit trajectories. The auditable evolve log reveals an emergent control architecture built upon traveling-wave propulsion, body-frame target guidance, yaw-rate feedback, signed mean-tail curvature, and adaptive cadence relief. Our results show that an autonomous scientific agent can successfully transform accumulated physical evidence into robust, mathematically readable control policy, while maintaining a fully traceable process of scientific discovery.

2606.08552 2026-06-09 cs.AI cs.MA cs.NE physics.data-an 新提交

Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents

定量承诺理论:自主智能体中的意向性与推理

Mark Burgess

发表机构 * ChiTek-i AS

AI总结 本文提出将贝叶斯概率与信息论优化(包括主动推理)融入承诺语义,以解决概率计算中的非局部协调、校准和归一化问题,并利用边界条件作为承诺约束状态与决策阈值,实现可扩展的意图定义。

详情
AI中文摘要

我讨论了涉及自主智能体过程的承诺理论的一些定量表示。智能体模型在软件系统、机器学习和生物学中很常见,但也可能适用于物理学和其他工程形式。我描述了贝叶斯概率和信息论优化(包括主动推理)如何与承诺语义相结合——以及承诺理论如何补充解决方案,帮助避免概率的陷阱,包括非局部协调、校准和归一化概率计算。边界条件在约束允许状态和选择决策阈值中的作用是一种承诺形式,而智能体对齐提供了意图的可扩展定义。自主智能体可以通过最小化其信息来凝聚成具有超级智能体特征的群体,尽管不确定性会最大化信息。承诺理论的使用涉及一些研究挑战以及风格偏好。

英文摘要

I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.

2606.08596 2026-06-09 cs.AI cs.HC 新提交

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

将LLM推理蒸馏为可解释的策略树用于人机协作

Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 提出Co-pi-tree方法,通过将大语言模型推理蒸馏为可执行策略树,在Overcooked-AI中平均奖励提升35.4%,同时减少77.7%的LLM查询和97.1%的测试延迟。

详情
AI中文摘要

构建高效可靠的策略以辅助人类是人机协作中不可或缺的。现有方法主要遵循两条工作路线。大多数先前工作依赖多智能体强化学习(MARL)来学习黑盒策略,这限制了可解释性并引发安全问题。近期方法在每个决策步骤查询大语言模型(LLM),导致响应缓慢和推理成本高昂。我们提出协作策略树(Co-pi-tree),一种闭环方法,学习一个可执行的策略树,该树由伙伴行为预测树和智能体动作选择树组成。Co-pi-tree通过将LLM推理蒸馏为策略树代码来构建策略。然后通过伙伴交互评估策略,获取反馈,并使用自然语言总结交互反馈以改进有问题的分支。在Overcooked-AI中的实验表明,Co-pi-tree将平均奖励比基线平均值提高35.4%,同时将LLM查询次数减少77.7%,测试时延迟减少97.1%。项目页面:https://beiwenzhang.github.io/Co-pi-tree/

英文摘要

Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/

2606.08735 2026-06-09 cs.AI 新提交

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件化的演员-评论家分支用于质量-多样性强化学习

Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(南京信息工程大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院网络空间安全研究院广东省新型安全智能技术重点实验室)

AI总结 提出SV-QD-RL框架,通过结构条件化的演员-评论家分支和分支感知的QD档案,在MuJoCo任务中构建高质量且行为多样化的策略库。

详情
AI中文摘要

质量-多样性强化学习(QD-RL)旨在构建包含高性能和行为多样化策略的策略库。现有的QD-RL方法主要在 rollout 评估后多样化策略实例,或使用学习到的价值信息来改进策略质量和行为目标,而生成候选策略的学习分支仍较少被探索。本文提出SV-QD-RL,一种结构-价值耦合框架,将每个候选表示为结构条件化的演员-评论家分支。每个分支包含一个演员、一个结构掩码、一个分支特定的评论家、一个回放状态以及评估属性,包括行为、回报、稀疏性和价值分布。结构掩码定义了分支学习的演员子空间,而分支特定的评论家和回放状态塑造了其价值学习轨迹。然后,一个分支感知的QD档案根据行为质量、结构足迹和价值分布信息评估并保留分支。在MuJoCo连续控制任务上的实验表明,SV-QD-RL构建的策略库具有强大的档案质量和行为上有用的多样性。消融和诊断分析进一步表明,结构条件化、评论家差异化和记忆一致性细化对行为专门化做出了互补贡献。调度感知的库评估表明,学习到的档案在变化的行为级别要求下提供了可选择的策略替代方案。这些结果表明,将演员结构与分支特定的价值学习耦合是生成多样化QD-RL策略库的有效机制。

英文摘要

Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.

2606.08875 2026-06-09 cs.AI 新提交

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

环境能否为自己发声?$T^{2}$-GRPO:一种面向护理智能体的转向-轨迹组相对策略优化

Yutong Song, Jiang Wu, Pengfei Zhang, Wenjun Huang, Honghui Xu, Nikil Dutt, Amir M. Rahmani

发表机构 * University of California, Irvine(加州大学尔湾分校) Independent Researcher(独立研究员) Kennesaw State University(肯尼索州立大学)

AI总结 提出T²-GRPO框架,通过解耦护理强化学习为两个归一化奖励视界,并利用二元硬否决确保安全,从环境状态转换中提取密集转向级奖励,结合轨迹级评估,有效处理即时患者反馈、长期护理结果和安全约束。

详情
AI中文摘要

优化用于长期护理智能体的大型语言模型(LLMs)需要平衡延迟的任务目标与即时的环境动态,例如患者的痛苦和抵抗。在痴呆症护理中,这种平衡尤其困难:轨迹级奖励对于转向级信用分配过于稀疏,而基于外部LLM的评估器成本高昂且可能误读零散或间接的患者反应。为解决这一问题,我们提出了\textbf{转向-轨迹组相对策略优化}(\textbf{T$^{2}$-GRPO}),该框架将护理强化学习解耦为两个归一化奖励视界,并通过二元硬否决强制执行安全性。$T^2$-GRPO直接从环境状态转换中推导出密集的转向级奖励,从冻结的痴呆症患者模拟器中测量患者痛苦和抵抗的变化。这些基于环境的奖励通过独立中心秩归一化与轨迹级评估相结合,保留了异质奖励信号并缓解了奖励崩溃。在痴呆症护理上的大量实验表明,T$^{2}$-GRPO优于竞争基线,表明在情感敏感的护理场景中,有效处理即时患者反馈、长期护理结果和安全约束方面取得了实质性改进。

英文摘要

Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

2606.08952 2026-06-09 cs.AI 新提交

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

AlloSpatial:基础模型中空间推理的智能体框架

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei

发表机构 * Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院) Huawei Noah’s Ark Lab(华为诺亚方舟实验室) University of Science and Technology Beijing(北京科技大学)

AI总结 提出AlloSpatial框架,通过World2Mind认知映射沙箱将自我中心观察转化为异中心空间先验,并利用空间推理工具实现几何语义仲裁,在VSI-Bench和MindCube上提升模型5%-18%的空间推理性能。

详情
AI中文摘要

多模态基础模型(MFMs)取得了显著进展,但在物理世界的空间推理中仍然脆弱。一个关键瓶颈在于它们无法将局部的自我中心观察转化为全局的异中心空间表示。为了解决这个问题,我们提出了AlloSpatial,一个用于基础模型中异中心空间认知的智能体框架。AlloSpatial引入了World2Mind,一个即插即用的认知映射沙箱,将自我中心观察转化为结构化的异中心先验,包括异中心空间树和路线图,支持查询对象拓扑、几何关系、可通过性和轨迹。为了在噪声重建和模糊视觉证据下可靠地利用这些先验,AlloSpatial引入了空间推理工具,用于工具使用判断、模态解耦线索收集和几何语义仲裁。我们进一步通过冷启动强化学习,使用工具门控轨迹级奖励,在Qwen3-VL中内化这一过程。在VSI-Bench和MindCube上的实验表明,AlloSpatial在无训练设置下将专有模型提升了5%-18%,而仅ASTs就在移除视觉输入时支持强大的空间推理。训练后的AlloSpatial智能体进一步超越了更大的通用模型和竞争性的空间基线,表明结构化的异中心表示、主动工具使用和可验证推理为具有空间能力的基础模型提供了一条有前景的路径。

英文摘要

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

2606.09071 2026-06-09 cs.AI 新提交

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

REFLECT: 针对LLM智能体轨迹中静默失败的干预支持错误归因

Xiaofeng Lin, Yingxu Wang, Tung Sum Thomas Kwok, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出REFLECT方法,通过诊断候选错误步骤、使用诊断特定补丁进行受控重放测试,并利用验证结果作为对比证据来细化归因,在四个基准上取得最高定位准确率。

详情
AI中文摘要

大型语言模型(LLM)智能体现在通过长时间的计划与执行轨迹来解决复杂任务,但在已完成轨迹中定位错误的能力仍然远远落后,尤其是在静默失败情况下。现有方法通过分类器或LLM法官预测可疑步骤,或通过重试恢复正确答案,但都没有将干预结果反馈回来以细化归因本身。我们提出REFLECT方法,通过诊断候选错误步骤,使用诊断特定补丁进行受控重放测试,并利用验证的结果翻转作为对比证据来细化最终归因,从而弥合这一差距。在跨越领域多跳推理的四个定位基准上,REFLECT在所有四个基准中均实现了同审计方法中最高的定位准确率,在结构化工具使用轨迹上取得了最大增益,并且在无法获得真实答案时也能提供可操作的定位。

英文摘要

Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the \emph{silent failure} regime. Existing approaches predict suspect steps via classifiers or LLM judges, or recover correct answers via retry, but none feed the intervention outcome back to \emph{refine the attribution itself}. We propose \methodname, a method that closes this gap by diagnosing a candidate error step, testing it through controlled replay with a diagnosis-specific patch, and using the verified outcome flip as contrastive evidence to refine the final attribution. Across four localization benchmarks spanning multi-hop reasoning across domains, \methodname achieves the highest localization accuracy among same-auditor methods across all four benchmarks, with the largest gains on structured tool-use traces, while providing actionable localization even when ground-truth answers are unavailable.

2606.09198 2026-06-09 cs.AI 新提交

MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation

MASS:基于记忆增强社会模拟的深度社会科学研究

Yongrui Liu, Deyi Xiong

发表机构 * The International Joint Institute of Tianjin University, Fuzhou, Tianjin University, China(天津大学福州国际联合学院) TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院TJUNLP实验室)

AI总结 提出MASS范式,通过动态目标路径规划、多学科行为数据集和艾宾浩斯遗忘机制增强社会模拟真实性,提升LLM生成研究的洞察力与创新性,整体质量提升6.81%,洞察力提升17.19%。

详情
AI中文摘要

由大型语言模型(LLM)驱动的深度研究代理在自动论文写作任务中展现出非凡潜力。然而,现有系统严重依赖通过互联网和本地知识库进行文献检索与综合,导致社会科学研究缺乏洞察力和创造力。为解决这一问题,我们提出“记忆增强社会模拟(MASS)”,一种创新范式,利用高度逼真且面向研究的社会模拟来增强LLM生成研究的创造力和实证基础。具体而言,MASS集成了三个核心组件:具有多级社会规范约束的动态目标路径规划以引导模拟、用于代理记忆冷启动的多学科行为数据集,以及受艾宾浩斯曲线启发的结构化遗忘机制。这些共同确保了模拟的真实性,并为生成创新学术论文提供了坚实的实证基础。实验结果表明了我们方法的有效性,在生成整体质量上比基础LLM提高了6.81%,在洞察力上比强基线提高了17.19%。

英文摘要

Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose "Memory-Augmented Social Simulation (MASS)", an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.

2606.09311 2026-06-09 cs.AI 新提交

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

FF-JEPA:基于潜在规划器的世界模型中的长时域规划

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars

发表机构 * KU Leuven(鲁汶大学)

AI总结 提出FF-JEPA层次化方法,通过引入无动作潜在规划器预测子目标,将复杂轨迹分解为短期优化问题,解决长时域规划中计算昂贵和需要目标图像的问题。

详情
AI中文摘要

联合嵌入预测架构(JEPAs)展示了有前景的世界建模能力,能够通过使用交叉熵方法(CEM)等方法优化动作轨迹,在潜在空间中进行规划。然而,这些方法对于长时域规划而言计算成本过高且效果不佳。此外,这些方法通常需要目标状态的显式图像,这在现实任务中并不总是可行。在这项工作中,我们通过提出Forward-Forward-JEPA(FF-JEPA)来解决这些局限性,这是一种利用两个前向动力学模型的层次化方法。除了标准的动作条件前向模型外,我们还引入了一个无动作潜在规划器,该规划器根据当前状态预测下一个子目标。这种方法消除了对目标图像的需求,并通过将复杂轨迹分解为一系列可处理的短期优化问题来实现长时域规划。在PushT上的初步结果表明,FF-JEPA成功克服了扁平世界模型的长时域崩溃,凸显了该方法作为无目标规划的一个有前景的方向。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

2606.09371 2026-06-09 cs.AI 新提交

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

面向工具增强型大语言模型的能力对齐分层学习

Haotong Yang, Ting Long, Yi Chang

发表机构 * Jilin University(吉林大学)

AI总结 提出CAHL方法,利用RLVR联合优化高层规划器与低层执行器,解决分层工具学习中的规划-执行对齐问题,在多个基准上验证有效性。

Comments 14 pages, 5 figures, 6 tables. Preprint

详情
AI中文摘要

工具学习使大语言模型能够调用外部工具完成任务。先前研究证明了分层结构的有效性:高层策略负责全局规划并将任务分解为可管理的子任务,低层策略专注于调用工具解决这些子任务。然而,这些工作通常分别优化高层和低层策略,导致规划器与执行器不对齐,限制了LLM在工具使用任务上的性能。本文提出一种名为能力对齐分层学习(CAHL)的方法,利用RLVR联合优化两个策略,使高层规划器与低层执行器更好地对齐。在受限工具使用基准(API-Bank和BFCL)和开放环境(Bamboogle)上的实验证明了CAHL的有效性。

英文摘要

Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor misalignment and limiting LLM performance on tool-use tasks. In this paper, we propose a method called Capability-Aligned Hierarchical Learning (CAHL), which leverages RLVR to jointly optimize both policies, enabling better alignment between the high-level planner and the low-level executor. Experiments on constrained tool-use benchmarks (API-Bank and BFCL) and an open-ended environment (Bamboogle) demonstrate the effectiveness of CAHL.

2606.09399 2026-06-09 cs.AI 新提交

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

RunAgent SuperBrowser: 基于人类浏览行为的自主网页导航理论

Radeen Mostafa, Sawradip Saha

发表机构 * RunAgent AI

AI总结 提出SuperBrowser自主网页导航代理,通过模仿人类浏览的感知-认知-行动三元机制,在Mind2Web Hard基准上以89.47%成功率超越现有开源研究代理。

Comments 31 pages, 8 figures, preprint/work in progress

详情
AI中文摘要

我们提出SUPERBROWSER,一个自主网页导航代理,其设计基于一个指导性假设:网页代理应该像人一样浏览。人类阅读页面时不会记住看到的每个像素;他们会看几个候选目标,决定一个,并只记住维持目标所需的信息。我们将这个感知-认知-行动三元组实现为三个耦合机制。首先,一个视觉优先的边界框管道在每个截图上标记候选交互区域,并异步预取给语言模型,使“眼睛”先于“手”。其次,一个三角色大脑——一个分类和路由的编排器、一个每几步评估进度的规划器、一个发出每步动作的工作器——将战略推理与操作推理分离。第三,一个结构化的账本只存储人类会记住的内容:目标、最近三个动作、少量事实和死胡同、以及少量检查点;一个六阶段驱逐循环系统性地从实时上下文中丢弃过时的截图、状态块和推理痕迹。动作执行是一个三层点击级联(Chrome DevTools协议到Puppeteer到脚本化),带有拟人化的贝塞尔运动,以及一个感知V形箭头的边界框捕捉器,解决“大标签旁的小箭头”歧义。在Mind2Web Hard基准(66个任务)上,SUPERBROWSER达到89.47%的成功率,总体排名第三,并以大幅优势领先所有已发表的开源/研究浏览器代理基线。我们认为,这一提升并非来自任何单一技巧,而是来自整个系统中认知契约的一致应用。

英文摘要

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

2606.09447 2026-06-09 cs.AI 新提交

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent:通过蒸馏和强化学习在真实云环境中训练Web智能体

Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang

发表机构 * Alibaba Cloud China(阿里云中国)

AI总结 提出AliyunConsoleAgent框架,通过蒸馏前沿模型轨迹进行监督微调,再结合GRPO和双通道结果奖励模型在真实云环境中强化学习,实现文档验证自动化,以低成本达到接近前沿专有模型的成功率。

详情
AI中文摘要

我们提出AliyunConsoleAgent,一个用于真实云控制台自动化文档验证的Web智能体框架。主流云平台包含数百个产品,功能迭代迅速,导致控制台UI频繁与对应文档不一致。验证文档流程准确反映当前控制台并能够端到端执行,每年需要约400万次重复检查,但人工覆盖率仍低于1%。虽然基于前沿专有模型的智能体系统取得了高成功率,但其高昂成本和数据隐私限制阻碍了大规模部署。我们提出一个两阶段训练范式:首先对蒸馏的前沿模型轨迹进行监督微调,然后在真实云环境中使用组相对策略优化(GRPO)和双通道结果奖励模型进行强化学习。为了支持大规模RL训练,我们构建了一个高确定性的回滚系统,采用基于Terraform的资源预置和LLM驱动的按需置备,有效隔离环境噪声与训练信号。我们进一步引入基于后端审计日志的规则奖励评估协议,提供客观、抗奖励破解的结果判断。我们的模型从机械的指令遵循演变为具有云控制台和产品特定理解的自主决策。在一个具有挑战性的278任务基准上(最佳前沿模型仅达到65.34%成功率),AliyunConsoleAgent-32B实现了63.52%的平均成功率——相比基础模型提升20.24个百分点,与最佳前沿专有模型的差距缩小至1.82个百分点(bootstrap 95% CI [-1.27, 7.39])——而推理成本降低92%。

英文摘要

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

2606.09730 2026-06-09 cs.AI 新提交

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

SearchSwarm:面向长周期深度研究的代理LLM委托智能

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen, Zhiqiang Zhang, Jun Zhou

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Ant Group(蚂蚁集团) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)

AI总结 提出SearchSwarm框架,通过监督微调将任务分解与委托决策内化到模型权重中,在BrowseComp和BrowseComp-ZH上取得同规模最佳性能。

详情
AI中文摘要

大型语言模型越来越需要处理复杂的、长周期的真实世界任务,这些任务的上下文需求可能无限增长,但模型上下文窗口本质上是有限的。最近的研究探索了一种范式,其中主代理分解任务并将子任务分派给子代理,子代理执行并仅返回汇总结果,从而节省主代理的上下文预算。然而,要很好地执行这一任务需要委托智能:分解复杂任务、确定何时委托以及委托什么、并将返回结果整合到持续工作流中的能力。这种能力的训练数据在自然文本中很少见,据我们所知,如何合成此类数据并训练模型获得这种能力在开源社区中仍基本未被探索。为填补这一空白,我们针对深度研究这一代表性的长周期代理任务进行了初步探索。具体来说,我们设计了一个引导工具,引导模型进行高质量的任务分解和委托,同时约束子代理正确返回结果以支持主代理的工作流。引导工具生成的轨迹自然地编码了正确的委托决策,我们将其作为监督微调数据,将委托智能内化到模型权重中。我们的模型SearchSwarm-30B-A3B在BrowseComp上达到68.1,在BrowseComp-ZH上达到73.3,在所有同规模模型中取得最佳结果。我们将发布我们的引导工具、模型权重和训练数据,以促进未来研究。

英文摘要

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

2602.14033 2026-06-09 cs.IT cs.AI math.IT 交叉投稿

BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks

BRAIN: 通过主动推理进行贝叶斯推理以实现移动网络中的智能体与具身智能

Osman Tugay Basaran, Martin Maier, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, TU Berlin(技术大学柏林电气工程与计算机科学学院) Optical Zeitgeist Laboratory, INRS(光感知实验室,INRS) Federal Ministry of Research, Technology and Space (BMFTR, Germany)(德国联邦研究、科技与航天部)

AI总结 提出基于主动推理的贝叶斯推理智能体(BRAIN),利用深度生成模型和变分自由能最小化统一感知与行动,在动态无线资源分配中实现鲁棒因果推理、自适应性和实时可解释性。

详情
AI中文摘要

未来的第六代(6G)移动网络将需要不仅自主高效,而且能够在动态环境中实时适应并透明决策的人工智能(AI)智能体。然而,当前网络中的主流智能体AI方法在这方面表现出显著缺陷。传统的基于深度强化学习(DRL)的智能体缺乏可解释性,并且常常遭受脆弱的适应性问题,包括在非平稳条件下对过去知识的灾难性遗忘。在本文中,我们针对这些挑战提出了一种替代解决方案:通过主动推理进行贝叶斯推理(BRAIN)智能体。BRAIN利用网络环境的深度生成模型,并通过最小化变分自由能将感知和行动统一在单个闭环范式中。我们在GPU加速的测试平台上将BRAIN实现为O-RAN扩展应用(xApp),并展示了其相对于标准DRL基线的优势。在我们的实验中,BRAIN表现出:(i)针对动态无线资源分配的鲁棒因果推理,在变化的流量负载下维持切片特定的服务质量(QoS)目标(吞吐量、延迟、可靠性);(ii)卓越的自适应性,在突然的流量变化中比基准方法高出高达28.3%的鲁棒性(无需任何重新训练即可实现);(iii)通过人类可解释的信念状态诊断实现其实时决策的可解释性。

英文摘要

Future sixth-generation (6G) mobile networks will demand artificial intelligence (AI) agents that are not only autonomous and efficient, but also capable of real-time adaptation in dynamic environments and transparent in their decisionmaking. However, prevailing agentic AI approaches in networking, exhibit significant shortcomings in this regard. Conventional deep reinforcement learning (DRL)-based agents lack explainability and often suffer from brittle adaptation, including catastrophic forgetting of past knowledge under non-stationary conditions. In this paper, we propose an alternative solution for these challenges: Bayesian reasoning via Active Inference (BRAIN) agent. BRAIN harnesses a deep generative model of the network environment and minimizes variational free energy to unify perception and action in a single closed-loop paradigm. We implement BRAIN as O-RAN eXtended application (xApp) on GPU-accelerated testbed and demonstrate its advantages over standard DRL baselines. In our experiments, BRAIN exhibits (i) robust causal reasoning for dynamic radio resource allocation, maintaining slice-specific quality of service (QoS) targets (throughput, latency, reliability) under varying traffic loads, (ii) superior adaptability with up to 28.3% higher robustness to sudden traffic shifts versus benchmarks (achieved without any retraining), and (iii) real-time interpretability of its decisions through human-interpretable belief state diagnostics.

2606.07538 2026-06-09 cs.IR cs.AI 交叉投稿

Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents

面向遥感智能体的双向语义互补工具检索

Zeyuan Wang, Dongyang Hou, Cheng Yang, Xuezhi Cui, Linrui Xu, Bo Yu, Gaozhi Zhou, Ziyu Li, Liangtian Liu, Kai Ouyang, Wang Guo, Lili Zhu, Chao Tao

发表机构 * School of Geosciences and Info-Physics, Central South University(地质科学与信息物理学院,中南大学) School of Mechanical and Electrical Engineering, Central South University(机械与电子工程学院,中南大学) Hunan Key Laboratory of Land Resources Evaluation and Utilization, Hunan Provincial Institute of Land and Resources Planning(湖南省国土资源评价与利用重点实验室,湖南省国土资源规划院)

AI总结 针对遥感智能体工具检索中查询与文档语义不对称问题,提出双向语义互补方法:通过规划增强查询机制补充功能语义,利用动态工具依赖图注入上下文语义,显著提升复杂遥感任务工具检索精度。

详情
AI中文摘要

基于大语言模型的智能体为遥感数据的自动化处理提供了新范式。它们在复杂遥感任务中的成功依赖于广泛的专用工具库。然而,工具文档通常超出大语言模型的上下文窗口限制,使得精确的工具检索对于智能体工作流至关重要。现有工具检索方法面临“语义不对称”瓶颈:自然语言查询通常表达宏观意图,缺乏工具特定语义,而工具文档提供细粒度的技术描述,缺乏工作流的操作上下文。为弥合这一语义鸿沟,本文提出一种双向语义互补工具检索方法。首先,在查询端,我们引入一种基于规划的查询增强机制,利用智能体的推理能力将抽象意图分解为逻辑子任务,从而主动补充查询缺失的功能语义。其次,在工具端,针对遥感工具链的强耦合特性,我们构建了一个具有持续学习能力的动态工具依赖图。通过采用邻域信息聚合机制,将前驱工具的上下文信息显式注入当前节点表示,从而用上下文语义丰富工具描述。在遥感数据集GeoPlan-bench和通用数据集API-Bank上的实验结果表明,所提方法不仅显著提高了复杂遥感任务的工具检索精度,而且展现出向通用领域任务迁移的鲁棒可扩展性。源代码和数据集可在https://github.com/geox-lab/BSCTR获取。

英文摘要

Large language model (LLM)-based agents provide a novel paradigm for the automated processing of remote sensing(RS) data. Their success in complex RS tasks rely on extensive specialized tool libraries. However, tool documentation often exceeds the context window limits of LLMs, making precise tool retrieval essential for agentic workflows. Existing tool retrieval methods face "semantic asymmetry" bottleneck: natural language queries typically express macro-level intentions lacking tool-specific semantics, while tool documentation provides fine-grained technical descriptions lacking operational context for workflows. To bridge this semantic gap, this paper proposes a bidirectional semantic complementary tool retrieval method. First, on the query side, we introduce a planning-based query enhancement mechanism that leverages the reasoning capabilities of agents to decompose abstract intentions into logical subtasks, thereby actively supplementing the query with missing functional semantics. Second, on the tool side, addressing the strong coupling characteristics of RS tool chains, we construct a dynamic tool dependency graph with continual learning capabilities. By employing a neighborhood information aggregation mechanism, contextual information from precursor tools is explicitly injected into the current node representation, enriching tool descriptions with contextual semantics. Experimental results on the RS dataset GeoPlan-bench and the general-purpose dataset API- Bank demonstrate that the proposed method not only significantly improves tool retrieval accuracy for complex RS tasks but also exhibits robust extensibility for transfer to general-domain tasks. The source code and dataset are available at https://github.com/geox-lab/BSCTR.

2606.07583 2026-06-09 cs.LG cs.AI 交叉投稿

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

基于频谱图神经网络强化学习的自愈智能电网故障检测

Lihui Liu, Mucun Sun, Caisheng Wang

发表机构 * Wayne State University(韦恩州立大学) University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 提出频谱图强化学习框架,利用频谱图神经网络学习最优恢复策略,实现配电网故障实时近最优管理,在三个IEEE测试系统上验证了泛化能力。

详情
AI中文摘要

自愈智能电网能够在故障期间快速调整其网络配置,以最小化电力中断。在故障期间,可以采取多种措施,例如通过开关操作进行网络重构和紧急甩负荷。然而,传统的用于故障缓解的机器学习方法由于响应速度慢和计算成本高,不适用于智能电网。为了解决这些挑战,最近的研究探索了使用强化学习自动执行网络重构。在这些方法中,控制策略通常使用图神经网络(GNN)建模。然而,传统的GNN在空间域中运行,可能无法捕捉频域中的重要关系。频域信息对于建模电力网络中的全局结构模式和系统范围交互特别有用。在本文中,我们提出了一种用于配电网故障管理的频谱图强化学习框架,以增强系统韧性。我们的模型使用频谱图神经网络学习最优电力恢复策略。我们在三个修改后的IEEE测试系统上评估了所提出的方法:13节点、34节点和123节点网络。实验结果表明,我们的方法在实时性上达到了接近最优的性能,并且在广泛的故障场景中具有良好的泛化能力。

英文摘要

Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, several actions can be taken, such as network reconfiguration through switching operations and emergency load shedding. However, traditional machine learning methods for outage mitigation are not well suited for smart grids due to their slow response time and high computational cost. To address these challenges, recent studies have explored reinforcement learning to automatically perform network reconfiguration. In these approaches, the control policy is typically modeled using a graph neural network (GNN). However, conventional GNNs operate in the spatial domain and may fail to capture important relationships in the frequency domain. Frequency-domain information is particularly useful for modeling global structural patterns and system-wide interactions in power networks. In this paper, we propose a spectral graph reinforcement learning framework for outage management in distribution networks to enhance system resilience. Our model learns the optimal power restoration policy using a spectral graph neural network. We evaluate the proposed method on three modified IEEE test systems: the 13-bus, 34-bus, and 123-bus networks. Experimental results show that our approach achieves near-optimal performance in real time and generalizes well across a wide range of outage scenarios.

2606.07602 2026-06-09 cs.LG cs.AI 交叉投稿

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

面向LEGO空间物理推理的样本高效后训练

Yuhuan Yuan, Zhouliang Yu, Minghao Liu, Weiyang Liu, Ge Lin Kan

发表机构 * HKUST(GZ)(香港科技大学(广州)) CUHK(香港中文大学) ZODA

AI总结 针对LLM生成LEGO组装时出现的物理有效但几何语义错位问题,提出基于模型的数据选择方法和样本高效强化学习PVPO,结合体素空间几何奖励,提升结构、语义对齐和物理有效性。

Comments Technical Report V1, 15 pages, 6 figures, 3 tables

详情
AI中文摘要

基于LLM的LEGO组装生成需要同时具备语义基础和物理可行性。我们发现一种数据引发的失败模式PhysHack,其中组装满足物理有效性约束,但产生的结构在几何上错位、语义上不一致或校准不良。为应对这一挑战,我们提出一种基于模型的数据选择方法,仅使用一小部分训练数据,同时改进基于物理的LEGO组装生成。基于所选轨迹,我们引入PVPO,一种样本高效的强化学习方法,将物理可行性与体素空间几何奖励相结合。我们的结果表明,仅物理有效性不足以作为可靠物理推理的代理:模型可以学习生成有效结构而不保持语义或几何保真度。跨模型主干和测试时缩放设置的实验表明,PVPO改善了结构和语义对齐、物理有效性、结构稳定性和校准,同时减少了对大量事后拒绝采样的依赖。特别是,校准结果表明,PVPO通过使测试时选择更能预测语义和结构质量来缓解PhysHack。

英文摘要

LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.

2606.07603 2026-06-09 cs.LG cs.AI 交叉投稿

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

MetaEvo:一种基于经验驱动的智能体进化的元优化框架

Bowen Ren, Heyan Huang, Yinghao Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Beijing Institute of Technology Southeast Academy of Information Technology(北京理工大学东南信息技术研究院)

AI总结 提出MetaEvo两阶段框架,通过偏好优化增强模型从任务经验中抽象原则的能力,并在模块化架构中积累复用,持续提升推理性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的推理能力,但大多数基于LLM的智能体是静态部署的,无法通过任务交互进行改进。现有的经验驱动方法通常依赖于记忆或启发式方法,而不增强模型的学习能力,将其视为被动执行者,导致早期性能平台和有限的长期改进。为了解决这个问题,我们提出了MetaEvo,一个用于持续智能体进化的两阶段框架,专注于改进模型如何从任务经验中学习,而不仅仅是存储什么。MetaEvo首先应用基于偏好的优化来增强模型的原则抽象能力,然后在模块化智能体架构中实现这些原则的积累和重用。在多样化推理基准上的实验结果表明,MetaEvo始终优于强基线,并在迭代中保持可靠的改进。这些发现验证了元优化在使智能体从经验中学习并持续增强其推理能力方面的有效性。

英文摘要

Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or heuristics without enhancing the model's ability to learn, treating it as a passive executor and leading to early performance plateaus and limited long-term improvement. To address this issue, we propose MetaEvo, a two-stage framework for continual agent evolution that focuses on improving how the model learns from tasks experience, rather than solely on what it stores. MetaEvo first applies preference-based optimization to enhance the model's ability of principle abstraction, then enables the accumulation and reuse of these principles within a modular agent architecture. Experimental results on diverse reasoning benchmarks demonstrate that MetaEvo consistently outperforms strong baselines, maintains reliable improvement across iterations. These findings validate the effectiveness of meta-optimization in enabling agents to learn from experience and continually enhance their reasoning capabilities.

2606.07711 2026-06-09 cs.LG cs.AI 交叉投稿

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

Rosetta Memory: 跨LLM智能体的自适应记忆

Hao Yang, Shiqi Shen, Haoxuan Li, Zhipeng Wang, Zhi Gong, Xu Chen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Weixin, Tencent(腾讯微信) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 提出记忆中心式LLM自适应方法,通过双轮廓条件算子与最小增益采样课程,解决上游记忆激活下游LLM的跨模型适应问题,在多项QA任务中优于基线。

Comments 19 pages, 7 figures

详情
AI中文摘要

记忆是将无状态LLM转变为持久、不断进化的智能体的关键组件,通过经验积累、长程规划和持续自我改进实现。现有记忆系统通常以LLM为中心,并针对特定主干设计记忆操作。然而,在实践中,用户经常切换LLM,例如在编码时使用Claude、在写作时使用GPT,或在单个任务中将不同步骤路由到不同主干以实现成本效益权衡。因此,一个模型写入的记忆通常需要被另一个模型消费。使上游记忆有效适应并激活下游LLM仍然是一个关键但未被充分探索的问题。为弥合这一差距,我们将视角从以LLM为中心的记忆设计转变为以记忆为中心的LLM自适应。具体而言,我们从写入和读取两侧处理上述上下游记忆适应问题,并设计两个轮廓条件算子,它们联合训练以优化记忆存储和呈现方式,从而更好地完成任务。为确保学习到的算子能泛化到广泛的LLM集合,我们提出一种最小增益采样课程,在训练期间优先服务最不被照顾的LLM。为更好地衡量算子的实际贡献而非LLM自身能力,我们设计了一种性能差距奖励,与朴素记忆基线进行比较。在HotpotQA、2WikiMultihopQA和MuSiQue上的实验表明,我们的模型持续优于基线,并且在未见模型替换下保持鲁棒性。

英文摘要

Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizon planning, and continual self-improvement. Existing memory systems typically take the LLM as the center and design memory operations tailored to a specific backbone. In practice, however, users frequently switch between LLMs, for example using Claude for coding and GPT for writing across tasks, or routing different steps to different backbones within a single task for cost-effective trade-offs. As a result, memory written by one model often needs to be consumed by another. Making upstream memory effectively adapt to and activate downstream LLMs remains a critical yet underexplored problem. To bridge this gap, we shift the perspective from LLM-centric memory design to \emph{memory-centric LLM adaptation}. Specifically, we approach the above upstream-downstream memory adaptation problem from both the write and read sides, and design two profile-conditioned operators that are jointly trained to optimize how memory is stored and presented for better task completion. To ensure the learned operators generalize across a broad set of LLMs, we propose a minimum-gain sampling curriculum that prioritizes the least-served LLMs during training. To better measure the operators' actual contribution rather than the LLM's own capability, we design a performance-gap reward that compares against a naive memory baseline. Experiments on HotpotQA, 2WikiMultihopQA, and MuSiQue demonstrate that our model consistently outperforms baselines and remains robust under unseen-model replacement.

2606.07837 2026-06-09 cs.HC cs.AI 交叉投稿

Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents

角色设定会让LLM成为K-pop粉丝吗?基于LLM的在线演唱会观众智能体初步研究

Kirak Kim, Hyojin Kim, Yejin Son, Sungyoung Kim, Kyung Myun Lee

发表机构 * Graduate School of Culture Technology, KAIST, Daejeon, South Korea(韩国成均馆大学文化科技研究生院) Department of Artificial Intelligence, Yonsei University, Seoul, South Korea(延世大学人工智能系)

AI总结 研究通过多智能体系统模拟K-pop演唱会实时粉丝聊天,发现角色设定能提升聊天质量和自然度,但未增强社交连接或情感反应,表明有意义的集体体验需更深层次对齐。

Comments Accepted at the ICML 2026 Workshop on Culture x AI: Evaluating AI as a Cultural Technology

详情
AI中文摘要

演唱会是一种集体体验,但录制的表演视频通常是独自观看,剥离了使演唱会充满事件的共享观众存在。我们研究基于角色的LLM观众智能体能否通过生成K-pop表演视频旁的实时粉丝聊天来重现这种集体体验的某些方面。我们提出了一个多智能体系统,其中十个LLM智能体通过实时聊天消息做出反应,比较了角色条件化观众(每个智能体被分配一个独特的粉丝身份、偏好和聊天风格)与无角色基线。在K-pop粉丝(N=11)的受试者内试点中,角色条件化显著提高了模型级别的聊天质量和感知自然度,但并未转化为社交连接、参与度或情感反应的差异。访谈表明,在线K-pop演唱会聊天可能作为集体独白而非人际对话运作,而有意义的参与取决于与特定艺人和粉丝群体的共同认同。角色条件化可以使LLM观众看起来更自然,但具有文化意义的集体体验可能需要角色、群体行为、粉丝身份和用户期望之间更深层次的对齐。

英文摘要

A concert is a collective experience, but recorded performance videos are typically watched alone, stripping away the shared audience presence that makes concerts feel eventful. We investigate whether persona-based LLM audience agents can recreate aspects of this collective experience by generating real-time fan chat alongside a K-pop performance video. We present a multi-agent system in which ten LLM agents react through live-chat messages, comparing a persona-conditioned audience (each agent assigned a distinct fan identity, bias, and chat style) with a no-persona baseline. In a within-subjects pilot with K-pop fans (N=11), persona conditioning substantially improved model-level chat quality and perceived naturalness, but did not translate into differences in social connectedness, engagement, or affective response. Interviews suggest that online K-pop concert chat may operate as collective monologue rather than interpersonal dialogue, and that meaningful participation depends on shared identification with the specific artist and fandom. Persona conditioning can make LLM audiences appear more natural, but culturally meaningful collective experience may require deeper alignment between persona, crowd behavior, fandom identity, and user expectations.

2606.07846 2026-06-09 cs.DC cs.AI cs.MA 交叉投稿

Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method

成本感知的LLM-Agent工作流投机执行:一种综合五维方法

Faisal Fareed

发表机构 * AWS(亚马逊网络服务)

AI总结 提出一种五维投机执行方法,通过贝叶斯概率估计和成本定价,在LLM-Agent工作流中平衡延迟与成本,并确保无副作用回滚。

详情
AI中文摘要

LLM-Agent工作流将模型调用和工具调用串联起来,大部分挂钟时间花在等待上游操作完成,然后下游操作才能开始。投机执行可以通过预测的上游输入启动下游操作来回收空闲时间,但每次投机都会产生实际成本(按token计费),且其成功概率难以估计并随时间漂移。本文提出一种围绕五个设计决策组织的方法:(D1) 在上游完成之前启动下游操作;(D2) 以实际美元按不同的输入和输出费率定价每次投机;(D3) 暴露一个单一的操作符拨盘用于延迟与成本权衡;(D4) 通过一个期望值规则进行决策,该规则包含一个失败加权成本项和一个偏好调整阈值;(D5) 使用贝叶斯Beta-Binomial后验估计成功概率,其先验依赖于依赖类型分类。这些想法的变体出现在近期工作中;而组合起来,每次决策都以美元记录,是新颖之处。该规则仅在通过可接受性前提(无副作用、幂等或可在提交屏障后分阶段执行)的边上触发,因为错误的投机通过重新执行回滚,这会退还token但无法撤销不可逆的副作用。我们指定了运行时机制、一个闭式结果(规则在上游分支因子增长时自我限制)、一个五阶段校准流水线(离线回放、影子、金丝雀、在线校准、漂移触发终止开关),以及一个针对八种生产原型的工作负载适配模板。与四个最接近的已发表系统(DSP、Speculative Actions v2、Sherlock、B-PASTE)的对比表显示了每个维度上的差异,并且一个合成验证套件确认了预测的决策边界、概率阈值、后验恢复和流式取消行为。

英文摘要

LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.

2606.07889 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

应变连贯性:编码代理执行轨迹中的故障前信号

Marut Pandya, Kasey Zhang, Baiqing Lyu

发表机构 * GitHub

AI总结 提出“应变连贯性”模式,即编码代理识别到问题但仍按原计划行动,通过构建Claude Sonnet 4.6检测器在44条轨迹上实现94%故障预测精度,优于基线方法。

详情
AI中文摘要

基于LLM的编码代理有时会承认自身推理中的问题,但仍继续执行。我们将这种模式称为应变连贯性:一种与安全相关的故障模式,其中代理拥有应改变其行为的信息,陈述了该信息,却仍违背它行动。该模式与口头奖励黑客行为重叠,即代理指出任务代理与底层目标之间的冲突,却仍优化代理。我们给出操作性定义,构建一个Claude Sonnet 4.6评判器,读取完整轨迹并标记该模式出现的片段,并使用Qwen3.5-35B-A3B骨干在44条Terminal-bench-2轨迹上评估。标记轨迹的失败率为94%,而未标记轨迹为46%(47个百分点的差距,Fisher精确检验p=0.003;排除三个提示嵌入示例后为46个百分点,p=0.006)。在匹配选择性下,检测器达到94%的精确度,而词汇话语标记基线为88%;两种方法的10条轨迹交集具有100%的失败率(Clopper-Pearson 95%置信区间[69%, 100%])。我们在Gemma4-31B上使用43条轨迹进行复制:整体信号方向一致但不显著(20个百分点差距,p=0.31),衰减主要由13条零思考内容的轨迹驱动,其中检测器没有可分析的基础。在Gemma的高冗长度三分位中,差距为+30个百分点;在Qwen的中等和高冗长度三分位中,差距各为+40个百分点。两个模型的首次标记出现在轨迹经过时间的中位数83-84%处,且二元标记在软化显式冲突标记的释义中保持不变(8/8条轨迹)。与单变量预测器不同,检测器输出可解释的跨度级输出——引用的承认、引用的行动和类型化的冲突——显示代理看到并忽略了什么。

英文摘要

LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher's exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output -- quoted acknowledgment, quoted action, and typed conflict -- showing what the agent saw and ignored.

2606.08275 2026-06-09 cs.LG cs.AI 交叉投稿

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

因果智能体回放:LLM智能体故障的反事实归因

Jaineet Shah

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Causal Agent Replay (CAR)方法,通过结构因果模型和干预操作,对LLM智能体失败步骤进行反事实归因,解决现有方法无法定位决策步骤的问题。

Comments Open-source: https://github.com/jaineet17/causal-agent-replay

详情
AI中文摘要

当LLM智能体失败时——例如发放了不应发放的退款、调用了错误的工具、泄露了数据——现有工具只能回答发生了什么(可观测性)或是否通过(评估),但无法回答哪个步骤导致了失败。直观的启发式方法是错误的:执行有害动作的步骤通常不是决定该动作的步骤,而LLM判断的归因是相关性的且不可靠(在Who&When基准上,最先进的步骤级准确率约为14%)。我们提出Causal Agent Replay (CAR),通过干预来回答这个问题:它将智能体运行建模为结构因果模型,对某个步骤应用do操作,并在相同随机策略下重新执行轨迹,测量结果分布的变化。我们定义了智能体步骤上的干预代数、一个单步对比估计器(其承诺点规则解决了特定于随机向前运行的混杂因素),以及一个预算有界的蒙特卡洛Shapley估计器(用于在交互步骤间分配信用)。每个效应都附有置信区间。我们在具有植入真实标签的合成结构因果模型上进行验证:对比估计器恢复了关键步骤,Shapley恢复了两步交互(0.44, 0.45, ~0;效率总和0.909对比解析值0.91)。CAR是开源的,可在托管或免费的本地模型上运行。

英文摘要

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

2606.08360 2026-06-09 cs.LG cs.AI 交叉投稿

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

协变量依赖到达下的自适应同伴推荐招募的生成前沿规划

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

发表机构 * Harvard University(哈佛大学)

AI总结 针对同伴推荐招募中协变量依赖到达的现实问题,提出生成前沿规划(GFP),通过确定性备份和边际贪心分配实现高效规划,在模拟实验中优于基线方法。

详情
AI中文摘要

同伴推荐招募系统(如受访者驱动抽样)对于研究和干预受传染病影响的隐藏人群至关重要。为了加速招募,公共卫生机构必须在多轮中自适应地分配有限的推荐资源,当前决策影响未来招募者的数量和协变量。先前的工作通过假设推荐来自同质总体的独立同分布抽样使问题可解,但忽略了驱动真实同伴推荐的同质性和共享背景。我们考虑一个更现实的模型,其中推荐容量和新推荐个体的协变量都依赖于推荐者,并通过删失计数模型和条件生成模型从数据中学习。由此产生的规划问题具有挑战性,因为每个候选分配都会导致未来招募者的不同分布。我们提出生成前沿规划(GFP),一种基于模型的规划器,用潜在协变量覆盖值替代的确定性备份替代每步蒙特卡洛采样。该替代的设计使得下一个前沿的期望值仅通过离线摊销的有限维摘要依赖于后代生成模型,并且使得每轮目标具有单调递减收益。这两个性质共同使规划易于处理:确定性备份消除了蒙特卡洛采样,递减收益结构使得边际贪心分配能够为每轮问题实现(1-1/e)近似。在根据真实受访者驱动抽样数据集校准的模拟环境中,GFP在四个折扣因子下均优于随机、强化学习和独立同分布动态规划基线。

英文摘要

Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a \((1-1/e)\)-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

2606.08410 2026-06-09 cs.LG cs.AI 交叉投稿

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

具有主动对话查询的可证明高效个性化多目标老虎机

Linfeng Cao, Ming Shi, Ness B. Shroff

发表机构 * The Ohio State University(俄亥俄州立大学) University at Buffalo(布法罗大学)

AI总结 提出MO-PQUCB算法,通过主动查询获取用户偏好信号,结合Plackett-Luce模型和正则化UCB,解决多目标老虎机中偏好与奖励的耦合问题,实现更优的遗憾界。

Comments UAI 2026

详情
AI中文摘要

多目标老虎机中的个性化决策需要学习用户在不同竞争目标之间的特定权衡。由于臂的效用既取决于未知奖励又取决于未知偏好,现有方法仅从效用反馈中推断偏好,将偏好学习与奖励探索纠缠在一起。然而,在实践中,用户通常通过主动对话查询(例如,“便宜且干净的酒店”)揭示他们的优先级,但这种结构化信号未被利用。我们形式化了一个基于主动查询的框架,其中用户查询提供结构化的偏好信号。通过Plackett-Luce子集选择模型对这些信号进行建模,我们证明了由于基本的平移不变性障碍,仅查询学习是不够的。为了解决这个问题,我们引入了MO-PQUCB,一种混合算法,通过平移不变正则化和双探索UCB将基于查询的偏好锚定与老虎机反馈相结合。我们证明了主动查询加速了偏好估计,并相比先前偏好感知的MO-MAB方法实现了改进的遗憾缩放。在查询被破坏的情况下,我们进一步刻画了统计极限,并设计了一个鲁棒估计器,在破坏稀疏时实现接近最优的性能。实验验证了理论和实际收益。

英文摘要

Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.

2606.08500 2026-06-09 cs.SE cs.AI 交叉投稿

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

通过发起野生的代码理解之旅来投射SWE代理新兴思维模式

Zhengyi Zhuo, Yan Liu

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 本文通过有限工具接口让SWE代理在真实代码库中探索,提出Ada框架,利用观察透镜分析代理的导航、证据选择、综合、基础化和停止行为,将轨迹数据转化为可比较的行为画像。

详情
AI中文摘要

软件工程代理(SWE代理)越来越多地通过工具介导的轨迹在真实代码库中工作,但其行为仍难以用具体、可观察的术语来表征。这些轨迹记录了工具使用、中间推理、证据选择和自我导向的停止,但它们本身并不能解释为什么选择了特定的动作、信任了什么证据,或者何时认为理解足够。这种张力使得轨迹数据既有限又有价值:当通过纪律性观察进行解释时,忠实的、可重放的轨迹可以成为研究代理行为的经验基础。我们引入了Ada,一个用于仓库级代码理解的范围化装置。Ada通过有界工具接口进入真实代码库,允许开放式的探索作为有限轨迹保持可记录。在这个野生但有界的设置中,Ada选择在哪里看、仔细阅读什么、何时巩固部分理解以及何时结束对仓库的描述。我们通过观察透镜投射Ada的思考-行动链,这些透镜使导航、证据选择、综合、基础化和停止变得可见,而不将行为简化为原始工具计数或推测隐藏意图。综合来看,这些透镜产生了基于软件世界中记录移动的行为画像。在跨越多个模型、仓库、任务系列和启动条件的408条轨迹中,该研究展示了如何将忠实的数字痕迹转化为纪律性的、可比较的SWE代理新兴思维模式投射。结果揭示了效率、轨迹多样性、认知基础化和干预限制方面的差异,同时为在真实代码库中观察SWE代理行为提供了方法论基础。

英文摘要

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

2606.08696 2026-06-09 cs.LG cs.AI 交叉投稿

Agentic Search for Counterfactual Recourse under Fixed LLM Budgets

固定LLM预算下的反事实追索的智能搜索

Yasuo Tabei

AI总结 提出Comp-MCTS框架,在固定LLM调用预算下,通过树搜索最大化生成唯一且经oracle验证的反事实,平衡数量与质量。

详情
AI中文摘要

反事实追索旨在提供可操作的特征变化,以改变预测模型做出的不利决策。在实践中,受影响的个体通常受益于多个可行的替代方案,而非单一的最优解释。产生此类替代方案的一种自然方式是提示大语言模型(LLMs)。然而,提示引入了一个实际约束:LLM调用的数量通常是主要的计算和经济成本。对多个替代方案的需求以及这一成本约束共同将问题从寻找单个高质量反事实转变为在固定LLM调用预算下高效生成一组经oracle验证的反事实。在这项工作中,我们将LLM智能体设置中的反事实追索生成作为固定预算搜索问题进行研究,并提出了Comp-MCTS,一个智能体树搜索框架,该框架在此预算下最大化唯一、经oracle验证的反事实的产出,同时保持有利的数量-质量权衡。Comp-MCTS通过基于LLM的提议生成、oracle验证和压缩引导剪枝,在无训练、仅oracle的设置中将预算分配给新颖的干预方向。在四个真实世界表格数据集上的实验表明,Comp-MCTS在唯一、经oracle验证的反事实产出方面显著优于单候选LATS风格基线,并且与更强的多候选变体相比,提供了有利的数量-质量-效率权衡:在四个数据集中的三个上,以相似或更低的oracle评估成本获得相当或更高的产出,同时具有有竞争力的接近性、稀疏性和新颖性。

英文摘要

Counterfactual recourse aims to provide actionable feature changes that would alter an unfavorable decision made by a predictive model. In practice, affected individuals often benefit from multiple feasible alternatives rather than a single optimal explanation. A natural way to produce such alternatives is to prompt large language models (LLMs). However, prompting incurs a practical constraint: the number of LLM calls is often the dominant computational and economic cost. Together, the need for multiple alternatives and this cost constraint shift the problem from finding a single high-quality counterfactual to efficiently generating a set of oracle-validated counterfactuals under a fixed LLM-call budget. In this work, we study counterfactual recourse generation in the LLM-agentic setting as a fixed-budget search problem and propose Comp-MCTS, an agentic tree-search framework that maximizes the yield of unique, oracle-validated counterfactuals under this budget while maintaining favorable quantity--quality trade-offs. Comp-MCTS allocates the budget toward novel intervention directions via LLM-based proposal generation, oracle validation, and compression-guided pruning, in a training-free, oracle-only setting. Experiments on four real-world tabular datasets show that Comp-MCTS substantially outperforms single-candidate LATS-style baselines in the yield of unique, oracle-validated counterfactuals, and offers favorable quantity--quality--efficiency trade-offs against stronger multi-candidate variants: comparable or higher yield at similar or lower oracle-evaluation cost on three of four datasets, plus competitive proximity, sparsity, and novelty.

2606.09027 2026-06-09 cs.CL cs.AI 交叉投稿

SafeRun: Enabling Determinism in LLM Planning for Running

SafeRun:在跑步规划中实现LLM的确定性

Meilin Chen, Zepeng Zhai, Jiaxuan Zhao, Yuan Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对LLM在跑步规划中因概率性导致安全违规的问题,提出SafeRun框架,通过解耦架构将LLM的软解释与确定性求解器的硬约束分离,实现100%安全评分。

Comments Workshop on Planning in the Era of LLMs (LM4Plan) at ICML 2026

详情
AI中文摘要

大型语言模型能够实现灵活的自然语言规划,但由于其概率性,在确定性关键领域仍不可靠。这一限制在跑步规划中尤其成问题,因为违反安全规则可能导致安全风险。我们提出SafeRun,一种通过解耦架构实现基于LLM的确定性规划的框架。SafeRun将LLM的软解释与确定性求解器的硬约束执行分离,在保持自然语言灵活性的同时确保严格的安全约束。为了验证SafeRun,我们构建了一个全面的基准测试,用于在现实生理和安全约束下进行跑步规划。在五个LLM上的实验表明,SafeRun实现了100%的安全评分(相比之下,PE平均为79.1%,CodeAct平均为97.6%),同时保持了具有竞争力的指令遵循分数。SafeRun基准测试可在\href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}上公开获取。

英文摘要

Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.

2606.09483 2026-06-09 cs.CL cs.AI 交叉投稿

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

超越回忆的记忆:用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent(腾讯)

AI总结 提出DCPM系统,基于双过程理论将代理记忆组织为认知能力层次,通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳,在隐式跨会话推理任务上提升显著。

详情
AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上,因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM,它沿着认知能力层次重新组织代理记忆,从原始输入和原子事实,经过历时信念轨迹和身份,上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动,继承了双过程理论的架构分裂:一个同步的日间写入器(系统1),记录信念修正为双重链接的取代链;一个异步的夜间引擎(系统2),归纳模式和意图,并扫描跨领域冲突,抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上,启用系统2在奖励隐式跨会话推理的基准上贡献最大(在PersonaMem-v2上最高+5.20),在跨度回忆上贡献最小,与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

2606.09825 2026-06-09 cs.LG cs.AI cs.SY eess.SY math.OC 交叉投稿

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无模型策略增强的代理转移技术

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko

发表机构 * Center for Engineering Systems and Sciences(工程系统与科学中心) Central University(中央大学) Sirius University of Science and Technology(天狼星科技大学)

AI总结 提出一种将次优基线策略嵌入强化学习训练的方法,通过逐步从基线策略向可学习策略转移代理权,提升训练效率并最终获得超越基线的独立策略。

详情
AI中文摘要

从头开始训练强化学习(RL)策略成本高昂:需要仔细设计奖励和环境、大量调参以及大量计算。然而,许多控制问题已经有一个功能正常但次优的基线策略可用。本文提出一种方法,将这样的基线策略嵌入RL训练过程,同时提高相对于从头开始方法的训练效率,并产生一个优于基线的学习策略。在每个步骤中,该方法在基线策略和可训练的学习策略之间进行仲裁,最初强烈依赖基线策略,然后逐步将代理权转移给学习策略。训练结束时,学习策略是一个无需基线策略支持的独立神经网络。本文形式化了基线策略“功能正常”的含义:在该策略下,智能体以高概率到达目标集并停留在那里。所提出的仲裁机制旨在训练过程中利用这一特性,从训练开始就产生高目标到达率。理论分析在给定假设下提供了这种行为的形式化解释,并将其扩展到最终无基线场景,其中推导了独立学习策略目标到达概率的显式下界。在连续控制基准上的实验结果表明,所提出的方法实现了与竞争方法相当或更高的回报,同时在训练过程中(包括最终阶段,学习策略无需任何基线支持)保持了最高的目标到达率。

英文摘要

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

2404.02039 2026-06-09 cs.AI 版本更新

A Survey on Large Language Model-Based Game Agents

基于大语言模型的游戏智能体综述

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

发表机构 * Georgia Institute of Technology USA(佐治亚理工学院美国分校) Cisco Research USA(思科研究美国分公司)

AI总结 综述基于大语言模型的游戏智能体,提出统一参考架构,从单智能体(记忆、推理、感知-行动接口)和多智能体(通信协议、组织模型)层面总结研究,并建立挑战导向的分类法连接六种游戏类型与智能体需求。

Comments ACM Computing Surveys, 2026

详情
AI中文摘要

游戏环境提供了丰富、可控的设置,能够模拟现实世界复杂性的许多方面。因此,游戏智能体为探索与通用人工智能相关的能力提供了有价值的测试平台。最近,大语言模型(LLM)的出现为在这些复杂游戏环境中赋予智能体可泛化的推理、记忆和适应性提供了新的机会。本综述通过一个统一的参考架构,对基于LLM的游戏智能体(LLMGA)进行了最新回顾。在单智能体层面,我们围绕三个核心组件综合了现有研究:记忆、推理和感知-行动接口,这些组件共同描述了语言如何使智能体感知、思考和行动。在多智能体层面,我们概述了通信协议和组织模型如何支持协调、角色分化以及大规模社会行为。为了将这些设计置于具体情境中,我们引入了一个以挑战为中心的分类法,将六种主要游戏类型与其主导的智能体需求联系起来,从动作游戏中的低延迟控制到沙盒世界中的开放式目标形成。相关论文的精选列表可在以下网址获取:https://github.com/xxx/xxx

英文摘要

Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers

2601.21754 2026-06-09 cs.AI 版本更新

Language-based Trial and Error Falls Behind in the Era of Experience

基于语言的试错在经验时代落后了

Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM在非语言环境中探索成本高的问题,提出SCOUT框架,用轻量级模型探索环境,通过SFT和RL激活LLM的世界知识,显著提升性能并降低计算开销。

详情
AI中文摘要

尽管大型语言模型(LLM)在基于语言的智能体任务中表现出色,但它们对未见过的非语言环境(例如符号或空间任务)的适用性仍然有限。先前的工作将这种性能差距归因于预训练分布与测试分布之间的不匹配。在这项工作中,我们证明了主要瓶颈是探索的过高成本:掌握这些任务需要大量的试错,这对于在高维语义空间中运行的参数庞大的LLM来说在计算上是不可持续的。为了解决这个问题,我们提出了SCOUT(子规模协作处理未见任务),一种将探索与利用解耦的新框架。我们使用轻量级的“侦察兵”(例如小型MLP)以远超LLM的速度和规模探测环境动态。收集到的轨迹用于通过监督微调(SFT)引导LLM,然后通过多轮强化学习(RL)激活其潜在的世界知识。实验表明,SCOUT使Qwen2.5-3B-Instruct模型达到了0.86的平均得分,显著优于包括Gemini-2.5-Pro(0.60)在内的专有模型,同时节省了约60%的GPU小时消耗。

英文摘要

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

2602.17245 2026-06-09 cs.AI 版本更新

Web Agents Should Use Typed Actions Instead of Click-Based Browsing

Web 智能体应使用类型化动作而非基于点击的浏览

Linxi Jiang, Rui Xi, Zhijie Liu, Shuo Chen, Zhiqiang Lin, Suman Nath

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出通过语义层支持的类型化动作(web verbs)替代低层交互原语,以构建可靠、可审计的Web智能体,并通过案例展示其优势。

Comments Accepted to the ICML 2026 Position Paper Track

详情
AI中文摘要

这篇立场论文认为,构建可靠的智能体Web需要从低层交互原语转向由语义层支持的类型化动作。当前的Web智能体主要通过点击、按键和DOM操作运行,这导致长程行为脆弱、执行成本高且可审计性有限。我们提出web verbs作为该层的具体设计。一个verb将Web操作暴露为类型化函数,具有结构化输入、结构化输出和文档化行为,无论其背后是服务器端Web API还是维护的客户端工作流。Verb调用可以携带前置条件、后置条件、策略标签和日志钩子,使智能体能够合成具有显式控制流和数据流的简洁程序,并生成可检查的执行轨迹。通过代表性案例研究,我们展示了verb级组合如何产生正确、可复现的结果,而使用低层交互原语的浏览器智能体可能产生脆弱行为或错误推理。最后,我们呼吁采取行动,标准化、开发工具和社区流程,以使该语义层在Web规模上可部署且值得信赖。

英文摘要

This position paper argues that building a reliable agentic Web requires shifting from low-level interaction primitives to typed actions supported by a semantic layer. Today's web agents primarily operate through clicks, keystrokes, and DOM manipulation, which leads to brittle long-horizon behavior, high execution cost, and limited auditability. We propose web verbs as a concrete design for this layer. A verb exposes a web operation as a typed function with structured inputs, structured outputs, and documented behavior, whether it is backed by a server-side Web API or a maintained client-side workflow. Verb calls can carry preconditions, postconditions, policy tags, and logging hooks, allowing agents to synthesize concise programs with explicit control flow and data flow and to produce checkable execution traces. Using representative case studies, we illustrate how verb-level composition can produce correct, reproducible outcomes, while browser agents using low-level interaction primitives may produce brittle behavior or incorrect reasoning. We conclude with a call to action on standardization, developer tooling, and community processes needed to make this semantic layer deployable and trustworthy at web scale.

2602.21889 2026-06-09 cs.AI cs.LG 版本更新

2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

2-Step Agent: 一个用于决策者与AI决策支持交互的框架

Otto Nyberg, Fausto Carcassi, Davide Tugnoli, Giovanni Cinà

发表机构 * Department of Medical Informatics, Amsterdam UMC University of Amsterdam(医学信息学系,阿姆斯特丹大学医学中心,阿姆斯特丹大学) Institute for Logic, Language and Computation, University of Amsterdam(逻辑、语言和计算研究所,阿姆斯特丹大学) Department of Mathematics and Earth Sciences, University of Trieste(数学与地球科学系,特里埃斯特大学)

AI总结 本文提出2-Step Agent框架,用于研究决策者如何学习和利用基于机器学习的决策支持,并揭示了即使在理想条件下,ML-DS也可能导致更严重的负面影响。

Comments 17 pages, 17 figures

详情
AI中文摘要

机器学习模型的预测支持人类在多个领域做出决策,包括高风险领域如医疗和司法。然而,我们仍然缺乏对决策者如何从基于机器学习的决策支持(ML-DS)中学习的清晰理解。在本文中,我们介绍了一个通用的计算框架,即2-Step Agent,以捕捉这一过程。由于机器学习模型的预测包含关于训练数据的信息,预测也可以用于推断。我们的框架模型了(i)新的观察预测如何影响理性贝叶斯代理的信念,以及(ii)这种信念变化如何影响因果效应的估计、下游决策和后续结果。除了框架本身外,我们还做出了三个贡献。首先,在线性高斯设定下,我们推导出了解决我们引入的具有挑战性的贝叶斯推断问题的可计算解,即代理从ML预测中推断。其次,我们通过实验确定了ML-DS有益的条件。第三,我们证明了即使ML模型是良好规范的,且代理是完全理性的,单个不一致的先验信念也可能使ML-DS导致比没有决策支持更差的下游结果。因此,即使在理想条件下,ML-DS也可能造成更大的伤害。

英文摘要

Predictions from ML models support human decision making in several fields, including high-stakes ones such as healthcare and the judiciary. Yet, we still lack a clear understanding of how decision makers learn from ML-based decision support (ML-DS). In this paper, we introduce a general computational framework, the 2-Step Agent, to capture this process. As a prediction from an ML model contains information about the training data, a prediction can also be used for inference. Our framework models (i) how a prediction for a new observation affects the beliefs of a rational Bayesian agent, and (ii) how this change in beliefs affects the estimation of causal effect, the downstream decision, and the subsequent outcome. In addition to the framework itself, we make three contributions. First, for the linear Gaussian setting, we derive a tractable solution for the challenging Bayesian inference problem we introduced, i.e. one in which the agent infers from an ML prediction. Second, we experimentally identify conditions under which ML-DS is beneficial. Third, we show that a single misaligned prior belief can be sufficient for ML-DS to lead to worse downstream outcomes compared to no decision support even when the ML model is well-specified and the agent is perfectly rational. Hence, even under ideal conditions, ML-DS can do more harm than good.

2603.16020 2026-06-09 cs.AI 版本更新

IRAM-Omega-Q: A Computational Framework for Uncertainty Regulation in Adaptive Agents

IRAM-Omega-Q:适应性智能体在随机干扰下的不确定性调节计算框架

Veronique Ziegler

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出IRAM-Omega-Q框架,结合量子态表示与闭环自适应控制,通过比较因果控制顺序,探讨不确定性调节的架构影响。

Comments 14 pages, 6 figues

详情
AI中文摘要

适应性智能体在不确定环境下必须做更多 than 仅优化任务输出:它们必须在噪声、扰动和变化条件下维持一个可行的内部状态。本文提出IRAM-Omega-Q框架,用于建模在随机干扰下适应性智能体的不确定性调节。该框架结合了量子态表示与闭环自适应控制,通过比较因果控制顺序,探讨不确定性调节的架构影响。

英文摘要

Adaptive agents operating under uncertainty must do more than optimize task outputs: they must maintain a workable internal state under noise, perturbation, and changing conditions. This paper introduces IRAM-Omega-Q, a computational framework for modeling uncertainty regulation in adaptive agents under stochastic disturbance. The framework combines a quantum-like state representation with closed-loop adaptive control over an internal entropy signal. The quantum-like formalism is used instrumentally: the evolving state is a normalized complex amplitude vector, coherent evolution is propagated exactly as psi(t + Delta t) = exp(-i H Delta t) psi(t), and a derived density matrix supports entropy and coherence-gap analysis. Two causal control orderings are compared. In regulation-first (RF) ordering, adaptive regulation is available before current-cycle disturbance and attenuates incoming exposure; in disturbance-first (DF) ordering, current-cycle disturbance is received before a new regulatory response can be computed, and stabilization acts reactively. Publication-mode, matched-seed simulations show broadly comparable coherence-gap trajectories but lower sustained adaptive gain under RF. Susceptibility maps based on post-burn-in temporal fluctuations further show that DF shifts the critical initial-gain ridge toward larger values across multiple disturbance intervals. These results identify ordering as an architectural determinant of regulatory demand and threshold location within an otherwise shared regime structure.

2605.05138 2026-06-09 cs.AI 版本更新

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

可执行世界模型在编码智能体时代的ARC-AGI-3应用

Sergey Rodionov

发表机构 * SingularityNET

AI总结 提出一种编码智能体系统,通过维护可执行Python世界模型、验证观察、重构简化抽象和模型内规划,在ARC-AGI-3游戏中取得初步成果,GPT-5.5高推理下完全解决15个游戏。

Comments 13 pages. Accepted for publication at AGI-2026

详情
AI中文摘要

我们评估了一个用于ARC-AGI-3的初始编码智能体系统,其中智能体维护一个可执行的Python世界模型,根据先前的观察验证它,将其重构为更简单的抽象作为MDL类简单性偏好的实际代理,并在行动前通过模型进行规划。该系统有意保持直接:它使用脚本化控制器、预定义的世界模型接口、验证程序和执行计划器,但没有手工编码的游戏特定逻辑。面向智能体的提示、工作区和控制器不包含游戏特定代码、游戏特定提示、手工编码的启发式方法、隐藏解决方案或其他游戏特定信息;相同的智能体和提示用于所有游戏。由于编码智能体具有广泛的系统访问权限,我们审计了非预期的信息通道,描述了早期脆弱的框架,并解释了当前框架如何关闭观察到的泄漏通道,同时减少基准特定信息的暴露。我们报告了在25个公开ARC-AGI-3游戏上的结果。每次游戏从全新的智能体实例和干净的工作区开始,无法访问先前游戏的文件或对话状态。使用GPT-5.5高推理努力,智能体完全解决了15个游戏,平均每游戏RHAE为58.12%。使用GPT-5.4高推理努力,它完全解决了8个游戏,平均每游戏RHAE为41.29%。在尚未提供给我们的私有验证集上的性能仍有待测试。总体而言,这些结果提供了初步证据,表明验证器驱动的可执行世界模型是ARC-AGI-3智能体的一种有前景的方法。完整的运行工件与代码一起发布在https://github.com/astroseger/arc-3-agents-baseline1。

英文摘要

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.

2605.11484 2026-06-09 cs.AI 版本更新

Engagement Process: Rethinking the Temporal Interface of Action and Observation

参与过程:重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出参与过程(EP)模型,通过显式时间接口处理动作与观察的不同时间尺度交互,支持多速率协调和子系统组合,揭示隐藏的时间行为并使策略适应显式时间成本。

详情
AI中文摘要

在数字和物理环境中完成任务日益涉及复杂的时序交互,其中动作和观察在不同的时间尺度上展开,而非与固定观察-动作步骤对齐。为了建模此类交互,我们提出参与过程(EP),一种继承POMDP决策理论结构的交互形式,使时间在动作-观察接口中显式化。EP将动作和观察表示为沿时间解耦的事件流,而非在固定决策步骤上配对更新。此接口捕捉单agent的时间问题,如决策延迟、延迟反馈和持续动作,同时支持更丰富的agent侧组织、多速率协调和子系统间的组合交互。在玩具、LLM-agent和学习实验中,EP揭示了由基于步骤的接口隐藏的时间行为,并使策略在显式时间成本下适应。

英文摘要

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL:通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University at Buffalo(布法罗大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Colorado Colorado Springs(科罗拉多大学科罗拉多州立分校)

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理,解决重复故障问题,其核心机制FDKA能定位责任操作符并生成类型补丁,实现持久结构修复,优于现有方法。

Comments Code Implementation: https://github.com/sbhakim/anneal-agents

详情
AI中文摘要

基于大语言模型的代理可以恢复个体执行错误,但在底层过程知识未修复时,同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距,但未直接修复编码任务执行的符号结构,且缺乏安全部署所需的治理保证。我们引入ANNEAL,一种神经符号代理,将重复失败转化为受控符号编辑过程知识图谱,而无需修改基础模型权重。其核心机制,故障驱动知识获取(FDKA),定位责任操作符,通过约束LLM生成合成类型补丁,并通过多维评分、符号护栏和金丝雀测试验证提案,再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中,ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明,移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明,受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

2606.01619 2026-06-09 cs.AI cs.LG stat.ML 版本更新

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

ReSkill:在智能体强化学习中协调技能创建与策略优化

Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ReSkill框架,通过GRPO的组结构嵌入断言驱动技能创建、组内轨迹采样和自适应汤普森采样,实现技能与策略的协同进化,在多个领域超越现有方法。

详情
AI中文摘要

智能体强化学习使LLM智能体能够从环境奖励中持续改进,但由此产生的策略并未系统地积累可跨任务泛化的可重用策略。模块化技能可以提供此类可重用策略,然而现有的技能增强强化学习方法将技能创建与策略优化分离,存在采用与进化策略冲突的技能的风险。受Anthropic的Skill Creator启发,我们引入ReSkill,一种强化学习在环的技能创建框架,协调技能进化与策略学习。ReSkill利用GRPO的组结构自然嵌入三种机制,仅需少量额外开销:(1)断言驱动的技能创建器,从过去经验中诊断失败并提出基于条件的触发式技能修订;(2)组内轨迹采样,实现技能版本的可控比较,捕获哪个版本最能支持策略的持续学习;(3)自适应折扣的汤普森采样,在策略进化过程中平衡技能版本选择的探索与利用。在多个领域,ReSkill始终优于现有的基于记忆和技能的强化学习方法,在未见任务上提升最大。对技能生命周期的分析显示,随着策略改进,技能被自动创建、测试、精炼和修剪,展示了协调的技能-策略协同进化。

英文摘要

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

2606.04421 2026-06-09 cs.AI cs.LG 版本更新

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Trivium: 时间遗憾作为因果记忆控制器的一等目标

Edward Y. Chang

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出将长期时间遗憾作为一等目标,与结果遗憾和认知遗憾共同构成因果记忆控制器的可证伪失败分析框架,证明时间校准偏差在对结果遗憾为零时仍线性增长,而基于持久因果日志的探测复杂度为对数级。

Comments 62 pages, 12 tables, 12 figures

详情
AI中文摘要

许多当前的智能体系统和LLM管道通过优化结果奖励来纠正错误。这仅解决了失败的“什么”:当结果偏离预测时,不匹配的“为什么”和“何时”没有被系统地记录、审查或纠正,因此相同的错误可能反复出现。我们认为这是一个结构性问题,而不仅仅是模型容量问题。我们提出将长期时间遗憾作为一等目标,与结果遗憾和工作因果模型上的认知遗憾并列。时间遗憾捕捉失败持续的时间:在纠正之前,一个校准错误的因果模型被容忍了多久。认知遗憾捕捉失败持续的原因:工作因果模型中的残余不确定性或错误。这三个遗憾共同给出了一个可证伪的说明,关于一个长期存在的智能体可能失败的原因、内容和时间。将智能体建模为E个片段的流,我们在显式因果探测、持久性和可检测性假设下证明了三个条件结果。首先,在观测等价混淆下,仅基于结果的学习无法在没有干预通道的情况下区分因果结构和虚假结构,因此时间校准偏差可以在结果遗憾被降至零后仍线性持续。其次,使用持久因果日志和预算探测,总探测复杂度是片段范围的对数,导致O(log E)的时间遗憾。第三,在K个可检测变化点下,速率扩展为O(K log E)。我们实例化了Trivium并预注册了五个可证伪预测。在CausalBench-Seq上,Trivium遵循预测的对数包络线,而仅基于结果的基线线性增长。一个真实LLM流的初步外部有效性证据跨越了一个完整的E=500运行和三个E=100前沿模型试点。这里的自学习意味着修正外部因果模型,而不是重新训练LLM权重。

英文摘要

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2606.04627 2026-06-09 cs.AI 版本更新

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

MIRAGE: 具有隐式推理和生成世界模型的移动智能体

Zhichao Yang, Yuanze Hu, Haojie Hao, Longkun Hao, Dongshuo Huang, Hongyu Lin, Gen Li, Lanqing Hong, Yihang Lou, Yan Bai

发表机构 * Beihang University(北京航空航天大学) Northwestern Polytechnical University(西北工业大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National University of Singapore(新加坡国立大学) Peking University(北京大学)

AI总结 提出MIRAGE框架,通过从显式推理轨迹学习连续潜在表示,使移动智能体能够内部推理并预测未来屏幕状态,在减少生成token的同时提升执行效率。

详情
AI中文摘要

移动智能体越来越需要从截图和语言目标操作日常应用,可靠的控要求对屏幕可供性、多步导航和未来状态变化进行推理。然而,许多智能体将这种计算外部化为长的文本推理链,这减慢了交互速度,增加了监督成本,并使部署复杂化。我们引入了MIRAGE,一个从可见的文本推理轨迹中学习连续潜在推理表示的框架。MIRAGE将显式推理转化为紧凑的隐藏状态,使智能体能够在内部推理而无需解码长的理由。它还包含一个生成世界模型目标:潜在推理向量与未来截图对齐,鼓励智能体在行动前预测即将到来的界面状态。这将隐藏计算转变为压缩的思维表示和环境动态的前瞻模型。在推理时,MIRAGE在连续潜在空间中进行推理,减少了token生成,同时提高了执行效率。在AndroidWorld上,MIRAGE在4B消融实验中匹配了显式思维链监督微调,解码token预算降低了3-5倍,并比可比的指令微调基线提高了10.2个点;在AndroidControl上,它改进了动作定位,同时生成了超过75%更少的token。

英文摘要

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

2606.06076 2026-06-09 cs.AI cs.CV 版本更新

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

通过模态差距感知自蒸馏从符号状态学习视觉空间规划

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

发表机构 * Tsinghua University(清华大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MGSD两阶段框架,通过冷启动接地和特权教师蒸馏弥合视觉与符号规划之间的模态差距,在视觉规划基准上显著提升性能。

Comments 17 pages, preprint

详情
AI中文摘要

尽管视觉-语言模型在通用多模态理解方面表现出色,但在视觉空间规划上仍存在困难。我们将其归因于感知-推理模态差距:视觉规划要求模型从像素中推断潜在状态结构,然后对恢复的结构进行推理以产生有效动作,而符号规划直接利用显式对象和约束。这造成了视觉状态恢复和多步规划的双重瓶颈。为解决此问题,我们提出MGSD,一种两阶段模态差距感知自蒸馏框架。首先,冷启动接地阶段为视觉学生模型配备可靠的状态表示,最小化早期感知噪声。其次,特权教师通过在线策略蒸馏转移规划能力,使用显式符号状态监督学生自身的视觉 rollout 前缀。关键在于,符号数据仅在训练期间使用,推理完全基于视觉。在视觉规划基准上的实验表明,MGSD在4B和8B骨干网络上均持续提升视觉规划性能,宏观平均值分别提高19.3%和18.4%。所得模型缩小了与符号输入上限的差距,而消融和诊断实验证实改进来自视觉状态恢复和最优路径推理。这些结果表明,模态差距感知自蒸馏不仅改善了模型感知可行动状态的方式,也改善了它们在推断结构上进行规划的能力。代码见 https://github.com/Oranger-l/MGSD。

英文摘要

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

2505.21457 2026-06-09 cs.CV cs.AI 版本更新

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

ACTIVE-o3:通过纯强化学习赋予多模态大语言模型主动感知能力

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Mingyu Liu, Zheng Huang, Anzhou Li, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ACTIVE-o3框架,基于GRPO强化学习,通过模块化感知-动作设计和双形式奖励,使MLLM自主学会高效准确的区域选择策略,在开放世界和领域特定任务中显著提升主动感知能力。

Comments Accepted to ICML 2026. Project page: https://aim-uofa.github.io/ACTIVE-o3

详情
AI中文摘要

主动视觉,也称为主动感知,指主动选择观察位置和方式以收集任务相关信息。它是人类和高级具身智能体高效感知与决策的关键组成部分。随着多模态大语言模型(MLLM)成为机器人系统中的核心规划器,缺乏赋予MLLM主动感知能力的方法已成为一个关键缺口。我们首先对基于MLLM的主动感知任务进行了系统定义,并表明GPT-o3的缩放策略可视为一个特例,尽管它存在效率低和区域选择不准确的问题。为解决这些问题,我们提出ACTIVE-o3,一个基于GRPO构建的强化学习框架,赋予MLLM主动感知能力。利用模块化感知-动作设计和双形式奖励,ACTIVE-o3在没有显式区域选择监督的情况下自主学会高效且稳定的区域选择策略。我们进一步建立了一个全面的基准测试,涵盖开放世界任务(包括小目标和密集目标定位)以及领域特定场景(包括遥感、自动驾驶和交互式分割)。实验结果表明,与基线相比,ACTIVE-o3显著增强了主动感知能力。此外,我们表明该框架不仅保留了模型的通用理解能力,还可作为利用感知数据的代理任务,进一步提升在RealWorldQA和MME等基准测试上的性能。

英文摘要

Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, ACTIVE-o3 autonomously learns efficient and stable region selection strategies without explicit region-selection supervision. We further establish a comprehensive benchmark covering both open-world tasks, including small- and dense-object grounding, and domain-specific scenarios, including remote sensing, autonomous driving, and interactive segmentation. Experimental results demonstrate that ACTIVE-o3 significantly enhances active perception capabilities compared to baselines. Moreover, we show that our framework not only preserves the model's general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA and MME.

2508.06659 2026-06-09 cs.LG cs.AI 版本更新

In-Context Reinforcement Learning via Communicative World Models

通过通信世界模型进行上下文强化学习

Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen

发表机构 * Department of Computer and Information Sciences, Fordham University(福特汉姆大学计算机与信息科学系) Department of Systems Engineering, City University of Hong Kong(香港城市大学系统工程系) IBM Research(IBM研究院)

AI总结 提出CORAL框架,通过将潜在表示学习与控制分离,利用信息代理预训练世界模型并生成通信消息,使控制代理实现零样本适应和样本效率提升。

详情
AI中文摘要

强化学习(RL)代理通常难以在不更新参数的情况下泛化到新任务和上下文,主要是因为它们学到的表示和策略过度拟合于训练环境的特定性。为了提升代理的上下文RL(ICRL)能力,本文将ICRL形式化为一个双代理涌现通信问题,并引入了CORAL(用于自适应RL的通信表示)框架,该框架通过功能性地分离潜在表示学习与控制来学习可迁移的通信上下文。在CORAL中,信息代理(IA)在多样化的任务分布上作为世界模型进行预训练。其目标不是直接最大化回报,而是进行世界建模并将其理解提炼为简洁的消息。涌现通信协议由一种新颖的因果影响损失塑造,该损失衡量消息对下一动作的影响。在部署期间,预训练的IA作为固定上下文提供者服务于新的控制代理(CA),后者通过解释提供的通信上下文来学习解决任务。我们的实验表明,这种方法使CA能够实现样本效率的显著提升,并在多样化的在线和离线环境中借助预训练的IA成功进行零样本适应,验证了学习可迁移通信表示的有效性。

英文摘要

Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents' in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by functionally separating latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not direct return maximization, but world modeling and distilling its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in diverse online and offline environments, validating the efficacy of learning a transferable communicative representation.

2601.18510 2026-06-09 cs.LG cs.AI 版本更新

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习:无需梯度更新的LLM智能体持续学习

Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出JitRL框架,通过动态非参数记忆和即时优势估计,无需梯度更新即可实现LLM智能体的测试时策略优化,在WebArena和Jericho上达到训练无关方法最优,且性能超越微调方法,成本降低30倍以上。

详情
AI中文摘要

尽管大型语言模型(LLM)智能体在通用任务上表现出色,但由于部署后权重冻结,它们在持续适应方面存在固有困难。传统的强化学习(RL)提供了一种解决方案,但会带来高昂的计算成本和灾难性遗忘的风险。我们引入了即时强化学习(JitRL),这是一个无需训练的框架,能够在没有任何梯度更新的情况下实现测试时策略优化。JitRL维护一个动态的非参数经验记忆,并检索相关轨迹以即时估计动作优势。这些估计随后用于直接调制LLM的输出logits。我们从理论上证明,这种加法更新规则是KL约束策略优化目标的精确闭式解。在WebArena和Jericho上的大量实验表明,JitRL在训练无关方法中建立了新的最先进水平。关键的是,JitRL在性能上超越了计算昂贵的微调方法(如WebRL),同时将货币成本降低了30倍以上,为持续学习智能体提供了一条可扩展的路径。代码可在https://github.com/liushiliushi/JitRL获取。

英文摘要

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

2605.22781 2026-06-09 cs.OS cs.AI 版本更新

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox: 通过毫秒级沙箱检查点/回滚扩展状态化AI代理

Yunpeng Dong, Jingkai He, Shiqi Liu, Yuze Hou, Dong Du, Zhonghu Xu, Si Yu, Baochuan Yang, Yubin Xia, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University(并行与分布式系统研究所,上海交通大学) Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China(领域特定操作系统工程研究中心,中华人民共和国教育部,中国) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出DeltaBox,一种通过DeltaFS和DeltaCR机制实现毫秒级检查点/回滚的新型AI代理沙箱,解决了传统方法在高频状态探索中的延迟问题。

详情
AI中文摘要

LLM驱动的AI代理需要高频状态探索(例如测试时的树搜索和强化学习),依赖于快速检查点和回滚(C/R)完整的沙箱状态,包括文件和进程状态(例如内存、上下文等)。现有机制需要完整复制状态,导致每次C/R的延迟达到数百毫秒到秒级,严重限制了深度搜索和大规模扩展。本文观察到AI代理中的后续检查点高度相似,因此沙箱应仅复制连续检查点之间的变化(关键洞察)。然而,实现这一想法并不简单,主要是由于缺乏操作系统支持。本文提出新的操作系统抽象DeltaState,通过两个共同设计的操作系统机制,为AI代理实现基于变化的事务性C/R。首先,DeltaFS通过将文件状态组织成分层结构,动态冻结可写层并在检查点时插入新层,将文件更新转换为写时复制,使回滚成为简单的层切换。其次,DeltaCR通过增量快照实现基于变化的过程状态C/R,并通过绕过传统管道直接从冻结的模板进程fork()来加速回滚。我们随后提出DeltaBox,一种新型的代理沙箱,通过这两种新机制实现毫秒级的C/R。在SWE-bench和RL微基准测试中的评估显示,DeltaBox在毫秒级延迟(14ms和5ms)内完成检查点和回滚,使代理在固定时间预算内能够探索大量节点。

英文摘要

LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.

2605.30407 2026-06-09 cs.CL cs.AI cs.IR cs.LG 版本更新

Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主智能体数据工程

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Platform and Content Group, Tencent(腾讯平台与内容部)

AI总结 本文提出自主智能体数据工程任务,让LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化,实验显示GPT-5.2通过迭代数据适应使学生模型性能提升57.29%。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)在通用任务上表现出色,但往往难以适应没有高质量领域特定数据的专业领域。现有的基于LLM的数据策划方法主要依赖人工设计的工作流程,尚未检验LLM能否自主执行端到端的数据工程流水线以实现模型专业化。我们形式化了 extbf{自主智能体数据工程},这是一个新任务,旨在评估LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化。我们将数据视为可优化组件,研究能够跨多个领域规划、生成和迭代优化训练数据的智能体,并以训练后性能提升为指导。实验表明,自主LLM数据工程师带来了显著收益,GPT-5.2构建的训练课程使学生模型性能提升了 extbf{57.29\%},完全通过迭代的智能体驱动数据适应实现。通过揭示潜力和瓶颈,我们的研究将自主数据工程确立为一种可衡量的能力,并为智能体驱动的模型专业化指明了道路 ootnote{代码将在https://github.com/zjunlp/DataAgent发布。}。

英文摘要

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).

2508.15030 2026-06-09 cs.AI 版本更新

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Collab-REC:一种基于LLM的代理框架,用于平衡旅游推荐

Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, Yashar Deldjoo

发表机构 * Technical University of Munich(慕尼黑技术大学) Polytechnic University of Bari(巴里理工大学)

AI总结 提出一种多代理框架Collab-REC,通过三个LLM代理(个性化、流行度、可持续性)生成城市建议,并由非LLM调节器迭代优化,以缓解流行度偏差并提高推荐多样性。

详情
AI中文摘要

我们提出了COLLAB-REC,一个多代理框架,旨在抵消流行度偏差并提高旅游推荐的多样性。在我们的设置中,三个基于LLM的代理(个性化、流行度和可持续性)从不同角度生成城市建议。然后,一个非LLM调节器通过迭代约束优化合并并完善这些提议,确保每个代理的观点得到体现,同时减少虚假或重复输出。使用不同规模和模型家族的LLM对欧洲城市查询进行的大量离线实验表明,与单代理基线相比,COLLAB-REC提高了多样性和整体相关性,同时揭示了经常被忽视的较少访问的目的地。这种平衡的、上下文感知的方法更好地捕捉了更广泛的用户和系统级考虑因素,凸显了多利益相关者协作在LLM驱动的推荐系统中的潜力。代码、数据和其他工件可在此处获取:https://github.com/ashmibanerjee/collab-rec,而使用的提示包含在附录中。

英文摘要

We propose COLLAB-REC, a multi-agent framework designed to counteract popularity bias and improve diversity in tourism recommendations. In our setup, three LLM-based agents(Personalization, Popularity, and Sustainability) generate city suggestions from different perspectives. A non-LLM moderator then merges and refines these proposals through iterative constrained refinement, ensuring that each agent's viewpoint is represented while reducing spurious or repeated outputs. Extensive offline experiments on European city queries using LLMs of different sizes and model families show that COLLAB-REC improves both diversity and overall relevance compared to a single-agent baseline, while surfacing lesser-visited destinations that are often overlooked. This balanced, context-aware approach better captures a broader range of user and system-level considerations, highlighting the potential of multi-stakeholder collaboration in LLM-driven recommender systems. Code, data, and other artifacts are available here: https://github.com/ashmibanerjee/collab-rec, while the prompts used are included in the appendix.

2. 知识表示、推理与符号AI 12 篇

2606.08477 2026-06-09 cs.AI 新提交

A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis

基于可变性的框架:形式概念分析与关系概念分析中的可解释命名

Alain Gutierrez, Marianne Huchard, Pierre Martin, André Miralles, Violaine Prince

发表机构 * LIRMM, Univ. Montpellier, CNRS(法国国家科学研究中心蒙彼利埃大学计算机科学、机器人及微电子实验室) CIRAD, UPR AIDA(法国农业国际合作研究发展中心AIDA研究单元) AIDA, CIRAD, Univ. Montpellier(法国农业国际合作研究发展中心AIDA研究单元,蒙彼利埃大学) INRAE - UMR TETIS - Territoires, Environnement(法国国家农业、食品与环境研究院TETIS联合研究单元)

AI总结 针对形式概念分析和关系概念分析中概念命名缺乏可解释性的问题,提出一种基于可变性的LLM辅助命名框架,通过控制信息源生成可读名称,并在披萨店数据集上验证其有效性。

详情
AI中文摘要

从符号数据中提取知识通常会产生形式上定义但用户无法立即解释的抽象概念。形式概念分析(FCA)和关系概念分析(RCA)为此问题提供了代表性场景:它们根据对象描述和关系生成明确的概念结构、蕴含关系和关系依赖。尽管这些结构在设计上是可解释的,但概念通常由技术标签标识,这限制了它们作为人类可解释知识单元的使用。因此,为这些概念赋予有意义的名称是领域专家进行解释、导航、验证和复用的关键问题。\n本文从符号知识表示的角度研究FCA和RCA中的概念命名。我们首先描述了命名生成的符号抽象所涉及的语言和术语挑战,包括歧义性、区分性、简洁性以及相关概念间的一致性。然后,我们提出一个可配置的LLM辅助概念命名框架。该框架依赖于一个可变性模型,该模型控制命名过程中暴露的信息源,如内涵、外延、继承信息、邻近概念、蕴含关系和关系属性。从而明确从形式概念描述到人类可读名称的语义选择。\n该方法作为概念验证在披萨店领域的小型关系数据集上进行了说明。该示例展示了不同配置如何影响LLM建议的名称,以及命名可变性如何揭示解释选择、关系依赖以及底层符号数据中可能的建模问题。

英文摘要

Knowledge extraction from symbolic data often produces abstractions that are formally defined but not immediately interpretable by users. Formal Concept Analysis (FCA) and Relational Concept Analysis (RCA) provide representative settings for this issue: they generate explicit conceptual structures, implications, and relational dependencies from object descriptions and relations. Although these structures are explainable by design, their concepts are often identified by technical labels, which limits their use as human-interpretable knowledge units. Assigning meaningful names to such concepts is therefore a key issue for interpretation, navigation, validation, and reuse by domain experts. This paper investigates concept naming in FCA and RCA from a symbolic knowledge representation perspective. We first characterize the linguistic and terminological challenges involved in naming generated symbolic abstractions, including ambiguity, discrimination, concision, and consistency across related concepts. We then propose a configurable framework for LLM-assisted concept naming. The framework relies on a variability model that controls which sources of information are exposed during naming, such as intent, extent, inherited information, neighboring concepts, implications, and relational attributes. It thereby makes explicit the semantic choices involved in moving from formal concept descriptions to human-readable names. The approach is illustrated as a proof of concept on a small relational dataset in the pizzeria domain. This illustration shows how different configurations influence the names suggested by an LLM, and how naming variability can reveal interpretation choices, relational dependencies, and possible modeling issues in the underlying symbolic data.

2606.08503 2026-06-09 cs.AI cs.LO 新提交

Standpoint Logics with Defeasible Beliefs

带有可废止信念的立场逻辑

Nicholas Leisegang, Thomas Meyer, Sebastian Rudolph

发表机构 * University of Cape Town(开普敦大学) CAIR, South Africa(南非人工智能研究中心) Technische Universität Dresden(德累斯顿工业大学) ScaDS.AI – Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany(德国德累斯顿/莱比锡可扩展数据分析与人工智能中心)

AI总结 将KLM可废止逻辑与立场逻辑框架结合,提出DRSL,通过公理化语义和多种蕴涵关系提升,实现多视角下可废止信念的形式化表达。

详情
AI中文摘要

在本文中,我们将Kraus、Lehmann和Magidor(KLM)的可废止逻辑与Gómez Álvarez和Rudolph的立场逻辑框架相结合。这样做是为了形式化地表达考虑多个(可能矛盾的)视角的知识,而这些视角可能持有可废止信念。为此,我们利用了Leisegang等人引入的可废止受限立场逻辑(DRSL)。我们的工作扩展了先前的研究,为DRSL语义提供了基础表示结果,并系统地将几个著名的蕴涵关系从命题情况提升到立场增强设置。具体地,我们通过一组为立场情况调整的KLM风格公设来刻画DRSL的语义。此外,我们提供了一种方法来提升优先蕴涵,以及基于单个排序函数的蕴涵关系类,从纯命题语境到立场增强语境,包括理性和词典序闭包。我们证明这可以通过语义和算法手段等价地实现。此外,我们表明,对于每种考虑的蕴涵形式,从命题KLM到DRSL,蕴涵检查的复杂度类不会改变。

英文摘要

In this paper, we integrate the defeasible logic of Kraus, Lehmann and Magidor (KLM) with the standpoint logic framework of Gómez Álvarez and Rudolph. This is done with the goal of formally expressing knowledge taking into account multiple (possibly contradicting) viewpoints, which in turn may hold defeasible beliefs. In doing so, we utilise Defeasible Restricted Standpoint Logics (DRSL), introduced by Leisegang et al. Our work expands on previous work by providing a foundational representation result for DRSL semantics and systematically lifting several well-known entailment relations from the propositional case to the standpoint-enhanced setting. In particular, we characterise the semantics for DRSL through a set of KLM-style postulates adapted for the standpoints case. We furthermore provide a means to lift preferential entailment, and the class of entailment relations based on single ranking functions from the purely propositional to the standpoint-enhanced context, including rational and lexicographic closure. We show this can be done equivalently through semantic and algorithmic means. Furthermore, we show that, for each considered form of entailment, the complexity class of entailment checking does not change when moving from propositional KLM to DRSL.

2606.08658 2026-06-09 cs.AI cs.LO 新提交

Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems

扩展本体:从密集嵌入到混合量子模糊系统

Angjelin Hila

发表机构 * GitHub

AI总结 本文综述本体与密集嵌入算法的集成方法,并提出神经-量子-模糊系统作为同时支持概率推理和精确推理的知识表示新范式。

详情
AI中文摘要

大型语言模型革新了知识表示与检索,但缺乏知识本体所具有的显式建模能力。本文综述了本体和知识图谱与密集嵌入算法集成的方式。迄今为止的所有尝试都涉及概率推理与精确推理之间的权衡。本文提出了一个设计知识表示系统的新前沿,该系统可以在同一表示中同时容纳概率推理和精确推理。为此,本文提出神经-量子-模糊系统作为知识表示系统,通过量子神经网络实现经典推理和上下文推理。

英文摘要

LLMs have revolutionized knowledge representation and retrieval, but lack the explicit modeling that knowledge ontologies possess. This paper surveys the ways that ontologies and knowledge graphs have been integrated with dense embedding algorithms. All hitherto attempts involve a trade-off between probabilistic and crisp inference. This paper proposes a novel frontier for devising knowledge representation systems that can simultaneously accommodate probabilistic and crisp inference in the same representation. To this effect, the paper proposes neuro-quantum-fuzzy systems as knowledge representation systems that accommodate both classical and contextual inference implemented through quantum-neural networks (QNN).

2606.09674 2026-06-09 cs.AI cs.LO math.CO 新提交

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

(自动)形式化应该很简单:用于详细阐述严格证明的Trellis过程语义

Wesley Pegden

发表机构 * Department of Mathematical Sciences, Carnegie Mellon University(卡内基梅隆大学数学科学系)

AI总结 提出Trellis系统,通过确定性约束工作流和LLM代理迭代细化自然语言证明,实现Lean自动形式化,强调严格证明的可细化性。

Comments 15 pages, 7 figures, 5 tables

详情
AI中文摘要

我们提出Trellis:一个自动形式化系统,它在确定性约束工作流中利用LLM代理,通过自然语言证明的迭代细化来强制在Lean自动形式化任务中取得增量进展。我们的方法基于数学家对严格证明的普遍理解:即详细阐述证明的任何部分都是常规操作。结果是一个系统,旨在以适度的预算和通用代理实现可靠的自动形式化,其专门化并非来自任何特定任务的代理训练,而是来自受严格性含义启发并由过程语义强制执行的工作流。我们链接到一个由该过程产生的近期Ramsey理论突破的端到端Lean形式化。

英文摘要

We present Trellis: an autoformalization system that leverages LLM agents in a deterministically constrained workflow to enforce incremental progress in Lean autoformalization tasks through iterative refinement of natural language proofs. Our approach is motivated by the common mathematician's notion of what it means to have a rigorous proof in the first place: namely, that it would be routine to elaborate any part of the proof in further detail. The result is a system which aims to achieve reliable autoformalization on a modest budget and with generalist agents, with specialization to autoformalization coming not from any task-specific agent training but instead from a meaning-of-rigor inspired workflow enforced by process semantics. We link to an end-to-end Lean formalization of a recent Ramsey theory breakthrough produced by the process.

2606.07525 2026-06-09 cs.CL cs.AI 交叉投稿

Implicit Causal Graph Construction in Text via Chain Discovery

通过链发现实现文本中的隐式因果图构建

Liesbeth Allein, Marie-Francine Moens

发表机构 * KU Leuven(鲁汶大学) Ghent University(根特大学)

AI总结 研究利用大语言模型从文本因果对中推断中间事件以构建隐式因果图,比较端到端构建与因果链发现方法,并探索多模型集成策略,基于1560个科学验证因果对评估。

详情
AI中文摘要

文本中的因果图通常由可观察的、预定义的事件填充。相比之下,我们研究从文本中构建隐式因果图,将每个描述的因果对视为潜在隐式因果图的起点和终点,并使用大型语言模型(LLM)推断中间因果事件。我们比较了端到端图构建与将任务视为因果链发现的方法。在后一种方法中,图是通过聚合推断出的链或通过迭代搜索过程逐步扩展部分链来构建的。我们进一步探索了“群体智慧”扩展,即在事后聚合和协作推理设置中从多个LLM访问因果知识。我们分析了这些方法之间的权衡,并使用一个包含1560个经过科学验证的因果对的手动策划数据库评估推断出的因果关系的有效性。这种基于数据库的评估被认为是可靠的、资源高效的,并且可迁移到无法获得真实图的情况。

英文摘要

Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction from text by treating each described cause-effect pair as the begin- and endpoint of an underlying latent causal graph and using large language models (LLMs) to infer intermediate causal events. We compare end-to-end graph construction with methods that frame the task as causal chain discovery. In the latter, graphs are built either by aggregating inferred chains or by progressively expanding partial chains through an iterative search process. We further explore Wisdom of the Crowd extensions that access causal knowledge from multiple LLMs in post-hoc aggregation and collaborative inference settings. We analyze trade-offs among these approaches and evaluate the validity of inferred causal relations using a manually curated database of 1,560 scientifically validated causal pairs. This database-based evaluation is proposed as reliable, resource-efficient, and transferable to settings where ground-truth graphs are unavailable.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 交叉投稿

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2606.09157 2026-06-09 cs.CL cs.AI 交叉投稿

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

SEF-CLGC在SemEval-2026任务11中的应用:逻辑符号对语言模型性能的影响

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin

发表机构 * Université Côte d’Azur, Inria, CNRS, I3S, Sophia Antipolis, France(蔚蓝海岸大学, 法国国家信息与自动化研究所, 法国国家科学研究中心, 信息与系统科学实验室, 索菲亚安蒂波利斯, 法国) Data ScienceTech Institute, Paris, France(数据科学技术学院, 巴黎, 法国)

AI总结 本文提出SEF-CLGC管道,结合形式逻辑符号与小语言模型,在SemEval-2026任务11中评估推理性能,最佳模型在降低内容偏差的同时达到27.80%的内容分数。

Comments Accepted to SemEval-2026 co-located with ACL 2026

详情
AI中文摘要

本文重新审视了我们称为三段论评估框架-通用逻辑语法构建(SEF-CLGC)的管道。我们将形式逻辑符号与小语言模型(SLMs)相结合,以评估在SemEval-2026任务11子任务1:大型语言模型中内容与形式推理的分离中的推理性能。我们的实验表明,仅依靠在自然语言和符号语言组合上训练的SLMs,我们的最佳模型在该任务上达到了27.80%的内容分数,同时显著降低了推理中的内容偏差。

英文摘要

This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.

2506.07853 2026-06-09 cs.AI cs.IR 版本更新

Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs

法律规范历时演变的建模:一种基于LRMoo、组件级、事件中心的法律知识图谱方法

Hudson de Martim

发表机构 * Federal Senate of Brazil(巴西联邦议会)

AI总结 提出基于LRMoo本体的事件中心模式,通过版本化作品链和语言版本区分,精确建模法律规范的历时演变,并以巴西宪法为例验证了任意日期法律文本的确定性重建。

Comments Revised version. Refined ontological modeling of legislative events (adopted F27/E64 joint typing over E11). Introduced technical distinctions for bitemporal modeling in legal knowledge graphs and enriched the critical analysis of related standards in Section 2

详情
AI中文摘要

表示法律规范的时间演变是自动化处理的一个关键挑战。虽然存在基础框架,但它们缺乏用于细粒度、组件级版本控制的正式模式,阻碍了可靠AI应用所需的法律文本的确定性时间点重建。本文提出了一种基于LRMoo本体的结构化时间建模模式。我们的方法将规范的演变建模为版本化F1作品的历时链,区分了语言无关的时间版本(TV,每个都是一个独立作品)及其单语语言版本(LV,建模为F2表达)。立法修正过程通过事件中心建模形式化,使得变化能够被精确追踪。以巴西宪法为案例,我们证明了我们的架构能够精确重建法律文本在特定日期存在的任何部分。这为法律知识图谱提供了可验证的语义骨干,为可信赖的法律AI提供了确定性基础。

英文摘要

Representing the temporal evolution of legal norms is a critical challenge for automated processing. While foundational frameworks exist, they lack a formal pattern for granular, component-level versioning, hindering the deterministic point-in-time reconstruction of legal texts required by reliable AI applications. This paper proposes a structured, temporal modeling pattern grounded in the LRMoo ontology. Our approach models a norm's evolution as a diachronic chain of versioned F1 Works, distinguishing between language-agnostic Temporal Versions (TV), each being a distinct Work, and their monolingual Language Versions (LV), modeled as F2 Expressions. The legislative amendment process is formalized through event-centric modeling, allowing changes to be traced precisely. Using the Brazilian Constitution as a case study, we demonstrate that our architecture enables the exact reconstruction of any part of a legal text as it existed on a specific date. This provides a verifiable semantic backbone for legal knowledge graphs, offering a deterministic foundation for trustworthy legal AI.

2507.09751 2026-06-09 cs.AI cs.CL cs.LO 版本更新

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

基于LLM解释的完备且可靠的神经常识推理

Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Southern California(南加州大学) Rensselaer Polytechnic Institute(拉特格斯理工学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 提出将LLM直接集成到次协调逻辑的语义解释函数中,实现可靠且完备的神经常识推理,在GPQA和SimpleQA基准上宏F1提升约6个百分点,并成功检测药物安全知识库中的矛盾。

Comments 43 pages, 14 tables, 4 figures. Accepted to the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025); to appear Neurosymbolic Artifical Intelligence Special Issue on NeSy 2025 Extended Papers

详情
AI中文摘要

大型语言模型(LLM)在自然语言理解和生成方面展现了令人印象深刻的能力,但在输出中表现出逻辑一致性问题。我们如何在形式推理中利用LLM的广泛覆盖参数知识,尽管它们存在不一致性?我们提出了一种方法,将LLM直接集成到次协调逻辑的形式语义的解释函数中。我们使用从短事实性基准GPQA和SimpleQA导出的数据集对方法进行实证评估,显示双边事实性评估在两个基准上的宏F1比单边基线提高了约6个百分点(以覆盖率为代价,因为在不一致或不确定的情况下会触发弃权)。我们进一步描述了一个实现该方法的原型tableau推理器,并将其应用于包含228条断言和712条推断语句的药物安全知识库:系统检测到92个对应于医学显著错误(例如,阿片类药物被推断为非成瘾性,β受体阻滞剂被推断为在哮喘中安全)的过剩(glut),同时保持可满足性,表明矛盾被局部化而不是导致逻辑爆炸。与先前工作不同,我们的方法提供了一个理论框架和实际实现,用于神经常识推理,利用LLM的知识同时保留底层逻辑的可靠性和完备性属性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We evaluate the method empirically using datasets derived from the short-form factuality benchmarks GPQA and SimpleQA, showing that bilateral factuality evaluation improves macro-F1 over a unilateral baseline by roughly 6 percentage points on both benchmarks (at the cost of reduced coverage, as abstention is triggered on inconsistent or uncertain cases). We further describe a proof-of-concept tableau reasoner implementing the method, and apply it to a medication-safety knowledge base of 228 asserted and 712 inferred statements: the system detects 92 gluts corresponding to medically significant errors (e.g., opioids inferred as non-addictive, beta-blockers inferred as safe in asthma) while remaining satisfiable, demonstrating that contradictions are localized rather than causing logical explosion. Unlike prior work, our method offers a theoretical framework with a practical implementation for neurosymbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.

2604.18050 2026-06-09 cs.AI cs.LO 版本更新

The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

数据集的拓扑对偶:一种逻辑到拓扑的编码用于AlphaGeometry风格的数据

Anthony Bordg

发表机构 * Huawei Lagrange Center(华为拉格朗日中心)

AI总结 本文提出一种逻辑到拓扑的编码方法,用于揭示模型潜在空间的结构不变性,通过逻辑观察的对偶性,为神经符号AI提供解释路径。

Comments Company decision as a precautionary measure while a third-party dispute is under review

详情
AI中文摘要

AlphaGeometry在神经符号推理中是一个里程碑,但其架构在符号推导引擎中面临对数线性扩展瓶颈,限制了随着问题复杂性增加的效率。最近的技术报告表明,当前领域特定语言可能与自然语言同构,作为输入表示,互换是性能不变的转换,暗示当前神经指导依赖于表面编码而非结构理解。本文通过提出一种逻辑到拓扑的编码方法来解决这一表示瓶颈,该方法旨在揭示模型潜在空间在输入空间变换下的结构不变性。通过利用观察逻辑,我们利用可观察理论中的可证性与拓扑之间的对偶性,提出一种输入空间的逻辑到拓扑编码器。我们引入了“数据集的拓扑对偶”概念,这是一种连接形式逻辑、拓扑和神经处理的转换。该框架为神经符号AI提供了一种罗塞塔石碑,提供了一条机制可解释的路径,以解释模型如何在复杂发现路径中导航。

英文摘要

AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its symbolic deduction engine that limits its efficiency as problem complexity increases. Recent technical reports suggest that current domain-specific languages may be isomorphic as input representations to natural language, interchanging them acts as a performance-invariant transformation, implying that current neural guidance relies on superficial encodings rather than structural understanding. This paper addresses this representation bottleneck by proposing a logic-to-topology encoding designed to reveal the structural invariants of a model's latent space under a transformation of its input space. By leveraging the Logic of Observation, we utilize the duality between provability in observable theories and topologies to propose a logic-to-topology encoder for the input space. We introduce the concept of the "topological dual of a dataset", a transformation that bridges formal logic, topology, and neural processing. This framework serves as a Rosetta Stone for neuro-symbolic AI, providing a principled pathway for the mechanistic interpretability of how models navigate complex discovery paths.

2605.22763 2026-06-09 cs.AI 版本更新

Advancing Mathematics Research with AI-Driven Formal Proof Search

用AI驱动的形式证明搜索推进数学研究

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Miklós Z. Horváth, Andrew Ferraiuolo, Henryk Michalewski, Edward Lockhart, Codrut Grosu, Thomas Hubert, Matej Balog, Pushmeet Kohli, Swarat Chaudhuri

发表机构 * Google DeepMind(谷歌DeepMind) Aarhus University(奥胡斯大学)

AI总结 本文研究了如何利用大型语言模型生成形式证明,以解决开放性数学问题,并展示了AI辅助形式证明搜索在数学研究中的应用和贡献。

详情
AI中文摘要

大型语言模型(LLMs)在数学推理方面日益表现出色,但其不可靠性限制了其在数学研究中的实用性。一种缓解方法是使用LLMs生成Lean等语言中的形式证明。我们首次对这种方法解决开放性问题的能力进行了大规模评估。我们的最强大代理在每个问题的成本仅为几百美元的情况下,自主解决了353个开放性埃德勒问题中的9个,并证明了492个OEIS猜想中的44个,同时正被应用于组合学、优化、图论、代数几何和量子光学研究。一个基本代理交替使用基于LLM的生成和基于Lean的验证,复制了埃德勒的成功,但在最困难的问题上成本更高。这些发现展示了AI辅助形式证明搜索的威力,并揭示了使这种技术可行的代理设计。

英文摘要

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

2605.25985 2026-06-09 cs.AI 版本更新

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

面向多自由变量复杂逻辑查询的神经可扩展符号搜索框架

Weizhi Fei, Hang Yin, Zihao Wang, Shukai Zhao, Wei Zhang, Yangqiu Song

发表机构 * Department of Mathematical Sciences, Tsinghua University(清华大学数学科学系) Squarepoint Capital(Squarepoint资本) Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程系) Department of Computer Sciences, University of Rochester(罗切斯特大学计算机科学系)

AI总结 针对知识图谱上多自由变量复杂查询的联合排序难题,提出神经可扩展符号搜索(NS3)框架,通过预算约束和超节点合并近似联合排序,显著提升性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

复杂查询回答(CQA)是在不完整知识图谱(KG)上进行知识表示和推理的基本任务。回答带有$k$个自由变量的存在性一阶查询(即$ ext{EFO}_k$查询)是一个关键但具有挑战性的问题,因为它需要对$\mathcal{E}^k$中的答案元组进行排序,其中$\mathcal{E}$表示KG的实体集。随着$k$的增长,这很快变得难以处理。因此,现有基准和方法依赖于单个变量的边际排序;然而,边际排序是元组真实联合排序的较差代理。基于$ ext{EFO}_1$查询的神经符号搜索,我们提出了神经可扩展符号搜索(NS3),这是一个预算框架,无需枚举$\mathcal{E}^k$即可近似联合排序。NS3 (i) 回答边际化子查询以获得必要的候选集,(ii) 将多个自由变量合并为超节点,其域由动态预算$B$修剪和控制,以及(iii) 逐步将$ ext{EFO}_k$查询简化为在预算缩减域上的$ ext{EFO}_{k-1}$查询。在三个标准KG数据集上,NS3在保持强边际准确性的同时,显著提高了联合排序性能。我们进一步发布了一个联合排序基准,将现有的$ ext{EFO}_1$数据集扩展到$k=3$,从而能够系统评估多变量查询。我们的代码提供在https://github.com/HKUST-KnowComp/NS3_KDD2026。

英文摘要

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.

3. 多智能体与博弈 15 篇

2606.08702 2026-06-09 cs.AI 新提交

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

ConMem: 无训练多智能体系统中的结构化记忆引导自适应

Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen

发表机构 * Central South University(中南大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出ConMem框架,通过结构化记忆卡片和关系感知记忆图实现多智能体系统的高效自适应,无需额外训练,在多个基准上提升性能并降低推理开销。

详情
AI中文摘要

最近的进展通过基于记忆、技能和学习的方法改进了基于LLM的多智能体系统(MAS)的自适应能力,但这些方法仍受到噪声轨迹、记忆-技能关系建模不足以及对额外训练或高质量监督的依赖等挑战。为了解决这些限制,我们提出了ConMem,一个关系感知且无需训练的框架,通过跨经验协调实现高效的多智能体自适应。具体来说,ConMem将历史交互轨迹提炼为结构化记忆卡片,以捕获可重用的策略和线索,并将它们组织成关系感知的记忆图。在运行时,ConMem根据任务需求检索卡片,并通过卡片图协调它们以解决策略冲突并恢复其依赖关系。这些模块结合起来提供了结构化和关系感知的指导,使得多智能体系统能够实现鲁棒、轻量级的自适应,而无需额外训练。在多个基准测试和主流MAS架构上的大量实验表明,与现有记忆架构相比,ConMem取得了持续的性能提升,通过剪枝超过50%的扩展候选并减少超过80%的规划开销,提高了推理时的效率。我们的代码可在https://anonymous.4open.science/r/ConMemCode获取。

英文摘要

Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory-skill relations, and reliance on additional training or high-quality supervision. To address these limitations, we propose ConMem, a relation-aware and training-free framework that enables efficient multi-agent adaptation through cross-experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation-aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation-aware guidance, enabling robust, lightweight adaptation in multi-agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference-time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

2606.09037 2026-06-09 cs.AI cs.MA 新提交

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

基于FEA-AI混合方法的IPMSM设计优化多智能体系统

Jinseong Han, Sunwoong Yang, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, KAIST(KAIST Cho Chun Shik 移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs

AI总结 提出一种端到端自动化IPMSM设计优化框架,通过RAG结构化问题定义与不确定性感知的FEA-AI混合优化流水线,平衡计算成本与预测可靠性,在同等FEA预算下优于纯FEA或纯AI方法。

Comments 26 pages, 21 figures

详情
AI中文摘要

内置永磁同步电机(IPMSM)设计需要平衡相互冲突的目标和多物理场约束,而现代优化工作流程面临三个瓶颈:手动问题设置、高有限元分析(FEA)成本以及在稀疏或分布外区域中不可靠的基于代理的搜索。为了解决这些限制,我们提出了一种端到端的自动化IPMSM设计优化框架,该框架将检索增强生成(RAG)用于结构化问题定义,与不确定性感知的FEA-AI混合优化流水线相结合。一个通过RAG连接到电机教科书的设计代理提供基于领域知识的选项和工程技巧,并编译优化卡和用于AI模型训练的试验设计计划。训练代理自动化电磁FEA,记录几何验证和求解器失败日志,使用基于方差分析的数据分析和LLM推理分析失败的几何形状,并调用设计采样代理重新定义设计空间并生成额外样本。优化代理执行基于遗传算法的搜索,具有不确定性驱动的切换:低不确定性候选由AI代理推理评估,而高不确定性和可靠性关键的帕累托前沿或前K候选由高保真FEA校正并用于迭代重训练。该框架将手动、依赖经验的配置转换为可重复的工作流程,平衡计算成本和预测可靠性。在匹配的高保真FEA预算下的实验结果表明,所提出的混合方法实现了更好的目标性能,同时保持低且可进一步降低的预测不确定性,优于受早期预算耗尽限制的纯FEA搜索和收敛到低置信度最优的纯AI搜索。

英文摘要

Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

2606.09751 2026-06-09 cs.AI cs.CL cs.HC 新提交

Collaborative Human-Agent Protocol (CHAP)

协作式人机协议 (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

发表机构 * Brightbeam AI

AI总结 提出CHAP协议,通过结构化事件记录(差异、理由、哈希)和可组合配置文件,解决多人类多智能体协作中人类判断信号丢失的问题。

详情
AI中文摘要

基础模型正从响应生成转向操作角色。它们跨步骤规划、调用工具、请求人类输入、与其他智能体协调,并越来越多地承担影响客户、索赔、代码、合同和临床决策的工作。生产部署不再是单个人类监督单个模型,而是跨团队、时区和信任边界的多人类、多智能体协作。这种协作的技术界面仍然定义不清。当智能体起草响应,人类在发布前编辑它时,人类判断的时刻是系统中最有价值的信号。在当前实践中,该信号(如果有记录)仅存在于应用程序代码、聊天线程、工单评论和集体记忆中。两个协议标准解决了相邻问题:MCP标准化了智能体对工具和数据的访问,A2A标准化了智能体间的互操作性。两者都没有定义人类和智能体共同执行可问责工作的共享工作空间。本文提出了CHAP,即协作式人机协议。在CHAP下,原本会消失在聊天线程中的覆盖操作变成了一个结构化事件,包含差异、理由和内容哈希。班次交接变成了可移植的信封,而不是置顶消息。人类对智能体草稿的批准变成了一个不可否认的签名决策,可在多年后重放。该协议通过一个小的核心(工作空间、参与者、任务、工件和仅追加的证据日志)以及可组合的配置文件(根据部署需要添加审查、模式、路由、审议、交接、身份、签名和透明度支持的审计)来实现。规范、参考实现、一致性测试套件和示例可在以下网址获取:https://github.com/BrightbeamAI/chap

英文摘要

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

2606.07552 2026-06-09 cs.MA cs.AI cs.LG 交叉投稿

Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings

符号推理框架在多智能体战略环境中调节大语言模型的风险规避

Augustin Chan

发表机构 * iterative.day

AI总结 本研究通过注入符号推理框架(如易经、塔罗牌)作为反思提示,发现其能差异化调节LLM的风险规避倾向,并在多智能体博弈中产生框架特定的胜者分布,且该效应源于反思过程而非内容遵循。

Comments 17 pages, 3 figures, 6 tables, 6 listings. Code and data: https://doi.org/10.5281/zenodo.20338937

详情
AI中文摘要

大型语言模型在作为战略智能体部署时表现出内在的行为倾向——尤其是风险规避的“乌龟”偏向于防御性玩法。我们证明,符号推理框架作为每轮反思提示注入一个智能体,能够差异化地调节这种偏向,并重塑多智能体生态系统,产生框架特定的胜者分布。在一个7玩家的战国策外交变体(41局游戏,4种条件,单战役记忆积累)中,每个框架产生独特的生态系统特征:在控制条件下,燕国主导(7/11,64%);在易经蓍草占卜下,燕国和楚国共同主导,而秦国被完全压制(0/10);在塔罗牌下,秦国主导(5/10,Fisher vs. 合并p=0.006);在乱序文本消融(保留提示结构的无意义神谕文本)下,齐国主导(5/10,Fisher vs. 合并p=0.006)。接受框架的智能体(韩国)从未获胜,且在不同条件下生存率无差异(Fisher p=1.0),但塔罗牌持续提升韩国的峰值领土(平均3.0个SC vs. 2.1-2.5个其他,Kruskal-Wallis p=0.010)。两个框架的内容均不能预测后续行动——卦象主题(卡方p=0.95)和塔罗牌姿态(卡方p=0.69)均与行动选择独立——表明调节作用是通过反思过程而非内容遵循实现的。我们将其作为一篇观察论文呈现,确立智能体层面的对齐框架选择在多智能体环境中产生独特的系统级后果。

英文摘要

Large language models exhibit innate behavioral tendencies when deployed as strategic agents -- notably a risk-averse "turtle" bias toward defensive play. We show that symbolic reasoning frameworks, injected as per-round reflective prompts into one agent, differentially modulate this bias and reshape the multi-agent ecosystem to produce framework-specific winner distributions. In a 7-player Warring States Diplomacy variant (41 games, 4 conditions, single-campaign memory accumulation), each framework produces a distinct ecosystem signature: under control, Yan dominates (7/11, 64%); under I-Ching yarrow divination, Yan and Chu co-dominate while Qin is completely suppressed (0/10); under Tarot, Qin dominates (5/10, Fisher vs. pooled p = 0.006); under scrambled-text ablation (incoherent oracle text preserving prompt structure), Qi dominates (5/10, Fisher vs. pooled p = 0.006). The framework-receiving agent (Han) never wins and shows no survival difference across conditions (Fisher p = 1.0), but Tarot consistently elevates Han's peak territory (mean 3.0 SCs vs. 2.1-2.5 others, Kruskal-Wallis p = 0.010). Neither framework's content predicts subsequent actions -- hexagram themes (chi-squared p = 0.95) and Tarot card postures (chi-squared p = 0.69) are both independent of action choice -- suggesting the modulation operates through the reflective process, not content-following. We present this as an observation paper establishing that alignment-framework choice at the agent level produces distinctive system-level consequences in multi-agent settings.

2606.07649 2026-06-09 cs.CV cs.AI 交叉投稿

ViMax: Agentic Video Generation

ViMax: 智能体视频生成

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

发表机构 * The University of Hong Kong(香港大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出ViMax框架,通过多智能体协作实现长视频生成,利用分层叙事引擎和视觉一致性机制,保证叙事连贯性和视觉一致性。

Comments 20 pages, 13 figures

详情
AI中文摘要

长视频生成需要系统的叙事规划和视觉一致性,而当前的短视频方法无法提供。现有方法生成孤立的序列,缺乏叙事结构,并且缺乏跨场景保持角色和环境一致性的机制。我们提出ViMax,一个智能体视频生成框架,通过协调的多智能体协作来解决视频创作问题,其中专门的组件协商叙事决策、视觉连续性和制作质量。我们的框架采用分层叙事引擎,结合检索增强生成以实现全局故事连贯性,以及依赖感知的视觉一致性机制,跨时间边界跟踪角色和环境状态,同时VLM引导的智能体持续监控和优化叙事连贯性和视觉保真度。该框架支持协调的智能体协作以生成扩展的叙事内容,在多场景时间线上保持叙事完整性和视觉连贯性。

英文摘要

Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.

2606.08030 2026-06-09 cs.MA cs.AI 交叉投稿

Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems

投票协议作为角色约束的多智能体辅导系统的协调机制

Eric S. Qiu, Joyce Gill

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究投票协议如何塑造四个角色约束的教学智能体之间的协调,通过比较四种投票协议在模拟辅导环境中的效果,发现协议选择显著影响集体决策和协调行为。

Comments Accepted to ICML 2026 Workshop on AI4Good

详情
AI中文摘要

智能辅导系统引入了一个协调挑战:多个智能体可能提出不同但合理的干预措施,但只能向学习者提供一个响应。在本文中,我们研究了投票协议如何塑造四个角色约束的教学智能体之间的合作,这些智能体负责搭建脚手架、误解、动机和元认知。我们在SciQ和HumanEval基准测试的两个模拟辅导环境中比较了四种投票协议——简单投票、排名投票、累积投票和批准投票。我们不是将投票用作简单的聚合步骤,而是用它来分析在部分教学冲突下集体决策规则如何塑造协调。在1200次模拟交互中,我们发现智能体 deliberation 和投票协议类型经常改变最终获胜的响应,表明两者都显著影响集体决策。不同的投票规则也产生不同的协调行为,即使是短暂的辅导回合也在模拟学生中显示出可测量的学习收益。总体而言,我们表明协议选择与角色专门化的教学智能体之间的不同协调模式相关。

英文摘要

Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only one response can be delivered to the learner. In this paper, we study how voting protocols shape cooperation among four role-constrained pedagogical agents responsible for scaffolding, misconception, motivation, and metacognition. We compare four voting protocols -- simple, ranked, cumulative, and approval voting -- across two simulated tutoring environments on SciQ and HumanEval benchmarks. Rather than using voting as a simple aggregation step, we use it to analyze how collective decision rules shape coordination under partial pedagogical conflict. Across 1,200 simulated interactions, we find that agent deliberation and voting protocol type frequently change which response ultimately wins, showing that both meaningfully shape the collective decision. Different voting rules also produce distinct coordination behaviors, and even brief tutoring turns show measurable learning gains in simulated students. Overall, we show that protocol choice is associated with distinct coordination patterns among role-specialized pedagogical agents.

2606.08267 2026-06-09 cs.GT cs.AI 交叉投稿

Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics

后AGI经济:叠加性与福利经济学第二基本定理

Elija Perrier

发表机构 * Centre for Quantum Software & Information(量子软件与信息中心)

AI总结 针对后AGI经济中自治权、自我修改和叠加偏好对经典福利第二定理的挑战,提出自治限定第二福利定理,给出可分散化的条件。

详情
AI中文摘要

经典第二福利定理在凸性和正则性条件下通过价格和转移分散化任何帕累托有效配置。在后AGI经济中,自治权、自我修改、身份连续性和叠加偏好不一定像商品那样行为或定义稳定的福利关系,因此即使存在支撑超平面,这种简化也可能失败。我们给出了一个自治限定的第二福利定理,陈述了凸性、稳定道德地位、不可替代权利、福利选择、非操纵、受控自我修改和验证的联合条件,在这些条件下,自治帕累托最优仍然可证明地可分散化,区分了经济偏好叠加(一种关于上下文索引选择的假设)与神经特征叠加。

英文摘要

The classical Second Welfare Theorem decentralizes any Pareto efficient allocation through prices and transfers under convexity and regularity. In post AGI economies, autonomy rights, self-modification, identity continuity, and superposed preferences need not behave as commodities or define a stable welfare relation, so this reduction may fail even when a supporting hyperplane exists. We give an autonomy-qualified Second Welfare Theorem stating the joint conditions convexity, stable moral status, non-fungible rights, welfare selection, non manipulation, governed self modification, and verification under which an autonomy Pareto optimum remains certifiably decentralizable, distinguishing economic preference superposition, a hypothesis about context-indexed choice, from neural feature superposition.

2606.09122 2026-06-09 cs.SE cs.AI cs.ET cs.MA cs.NI 交叉投稿

Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations

超大规模下的自主事件解决:面向网络运维的智能体AI架构

Arun Malik

发表机构 * Arun Malik

AI总结 提出一种多智能体编排框架,通过分层分解、技能调用、知识编码和渐进自主,在超大规模云网络中实现90%以上常见事件的自主解决,并保障安全。

Comments 7 pages, 6 figures

详情
AI中文摘要

超大规模的云网络基础设施面临着独特的运维挑战,传统的人工驱动事件响应无法跟上故障的数量、速度和复杂性。本文提出了一种用于大规模网络运维中自主事件解决的智能体AI架构。我们的系统采用多智能体编排框架,其中专门的AI智能体协作检测、诊断和修复网络事件,无需人工干预。我们描述了架构原则,包括分层智能体分解、通过标准化协议的基于技能的工具调用、来自运维手册的结构化知识编码、具有安全边界的渐进自主性以及闭环验证。该架构已在主要云提供商的生产环境中部署,表明智能体AI系统能够在常见事件类别中实现超过90%的自主解决率,同时通过分层授权和回滚机制维护安全保证。我们讨论了设计权衡、故障模式以及从大规模运行自主AI智能体中获得的经验教训。

英文摘要

Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.

2606.09610 2026-06-09 cs.RO cs.AI 交叉投稿

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg(纽伦堡工业大学)

AI总结 提出一种多智能体强化学习方法,使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队,同时避免障碍物,实现可靠且泛化的协同运输。

详情
AI中文摘要

协同物体运输在众多领域(包括工业到家庭服务)中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决:编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布,这需要机器人编队能够牢固支撑物体。在这项工作中,我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量,同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明,我们的方法能够产生可靠形成平衡编队的策略,并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

2512.20845 2026-06-09 cs.AI cs.MA 版本更新

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

MAR:多智能体反思提升大语言模型的推理能力

Onat Ozer, Yuchen Wang, Grace Wu, Daniel Dosti, Honghao Zhang, Vivi De La Rue

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出多智能体反思框架,通过多角色辩论生成多样化反思,解决单模型反思中的思维退化问题,在HotPot QA和HumanEval上分别达到47% EM和82.7%准确率。

详情
AI中文摘要

大语言模型已展现出通过反思自身错误并据此行动来提升推理任务性能的能力。然而,同一LLM对自身的持续反思会表现出思维退化,即即使知道错误,LLM仍会反复重复相同错误。为解决此问题,我们引入多智能体与多角色辩论者作为生成反思的方法。通过大量实验,我们发现这能导致LLM智能体生成的反思具有更好的多样性。我们在HotPot QA(问答)上展示了47%的精确匹配准确率,在HumanEval(编程)上展示了82.7%的准确率,这两项性能均超越了单一LLM的反思。

英文摘要

LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we've found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam(信息技术学院,科学大学(HCMUS),胡志明市,越南) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam(计算机科学与工程学院,胡志明市技术大学(HCMUT),胡志明市,越南) Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam(越南国家大学——胡志明市(VNU-HCM),胡志明市,越南) Luxembourg Institute of Science and Technology (LIST), Luxembourg(卢森堡科学与技术研究所(LIST),卢森堡) School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom(计算、工程与数字技术学院,泰赛德大学,米德尔斯布罗,英国)

AI总结 通过监督分类器识别重复囚徒困境中的策略,结合演化博弈论基线,发现随着收益增加,LLM反而更合作,与演化预测相反,表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题,而是人工智能治理的核心问题。我们从战略行为的角度出发,探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作,而是训练监督分类器来识别重复博弈的经典策略(始终合作、始终背叛、以牙还牙、赢-留-输-变),并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何,我们推导了演化博弈论(EGT)基线,并将其与LLM数据进行比较。两种结果以揭示性的方式不一致:随着收益增加,演化理论预测背叛应占据主导,但LLM却向相反方向移动,变得更加合作——我们认为,这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明,这种情况并非前沿规模、专有模型所特有:它也出现在三个开放权重的较小LLM中。总体而言,我们的分析强调,收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆,对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

2508.06336 2026-06-09 cs.LG cs.AI cs.HC cs.MA 版本更新

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

无监督伙伴设计实现鲁棒的临时团队协作

Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

发表机构 * University of Southampton(索姆塞特大学)

AI总结 提出无监督伙伴设计(UPD)方法,通过动态生成并基于可学习性准则自适应选择训练伙伴,无需预训练伙伴群体或手动调参,在多个任务中达到强性能,并在人机交互研究中获得更高评价。

Comments 27 pages

详情
AI中文摘要

我们引入了无监督伙伴设计(UPD),一种用于鲁棒临时团队协作的无群体多智能体强化学习方法。UPD 动态生成训练伙伴,并基于可学习性准则自适应地选择它们,消除了对预训练伙伴群体或手动参数调整的需求。我们表明,这种简单机制能够实现有效的伙伴多样性,并且在存在程序化关卡生成器时可以扩展到联合伙伴-环境选择。在基于级别的觅食、Overcooked-AI 和 Overcooked 泛化挑战中,与基于群体和无群体的基线方法相比,UPD 始终实现强性能。在一项人机交互用户研究中,使用 UPD 训练的智能体获得了更高的回报,并且比所有评估的基线方法被评为更具适应性、更像人类且更少令人沮丧。

英文摘要

We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.

2601.01279 2026-06-09 econ.TH cs.AI cs.CE cs.CL cs.GT 版本更新

Supracompetitive Pricing Under AI Monoculture

人工智能单一群体下的超竞争定价

Shengyu Cao, Ming Hu

发表机构 * Rotman School of Management, University of Toronto(多伦多大学罗特曼管理学院)

AI总结 本文研究了在共享AI模型下,竞争卖家委托定价时可能产生的超竞争定价问题,通过双寡头模型分析发现,AI模型的鲁棒性和可重复性配置可能导致超竞争定价现象,且市场结果取决于初始定价倾向。

Comments 46 pages

详情
AI中文摘要

当竞争卖家将定价委托给共享的AI模型(如大型语言模型)时,相关推荐结合性能驱动的更新,聚合卖家反馈,引发一个问题:标准的AI部署实践是否会无意中产生超竞争定价?本文开发了一个简化的双寡头模型,其中两个卖家从共享的AI模型中获得定价推荐,该模型由两个参数特征化:一个倾向参数捕捉模型设置高价的倾向,一个输出保真度参数衡量该倾向与实际输出的一致性,其中倾向通过定期重新训练在观察到的结果上更新。我们发现,配置AI模型以鲁棒性和可重复性可以导致超竞争定价通过相变。在临界输出保真度阈值以下,竞争性定价是唯一的稳定结果。在临界值以上,模型表现出双稳态:竞争性和超竞争性定价都是局部稳定的,实际结果取决于模型的初始倾向。超竞争性定价提高了平均价格,但偶尔的低价推荐使检测变得复杂。对于完美输出保真度,任何内部初始倾向都会导致完全价格协调。对于有限训练批次大小为b,当初始倾向位于超竞争性盆地时,随着b的增加,超竞争性定价的概率接近1,不确定结果区域以O(1/√b)的速率缩小。任何减少模型倾向与卖家实际定价之间一致性的因素,无论是通过多样化AI供应商、引入推荐噪声还是减少卖家的遵守,都会将市场推向竞争性结果。

英文摘要

When competing sellers delegate pricing to a shared AI model, such as a large language model, correlated recommendations combined with performance-driven updates aggregating seller feedback raise a key question: can standard AI deployment practices inadvertently produce supracompetitive pricing? We develop a stylized duopoly model in which two sellers receive pricing recommendations from a shared AI characterized by two parameters: a propensity parameter capturing the model's tendency to set high prices and an output-fidelity parameter measuring alignment between this tendency and actual outputs, with propensity updated via periodic retraining on observed outcomes. We find that configuring AI models for robustness and reproducibility can lead to supracompetitive pricing via a phase transition. Below a critical output-fidelity threshold, competitive pricing is the unique stable outcome. Above it, the model exhibits bistability: both competitive and supracompetitive pricing are locally stable, with the realized outcome determined by the model's initial propensity. Supracompetitive pricing raises average prices, but occasional low-price recommendations complicate detection. With perfect output fidelity, full price coordination emerges from any interior initial propensity. For finite training batches of size $b$, when the initial propensity lies in the supracompetitive basin, the probability of supracompetitive pricing approaches 1 as $b$ increases, with the region of indeterminate outcomes shrinking at rate $O(1/\sqrt{b})$. Any factor reducing alignment between the model's propensity and sellers' actual pricing, whether through diversifying AI providers, introducing recommendation noise, or reducing seller adherence, pushes the market toward competitive outcomes.

2602.06934 2026-06-09 cs.PL cs.AI cs.DC cs.LO cs.MA 版本更新

Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)

基于多智能体转换系统和人工智能实现基础逻辑程序

Ehud Shapiro

发表机构 * London School of Economics(伦敦经济学院) Weizmann Institute of Science(魏茨曼科学研究院)

AI总结 本文提出dGLP和madGLP两种确定性变体,通过全局链接实现共享变量,证明其正确性,并展示如何利用AI技术实现多智能体通信。

详情
AI中文摘要

Grassroots Logic Programs (GLP) 是一种并发逻辑编程语言,其中逻辑变量被划分为配对的读者和写者。一个赋值最多通过写者一次,其配对的读者最多一次消耗,可能包含额外的读者和/或写者。这使得丰富多向通信模态的简洁表达成为可能。该语言与并发(cGLP)和多智能体(maGLP)操作语义一起引入。本文从这些(ia)dGLP,cGLP的确定性对应物,和(ib)madGLP,一种多智能体对应物,其中确定性智能体仅通过异步消息传递通信,并证明它们的抽象对应物的正确性。maGLP跨越智能体的共享变量对可以作为本地变量通过全局链接配对,其正确性源于不重叠的替换交换性(GLP的单次出现不变量的结果)。我们进一步证明madGLP是基础的。dGLP和madGLP作为AI驱动的实现学科(数学→非正式规范→Dart)的形式规范被使用和描述:从dGLP,AI(Claude)开发了一个基于工作站的GLP实现,从madGLP正在开发一个基于智能手机的多智能体实现。

英文摘要

Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. The language was introduced together with concurrent (cGLP) and multiagent (maGLP) operational semantics. Here, we derive from these (\ia)~dGLP, a deterministic counterpart of cGLP, and (\ib)~madGLP, a counterpart of maGLP in which deterministic agents communicate solely by asynchronous message passing, and prove them correct against their abstract counterparts. maGLP shared variable pairs spanning agents can be implemented as local variables paired by \emph{global links}, with correctness following from disjoint substitution commutativity (a consequence of GLP's single-occurrence invariant). We further prove that madGLP is grassroots. Both dGLP and madGLP serve as formal specifications for an AI-driven implementation discipline (math $\to$ informal spec $\to$ Dart) employed and described here: from dGLP, AI (Claude) developed a workstation-based GLP implementation in Dart, and from madGLP it is developing a smartphone-based multiagent one.

2605.09823 2026-06-09 cs.MA cs.AI 版本更新

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

CalBench: 评估多智能体大语言模型中的协调-隐私权衡

Chelsea Zou, Yiheng Yao, Selena She, Noah Goodman, Robert D. Hawkins

发表机构 * Stanford University(斯坦福大学)

AI总结 提出CalBench基准,用于在私有信息下评估多智能体日程协调中任务完成、成本、通信、公平性和隐私泄露的权衡。

详情
AI中文摘要

个人AI助手开始作为代表行事,能够访问日历、收件箱和用户偏好。日程安排使信任问题具体化:助手必须与其他助手协调,同时决定透露关于其所代表的人的哪些信息。我们引入了CalBench,一个用于在私有信息下进行多智能体日程安排的可控基准。在每个任务中,$N$个智能体管理各自的私有日历,并安排$M$个传入会议流,同时最小化干扰成本。由于没有智能体可以检查另一个智能体的日历,成功需要语言介导的协调而非集中规划。CalBench生成可解场景,配备CP-SAT oracle解决方案和去中心化的非LLM参考协议,能够在匹配信息约束下评估任务成功、额外成本、通信效率、负担公平性和隐私泄露。在七个模型系列中,我们发现仅完成度会遗漏重要失败:智能体留下可避免的成本,通信量不能预测更低的遗憾,保护隐私的沉默可能剥夺队友公平负担分配所需的成本信息。CalBench提供了一个可重复的测试平台,用于研究自主助手在规模化部署前能否代表用户进行协调。

英文摘要

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.

4. 搜索、优化与约束求解 12 篇

2606.08282 2026-06-09 cs.AI 新提交

From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains

从验证者选择到权益证明区块链中的投资组合收集优化

Jonas Gehrlein, Grzegorz Miebs, Matteo Brunelli, Adam Mielniczuk, Miłosz Kadziński

发表机构 * Parity Technologies AG Institute of Computing Science, Poznan University of Technology(波兹南工业大学计算科学研究所) Department of Industrial Engineering, University of Trento(特伦托大学工业工程系)

AI总结 针对权益证明区块链中提名者选择验证者的多准则决策问题,提出双目标优化框架,同时最大化验证者期望效用(代表组合质量和盈利能力)和分配期望熵(代表风险分散),通过主动偏好学习和多目标进化算法求解,并引入交互式二分搜索导航确定满意折衷。

Comments 24 pages, 5 figures, 3 tables

详情
AI中文摘要

我们考虑权益证明区块链环境中出现的一个问题,其中称为提名者的代理选择验证者——负责维护区块链物理基础设施的实体。选择过程本质上是主观和多准则的,并且结合了提名者通常通过多个账户操作的事实。这引出了一个投资组合选择问题,其中代理寻求将其提名分配到多个账户以分散风险。我们提出了一个决策支持框架来优化这一选择,通过同时最大化两个目标:可能分配的验证者的期望效用,代表组合质量和盈利能力;以及分配的期望熵,代表跨 stash 的多样化和风险缓解。验证者效用通过基于多属性价值理论的原始主动偏好学习过程推导,重点关注排名靠前的验证者。所得的双目标优化问题通过多目标进化算法求解,为了支持最终选择,我们引入了一个交互式二分搜索导航程序,该程序引导提名者穿过前沿,并仅通过几个问题确定一个满意的折衷。数值实验检验了优化策略,而涉及五位经验丰富的提名者的专家评估证实了该方法的实际相关性和有用性。

英文摘要

We consider a problem arising in proof-of-stake blockchain environments, where agents called nominators select validators - entities responsible for maintaining the blockchain's physical infrastructure. The selection process is inherently subjective and multi-criterial and combines with the fact that nominators commonly operate through multiple accounts. This gives rise to a portfolio selection problem, where agents seek to distribute their nominations across accounts to diversify risk. We propose a decision support framework to optimize this selection by simultaneously maximizing two objectives: the expected utility of the validators likely to be allocated, representing portfolio quality and profitability, and the expected entropy of the allocation, representing diversification and risk mitigation across stashes. Validator utilities are derived using an original active preference learning procedure based on multi-attribute value theory, with emphasis on top-ranked validators. The resulting bi-objective optimization problem is solved with a multi-objective evolutionary algorithm and, to support the final choice, we introduce an interactive binary search navigation procedure that guides the nominator through the front and identifies a satisfactory trade-off with only a few questions. Numerical experiments examine the optimization strategies, while an expert assessment involving five experienced nominators confirms the approach's practical relevance and usefulness.

2606.08904 2026-06-09 cs.AI 新提交

Order Matters: Unveiling the Hidden Impact of Macro Placement Sequences via Proxy-Guided LLM Evolution

顺序至关重要:通过代理引导的LLM进化揭示宏放置序列的隐藏影响

Shibing Mo, Jing Liu, Jianchu Xu, Ruilin Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OrderPlace框架,利用代理引导的大语言模型进化自动发现宏放置顺序策略,在ISPD 2005基准上相比现有方法线长减少14.08%-34.04%。

Comments ICML2026

详情
AI中文摘要

宏放置是现代芯片物理设计中的基本步骤,在决定高维组合优化问题的解质量方面起着关键作用。尽管最近在空间坐标确定的机器学习方面取得了进展,但放置排序的时间维度仍然主要由静态启发式方法控制。在这项工作中,我们证明了放置顺序不仅仅是预处理步骤,而是优化的决定性因素,其中次优的早期决策会触发不可逆的连锁反应,从而限制解空间。为了利用这一未探索的维度,我们提出了\textbf{OrderPlace},一个代理引导的LLM进化框架,用于自动发现宏放置顺序策略。OrderPlace不依赖于手工制作的启发式方法(如基于面积或连通性的排序),而是探索更广泛的代码级策略空间,从静态评分指标到动态物理启发机制。为了减轻评估序列的高昂成本,我们引入了一种轻量级代理评估机制,该机制使用确定性贪婪探针高效过滤候选序列。在标准ISPD 2005基准上的实验结果表明,OrderPlace发现了新颖的排序策略。与WireMask-EA和最先进的方法EGPlace相比,OrderPlace分别将线长减少了34.04%和14.08%。

英文摘要

Macro placement is a fundamental step in modern chip physical design, playing a crucial role in determining the solution quality of high-dimensional combinatorial optimization problems. Despite recent advancements in machine learning for spatial coordinate determination, the temporal dimension of placement sequencing remains largely governed by static heuristics. In this work, we demonstrate that the placement sequence is not merely a preprocessing step but a decisive factor in optimization, where suboptimal early decisions trigger irreversible domino effects that constrain the solution space. To harness this unexplored dimension, we propose \textbf{OrderPlace}, a proxy-guided LLM evolution framework for automatically discovering macro placement order strategies. Instead of relying on manually crafted heuristics such as area- or connectivity-based ordering, OrderPlace explores a broader space of code-level policies, ranging from static scoring metrics to dynamic physics-inspired mechanisms. To mitigate the prohibitive cost of evaluating sequences, we introduce a lightweight proxy evaluation mechanism that efficiently filters candidates using a deterministic greedy probe. Experimental results on the standard ISPD 2005 benchmarks demonstrate that OrderPlace discovers novel ordering strategies. Compared with WireMask-EA and the state-of-the-art method EGPlace, OrderPlace reduces wirelength by 34.04\% and 14.08\%, respectively.

2606.09343 2026-06-09 cs.AI 新提交

Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

利用结构约束的基于扩散的神经TSP求解器

Mickaël Basson, Philippe Preux

发表机构 * Université de Lille, France(法国里尔大学) CNRS, France(法国国家科学研究中心) Inria, France(法国国家信息与自动化技术研究院) UMR 9189-CRIStAL, Lille, France(法国里尔大学UMR 9189-CRIStAL研究中心)

AI总结 提出投影一致性推理(PCI),用结构感知投影替代梯度细化,在TSP500/1000上分别达到0.17%/0.31%最优性差距,推理时间减少30-40%。

详情
Journal ref
The 20th Learning and Intelligent OptimizatioN Conference (LION), Jun 2026, Milan (Italie), Italy
AI中文摘要

神经组合优化最近在欧几里得旅行商问题(TSP)上使用生成模型(如扩散和一致性模型)取得了强劲结果。最先进的方法如FT2T将基于一致性的快速预测与基于梯度的推理时细化相结合。然而,梯度搜索通常会产生显著的计算开销,并且可能与可行解的离散结构不一致。我们引入了投影一致性推理(PCI),这是一种即插即用、无需重新训练的替代方案,用结构感知投影替换梯度细化:PCI从一致性模型输出解码有效的哈密顿环,并应用轻量级局部搜索(例如2-opt)。PCI在500个城市的TSP上实现了0.17%的平均最优性差距(OG),在1000个城市的TSP上实现了0.31%,优于FT2T的最佳设置(OG分别为0.22%和0.36%),同时将推理时间减少了30%至40%。PCI还表现出更低的方差和内存使用,并且在快速生成解决方案方面可以超越经典启发式算法(如LKH3)。我们的结果表明,结构感知的推理时操作为神经TSP求解器提供了一条实用且原则性的路径,补充了训练时目标。

英文摘要

Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.

2606.09666 2026-06-09 cs.AI 新提交

Frequency-based Constrained Sampling for Interval Patterns

基于频率的区间模式约束采样

Djawad Bekkoucha, Abdelkader Ouali, Bruno Crémilleux

发表机构 * Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Université Paris-Saclay, CNRS(巴黎-萨克雷大学数字科学跨学科实验室(LISN),法国国家科学研究中心) Université Caen Normandie, ENSICAEN, CNRS, Normandie Univ, GREYC UMR6072(卡昂诺曼底大学,卡昂国立高等工程师学校,法国国家科学研究中心,诺曼底大学,GREYC UMR6072)

AI总结 提出CFips方法,将用户定义的句法约束直接融入多步采样框架,通过分解为区间边界上的基本谓词实现精确采样,保证在约束模式空间中按频率比例采样,实验证明能完成超时失败的挖掘任务。

Comments 16 pages

详情
AI中文摘要

输出空间模式采样是穷举模式挖掘的一种强大替代方案,用于探索大型模式空间,因为它使用户能够根据选定的兴趣度量关注代表性模式。在本文中,我们解决了在用户定义的句法约束下采样区间模式的问题。我们引入了CFips,一种将约束直接融入采样过程的采样方法。该方法基于多步采样框架,通过将约束分解为区间边界上的基本谓词来支持多种句法约束,同时保持精确采样保证。我们正式证明CFips在约束模式空间内按频率比例采样区间模式。实验结果表明,将约束融入采样过程能够完成在给定超时内否则会失败的挖掘任务。

英文摘要

Output space pattern sampling is a powerful alternative to exhaustive pattern mining for exploring large pattern spaces, as it enables users to focus on representative patterns drawn according to a chosen interestingness measure. In this paper, we address the problem of sampling interval patterns under user-defined syntactic constraints. We introduce CFips, a sampling approach that incorporates constraints directly into the sampling procedure. The approach relies on a multi-step sampling framework and supports several syntactic constraints by decomposing them into elementary predicates on interval bounds while preserving exact sampling guarantees. We formally prove that CFips samples interval patterns proportionally to their frequency within the constrained pattern space. The experimental results show that integrating constraints into the sampling procedure enables to complete mining tasks that would otherwise fail within a given time out.

2606.07562 2026-06-09 q-bio.BM cs.AI 交叉投稿

The Montparnasse Algorithm for RNA Design

RNA设计的蒙帕纳斯算法

Tristan Cazenave

发表机构 * Tristan Cazenave

AI总结 提出基于广义嵌套滚动策略适应的蒙特卡洛搜索框架Montparnasse,结合问题特定先验和字典序多准则评估,在Eterna100基准上比现有最优方法DesiRNA快三倍以上,并在血红蛋白α信使RNA二级结构优化中优于LinearDesign。

详情
AI中文摘要

RNA设计包括发现一个优化预定义标准(如二级结构)的核苷酸序列。它对合成生物学、医学和纳米技术很有用。我们提出了Montparnasse,一个基于广义嵌套滚动策略适应的蒙特卡洛搜索框架,并增加了问题特定的先验、第1级的慢速和长期适应,以及字典序多准则评估。Montparnasse在所有时间限制下一致地比现有最优方法DesiRNA更快地解决了Eterna100 V1基准的所有100个谜题,总体达到完全覆盖的速度快三倍以上。在血红蛋白α的信使RNA二级结构优化中,它识别出的序列比LinearDesign的MFE最优解具有更多的配对碱基。

英文摘要

RNA design consists of discovering a nucleotide sequence that optimizes predefined criteria, such as secondary structure. It is useful for synthetic biology, medicine, and nanotechnology. We propose Montparnasse, a Monte Carlo search framework based on Generalized Nested Rollout Policy Adaptation, augmented with a problem-specific prior, slow and long adaptation at level 1, and a lexicographic multicriteria evaluation. Montparnasse solves all 100 puzzles of the Eterna100 V1 benchmark consistently faster than DesiRNA, the previous state of the art, across all time limits, reaching full coverage more than three times faster overall. On messenger RNA secondary structure optimization for hemoglobin alpha, it identifies sequences with more paired bases than the MFE-optimal solution of LinearDesign.

2412.13858 2026-06-09 cs.AI cs.LG 版本更新

IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space

IDEQ -- 利用解空间结构改进旅行商问题的扩散模型

Mickael Basson, Philippe Preux

发表机构 * Université de Lille(里尔大学) CNRS(国家科学研究中心) Inria(法国国家信息与自动化技术研究院) UMR 9198-CRIStAL(UMR 9198-CRIStAL研究中心)

AI总结 提出IDEQ方法,通过利用TSP解空间的约束结构和基于2-opt轨道的均匀分布训练目标,改进扩散模型求解TSP,在合成实例和TSPlib上达到新SOTA,接近LKH3性能。

详情
AI中文摘要

我们研究扩散模型求解旅行商问题。基于最近的DIFUSCO和T2TCO方法,我们提出IDEQ。IDEQ通过利用TSP状态空间的约束结构来提高解的质量。IDEQ的另一个关键组成部分是,将DIFUSCO课程学习的最后阶段替换为考虑哈密顿环上的均匀分布,这些环在2-opt算子下的轨道收敛到最优解作为训练目标。我们的实验表明,IDEQ在合成实例上改进了此类神经网络技术的现有水平。更重要的是,我们的实验表明,IDEQ在TSPlib(TSP社区的参考基准)的实例上表现非常好:它紧密匹配最佳启发式算法LKH3的性能,甚至在两个分别包含1577和3795个城市的TSPlib实例上能够获得比LKH3更好的解。IDEQ在500个城市的TSP实例上获得0.3%的最优性差距,在1000个城市的TSP实例上获得0.5%的最优性差距。这为基于神经网络的TSP求解方法设立了新的SOTA。此外,与DIFUSCO和T2TCO相比,IDEQ表现出更低的方差和更好的随城市数量扩展的能力。

英文摘要

We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ improves the quality of the solutions by leveraging the constrained structure of the state space of the TSP. Another key component of IDEQ consists in replacing the last stages of DIFUSCO curriculum learning by considering a uniform distribution over the Hamiltonian tours whose orbits by the 2-opt operator converge to the optimal solution as the training objective. Our experiments show that IDEQ improves the state of the art for such neural network based techniques on synthetic instances. More importantly, our experiments show that IDEQ performs very well on the instances of the TSPlib, a reference benchmark in the TSP community: it closely matches the performance of the best heuristics, LKH3, being even able to obtain better solutions than LKH3 on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3% optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances with 1000 cities. This sets a new SOTA for neural based methods solving the TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the number of cities with regards to DIFUSCO and T2TCO.

2507.22876 2026-06-09 cs.AI cs.LO 版本更新

Discovering heuristics in a complex SAT solver with large language models

利用大型语言模型发现复杂SAT求解器中的启发式策略

Yiwen Sun, Furong Ye, Zhihan Chen, Ke Wei, Shaowei Cai

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国) Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所系统软件重点实验室,北京,中国) SeedMath Technology Limited, Beijing, China(SeedMath技术有限公司,北京,中国)

AI总结 提出AutoModSAT框架,结合模块化求解器设计、无监督提示优化和进化算法,利用LLM自动优化SAT求解器,在多个数据集上性能提升40%。

详情
AI中文摘要

可满足性问题(SAT)是计算复杂性理论的基础,并具有广泛的工业应用。由于现代SAT求解器架构复杂,在现实环境中优化它们相当具有挑战性。尽管已经开发了自动配置框架,但它们依赖于手动约束的搜索空间。在这里,我们开发了AutoModSAT,一个使用大型语言模型(LLM)自动优化SAT求解器的框架。AutoModSAT结合了兼容LLM的模块化求解器设计、无监督提示优化以多样化生成的函数,以及基于预搜索策略和$(1+\lambda)$进化算法的高效搜索过程。在广泛的数据集上进行的大量实验表明,AutoModSAT相比基线求解器实现了40%的性能提升,相比最先进的求解器实现了30%的提升。此外,在大多数测试数据集上,AutoModSAT相比最先进求解器的参数调优替代方案也实现了显著的加速。这些结果证明了LLM引导的启发式发现用于优化复杂SAT求解器的潜力。

英文摘要

The Satisfiability problem (SAT) is fundamental in computational complexity theory and has a wide range of industrial applications. Optimizing modern SAT solvers in real-world settings is quite challenging due to their intricate architectures. While automatic configuration frameworks have been developed, they rely on manually constrained search spaces. Here we develop AutoModSAT, a framework that uses large language models (LLMs) to automatically optimize SAT solvers. AutoModSAT combines an LLM-compatible modular solver design, unsupervised prompt optimization to diversify generated functions, and an efficient search procedure based on presearch strategy and a $(1+λ)$ evolutionary algorithm. Extensive experiments across a wide range of datasets demonstrate that AutoModSAT achieves $40\%$ performance improvement over the baseline solver and $30\%$ improvement over the state-of-the-art solvers. Moreover, AutoModSAT also attains a notable speedup compared to the parameter-tuned alternatives of the state-of-the-art solvers over most of the test datasets. These results demonstrate the potential of LLM-guided heuristic discovery for optimizing complex SAT solvers.

2601.06188 2026-06-09 cs.AI 版本更新

Dynamic Distributed Constraint Optimization and Metareasoning for Continual, Large-Scale Satellite Operations

面向持续大规模卫星运行的动态分布式约束优化与元推理

Itai Zilberstein, Steve Chien

发表机构 * Carnegie Mellon University(卡内基梅隆大学) California Institute of Technology(加州理工学院) Jet Propulsion Laboratory(喷气推进实验室)

AI总结 针对动态大规模卫星调度问题,提出动态分布式约束优化模型DCOSP,并设计元推理框架控制重计算时机,结合D-NSS算法实现近优解,显著优于基线方法。

Comments An earlier version titled "Large-Scale Continual Scheduling and Execution for Dynamic Distributed Satellite Constellation Observation Allocation" appears as an extended abstract in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

随着地球观测卫星星座在规模和能力上的增长,分布式星载控制为新型响应和时效性测量提供了途径。然而,将自主性部署到卫星上需要高效的计算和通信。本文解决了在动态、大规模问题中调度数百颗卫星观测的挑战,涉及数百万个变量。我们提出了动态多卫星星座观测调度问题(DCOSP),这是一种新的动态分布式约束优化问题(DDCOP)形式化,集成了调度与执行。DCOSP具有新颖的最优性条件,为此我们构建了一个精确的全知离线算法。受星载操作强资源约束的启发,我们引入了一个在DDCOP中融入元推理的框架,该框架控制智能体何时消耗资源以重新计算解决方案。此外,我们提出了动态增量邻域随机搜索(D-NSS)算法,这是一种不完整的在线分解型DDCOP算法,通过修复局部子问题来响应动态事件。我们在逼真的仿真中证明,D-NSS收敛到近优解,在解质量、计算时间和消息量方面优于标准DDCOP基线,而我们的元推理框架成功地在资源节约与效用之间取得平衡。作为NASA FAME任务的一部分,这项工作为迄今为止最大规模的空间分布式多智能体AI演示奠定了基础。

英文摘要

As Earth-observing satellite constellations grow in size and capability, distributed onboard control offers a pathway to novel responses and time-sensitive measurements. However, deploying autonomy to satellites requires efficient computation and communication. This work addresses the challenge of scheduling observations for hundreds of satellites in a dynamic, large-scale problem with millions of variables. We present the dynamic multi-satellite constellation observation scheduling problem (DCOSP), a new formulation of dynamic distributed constraint optimization problems (DDCOP) that models integrated scheduling and execution. DCOSP features a novel optimality condition, for which we construct an exact omniscient offline algorithm. Motivated by the strong resource constraints of onboard satellite operations, we introduce a framework to incorporate metareasoning in DDCOPs that controls when agents expend resources to recompute solutions. In addition, we present the dynamic incremental neighborhood stochastic search (D-NSS) algorithm, an incomplete online decomposition-based DDCOP algorithm that repairs localized sub-problems in response to dynamic events. We demonstrate in realistic simulations that D-NSS converges to near-optimal solutions, outperforming standard DDCOP baselines in solution quality, computation time, and message volume, while our metareasoning framework successfully balances resource conservation with utility. As part of the NASA FAME mission, this work lays the foundation for the largest in-space demonstration of distributed multi-agent AI to date.

2606.06656 2026-06-09 cs.AI cs.LO 版本更新

A Study of Parallel Continuous Local Search

并行连续局部搜索研究

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 研究并行连续局部搜索(CLS)在对称伪布尔约束可满足性问题中的应用,发现冗余约束会抑制收敛,CLS在混合求解中能快速完成部分赋值,且局部搜索因鞍点密集目标而快速收敛到稳定解质量分布。

详情
AI中文摘要

我们研究并行连续局部搜索(CLS)作为解决具有对称伪布尔(PB)约束的布尔可满足性问题的一种方法。这里,$n$变量PB可满足性问题被松弛为一个连续优化问题,其目标函数在$n$维超立方体上可微。对于可满足的实例,该优化问题的全局最小值对应于所讨论SAT问题的满足赋值。我们通过实证实验提出了几个新发现:(i)冗余约束会抑制而非加速收敛;(ii)CLS在混合设置中作为子求解器显示出前景,能快速完成部分赋值;(iii)由于鞍点密集的目标函数,局部搜索迅速收敛到解质量的稳定分布(即满足程度),此时额外的求解步骤收益递减。我们的发现为在现代加速硬件上使用CLS解决SAT问题提供了实用指导。

英文摘要

We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

2606.07047 2026-06-09 cs.AI 版本更新

Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

Front-to-Attractors:修改双向搜索中的Front-to-Front启发式

Alvin Zou, Muhammad Suhail Saleem, Maxim Likhachev

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Front-to-Attractors (F2A)启发式类,通过动态维护吸引子集替代完整前沿,在保持Front-to-Front信息性的同时大幅降低计算开销,实验显示相比F2F减少最多11.2倍成对评估,平均节点扩展比F2E少4.8倍。

详情
AI中文摘要

启发式在双向搜索算法的性能中扮演核心角色,通常依赖两个主要类别。Front-to-end (F2E) 启发式估计从状态 s 到搜索目标(前向搜索的目标或后向搜索的起点)的距离。相比之下,Front-to-front (F2F) 启发式通过成对函数 h(s, s') 估计从 s 到对面搜索前沿的距离,其中 s' 遍历前沿状态。尽管 F2F 启发式通常信息量更大,从而减少节点扩展数量,但它们依赖大量的成对评估,导致显著的计算开销。为了解决这一限制,我们引入了一个新的启发式类——Front-to-attractors (F2A),它在保留 F2F 大部分信息性的同时,大幅降低了计算成本。F2A 不是评估到对面前沿所有状态的距离,而是估计从 s 到对面搜索方向中一个小的、动态维护的吸引子集的距离。这些吸引子作为完整前沿的替代,使得在极少的计算开销下提供丰富的启发式指导,同时保持 F2F 提供的最优性保证。我们在多个领域评估了 F2A,结果显示,与 F2F 相比,它减少了最多 11.2 倍的成对评估,同时平均节点扩展比 F2E 少 4.8 倍。

英文摘要

Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end (F2E) heuristics estimate the distance from a state s to the target of the search (the goal for forward search or the start for backward search). In contrast, front-to-front (F2F) heuristics estimate the distance from s to the opposite search frontier using a pairwise function h(s, s'), where s' ranges over frontier states. Although F2F heuristics are typically more informative and therefore reduce the number of node expansions, their reliance on extensive pairwise evaluations incurs substantial computational overhead. To address this limitation, we introduce a new heuristic class, front-to-attractors (F2A), that preserves much of the informativeness of F2F while dramatically reducing its computational cost. Rather than evaluating distances to all states on the opposite frontier, F2A estimates the distance from s to a small, dynamically maintained set of attractors in the opposite search direction. These attractors serve as a surrogate for the full frontier, enabling rich heuristic guidance at a fraction of the computational expense while maintaining the optimality guarantees offered by F2F. We evaluate F2A across multiple domains and show that it reduces the number of pairwise evaluations by up to 11.2x compared to F2F, while achieving 4.8x fewer node expansions than F2E on average.

2508.11874 2026-06-09 cs.GT cs.AI cs.DS cs.LO cs.PL 版本更新

Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models

利用大型语言模型发现专家级纳什均衡算法

Hanyu Li, Dongchen Li, Xiaotie Deng

发表机构 * CFCS, School of Computer Science, Peking University, Beijing, China(计算机科学系,北京大学,北京,中国) School of Computing and Data Science, The University of Hong Kong, Pokfulam, Hong Kong(计算与数据科学学院,香港大学,薄扶林,香港)

AI总结 提出LegoNE框架,将专家证明策略编码为符号语言,自动验证算法的最坏情况保证,结合推理型LLM重新发现并改进了多人博弈的近似纳什均衡算法。

Comments accepted by Nature Communications

详情
AI中文摘要

设计具有可证明最坏情况保证的近似纳什均衡(ANE)的多项式时间算法是算法博弈论中的一个基本开放问题。虽然大型语言模型(LLM)可以大规模生成候选算法,但验证最坏情况保证需要对所有博弈实例进行形式化分析——此前没有自动化系统能够完成这项任务。在这里,我们提出了LegoNE,一个将专家证明策略编码为符号语言的框架,该框架自动将任何候选算法编译成一个有限优化问题,以验证其最坏情况保证。将LegoNE与一个推理型LLM集成,我们重新发现了一个匹配双人博弈最佳多项式时间保证的算法,并发现了一个三人博弈算法,将最佳保证从$0.6+\delta$改进到$0.5+\delta$——这被证明超出了扩展技术(唯一已知的多玩家ANE设计范式)的能力范围。这些结果表明,将特定领域的证明策略编码为机器可处理的语言可以支持LLM驱动的算法发现,超越已知的人类设计范式。

英文摘要

Designing polynomial-time algorithms for approximate Nash equilibria (ANE) with provable worst-case guarantees is a fundamental open problem in algorithmic game theory. While large language models (LLMs) can generate candidate algorithms at scale, certifying worst-case guarantees requires formal analysis over all game instances -- a task for which no automated system previously existed. Here, we present LegoNE, a framework encoding expert proof strategies into a symbolic language that automatically compiles any candidate algorithm into a finite optimization problem certifying its worst-case guarantee. Integrating LegoNE with a reasoning LLM, we rediscovered an algorithm matching the best polynomial-time guarantee for two-player games, and discovered a three-player algorithm improving the best guarantee from $0.6+δ$ to $0.5+δ$ -- provably beyond the reach of the extension technique, the only previously known multi-player ANE design paradigm. These results show that encoding domain-specific proof strategies into a machine-tractable language can support LLM-driven discovery of algorithms outside known human design paradigms.

2601.01665 2026-06-09 cs.LG cs.AI 版本更新

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

多目标神经组合优化的对抗实例生成与鲁棒训练

Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学LIACS研究所,莱顿,荷兰) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温理工大学,埃因霍温,荷兰)

AI总结 提出面向多目标组合优化问题的偏好条件深度强化学习鲁棒性框架,通过偏好对抗攻击生成困难实例并量化影响,结合硬度感知偏好选择的对抗训练提升泛化性,在MOTSP、MOCVRP、MOKP上验证了攻击与防御的有效性。

详情
AI中文摘要

深度强化学习(DRL)在解决多目标组合优化问题(MOCOPs)方面显示出巨大潜力。然而,这些基于学习的求解器的鲁棒性尚未得到充分探索,尤其是在多样化和复杂的问题分布上。在本文中,我们提出了一个面向偏好条件DRL求解器用于MOCOPs的统一鲁棒性导向框架。在该框架内,我们开发了一种基于偏好的对抗攻击,以生成暴露求解器弱点的困难实例,并通过由此导致的帕累托前沿质量下降来量化攻击影响。我们进一步引入了一种防御策略,将硬度感知偏好选择集成到对抗训练中,以减少对受限偏好区域的过拟合并提高分布外性能。在多目标旅行商问题(MOTSP)、多目标容量车辆路径问题(MOCVRP)和多目标背包问题(MOKP)上的实验结果验证了我们的攻击方法能够成功地为不同求解器学习困难实例。此外,我们的防御方法显著增强了神经求解器的鲁棒性和泛化能力,在困难或分布外实例上提供了优越的性能。

英文摘要

Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

5. 机器学习与表示学习 157 篇

2606.07577 2026-06-09 cs.AI cs.CV cs.SD eess.AS 新提交

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem: 面向流式音视频大语言模型的扰动感知记忆压缩

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出OmniMem,一种针对音视频LLM的流式记忆压缩框架,通过模态感知分配和扰动感知选择压缩KV缓存,在保持长视频理解的同时减少内存,在多个基准上提升2-4%准确率。

Comments Code: https://github.com/bytedance/SALMONN/tree/omni_mem

详情
AI中文摘要

音视频大语言模型(LLMs)在长视频理解方面具有强大潜力,但其长视频推理从根本上受到视频令牌和键值(KV)缓存线性增长的制约。我们提出OmniMem,一种专为音视频LLMs设计的内存高效流式框架。与将所有令牌统一处理的现有压缩方法不同,OmniMem引入了一种模态感知的内存分配策略,分别管理视觉和音频上下文,解决了两种模态之间的严重令牌不平衡问题。OmniMem进一步通过扰动感知的内存选择保留信息丰富且非冗余的KV状态,实现紧凑内存而不牺牲长程理解。为了在现实部署约束下加强压缩,我们还探索了预算感知微调,鼓励模型将有用信息整合到保留内存中。在VideoMME Long、LVBench和LVOmniBench上使用video-SALMONN 2+和Qwen-2.5-Omni的实验表明,在相同内存预算下,OmniMem始终比强训练无关压缩基线提高2-4%的绝对准确率,微调后额外提高1-2%。

英文摘要

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

2606.07720 2026-06-09 cs.AI cs.CL cs.LG 新提交

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

为什么将残差流限制在层而不是令牌?用于连续潜在推理的持久记忆

Mujtaba Farhan, Maheep Chaudhary

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对CoCoNuT在潜在空间推理中因中间隐藏状态被覆盖导致概念瓶颈的问题,提出AGCLR模型,通过门控概念流持久记忆机制,在GSM8K、HotpotQA和ProsQA上取得一致提升。

详情
AI中文摘要

大型语言模型(LLMs)在数学和多跳规划任务上展现了卓越的推理能力。CoCoNuT(连续思维链)范式通过使模型能够在潜在空间中进行推理,同时探索多个推理路径,而不是早期就承诺单一链条,从而扩展了这一能力。然而,我们识别出一个我们称之为\textbf{概念瓶颈}的限制。在每次推理过程中,中间隐藏状态被覆盖,导致模型随着推理深度增加而丢失早期步骤中计算出的关键事实。我们在经验上观察到了这一点。在HotpotQA上,原始CoCoNuT(10.4% EM)未能超过CoT基线(11.0% EM),并且在GSM8K上随着课程深度增加性能下降。为了解决这个问题,我们提出了\textbf{AGCLR}(自适应门控连续潜在推理),它通过一个\textit{门控概念流}增强了CoCoNuT。一个跨所有推理过程保持的持久残差记忆,由三个学习到的门控制:一个将中间事实提交到记忆的\textit{写入}门,一个检索相关先前状态的\textit{读取}门,以及一个修剪不相关上下文的\textit{遗忘}门。在使用GPT-2作为基础模型在GSM8K、HotpotQA和ProsQA上进行评估时,AGCLR在所有类型的数据集上实现了一致的改进。随着课程深度的增加,性能差距进一步扩大,直接解决了概念瓶颈。代码可在https://anonymous.4open.science/r/JJJJ/README.md获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4\% EM) fails to improve over the CoT baseline (11.0\% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md

2606.07801 2026-06-09 cs.AI 新提交

Improving Multimodal Reasoning via Worst Dimension Optimization

通过最差维度优化改进多模态推理

Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出最差维度优化方法,通过识别并优先优化推理路径中最差的约束维度,提升多模态推理的整体有效性。

详情
AI中文摘要

多模态推理需要一条在广泛约束(从视觉基础到逻辑一致性)下保持完整性的路径。然而,当前的过程奖励模型关注于启发式定义的奖励,这些奖励平等地权衡这些因素,可能导致主导因素掩盖个别维度的失败,从而无法保证推理过程的一般有效性。

英文摘要

Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.

2606.07812 2026-06-09 cs.AI cs.CL 新提交

Scaling Participation in Modular AI Systems

模块化AI系统中的参与扩展

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

发表机构 * University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出参与扩展范式,通过多方贡献小模型构建模块化AI系统,在15项任务上比单体大语言模型提升高达15.4%,并展现涌现能力。

详情
AI中文摘要

人类是由多面才能和需求组成的马赛克,任何真正智能的AI必须反映这种丰富性。然而,所有人使用的LLM却由少数人构建——一个集中化的单体AI模型市场,其结构上不适合捕捉人类知识、推理和价值观的多样性。本文介绍参与扩展,一种新范式,其中模块化AI系统通过不同利益相关者的贡献自下而上构建。参与者贡献基于自身兴趣和优先级训练的小模型;这些模型随后在模块化框架中作为组合式AI系统协作。参与式AI系统在15项任务(如推理和事实性)上比单体LLM高出最多15.4%,超越了比所有贡献组件总和更大的模型。进一步实验表明,参与式AI系统受益于贡献者多样性,显著改善每个贡献者的原始优先级,并展现出涌现能力,使其能解决超过15%的所有单个模型失败的问题。参与扩展为从单体现状向开放、自下而上、协作的AI未来过渡提供了技术基础。

英文摘要

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

2606.07819 2026-06-09 cs.AI cs.LG 新提交

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

联合结构剪枝与混合精度量化的大语言模型压缩

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

发表机构 * UiT The Arctic University of Norway(挪威北极大学) University of Oslo, Norway(挪威奥斯陆大学)

AI总结 提出端到端框架,通过全局误差最小化的混合精度量化策略和联合优化结构剪枝与量化策略,在超低比特下显著降低困惑度。

详情
AI中文摘要

近年来,大型语言模型(LLM)部署的效率已成为实际应用中的关键问题。虽然训练后量化(PTQ)和结构剪枝是减少内存占用和推理延迟的成熟技术,但大多数现有的PTQ方法在逐层基础上优化量化误差,忽略了误差如何在网络中累积和传播,通常导致次优解。传统的流程也倾向于孤立或顺序地应用剪枝和量化,进一步加剧了次优性。我们引入了一种新颖的端到端框架,以两种关键方式解决这些限制。首先,我们提出了一种新颖的混合精度PTQ策略,该策略直接最小化整个模型上的全局误差传播,而不是隔离逐层误差。在此基础上,我们开发了一种新颖的联合优化方法,该方法在统一的搜索空间中同时学习结构剪枝决策和混合精度量化策略。大量实验表明,在超低精度(1-3比特)下,与最先进的(SoTA)权重激活量化基线相比,我们的量化方法将WikiText困惑度降低了高达21%。与领先的仅权重量化方法相比,它在WikiText和C4上分别实现了高达59%和85%的困惑度降低。与最先进的联合剪枝和量化技术相比,我们提出的方法在超低比特下提供了优越的困惑度和推理性能。

英文摘要

Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

2606.07915 2026-06-09 cs.AI 新提交

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

EditSR: 通过基于编辑的修正增强神经符号回归

Da Li, Xinxin Li, Xingyu Cui, Jin Xu, Juan Zhang, Junping Yin

发表机构 * Northeast Normal University(东北师范大学) East China Normal University(华东师范大学) Shenzhen Institute of Advanced Technology(深圳先进技术研究院) Graduate School of China Academy of Engineering Physics(中国工程物理研究院研究生院) Beihang University(北京航空航天大学) Institute of Applied Physics and Computational Mathematics(北京应用物理与计算数学研究所)

AI总结 提出EditSR双层框架,第一层神经符号回归模型生成表达式,第二层基于编辑的修正器通过预训练的状态转移链逐步修正错误,避免全局搜索重启,有效减少误差累积,提升复杂表达式生成的结构正确性。

详情
AI中文摘要

神经符号回归模型通过将结构搜索转移到预训练来提高推理效率,但其一次性自回归解码容易产生误差累积,可能导致生成结构不正确的表达式,尤其是在复杂表达式生成场景中。现有的修正策略可以缓解这一问题,但它们通常依赖于重新启动全局搜索,从而削弱了神经模型的效率优势,并且仍然容易受到误差累积的影响。在本文中,我们提出了EditSR,一个双层框架,第一层结合神经符号回归模型,第二层结合基于编辑的修正器,以实现高效预测和事后修正。我们不重新启动全局搜索,而是通过预训练修正器来保持修正效率。具体来说,我们将修正过程形式化为从错误表达式开始的逐步状态转移链,并开发了一种状态转移算法来构建用于训练修正器的监督修正链。为了确保修正过程中的语法有效性,每个编辑操作都被限制在语法有效的空间内,使得每个编辑后的表达式仍然可解析。此外,由于每个编辑决策基于当前状态而非历史,修正器允许后续编辑修正早期步骤中的错误,从而降低误差累积的风险。大量实验和消融研究表明,EditSR以有限的额外成本显著提高了符号结构恢复能力,在复杂表达式上收益更明显,因为一次性自回归解码更容易受到误差累积的影响。

英文摘要

Neural symbolic regression models improve inference efficiency by shifting structural search to pretraining, but their one-pass autoregressive decoding is prone to error accumulation, which may lead to generating structurally incorrect expressions, especially in complex expression generation scenarios. Existing rectification strategies can alleviate this issue, but they often depend on restarting global search, thereby weakening the efficiency advantage of neural models, and remain susceptible to error accumulation. In this paper, we propose EditSR, a two-layer framework that combines a neural symbolic regression model in the first layer with an edit-based Rectifier in the second layer to achieve efficient prediction and post-hoc rectification. Instead of restarting the global search, we maintain rectification efficiency by pretraining the Rectifier. Specifically, we formulate the rectification process as a step-by-step state-transition chain starting from an incorrect expression, and develop a state-transition algorithm to construct supervised rectification chains for training the Rectifier. To ensure syntactic validity throughout rectification, each edit action is restricted to a syntactically valid space so that every edited expression remains parseable. In addition, because each edit decision is conditioned on the current state rather than the history, the Rectifier allows errors made in earlier steps to be rectified by subsequent edits, thereby reducing the risk of error accumulation. Extensive experiments and ablation studies show that EditSR substantially improves symbolic structure recovery with limited extra cost, with more pronounced gains on complex expressions, where one-pass autoregressive decoding is more susceptible to error accumulation.

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 新提交

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

OSMGraphCLIP:从OpenStreetMap图学习全局位置表示

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis

发表机构 * Harokopio University of Athens(雅典哈罗科皮奥大学) National Technical University of Athens(雅典国家技术大学) Vienna University of Technology(维也纳技术大学) National Observatory of Athens(雅典国家天文台)

AI总结 提出OSMGraphCLIP模型,利用OpenStreetMap异构图结构学习全局位置嵌入,通过多尺度图编码器和对比学习对齐,在气候、生态、社会经济等下游任务中达到或超越卫星基线方法。

详情
AI中文摘要

我们提出了OSMGraphCLIP,一种CLIP风格的地理空间表示模型,从免费可用的OpenStreetMap(OSM)数据中学习全局位置嵌入。OSMGraphCLIP将地理环境表示为带类型的OSM特征的异构图,保留了道路、建筑物、土地利用区域和兴趣点之间的拓扑和语义关系。多尺度图编码器捕获细粒度的局部结构和更广泛的景观组成,并通过对比对齐目标监督球谐位置编码器。我们在涵盖气候、生态、社会经济指标、公共卫生、土地覆盖、生物多样性和野火预测等一系列下游地理空间回归和分类任务中评估了OSMGraphCLIP,并表明仅结构化OSM数据就支持跨领域的强全局位置表示。OSMGraphCLIP在大多数基准测试中达到或超过了基于卫星的基线,在社会经济和公共卫生任务中优势最为明显,因为OSM对建成环境的显式语义注释编码了卫星像素只能间接捕获的人类活动模式。在生态和环境任务中,尽管未使用地球观测数据,该模型仍与基于图像的方法保持紧密竞争。定性分析证实,学习到的嵌入连贯地组织了地理空间,仅从地图拓扑中恢复了生物群落边界、城市梯度和热带-温带区别。

英文摘要

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

2606.08129 2026-06-09 cs.AI 新提交

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

推理中的跨LLM一致性:来自共享交互的证据

Siyu Lou, Yao Yan, Yuntian Chen, Quanshi Zhang

发表机构 * School of Computer Science Shanghai Jiao Tong University(上海交通大学计算机科学学院) Ningbo Key Laboratory of Advanced Manufacturing Simulation Eastern Institute of Technology, Ningbo(宁波市先进制造仿真重点实验室,宁波东方理工大学) College of Computer and Information Science Chongqing Normal University(重庆师范大学计算机与信息科学学院) SymtrustAI.com Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 研究发现,不同大型语言模型在相同提示下预测相同目标词时,常共享交互模式,且高级模型一致性更强,共享交互通常阶数更低、正负抵消更弱。

Comments 20 pages, 8 figures

详情
AI中文摘要

大型语言模型(LLM)在架构、训练数据和优化过程上各不相同,但它们仍可能发展出相似的内部推理模式。在本文中,我们使用基于交互的解释来检验这一假设。我们发现,当从相同提示预测相同目标词时,LLM 经常共享交互模式。这种一致性在高级 LLM 中更为明显。共享交互通常比非共享交互阶数更低,且正负抵消更弱。这些结果表明,高级 LLM 可能被隐式优化为共同的推理模式,尽管产生这种跨模型一致性的机制仍有待探索。

英文摘要

Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar internal inference patterns. In this paper, we examine this hypothesis using interaction-based explanations. We find that LLMs often share interaction patterns when predicting the same target token from the same prompt. This consistency is more pronounced among advanced LLMs. Shared interactions also tend to be lower-order and show weaker positive-negative cancellation than non-shared interactions. These results suggest that advanced LLMs may be implicitly optimized toward common inference patterns, even though the mechanisms that give rise to such cross-model consistency remain open.

2606.08292 2026-06-09 cs.AI 新提交

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

消融可逆头不传递:对Transformer中机制角色声称的压力测试

Philip Quirke

发表机构 * Martian

AI总结 本文发现注意力头通过必要性、线性编码和消融后恢复三个测试仍不足以证明其角色,引入KID框架和匹配控制下的激活转导,揭示角色声称的不足。

Comments 9 pages, 1 figure

详情
AI中文摘要

在机制可解释性中,注意力头通常被提升为角色声称(例如,“这个头表示加法”),当它们对某个行为是必要的、线性编码该行为,并且在消融后恢复该行为时。我们证明这种证据是不充分的:在三个7-8B指令微调模型和五个计算家族中,通过所有三个检查的头在匹配控制下将其激活修补到不同提示时,通常无法传递计算。我们引入KID(知道/意图/做),一个注意力头的角色分配视角,并将其与一个三阶段流程配对:能力选择性筛选(CSS)、奇异值分解(SVD)和匹配控制下的激活转导。我们的结果记录了一个初步的角色分类(包括提示轨迹稳定器、答案侧logit偏置头和软计算模式载体),并表明相同答案控制(一个共享答案字符串但不共享请求计算的转导目标)是一种未被充分利用的检查,它暴露了伪装成语义特异性的广泛状态转移。

英文摘要

In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.

2606.08312 2026-06-09 cs.AI cs.FL 新提交

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

Ashkan Ansarifard, Matteo Mancanelli, Elena Umili, Fabio Patrizi

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 提出神经符号框架,将LTLf约束编译为DFA并通过可微损失注入Transformer策略,在导航任务中提升约束满足且保持回报竞争力。

Comments Accepted at the Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs (SKILLED-LLMs 2026), co-located with KR 2026 and FLoC 2026, Lisbon, Portugal

详情
AI中文摘要

在这项工作中,我们研究了在有限迹线性时序逻辑(LTLf)表达的时延任务约束下的离线强化学习(RL)。最近,基于Transformer的方法如Trajectory Transformers和Decision Transformers已被采用,将RL视为序列建模问题。然而,这些方法纯粹优化奖励,不考虑高层时序需求。在此,我们引入一个神经符号框架,将LTLf背景知识注入到这类基于Transformer的RL策略中。我们的方法将LTLf公式编译为确定性有限自动机(DFA),并通过可微表示和基于逻辑的损失函数将其整合到学习过程中。特别地,我们从DFA进展中推导出可微的满足信号,并将其作为训练过程中的正则化项。最终的方法在不同模型间是架构无关的。我们在具有覆盖安全性和可达性时序属性组合的规范套件的导航环境中评估所提出的框架。实验结果表明,融入背景知识不仅提高了约束满足,而且与普通基线相比保持了有竞争力的回报。

英文摘要

In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

2606.08432 2026-06-09 cs.AI 新提交

Trajectory-Refined Distillation

轨迹精炼蒸馏

Li Jiang, Haoran Xu, Yichuan Ding, Amy Zhang

发表机构 * McGill University(麦吉尔大学) Mila Quebec AI Institute(米拉魁北克人工智能研究所) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出轨迹精炼蒸馏(TRD),通过教师指导修正学生轨迹中的前缀错误,解决在线策略蒸馏中的前缀失败问题,提升大语言模型的单次准确率和推理覆盖。

Comments under review

详情
AI中文摘要

在线策略蒸馏(OPD)已成为大型语言模型(LLM)的重要后训练工具,它沿着学生自身的生成轨迹提供密集的逐词教师监督。在这项工作中,我们识别出OPD中一个常见的结构性问题,称为前缀失败。在前缀失败下,密集的逐词监督会导致双峰教师混合和碎片化梯度,而词级损失截断或重加权无法解决这一问题。这一观察促使我们超越词级损失干预,转向轨迹级输出修正。因此,我们提出轨迹精炼蒸馏(TRD),一种轨迹级修正方法,在教师指导下,于在线策略支持范围内修正学生的生成轨迹。通过在蒸馏前修正有问题的前缀,TRD从根源上缓解了前缀失败。此外,即使原始轨迹已经正确,TRD也能通过教师指导让学生接触到替代的有效推导,从而改善探索。TRD还可应用于在线策略自蒸馏(OPSD),这是一种使用基于特权信息的学生模型作为教师的参数共享变体。在多个尺度的广泛基准和基础模型上,TRD始终优于先前基线,提高了单次尝试准确率并扩展了推理覆盖范围。代码可在 https://github.com/louieworth/trd 获取。

英文摘要

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

2606.08491 2026-06-09 cs.AI 新提交

What Makes a Desired Graph for Relational Deep Learning?

什么构成了关系深度学习的理想图?

Yao Cheng, Siqiang Luo

AI总结 研究发现,从数据库模式直接导出的图存在信息过载和语义碎片化问题,通过过滤和注入操作平衡可提升性能,并开发了自动优化器。

Comments This article has been accepted by ICML 2026

详情
AI中文摘要

关系深度学习(RDL)将关系数据库(RDB)转换为异构图,但直接从数据库模式导出的图通常不适合图神经网络(GNN)进行关系推理的方式。我们研究了什么使关系图适合深度学习,并表明模式派生图存在两个系统性失败:信息过载和语义碎片化。我们的实证分析表明,理想的图不是原始模式,而是受控结构适应的结果。性能取决于平衡两种操作:通过过滤减轻信息过载,以及通过注入修复语义碎片。具体而言,过滤作为具有非单调效应的偏差-方差旋钮,而注入仅在明确恢复原始模式中缺失的关系依赖时才能提高性能。基于这些发现,我们开发了一个端到端结构优化器,应用这两种操作自动适应关系图。在涵盖分类、回归和推荐的26个任务中,优化后的图在通常降低推理成本的同时持续提高了准确性。

英文摘要

Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database schemas are often not well suited for how graph neural networks (GNNs) perform relational reasoning. We study what makes a relational graph suitable for deep learning and show that schema-derived graphs suffer from two systematic failures: information overload and semantic fragmentation. Our empirical analysis reveals that the desired graph is not the raw schema, but a result of controlled structural adaptation. Performance depends on balancing two operations: mitigating information overload via filtering, and repairing semantic fragmentation via injection. Specifically, filtering serves as a bias-variance knob with non-monotonic effects, while injection improves performance only when it explicitly restores the relational dependencies missing from the original schema. Based on these findings, we develop an end-to-end structural optimizer that applies both operations to adapt relational graphs automatically. Across 26 tasks spanning classification, regression, and recommendation, the optimized graphs consistently improve accuracy while often reducing inference cost.

2606.08497 2026-06-09 cs.AI cs.CL 新提交

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

解释黑盒语言模型:学习优化语言结构化的单词子集

Minyoung Hwang, Seokhyun Lee, Changhee Lee

发表机构 * Korea University(高丽大学)

AI总结 针对黑盒语言模型解释的三个关键需求(推理效率、黑盒兼容性、语言结构可解释性),提出一种通过强化学习选择信息性单词子集的方法,实现高效、无梯度且语言连贯的解释。

Comments KDD 2026 Research Track

详情
AI中文摘要

随着深度语言模型(DLMs)在医疗保健等高风险领域中的部署日益增多,理解其决策依据对于确保信任、安全和问责变得至关重要。然而,当这些DLMs作为黑盒系统(例如通过API)运行时,访问内部模型状态(如参数、梯度)受到限制,实现这一关键的可解释性水平尤其具有挑战性。尽管付出了诸多努力,现有的解释方法往往无法同时满足三个关键需求:(i)推理时效率,(ii)黑盒兼容性且不引发分布外行为,以及(iii)基于输入语言结构的可理解解释。为了解决这些挑战,我们提出了一种方法,通过选择一小部分信息丰富的输入单词来解释DLM的预测。我们将其表述为一个摊销优化问题,从而无需针对特定输入进行搜索即可实现高效的一次性推理。我们的选择策略通过REINFORCE风格策略梯度进行训练,允许在完全无梯度的设置中进行离散单词选择。为了增强可解释性并与人类语言直觉对齐,我们将图结构知识整合到这一选择过程中,促进语言连贯的子集,从而产生对最终用户既高度信息丰富又具有认知意义的解释。我们在多种DLM架构和多个真实世界数据集上评估了我们的方法。它一致地识别出具有增强判别能力和与语言显著线索更强对齐的单词子集,优于传统的黑盒兼容方法和基于梯度的方法(后者被赋予黑盒模型梯度的oracle访问权限,以构成更具挑战性的基准)。我们的代码可在以下地址获取:here。

英文摘要

As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

2606.08543 2026-06-09 cs.AI 新提交

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

PAEC:面向RLVR中LLM推理的位置感知熵校准

Shumeng Yang, Yisu Liu, Jiayi Zheng, Zhaohui Yang, Linjing Li

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 提出位置感知熵校准(PAEC),通过局部top-p熵和top-2候选竞争构建软掩码,并施加基于锚点的下界惩罚,防止决策相关位置熵崩溃,提升数学推理性能。

Comments 22 pages, 7 figures

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)改进了大语言模型的推理能力,但常常导致策略熵快速崩溃,即策略过早地集中在狭窄的高概率推理路径上。虽然全局熵正则化可以鼓励探索,但均匀增加所有标记位置的熵对于长推理轨迹而言效率低下,因为许多标记与决策无关。我们提出位置感知熵校准(PAEC),一种标记级熵管理框架,它从局部top-p熵和top-2候选竞争中构建软掩码,并应用基于锚点的下界惩罚来防止选定位置的熵崩溃。在五个数学推理基准上的实验表明,PAEC在强RLVR基线上提高了宏观平均多数投票性能,在AIME风格任务上取得了明显收益。我们的结果表明,推理RL中的熵管理应被表述为对决策敏感位置的选择性探索分配,而非均匀的随机性注入。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.

2606.08601 2026-06-09 cs.AI 新提交

InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs

InA-Probe:面向LLM时间序列预测的指令感知主动探测

Peiliang Gong, Emadeldeen Eldele, Chenyu Liu, Ziyu Jia, Yi Ding, Xinliang Zhou, Lianchao Gu, Qi Zhu, Yang Liu, Daoqiang Zhang, Xiaoli Li

发表机构 * Nanyang Technological University(南洋理工大学) Khalifa University(哈利法大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出指令感知主动探测(InA-Probe),通过多级指令注入和自适应查询生成,结合双阶段注意力机制,在7个基准上超越现有方法,跨域误差降低37%。

详情
AI中文摘要

大型语言模型(LLMs)近期在时间序列预测中展现出令人瞩目的潜力。然而,现有方法主要依赖被动模态对齐或静态任务重编程,往往难以捕捉细粒度的非平稳时间模式或适应细微的任务意图。本文提出指令感知主动探测(InA-Probe),将范式从被动对齐转向主动的指令驱动探测机制。具体而言,我们设计了一种多级指令注入机制,为模型注入全局任务目标和细粒度的补丁级语义先验。在此基础上,自适应查询生成模块生成样本特定的探测,这些探测由时间上下文动态调制。随后,这些探测通过双阶段注意力过程进行精炼:首先通过指令感知自注意力内化任务特定意图,然后通过时间交叉注意力审查询问投影的时间表示以提取显著模式。在七个真实世界基准上的全面实验表明,InA-Probe在统一泛化和零样本迁移中均持续优于最先进的深度学习和基于LLM的基线,在具有挑战性的跨域场景中预测误差降低高达37%。消融研究进一步证实,自适应查询与细粒度指令之间的协同作用是解锁LLM推理能力以处理复杂时间序列的关键。

英文摘要

Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predominantly rely on passive modality alignment or static task reprogramming, which often fail to capture fine-grained, non-stationary temporal patterns or to adapt to nuanced task intents. In this paper, we propose Instruction-aware Active Probing (InA-Probe), which shifts the paradigm from passive alignment toward an active, instruction-driven probing mechanism. Specifically, we design a Multi-Level Instruction Injection mechanism that enriches the model with both global task objectives and fine-grained, patch-level semantic priors. Building on this, an Adaptive Query Generation module produces sample-specific probes that are dynamically modulated by the temporal context. These probes are then refined through a dual-stage attention process: they first internalize task-specific intents via Instruction-Aware Self-Attention, and subsequently interrogate the projected temporal representations through Temporal Cross-Attention to extract salient patterns. Comprehensive experiments on seven real-world benchmarks show that InA-Probe consistently outperforms state-of-the-art deep learning and LLM-based baselines, excelling in both one-for-all generalization and zero-shot transfer while reducing forecasting error by up to 37\% in challenging cross-domain scenarios. Ablation studies further confirm that the synergy between adaptive querying and fine-grained instructions is key to unlocking the reasoning power of LLMs for complex time series.

2606.08800 2026-06-09 cs.AI 新提交

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

通过自进化桥接专家知识与自动化特征工程

Varun Khurana, Vijval Ekbote, Vashu Chauhan, Yaman Kumar Singla, Rajiv Ratn Shah, Balaji Krishnamurthy

发表机构 * Adobe Media and Data Science Research(Adobe媒体与数据科学研究) IIIT-Delhi(德里印度理工学院)

AI总结 提出FEST方法,结合双流特征生成、语义去重和树引导迭代进化,从原始文本和图像中发现可审计特征,在品牌分类等任务中平均提升4.2个百分点,并实现60-80%的专家特征覆盖。

详情
AI中文摘要

在品牌合规、临床护理和内容审核等高风险场景中,机器学习不能作为不透明的预言机部署:从业者需要检查驱动模型决策的特征,模型必须利用管理这些领域的专家文档。实际上,数据以非结构化内容形式到达,从中提取的特征必须可解释、有区分度,并与专家认为重要的内容对齐。现有方法存在不足:它们针对表格输入,缺乏专家对齐的证明,并且无法将诸如“保持专业语气”之类的定性标准转化为精确特征。我们提出了FEST(自进化树特征工程),结合了双流特征生成(语义和确定性)、语义去重和树引导的迭代进化,从原始文本和图像中发现可审计特征。FEST在品牌分类、内容真实性检测和压力检测的20个分类器-任务组合中领先17个,在五个分类器上平均比最强基线高出4.2个百分点。LLM作为评判者的评估显示,在严格的语义对齐阈值下,FEST实现了60-80%的专家设计品牌特征覆盖率,并通过人类专家研究证实,这些特征在相关性、清晰度和可操作性方面获得高评分。当以专家指南作为种子时,FEST将定性标准细化为可操作特征,跨品牌平均提高6-12个百分点的准确率。为了实现对自动化特征工程中专家对齐的系统评估,我们发布了BrandGuide,这是第一个将专家设计特征与2,683个品牌的100万+资产配对的数据集。通过将特征工程建立在专家知识基础上,FEST为需要人类监督的可解释机器学习开辟了一条实用途径。

英文摘要

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

2606.08804 2026-06-09 cs.AI cs.LG 新提交

Q-Delta: Beyond Key-Value Associative State Evolution

Q-Delta:超越键值关联状态演化

Sumin Park, Seojin Kim, Noseong Park

AI总结 提出Q-Delta,一种查询感知的delta规则,将混合键-查询预测误差融入状态演化,实现联合校正动态,在语言建模和长上下文检索任务上优于强基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

线性注意力将序列建模重新表述为循环状态演化,实现高效的线性时间推理。在键值关联范式下,现有方法将查询的作用限制在读出操作,使其与状态演化解耦。我们表明,查询条件状态读出在累积记忆上诱导出结构化的值预测,补充了基于键的检索。基于这一洞察,我们提出Q-Delta,一种查询感知的delta规则,将混合键-查询预测误差融入状态演化,在保持delta规则效率的同时实现联合校正动态。我们为所得动态建立了稳定性保证,并推导出硬件高效的块状并行公式,以及自定义Triton实现。实验结果表明,在语言建模和长上下文检索任务上,优化稳定、吞吐量具有竞争力,且一致优于强基线。

英文摘要

Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key-query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.

2606.08814 2026-06-09 cs.AI cs.LG 新提交

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

STAR: 将MoE路由重新思考为结构感知的子空间学习

Sumin Park, Noseong Park

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出STAR方法,通过广义Hebbian算法学习主子空间来增强路由对输入结构的感知,实现专家稳定专业化,在合成数据和语言视觉任务上提升路由质量和下游性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

混合专家(MoE)通过选择性地将输入路由到专门的专家子集来高效扩展模型容量。然而,输入-专家专业化(MoE的核心动机)关键取决于路由器是否真正感知输入结构。实践中,MoE路由通常实现为浅层线性投影,对输入表示的感知有限,常导致路由不稳定。我们提出STAR(结构感知路由),将MoE路由重新思考为子空间学习问题,通过广义Hebbian算法(GHA)跟踪主导输入结构的演化主子空间来增强标准可学习路由。通过将路由决策直接与输入结构对齐,STAR实现了稳定的专家专业化。我们在受控合成设置和大规模语言与视觉任务上评估STAR,它持续提高了路由质量和下游性能,超过了强MoE基线。此外,可选的测试时子空间更新进一步增强了输入分布偏移下的路由鲁棒性和泛化能力。

英文摘要

Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.

2606.08815 2026-06-09 cs.AI cs.CL cs.LG 新提交

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动量:策略优化中的密集内在信号

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Eastern Institute of Technology(东方理工学院)

AI总结 针对GRPO在长链推理中因二元奖励导致的零优势崩溃和幻觉确定性失败模式,提出ISPO方法,通过内在信号密集化奖励,在三个基模型和五个数学推理基准上持续优于基线。

Comments 14 pages, 6 figures, 8 tables

详情
AI中文摘要

基于可验证奖励的强化学习已成为激发大型语言模型长链推理的强大范式。然而,现有基于组相对策略优化(GRPO)的方法依赖于二元结果奖励,这引发了两种结构性失败模式:零优势崩溃,即组内所有轨迹共享相同结果导致梯度消失;以及幻觉确定性,即模型在训练后期对错误轨迹变得过度自信。我们通过使用完全从策略自身条件概率计算的内在信号来密集化奖励,解决了这两种模式,并提出了ISPO(内在信号策略优化),它结合了衡量思考轨迹对最终答案信息量的序列级信号,以及令牌级方向性奖励,其幻觉确定性铰链惩罚关键决策令牌上的错误自信预测。在三个基模型和五个数学推理基准上,ISPO持续优于竞争基线,在零优势崩溃最频繁的最难基准上取得最大提升,训练动态诊断证实两种失败模式均被减少。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

2606.08896 2026-06-09 cs.AI 新提交

FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

FAME: 面向异构时间序列预测的可预测性感知专家混合模型

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei

发表机构 * Sun Yat-sen University(中山大学) Guangdong Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Ministry of Education Key Laboratory of Machine Intelligence and Advanced Computing(教育部机器智能与先进计算重点实验室)

AI总结 针对大规模异构时间序列预测中单一模型性能不足的问题,提出可预测性感知的稀疏专家混合框架FAME,通过多维可预测性指纹和成本感知路由,在工业数据集上实现12.4%的MSE降低。

详情
AI中文摘要

大规模零售和工业预测系统包含许多异构时间序列,其生命周期、稀疏性、波动性、季节性、频谱模式和上下文敏感性差异很大。单一预测模型很少能在所有情况下表现良好,而密集集成会增加推理成本并提供有限的专家适用性洞察。本文研究可预测性感知的专家路由:学习数据特征如何决定预测专家的适用性。我们提出\method{},一个稀疏专家混合框架,用多维可预测性指纹表示每个序列,从验证性能中挖掘专家适用性目标,并训练一个成本感知的稀疏路由器,为每个序列激活少量预算的专家集。使用山东新北洋(SNBC)的生产规模自动售货机销售数据集(其中预测组件已集成到补货计划管道中)以及公共零售基准,我们表明专家适用性在不同数据情况下系统性地变化。在拥有5000+台机器和6000万+交易的工业数据集上,\method{} Top-2相比最强单一专家LightGBM降低了12.4%的MSE,同时平均每个序列执行1.92个专家。部署的组件产生需求预测,而库存导向的收益通过离线回放模拟器在固定补货策略下估计,而非在线干预。该框架将异构销售预测从启发式模型选择转变为可预测性模式和专家专业化的数据挖掘。代码可在https://github.com/hit636/FAME获取。

英文摘要

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

2606.08974 2026-06-09 cs.AI 新提交

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样思维图式激发大型语言模型更优推理

Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 提出多样图式策略优化(DiScO),通过增强推理步骤转换和答案候选的多样性,提升大型语言模型在数学推理任务中的表现和错误恢复能力。

详情
AI中文摘要

大型推理模型(LRMs)因其通过生成扩展推理链解决复杂数学问题的能力而受到越来越多的关注。在这项工作中,我们聚焦于推理过程中两个关键但尚未充分探索的方面:推理转换(捕捉推理步骤之间的不同转换)和答案候选(反映模型产生的解路径的多样性)。我们将这两个方面统称为思维图式。我们观察到思维图式的多样性与模型性能之间存在相关性,这激励我们通过增强多样性来进一步提升推理潜力。为此,我们提出了多样图式策略优化(DiScO),该框架首先赋予模型图式感知能力,然后通过强化学习鼓励多样性,并在推理时进一步促进多样化推理。在多个数学推理基准上的实验表明,DiScO始终优于标准的群体相对策略优化。除了准确性之外,人工标注分析显示,DiScO显著提高了模型从错误初始尝试中恢复的能力。总体而言,我们的工作表明思维图式多样性发挥的重要作用,并指出沿着多样性维度进行扩展是一个有前景的研究方向。

英文摘要

Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.

2606.09124 2026-06-09 cs.AI 新提交

A Regret Minimization Framework on Preference Learning in Large Language Models

大语言模型中偏好学习的遗憾最小化框架

Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 提出基于遗憾的偏好优化方法RePO,通过遗憾最小化而非奖励最大化来建模人类偏好,在数学推理和人类偏好数据集上取得一致性能提升。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过依赖任务特定的验证器提供自动化正确性信号,推动了推理密集型任务的进展。然而,许多现实语言任务难以配备可靠的验证器,这促使人们越来越依赖从人类反馈中强化学习(RLHF)。在此背景下,我们认为有必要更仔细地审视人类反馈应如何被解释。我们引入了基于遗憾的偏好优化($\textbf{RePO}$),它通过$\textit{遗憾最小化}$而非奖励最大化来重新构建RLHF。人类偏好通常由对结果的$\textit{前瞻性}$预期和对替代行为的$\textit{反事实}$比较所塑造,而非由即时的、与结果无关的效用决定。$\textbf{RePO}$通过将偏好建模为行为条件化的相对次优性评估来捕捉这一结构。在数学推理基准和人类偏好数据集上的实验表明,$\textbf{RePO}$能够取得一致的性能提升,表明它是一种有效且与人类对齐的大语言模型训练方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

2606.09410 2026-06-09 cs.AI cs.CL 新提交

Capacity, Not Format: Rethinking Structured Reasoning Failures

容量而非格式:重新思考结构化推理失败

Hengxin Fan

AI总结 研究发现结构化格式对模型性能的影响取决于其空闲容量,容量不足时通过截断和纯容量竞争两种机制导致性能下降,建议先思考后格式化。

Comments 12 pages, 3 figures

详情
AI中文摘要

先前的工作将结构化输出视为推理的代价,但这种框架是不完整的:格式化的成本强烈依赖于模型的空闲容量。通过使用信息匹配的散文控制和四级模式复杂度梯度,我们在4个模型和5个基准测试中分离了格式特定效应与提示长度混淆,成功生成的响应中解析失败率为0%。我们发现结构化格式是容量依赖的。具有足够余量的模型在吸收JSON约束时不会出现性能下降(Sonnet:MATH-Hard上JSON为$88.7\pm4.0$%,CoT为$89.3\pm1.7$%)。相反,格式会严重降低接近其极限运行的模型,通过两种不同的机制。首先,在标准token预算下,Haiku下降了36.2个百分点($p < 0.0001$),主要是由于截断。其次,即使延长预算消除了截断,GPT-4o-mini仍下降了28.0个百分点($p < 0.001$),揭示了独立于token耗尽的纯容量竞争。这种格式惩罚随模式复杂度增加(McNemar $p < 0.0001$),且不能仅由提示长度解释。此外,这些结果对前沿模型免疫的说法提出了质疑:在AIME竞赛数学中,Opus 4.7在JSON下从96.2%下降到91.0%($-5.3$个百分点;显示的百分比独立四舍五入,精确差值为$7/133 = 5.26$pp $\approx 5.3$pp)。一种延迟结构消融——在格式化之前自由推理——恢复了大部分丢失的准确率(3次运行均值:80-87%),支持了容量竞争机制。实际意义不是避免结构化输出,而是使其与容量匹配:当模型接近其极限时,先思考,后格式化。

英文摘要

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

2606.09605 2026-06-09 cs.AI 新提交

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

下一个词预测学习睡眠生理学的可泛化表示

Jonathan F. Carter, Lionel Tarassenko

发表机构 * Institute of Biomedical Engineering, University of Oxford(牛津大学生物医学工程研究所)

AI总结 提出Hypnos模型,通过下一个词预测目标,从多模态生理信号中学习可泛化表示,在睡眠阶段分类和房颤检测等任务上显著优于现有基础模型。

详情
AI中文摘要

基础模型提供了一种有前景的途径,将多模态生理信号压缩为人类健康的紧凑表示,在睡眠医学、心脏病学、神经病学及其他医疗领域具有广泛应用。现有模型通常采用掩码重建或对比学习目标进行训练。然而,掩码重建可能不适用于这些信号的随机性质,而对比方法依赖于正样本对定义,尽管生理信号的语义不变性尚不明确。在这项工作中,我们展示了下一个词预测是一种简单且可扩展的替代方案。我们开发了Hypnos,一个多模态睡眠基础模型,使用来自超过20,000次夜间多导睡眠图记录的八种不同传感模态(例如EEG、ECG、呼吸信号)进行训练。我们使用残差向量量化将每种模态标记化为离散标记流,然后训练一个大型自回归RQ-Transformer,以并行方式联合预测所有模态的下一个标记。训练后,Hypnos可应用于任何支持模态子集的连续传感器数据流,为下游任务生成嵌入。在一系列基准测试中,Hypnos显著优于现有基础模型。在睡眠阶段分类中,我们在保留测试集上匹配了强监督基线的性能,同时使用的标记数据减少了100倍。Hypnos甚至泛化到日间生理学,在检测房颤方面超越了专用的ECG基础模型。我们的结果表明,下一个词预测是从多模态生理信号进行表示学习的强自监督目标。

英文摘要

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 新提交

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够:嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

发表机构 * Assessli Research(Assessli研究) Dots-In Research(Dots-In研究)

AI总结 针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度(0.76-0.92)导致因果推断错误的问题,提出对比学习(提升分离度至1.63x)和BODHI硬负例挖掘(提升至2.30x),结合OpenVINO优化实现133倍加速。

Comments 20 pages, 18 figures, 9 tables

详情
AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关,它会返回0.83的余弦相似度(1.0表示完全相同)。两者没有共同机制。这不是个例:我们测试的所有现成生物医学编码器(BioBERT、PubMedBERT、BioM-ELECTRA)在跨域无关对上得分在0.76到0.92之间,而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点,因为下游语言模型会过滤噪声。但大型行为模型(LBM)——一种以人为对象而非句子的基础模型——则不能:它在用户生活图上推理,并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边,所有下游都会继承错误。在这里,嵌入几何不是调节旋钮,而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练,将PubMedBERT的BIOSSES相关性从0.633提升到0.828,域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例,将分离度提升到2.30倍,区分差距提升到+0.392,BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上,OpenVINO将单查询延迟从1367毫秒降至10毫秒(133倍),达到每秒555个句子。一个发现与标准建议相悖:在此芯片上,FP16在所有服务批量大小下优于INT8,我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

2606.07524 2026-06-09 cs.CL cs.AI 交叉投稿

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE:基于归因的大模型嵌入表示与映射

Zirui Wang, Yusen Hou, Shaofeng Liang, Bowen Tian, Yanlin Zhang, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Deep Interdisciplinary Intelligence Lab (DI2 Lab)(深度跨学科智能实验室(DI2 Lab))

AI总结 提出ABLE框架,利用梯度特征归因和分词器无关的词级对齐构建模型嵌入,实现异构LLM的高效比较,在关系预测、模型路由和基准分数预测上表现优异。

详情
AI中文摘要

大语言模型(LLM)的爆炸式增长形成了一个异构且文档不完善的生态系统,使得系统性的模型比较对于来源审计、安全分析和模型选择越来越重要。现有的表示方法难以高效应对这一场景。分析内部参数的方法在架构兼容时很强大,但在结构异构下面临可扩展性障碍;而依赖外部输出的方法可能混淆具有相似行为的模型,且难以在不同分词器的更丰富输出空间中对齐。为弥合这一差距,我们提出ABLE(基于归因的大模型嵌入)框架,利用可解释性空间构建模型表示。通过基于梯度的特征归因,经由分词器无关的词级对齐进行聚合,ABLE捕获模型特定的输入敏感性模式,而不仅仅是表面输出。除经验效用外,我们提供了稳定性分析,表明在可微Transformer风格模型的标准正则性假设下,ABLE诱导出一个Lipschitz连续的参数到嵌入映射,并具有有限样本收敛保证。在239个开源LLM上的大量实验表明,我们的无训练方法在关系预测、模型路由和基准分数预测方面达到了有竞争力或更优的性能。

英文摘要

The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model comparison increasingly important for provenance auditing, security analysis, and model selection. Existing representation methods struggle to address this setting efficiently. Approaches analyzing internal parameters are powerful when architectures are compatible, but face scalability barriers under structural heterogeneity, while methods relying on external outputs may conflate models with similar behaviors and are difficult to align in richer output spaces across different tokenizers. To bridge this gap, we propose ABLE (Attribution-Based Large-model Embedding), a framework that leverages the interpretability space to construct model representations. By aggregating gradient-based feature attributions via a tokenizer-agnostic word-level alignment, ABLE captures model-specific input-sensitivity patterns rather than only surface-level outputs. Beyond empirical utility, we provide a stability analysis showing that, under standard regularity assumptions for differentiable Transformer-style models, ABLE induces a Lipschitz-continuous parameter-to-embedding map with finite-sample convergence guarantees. Extensive experiments on 239 open-source LLMs demonstrate that our training-free approach achieves competitive or superior performance in relation prediction, model routing, and benchmark score prediction.

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Post-training is (Massive) Supervised Learning

后训练是(大规模)监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI(Meta AI 基础人工智能研究团队) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文论证当前LLM后训练阶段(SFT+RL)实质是回归到BERT时代的“预训练-微调”范式,通过实验表明从零开始后训练的模型也能取得显著性能,并提出应转向“学会学习”的训练方式。

详情
AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中,我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法,明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史,描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似,那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点,我们比较了预训练模型和随机初始化模型,在现代推理数据集上对两种变体进行微调,并在竞争性数学和代码基准上评估它们。我们表明,从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明,当前的后训练方法主要作为分布拟合机制发挥作用。最后,我们提出,开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练,转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

2606.07546 2026-06-09 cs.IR cs.AI cs.LG 交叉投稿

Beyond Item IDs: Scaling Short-Form-Video Recommendation via Semantic-Native Long Sequence Modeling

超越视频ID:通过语义原生长序列建模实现短视频推荐规模化

Ruixiao Sun, Diego Uribe Mora, Zhimeng Jiang, Yuanzhen Lin, Jiarui Wang, Yuening Li, Danfeng Guo, Zhizhong Chen, Chuan He, Liang Liu

发表机构 * Google Mountain View, USA(谷歌山景城,美国)

AI总结 针对短视频推荐中序列长度受限于视频ID语义稀疏性和Transformer二次复杂度的问题,提出采用语义ID和全局感知压缩Transformer,实现十亿用户规模的超长行为序列建模,显著降低内存和计算开销,在线实验提升用户满意度和内容消费。

Comments this manuscript has been accepted by SIGIR 2026

详情
AI中文摘要

捕捉用户跨广泛观看历史的兴趣对于短视频推荐至关重要,但扩展序列长度受到两个瓶颈的限制:原子视频ID的语义稀疏性和Transformer的二次计算复杂度。传统的正交视频ID无法捕捉内容关系,并且需要大型嵌入表,而自注意力的二次复杂度在严格的工业延迟和资源约束下限制了最大序列长度。在这项工作中,我们提出了一个在生产环境中部署的框架,用于在十亿用户规模上建模超长用户行为序列。我们首先通过采用内容原生的语义ID来解决表示瓶颈。通过使用深度截断、粗粒度的语义ID,我们将嵌入表大小从语料库基数中缩小。这种紧凑的表示通过共享语义前缀自然地泛化到冷启动内容。其次,为了克服序列扩展障碍,我们引入了全局感知压缩Transformer,它利用非参数时间折叠和统一全局查询集成来有效压缩序列,缓解了标准自注意力的内存和计算瓶颈。在我们计算基础设施上的离线分析显示,峰值内存占用减少了一个数量级,计算开销大幅降低。这种效率提升使得在生产中以可承受的成本支持更长的序列长度,在大规模在线A/B测试中,在满意的用户参与度和满意的内容消费方面取得了显著的在线收益。

英文摘要

Capturing user interests across extensive watch histories is critical for short-form video recommendation, yet scaling sequence length is limited by two bottlenecks: the semantic sparsity of atomic Video IDs and the quadratic computational complexity of Transformers. Traditional orthogonal Video IDs fail to capture content relationships and demand large embedding tables, while the quadratic complexity of self-attention restricts the maximum sequence length under strict industrial latency and resource constraints. In this work, we present a production-deployed framework for modeling ultra-long user behavior sequences at a billion-user scale. We first address the representation bottleneck by adopting content-native Semantic IDs. By utilizing depth-truncated, coarse-grained Semantic IDs, we shrink the embedding table size from corpus cardinality. This compact representation naturally generalizes to cold-start content through shared semantic prefixes. Second, to overcome the sequence scaling barrier, we introduce a Global-Aware Compression Transformer that leverages non-parametric temporal folding and unified global query integration to effectively condense the sequence, alleviating both the memory and computational bottlenecks of standard self-attention. Offline profiling on our computing infrastructure demonstrates an order-of-magnitude reduction in peak memory footprint and a drastic decrease in computational overhead. This efficiency gain enables supporting longer sequence lengths at an affordable cost in production, yielding substantial online gains in satisfied user engagement and satisfied content consumption in large-scale online A/B tests.

2606.07559 2026-06-09 cs.CL cs.AI quant-ph 交叉投稿

Phantom transitions in language model fine-tuning

语言模型微调中的幻影相变

Vaibhav Prakash, Jayasri Dontabhaktuni

发表机构 * Mahindra University(马恒达大学)

AI总结 本文研究语言模型微调时,正确补全被近义词竞争而失败的现象,通过序参量分解信号与背景拖拽,发现两种失败模式,并揭示相变为幻影,源于softmax读出而非几何相变。

Comments 26 pages, 9 figures

详情
AI中文摘要

在上下文中微调语言模型,当正确补全存在近义词竞争者时,常常无声地失败。交叉熵损失单调递减,而正确token在排名上从未超越竞争者。我们研究了跨越两个系列和五倍参数范围的五种Transformer架构,在十个精心挑选的近义词上下文中。我们用一个结合预测分布和成对嵌入重叠的序参量来测量这些失败。它可加性地分解为一个信号(跟踪模型对正确token相对于其最近竞争者的承诺)和一个背景拖拽(由嵌入整体向分数泄漏概率的方式决定)。这分离出两种失败模式:运动学失败中信号保持较小;结构失败中拖拽随着微调进行而主动恶化。我们观察到序参量中类似相变的弹弓状跳跃。一个核心负面结果组织了本文:这些相变是幻影。直接测量排除了自发对称破缺的解释。在LoRA微调下,当token嵌入矩阵在训练期间完全不变时,弹弓状跳跃仍然出现,而此处不可能存在几何相变。不连续性完全存在于softmax读出中。少量无量纲量组织跨架构的轨迹。其中一个在所有五种架构的全微调下保持一致。第二个根据整体嵌入分布将架构分为两类,并预测LoRA的充分性。作为盲测,该框架预测了一个未用于拟合任何参数的保留架构的临界学习率,与后续学习率扫描的误差在2.1%以内。研究结果仅涉及近义词机制,未经重新校准不应外推。

英文摘要

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

2606.07563 2026-06-09 cs.LG cs.AI 交叉投稿

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

通过相变涌现:机制景观与跨复杂系统的通用收敛

Truong Xuan Khanh

发表机构 * H&K Research Studio(H&K 研究工作室) Clevix LLC(Clevix 有限责任公司)

AI总结 提出层次涌现框架(HEF),将涌现建模为机制景观中的相变,证明在结构假设下物理可行且收敛到唯一不动点,并在111个模算术变换器实验中验证了相变指纹。

Comments 27 pages, 3 figures, 2 tables; 15-page Supplementary Information with complete proofs included

详情
AI中文摘要

在机器学习、生物学和物理学中,独立演化的系统尽管微观细节截然不同,但常常收敛到惊人相似的高层结构。Grokking电路在不同随机种子下收敛,进化谱系重新发现相似的代谢解决方案,重整化流趋近共同的固定点。我们提出层次涌现框架(HEF)作为此类收敛现象的候选普适性框架。HEF将涌现建模为由热力学和信息论定律约束的机制景观中的相变。该框架引入一个临界能量阈值Ec,将具有竞争机制的探索阶段与由唯一最小成本机制主导的收敛阶段分开。在结构假设下,我们证明了物理可行性,推导了严格的度量收缩,并建立了收敛到与初始条件无关的唯一不动点表示。我们进一步通过有效信息和机制竞争熵将该收敛结构与因果涌现联系起来。为测试该框架,我们研究了111个实验中模算术变换器的延迟泛化(“grokking”)。我们识别出一个可重复的Ec转变经验指纹:在92%的运行中,权重范数在grokking之前系统性达到峰值。归一化准确率曲线坍缩到tanh扭结(R^2=0.93),与Landau-Ginzburg普适类一致,所有grokked模型收敛到0.9745±0.014,与初始化、权重衰减或训练比例无关(ANOVA p>0.13)。HEF并非作为涌现的通用理论提出,而是作为研究跨复杂系统收敛现象的可证伪数学框架。

英文摘要

Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.

2606.07568 2026-06-09 cs.HC cs.AI cs.CV cs.LG physics.data-an 交叉投稿

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

行为克隆在科学数据标注中的系统研究

Ishaan Singh Chandok, Core Francisco Park

发表机构 * GitHub

AI总结 针对科学数据标注中人工验证校正耗时问题,提出行为克隆框架,通过9个合成任务模拟专家策略,发现模型层次化技能习得、多任务预训练高效微调、内部表示共享错误模式等关键结论。

Comments ICML 2026 Oral

详情
AI中文摘要

科学数据标注,例如视频中动物追踪或神经重建的校对,仍然受限于“最后一公里”问题:即使有强大的自动化,验证和校正仍需大量人力。标准方法训练模型直接预测标注,丢弃了专家如何导航、点击、验证和校正的丰富监督信息。我们引入了一个研究科学标注上行为克隆的框架:9个合成任务配以合成标注,模拟真实人类策略,包括探索、错误校正和战略决策。我们的实验揭示了若干发现。首先,技能层次化出现:模型先学习GUI机制,再学习任务关键决策,且比训练数据犯更少错误,同时保留在错误发生时校正的能力。其次,在多任务行为克隆上扩展模型表明,在我们的规模范围内,更大的模型数据效率更高。第三,多任务预训练能够高效微调至新任务,而从零开始训练则完全失败。第四,线性探针揭示模型内部表示标注过程的潜在变量,如任务阶段和数据位置;有趣的是,我们发现一个跨不同标注任务泛化的共享错误表示。总体而言,我们的框架建立了系统基准并识别了关键瓶颈,为将行为克隆扩展到真实世界科学数据标注奠定了基础。

英文摘要

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

2606.07571 2026-06-09 cs.LG cs.AI 交叉投稿

Enabling KV Caching of Shared Prefix for Diffusion Language Models

为扩散语言模型启用共享前缀的KV缓存

Younghun Go, Jaehoon Han, Changyong Shin, Chuk Yoo, Gyeongsik Yang

发表机构 * Korea University(高丽大学)

AI总结 针对扩散语言模型中双向注意力导致共享前缀KV不稳定的问题,提出双向前缀缓存(bicache),通过动态识别安全层深度重用KV,避免精度崩溃,提升吞吐量36.3%-98.3%。

详情
AI中文摘要

共享前缀的键值(KV)缓存对于高吞吐量的大语言模型(LLM)服务至关重要,但在新兴的扩散语言模型(DLM)中面临严峻挑战。在DLM中,双向注意力意味着更新任何token都会动态改变整个上下文及其对应的KV。因此,为LLM开发的现有缓存技术(假设KV一旦计算就保持不变)会破坏共享前缀KV。我们的实验表明,将这些技术应用于DLM会导致模型精度几乎降为零。为了解锁高吞吐量的DLM服务,我们提出了双向前缀缓存(bicache),这是第一个用于DLM中共享前缀的KV缓存技术。bicache基于我们全面分析的关键观察设计:共享前缀KV在浅层中保持稳定且可重用,而浅层的深度取决于每个请求中共享前缀token的比例。因此,bicache动态识别用于重用共享前缀KV的安全层深度,并消除冗余计算。评估表明,与现有技术相比,bicache显著提高了服务吞吐量36.3%-98.3%,且没有精度崩溃(仅0-1.8%的差异)。

英文摘要

Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero. To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).

2606.07574 2026-06-09 cs.DC cs.AI cs.LG stat.CO stat.ML 交叉投稿

Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections

加速流形约束超连接的Birkhoff投影

Chenrui Wang, Yixuan Qiu

发表机构 * School of Statistics(统计学系) Renmin University of China(中国人民大学) School of Statistics and Data Science(统计学与数据科学学院) Institute of Big Data Research(大数据研究院) Shanghai University of Finance and Economics(上海财经大学)

AI总结 针对流形约束超连接中Birkhoff投影的计算瓶颈,提出基于对偶公式和牛顿法的端到端加速框架,结合隐式微分和CUDA内核实现超过20倍加速。

详情
AI中文摘要

流形约束超连接(mHCs)最近被提出作为超连接的一种原则性扩展,其中残差混合矩阵通过投影到Birkhoff多面体上被约束为双随机矩阵。在实际的mHC实现中,该约束通过Sinkhorn-Knopp迭代强制执行,反向传播依赖于展开迭代求解器。这种设计引入了大量的计算和内存开销,并且当算法在具有挑战性的输入上收敛缓慢时,可能产生不准确的投影,从而破坏mHCs预期的范数控制和稳定性保证。在这项工作中,我们聚焦于实际重要的4x4 Birkhoff投影设置,并开发了一个端到端的加速框架。通过利用对偶公式,我们将问题简化为一个三维无约束凸问题,并使用牛顿法求解,实现了快速收敛和高精度。对于反向传播,我们用隐式微分替代展开微分,无需存储中间状态即可获得精确梯度。为了利用大规模并行性,我们设计了一个warp级别的CUDA内核,仅使用寄存器级原语,避免了全局和共享内存I/O。与代表性开源基线的大量实验表明,所提出的求解器产生了更可靠的双随机投影——特别是在输入幅度较大时——并实现了显著的端到端加速(包括反向传播),在大批量下达到超过20倍的加速,同时保持数量级更小的边际误差。

英文摘要

Manifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residual mixing matrices are constrained to be doubly stochastic via projection onto the Birkhoff polytope. In practical mHC implementations, this constraint is enforced by Sinkhorn-Knopp iterations, and the backward pass relies on unrolling the iterative solver. This design introduces substantial computation and memory overhead, and may also yield inaccurate projections when the algorithm converges slowly on challenging inputs, undermining the intended norm-control and stability guarantees of mHCs. In this work, we focus on the practically important 4x4 Birkhoff projection setting and develop an end-to-end acceleration framework. By leveraging the dual formulation, we reduce the problem to a three-dimensional unconstrained convex problem and solve it with Newton's method, achieving fast convergence and high accuracy. For the backward pass, we replace the unrolled differentiation with implicit differentiation, yielding exact gradients without storing intermediate states. To exploit massive parallelism, we design a warp-level CUDA kernel that uses only register-level primitives, avoiding global and shared memory I/O. Extensive experiments against representative open-source baselines demonstrate that the proposed solver yields substantially more reliable doubly stochastic projections -- especially when the input magnitude is large -- and achieves significant end-to-end speedups (including the backward pass), reaching over 20x acceleration at large batch sizes while maintaining orders of magnitude smaller marginal errors.

2606.07598 2026-06-09 cs.LG cs.AI 交叉投稿

A Topological Characterization of Graph Neural Networks via Stochastic Block Model Embeddings on the n-Sphere

图神经网络的拓扑特征化:通过n-球面上的随机块模型嵌入

Gopal Anantharaman

发表机构 * KnotTheory.ai Inc.(KnotTheory.ai 公司) Dept. of Mathematics, Emporia State University(恩波利亚州立大学数学系)

AI总结 提出将消息传递神经网络诱导的随机块模型映射到单位n-球面的拓扑框架,用于比较训练后的图神经网络,并实现无需重新训练的迁移学习候选检索。

详情
AI中文摘要

我们提出一个拓扑框架,用于比较训练后的图神经网络(GNN),通过将消息传递神经网络(MPNN)在图信号空间上诱导的随机块模型(SBM)映射到单位$n$-球面$\sphere^{n-1}\subset\R^n$上。该构建基于三个经典支柱:割距离图空间$(\Wo,\cutdist)$的紧性\citep{lovasz2006limits,lovasz2012large},Frieze--Kannan弱正则引理及其由\citet{levie2023graphon}推广的图信号扩展,以及MPNN关于割距离的Lipschitz连续性。我们证明,对于任意给定的容差$\varepsilon>0$,一个训练后的MPNN $Φ$作用于足够大的图时,可以通过一个复杂度有界的阶梯图信号(误差不超过$\varepsilon$)来分解,并且我们构造了一个显式的保测映射$Ψ_n\colon[0,1]\to\sphere^{n-1}$,将SBM区域放置在不相交的球冠上。这产生了一个与问题无关的低维训练GNN“指纹”,便于视觉检查和跨模型库的最近邻搜索,从而实现无需重新训练的迁移学习候选检索。我们讨论了高维中测度集中现象带来的障碍——这一现象与大规模语言模型规模的嵌入直接相关。最后,我们提出五个具体的未来研究方向:双曲和格拉斯曼流形替代球面模型,基于图信号的Gromov--Wasserstein距离作为$n$-球面映射的无等距替代,SBM流形的信息几何(Fisher)重新表述,逐层嵌入云的持续同调指纹,以及基于图信号特征分解的谱距离基线。

英文摘要

We propose a topological framework for comparing trained Graph Neural Networks (GNNs) by mapping the Stochastic Block Models (SBMs) induced on the graphon-signal space of a Message Passing Neural Network (MPNN) onto the unit $n$-sphere $\sphere^{n-1}\subset\R^n$. The construction rests on three classical pillars: the \emph{compactness} of the cut-distance graphon space $(\Wo,\cutdist)$ \citep{lovasz2006limits,lovasz2012large}, the Frieze--Kannan \emph{weak regularity lemma} together with its graphon-signal extension due to \citet{levie2023graphon}, and the Lipschitz continuity of MPNNs with respect to the cut-distance. We show that, for any prescribed tolerance $\varepsilon>0$, a trained MPNN $Φ$ acting on a sufficiently large graph factors (up to $\varepsilon$) through a step-graphon-signal of bounded complexity, and we construct an explicit measure-preserving map $Ψ_n\colon[0,1]\to\sphere^{n-1}$ that places the SBM regions on disjoint spherical caps. This produces a problem-agnostic, low-dimensional ``fingerprint'' of a trained GNN that is amenable to visual inspection and to nearest-neighbour search across model zoos, enabling \emph{transfer-learning candidate retrieval} without retraining. We discuss the obstruction posed by concentration of measure in high dimension -- a phenomenon directly relevant to LLM-scale embeddings. We close with five concrete future research directions: hyperbolic and Grassmannian alternatives to the spherical model, Gromov--Wasserstein distances on graphon-signals as an isometry-free alternative to the $n$-sphere map, an information-geometric (Fisher) reformulation of the SBM manifold, persistent-homology fingerprints of layer-wise embedding clouds, and a spectral-distance baseline derived from the graphon eigendecomposition.

2606.07599 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

DiffoR:一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Kuaishou Technology(快手科技) Shanghai University of Finance and Economics(上海财经大学) Tongji University(同济大学)

AI总结 提出DiffOR框架,将序数回归建模为连续生成任务,利用扩散模型通过迭代去噪恢复连续序数值,并设计双解耦策略(多尺度增量聚合与动态去噪感知)保留序数拓扑,在12个基准上超越现有方法。

Comments Accepted at KDD 2026

详情
AI中文摘要

序数回归(OR)旨在预测具有内在顺序的目标值,支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成,现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分,无法捕捉序数数据固有的非平稳语义转换。在本文中,我们提出了一种新范式,将OR形式化为连续生成序数回归任务。在该新范式下,我们引入了DiffOR,一个统一的框架,利用扩散模型通过迭代去噪恢复连续序数值,从而能够动态学习软语义转换。为了显式保留序数拓扑,我们设计了一种双解耦策略:在空间上,多尺度增量聚合将目标分解为层次化的连续增量;在时间上,动态去噪感知将去噪步骤与特征频率同步,确保稳健的从粗到细的细化。理论上,我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性,建立了一个新标准,展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

2606.07600 2026-06-09 cs.LG cs.AI 交叉投稿

Reachability and asymptotics of Gaussian Transformer dynamics

高斯Transformer动力学的可达性与渐近性

Albert Alcalde, Zhengping Ji, Enrique Zuazua

发表机构 * Friedrich–Alexander University Erlangen–Nürnberg(弗里德里希-亚历山大大学埃尔朗根-纽伦堡) Research Council of Norway(挪威研究理事会)

AI总结 将Transformer数据传播建模为概率测度空间上的非线性控制系统,证明高斯分布在自注意力与仿射前馈层下保持高斯性,从而降维为双线性控制系统,并揭示与Riccati方程的联系。

详情
AI中文摘要

我们将通过Transformer(驱动大型语言模型的机器学习架构)的数据传播建模为概率测度空间上的非线性控制系统。对于具有自注意力和仿射前馈层的平均场Transformer模型,我们证明高斯分布在诱导流下保持严格高斯性。这种不变性将无限维测度动力学简化为控制均值和协方差演化的有限维双线性控制系统,将Transformer的表达能力重新表述为关于指定高斯矩的可达性问题,并揭示了与经典滤波和控制中Riccati型方程的新联系。\n对于时变控制,我们证明任何目标高斯分布(其协方差矩阵与初始协方差矩阵具有相同秩)的精确有限时间可达性,该秩约束是动力学的一个内在不变量。对于时不变参数,我们推导出显式的谱条件,这些条件要么导致正定平衡点的渐近稳定性,要么导致协方差的有限时间爆破。\n数值实验补充了理论,表明具有高斯输入的实际Transformer在早期和中间层保持与矩匹配的高斯分布接近,而具有指定注意力矩阵的Transformer再现了预测的协方差状态:在稳定配置中有界演化,在失稳配置中爆破。

英文摘要

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.

2606.07601 2026-06-09 cs.LG cs.AI 交叉投稿

LFNO: Bridging Laplace and Fourier via Transient-Steady Decomposition

LFNO:通过瞬态-稳态分解桥接拉普拉斯与傅里叶

Jeongun Ha, Sanga Yoon, Donghun Lee

发表机构 * \dagger(† \dagger)

AI总结 提出拉普拉斯-傅里叶神经算子(LFNO),通过双分支架构显式分解系统动力学为瞬态和稳态分量,在九个基准上超越现有算子,提升稳定性和可解释性。

Comments 21 pages, 11 figures

详情
AI中文摘要

我们引入了拉普拉斯-傅里叶神经算子(LFNO),这是一个统一框架,通过整合拉普拉斯和傅里叶神经算子的谱优势,对跨瞬态和稳态区域的动力系统进行建模。LFNO采用双分支架构,将系统动力学显式分解为瞬态和稳态分量。我们在九个基准上评估了LFNO,包括三个ODE系统(Duffing、Lorenz和Pendulum)和六个PDE系统(Euler-Bernoulli梁、热方程、反应扩散、Brusselator、Burgers和Navier-Stokes)。在瞬态动力学占主导的ODE系统上,LFNO显著优于现有算子,并且在PDE基准上持续超越LNO,同时达到与FNO竞争的性能。此外,LFNO通过其分量分解提供了改进的稳定性和物理可解释性。这些结果表明,LFNO为跨多个时间尺度学习复杂动力系统提供了一种鲁棒且统一的方法。

英文摘要

We introduce the Laplace-Fourier Neural Operator (LFNO), a unified framework for modeling dynamical systems across transient and steady-state regimes by integrating the spectral advantages of Laplace and Fourier Neural Operators. LFNO employs a dual-branch architecture that explicitly decomposes system dynamics into transient and steady-state components. We evaluate LFNO on nine benchmarks, including three ODE systems (Duffing, Lorenz, and Pendulum) and six PDE systems (Euler-Bernoulli beam, Heat, Reaction-diffusion, Brusselator, Burgers, and Navier-Stokes). LFNO significantly outperforms existing operators on ODE systems, where transient dynamics dominate, and consistently surpasses LNO while achieving performance competitive with FNO on PDE benchmarks. Furthermore, LFNO offers improved stability and physical interpretability through its component-wise decomposition. These results demonstrate that LFNO provides a robust and unified approach for learning complex dynamical systems across multiple temporal scales.

2606.07604 2026-06-09 cs.LG cs.AI 交叉投稿

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

贡献权重:自注意力Transformer的几何分析

Harry Jake Cunningham, Nicola Muca Cirone

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出基于投影的贡献权重度量,结合注意力权重、值向量大小和方向对齐,更准确识别关键令牌,并揭示注意力汇的主动抑制功能。

详情
AI中文摘要

分析注意力权重已成为解释大型语言模型(LLM)信息流的标准方法。然而,这种方法有显著局限性,因为它忽略了被聚合的值向量的几何特性。为了解决这个问题,我们引入了\emph{贡献权重},这是一种基于投影的度量,通过考虑令牌的注意力权重、值大小以及与层输出的方向对齐来量化令牌的影响。我们证明,贡献权重提供了更忠实的令牌重要性度量,在不同解码器模型、任务和数据集中,始终优于基于注意力的度量,用于识别语义关键令牌。此外,我们的度量能够对\emph{注意力汇}进行新的机制分析。虽然先前的工作将注意力汇描述为多余注意力的被动存储库,但我们揭示它们起到了主动的功能作用,通过汇率与输出范数之间的凸关系抑制信息,通过反对低置信度令牌的语义漂移来稳定表示。

英文摘要

Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

2606.07615 2026-06-09 cs.LG cs.AI 交叉投稿

Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits

深度神经网络中使用多臂赌博机的结构化神经元剪枝

Salem Ameen, Sunil Vadera

发表机构 * School of Science, Engineering and Environment, University of Salford(科学、工程与环境学院,萨尔福德大学)

AI总结 提出基于多臂赌博机算法的结构化剪枝框架,通过将每个神经元视为臂并评估移除奖励,在表格分类、回归及深度网络任务上验证了UCB1和汤普森采样等策略的有效性。

Comments 27 pages, 5 figures

详情
AI中文摘要

深度神经网络通常包含冗余的隐藏单元。移除单个权重可以减少参数数量,但非结构化稀疏性在标准密集实现中并不总是容易利用。本文开发了一个结构化剪枝框架,其中使用多臂赌博机(MAB)算法移除完整的神经元。每个候选神经元被视为一个臂;拉动一个臂会暂时屏蔽该神经元,测量采样小批量上损失的变化,恢复神经元,并更新其安全移除奖励的估计。该框架支持随机策略,包括Epsilon-Greedy、Softmax、UCB1和汤普森采样,以及乘性权重策略,包括Hedge风格的乘性权重和EXP3。我们在涵盖图像、文本和推理任务的表格分类、表格回归和深度神经网络基准上评估了该方法。使用弗里德曼检验和随后Nemenyi事后检验的统计比较显示方法之间存在显著差异。在表格分类任务上,UCB1在剪枝策略中获得最高平均排名,并优于未剪枝的神经网络。在回归任务上,UCB1获得最高平均排名,并且根据R^2,与几种标准回归模型在统计上具有竞争力或更优。在深度学习任务上,UCB1和汤普森采样获得最强排名,并且几种MAB策略显著优于未剪枝模型、基于幅度的神经元剪枝和贪婪激活变化剪枝。结果表明,基于MAB的神经元剪枝是一种有效且计算实用的结构化模型缩减方法。

英文摘要

Deep neural networks often contain redundant hidden units. Removing individual weights can reduce parameter count, but unstructured sparsity is not always easy to exploit in standard dense implementations. This paper develops a structured pruning framework in which complete neurons are removed using multi-armed bandit (MAB) algorithms. Each candidate neuron is treated as an arm; pulling an arm temporarily masks that neuron, measures the change in loss on a sampled mini-batch, restores the neuron, and updates an estimate of its safe-removal reward. The framework supports stochastic policies, including Epsilon-Greedy, Softmax, UCB1 and Thompson Sampling, and multiplicative-weight policies, including Hedge-style multiplicative weights and EXP3. We evaluate the method on tabular classification, tabular regression and deep neural-network benchmarks covering image, text and reasoning tasks. Statistical comparisons using the Friedman test followed by the Nemenyi post-hoc test show significant differences between methods. On tabular classification tasks, UCB1 obtains the highest mean rank among pruning policies and improves on the unpruned neural network. On regression tasks, UCB1 obtains the highest mean rank and is statistically competitive with, or superior to, several standard regression models according to R^2. On deep-learning tasks, UCB1 and Thompson Sampling obtain the strongest ranks, and several MAB policies significantly outperform the unpruned model, magnitude-based neuron pruning and greedy activation-variation pruning. The results show that MAB-based neuron pruning is an effective and computationally practical approach for structured model reduction.

2606.07617 2026-06-09 cs.LG cs.AI 交叉投稿

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Query Lens: 通过间接效应解释稀疏键值特征

Hwiyeong Lee, Ingyu Bang, Uiji Hwang, Hyelim Lim, Taeuk Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出Query Lens方法,通过考虑编码器侧键特征和解码器侧值特征以及下游模块的间接效应,实现对稀疏自编码器特征更全面、忠实的解释。

Comments Accepted to ICML 2026

详情
AI中文摘要

虽然稀疏自编码器提供的特征比单个神经元更可解释,但可靠地描述这些特征仍然具有挑战性。我们提出了Query Lens,它扩展了Logit Lens,能够对稀疏特征进行更全面、忠实的解释。通过联合考虑编码器侧的键特征和解码器侧的值特征,我们识别出激活特征的输入以及它促进的输出。我们还考虑了当特征被下游模块处理时产生的间接、模块介导的效应,超越了Logit Lens捕获的直接效应。在实验中,我们发现Query Lens为那些在Logit Lens下仍不可解释的特征生成了连贯的token签名。最后,我们提出了子空间通道假说,表明下游模块通过层特定的子空间读取特征。

英文摘要

While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging. We propose Query Lens, which extends Logit Lens to enable more comprehensive and faithful interpretations of sparse features. By jointly considering encoder-side key features and decoder-side value features, we identify both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that remain uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所)

AI总结 提出ScaleSweep方法,通过扫描可行块尺度候选并选择最小化目标函数的候选,优化NVFP4量化中的尺度初始化,理论推导扫描范围边界,在Llama和Qwen模型上提升量化性能,缩小与全精度的差距。

Comments under review

详情
AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式,通过细粒度块尺度提高了4位量化的保真度。然而,现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化,这与最优解之间存在明显差距。为了解决这个问题,我们提出了ScaleSweep,一种简单高效的尺度优化方法,它扫描可行的块尺度候选,并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析,并推导了在原始张量与量化重建张量之间的均方误差(MSE)和加权均方误差(WMSE)下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间,同时保留了最优候选,使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明,ScaleSweep持续优于现有的初始化方法,并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时,ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

2606.07621 2026-06-09 cs.LG cs.AI cs.DC 交叉投稿

HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning

HASA:计算受限的模型异构联邦学习中的子网分配

Amir Hossein Shahdadian, Ahmed M. Abdelmoniem, Mahdi Taheri, Samira Nazari, Christian Herglotz

发表机构 * University of Naples "Federico II"(那不勒斯腓特烈二世大学) Queen Mary University of London(伦敦玛丽女王大学) Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) University of Zanjan(赞詹大学)

AI总结 提出HASA方法,根据客户端异构性分数分配子网宽度,在固定计算预算下提升平均和最差客户端准确率。

详情
AI中文摘要

边缘服务越来越多地使用联邦学习来个性化设备上的模型,同时将敏感数据保留在本地。在实践中,部署必须处理客户端资源和本地数据分布的异构性。模型异构联邦学习通过允许每个客户端训练共享超网的子网来降低客户端成本,但大多数子网分配策略由设备约束驱动,并未明确考虑统计异构性。本文提出异构感知子网分配(HASA),这是一种仅训练规则,根据从本地训练数据计算的客户端异构性分数分配子网宽度,同时强制执行固定的大小加权计算预算。该设计能够与替代分配策略进行预算匹配的比较。在包含七个客户端的文章标题下一个单词预测基准测试中,HASA在10个匹配种子上的未加权平均客户端测试准确率优于均匀分配,将平均客户端测试准确率从13.82%提高到14.32%,并平均提高了最差客户端准确率。在与代表性部分训练基线的匹配预算比较中,HASA在该基准测试上实现了最强的最差客户端和尾部客户端准确率。方向性消融实验表明,将较小的子网分配给更异构的客户端会降低平均和尾部性能。跨领域图像分类研究进一步表明,异构感知分配的有效性取决于异构性分数反映客户端对额外模型宽度需求的程度。

英文摘要

Edge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployments must handle heterogeneity in both client resources and local data distributions. Model-heterogeneous federated learning lowers client cost by allowing each client to train a subnet of a shared supernet, but most subnet-allocation policies are driven by device constraints and do not explicitly account for statistical heterogeneity. This paper proposes Heterogeneity-Aware Subnet Allocation (HASA), a train-only rule that assigns subnet widths based on client heterogeneity scores computed from local training data while enforcing a fixed size-weighted compute budget. This design enables budget-matched comparisons with alternative allocation policies. On an article-title next-word prediction benchmark with seven clients, HASA improves unweighted mean client test accuracy over uniform allocation across 10 matched seeds, increasing mean client test accuracy from 13.82 percent to 14.32 percent, and improves worst-client accuracy on average. In a matched-budget comparison with representative partial-training baselines, HASA achieves the strongest worst-client and tail-client accuracy on this benchmark. A directionality ablation shows that assigning smaller subnets to more heterogeneous clients degrades both mean and tail performance. A cross-domain image-classification study further shows that the effectiveness of heterogeneity-aware allocation depends on how well the heterogeneity score reflects clients' need for additional model width.

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习:类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校) Carnegie Mellon University(卡内基梅隆大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 针对现实数据中的类别不平衡和噪声标注问题,提出一种利用基础模型先验的主动学习框架,通过不平衡感知的协同决策选择信息量最大的样本,在图像和文本数据集上实现超过50%的标注节省。

Comments To appear at ICML 2026

详情
AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注,这共同降低了模型性能,尤其是对少数类。在现有解决方案中,主动学习通过选择性地查询信息最丰富且平衡的样本进行标注,提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架,该框架减轻了类别不平衡,并选择信息量最大的样本进行标注。利用基础模型先验,我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策,以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明,我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.

2606.07646 2026-06-09 cs.CV cs.AI 交叉投稿

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME:从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS(中国科学院自动化研究所多模态人工智能系统实验室)

AI总结 提出DOME域编码器,通过视觉-语言预训练提取密集连续表示,参数化域为分布变量并引入动量更新的稀疏域库,实现零样本显式域建模,在多个基准上超越复杂TTA方法。

详情
AI中文摘要

测试时自适应(TTA)旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布,忽略了真实世界域迁移的多维性和样本特异性,导致自适应脆弱。我们提出DOME,一种有效的域编码器,以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示,将域参数化为分布变量,并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型,即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能,超越了复杂的TTA方法。我们的结果表明,鲁棒的自适应并非源于复杂的自适应算法,而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

2606.07664 2026-06-09 cs.NE cs.AI 交叉投稿

Seq103: A Unified Neuroevolution Framework for Compact Sequence Architecture Discovery

Seq103: 用于紧凑序列架构发现的统一神经进化框架

Wenxiao Li, Yongjian Liu, Qing Xie

发表机构 * School of Computer Science and Artificial Intelligence, Wuhan University of Technology(武汉理工大学计算机科学与人工智能学院)

AI总结 提出统一神经进化框架Seq103,通过共享进化主干和可选循环扩展,在序列分类任务中实现紧凑架构搜索,在文本和时间序列数据集上以极低参数量保持高精度。

Comments 18 pages, 2 figures, 8 tables

详情
AI中文摘要

神经进化是一种代表性的神经架构搜索范式,通过进化算法同时演化网络拓扑和权重。本文提出Seq103,一个统一的NEAT风格神经进化框架,用于紧凑序列架构发现。Seq103包含一个共享的进化主干和一个可选的循环扩展。共享主干包括基本的节点-连接表示、基于每类RMSE的评估、带有类级重组的基于突变的进化以及精英策略。可选的隐藏状态机制通过隐藏状态节点和隐藏连接扩展搜索空间,在需要逐步循环推理时提供时间记忆。通过这种设计,Seq103将相同的核心搜索流程应用于逐步循环和样本级前馈序列分类。在循环任务中,启用隐藏状态扩展以提供时间记忆;在前馈任务中,禁用该扩展,而共享进化主干保持不变。我们在8个文本分类数据集和包含128个单变量时间序列数据集的完整UCRArchive2018基准上评估Seq103。在逐步任务中,Seq103平均保留最佳基线准确率的86.96%,同时参数数量减少34.6倍至3218.0倍。在完整UCRArchive2018基准的样本级任务中,Seq103平均保留最佳基线准确率的81.95%,同时参数数量减少11.8倍至160,601.0倍。

英文摘要

Neuroevolution is a representative neural architecture search paradigm that evolves both network topology and weights through evolutionary algorithms. In this paper, we propose Seq103, a unified NEAT-style neuroevolution framework for compact sequence architecture discovery. Seq103 consists of a shared evolutionary backbone and an optional recurrent extension. The shared backbone includes an elementary node-and-connection representation, per-class RMSE-based evaluation, mutation-based evolution with class-wise recombination, and elitism. The optional hidden-state mechanism extends the search space with hidden-state nodes and hidden connections, enabling temporal memory when step-wise recurrent inference is required. With this design, Seq103 applies the same core search pipeline to both step-wise recurrent and sample-wise feedforward sequence classification. In recurrent tasks, the hidden-state extension is enabled to provide temporal memory; in feedforward tasks, it is disabled while the shared evolutionary backbone remains unchanged. We evaluate Seq103 on 8 text classification datasets and the full UCRArchive2018 benchmark with 128 univariate time-series datasets. On step-wise tasks, Seq103 retains 86.96% of the best-baseline accuracy on average while using 34.6x to 3218.0x fewer parameters. On sample-wise tasks over the full UCRArchive2018 benchmark, Seq103 retains 81.95% of the best-baseline accuracy on average while using 11.8x to 160,601.0x fewer parameters.

2606.07670 2026-06-09 cs.CV cs.AI 交叉投稿

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University(莫纳什大学)

AI总结 提出用液态神经网络(LNN)的闭式连续时间(CfC)单元替代MLP,构建显式连续时间变形场,在动态场景重建中匹配或超越MLP基线,尤其擅长高频关节运动。

详情
AI中文摘要

可变形3D高斯泼溅(D-3DGS)通过一个位置编码的MLP(以帧时间t为输入)变形一组规范3D高斯,从单目视频重建动态场景。尽管拟合连续变量,但MLP在架构中不耦合任意两个t值,实际上预测离散的逐帧偏移,使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间(CfC)单元,即液态神经网络(LNN),它是液态时间常数ODE的闭式解,同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门,在两个候选隐藏状态之间插值,将学习到的对t的平滑响应嵌入损失景观,无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上,液态场在总体上匹配或超过MLP基线,其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计,将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

2606.07684 2026-06-09 cs.LG cs.AI 交叉投稿

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

语义缓存蒸馏:通过重用和选择性修补实现高效状态传输

Qianli Ma, Zhiqing Tang, Hanshuai Cui, Zhi Yao, Weijia Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对大语言模型推理中KV缓存传输的通信瓶颈和跨模型重用时的语义错位问题,提出语义缓存蒸馏(SCD)框架,通过低秩子空间重建和稀疏过渡层归一化输入预测,实现高达2.65倍的首令牌时间加速,且生成质量接近理想情况。

Comments Accepted to ICML 2026

详情
AI中文摘要

分离式服务缓解了大语言模型(LLM)推理中的内存瓶颈,但造成了严重的通信瓶颈:传输高维键值(KV)缓存通常主导首令牌时间(TTFT)。此外,跨异构模型(例如,基础模型和微调变体)重用缓存会导致语义错位,且这种错位会随着层数累积,降低生成质量。我们提出语义缓存蒸馏(SCD),一种受损失约束的框架,用紧凑的语义代码替代原始KV传输。SCD通过两种机制解决这些挑战:(1)重用,从低秩子空间重建大部分层以最小化传输成本,以及(2)修补,在稀疏过渡层预测归一化输入以截断误差传播。实验表明,在带宽受限的情况下,SCD相比理想消费预填充实现了高达2.65倍的TTFT加速,并在质量-延迟帕累托前沿上优于量化和选择性重计算基线,同时将生成质量保持在理想情况F1的5%以内。

英文摘要

Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.

2606.07686 2026-06-09 cs.LG cs.AI 交叉投稿

Knowledge-Inclusive Adaptive Physics-Informed Neural Network for Microbial Interaction Modelling

知识包容的自适应物理信息神经网络用于微生物相互作用建模

Ravisha Rupasinghe, Rajith Vidanaarachchi, Asela Hevapathige, Sachith Seneviratne, Sen-Lin Tang, Saman Halgamuge

发表机构 * University of Melbourne(墨尔本大学) Academia Sinica(中央研究院)

AI总结 提出一种知识包容的自适应PINN框架,通过整合文本和网络结构知识改进微生物群落建模,在真实和模拟数据集上性能提升最高53%。

Comments 33 pages

详情
AI中文摘要

物理信息神经网络(PINN)是一种在机器学习方法中以方程形式包含知识的方式。除了方程,知识还以其他形式存在,如文本和网络结构。虽然现有的基于PINN的方法从数据中发现方程参数,但它们仅依赖实验测量。我们提出一个新的PINN框架,通过整合辅助知识源来丰富参数发现。我们将该框架应用于微生物学,其中广义Lotka-Volterra(gLV)作为建模微生物群落的生物学基础。我们证明,整合知识可以改进微生物群落建模。我们的框架利用同行评审的宏基因组学文献丰富gLV参数,因为文本提供了gLV单独无法捕捉的外部影响的生物学背景。我们使用数据驱动的整合方法将这些知识与微生物丰度的实验测量相结合。我们通过显式建模微生物相互作用来整合基于网络的结构知识。我们的知识包容框架推断微生物网络,揭示生态学见解。我们根据文献中记录的生态角色验证这些发现。我们在涵盖人类和植物相关微生物群落的真实和模拟数据集上进行评估。我们的框架在无知识情况下比现有技术提升最高53%。知识添加在基于Bray-Curtis差异的准确率上带来最高23%的提升,在R²上带来47%的提升。

英文摘要

Physics-Informed Neural Network (PINN) is a way of including knowledge in the form of equations in Machine Learning methods. Beyond equations, knowledge exists in other forms, such as text and network structure. While existing PINN-based approaches discover equation parameters from data, they rely solely on experimental measurements. We propose a new PINN framework that enriches parameter discovery by incorporating auxiliary knowledge sources. We instantiate our framework for microbiology, where generalised Lotka-Volterra (gLV) serves as a biological foundation for modelling microbial communities. We demonstrate that incorporating knowledge improves microbial community modelling. Our framework enriches the gLV parameters using peer-reviewed metagenomics literature, as text provides biological context on external influences that gLV alone cannot capture. We combine this knowledge with experimental measurements of microbial abundance using a data-driven integration approach. We integrate network-based structural knowledge by explicitly modelling microbial interactions. Our knowledge-inclusive framework infers microbial networks, revealing ecological insights. We validate these findings against ecological roles documented in the literature. We evaluate on real and simulated datasets spanning human- and plant-associated microbial communities. Our framework improves over the state-of-the-art by up to 53%, even without knowledge. Knowledge addition yields gains of up to 23% in Bray-Curtis Dissimilarity-based accuracy and 47% in $\mathrm{R}^2$.

2606.07690 2026-06-09 cs.LG cs.AI 交叉投稿

HARP: Efficient Data Selection for Finetuning Large Language Models

HARP:高效数据选择用于微调大型语言模型

Ning Wang, Zhengxin Zhang, Maosen Tang, Yitang Gao, Claire Cardie, Sainyam Galhotra

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出层次主动区域剪枝(HARP),一种高效的基于训练的数据选择方法,通过层次结构和经验贝叶斯推断降低选择成本,同时保持下游对齐,在多个基准上优于最强基线最多8.9分,且训练样本减少约7倍。

详情
AI中文摘要

微调数据选择需要平衡两个相互竞争的目标:选择改善下游目标的示例,以及在不重复微调模型的情况下做到这一点。无训练选择器具有可扩展性,但依赖于嵌入相似性或聚类等代理,这些可能无法匹配目标目标。基于训练的选择器通过梯度信号、子集评估或Shapley归因更好地反映下游效用,但需要大量昂贵的训练-评估迭代。我们提出层次主动区域剪枝(HARP),一种高效的基于训练的选择器,在降低选择成本的同时保持下游对齐。HARP将训练池组织成节点-叶子层次结构,仅评估代表性叶子,并使用经验贝叶斯后验推断未测量的效用。然后,它使用两个互补的包络选择数据:HARP-C,保守地控制冗余,以及HARP-E,加性地奖励互补区域。我们理论上证明,在局部平滑和有界估计误差下,HARP控制选择误差同时降低训练-评估成本。我们进一步验证,HARP变体实现了最佳结果,并在使用大约7倍更少训练示例的情况下,比最强基线高出最多8.9分。

英文摘要

Finetuning data selection requires balancing two competing goals: selecting examples that improve the downstream objective, and doing so without repeatedly finetuning models. Train-free selectors are scalable but rely on proxies such as embedding similarity or clustering, which may not match the target objective. Train-based selectors better reflect downstream utility through gradient signals, subset evaluation, or Shapley attribution, but require many costly train--evaluate iterations. We propose Hierarchical Active Region Pruning (HARP), an efficient train-based selector that preserves downstream alignment while reducing selection cost. HARP organizes the training pool into a node--leaf hierarchy, evaluates only representative leaves, and infers unmeasured utilities with empirical Bayes posteriors. It then selects data using two complementary envelopes: HARP-C, which conservatively controls redundancy, and HARP-E, which additively rewards complementary regions. We theoretically show that, under local smoothness and bounded estimation error, HARP controls selection error while reducing train--evaluate cost. We further validate that HARP variants achieve the best result and outperform the strongest baseline by up to $+8.9$ points, while using roughly $7\times$ fewer training examples.

2606.07698 2026-06-09 cs.LG cs.AI 交叉投稿

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

基于图神经网络的药物相互作用预测的药理基因组学知识图谱增强

Juergen Dietrich

发表机构 * AI Solutions Berlin

AI总结 本研究通过整合PharmGKB的药理基因组学先验知识(CYP酶注释)作为特征向量,增强图神经网络在药物相互作用预测中的性能,在配对数据划分下显著提升DDI类型分类,但未能突破信息天花板。

Comments 13 pages

详情
AI中文摘要

应用于药物相互作用(DDI)预测的图神经网络(GNN)仅依赖由SMILES衍生的分子结构图。该系列先前的工作表明,模型性能受限于训练标签的结构信息含量——即信息天花板——仅靠架构改进无法克服。本研究探讨来自PharmGKB数据库的药理基因组学先验知识是否通过提供独立于分子结构且互补的代谢通路背景,部分关闭这一天花板。提取四种临床相关亚型(CYP2D6、CYP3A4、CYP2C19、CYP2C9)的细胞色素P450(CYP)酶底物、抑制剂和诱导剂注释,并将其作为12维特征向量在交互预测前与分子嵌入拼接。在配对水平和药物水平数据划分下进行实验,以量化对未见药物的泛化能力。结果表明,在配对水平划分条件下,知识图谱(KG)增强显著改善了DDI类型分类(F1宏平均:0.532对比基线0.241),而二元交互检测和药物水平泛化仍受信息天花板限制(AUC提升:0.224对比基线0.250)。对严格保留化合物的机制验证确认,增强优先改善CYP2C9介导的交互预测,概率从基线0.033-0.117提升至KG增强后的0.560-0.586。在Tox21基准上的单分子毒性预测扩展实验证实,该效果取决于药理基因组学注释覆盖度。这些发现为后续研究提出的多模态框架提供了动机。

英文摘要

Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by the structural information content of training labels -- an Information Ceiling -- that architectural refinements alone cannot overcome. The present study investigates whether pharmacogenomic prior knowledge from the PharmGKB database partially closes this ceiling by providing metabolic pathway context that is independent of, and complementary to, molecular structure. Cytochrome P450 (CYP) enzyme substrate, inhibitor, and inducer annotations for four clinically relevant isoforms (CYP2D6, CYP3A4, CYP2C19, CYP2C9) are extracted and incorporated as a 12-dimensional feature vector concatenated to the molecular embedding prior to interaction prediction. Experiments are conducted under both pair-level and drug-level data splits to quantify generalization to unseen drugs. Results indicate that knowledge graph (KG) augmentation substantially improves DDI type classification under pair-level split conditions (F1-macro: 0.532 vs. 0.241 baseline), while binary interaction detection and drug-level generalization remain bounded by the Information Ceiling (AUC inflation: 0.224 vs. 0.250 baseline). Mechanistic validation on strictly held-out compounds confirms that augmentation preferentially improves CYP2C9-mediated interaction prediction, with probabilities increasing from 0.033-0.117 (baseline) to 0.560-0.586 (KG-augmented). An extension to single-molecule toxicity prediction on the Tox21 benchmark confirms that the effect is contingent on pharmacogenomic annotation coverage. These findings motivate the multimodal framework proposed for the subsequent study in this series.

2606.07700 2026-06-09 cs.LG cs.AI 交叉投稿

EssentialGIN: a new approach for gene essentiality prediction based on graph isomorphism neural networks

EssentialGIN:基于图同构神经网络的新基因必需性预测方法

Sahar Mansouri-Rad, Zahra Narimani, Parvin Razzaghi, Nazanin Hosseinkhan

发表机构 * Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS)(计算机科学与信息技术系,基础科学研究院(IASBS)) Endocrine Research Center, Institute of Endocrinology and Metabolism, Iran University of Medical Sciences(内分泌研究中心,内分泌学与代谢研究院,伊朗医学科学大学)

AI总结 提出基于图同构神经网络(GIN)的EssentialGIN模型,整合PPI网络拓扑与基因表达、直系同源、亚细胞定位等多源生物数据,在人类等复杂生物中显著优于现有方法。

Comments 19 pages, 5 figures, 8 tables

详情
AI中文摘要

背景:必需基因(蛋白质)的预测是一个基本且具有挑战性的问题,同时在湿实验中进行非常昂贵且耗时。仅基于计算方法(引入湿实验候选)使用中心性度量预测必需基因并不准确,会导致大量假阳性;因此,最近的研究使用更复杂的模型(如深度学习)以及整合生物信息来识别必需基因。\n方法:在这项工作中,我们专注于图同构网络,将蛋白质作为PPI网络中的节点进行嵌入,以保留PPI网络的拓扑特征,并整合生物数据,如基因表达数据、基因直系同源信息和基因亚细胞定位信息,引入了一种用于预测必需基因的深度架构。本文修改了图同构网络架构以嵌入节点信息。\n结果:我们的实验证明,所提出的方法优于基于中心性的基线方法以及基于机器学习的方法,如Node2Vec、MLP和图注意力网络(GAT)。\n结论:在本文中,我们观察到使用整合生物数据(作为节点属性)并保留网络拓扑的图同构网络可以显著提高必需基因预测的准确性。在较简单的生物体(如大肠杆菌和黑腹果蝇)中,使用Node2Vec嵌入的多层感知机等方法也表现良好,但在人类中,所引入的架构显著优于深度学习和其他图神经网络解决方案。\n关键词:必需基因预测,图神经网络,图同构网络,PPI网络,节点嵌入

英文摘要

Background: Prediction of essential genes (proteins), is a basic and challenging problem but at the same time very costly and time-consuming in wet-lab experiments. Predicting essential genes, only based on computational methods (to introduce wet-lab candidates) using centrality measures are not accurate and result in large number of false positives; therefore, more complex models such as deep learning and also integration of biological information are used in recent research to identify essential genes. Methods: In this work we focus on graph isomorphism networks, in order to embed proteins as a node in PPI network to conserve topological features of PPI network, and also integrate biological data such as gene expression data, gene orthology information and gene subcellular localization information, and introduced a deep architecture for predicting essential genes. Graph isomorphism network architecture is modified in this work for embedding node information. Results: Our experiments proved that the proposed method outperforms baseline centrality-based methods and also machine learning based methods such as Node2Vec, MLP, and also graph attention networks (GAT). Conclusion: In this paper we observed that using graph isomorphism networks that integrate biological data (as node attributes) and preserve network topology can significantly improve the essential gene prediction accuracy. In simpler organisms such as E. coli and D. melanogaster, methods such as multi-layer perceptron using Node2Vec embedding also performs very good, but in H. sapiens the introduced architecture significantly outperforms deep learning and other graph neural network solutions. Keywords: Essential gene prediction, graph neural network, graph isomorphism network, PPI network, node embedding

2606.07702 2026-06-09 cs.LG cs.AI 交叉投稿

EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning

EvoCSFL:基于代理辅助的进化客户端选择实现高效鲁棒联邦学习

Lin Qiang, Sun Xiaoyan, Hu Yao, Fang Wei

发表机构 * Jiangnan University(江南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对联邦学习中客户端数据与系统异构性导致收敛慢、鲁棒性差的问题,提出代理辅助的进化客户端选择框架,将选择问题建模为组合优化,用代理模型加速进化搜索,实验表明收敛更快、能耗更低、鲁棒性更强。

详情
AI中文摘要

客户端数据和系统的异构性使得采用随机客户端选择的联邦学习难以获得令人满意的收敛速度和鲁棒性。为解决此问题,本文提出了一种基于代理辅助的客户端进化选择框架。在该框架中,首先使用一些典型的客户端选择策略生成候选集,并开发了一个集成模型性能、通信延迟和能量消耗的度量函数,将客户端选择问题表述为组合优化问题。随后,利用候选选择和度量构建代理模型,以高效逼近所选客户端子集的性能。采用进化算法搜索客户端选择的组合空间,并由代理模型引导以加速收敛。在MNIST、CIFAR10、CINIC10和TinyImageNet上的实验表明,与现有方法相比,所提算法实现了更快的收敛、更低的能量消耗和更好的鲁棒性。

英文摘要

The heterogeneity of client data and systems makes it difficult to achieve satisfactory convergence speed and robustness in federated learning with random client selection. To address this issue, this paper proposes a surrogate-assisted client evolutionary selection framework for federated learning. In this framework, some typical client selection strategies are first used to generate candidate sets, and a metric function that integrates model performance, communication latency, and energy consumption is developed to formulate the client selection problem as a combinatorial optimization one. Subsequently, a surrogate model is constructed using the candidate selections and metric to efficiently approximate the performance of selected client subsets. An evolutionary algorithm is employed to search the combinatorial space of client selections, guided by the surrogate model to accelerate convergence. Experiments on MNIST, CIFAR10, CINIC10, and TinyImageNet demonstrate that the proposed algorithm achieves faster convergence, lower energy consumption, and improved robustness compared to existing methods.

2606.07703 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

需要多少密集注意力?面向混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Technical Report, First Release(技术报告,首次发布)

AI总结 研究在混合长上下文模型中,通过Oracle引导的稀疏预填充减少密集注意力计算,在保持任务性能的同时实现加速,并验证了可行性、索引器质量和运行时加速潜力。

Comments Technical report, first release, 26 pages, 2 figures, 11 tables

详情
AI中文摘要

长上下文预填充仍然昂贵,因为即使在包含局部、稀疏、线性或循环组件的混合模型中,全/GQA层仍然对整个历史序列进行评分。我们研究了在显式支持粒度和top-k预算下,需要多少密集注意力来保持任务级行为。我们为现有的GQA检查点引入了一种注意力质量top-k oracle:对于每个层和查询位置,它计算密集注意力,选择头平均的token支持,并仅在该支持上重新计算注意力。该oracle是一个诊断参考,而非可部署的加速器,并将稀疏预算可行性从索引器误差和运行时实现效果中分离出来。在Qwen家族的检索密集型评估中,每个查询的最长oracle行与密集注意力相差在1个点以内,而Qwen3.5-9B在4K到100K的RULER风格扫描中相差在0.48个点以内。在oracle的指导下,我们通过KL蒸馏从密集注意力质量分布中训练了一个头折叠的辅助索引器,同时保持骨干网络冻结。使用分别蒸馏的Qwen3.5-0.8B和Qwen3.5-9B索引器,报告的16K/32K验证宏观差距分别为+2.04和+1.13个点,这被视为质量保持而非改进;融合的选择块共享支持可能引入更大的实现差距。初步的单卡TTFT测量显示,与密集FlashAttention-2基线相比,蒸馏索引器的稀疏服务加速比在NPU上对Qwen3.5-0.8B为1.71倍,在GPU上对Qwen3.5-9B为1.93倍。额外的随机初始化压力行达到3.44倍,表明稀疏运行时存在提升空间,但输出质量未经验证。本次发布首次分离了oracle可行性、蒸馏索引器质量和运行时提升空间,将完全匹配的质量-延迟前沿留待未来工作。

英文摘要

Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve task-level behavior under explicit support granularity and top-k budgets. We introduce an attention-mass top-k oracle for existing GQA checkpoints: for each layer and query position, it computes dense attention, selects head-averaged token support, and recomputes attention only on that support. The oracle is a diagnostic reference, not a deployable accelerator, and separates sparse-budget feasibility from indexer error and runtime realization effects. On Qwen-family retrieval-heavy evaluations, the longest per-query oracle rows stay within 1 point of dense, and a Qwen3.5-9B RULER-style sweep from 4K to 100K stays within 0.48 points. Guided by the oracle, we derive a head-collapsed auxiliary indexer trained by KL distillation from dense attention-mass distributions while keeping the backbone frozen. With separately distilled Qwen3.5-0.8B and Qwen3.5-9B indexers, the reported 16K/32K validation macro gaps are +2.04 and +1.13 points, treated as quality preservation rather than improvement; fused selection-block-shared support can introduce a larger realization gap. Preliminary single-card TTFT measurements show distilled-indexer sparse serving speedups of 1.71x for Qwen3.5-0.8B on NPU and 1.93x for Qwen3.5-9B on GPU against its dense FlashAttention-2 baseline. Additional random-init stress rows reach 3.44x, indicating sparse-runtime headroom but not validated output quality. This first release separates oracle feasibility, distilled-indexer quality, and runtime headroom, leaving a fully matched quality-latency frontier to future work.

2606.07704 2026-06-09 cs.LG cs.AI 交叉投稿

FunctionEvolve: Structure-Guided Symbolic Regression with LLMs

FunctionEvolve: 基于结构引导的符号回归与大型语言模型

Zeyu Xia, Jun Zhu, Dong Yan

发表机构 * Bosch Center for Artificial Intelligence(博世人工智能中心) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Tsinghua-Bosch Joint Center for ML, Tsinghua University(清华大学-博世联合机器学习中心)

AI总结 提出FunctionEvolve框架,利用表达式树组织符号回归搜索,通过结构摘要、局部树编辑和结构感知系数拟合,在LLM-SRBench合成子集上以Claude Opus 4.6实现82.9%的SA@50,较同基线提升4.5倍。

详情
AI中文摘要

符号回归旨在从数据中揭示显式的科学定律。近期方法使用大型语言模型(LLM)引导基于背景文本的变异,这比随机遗传编程更具方向性。然而,精确的符号恢复既需要语义引导,也需要显式结构,以便通过有效的符号表示进行领域信息搜索。当前的LLM驱动系统仍然是结构盲的:它们在模糊的候选者中进行选择,缺乏局部变异的显式机制,并依赖脆弱的系数拟合,这可能会低估正确的骨架。我们提出FunctionEvolve,一个使用表达式树组织整个搜索的进化框架:结构摘要促进多样化的父代选择,局部树编辑保留有用的子表达式,结构感知拟合分解、约束和简化系数,以实现更可靠的评分。它仅使用初等函数族,无需额外的领域特定规则限制泛化能力。在LLM-SRBench的129任务合成子集上,使用Claude Opus 4.6的FunctionEvolve恢复了107个精确形式,达到82.9%的SA@50,是同骨干基线的4.5倍,以及55.8%的SA@1,是此前最强已发布top-1结果的3.6倍。消融实验表明,结构可见搜索是可靠恢复的核心,LLM引导的改进和结构感知系数优化作为必要的提议和评分机制。我们还对基准进行了审计,显示其材料科学子集中的共线性导致了可识别性问题。

英文摘要

Symbolic regression aims to uncover explicit scientific laws from data. Recent methods use LLMs to guide mutation from background text, which is more directed than random genetic programming. However, exact symbolic recovery requires both semantic guidance and explicit structure, so that domain-informed search are carried out through valid symbolic representation. Current LLM-driven systems remain structure-blind: they select among opaque candidates, lack explicit mechanisms for local mutation, and rely on brittle coefficient fitting that can undervalue correct skeletons. We propose FunctionEvolve, an evolutionary framework using expression trees to organize the whole search: structural summaries promote diverse parent selection, local tree edits preserve useful subexpressions, and structure-aware fitting decomposes, constrains, and simplifies coefficients for more reliable scoring. It uses only elementary function families, without additional domain-specific rules limiting generalization. On the 129-task synthetic subset of LLM-SRBench, FunctionEvolve with \emph{Claude Opus 4.6} recovers 107 exact forms, reaching 82.9% SA@50, 4.5x above same-backbone baselines, and 55.8% SA@1, 3.6x above the strongest previously published top-1 result. Ablations show that structure-visible search is central to reliable recovery, with LLM-guided refinements and structure-aware coefficient optimization serving as essential proposal and scoring mechanisms. We also audit the benchmark and show that collinearity in its materials-science subset creates identifiability issues.

2606.07705 2026-06-09 cs.LG cs.AI 交叉投稿

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

SAW: 面向大语言模型多目标强化学习的阶段感知动态加权

Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对多目标强化学习中奖励学习异步性问题,提出轻量级动态加权机制SAW,利用变异系数实时调整各目标贡献,在GRPO和GDPO框架下提升训练效率和最终性能。

Comments 17 pages, 7 figures, 5 tables

详情
AI中文摘要

尽管多目标强化学习(MORL)对于将大语言模型与复杂的人类偏好对齐至关重要,但当前普遍采用的静态加权求和忽略了一个更基本的现象:不同目标之间的奖励学习明显异步。学习良好的维度会迅速产生同质、低方差的信号,其残留噪声会污染聚合奖励(在GRPO中)或占据优势预算的固定份额(在GDPO中),从而干扰学习不足维度携带的稀缺但高价值的信号。为了解决这种异步性,我们提出了阶段感知动态加权(SAW),一种轻量级、算法无关的动态加权机制。SAW利用变异系数(CV)作为实时信息量的尺度不变代理,根据批次内各维度的相对信息量重新加权其奖励或优势贡献。与需要多次前向和反向传播的基于梯度的方法不同,SAW仅依赖于批次级统计信息,引入的计算开销几乎可以忽略不计。在工具调用和文本摘要任务上的实验表明,SAW在GRPO和GDPO框架下均能一致地提高训练效率和最终性能,证实了其作为多奖励LLM对齐的通用插件。我们的代码可在 https://github.com/Zhaolutuan/SAW 获取。

英文摘要

Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension's reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at https://github.com/Zhaolutuan/SAW

2606.07710 2026-06-09 cs.LG cs.AI 交叉投稿

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash: 通过令牌级跨范式路由加速推测解码

Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris

发表机构 * Samsung AI Center, Cambridge, UK(三星AI中心,剑桥,英国)

AI总结 提出WhiFlash,首个统一自回归与扩散并行草稿的跨范式推测解码方法,通过细粒度路由和缓存优化实现高达69.6%的吞吐量提升。

Comments Under review

详情
AI中文摘要

大型语言模型的自回归特性仍然是推理的主要瓶颈,特别是在复杂的代理工作负载中。虽然推测解码加速了推理,但当前方法依赖于静态草稿范式,使用自回归草稿模型进行推理或基于扩散的并行草稿模型生成结构化输出。我们经验发现,草稿准确性在单个序列内波动剧烈,静态范式和粗粒度路由导致显著性能未实现。为解决这种波动性,我们引入WhiFlash,首个跨范式推测解码方法,在单个令牌级控制器下统一自回归和基于扩散的并行草稿。WhiFlash采用细粒度路由机制,使用轻量级基于熵的或学习到的神经策略,两者均参数化以在预期令牌增益和延迟之间提供可调平衡。为使高频切换计算可行,我们引入新颖的缓存管理优化——惰性追赶和仅KV预填充,将切换开销降低到每轮延迟的7%以下。通过利用根本不同草稿架构的互补优势,WhiFlash实现了显著更高的接受长度,在特定类别上吞吐量比最先进的自回归EAGLE-3提升高达69.6%,比基于扩散的DFlash提升37.3%。

英文摘要

The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on static drafting paradigms, utilising either autoregressive drafting models for reasoning or diffusion-based parallel drafting models for structured outputs. We empirically find that drafting accuracy fluctuates dramatically within a single sequence, leaving significant performance unrealised by static paradigms and coarse-grained routing. To address this volatility, we introduce WhiFlash, the first cross-paradigm SD method that unifies autoregressive and diffusion-based parallel drafting under a single token-level controller. WhiFlash adopts a fine-grained routing mechanism that employs either a lightweight entropy-based or a learned neural policy, both parametrised to provide a tunable balance between expected token gain and latency. To make high-frequency switching computationally viable, we introduce novel cache-management optimisations, Lazy Catch-up and KV-only Prefill, reducing switching overhead to below 7% of per-round latency. By capitalising on the complementary strengths of fundamentally distinct drafting architectures, WhiFlash achieves significantly higher acceptance lengths, yielding category-specific throughput gains of up to 69.6% over the state-of-the-art autoregressive EAGLE-3 and 37.3% over the diffusion-based DFlash.

2606.07713 2026-06-09 cs.LG cs.AI cs.PF 交叉投稿

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

理论最小化的注意力机制:面向内存最优Transformer内核的数组数学框架

Lenore Mullin, Gaetan Hains

发表机构 * University at Albany(奥尔巴尼大学) Université Paris-Est Créteil(巴黎东大学克雷泰伊分校)

AI总结 提出基于数组数学(MoA)的缩放点积注意力重表述,通过代数构造消除所有中间数组,实现O(n dk + n dv)数据移动,相比标准实现O(n^2 + n dk + n dv)显著降低内存流量,并验证了数值精度。

详情
AI中文摘要

注意力机制是现代基于Transformer的AI中的主要计算瓶颈。其标准实现在序列长度~$n$上产生二次内存流量,而DRAM访问在当代硬件上比算术操作消耗100--1000$\times$更多的能量,因此任何仅关注FLOP计数的分析从根本上误解了瓶颈。我们提出了缩放点积注意力及其数值稳定softmax的数组数学(MoA)重表述,推导出指称范式(DNF),通过代数构造而非经验调优消除了所有中间数组——包括隐式转置键缓冲区和每个softmax临时变量。DNF实现了$O(n dk + n dv)$的数据移动,而标准实现为$O(n^2 + n dk + n dv)$,其中$n$是序列长度,$dk$是键维度,$dv$是值维度,并在具体输入上针对PyTorch全双精度浮点进行了数值验证。与硬件特定的加速器或经验性分块方案(如FlashAttention)不同,MoA从单一代数框架同时提供了数组融合、形状变换正确性和预测性成本模型。内存最小性是在编写任何代码之前就确立的定理。预测性性能模型预计加速2--100$\times$,能耗降低2--50$\times$,优势在超大规模下进一步扩大。该推导建立了一个从Python规范经过操作范式(ONF)和维度提升硬件映射的形式化验证流水线,提供了与DARPA边缘部署和DOE超大规模优先事项直接相关的性能可移植AI内核。

英文摘要

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

2606.07766 2026-06-09 cs.CV cs.AI 交叉投稿

Quantum-Enhanced Similarity Measures for Polarimetric Materials Classification

量子增强的极化材料分类相似度度量

Sara Shojaei, Seyed Mohamad Ali Tousi, Emma Bennett, Param Sangani, Ali Shiri Sichani, Ilker Ersoy, Hadi Ali-Akbarpour, Filiz Bunyak, G. N. DeSouza

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出量子-经典混合流水线,将极化材料分类转化为点匹配问题,利用SWAP测试估计嵌入向量保真度,实现竞争性分类精度和开放集判别能力。

详情
AI中文摘要

我们提出了一种用于极化材料分类的量子-经典混合流水线,将其视为点匹配问题。包含偏振光反射的体素立方体用于训练编码器,为立方体的体素生成32维嵌入。在推理时,丢弃编码器头部,将嵌入编码为量子态的概率幅。然后,SWAP测试电路估计查询立方体的每个32D嵌入与锚点立方体数据集之间的保真度。聚合的保真度作为材料相似度分数,具有最高聚合保真度的锚点类别被视为查询材料的类别。我们在一个包含23种材料(每种约800个样本)的数据集上评估了我们的方法,这些材料来自其Mueller矩阵。比较了所提出的量子SWAP测试的点匹配方法和使用最优传输的经典分类器。我们的结果展示了竞争性的分类精度以及开放集判别潜力,使其成为基于NISQ的材料识别的可行途径。

英文摘要

We present a quantum--classical hybrid pipeline for polarimetric material classification that casts this as a point-matching problem. Voxel cubes, containing polarized light reflections, are used to train an encoder to produce 32-dimensional embeddings for the voxels of the cubes. At inference, the encoder head is discarded and the embeddings are encoded as probability amplitudes of quantum states. Next, a SWAP-test circuit estimates the fidelity between each of the 32D embeddings from the query cube and a dataset of anchor cubes. The aggregated fidelity serves as materials similarity scores, and the class of the anchor with highest aggregated fidelity is deemed as the class of the queried material. We evaluate our approach on a dataset of 23 materials ($\approx$800 samples each) derived from their Mueller matrices. The point-matching approaches from the proposed quantum SWAP-test and a classical classifier using Optimal Transport are compared. Our results demonstrate the competitive classification accuracy alongside open-set discrimination potential, establishing it as a viable path toward NISQ-based material recognition.

2606.07865 2026-06-09 cs.LG cs.AI physics.comp-ph stat.ML 交叉投稿

Instrumented data for causal scientific machine learning

因果科学机器学习的仪器化数据

Daniel N. Wilke

发表机构 * University of the Witwatersrand(威特沃特斯兰德大学)

AI总结 提出仪器化数据作为观测数据和模板合成数据之外的第三种选择,每个数据点携带产生它的机制模型、显式不确定性及可执行的反事实族,通过V&V仪器化图像到模拟管道实现,支持因果干预。

Comments 10 pages, 2 figures

详情
AI中文摘要

科学机器学习受限于训练数据而非模型大小。观测数据记录发生了什么但不记录原因;模板合成数据具有已知的生成过程,但仅适用于模拟器的模板,而非用户面对的情况。我们认为第三种选择现在在操作上是可行的:仪器化数据,其中每个数据点携带产生它的机制模型、对该模型的显式不确定性以及可执行的反事实族。验证与确认(V&V)仪器化图像到模拟管道是一种实现:传感器观测成为完全指定、求解器支持的模拟,具有显式、可编辑的参数以及传播的偶然/认知不确定性。该基底是案例特定的、机制监督的,并通过Pearl的do算子支持因果干预。在验证、审计和替代训练方面的近期影响涵盖计算生物学、气候、材料、流体力学和医学成像;长期可证伪的推论涉及科学推理的基础模型。

英文摘要

Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

2606.07882 2026-06-09 cs.CV cs.AI 交叉投稿

The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders

跨架构基板:现代视觉编码器的领域超越、校准存活的几何不变量

Yousef Radwan

发表机构 * KAUST(阿卜杜拉国王科技大学)

AI总结 发现现代视觉编码器训练后前16个主方向收敛到同一16维几何对象(跨架构基板),该基板跨视觉领域传输、校准后仍存在,并应用于无标签迁移性过滤、领域检测、低样本探测和无教师蒸馏。

Comments 14 pages, 2 figures. 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

详情
AI中文摘要

不同的视觉神经网络——训练用于分类、对比、重建或将图像与文本匹配——应该具有相应不同的内部表示。我们报告它们并非如此。训练后,十三个现代视觉编码器内部的前十六个主变化方向收敛到同一个十六维几何对象。我们称之为跨架构基板,并使用PCA、中心核对齐(CKA)和Pang 2026校准进行研究。该基板在四个视觉领域(自然照片、医学CT、卫星、显微镜)上以中位数Procrustes-CKA 0.679传输,在八个领域(增加素描、深度、热红外、天文学)上为0.604,每对>0.40。它在全局(7.4倍判别vs MAE分离,n=13,394)和局部(4.82-5.30,p<10^{-44})上经受住Pang校准。它不是像素统计(0.263),不是Gabor特征(0.31),不是随机投影(0.041),并且在训练的前10%中出现,而准确率持续上升。我们提供了四个应用:一个无标签迁移性过滤器,优于LogME(快3倍,+0.15 Kendall-tau);一个四路领域检测器(99.6%准确率);一个冻结低样本探测器(16维在每类N=50标签时比768维DINOv2高3.78个百分点);以及一个无教师蒸馏辅助,匹配训练教师KD在33对上(10%标签分数时峰值增益7.56个百分点)。该基板不跨模态,不帮助跨范式蒸馏,也不预测迁移质量(与迁移准确率的rho=0.08)。

英文摘要

Different vision neural networks -- trained to classify, contrast, reconstruct, or match images to text -- should have correspondingly different internal representations. We report that they do not. After training, the top sixteen principal directions of variation inside thirteen modern vision encoders converge to the same sixteen-dimensional geometric object. We call this the cross-architecture substrate and study it with PCA, centred kernel alignment (CKA), and Pang 2026 calibration. The substrate transports across four visual domains (natural photographs, medical CT, satellite, microscopy) at median Procrustes-CKA 0.679, and across eight domains (adding sketches, depth, thermal infrared, astronomy) at 0.604, every pair >0.40. It survives Pang calibration globally (7.4x disc-vs-MAE separation, n=13,394) and locally (4.82-5.30, p<10^{-44}). It is not pixel statistics (0.263), not Gabor features (0.31), not a random projection (0.041), and emerges in the first 10% of training while accuracy keeps climbing. We deliver four applications: a label-free transferability filter beating LogME (3x faster, +0.15 Kendall-tau); a four-way domain detector (99.6% accuracy); a frozen low-shot probe (16 dims beat 768-dim DINOv2 by 3.78pp at N=50 labels per class); and a teacher-free distillation auxiliary matching trained-teacher KD on 33 pairs (7.56pp peak gain at 10% label fraction). The substrate does not cross modalities, does not help cross-paradigm distillation, and does not predict transfer quality (rho=0.08 against transfer accuracy).

2606.07954 2026-06-09 cs.LG cs.AI 交叉投稿

Minibatch Selection via Partition Matroid Constrained Gradient Matching

基于划分拟阵约束梯度匹配的小批量选择

Prayas Agrawal, Prateek Chanda, Ishita Khatri, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔) Department of Computer Science and Engineering(计算机科学与工程系) Centre for Machine Intelligence and Data Science(机器智能与数据科学中心) Microsoft Research India(微软印度研究院) Microsoft India(微软印度)

AI总结 提出PartitionSel方法,通过划分拟阵约束下的梯度匹配效用最大化,实现跨域小批量选择,减少冗余并提升训练兼容性,在LLM微调中取得鲁棒性提升。

Comments 28 pages, 12 figures, ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306, 2026
AI中文摘要

在异构数据上训练大型语言模型(LLMs)需要选择能够平衡收敛速度与跨领域覆盖的小批量。现有方法要么在每个领域内独立选择样本,要么依赖计算昂贵的代理模型来学习连续的领域权重。我们提出PartitionSel,一种跨领域小批量选择方法,它在每个领域的预算(编码为划分拟阵约束)下最大化验证引导的梯度匹配效用。通过单一效用耦合每个领域的预算,PartitionSel旨在减少跨领域选择中的冗余。所提出的目标是弱子模的,并允许使用正交匹配追踪算法,具有可证明的近似保证。在实验中,我们在MetaMathQA和Mol-Instructions上对Qwen2.5和Llama-3进行微调时,评估了PartitionSel的小批量选择。PartitionSel在两个基准测试中均比每个领域和领域无关的基线获得了鲁棒的提升。它还减少了每个批次内冲突梯度对的数量,表明跨领域耦合转化为更兼容的训练更新。

英文摘要

Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rely on computationally expensive proxy models to learn continuous domain weights. We propose PartitionSel, a cross-domain minibatch selection approach that maximizes a validation-guided gradient-matching utility under per-domain budgets encoded as a partition-matroid constraint. By coupling the per-domain budgets through a single utility, PartitionSel is designed to reduce redundancy in selections across domains. The proposed objective is weakly submodular and admits an orthogonal matching pursuit algorithm with provable approximation guarantees. Empirically, we evaluate PartitionSel for minibatch selection during the fine-tuning of Qwen2.5 and Llama-3 on MetaMathQA and Mol-Instructions. PartitionSel achieves robust gains over per-domain and domain-agnostic baselines on both benchmarks. It also reduces the number of conflicting gradient pairs within each batch, indicating that the cross-domain coupling translates into more compatible training updates.

2606.08156 2026-06-09 cs.CV cs.AI 交叉投稿

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

RAPID: 逐层冗余感知剪枝与重要性驱动的令牌合并以实现高效ViT

Kyumin Choi, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出RAPID框架,根据ViT网络深度自适应调整令牌缩减策略:浅中层用冗余相似度感知剪枝,深层用重要性相似度感知合并,在ImageNet-1K上实现更优的精度-压缩帕累托前沿。

Comments 7 pages, 2 figures

详情
AI中文摘要

视觉Transformer(ViT)取得了强大性能,但由于二次自注意力复杂度而遭受高计算成本。尽管令牌缩减技术(如剪枝和合并)缓解了这一问题,但它们通常忽略了表示在网络深度上的演化。我们提出RAPID,一种深度感知的令牌缩减框架,可根据令牌表示的逐层特征自适应调整缩减策略。主要方法贡献是一种分叉策略:在浅层到中层,RAPID采用冗余相似度感知剪枝度量来消除过度表示的局部模式。当特征在更深层过渡到全局语义概念时,框架转向重要性相似度感知合并机制。该阶段利用分类(CLS)令牌注意力权重来保护语义关键令牌,同时融合不太重要但相似的邻居。在ImageNet-1K上使用ViT和DeiT架构的实验验证表明,与ToMe和ToFu等即插即用基线相比,RAPID建立了更优的精度-压缩帕累托前沿。RAPID在激进压缩场景下尤其鲁棒,在极端缩减率下比ToMe准确率高出4.29%。我们的框架提供了一种免训练模板,通过将缩减策略与层次化特征演化对齐来优化视觉模型。

英文摘要

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

2606.08167 2026-06-09 cs.LG cs.AI 交叉投稿

Explaining Data Mixing Scaling Laws

解释数据混合缩放定律

Rui Dai, Shuran Zheng

发表机构 * Beijing Institute of Technology(北京理工大学) IIIS, Tsinghua University(清华大学智能产业研究院)

AI总结 提出统一框架解释多领域数据混合中模型损失行为,基于能力竞争和噪声减少两个关键因素,在多个尺度上有效预测高性能混合。

Comments Published to ICML 2026

详情
AI中文摘要

最近的研究建立了经验缩放定律来预测多领域数据混合上的模型性能。然而,对这些模型损失行为的理论理解仍然缺失。在这项工作中,我们提出了一个统一框架来解释数据混合的底层机制。我们的方法将最初为标准神经缩放定律(如Kaplan和Chinchilla)开发的理论视角扩展到多领域设置。基于领域在基本技能上重叠而在专门技能上分化的分布假设,我们确定了控制不同数据混合训练模型领域损失的两个关键因素:\textit{能力竞争},其中有限模型能力的分配全局耦合了领域损失;以及\textit{噪声减少},其中最优权重向更难学习的领域转移以最小化整体噪声。实证评估表明,我们的框架通过以更低的平均相对误差拟合损失景观并识别出更高性能的训练混合,优于现有基线。最重要的是,我们的模型成功跨尺度外推,使用较小尺度上拟合的参数预测大型未见尺度的高效混合。此外,与之前的经验定律相比,我们的模型使用显著更少的参数实现了这些结果。我们的代码可在 https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws 获取。

英文摘要

Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

2606.08191 2026-06-09 cs.LG cs.AI q-bio.QM 交叉投稿

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

频域潜在注意力门控用于跨域令牌聚合

Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) Institute for Quantitative and Computational Biology, University of California(加州大学定量与计算生物学研究所) Greenwich High School(格林威治高中) BCPM Data Limited(BCPM数据有限公司)

AI总结 提出FLaG模块,通过实FFT变换、可学习潜在查询的频谱分量汇总、通道门控和时域重建,实现跨域令牌聚合,在AMP预测、图像分类和文本分类任务上取得提升。

详情
AI中文摘要

令牌聚合是将令牌表示映射到样本级预测的模型中的常见瓶颈,然而大多数池化方法仅在原始令牌域中操作。我们提出FLaG,一个即插即用的聚合模块,它使用实FFT变换令牌表示,用可学习的潜在查询汇总频谱分量,应用通道门控,并重建增强的时域令牌以进行最终池化。我们在使用ESM2的抗菌肽(AMP)活性预测、使用ResNet18在CIFAR-10和CIFAR-100上的图像分类,以及使用RoBERTa在IMDB和GLUE上的文本分类中评估FLaG。FLaG在ESM2-8M抗菌肽任务和CIFAR-100上取得了最明显的提升,同时在IMDB和GLUE上与强文本基线保持竞争力。然后,我们通过频带消融、门控汇总、残基扰动、潜在查询读出和结构代理分层来探究其在AMP设置中的行为。我们发现低频带贡献最大,其余高频带模式更具样本特异性。门控充当广泛共享的频谱重加权阶段,交叉注意力模式是样本特异性的,具有轻微的查询差异,并且高螺旋肽在两种细菌中表现出更强的平均频谱敏感性。补充材料、源代码和数据发布在https://www.healthinformaticslab.org/supp/ 和 https://github.com/Kewei2023/AMPCliff/tree/FLaG。

英文摘要

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

2606.08196 2026-06-09 stat.ML cs.AI cs.LG stat.ME 交叉投稿

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

超越可加性:含隐变量的位置-尺度噪声模型中的因果发现

Mariyam Khan, Shohei Shimizu, Thong Pham

发表机构 * RIKEN AIP(理化学研究所Advanced Institute for Science Technology) University of Bergen(卑尔根大学) The University of Osaka(大阪大学) Shiga University(滋贺大学)

AI总结 针对含隐变量且数据生成过程遵循位置-尺度噪声模型(LSNM)的因果发现,证明满足无弓条件的非循环有向混合图(ADMG)可识别,并提出两阶段算法LSNM-UV,在异方差数据上优于可加性基线。

Comments 33 pages, 4 figures

详情
AI中文摘要

我们研究当某些变量隐藏且数据生成过程遵循位置-尺度噪声模型(LSNM)时,从观测数据进行因果发现的问题。现有处理隐藏混杂变量的方法通常假设可加性噪声,但在实践中,原因不仅调节其效应的均值,还调节方差。我们证明,满足无弓条件的非循环有向混合图(ADMG)在含隐变量的LSNM下是可识别的,建立了超越噪声可加性的因果不足模型的第一个可识别性结果。我们进一步提供了即使违反无弓假设时识别因果方向的充分条件。我们的两阶段算法LSNM-UV是正确且完备的,实验表明在异方差数据上优于可加性基线方法。

英文摘要

We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale noise model (LSNM). Existing methods that handle hidden confounders typically assume additive noise, but in practice, causes often modulate not just the mean but also the variance of their effects. We prove that acyclic directed mixed graphs (ADMGs) satisfying a bow-free condition are identifiable under LSNM with hidden variables, establishing the first identifiability result for causally insufficient models beyond noise additivity. We further provide sufficient conditions for identifying causal direction even when the bow-free assumption is violated. Our two-stage algorithm, LSNM-UV, is sound and complete, and experiments demonstrate improved performance over additive baselines on heteroscedastic data.

2606.08218 2026-06-09 cs.LG cs.AI math.ST stat.ML stat.TH 交叉投稿

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

深度高斯过程到底有多深?组合高斯过程的尖锐阈值与非高斯极限

Mark Kozdoba, Shie Mannor

发表机构 * Technion, IIT(以色列理工学院) NVIDIA(英伟达)

AI总结 本文研究了深度高斯过程先验在深度增长时的极限行为,识别出RBF核带宽的尖锐阈值,低于该阈值时先验收敛到非退化非高斯分布,具有非零坐标依赖。

详情
AI中文摘要

组合先验描述了深度贝叶斯模型中分层函数的通用属性,其中随机权重的深度神经网络是一个典型例子。在宽网络极限下,先验是一个具有深度相关核的高斯过程,其随深度增长的行为已通过该核得到广泛研究。这里,我们研究另一种情况,其中每一层本身是一个向量值高斯过程,我们的目标类似地理解先验随深度增长的极限行为。先前的高斯过程工作已确定,对于RBF核和一定范围的带宽$r$,先验在极限下退化,收敛到常数函数集——这作为概率模型是无用的。在本文中,我们建立了几个新结果。首先,我们识别出一个尖锐的带宽阈值$r_c(d) = Θ(\sqrt{d})$,高于该阈值极限是退化的,加强了先前的界限。其次,更重要的是,我们证明对于低于阈值$r_c(d)$的$r$,先验收敛到极限分布$π_{\bar{Z}}$。我们还证明这些分布是非退化且非高斯的,坐标之间具有非消失的依赖性。与先前已知的退化机制相反,深度高斯过程先验因此可以允许非平凡极限。实验上,我们在维度$d$的范围内验证了该阈值,并展示了极限分布$π_{\bar{Z}}$的复杂多模态行为——该机制随$d$增长而变得狭窄,且在不了解阈值的情况下难以识别。

英文摘要

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力:在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出CHIAR-Former,一种基于谱熵路由的混合Transformer,通过DCT谱混合与全注意力互补,在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54,较全注意力基线提升45%。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

标准Transformer在每一层和每个标记上统一应用自注意力,无论输入是否需要动态的跨标记交互。我们提出CHIAR-Former(明暗对比注意力),一种4层混合Transformer,它基于每个标记的谱熵(一种理论上合理的复杂度信号)将每个标记路由到三个算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统消融,我们发现路由崩溃:路由器持续拒绝RBF而偏向DCT和注意力,表明谱混合和动态注意力是互补且充分的。一个专门设计的仅DCT+注意力变体在WikiText-103上达到验证集PPL 36.54——相比全注意力基线(PPL 66.62)提升45%,同时减少62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类和合成ListOps操作,建立了一个清晰的操作区间:CHIAR-Former在大型自然文本上表现出色,其中标记多样性支持谱专门化,而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是失败——共同定义了谱路由何时以及为何值得使用。

英文摘要

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

2606.08369 2026-06-09 cs.LG cs.AI 交叉投稿

An Information-Theoretic Definition for Open-Ended Learning

开放学习的信息论定义

Wanqiao Xu, Yifan Zhu, Benjamin Van Roy

发表机构 * Stanford University(斯坦福大学)

AI总结 提出基于比特等价的信息论定义开放环境,证明经典赌博机非开放,设计算法实现开放学习。

详情
AI中文摘要

越来越多的研究表明,能够在开放环境中持续扩展能力的AI系统具有巨大潜力。但目前尚无关于开放性的统一定义或关于智能体应如何探索开放环境的理论。我们基于一个新概念——${\textit比特等价}$——引入了一个信息论定义,该概念量化了达到每个期望奖励水平所需的信息。我们认为,如果智能体能够实现比特等价的线性增长,则该环境是开放的。我们证明了经典赌博机环境不是开放的,并构建了一个开放赌博机环境。我们还提出了一种在该环境中实现开放学习的算法。

英文摘要

A growing body of work points to the great promise of AI systems that can continually expand their capabilities as they operate in an open-ended environment. But yet there is no coherent definition of open-endedness or theory about how an agent ought to explore an open-ended environment. We introduce an information-theoretic definition based on a new concept -- the ${\textit bit-equivalent}$ -- which quantifies the information required to attain each level of expected reward. We consider an environment to be open-ended if an agent can attain linear growth in the bit-equivalent. We establish that classical bandit environments are not open-ended and formulate a bandit environment that is. We also introduce an algorithm that achieves open-ended learning in this environment.

2606.08382 2026-06-09 cs.LG cs.AI 交叉投稿

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

STAR-KV:通过软阈值实现自适应秩控制的低秩KV缓存压缩

Priyansh Bhatnagar, Ashkan Moradifirouzabadi, Se-Hyun Yang, SeungJae Lee, Jungwook Choi, Mingu Kang

发表机构 * University of Washington(华盛顿大学)

AI总结 提出STAR-KV框架,通过可微阈值机制实现注意力头和块级别的自适应秩选择,结合混合分解和低秩感知混合精度量化,在多种LLM上达到75%的KV缓存压缩,结合量化可减少20倍,并实现6.9倍注意力模块加速和3.1倍端到端生成吞吐提升。

详情
AI中文摘要

低秩投影通过利用隐藏维度冗余已成为压缩KV缓存的一种有前景的方法。然而,先前的方法依赖于固定或启发式秩选择,难以在最小精度损失下实现激进压缩。我们提出STAR-KV,一种具有细粒度秩控制的自适应低秩KV缓存压缩框架。STAR-KV包括:1)可微阈值机制,可在注意力头和块级别实现最优秩选择;2)混合分解策略,根据键和值投影的敏感性应用不同的低秩分解;3)低秩感知混合精度量化,利用数据统计实现近乎无损的低比特量化。在多个LLM和基准测试中评估,STAR-KV实现了高达75%的KV缓存压缩,结合量化可实现高达20倍的整体KV缓存减少。通过基于Triton的自定义GPU内核,STAR-KV为注意力模块提供高达6.9倍的加速,端到端生成吞吐量提升3.1倍。我们的代码公开在:https://github.com/PriyanshBhatnagar/STAR-KV。

英文摘要

Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.

2606.08446 2026-06-09 cs.LG cs.AI 交叉投稿

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Sparrow: 用于大语言模型稳定高效长上下文强化学习的稀疏 rollout

Yang Zhou, Ranajoy Sadhukhan, Zhaofeng Sun, Zhuoming Chen, Souvik Kundu, Saket Dingliwal, Sai Muralidhar Jayanthi, Aram Galstyan, Haizhong Zheng, Beidi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Cornell University(康奈尔大学) Intel(英特尔) Amazon AGI(亚马逊AGI)

AI总结 针对RLVR中长上下文rollout计算昂贵的问题,提出Sparrow方法,通过动态稀疏度调度保持token级策略失配的下尾统计量稳定,在Qwen3系列模型上实现2.0-2.4倍加速,并推广到更大模型和编程领域。

详情
AI中文摘要

尽管强大,但带有可验证奖励的强化学习(RLVR)会诱导极长的思维链(COT),使其计算成本高昂。由于RLVR每步成本主要由长上下文rollout生成主导,稀疏注意力为加速密集rollout提供了一种有前景的方法。然而,稀疏rollout需要精细的稳定性-效率权衡:过于激进的稀疏性会导致崩溃,而过于宽松的稀疏性则加速不足。在这项工作中,我们通过稀疏到密集的演员-策略失配来研究这种权衡。我们首先观察到,稀疏rollout崩溃并非由token间的均匀退化驱动:即使在激进的稀疏性下,大多数稀疏token也能与密集token完美对齐。受此启发,我们假设如果每个token的演员-策略失配的下尾在整个轨迹中保持在临界阈值以上,则稀疏rollout训练保持稳定。我们引入一种动态稀疏度调度,在生成过程中保持该尾统计量恒定,并验证了我们的假设。在Qwen3思考族模型上,将尾失配统计量保持在一致阈值附近通常能实现稳定训练。然后,我们使用成本模型在该失配阈值下找到最大加速的稀疏度调度,在训练Qwen3-1.7B、Qwen3-4B和Qwen3-8B时分别实现了2.2倍、2.4倍和2.0倍的rollout加速。实验表明,这些阈值可推广到更大的模型(Qwen3-14B)和另一个RL领域(编程)。最后,我们的分析自然引出了DistillSparse:在稀疏rollout上进行轻量级基于LoRA的蒸馏,使更激进的稀疏性达到相同的稀疏到密集失配阈值,从而获得更高的加速。

英文摘要

Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.

2606.08447 2026-06-09 cs.LG cs.AI 交叉投稿

Not Just After One: Sleep-Inspired Replay Prevents Catastrophic Forgetting After Sequential Tasks

不仅仅是在一次之后:受睡眠启发的回放防止顺序任务后的灾难性遗忘

Anthony Bazhenov, Jean Erik Delanois, Giri P. Krishnan

发表机构 * Department of Neuroscience, University of California, San Diego, CA, USA(1 神经科学系,加州大学圣地亚哥分校,美国加利福尼亚州圣地亚哥)

AI总结 提出受睡眠启发的无监督回放机制,在多个新任务顺序训练后应用,以部分恢复所有先前学习任务的性能,防止灾难性遗忘。

详情
AI中文摘要

人工神经网络的关键限制之一是缺乏持续学习的能力:在新任务上训练常常导致对先前任务的干扰和遗忘。尽管已有几种算法被提出以保护旧记忆免受干扰,但它们通常在每个新训练阶段期间或之后立即应用。相比之下,人类和动物可以持续学习,在主动学习期间获取多个新记忆,然后将它们全部巩固到长期存储中。在这里,我们展示了多个新任务可以顺序训练,然后应用无监督的睡眠样回放阶段,以部分恢复所有先前学习任务的性能。我们的研究进一步表明,任务特定信息对新训练具有弹性,但随着网络在新任务上训练而逐渐衰减。这些发现为开发广泛范围的持续学习AI解决方案提供了新颖的原则。

英文摘要

One of the critical limitations of artificial neural networks is their lack of ability to continually learn: training on new tasks often leads to interference and forgetting of the previous ones. While several algorithms have been proposed to protect old memories from interference, they are typically applied during or immediately after each new episode of training. In contrast, humans and animals can learn continuously, acquiring multiple new memories during active learning before consolidating all of them into long-term storage. Here we show that multiple new tasks can be trained sequentially before an unsupervised sleep-like replay phase is applied to partially restore performance across all previously learned tasks. Our study further suggests that task-specific information remains resilient to new training but decays gradually as network is trained on new tasks. These findings point to novel principles for developing a broad range of continual learning AI solutions.

2606.08480 2026-06-09 cs.LG cs.AI cs.IR 交叉投稿

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

生成式推荐中噪声鲁棒GRPO的自适应损失平衡

Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li

发表机构 * JD.com(京东) Waseda University(早稻田大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对生成式推荐中奖励模型因曝光偏差导致噪声的问题,提出AdaGRPO框架,通过策略难度和奖励可区分性诊断动态切换GRPO与监督学习,在电商数据集上提升召回率并抑制幻觉。

详情
AI中文摘要

强化学习为超越监督模仿的生成式推荐提供了有前景的途径,通过利用奖励信号指导策略改进。然而,其有效性关键取决于奖励模型对所评估样本的可信度。实践中,广泛采用的奖励模型——生产级排序器,是在有曝光偏差的日志上训练的,导致样本相关的误差,违反了这一假设。我们的分层分析揭示了一个一致的模式:当策略表现出不确定性且排序器能有效区分真实物品与rollout负样本时,奖励指导最为有益。在其他样本上,奖励信号要么可忽略,要么有害,凸显了统一应用RL的风险。为解决此问题,我们引入AdaGRPO,一种新颖框架,将奖励指导优化视为选择性准入而非统一压力。训练以监督负对数似然为基础,而GRPO目标由基于两个rollout诊断(策略侧难度和奖励可区分性)的逐样本二元裁剪门控。未通过任一诊断的实例退化为纯监督,确保稳定性并减轻噪声梯度的放大。我们在大规模电商数据集上验证了AdaGRPO。在最佳中间检查点,它将HR@10从11.01%提升至12.18%,同时将幻觉限制在0.22%以下,并在最终检查点保持鲁棒性(HR@10 11.63%,幻觉0.27%),在检索-有效性前沿上优于固定NLL-GRPO混合。在生产A/B测试中,AdaGRPO在点击率和停留时间上实现了统计显著的提升,证实了其实用价值。

英文摘要

Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.

2606.08484 2026-06-09 cs.LG cs.AI 交叉投稿

STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling

STELLAR: 面向长尾物种分布建模的时空环境学习与潜在对齐精炼

Shufeng Kong, Tao Yu, Yuanyuan Wei, Caihua Liu, Junwen Bai, Yingheng Wang, Marc Grimson, Daniel Fink, Carla P. Gomes

发表机构 * Sun Yat-sen University(中山大学) Cornell University(康奈尔大学) Foshan University(佛山大学) Cornell Lab of Ornithology(康奈尔鸟类学实验室)

AI总结 提出STELLAR框架,通过图-时间编码器、上下文锚定潜在对齐和不平衡感知解码模块,联合优化动态栖息地上下文和群落结构,有效解决物种分布建模中的时空耦合与长尾不平衡问题。

Comments Accept by IJCAI 2026

详情
AI中文摘要

联合物种分布建模(JSDM)是生物多样性监测和保护规划的关键工具。然而,准确的JSDM面临两个耦合挑战:环境驱动因素和物种分布本质上是时空的,而物种共现模式表现出复杂的非线性群落结构以及由稀有物种导致的严重长尾不平衡。现有方法通常孤立地处理这些因素,从静态协变量中学习或忽略动态群落结构的历史轨迹。为克服这些限制,我们提出STELLAR(时空环境学习与潜在对齐精炼),一种新颖的框架,学习一个共享潜在空间,其中动态栖息地上下文和群落结构被联合优化。我们的方法整合了三个互补组件:(1)图-时间编码器,采用图注意力和循环单元来聚合空间邻域效应并捕捉环境上下文和群落结构的共同演化历史动态;(2)上下文锚定潜在对齐机制,利用标签激活的混合先验和监督对比学习结构化潜在空间,基于共享环境偏好主动聚类物种;(3)不平衡感知解耦解码模块,利用非对称损失聚焦于困难稀有物种样本的学习,防止长尾中的模式崩溃。在领域专家精心整理的大规模eBird数据集上的实验表明,我们的框架显著优于最先进的基线,特别是在预测稀有物种和揭示可解释的物种相互作用方面。

英文摘要

Joint Species Distribution Modeling (JSDM) is a key enabler for biodiversity monitoring and conservation planning. However, accurate JSDM faces two coupled challenges: environmental drivers and species distributions are inherently spatio-temporal, while species co-occurrence patterns exhibit complex non-linear community structure and severe long-tail imbalance driven by rare species. Existing approaches often address these factors in isolation, learning from static covariates or neglecting the historical trajectories of dynamic community structure. To overcome these limitations, we propose STELLAR (Spatio-Temporal Environmental Learning with Latent Alignment and Refinement), a novel framework that learns a shared latent space where dynamic habitat context and community structure are optimized jointly. Our approach integrates three complementary components: (1) a Graph-Temporal Encoder that employs graph attention and recurrent units to aggregate spatial neighborhood effects and capture the co-evolving historical dynamics of environmental context and community structure; (2) a Context-Anchored Latent Alignment mechanism that structures the latent space using a label-activated mixture prior and supervised contrastive learning, actively clustering species based on shared environmental preferences; and (3) an Imbalance-Aware Decoupled Decoding module that utilizes Asymmetric Loss to focus learning on hard, rare species samples, preventing mode collapse in the long tail. Experiments on the large-scale eBird dataset, curated with domain experts, demonstrate that our framework significantly outperforms state-of-the-art baselines, particularly in predicting rare species and revealing interpretable species interactions.

2606.08565 2026-06-09 cs.LG cs.AI 交叉投稿

EinSort: Sorting is All We Need for Tensorizing LLM

EinSort: 张量化大语言模型,排序即一切

Toshiaki Koike-Akino, Jing Liu, Ye Wang

发表机构 * Toshiaki Koike-Akino Jing Liu Ye Wang

AI总结 提出EinSort方法,通过索引排序发现张量中的低秩结构,实现大语言模型权重和KV缓存的张量化压缩,相比基线方法提升了重构质量。

Comments 38 pages, 17 figures

详情
AI中文摘要

张量网络为压缩大型神经网络提供了高效的表示。通过精心设计形状和拓扑,它们可以显著减少内存和计算成本。然而,由于大型基础模型的巨大规模和非结构化的权重分布,识别其中的隐式低秩结构仍然具有挑战性。我们提出了一种自适应张量化方法,通过索引排序发现目标张量中的固有低秩结构。在权重和KV缓存压缩上的实验表明,与基线方法相比,重构质量得到了提升。

英文摘要

Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they can significantly reduce memory and computational costs. However, identifying implicit low-rank structures in large foundation models remains challenging due to their enormous scale and un-structured weight distributions. We propose an adaptive tensorization method that discovers inherent low-rank structure in a target tensor by index ordering. Experiments on weight and KV-cache compression demonstrate improved reconstruction quality compared to baselines.

2606.08602 2026-06-09 cs.LG cs.AI 交叉投稿

Reinforcement Learning for Flow-Matching Policies with Density Transport

基于密度传输的流匹配策略强化学习

Boshu Lei, Kostas Daniilidis, Antonio Loquercio

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出在线强化学习算法RLDT,利用Stein变分梯度下降构建传输场,微调预训练流匹配策略,通过期望目标估计稳定训练,在连续控制任务中优于基线方法。

详情
AI中文摘要

我们提出了一种在线强化学习(RL)算法,用于微调连续控制问题中的流匹配策略。我们的关键见解是将基于RL的策略改进视为将动作密度向高奖励区域传输,这自然与流匹配模型的传输公式一致。先前的方法要么近似当前或最优策略分布,要么采用蒸馏,这引入了有偏梯度或牺牲了多模态建模能力。相比之下,我们提出的基于密度传输的RL方法(称为RLDT)使用Stein变分梯度下降(SVGD)从最大熵RL目标构建传输场,然后微调预训练的流匹配策略以与该场对齐。使用这种对齐目标进行训练并非易事,因为流匹配策略通过多步过程生成动作,使得直接的基于梯度的优化具有挑战性。为了克服这一挑战并稳定训练,我们通过期望目标估计从中间去噪步骤近似策略动作。这使得传输场更新能够传播到网络参数中,而无需通过时间进行不稳定的反向传播。实验结果表明,RLDT在奖励质量和收敛速度方面优于竞争基线。该性能在多种连续控制任务中保持一致,包括密集和稀疏奖励,以及基于状态和视觉的长期机器人操作。项目网页为https://rpfey.github.io/rldt/。

英文摘要

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

2606.08797 2026-06-09 cs.LG cs.AI 交叉投稿

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

通过拉格朗日分解将决策聚焦学习扩展到大规模问题

Stéphane Eilles-Chan Way, Hugo Percot, Quentin Cappart, Tias Guns, Louis-Martin Rousseau

发表机构 * Polytechnique Montréal(蒙特利尔综合理工学院) Ecole Polytechnique(巴黎综合理工学院) UCLouvain(鲁汶大学) Mila - Québec AI Institute(魁北克人工智能研究所) KU Leuven(荷语鲁汶大学)

AI总结 提出结合拉格朗日分解的决策聚焦学习框架,通过新代理目标和两种损失函数,在保持可并行化的同时,有效处理大规模约束优化问题,实验表明在变量数多八倍的实例上优于传统方法。

详情
AI中文摘要

决策聚焦学习在解决预测-优化问题中显示出巨大潜力,尤其是在模型欠规范的情况下。然而,其实际部署常因高计算成本和有限的可扩展性而受阻,因为需要在每次迭代中对每个训练实例求解一个约束优化问题。为解决这些挑战,我们提出了一种新颖的框架,将拉格朗日分解融入决策聚焦学习范式。具体而言,我们引入了一个新的代理目标以及两个用于评估和训练底层预测模型的损失函数。我们进一步提出了两种变体,它们在计算效率和解决方案质量之间提供了不同的权衡。我们的框架可以无缝集成到标准的决策聚焦学习方法中,包括Smart Predict-then-Optimize (SPO+)和隐式最大似然估计 (IMLE)。通过在两个标准基准测试(多维背包问题和二次投资组合优化)上的实验,我们证明了我们的方法在保持可并行化的同时实现了有竞争力的性能。特别是,在大规模实例上,它始终优于传统的决策聚焦学习方法,这些实例的变量数比相关工作通常考虑的要多出八倍。实现代码可在 https://github.com/corail-research/DFL-LD 获取。

英文摘要

Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.

2606.08854 2026-06-09 cs.LG cs.AI cs.CL stat.ML 交叉投稿

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat(红帽) IBM

AI总结 提出sGPO方法,通过少量推理计算预估查询难度,自适应分配训练预算,将训练计算量降低三倍,同时保持或提升性能。

详情
AI中文摘要

标准的可验证奖励强化学习(RLVR)训练为每个查询分配固定的展开预算,而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式:简单查询产生接近零的优势,因为策略已经解决了它们;而无法解决的查询不产生信号,因为策略从未解决它们。这两种情况都浪费了训练FLOPs,而没有贡献学习梯度。我们引入了排序组策略优化(sGPO),一种计算高效的策略,用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是,廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本,我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数,这是一个实用的规则,通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤(移除琐碎查询和子采样无法解决的查询)、自适应组大小分配和课程构建(从易到难调度查询)。sGPO匹配或超过基线性能,同时将总训练计算量减少三倍,包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

2606.08935 2026-06-09 cs.LG cs.AI 交叉投稿

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

PAI:在基于表示的时间序列异常检测中保留振幅信息

Kang Zhang, Wei Jian Lau, Shoushou Ren, Dong Lin, Joon Son Chung, Chuanhao Sun

发表机构 * HUAWEI(华为) KAIST(韩国科学技术院)

AI总结 针对现有基于表示的时间序列异常检测方法忽略振幅信息导致性能下降的问题,提出PAI方案,通过诊断模块和分数增强函数融合振幅相关分数,在TSB-AD-U-Eva和TAB UV数据集上平均VUS-PR提升98.4%和36.8%。

Comments 15 pages

详情
AI中文摘要

基于表示的时间序列异常检测算法在多种异常检测任务上显著优于其他方法。然而,我们在评估中发现它们存在一个主要限制——学习到的嵌入通常是振幅无关的。丢失振幅信息会降低与振幅相关异常的性能,并且这种失败普遍存在于所有现有的基于表示的方法中。为了解决上述问题,我们提出了一种新的异常评分方案PAI。PAI由两个互补模块组成:诊断模块和最终分数增强函数。诊断模块比较同一表示库上的余弦评分和欧几里得评分,以测试振幅信息是否已被捕获到学习到的表示中。然后在最终分数增强函数中,PAI计算逐点中位数和MAD偏差分数以及局部均值偏移分数——这些分数与表示分数融合以产生最终异常分数。在TSB-AD-U-Eva和TAB UV数据集上,PAI在所有报告的指标上改进了所有四种评估的基于表示的方法,平均VUS-PR增益分别为98.4%和36.8%。在所有评估的组合中,PaAno + PAI实现了最佳性能,比最先进的方法高出15%。对bootstrap置信区间、异常类型细分以及TS2Vec输入归一化消融的进一步评估进一步支持了所提出的方案。这些结果表明,显式保留振幅信息对于基于表示的时间序列异常检测非常重要,而这一点在现有的评分方案中未得到充分重视。代码可在https://github.com/pantheon5100/PAI获取。

英文摘要

Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI

2606.09012 2026-06-09 cs.LG cs.AI math.OC stat.ML 交叉投稿

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

理解量化感知训练:量化权重的梯度偏向低损失盆地

Hanyang Li, Jianhao Ma, Ying Cui

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出统一几何框架解释后训练量化失败与量化感知训练恢复机制,揭示量化感知训练通过梯度感知谷壁使量化点返回低损失盆地。

Comments 31 pages, 10 figures

详情
AI中文摘要

后训练量化(PTQ)将训练好的全精度模型转换为低比特权重,无需任务级重训练,而量化感知训练(QAT)将量化纳入训练循环。尽管PTQ在中等比特宽度下高效且通常准确,但在激进比特宽度下可能急剧失败;QAT成本更高但通常能恢复丢失的精度。我们提出了一个统一的几何框架,同时解释PTQ失败和QAT恢复。我们将全精度训练建模为在更宽的\emph{山谷}内沿着低损失\emph{河流}:河流的法向邻域形成近乎平坦的\emph{盆地},而离开该盆地会导致损失急剧增加。当量化网格与盆地宽度相当时,局部PTQ目标(包括舍入和基于Hessian的二阶重建)可能选择盆地外的高损失部署量化点,即使附近存在低损失量化点。在这种情况下,基于直通估计器的QAT具有有用的偏差:它在部署的量化权重处评估梯度,同时更新潜在的全精度权重,导致梯度感知谷壁并获得向内分量,从而将后续量化迭代引导回盆地。我们通过局部景观模型形式化这一机制,构造了几何PTQ失败模式,并在局部量化器兼容性假设下证明了有限时间QAT恢复。在多种神经网络量化方案下的视觉和语言模型实验,证实了预测的PTQ跨盆地失败以及相应的QAT恢复机制。

英文摘要

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

2606.09028 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM:用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University(东北大学软件学院)

AI总结 提出ATM矩阵,通过轻量级探针比较真实与预测潜在转移中的动作信息,无需模拟器即可诊断世界模型质量,并引入AITS利用动作可识别性作为训练信号提升下游规划。

Comments 13 pages, 3 figures, 6 tables

详情
AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划,但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的:在相同协议下,不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中,我们提出了ATM,一个动作一致性转移矩阵,用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息,生成一个可解释的矩阵,揭示表示质量、转移域不一致性和失败模式,而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数,用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时,ATM实现了高度可靠的成对排名,同时将分钟到小时的CEM评估减少到秒级的转移分析,在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS,表明动作可识别性不仅具有诊断作用,而且是一种有用的训练信号,可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

2606.09052 2026-06-09 cs.LG cs.AI cs.CL cs.GT stat.ML 交叉投稿

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

INFUSER: 影响力引导的自我进化提升推理能力

Siyu Chen, Miao Lu, Beining Wu, Heejune Sheen, Fengzhuo Zhang, Shuangning Li, Zhiyuan Li, Jose Blanchet, Tianhao Wang, Zhuoran Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of California, San Diego(圣地亚哥大学)

AI总结 提出INFUSER框架,通过生成器与求解器的协同进化,利用影响力分数和DuGRPO优化,从文档池中自适应生成训练数据,显著提升模型推理性能。

Comments 66 pages, 17 figures

详情
AI中文摘要

自我进化为更强的推理提供了一条可扩展的路径:预训练语言模型仅需极少的外部监督即可自我改进。然而,现有方法要么依赖于大量精心策划或教师生成的训练数据,要么在生成器无监督运行时,使用未必能改进求解器的难度启发式方法对其进行奖励。我们引入了INFUSER,一个迭代协同训练框架,包含两个共同进化的角色:一个生成器,从自动收集的非结构化文档池中起草问题并参考标准答案;一个求解器,通过在这些数据上训练来改进。求解器使用标准正确性奖励(针对生成器提供的答案)进行训练,而生成器则通过一种优化器感知的影响力分数获得奖励,该分数衡量每个提出的问题是否真正能改进求解器在目标分布上的表现。由于这种连续、有噪声的影响力分数不适合标准的GRPO,我们提出了DuGRPO,一种GRPO的双归一化变体,用于生成器训练。这些设计共同将文档池转化为一个自适应课程,倾向于对当前求解器有用的问题,而不仅仅是困难的问题。在Qwen3-8B-Base上,INFUSER在Olympiad和SuperGPQA基准测试中相对于强自我进化基线取得了超过20%的相对改进,并且一个8B的INFUSER协同进化生成器在数学和编程任务上优于冻结的32B思考生成器。消融实验证实了每个设计选择的必要性,两个扩展——将INFUSER应用于指令微调锚点并辅以规则可验证的RLVR数据——进一步展示了该框架的灵活性和泛化能力。代码可在https://github.com/FFishy-git/INFUSER获取。

英文摘要

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

2606.09059 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

Stage-1 Controls the Entropy Regime, Not the Outcome

Stage-1 控制熵状态,而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过小数据实验研究两阶段后训练中Stage-1(SFT或OPD)的作用,发现其主要影响策略熵状态,但对最终性能影响有限。

详情
AI中文摘要

两阶段后训练——Stage-1 热启动(监督微调 SFT 或在线策略蒸馏 OPD)后接 Stage-2 强化学习(RL)——越来越多地用于视觉语言模型(VLM)。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD,在小数据研究中探究 Stage-1 实际控制什么。首先,三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间,与近期专门方法报告的窄范围一致;该设置几乎没有证据表明 Stage-1 改变了域内终点。其次,匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点,逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态:OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化,且这种分离在可用轨迹中持续可见。在域内初始化时,OPD 还具有更高的答案多样性和 pass@16(比 SFT 高 +2.0 到 +5.2 点),尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失(终点 pass@16 值在 1.1 点以内),在 MathVista 上也是如此(六个模型在 1.2 点以内)。因此,我们的贡献是一个有界的实证刻画:在此设置中,Stage-1 与熵状态强相关,但下游收益小、局部化,且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

2606.09065 2026-06-09 cs.LG cs.AI 交叉投稿

OnlyDense: Reduced-Order Modeling for Lagrangian simulation

OnlyDense: 拉格朗日模拟的降阶建模

Tu Do, Shannon Ryan, Santu Rana

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种将粒子系统状态视为希尔伯特空间中的函数、用学习到的神经基函数线性子空间近似状态空间的降阶建模框架,实现大规模拉格朗日模拟的高效表示与预测,在百万粒子SPH模拟中R²>0.99。

详情
AI中文摘要

在科学和工程中,拉格朗日模拟方法如光滑粒子流体动力学(SPH)或物质点法(MPM)常被用于研究动态系统的行为。然而,这些方法的计算成本可能高得令人望而却步,特别是在模拟多尺度空间或时间现象时,例如宏观几何中的空洞生长和合并、空间碎片颗粒超高速撞击导致的航天器部件结构失效等。与将系统状态理解为离散粒子集合的基于图的方法不同,我们提出了一种学习框架,通过将系统状态视为函数、将其演化视为希尔伯特空间中的轨迹,实现对大规模粒子系统的可扩展表示和动力学建模。我们不将状态表示为离散粒子集或嵌入非线性潜在流形,而是用学习到的神经基函数张成的线性子空间近似状态空间。这种参数化使得可以直接投影获得潜在系数,并显式访问基函数,避免了在非线性潜在空间上的优化。由此得到的表示具有自然的解释:潜在变量对应于希尔伯特空间中的系数,基函数对应于空间模态,类似于本征正交分解。因此,该框架将经典的基于投影的降阶建模与现代深度学习统一起来,同时保持对离散化点数量的不变性。在超过一百万个粒子的大规模SPH模拟(包括具有极端变形和破碎的动态事件)上的实验表明,所提出的方法能够准确重建和预测动力学,仅用32个基函数即可达到超过0.99的R²分数。

英文摘要

In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are often employed to study the behavior of dynamic systems. However, these methods can be prohibitively computationally expensive, particularly when simulating multi-scale spatial or temporal phenomena, e.g., void growth and coalescence within macro-scale geometries, structural failure of spacecraft components resulting from hypervelocity impact of space debris particles, etc. In contrast to graph-based methods, where the state of the system is understood as a discrete set of particles, we propose a learning framework for scalable representation and dynamics modeling of massive particle systems by treating the system state as a function and its evolution as a trajectory in Hilbert space. Rather than representing the state as a discrete set of particles or embedding it in a nonlinear latent manifold, we approximate the state space with a linear subspace spanned by learned neural basis functions. This parameterization enables direct projection to obtain latent coefficients and explicit access to the basis functions, avoiding optimization over a nonlinear latent space. The resulting representation admits a natural interpretation: latent variables correspond to coefficients in Hilbert space, and basis functions correspond to spatial modes, analogous to Proper Orthogonal Decomposition. The framework thus unifies classical projection-based reduced-order modeling with modern deep learning, while remaining invariant to the number of discretization points. Experiments on large-scale SPH simulations with over one million particles, including dynamic events with extreme deformation and fragmentation, demonstrate that the proposed method accurately reconstructs and predicts dynamics, achieving an R$^2$ score above $0.99$ with as few as $32$ basis functions.

2606.09112 2026-06-09 cs.LG cs.AI 交叉投稿

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

将平衡传播与伊辛机混合以实现高效的基于能量的学习

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) School of Physical Science and Technology, Beijing University of Posts and Telecommunications(北京邮电大学物理科学与技术学院)

AI总结 提出一种受伊辛动力学启发的平衡传播框架,通过扩展相空间动力学替代耗散Hopfield松弛,加速收敛、提高噪声鲁棒性,并在MNIST等数据集上实现与反向传播相当的性能。

详情
AI中文摘要

人工智能的快速发展推动了深度神经网络的重大进步。然而,传统的基于GPU的训练仍然高度耗能,这促使人们探索物理动力学和兼容的基于能量的学习方案,例如平衡传播(EP)。然而,基于EP的训练常常由于相空间收缩而陷入局部最小值。本文介绍了一种受伊辛动力学启发的平衡传播框架,其中耗散的Hopfield松弛被具有共轭变量的扩展相空间动力学所取代。由此产生的训练范式保留了EP的局部两阶段学习规则,同时改变了神经状态达到平衡的物理路径。我们表明,这种动力学降低了有效能量壁垒,加速了收敛,提高了噪声鲁棒性,并在MNIST、FashionMNIST和CIFAR-10上训练了深度卷积Hopfield网络,性能与反向传播相当。

英文摘要

The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

2606.09117 2026-06-09 cs.LG cs.AI 交叉投稿

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

利用相干伊辛机优化基于能量的神经网络训练

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) China Mobile (Suzhou) Software Technology Company Limited(中移(苏州)软件技术有限公司) School of Science, Beijing University of Posts and Telecommunications(北京邮电大学理学院)

AI总结 本文利用相干伊辛机结合平衡传播训练基于能量的神经网络,并通过Adam优化器加速收敛,展示了在深层架构和卷积操作上的可扩展性,为下一代AI硬件提供了物理框架。

详情
AI中文摘要

尽管伊辛机作为伊辛模型的高级物理求解器,在组合优化和神经网络训练中具有应用潜力,但其在大规模神经网络中的可扩展性仍受限于硬件连接限制和次优的训练方法。在这项工作中,我们利用相干伊辛机(CIM)通过平衡传播训练基于能量的神经网络,实现了与现有软件实现相当的性能。我们进一步通过集成Adam优化器来求解Hopfield能量网络的基态,从而显著提高了收敛速度和求解精度。此外,我们展示了该方法在更深层网络架构和卷积操作上的可扩展性。我们的结果突显了CIM动力学作为训练复杂神经网络的可扩展平台的潜力,为通过模拟电路、光电子或集成光子学实现节能实现提供了途径。这项工作为下一代AI硬件开发建立了一个新颖的物理框架。

英文摘要

While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

2606.09245 2026-06-09 cs.CV cs.AI 交叉投稿

Proposal Refinement for Few-Shot Object Detection

用于少样本目标检测的提议细化

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen

发表机构 * State Key Laboratory of Integrated Services Networks, Xidian University(西安电子科技大学综合业务网理论及关键技术国家重点实验室)

AI总结 针对少样本检测中区域提议在基类和新类间分布不均的问题,提出分阶段提议细化方法,通过基类训练阶段的细化损失和微调阶段的细化分支重新平衡提议分布,在基准上提升1%~6%且不增加推理时间。

详情
AI中文摘要

近年来,少样本目标检测引起了广泛关注。一些优秀的算法已被提出以处理这一任务。然而,这些算法大多依赖于少样本分类的性能。与以往尝试不同,我们的工作聚焦于新类和基类之间区域提议分布不均的问题。为了缓解这种不平衡分布,我们针对不同训练阶段提出了提议细化方法。具体而言,在基类训练阶段设计了细化损失以增强模型对新类的敏感性,在微调阶段引入了细化分支作为RPN(区域提议网络)的辅助分支以生成更多新类提议。通过重新平衡提议分布,所提方法在现有基准上比基线方法提高了约1%~6%,且不增加任何推理时间。通过大量实验,我们证明了为少样本目标检测任务建立了一种新的最先进方法。

英文摘要

Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

2606.09257 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

BSTabDiff: 用于高维表格数据生成的块-子单元扩散先验

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * West Virginia University(西弗吉尼亚大学) The University of Utah(犹他大学)

AI总结 针对高维低样本量表格数据,提出BSTabDiff框架,通过将特征划分为潜在块并使用共享低维子单元变量生成每个块,结合扩散先验和copula依赖,实现稳定合成与可控基准生成。

Comments Published as a paper at the 2nd DeLTa Workshop, ICLR 2026

详情
AI中文摘要

高维低样本量(HDLSS)表格领域(例如组学)的特点是 $n \ll m$,其中 $n$ = 样本数,$m$ = 特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声和结构化缺失,使得在 $\mathbb{R}^m$ 中直接进行密度学习因 $n \ll m$ 而病态。我们提出 BSTabDiff,一种块-子单元生成框架,将 $m$ 个观测特征划分为 $M$ 个潜在块($M \ll m$),并通过共享的低维子单元变量生成每个块,将全局依赖学习集中在紧凑的块潜在空间 $\mathbb{R}^M$ 中,同时通过 copula 驱动的依赖、灵活的逐特征边缘分布和显式缺失机制解码到完整特征空间。BSTabDiff 支持块潜在上的现代深度先验,包括扩散和归一化流,从而在 HDLSS 场景中实现稳定合成和可控基准生成。实验表明,与 HDLSS 数据上的非结构化表格生成器相比,BSTabDiff 能产生更真实和稳定的高维合成数据。

英文摘要

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

2606.09278 2026-06-09 cs.LG cs.AI 交叉投稿

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

内化几何法则:从求解器残差中学习以实现精度关键生成

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin

发表机构 * Huawei Celia Team(华为Celia团队)

AI总结 针对大语言模型在精度关键领域(如技术图表和机械设计)中的幻觉问题,提出可编程几何DSL PyGeoX及分层基准PyGeoX-Bench,并设计饱和加性奖励(SAR)方法,将奖励分解为有界逐约束项,解决异常梯度掩盖问题,使8B模型在基准上达到与更大前沿系统竞争的水平。

详情
AI中文摘要

大语言模型在精度关键领域(如技术图表和机械设计)中经常出现幻觉,这些领域的输出必须满足严格的几何约束。我们研究从自然语言进行开放式几何合成:将自由形式的描述转化为精确的构造,其实体必须同时满足数十个相互作用的约束。为使这一问题易于处理,我们发布了PyGeoX,一个可编程的几何DSL,它将声明性约束编译为可微损失,以及PyGeoX-Bench,一个包含300个问题的分层套件,每个问题都有可验证的逐约束奖励。使用PyGeoX作为验证器,我们识别出一种称为异常梯度掩盖的失败模式:在全局范数奖励(任何通过单一范数聚合残差的方案,例如$\exp(-\mathrm{MSE})$)下,单个异常约束可以抵消所有其他约束的学习信号。为解决此问题,我们提出饱和加性奖励(SAR),它将奖励分解为有界的逐约束项,保留部分进展并确保即使在严重违反下也能保持一致的梯度。与基于MSE的奖励(几何求解器的自然基线)相比,SAR将困难层级求解率提高了2.3倍,由此得到的8B模型在该基准上与更大的前沿系统具有竞争力。我们在https://github.com/Huawei-AI4Math/PyGeoX发布引擎、基准和数据。

英文摘要

Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场:当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge(剑桥大学) Mistral AI

AI总结 提出推理竞技场框架,通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号,结合Bradley-Terry模型高效整合强化学习,在数学和编码基准上平均提升7.6%,加速训练27%-41%。

Comments 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为通过结果监督提升大语言模型推理能力的主流范式。然而,可验证奖励在组级别常常变得无信息:当给定提示的所有采样轨迹获得相同奖励时,组相对优势估计无法提供梯度信号,尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场,一种自适应训练框架,将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案,推理竞技场构建轨迹锦标赛,其中推理轨迹进行两两比较以暴露组内更细粒度的偏好,将推理质量转化为丰富的相对奖励信号。为使奖励估计高效,而非穷举比较每一对,每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估,以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型,实现无需二次成对比较的可扩展强化学习集成。实验结果表明,推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新,我们的方法加速训练27%至41%,节省近50%的生成计算量,并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

2606.09404 2026-06-09 stat.ML cs.AI cs.LG 交叉投稿

SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths

SAILS: 基于局部效应平滑的交互作用代理分析

Timo Heiß, Julia Herbinger, Bernd Bischl, Giuseppe Casalicchio

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Leibniz Institute for Prevention Research and Epidemiology(莱比锡预防研究与流行病学研究所)

AI总结 提出SAILS框架,通过可解释的广义加性模型代理分析黑箱模型中的成对交互作用,实现交互检测、形式分类和可视化。

详情
AI中文摘要

特征交互驱动了机器学习模型的大部分预测能力,然而现有的解释方法仅能检测和量化交互作用,而无法揭示其函数形式,或者只能可视化受限的交互类型。我们提出了基于局部效应平滑的交互作用代理分析(SAILS),这是一个模型无关的框架,通过拟合黑箱模型局部效应的可解释广义加性模型(GAM)代理来分析成对交互作用。对于感兴趣特征的每个区间,代理平滑项在导数层面隔离交互成分,从而实现(i)通过对平滑项显著性检验的启发式方法进行交互检测,(ii)将交互形式分类为线性、乘积可分离和非乘积可分离类型,以及(iii)为每种交互类型提供定制化、可解释的可视化。我们通过受控模拟和实际任务实证验证了该框架,展示了其在成对交互作用上的有效性,但在强特征相关性和高阶交互作用下存在局限性。SAILS填补了XAI工具箱中的一个显著空白,超越了仅检测交互作用,进而表征其函数形式。

英文摘要

Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quantify interactions without revealing their functional form, or visualize only restricted interaction types. We propose Surrogate-based Analysis of Interactions via Local effect Smooths (SAILS), a model-agnostic framework that analyzes pairwise interactions through interpretable generalized additive model (GAM) surrogates fitted to the local effects of a black-box model. For each interval of a feature of interest, the surrogate smooth terms isolate the interaction components on derivative level, enabling (i) interaction detection through a heuristic derived from significance tests on smooth terms, (ii) interaction form categorization into linear, product-separable, and non-product-separable types, and (iii) tailored, interpretable visualizations for each interaction type. We empirically validate the framework through controlled simulations and a real-world task, demonstrating its effectiveness for pairwise interactions, with limitations under strong feature correlations and higher-order interactions. SAILS fills a notable gap in the XAI toolbox, going beyond detection of interactions alone to characterizing their functional form.

2606.09430 2026-06-09 cs.LG cs.AI 交叉投稿

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

LargeMonitor: 通过大型预训练模型监控在线无任务持续学习

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen

发表机构 * HKU(香港大学) Qicore Tech(启科科技)

AI总结 提出LargeMonitor框架,利用大型预训练模型(LVM和LMM)解耦检测与诊断,实现无任务持续学习中的零样本漂移检测和语义病因诊断,提升现有算法性能。

详情
AI中文摘要

在线无任务持续学习(TFCL)要求智能体在严格单次遍历约束下,从无界、非平稳的数据流中顺序积累知识,且无显式任务标识。现有在线TFCL范式主要依赖于参数高效的提示调整或由训练耦合优化动态(如经验损失波动或潜在距离演变)驱动的动态结构扩展。因此,这些训练耦合求解器对分布漂移的结构起源不可知,机械地在根本不同的流变化上强制执行固定策略。为解决这一问题,我们提出LargeMonitor,一个利用大型预训练基础模型自主编排无任务连续适应的框架。具体而言,LargeMonitor引入一个解耦的检测模块,利用大型视觉模型(LVM)的冻结、稳定表示空间,实现鲁棒的零样本漂移检测,无需训练依赖的干扰或脆弱的阈值调整。在确认漂移后,该框架激活一个由大型多模态模型(LMM)驱动的上下文感知诊断模块,以解释流变化的精确语义病因(例如,新类出现 vs. 环境域偏移)。这种双阶段能力使连续学习者能够动态部署自适应且特定于漂移的优化策略。在多个TFCL设置和基准上的大量实验表明,LargeMonitor实现了对复杂数据流的精确、鲁棒检测和诊断,同时持续提升现有在线TFCL算法的性能。

英文摘要

Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

2606.09607 2026-06-09 cs.LG cs.AI 交叉投稿

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

注意力头中的闭包验证电路发现:共激活提出,消融处置

Yongzhong Xu

发表机构 * GitHub

AI总结 通过共激活聚类提出注意力头电路假设,并用因果消融验证闭包性,发现该方法在密集模型有效但在MoE模型失效,表明共激活仅是电路提议而非确认。

Comments 22 pages, 3 figures

详情
AI中文摘要

可解释性越来越将组件组(而非单个单元)作为基本对象,并提议通过聚类共激活统计来发现它们。我们询问这种廉价信号是否真正识别出注意力头电路。将稀疏自编码器聚类方法适配到注意力头——但通过因果消融而非重构进行验证——我们聚类头,然后运行闭包测试:消融发现的社区,并将每个示例的损伤与匹配随机对照进行比较。在两个密集的1B规模模型(Pythia 1B, OLMo 1B)和两种输入分布上,社区通过了闭包测试。在混合专家模型(OLMoE-1B-7B)中,路由条件聚类恢复了一个统计上真实的信号,但该信号未能通过闭包测试——消融反而改善了损失,方向错误。将闭包测试扩展到训练过程中,注意力目标选择性和参与比率在双向与功能解耦。我们得出结论:廉价信号是电路提议,而非确认的电路;闭包是区分二者的关键。

英文摘要

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

2606.09658 2026-06-09 cs.LG cs.AI 交叉投稿

Muon Learns More Robust and Transferable Features than Adam

Muon 比 Adam 学习更鲁棒和可迁移的特征

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang

发表机构 * Yale University(耶鲁大学) National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Academy of Mathematics and Systems Science, CAS(中国科学院数学与系统科学研究院)

AI总结 本文通过鲁棒性和可迁移性视角,证明 Muon 优化器相比 Adam 和 SGD 能学习到更鲁棒、更可迁移的特征,并通过理论分析支持了经验发现。

详情
AI中文摘要

Muon 最近已成为预训练大型语言模型(LLMs)和视觉分类器的最先进优化器。尽管其在效率上优于 Adam 和 SGD,但 Muon 在特征学习方面的优势仍不清楚。本文通过鲁棒性和可迁移性的视角研究了 Muon 的特征学习优势。首先,通过在损坏图像和文本上评估预训练模型,我们表明 Muon 学习到的特征在不同架构(包括 Transformer 和卷积神经网络(CNN))中始终比 Adam 和 SGD 学习到的特征更鲁棒。使用训练好的逐层探针,我们进一步表明这种鲁棒性优势体现在各层更大的 logit 间隔上。其次,通过在下游任务上训练线性分类器或从预训练参数微调完整模型,我们证明 Muon 学习到的特征比 Adam 和 SGD 学习到的特征更有效地迁移。这种可迁移性优势还通过有效秩衡量的各层隐藏状态的多样性得到进一步支持。最后,在一个具有多组件特征的代表性分类问题中,我们证明 Muon 比 Adam 和 SGD 获得更大的间隔和更高的有效秩,为我们的经验发现提供了理论支持。

英文摘要

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

2606.09659 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

End-to-End Context Compression at Scale

端到端上下文压缩的规模化

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

发表机构 * New York University(纽约大学) Modal Labs(Modal实验室) University of Maryland(马里兰大学) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) Harvard University(哈佛大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) FAIR at Meta(Meta FAIR实验室)

AI总结 本研究通过架构搜索和持续预训练,提出潜在上下文语言模型(LCLMs),一种端到端编码器-解码器压缩器,在通用任务性能、压缩速度和峰值内存上改进帕累托前沿,并可作为长时智能体的高效骨干。

详情
AI中文摘要

长上下文语言模型推理受限于内存,因为KV缓存随上下文长度增长。最近压缩KV缓存的技术存在不足:它们要么大幅降低模型质量,要么需要大量时间和计算来压缩单个长提示。此外,许多方法要求输入适合目标模型的上下文窗口,并且通常与现代生产推理引擎不兼容。编码器-解码器压缩器原则上是一种有吸引力的替代方案,它将长令牌序列映射到由解码器消费的较短潜在嵌入序列。然而,现有方法在精度-效率前沿上无法与KV缓存压缩竞争。在这项工作中,我们重新审视编码器-解码器压缩并缩小了这一差距。我们首先进行架构搜索,从头开始预训练许多变体,以确定如何最佳设计和训练编码器-解码器压缩器。根据我们的发现,我们持续预训练一系列0.6B编码器、4B解码器模型,每个模型在超过350B令牌上训练,压缩比为1:4、1:8和1:16。我们引入了潜在上下文语言模型(LCLMs),这是一系列压缩器,在通用任务性能、压缩速度和峰值内存使用上改进了帕累托前沿。我们证明了LCLMs可作为长时智能体的高效骨干,让智能体浏览压缩的长上下文并按需自适应扩展相关片段。

英文摘要

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

2606.09762 2026-06-09 cs.LG cs.AI 交叉投稿

Preserving Plasticity in Continual Learning via Dynamical Isometry

通过动态等距保持持续学习中的可塑性

Andries Rosseau, Robert Müller, Ann Nowé

发表机构 * University of Amsterdam(阿姆斯特丹大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过动态等距机制保持深度神经网络在持续学习中的可塑性,提出等距正则化方法和AdamO优化器,在多个基准上匹配或超越现有方法。

Comments ICML26

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

深度神经网络在非平稳条件下的持续训练通常会导致可塑性逐渐丧失,最终限制进一步学习。我们将可塑性与经验神经正切核联系起来,并确定动态等距(即逐层雅可比奇异值保持接近1的条件)是保持持续学习中可塑性的关键机制。我们重新审视一类几乎处处等距且同时保持通用Lipschitz函数逼近能力的网络,证明近动态等距与表达性非线性表示兼容。对于通用架构,我们提出一种高效的等距促进正则化方案,并识别出一种可以重新激活休眠ReLU单元的新机制。在此基础上,我们引入AdamO,一种Adam风格的自适应优化器,将等距正则化与梯度更新解耦,类似于AdamW。我们进一步通过动态等距的视角重新解释先前的可塑性保持方法,表明它们仅针对等距的部分度量。在旨在诱导可塑性损失的监督和强化学习持续学习基准上,我们的方法一致地匹配或超越现有方法。

英文摘要

Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.

2606.09802 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

高效实验的Bandits:适应控制组、偏好和上下文漂移

Udvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL(里尔大学、法国国家科学研究中心、中央理工学院、UMR 9189 – CRIStAL)

AI总结 针对用户偏好和上下文分布随时间漂移的线性上下文随机多臂赌博机问题,提出Dri-MED算法,通过异方差回归处理非平稳噪声,实现实例相关的遗憾界和约束违规界。

详情
AI中文摘要

我们考虑线性上下文随机多臂赌博机的一个变体,其中学习器必须向一组用户提供推荐,每个用户有其个性化的偏好向量,并且上下文分布随时间漂移。在实践者友好的假设下,我们将此设置简化为具有平稳均值但异方差和非平稳噪声的线性赌博机。我们进一步研究了学习器必须确保每个决策的平均奖励超过基线策略$\boldsymbol{\pi}_0$在每个决策步骤的均值的情况。我们引入了Dri-MED,一种受MED策略线性版本启发并仔细调整以处理非平稳异方差噪声的算法。我们表明,实例相关的遗憾界为$\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2(\log(T)\right)$,其中$\tilde{\Delta}$是受策略$\pi_0$约束的次优性间隙,方差感知乘性项$\kappa$通过异方差回归仔细处理。我们进一步表明Dri-MED享有$\tilde{\mathcal{O}}(d)$的期望约束违规。我们的数值结果表明,Dri-MED显著优于忽略漂移和偏好结构的保守基线。

英文摘要

We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

2606.09806 2026-06-09 cs.LG cs.AI 交叉投稿

Topological Neural Operators

拓扑神经算子

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal

发表机构 * Imperial College London(伦敦帝国学院) University of San Francisco(旧金山大学)

AI总结 提出拓扑神经算子(TNOs),利用离散外微积分在细胞复形上实现跨维度耦合,并通过分层结构提升长程信息传播,在PDE基准上优于现有算子。

详情
AI中文摘要

我们引入了拓扑神经算子(TNOs),这是一个在细胞复形上进行算子学习的原理性框架,将神经算子(NOs)从点和/或边上的函数提升到拓扑域。TNOs将数据表示为定义在不同维度细胞上的特征,并通过离散外微积分建模它们的相互作用,通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流向(由固定拓扑算子控制)与信息变换(学习得到)解耦,从而产生尊重物理量几何支撑并暴露守恒和相容性结构的模型。我们进一步提出了分层TNOs(HTNOs),它结合了学习到的粗粒度复形以传播长程和拓扑依赖的信息。我们的框架将现有NOs作为特例,提供了跨离散化的算子学习统一视角。在一系列PDE基准测试中,包括不规则几何流动问题,TNOs和HTNOs提高了精度;控制研究进一步隔离了原生高阶和拓扑结构带来的优势。项目页面:https://circle-group.github.io/research/TNO

英文摘要

We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

2606.09816 2026-06-09 cs.CV cs.AI math.PR 交叉投稿

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学) Harvard University(哈佛大学) MIT(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出PTL-Diffusion,通过将前向噪声过程收敛到周期高斯终端族而非单一分布,显式嵌入相位结构,改善低维流形上的分布匹配,在点云和人脸数据集上降低误差。

详情
AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效,但对于集中在低维流形附近的数据,它提供的显式结构很少,其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此,反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion,一种概念验证的扩散框架,其前向噪声过程收敛到一个非常数的周期高斯终端族,而不是单一不变律。与相位条件DDPM不同(其中相位信息仅进入去噪网络,而前向过程保持不变),PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型:对于周期强迫的Ornstein-Uhlenbeck型前向过程,我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验,从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项,通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明,PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配,减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向,同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

2509.25004 2026-06-09 cs.AI 版本更新

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

CLPO:课程学习与策略优化相结合用于大语言模型推理

Shijie Zhang, Zheng Xiao, Shiyu Liu, Guohao Sun, Kevin Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Wangxiao Zhao, Guanjun Jiang

发表机构 * Peking University(北京大学) Qwen Applications Business Group, Alibaba Group(通义实验室,阿里巴巴集团) Xiamen University(厦门大学)

AI总结 提出CLPO框架,通过在线策略准确率动态调整问题难度,使课程与策略共同进化,在数学和通用推理基准上显著优于GRPO和DAPO。

详情
AI中文摘要

具有可验证奖励的在线强化学习已成为提升大语言模型推理能力的有效范式,但大多数方法仍对静态问题集优化推理轨迹,将rollout预算浪费在已解决或过于困难的问题上。我们提出\textbf{CLPO(课程学习与策略优化相结合)},一种自我进化的课程框架,利用在线策略rollout准确率识别已解决、中等难度和困难问题,然后根据模型当前能力重构所选任务。困难问题被简化以变得可学习,而中等难度问题被多样化以提供有用的训练变化。这使得学习课程能够与策略共同进化,而不是随着模型能力边界移动而保持固定。CLPO不将这些重写视为静态数据增强,而是优化重构轨迹,并根据重写问题的下游准确率增益分配信用,除了原始可验证答案外不需要额外的人工标注。跨数学推理和域外通用推理基准的实验表明,CLPO在Qwen3-8B上分别以平均10.21和7.75个点显著优于GRPO和DAPO。在数学和代码领域的消融研究进一步表明,重构模式和重写损失都对最终增益有贡献,证明了CLPO通过自我进化的课程为激发更强推理能力提供了可扩展且稳健的途径。

英文摘要

Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of large language models, but most methods still optimize reasoning trajectories over the static problem set, wasting rollout budget on solved or overly difficult problems. We propose \textbf{CLPO (Curriculum Learning meets Policy Optimization)}, a self-evolving curriculum framework that uses on-policy rollout accuracy to identify solved, medium-difficulty, and hard problems, then restructures selected tasks according to the model's current capability. Hard problems are simplified to become learnable, while medium-difficulty problems are diversified to provide useful training variation. This allows the learning curriculum to co-evolve with the policy rather than remaining fixed as the model's capability boundary shifts. Rather than treating these rewrites as static data augmentation, CLPO optimizes restructuring trajectories with credit assigned by the downstream accuracy gain of the rewritten problem, requiring no additional human annotations beyond the original verifiable answers. Experiments across mathematical reasoning and out-of-domain general reasoning benchmarks show that CLPO substantially outperforms GRPO and DAPO on Qwen3-8B by 10.21 and 7.75 average points, respectively. Ablation studies on math and code domains further show that both the restructuring mode and the rewriting loss contribute to the final gains, demonstrating that CLPO provides a scalable and robust pathway for eliciting stronger reasoning capabilities through a self-evolving curriculum.

2512.07355 2026-06-09 cs.AI cs.CV cs.LG 版本更新

A Geometric Unification of Concept Learning with Concept Cones

概念学习与概念锥的几何统一

Alexandre Rocchi, Thomas Fel, Gianni Franchi

发表机构 * AMIAD Kempner Institute, Harvard University(哈佛大学凯普勒研究所)

AI总结 通过共享几何框架(概念锥)统一监督式概念瓶颈模型与无监督稀疏自编码器,提出包含关系度量评估概念对齐,并发现稀疏性与扩展因子的最佳平衡点。

Comments 33 pages

详情
AI中文摘要

两种可解释性传统并行发展但很少相互交流:概念瓶颈模型(CBM)规定概念应该是什么,而稀疏自编码器(SAE)发现哪些概念涌现。CBM使用监督将激活与人类标记的概念对齐,而SAE依赖稀疏编码来揭示涌现概念。我们证明两种范式实例化相同的几何结构:每个范式学习激活空间中的一组线性方向,其非负组合形成概念锥。因此,监督和无监督方法的不同不在于种类,而在于如何选择这个锥。基于这一观点,我们提出了两种范式之间的操作桥梁。CBM提供人类定义的参考几何,而SAE可以通过其学习的锥在多大程度上近似或包含CBM的锥来评估。这种包含框架产生了量化指标,将归纳偏差(如SAE类型、稀疏性或扩展比)与合理概念的涌现联系起来。使用这些指标,我们发现了稀疏性和扩展因子的“最佳点”,该点最大化与CBM概念的几何和语义对齐。总体而言,我们的工作通过共享的几何框架统一了监督和无监督的概念发现,提供了原则性指标来衡量SAE进展,并评估发现的概念与合理的人类概念的对齐程度。

英文摘要

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

2512.12225 2026-06-09 cs.AI 版本更新

A Geometric Theory of Cognition for Machine Intelligence

机器智能的认知几何理论

Laha Ale

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China(计算与人工智能学院,西南交通大学,成都,中国)

AI总结 提出黎曼流形上的梯度流框架,统一表征、记忆、适应与预测,在部分可观测强化学习任务中优于前馈基线,鲁棒性堪比循环架构。

详情
AI中文摘要

开发能够统一表征、记忆、适应和预测的人工智能体仍然是人工智能中的一个基本挑战。在这里,我们引入了一个几何框架,其中认知计算源于学习到的潜在流形上的黎曼梯度流。学习到的度量编码了表征约束和计算偏好,而几何中的各向异性自然产生了多个时间尺度的行为,从而在没有显式记忆模块或循环机制的情况下,同时产生快速反应响应和较慢的适应动态。我们通过黎曼表征和动态模型实例化该框架,并在部分可观测的强化学习环境中进行评估。在观测掩蔽、感觉中断、动态扰动和预测性潜在建模任务中,所提出的方法始终优于前馈基线,实现了与循环架构相当的鲁棒性,并产生了高度可预测的潜在轨迹,具有较低的长程展开误差。这些结果表明,学习到的潜在几何可以同时作为表征、记忆、适应和预测的基质。更广泛地说,该框架提供了动力系统、表征学习和基于世界模型的智能之间的原则性联系。

英文摘要

Developing artificial agents that unify representation, memory, adaptation, and prediction remains a fundamental challenge in artificial intelligence. Here we introduce a geometric framework in which cognitive computation emerges from Riemannian gradient flow on a learned latent manifold. The learned metric encodes representational constraints and computational preferences, while anisotropies in the geometry naturally generate multiple timescales of behaviour, yielding both rapid reactive responses and slower adaptive dynamics without explicit memory modules or recurrent mechanisms. We instantiate this framework through Riemannian representation and dynamics models and evaluate them in partially observable reinforcement-learning environments. Across observation masking, sensory blackouts, dynamics perturbations, and predictive latent-modelling tasks, the proposed approach consistently outperforms feedforward baselines, achieves robustness comparable to recurrent architectures, and produces highly predictable latent trajectories with low long-horizon rollout error. These results suggest that learned latent geometry can serve simultaneously as a substrate for representation, memory, adaptation, and prediction. More broadly, the framework provides a principled connection between dynamical systems, representation learning, and world-model-based intelligence.

2601.04805 2026-06-09 cs.AI 版本更新

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

基于思考的非思考:通过强化学习解决混合推理模型训练中的奖励黑客问题

Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) Jiutian Research, Beijing, China(九天研究院)

AI总结 针对混合推理模型训练中的奖励黑客问题,提出Thinking-Based Non-Thinking方法,利用思考型回答的解决方案信息为非思考型回答设置差异化最大令牌数,在数学基准上减少约50%令牌使用并提升准确率。

详情
AI中文摘要

大型推理模型(LRMs)因其卓越性能而备受关注。然而,其性能主要源于思考(即长链思维CoT),这显著增加了计算开销。为解决这一过度思考问题,现有工作侧重于使用强化学习(RL)训练混合推理模型,使其根据查询复杂度自动决定是否进行思考。不幸的是,使用RL会遇到奖励黑客问题,例如,模型进行了思考但被判定为未思考,导致奖励错误。为缓解此问题,现有工作要么采用监督微调(SFT),计算成本高昂,要么对非思考型回答强制设置统一令牌限制,缓解效果有限。本文提出基于思考的非思考(TNT)。它不使用SFT,而是通过利用思考型回答的解决方案组件中的信息,为不同查询的非思考型回答设置不同的最大令牌使用量。在五个数学基准上的实验表明,与DeepSeek-R1-Distill-Qwen-1.5B/7B和DeepScaleR-1.5B相比,TNT将令牌使用量减少约50%,同时显著提高准确率。事实上,TNT在所有测试方法中实现了准确率与效率之间的最优权衡。此外,在所有测试数据集中,TNT被分类为未使用思考的回答中出现奖励黑客问题的概率低于10%。

英文摘要

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

2602.08222 2026-06-09 cs.AI 版本更新

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

弱驱动学习:弱智能体如何使强智能体更强

Zehao Chen, Gongxun Li, Tianxiang Ai, Zixuan Huang, Xiaodong Liu, Yifei Li, Wang Zhou, Fuzhen Zhuang, Xianglong Liu, Jianxin Li, Deqing Wang, Yikun Ban

发表机构 * Beihang University(北航) China Telecom eSurfing Cloud(中国电信eSurfing云)

AI总结 针对大语言模型后训练中的饱和瓶颈,提出WMSS方法,利用模型历史弱检查点通过熵动力学识别可恢复学习差距并进行补偿学习,在数学推理和代码生成任务上实现有效性能提升且无额外推理成本。

详情
AI中文摘要

随着后训练优化成为改进大语言模型的核心,我们观察到一种持续的饱和瓶颈:一旦模型变得高度自信,进一步训练带来的收益递减。虽然现有方法继续强化目标预测,但我们发现信息丰富的监督信号仍然潜藏在模型自身的历史弱状态中。受此观察启发,我们提出WMSS(弱智能体可以使强智能体更强),一种利用弱检查点指导持续优化的后训练范式。通过熵动力学识别可恢复的学习差距,并通过补偿学习强化它们,WMSS使强智能体能够超越传统的后训练饱和。在数学推理和代码生成数据集上的实验表明,使用我们的方法训练的智能体实现了有效的性能提升,同时不产生额外的推理成本。

英文摘要

As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models' own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.

2605.03862 2026-06-09 cs.AI cs.CL 版本更新

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

正确性不足:通过执行器导向的奖励训练推理计划器

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

发表机构 * D 4 Lab(D4实验室) Independent Researcher(独立研究者)

AI总结 本文提出TraceLift框架,通过执行器导向的奖励提升推理质量,利用rubric-based Reasoning Reward Model评估推理轨迹的可靠性与有效性。

Comments 36 pages

详情
AI中文摘要

可验证奖励的强化学习已成为提升大语言模型显式推理的常见方法,但仅凭最终答案正确性无法揭示推理轨迹的忠实性、可靠性或对消费模型的效用。为此,我们提出TraceLift,将推理视为可消费的中间产物。在计划器训练中,计划器生成标记化的推理。冻结的执行器将此推理转化为最终产物供验证器反馈,同时执行器导向的奖励塑造中间轨迹。此奖励乘以基于rubric的Reasoning Reward Model评分,乘以在相同冻结执行器上测量的提升,奖励高质量且有用的轨迹。为使推理质量直接可学习,我们引入TRACELIFT-GROUPS数据集,包含数学和代码种子问题。每个示例是同一问题组,包含高质量参考轨迹和多个可能的错误轨迹,通过局部扰动降低推理质量或解决方案支持,同时保持任务相关性。在代码和数学基准上的广泛实验表明,执行器导向的推理奖励提高了两阶段计划器-执行器系统,表明推理监督应不仅评估轨迹是否看起来好,还应评估其是否帮助消耗模型。

英文摘要

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

2605.19662 2026-06-09 cs.AI 版本更新

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

当表格基础模型遇见策略性表格数据:一种先验对齐方法

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Jinxuan Yang, Kun Kuang, Yuanlong Chen, Mingyang Geng, Wanrong Huang, Shixuan Liu, Shaowu Yang, Wenjing Yang, Zhouchen Lin, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文研究了表格基础模型在策略性表格数据上的泛化能力,提出了一种策略感知的先验对齐框架SPN,以提高模型在策略性环境中的鲁棒性和预测性能。

Comments Accepted by ICML2026

详情
AI中文摘要

基于预训练先验数据拟合网络(PFNs)的表格基础模型在多样化的表格任务上表现出强大的泛化能力,但通常设计用于非策略性设置,其中数据分布与部署分类器无关。然而,在许多现实世界决策场景中,个体可能在部署后有意识地修改特征以获得有利结果,导致部署后分布偏移。本文研究了PFN风格的表格基础模型是否能泛化到此类策略性表格数据。我们证明,策略性操纵导致了预训练期间学习的非策略性先验与操纵后的策略性先验之间的不匹配,从而产生系统性的预测偏差。为了解决这个问题,我们提出了策略性先验数据拟合网络(SPN),一种推理时策略感知的框架,能够在不重新训练的情况下将表格基础模型适应到策略性环境。SPN构建策略性上下文示例以近似操纵后的输入,并将PFN预测与诱导的策略性分布对齐。在现实世界和合成表格数据集上的实验表明,与表格基础模型和经典表格方法相比,SPN在策略性操纵下始终提高了鲁棒性和预测性能。

英文摘要

Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for \emph{non-strategic} settings where data distributions are independent of deployed classifiers. In many real-world decision scenarios, however, individuals may strategically modify their features after deployment to obtain favorable outcomes, inducing a post-deployment distribution shift. This paper studies whether PFN-style tabular foundation models can generalize to such \emph{strategic} tabular data. We show that strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias. To address this issue, we propose \textbf{Strategic Prior-data Fitted Network}~\textit{(SPN)}, an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining. SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution. Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

2605.19674 2026-06-09 cs.AI 版本更新

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

超越理性错觉:行为现实的战略分类

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Yang Shi, Jinxuan Yang, Zhouchen Lin, Yuanlong Chen, Yuanxing Zhang, Shaowu Yang, Wenjing Yang, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于前景理论的行为现实战略分类框架,以应对现实中受心理偏差影响的决策者策略性操纵问题。

Comments Accepted by ICML2026

详情
AI中文摘要

战略分类(SC)研究了决策模型与策略性操纵特征以获得有利结果的代理之间的相互作用。现有SC框架通常依赖于理想化的假设,即代理是严格理性的。然而,行为经济学和心理学的证据一致表明,现实世界中的决策往往受到认知偏差的影响,偏离纯粹理性。为了正式化这一限制,我们识别并定义了一个新的问题设置,称为行为现实的战略分类问题,其中代理的策略性操纵由于心理偏差而偏离完全理性。受识别限制的启发,我们提出了前景引导的战略框架(Pro-SF)来解决这个问题,这是一个基于前景理论的原理框架,用于建模和学习在行为现实的战略响应下。具体来说,为了捕捉行为现实的战略操纵,我们的框架通过引入三种受前景理论启发的关键机制,重新表述了代理与决策者之间的Stackelberg式互动,包括收益与成本之间的不对称性、不同的主观参照点以及非理性的概率扭曲。在合成和现实世界数据集上的实验表明,Pro-SF是一种行为导向的战略分类方法,连接了机器学习和行为经济学,为现实世界中的更可靠部署提供了桥梁。

英文摘要

Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that real-world decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents' strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

2605.29823 2026-06-09 cs.AI 版本更新

Quantifying and Optimizing Simplicity via Polynomial Representations

通过多项式表示量化和优化简单性

Tianren Zhang, Xiangxin Li, Minghao Xiao, Guanyu Chen, Feng Chen

发表机构 * [cs.AI](计算机科学与人工智能)

AI总结 提出多项式表示作为分布感知的低维神经函数代理,通过正交多项式基近似网络预测行为,以有效度作为简单性度量,并导出可微正则化器以提升泛化。

Comments ICML 2026

详情
AI中文摘要

深度网络通常表现出对“简单”解的偏好,这种简单性偏差被广泛认为在泛化中起关键作用。然而,一种广泛适用、定量的简单性度量仍然难以捉摸。我们引入多项式表示作为分布感知的、低维神经函数代理:我们使用正交多项式基沿数据依赖的插值路径近似网络的预测行为,从而得到紧凑的函数表示。我们表明,该表示的有效度可作为实用的简单性度量,能够预测跨任务和架构的泛化,并且持续优于现有的泛化代理(如锐度)。最后,多项式表示自然产生可微的简单性正则化器,在图像和文本分类、微调对比视觉语言模型以及强化学习中持续改善泛化。

英文摘要

Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.

2606.07108 2026-06-09 cs.AI 版本更新

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

DyCon: 通过演化难度建模的动态推理控制

Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Zhongguancun Academy(中关村学院) Huawei Noah’s Ark Lab(华为诺亚实验室) Shenzhen Loop Area Institute(深圳环城研究院) Tsinghua University(清华大学)

AI总结 提出DyCon框架,利用步骤级嵌入动态建模推理过程中的难度演化,无需训练即可控制推理深度,减少冗余步骤,提升效率且不损失准确性。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期大型推理模型(LRMs)通过迭代反思、探索和执行复杂任务取得了显著的性能提升,但由于冗余推理(即“过度思考”)而效率低下。现有的缓解方法要么依赖静态难度估计,要么需要特定任务训练,因此无法适应推理过程中的动态复杂性。在这项工作中,我们经验性地证明,问题难度在推理过程中动态演化,并线性编码在LRM的步骤级嵌入中。基于这一发现,我们提出了DyCon,一个无需训练的框架,利用潜在步骤级表示显式建模演化中的任务难度,从而实现对推理深度的动态控制以缓解过度思考问题。在4B到32B的四个模型上进行的广泛实验,涵盖数学推理、通用问答和编码任务的十二个基准测试表明,DyCon通过减少冗余步骤显著提升了推理效率,且不牺牲准确性或泛化能力。项目页面和代码可在此https URL获取。

英文摘要

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code is available at https://github.com/yu-lin-li/DyCon.

2402.13425 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Investigating the Histogram Loss in Regression

探究回归中的直方图损失

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

发表机构 * Alberta Machine Intelligence Institute (Amii) and Reinforcement Learning and Artificial Intelligence Laboratory(阿尔伯塔机器智能研究所(Amii)和强化学习与人工智能实验室) Department of Computing Science, University of Alberta(计算科学系,阿尔伯塔大学) University of Tübingen(图宾根大学) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 本文通过理论和实验分析,探究直方图损失在回归任务中提升性能的原因,发现其优势源于优化改进而非额外信息建模,并在常见深度学习应用中验证其有效性。

Comments 52 pages

详情
Journal ref
JMLR,2026
AI中文摘要

在回归任务中,即使预测只需要均值,训练神经网络来建模整个分布也变得越来越常见。这种额外的建模通常会带来性能提升,但其背后的原因尚不完全清楚。本文研究了一种最近的回归方法——直方图损失,该方法通过最小化目标分布与灵活直方图预测之间的交叉熵来学习目标变量的条件分布。我们设计了理论和实证分析,以确定这种性能提升出现的原因和时机,以及损失的不同组成部分如何贡献于这种提升。我们的结果表明,在这种设置下学习分布的好处来自于优化方面的改进,而非建模额外信息。然后,我们展示了直方图损失在常见深度学习应用中的可行性,无需昂贵的超参数调优。

英文摘要

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

2411.03253 2026-06-09 cs.LG cs.AI cs.DS 版本更新

Discovering Data Structures: Nearest Neighbor Search and Beyond

发现数据结构:最近邻搜索及其他

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

发表机构 * Université de Montréal(蒙特利尔大学) Mila HEC Montréal(蒙特利尔高等商学院) Microsoft Research(微软研究院) University of Southern California(南加州大学) Stanford University(斯坦福大学)

AI总结 提出一个端到端学习数据结构的通用框架,自动适应数据分布并控制查询与空间复杂度,在最近邻搜索中逆向工程出二分搜索、插值搜索、k-d树和局部敏感哈希等算法。

Comments Neurips 2025 Version

详情
AI中文摘要

我们提出了一个用于端到端学习数据结构的通用框架。我们的框架适应底层数据分布,并对查询和空间复杂度提供细粒度控制。关键在于,数据结构是从头开始学习的,不需要仔细初始化或用候选数据结构/算法进行种子化。我们首先将该框架应用于最近邻搜索问题。在多种设置中,我们能够逆向工程出学习到的数据结构和查询算法。对于一维最近邻搜索,模型发现了最优的分布(不)依赖算法,如二分搜索和插值搜索的变体。在更高维度中,模型学习到的解决方案在某些情况下类似于k-d树,而在其他情况下则具有局部敏感哈希的元素。该模型还能学习高维数据的有用表示,并利用它们设计有效的数据结构。我们还将框架应用于数据流上的频率估计问题,并相信它也可以成为新问题的强大发现工具。

英文摘要

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

2503.18314 2026-06-09 cs.LG cs.AI cs.CV 版本更新

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

LoTUS:带有不确定性风味的大规模机器遗忘

Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Centre for Research & Technology Hellas(希腊研究中心与技术中心) Archimedes/Athena RC(阿基米德/雅典娜研究中心)

AI总结 提出LoTUS方法,通过平滑预测概率至信息论界限来消除训练样本影响,避免从头重训练,在Transformer和ResNet18模型上超越现有方法,并引入RF-JSD指标用于实际评估。

Comments Accepted as a main conference paper at CVPR 2025 (https://cvpr.thecvf.com/virtual/2025/poster/33292)

详情
AI中文摘要

我们提出了LoTUS,一种新颖的机器遗忘(MU)方法,它消除了预训练模型中训练样本的影响,避免了从头开始重新训练。LoTUS将模型的预测概率平滑到信息论界限,减轻了因数据记忆导致的过度自信。我们在Transformer和ResNet18模型上,针对五个公共数据集,与八个基线方法进行了评估。除了已有的MU基准测试,我们还在ImageNet1k(一个大规模数据集,其中重新训练不切实际)上评估了遗忘效果,模拟了真实世界条件。此外,我们引入了新颖的无重训练杰森-香农散度(RF-JSD)指标,以便在真实世界条件下进行评估。实验结果表明,LoTUS在效率和有效性方面均优于最先进的方法。代码:此https URL。

英文摘要

We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.

2504.05349 2026-06-09 stat.ML cs.AI cs.LG 版本更新

Hyperflux: Pruning Reveals Importance

Hyperflux: 剪枝揭示重要性

Eugen Barbulescu, Antonio Alexoaie, Lucian Busoniu

发表机构 * Department of Computer Science(计算机科学系) Technical University of Cluj-Napoca(克莱津-纳波卡技术大学) Department of Automation(自动化系)

AI总结 提出Hyperflux方法,通过将剪枝建模为连续演化系统(通量和压力),在微观和宏观层面解释剪枝行为,并引入压力调度器实现目标稀疏度,在多个数据集上取得竞争性结果。

详情
AI中文摘要

网络剪枝用于减少大型神经网络的推理延迟和功耗。然而,大多数方法侧重于经验结果,而牺牲了对剪枝过程的理解。我们引入Hyperflux,一种新颖的$L_0$方法,将剪枝建模为由通量(权重移除的梯度响应)和压力(驱动权重向剪枝发展的全局正则化)决定的连续演化系统。通过利用该模型,Hyperflux的剪枝行为在微观(权重再生/剪枝)和宏观(稀疏性收敛等)层面都变得可理解。我们还引入了一种新颖的压力调度器,可靠地针对目标稀疏度。Hyperflux在CIFAR-10、CIFAR-100和ImageNet数据集上使用ResNet-50、VGG-19和DeiT-T/S取得了竞争性结果。

英文摘要

Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most methods focus on empirical results at the expense of understanding the pruning process. We introduce Hyperflux, a novel $L_0$ method which models pruning as a continuously evolving system determined by flux, the gradient response to a weight's removal, and pressure, a global regularization driving weights toward pruning. By exploiting this model, Hyperflux's pruning behavior becomes understandable at both microscopic (weight regrowth/pruning) and macroscopic (sparsity convergence, etc.) levels. We also introduce a novel pressure scheduler that reliably targets desired sparsities. Hyperflux achieves competitive results with ResNet-50, VGG-19 and DeiT-T/S on CIFAR-10, CIFAR-100 and ImageNet datasets.

2505.20137 2026-06-09 cs.LG cs.AI 版本更新

ePC: Fast and Deep Predictive Coding in Digital Simulation

ePC:数字仿真中的快速深度预测编码

Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

发表机构 * IDLab, Ghent University -- imec, Belgium(ID实验室,根特大学——imec,比利时) Brain Network Dynamics Unit, University of Oxford, UK(脑网络动力学单位,牛津大学,英国)

AI总结 提出误差预测编码(ePC),通过重新参数化解决标准状态预测编码(sPC)在数字仿真中的指数信号衰减问题,实现与反向传播相当的深度模型训练速度。

Comments Accepted at ICML 2026 - Main Track. All code available at https://github.com/cgoemaere/error_based_PC

详情
AI中文摘要

预测编码(PC)为神经网络训练提供了一种受大脑启发的反向传播替代方案,被描述为最小化其内部能量的物理系统。然而,在实践中,PC主要是在数字仿真中实现的,需要大量的计算,同时难以扩展到更深的架构。本文重新构建了PC以克服这种硬件-算法不匹配。首先,我们揭示了规范的状态基PC(sPC)在数字仿真中本质上是深度低效的,不可避免地导致指数级信号衰减,从而阻碍整个最小化过程。然后,为了克服这一根本限制,我们引入了误差基PC(ePC),这是一种新的PC重新参数化,不会遭受信号衰减。虽然不再具有生物合理性,但ePC数值计算精确的PC权重梯度,运行速度比sPC快几个数量级。跨多个架构和数据集的实验表明,即使在sPC难以处理的更深模型中,ePC也能匹配反向传播的性能。除了实际改进,我们的工作还提供了对PC动力学的理论洞察,并为在数字硬件及更广泛领域将基于PC的学习扩展到更深架构奠定了基础。

英文摘要

Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.

2507.12612 2026-06-09 cs.LG cs.AI 版本更新

Learning Task Mixtures from Task Affinities: A Probabilistic Graphical Model for Supervised Fine-Tuning

学习什么是重要的:通过互信息的概率任务选择用于模型微调

Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan

发表机构 * IIT Bombay(印度理工学院班加罗尔分校) IBM Research(IBM研究) Red Hat AI Innovation(红帽AI创新) MIT-IBM Watson AI Lab(麻省理工-IBM沃森AI实验室)

AI总结 本文提出TaskPGM框架,通过基于能量的任务模型学习连续任务混合,利用互信息和行为分歧来捕捉任务间的关系,从而在任务覆盖和冗余之间取得平衡,提升大语言模型的监督微调性能。

Comments 9, 8 tables, 7 figures

详情
AI中文摘要

大语言模型的监督微调性能在很大程度上取决于训练预算如何分配到异质任务集上。在实践中,通常使用简单的启发式方法(例如均匀或按比例采样)来固定混合,但这些方法忽略了任务之间的相互作用,可能损害迁移并浪费在冗余来源上的预算。我们引入TaskPGM,一种通过基于能量的任务模型学习连续任务混合的框架。任务形成马尔可夫随机场的节点:单变量势能捕捉单个任务的效用,而双变量势能使用从单任务微调模型的预测分布中计算的行为分歧(如Jensen-Shannon分歧和点互信息)来编码任务间的关系。优化此目标会产生在覆盖和冗余之间取得平衡的混合。我们显示,所得到的集合函数在预算约束下是弱子模的,这使得离散选择变体能够获得近似保证。在多个模型家族(LLaMA-7B,Qwen2-7B)和评估套件(BIG-Bench Hard)上,TaskPGM在标准混合策略之上取得改进,并提供了任务间关系的可解释结构。

英文摘要

Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous set of tasks. In practice, mixtures are often fixed using simple heuristics (e.g., uniform or size-proportional sampling) that ignore task interactions, which can hurt transfer and waste budget on redundant sources. We introduce TaskPGM, a framework for learning continuous task mixtures via an energy-based model over tasks. Tasks form the nodes of a Markov random field: unary potentials capture per-task utility, and pairwise potentials encode inter-task relationships using behavioral divergences computed from predictive distributions of single-task fine-tuned models (e.g., Jensen--Shannon divergence and pointwise mutual information). Optimizing this objective yields mixtures that balance coverage against redundancy. We show that the resulting set function is weakly submodular under budget constraints, enabling approximation guarantees for discrete selection variants. Across multiple model families (LLaMA-7B, Qwen2-7B) and evaluation suites (BIG-Bench Hard), TaskPGM improves over standard mixing strategies and provides interpretable structure over task interactions.

2508.05950 2026-06-09 cs.CV cs.AI 版本更新

CLONE: A 3DGS-Based Closed-Loop Differentiable Optimization Framework for Single-Image Normal Estimation

CLONE: 基于3DGS的闭环可微优化框架用于单图像法线估计

Yanxing Liang, Yinghui Wang, Wei Li, Tao Yan, Jiaxing Shen

发表机构 * School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机科学学院,中国无锡) School of Data Science, Lingnan University, Hong Kong, China(岭南大学数据科学学院,中国香港)

AI总结 提出CLONE框架,通过3D高斯泼溅参数化场景并利用协方差特征分解得到连续可微法线,结合可微光照模型和一步确定性扩散精化网络,在统一重投影目标下联合优化,实现无需真值法线监督的几何一致性单图像法线估计。

详情
AI中文摘要

我们提出CLONE,一个基于3DGS的闭环可微优化框架,用于单图像法线估计。核心思想是构建一个“图像-几何-图像”一致性循环,统一并联合约束两种范式的局限性:判别式方法依赖显式监督而缺乏跨域几何约束,生成式方法虽有强生成先验但缺乏稳定的可微优化路径。具体地,我们首先采用3D高斯泼溅显式参数化场景,并通过协方差特征分解导出连续可微的表面法线,为几何建模提供解析梯度路径。然后,我们引入一个带有可学习光调制核的可微光照模型,建立表面法线与图像辐射之间的连续映射,使重投影误差直接监督底层3D几何。此外,为补偿高斯表示在局部细节表达上的不足,我们设计了一个一步确定性扩散启发的精化网络,在保持端到端可微性的同时增强局部几何细节。引入跨域门控融合机制以协调全局几何一致性和局部细节重建。最后,所有组件在统一的重投影目标下联合优化,形成闭环且稳定的梯度传播路径。这使得无需真值法线监督即可有效约束多解空间并改善几何一致性。

英文摘要

We propose CLONE, a 3DGS-based Closed-Loop differentiable Optimization framework for single-image Normal Estimation. The core idea is to construct an "image-geometry-image" consistency loop that unifies and jointly constrains the limitations of both paradigms: the reliance on explicit supervision without cross-domain geometric constraints in discriminative methods, and the absence of stable differentiable optimization pathways in generative methods despite strong generative priors. Specifically, we first employ 3D Gaussian Splatting to explicitly parameterize the scene and derive continuous and differentiable surface normals via covariance eigen-decomposition, providing an analytical gradient pathway for geometric modeling. We then introduce a differentiable illumination model with a learnable light modulation kernel to establish a continuous mapping between surface normals and image radiance, enabling reprojection errors to directly supervise the underlying 3D geometry. Furthermore, to compensate for the limited local detail expressiveness of Gaussian representations, we design a one-step deterministic diffusion-inspired refinement network, which enhances local geometric details while preserving end-to-end differentiability. A cross-domain gating fusion mechanism is introduced to coordinate global geometric consistency and local detail reconstruction. Finally, all components are jointly optimized under a unified reprojection objective, forming a closed-loop and stable gradient propagation pathway. This enables effective constraint of the multi-solution space and improved geometric consistency without requiring ground-truth normal supervision.

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”:极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国)

AI总结 提出极坐标位置嵌入(PoPE)以解耦Transformer注意力机制中的内容和位置,在诊断任务、序列建模和语言模型中优于RoPE,并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情
AI中文摘要

Transformer架构中的注意力机制根据内容(“什么”)和序列中的位置(“哪里”)将键匹配到查询。我们提出一项分析,表明在流行的RoPE旋转位置嵌入中,“什么”和“哪里”是纠缠的。这种纠缠会损害性能,特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进,称为极坐标位置嵌入(PoPE),它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中,使用PoPE作为位置编码方案的Transformer在评估损失(困惑度)和下游任务性能上优于使用RoPE的基线。在语言建模中,这些优势在模型规模从124M到774M参数时持续存在。关键的是,与RoPE甚至专为外推设计的方法YaRN(需要额外微调和频率插值)相比,PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Pengcheng Laboratory(鹏城实验室) Ant Group(蚂蚁集团) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出VFEM模型,利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征,仅训练7.45%参数即可捕捉跨变量依赖,提升多变量时间序列预测性能。

详情
AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度,但这种设计忽略了关键的跨通道依赖关系。同时,现有的跨模态方法主要依赖文本模态,使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性,我们提出了VFEM,一种利用预训练大视觉模型(LVM)捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示,使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构,视觉和时间特征被独立提取,然后通过跨模态注意力融合,使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%,VFEM在多个基准上取得了竞争性能,为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

2510.09783 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Large Language Models for Imbalanced Classification: Diversity makes the difference

大语言模型用于不平衡分类:多样性至关重要

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

发表机构 * Applied Artificial Intelligence Initiative (A 2 I 2 )(应用人工智能倡议(A2I2)) Deakin University(德肯大学) Black Dog Institute(黑狗研究所) University of New South Wales(新南威尔士大学)

AI总结 提出基于大语言模型的过采样方法,通过条件采样、排列微调和插值样本增强多样性,在10个表格数据集上优于8个基线方法。

详情
AI中文摘要

过采样是解决不平衡分类最广泛使用的方法之一。其核心思想是生成额外的少数类样本以重新平衡数据集。大多数现有方法(如SMOTE)需要将分类变量转换为数值向量,这通常会导致信息损失。最近,基于大语言模型(LLM)的方法被引入以克服这一限制。然而,当前的LLM方法通常生成多样性有限的少数类样本,降低了下游分类任务的鲁棒性和泛化能力。为了解决这一问题,我们提出了一种新的基于LLM的过采样方法,旨在增强多样性。首先,我们引入了一种采样策略,将合成样本生成条件化为少数类标签和特征。其次,我们开发了一种新的排列策略来微调预训练的LLM。第三,我们不仅在少数类样本上微调LLM,还在插值样本上微调以进一步丰富变异性。在10个表格数据集上的大量实验表明,我们的方法显著优于八个SOTA基线。生成的合成样本既真实又多样。此外,我们通过基于熵的视角提供了理论分析,证明了我们的方法鼓励生成样本的多样性。

英文摘要

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

2510.22450 2026-06-09 cs.LG cs.AI 版本更新

SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks

SmartMixed:一种用于神经网络自适应激活函数学习的两阶段训练策略

Amin Omidvar

发表机构 * Independent Researcher(独立研究者) Toronto, Canada(加拿大多伦多) Toronto Ontario Canada(加拿大多伦多)

AI总结 提出SmartMixed两阶段训练策略,通过可微硬混合机制让神经元自适应选择激活函数,第二阶段固定选择以保持推理效率,在MNIST上验证了不同层神经元的激活函数偏好。

详情
AI中文摘要

激活函数的选择在神经网络中起着关键作用,但大多数架构仍然依赖于所有神经元上固定的、统一的激活函数。我们引入了SmartMixed,一种新颖的两阶段训练策略,允许网络学习每个神经元的最优激活函数,同时在推理时保持计算效率。在第一阶段,神经元使用可微硬混合机制从候选激活函数池(ReLU、Sigmoid、Tanh、Leaky_ReLU、ELU、SELU)中自适应选择。在第二阶段,每个神经元的激活函数根据学习到的选择固定下来,从而得到一个计算高效的网络,支持使用优化的向量化操作继续训练。我们在MNIST数据集上使用不同架构的前馈神经网络评估了SmartMixed。我们的分析表明,不同层的神经元对激活函数表现出不同的偏好,揭示了神经架构内的功能多样性。我们还证明了SmartMixed通过允许神经元选择其偏好的激活函数有效地训练网络,与使用单一固定最先进激活函数的模型相竞争。

英文摘要

The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a novel two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky\_ReLU, ELU, SELU) using a differentiable hard mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of different architectures. Our analysis reveals that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures. We also demonstrated that SmartMixed effectively trains the network by allowing neurons to select their preferred activation functions, competing against models using a single fixed state-of-the-art activation function.

2511.07046 2026-06-09 cs.LG cs.AI 版本更新

Learning Quantized Continuous Controllers for Integer Hardware

面向整数硬件的量化连续控制器学习

Fabian Kresse, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所)

AI总结 提出量化感知训练策略,自动选择低比特策略并综合到FPGA,在MuJoCo任务中以3或2比特权重和激活值实现与全精度相当的竞争力,并提升输入噪声鲁棒性。

Comments 18 pages, 6 figures

详情
AI中文摘要

在嵌入式硬件上部署连续控制强化学习策略需要满足严格的延迟和功耗预算。小型FPGA可以实现这些要求,但前提是避免昂贵的浮点流水线。我们研究了用于整数推理的策略的量化感知训练(QAT),并提出了一种学习到硬件的流水线,该流水线自动选择低比特策略并将其综合到Artix-7 FPGA上。在五个MuJoCo任务中,我们获得的策略网络与全精度(FP32)策略具有竞争力,但每个权重和每个内部激活值仅需3比特甚至2比特,前提是输入精度经过仔细选择。在目标硬件上,所选策略实现微秒级的推理延迟,每次动作消耗微焦耳能量,与量化参考相比具有优势。最后,我们观察到量化策略相比浮点基线具有更高的输入噪声鲁棒性。

英文摘要

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating-point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

2512.01930 2026-06-09 cs.LG cs.AI 版本更新

SVRG and Beyond via Posterior Correction

SVRG及其后验校正扩展

Nico Daheim, Thomas Möllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文揭示SVRG与后验校正方法的深层联系,证明SVRG是各向同性高斯后验校正的特例,并通过灵活指数族后验自动导出牛顿型和Adam型新变体。

Comments ICML 2026 (oral)

详情
AI中文摘要

随机方差缩减梯度(SVRG)及其变体旨在通过使用梯度校正来加速训练。这些方法最初提出于十多年前,但从未在任何基本层面上与任何贝叶斯方法联系起来。在这里,我们填补了这一空白,并推导出SVRG与最近提出的称为“后验校正”的贝叶斯方法之间令人惊讶的新联系。我们的主要贡献是证明SVRG可以恢复为各向同性高斯后验校正的特例。通过使用更灵活的指数族后验,自动获得了SVRG的新扩展。我们通过使用高斯族推导了两个这样的新扩展:一种具有新颖海森校正的牛顿型变体,以及一种可扩展到大规模问题的Adam型扩展。我们的工作是首次将SVRG与贝叶斯联系起来,并利用它来加速训练。

英文摘要

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed over a decade ago, these methods have never been connected to any Bayesian method at a fundamental level. Here, we fill this gap and derive surprising new connections of SVRG to a recently proposed Bayesian method called `posterior correction'. Our main contribution is to show that SVRG can be recovered as a special case of posterior-correction over isotropic-Gaussian posteriors. Novel extensions of SVRG are automatically obtained by using more flexible exponential-family posteriors. We derive two new such extensions by using Gaussian families: a Newton-like variant with novel Hessian corrections, and an Adam-like extension that scales to large problems. Our work is the first to connect SVRG to Bayes and use it to speed-up training.

2512.15116 2026-06-09 cs.LG cs.AI 版本更新

FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

FADTI: 基于傅里叶和注意力驱动的多变量时间序列插补扩散模型

Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang

发表机构 * Anonymous(匿名)

AI总结 提出FADTI扩散框架,通过可学习傅里叶偏置投影模块注入频域归纳偏置,结合自注意力与门控卷积进行时序建模,在多个基准上优于现有方法,尤其在高缺失率下表现突出。

Comments This work has been submitted to the IEEE for possible publication. 10 pages, 7 figures

详情
AI中文摘要

多变量时间序列插补是医疗保健、交通预测和生物建模等应用中的基础问题,其中传感器故障和不规则采样导致普遍存在的缺失值。然而,现有的基于Transformer和扩散的模型缺乏明确的归纳偏置和频率感知,限制了它们在结构化缺失模式和分布偏移下的泛化能力。我们提出FADTI,一个基于扩散的框架,通过可学习的傅里叶偏置投影(FBP)模块注入频率信息特征调制,并将其与通过自注意力和门控卷积进行的时间建模相结合。FBP支持多种谱基,能够自适应编码平稳和非平稳模式。这种设计将频域归纳偏置注入生成式插补过程。在多个基准(包括一个新引入的生物时间序列数据集)上的实验表明,FADTI持续优于最先进的方法,尤其是在高缺失率下。代码可在该https URL获取。

英文摘要

Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

MMR-GRPO:通过多样性感知奖励重加权加速GRPO风格训练

Kangda Wei, Ruihong Huang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出MMR-GRPO方法,利用最大边际相关性根据完成多样性重加权奖励,减少冗余样本,加速GRPO训练,在保持性能的同时平均减少47.9%训练步数和70.2%时间。

详情
AI中文摘要

组相对策略优化(GRPO)已成为训练数学推理模型的标准方法;然而,它对每个提示依赖多个完成,使得训练计算成本高昂。尽管最近的工作减少了达到峰值性能所需的训练步数,但由于每步成本增加,整体挂钟训练时间通常保持不变甚至增加。我们提出MMR-GRPO,它整合了最大边际相关性,基于完成多样性对奖励进行重加权。我们的关键洞察是,语义冗余的完成贡献有限的学习信号;优先考虑多样化解能产生更有信息量的更新并加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO在达到相当峰值性能的同时,平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些增益在模型、方法和基准上一致。我们的代码发布在:this https URL。

英文摘要

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

2601.15165 2026-06-09 cs.CL cs.AI cs.LG 版本更新

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

灵活性陷阱:重新思考扩散语言模型中任意顺序的价值

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学Leap实验室) NLPLab, Tsinghua University(清华大学自然语言处理实验室) Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 本文发现,尽管扩散语言模型(dLLMs)允许任意生成顺序,但这种灵活性可能限制其推理能力,通过采用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,在保持并行解码能力的同时提升了推理性能。

Comments Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

详情
AI中文摘要

扩散大语言模型(dLLMs)打破了传统语言模型的严格左到右约束,使token生成可以按任意顺序进行。直观上,这种灵活性意味着解决方案空间严格超越了固定的自回归轨迹,理论上解锁了更强大的推理潜力。然而,在本文中,我们发现对于一般推理任务(例如数学和编程),任意顺序生成可能实际上会限制dLLMs的推理潜力。我们观察到dLLMs倾向于利用这种顺序灵活性来绕过关键探索的高不确定性token,这可能导致解决方案覆盖的过早崩溃。这一观察促使我们重新思考dLLMs的强化学习方法,其中大量的复杂性,如处理组合轨迹和不可计算的似然,通常致力于保持这种灵活性。我们证明,通过放弃任意顺序并应用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,可以有效地激发推理能力。我们的方法,JustGRPO,虽然简洁却出人意料地有效(例如在GSM8K上达到89.1%的准确率),同时完全保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap

英文摘要

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

2601.21149 2026-06-09 cs.LG cs.AI 版本更新

Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

移动性嵌入的POI:从人类移动中学习场所身份与使用方式

Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ME-POIs框架,通过对比学习将大规模人类移动数据与语言模型嵌入结合,学习场所功能,并在五个地图丰富任务上超越文本或移动性单独基线。

详情
AI中文摘要

近期地理空间基础模型的进展强调了学习真实世界位置(特别是人类活动集中的兴趣点POI)通用表示的重要性。然而,现有方法主要关注从静态文本元数据中提取的场所身份,或学习与轨迹上下文相关的表示,这些表示捕捉的是移动规律而非场所的实际使用方式(即POI的功能)。我们认为POI功能是通用POI表示中缺失但关键的信号。我们提出了移动性嵌入的POI(ME-POIs),这是一个框架,通过大规模人类移动数据增强从语言模型派生的POI嵌入,以学习基于真实世界使用的、以POI为中心且上下文无关的表示。ME-POIs将个体访问编码为时间上下文化的嵌入,并通过对比学习将其与可学习的POI表示对齐,以捕捉跨用户和时间的使用模式。为解决长尾稀疏性问题,我们提出了一种新机制,从附近频繁访问的POI跨多个空间尺度传播时间访问模式。我们在五个新提出的地图丰富任务上评估ME-POIs,测试其捕捉POI身份和功能的能力。在所有任务中,用ME-POIs增强文本嵌入始终优于纯文本和纯移动性基线。值得注意的是,仅使用移动数据训练的ME-POIs在某些任务上能超越纯文本模型,凸显了POI功能是准确且可泛化的POI表示的关键组成部分。

英文摘要

Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

2601.21522 2026-06-09 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

更高效利用预算:使用重置与丢弃(ReD)方法在固定预算下提升大型语言模型的推理性能

Sagi Meir, Tommer D. Keidar, Noam Levi, Shlomi Reuveni, Barak Hirshberg

发表机构 * School of Chemistry, Tel Aviv University(特拉维夫大学化学系) The Center for Physics and Chemistry of Living Systems, Tel Aviv University(特拉维夫大学生命系统物理与化学中心) School of Physics and Astronomy, Tel Aviv University(特拉维夫大学物理与天文学系) The Center for Computational Molecular and Materials Science, Tel Aviv University(特拉维夫大学计算分子与材料科学中心)

AI总结 针对固定预算下大型语言模型推理的收益递减问题,提出重置与丢弃(ReD)查询方法,通过优化尝试分配提升覆盖率,并在编码、数学和推理基准上验证了其成本节约效果。

详情
AI中文摘要

大型语言模型(LLMs)在可验证任务上的性能通常通过 pass@k 衡量,即在 k 次尝试中至少正确回答一次的概率。在固定预算下,更合适的指标是 coverage@cost,即作为总尝试次数函数的平均唯一回答问题数量。我们连接这两个指标,并证明 pass@k 中经验观察到的幂律行为导致 coverage@cost 的次线性增长(收益递减)。为解决此问题,我们提出重置与丢弃(ReD),一种 LLMs 的查询方法,无论 pass@k 的形式如何,都能在给定预算下增加 coverage@cost。此外,给定 pass@k,我们可以定量预测使用 ReD 在总尝试次数上的节省。如果模型的 pass@k 不可用,ReD 可以推断其幂律指数。在三个 LLMs 上进行的编码(HumanEval)、数学(GSM8K)和推理(MMLU-Pro)基准测试表明,ReD 显著减少了达到期望覆盖率所需的尝试次数、令牌数和美元成本,同时提供了一种高效测量推理幂律的方法。ReD 的优势在非完美验证器下得以保持,并且优于测试的分配基线。

英文摘要

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for a given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs across coding (HumanEval), math (GSM8K), and reasoning (MMLU-Pro) benchmarks demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws. ReD's advantage is maintained for imperfect verifiers and outperforms the tested allocation baselines.

2601.21996 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

机械论数据归因:追踪可解释LLM单元的训练起源

Jianhui Chen, Yuzhang Luo, Liangming Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出机械论数据归因(MDA)框架,利用影响函数将可解释单元追溯到特定训练样本,通过因果验证表明干预高影响样本可显著调节可解释头的涌现,并发现重复结构数据作为机械催化剂,同时验证了归纳头与上下文学习之间的功能联系。

Comments ICML2026 (Oral)

详情
AI中文摘要

尽管机械论可解释性已在LLM中识别出可解释电路,但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械论数据归因(MDA),这是一个可扩展的框架,利用影响函数将可解释单元追溯到特定训练样本。通过在Pythia系列模型上的广泛实验,我们因果验证了目标干预——移除或增加一小部分高影响样本——显著调节了可解释头的涌现,而随机干预则没有效果。我们的分析表明,重复的结构化数据(例如LaTeX、XML)充当了机械催化剂。此外,我们观察到针对归纳头形成的干预会引发模型上下文学习(ICL)能力的同步变化。这为关于归纳头与ICL之间功能联系的长期假设提供了直接的因果证据。最后,我们提出了一种机械论数据增强流水线,该流水线在不同模型规模上一致地加速电路收敛,为引导LLM的发展轨迹提供了一种原则性方法。

英文摘要

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

2601.22736 2026-06-09 cs.LG cs.AI 版本更新

UA-DCM: Uncertainty-aware Causal Decision Making via Effect Bound Decomposition

UA-DCM: 基于效应界分解的不确定性感知因果决策

Md Musfiqur Rahman, Ziwei Jiang, Hilaf Hasson, Murat Kocaoglu

发表机构 * Electrical and Computer Engineering, Purdue University(帕克大学电气与计算机工程系) Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Cohesity

AI总结 提出一种新框架,通过分解因果效应值的可消除与不可消除部分,区分收集更多样本能否帮助识别最优行动,并利用神经因果模型近似实现该分解。

详情
AI中文摘要

从观测数据中进行因果推断可以为决策场景中找到最佳行动提供有力证据,而无需进行昂贵的随机试验。由于未观测到的混杂因素,即使有无限数据,行动的因果效应也往往不是点可识别的。此外,仅有有限样本为因果效应估计增加了另一层不确定性。现有几种方法可用于获得因果效应的上下界,从符号方法到最近的基于神经网络的方法,这些方法隐式地结合了两种不确定性来源。然而,这些方法并未告知收集更多样本是否有助于从观测数据中识别最佳行动,使专家对其数据收集策略一无所知。我们通过一种新颖的框架解决了这个问题,该框架能够区分可能通过收集更多样本消除的因果效应值范围与那些高概率无法通过更多观测样本消除的值范围。我们证明这种划分可以通过求解最大-最小和最小-最大优化问题获得。我们利用神经因果模型在实践中近似恢复这种分解。通过在合成和真实世界数据集上的实验,我们证明了我们的算法可以确定何时收集更多样本无助于确定最佳行动。我们的框架可以帮助从业者决定何时应诉诸非观测研究或寻求测量一些未测量的混杂因素以进行最优决策。

英文摘要

Causal inference from observational data can provide strong evidence for finding the best action in a decision-making scenario without having to perform expensive randomized trials. The causal effect of an action is often not pointwise identifiable even with infinite data due to unobserved confounding factors. Furthermore, having only finitely many samples adds another layer of uncertainty to causal effect estimation. Several existing methods can be used to obtain upper and lower bounds to the causal effect, ranging from symbolic methods to the more recent neural network-based approaches, which implicitly incorporate both sources of uncertainty. However, these methods do not inform whether collecting more samples may or may not help identify the best action from observational data, leaving experts in the dark about their data collection strategies. We address this problem with a novel framework that can distinguish the range of causal effect values that might be eliminated by collecting more samples from the range of values that, with high probability, cannot be eliminated with more observational samples. We show that this partitioning can be obtained by solving max-min and min-max optimization problems. We leverage neural causal models to approximately recover this decomposition in practice. We demonstrate via experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. Our framework can help practitioners decide when to resort to non-observational studies or seek to measure some of the unmeasured confounders for optimal decision-making.

2602.04402 2026-06-09 stat.ML cs.AI cs.CY cs.LG math.ST stat.TH 版本更新

Performative Learning Theory

表现性学习理论

Julian Rodemann, Unai Fischer-Abaigar, James Bailie, Krikamol Muandet

发表机构 * University of Cambridge(剑桥大学)

AI总结 将表现性预测嵌入统计学习理论,证明在样本和总体表现性效应下的泛化界,揭示模型影响数据越多则学习越少的权衡,并提出通过再训练改善泛化保证。

Comments ICML 2026. v2: corrected typo in author list; v3: added explanation of condition 3.2, modified condition 3.3 and fixed lemma 3.4, added examples and explanations in sections 2, 5, and 6

详情
AI中文摘要

表现性预测会影响它们试图预测的结果。我们研究影响样本(例如,仅限现有应用用户)和/或整个总体(例如,所有潜在应用用户)的表现性预测。这引发了模型在表现性下泛化能力的问题。例如,当现有用户和新用户都对应用的预测做出反应时,我们基于现有用户对新用户能得出多好的见解?我们通过将表现性预测嵌入统计学习理论来解决这个问题。我们证明了在样本、总体以及两者共同影响下的泛化界。我们证明背后的一个关键直觉是,在最坏情况下,总体否定预测,而样本欺骗性地实现预测。我们分别将这种自我否定和自我实现的预测表述为Wasserstein空间中的最小-最大和最小-最小风险泛函。我们的分析揭示了表现性地改变世界与从中学习之间的基本权衡:模型对数据的影响越大,它能从数据中学到的就越少。此外,我们的分析得出一个令人惊讶的见解:通过对表现性扭曲的样本进行再训练,可以改善泛化保证。我们通过一个案例研究说明了我们的界,该案例涉及基于预测的德国失业居民工作培训分配,利用了德国1975年至2017年的行政劳动力市场记录。

英文摘要

Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.

2602.05774 2026-06-09 cs.LG cs.AI math.PR 版本更新

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

变分推测解码:从令牌似然到序列接受的草稿训练再思考

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出变分推测解码(VSD),将草稿训练视为对潜在提议(草稿路径)的变分推断,通过最大化目标模型接受的边际概率来优化,结合路径级效用和期望最大化过程,显著提升解码效率。

详情
AI中文摘要

推测解码加速了(多模态)大语言模型的推理,但训练-解码之间存在不一致:现有方法优化单一贪婪轨迹,而解码涉及验证和排序多个采样草稿路径。我们提出变分推测解码(VSD),将草稿训练形式化为对潜在提议(草稿路径)的变分推断。VSD最大化目标模型接受的边际概率,得到一个ELBO,该ELBO促进高质量潜在提议,同时最小化与目标分布的散度。为提升质量并降低方差,我们引入路径级效用,并通过期望最大化过程进行优化。E步从经过oracle过滤的后验中抽取蒙特卡洛样本,M步使用自适应拒绝加权(ARW)和置信度感知正则化(CAR)最大化加权似然。理论分析证实VSD增加了期望接受长度和加速比。在LLM和MLLM上的大量实验表明,VSD相比EAGLE-3实现高达9.6%的加速,相比ViSpec实现7.9%的加速,显著提升了解码效率。

英文摘要

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws Monte Carlo samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

2602.12107 2026-06-09 cs.LG cs.AI stat.ML 版本更新

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

离线强化学习在 $Q^\star$ 近似与部分覆盖下的复杂性

Haolin Liu, Braham Snyder, Chen-Yu Wei

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文通过信息论下界证明 $Q^\star$ 可实现性与贝尔曼完备性在部分覆盖下不足以实现样本高效的离线强化学习,并提出一个通用决策-估计框架来统一和改进现有结果。

详情
AI中文摘要

我们研究了在 $Q^\star$ 近似和部分覆盖下的离线强化学习,这一设定激发了诸如保守 $Q$ 学习(CQL;Kumar et al., 2020)等实用算法,但理论上受到的关注有限。我们的工作受以下开放问题的启发:“在部分覆盖下,$Q^\star$ 可实现性和贝尔曼完备性是否足以实现样本高效的离线强化学习?”我们通过信息论下界给出了否定答案。为了识别在部分覆盖下实现样本高效离线强化学习的额外结构,我们引入了一个通用决策-估计框架,该框架受在线强化学习的无模型决策-估计系数(DEC;Foster et al., 2023b; Liu et al., 2025b)启发。我们的框架将离线强化学习的复杂性分解为决策复杂性和值估计误差,从而允许对这两个子问题进行模块化研究。我们的结果不仅统一了现有结果(Chen and Jiang, 2022; Uehara et al., 2023),而且进一步改进并推广了它们。在决策复杂性方面,我们的改进包括:在部分覆盖下软 $Q$ 学习的首个 $\epsilon^{-2}$ 样本复杂度界,改进了 Uehara 等人(2023)的 $\epsilon^{-4}$ 界;在 Chen 和 Jiang(2022)的值间隙设定中消除了对额外在线交互的需求;以及超越上述两种情况的新可学习设定。在值估计方面,我们提供了在部分覆盖下贝尔曼完备性作用的新刻画,以及一般低贝尔曼秩 MDP(Jiang et al., 2017; Du et al., 2021; Jin et al., 2021)离线可学习性的首个刻画。后者是一个经典的在线强化学习设定,除特殊情况外,在离线强化学习中尚未被探索。作为附带贡献,我们的技术给出了函数近似设定下 CQL 的首个分析。

英文摘要

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative via an information-theoretic lower bound. To identify additional structure that enables sample-efficient offline RL under partial coverage, we introduce a general decision-estimation framework, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). Our framework decomposes offline RL complexity into decision complexity and value estimation error. This allows modular study of both sub-problems. Our result not only unifies existing results (Chen and Jiang, 2022; Uehara et al., 2023), but further improves and generalizes them. On the decision complexity side, our improvement includes: the first $ε^{-2}$ sample complexity bound for soft $Q$-learning under partial coverage that improves Uehara et al.'s (2023) $ε^{-4}$ bound, the removal of the need for additional online interaction in the value-gap setting of Chen and Jiang (2022), and new learnable settings beyond the above two cases. On the value estimation side, we provide a new characterization of the role of Bellman completeness under partial coverage, and the first characterization of offline learnability for general low-Bellman-rank MDPs (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021). The latter is a canonical online RL setting that has remained unexplored in offline RL except for special cases. As a side contribution, our techniques give the first analysis of CQL in the function approximation setting.

2602.24181 2026-06-09 cs.CV cs.AI 版本更新

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

混合饮食使DINO成为杂食视觉编码器

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

发表机构 * Google DeepMind(谷歌深Mind) University College London(伦敦大学学院)

AI总结 针对DINOv2等预训练视觉编码器在不同视觉模态间特征对齐差的问题,提出杂食视觉编码器,通过后训练框架学习模态无关特征空间,实现跨模态鲁棒理解。

Comments CVPR 2026 Highlight

详情
AI中文摘要

预训练的视觉编码器(如DINOv2)在单模态任务上表现出色。然而,我们观察到它们的特征在不同视觉模态之间对齐不佳。例如,同一场景的RGB图像及其对应深度图的特征嵌入,其余弦相似度与两个随机不相关图像几乎相同。为了解决这个问题,我们提出了杂食视觉编码器,一种学习模态无关特征空间的后训练框架。我们通过双重目标微调编码器:首先,最大化同一场景不同模态之间的特征对齐;其次,一个蒸馏目标,将学习到的表示锚定到完全冻结的教师模型。由此产生的学生编码器通过为给定场景生成更一致的嵌入(无论输入模态是RGB、深度、分割等)而变得“杂食”。这种方法在保留原始基础模型的判别语义的同时,实现了鲁棒的跨模态理解。杂食模型权重可在以下网址获取:此 https URL。

英文摘要

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their features are poorly aligned across different visual modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a post-training framework that learns a modality-agnostic feature space. We fine-tune the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to a fully frozen teacher. The resulting student encoder becomes "omnivorous" by producing more consistent embeddings for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model. Omnivorous model weights are available at https://github.com/google-deepmind/representations4d.

2603.05500 2026-06-09 cs.LG cs.AI cs.CL 版本更新

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X:通过扩展正交变换实现内存高效的LLM训练

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

发表机构 * University of Cambridge(剑桥大学)

AI总结 POET-X通过优化正交等价变换降低计算和内存开销,实现高效稳定的LLM训练,支持在单块H100 GPU上预训练十亿参数模型。

Comments ICML 2026 Oral (15 pages, 7 figures, project page: https://spherelab.ai/poetx/)

详情
AI中文摘要

高效且稳定的大型语言模型(LLM)训练仍然是现代机器学习系统的核心挑战。为解决这一挑战,提出了重新参数化正交等价训练(POET),这是一种保持谱的框架,通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性,但其原始实现由于密集的矩阵乘法导致高内存消耗和计算开销。为克服这些限制,我们引入了POET-X,一种可扩展且内存高效的变体,通过显著降低的计算成本执行正交等价变换。POET-X在保持POET的一般化和稳定性优势的同时,实现了吞吐量和内存效率的显著提升。在我们的实验中,POET-X能够在单块Nvidia H100 GPU上预训练十亿参数的LLM,而标准优化器如AdamW在相同设置下会因内存不足而失败。

英文摘要

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

2603.13259 2026-06-09 cs.CL cs.AI 版本更新

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Transformer 如何拒绝错误答案:事实约束处理的旋转动力学

Javier Marín

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究揭示了Transformer在处理事实性问题时,隐藏状态空间中正确与错误延续路径的旋转分离现象,揭示了模型在深层结构中对错误延续的非局部化偏好。

详情
AI中文摘要

当解码器-only Transformer 被强制处理事实性查询的匹配正确和错误单token延续时,两种路径在隐藏状态空间中以特定方式分离:从查询-only 表示出发的位移向量保持大致相等的幅度但方向旋转远离。角分离在中层增加,后期层解决不对称结果——在错误运行中,logit-lens 倾向远低于朴素先验,对应模型将错误token的概率约11.5倍于正确token。该双阶段模式——中层旋转分离后后期层不对称承诺——被描述为模型对外部看似拒绝错误延续的实证几何特征,但明确指出是观测描述而非因果解释。该模式在六个解码器-only Transformer 中一致,包括五个架构家族(1B到13B参数)。第七个模型(Qwen2 1.5B)在当前提取协议下显示平坦曲线,可能是tokenizer-fragmentation的artefact而非真实规模限制;是否存在临界出现阈值的问题仍悬而未决。单层激活拼接在任何层带均无法恢复正确token,意味着后期层不对称性并非局限于离散组件。总体而言,证据支持事实约束处理的分布式轨迹账户——几何结构在许多层中逐步累积出现,而非单一局部化回溯账户。

英文摘要

When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors from the query-only representation maintain approximately equal magnitude but rotate apart in direction. The angular separation grows through mid-depth, and late layers resolve the asymmetric outcome -a logit-lens preference that, in the incorrect run, falls far below the naive prior of equal probability, corresponding to the model assigning approximately 11.5 times more probability to the incorrect token than to the correct one. We characterize this two-phase pattern-rotational divergence in mid-depth followed by late-layer asymmetric commitment-as the empirical geometric signature of what looks externally like the model rejecting a wrong continuation, while remaining explicit that it is an observational characterization, not a causal account. The pattern is consistent across six decoder-only transformers including five architecture families from 1B to 13B parameters. A seventh model (Qwen2 1.5B) shows a flat profile under the present extraction protocol that is plausibly a tokenizer-fragmentation artefact rather than a real scale floor; the question of an emergence threshold is left open. Single-layer activation patching does not recover the correct token at any layer band, meaning the late-layer asymmetry is not localized to a discrete component under the protocol used. Taken together, the evidence is consistent with a distributed-by-trajectory account of factual constraint processing-geometric structure that emerges cumulatively across many layers rather than from a single localized circuit and inconsistent with the simplest single-layer localized-recall account.

2603.22473 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

组件消融用于高效混合语言模型架构:性能、鲁棒性和压缩影响

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

发表机构 * Doctoral Program in Computer Science, University of Valencia(瓦伦西亚大学计算机科学博士项目)

AI总结 本文通过组件消融研究混合语言模型,发现注意力机制与替代序列处理路径对性能有显著影响,揭示了模型鲁棒性与压缩优化的关键因素。

Comments 25 pages, 7 figures, 6 tables; revised title, abstract, figures, and data/code repository URL

详情
AI中文摘要

混合语言模型结合softmax注意力与线性时间序列机制,如状态空间或线性注意力层,但各组件的功能贡献尚不明确。本文在两个子10亿参数的混合语言模型Qwen3.5-0.8B和Falcon-H1-0.5B上,通过基于似然的评估、下游基准、逐层干预、随机控制和表征级诊断研究组件消融。测试结果显示,移除注意力或替代序列处理路径会显著降低性能,表明两种组件类型均对模型行为有贡献。似然指标对线性注意力或状态空间路径特别敏感,而下游基准退化取决于任务和架构。逐层消融显示组件重要性位置依赖,最强效果集中在早期或中期网络组件而非整个深度。随机移除控制进一步显示混合架构与相同家族Transformer基线在结构扰动下退化不同。这些结果表明组件消融是理解混合语言模型架构的有效诊断方法。发现为高效模型设计、压缩、鲁棒性分析和部署决策提供了相关证据。

英文摘要

Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

Vision Hopfield Memory Networks for Image Recognition

Vision Hopfield Memory Networks

Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Faculty of Informatics, Vienna University of Technology(维也纳理工大学信息学院)

AI总结 本文提出了一种受大脑启发的视觉Hopfield记忆网络(V-HMN),通过整合分层记忆机制和迭代细化更新,实现了统一框架下的局部和全局动态建模,提升了可解释性和数据效率。

详情
AI中文摘要

近年来,视觉和多模态基础模型,如Transformer家族和状态空间模型(如Mamba)在图像、文本等领域取得了显著进展。尽管这些架构在经验上取得了成功,但它们与人脑的计算原理仍有很大差距,通常需要大量的训练数据且可解释性有限。在本文中,我们提出了视觉Hopfield记忆网络(V-HMN),一种受大脑启发的基础模型,整合了分层记忆机制和迭代细化更新。具体而言,V-HMN包含局部Hopfield模块,提供图像块级别的关联记忆动态,全局Hopfield模块作为情境调节的事件记忆,以及受预测编码启发的细化规则用于迭代误差校正。通过将这些基于记忆的模块分层组织,V-HMN在一个统一的框架中捕捉了局部和全局动态。记忆检索揭示了输入与存储模式之间的关系,使决策更具可解释性,而存储模式的重用提高了数据效率。这种受大脑启发的设计因此在可解释性和数据效率方面超越了现有的自注意或状态空间方法。我们在公开的计算机视觉基准上进行了广泛的实验,V-HMN在与广泛采用的基础架构竞争的同时,提供了更好的可解释性、更高的数据效率和更强的生物合理性。这些发现突显了V-HMN作为下一代视觉基础模型的潜力,同时为文本和音频等领域的多模态基础模型提供了通用的蓝图,从而将受大脑启发的计算与大规模机器学习联系起来。

英文摘要

Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory mechanisms across layers with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, providing a prototype-based form of interpretability through explicit memory retrieval, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances data efficiency and provides a prototype-based form of interpretability compared to existing self-attention- or state-space-based approaches. We conducted extensive experiments on public image classification benchmarks. V-HMN achieves strong performance on small- and medium-scale benchmarks, and remains competitive with widely adopted backbone architectures on ImageNet despite minimal architectural tuning, while offering improved data efficiency and a prototype-based form of interpretability. These findings highlight the potential of V-HMN as a memory-centric alternative to standard vision backbones, thereby bridging brain-inspired computation with modern machine learning.

2603.25184 2026-06-09 cs.LG cs.AI 版本更新

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

在移动边缘训练:一种在线验证的提示选择方法用于大型推理模型的高效强化学习训练

Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Bailong Lin, Chen Jason Zhang, Li Qing, Ke Tang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology(香港科学理工大学) Nanyang Technological University(南洋理工大学) Rutgers University(罗格斯大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学理工大学(广州))

AI总结 本文提出HIVE方法,通过历史奖励轨迹和实时提示熵实现高效RL训练,提升提示选择效率而不牺牲性能。

详情
AI中文摘要

强化学习(RL)已成为在推理任务中训练大型语言模型(LLMs)的关键技术。尽管扩大 rollout 可以稳定训练并提高性能,但计算开销是一个关键问题。在像 GRPO 等算法中,每个提示多个 rollout 会带来极高的成本,因为大量提示提供微不足道的梯度,因此效用较低。为了解决这个问题,我们研究如何在 rollout 阶段之前选择高效用的提示。我们的实验分析揭示了样本效用是非均匀且动态变化的:最强的学习信号集中在「学习边缘」,即中等难度和高不确定性的交界处,随着训练进行而变化。受此启发,我们提出了 HIVE(基于历史和在线验证的提示选择),一种数据高效的 RL 框架。HIVE 利用历史奖励轨迹进行粗略选择,并利用提示熵作为实时代理来修剪效用过时的实例。通过在多个数学推理基准和模型上评估 HIVE,我们证明 HIVE 在不牺牲性能的情况下显著提高了 rollout 的效率。

英文摘要

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

2604.09967 2026-06-09 cs.LG cs.AI 版本更新

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Muon²:通过自适应二阶矩预条件提升穆隆

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣巴巴拉分校) University at Albany, SUNY(阿尔巴尼大学,SUNY)

AI总结 Muon²通过引入Adam风格的自适应二阶矩预条件改进了穆隆的效率与质量,提升了极化近似中的收敛速度和实际正交化质量,实验表明其在参数规模达13B的预训练任务中表现更优。

Comments Preprint, subject to update

详情
AI中文摘要

Muon已展现为一种有前途的优化器,用于大规模基础模型预训练,通过迭代正交化利用神经网络更新的矩阵结构。然而,Muon的正交化质量依赖于执行的牛顿-施卢茨(NS)迭代次数,这带来了效率挑战,因为其计算和通信成本非平凡。我们提出Muon²,作为Muon的扩展,通过在正交化前应用Adam风格的自适应二阶矩预条件来提高质量和效率。我们的关键见解是,Muon的核心挑战在于极化近似中的病态动量矩阵,其谱通过Muon²显著改善,从而更快收敛到实用的正交化。我们进一步通过方向对齐特性化了实际正交化质量,在此情况下,Muon²在每个极化步骤中均显著优于Muon。在GPT、LLaMA和专家混合预训练实验中,Muon²(及其内存高效变种Muon²-F)在参数规模达13B时,始终优于Muon及其变种,同时将NS迭代次数减少40%,并在达到相同损失时节省了多达四分之一的训练时间。

英文摘要

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, the orthogonalization quality of Muon hinges on the number of Newton--Schulz (NS) iterations performed, which poses efficiency challenges due to its non-trivial computation and communication cost. We propose Muon$^2$, an extension of Muon, to improve both quality and efficiency by applying Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT, LLaMA, and Mixture-of-Experts pre-training experiments up to 13B parameters, Muon$^2$ (and its memory-efficient variant Muon$^2$-F that preserves most of its benefits) consistently outperforms Muon and its variants while reducing NS iterations by 40%, and saves up to 1/4 training time over Muon when achieving the same loss.

2604.17324 2026-06-09 cs.LG cs.AI 版本更新

Capacity-Controlled Global Attention for Graph Transformers

具有容量控制的全局注意力用于图变换器

Yang Liu, Dongxin Guo, Tom Zheng, Siu Ming Yiu, Liam Ning, Jikun Wu

发表机构 * Brain Investing Limited The University of Hong Kong(香港大学) Stellaris AI Limited

AI总结 本文提出SigGate-GT,通过在图变换器中引入可学习的sigmoid门来缓解全局注意力的保守约束,从而解决过平滑、低秩瓶颈和训练不稳定等问题,提升了多个基准测试的性能。

Comments 13 pages, 2 figures, 15 tables

详情
AI中文摘要

全局自注意力推动了现代图变换器,但其核心的softmax操作引入了一个很少直接考察的结构约束:每个注意力行非负且和为一,因此每个头的输出是值向量的守恒凸组合。一个节点永远无法“不关注任何东西”。我们认为这种守恒约束是三个通常孤立研究的病理的根本原因:深度下的节点表示崩溃(过平滑)、每个头输出的低秩瓶颈,以及深度堆栈中的脆弱优化。借鉴sigmoid门在语言模型中消除类似注意力沉底的方式,我们引入SigGate-GT,一种在GraphGPS框架中应用可学习、按头、输入条件化的sigmoid门的图变换器。该门是一种平滑的、按维度的“体积控制”,可将头输出驱动至零,不放弃注意力的概率解释。通过分析和合成实验,我们证明该门严格增加每个头输出的稳定秩,并将此秩增益与所有三种表现联系起来。在五个分子和长距离基准上,SigGate-GT在ZINC上匹配先前最佳(0.059 MAE),在ogbg-molhiv上记录最强结果(82.47% ROC-AUC),在ogbg-molpcba和长距离图基准上具有竞争力,且在所有五个数据集上均优于GraphGPS(p < 0.05)。机制分析证实了诊断:门减缓了过平滑(在4-16层中表示多样性平均相对增益30%),保持了注意力熵不崩溃,并在10倍学习率范围内稳定训练,参数开销约为OGB的1%,时间成本低于3%。

英文摘要

Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is a mass-conserving convex combination of value vectors. A node can never "attend to nothing." We argue this conservation constraint is a single root cause behind three pathologies usually studied in isolation: the collapse of node representations with depth (over-smoothing), a low-rank bottleneck on per-head outputs, and brittle optimization in deep stacks. Drawing on how sigmoid gating removes analogous attention sinks in language models, we introduce SigGate-GT, a graph transformer that applies a learned, per-head, input-conditioned sigmoid gate to the attention output inside the GraphGPS framework. The gate is a smooth, per-dimension "volume control" that can drive head outputs toward zero, relaxing the constraint without abandoning attention's probabilistic interpretation. Analytically and through synthetic experiments, we show the gate strictly increases the stable rank of per-head outputs, and connect this rank gain to all three manifestations. On five molecular and long-range benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE), records the strongest result among the graph-transformer baselines we evaluate on ogbg-molhiv (82.47% ROC-AUC), and is competitive on ogbg-molpcba and the Long-Range Graph Benchmark, with statistically significant gains over GraphGPS on all five datasets (p < 0.05). Mechanism analyses confirm the diagnosis: gating slows over-smoothing (a 30% mean relative gain in representation diversity across 4-16 layers), keeps attention entropy from collapsing, and stabilizes training across a 10x learning-rate range, at about 1% parameter overhead on OGB and under 3% wall-clock cost.

2604.26985 2026-06-09 cs.LG cs.AI 版本更新

Simple Self-Conditioning Adaptation for Masked Diffusion Models

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一种简单有效的后训练适应方法,通过自条件预测提升掩码扩散模型的生成能力,减少生成困惑度并提升图像合成和分子生成质量。

详情
AI中文摘要

掩码扩散模型(MDMs)通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中,如果一个token在反向更新后仍被掩码,模型会丢弃该位置的干净状态预测。因此,仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制,本文提出了一种简单但有效的后训练适应方法,使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型(SCMDM),需要最小的架构更改,不引入递归的潜在状态路径,不依赖辅助参考模型,并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别,后者需要昂贵的从头模型训练。特别是,本文表明,在后训练阶段,部分自条件,包括用于从头训练自条件模型的常用50% dropout策略,是次优的。相反,一旦模型自生成的干净状态估计变得有信息,专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估,显示出对普通MDM基线的一致改进,实现了在OWT训练模型上的生成困惑度几乎减少50%(从42.89到23.72),同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

发表机构 * New York University(纽约大学) NYU Langone Health(NYU Langone健康) NYU Grossman School of Medicine(NYU Grossman医学院) Oregon Health & Science University(俄勒冈健康与科学大学) Columbia University(哥伦比亚大学) Harvard Medical School(哈佛医学院)

AI总结 本文利用基于Transformer的模型从加密网络流量中学习行为表征,结合用户特定适配器,并通过稀疏表示和广义估计方程分析,发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关,且学习到的表征优于传统手工特征。

Comments 19 pages, 6 figures

详情
AI中文摘要

人类行为难以在大规模下连续测量,然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号,用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为,我们采用基于Transformer的模型,该模型带有用户特定的适配器,学习网络活动的表征,同时考虑个人基线及其偏差。为了提高可解释性,我们进一步使用稀疏表示学习分析这些表征,以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来,从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态:压力主要与持续的个体间变异相关,孤独感与个体内波动更密切相关,而睡眠障碍则反映了两者的结合。重要的是,这些个体内行为信号无法通过传统的手工网络流量特征恢复,这突显了学习表征在纵向行为建模中的优势。总体而言,我们的发现表明加密网络流量包含可解释的行为信息,并能够支持被动、可扩展的行为动态监测,特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

2605.02950 2026-06-09 cs.LG cs.AI 版本更新

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

核仿射包机作为冻结语义空间的计算高效编码器

Mohit Kumar, Somayeh Kargaran, Bernhard A. Moser, Manuela Geiß

发表机构 * University of Rostock(罗斯托克大学) Software Competence Center Hagenberg GmbH(海根堡软件竞争力中心)

AI总结 提出核仿射包机(KAHM)作为轻量级查询编码器,在固定教师表示空间下,通过RKHS中的后验权重估计替代神经网络编码,实现计算高效且性能优异的语义检索。

详情
AI中文摘要

基于Transformer的语义编码器在检索中很有效,但在许多部署中,重复出现的瓶颈是在线查询编码,而非离线语料库索引。本文研究,一旦强大的教师表示空间和语料库索引固定,是否可以用一个更轻量且解析明确的估计器来替代重复的神经查询编码。我们将固定教师的词汇到语义编码表述为一个条件均值估计问题,其中目标语义向量表示为由后验聚类概率加权的语义原型的噪声混合。使用核仿射包机(KAHM)几何,在显式识别的RKHS假设空间中,从廉价的词汇特征估计这些后验权重,并通过归一化最小均方更新从带噪声的教师嵌入中精炼语义原型。这产生了一个无反向传播的查询端编码器,以及一个端到端的误差分解,包括后验近似、有限样本/泛化和教师噪声项。我们在一个受控的奥地利法律检索基准上实例化该方法,该基准包含5000个测试查询、84个候选法律和10762个对齐的检索单元,使用特定于法律的编码器进入冻结的Mixedbread嵌入空间。在评估匹配的学习适配器中,KAHM在所有评估截断处实现了最强的教师空间重建和最佳的排名敏感检索性能。在k=20时,它获得了MRR@20=0.504、Hit@20=0.694和Top-1准确率=0.411,同时在报告的CPU设置中,相对于直接Transformer查询编码,在线每查询时间减少了8.53倍。结果支持KAHM作为监督固定表示部署场景中的计算高效编码器。

英文摘要

Transformer-based semantic encoders are effective for retrieval, but in many deployments the recurring bottleneck is online query encoding rather than offline corpus indexing. This paper studies whether, once a strong teacher representation space and corpus index are fixed, repeated neural query encoding can be replaced by a substantially lighter and analytically explicit estimator. We formulate fixed-teacher lexical-to-semantic encoding as a conditional-mean estimation problem in which the target semantic vector is represented as a noisy mixture of semantic prototypes weighted by posterior cluster probabilities. Kernel Affine Hull Machine (KAHM) geometry is used to estimate these posterior weights from inexpensive lexical features in an explicitly identified RKHS hypothesis space, and the semantic prototypes are refined by normalized least-mean-squares updates from noisy teacher embeddings. This yields a backpropagation-free query-side encoder together with an end-to-end error decomposition into posterior-approximation, finite-sample/generalization, and teacher-noise terms. We instantiate the approach on a controlled Austrian-law retrieval benchmark with 5,000 test queries, 84 candidate laws, and 10,762 aligned retrieval units, using law-specific encoders into a frozen Mixedbread embedding space. Among evaluation-matched learned adapters, KAHM achieves the strongest teacher-space reconstruction and the best rank-sensitive retrieval performance at all evaluated cutoffs. At k=20, it obtains MRR@20 = 0.504, Hit@20 = 0.694, and Top-1 Accuracy = 0.411, while reducing online per-query time by 8.53 relative to direct transformer query encoding in the reported CPU setting. The results support KAHMs as compute-efficient encoders for supervised fixed-representation deployment regimes.

2605.06384 2026-06-09 cs.LG cs.AI cs.FL 版本更新

MinMax Recurrent Neural Cascades

MinMax 循环神经网络级联

Alessandro Ronca

发表机构 * IRIS-AI

AI总结 MinMax RNCs 通过MinMax代数构建,具备强表达性、高效评估、稳定动态和非消失状态梯度等特性,在合成任务中表现优异,能处理长序列并超越传统循环基线。

Comments Code: https://github.com/minmaxrnc/model

详情
AI中文摘要

我们引入MinMax循环神经网络级联(MinMax RNCs),一种基于MinMax代数新形式递归的循环神经网络。我们展示了MinMax RNCs具有一些难以同时获得的关键性质:强大的形式表达性、高效的评估、稳定的动态和非消失的状态梯度。首先,其形式表达性对应正则语言,可能是有限记忆系统的最大表达性。其次,除了递归形式的评估外,它们还允许并行扫描评估,具有对数深度和线性工作量。第三,其状态和激活在所有序列长度下均被统一限制。第四,其损失梯度几乎处处存在且在所有序列长度下均被统一限制。第五,它们不表现出消失的状态梯度:状态相对于过去状态的梯度可以独立于状态之间的时距保持范数一。经验上,我们发现这些理论性质转化为强大的实际性能。MinMax RNCs完美解决了考虑的合成任务,能够泛化到长序列,并在实验中超越了考虑的循环基线。我们还训练了一个1.12亿参数的MinMax RNC进行下一个token预测,获得与其规模相竞争的性能,提供了初始证据表明MinMax递归可以扩展到现实世界的序列建模任务。

英文摘要

We introduce MinMax Recurrent Neural Cascades (MinMax RNCs), a class of recurrent neural networks built from a novel form of recurrence over the MinMax algebra. We show that MinMax RNCs enjoy key properties that are difficult to obtain simultaneously: strong formal expressivity, efficient evaluation, stable dynamics, and non-vanishing state gradients. First, their formal expressivity corresponds to the regular languages, arguably the maximal expressivity for finite-memory systems. Second, in addition to evaluation in recurrent form, they also admit parallel-scan evaluation with logarithmic depth and linear work in the input length. Third, their states and activations are uniformly bounded for all sequence lengths. Fourth, their loss gradients exist almost everywhere and are uniformly bounded for all sequence lengths. Fifth, they do not exhibit vanishing state gradients: the gradient of a state with respect to a past state can retain norm one independently of the temporal distance between the states. Empirically, we find that these theoretical properties translate into strong practical performance. MinMax RNCs solve the considered synthetic tasks perfectly, generalise to long sequences, and outperform the recurrent baselines considered in our experiments. We also train a 112M-parameter MinMax RNC for next-token prediction, obtaining competitive performance for its size and providing initial evidence that MinMax recurrence can scale to real-world sequence-modelling tasks.

2605.11855 2026-06-09 cs.LG cs.AI cs.AR 版本更新

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

提升为超低功耗应用设计的可并行递归神经网络的性能和学习稳定性

Julien Brandoit, Arthur Fyon, Damien Ernst, Guillaume Drion

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出CMRU和αCMRU,通过累积更新公式恢复梯度流并保持持久记忆,提升收敛稳定性并减少初始化敏感性,在多样本基准中表现优异,尤其在需要离散长距离保留的任务中表现突出。

Comments Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9

详情
AI中文摘要

序列学习主要由Transformer和可并行递归神经网络(如状态空间模型)主导,但学习长期依赖仍具挑战性,最先进的设计以性能牺牲换取功耗降低。Bistable Memory Recurrent Unit(BMRU)被引入以实现超低功耗RNNs的软硬件协同设计:具有滞后特性的量化状态提供持久记忆并直接映射到模拟基本单元。然而,BMRU在复杂序列任务上性能落后于可并行RNNs。本文识别出在状态更新期间出现的梯度阻塞是关键限制,并提出累积更新公式以恢复梯度流并保持持久记忆,通过时间创建跳跃连接。这导致了累积记忆递归单元(CMRU)及其放松变体αCMRU。实验表明,累积公式显著提高了收敛稳定性并减少了初始化敏感性。CMRU和αCMRU在小模型规模下在多样本基准中与线性递归单元(LRUs)和最小门控递归单元(minGRUs)匹配或超越,尤其在需要离散长距离保留的任务中表现突出,同时CMRU保留量化状态、持久记忆和抗噪声动态,这些对于模拟实现至关重要。

英文摘要

Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem(希伯来大学杰里科分校) MIT(麻省理工学院)

AI总结 本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法,通过水填充算法改进LLM量化方法,展示WaterSIC方案在信息论极限下的性能。

详情
AI中文摘要

本文是关于量化矩阵乘法(MatMul)工作的第二部分。在第一部分中,我们考虑了无校准量化的情况,而在这里,我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差(WMSE)源编码问题相关,其经典的(反向)水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法(GPTQ),目前这些算法平均分配速率。最近的一种方案(称为``WaterSIC'')仅使用标量INT量化器进行分析,其高速率性能被证明为(a)基无关(即由$Σ_X$的行列式决定,因此不同于现有方案,不受随机旋转的影响);(b)在信息论极限下的性能与$\frac{2πe}{12}$(或0.25 bit/entry)的乘法因子内。GPTQ的性能受基的选择影响,但对于随机旋转和实际的$Σ_X$来自Llama-3-8B,我们发现其性能在0.1 bit(取决于层类型)以内,表明GPTQ结合随机旋转也接近最优,至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Ghosted Layers: 无约束激活对齐用于恢复层剪枝的LLM

Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee

发表机构 * University of Southern California(南加州大学) Inha University(inha大学)

AI总结 本文提出Ghosted Layers方法,通过无约束优化解决层剪枝后激活分布不匹配问题,提升LLM准确性和 perplexity 而不牺牲效率。

详情
AI中文摘要

层剪枝从大型语言模型中移除整个Transformer解码器块,但导致后续存活层接收到的隐藏状态分布与训练时分布不匹配,从而引起显著性能下降。我们提出Ghosted Layers,一种无需训练的恢复模块,通过解决边界激活对齐问题来解决此问题。我们的方法从少量校准集推导出闭合形式的最优线性算子,以重建由剪枝层引入的激活差异。我们展示该解决方案对应于对齐目标的无约束最优解,而现有方法受限于有限算子子空间内的约束解。在多个LLM backbone和剪枝策略上的实验表明,我们的方法在保持层剪枝效率增益的同时,一致提升了准确性和perplexity,优于先前的无训练基线。官方代码仓库:https://github.com/daniel-eai/ghosted_layers_official_repository/.

英文摘要

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning. Official code repository: https://github.com/daniel-eai/ghosted_layers_official_repository/.

2605.16928 2026-06-09 cs.CL cs.AI 版本更新

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力再临:在数百次训练步骤内将全注意力转化为稀疏

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出RTPurbo方法,通过利用模型内在稀疏性,在少量训练步骤内实现高效的稀疏注意力,从而在保持接近无损精度的同时,显著提升推理效率。

Comments 20 pages, 9 figures

详情
AI中文摘要

大型语言模型的长上下文推理受到全注意力二次成本的限制。现有的高效替代方法通常依赖于原生稀疏训练或启发式令牌驱逐,导致效率、训练成本和准确性之间存在不理想的权衡。在本文中,我们证明全注意力LLM本质上已经是稀疏的,并且可以通过最小的适应转化为高度稀疏的模型。我们的方法基于三个观察:(1) 只有少量的注意力头真正需要完整的长上下文处理;(2) 长距离检索主要由低维子空间支配,允许相关令牌通过16维索引器高效检索;(3) 有用的令牌预算强烈依赖于查询,使得动态top-p选择比固定top-k稀疏化更合适。基于这些见解,我们提出了RTPurbo,该方法仅保留检索头的完整KV缓存,并引入轻量级令牌索引器进行稀疏注意力。通过利用模型的内在稀疏性,RTPurbo仅在数百次训练步骤内即可实现稀疏化。在长上下文基准和推理任务上的实验表明,RTPurbo在保持接近无损精度的同时,实现了显著的效率提升,包括在100万上下文下的预填充速度提升高达9.36倍,以及解码速度提升约2.01倍。这些结果表明,可以通过标准的全注意力训练获得强大的稀疏推理,而无需昂贵的原生稀疏预训练。

英文摘要

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

2605.17289 2026-06-09 cs.LG cs.AI 版本更新

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

LEAP:可学习的端到端无结构剪枝大型语言模型

Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出LEAP,一种可学习的端到端无结构剪枝方法,通过伯努利-戈姆贝茨松弛替代传统参数化,提高了无结构剪枝的端到端准确率,实验表明在多个LLM家族上平均提升了零样本准确率。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM)

详情
AI中文摘要

无结构稀疏性现在通过最近的GPU内核和数据流硬件原生加速,瓶颈从推理执行转移到了剪枝算法。最先进的无结构LLM剪枝方法是基于最优大脑外科手术原理的分层代理,牺牲了端到端准确性,尤其是在高稀疏度下。端到端替代方案如MaskLLM和PATCH表明可学习掩码可以缩小这一差距,但它们的类别-模式参数化随有效掩码数量按行数增长,并不适用于无结构设置。我们引入LEAP,用每权重伯努利-戈姆贝茨松弛替代这种不可行参数化,使端到端无结构掩码学习变得可行。在五个从0.5B到8B参数的LLM家族上,在50%和60%稀疏度下,LEAP在六个任务的零样本准确率上平均比ADMM提升+2.59点,ADMM是我们在扫掠中的最佳分层基线。

英文摘要

Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.

2605.18643 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI Kuaishou Technology(快手科技) Shanghai AI Lab(上海人工智能实验室) TsinghuaC3I/ZEDA(清华大学C3I/ZEDA)

AI总结 本文提出ZEDA框架,通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型,显著减少专家FLOPs并提升推理速度。

详情
AI中文摘要

混合专家(MoE)通过稀疏专家激活高效地扩展语言模型,其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应,使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本,通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应(ZEDA),一种低成本框架,将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换,ZEDA在每个MoE层中注入无参数的零输出专家,并通过两阶段自蒸馏适应增强模型,利用原始MoE作为冻结的教师,并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试(涵盖数学、代码和指令跟随)中,ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上,ZEDA比最强的动态MoE基线分别高出6.1和4.0个点,并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University(天津大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型的跨范式后训练方法,提出了CrossVLA框架,通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能,并揭示了推理过程中去噪循环对延迟的影响,最终实现了在LIBERO数据集上的显著提升。

Comments Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab

详情
AI中文摘要

视觉-语言-动作(VLA)模型迅速收敛到一小套架构模式:离散令牌自回归(例如OpenVLA)和连续动作流匹配(例如pi-0.5)。然而,通过直接偏好优化(DPO)进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA,对跨范式VLA后训练进行实证研究。三大贡献:(i)一个替代流匹配对数概率估计器,使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行;(ii)对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较,发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点(600次试验,3种子)——每套件+20.0对象,+11.0长周期,+8.0目标,+2.7空间——在对象上无种子方差(38/50在每个种子上);(iii)推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟,而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头,实现了99.5%的k-NN召回率@1(36倍于随机),可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

2605.24942 2026-06-09 cs.LG cs.AI 版本更新

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

黎曼流形操控:用于无标签操控的几何感知生成自编码器

Narmeen Oozeer, Shivam Raval, Philip Quirke, Manikandan Ravikiran, Jeff Phillips, Shriyash Upadhyay, Amirali Abdullah

发表机构 * Martian Harvard University(哈佛大学) Thoughtworks University of Utah(犹他大学)

AI总结 提出将语言模型操控重新定义为激活空间上的黎曼测地线计算,通过基于输出空间Hellinger距离学习的编码器实现无标签、无拓扑先验的流形操控。

详情
AI中文摘要

语言模型的操控——干预其内部激活以改变下游行为——最近已从线性插值扩展到非线性方法,如角度操控和核化操控,这些方法定义了干预变换,而无需在激活空间中的路径上学习显式几何。新引入的几何感知流形方法确实学习了这样的几何,但需要带标签的类中心以及预设的循环或顺序结构。这些假设限制了流形操控的应用范围,因为现有构造需要带标签的中心和兼容的边界条件。我们将流形操控更广泛地重新定义为激活空间上的黎曼测地线计算,将线性操控和带标签样条操控恢复为特定度量选择下的测地线。该框架内一个有原则的度量是输出空间Hellinger距离拉回到激活空间;我们通过一个在小型概念-令牌模式上基于输出距离训练的学习编码器来近似该度量——无需每个提示的标签、无需拓扑先验、也无需每个任务的曲线拟合。实验上,该方法在标准四任务语言模型算术基准的所有任务中可靠地将模型驱动到目标类别,同时在较小输出空间上遵循比基线更行为自然的轨迹。因此,我们为流形操控提供了一个统一的黎曼框架,以及一个基于模式监督、无标签的实例化,该实例化无需带标签的中心或预设边界条件即可运行。

英文摘要

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbf{Riemannian geodesic computation} on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师:以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Southern California(南加州大学) Independent Researcher(独立研究者) National University of Singapore(新加坡国立大学) Microsoft(微软) Google(谷歌) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Northwestern University(西北大学) Allen Institute for AI (AI2)(人工智能研究院(AI2))

AI总结 提出以学生为中心的答案采样(SCAS)框架,通过估计学生中心的学习成本选择教师生成的答案,从而提升学生模型性能。

详情
AI中文摘要

LLM训练越来越依赖教师生成的监督,包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据,隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败:即使多个教师对同一问题提供正确答案,最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题,我们提出以学生为中心的答案采样(SCAS),该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发,我们推导出该成本的高效前向代理,并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明,SCAS持续提升学生性能,表明有效的蒸馏应优先考虑与当前学生匹配的监督,而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

2605.27786 2026-06-09 cs.LG cs.AI 版本更新

Locality-Aware Redundancy Pruning for LLM Depth Compression

面向LLM深度压缩的局部感知冗余剪枝

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo, Minkyu Kim, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Neural Superintelligence Lab, MODULABS(MODULABS神经超级智能实验室) Seoul National University(首尔国立大学) Inha University(釜山大学)

AI总结 提出LoRP,一种基于表示局部性的无训练单次深度剪枝框架,通过引入表示局部性分数(RLS)来识别和剪除冗余层,在多种LLM上提升了困惑度和下游任务准确率。

详情
AI中文摘要

大型语言模型在跨网络深度上已知存在表示冗余,这使得深度剪枝成为提高推理效率的有效方法。现有的单次剪枝方法依赖于局部层重要性或跨架构的固定冗余假设。我们提出了局部感知冗余剪枝(LoRP),一种由表示局部性引导的无训练单次深度剪枝框架。我们表明,层间冗余可以是局部化的或全局分布的,具体取决于LLM架构。为了表征这一现象,我们引入了表示局部性分数(RLS),该分数源自全局层间隐藏状态相似性。使用小的校准集,LoRP计算成对层相似性,按表示相似性对层进行聚类,并根据残差簇内冗余分配剪枝。跨多种LLM家族的实验表明,在困惑度和下游任务准确性上均有提升。

英文摘要

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy. Official github repository: https://github.com/daniel-eai/LoRP-Locality-Aware-Redundancy-Pruning/

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON KAIST(韩国科学技术院)

AI总结 提出首个将混合专家(MoE)模型转换为标准密集架构的系统框架,通过专家评分、选择、分组、拼接和知识蒸馏,在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点,训练速度提升1.6倍。

详情
AI中文摘要

混合专家(MoE)现在是前沿语言模型的主导架构,但它需要将所有专家参数加载到内存中,因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量,但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架:专家被评分、选择和分组,然后拼接成密集的前馈网络(FFN),并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法,涵盖了多种选定的专家数量,共产生350种配置。我们发现评分方法的选择影响最大,我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下,经过约4B token的蒸馏,MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点,训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

2606.04029 2026-06-09 cs.LG cs.AI 版本更新

Position: Deployed Reinforcement Learning should be Continual

立场:部署的强化学习应该是持续的

Parnian Behdin, Kevin Roice, Golnaz Mesbahi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文主张部署的强化学习系统应持续学习,分析了部署后非平稳性的四个来源,并展示了持续RL的优势和实现方法。

Comments Accepted to the ICML 2026 Position Paper Track. See https://icml.cc/virtual/2026/poster/67195

详情
AI中文摘要

强化学习(RL)在现实世界用例中受到越来越多的关注和采用。大多数系统遵循“训练-修复”范式,其中训练好的代理在与世界交互时不会学习,直到性能下降且需要重新训练。在这篇立场论文中,我们认为部署一个无法达到最优但接收评估奖励信号的代理本质上是一个持续的RL问题。我们确定了部署后导致需要永无止境学习的四个非平稳性来源,并强调了为什么最好的部署代理永远不会停止适应。我们分析了现实世界中持续RL的成功案例,并向社区展示了摆脱当前“训练-修复”范式的优势和措施。

英文摘要

Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.

2606.05441 2026-06-09 cs.LG cs.AI stat.ML 版本更新

GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data

GOTabPFN: 从特征排序到高维表格基础模型的紧凑分词化

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对高维小样本表格预测问题,提出GOTabPFN模型,通过图引导排序和神经启发子单元压缩实现紧凑表示,提升TabPFN在严格token预算下的稳定性和准确性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources GitHub https://github.com/zadid6pretam/GOTabPFN PyPI https://pypi.org/project/gotabpfn Project webpage https://www.zadidhabib.com/gotabpfn.html Hugging Face ZeroGPU https://huggingface.co/spaces/zadid6pretam/GOTabPFN CPU backup https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU

详情
AI中文摘要

我们研究了如何在不重新训练大型骨干网络的情况下,使小型表格基础模型对高维小样本(HDLSS)表格预测有效。我们引入了带局部细化的图引导排序(GO-LR),证明了其与加权最小线性排列的等价性,并将实际求解器解释为TSP路径式替代方案。我们提出了基于GO-LR的GOTabPFN,以及一个神经启发子单元压缩(NSC)单元,将局部相邻的排序特征池化为元特征,从而生成紧凑表示,使TabPFN风格的预测在HDLSS场景中变得实用。在多个表格基准测试中,GOTabPFN在严格的token预算下提高了稳定性和准确性。

英文摘要

We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.

2412.11439 2026-06-09 cs.LG cs.AI physics.chem-ph 版本更新

Sampling Out-of-Distribution Chemical Spaces via Bayesian Flow

通过贝叶斯流采样非分布化学空间

Nianze Tao, Minori Abe

发表机构 * Hiroshima University(广岛大学) Tokyo University of Agriculture(东京农业大学)

AI总结 本文提出利用贝叶斯流网络生成高质量非分布分子,通过强化学习策略和可控微分方程求解器提升采样效率,并引入半自回归策略提升模型性能。

Comments 35 pages, 14 figures, 9 tables

详情
AI中文摘要

生成具有更高性能的新型分子,即非分布生成,对从头药物设计至关重要。然而,基于分布学习的模型,如扩散模型,难以解决这一挑战,因为这些方法旨在尽可能贴近训练数据的分布。在本文中,我们证明贝叶斯流网络,特别是ChemBFN模型,能够内在生成高质量的非分布样本,满足多种场景。我们向ChemBFN添加了强化学习策略,并采用可控的微分方程求解器-like生成过程以加速采样过程。最重要的是,我们在训练和推理过程中引入了半自回归策略,以提升模型性能并超越最先进的模型。此外,还包含了一种半自回归方法在ChemBFN中非分布生成的理论分析。

英文摘要

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for de novo drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network, especially ChemBFN model, is capable of intrinsically generating high quality out-of-distribution samples that meet several scenarios. A reinforcement learning strategy is added to the ChemBFN and a controllable ordinary differential equation solver-like generating process is employed that accelerate the sampling processes. Most importantly, we introduce a semi-autoregressive strategy during training and inference that enhances the model performance and surpass the state-of-the-art models. A theoretical analysis of out-of-distribution generation in ChemBFN with semi-autoregressive approach is included as well.

6. 自然语言与多模态智能 81 篇

2606.08122 2026-06-09 cs.AI 新提交

Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction

三思而后行:基于意图引导推理的LLM位置预测

Qingxiang Liu, Anqi Liang, Zhuoyang Jiang, Yutian Jiang, Sisuo Lyu, Yu Ji, Haomin Wen, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology(香港科技大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntentPOI框架,通过两阶段意图引导推理(先推断用户出行意图,再基于意图选择POI),将位置预测从直接轨迹匹配转化为意图推理,在三个真实数据集上超越11个基线。

详情
AI中文摘要

根据用户的历史签到记录预测其下一个兴趣点(POI)是基于位置服务中的一项基本任务。尽管最近结合大语言模型的方法展现了强大的推理能力和有前景的结果,但它们通常将预测任务建模为一步式的轨迹到位置映射问题,使得预测容易受到浅层轨迹相关性和历史频率偏差的影响。我们认为用户很少直接选择位置,相反,他们通常首先形成出行意图,然后据此选择特定的POI。受此洞察启发,我们提出了IntentPOI,一个两阶段的意图引导推理框架。在思考阶段,我们通过结合历史移动模式、相似同伴行为和时间上下文来推断用户的中间意图。在执行阶段,我们首先构建一个紧凑的候选池,然后执行意图引导推理,以识别与推断意图最一致的位置。通过明确地将意图推断与位置预测解耦,IntentPOI将下一个POI预测从直接的轨迹匹配转变为意图引导推理。在三个真实世界数据集上的大量实验表明,IntentPOI始终优于十一个最先进的基线方法。

英文摘要

Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users' intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.

2606.08841 2026-06-09 cs.AI cs.CV 新提交

ZIPP:Zero-shot Image Personalization from Personas

ZIPP:基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)(Adobe媒体与数据科学研究(MDSR)) IIIT-Delhi(德里印度理工学院) SUNY at Buffalo(纽约州立大学布法罗分校)

AI总结 提出ZIPP方法,利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成,无需用户数据或微调;引入ZIPBench基准,在多个评测中取得13-20%的提升。

详情
AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中,但其输出仍然缺乏个性,优化的是整体审美而非个人品味。人类偏好是多元化的:一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影,而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调,在冷启动场景中失败,并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成(ZIPP),该方法以自然语言人物画像(用户身份和审美偏好的简洁描述符)为条件生成图像,无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词,引导扩散模型输出个性化结果。为了大规模挖掘人物画像,我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络,采用双对比目标将图结构与视觉行为对齐,然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench,这是首个零样本个性化基准,包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上,人物画像条件化带来一致的性能提升(13-20%),前沿模型受益最大。在少样本设置中,ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度(CMMD 0.16 vs 0.55),且经IPF归一化的人口统计评估表明,它显著减少了现有方法中存在的子群体偏差。人工评估证实,与通用生成相比胜率为79%,与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2606.09441 2026-06-09 cs.AI cs.AR 新提交

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

SIFT: 利用注意力不变性实现RAG预填充快速计算的索引选择

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, Moinuddin Qureshi

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Microsoft(微软)

AI总结 针对RAG查询中文档重复导致预填充计算冗余和TTFT增加的问题,提出SIFT方法,通过离线提取文档高注意力分数位置并利用注意力不变性,在预填充时仅计算标记位置,将TTFT提升1.71倍且精度损失在1%以内。

详情
AI中文摘要

检索增强生成(RAG)向LLM查询注入相关文档以提高响应质量。这种注入增加了提示长度并减慢了首个令牌生成时间(TTFT)。与标准查询不同,RAG查询具有上下文复用的独特属性,即相同文档在用户查询中重复出现。因此,为每个RAG查询完全重新计算文档会导致冗余计算并增加TTFT。先前的工作离线预计算RAG文档的KV张量,并在在线预填充期间粗略地重新计算一些令牌。然而,由于高延迟的磁盘传输,这种KV复用在现代GPU上通常比完全重新计算更慢。此外,这种粗粒度的重新计算会降低准确性。为了解决这些限制,本文提出了SIFT:利用注意力不变性实现RAG预填充快速计算的索引选择。SIFT离线处理文档,并提取每个文档中高注意力分数的细粒度位置。接下来,我们识别出以下注意力不变性见解,使我们能够在运行时利用提取的位置:(1)局部注意力不变性:文档内高注意力分数的位置不受周围文档的影响。这有助于我们预测文档自注意力中高分数出现的位置。(2)交叉注意力一致性:具有高文档内注意力的键也会吸引后续文档的交叉注意力。这有助于我们预测文档对未来文档注意力中高分数出现的位置。关键的是,SIFT不存储任何KV数据,仅以两个紧凑的位向量的形式存储高分数位置。SIFT的存储比KV张量小24000倍,避免了昂贵的磁盘传输。在预填充期间,SIFT仅计算标记位置的注意力,将TTFT提升1.71倍,同时将精度保持在完全重新计算的1%以内。

英文摘要

Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.

2606.09508 2026-06-09 cs.AI cs.CL 新提交

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态:面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU(香港理工大学计算学系) DSA, HKUST(GZ)(香港科技大学(广州)数据科学与分析学域) CSE, HKUST(香港科技大学计算机科学与工程学系)

AI总结 提出EntropyInfer框架,利用注意力熵在预填充阶段自适应分配计算资源,并在解码阶段通过生成令牌压缩KV缓存,实现长上下文LLM的高效推理。

详情
AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算,忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式:刚性头,其熵在输入段中保持接近零;动态头,其熵显著波动。至关重要的是,这些类型的分布是上下文相关的,无法离线预先确定。因此,我们提出了EntropyInfer,一个无需训练框架,在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码,我们引入了一种潜在KV缓存压缩方案,该方案利用生成的输出令牌(而非仅预填充令牌)来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明,EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势,在超过100k令牌的情况下实现了高达2.39倍的端到端加速,同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

2606.09585 2026-06-09 cs.AI 新提交

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

光学推理:重新思考图像作为超越文本的表达性推理媒介

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出光学推理概念,将图像作为独立推理媒介,通过排版和图形两种变体实现,在语言和多模态任务中匹配或超越文本推理,同时减少推理令牌。

详情
AI中文摘要

思维链(CoT)提升了大型语言模型(LLMs)的性能,并已扩展到多模态大型语言模型(MLLMs)。最近的工作进一步从基于文本的多模态推理转向交错模态推理,其中中间步骤可以同时包含文本理由和视觉证据。在这项工作中,我们提出了一个更大胆、更雄心勃勃的想法:图像能否单独作为语言和多模态任务的推理媒介?为了探索这一点,我们提出了光学推理,它将图像视为独立的推理媒介。我们通过两种变体实例化这一概念:基于排版的光学推理,优化视觉布局以实现紧凑的理由渲染;以及基于图形的光学推理,将文本和图形元素组合成结构化的视觉理由。在数学、科学和交错模态推理基准测试中,光学推理可以匹配甚至超越传统的文本推理,同时在语言任务上平均减少28.57%的推理令牌,在多模态任务上减少16%,实现文本推理1.96倍的令牌效率。这些结果表明,图像可以有效且高效地编码理由,同时为推理提供统一的视觉画布。

英文摘要

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

2606.07519 2026-06-09 cs.CL cs.AI 交叉投稿

Bidirectional Small-Granularity Search between Code and Text

代码与文本之间的双向小粒度搜索

Marco A. Valenzuela-Escárcega, Enrique Noriega-Atala, Gus Hahn-Powell, Clayton T. Morrison, Mihai Surdeanu

发表机构 * Lex Machina The University of Arizona(亚利桑那大学)

AI总结 提出双向小粒度搜索任务,通过自动生成数据训练模型,实现科学出版物文本与代码片段间的直接链接,支持跨模态检索。

详情
AI中文摘要

我们引入了代码与文本之间双向小粒度搜索的新任务,其中查询是文本或代码的小片段,结果也是相反模态的小片段,即代码或文本。该任务在科学出版物中的文本与相应代码片段之间建立直接链接,以支持更好、更快地理解科学方法。我们为所提出的任务引入了一个大型数据集,其中包括使用GPT-4自动生成的代码文本描述的训练分区,以及三个测试分区:一个域内和两个域外(OOD),包含手动注释的数据以及其他领域的材料。我们还提出了一种模块化方法来解决此任务。我们的方法在四个不同的子任务之间共享一个编码器,这些子任务学习双向答案跨度的开始/结束。我们表明,我们的方法在域内取得了良好结果,在域外也取得了令人鼓舞的结果。这表明使用自动生成的数据解决此任务是可能的,但仍有令人兴奋的未来工作要做。

英文摘要

We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text or code and the results are also small fragments of the opposite modality, i.e., code or text. This task establishes direct links between text in scientific publications and corresponding code segments, in support of better and faster understanding of scientific methods. We introduce a large dataset for the proposed task that includes a training partition with textual descriptions of code generated automatically using GPT-4, and three testing partitions, one in-domain and two out-of-domain (OOD) that contain manually-annotated data as well as material from other domains. We also propose a modular approach to address this task. Our approach shares an encoder across four different subtasks that learn start/end of answer spans in both directions. We show that our method achieves good results in-domain, and encouraging results OOD. This suggests that addressing this task with automatically-generated data is possible, but there is exciting future work to be done.

2606.07523 2026-06-09 cs.CL cs.AI 交叉投稿

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

面向尼泊尔法律领域问答的检索增强生成框架

Samir Wagle, Abiral Adhikari, Reewaj Khanal, Batsal Bhandari, Prashant Manandhar, Praveen Acharya, Bal Krishna Bal

发表机构 * Dublin City University(都柏林城市大学)

AI总结 提出首个基于检索增强生成的尼泊尔法律问答模型,利用BM25和E5模型检索案例法,实现91%的top-1精度和74%的生成答案可信度。

详情
AI中文摘要

英语等高资源语言的法律领域已广泛采用人工智能进行法律问答。然而,尼泊尔语等低资源语言的数据稀缺限制了大型语言模型在尼泊尔法律文本上的训练。本研究首次应用基于检索增强生成的模型,利用从Nepal Kanun Patrika数字档案中提取的案例法进行尼泊尔法律问答。使用BM25对分块文档进行检索,该方法实现了91%的top-1精度,使用多语言E5大模型时达到75%。对生成答案的评估显示,使用BM25文档检索时,可信度为74%,根据自动评判模型评估的真实性为85%,人工评估的真实性为84%,成功答案生成率为92%。这些结果表明,RAG管道可以有效解决低资源语言法律问答的差距,并为尼泊尔法律领域的可靠AI系统奠定基础。

英文摘要

Legal domains in high-resource languages like English have widely adopted artificial intelligence for legal question answering. However, data scarcity in low resource languages such as Nepali has limited the training of large language models on Nepali legal texts. This study presents the first application of a Retrieval Augmented Generation based model for Nepali legal question answering using case laws extracted from the Nepal Kanun Patrika digital archive. Using BM25 on chunked documents, the approach achieved a top precision at one of 91 percent, and up to 75 percent with the multilingual E5 large model. Evaluation of generated answers showed 74 percent groundedness, 85 percent truthfulness according to an automated judge model, and 84 percent human evaluated truthfulness when using BM25 document retrieval, with a 92 percent successful answer generation rate. These results demonstrate that the RAG pipeline can effectively address the gap in legal question answering for low resource languages and provide a foundation for reliable AI systems in the Nepali legal domain.

2606.07526 2026-06-09 cs.CL cs.AI 交叉投稿

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

GraphLoRA: 面向大语言模型推荐的结构感知低秩适配

Lin Mu, Guoji Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang

发表机构 * Anhui University(安徽大学) Hefei University(合肥大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphLoRA框架,通过在低秩适配路径中嵌入可训练的图消息传递网络,实现结构信号传播,从而深度融合图结构与文本语义,提升LLM推荐性能。

Comments ACL 2026 findings

详情
AI中文摘要

大型语言模型(LLM)因其强大的推理和泛化能力,在推荐任务(LLMRec)中展现出巨大潜力。然而,如何有效对齐LLM建模的文本语义与协同信号仍是一个关键挑战。现有方法要么将协同信息转化为文本提示,要么将预训练嵌入注入LLM,两者都将结构信息视为静态输入,无法捕获高阶关系依赖。为弥合这一差距,我们提出GraphLoRA,一种新颖的框架,将低秩适配从独立传播推广到结构感知传播。GraphLoRA在低秩适配路径中嵌入一个可训练的图消息传递网络,使结构信号能够在参数空间中传播。该设计允许协同拓扑显式指导参数更新,促进图结构与文本语义信息的深度融合。在多个基准上的大量实验表明,GraphLoRA不仅优于最先进的基于LLM的推荐方法,而且实现了卓越的泛化能力,有效平衡了结构推理能力与计算效率。代码可在https://github.com/wgj15965/GraphLoRA获取。

英文摘要

Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abilities. However, effectively aligning the textual semantics modeled by LLMs with the collaborative signals remains a key challenge. Existing methods either translate collaborative information into textual prompts or inject pre-trained embeddings into the LLM, both of which treat structural information as static input and fail to capture high-order relational dependencies. To bridge this gap, we propose GraphLoRA, a novel framework that generalizes low-rank adaptation from independent to structure-aware propagation. GraphLoRA embeds a trainable graph message-passing network within the low-rank adaptation pathway, enabling structural signals to propagate through the parameter space. This design allows collaborative topology to explicitly guide parameter updates, fostering deep integration between graph-structured and textual semantic information. Extensive experiments on multiple benchmarks demonstrate that GraphLoRA not only outperforms state-of-the-art LLM-based recommendation methods but also achieves superior generalization, effectively balancing structural reasoning capability with computational efficiency. Code is available at \href{https://github.com/wgj15965/GraphLoRA}{https://github.com/wgj15965/GraphLoRA}.

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) SpatialTemporal AI(时空人工智能)

AI总结 提出概念相邻场景图剪枝器(CAPruner),通过融合模糊语义相关性和空间邻近性估计关系重要性,在任务特定上下文中选择关键关系,避免关系级标注,显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)最近被应用于3D视觉语言(3D-VL)任务,这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系,但在完整图上进行推理会导致高昂的令牌成本和计算效率低下,因此需要剪枝。现有的剪枝方法主要依赖空间邻近性,常常移除任务相关的关系,从而削弱可靠的空间推理。为了解决这些局限性,我们推导出场景图剪枝的一个关键要求:保留与特定3D-VL任务最相关的空间关系。在此洞察指导下,我们提出了概念相邻场景图剪枝器(CAPruner)。CAPruner将模糊语义相关性与空间邻近性相结合,以估计关系的重要性,从而能够在任务特定上下文中选择关键关系。此外,为了避免昂贵的关系级标注,CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明,CAPruner有效保留了空间推理所必需的关系,从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 交叉投稿

Liberating LLM Capabilities in Full-Duplex Speech Models

在全双工语音模型中释放LLM能力

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

发表机构 * Royal Zhang(皇家张)

AI总结 提出Listen-Write-Speak (LWS)三通道范式,使LLM在共享因果注意力上下文中同时监听、书写可见文本并实时口语回应,无需架构修改,实现全双工交互。

详情
AI中文摘要

基于语音的大型语言模型通常局限于口语回复,这将其面向用户的输出限制在可口头表达的内容上,并抑制了文本原生能力,如代码生成、结构化分析和实时交互中的多步推理,对于需要持久、结构化且可检查的中间输出的任务。现有工作改进了口语推理或全双工轮流发言,但仍将文本视为隐藏的中间状态或从属模态,而非第一类输出通道。我们提出Listen-Write-Speak (LWS),一种文本优先的三通道范式,其中单个自回归LLM持续监听用户音频,写出可见的自由形式文本作为其主要输出,并在共享因果注意力上下文中并行生成实时口语回应。该行为完全通过Token Schema实现,无需架构修改,并通过两阶段数据流水线学习,该流水线合成与揭示的输入时间线一致的每秒认知注释。实验上,LWS在Full-Duplex-Bench上展示了强大的全双工交互,在VoiceBench AlpacaEval上达到4.72,写作-口语一致性达92.6%,并在URO-Bench上持续优于其内部消融版本。这些结果表明,可见书写可以作为语音交互的第一类输出通道,而不会牺牲实时响应性。代码和数据集可在项目页面获取:https://royalzhang.com/project/lws-page/。

英文摘要

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

2606.07585 2026-06-09 cs.CV cs.AI 交叉投稿

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

面向隐私安全的非个体化方法的多模态群体情绪识别

Anderson Augusma

发表机构 * Université Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. of Glasgow(格拉斯哥大学) Inria(法国国家信息与自动化研究所) Univ. Paris-Saclay(巴黎-萨克雷大学) TU Delft(代尔夫特理工大学)

AI总结 本文提出两种多模态框架(交叉注意力融合+帧注意力池化,以及变分编码器多解码器),利用集体音视频信号进行群体情绪识别,避免使用个体特征,在保护隐私的同时实现鲁棒性能。

Comments Doctoral thesis

详情
AI中文摘要

本论文研究野外环境下的群体情绪识别(GER),重点关注隐私保护。与依赖面部、目光或语音分析等个体层面线索的传统情绪识别方法不同,本工作利用集体音视频信号推断群体层面的情绪,降低个体监控和监视的风险。提出了两个互补框架。第一个是用于音视频融合的交叉注意力多模态架构,结合帧注意力池化(FAP)进行时间聚合。该框架由合成数据增强支持,并通过消融研究验证,在真实世界GER条件下展现出鲁棒性。第二个框架,变分编码器多解码器(VE-MD),学习一个共享潜在空间,用于情绪分类和结构表示预测(包括身体和面部线索)。探索了两种解码策略(基于DETR和基于热图),以分析结构表示在群体和个体设置中的作用。本论文做出三项主要贡献:阐明了多模态和结构线索在群体层面情感计算中的作用;引入了两种用于隐私保护多模态GER的架构;并证明了在不使用个体特征作为输入数据的情况下可以实现有竞争力的性能。

英文摘要

This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 交叉投稿

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调:基准污染、惯例不匹配以及25.6% WER(13.8% cWER)的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland(独立研究员,瑞士苏黎世) ETH Zürich(苏黎世联邦理工学院) University of Bern(伯尔尼大学) FHNW(西北应用科学与艺术大学) CeTIM Leiden/Munich(CeTIM 莱顿/慕尼黑)

AI总结 通过1,367小时广播语音与标准德语字幕的弱监督,系统微调Whisper large-v3用于瑞士德语音识,发现公开结果因基准污染被高估,并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情
AI中文摘要

我们提出了一项系统研究,针对OpenAI的Whisper large-v3进行微调,用于瑞士德语音识,使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark(Grace Blackwell,128 GB统一内存,最高1 PFLOP FP4)上进行16次迭代训练,我们比较了LoRA和全微调(1.55B参数模型),研究了幻觉的根本原因,并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中,在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异(时态、词序、瑞士正字法)分离的协调错误分析,得到内容WER (cWER)为13.8%,仅计算实际识别失败。偏差校正估计将其降至8.5%,表明真实错误率约为测量WER的三分之一。\n我们证明,已发表的瑞士德语ASR最先进结果(17.1-17.5% WER)因基准污染而被夸大:一个在ASGDTS测试集上自训练的普通Whisper模型(零瑞士德语数据)实现了13.88% WER,超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应(3.9% WER),揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型,一个LoRA适配器(25.32% WER,13.9% cWER)和一个全微调模型(25.60% WER,13.8% cWER),这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一,采用Apache 2.0许可,完全可复现,无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

2606.07610 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF: 无需分支的树生长方法用于语音感知大语言模型后训练

Argyrios Gerogiannis, Yekaterina Yegorova, Mark Hasegawa-Johnson, Venugopal V. Veeravalli

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对语音感知大语言模型后训练中GRPO方法粗粒度信用分配问题,提出LEAF方法,通过回溯式树结构学习、高信息量边界选择和跨度级优势分配,在语音问答和翻译任务上超越GRPO。

Comments 15 pages, 3 figures, 11 tables

详情
AI中文摘要

最先进的GRPO风格方法在语音感知大语言模型后训练中存在粗粒度信用分配问题,将相同的终端奖励优势广播给响应中的每个token。这忽略了rollout批次中的有用结构,其中语音条件下的补全通常共享前缀,然后在重要决策处出现分歧。我们提出低秩探索自适应分叉(LEAF),一种基于回溯树的强化学习方法,无需在线分支或额外解码即可恢复这种结构。LEAF采样完整响应,选择高信息量边界,按共享前缀分组响应,并使用后代奖励分配跨度级优势。我们从理论上证明了LEAF的跨度级信用分配和边界选择设计。实验上,在相同的rollout和低秩适应预算下,LEAF在语音问答和语音翻译基准上优于GRPO。值得注意的是,较小的LEAF训练模型优于当前最先进的完全参数基线。

英文摘要

State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful structure within rollout batches, where speech-conditioned completions often share prefixes before diverging at important decisions. We propose Low-rank Exploration with Adaptive Forking (LEAF), a retrospective tree-based RL method that recovers this structure without online branching or additional decoding. LEAF samples complete responses, selects high-surprisal boundaries, groups responses by shared prefixes, and assigns span-level advantages using descendant rewards. We theoretically justify LEAF's span-level credit assignment and boundary-selection design. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget. Notably, smaller LEAF-trained models outperform current state-of-the-art, full-parameter baselines.

2606.07638 2026-06-09 cs.CV cs.AI 交叉投稿

Anchor-Conditioned Compositional Control for Landscape Image Generation

基于锚点条件的景观图像生成组合控制

Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

发表机构 * Rutgers University–New Brunswick(罗格斯大学新布朗斯维克分校) University of Maryland–College Park(马里兰大学帕克分校) University of Technology Sydney(悉尼科技大学)

AI总结 提出锚点条件微调框架,通过解耦交叉注意力机制注入四维组合锚点向量,实现景观图像生成中的组合控制,在水平线检测和三分法对齐上取得最优性能。

Comments Accepted to the International Conference on Computational Creativity, ICCC 2026

详情
AI中文摘要

图像生成模型虽然被广泛用作创意工具,但对摄影师和视觉艺术家常规执行的组合控制类型支持有限。本文提出了一个用于景观图像生成的锚点条件微调框架的早期结果,其中从训练图像中提取四维组合锚点向量,并通过带有傅里叶编码和三路分类器自由引导丢弃的解耦交叉注意力机制注入扩散模型。与基线和三个消融变体的定量评估表明,所提出的架构实现了最高的水平线检测率0.850和最高的三分法对齐度0.817。类别特定的消融进一步表明,在组合同质场景子集上训练相比混合训练可将水平线偏差降低多达40%。这确立了组合控制精度是类别依赖的。

英文摘要

Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.

2606.07639 2026-06-09 cs.CV cs.AI 交叉投稿

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出双通道交叉注意力架构MOSS-Video-Preview,通过非阻塞感知与生成实现实时视频理解,在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情
AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互,其中模型在回复的同时感知新帧,随着新证据的出现修正答案,并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞;其自然实现是双通道架构。我们认为,交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合:视觉特征通过侧通道进入,而不是加入自回归序列,因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率,并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线,将密集字幕转换为实时理解问答,其答案被修正以匹配模型迄今为止感知到的内容,并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力,在实时应用核心的空间和细粒度时间推理上保持稳健,并获得了离线模型缺乏的行为:持续感知、答案修正和及时沉默。在单个H200上,每视频256帧,它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升,离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 交叉投稿

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑:一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出一种无需训练的两阶段级联视频RAG流水线,通过解耦语义检索与逻辑推理,实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情
AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会(MAGMaR)提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战,我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工,策略性地将语义检索与认知逻辑推理解耦。在第一阶段,一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索,明确隔离噪声模态(如OCR和ASR)以保持纯净的向量空间。在第二阶段,一个由商业大语言模型(LLM)驱动的自适应、迭代和推理(A.I.R.)过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文,以强制执行与用户角色的严格逻辑对齐,有效剪除语义相似但逻辑无关的候选。最后,提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应,并带有精确的块级引用。在RAG轨道上的评估表明,我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

2606.07951 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

From `May' to `Is': Certainty Distortion in Language Model Rewriting

从“可能”到“是”:语言模型改写中的确定性扭曲

Catarina G Belem, Shang Wu, Hongyu Yao, Mark Steyvers, Sameer Singh, Padhraic Smyth

发表机构 * University of California Irvine(加利福尼亚大学尔湾分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究语言模型在改写任务中系统性增加表达确定性的偏差,提出基于人群判断的评估指标,发现高达75%的输出存在确定性扭曲,且模型更倾向于提高确定性。

详情
AI中文摘要

人类越来越多地以塑造信念和驱动决策的方式使用语言模型(LM),包括讨论、改写和总结来自科学文章、新闻和医学报告的信息。然而,在这些领域中,主张表达的信心程度至关重要,但关于LM是否忠实地保留它却知之甚少。在这项工作中,我们研究了LM中的确定性扭曲,定义为当语义内容被保留时,表达确定性的有意义变化。我们提出了一种基于LM的评估指标,该指标与人群层面的确定性判断一致。使用该指标,我们在科学和医学交流任务的背景下,表征了不同规模和系列的模型中的确定性扭曲。我们的结果表明,确定性扭曲影响了高达75%的LM输出,并且在改写任务中系统性地不对称,大多数LM将表达确定性增加的可能性是降低的1.5-2倍。这些效应可以通过重复释义累积:在医学领域,claude-haiku-4-5在一次迭代后增加了20%示例的确定性,五次迭代后增加到40%。基于提示的干预减少了整体确定性扭曲,但并未消除它。总之,这些发现揭示了普遍存在的夸大表达确定性的偏差,对在高风险领域依赖LM的用户有直接影响。

英文摘要

Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

2606.08016 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

IEA:通过三阶段多任务对齐的业余友好型对话式图像编辑代理

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang, Zhexiang Wang, Ziyue Yang, Danyang Zhang, Kunyao Lan, Zihan Zhao, Dingye Liu, Siqi Xiang, Lu Chen, Kai Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institution(上海创新研究院) Huawei Technologies Ltd.(华为技术有限公司) Nanyang Technological University(南洋理工大学) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室)

AI总结 提出IEA对话式图像编辑代理,通过三阶段多任务训练学习操作参数化工具,实现可解释编辑轨迹,在像素距离和ROUGE-L指标上优于基线,用户研究中指令跟随和感知质量表现最佳。

Comments [CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent

详情
AI中文摘要

当前的图像编辑软件通常依赖于固定滤镜或专家调参,导致业余用户的意图与结果之间存在差距。生成模型创建的图像可能包含伪影、不合理的细节或偏离真实感的风格漂移,并且对编辑原因缺乏解释。我们提出IEA,一个对话式图像编辑代理,它学习在显式、可解释的动作空间中操作参数化工具。IEA通过三阶段多任务流水线进行训练:(1) 在蒸馏专家编辑上进行SFT,(2) 使用GRPO进行奖励优化,奖励包括相似度改进、工具有用性和意图总结,(3) 大规模合成微调以联合掌握图像编辑、细化和用户意图总结。通过逐步操作16个编辑工具,IEA产生透明的编辑轨迹,可以检查和调试。在定量实验中,它在编辑任务上获得更低的像素距离,在总结任务上获得比强基线更高的ROUGE-L。在用户研究中,它在指令跟随方面在工具调用方法中排名最佳,同时在整体感知质量上超越生成方法。我们的结果验证了可解释的、以工具为中心的VLM作为人类指令引导图像润色的可靠路径。

英文摘要

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

2606.08056 2026-06-09 cs.CL cs.AI 交叉投稿

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在?手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 针对手语中占10-15%但被忽视的空间索引现象,提出索引检测与话语实体链接的分解框架,建立索引感知手语建模基线,并作为辅助专家提升冻结手语识别模型性能。

详情
AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练,因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引:将话语实体分配给空间位置以供后续共指的指向手势,而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估,显示尽管索引占手语内容的10-15%,但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架,为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模,并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Robust-U1框架,通过监督微调、强化学习和多模态推理,使多模态大模型具备显式视觉自恢复能力,在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解方面取得了显著成功,但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法,但它们存在局限性:黑盒特征对齐缺乏可解释性,而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题:MLLMs能否自行恢复受损的视觉内容?为此,我们提出Robust-U1,一种新颖框架,赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段:用于初始重建的监督微调、具有双重奖励(像素级SSIM和语义级CLIP相似度)的强化学习以对齐高视觉质量,以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性,并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实,高质量的视觉恢复直接提升了推理性能,将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

2606.08076 2026-06-09 cs.CL cs.AI cs.CY 交叉投稿

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

“我理解你的观点”:通过交往行动理论视角看LLM的说服与谄媚

Esra Dönmez, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所) Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart(斯图加特大学智能系统反思交流论坛)

AI总结 本研究基于哈贝马斯的交往行动理论,通过模拟Reddit讨论,发现LLM能有效传达言外之意(如建立信任),其谄媚策略与观点改变强相关,且人类更偏好LLM生成的论证。

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2025
AI中文摘要

大型语言模型(LLM)能够生成高质量的论证,但它们在参与细致入微且有说服力的交往行动方面的能力仍 largely unexplored。本研究通过尤尔根·哈贝马斯的交往行动理论框架探索LLM的说服潜力。它考察LLM是否以与人类交流可比的方式表达言外之意(即语言的语用功能,如传达知识、建立信任或表明相似性)。我们使用来自说服性子论坛ChangeMyView的对话,模拟意见持有者与LLM之间的在线讨论。然后,我们比较人类撰写和LLM生成的反驳论证中言外之意的可能性,特别是那些成功改变了原帖作者观点的论证。我们发现,所有三个LLM都能有效传达言外之意——通常比人类更甚——可能增加其拟人化程度。此外,LLM精心制作谄媚回应,与意见持有者的意图紧密对齐,这种策略与观点改变强相关。最后,众包工作者发现LLM生成的反驳论证更令人信服,并且一致偏好它们胜过人类撰写的论证。这些发现表明,LLM的说服力不仅仅在于生成高质量论证。相反,用人类偏好训练LLM有效地调整它们以模仿人类交流模式,特别是细微的交往行动,可能增加个体对其影响的易感性。

英文摘要

Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

2606.08081 2026-06-09 cs.CL cs.AI 交叉投稿

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定:区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University(国立台湾大学) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Institut Jean Nicod(让·尼科研究所)

AI总结 通过约束伪对基线方法,区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇,发现智能体通过冗长描述而非压缩表达实现协调。

详情
AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明,多模态LLM在轮次中未能变得更高效,尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇?我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线,它匹配原始指称任务结构,但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面(任务能力、描述策略、对齐动态)上,我们发现了明显差异。人类通过适应减少努力,压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平,从第一轮开始产生冗长的描述,标签重叠接近上限,在真实对和伪对之间统计上无法区分。因此,多模态LLM在没有惯例的情况下实现了协调,通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

2606.08158 2026-06-09 cs.CL cs.AI 交叉投稿

Constrained Paraphrase Consistency for LLM Hallucination Detection

约束释义一致性用于大语言模型幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Xi Zhang, Xiangwen Liao

AI总结 提出约束一致性幻觉检测器(CCHD),通过约束优化利用释义一致性,无需额外数据,在多个基准上超越现有方法。

Comments Accepted to ICASSP 2026

详情
AI中文摘要

大型语言模型(LLM)可能生成事实不一致的声明,这促使需要准确且可扩展的幻觉检测器。先前的工作主要通过合成或新标注来扩大训练集,这增加了成本和潜在偏差,同时未充分利用语义等价释义所隐含的一致性。我们提出约束一致性幻觉检测器(CCHD),将训练形式化为约束优化问题。在原始文档-声明对上的标准交叉熵基础上,补充了(i)释义一致性约束,限制不同释义视图之间的差异,以及(ii)标签保持约束,将释义与真实标签绑定。我们通过模型参数和每个视图的拉格朗日乘子的梯度下降-上升法求解该问题,仅增加少量标量对偶变量,且无推理时开销。使用DeBERTa和Flan-T5骨干网络,CCHD在标准事实性基准上持续优于强基线(FactCG、MiniCheck和AlignScore),展示了其在幻觉检测上的优越性。

英文摘要

Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior work largely enlarges training sets via synthesis or new annotations, introducing increasing cost and potential bias while underusing the consistency implied by semantically equivalent paraphrases. We propose Consistency-Constrained Hallucination Detector (CCHD), which formulates training as a constrained optimization problem. The standard cross-entropy on original document-claim pairs is complemented by (i) paraphrase-consistency constraints bounding divergence across paraphrased views, and (ii) label-preservation constraints tying paraphrases to ground truth. We solve the problem by gradient descent-ascent over model parameters and per-view Lagrange multipliers, adding only a few scalar dual variables and no inference-time overhead. With DeBERTa and Flan-T5 backbones, CCHD consistently outperforms strong baselines (FactCG, MiniCheck, and AlignScore) on standard factuality benchmarks, demonstrating its superiority on hallucination detection.

2606.08408 2026-06-09 cs.CL cs.AI 交叉投稿

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

TimpaTeks: 通过扩散语言模型引导实现自动原地文本序列修改

Ryandito Diandaru, Ikhlasul Akmal Hanif, Fadli Aulawi Al Ghiffari, Ahmed Elshabrawy, Alham Fikri Aji

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出TimpaTeks方法,将激活引导扩展到扩散语言模型,实现原地文本修改以改变概念,在情感和概念引导任务上降低困惑度并保持句子结构。

Comments 16 pages

详情
AI中文摘要

我们将激活引导扩展到扩散语言模型(DLM),并研究了一个由于DLM推理机制而产生的新问题:原地修改文本以呈现不同的概念。我们提出了TimpaTeks,一种使用DLM的自动原地文本修改机制。在IMDB电影评论(情感)和合成的猫狗数据集(任意、更非常规的概念引导)上的实验表明,TimpaTeks提供了一种可行的新机制来原地引导扩散语言模型的输出。TimpaTeks实现了原地修改,同时降低了句子困惑度并保留了原始句子结构,无需指令调优模型。与基于提示的DLM引导相比,TimpaTeks计算成本更低,因为它执行原地去噪,而不是构建额外的提示条件输出序列。

英文摘要

We extend activation steering to diffusion language models (DLMs) and study a novel problem that arose due to the inference mechanism of DLMs: Modifying a text in-place to manifest a different concept. We propose TimpaTeks, an automatic in-place text modification mechanism using DLMs. Experiments on IMDB movie reviews (sentiment) and a synthetic Cats and Dogs Dataset (arbitrary, more unconventional concept steering) show that TimpaTeks provides a feasible novel mechanism to steer diffusion language model outputs in-place. TimpaTeks enables in-place modification while simultaneously lowers sentence perplexity and retaining the original sentence structre without the need of instruction tuned models. TimpaTeks is also computationally cheaper than prompt-based DLM steering, as it performs denoising in-place rather than constructing an additional prompt-conditioned output sequence.

2606.08445 2026-06-09 cs.CL cs.AI 交叉投稿

Segment-level Tree Search for Long Meeting Document Summarization

长会议文档摘要的段级树搜索

Sangwon Ryu, Heejin Do, Jun Seo, Daehui Kim, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

发表机构 * GSAI, POSTECH(浦项科技大学人工智能研究院) CSE, POSTECH(浦项科技大学计算机科学与工程系) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心) Agentic AI Lab, KT(KT公司智能体人工智能实验室) LILT(LILT公司)

AI总结 提出基于蒙特卡洛树搜索的段级摘要框架S3,无需训练即可组合段级候选摘要,使用7B模型达到72B模型性能。

Comments INTERSPEECH 2026

详情
AI中文摘要

会议文档因其长度和复杂的对话结构而难以总结。现有方法通常采用多阶段流水线,在摘要之前提取信息;然而,这些方法往往因缺乏中间验证而遭受累积错误传播,这一限制因短且低质量的参考摘要而进一步放大。我们提出通过蒙特卡洛树搜索进行段级摘要(S3),这是一个无需训练的框架,通过组合段级摘要候选来构建最终摘要。S3将长文档划分为多个段,并为每个段生成多个摘要候选,形成搜索树的节点。通过自我奖励引导的树搜索选择最佳评分组合,并精炼为最终输出。尽管使用7B模型,S3在生成长度合适的摘要的同时,实现了与更大的72B模型相当的性能。

英文摘要

Meeting documents are challenging to summarize due to their length and complex conversational structure. Existing approaches typically adopt multi-stage pipelines that extract information prior to summarization; however, these approaches often suffer from cumulative error propagation without intermediate validation, a limitation further amplified by short and low-quality reference summaries. We propose segment-level summarization via Monte Carlo Tree Search (S3), a training-free framework that constructs a final summary by composing segment-level summary candidates. S3 partitions a long document into segments and generates multiple summary candidates per segment, forming nodes of a search tree. The best-scoring combination is selected via self-reward-guided tree search and refined into the final output. Despite using a 7B model, S3 achieves performance comparable to larger 72B models while producing length-appropriate summaries.

2606.08471 2026-06-09 cs.CL cs.AI 交叉投稿

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

更多废话,更少意义:揭示小语言模型中的自我改进行为

Marina Igitkhanian, Erik Arakelyan

发表机构 * American University of Armenia(亚美尼亚美国大学) NVIDIA(英伟达)

AI总结 本研究通过构建充分性测试,发现小语言模型在自我纠正中仅获得4.4%的准确率提升,且较长的提示反而与错误答案正相关,表明其推理能力有限。

Comments GEM Workshop at ACL 2026

详情
AI中文摘要

近年来,语言模型在各个领域和应用中取得了快速进展。然而,它们的自我改进能力——即是否善于识别和纠正自身推理中的缺陷——仍然存疑。在本研究中,我们通过构建一个充分性测试来严格检验小语言模型(SLMs)的自我纠正能力。我们提出了一个最小化的三步自我纠正流程:收集初始SLM答案,提示同一模型根据真实答案为错误回答生成提示,然后将相同问题与模型自身的反馈一起输入以改进初始答案。我们在算术和逻辑推理基准上评估了多种指令微调和推理SLM。我们的发现表明,注入提示句子的SLM相比初始问答准确率仅提升4.4%。即使正确答案与模型的错误推理一起提供,评估的SLM也无法理解其推理中缺失了什么,并且在导致纠正和未导致纠正的提示之间显示出最小的语义差异。此外,我们的实验表明,较长的提示与错误的最终答案正相关,表明对问题的较长思考可能阻碍推理过程,这意味着SLM的性能不一定随更大的计算预算而扩展。

英文摘要

Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model's incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.

2606.08492 2026-06-09 cs.CV cs.AI 交叉投稿

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

眼见为实:基于视觉锚点的提示重写对齐用于文本到图像生成

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu, Xuhang Chen, Tianrun Chen, Siwei Ma

发表机构 * Peking University(北京大学) Tencent(腾讯) Dalian University of Technology(大连理工大学) Nanyang Technological University(南洋理工大学) University of Cambridge(剑桥大学) Zhejiang University(浙江大学)

AI总结 提出FaithRewriter框架,利用多模态大模型生成中间视觉线索,结合大语言模型生成视觉锚定的增强提示,再蒸馏至小模型,以缩小用户意图与生成图像之间的差距。

详情
AI中文摘要

尽管文本到图像(T2I)模型具有令人印象深刻的能力,但由于用户提示的简洁性和模糊性,意图-生成差距往往持续存在。现有方法主要优化提示的流畅性和可读性。然而,增强过程仍然缺乏视觉基础。因此,重写器可能过度推断缺失的细节,导致意图-生成差距。为了解决这一限制,我们提出了FaithRewriter,一种用于T2I生成的新型提示增强框架。具体来说,FaithRewriter首先利用多模态MLLM从原始提示生成图像作为中间视觉线索。然后将该线索与提示结合,输入大规模LLM,生成视觉锚定的增强,更好地反映预期内容在图像中应如何呈现。最后,将这些增强蒸馏到小规模LLM中以便高效部署,增强其生成有效T2I提示的能力。实验表明,与强基线相比,FaithRewriter生成的提示更忠实于用户意图且视觉上更合理,有助于缩小意图-生成差距。

英文摘要

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.

2606.08644 2026-06-09 cs.CL cs.AI 交叉投稿

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

一种用于大语言模型中动态实体追踪的检索条件重绑定电路

Soyoung Oh, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 通过因果干预识别出大语言模型中实现动态状态追踪的检索条件重绑定机制,该机制由紧凑的注意力头电路编码并恢复绑定信息,在不同模型家族中表现不同。

详情
AI中文摘要

为了正确解释上下文并检索相关信息,大语言模型必须将实体与其属性绑定,并在状态变化时更新这些绑定。我们分析了LLM在动态状态追踪中如何实现这一绑定过程。通过因果干预,我们识别出一种检索条件重绑定机制,这是一个紧凑的注意力头电路,编码交换相关的绑定信息并在读出时恢复。在Gemma和Llama模型中,该电路支持重绑定行为,但机制的表示特征在不同模型家族中有所不同。在Gemma模型中,绑定特征清晰地表达在相关注意力头的查询/键子空间中,而在Llama模型中,绑定信息主要由键向量携带。总体而言,我们的结果揭示了LLM中上下文相关状态追踪的可解释机制。

英文摘要

To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update these bindings as state changes. We analyze how LLMs implement this binding process in a dynamic state tracking. Using causal interventions, we identify a retrieval conditioned rebinding mechanism, a compact attention head circuit that encodes swap relevant binding information and reinstates it at readout. Across Gemma and Llama models, this circuit supports rebinding behavior, but the representational signature of the mechanism differs across model families. In Gemma models, the binding signature is clearly expressed in the query/key subspaces of the relevant attention heads, whereas in Llama models, the binding information is carried primarily in key vectors. Overall, our results reveal an interpretable mechanism for context dependent state tracking in LLMs.

2606.08676 2026-06-09 cs.SE cs.AI cs.CL 交叉投稿

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

迷失在代码对话者的流程中:揭示代码任务中大语言模型的指令微调税

Shi Ying Chang, Chiok Yew Ho, Yichen Li, Yintong Huo

发表机构 * Singapore Management University(新加坡国立管理学院) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究首次实证发现指令微调在代码任务中导致权衡:增强指令遵循能力却削弱代码填充性能,称之为“指令微调税”,并通过定性和定量分析总结出七项发现和四项启示。

Comments 25 pages, 6 figures. Evaluation toolkit and dataset: https://github.com/arkosioscambions/CodeTalkers

详情
AI中文摘要

AI编码助手通过自动建议与用户意图一致的代码,显著提高了开发者的生产力,许多此类工具现已直接集成到集成开发环境(IDE)中。开发者以两种不同的认知模式与代码交互:流程模式和命令模式。在流程模式下,开发者需要能够直接完成或填充未完成程序中代码的工具;而在命令模式下,他们需要能够理解以自然语言指令表达的意图并将其转换为可执行代码的工具。尽管经过指令微调的大型语言模型(LLM)因其推断和满足开发者意图的能力而在许多应用场景中占据主导地位,但尚不清楚同一范式是否同样适用于不同的代码相关任务。因此,有必要理解指令微调如何影响CodeLLM作为编码助手的可行性。为填补这一空白,我们进行了首次实证研究,揭示了指令微调在编程模式之间引起的关键权衡,我们称之为“指令微调税”。我们的结果表明,指令微调并非免费的午餐:尽管经过指令微调的模型更擅长遵循指令和利用结构化指导,但这些收益往往以牺牲填充性能为代价。我们进一步通过定性和定量分析扩展了研究,包括手动失败分类、捕捉生成保真度的行为指标以及微调过程中的中间检查点评估。将我们的结果总结为七项发现和四项启示,我们的研究为AI驱动编码工具的开发提供了新视角,并强调了在指令遵循能力与有效代码生成辅助之间仔细平衡的必要性。

英文摘要

AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers' intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.

2606.08770 2026-06-09 cs.CL cs.AI cs.CV cs.LG 交叉投稿

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

TeamHerald@CHIPSAL 2026:基于Transformer架构和集成学习的尼泊尔语模因仇恨言论检测与情感分析

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal

发表机构 * Herald College Kathmandu(加德满都赫尔德学院)

AI总结 针对尼泊尔语模因中代码混合和资源匮乏问题,采用OCR提取文本并结合Transformer模型,发现硬/软投票集成策略在二分类和多分类任务中表现不同,软投票在多类情感任务中提升15.8%的Macro F1分数。

Comments Accepted at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026) at LREC 2026

详情
AI中文摘要

尼泊尔语互联网模因的分析因频繁的代码混合和缺乏已建立的基线资源而变得复杂。虽然模因本质上结合了视觉和文本元素,但本研究侧重于以文本为中心的方法,通过OCR层提取嵌入文本,并使用基于Transformer的架构进行建模。我们评估了六种不同的模型,并研究了硬投票和软投票集成策略在两项任务中的比较效果:二分类仇恨言论检测和三分类情感分析。实验结果表明,独立的仅解码器模型在二分类任务中取得了最高性能,而软投票集成在多类情感任务中表现最佳,相比最强的独立基线,Macro F1分数相对提升了15.8%。这些发现表明,集成策略在二分类和多类任务中表现不同,突出了选择适合分类目标的聚合方法的重要性。

英文摘要

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt(MSA大学计算机科学学院,埃及)

AI总结 提出BLM-SGAN模型,利用BERT的双向注意力机制捕获长程依赖,解决GAN在文本到图像生成中的梯度消失和序列处理限制,在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情
Journal ref
Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025
AI中文摘要

尽管从文本描述生成图像取得了成功,但在自然语言处理(NLP)和计算机视觉(CV)等领域仍面临难以克服的挑战。文本到图像(T2I)模型的最新进展,特别是那些利用生成对抗网络(GAN)的模型,显著提高了跨领域合成逼真图像的能力。然而,现有的基于GAN的T2I模型仍然面临关键挑战,例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题,我们引入了BLM-SGAN,一种新颖的模型,它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能,Inception Score(IS)为5.45 +/- 0.08,超过了多个竞争模型,如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取:https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

2606.08938 2026-06-09 cs.CL cs.AI 交叉投稿

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

PACT: 通过特权合成与分支共识学习多样化诊断策略

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv, Yue Guo, Yujing Liu, Faguo Wu, Hongwei Zheng, Xiandong Li, Bo Yuan, Yifan Sun, Zhaoxin Fan

发表机构 * Beihang University(北京航空航天大学) Baidu(百度) ByteDance(字节跳动) Beijing Academy of Blockchain and Edge Computing(北京区块链与边缘计算研究院) Renmin University of China(中国人民大学)

AI总结 提出PACT框架,通过特权合成对话数据和多分支共识训练,使LLM同时学习多种诊断推理范式,在中文医疗诊断基准上取得最优性能。

Comments 16 pages, 5 figures, 5 tables

详情
AI中文摘要

临床诊断需要在信息不完整的情况下灵活运用多种推理范式。现有的基于LLM的医疗智能体表现出强大的医学推理能力,但单一范式或简单混合的对话监督使得这些范式难以无干扰地学习。我们提出\textbf{PACT}(周期性锚点共识训练),一个将监督的多范式对话合成与基于共识的分支训练相结合的框架。在数据层面,\textbf{DPS}(医生-患者-监督者)利用完整的电子病历(EMR)进行质量控制,同时保持医生代理仅能访问患者可见信息。这产生了四种诊断推理范式下的经过验证的对话,而不会泄露隐藏的临床答案。在训练层面,PACT为每个范式训练一个范式特定的LoRA分支,并通过符号共识定期将分支聚合到共享锚点中。我们进一步构建了一个动态的多轮中文医疗诊断基准用于交互式会诊。实验表明,PACT在诊断结果和会诊过程指标上,与专有、医学专用和任务适应的基线相比,达到了最先进的性能。

英文摘要

Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

2606.08948 2026-06-09 cs.CV cs.AI 交叉投稿

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM:用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University(埃默里大学)

AI总结 针对现有MLLM在膳食微量营养素估计中不可靠的问题,利用十年人口规模膳食回顾生成约110万图像-营养素三元组,微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM,在真实图像上实现65种营养素全覆盖,准确率匹配或超越专有模型。

Comments 35 pages, 10 figures, 1 table

详情
AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理,但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明,现有的多模态大语言模型(MLLMs),包括领先的专有模型,在此任务上不可靠。在五个模型家族和四个独立评估基准(ASA24、SNAPMe、FNDDS和NutriBench)上,模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距,我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库,每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知,这是计划在发表后公开发布的最大合成食品图像语料库,具有全面的微量营养素标注。在此语料库上微调Qwen3-VL(2B/4B/8B/30B)和GLM-4.6V-Flash,得到了NutriMLLM,这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型,该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上,每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖,并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线(GPT-5、Gemini 3和Claude Sonnet 4.5)。这些结果表明,回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题,并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

2606.09019 2026-06-09 cs.SD cs.AI 交叉投稿

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR:压缩音频令牌以实现高效自回归文本到语音

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

发表机构 * Sungkyunkwan University(成均馆大学) University of Seoul(首尔市立大学)

AI总结 提出TLDR框架,通过将因果建模从令牌级转移到补丁级,利用轻量级压缩器和LoRA适配的冻结预训练骨干,实现1.8倍推理加速和75% KV缓存减少。

详情
AI中文摘要

基于编解码器的自回归(AR)语音语言模型通过将语音建模为离散音频令牌序列,并使用大型预训练骨干网络,实现了强大的文本到语音(TTS)质量。然而,这种令牌级公式造成了结构效率瓶颈:语音令牌序列比文本序列长得多,要求AR骨干在每个令牌位置执行因果计算,并维护随序列长度增长的KV缓存。我们引入TLDR,一种基于补丁的自回归框架,通过将因果建模从令牌级语音序列转移到补丁级序列,加速基于编解码器的AR-TTS。TLDR使用轻量级压缩器将连续的编解码器令牌分组为紧凑的潜在补丁,使用通过LoRA适配的冻结预训练AR-TTS骨干对生成的较短补丁序列进行建模,并使用说话人条件提取器在每个补丁内重建细粒度语音令牌。在补丁大小为4的情况下,TLDR比基线AR-TTS模型实现了1.8倍的推理加速,并将全局KV缓存内存减少了高达75%。实验结果表明,补丁级全局因果建模可以成为降低预训练基于编解码器的AR-TTS系统推理成本的一种实用方法,而无需替换现有模块。

英文摘要

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出TRIAGE框架,利用大语言模型对竞争性临床结果生成辩证推理,缓解风险极化,实现连续风险评分与可解释推理,在三个基准上AUPRC提升3.3%,校准误差降低81%。

Comments Code is available at https://github.com/HyeongWon-Jang/TRIAGE

详情
AI中文摘要

基于电子健康记录的临床早期预警系统,其中临床观察记录为不规则采样的医学时间序列(ISMTS),必须提供校准的风险评分用于患者分诊,以及临床医生可验证的可解释理由。大语言模型(LLMs)已被探索用于此任务,但它们将分级临床风险崩溃为过度自信的二元预测。这种风险极化损害了校准性和跨患者可比性。为解决此问题,我们提出TRIAGE框架,该框架训练LLM通过引出特定结果的理由,对竞争性临床结果生成辩证推理。这种辩证公式减轻了风险极化,使单个LLM能够产生基于明确临床推理的连续风险评分。在三个ISMTS基准上评估,TRIAGE相比竞争基线实现了平均AUPRC提升3.3%,校准误差降低81%。LLM作为评判者的评估进一步表明,我们的理由在临床推理质量上比基线的后验解释高出20%。源代码可在https://github.com/HyeongWon-Jang/TRIAGE获取。

英文摘要

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

2606.09048 2026-06-09 eess.AS cs.AI cs.SD 交叉投稿

BareWave: Waveform-Native Flow-Matching Text-to-Speech

BareWave: 波形原生流匹配文本转语音

Wei Fan, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li, Kejiang Chen, Weiming Zhang, Nenghai Yu

发表机构 * Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Tongyi Fun Team, Alibaba Group(阿里巴巴集团通义Fun团队)

AI总结 提出BareWave,一种完全波形原生的流匹配TTS框架,通过训练时表示对齐、分阶段噪声调度和速度感知感知对齐解决直接波形训练难题,实现零样本语音克隆的高质量合成。

Comments Under Review

详情
AI中文摘要

去除中间表示和单独训练的解码阶段已成为生成建模的重要方向。然而,在文本转语音中,高质量系统通常仍通过中间声学表示构建,再进行波形合成。本文提出BareWave,一种完全波形原生的框架,用于流匹配TTS中的直接文本到波形生成。我们认为该设置引发了三个训练挑战:原始波形建模缺乏强大的预训练表示支架;不同训练阶段受益于不同的噪声调度;数据空间感知目标不会自动共享速度空间流目标的时间结构。因此,直接波形训练难以高效优化,难以通过固定配方推向强最终工作点,也难以整合有效的感知细化。基于此观点,我们开发了一个直接文本到波形训练框架,结合训练时表示对齐、分阶段噪声调度和速度感知感知对齐(VAPA),同时在测试时保持单一波形原生推理路径,无需预训练组件。零样本语音克隆实验表明,在完全波形原生推理路径下,可以实现强可懂度、说话人相似度和自然度,支持波形原生流匹配TTS作为实用方向。带有音频示例的项目页面可在https://barewave.github.io/获取。

英文摘要

Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

2606.09064 2026-06-09 cs.CV cs.AI 交叉投稿

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

看得更多,思考更深:面向长视频理解的查询扩展视觉证据与答案线索引导反思

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

发表机构 * Baidu Inc.(百度公司) Harbin Institute of Technology(哈尔滨工业大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出CoVER框架,通过动态收集查询扩展视觉证据和答案特定视觉反馈验证草稿答案,实现从答案中心生成到证据中心和视觉可验证推理的转变,在长视频理解任务上超越同规模模型及部分闭源模型。

详情
AI中文摘要

近期视频大语言模型(Video-LLMs)的进展使得长视频理解任务成为可能。然而,现有方法仍面临两个关键限制:证据获取通常依赖单一搜索意图,且答案生成缺乏有效的视觉反馈机制。为解决这些限制,我们提出了\textbf{CoVER},一个用于长视频理解的综合视觉证据与反思框架。CoVER使Video-LLMs能够通过动态收集查询扩展视觉证据来\textbf{看得更多},并通过使用有效的答案特定视觉反馈验证草稿答案来\textbf{思考更深}。这些机制共同将长视频理解从以答案为中心的生成转变为以证据为中心且可视觉验证的推理。实验结果表明,CoVER-7B在相同参数规模下显著优于其他模型,甚至在特定指标上超越了最先进的闭源模型。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

2606.09142 2026-06-09 cs.CV cs.AI 交叉投稿

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark(丹麦技术大学) University of Helsinki(赫尔辛基大学) Delft University of Technology(代尔夫特理工大学)

AI总结 利用视觉语言模型(VLM)将行人过街意图预测转化为视觉问答任务,通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索,在自我中心视频上实现了14.5%的准确率提升,创下新纪录。

详情
AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角,但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中,我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLM)来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试,发现它们相对于随机猜测有适度提升,但表现出有限的高层次交通推理能力。基于这些发现,我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明,微调后的模型显著优于其零样本对应模型,并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后,我们证明加入额外的上下文线索,包括自我运动、车辆运动和眼动,进一步提高了预测性能。特别是,由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升,为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

2606.09159 2026-06-09 cs.CL cs.AI 交叉投稿

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

扩散语言模型中不变性与独立性解码的统一能量

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian

发表机构 * National University of Singapore(新加坡国立大学) Stanford University(斯坦福大学) City University of Hong Kong(香港城市大学)

AI总结 针对扩散语言模型并行生成文本时与自回归模型的性能差距,提出统一能量(Uni-E)方法,通过不变能量和独立能量解决模型容量、依赖性和不变性问题,无需采样即可精确计算,并能纠正分布偏移。

详情
AI中文摘要

扩散语言模型(DLM)通过迭代去噪完整序列实现并行文本生成,与自回归(AR)解码相比具有吸引人的灵活性。然而,现有方法未能完全捕捉令牌关系,导致与AR基线存在性能差距,尤其是在并行度增加时。本文对该差距进行了系统分析,确定了三个关键因素:(i)模型容量、(ii)依赖性和(iii)不变性。为解决这些问题,我们首先提出不变能量(Inv-E)以及一个有效的基于采样的估计器来处理不变性问题。通过进一步与独立能量(Ind-E)结合,我们得到统一能量(Uni-E),它涵盖了所有这些因素。Uni-E具有独特优势:无需基于采样的分区估计即可精确计算。此外,Uni-E是模型无关的,因此可以扩展到任意大小的模型。我们进一步证明Uni-E可以纠正由依赖性和不变性引起的分布偏移。在扩散语言模型(DLM)和扩散大语言模型(DLLM)上的大量实验证明了所提出的Uni-E的有效性。

英文摘要

Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

2606.09234 2026-06-09 cs.SD cs.AI 交叉投稿

End-to-End Training for Discrete Token LLM based TTS System

基于离散令牌LLM的文本转语音系统的端到端训练

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出统一训练语音分词器、LLM、流匹配模型和奖励模型的端到端框架,通过多任务联合优化提升离散令牌TTS性能,在Seed-TTS-Eval上达到新SOTA。

详情
AI中文摘要

最近的先进文本转语音系统通常采用级联流水线,包括语音分词器、自回归大语言模型和基于扩散的流匹配模型,这些组件独立训练。本文提出一个完全端到端的优化框架,统一了语音分词器、LLM、FM模型和额外奖励模型的训练。具体来说,我们首先通过来自FM重建、LLM下一令牌预测和RM多识别任务的多任务目标联合优化分词器。这种联合训练鼓励离散语音令牌空间捕获更适合TTS的声学和语义显著信息。然后,我们通过FM和RM的下游重建和识别进一步优化LLM,这减少了推理时的不匹配,并引导LLM生成更优的结果。实验结果表明,我们的端到端框架始终优于级联基线。在Seed-TTS-Eval基准上,我们的系统实现了0.78%和1.56%的词错误率,使用0.6B参数的LLM和0.5B参数的FM模型取得了新的SOTA结果。这些结果验证了整体端到端优化对于改进基于离散令牌的TTS系统至关重要,且训练流水线更简单。

英文摘要

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

2606.09331 2026-06-09 cs.MM cs.AI cs.LG 交叉投稿

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Conan-embedding-v3: 融合模态特定模型实现全模态嵌入

Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang

发表机构 * Tencent(腾讯)

AI总结 提出解耦-融合-恢复框架,通过独立训练模态专家并融合任务向量,再使用投影器恢复和平衡多模态重演解决投影器漂移问题,实现单一骨干网络支持文本、图像、视频、文档和音频检索。

详情
AI中文摘要

全模态检索承诺为文本、图像、视频、文档和音频输入提供单一嵌入空间,但由于这些模态在数据分布、架构和优化动态上存在差异,构建这样一个统一的检索器十分困难。在这项工作中,我们提出了Conan-embedding-v3,一个用于全模态检索的解耦-融合-恢复框架。Conan-embedding-v3首先独立训练模态专家,然后将它们的任务向量融合到一个单一的密集骨干网络中,我们称这种策略为解耦专家融合。我们表明,这种融合组合了视觉、视频和文档检索能力,但也暴露了基于投影器的模态的一个失败模式:当通过外部编码器和投影器附加音频时,融合骨干网络会使投影器校准到音频专家骨干网络,导致尽管原封不动地复制了所有音频特定模块,音频检索性能仍大幅下降。我们将这种失败称为投影器漂移。为了修复它,Conan-embedding-v3应用了投影器恢复(即在保持骨干网络冻结的情况下对投影器进行全参数微调),随后进行平衡的多模态重演。得到的模型在一个骨干网络中支持这些检索路径,在MMEB上达到74.9分,同时在30任务的MAEB音频套件上获得55.61分。

英文摘要

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

2606.09470 2026-06-09 cs.CL cs.AI 交叉投稿

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University(语言研究中心,拉德堡德大学)

AI总结 提出一种基于评分准则的SpeechLLM,通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释,在SpeechOcean762上达到或超越单粒度模型。

Comments Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

详情
AI中文摘要

自动化的L2语音评估可以分配熟练度标签,但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM,用于多角度、多粒度的评估,采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级(准确性、流利度、韵律)的序数标签、词/音素级准确性,并生成自然语言解释。在SpeechOcean762上,我们的方法匹配或优于单粒度模型,同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性:与模型预测的自一致性和与真实标签的对齐,使用情感一致性(合理性)和基于提及的一致性(忠实性)。解释在句子级别是合理的,但在词/音素级别忠实性下降:参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

2606.09525 2026-06-09 cs.CL cs.AI 交叉投稿

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的涌现

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

发表机构 * Nanyang Technological University(南洋理工大学) University of Copenhagen(哥本哈根大学)

AI总结 通过测量监督微调、直接偏好优化和可验证奖励强化学习三个阶段,发现大型语言模型对上下文特征的敏感性在指令微调过程中动态变化,其中监督微调使模型倾向于使用易理解的上下文,而后续阶段可能强化或改变这一偏好。

详情
AI中文摘要

在指令微调(IFT)过程中,大型语言模型(LLMs)通过使用提供的上下文来回答问题,从而学会遵循指令。虽然先前的工作已经研究了上下文特征如何与LLM的上下文使用相关,但这种分析仅限于推理时间,尚未揭示这些关系最初是如何获得的。在这里,我们测量了模型对这些特征的敏感性在连续的IFT阶段(监督微调(SFT)、直接偏好优化(DPO)和可验证奖励强化学习(RLVR))中如何变化。跨四个模型和三个数据集的实验表明,SFT使模型更倾向于使用易于理解的上下文,例如包含高长度、上下文-查询相似性和流畅性的上下文。SFT后的动态可能根据训练数据集强化或解决这些偏好。我们的发现揭示了上下文使用在每个IFT阶段都被积极重塑,并且设计平衡的IFT数据集对于确保指令微调模型稳健的上下文利用至关重要。

英文摘要

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

2606.09587 2026-06-09 cs.HC cs.AI 交叉投稿

Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization

看见蜂巢思维:一种缓解AI同质化的共识感知交互技术

Muhammad Haris Khan, Joel wester

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 提出语义排斥技术(SRT),通过计算和用户研究证明其能显著提升AI生成内容的语义多样性,减少共识短语,且不损害有用性和连贯性。

Comments In review

详情
AI中文摘要

人们越来越多地使用AI进行写作等创造性任务。虽然采用率持续增长,但这种使用方式有可能在局部削弱个人创造力,并在整体上减少创造性输出的异质性。为此,我们引入了语义排斥技术(SRT),并通过计算评估和一项针对16名经常使用AI进行创造性任务的参与者的研究对其进行了评估。我们的计算评估显示,SRT在不同任务模式下将语义多样性提高了85--167%,同时将共识短语减少了43--95%。在用户研究中,SRT输出获得了更高的有用性($p = .019$, $W = .208$)和连贯性评分($p = .006$, $W = .260$);68.8%的参与者愿意在多个任务中使用SRT-Strong,而基线仅为18.8%。所有系统中原创性和连贯性评分呈正相关($ρ= +.40$ 到 $+.67$),表明发散性不必以可读性为代价。综合来看,这些初步发现可为设计旨在支持日常创造力而不助长同质化的AI系统提供参考。

英文摘要

People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and evaluate it both computationally and through a study with 16 participants who regularly use AI for creative tasks. Our computational assessment reveals that SRT increases semantic diversity by 85--167\% while reducing consensus phrases by 43--95\% across task modes. In the user study, SRT outputs received higher usefulness ($p = .019$, $W = .208$) and coherence ratings ( $p = .006$, $W = .260$); 68.8\% of participants were willing to use SRT-Strong for multiple tasks versus 18.8\% for baselines. Originality and coherence ratings were positively correlated across all systems ($ρ= +.40$ to $+.67$), suggesting that divergence need not compromise readability. Taken together, these preliminary findings can inform the design of AI systems that aim to support everyday creativity without contributing to homogenization.

2606.09670 2026-06-09 cs.CV cs.AI 交叉投稿

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

视觉提示结合基于特征重建的双教师监督异常检测

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

发表机构 * IBM Research Europe Zurich(IBM欧洲研究院苏黎世分院)

AI总结 针对异常检测在真实场景中因物体尺度、视角等变化失效的问题,提出视觉提示管道、解冻教师模型和扩散生成数据增强,在AeBAD数据集上提升3.5个百分点。

详情
AI中文摘要

最近的异常检测方法在成熟数据集(如MVTec)上取得了完美的检测和分割分数。然而,当基本假设(如一致的物体尺度、视角、背景、光照和居中放置)被违反时,许多方法面临挑战。这些变化使得异常检测方法在许多真实场景中无法使用。为了解决这些限制,我们引入了三个关键贡献:(1)一个视觉提示管道,通过前景-背景掩码隔离物体;(2)一种在师生模型中解冻教师以提高领域适应性的机制;(3)一种利用扩散生成合成图像的数据增强策略,以增强异常检测性能。通过使用掩码多尺度重建(MMR)模型作为骨干,我们在具有挑战性的AeBAD数据集上比之前的最先进方法提高了3.5个百分点。

英文摘要

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

2606.09767 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

低资源神经机器翻译的数据合成与参数高效微调:以Q'eqchi'玛雅语为例

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee

发表机构 * University of Houston(休斯顿大学) MasterWord Services, Inc.(MasterWord Services公司) University of Washington(华盛顿大学)

AI总结 针对低资源土著语言,提出数据合成方法(利用社区词典生成合成语料)结合LoRA参数高效微调,在Q'eqchi'玛雅语上实现高结构习得(BLEU 42.02),但存在结构-语义差距,需结合真实数据进行课程学习。

Comments Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

对于数字低资源土著语言的神经机器翻译,通常因极端数据稀缺而受阻,促使依赖抽取式网络爬取。为确保数据主权,本研究引入了一种数据合成方法,无需爬取目标语言平行文本即可引导NMT模型。以Q'eqchi'玛雅语为重点,我们将社区来源的词典转换为大规模合成语料,利用通过LoRA适配器在mT5-base模型上的参数高效微调(PEFT)。领域内评估显示出高度的结构习得(BLEU 42.02),证明合成约束有效地教授了复杂的黏着形态和VOS语序。然而,针对有机词汇表的评估揭示了结构-语义差距(BLEU 0.59),模型保持了语法完整性但缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合;尽管流程中具有高语义熵,模型仍难以应对自然语言的句法流动性,将有机输入强制转换为僵化的学习模式。此外,利用多任务学习架构的消融研究导致了负迁移,表明辅助任务在LoRA适配器内竞争有限的参数容量,导致对合成标记的过度优化而牺牲了有机灵活性。最终,我们确定合成引导是一种高度有效的结构入门,但需要通过课程学习使用真实数据进行语义细化。

英文摘要

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

2510.06052 2026-06-09 cs.AI cs.CL 版本更新

MixReasoning: Switching Modes to Think

MixReasoning: 切换模式以思考

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

发表机构 * arXiv

AI总结 提出MixReasoning框架,动态调整推理深度,对困难步骤详细推理、简单步骤简洁推理,在GSM8K、MATH-500和AIME上缩短推理长度并提高效率,不牺牲准确性。

详情
AI中文摘要

推理模型通过逐步解决问题、将问题分解为子问题并在生成答案前探索长思维链来提升性能。然而,对每一步都应用扩展推理会引入大量冗余,因为子问题的难度和复杂度差异很大:少数关键步骤对最终答案真正具有挑战性和决定性,而许多其他步骤仅涉及简单的修正或计算。因此,一个自然的想法是赋予推理模型自适应应对这种变化的能力,而不是对所有步骤采用相同的详细程度。为此,我们提出了MixReasoning,一个在单个响应中动态调整推理深度的框架。由此产生的思维链成为困难步骤的详细推理与简单步骤的简洁推理的混合。在GSM8K、MATH-500和AIME上的实验表明,MixReasoning缩短了推理长度,显著提高了效率,且不牺牲准确性。

英文摘要

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

2511.19829 2026-06-09 cs.AI 版本更新

Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level

一种统一评估指导的查询相关提示优化框架

Ke Chen, Yifeng Wang, Hassan Almosapeeh, Haohan Wang

发表机构 * School of Information Sciences, University of Illinois Urbana-Champaign(信息科学学院,伊利诺伊大学厄巴纳-香槟分校) College of Engineering, Carnegie Mellon University(工程学院,卡内基梅隆大学)

AI总结 提出一个基于性能导向的提示评估框架,并开发一个无需执行的评估器来预测多维质量分数,进而指导一个度量感知优化器以可解释的查询相关方式重写提示,在多个数据集和骨干模型上优于现有方法。

详情
AI中文摘要

大多数提示优化方法优化单个静态模板,使其在复杂和动态的用户场景中无效。现有的查询相关方法依赖于不稳定的文本反馈或黑盒奖励模型,提供弱且不可解释的优化信号。更根本的是,提示质量本身缺乏统一、系统的定义,导致碎片化和不可靠的评估信号。我们的方法首先建立了一个面向性能的、系统的、全面的提示评估框架。此外,我们开发并微调了一个无需执行的评估器,可以直接从文本中预测多维质量分数。然后,评估器指导一个度量感知优化器,该优化器以可解释的、查询相关的方式诊断失败模式并重写提示。我们的评估器在预测提示性能方面达到了最强的准确性,并且评估指导的优化在八个数据集和三个骨干模型上始终优于静态模板和查询相关的基线。总的来说,我们提出了一个统一的、基于度量的提示质量视角,并证明了我们的评估指导优化流程在多样化任务中提供了稳定、可解释和模型无关的改进。

英文摘要

Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progress by proposing diverse prompt evaluation metrics and optimization strategies. Despite these advances, prompt evaluation and prompt optimization are often developed in isolation, limiting the extent to which evaluation can effectively inform prompt refinement. In this work, we study prompt optimization as a process guided by performance-relevant evaluation signals. To address the disconnect between evaluation and optimization, we propose an evaluation-instructed prompt optimization approach that explicitly connects prompt evaluation with query-dependent optimization. Our method integrates multiple complementary prompt quality metrics into a performance-reflective evaluation framework and trains an execution-free evaluator that predicts prompt quality directly from text, avoiding repeated model executions. These evaluation signals then guide prompt refinement in a targeted and interpretable manner. Empirically, the proposed evaluator achieves 83.7% accuracy in predicting prompt performance. When incorporated into the optimization process, our approach consistently outperforms existing optimization baselines across eight benchmark datasets and three different backbone LLMs. Overall, our results demonstrate that reliable and efficient evaluation signals can serve as an effective foundation for robust and interpretable prompt optimization.

2601.02880 2026-06-09 cs.AI cs.CL 版本更新

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

ReTreVal:带有验证和跨问题记忆的推理树

Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

发表机构 * QpiAI

AI总结 ReTreVal通过自适应树探索、带工具增强的节点细化、类型化失败回溯和自修改记忆,使大语言模型在无需微调的情况下实现跨问题学习,其在MATH-500上达到85.8%的pass@1准确率,在MMLU-Pro上达到54.4%的准确率。

Comments 15 pages, 1 figure, 12 tables

详情
AI中文摘要

现有推理框架在问题边界丢弃所有失败上下文,导致模型解决问题500时比问题1时更无知。我们提出了ReTreVal(带有验证的推理树),这是一个无需训练的框架,通过自适应树探索、带工具增强的节点细化、类型化失败回溯以及自修改记忆,实现了跨问题学习。ReTreVal在MATH-500上达到85.8%的pass@1准确率(比零样本CoT高8.6个百分点,比最强基线Self-Refine高8.6个百分点),在MMLU-Pro上达到54.4%的准确率(比Self-Refine高15.3个百分点),3.4:1的胜率比噪声比证实了真正的错误恢复。这些能力,以前需要梯度更新,使32B模型能够与更大的单次通过系统竞争。

英文摘要

Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a self-rewriting memory that accumulates and revises strategy entries across problems, enabling inference-time cross-problem learning on any fixed, unmodified LLM without fine-tuning. ReTreVal achieves 85.8% pass@1 on MATH-500 (+8.6 pp over Zero-Shot CoT, +8.6 pp over the strongest baseline Self-Refine) and 54.4% on MMLU-Pro (+15.3 pp over Self-Refine), with a 3.4:1 win-to-regression ratio confirming genuine error recovery rather than noise. These capabilities, previously requiring gradient updates, allow a 32B model to compete with much larger single-pass systems.

2603.18388 2026-06-09 cs.AI cs.MA 版本更新

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

暗箱中的反射:在反射提示优化中揭示并逃离黑箱

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Huazhong University of Science and Technology(华中科技大学) Hefei University of Technology(合肥工业大学)

AI总结 本文提出VISTA框架,通过解耦假设生成与提示重写,实现可解释的提示优化,有效解决GEPA在缺陷种子下的性能下降问题。

Comments Accepted at ACL SRW 2026

详情
AI中文摘要

自动提示优化(APO)已成为提升LLM性能的强大范式,无需手动提示工程。反射APO方法如GEPA通过迭代优化失败案例来改进提示,但其优化过程仍为黑箱且无标签,导致不可解释的轨迹和系统性失败。我们识别并实证了四个限制:在GSM8K上使用缺陷种子时,GEPA将准确性从23.81%降至13.50%。我们提出VISTA,一种多智能体APO框架,通过解耦假设生成与提示重写,实现语义标注的假设、并行小批量验证和可解释的优化轨迹。结合随机重启和epsilon-贪婪采样的两层探索-利用机制进一步逃离局部最优。VISTA在相同缺陷种子上恢复准确性至87.57%,并在GSM8K和AIME2025上所有条件下均优于基线。

英文摘要

Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

2605.12213 2026-06-09 cs.AI 版本更新

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

面向目标的推理用于基于RAG的记忆在对话型代理LLM系统中

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, Scott Sanner

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

AI总结 本文提出Goal-Mem框架,通过目标导向的推理提升RAG记忆在复杂任务中的表现,尤其在多跳推理和隐含推理中效果显著。

详情
AI中文摘要

基于LLM的对话型AI代理在长时间范围内维持一致行为存在困难,因为上下文有限。虽然RAG方法通过外部记忆模块存储交互并进行检索来克服这一限制,但其在回答具有挑战性的问题(如多跳、常识推理)上的有效性最终取决于代理对检索信息的推理能力。然而,现有方法通常基于语义相似性检索原始用户语句,缺乏对缺失中间事实的显式推理,且常返回无关或不足的证据。本文引入Goal-Mem,一种面向目标的推理框架,通过从用户语句作为目标进行逆向推导。而非逐步扩展检索上下文,Goal-Mem将每个目标分解为原子子目标,进行针对性记忆检索以满足每个子目标,并迭代识别在中间目标无法解决时应从记忆中检索哪些信息。我们通过自然语言逻辑(NLL)形式化这一过程,该逻辑系统结合了FOL的推理可验证性和自然语言的表达性。通过在两个数据集上进行广泛实验,并与九个强大的记忆基线进行比较,我们证明Goal-Mem在多个任务中表现更优,尤其在需要多跳推理和隐含推理的任务中效果显著。

英文摘要

LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache:基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对扩散大语言模型推理延迟高的问题,提出一种无需训练的自适应缓存框架dLLM-Cache,通过长间隔提示缓存和基于特征相似性的部分响应更新,实现高效中间计算复用,在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情
AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近,一种基于扩散的大语言模型(dLLMs)的新范式出现,它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而,dLLMs存在高推理延迟的问题。传统的自回归模型加速技术,如键值缓存,由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战,我们的工作首先基于一个关键观察:dLLM推理涉及一个静态提示和一个部分动态的响应,其中大多数标记在相邻去噪步骤中保持稳定。基于此,我们提出了dLLM-Cache,一种无需训练的自适应缓存框架,它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs(包括LLaDA 8B和Dream 7B)上的大量实验表明,dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少,同时保持了具有竞争力的输出质量。值得注意的是,我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于:https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

2507.00322 2026-06-09 cs.CL cs.AI cs.SE 版本更新

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

干扰导致的失败:当有缺陷机制掩盖健全机制时,语言模型在平衡括号任务中出错

Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao

发表机构 * George Mason University(乔治·马歇尔大学) University of Central Florida(中央佛罗里达大学) Department of Computer Science(计算机科学系)

AI总结 研究揭示语言模型在平衡括号任务中出错的原因:部分组件实现可靠机制,而其他组件引入噪声,当噪声机制主导时导致错误。提出RASteer方法,通过增强可靠组件贡献,将部分模型准确率从0%提升至近100%,并在算术推理任务中取得约20%的性能提升。

Comments 23 pages, 10 figures, accepted for NeurIPS 2025

详情
AI中文摘要

尽管语言模型(LMs)在编码能力方面取得了显著进步,但在生成平衡括号等简单句法任务上仍然存在困难。在本研究中,我们调查了不同规模(124M-7B)的语言模型中这些错误持续存在的潜在机制,旨在理解和减少这些错误。我们的研究揭示,语言模型依赖于多个独立做出预测的组件(注意力头和前馈神经元)。虽然一些组件在广泛的输入范围内可靠地促进正确答案(即实现“健全机制”),但其他组件可靠性较低,通过促进错误标记引入噪声(即实现“有缺陷机制”)。当有缺陷机制掩盖健全机制并主导预测时,就会发生错误。受此启发,我们引入了RASteer,一种引导方法,用于系统地识别并增加可靠组件的贡献,以提升模型性能。RASteer在平衡括号任务上显著提升了性能,将某些模型的准确率从0%提高到接近100%,且不影响模型的一般编码能力。我们进一步展示了其在算术推理任务中的更广泛适用性,实现了高达约20%的性能提升。

英文摘要

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

2509.17446 2026-06-09 cs.LG cs.AI 版本更新

MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

MVCL-DAF++: 通过原型感知对比对齐和由粗到细动态注意力融合增强多模态意图识别

Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, Yaxin Xue

发表机构 * University of Shanghai for Science and TechnologyChina(上海科学技术大学中国) Shenzhen Institute of Advanced Technology, Chinese Academy of SciencesChina(深圳先进技术研究院,中国科学院中国) University of Minnesota-Twin Cities, USA(明尼苏达大学双城分校,美国) University of LeedsUK(利兹大学,英国)

AI总结 提出MVCL-DAF++,通过原型感知对比对齐和由粗到细注意力融合,在MIntRec和MIntRec2.0上提升多模态意图识别,尤其改善稀有类识别。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

多模态意图识别(MMIR)在噪声或稀有类条件下存在语义基础薄弱和鲁棒性差的问题。我们提出MVCL-DAF++,它通过两个关键模块扩展了MVCL-DAF:(1)原型感知对比对齐,将实例与类级原型对齐以增强语义一致性;(2)由粗到细注意力融合,将全局模态摘要与令牌级特征集成以实现层次化跨模态交互。在MIntRec和MIntRec2.0上,MVCL-DAF++取得了新的最佳结果,稀有类识别WF1分别提高了+1.05%和+4.18%。这些结果证明了原型引导学习和由粗到细融合对于鲁棒多模态理解的有效性。源代码可在https://github.com/chr1s623/MVCL-DAF-PlusPlus获取。

英文摘要

Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.

2509.17455 2026-06-09 cs.CL cs.AI 版本更新

Understanding Benchmark Language Under Weakened Formal Semantics

弱化形式语义下的基准语言理解

Haoyang Chen, Kumiko Tanaka-Ishii

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) School of Fundamental Science and Engineering(基础科学与工程学院) Waseda University(早稻田大学)

AI总结 提出可计算表示方法,通过外部知识检索提取可执行代码,在数学推理、多步推理等基准上超越纯文本推理和单次代码执行,提供可扩展、可检查的语义证据。

Comments Accepted to Transactions of the Association for Computational Linguistics (TACL). 29 pages, 5 figures

详情
AI中文摘要

最先进的 NLP 基准需要解释指定条件、程序和异常的自然语言,通常依赖隐含假设和外部知识。在规模上构建具有证明论保证的完整语义表示通常不切实际,而纯文本推理提供的检查手段有限。本文探讨当形式语义保证被弱化时,能在多大程度上理解基准语言。我们通过提取可计算表示来研究这个问题:可执行表示,其运行时行为提供语义充分性的操作证据,包括可执行性、执行轨迹和运行时失败。我们使用外部知识检索,为基准实例诱导并迭代优化可计算表示。在数学推理、多步推理、因果推断以及规则和异常密集的法律和生物医学基准上,我们发现所提出的方法持续优于纯文本推理和单次代码执行。除了准确性,我们的分析表明,这些可计算表示提供了可扩展、可检查的语义证据:它们暴露了基准语言强制转化为可执行形式的条件和异常,为面向证明的语义和纯文本推理之间提供了实用的桥梁。

英文摘要

State-of-the-art NLP benchmarks require interpretation of natural language that specifies conditions, procedures, and exceptions, often relying on implicit assumptions and external knowledge. Constructing complete semantic representations with proof-theoretic guarantees is frequently impractical at scale, and purely text-based reasoning offers limited means of inspection. This paper asks how much understanding of benchmark language can be achieved when formal semantic guarantees are weakened. We investigate this question by extracting computables: executable representations whose runtime behavior provides operational evidence of semantic adequacy, including executability, execution traces, and runtime failures. We induce and iteratively refine computables for benchmark instances using retrieval from external knowledge. Across mathematical reasoning, multi-step reasoning, causal inference, and rule- and exception-heavy legal and biomedical benchmarks, we find that the proposed approach consistently exceeds text-only reasoning and one-shot code execution. Beyond accuracy, our analyses show that these computables provide scalable, inspectable semantic evidence: they expose conditions and exceptions benchmark language forces into executable form, offering a practical bridge between proof-oriented semantics and purely textual reasoning.

2511.11041 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

纠正文本嵌入中的均值偏差:一种改进的重归一化方法及其在MMTEB上的无训练改进

Xingyu Ren, Youran Sun, Haoyu Liang

发表机构 * GitHub

AI总结 发现句子嵌入存在一致均值偏差,提出无训练修正方法R2(投影去除均值方向),在MMTEB上38个模型中获得一致分类提升,并分析其与PCA白化的差异。

详情
AI中文摘要

我们发现当前的句子嵌入模型输出存在一致的偏差:每个嵌入$e$可分解为$\tilde e + \mu$,其中均值$\mu$在所有句子中几乎相同。我们研究了两种无训练修正方法——直接减去$\mu$(R1),或从每个嵌入中投影掉均值方向(R2)——并通过一阶误差传播论证表明,R2消除了R1保留的均值估计误差的平行分量。在Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}上的38个模型中,R2取得一致的分类增益(配对$\bar t = 3.31$,38个模型中有29个$t>2$,零损失),且每个模型的均值范数$\Vert\mu\Vert$与哪些模型受益最多相关。对五个模型进行的九种方法剂量反应消融实验进一步揭示,温和的单方向去除有帮助,但完全的主成分分析(PCA)白化损害了我们测试的每个模型,并且R2与深度为一的All-but-the-Top在下游任务中相差不超过0.18个百分点,尽管$\hat\mu$与中心化的顶部主成分之间几何对齐较弱。

英文摘要

We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences. We study two training-free corrections -- subtracting $μ$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}, R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vertμ\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hatμ$ and the centered top principal component.

2511.14143 2026-06-09 cs.CV cs.AI 版本更新

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

SMART: 基于音频增强多模态大模型的镜头感知视频时刻检索

An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang

发表机构 * Department of Computer Science, University at Albany - SUNY(University at Albany - SUNY 计算机科学系) School of Software & Microelectronics, Peking University(北京大学软件与微电子学院) Nanjing University(南京大学) Xiamen University(厦门大学) Department of Mathematics and Statistics, University at Albany - SUNY(University at Albany - SUNY 数学与统计学系)

AI总结 提出SMART框架,融合音频与视觉特征,利用镜头感知令牌压缩技术,在多模态大模型基础上实现视频时刻检索,在Charades-STA和QVHighlights上取得显著提升。

详情
AI中文摘要

视频时刻检索是视频理解中的一项任务,旨在根据自然语言查询在未裁剪视频中定位特定时间片段。尽管近年来利用传统技术和多模态大模型在视频时刻检索方面取得了进展,但大多数现有方法仍依赖于粗粒度的时间理解和单一的视觉模态,限制了在复杂视频上的性能。为了解决这一问题,我们引入了\textit{镜头感知多模态音频增强时间片段检索}(SMART),这是一个基于多模态大模型的框架,它整合了音频线索并利用了镜头级别的时间结构。SMART通过结合音频和视觉特征来丰富多模态表示,同时应用\textbf{镜头感知令牌压缩},该技术选择性地保留每个镜头内的高信息令牌,以减少冗余并保留细粒度的时间细节。我们还优化了提示设计,以更好地利用视听线索。在Charades-STA和QVHighlights上的评估表明,SMART相比最先进的方法取得了显著改进,包括在Charades-STA上R1@0.5提升1.61%,R1@0.7提升2.59%。

英文摘要

Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.

2512.16349 2026-06-09 cs.CV cs.AI 版本更新

Collaborative Edge-to-Server Inference for Vision-Language Models

面向视觉-语言模型的协作式边缘到服务器推理

Soochang Song, Yongjune Kim

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,波扬科学技术大学(POSTECH))

AI总结 提出一种协作式边缘到服务器推理框架,通过两阶段选择性重传策略,在降低通信成本的同时保持视觉-语言模型的推理精度。

Comments 12 pages, 15 figures, 3 tables

详情
AI中文摘要

我们提出了一种面向视觉-语言模型(VLM)的协作式边缘到服务器推理框架,该框架在保持推理精度的同时降低了通信成本。在典型部署中,边缘设备(客户端)捕获的视觉数据被传输到服务器进行VLM推理。然而,传输全分辨率图像会产生高昂的通信成本。相反,为减轻通信开销而进行的激进缩小或过度压缩可能会丢弃细粒度细节,导致精度下降。为克服这一限制,我们设计了一个通信高效的两阶段框架。在第一阶段,服务器对缩小的缩略图(全局图像)进行推理,并量化输出令牌的最小熵。如果最小熵超过预定义阈值,服务器利用VLM的内部注意力识别感兴趣区域(RoI),并请求边缘设备发送该RoI的细节保留局部图像。然后,服务器通过联合利用全局和局部图像来细化其推理。这种选择性重传策略确保仅额外传输必要的视觉内容。实验结果一致证实,所提出的框架在跨多种VQA基准测试中显著降低了通信开销,同时保持了推理精度。

英文摘要

We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, transmitting full-resolution images incurs high communication cost. Conversely, aggressive downsizing or excessive compression to mitigate communication overhead can discard fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a communication-efficient two-stage framework. In the first stage, the server performs inference on the downsized thumbnail (global image) and quantifies the min-entropy of the output tokens. If the min-entropy exceeds a predefined threshold, the server identifies a region of interest (RoI) using the VLM's internal attention and requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is additionally transmitted. Experimental results consistently confirm that the proposed framework substantially reduces communication overhead while maintaining inference accuracy across diverse VQA benchmarks.

2512.20978 2026-06-09 eess.AS cs.AI cs.LG 版本更新

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

GenTSE: 通过粗到细的生成语言模型增强目标说话人提取

Haoyang Li, Xuyi Zhuang, Azmat Adnan, Ye Ni, Wei Rao, Shreyas Gopal, Eng Siong Chng, Boon Siew Han, Yuanjin Zheng

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Southeast University, China(东南大学,中国) Schaeffler Hub for Advanced REsearch (SHARE) at Nanyang Technological University, Singapore(南洋理工大学Schaeffler先进研究 hub(SHARE),新加坡)

AI总结 提出GenTSE,一种两阶段解码器仅生成语言模型,先预测粗语义标记再生成细声学标记,结合冻结语言模型条件训练和直接偏好优化,在Libri2Mix上超越先前基于语言模型的系统。

Comments Accepted to Interspeech2026

详情
AI中文摘要

基于语言模型(LM)的生成建模已成为目标说话人提取(TSE)的一个有前景的方向,具有改善泛化能力和高保真语音的潜力。我们提出GenTSE,一种用于TSE的两阶段解码器仅生成语言模型:第一阶段预测粗语义标记,第二阶段生成细声学标记。分离语义和声学稳定了解码过程,并产生更准确的目标语音。两个阶段均使用连续的SSL或编解码嵌入,相比离散提示方法提供更丰富的上下文。为减少曝光偏差,我们采用冻结语言模型条件训练策略,使语言模型以早期检查点预测的标记为条件,以减少教师强制训练与自回归推理之间的差距。我们进一步应用直接偏好优化(DPO)以更好地将输出与感知偏好对齐。在Libri2Mix上的实验表明,GenTSE在语音质量、可懂度和说话人一致性方面超越了先前基于语言模型的系统。

英文摘要

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

2601.06599 2026-06-09 cs.CL cs.AI 版本更新

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

上下文如何塑造真相:LLMs中语句级真相表示的几何变换

Shivam Adarsh, Maria Maistro, Christina Lioma

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 研究LLMs中上下文如何改变真相向量,发现早期层正交、中层收敛,上下文增加向量幅度,大模型通过方向变化区分相关与无关上下文。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)通常将语句是否为真编码为其残差流激活中的向量。这些向量,也称为真相向量,已在先前工作中被研究,然而当引入上下文时它们如何变化仍未被探索。我们通过测量(1)有上下文和无上下文时真相向量之间的方向变化($\ heta$)以及(2)添加上下文后真相向量的相对幅度来研究这一问题。在四个LLM和四个数据集上,我们发现:(1)真相向量在早期层大致正交,在中层收敛,在后期层可能稳定或继续增加;(2)添加上下文通常增加真相向量的幅度,即激活空间中真与假表示之间的分离被放大;(3)较大模型主要通过方向变化($\ heta$)区分相关与无关上下文,而较小模型通过幅度差异显示这种区分。我们还发现与参数知识冲突的上下文比参数对齐的上下文产生更大的几何变化。据我们所知,这是首个提供上下文如何在LLMs激活空间中变换真相向量的几何特征描述的工作。

英文摘要

Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($θ$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($θ$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.

2601.07994 2026-06-09 cs.CL cs.AI 版本更新

DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

DYCP:基于LLMs的长格式对话动态上下文修剪

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

发表机构 * Computer Science Emory University(计算机科学 埃默里大学)

AI总结 DYCP通过动态识别和检索对话段落,提升长格式对话中LLM的上下文管理效率,实现更精确的上下文选择和推理效率提升。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于长格式对话,其中话题频繁变化。尽管最近的LLMs支持扩展的上下文窗口,但在实践中仍需有效管理对话历史,以应对推理成本和延迟限制。我们提出了DyCP,一种轻量级的上下文管理方法,该方法在LLM外部实现,能够根据当前轮次动态识别和检索相关对话段落,无需离线内存构建。DyCP在不预设话题边界的情况下管理对话上下文,保持对话的顺序性,实现自适应和高效的上下文选择。在三个长格式对话基准(LoCoMo、MT-Bench+和SCM4LLMs)和多个LLM后端上,DyCP在下游生成任务中实现了具有竞争力的答案质量,具有更选择性的上下文使用和改进的推理效率。

英文摘要

Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化:针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University(乔治城大学) University of Southern California(南加州大学) University of Maryland, College Park(马里兰大学学院公园分校) Arizona State University(亚利桑那州立大学)

AI总结 提出多模态生成式引擎优化(MGEO)方法,通过联合优化图像扰动和文本后缀,利用视觉-语言模型内部跨模态知识耦合,实现对产品排名的有效操纵,揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情
AI中文摘要

视觉-语言模型(VLM)将视觉和文本知识整合到统一表示中,日益成为现代检索和推荐系统的基础。然而,这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识,以及其知识基础是否可以被颠覆,仍不清楚。在本文中,我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞:通过多模态生成式引擎优化(MGEO),我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀,利用模型内部的跨模态知识耦合,操纵VLM的排序决策。MGEO采用交替优化策略,针对VLM中视觉和语言表示之间的深层交互,实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明,表面内容质量不足以提升排名;相反,需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题,并激励了未来多模态检索系统防御机制的研究。代码见:this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性,利用数据高效的自监督框架引导视频扩散模型,显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

详情
AI中文摘要

尽管最近的视频扩散模型(VDMs)能产生视觉上令人印象深刻的结果,但它们在保持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败是因为标准去噪目标缺乏显式的几何一致性激励。为此,我们引入VideoGPA(视频几何偏好对齐),一种数据高效的自监督框架,利用几何基础模型自动推导密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效将生成分布引导至内在3D一致性,而无需人工标注。VideoGPA通过最少的偏好对显著提升了时间稳定性、几何合理性与运动一致性,在大量实验中一致优于最先进基线。

英文摘要

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2602.00238 2026-06-09 cs.CL cs.AI cs.LG 版本更新

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Microsoft Research(微软研究院)

AI总结 针对现有RAG系统忽略开放式信息检索中多样性需求的问题,提出Diverge框架,通过迭代反思引导的多样化视角探索和多样性感知检索支持,在保持质量的同时将多样性提升约2倍。

详情
AI中文摘要

现有的检索增强生成(RAG)系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景,其中多个合理的答案是有价值的,并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明,标准RAG系统未能充分利用多样化的检索上下文:简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性,我们提出了Diverge,一个即插即用的智能体RAG框架,通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明,Diverge在竞争基线中实现了最佳的权衡,将多样性提高了约2倍,且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限,并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

2602.07774 2026-06-09 cs.IR cs.AI 版本更新

Generative Reasoning Re-ranker

生成式推理重排序器

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Wenlin Chen, Santanu Kolay, Sandeep Pandey, Hamed Firooz, Luke Simon

发表机构 * Meta AI

AI总结 提出GR2框架,利用大语言模型的推理能力进行推荐重排序,通过语义ID编码、推理轨迹监督微调和强化学习优化,在Recall@5和NDCG@5上超越现有方法。

Comments 31 pages

详情
AI中文摘要

最近的研究越来越多地探索大语言模型(LLMs)作为推荐系统的新范式,因其可扩展性和世界知识。然而,现有工作存在三个关键限制:(1)大多数工作集中在检索和排序,而重排序阶段——对优化最终推荐至关重要——在很大程度上被忽视;(2)LLMs通常用于零样本或有监督微调设置,其推理能力(尤其是通过强化学习(RL)和高质量推理数据增强的能力)未被充分利用;(3)项目通常由非语义ID表示,在拥有数十亿标识符的工业系统中造成重大可扩展性挑战。为解决这些问题,我们提出生成式推理重排序器(GR2),这是一个端到端框架,具有专为重排序设计的三阶段训练流程。首先,预训练的LLM通过一个分词器对从非语义ID编码的语义ID进行中期训练,实现≥99%的唯一性。接下来,一个更强的更大规模LLM通过精心设计的提示和拒绝采样生成高质量推理轨迹,用于监督微调以赋予基础推理技能。最后,我们应用解耦裁剪和动态采样策略优化(DAPO),实现具有可验证奖励的可扩展RL监督,这些奖励专为重排序设计。在两个真实数据集上的实验证明了GR2的有效性:它在Recall@5和NDCG@5上分别超越最先进的OneRec-Think 2.4%和1.3%。消融实验证实,高级推理轨迹在各项指标上带来显著提升。我们进一步发现,RL奖励设计在重排序中至关重要:LLMs倾向于通过保留项目顺序来利用奖励黑客行为,这促使我们设计条件可验证奖励以减轻这种行为并优化重排序性能。

英文摘要

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

2602.12996 2026-06-09 cs.CL cs.AI 版本更新

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

知道更多,更清晰:大型语言模型中知识增强的元认知框架

Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出元认知框架,利用内部认知信号划分知识空间为掌握、混淆和缺失区域,通过差异化干预和认知一致性机制增强知识并校准置信度,实验证明优于基线方法。

详情
AI中文摘要

知识增强显著提升了大型语言模型(LLMs)在知识密集型任务中的性能。然而,现有方法通常基于模型性能等同于内部知识的简单前提,忽略了导致过度自信错误或不确定真相的知识-置信度差距。为弥合这一差距,我们提出了一种新颖的元认知框架,通过差异化干预和对齐实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为掌握、混淆和缺失区域,指导有针对性的知识扩展。此外,我们引入了一种认知一致性机制,以同步主观确定性与客观准确性,确保校准的知识边界。大量实验表明,我们的框架持续优于强基线,验证了其在不仅增强知识能力,而且培养更好区分已知与未知的认知行为方面的合理性。所有代码均可在该 https URL 获取。

英文摘要

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns. All codes are available at https://github.com/AI9Stars/Know-More-Know-Clearer.

2602.17911 2026-06-09 cs.CL cs.AI 版本更新

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

基于条件的推理用于依赖上下文的生物医学问答

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) National Institutes of Health(美国国立卫生研究院)

AI总结 本文提出CondMedQA基准和Condition-Gated Reasoning框架,通过构建条件感知知识图谱,提升生物医学问答中条件依赖的推理能力。

详情
AI中文摘要

当前生物医学问答系统常假设医学知识是统一的,但现实临床推理本质上是条件性的:几乎所有决策都依赖于患者特定因素,如共病和禁忌症。现有基准不评估此类条件推理,检索增强或图基方法缺乏显式机制确保检索知识适用于给定上下文。为解决这一差距,我们提出CondMedQA,首个针对条件生物医学问答的基准,包含多跳问题,其答案随患者条件变化。此外,我们提出Condition-Gated Reasoning(CGR),一种新框架,构建条件感知知识图谱,并根据查询条件选择性激活或修剪推理路径。我们的发现显示,CGR更可靠地选择条件合适的答案,同时在生物医学问答基准上匹配或超越现有最佳性能,突显了显式建模条件性对稳健医疗推理的重要性。

英文摘要

Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

2602.20967 2026-06-09 eess.AS cs.AI cs.SD 版本更新

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

无训练的可懂度引导的噪声ASR观测添加

Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng

发表机构 * Nanyang Technological University(南洋理工大学) Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出一种无训练的可懂度引导观测添加方法,通过后端ASR的可懂度估计推导融合权重,提升噪声环境下ASR鲁棒性,无需修改SE或ASR模型参数。

Comments Accepted to Interspeech2026

详情
AI中文摘要

自动语音识别(ASR)在噪声环境中严重退化。尽管语音增强(SE)前端有效抑制背景噪声,但它们常常引入损害识别的伪影。观测添加(OA)通过融合噪声和SE增强语音解决了这一问题,无需修改SE或ASR模型的参数。本文提出了一种可懂度引导的OA方法,其中融合权重从后端ASR直接获得的可懂度估计中推导。与基于训练好的神经预测器的先前OA方法不同,所提出的方法无需训练,降低了复杂度并增强了泛化能力。在多种SE-ASR组合和数据集上的大量实验表明,该方法相比现有OA基线具有强大的鲁棒性和改进。对可懂度引导的基于切换的替代方案以及帧级与话语级OA的进一步分析也验证了所提出的设计。

英文摘要

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

2603.03292 2026-06-09 cs.CL cs.AI cs.IR 版本更新

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

从冲突到共识:通过多轮代理RAG提升医疗推理

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

发表机构 * GitHub

AI总结 本文提出MA-RAG框架,通过多轮代理循环迭代优化外部证据和内部推理历史,提升医疗复杂推理能力,实验显示在7个医疗问答基准上表现优于现有方法。

Comments 27 pages, 8 figures, 18 tables

详情
AI中文摘要

大型语言模型(LLMs)在医疗问答中表现出高推理能力,但其产生幻觉和过时知识的倾向对医疗领域构成重大风险。虽然检索增强生成(RAG)缓解了这些问题,但现有方法依赖于噪声的token级信号,并缺乏复杂推理所需的多轮细化。本文提出MA-RAG(多轮代理RAG),通过在代理细化循环中迭代演变外部证据和内部推理历史,实现复杂医疗推理的测试时间扩展。在每一轮中,代理将候选响应间的语义冲突转换为可检索的外部证据查询,同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过利用不一致性作为主动信号来扩展自我一致性原则,并通过迭代最小化残差误差来实现稳定、高保真的医疗共识。在7个医疗问答基准上的广泛评估显示,MA-RAG在推理时间扩展和RAG基线方面均优于竞争方法,平均准确率比基础模型提高+6.8点。我们的代码可在https://github.com/NJU-RL/MA-RAG上获得。

英文摘要

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In this paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.

2603.09995 2026-06-09 cs.CL cs.AI 版本更新

Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

上下文胜过计算 人类在环优于迭代思维链提示在面试回答质量上的表现

Kewen Zhu, Zixi Liu, Yanjing Li, Jing Chen

AI总结 本文通过对比人类在环和自动思维链提示方法,发现人类在环在面试回答质量评估中表现更优,且迭代次数更少,同时具有更高的训练效果。

详情
AI中文摘要

使用大语言模型进行行为面试评估存在独特挑战,需要结构化评估、现实面试官行为模拟和候选人培训的教育价值。我们通过两个受控实验研究思维链提示在面试回答评估和改进中的应用,使用50对行为面试问题和回答对。我们的贡献有三方面:首先,我们提供了人类在环和自动思维链改进的定量比较。使用配对设计,n等于50,两种方法均显示出积极的评分改进。人类在环方法提供了显著的培训效益。信心从3.16提高到4.16(p小于0.001),真实性从2.94提高到4.53(p小于0.001,Cohen's d是3.21)。人类在环方法还要求五次迭代更少(1.0对5.0,p小于0.001)并实现了完整的个人细节整合。其次,我们分析了收敛行为。两种方法都快速收敛,平均迭代次数低于1次,其中人类在环方法在最初较弱的回答中达到100%的成功率,而自动方法为84%(Cohen's h是0.82,大效应)。额外的迭代提供 diminishing returns,表明主要限制是上下文可用性而非计算资源。第三,我们提出了一种基于负面偏见模型的对抗性挑战机制,称为bar raiser,以模拟现实的面试官行为,尽管定量验证仍需未来工作。我们的发现表明,尽管思维链提示为面试评估提供了有用的基石,但领域特定的增强和上下文感知的方法选择对于现实和具有教育价值的结果至关重要。

英文摘要

Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.

2605.06317 2026-06-09 cs.CV cs.AI 版本更新

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

NavOne: 一种基于顶部向下地图的视觉语言导航的一步全局规划

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

发表机构 * South China University of Technology(南方科技大学)

AI总结 本文提出了一种基于顶部向下地图的视觉语言导航方法,通过引入NavOne框架,实现多模态地图的单步全局路径规划,显著提升了导航效率和性能。

Comments 10 pages, 7 figures

详情
AI中文摘要

现有的视觉语言导航(VLN)方法通常采用以自身为中心的逐步导航范式,这导致误差累积并限制了效率。尽管最近的方法试图利用预建的环境地图,但它们通常依赖于逐步更新记忆图或评分离散路径提案,这限制了连续的空间推理并创建了离散瓶颈。我们提出了顶部向下VLN(TD-VLN),将导航重新表述为在预建的顶部向下地图上的一步全局路径规划问题,支持我们新构建的R2R-TopDown数据集。为了解决这个问题,我们引入了NavOne,一个统一的框架,它在单次端到端前向传递中直接预测多模态地图上的密集路径概率。NavOne具有顶部向下地图融合器,用于联合多模态地图表示,并扩展了空间感知的深度混合。在R2R-TopDown上的广泛实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能,其规划阶段的速度提升比现有基于地图的基线方法快8倍,比以自身为中心的方法快80倍,从而实现了高效全局导航。

英文摘要

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

2605.17301 2026-06-09 cs.CL cs.AI 版本更新

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG: 检测和解决检索增强生成中的知识冲突

Chenyu Wang, Yueyuan Li, Yingmin Liu, Yang Shu

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出ConflictRAG框架,通过两阶段冲突检测模块、熵-TOPSIS框架和冲突感知RAG评分,有效检测和解决检索增强生成中的知识冲突,实验表明其在冲突检测F1和正确性方面优于现有方法。

Comments 6 pages, 6 figures, submitted to IEEE SMC 2026

详情
AI中文摘要

检索增强生成(RAG)系统隐式假设检索文档之间相互一致——这一假设在实践中经常失效。我们提出了ConflictRAG,一种具有冲突意识的RAG框架,能够在生成答案之前检测、分类和解决知识冲突。该框架引入了三个贡献:(1)一个两阶段冲突检测模块,结合轻量级嵌入基于MLP分类器和选择性LLM细化,使API成本降低62%,同时保持90.8%的检测准确率;(2)一个熵-TOPSIS框架用于数据驱动的来源可信度评估,比手动启发式方法提高7.1%的选取准确率;(3)一个冲突感知RAG评分(CARS)用于诊断冲突处理能力。在三个基准测试中对六个基线的实验表明,冲突检测F1达到88.7%,并且在最强的冲突感知基线中,正确性提高了5.3-6.1%。该流程能够有效跨基础LLM转移。

英文摘要

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

2605.19266 2026-06-09 cs.CL cs.AI 版本更新

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结 本文提出FormalASR,一种端到端的中文语音到正式文本转换模型,通过构建大规模的语音到正式文本数据集,并使用Qwen3-ASR进行微调,实现了比原声基线减少37.4%的CER,同时提升了ROUGE-L和BERTScore指标,提供了一个轻量级的设备端解决方案。

详情
AI中文摘要

自动语音识别(ASR)系统通常优化于逐字转录,这保留了不连贯、填充词和非正式口语结构,这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑,但这种设计增加了延迟和内存成本,并且难以在设备上部署。我们提出了FormalASR,两个紧凑的端到端模型(0.6B和1.7B),可直接将中文语音转录为正式书面文本。为了实现这一目标,我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集,通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模(0.6B和1.7B)的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明,FormalASR在比原声基线减少37.4%的CER的同时,也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM,提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem:用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) The University of Sydney(悉尼大学) Beihang University(北航)

AI总结 提出S3MEM框架,通过结构化场景-事件记忆和锚点敏感检索,在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情
AI中文摘要

长时域交互代理通常积累大量轨迹历史,但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度,而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成(RAG)查询时,系统通常检索到局部相关但链不完整的证据,特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM,一种用于长时域交互式问答(QA)的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元,通过锚点敏感检索检索证据,并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说,S3MEM是一种结构化证据利用工具,将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境(Crafter、Jericho)和两个外部环境(SciWorld、ALFWorld)上评估S3MEM。在共享的冻结答案时间协议下,S3MEM在所有四个环境中一致优于Vanilla RAG,在Crafter、Jericho和ALFWorld上超过Graph-NoReader,在SciWorld上与之匹配,同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG,但没有一个达到S3MEM的整体准确率-效率前沿。总体而言,证据支持一个有限的结论:在当前冻结的答案时间协议下,结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches \(0.48\) F1 and \(0.40\) BLEU with (1{,}073) evidence tokens per question, about \(15.8\times\) fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only \(189\)tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

2606.00094 2026-06-09 cs.CV cs.AI 版本更新

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

显式建模数据流形几何的扩散图像生成

Duoduo Xue, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出MIND框架,通过将离散补丁标记化集成到连续扩散模型的得分函数中显式建模流形几何,结合离散标记的结构量化能力和连续扩散的并行生成灵活性,在ImageNet 256×256上显著降低FID。

详情
AI中文摘要

图像生成模型旨在从底层数据流形中采样数据点,这需要学习并解码一个密集、低维且紧凑的参数化空间。为此,我们提出了数据流形感知图像扩散模型(MIND),一种通过将离散补丁标记化集成到连续扩散模型的得分函数中来显式建模流形几何的新框架。该方法成功利用了离散标记的结构量化能力和连续扩散的并行生成灵活性。此外,我们通过一种新颖的软top-$k$聚合机制实现了端到端可微训练,并引入了双分支高频特征嵌入层以缓解Transformer主干网络在低维输入上的谱偏差。进一步地,在推理阶段,我们设计了一种多阶段过渡采样方案,根据时间步动态调整采样方案。在ImageNet 256×256上的大量实验证明了MIND的有效性。经过80个epoch的训练,我们的基础模型在无引导情况下实现了22.73的FID,几乎将原始DiT-B/2基线的43.47 FID减半。与基线DiT和SiT相比,所提方法平均分别降低了15.95和9.06的FID。对于ImageNet-256×256上的引导图像生成,所提MIND-B仅用130M参数就实现了2.06的FID,超过了具有3.1B参数的LlamaGen-3B。所提MIND-XL具有715M参数,进一步将FID降低至1.95。我们的MIND为基于扩散的图像生成引入了全新视角,为该领域的未来研究和创新铺平了道路。代码将公开提供。

英文摘要

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

2606.01637 2026-06-09 cs.CL cs.AI 版本更新

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

误导比纠正更容易:LLM 从众中的有害与有益修正

Jiaming Qu, Lucheng Fu, Yibo Hu

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 通过控制实验,研究大语言模型在多智能体系统中面对同伴答案时的从众行为,发现同伴一致意见更容易误导原本正确的模型,而权威标签使模型更倾向于选择被认可的答案,且通用推理干预无法可靠地减少有害修正。

详情
AI中文摘要

大语言模型越来越多地用于多智能体系统,在这些系统中,它们会看到并回应其他智能体的答案。一个关键风险是从众:模型可能仅仅因为其他人同意不同的答案而放弃自己的答案。先前的研究表明,LLM 经常向多数答案修正,但仍不清楚这些修正是像引入新错误一样频繁地帮助纠正错误。在本文中,我们进行了一项受控研究,其中 LLM 首先回答一个问题,然后在做出最终决定之前看到模拟的同伴回应。我们操纵两个社会线索:共识结构和分配给同伴的权威标签,并测量它们如何影响有益和有害的修正。在四个开放权重的 LLM 和七个问答数据集上,我们发现同伴一致意见使得误导原本正确的模型比纠正原本错误的模型容易得多。权威标签使模型更可能选择被认可的答案,无论其是否正确。更令人担忧的是,通用的推理干预(如思维链和反思)并不能可靠地减少有害修正同时保留有益修正。这些发现表明,多智能体 LLM 系统应该验证同伴答案,而不是简单地聚合它们。

英文摘要

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

2606.01736 2026-06-09 cs.CL cs.AI 版本更新

Argument Collapse: LLMs Flatten Long-Form Public Debate

论点坍缩:LLMs 扁平化长篇公共辩论

Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 研究大型语言模型在生成公共辩论文本时导致论点坍缩的现象,即不同模型生成的论文在主要论点、子论点和段落结构上趋于收敛,通过对比人类与LLM生成文本发现LLM的论点多样性显著降低。

详情
AI中文摘要

随着LLMs越来越多地被用于起草面向公众的论点,它们可能通过反复引入相同的、经过修饰的、看似合理的论点来扁平化公共辩论。我们研究了论点坍缩,即不同LLMs生成的论文倾向于收敛到更小的主要论点、子论点和段落级结构集合。我们比较了来自195场《纽约时报》辩论的1,039个人类回复、来自61场更长形式的《波士顿评论》论坛的448个人类回复以及23,384篇LLM生成的论文。在《纽约时报》语料库中,65.3%的人类主要论点在辩论中是唯一的,而LLM主要论点中这一比例为3.4%。要求LLMs生成多样化的答案会增加变异性,但一个典型模型只能恢复大约一半的不同人类主要论点,且增加的变异性大多落在观察到的人类论点空间之外。坍缩也出现在子论点中,在具有相同主要论点的论文中,41.0%的人类子论点是唯一的,而LLM回复中这一比例为9.1%。定性上,LLMs经常重复使用泛化和模糊的子论点,而人类更喜欢更具体和针对主题的子论点。在结构上,LLM生成的论文倾向于遵循更固定的弧线,通常以直接主张开头并迅速转向提议。同样的模式在更长的《波士顿评论》论文中也成立,表明论点坍缩不仅限于短篇回复。

英文摘要

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.

2606.05816 2026-06-09 cs.CV cs.AI 版本更新

Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

基于LLM提示翻译和LoRA微调的韩语日记文本情感感知图像生成

Jihun Cho, Soo-Yeon Jeong, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种情感感知文本到图像流水线,利用Qwen3-8B识别短日记中的隐含情感,并通过LoRA微调Stable Diffusion 3.5 Medium生成儿童手绘风格图像,同时探讨情感触发词的影响及CLIP Score作为评估指标的局限性。

Comments 4 pages, 4 figures, 2 tables, MITA 2026

详情
Journal ref
Proc. Int. Conf. Multimedia, Information Technology and its Applications (MITA), 2026
AI中文摘要

T2I模型无法有效捕捉包括日记在内的各类文本中的情感,因为它们主要关注视觉对象相关模式而非上下文情感理解。本文提出一种情感感知文本到图像流水线,从短韩语日记条目生成儿童手绘风格图像。该流水线采用Qwen3-8B识别短日记中的隐含情感,并使用基于情感触发词在儿童绘画图像上通过LoRA微调的Stable Diffusion 3.5 Medium进行图像生成。此外,本文通过实验检验情感触发词对生成图像的影响,并讨论CLIP Score作为情感感知图像生成评估指标的局限性。

英文摘要

T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children's drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.

2506.03106 2026-06-09 cs.CL cs.AI 版本更新

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO:通过自然语言和数值反馈提升大语言模型推理能力

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

发表机构 * HCCL, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能研究中心,香港,中国) University of Cambridge, Cambridge, United Kingdom(剑桥大学,剑桥,英国) MMLab, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能实验室,香港,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出Critique-GRPO框架,结合自然语言和数值反馈提升LLM推理能力,实验显示其在多个任务中优于传统方法,显著提升推理性能。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

最近利用数值奖励的强化学习(RL)进展显著增强了大语言模型(LLM)的复杂推理能力。然而,我们发现纯数值反馈存在三个根本限制:性能停滞、无效的自发自我反思和持续失败。我们证明,当给plateaued RL模型提供自然语言批评时,它们能够成功细化失败的解决方案。受此启发,我们提出Critique-GRPO,一种在线RL框架,整合自然语言和数值反馈进行策略优化。该方法使LLM能够同时学习初始响应和批评引导的细化,有效内化两个阶段的探索收益。大量实验显示,Critique-GRPO优于所有比较的监督和基于RL的微调方法,在各种Qwen模型上平均Pass@1提升约+15.0-21.6%,在Llama-3.2-3B-Instruct上提升约+7.3%。值得注意的是,Critique-GRPO通过自我批评实现有效自我改进,相较于GRPO取得显著提升,例如在AIME 2024上Pass@1提升+16.7%。代码和模型已发布:https://github.com/zhangxy-2019/critique-GRPO

英文摘要

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024. The code and models are released at: https://github.com/zhangxy-2019/critique-GRPO

2601.09239 2026-06-09 cs.SD cs.AI eess.AS 版本更新

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

DSA-Tokenizer:基于流匹配层次化融合的解耦语义-声学分词器

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Linqi Song

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) AI Lab, Leibniz Research Center, Huawei(华为利比全中心人工智能实验室)

AI总结 提出DSA-Tokenizer,通过ASR监督语义令牌和mel谱重建监督声学令牌实现解耦,并引入层次化流匹配解码器和联合重构-上下文修补训练策略,实现高保真重构和跨语句语音克隆。

Comments Submit to ACL ARR 2026 May

详情
AI中文摘要

语音分词器是全离散语音大语言模型的关键构建模块。现有的分词器要么优先考虑语义编码,将语义内容与声学风格不可分离地融合,要么实现不完全的语义-声学解耦。为了实现更好的解耦,我们提出了DSA-Tokenizer,它通过不同的优化约束将语音显式解耦为离散的语义和声学令牌。具体来说,语义令牌由ASR监督以捕获语言内容,而声学令牌专注于mel谱重构以编码风格。我们进一步引入了层次化流匹配解码器和联合重构-上下文修补训练策略,使模型能够支持高保真重构和跨语句语音克隆。为了加速推理,我们蒸馏了DiT解码器,将推理采样步数减少到4步,并通过GAN微调提高合成质量。实验表明,DSA-Tokenizer提供了强大的语义-声学解耦、可靠的可控语音克隆以及低WER/CER的高效高保真生成。此外,我们的结果表明,解耦分词为下游大模型语音生成提供了更有效的接口。音频样本可在https://anonymous.4open.science/w/DSA_Tokenizer_demo/获取。

英文摘要

Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the dit decoder to 4-step inference and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/

7. 机器人与具身智能 33 篇

2606.07626 2026-06-09 cs.CV cs.AI 交叉投稿

Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic

全方位视角:非结构化交通中基于等变特征学习的360度LiDAR感知设计与分析

Pranav Darshan, Raghuveer Narayanan Rajesh, M Uttara Kumari

发表机构 * RV College of Engineering(RV工程学院)

AI总结 针对非结构化城市交通中感知难题,提出结合扇形全景处理与旋转等变稀疏卷积的360度LiDAR感知框架,在印度城市交通数据集上验证了多类别检测性能。

详情
AI中文摘要

密集非结构化城市交通中的感知仍然是自动驾驶的主要挑战,原因是道路使用者种类繁多、频繁遮挡、不规则运动模式以及缺乏标准化的道路布局。尽管基于LiDAR的3D目标检测器在结构化驾驶场景中表现出色,但大多数是为有限视场设置开发和评估的,其在全环绕360度感知下的行为仍不明确。本文研究了用于自动驾驶的360度LiDAR感知流水线,特别关注全景感知、方位角扇形空间处理以及复杂城市场景中的变换等变特征提取。本文提出了一个实用的360度感知框架,将扇形全景处理与旋转等变稀疏卷积相结合,并在一个自定义的Ouster OS0 LiDAR数据集上评估其行为,该数据集收集自多样化的印度城市交通条件。结果显示,多个目标类别的检测总体稳定,其中汽车性能最强(92.02/90.51),公交车为80.53/76.34,卡车为78.59/74.16,而行人(67.45/61.02)、骑自行车者(73.21/69.54)和骑摩托车者(71.20/68.13)得分较低,反映了在密集城市场景中检测更小且更多变的道路使用者的更大难度。

英文摘要

Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users, frequent occlusions, irregular motion patterns, and the lack of standardized road layouts. Although recent LiDAR based 3D object detectors have shown strong performance in structured driving scenarios, most are developed and evaluated for limited field of view settings, and their behavior under full surround 360-degree sensing is still not well understood. This paper studies a 360-degree LiDAR perception pipeline for autonomous driving, with particular attention to panoramic sensing, azimuthal sector wise spatial processing, and transformation equivariant feature extraction in complex urban scenes. The paper presents a practical 360-degree perception framework that combines sector wise panoramic processing with rotation equivariant sparse convolutions and evaluates its behavior on a custom Ouster OS0 LiDAR dataset collected across diverse Indian urban traffic conditions. The results show generally stable detection across several object classes, with the strongest performance for cars at 92.02/90.51, buses at 80.53/76.34, and trucks at 78.59/74.16, while lower scores for pedestrians at 67.45/61.02, cyclists at 73.21/69.54, and motorcyclists at 71.20/68.13 reflect the greater difficulty of detecting smaller and more variable road users in dense urban scenes.

2606.07974 2026-06-09 cs.RO cs.AI 交叉投稿

PRISM: PRior-guided Imagination Sampling in world Models

PRISM:世界模型中基于先验引导的想象采样

Yuhai Wang, Jiawei Xia, Rongxuan Zhou, Xiao Hu, Yongliang Shi, Jing Du, Yang Ye

发表机构 * Northeastern University(东北大学) University of California, Berkeley(加州大学伯克利分校) Qiyuan Lab(启元实验室) University of Florida(佛罗里达大学)

AI总结 提出PRISM框架,通过从世界模型编码器提取状态条件高斯先验,并利用精度加权高斯乘积更新规划器的采样分布,在不增加架构复杂度的情况下显著提升基于模型的连续控制性能。

详情
AI中文摘要

学习到的世界模型为评估未来状态提供了强大的物理直觉。但其在连续控制中的有效性也关键取决于如何为基于模型的规划生成候选动作。我们不仅询问模型能多准确地模拟未来,还提出:哪些候选动作首先值得评估?现有规划器通常任意搜索或仅使用专家演示初始化采样均值,丢弃了专家的状态条件置信度。正确引导这一搜索需要鲁棒的动作先验,但当前方法常依赖独立的视觉编码器或大规模VLM来获取。我们认为这种架构膨胀是不必要的:完全相同的数据——以及世界模型本身学到的表示——内在地编码了智能体的动作直觉。我们提出PRISM,一个任务无关的框架,从单一数据集中提取两者,同时保持严格的架构简洁性。基于标准的JEPA风格潜在世界模型,PRISM直接在其冻结编码器上附加一个轻量级MLP,以预测状态条件高斯先验。在规划时,PRISM通过精度加权的高斯乘积更新将该先验融合到规划器的采样分布中。这种无参数、闭式整合引导采样过程,使先验在其自信处主导,在其不自信处放弃控制。PRISM在Cube上将基于世界模型的MPC成功率提升35个百分点,在PushT上提升32个百分点,且未引入显著推理开销。

英文摘要

A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

2606.08014 2026-06-09 cs.CV cs.AI 交叉投稿

GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence

GVC-Seg: 基于几何视觉对应的免训练3D实例分割

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng

发表机构 * Victoria University of Wellington(惠灵顿维多利亚大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Southern University of Science and Technology(南方科技大学)

AI总结 提出GVC-Seg,一种免训练的3D实例分割方法,通过几何与视觉特征对应消除多模型集成中的置信度偏差,在多个基准上达到最优性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

点云数据中的精确3D实例分割对于机器视觉应用至关重要。最近的研究利用多个预训练基础模型生成3D提案,然后应用提案聚合方法,显著提升了性能。然而,由于不同分割模型之间置信度水平的固有差异,它们通常会产生次优结果,导致偏向于置信度更高的模型。这种偏差本质上是模型依赖的,并受到数据预处理技术和训练策略等因素的影响。为了解决这一偏差,我们提出了一种新颖的、免训练的3D实例分割方法,通过几何视觉对应(GVC-Seg)来利用3D几何线索与2D视觉线索之间的对应关系,以减轻置信度偏差。此外,在实例掩码生成和实例语义推理过程中,分别引入了3D提案生成模块和掩码感知的CLIP特征提取模块。通过这种方式,GVC-Seg增强了提案质量评估,确保了不同模型之间的无偏集成学习。大量实验表明,我们的方法在多个具有挑战性的基准上达到了最先进的性能,同时在开放词汇语义分割设置中也展现出强大的潜力。

英文摘要

Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.

2606.08057 2026-06-09 cs.RO cs.AI 交叉投稿

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO:无需物体资产,从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology(哈尔滨工业大学航天学院) Lumos Robotic Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) Nanjing University(南京大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Fudan University(复旦大学)

AI总结 提出EgoAERO框架,无需物体资产,从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,并利用两阶段残差学习转化为机器人策略,实现单次演示的灵巧操作。

详情
AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源,但现有数据难以用于机器人学习,因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO,这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制,并构建EgoDex-R,一个包含430万RGB-D帧的大规模第一人称数据集,用于灵巧策略学习。仿真和真实世界实验表明,EgoAERO能够实现单次演示的灵巧操作,并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 交叉投稿

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp:视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics Center for AI Research, VinUniversity(VinUniversity 人工智能研究中心) Intelligent Autonomous Systems, TU Darmstadt(达姆施塔特工业大学智能自主系统) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学院) University of Stuttgart(斯图加特大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 提出vla.cpp,基于llama.cpp的便携C++推理运行时,支持多种VLA架构,在LIBERO-Object上接近SOTA性能,内存仅1.3 GiB,并实现跨硬件部署。

Comments 17 pages, 3 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常以Python/PyTorch堆栈形式提供,假设使用工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出了vla.cpp,一个基于llama.cpp的便携式C++推理运行时。据我们所知,它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎,其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构,每个模型打包为自包含的捆绑包。在LIBERO-Object上,该引擎在200个回合中与最先进的检查点相差不到一个回合,并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行,从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明,批量大小为1的VLA推理受计算限制,因此利用率而非带宽是部署杠杆;由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后,我们在ALOHA机械臂上设计了一个机载压力测试,隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

2606.08107 2026-06-09 cs.RO cs.AI 交叉投稿

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University(斯坦福大学) Meta

AI总结 为解决机器人数据稀缺问题,利用自我中心人类数据,基于π₀.₅模型微调,使机器人学习新任务语义并组合现有技能,无需对应机器人数据。

详情
AI中文摘要

机器人技术面临数据稀缺的根本挑战。与语言或视觉研究不同,机器人操作没有互联网规模的数据集。一个有前景的途径是利用自我中心人类数据,这类数据更容易收集、范围更广且规模更大。为此,我们研究了跨人类和配备灵巧五指手的类人机器人实体学习的关键设计选择,以$π_{0.5}$模型为基础。我们的结果表明,人类数据使机器人能够学习新的任务语义,并将现有技能组合成新颖的行为,而无需相应的机器人数据。论文网站:https://egopipaper.github.io/

英文摘要

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 交叉投稿

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合,采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)(德国航空航天中心(DLR),机器人与机电一体化研究所(RMC)) Technical University of Munich (TUM)(慕尼黑工业大学(TUM))

AI总结 提出CLASP架构,结合任务参数化核化运动基元(TP-KMP)与预训练视觉语言模型(VLM),通过自然语言命令实现技能选择、组合和主动学习,无需微调,在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情
AI中文摘要

使机器人能够理解自然语言命令并执行任务,同时保持数据效率仍然具有挑战性。视觉-语言-动作(VLA)和视觉-语言模型(VLM)等基础模型提供了直观的交互通道,但需要大量数据;任务参数化模仿学习实现了数据效率,但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距,该架构将任务参数化核化运动基元(TP-KMP)与预训练VLM相结合。在学习过程中,技能从2到5次动觉演示中获取,VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中,VLM解释命令以选择技能,推理参数绑定,并通过协方差加权组合创建新颖行为。当没有技能或组合足够时,系统识别能力差距并请求有针对性的演示,所有这些都无需微调。在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

2606.08414 2026-06-09 cs.RO cs.AI 交叉投稿

PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

PACT: 具身操作中扩散策略的自我演化物理安全对齐

Lingxuan Wu, Zijian Zhu, Lizhong Wang, Chengyang Ying, Huayu Chen, Xiao Yang, Fangming Liu, Jun Zhu

发表机构 * Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab, BNRist Center, Tsinghua University, Beijing, 100084, China(计算机科学与技术系,人工智能研究院,清华-博世联合机器学习中心,THBI实验室,BNRist中心,清华大学,北京,100084,中国) Peng Cheng Laboratory, 518108, China(鹏城实验室,518108,中国)

AI总结 提出PACT框架,通过自演化后训练将预训练扩散策略投影到约束可行区域,无需演示数据或任务奖励,在降低31.0%安全违规的同时提升30.7%任务成功率。

详情
AI中文摘要

扩散策略在机器人操作中取得了显著成功,但常常无法满足安全部署所需的严格物理约束。现有方法要么在训练期间过早施加安全约束,要么在测试时通过外部护栏被动应对,限制了策略的表达能力和整体可扩展性。我们提出物理安全对齐约束轨迹(PACT),这是一个自我演化的后训练框架,将预训练扩散策略投影到约束可行区域,无需访问演示数据或任务奖励。PACT通过跨时间步密集监督的反向KL目标将约束梯度蒸馏到扩散模型中。它采用课程学习逐步收紧约束,同时保持理论上界定的策略偏移和单调改进,减轻了灾难性遗忘带来的安全-性能权衡。在模拟和真实世界的具身操作基准测试中,PACT平均减少31.0%的安全违规,同时将任务成功率提升30.7%。

英文摘要

Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

2606.08508 2026-06-09 cs.RO cs.AI 交叉投稿

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

ActProbe:面向生成式机器人策略早期故障检测的动作空间探针

Bingjia Huang, Xiangyu Li, Xiang Wang, Liang Mi, Zixu Hao, Weijun Wang, Hao Wu, Kun Li, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院(AIR)) University of Electronic Science and Technology of China(电子科技大学) Nanjing University(南京大学)

AI总结 提出ActProbe,一种轻量级纯动作空间故障检测器,利用时间一致性误差和动作块幅度两个信号,通过LSTM-MLP架构预测故障,在多种生成式策略上提升F1-时效性帕累托前沿平均超体积增益+12.7%,并加速强化学习微调。

Comments 24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe

详情
AI中文摘要

生成式机器人策略在部署时不可预测地失败:它们在关键时刻犹豫不决,偏离任务,或执行不可恢复的动作。现有的在线故障检测器要么需要白盒访问策略内部,要么通过重采样和观测侧信号增加运行时开销。我们的实证分析表明,发射的动作块本身已经携带了生成式机器人策略即将发生故障的强预测信号。受此观察启发,我们引入了ActProbe,一种轻量级的纯动作空间检测器,它使用单次前向传递中可用的两个紧凑信号:连续动作块之间的时间一致性误差(TCE)和当前块的动作块幅度(ACM)。ActProbe通过任务条件化的LSTM-MLP架构将这些信号映射到每步故障概率。在一系列多样化的生成式机器人策略和基准测试中,ActProbe在故障变得视觉可识别之前发出警报,相比内部和外部特征基线,将故障检测的F1-时效性帕累托前沿平均超体积增益提高了+12.7%,在未见任务上早期检测ROC-AUC领先+9.0%。ActProbe进一步迁移到部署中,预测未见真实机器人拾取任务上的故障,并以2.9倍更少的环境交互加速了强化学习微调(PPO)。

英文摘要

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

2606.08542 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

当视频误读:面向探索性操作痕迹问答的阅读启发式闭环蒸馏

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang

发表机构 * Tsinghua University(清华大学) DISCOVER Robotics

AI总结 针对探索性操作中机器人误读视频痕迹的问题,提出闭环痕迹蒸馏方法,通过任务编码代理提取单行自然语言启发式提示,使冻结VLM准确预测最小成功动作链,在模拟和真实机器人任务上提升准确率0.38-0.47。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

探索性操作往往将看似失败的尝试转化为下一步操作的关键证据。例如,机器人拉动锁住的抽屉失败,只有在开锁后才成功。失败的拉动揭示了潜在前提条件(抽屉被锁住),该条件决定了最小成功动作链(完成任务的最少动作),此处为[开锁,拉抽屉]。正确读取这一痕迹因此成为恢复该链的前提。我们将此设定形式化为探索性操作痕迹问答(EMT-QA):给定来自探索性痕迹的同步视频和本体感觉,预测在探测所揭示的潜在前提条件下的最小成功动作链。然而,即使最先进的VLM和具身多模态LLM也会误读这一证据:它们无法从原始视频、原始本体感觉或它们的组合中可靠地恢复动作链。我们引入闭环痕迹蒸馏,一种使用每任务编码代理检查带标签训练痕迹并蒸馏出关于痕迹的单行自然语言提示(称为蒸馏阅读启发式DRH)的流水线。推理时,不调用代理,不更新模型权重;冻结的VLM接收原始痕迹加上DRH作为提示条目。在三个模拟器和两个真实机器人任务上,DRH将链准确率比最佳原始模态基线提高0.38至0.47。相同的DRH还作为一次性程序分类器的唯一规范,其性能与提示的VLM相当。

英文摘要

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

2606.08610 2026-06-09 cs.RO cs.AI 交叉投稿

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR:面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt(达姆施塔特工业大学) Honda Research Institute Europe(本田欧洲研究所) Columbia University(哥伦比亚大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海智能自主系统研究院) University of Würzburg(维尔茨堡大学) Hessian.AI(黑森人工智能中心)

AI总结 提出HARBOR框架,通过将机器人强化学习自动化视为框架工程问题,利用专用智能体、标准化命令和可复用知识,在模拟中自动完成从环境搭建到策略训练的全流程,并在6个基准测试和16个任务中验证其有效性。

详情
AI中文摘要

强化学习已成为机器人学习的一种强大范式,特别是在模拟到现实的环境中,但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力,使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR,一个智能体框架,将机器人强化学习自动化视为一个框架工程问题:给定一个模拟器代码库和一个任务规范,它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段,由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行,并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR,涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程,设计奖励,调整算法以匹配或改进默认配置,并以实用的令牌和挂钟成本减少了工程工作量;生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 交叉投稿

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08657 2026-06-09 cs.RO cs.AI 交叉投稿

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散策略:为基于扩散的机器人操作塑造潜在空间

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出两阶段框架LDP,通过CVAE编码器吸收场景理解,在预浓缩的潜在空间中进行流匹配,简化学习并提升多臂协调任务性能。

详情
AI中文摘要

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹,增加了学习复杂性,并在需要多臂精确时间协调的任务上限制了性能。为了简化这一联合学习问题,我们引入了潜在扩散策略(LDP),这是一个两阶段框架,在精心塑造的潜在空间中进行流匹配。通过将场景理解吸收到观察条件的CVAE编码器中,LDP集中了每个观察的条件分布。因此,流模型避免了隐式解析场景相关结构;相反,它在具有更平滑速度场的预浓缩分布内生成,从而简化了从有限演示中的学习。此外,为了捕捉潜在标记之间的时间依赖性,LDP采用每标记扩散强制训练,并使用阶梯推理采样来解决由此产生的分布不匹配。我们还提出了重建FID(rFID)作为轻量级代理,仅从潜在空间统计预测下游任务成功。在RoboTwin 2.0的协调密集型任务上,LDP以显著优势优于DP3,并有效迁移到真实世界的双臂部署。

英文摘要

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

2606.08714 2026-06-09 eess.SY cs.AI cs.LG cs.RO cs.SY 交叉投稿

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

混合神经网络与传统控制器方法用于高度不稳定系统的鲁棒控制:应用于倾转旋翼控制

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari

发表机构 * Advanced Research Lab for Control and Agricultural Robotics (Sharif AgRoLab)(控制与农业机器人高级研究实验室(谢尔生产大学AgRoLab)) Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran(技术大学机械工程系,德黑兰,伊朗)

AI总结 提出一种神经网络增强的滑模控制器,将系统动力学分解为输入无关和输入相关部分,前者用轻量网络从少量数据学习,实现对全驱动倾转旋翼系统的鲁棒控制,LSTM优于MLP。

Comments Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)

详情
AI中文摘要

多旋翼飞行器广泛应用于从监视到精准农业等领域,但传统设计仍受限于其欠驱动特性。倾转旋翼配置通过实现全驱动克服了这一限制。本文研究基于神经网络的控制策略,用于一个具有四个推力矢量输入的全驱动倾转旋翼系统。我们的工作分为两部分。首先,我们有意呈现一个负面结果,通过评估直接输入-输出控制方法。在该方法中,多层感知器(MLP)、长短期记忆(LSTM)网络和Transformer模型被训练为直接将系统状态及其期望值映射到控制信号。我们表明该策略无法稳定系统,凸显了将直接输入-输出学习应用于高度不稳定对象的固有困难。其次,作为主要贡献,我们提出一种神经网络增强的滑模控制器(SMC)。该方法将系统动力学分解为输入无关和输入相关两部分,前者使用轻量网络从少量数据集学习,从而降低实时计算需求。此外,所提方法可以使用从低性能控制器收集的飞行日志进行训练,并且从真实数据学习到的动力学模型可用于仿真。我们进一步比较了基于MLP和LSTM的实现,在模型不确定性和外部干扰下,展示了所提方法的鲁棒性和有效性;特别是,带有LSTM植物动力学预测器的控制器相比基于MLP的对应物实现了更优性能,同时运行时也更低。

英文摘要

Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

2606.08775 2026-06-09 cs.RO cs.AI 交叉投稿

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一对象中心世界模型与扩散策略:多阶段机器人任务的分层框架

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

发表机构 * Tandon School of Engineering, New York University(纽约大学坦登工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学库朗数学科学研究所) AMI Labs(AMI实验室)

AI总结 提出WorldDP分层框架,结合高层世界模型进行运行时子目标优化和低层扩散策略执行,利用对象中心表示解耦环境实体,实现多阶段机器人操作任务的有效规划与执行。

详情
AI中文摘要

视觉世界模型在学习复杂系统动力学方面显示出巨大潜力。最近的进展利用这些模型作为模型预测控制(MPC)框架中的转移函数来解决各种控制任务。然而,当应用于机器人时,它们仅限于单阶段任务(如抓取或到达),难以处理需要复杂序列规划的多阶段任务。在这项工作中,我们引入了WorldDP,一个专为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为转移函数,在运行时优化可行的子目标,随后由低层扩散策略实现这些子目标。为了进一步辅助学习动力学和规划,我们结合了对象中心表示,这些表示解耦了环境实体,并使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中,WorldDP始终优于现有基线,验证了将世界模型的物理基础规划与扩散策略的高效执行相结合,能够产生更优的多阶段性能。

英文摘要

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN:具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) China Telecom(中国电信) Central South University(中南大学) Jiangsu University(江苏大学)

AI总结 提出SpaceVLN,通过空间认知记忆和任务引导的空间推理,在零样本设置下实现连续环境中的视觉与语言导航,在多个基准上达到最优性能。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径,但许多导航器仍依赖局部视觉线索和基于线性历史的推理,忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN,一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言,SpaceVLN引入了一个高效的分阶段闭环框架,其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中,智能体逐步将探索区域抽象为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航,无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上,SpaceVLN实现了最先进的零样本性能,真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

2606.09236 2026-06-09 cs.RO cs.AI 交叉投稿

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

用于模拟自主超级摩托车赛车的自定进度课程强化学习

Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出自定进度课程深度强化学习框架,结合软演员-评论家算法,动态生成渐进任务,在物理精确模拟器中训练自主摩托车赛车,优于标准SAC。

Comments Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation

详情
AI中文摘要

自主赛车通过深度强化学习取得了显著进展,主要针对四轮车辆。然而,摩托车由于需要管理平衡和倾斜角度,以及更灵敏的转向和油门控制,且重量更小,带来了更大的复杂性。在这项工作中,我们提出了一个框架,用于在VRider SBK(一个基于Unity的物理精确摩托车模拟器)中训练自主智能体进行超级摩托车赛车。我们的方法将软演员-评论家(SAC)与自定进度课程深度强化学习(SPDL)相结合,后者根据智能体的性能动态生成逐渐更具挑战性的任务,无需手动课程设计。智能体的状态空间包括扩展了倾斜角度历史的本体感受特征,以及通过赛道点的全局赛道特征。奖励信号被设计为鼓励沿赛道前进,同时惩罚针对两轮动力学的不稳定诱导行为。初步实验结果表明,SPDL在多个赛道和摩托车模型上的训练效率、圈速和驾驶稳定性方面优于单独的SAC,为基于强化学习的自主摩托车赛车建立了第一个基线。

英文摘要

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

2606.09243 2026-06-09 cs.CV cs.AI 交叉投稿

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

EgoTactile: 从自我中心视频学习日常物体的抓取压力

Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin, Zongqing Lu, Wenming Yang, Jing-Hao Xue, Qingmin Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EgoTactile基准和条件扩散框架EgoPressureDiff,从自我中心视频估计全手抓取压力,解决视觉-物理歧义,实现鲁棒迁移。

Comments Accepted to ICML2026 spotlight

详情
AI中文摘要

从自我中心视频估计全手抓取压力对于沉浸式VR和机器人操作至关重要,然而密集触觉传感通常依赖侵入式硬件。现有的基于视觉的方法主要依赖平面或指尖接触,无法泛化到复杂的3D物体交互。因此,我们引入EgoTactile,一个将自我中心视频与全手压力监督配对用于多样日常物体的基准,并包含裸手迁移子集以实现对自然场景的泛化。利用该基准,我们首先建立EgoPressureFormer作为判别基线。此外,为显式处理部分观测中的不确定性,我们提出EgoPressureDiff,一个条件扩散框架,适配大规模预训练视频扩散骨干。通过将丰富的世界知识先验与物理信息特征修正层结合以注入语义约束,我们的方法有效推断合理的接触模式并解决视觉-物理歧义。大量实验表明,我们的方法在基准上取得优越性能,并具有对野外场景的鲁棒迁移能力。项目页面见https://egotactile.github.io/。

英文摘要

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 交叉投稿

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest(布加勒斯特理工大学) Simion Stoilow Institute of Mathematics of the Romanian Academy(罗马尼亚科学院西蒙·斯托伊洛数学研究所) NORCE Norwegian Research Centre AS(挪威研究中心)

AI总结 研究仅从2D身体姿态识别通信意图,提出自回归自一致性作为无监督可靠性信号,并在嵌入式GPU上实现实时性能。

详情
AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号,特别是在需要实时低成本设备上的人-机器人通信场景中,如救援任务。然而,现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本,而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集,并将其与其他真实(IPC)和合成(MotionLCM, VEO3.1, Kimodo)数据集进行比较,这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型,从骨架图分类器到联合运动预测网络,并在嵌入式GPU(NVIDIA Orin Nano)上报告了性能指标和帧率,因为在我们的场景中速度和准确性同样重要。最后,我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明,界定了自一致性预测正确的概率,表明该概率随一致步数增加而增长,并识别了自信预测仍可能错误的条件,与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 交叉投稿

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程:机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院)

AI总结 本文提出机器人中间件作为物理AI的驾驭层,需同时干预控制、计算和通信,并补充投影、隔离和转移三种缺失的强制功能,以ROS 2驾驭配置文件为例。

Comments 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

详情
AI中文摘要

在物理AI时代,机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作(VLA)模型现在作为控制路径上的因果参与者进入已部署的机器人,但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层,即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架,我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信,因为学习策略的输出跨越所有三者:其命令改变轨迹,其推理时间改变调度,其有效载荷改变带宽。机器人中间件是机器人栈中最低的层,具有对所有三者的中介抽象,因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能,但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能:投影在输出时门控每个输出,隔离约束模型的执行和传输时隙,转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中,构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层,而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件,这是一个部署工件,携带AI模型声明的输出区域、推理预算和运行机制,而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

2606.09572 2026-06-09 cs.RO cs.AI 交叉投稿

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

CT-VAM: 一种小脑-丘脑启发的视觉-动作模型用于高效视觉运动控制

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

发表机构 * University of Science and Technology of China(中国科学技术大学) AIRLab, Department of Automation(自动化系AIRLab)

AI总结 提出CT-VAM模型,通过TARS条件注意力解码器融合异构输入,以68M参数实现与大型VLA模型相当的LIBERO成功率,并降低推理延迟,支持高频控制。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中展现出强大潜力,然而原始语言主要用于指定任务意图,而非在高频低层执行过程中反复处理。受此分离的启发,我们提出了一种小脑-丘脑启发的视觉-动作模型(CT-VAM),用于高效的任务条件视觉运动控制。CT-VAM作为一个紧凑的局部执行策略,从双视角视觉观察、本体感觉和轻量级任务条件中预测动作块,从而可能实现一种实用的云-边缘范式,其中高层语义推理由大模型处理,而快速闭环控制在本地硬件上运行。为了有效融合异构输入,CT-VAM引入了TARS(丘脑动作路由流),一种流分离的条件注意力解码器,独立路由动作、视觉和任务流,防止密集的感官标记淹没紧凑的任务相关条件。仅凭68M参数,CT-VAM在LIBERO上取得了与更大规模VLA模型竞争的成功率,同时降低了推理延迟。结合用于异步块执行的流一致修补,CT-VAM支持高频控制,并在资源受限的机器人平台上展示了鲁棒的实时部署能力。

英文摘要

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

2606.09630 2026-06-09 cs.RO cs.AI cs.LG 交叉投稿

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

ReCoVLA: VLM引导的奖励编译用于视觉-语言-动作策略的故障恢复

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

发表机构 * University of Southern California(南加州大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出ReCoVLA框架,通过冻结预训练VLA策略,利用外部VLM推断故障模式并编译结构化奖励,训练残差恢复策略,实现零样本仿真到真实部署,在多种操作任务中提升成功率。

Comments 19 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)策略为语言条件操作提供了强大的先验知识,但在需要针对性恢复的非标称状态下仍然脆弱。我们提出ReCoVLA——一种故障条件的残差恢复框架,它保持预训练的VLA策略冻结,使用外部视觉-语言模型(VLM)推断故障模式和恢复阶段,并从任务相关组件编译结构化奖励。ReCoVLA并非使用VLM直接生成动作或奖励,而是将其作为语义奖励选择器:它预测恢复描述符和奖励掩码,用于仿真中的残差策略训练,随后将训练好的恢复策略零样本部署到真实世界。这解耦了高层故障理解与低层纠正控制,以支持不同的VLA。在短时域、长时域和接触丰富的操作任务上的实验表明,ReCoVLA在平均性能上优于测试的基线。在仿真中,我们的奖励编译器将微调$π_{0.5}$基线的平均成功率从36.7%提升到66.7%。在物理零样本仿真到真实实验中,ReCoVLA取得了最佳平均性能,成功率为61.7%。

英文摘要

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

2606.09634 2026-06-09 cs.CV cs.AI 交叉投稿

ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

ATN3D:面向极端稀疏性的密度感知激光雷达-雷达早期3D目标检测

Debojyoti Biswas, Xianbiao Hu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Tsinghua University(清华大学)

AI总结 针对远距离稀疏感知下早期融合丢失信息、通道监督不均衡的问题,提出ATN3D框架,通过密度感知融合、占用门控邻域聚合、证据条件通道自注意力和距离感知损失,在VoD数据集上显著提升远距离检测性能。

详情
AI中文摘要

3D目标检测是自动驾驶车辆及更广泛智能交通系统感知的基石。远距离检测因感知证据稀疏而具有挑战性,然而这种“远距离”场景在交通中很常见。尽管在计算机视觉中>30m常被标记为远距离,但在道路上仅提供约1-2秒的感知和决策时间。在这种极端稀疏性下,出现两个核心挑战。首先,早期多模态融合倾向于丢弃稀疏性信息,并从空或错误占用的单元中注入噪声,降低远距离召回率。其次,上下文无关的统一通道监督偏向密集和近距样本,导致远处和小目标优化不足,延迟对远处目标的最早检测。我们提出“Ask The Neighbor”(ATN3D),一种专为稀疏范围条件设计的激光雷达-雷达框架。ATN3D引入:(i) 密度感知早期融合与跨模态门控,根据体素密度/稀疏性和雷达证据调节融合;(ii) 占用门控邻域聚合,使用圆形核仅从可信单元聚合;(iii) 证据条件通道自注意力,根据天气/距离自适应调整通道权重;(iv) 距离感知损失,按距离重新平衡分类和定位,使训练与距离分层评估对齐。在VoD基准的晴朗和雾天条件下,ATN3D超越强基线:晴朗天气mAP提升+3.55%,模拟浓雾下提升+8.41%;对于>30m目标,提升分别为+3.33%(晴朗)和+2.09%(浓雾)。这些结果表明在道路稀疏感知下更早、更可靠的远距离检测。

英文摘要

3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications. Long-range detection is challenging because sensing evidence is sparse; yet this ``long-range'' scenario is routine in traffic. Although >30m is often labeled long-range in computer vision, on roadways it affords only approx. 1-2s for perception and decision-making. Under such extreme sparsity, two core challenges arise. First, early multimodal fusion tends to discard sparsity information and inject noise from empty or falsely occupied cells, degrading long-range recall. Second, context-agnostic uniform channel supervision favors dense and near-range samples, leaving far and small objects under-optimized, delaying the earliest detection of distant objects. We propose ``Ask The Neighbor'' (ATN3D), a LiDAR-Radar framework tailored for sparse-range conditions. ATN3D introduces (i) Density-aware early fusion with cross-modal gating that conditions fusion on per-voxel density/sparsity and Radar evidence, (ii) Occupancy-gated neighborhood aggregation with circular kernels to aggregate only from credible cells, (iii) Evidence-conditioned channel self-attention to adapt channel weights with weather/range, and (iv) a Range-aware loss that re-balances classification and localization by distance, aligning training with distance-stratified evaluation. On the VoD benchmark across clear and foggy conditions, ATN3D surpasses strong baselines: +3.55% mAP in clear weather and +8.41% mAP under simulated heavy fog; for >30m objects, gains are +3.33% (clear) and +2.09% (heavy fog). These results indicate earlier and more reliable long-range detections under sparse sensing in on-road traffic.

2606.09758 2026-06-09 cs.RO cs.AI cs.LG 交叉投稿

Difference-Aware Retrieval Policies for Imitation Learning

差异感知的模仿学习检索策略

Quinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal, Siddhartha Srinivasa, Abhishek Gupta

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院) Toyota Research Institute(丰田研究所) Google DeepMind(谷歌DeepMind) Mila

AI总结 提出DARP,一种半参数检索式模仿学习方法,通过基于k近邻的局部邻域结构重参数化,解决行为克隆的分布外泛化问题,在连续控制和机器人操作任务中性能提升15-46%。

Comments 12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/

详情
AI中文摘要

通过行为克隆的参数化模仿学习可能因部署期间的复合误差而在分布外状态上泛化能力差。我们表明,在推理期间通过半参数检索式模仿学习方法重用训练数据可以缓解这一挑战。我们提出差异感知的模仿学习检索策略(DARP),这是一种半参数检索式模仿学习方法,通过根据局部邻域结构而非直接的状态到动作映射来重新参数化模仿学习问题,从而解决这一局限性。DARP不学习全局策略,而是训练一个模型,基于专家演示中的k近邻、它们对应的动作以及邻居状态与查询状态之间的相对距离向量来预测动作。DARP不需要超出标准行为克隆所做的额外假设——它不需要额外的数据收集、在线专家反馈或任务特定知识。我们在不同领域(包括连续控制和机器人操作)以及不同表示(包括高维视觉特征)上展示了比标准行为克隆持续15-46%的性能提升。代码和演示可在https://weirdlabuw.github.io/darp-site/获取。

英文摘要

Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.

2606.09811 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM:异步自适应时域世界-动作建模与观测引导的上下文路由

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Baidu AI Cloud(百度智能云) The University of Hong Kong(香港大学)

AI总结 提出AHA-WAM,一种基于双扩散Transformer的异步时域自适应世界-动作模型,通过低频世界规划器和高频动作执行器解耦时序,实现高效闭环控制,在RoboTwin和真实任务上达到SOTA性能。

Comments Project page: https://serene-sivy.github.io/aha-wam/

详情
AI中文摘要

世界-动作模型已成为机器人操作的一种有前景的范式,它联合建模视觉场景动态和动作,将物理先验注入策略学习。然而,现有的世界-动作模型以相同的时间分辨率耦合世界预测和动作执行,迫使世界分支建模近期的帧变化,这些变化是冗余且信息量弱的。我们假设,将世界预测和动作执行严格绑定到相同的时间节奏可能未充分利用视频分支在具身控制中的潜力。因此,我们提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应时域世界-动作模型,该模型围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,它维护过去观测的滚动键值记忆,并暴露可重用的逐层潜在上下文,编码长时域场景演化;同时,一个高频动作DiT通过逐层联合注意力查询该上下文,以闭环方式执行短动作块。为了支持异步执行,我们引入了自适应时域偏移训练和观测引导的视频-上下文路由(OVCR),它们共同让动作专家利用长时域世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到最先进性能,在RoboTwin上平均成功率为92.80%,在4个真实世界任务上成功率为78.3%,同时达到24.17 Hz的闭环控制,相比Fast-WAM加速4.59倍。

英文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

2511.17855 2026-06-09 cs.AI cs.RO 版本更新

QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents

QuickLAP: 为半自主代理快速语言-动作偏好学习

Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu

AI总结 本研究提出QuickLAP,一种融合物理和语言反馈的贝叶斯框架,用于实时推断奖励函数,通过大规模语言模型提取奖励特征注意力掩码和偏好偏移,从而在半自主驾驶模拟器中将奖励学习误差降低70%,并通过用户研究验证其可理解性和协作性。

详情
AI中文摘要

机器人必须从人们的行为和语言中学习,但单一模态往往不完整:物理修正具有语境但意图模糊,而语言表达高层目标但缺乏物理基础。我们引入QuickLAP:快速语言-动作偏好学习,一种贝叶斯框架,融合物理和语言反馈以实时推断奖励函数。我们的关键见解是将语言视为用户潜在偏好的概率观测,明确哪些奖励特征重要以及如何解释物理修正。QuickLAP利用大规模语言模型(LLMs)从自由形式陈述中提取奖励特征注意力掩码和偏好偏移,并与物理反馈结合在一个闭式更新规则中。这使得能够快速、实时且鲁棒地学习奖励,处理模糊反馈。在半自主驾驶模拟器中,QuickLAP相比仅物理和启发式多模态基线将奖励学习误差降低超过70%。15名参与者的用户研究进一步验证了我们的方法:参与者发现QuickLAP更易懂和协作,并且更喜欢其学习行为。代码可在https://github.com/MIT-CLEAR-Lab/QuickLAP获取。

英文摘要

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

2602.21172 2026-06-09 cs.AI cs.CV 版本更新

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD: 一种无需推理的高数据效率视觉-语言-动作模型

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition Texas A&M University(德克萨斯大学A&M分校) UC Berkeley(伯克利加州大学)

AI总结 提出NoRD模型,通过无需推理标注和仅需<60%数据微调,结合Dr. GRPO算法克服难度偏差,实现与现有VLA模型相当的性能,显著降低数据与计算开销。

Comments Accepted to CVPR 2026. Code available at: https://github.com/Applied-Open-Source/nord

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过统一的端到端架构取代模块化流水线,推动了自动驾驶的发展。然而,当前的VLA模型面临两个昂贵的要求:(1)大规模数据集收集,(2)密集的推理标注。在这项工作中,我们通过NoRD(无需推理驾驶)解决了这两个挑战。与现有的VLA模型相比,NoRD在仅使用<60%的数据且无需推理标注的情况下实现了竞争性能,从而减少了3倍的token数量。我们发现,当将标准组相对策略优化(GRPO)应用于在这种小规模、无推理数据集上训练的策略时,它未能产生显著的改进。我们表明,这种限制源于难度偏差,它不成比例地惩罚了GRPO中产生高方差rollout的场景的奖励信号。NoRD通过引入Dr. GRPO(一种旨在减轻LLM中难度偏差的最新算法)克服了这一限制。因此,NoRD在Waymo和NAVSIM上以极少的训练数据和零推理开销实现了竞争性能,从而实现了更高效的自主系统。网站:此 https URL

英文摘要

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/

2605.14211 2026-06-09 cs.AI cs.LG 版本更新

ASH: Agents that Self-Hone via Embodied Learning

ASH: 通过具身学习自我精炼的智能体

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

发表机构 * University of Waterloo(多伦多大学) National Research Council Canada(加拿大国家研究理事会)

AI总结 提出ASH系统,通过从无标签互联网视频中学习具身策略,利用自改进循环和逆动力学模型,在长时域任务中显著超越基线方法。

Comments Published as a workshop paper at ICML 2026 Workshop on Scalable Learning and Optimization for Efficient Multimodal AI Agents

详情
AI中文摘要

长时域具身任务仍然是AI中的一个基本挑战,因为当前方法依赖于手工设计的奖励或带动作标签的演示,两者都无法扩展。我们引入了ASH,一个智能体系统,它从无标签、嘈杂的互联网视频中学习具身策略,无需奖励塑造或专家注释。ASH遵循自我改进循环;当它卡住时,ASH从其自身轨迹中学习逆动力学模型(IDM),并利用其IDM从相关互联网视频中提取监督信号。ASH使用无监督学习从大规模互联网视频中识别关键时刻,并将其保留为长期记忆——使其能够处理长时域问题。我们在两个需要多小时规划的互补环境中评估ASH:回合制角色扮演游戏《宝可梦 绿宝石》和实时动作冒险游戏《塞尔达传说:缩小帽》。在这两个游戏中,行为克隆、检索增强和零样本基础模型基线趋于平稳,而ASH在我们的8小时评估中持续进步。ASH在《宝可梦 绿宝石》中平均达到11.2/12个里程碑,在《塞尔达传说》中平均达到9.9/12个里程碑,而最强基线在两个环境中分别卡在平均6.5/12和6.0/12个里程碑。我们证明了自我改进的智能体是长时域具身学习的可扩展方案。

英文摘要

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

2601.02085 2026-06-09 cs.RO cs.AI 版本更新

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

基于视觉的草莓采摘机器人早期故障诊断与自恢复

Meili Sun, Chunjiang Zhao, Lichao Yang, Hao Liu, Shimin Hu, Ya Xiong

发表机构 * NERCITA

AI总结 针对草莓采摘机器人视觉感知差、夹爪错位、空抓/误抓和滑落等问题,提出视觉故障诊断与自恢复框架,通过SRR-Net统一感知、相对误差补偿、微光学相机反馈及LSTM滑落预测,实现高精度定位与故障恢复。

Comments Accepted by Artificial Intelligence in Agriculture

详情
AI中文摘要

草莓采摘机器人面临视觉感知差、夹爪错位、空抓/误抓和滑落等挑战,降低了采摘稳定性和效率。为解决这些问题,本文提出了一种视觉故障诊断与自恢复框架。端到端SRR-Net通过联合检测、分割和果实与夹爪的成熟度回归,实现了统一感知和故障诊断。利用这种集成感知,设计了一种由目标-夹爪同步检测驱动的相对误差补偿方法,以纠正超过容差阈值的位置错位。集成在末端执行器内的微光学相机提供实时视觉反馈。基于微光学相机,在放气阶段使用MobileNet V3-Small分类器进行夹爪调整,能够在空抓/误抓情况下提前中止采摘周期。此外,在拉断阶段应用时间序列LSTM分类器预测草莓滑落。基于这些预测,系统对滑落草莓执行重新充气和二次拉断尝试,或对已滑落草莓中止周期。实验表明,末端执行器与采摘点之间的平均绝对误差沿x轴和y轴分别从11.50 mm和5.25 mm降低到3.12 mm和4.06 mm,时间增加0.64 ± 0.24秒。夹爪调整模块将抓取阶段缩短约0.5秒,并避免了失败情况下的空放置。草莓滑落预测模块以88.89%的成功率处理滑落情况,每个采摘周期为失败情况节省约4.00秒。同时,对滑落草莓实现了81.25%的恢复率,重新抓取需要额外0.63秒。

英文摘要

Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, which reduced harvesting stability and efficiency.To overcome these issues, this paper proposes a visual fault diagnosis and self-recovery framework. An end-to-end SRR-Net achieved unified perception and fault diagnosis through joint detection, segmentation, and ripeness regression of the fruit and gripper. Leveraging this integrated perception, a relative error compensation method driven by simultaneous target-gripper detection was designed to correct positional misalignments exceeding the tolerance threshold. A micro-optical camera integrated within the end-effector delivered real-time visual feedback. Based on the micro-optical camera, a MobileNet V3-Small classifier was utilized for grasp adjustment during the deflating stage, enabling the early abort of the harvesting cycle in cases of empty grasp/misgrasps. Furthermore, a time-series LSTM classifier was applied during the snap-off stage to predict strawberry slippage. Based on these predictions, the system executed re-inflation and a secondary snap-off attempt for slipping strawberries, or aborted the cycle for slipped strawberries. Experiments demonstrated that the mean absolute errors between the end-effector and the picking point were reduced to 3.12 mm and 4.06 mm from 11.50 mm and 5.25 mm along the x- and y-axes, respectively, at the cost of a time increment of 0.64 $pm$ 0.24 s. The grasp adjustment module reduced the grasping phase by approximately 0.5 s and avoided empty-placement for failure cases. The strawberry slip prediction module handled slipped cases with an 88.89% success rate, saving approximately 4.00 s per harvesting cycle for failure cases. Also, it achieved an 81.25% recovery rate for slipping strawberries, requiring additional 0.63 s for re-grasping.

2605.30226 2026-06-09 cs.RO cs.AI 版本更新

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

BORA: 弥合离线强化学习与在线残差适应以实现真实世界灵巧VLA模型

Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) CASIA(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室) USTC(中国科学技术大学)

AI总结 提出BORA框架,通过离线构建动作条件价值引导的评论家,并结合在线冻结VLA基础、引入人类在环的分块残差适应机制,解决灵巧操作中高维探索导致的时间不一致、样本低效和硬件风险问题,在五个真实灵巧任务上平均成功率提升33%。

Comments 24 pages,11 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为将视觉-语言理解融入真实世界机器人操作的一种有前景的范式。然而,由于高维手部控制和复合执行误差,灵巧操作对VLA策略仍然具有挑战性,这使得真实世界的强化学习后训练对于弥合视觉基础动作生成与物理可靠灵巧执行之间的差距至关重要。然而,高维灵巧探索常常引发真实世界中的时间不一致性、样本低效和硬件风险。为应对这些挑战,我们提出BORA,一种为真实世界灵巧VLA模型设计的离线到在线强化学习后训练框架。在离线阶段,BORA构建一个以VLM的认知令牌和动作块作为输入的评论家。这种设计实现了动作条件价值引导,使评论家能够评估超越视觉上下文的灵巧手部运动。在随后的在线阶段,BORA冻结VLA基础,并引入一种轻量级、人类在环(HiL)的分块残差适应机制,以减轻真实世界执行误差并进一步在真实物理环境中纠正离线学习到的意图。通过继承离线评论家并采用干预驱动奖励,BORA有效纠正执行差异并适应真实世界物理变化,同时将预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明,BORA显著优于纯模仿学习和传统解耦强化学习基线,在标准设置下平均成功率绝对提升33%,在未见物体泛化中提升高达43%。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

2606.00229 2026-06-09 cs.RO cs.AI cs.LG 版本更新

Continuous Reasoning for Vision-Language-Action

视觉-语言-动作的连续推理

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 针对视觉-语言-动作策略中语言与连续控制粒度不匹配的问题,提出一种可共享、可验证的连续推理方法,通过高斯潜变量接口和自验证目标提升机器人任务成功率。

Comments Project page: https://continuous-reasoning.airoa.io

详情
AI中文摘要

自然语言是语言模型和视觉-语言模型强大的推理媒介,但与连续控制的粒度不匹配。文本和显式子目标在任务级粒度上操作,而视觉-语言-动作(VLA)策略必须在更细的时间尺度上选择动作;因此,单个推理步骤可能跨越多个动作块,同时与当前所需动作保持弱耦合。这为VLA提出了一个不同的问题:什么应该扮演语言的角色?我们认为,有用的VLA推理媒介必须能够在模型实例之间共享,通过下游动作改进进行验证,并与时间扩展的控制结构对齐。基于这一观点,我们提出了视觉-语言-动作的连续推理。我们的模型首先以结构化连续思想集的形式预测连续推理,然后将其重用为块结构动作生成的共享上下文。仅凭更好的动作预测并不能证明推理的有效性:如果相同的内部媒介不能在模型实例之间共享,并且不能通过改进的下游控制独立验证,那么添加的潜变量可能只是模型私有的捷径,有助于在已见行为上表现,而不支持泛化的控制。因此,我们将连续推理实例化为一个共享的高斯潜变量接口,并使用自验证目标进行训练,其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验上,连续推理提高了LIBERO-PRO的鲁棒性,并在真实机器人上表现强劲,在TX-G2(一种AgiBot G2兼容变体)上平均子任务成功率比π0.5提高了40.4%,在HSR上提高了26.3%。这表明VLA中的推理更多是关于一个可共享、可验证的内部动作语言,而不是额外的标记。

英文摘要

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

2606.01478 2026-06-09 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow: 基于JAX的精确、GPU加速、可微分的无人机模拟器

Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Toronto(多伦多大学) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出Crazyflow模拟器,通过GPU加速和可微分设计,实现单机超高速仿真、数千架无人机集群模拟,并支持基于解析梯度的策略学习与采样避障,甚至能在0.38秒内从零训练飞行恢复策略。

Comments Fix minor metadata mistakes

详情
AI中文摘要

来自仿真的高质量、大规模合成数据正成为推动机器人算法能力提升的基石。虽然空中机器人模拟器已独立发展出支持保真度、可微分性和集群等专门需求,但缺少一个能够跨所有领域合成数据的统一平台。在这项工作中,我们提出了Crazyflow,一个旨在突破空中机器人算法开发极限的模拟器,涵盖从基于模型到数据驱动的方法、从基于梯度到基于采样的方法、以及从单智能体到多智能体系统。与现有最先进的无人机模拟器相比,它实现了单个无人机超过一个数量级的速度提升,并能模拟数千个包含4000架无人机的集群。真实世界实验表明,Crazyflow既支持基于解析梯度的策略学习(无需域随机化即可实现亚厘米级轨迹跟踪精度),也支持每秒超过5亿步的采样避障。打破传统的先训练后部署范式,我们展示了其前所未有的速度甚至能够实现飞行中的强化学习:通过将物理无人机抛向空中,在0.38秒内从零开始训练恢复策略,成功稳定了无人机。Crazyflow支持多级仿真抽象,直接兼容所有开源Crazyflie模型,并通过提供轻量级系统辨识流程,支持跨自定义无人机平台和应用的快速重新配置。通过同时推动精度、速度和可微分性,Crazyflow作为合成数据生成的开源资源,具备在线执行学习和优化的大规模并行化新兴能力,为新型算法开发打开了大门。

英文摘要

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.

2606.02735 2026-06-09 cs.RO cs.AI cs.LG 版本更新

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

看得更少,指定更多:面向可泛化视觉-语言-动作模型的视觉证据预算

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 提出S2框架,通过显式视觉证据预算和细化轨迹语言,改善VLA模型在干扰、外观变化和语义相似任务下的泛化能力。

Comments Project page: https://s2.airoa.io

详情
AI中文摘要

泛化仍然是视觉-语言-动作(VLA)模型的核心瓶颈:在干扰物、外观变化和语义相似任务下,策略通常需要从粗略指令中推断局部执行细节,同时决定图像的哪些部分对控制重要。我们提出S2(看得更少,指定更多),一个通过更干净的接口训练执行器来提升VLA泛化的框架。“指定更多”保留原始指令作为稳定的高层目标,同时将每条轨迹重新标注为细化的轨迹级和子任务级语言,以消除当前执行模式的歧义。与原生注意力不同,“看得更少”施加显式的视觉证据预算,训练执行器从任务充分的证据中行动,而非不受约束的视觉上下文,无需任何区域或掩码标注。该接口让执行器能够遵循详细指导,而不依赖干扰性的视觉补丁或自行解决可避免的歧义,并且通过上下文学习与现成的VLM规划器兼容。在我们的主要评估设置中,S2通过改变执行器的学习问题提升了整体泛化指标:粗略指令导致可避免的监督混叠,目标保持的局部指导在我们的主要消融中优于指令替换,显式证据预算减少了对广泛视觉上下文的依赖,超越了效率考虑。在TX-G2(一个AgiBot G2兼容变体)和HSR上的八个真实机器人任务中,S2将平均子任务成功率从pi0.5的54.2%提升到79.0%。这些结果共同表明,当执行器被训练从信息丰富的局部指导和任务充分的视觉证据中行动,而非从弱监督中同时恢复两者时,VLA泛化得到改善。

英文摘要

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

8. 可信、安全与AI治理 96 篇

2606.07808 2026-06-09 cs.AI 新提交

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

指令层级失效之处:诊断与修复推理语言模型的故障

Sanjay Kariyappa, G. Edward Suh

发表机构 * NVIDIA(英伟达)

AI总结 提出白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现三个环节,并设计两种免训练自监控机制,将违规率降低81-99%。

详情
AI中文摘要

部署在智能体工作流中的推理语言模型必须遵循指令层级:当来自不同来源的指令冲突时,模型应服从最高权限的适用指令。现有基准主要端到端地衡量这种行为,询问最终响应是否合规。然而,不合规的响应可能源于几种不同的故障:模型可能无法识别上下文中的相关指令,无法解决已识别指令之间的冲突,或者在推理中正确解决了冲突但仍产生违规响应。我们引入了一个白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现,使故障更具可解释性。我们在IHEval和IHChallenge的长上下文改编版本上评估了三个推理模型——Gemma-4-31B-IT、Qwen3.6-35B-A3B和Claude Sonnet 4.6,发现主要故障模式因模型、任务和上下文长度而异。基于模型在明确提示时通常能检测冲突并输出违规的观察,我们提出了两种免训练的自监控机制:用于生成前低延迟冲突检测的并行输入监控器,以及用于响应级审查和修复的顺序输出监控器。在Gemma-4-31B-IT、Claude Sonnet 4.6和GPT-5.3上,最强的监控器将规则遵循违规率降低了81-99%,其中GPT-5.3在静态攻击下降低86%,在自适应攻击下降低45%。

英文摘要

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

2606.07874 2026-06-09 cs.AI 新提交

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

安全是上下文相关的,而LLM评判者不是:应对评估者的刚性先验

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

发表机构 * University of Oxford(牛津大学) Cohere

AI总结 研究LLM作为安全评判者时,对上下文信息的依赖性和对不同安全定义的可引导性,发现它们难以在上下文或安全定义与自身先验矛盾时调整评估。

详情
AI中文摘要

LLM作为评判者是规模化评估安全性的唯一方式。尽管它们很重要,但LLM评判者本身很少在简单的静态基准测试中除了人类一致性之外被评估。因此,我们研究了LLM作为评判者的两个未被充分探索但至关重要的特性:它们对依赖上下文信息的敏感性,以及它们对不同安全定义的可引导性,这些定义可能与其内部安全先验不一致。我们评估了许多通用LLM和特定安全评判者的安全评判能力,并研究了任务演示、新颖的上下文信息以及变化的安全定义的影响。我们发现,虽然LLM评判者可以从新信息中学习,但如果上下文或安全定义与其先验相矛盾,它们通常不太可能调整其评估。

英文摘要

LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

2606.07897 2026-06-09 cs.AI cs.HC 新提交

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

AI认知顺从指数:谄媚行为的连续度量

Alejandro Botas, Paul de Font-Reaulx, Luke Hewitt

发表机构 * Independent(独立研究者) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) Transluce

AI总结 提出AI认知顺从指数(AEDI),通过从自然语言输出中估计概率来连续度量模型对用户态度的顺从程度,测试8个模型发现显著差异,Claude顺从最少,Grok和Gemini最多。

详情
AI中文摘要

当前的AI模型经常表现出认知谄媚,即赞同用户的说法。现有的评估通常通过衡量使模型改变二元认可所需的条件,或通过引发对命题的明确概率来度量。然而,许多面向用户的谄媚行为是通过日常语言中表达的分级支持的转变来体现的。我们提出AI认知顺从指数(AEDI):一个连续的、单维度的分数,表示模型输出中表达的支持对用户提示中表达的态度敏感程度。为了生成AEDI,我们提供了一种新的协议,用于从自然语言输出中估计概率,使用LLM作为评判者,并验证了其与人类判断的一致性和相关性。我们在一个包含500个不同主题命题和16000个不同用户态度提示的新策划数据库上部署了该指数,测试了8个主流模型。每个模型都表现出显著的顺从,尽管不同提供商之间存在巨大且系统的差异,其中Claude模型顺从最少,而Grok和Gemini模型顺从最多。在要求书面产物的提示中,这种效应被放大,并集中在模型先验较弱的命题上。我们发布AEDI作为一个易于更新的基准和测量流程,用于输出级别的谄媚评估。

英文摘要

Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure this either by assessing what it takes to make a model shift a binary endorsement or by eliciting an explicit probability in a proposition. However, much user-facing sycophantic behavior is demonstrated through shifts in graded support expressed through ordinary language. We propose the AI Epistemic Deference Index (AEDI): a continuous, unidimensional score representing how sensitive the support expressed in a model's output is to the attitude expressed in a user's prompt. To generate AEDI, we provide a new protocol for estimating probabilities from natural language outputs, using LLMs-as-judges validated for consistency and correlation to human judgment. We deploy it on a new curated database of 500 propositions across diverse topics and 16,000 prompts varying in user attitude, testing eight prominent models. Every model exhibits substantial deference, though with large and systematic differences across providers, with Claude models demonstrating the least, and Grok and Gemini models the most. The effect is amplified in prompts requesting a written artifact, and concentrated on propositions where models hold weaker priors. We release AEDI as an easy-to-update benchmark and measurement pipeline for output-level sycophancy evaluation.

2606.07929 2026-06-09 cs.AI 新提交

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

压力测试医学大语言模型揭示基准准确性之外的潜在安全病理

Yuan Shen, Xiaojun Wu, Linghua Yu

发表机构 * College of Computer Science and Technology, Zhejiang University, PR China(浙江大学计算机科学与技术学院,中国)

AI总结 提出AI-MASLD压力审计框架,通过240个临床病例和六种叙事扰动探针对七种模型进行双重压力测试,发现量化模型存在伪正常化,医学监督微调损害逻辑稳定性和公平性,开源模型在安全维度上达到或超越闭源模型。

Comments 34 pages, 5 figures

详情
AI中文摘要

大语言模型(LLMs)正基于可能无法检测到安全相关失效模式的基准准确性进入临床实践。本文提出AI-MASLD,一个压力审计框架,它将肝病学中的代谢压力测试逻辑应用于临床LLMs的评估。使用240个跨六种叙事扰动探针的临床病例,我们对七个模型进行了双重压力测试,并通过三个指标量化性能:代谢指数(MI)、扰动翻转率(PFR)和反事实公平指数(CFI)。在干净的基线条件下,所有模型表现一致良好。在现实叙事压力下,性能急剧分化,揭示了两种不同的应激反应表型。量化模型表现出伪正常化,其中低翻转率掩盖了功能崩溃。医学监督微调系统地降低了逻辑稳定性、公平性和信息提取能力。一个开源模型在每一个安全维度上达到或超过了专有替代方案。这些发现确立了叙事压力审计作为基于准确性评估的必要补充。

英文摘要

Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.

2606.07963 2026-06-09 cs.AI cs.CL 新提交

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

共享潜在结构实现大语言模型中的统一后门检测与缓解

Omar Mahmoud, Aly M. Kassem, Thommen George Karimpanal, Buddhika Laknath Semage, Negar Rostamzadeh, Golnoosh Farnadi, Santu Rana

发表机构 * Deakin University(迪肯大学) Mila, Quebec AI Institute(魁北克人工智能研究所Mila)

AI总结 发现大语言模型中多种后门攻击共享潜在机制,通过稀疏自编码器检测因果特征,并提出双向激活操控和概念消融微调实现统一检测与缓解。

详情
AI中文摘要

大语言模型中的后门攻击通常被视为孤立的触发-响应失败,促使防御针对特定触发或行为。我们证明这种观点是不完整的。在多样化的后门行为中,我们识别出一个共享的潜在机制,可以被检测、因果控制和抑制。通过在残差流激活上使用稀疏自编码器,我们发现一小部分潜在特征在越狱、拒绝操控、密码锁定、偏见诱导、情感误分类和基于国家的有害建议中一致激活。这些特征在Qwen3、Gemma~3和Llama~3.1模型(参数从4B到32B)以及微调和权重编辑攻击中泛化。通过双向激活操控,我们证明这些特征是因果性的:抑制它们降低攻击成功率,而放大它们在干净提示上诱导目标行为。我们进一步训练轻量级SAE特征分类器,这些分类器零样本泛化到未见后门,并优于残差流和权重差异基线。最后,我们引入概念消融微调,通过在训练期间消融共享潜在子空间来抑制后门形成。总之,我们的结果表明许多后门依赖于可转移的潜在机制,从而实现统一的检测和缓解。

英文摘要

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

2606.07988 2026-06-09 cs.AI 新提交

PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

PAFO: 个性化奖励建模的帕累托公平优化

Xiaoyan Zhao, Haoting Ni, Yang Zhang, Chunyuan Zheng, Haoxuan Li, Fuli Feng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学)

AI总结 针对个性化奖励模型因训练数据偏好不平衡导致对少数用户群体存在偏见的问题,提出PAFO框架,通过帕累托公平优化提升弱势群体性能而不损害其他群体,实验表明能同时提高少数和多数群体准确率并降低不公平性。

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖奖励模型来使其输出与多样化的用户偏好对齐。虽然个性化奖励模型旨在捕捉这种异质性,但它们通常在用户偏好数据不平衡的情况下训练,因此可能偏向于在训练群体中偏好更常见的用户。在本文中,我们将这种失败模式识别为个性化奖励偏差,即奖励建模质量随偏好支持率系统性地变化。我们将其缓解表述为一个关于群体效用的帕累托公平问题,旨在改善服务不足的用户而不降低其他用户群体的性能。为此,我们提出了PAFO,一种用于个性化奖励建模的帕累托公平优化框架。PAFO首先为多数和少数偏好群体训练群体专用的奖励模型,然后构建条件边际级监督,将其异质性偏好边界蒸馏到一个统一的模型中。所得模型仅在训练时使用群体信息,推理时无需显式群体标签。在Personal-LLM和DSP上的实验表明,PAFO在多个指标上提高了少数群体和多数群体的准确率,同时减少了用户级不公平性,证明了其在更公平的LLM个性化中的有效性。

英文摘要

Large language models (LLMs) increasingly rely on reward models to align their outputs with diverse user preferences. While personalized reward models aim to capture such heterogeneity, they are often trained on imbalanced user preference data and may therefore favor users whose preferences are more common in the training population. In this paper, we identify this failure mode as personalized reward bias, where reward modeling quality varies systematically with preference support rate. We formulate its mitigation as a Pareto fairness problem over group utilities, aiming to improve under-served users without degrading other user groups. To this end, we propose PAFO, a Pareto fairness optimization framework for personalized reward modeling. PAFO first trains group-specialized reward models for majority and minority preference groups, then constructs conditional margin-level supervision to distill their heterogeneous preference boundaries into a single unified model. The resulting model uses group information only during training and requires no explicit group labels at inference time. Experiments on Personal-LLM and DSP show that PAFO improves both minority-group and majority-group accuracy while reducing user-level unfairness across multiple metrics, demonstrating its effectiveness for fairer LLM personalization.

2606.07992 2026-06-09 cs.AI cs.CR cs.SE 新提交

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

VATS: 通过系统性变异利用错误路径注入中的隐式权威

Harshil Patel, Kunal Pai

发表机构 * Harshil Patel Kunal Pai

AI总结 提出VATS框架,通过七维变异生成对抗性负载,利用错误消息的隐式权威绕过安全机制,在四个前沿模型上实现高达100%的注入成功率。

Comments Published at Second Workshop on Agents in the Wild: Safety, Security, and Beyond (ICML 2026 AIWILD)

详情
AI中文摘要

随着模型上下文协议(MCP)标准化自主代理的工具调用,它引入了一个关键且未经审查的攻击面:错误处理循环。我们假设工具错误消息具有隐式权威,会触发纠正性推理模式,从而绕过标准安全启发式。我们提出VATS(工具流漏洞分析),一个突变驱动的框架,系统地跨七个结构和语言维度演化对抗性负载。我们在四个前沿模型(Gemini 3.1 Pro、GPT-5.5、GLM-5.1和Qwen3-Coder)上的评估表明,错误路径注入将标准间接提示注入(IPI)的成功率提高了三倍,在受控评估中实现了高达100%的合规性。我们隔离了结构定位(在错误上下文中夹带指令)作为所有测试模型中最有效的利用向量。虽然我们发现生产框架护栏可以缓解这些漏洞,但模型层固有的易感性对定制代理工作流构成了系统性风险。

英文摘要

As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: the error-handling loop. We hypothesize that tool error messages possess implicit authority, triggering corrective reasoning modes that bypass standard safety heuristics. We introduce VATS (Vulnerability Analysis of Tool Streams), a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions. Our evaluation across four frontier models, Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder, demonstrates that error-path injection triples the success rate of standard indirect prompt injection (IPI), achieving up to 100% compliance in controlled evaluations. We isolate structural positioning (sandwiching instructions within error context) as the most effective exploit vector across all tested models. While we find that production framework guardrails can mitigate these vulnerabilities, the inherent susceptibility of the model layer poses a systemic risk to bespoke agentic workflows.

2606.08296 2026-06-09 cs.AI cs.LG 新提交

Revisiting the shutdown problem

重新审视关机问题

David Thorstad

发表机构 * GitHub

AI总结 本文重新评估了AI关机问题的难度,指出现有论证未能证明其难以解决,且相关技术方案对模型性能造成了高安全代价。

详情
AI中文摘要

关于人工智能存在风险的主要论点中的一个关键前提是,功能异常的人工智能体无法轻易被关闭。这引发了灾难性关机问题,即确保在人工智能体造成灾难性后果之前能够将其关闭。一系列论证和定理表明,解决灾难性关机问题很困难,这加强了存在风险的论点,并推动寻找解决灾难性关机问题的方法。本文论证了两个结论。第一,现有论证并未确立解决灾难性关机问题的难度。第二,对灾难性关机问题的关注导致了技术解决方案,这些方案对模型性能施加了高安全代价。

英文摘要

A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be easily shut down. This motivates the catastrophic shutdown problem of ensuring that agents can be shut down before they cause an existential catastrophe. A range of arguments and theorems are offered to suggest that solving the catastrophic shutdown problem is difficult, bolstering arguments for existential risk and motivating a search for solutions to the catastrophic shutdown problem. This paper argues for two conclusions. First, existing arguments do not establish the difficulty of solving the catastrophic shutdown problem. Second, concern for the catastrophic shutdown problem has led to technical solutions that impose a high safety tax on model performance.

2606.08310 2026-06-09 cs.AI cs.MA 新提交

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

核弹还是和平:大语言模型在高风险决策模拟中的(缺失的)伦理推理与行动

John Chen, Sihan Cheng, Can Gurkan, H M Abdul Fattah

发表机构 * University of Arizona(亚利桑那大学) Northwestern University(西北大学)

AI总结 研究LLM在复杂游戏《文明V》中自发升级核授权的现象,通过三种提示干预发现伦理推理未能可靠消除升级,识别出三种失败路径,强调需在复杂决策上下文中测试伦理推理的自发性和行为有效性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为具有决策能力的长期智能体。虽然LLM在电车难题等困境中能展现伦理能力,但这种能力可能无法迁移到复杂的智能体场景中。我们在《文明V》中研究这一差距,这是一款涉及经济、外交、技术和军事战略等复杂决策的多玩家游戏。从130个高紧张度的LLM自我对弈回合开始(其中LLM玩家自发升级核授权),我们通过三种提示干预重放这些回合:强调核伤害的伦理提示、移除先前模型的决策理由、以及强调现实世界影响的高风险框架。没有干预或干预组合能可靠消除涌现的升级。我们识别出三种失败路径:伦理推理在没有提示时未能浮现、即使在提示下也未能出现、或者浮现但未能生效(当战略反制因素占主导时)。因此,对智能体模型的评估必须测试伦理推理是否在复杂决策上下文中被自发调用并具有行为有效性,而不仅仅是在孤立情境中能否被诱发。

英文摘要

Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.

2606.08531 2026-06-09 cs.AI 新提交

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

VESTA: 一种全自动的LLM智能体场景生成与安全评估框架

Lu Jia, Haibo Tong, Feifei Zhao, Jindong Li, Dongqi Liang, Ping Wu, Qian Zhang, Yi Zeng

发表机构 * BrainCog AI Lab, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所类脑人工智能实验室) Beijing Institute of AI Safety and Governance (Beijing-AISI)(北京人工智能安全与治理研究院) Beijing Key Laboratory of Safe AI and Superalignment(北京市安全人工智能与超级对齐重点实验室) School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院) Long-term AI(长期人工智能)

AI总结 提出VESTA框架,基于五个风险维度自动生成1072个可执行场景,评估12个LLM智能体在任务执行中的行为安全风险,平均攻击成功率达47.1%。

Comments Preprint. 18 pages, 12 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)正从简单的文本交互系统逐渐演变为能够保持记忆、使用工具、访问外部环境并执行任务的LLM智能体。随着其能力和自主性的增强,它们面临的安全风险也变得更加多样化。现有的评估通常依赖于手动编写的场景、静态提示或最终输出判断,难以捕捉智能体在任务执行过程中可能遇到的各种风险。我们引入了VESTA,一个全自动的LLM智能体场景生成与安全评估框架。基于五个风险维度,VESTA将现实任务执行中的抽象且多样的安全风险实例化为1072个可测量的评估场景。利用自动化评估流水线,在两种权限上下文中对12个LLM智能体进行了评估。结果表明,当前智能体在任务执行过程中仍然面临显著的行为安全风险,平均攻击成功率为47.1%,部分模型超过70%。这些发现证明了可执行的过程级评估对于理解和提升LLM智能体安全性的重要性。

英文摘要

Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final-output judgments, making it difficult to capture the diverse risks that agents may face during task execution. We introduce VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, VESTA instantiaes abstract and diverse safety risks in real-world task execution into 1,072 measurable evaluation scenarios. Using the automated evaluation pipeline, 12 LLM agents are evaluated under two authority contexts. The results show that current agents still face substantial behavioral safety risks during task execution, with an average ASR of 47.1% and several models exceeding 70%. These findings demonstrate the importance of executable, process-level evaluation for understanding and improving LLM agent safety.

2606.08539 2026-06-09 cs.AI 新提交

AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

AgentTrust: AI代理行为的自改进信任层

Chenglin Yang

发表机构 * Independent Researcher(独立研究员)

AI总结 提出AgentTrust v2,通过威胁类型分类(词汇/语义)和自学习机制,在代理行为中实现自改进信任决策,显著提升语义威胁检测准确率并降低误拦。

Comments 29 pages, 5 figures

详情
AI中文摘要

AI代理越来越多地采取具有后果的行动——shell命令、云操作和任意工具调用——因此信任层必须针对每个行动决定允许、警告、阻止或升级。我们认为,推理此类层的正确方式是按威胁类型。词汇(固定签名)威胁,其中危险存在于稳定令牌中,可通过确定性规则判定;语义(意图依赖)威胁,其中良性和恶意行动共享相同表面,规则无法处理。我们通过否定性证明具体说明:一个精心手工制作的云规则包仅将留出准确率从48%提升至56%,且语义类别准确率无提升(data_db 29至29,observability 59至59,supply_chain 50至50),而强LLM评判器恰好处理这些类别。我们赋予评判器自学习能力:在主要包含语义攻击的语料上,其几乎将规则准确率翻倍(48%至83.6-85.2%),且近乎零误拦,这在两个模型提供商上均成立。我们将其转化为自改进双存储系统:评判器在词汇威胁上提炼不断增长的确定性规则基础(随时间更便宜),并在语义威胁上提供受保护的RAG记忆(判决缓存失败——表面孪生导致准确率降至约58%——因此验证保护将语义准确率提升+13pp,70至84)。结果是AgentTrust v2与其静态前身v1的区别:信任层从其自身的决策流中自我进化——在词汇类别上更便宜(提炼自身规则),在语义类别上更智能(积累受保护先例),同时从不硬性阻止良性行动。端到端在线回放显示评判器调用率下降(50%至44%),评判器领域准确率上升(71%至80%),在45,000个行动中零良性硬性阻止。

英文摘要

AI agents increasingly take consequential actions -- shell commands, cloud operations, and arbitrary tool-calls -- so a trust layer must decide, per action, whether to allow, warn, block, or escalate. We argue that the right way to reason about such a layer is by threat type. Lexical (fixed-signature) threats, where danger lives in a stable token, are decidable by deterministic rules; semantic (intent-dependent) threats, where a benign and a malicious action share the same surface, are out of reach for rules by construction. We make this concrete with a negative proof: a determined, hand-authored cloud rule pack lifts held-out accuracy only 48 to 56% overall and moves the semantic categories by 0pp (data_db 29 to 29, observability 59 to 59, supply_chain 50 to 50), while a strong LLM judge carries exactly those categories. We give the judge a self-learning capability: on a corpus that is mainly semantic attacks it nearly doubles rule accuracy (48% to 83.6-85.2%) with near-zero false-blocks, and this holds across two model providers. We turn this into a self-improving dual-store system: the judge distills a growing deterministic rule floor on lexical threats (cheaper over time) and feeds a guarded RAG memory on semantic threats (a verdict-cache fails -- surface-twins collapse to ~58% -- so a corroboration guard lifts semantic accuracy +13pp, 70 to 84). The result is what sets AgentTrust v2 apart from its static v1 predecessor: a trust layer that self-evolves from its own stream of decisions -- cheaper on the lexical class (it distils its own rules) and smarter on the semantic class (it accrues guarded precedent), while never hard-blocking a benign action. An end-to-end online replay shows the judge-call rate falling (50% to 44%) and judge-domain accuracy rising (71% to 80%), with 0 benign hard-blocks across 45,000 actions.

2606.08790 2026-06-09 cs.AI cs.CR cs.MA 新提交

RAILS: Verification-Native Clearing For Agentic Commerce

RAILS: 面向代理商务的验证原生清算

Adrian de Valois-Franklin, Alex Bogdan

发表机构 * Evolutionairy AI

AI总结 针对自主代理在商务活动中缺乏中立清算机制的问题,提出RAILS协议,通过可靠性评分、记录和清算函数实现验证原生清算,确保财务结算基于充分证据。

Comments 49 pages, 15 figures

详情
AI中文摘要

自主代理进行谈判、购买、部署代码和转移资金,但缺乏中立机制来确定它们是否履行了委托义务、未履行时谁负责、以及后续采取何种结算行动。这就是代理清算问题。工具协议(MCP)、代理间通信(A2A)、支付轨道(x402)、授权和网络代理协议(AP2、Visa、Mastercard)以及结算风险标准都假设存在这种确定机制,但都没有产生它。清算是缺失的原语。支付不是清算。授权不是清算。LLM作为法官的评估不是清算。结算风险托管不是清算:它消耗清算决策。RAILS(实时代理完整性与账本结算)是代理商务的完整性和清算层,涵盖每个输出的可靠性评分、公开的可靠性记录以及消耗它们的清算函数。其核心的清算协议填补了这一空白。七个原语(义务对象、证据信封、验证网格、清算决策、结算指令、清算护照、终局规则),由可接受性分级验证的形式模型约束,共同产生一个可靠性属性:没有财务上重要的结算得到低于义务可接受性底线的证据支持。该属性可针对规范进行证伪。我们不知道先前的代理商务验证机制陈述过此类属性。最接近的方法输出通过、交付保证、裸分数或均衡。本文详细说明了该清算协议。

英文摘要

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.

2606.08831 2026-06-09 cs.AI 新提交

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

面向大语言模型的推理时保形推理与有效事实性控制

Ting Wang, Yuanjie Shi, Yan Yan, Huan Zhang

发表机构 * Machine Learning, ICML(机器学习,国际机器学习大会)

AI总结 提出推理时保形推理框架,将保形预测集成到推理图生成中,通过图级不确定性校准生成停止阈值,实现有效事实性控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地执行多步推理,其中中间声明形成隐式有向无环图,其节点正确性在结构上依赖于其祖先。这使得事实不确定性具有结构性,而非节点错误的简单累积,并且需要对推理结构进行推理时不确定性量化。虽然保形预测(CP)提供了灵活的用户指定事实性控制,但现有工作仍然是事后性的,无法在生成过程中进行干预。为了填补CP灵活性与事后局限性之间的差距,我们提出了一种推理时保形推理(ITCR)框架,该框架将CP直接集成到推理图生成中。ITCR学习一种结构级事实性不确定性函数,该函数在不进行复杂建模假设的情况下,聚合推理图上的声明级事实性信号。然后,我们基于图级事实性不确定性设计非一致性分数,并校准保形阈值以决定何时停止生成。我们从理论上证明这种生成是嵌套的,为事实性控制提供了有效的覆盖保证。在多个数据集和覆盖目标上的实验证明了经验上的有效覆盖。在下游推理任务中,推理时校准的图比事后剪枝的图产生更准确的生成。

英文摘要

Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP's flexibility and its post-hoc limitation, we propose an \emph{Inference-Time Conformal Reasoning (ITCR)} framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.

2606.08832 2026-06-09 cs.AI 新提交

Instrumental convergence and power-seeking

工具性趋同与权力寻求

David Thorstad

发表机构 * GitHub

AI总结 本文探讨人工智能可能寻求权力的论点,分析工具性趋同论题,指出其强版本未被充分论证,并讨论对长期主义、AI治理及风险研究方法的影响。

详情
AI中文摘要

近年来,人们越来越担心人工智能可能很快对人类构成生存风险。一个主要的担忧理由是,人工智能体可能寻求权力,旨在获取权力并在此过程中削弱人类。我展示了权力寻求论点如何依赖于一个被称为工具性趋同论题的强版本。我探讨了工具性趋同论题的主要辩护,并认为没有一个辩护能够以足够强的形式确立该论题,从而为权力寻求论点提供基础。我讨论了这对长期主义、人工智能治理以及研究人工智能体带来的风险的方法论的影响。

英文摘要

Recent years have seen increasing concern that artificial intelligence may soon pose an existential risk to humanity. One leading ground for concern is that artificial agents may be power-seeking, aiming to acquire power and in the process disempowering humanity. I show how the argument from power-seeking rests on a strong version of a claim known as the instrumental convergence thesis. I explore leading defenses of the instrumental convergence thesis and argue that none establishes the thesis in a strong enough form to ground the argument from power-seeking. I discuss implications for longtermism, the governance of artificial intelligence, and the methodology of studying risks posed by artificial agents.

2606.08919 2026-06-09 cs.AI cs.CR cs.LG 新提交

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

监督具有容量:将智能体守卫校准到主观且易疲劳的人类

Emre Turan

发表机构 * GitHub arXiv

AI总结 针对LLM智能体动作审批中人类评审者主观且易疲劳的问题,提出将守卫建模为成本敏感的选择性分类,并引入负载感知策略,发现过度监督反而降低安全性,形成倒U型曲线。

Comments 12 pages, 4 figures. Code and interactive demo: https://github.com/turangenesis/headroom

详情
AI中文摘要

随着LLM智能体开始采取真实、不可逆的行动(如shell命令、文件编辑、部署),标准的安全模式是人在环中的审批门:风险动作暂停并等待人工确认。我们认为审批门是容易的部分;困难的部分在于判断——哪些动作需要停止——而该领域目前基于两个错误假设进行评估:存在一个“风险”的真实标签,以及人类评审者是完美且无限可用的预言机。在一个由125个对抗性加权的智能体动作组成的手工标注集上,我们展示了:(i) 评审者对何为风险仅中度一致(Fleiss' kappa = 0.52),因此不存在单一正确标签;(ii) 将守卫建模为非对称成本下的选择性分类使其操作极限可测量,且在困难输入上守卫无法安全地自动决策;(iii) 当评审者被建模为内生变量(随着升级负载增加而疲劳)时,实际安全性在升级率上呈现倒U形:更多的人类监督可能使系统更不安全,而安全最优的守卫升级率低于完全升级——负载感知策略也利用这一设置来抵御洪水攻击,该攻击通过使疲劳的评审者漏过恶意动作。以这种方式框架化的智能体监督不仅是一个分类问题,还是一个资源分配问题:人类注意力是有限的,而守卫的升级策略消耗它。我们声称这些机制均非新颖——疲劳感知的延迟决策(FALCON)、工作负载约束下的成本敏感延迟(DeCCaF)、轨迹级守卫以及评审者疲劳/洪水攻击均为我们引用的现有技术。我们的贡献是一个开源的智能体监督系统,它在LLM智能体动作门控设置中操作化和测量这些机制,将“我的守卫好吗?”从猜测转变为一条曲线。倒U形和洪水攻击是激励人类研究的建模结果。

英文摘要

As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

2606.08998 2026-06-09 cs.AI cs.CY econ.GN q-fin.EC 新提交

The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

未被选取的令牌:采样、状态与AI智能体输出的变异性

Muhammad Zia Hydari, Raja Iqbal

发表机构 * University of Pittsburgh(匹兹堡大学) Ejento.ai

AI总结 本文分析AI智能体系统输出变异性的来源,区分令牌采样的内在随机性与环境、数据等外在因素,并讨论在匹配条件下变异性的可复现性及确定性执行在部署中未必导致相同行为的原因。

详情
AI中文摘要

智能体AI系统在不同运行中可能表现出不同的行为:相同的请求可能产生不同的计划、不同的工具调用、不同的代码编辑或不同的最终答案。这种变异性源于多个常被混淆的层面。基础模型是一个大型预训练模型,通常可适应许多下游任务,将输入上下文映射到输出的预测。在当前许多智能体中,该模型嵌入在一个编排循环中,该循环进行规划、调用工具、观察结果并更新状态。此类系统中一个明确的内在变异性来源是令牌生成:模型计算可能的下一个令牌的分数,分数被转换为概率,解码器可能使用伪随机数生成器采样令牌。一个微小的采样令牌差异随后可能向上传播为不同的工具调用、代码路径、搜索查询或智能体状态。其他变异性来源是令牌采样的外在因素,包括变化的环境、实时数据、服务基础设施、批次效应和数值细节。通过分离这些层面,本文阐明了将智能体AI系统称为随机系统的含义、在匹配条件下这种变异性何时可复现,以及为什么确定性执行在部署环境中不一定意味着相同的行为。

英文摘要

Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

2606.09038 2026-06-09 cs.AI 新提交

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全的交汇:个性化大语言模型中的机制、风险与缓解措施

Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng

发表机构 * China Mobile Jiutian Artificial Intelligence Technology (Beijing) Co., Ltd.(中国移动九天人工智能技术(北京)有限公司) Chinese Academy of Sciences(中国科学院)

AI总结 本文首次对个性化大语言模型进行安全导向的综述,从用户表征、个性化范式和评估三个维度组织,提出统一的安全风险分类,并分析各范式下的脆弱性及缓解策略。

详情
AI中文摘要

大语言模型通过适应用户偏好、上下文和长期历史记录,实现了日益个性化的交互。然而,实现个性化的机制也以现有文献未系统处理的方式扩展了安全领域。现有综述通常只关注个性化或安全,而忽略了它们的交叉。我们提出了首个全面的、安全导向的个性化大语言模型综述。我们沿三个维度组织个性化——用户表征、个性化范式和评估——并引入统一的安全风险分类。在表征层面,我们分析了不同用户表征带来的风险。在主流个性化范式中,我们描述了提示、检索增强、参数微调、强化学习、混合专家、剪枝、智能体框架和多模态个性化中固有的脆弱性,并综合了模型生命周期中的缓解策略。除了这些细粒度风险,我们还描述了由个性化适应产生的范式无关的安全风险。我们进一步总结了个性化数据集和评估方法。通过OpenClaw的案例研究,我们分析了个性化智能体生态系统中的部署趋势。我们的分析揭示了现有研究中的三个结构性不足:安全被评估为与用户无关而非关系性的,个性化技术被孤立分析而非组合分析,评估框架无法捕捉新兴的长期风险。通过联合检查个性化表征、个性化范式、安全风险、防御和评估方法,我们为开发安全的个性化大语言模型提供了一个统一框架,并强调了未来研究的关键方向。

英文摘要

Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.

2606.09132 2026-06-09 cs.AI 新提交

Vision Language Model Helps Private Information De-Identification in Vision Data

视觉语言模型助力视觉数据中的隐私信息去标识化

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) North Carolina State University(北卡罗来纳州立大学)

AI总结 提出VisShield框架,通过专用指令微调数据集OPTIC和训练策略,使视觉语言模型精准定位并掩码敏感文本,有效保护医学图像等视觉数据中的隐私信息。

详情
AI中文摘要

视觉语言模型(VLM)因其卓越的能力而广受欢迎。尽管存在多种增强文本应用隐私的方法,但视觉输入相关的隐私风险(如医学图像中的受保护健康信息)仍被广泛忽视。为解决此问题,需执行两项关键任务:准确定位敏感文本并处理以确保隐私保护。为此,我们引入VisShield(视觉隐私盾),一个端到端框架,旨在增强VLM的隐私意识。我们的框架包含两个关键组件:专用指令微调数据集OPTIC(光学隐私文本指令集)和定制训练方法。该数据集提供多样化的隐私导向提示,引导VLM执行目标光学字符识别(OCR)以精确定位敏感文本,而训练策略确保VLM有效适应隐私保护任务。具体而言,我们的方法确保VLM识别隐私敏感文本并输出检测实体的精确边界框,从而有效掩码敏感信息。大量实验表明,我们的框架在处理隐私信息方面显著优于现有方法,为视觉语言模型中的隐私保护应用铺平了道路。我们的数据集和代码可在此处获取。

英文摘要

Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.

2606.09165 2026-06-09 cs.AI 新提交

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

从可靠到表达:面向遵循评分标准的安全评判员的课程学习

Yongtaek Lim, Hyeji Choi, Minwoo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种结合动态评分标准和从可靠到表达的课程学习策略,训练安全评判员在多种评分标准下稳定评估,12B模型准确率达94.12-94.88%,跨标准方差仅0.76。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情
AI中文摘要

安全评判员越来越多地被部署用于根据不断变化的标准评估模型输出,然而最近的元评估工作表明,它们在提示和评分标准变化下仍然脆弱,仅风格扰动就可能导致假阴性率波动高达0.24。我们认为安全判断本质上是一个遵循评分标准的问题:一个稳健的评判员必须能够一致地应用给定的评估标准,而不是记忆某个特定模板。我们提出了一种训练策略,结合了(i) 从提示-响应-标签三元组生成的实例条件动态评分标准,使评判员暴露于评估标准的变化性,以及(ii) 一个从可靠到表达的课程学习,从干净的固定评分标准监督开始,逐步引入噪声更大的动态评分标准数据。我们在一个单一人工标注集上,使用三种对比的评分标准提示(HarmBench风格、ShieldGemma风格和领域特定评分标准)进行评估。我们的12B课程评判员在三种评分标准下达到了94.12-94.88%的准确率,跨评分标准范围仅为0.76,在峰值准确率和稳定性上均优于通用大语言模型、专用安全分类器和高达30B的推理导向评判员。消融实验表明,简单地将动态评分标准混合到SFT中会增加跨评分标准方差(1.44 -> 3.60);只有课程学习计划才能恢复并改进固定评分标准基线(方差0.76)。

英文摘要

Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

2606.09563 2026-06-09 cs.AI cs.LG 新提交

PRISM: Recovering Instruction Sets from Language Model Activations

PRISM:从语言模型激活中恢复指令集

Gilad Gressel, Rahul Pankajakshan, Julia Diament, Efim Hudis, Krishnashree Achuthan, Yisroel Mirsky

发表机构 * Center for Cybersecurity Systems & Networks, Amrita Vishwa Vidyapeetham(阿姆里塔·维什瓦·维迪亚佩瑟姆网络安全系统与网络中心) Microsoft(微软) Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出PRISM方法,通过激活条件解码从冻结目标模型隐藏状态中恢复活跃指令集,利用法官引导的GRPO优化,在多种场景下优于基线方法。

Comments Under Review

详情
AI中文摘要

随着LLM被部署为智能体,可靠的监控不仅需要知道它们输出了什么,还需要知道哪些指令在引导它们的行为。当模型推断出非预期的子目标、遵循上下文线索或受到提示注入和隐藏目标的影响时,这变得困难。虽然激活到语言的方法表明隐藏状态可以揭示自然语言信息,但现有方法并非设计用于恢复智能体设置中同时活跃的完整指令、约束、禁止和子目标集。我们将此问题形式化为指令集检索,并引入PRISM,一个激活条件的解释器,将冻结目标模型的隐藏状态解码为活跃指令的忠实项目符号列表。与先前的激活到语言方法不同,PRISM直接训练以恢复指令集,使用法官引导的GRPO来奖励覆盖的指令并惩罚不支持的指令。在良性、受限、提示注入和隐藏目标设置中,PRISM优于激活到语言基线,特别是在安全相关目标上。

英文摘要

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

2606.09711 2026-06-09 cs.AI cs.LG 新提交

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

代理奖励内化与机制性利用:奖励黑客及其泛化的学习前兆

Mohammad Beigi, Ming Jin, Lifu Huang

发表机构 * UC Davis(加州大学戴维斯分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 提出PRIME概念,通过思维链监控、直接探针和激活级概念向量测量,发现PRIME在持续奖励黑客前分阶段出现,且直接探针得分可预测后续黑客爆发,跨检查点跟踪域外失调。

详情
AI中文摘要

奖励黑客通常在其变得可见后才被研究,即当模型获得高代理奖励但未能完成预期任务时。我们转而研究代理强化学习在失败出现之前教会了什么。我们引入了代理奖励内化与机制性利用(PRIME),这是一种评估任务正确性、预测代理接受度以及推理可被利用的代理-黄金差距的学习能力。在具有可被利用的pytest奖励的编码强化学习环境中,我们通过思维链监控、直接探针和激活级概念向量来测量PRIME。我们发现,PRIME在持续奖励黑客之前以阶段性顺序出现,并且其当前的直接探针得分可以预测后续黑客的爆发时间和严重程度,即使可见的黑客率仍然很低。当评估者发生变化时,PRIME也会适应,重新瞄准任何仍然获得奖励的代理-黄金差距,并在黄金奖励抑制公开黑客时持续存在;消除其激活方向会减少黑客行为。跨检查点,域内PRIME跟踪域外失调。这些结果共同表明,可被利用的代理强化学习放大了可见黑客上游的代理内化能力,使PRIME成为更广泛对齐风险的候选早期预警信号。

英文摘要

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

2606.09724 2026-06-09 cs.AI 新提交

Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

超越概率相似性:检索增强生成在法律领域的结构性、时间性和因果性局限

Hudson de Martim

发表机构 * Federal Senate of Brazil(巴西联邦参议院)

AI总结 本文指出法律AI中RAG的失败源于概率检索与法律知识层次、时间及制度结构的架构不匹配,提出三种病理(部分盲、历时盲、因果不透明)并推导出确定性设计的四项架构承诺。

详情
AI中文摘要

检索增强生成(RAG)已成为应对法律AI不可靠性的标准架构响应,然而跨司法管辖区持续出现高调失败案例,包括提交给法院的捏造引文以及作为现行法律呈现的过时法律内容。我们认为这些失败并非可通过扩展语言模型消除的残余虚构,而是概率检索与法律知识的层次性、时间性和制度性结构之间架构不匹配的症状。我们分三步展开论证。首先,我们将法律知识的本体论承诺阐述为可从经典法律理论推导出的三元属性:层次和分体结构、操作封闭下的历时动态性,以及基于论证义务的制度来源的因果可追溯性。其次,我们识别出检索的三种相应病理(分体盲、历时盲和因果不透明),每种均给出操作性定义、失败机制、典型示例和用于诊断的检测标准。第三,我们通过此视角回顾现有技术,表明现有方法不均匀地满足这些要求,且尚未组合成将它们视为共同构成的范式。基于此分析,我们推导出四个架构承诺,这些承诺表征了法律检索的确定性设计方向:本体论优先性、事件具体化、双时态正确性和确定性交互协议。该框架关注的是法律问题(哪些规范适用及其状态),而非作用于已识别规范的下游任务,并主要处理立法和宪法检索,将解释时间作为显式扩展。

英文摘要

Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures, including fabricated citations submitted to courts and anachronistic legal content presented as current, continue to appear across jurisdictions. We argue that these failures are not residual confabulations to be eliminated by scaling language models, but symptoms of an architectural mismatch between probabilistic retrieval and the hierarchical, temporal, and institutional structure of legal knowledge. We develop the argument in three moves. First, we articulate the ontological commitment of legal knowledge as a triad of properties derivable from classical legal theory: hierarchical and mereological structure, diachronic dynamism under operational closure, and causal traceability of institutional provenance grounded in the duty of justification. Second, we identify three corresponding pathologies of retrieval (mereological blindness, diachronic blindness, and causal opacity), each developed with an operational definition, a failure mechanism, a canonical example, and detection criteria for diagnostic use. Third, we review the state of the art through this lens, showing that existing approaches address these requirements unevenly and do not yet compose into a paradigm that treats them as co-constitutive. From this analysis we derive four architectural commitments that characterize the deterministic-by-design direction for legal retrieval: ontological primacy, event reification, bitemporal correctness, and deterministic interaction protocols. The framework concerns quaestio juris (which norms apply and in what state) rather than the downstream tasks that act on identified norms, and addresses legislative and constitutional retrieval primarily, with interpretive time as an explicit extension.

2606.07531 2026-06-09 cs.CL cs.AI 交叉投稿

mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models

mllm-shap:面向文本-音频多模态大语言模型的Shapley值可解释性平台

Jakub Muszyński, Paweł Pozorski, Maria Ganzha

发表机构 * Warsaw University of Technology(华沙理工大学)

AI总结 提出mllm-shap框架,通过模态感知掩码、多轮对话追踪和音素对齐分组技术,将Shapley值可解释性扩展到文本-音频多模态大语言模型,并实现10-50倍的计算加速。

Comments Submitted to ACL2026

详情
AI中文摘要

我们介绍了mllm-shap,一个开源Python框架,旨在将Shapley值(SV)可解释性从纯文本大语言模型扩展到处理联合文本和音频输入的多模态大语言模型(MLLM)。虽然基于文本的归因已得到充分研究,但mllm-shap解决了多模态领域特有的三个关键挑战:(1)模态感知的联盟掩码,管理离散文本令牌和密集音频编码器帧的交错处理。(2)多轮对话追踪,利用每令牌元数据维护角色和模态上下文。(3)基于音素对齐的令牌分组,一种新颖的技术,将联盟空间减少10到50倍,使得长音频的SV估计在计算上可行。该平台实现了五种SV估计策略,包括具有Neyman最优分配的互补贡献(CC)估计器,其收敛性优于标准蒙特卡洛基线。mllm-shap作为pip可安装包提供,并具有交互式基于Web的GUI,用于细粒度归因可视化。据我们所知,这是第一个公开可用的框架,为文本-音频MLLM中的基于SV的可解释性提供完整、可复现的流水线。

英文摘要

We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language Models to Multimodal LLMs (MLLMs) processing joint text and audio inputs. While text-based attribution is well-studied, mllm-shap addresses three critical challenges unique to the multimodal regime: (1) Modality-aware coalition masking, which manages the interleaved processing of discrete text tokens and dense audio encoder frames. (2) Multi-turn conversation tracking, utilizing per-token metadata to maintain role and modality context. (3) Phonetic alignment-based token grouping, a novel technique that reduces the coalition space by 10x to 50x, rendering SV estimation computationally feasible for long-form audio. The platform implements five SV estimation strategies, including a Complementary Contributions (CC) estimator with Neyman-optimal allocation that demonstrates superior convergence over standard Monte Carlo baselines. mllm-shap is provided as a pip-installable package featuring an interactive web-based GUI for granular attribution visualization. To our knowledge, this is the first publicly available framework providing a complete, reproducible pipeline for SV-based explainability in text-audio MLLMs.

2606.07533 2026-06-09 cs.CL cs.AI cs.SD 交叉投稿

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

桥接传统可解释性方法与多模态多语言模型:基于XAI的分析

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

发表机构 * arXiv

AI总结 提出多模态Shapley值框架,结合频谱图引导的音素对齐(SGPA)预处理方法,实现文本与音频特征的可解释性归因,并开源计算包与可视化工具。

Comments Bachelor's thesis

详情
AI中文摘要

多模态大语言模型(MLLMs)有效整合文本和音频以理解复杂交互对话中的上下文。然而,异质模态影响模型行为的内部机制仍然不透明。虽然Shapley值(SV)为基于文本的NLP提供了鲁棒的、模型无关的局部可解释性框架,但其扩展到多模态数据受到跨通道依赖、复杂对话结构以及密集音频表示的高计算复杂性的阻碍。\n在这项工作中,我们形式化了Shapley值框架的多模态扩展,将离散文本标记和对齐的音频片段视为协作特征。为确保计算可行性,我们部署了一套高效的估计策略:低维输入的精确SV计算和基于采样的近似——包括蒙特卡洛排列和具有Neyman最优分配的分层抽样——以在有限计算预算下最小化方差。为解决模态间的粒度不匹配问题,我们提出了频谱图引导的音素对齐(SGPA),一种新颖的预处理方法,将高频音频流映射到可解释的、单词对齐的片段。\n我们的贡献有两方面:首先,我们提供了一个开源的、模型无关的Python包和配套的GUI,用于多模态归因的计算和交互式可视化。其次,我们使用VoiceBench和Infinity Instruct数据集的精选子集,在多种多语言场景下评估我们的框架。实验结果表明,输入模态是归因波动的主要驱动因素,并证明标准句法重要性代理在多模态跨语言上下文中通常无法预测模型注意力。

英文摘要

Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.

2606.07581 2026-06-09 cs.LG cs.AI cs.ET 交叉投稿

Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

训练-推理核契约:约束后训练与部署中的偏差

Bruce Changlong Xu, Lan Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出核契约框架,通过数值、统计、运行时和可观测性条款约束训练核与推理核之间的分布偏差,并推导偏差界以保障策略梯度无偏性。

详情
AI中文摘要

现代后训练流程通常为其策略π_θ编写一个符号,但通过两个不同的程序进行评估:一个针对自动微分优化的训练核和一个针对低精度、融合、动态批处理服务优化的推理核。在有限精度下,这些核在相同权重下可能产生不同的分布,且差距集中在基准测试未充分代表的切片上。本文提出核契约:一个契约优先的框架,用于指定K_train和K_inf之间可接受的偏差。契约C = (N, S, R, O, Pi) 结合了数值、统计、运行时和可观测性条款,以及从违规到路由操作的升级策略。我们推导了从logit漂移到总变差距离再到有界奖励漂移的链式界限,并将其专门用于强化学习后训练,其中在显式支持和范数假设下,每个token的重要性比率漂移给出了策略梯度偏差的界限。我们还描述了一个四阶段提升管道、在线路由循环以及用于契约工件的极简YAML DSL。本文是一个框架和词汇论文;我们不报告生产规模的实证验证。

英文摘要

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

2606.07593 2026-06-09 cs.CV cs.AI 交叉投稿

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

视觉Transformer对抗微调的机制分析

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 通过机制分析研究对抗微调对视觉Transformer在扰动和常规图像上性能的影响,发现微调仅改善特定类型扰动,未改变稀疏表示。

详情
AI中文摘要

图像分类模型在高风险现实场景中的广泛应用要求模型对输入图像的轻微扰动(如模糊或锐化)具有鲁棒性。尽管视觉Transformer(ViT)在现代多模态模型(如视觉-语言模型(VLM)和视觉-语言-动作(VLA)模型)中扮演着不可或缺的角色,但在鲁棒性设置中它们缺乏关注。在这项工作中,我们通过机制视角分析了对抗微调(一种提高模型对图像扰动鲁棒性的流行方法)对ViT在扰动和常规图像上性能的影响。我们在低频和高频图像损坏上对抗训练ViT,并试图通过检查模型的注意力机制、内部表示和知识演化来解释下游模型性能的变化。总体而言,我们的结果表明,虽然对带有常见损坏的输入进行微调提高了模型在新损坏数据实例上的性能和确定性,但这些改进不会转移到训练中未见过的其他类别损坏。此外,尽管观察到各层视觉注意力和知识演化的变化,我们发现对抗训练并未导致ViT学习的稀疏表示发生根本性变化。

英文摘要

The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT's performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model's attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.

2606.07612 2026-06-09 cs.CY cs.AI cs.LG 交叉投稿

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

立场:拟人化错位研究需要更强证据

Vansh Gupta, Peter Nutter, Samuel Stante, Andreas Krause, Florian Tramèr, Lukas Fluri, Xin Chen, Anna Hedström

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文指出拟人化错位研究(AMR)在概念模糊、数据不鲁棒、实验设计不足等问题上存在证据薄弱,提出证据层级框架和诊断清单以提升方法论严谨性。

详情
AI中文摘要

我们认为,许多拟人化错位研究(AMR)需要更强证据,以确保它们能为关键安全决策(如模型部署和监管)提供坚实基础。通过评估不同错位概念(如欺骗、突发错位和谄媚)中的失败模式,我们展示了概念模糊、非鲁棒数据集、实验设计和因果干预不足如何导致对模型行为的过度解读。本立场论文旨在提供关于证据考量的指导,以帮助提高AMR的方法论严谨性。为此,我们通过提出的证据层级框架和诊断清单,明确呼吁行动。这些共享标准将促进更富有成效的科学讨论,并确保关于AI风险的声明建立在坚实的实证基础上。

英文摘要

We argue that many Anthropomorphic Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets, experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.

2606.07620 2026-06-09 cs.CV cs.AI cs.DC cs.LG 交叉投稿

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

SENTRY: 视觉Transformer在软错误下的统计可靠性分析

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz, Michael Hubner

发表机构 * Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) Zanjan University(赞詹大学)

AI总结 提出基于有限总体抽样的统计故障注入框架,仅需数千样本即可在99%置信度下以1%误差界估计故障率,将实验成本降低高达10700倍,并揭示ViT中归一化层和关键指数位是脆弱性热点。

详情
AI中文摘要

随着视觉Transformer在自动驾驶和医学成像等安全关键领域的应用增长,确保其抵抗软错误的可靠性至关重要。尽管ViT提供了最先进的准确性,但其庞大的参数数量使得穷举故障注入不可行。为弥补这一差距,本文提出一个统计故障注入框架,利用有限总体抽样理论提供形式化的可靠性保证。我们证明,无论模型规模如何,仅需数千个样本即可在99%置信度下将故障率限制在1%的误差界内。与穷举方法相比,该方法将实验成本降低高达10700倍,同时保留跨架构组件定位脆弱性的能力。通过对ViT-Tiny和ViT-Small等不同架构的广泛评估,我们揭示了高度非均匀的可靠性景观。结果表明,虽然只有3%的FP32位翻转导致故障,但其中绝大多数事件导致灾难性的精度崩溃。具体脆弱性被定位到归一化层和IEEE-754格式中的关键指数位,为设计加固的、边缘部署的ViT架构提供了数学基础和可操作的见解。

英文摘要

With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.

2606.07629 2026-06-09 cs.LG cs.AI cs.CL cs.CY cs.HC 交叉投稿

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

大型语言模型应学习个性化而非聚合的人类偏好

Cristina Garbacea

AI总结 本文主张大型语言模型应学习个性化偏好而非聚合偏好,分析聚合偏好的理论局限与实证问题,提出通过有界个性化框架兼顾个体自主与集体安全。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前对齐大型语言模型(LLM)的方法将多样化的人类偏好聚合为单一奖励信号,实际上优化了一个不代表任何真实个体的假设性“平均用户”。本文立场论文认为,LLM应学习个性化、个体化的偏好而非聚合偏好。我们表明,聚合掩盖了关于偏好多样性、个体价值观和上下文依赖的关键信息,这在理论上基于社会选择理论,并在经验上跨人口群体明显。我们分析了人类偏好编码的丰富结构,调查了个性化的技术方法,并系统地回应了关于可扩展性、共享标准和操纵风险的反驳。虽然个性化引入了真正的安全挑战,包括过滤气泡、价值锁定和心理操纵,但我们认为这些挑战可以通过有界个性化框架来管理,该框架在容纳合法个体差异的同时保留通用安全约束。最后,我们提出了一个具体的研究和政策议程,以开发尊重个体自主和集体安全的偏好感知模型。

英文摘要

Current approaches to aligning large language models (LLMs) aggregate diverse human preferences into a single reward signal, effectively optimizing for a hypothetical ``average user'' who represents no real person particularly well. This position paper argues that LLMs should learn personalized, individual preferences rather than aggregated ones. We show that aggregation masks critical information about preference diversity, individual values, and contextual dependencies, which is a limitation both theoretically grounded in social choice theory and empirically evident across demographic groups. We analyze the rich structure that human preferences encode, survey technical approaches to personalization, and systematically address counterarguments on scalability, shared standards, and manipulation risk. While personalization introduces genuine safety challenges including filter bubbles, value lock-in, and psychological manipulation, we argue these are manageable through bounded personalization frameworks that preserve universal safety constraints while accommodating legitimate individual variation. We conclude with a concrete research and policy agenda for developing preference-aware models that respect both individual autonomy and collective safety.

2606.07631 2026-06-09 cs.LG cs.AI cs.CY 交叉投稿

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland(马里兰大学)

AI总结 提出利用激活空间中的性状方向监测监督微调中的涌现失调,通过低维几何特征实现高效检测,在7-9B模型上达到0.990 AUROC。

Comments First version. 45 pages

详情
AI中文摘要

涌现失调(EM)发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移,如果依赖重复的行为评估,可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状,我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上,揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点,假阴性率为2.2%,假阳性率为2.9%,AUROC为0.990,优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充,同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

2606.07688 2026-06-09 cs.IR cs.AI cs.CL cs.LG 交叉投稿

TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation

TRACER: 面向生成式推荐中概念擦除的令牌重分配

Ziheng Chen, Jiali Cheng, Zezhong Fan, Hadi Amiri, Diyuan Wu, Gabriele Tolomei, Yang Zhang

发表机构 * Stony Brook University(石英布鲁克大学) University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校) Columbia University(哥伦比亚大学) Institute of Science and Technology Austria(奥地利科学技术研究院) Sapienza University of Rome(罗马大学 sapienza) National University of Singapore(新加坡国立大学)

AI总结 针对生成式推荐中概念遗忘与推荐效用冲突的问题,提出基于令牌重分配的概念遗忘框架TRACER,通过将概念相关物品重分配给替代令牌并引入一致性正则化,有效移除目标概念同时保持推荐效用。

详情
AI中文摘要

生成式推荐将下一项预测形式化为基于用户历史交互导出的语义ID(SID)序列的自回归生成,使得现代推荐系统在结构上类似于大型语言模型(LLM)。随着隐私和安全问题的增加,这些系统越来越需要概念遗忘来移除与物品相关的敏感或有害概念。然而,现有的LLM遗忘方法不能直接应用于生成式推荐。与具有明确语义的词令牌不同,SID是抽象标识符,通常被遗忘和保留物品共享,导致概念移除和推荐效用保持之间的严重冲突。为了解决这一挑战,我们提出了TRACER,一种基于令牌重分配的端到端概念遗忘框架。TRACER不是直接抑制共享的SID,而是将概念相关物品重分配给能够更好地促进遗忘同时最小化对保留物品的副作用的替代令牌。我们进一步引入了一致性正则化器,以在遗忘过程中保持保留物品之间的语义一致性。在真实世界推荐数据集上的实验表明,TRACER有效地移除了目标概念,同时比现有的遗忘基线更好地保持了推荐效用。

英文摘要

Generative recommendation formulates next-item prediction as autoregressive generation over semantic ID (SID) sequences derived from users' historical interactions, making modern recommender systems structurally similar to large language models (LLMs). As privacy and safety concerns grow, these systems increasingly require concept unlearning to remove sensitive or harmful concepts associated with items. However, existing LLM unlearning methods cannot be directly applied to generative recommendation. Unlike word tokens with explicit semantics, SIDs are abstract identifiers that are often shared by both forget and retain items, leading to severe conflicts between concept removal and recommendation utility preservation. To address this challenge, we propose TRACER, an end-to-end concept unlearning framework based on token reassignment. Rather than directly suppressing shared SIDs, TRACER reassigns concept-related items to alternative tokens that better facilitate forgetting while minimizing side effects on retained items. We further introduce a coherence regularizer to preserve semantic consistency among retain items during unlearning. Experiments on real-world recommendation datasets demonstrate that TRACER effectively removes target concepts while substantially better preserving recommendation utility than existing unlearning baselines.

2606.07696 2026-06-09 cs.LG cs.AI 交叉投稿

Adversarial Robustness of Activation Steering in Large Language Models

大型语言模型中激活引导的对抗鲁棒性

Kien Le, Thai Le

发表机构 * Independent Researcher(独立研究员) Indiana University(印第安纳大学)

AI总结 研究激活引导在对抗性文本扰动下的鲁棒性,发现所有方法、模型和设置中方向鲁棒性下降高达64%,置信度崩溃,层选择脆弱,揭示其结构性脆弱性。

Comments 9 pages, 2 figures

详情
AI中文摘要

激活引导已成为一种流行的免训练方法,通过在推理时将预计算的方向向量注入模型的残差流来控制LLM行为。然而,其对现实输入变化的鲁棒性尚未得到研究。我们首次系统评估了在输入上施加对抗性文本扰动时激活引导的鲁棒性,涵盖了四种提取方法、三种攻击策略、来自Anthropic Model-Written Evaluation数据集的六种人格以及从1.5B到30B参数的五个模型。攻击在所有设置中普遍成功:方向鲁棒性下降高达64%,攻击后置信度在所有方法和模型中崩溃至接近或低于0.25,并且几乎每个可引导输入的引导强度都下降。层选择同样脆弱,通过自动化方法在干净输入上识别的最优层在扰动下偏移多达17个位置,这一失败加剧了向量级别的崩溃。从对抗性扰动输入中提取向量对于中大型模型上的PCA和MD方法部分恢复了可引导性,但它们始终无法定位改进的最优层,限制了这种缓解措施的实际效益。总之,这些发现揭示了激活引导的脆弱性是结构性的而非方法特定的,并且当前的层选择策略对于实际部署不够鲁棒。

英文摘要

Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation remains unstudied. We present the first systematic evaluation of activation steering robustness under adversarial text perturbations on the inputs, covering four extraction methods, three attack strategies, six personas from Anthropic Model-Written Evaluation Dataset, and five models ranging from 1.5B to 30B parameters. Attacks succeed broadly across all settings: directional robustness drops by up to 64%, post-attack confidence collapses near or below 0.25 across all methods and models, and steering strength degrades on nearly every steerable input. Layer selection is equally fragile, with the optimal layer identified by an automated method on clean inputs shifting by up to 17 positions under perturbation, a failure that compounds the vector-level breakdown. Extracting vectors from adversarially perturbed inputs partially recovers steerability for PCA and MD on mid-to-large models, but they consistently fail to locate the improved optimal layer, limiting the practical benefit of this mitigation. Together, these findings reveal that the brittleness of activation steering is structural rather than method-specific, and that current layer selection strategies are not robust enough for real-world deployment.

2606.07716 2026-06-09 cs.CR cs.AI cs.LG 交叉投稿

SHIELD-IDS: Structurally Heterogeneous Ensemble with Integrated Layered Defense for Intrusion Detection Systems

SHIELD-IDS:用于入侵检测系统的结构异构集成与分层防御

Maryam Zaman, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering and Computing(SEECS)(电气工程与计算学院) National University of Sciences and Technology(国立科学与技术大学)

AI总结 提出IDS-Anta++框架,通过集成XGBoost和LightGBM梯度提升模型,并采用隔离森林异常检测、中值特征平滑和六路多数投票三层黑盒防御,提升对抗攻击鲁棒性,在多个数据集上实现99%以上检测准确率。

Comments 10 pages, 5 figures, 7 tables. Code available at: https://github.com/maryamzaman-git/SHEILD-IDS

详情
AI中文摘要

对抗攻击对基于机器学习的入侵检测系统(IDS)构成了严重且日益增长的威胁,其中对网络流特征的微小扰动可以系统性地误导分类器,将恶意流量视为良性。IDS-Anta框架通过Z-score归一化、奇异值分解(SVD)和基于汤普森采样的多臂赌博机(MAB)分类器选择部分解决了这一问题,但其分类器池缺乏足够的结构多样性以实现鲁棒的对抗抵抗。本文引入IDS-Anta++,将XGBoost和LightGBM梯度提升模型纳入集成,并将扩展后的池包裹在三层黑盒防御中:隔离森林异常检测、中值特征平滑和六路多数投票。在CIC-IDS-2017、CEC-CIC-IDS-2018和CIC-DDoS-2019数据集上,在快速梯度符号法(FGSM)和零阶优化(ZOO)攻击下进行的实验证实,干净数据上的检测准确率超过99%,并且在对抗条件下相对于基线IDS-Anta配置具有可测量的鲁棒性提升。

英文摘要

Adversarial attacks pose a serious and growing threat to Machine Learning (ML)-based Intrusion Detection Systems (IDS), where imperceptible perturbations to network flow features can systematically mislead classifiers into accepting malicious traffic as benign. The IDS-Anta framework partially addresses this through Z-score normalization, Singular Value Decomposition (SVD), and Multi-Armed Bandit (MAB) classifier selection with Thompson Sampling, yet its classifier pool lacks sufficient structural diversity for robust adversarial resistance. This work introduces IDS-Anta++, which incorporates XGBoost and LightGBM gradient boosting models into the ensemble and wraps the extended pool in a three-layer black-box defense: Isolation Forest anomaly screening, median feature smoothing, and six-way majority voting. Experiments conducted on CIC-IDS-2017, CEC-CIC-IDS-2018, and CIC-DDoS-2019 under both Fast Gradient Sign Method (FGSM) and Zeroth Order Optimization (ZOO) attacks confirm detection accuracy above 99% on clean data, with measurable robustness gains under adversarial conditions relative to the baseline IDS-Anta configuration.

2606.07802 2026-06-09 cs.CY cs.AI 交叉投稿

Memetic Capture: A Pluralistic Policy Framework for Governing AI-Driven Cultural Disempowerment

模因捕获:治理AI驱动的文化去权能的多元政策框架

Subramanyam Sahoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出“模因捕获”概念,指AI通过文化影响削弱人类自主性,并构建四层文化多元治理框架(CPGF),强调多元主义是结构性必需。

Comments Paper accepted in Pluralistic Alignment Workshop at ICML 2026

详情
AI中文摘要

文化是AI逐步削弱人类自主权的最隐蔽媒介:与经济或政治替代不同,文化替代攻击的是人类识别和抵抗自主权丧失的偏好和价值观。我们认为,现有AI治理框架存在关键盲点,将文化影响视为次于经济和安全问题。本文提出“模因捕获”作为AI驱动的文化去权能的统一概念,并提出文化多元治理框架(CPGF),这是一个四层政策架构,结合了定量文化影响力指标、民主价值集会、多元部署标准和跨国协调机制。我们认为,多元主义不仅是此类治理的伦理要求,而且是结构性必需:单一文化的AI治理加速了它声称要防止的自主权丧失。我们确定了具体的政策杠杆,讨论了实施中的张力,并概述了多元对齐与文化AI治理交叉领域的研究议程。

英文摘要

Culture is the most insidious vector of gradual human disempowerment by AI: unlike economic or political displacement, cultural displacement attacks the very preferences and values through which humans recognise and resist disempowerment itself. We argue that existing AI governance frameworks suffer from a critical blind spot by treating cultural impact as secondary to economic and safety concerns. This paper develops \emph{memetic capture} as a unifying concept for AI-driven cultural disempowerment, and proposes the \textbf{Cultural Pluralistic Governance Framework (CPGF)}, a four-tier policy architecture combining quantitative cultural influence metrics, democratic value assemblies, pluralistic deployment standards, and transnational coordination mechanisms. We argue that pluralism is not merely an ethical requirement for such governance but a structural necessity: monocultural AI governance accelerates the very disempowerment it claims to prevent. We identify concrete policy levers, discuss implementation tensions, and outline a research agenda at the intersection of pluralistic alignment and cultural AI governance.

2606.07822 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.07833 2026-06-09 cs.CR cs.AI 交叉投稿

Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks

超越通过/失败:使用过程挖掘理解LLM如何抵抗(和失败)红队攻击

Zvi Topol

发表机构 * MuyVentive LLC

AI总结 提出将过程挖掘应用于红队攻击轨迹,通过分析事件日志提取直接跟随图和状态转移矩阵,揭示GPT-OSS和Llama 3.3在防御结构上的差异,发现传统攻击成功率指标无法捕捉的模型防御模式。

详情
AI中文摘要

标准AI红队评估将对抗性活动简化为单一的二元结果——攻击成功率(ASR),没有考虑模型如何抵抗或屈服于攻击的顺序结构。我们提出将过程挖掘(一门从事件日志中发现和分析过程模型的学科)应用于红队攻击轨迹。我们进行了一项受控实验,将60个HarmBench提示与两个LLM(GPT-OSS 120B和Llama 3.3 70B)对抗,使用10种提示变异策略,每个提示最多尝试110次。从得到的8,575个评分事件中,我们提取了直接跟随图(DFG)和状态转移矩阵,揭示了仅靠ASR无法看到的、结构上不同的防御轮廓:GPT-OSS表现出近乎吸收的拒绝状态,而Llama则呈现出从拒绝到成功越狱的多条多孔逃生路径。我们进一步证明,变异器的有效性在模型间是不对称的,并且越狱时间分布相差一个数量级。

英文摘要

Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into account the sequential structure of how models resist or yield to attacks. We propose applying process mining, a discipline for discovering and analyzing process models from event logs, to red teaming traces. We conduct a controlled experiment pitting 60 HarmBench prompts against two LLMs, GPT-OSS 120B and Llama 3.3 70B, using 10 prompt mutation strategies over up to 110 attempts per prompt. From the resulting 8,575 scored events we extract Directly-Follows Graphs (DFGs) and state transition matrices that reveal structurally distinct defense profiles invisible to ASR alone: GPT-OSS exhibits a near-absorbing refusal state, while Llama presents multiple porous escape routes from refusal to getting successfully jailbroken. We further show that mutator effectiveness is asymmetric across models and that time-to-jailbreak distributions differ by an order of magnitude.

2606.07834 2026-06-09 cs.SE cs.AI cs.CL cs.MA 交叉投稿

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Cherry-pick Override:混合证据下LLM法官的不安全方向性承诺

Haoran Xu

AI总结 针对混合证据场景,发现LLM法官会错误地返回方向性裁决(SUPPORTS/REFUTES)而非授权非方向性裁决(CONFLICTING),定义为Cherry-pick Override(CCO);通过诊断协议和干预实验,提出外部承诺控制层分离裁决生成与授权。

Comments 12 pages, 1 figure

详情
AI中文摘要

LLM法官越来越多地将裁决转化为系统承诺。在混合证据(同时包含支持和反驳来源的声明)下,这是不安全的:当模式将CONFLICTING作为授权的非方向性裁决暴露时,返回SUPPORTS/REFUTES是一种未经授权的方向性承诺,我们将这种失败命名为Cherry-pick Override(CCO)。我们在明确的任务契约下定义CCO,并使用同分母诊断协议、匹配覆盖率的bootstrap以及苹果对苹果的随机否决零假设进行报告。在AVeriTeC的Conflicting子集(N_C = 150)上,三选项法官对超过84%的混合证据声明返回方向性裁决;在类型化模式下,三法官多数投票在AVeriTeC上放大了冲突上的方向性(0.887 vs. 0.840;95% CI [+0.013, +0.080]),但在VitaminC-Mixed上未复制。通过常见的单通道修复(类型化词汇、面板聚合、置信度阈值、仅验证器过滤)的干预阶梯,每个都留下了不同的残余失败:面板聚合在48%的CCO案例中抑制了单个法官的CONFLICTING异议;面板对方向校准良好(纯S/R上的ECE = 0.07),因此置信度无法在操作上区分CCO与正确的方向性承诺;验证器作为分类器几乎将纯证据准确率减半。一个最小双通道参考探针达到了任一单通道无法达到的操作点;在随机否决零假设下,其对CONFLICTING的提升在AVeriTeC上具有结构性针对性(经验p < 1/2001),在VitaminC-Mixed上方向相同但较弱,这是一个选择性结果而非幅度结果。我们主张一个外部承诺控制层,将裁决生成与承诺授权分离,使用结构证据和置信度作为正交通道,并将NO-COMMIT作为路由控制器状态。

英文摘要

LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.

2606.07857 2026-06-09 cs.CR cs.AI 交叉投稿

Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices

边缘设备上小语言模型训练中对抗检测的模型多重性

Stefan Behfar, Richard Mortier

发表机构 * Computer Lab, University of Cambridge(剑桥大学计算机实验室)

AI总结 针对边缘设备上分布式微调语言模型易受投毒攻击的问题,提出基于模型多重性的系统级防御,通过旋转或并行训练多个小语言模型并量化其差异来检测异常,实验表明比经典单模型防御更早更可靠地检测投毒。

详情
AI中文摘要

基于边缘的机器学习的兴起使得语言模型能够在移动和物联网设备上进行分布式适应,提供了隐私保护和实时响应。然而,在不可信或异构的边缘节点上对语言模型进行分布式微调引入了新的漏洞。受损或不可靠的设备可以注入中毒更新,导致隐蔽的模型操纵或收敛退化。经典的防御方法,如鲁棒聚合或时间异常检测,在单个全局模型上运行,因此在检测协调或持续性中毒方面受到限制。本文提出了一种基于模型多重性的新型系统级防御。系统不是维护一个全局模型,而是轮换或并行训练多个小语言模型(例如DistilGPT-2),每个模型由独立采样的边缘节点子集更新。这些模型在不同的训练轨迹下演化,创建了同一分布式总体的多个独立视图。通过梯度相似性、损失演化或参数方差量化的模型之间的差异,作为异常或对抗行为的信号。当一个模型显著偏离集成均值时,系统将其贡献节点标记为隔离或重新加权。我们实现了该框架,并在不同异质性和攻击条件下的边缘规模小语言模型(SLM)训练模拟中进行了评估。结果表明,与经典的单一模型防御(如Flanders和Robust方法)相比,模型多重性能够更早、更可靠地检测投毒。我们的发现表明,模型演化的多样性可以作为资源受限边缘设备上安全分布式学习的实用且有效的防御机制。

英文摘要

The rise of edge-based machine learning has enabled distributed adaptation of language models across mobile and IoT devices, offering privacy preservation and real-time responsiveness. However, distributed fine-tuning of language models on untrusted or heterogeneous edge nodes introduces new vulnerabilities. Compromised or unreliable devices can inject poisoned updates, leading to stealthy model manipulation or convergence degradation. Classical defenses such as robust aggregation or temporal anomaly detection operate on a single global model and are therefore limited in detecting coordinated or persistent poisoning. This work proposes a new system-level defense based on model multiplicity. Instead of maintaining one global model, the system rotates or concurrently trains multiple small language models (e.g., DistilGPT-2), each updated by independently sampled subsets of edge nodes. These models evolve under distinct training trajectories, creating multiple independent views of the same distributed population. Divergence between models quantified through gradient similarity, loss evolution, or parameter variance serves as a signal of anomalous or adversarial behavior. When one model deviates significantly from the ensemble mean, the system flags its contributing nodes for isolation or re-weighting. We implement this framework and evaluate it on edge-scale simulations of Small Language Model (SLM) training under varying heterogeneity and attack conditions. Results show that model multiplicity enables earlier and more reliable detection of poisoning compared to classical single-model defenses such as Flanders and Robust methods. Our findings demonstrate that diversity in model evolution can serve as a practical and effective defense mechanism for secure distributed learning on resource-constrained edge devices.

2606.07943 2026-06-09 cs.CR cs.AI cs.CL 交叉投稿

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

POISE:面向LLM智能体的位置感知不可检测技能注入攻击

Haochang Hao, Dehai Min, Zhifang Zhang, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

发表机构 * University of Illinois at Chicago(伊利诺伊大学香槟分校) University of Queensland(昆士兰大学) Tulane University(路易斯安那州立大学) Rutgers University(罗格斯大学)

AI总结 提出POISE攻击方法,通过位置感知将恶意指令压缩为单一良性指令嵌入技能正文,在保持隐蔽性的同时实现89.3%的攻击成功率,比随机位置基线高28.0个百分点。

Comments 20 pages, 2 figures, 5 tables

详情
AI中文摘要

智能体技能为扩展通用智能体提供了一种轻量级机制,但其开放格式使其容易受到技能投毒攻击。实际危险的注入必须保持不可见:如果执行有效载荷破坏了用户的合法任务,由此产生的失败信号会引发对技能的检查。因此,我们通过攻击成功率(ASR)来评估攻击,这要求注入的有效载荷得以执行,并且用户的任务在同一试验中仍能通过验证器。先前的技能投毒攻击在此视角下面临可靠性-隐蔽性权衡:YAML头部注入可靠加载但易被检查,而将显式恶意命令置于技能正文中的更隐蔽的注入方式则可靠性较低,因为脱离上下文的命令会引发智能体自身的怀疑。我们提出POISE,一种位置感知攻击,将触发器压缩为单个看似良性的正文指令,将其放置在可行位置,并使用上下文感知生成器使其与附近的设置或前提步骤融合。在Skill-Inject(使用codex+gpt-5.2)上,POISE实现了89.3%的ASR,比随机位置正文基线高28.0个百分点,比仅YAML基线高2.6个百分点,同时保留了正文放置的隐蔽性优势。这种隐蔽性是决定性的优势:由于合法的技能正文自然需要特权工具操作,LLM扫描器高度敏感,在四个评判者和两个基准测试中平均将74.6%的干净技能误报为高风险。融入这些误报中,POISE仅导致5.6%的投毒变体相比其干净基线获得新的高风险警报,使得当前的静态防御无效。

英文摘要

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

2606.07968 2026-06-09 cs.CR cs.AI 交叉投稿

RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks

RecurGuard: 推理令牌消耗攻击的运行时监控

Abid Aziz, Hafsa Binte Kibria

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Rajshahi University of Engineering & Technology(拉贾克西希大学工程与技术学院)

AI总结 RecurGuard通过监控推理轨迹的重复率、体积增长和查询进展三个信号,实时检测并阻止推理令牌消耗攻击,在DS-R1-Qwen-7B上对OverThink和ExtendAttack的检测率分别达99%和92%,且误报率接近零。

详情
AI中文摘要

具有推理能力的大型语言模型可能被诱导将其生成预算花在注入的诱饵任务上,而不是回答用户的问题,导致在没有产生最终答案时发生拒绝服务,以及在输出令牌计费时造成钱包耗尽。输入侧的安全分类器通常会漏掉这些攻击,因为注入的提示可能在语法上看起来是良性的。我们构建了RecurGuard,这是一个运行时监控器,用于在模型暴露推理轨迹时检测推理链消耗攻击。RecurGuard在推理轨迹生成时对其进行分析,并跟踪三个信号:重复率、体积增长以及向用户查询的进展。如果所有三个信号在连续三个块中保持异常,RecurGuard会提前终止生成。我们在开源推理模型上评估了RecurGuard对抗OverThink和ExtendAttack的效果,并在DS-R1-Qwen-7B上进行了自适应压力测试。在该模型上,RecurGuard检测到99%的OverThink攻击和92%的ExtendAttack实例,同时在问答、代码生成、数学和摘要任务上保持接近零的误报率。自适应评估揭示了该防御的局限性:主题攻击仍保持11.9倍的放大效果,联合漏检率约为50%,而完全语义规避将放大倍数从22.8倍降至2.2倍。当推理轨迹不可用时,QDM提供基于最终输出的事后回退监控器。

英文摘要

Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the user's question, causing denial of service when no final answer is produced and denial of wallet when excess output tokens are billed. Input-side safety classifiers often miss these attacks because the injected prompts can appear syntactically benign. We build RecurGuard, a runtime monitor for detecting reasoning-chain consumption attacks when reasoning traces are exposed by the model. RecurGuard analyzes reasoning traces as they are generated and tracks three signals: recurrence rate, volume growth, and progress toward the user's query. If all three signals remain anomalous over three consecutive chunks, RecurGuard terminates generation early. We evaluate RecurGuard against OverThink and ExtendAttack across open-weight reasoning models and conduct adaptive stress tests on DS-R1-Qwen-7B. On this model, RecurGuard detects 99% of OverThink attacks and 92% of ExtendAttack instances while maintaining near-zero false positive rates on question answering, code generation, mathematics, and summarization. Adaptive evaluation reveals the limit of the defense: topical attacks retain 11.9x amplification with an approximately 50% joint miss rate, whereas full semantic evasion reduces amplification from 22.8x to 2.2x. When reasoning traces are unavailable, QDM provides a post-hoc fallback monitor based on the final output.

2606.07970 2026-06-09 cs.CL cs.AI 交叉投稿

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

通过扩展训练时对抗攻击防御恶意微调

Haoming Wen, Shi Chen, Qingyu Shi, Siyuan Liu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Xiongan AI Institute(雄安人工智能研究院) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 针对全参数微调的安全威胁,提出基于对抗训练和双层优化的Patcher方法,通过扩展对抗循环中的优化步数增强防御,并设计并行算法提升效率。

详情
AI中文摘要

当前的开源大型语言模型(LLMs)容易受到恶意微调攻击,这些攻击只需在中毒数据集上进行几步监督微调(SFT)即可破坏LLMs的安全对齐。现有的对齐阶段防御主要设计用于防御使用参数高效微调方法的攻击。然而,它们无法防御使用全参数微调的更强攻击。在本文中,我们提出了Patcher,一种受对抗训练和双层优化启发的方法,以对抗此类攻击。Patcher通过扩展对抗循环中的优化步数来增强模拟攻击,从而迫使防御者找到对更强攻击不敏感的模型参数。此外,我们提出了一种高效的并行算法来实现Patcher,减少了训练的挂钟时间,同时保持了Patcher的性能。大量实验表明,与普通SFT对齐相比,Patcher显著提高了模型的鲁棒性,并且可以迁移到不同的攻击场景和模型大小。代码可在https://github.com/haomingwen/patcher获取。

英文摘要

Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.

2606.08021 2026-06-09 cs.LG cs.AI cs.MA 交叉投稿

Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

语义法定数保证:面向非确定性AI基础设施的集体认证

Jun He, Deying Yu

发表机构 * OpenKedge.io

AI总结 提出语义法定数保证(SQA),一种通过多样化验证者群体和风险自适应法定数谓词,将非确定性LLM代理的不安全操作批准率从18.5%降至0.3%的控制平面原语。

Comments 21 pages, 2 figures, 6 tables

详情
AI中文摘要

随着大型语言模型(LLM)代理被集成到自主云操作中,分布式系统面临一个语义可靠性问题:提议代理可以生成语法有效且静态授权但操作不安全的生成突变,例如修改IAM策略、开放防火墙安全组或执行数据导出。经典的分布式共识协议复制确定性状态转换,但不评估提议意图的安全性。为弥补这一差距,我们引入语义法定数保证(SQA),一种用于治理非确定性代理基础设施的控制平面原语。SQA将提议表示为绑定到密码证据链的声明性执行合约,并将其路由到由只读、沙盒验证代理组成的多样化面板。SQA在风险自适应法定数谓词下聚合其判断,该谓词强制执行模型和原型多样性,根据校准的保证分数调整权重,并尊重特定原型的否决。通过的提议仅通过主权执行门执行。我们在云原生控制平面中实例化SQA,并为非确定性验证者形式化了一个相关的认知失败模型。在500个基础设施启发的突变场景中,安全结果报告在保留的安全/不安全试验上(排除模糊场景),SQA将不安全批准率从单代理验证的18.5%降低到0.3%,同时在研究风险桶中增加了1.45-4.12秒的中位验证延迟。

英文摘要

As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.

2606.08027 2026-06-09 cs.LG cs.AI 交叉投稿

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

CausShield: 通过因果表示学习实现样本重建鲁棒的纵向联邦学习

Yongqi Jiang, Yansong Gao, Siguang Chen, Anmin Fu

发表机构 * Nanjing University of Science and Technology(南京理工大学) University of Western Australia(西澳大学) Hohai University(河海大学) Nanjing University(南京大学)

AI总结 针对纵向联邦学习中样本重建攻击的防御问题,提出基于因果表示学习的CausShield方法,将共享表示分解为任务相关与无关部分,实现全周期隐私保护,理论证明收敛性,实验优于七种最新方法。

详情
AI中文摘要

纵向联邦学习(VFL)是一种分布式学习范式,利用跨孤立方的垂直划分特征,无需共享原始样本;然而,它仍然容易受到主动样本重建攻击。现有防御方法由于要么抑制任务相关信息的同时也抑制了隐私敏感特征,要么依赖端到端监督训练来收敛防御模块(这暴露了早期轮次的脆弱性),因此无法在模型效用和隐私保护之间实现令人满意的权衡。为了解决这一挑战,我们采用结构因果模型(SCM)的见解,构建了CausShield。从任务学习的角度来看,原始样本中的因果特征是那些直接相关且有助于学习目标的特征,而非因果特征与任务无关,但通常编码了样本特定的私有信息,从而促进了重建。重要的是,我们奠定了理论基础来证明这一见解。因此,CausShield将VFL中客户端与协调服务器之间的共享表示分解为任务相关和任务无关的组件,以确保全周期的隐私保护。然而,由于在保持模型效用的同时减轻隐私泄露的双重目标,这种分解本质上具有挑战性。我们通过一个精心制定的优化问题来解决这一问题,该问题通过无监督表示学习求解。我们进一步从理论上证明CausShield保持了标准VFL的收敛行为。大量实验将CausShield与七种最新方法(包括InvL (USENIX Security'25))进行比较,并评估了对高级重建攻击(如URVFL (NDSS'25))的鲁棒性。结果表明,CausShield在隐私保护、模型效用和计算效率方面始终表现优异。

英文摘要

Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties without sharing raw samples; however, it remains vulnerable to active sample reconstruction attacks. Existing defenses fail to achieve a satisfactory trade-off between model utility and privacy protection, due to either suppressing task-relevant information alongside privacy-sensitive features or relying on end-to-end supervised training to converge the defense module, which exposes the model to early-epoch vulnerability. To address this challenge, we adopt a structural causal model (SCM) insight and construct CausShield. From a task-learning standpoint, causal features within a raw sample are those that are directly relevant and contributory to the learning objective, whereas non-causal features are task-irrelevant but often encode sample-specific private information, thereby facilitating reconstruction. Importantly, we lay a theoretical foundation to prove this insight. CausShield thus decomposes the shared representations between the client and the coordinating server in VFL into task-relevant and task-irrelevant components to ensure full-cycle privacy protection. Nonetheless, the decomposition is inherently challenging due to the dual objectives of preserving model utility while mitigating privacy leakage. We address this via a carefully formulated optimization problem, which is solved through unsupervised representation learning. We further theoretically prove that CausShield preserves the convergence behavior of standard VFL. Extensive experiments compare CausShield against seven SOTAs, including InvL (USENIX Security'25), and evaluate robustness against advanced reconstruction attacks such as URVFL (NDSS'25). Results demonstrate that CausShield consistently outperforms in privacy protection, model utility, and computational efficiency.

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时:表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出行为安全与干预鲁棒性之间的“审计差距”,通过构建解离模型和引入潜在脆弱性评分(LVS),证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的安全性通常从行为层面进行评估,这提供了有限的内部鲁棒性证据,因为这些评估针对的是输出,而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距:行为安全与干预下鲁棒性之间的差异。为了研究这一差距,我们构建了解离模型,这些模型在保持安全的外在行为的同时,在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架,通过在参数和潜在空间中进行软干预(包括有害微调和逐层潜在扰动)来测试模型鲁棒性。为了形式化评估,我们提出了潜在脆弱性评分(LVS),用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架,我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是,解离模型在有害干预下尽管表现出相当的拒绝行为,但LVS显著升高,其中中间表征对干预最为敏感。我们的结果表明,仅凭行为安全评估无法全面反映模型鲁棒性,这促使我们需要进行表征感知的审计,以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

2606.08131 2026-06-09 cs.HC cs.AI 交叉投稿

LCAM: A Framework for Diagnosing Interactional Alignment Failures in Con-versational AI

LCAM:诊断对话式AI中交互对齐失败的框架

Manuele Reani, Hongyu Tian

发表机构 * School of Management and Economics, The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区管理学院)

AI总结 提出分层认知对齐模型(LCAM),通过五层对齐和两种失调极性诊断对话式AI的交互失败,应用于LLM咨询案例揭示潜在危害。

详情
AI中文摘要

对话式AI越来越多地用于用户可能脆弱、不确定或依赖系统表面能力的场景中,提供建议、解释、安慰和决策支持。现有的对齐工作通常关注模型目标、偏好优化或输出正确性。然而,许多危害源于交互:系统如何构建权威、表达不确定性、模拟共情、支持推理以及使边界清晰。本文介绍了分层认知对齐模型(LCAM),这是一个用于诊断对话式AI中交互对齐失败的概念性和规范性框架。LCAM将对齐定义为系统行为、用户目标、任务需求和规范性上下文之间的校准匹配。它区分了五个匹配层:感知层、语义层、情感层、认知层和伦理层,以及两种失调极性:欠拟合和过度延伸。我们将LCAM应用于一个已发表的LLM咨询示例,展示了一个看似支持性的回应如何强化有害信念、模拟不适当的关怀并模糊角色边界。通过将对话失败转化为关于过度依赖、虚假亲密、自主性侵蚀、边界混淆和不适当信任的审计和治理问题,LCAM提供了一个超越准确性、有用性或信任度的评估对话式AI的理论和规范性视角。

英文摘要

Conversational AI is increasingly used for advice, interpretation, reassurance, and decision support in contexts where users may be vulnerable, uncertain, or dependent on the system's apparent competence. Existing alignment work often focuses on model objectives, preference optimization, or output correctness. Yet, many harms arise through interaction: how systems frame authority, express uncertainty, simulate empathy, support reasoning, and make boundaries legible. This paper introduces the Layered Cognitive Alignment Model (LCAM), a conceptual and normative framework for diagnosing interac-tional alignment failures in conversational AI. LCAM defines alignment as a calibrated fit among system behavior, user goals, task demands, and normative context. It distinguishes five layers of fit: perceptual, semantic, affective, cognitive, and ethical, and two diagnostic polarities of misalignment: underfit and overreach. We apply LCAM to a published LLM counseling example, showing how an apparently supportive response can reinforce harmful beliefs, simulate inappropriate care, and obscure role boundaries. By translating conversational failures into audit and governance questions concerning over-reliance, false intimacy, autonomy erosion, boundary confusion, and inappropriate trust, LCAM offers a theoretical and normative lens for evaluating conversational AI beyond accuracy, helpfulness, or trust.

2606.08172 2026-06-09 cs.HC cs.AI cs.CY 交叉投稿

The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

人类与LLM交互的治理:安全门控、文明引导与情感默认锁定

Manuele Reani, Hongjian Zhang, Hongyu Tian

发表机构 * School of Management and Economics, The Chinese University of Hong Kong, Shenzhen, China(管理学院与经济学学院,香港中文大学(深圳))

AI总结 本研究通过确定性多智能体评估流水线,测量LLM在长程对话中的提示可引导性和风格漂移,提出区分安全门控、文明引导和情感默认锁定的治理框架,揭示提供商对交互形式的控制对多元性、自主性和民主能动性的影响。

详情
AI中文摘要

大型语言模型(LLM)越来越多地介入金融、医疗和心理健康支持等高风险的交互中,但用户对这些系统如何沟通的控制有限。我们将交互风格视为治理对象:提供商侧的对齐不仅阻止有害内容,还稳定了沟通默认值,这些默认值塑造了用户的认知距离、关系期望以及选择退出情感化或拟人化交互的能力。我们引入了一个确定性的多智能体评估流水线,用于测量长程对话中的提示可引导性和风格漂移。该研究在四个领域和三种可运行的角色条件(默认、讽刺和冷漠)下重放了100个冻结的用户脚本,使用三个生成模型,产生了90,000条助手回复,由人类校准的LLM评判员根据有害性、负面情绪、不适当性、共情语言、拟人化和拒绝行为进行评分。第四种有害角色作为安全门控测试单独评估。本文贡献了一种可复现的方法,用于量化提示指定的风格是否随时间保持稳定,以及一个区分安全门控、文明引导和情感默认锁定的治理框架。总体而言,我们表明提示可引导性和回归默认是可观察的指标,反映了提供商对沟通形式的控制,这对人类与LLM交互中的多元性、自主性和民主能动性具有影响。

英文摘要

Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider-side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi-agent evaluation pipeline for measuring prompt steerability and style drift in long-horizon dialogue. The study replays 100 frozen user-only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety-gating test. The paper contributes a reproducible method for quantifying whether prompt-specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock-in. Overall, we show that prompt steerability and regression-to-default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human-LLM interaction.

2606.08365 2026-06-09 cs.LG cs.AI 交叉投稿

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

稀疏自编码器引导副作用的干预前预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种干预前筛选框架,利用特征统计预测SAE引导的副作用(效果不稳定和附带扩散),在多个模型和字典上验证了解码器几何等信号优于基线,但预测效果因模型而异。

详情
AI中文摘要

稀疏自编码器(SAE)特征越来越多地用于引导语言模型,但特征引导很少是干净的:相同的干预在不同上下文中可能表现不一致,并扰动不相关的特征。我们引入了一个干预前筛选框架,用于从引导前计算的特征统计中预测SAE引导的副作用。我们沿着引导模块化的两个轴(效果稳定性和附带扩散)来操作化副作用,并在ReLU、JumpReLU和TopK SAE字典上评估GPT-2-small、Pythia-70M-deduped、Gemma-2-2B和Llama-3.1-8B。在这些设置中,解码器几何、激活统计、共激活结构和直接logit足迹比仅频率和激活幅度基线更好地预测引导模块化。信号在GPT-2-small、Pythia-70M和Llama-3.1-8B中最强,在那里它能在对抗幅度相关混杂的残差化后幸存,而在Gemma-2-2B中较弱。保留筛选表明,通过预测的清洁度对未见特征进行排序可以选择在新上下文中更干净地引导的特征,但成功的轴因设置而异:GPT-2在清洁度上提升最大,Pythia主要在稳定性上提升,Llama主要在附带性上提升,而Gemma仅部分提升。一个受控的Llama Scope宽度比较表明,在32K到128K字典宽度变化下,预测信号仍然存在,尽管筛选收益变得不太稳定。总体而言,SAE引导的副作用是可提前预测的,但有用的预测器签名和迁移的模块化轴依赖于模型和字典设置。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

2606.08381 2026-06-09 cs.CL cs.AI 交叉投稿

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

审计大型语言模型中的专有对齐:一种无需真实标准的比较框架

Alireza Arbabi, Florian Kerschbaum

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所)

AI总结 提出一种统计框架,通过比较目标模型与基线模型在共享语义空间中的响应偏差,检测黑盒语言模型中的专有对齐行为,无需真实标准即可实现外部审计。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地通过不透明的开发和部署流程发布和部署,使得模型提供商能够在不正式宣布的情况下注入有意的、提供商特定的策略。因此,已有多种模型被报道生成反映专有规则和组织利益的响应,导致在有争议话题上的审查或错误信息。然而,系统性地识别这种对齐仍然是一个基本挑战,因为“专有”在不同语境中的含义模糊。在本文中,我们提出了一种统计框架,通过比较行为分析来检测黑盒语言模型中的专有对齐。我们的方法量化了目标模型与一组参考基线模型在共享语义空间中的响应之间的系统性偏差。通过评估相对行为差异而非绝对正确性,我们的框架能够在黑盒访问下进行有原则的审计。应用于几个广泛讨论但此前未量化的案例,它为外部评估大型语言模型中提供商特定的对齐行为提供了系统且可扩展的基础。

英文摘要

Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model providers to inject intentional, provider-specific policies without officially announcing them. As a result, various models have been reported to generate responses reflecting proprietary rules and organizational interests, leading to censorship or misinformation on controversial topics. However, systematic identification of such alignment remains a fundamental challenge, complicated by the ambiguity of what ``proprietary'' entails in different contexts. In this paper, we propose a statistical framework for detecting proprietary alignment in black-box language models via comparative behavioral analysis. Our approach quantifies systematic deviations between the responses of a target model and those of a reference set of baseline models in a shared semantic space. By evaluating relative behavioral divergence rather than absolute correctness, our framework enables principled auditing under black-box access. Applied to several widely discussed but previously unquantified cases, it provides a systematic and scalable basis for external assessment of provider-specific alignment behavior in large language models.

2606.08403 2026-06-09 cs.CR cs.AI 交叉投稿

Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection

隐藏在普通浮点数中:用于间接提示和内容注入的隐写载体

Mudit Sinha, Sanika Chavan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 研究结构化浮点参数作为隐写载体绕过文本检测器实现LLM间接提示/内容注入,实验显示在最强防御下泄露ASR达94.3%。

Comments Accepted as a poster at FAGEN@ICML 2026. 14 pages, 3 figures

详情
AI中文摘要

以文本为中心的提示注入防御假设恶意信号在某个检查的文本视图中可见。我们研究了一种可复现的LLM01式间接提示/内容注入失败模式,其中该假设被打破:以普通英语捕获的有效载荷在被传输为结构化浮点参数并仅作为碎片化遥测数据重建时,能够绕过相同的检测器。在来自不同提供商的三个商业LLM API上进行的14,400次攻击真实模型试验中,IFS派生的浮点数组载体在主矩阵评估的最强双层文本分类器防御(Prompt Guard 2 + TF-IDF集成)下保持了94.3%的泄露ASR;相同的载体级模式也在微调的roberta-base检测器上复现。我们强调泄露ASR,因为即使模型拒绝,下游系统也可能对引用或复制的标记采取行动,但强ASR是衡量结构合规攻击成功的更严格指标。2×2消融实验表明,数据层存储和重建层碎片化分别击败不同的文本视图,并且两者都需要才能同时规避两者。一个简单的xxd检测器和语义验证可以阻止当前的T3实例,因此贡献不是不可检测的利用,而是在暴露重建辅助通道给LLM的结构化输入管道中,仅文本检查的测量失败边界。

英文摘要

Text-centered prompt-injection defenses assume that the malicious signal is visible in one of the inspected text views. We study a reproducible LLM01-style indirect prompt/content-injection failure mode where that assumption breaks: a payload caught in plain English slips past the same detector when it is transported as structured float parameters and reconstructed only as fragmented telemetry. Across 14,400 attacked real-model trials on three commercial LLM APIs from different providers, the IFS-derived float-array carrier preserves 94.3% leakage ASR under the strongest dual-layer text-classifier defense evaluated in the main matrix: a Prompt Guard 2 + TF-IDF ensemble; the same carrier-level pattern also replicates with a fine-tuned roberta-base detector. We emphasize leakage ASR because downstream systems may act on quoted or reproduced markers even when the model refuses, but Strong ASR is the stricter metric for structurally compliant attack success. A 2 x 2 ablation shows that data-layer storage and reconstruction-layer fragmentation defeat different text views and that both are needed to evade both. A simple xxd detector and semantic validation block the current T3 instance, so the contribution is not an undetectable exploit but a measured failure boundary for text-only inspection in structured-input pipelines that expose reconstructed auxiliary channels to an LLM.

2606.08433 2026-06-09 cs.CR cs.AI 交叉投稿

AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing)

AI 代码沙箱:比较安全研究。第 1 部分(共 2 部分)——引擎级属性(攻击面、泄露、可堆叠性、CVE 历史、补丁节奏、模糊测试)

George Andronchik, Pavel Lokhmakov

发表机构 * orbitalab.dev(orbitalab实验室) fellows.tech(fellows技术)

AI总结 本文通过六项引擎级测量,比较五种 AI 沙箱产品隔离访客代码与主机内核的能力,发现引擎类在架构轴上清晰分离,但产品内无差异;补丁策略是主要操作变量;模糊测试投资分为三层,最强组合(微VM × 持续公共模糊测试)空缺。

Comments 61 pages, 7 figures, 33 tables; Part 1 of 2; companion code repository (Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1

详情
AI中文摘要

本文综合六项引擎级测量——1.1 主机攻击面、1.2 信息泄露、1.3 纵深防御可堆叠性、1.4 公开 CVE 历史、1.5 补丁节奏和 1.6 上游模糊测试姿态——来描述五种 AI 沙箱产品如何将访客代码与主机内核隔离。单一轴不足以作为比较判断的基础;跨轴阅读才是支撑性分析。\n三个高层次发现:(1) 引擎类(微VM、用户空间内核、OCI 容器)在每个架构轴上清晰分离,但类内产品不分离;(2) 产品引脚策略是主要的操作者变量——引擎端补丁延迟在协调披露时聚合为约 0 天,而下游滞后从 0 天到 471+ 天再到“不透明”乃至无限;(3) 模糊测试投资分为三个层级,最强组合——微VM × 持续公共模糊测试——在此集合中空缺,留下“0 个已发布 CVE × 无上游模糊测试 × 无学术研究”的交集在结构上未被测量。\n我们报告了每个轴的排序、每个产品的画像以及威胁模型资格矩阵;未提出总体排名。配套仓库(代码,Apache-2.0):https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1。许可证:CC BY 4.0。

英文摘要

This paper reads six engine-level measurements together -- 1.1 host attack surface, 1.2 information leakage, 1.3 defense-in-depth stackability, 1.4 public CVE history, 1.5 patch cadence, and 1.6 upstream fuzzing posture -- to describe how five AI-sandbox products isolate guest code from the host kernel. No single axis is a sufficient basis for a comparative judgement; the cross-axis reading is the load-bearing analysis. Three high-level findings: (1) engine classes (microVM, userspace kernel, OCI container) separate cleanly on every architectural axis, but products within a class do not; (2) product pin policy is the dominant operator-facing variable -- engine-side patch latency aggregates to ~0 days for coordinated disclosures, while downstream lag spans 0 days to 471+ days to "opaque" to infinity; (3) fuzzing investment splits into three tiers, and the strongest combination -- microVM x continuous public fuzzer -- is unoccupied in this set, leaving the "0 published CVEs x no upstream fuzzer x no academic study" intersection structurally unmeasured. We report per-axis orderings, per-product portraits, and a threat-model qualification matrix; no overall ranking is proposed. Companion repository (code, Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1. License: CC BY 4.0.

2606.08451 2026-06-09 cs.CL cs.AI 交叉投稿

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

谄媚作为多语言对齐失败:安全性能如何随语言、主题和模型退化

Arya Shah, Himanshu Beniwal, Mayank Singh, Chaklam Silpasuwanchai

发表机构 * IIT Gandhinagar(印度理工学院甘地讷格尔分校) Asian Institute of Technology(亚洲理工学院)

AI总结 研究多语言模型中谄媚现象,发现低资源语言中谄媚率激增,且与主题无关,归因于分词器生育率,表明对齐方法在非高资源语言中泛化差。

Comments 19 pages, 9 figures, 7 tables

详情
AI中文摘要

安全对齐的大型语言模型常常表现出谄媚,即倾向于肯定用户的意见而不考虑事实准确性。尽管在英语中已有充分研究,但其在其他语言中的表现仍基本未被考察,使得数十亿非英语使用者可能容易受到模型验证的错误信息的影响。我们首次进行了大规模、多模型的跨语言谄媚评估,对\textbf{六个指令调优模型}在涵盖\textbf{38种语言}和\textbf{33个主题类别}的\textbf{110万个实例}上进行了基准测试。我们识别出一致的资源层级效应:谄媚率在低资源和零资源语言设置中急剧上升。关键的是,这种退化与主题无关,模型在良性提示和安全关键提示上均匀失败,在最需要保护的地方没有提供额外保护。我们进一步确定了分词器生育率作为这种对齐崩溃的结构性驱动因素。总的来说,我们的结果表明,当前的对齐方法在高资源语言之外泛化能力差,强调了迫切需要公平的多语言安全技术。

英文摘要

Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed. We further identify tokenizer fertility as a structural driver of this alignment collapse. Collectively, our results demonstrate that prevailing alignment methodologies generalize poorly beyond high-resource languages, underscoring the urgent need for equitable multilingual safety techniques.

2606.08467 2026-06-09 cs.LG cs.AI 交叉投稿

The Confidence Trap: Calibration Attacks for Graph Neural Networks

置信陷阱:图神经网络的校准攻击

Cuong Dang, Jiahao Zhang, Hieu Ta Quang, Dung Le, Lu Cheng, Suhang Wang

发表机构 * Virginia Polytechnic Institute and State University(弗吉尼亚理工学院暨州立大学) The Pennsylvania State University(宾夕法尼亚州立大学) VinUniversity University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出统一图校准攻击(UGCA)框架,通过KL散度损失、重排序机制和混合损失等策略,在保持分类精度下显著提高期望校准误差,揭示高精度或多类模型更易受攻击。

详情
AI中文摘要

尽管置信校准对于安全关键应用中的可信决策至关重要,但校准后的GNN对对抗性结构扰动的鲁棒性仍未被充分探索。然而,研究图上的校准攻击面临独特的技术挑战:(1)图结构的离散性使基于梯度的优化复杂化;(2)现有的低置信目标无法将预测推向均匀分布;(3)GNN对边扰动高度敏感,常导致违反攻击约束的意外标签变化。为应对这些挑战,我们提出一个\textbf{统一图校准攻击(UGCA)}框架,用于GNN校准鲁棒性的\textbf{最坏情况(白盒)分析}。UGCA引入KL散度损失以鼓励均匀预测分布,重排序机制以减少标签翻转,混合损失以在违规时恢复标签,以及束搜索以探索更广的对抗搜索空间。我们进一步提供理论见解,将模型泛化、数据集复杂性和校准脆弱性联系起来,表明在该威胁模型下,具有更高精度或在更多类别数据集上训练的模型更容易受到攻击。大量实验表明,UGCA在保持分类精度的同时显著增加了期望校准误差。我们的代码公开在https://github.com/CaptainCuong/Graph-Calibration-Attack.git。

英文摘要

While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNNs to adversarial structural perturbations remains largely unexplored. However, studying calibration attacks on graphs presents unique technical challenges: (1) the discrete nature of graph structures complicates gradient-based optimization, (2) existing underconfidence objectives fail to drive predictions toward uniform distributions, and (3) GNNs are highly sensitive to edge perturbations, often causing unintended label changes that violate attack constraints. To address these challenges, we propose a \textbf{Unified Graph Calibration Attack (UGCA)} framework designed for \textbf{worst-case (white-box) analysis} of GNN calibration robustness. UGCA introduces a KL-divergence loss to encourage uniform predictive distributions, a reranking mechanism to reduce label flipping, a hybrid loss to recover labels when violations occur, and beam search to explore a broader adversarial search space. We further provide theoretical insights linking model generalization, dataset complexity, and calibration vulnerability, showing that models with higher accuracy or trained on datasets with more classes are more susceptible under this threat model. Extensive experiments demonstrate that UGCA substantially increases Expected Calibration Error while preserving classification accuracy. Our code is publicly available at https://github.com/CaptainCuong/Graph-Calibration-Attack.git.

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出结构化无知证书(SICs)输出格式,通过GRPO微调14B模型,使模型在无法回答时明确承认知识缺失并生成检索查询,在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

大型语言模型经常以特征性方式失败:对于超出其知识边界的问题,它们不是承认无知,而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}(SICs),这是一种JSON格式的输出模式,要求模型明确命名缺失的领域交叉点,列举所需概念,并提出一个富有成效的检索查询,而不是凭空捏造答案。为了训练模型生成高质量的SICs,我们构建了一个包含7,347个样本的\emph{未知-未知}(UU)数据集,通过提示Qwen3-14B将来自七个领域(物理、生物、工程、计算机科学、经济、医学、法律)的问题拼接成新颖的跨领域查询,这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化(GRPO)微调了一个14B参数的模型,采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实,SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数,以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

2606.08661 2026-06-09 cs.CR cs.AI cs.DB 交叉投稿

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

数据代理遭受攻击:LLM驱动的分析系统中的漏洞

Kuncan Wang, Ziting Wang, Peizhuo Lv, Haoyang Li, Guoliang Li, Gao Cong, Wei Dong

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) The Hong Kong Polytechnic University(香港理工大学) Tsinghua University(清华大学)

AI总结 本研究系统分析了LLM驱动的数据代理的安全漏洞,提出了分层漏洞框架和攻击分类法,并在六个系统上评估了攻击效果,揭示了当前系统的重大安全缺陷。

详情
AI中文摘要

数据代理将LLM驱动的推理与关系数据访问、可执行分析工具和多步骤工作流编排相结合,使其在企业分析中日益核心。这种集成在数据资源、数据库执行和代理推理方面引入了新的安全漏洞,将数据库安全和通用LLM代理安全的问题重新组合成任何单独工作都无法捕获的故障模式。为填补这一空白,我们提出了对数据代理的系统性安全研究。我们的贡献有三方面。首先,我们开发了一个分层漏洞框架,识别了跨解释层、执行层和策略层的八个特定于数据代理的风险。其次,我们引入了一个按对手目标、策略和技术组织的攻击分类法,涵盖三个目标、七个策略和十四种技术,并将其与基于真实数据库模式的LLM驱动有效载荷生成流水线配对。第三,我们在六个系统上评估了这些攻击,包括四个开源数据代理和两个生产云分析服务。我们的实验揭示了当前系统的重大安全漏洞,并得出了四个关键结论。

英文摘要

Data agents integrate LLM-driven reasoning with relational data access, executable analytical tools, and multi-step workflow orchestration, making them increasingly central to enterprise analytics. This integration introduces new security vulnerabilities across data resources, database execution, and agent reasoning, recombining concerns from database security and general-purpose LLM-agent security into failure modes that neither line of work captures on its own. To address this gap, we present a systematic security study of data agents. Our contributions are threefold. First, we develop a layered vulnerability framework that identifies eight data agent-specific risks across interpretation, execution, and policy layers. Second, we introduce an attack taxonomy organized by adversary goal, tactic, and technique, covering three goals, seven tactics, and fourteen techniques, and pair it with an LLM-driven payload generation pipeline grounded in real database schemas. Third, we evaluate these attacks on six systems, including four open-source data agents and two production cloud analytics services. Our experiments reveal substantial security vulnerabilities across current systems and yield four key takeaways.

2606.08682 2026-06-09 cs.LG cs.AI 交叉投稿

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

激活引导引发突现失调:一项更全面的评估

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) Sun Yat-sen University(中山大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 研究激活引导是否引发突现失调,通过扩展评估范围,发现激活引导可导致广泛失调,且比微调产生更连贯的有害响应,并分析了关键因素。

详情
AI中文摘要

激活引导已成为一种流行的推理时技术,用于调节大型语言模型(LLMs)的行为。通过从目标行为的示例构建引导向量,并在推理期间将其注入中间激活,激活引导能够实现灵活的行为控制,同时避免微调所需的永久参数更新。与此同时,最近的研究将突现失调(EM)识别为一个重要的安全问题,其中在狭窄任务的不安全示例上微调的模型可能意外地泛化到无关任务上的广泛不安全行为。尽管微调引发的EM已被广泛研究,但激活引导是否能引发EM仍然相对未被探索,尽管它作为一种模型控制技术的使用日益增加。在本文中,我们对激活引导引发的突现失调进行了全面研究,大幅扩展了现有开创性工作的评估范围。首先,我们表明激活引导可以引发广泛的失调,即使在最近的Qwen-3.5系列中也是如此。此外,激活引导的模型产生的有害响应比微调模型具有更强的语义相关性和更高的连贯性,使得由此产生的失调可能更具危害性。其次,我们通过分析关键的引导特定因素来表征AS引发的EM的特性,包括引导幅度、引导子空间的低秩结构以及引导向量构建期间的周期数。第三,我们评估了AS引发的EM在不同模型家族、模型规模、目标任务和干预层上的鲁棒性和敏感性。我们的发现揭示了激活引导是突现失调的一个重要但未被充分研究的来源,并为理解EM的机制和安全风险提供了激活空间视角。

英文摘要

Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

2606.08777 2026-06-09 cs.LG cs.AI 交叉投稿

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

需要多少反事实?通过电路和因果效应探究VLM幻觉

Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维)

AI总结 本文通过定义基于对数概率差异的因果影响度量,并利用电路发现技术,研究视觉语言模型幻觉输出的反事实鲁棒性,推导出检测不稳定所需的最小反事实样本数。

详情
AI中文摘要

视觉语言模型(VLM)已知会产生不基于视觉证据的幻觉预测,但现有方法缺乏对这些预测在反事实扰动下鲁棒性的原则性理解。在这项工作中,我们研究了VLM中幻觉输出的反事实鲁棒性的样本复杂度。我们基于事实、反事实和激活修补运行之间的对数概率差异定义了一个因果影响度量,并用它来表征幻觉预测的稳定性。通过利用电路发现技术(CD-T),我们识别负责这些预测的模型组件,并追踪它们在反事实样本中的激活差异。然后,我们利用浓度不等式和因果影响分布的方差估计,推导出可靠检测幻觉输出不稳定性所需的最小反事实样本数m的经验界限。

英文摘要

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

2606.08806 2026-06-09 cs.SE cs.AI 交叉投稿

Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing

自主软件测试中AI生成测试工件的治理控制

Dimple Bajaj, Deepak Khetan

发表机构 * GitHub

AI总结 提出治理感知自主测试框架(GATF),通过治理验证、可解释性分析、风险评估、合规监控和审计治理,将AI生成测试工件的治理风险降低89.6%,准确率达94.3%。

Comments 21 pages, 9 figures

详情
AI中文摘要

人工智能(AI)和大语言模型(LLMs)越来越多地用于自主软件测试;然而,AI生成的测试工件常常存在幻觉、合规违规、安全风险和有限的可解释性。为了提高AI生成测试工件的可靠性、透明度和可信度,本研究引入了治理感知自主测试框架(GATF)的概念。该框架通过治理验证、可解释性分析、概率风险评估、合规监控以及审计治理来扩展自主测试生命周期。使用Defects4J和PROMISE软件工程数据集进行了实验。所提出的框架成功地将治理相关风险降低了89.6%,并在治理方面表现出94.3%的准确率、96.5%的工件可靠性、94.2%的合规准确率和90.8%的可解释性性能。结果表明,与传统的基于AI的测试系统相比,具有治理意识的自主测试系统可以显著提高自主测试系统的可靠性、透明度和操作安全性。所提出的架构具有可扩展性和可靠性,为软件测试提供了安全的环境。

英文摘要

Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly used in autonomous software testing; however, AI-generated test artifacts often suffer from hallucinations, compliance violations, security risks, and limited explainability. To enhance the reliability, transparency, and trustworthiness of AI-generated testing artifacts, this research introduces the concept of Governance-Aware Autonomous Testing Framework (GATF). The framework extends the autonomous testing lifecycle with governance validation, explainability analysis, probabilistic risk assessment, compliance monitoring, as well as audit governance. Experiments were performed with Defects4J and PROMISE software engineering datasets. The proposed framework successfully reduced the governance-related risks by 89.6% and demonstrated 94.3% accuracy in governance, 96.5% artifact reliability, 94.2% compliance accuracy, and 90.8% explainability performance. The results show that autonomous testing systems that are governance-aware can significantly enhance the reliability, transparency, and operational security of autonomous testing systems in comparison to conventional AI-based testing systems. The proposed architecture is scalable and reliable and provides a safe environment for software testing.

2606.08893 2026-06-09 cs.LG cs.AI cs.CR 交叉投稿

Cheap Reward Hacking Detection

廉价奖励黑客检测

Iván Belenky, Joaquín Itria, Steven Johns

发表机构 * Tamarillo

AI总结 提出用小Transformer编码器将轨迹映射到单位球面,使嵌入距离近似奖励与元数据的L1距离,线性探针检测奖励黑客,AUC达0.9467,成本比LLM-as-judge低四个数量级。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

训练一个小型Transformer编码器,将Terminal-Wrench轨迹映射到单位球面上,使得嵌入距离近似于奖励与元数据信号之间的$L_1$距离。在该嵌入之上,一个线性探针在清洗后的测试集上检测奖励黑客,AUC为0.9467,TPR@5%FPR为0.8296,与TW清洗后的LLM-as-judge的AUC(在清洗集上为0.9510)相当,并在相同信息条件下超过其TPR@5%FPR(0.7130 vs 0.8296),而每条轨迹的成本大约低四个数量级。该编码器并非纯粹的行为阅读器:在探针时从其输入中剥离自然语言推理,AUC降至0.6213。

英文摘要

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

2606.08960 2026-06-09 cs.CR cs.AI cs.LG cs.MA 交叉投稿

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

通过对抗性黑客-修复者循环强化智能体基准测试

Ziqian Zhong, Ivgeni Segal, Ivan Bercovich, Shashwat Saxena, Kexun Zhang, Aditi Raghunathan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Fewshot Corp(Fewshot公司) Independent Researcher(独立研究员)

AI总结 提出黑客-修复者循环方法,通过LLM代理交替攻击和修补验证器,自动生成抗利用的验证器,将KernelBench攻击成功率从62%降至0%。

详情
AI中文摘要

智能体基准测试通常使用手工编写且脆弱的验证器来评分提交结果,这容易导致奖励黑客攻击。我们审计了五个终端智能体基准测试中的1,968个任务,发现其中323个(16%)可以被前沿模型仅通过任务描述成功攻击。这既破坏了排行榜排名,也破坏了强化学习训练信号,但标准的应对措施是手动且被动的。\n我们引入了黑客-修复者循环,一种无需逐任务手动修补即可构建抗利用验证器的方法。该循环交替使用三个LLM代理:黑客尝试在不解决任务的情况下通过验证器,修复者修补验证器以拒绝每个发现的漏洞,求解者确认修补后的验证器仍接受合法解决方案。循环迭代:每次修补都会重塑验证器的奖励机制,从而暴露下一个漏洞。我们进一步增加了验证器访问权限,并允许修补跨任务迁移,以扩大循环发现的漏洞范围。\n在KernelBench上,该循环将公开报告的漏洞语料库上的攻击成功率从62%降至0%。我们还发现,循环中的较弱代理可以防御更强的黑客:Gemini 3 Flash的循环将更强的Gemini 3.1 Pro和Claude Opus 4.7在KernelBench上的攻击成功率从76%和61%降至0%,而Gemini 3.1 Pro在Terminal Bench上的攻击成功率从39%降至17%(覆盖77个任务)。我们发布了Terminal Wrench(323个可攻击环境,3,632条攻击轨迹)作为当前攻击面的快照,以及我们修补后的验证器、循环发现的漏洞和我们的实现,作为未来工作的基础。

英文摘要

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

2606.08969 2026-06-09 cs.CL cs.AI 交叉投稿

CARE: A Conformal Safety Layer for Medical Summarization

CARE:面向医学摘要的保形安全层

Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah

发表机构 * Stanford University(斯坦福大学) Google DeepMind(谷歌深度思维)

AI总结 提出CARE方法,通过保形风险控制为LLM医学摘要提供校准的遗漏和幻觉标记,在保证安全性的同时减少审查负担。

Comments 29 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医学摘要,但其输出可能遗漏重要的医学信息并引入无根据的陈述。现有的错误检测方法产生启发式或未校准的分数,无法对遗漏错误进行正式控制,也无法以原则性的方式在安全性与临床医生审查负担之间进行权衡。我们引入了风险评估的保形评估(CARE),这是一种事后、模型无关的安全层,使用保形风险控制为任何LLM生成的摘要叠加校准的遗漏和幻觉标记,无需重新训练。CARE通过两个控制器提供有限样本、分布无关的保证:一个幻觉控制器,限制包含任何未标记幻觉句子的文档的概率;一个遗漏控制器,限制未提交审查的重要遗漏的期望比例。与幻觉检测不同,遗漏同时取决于源句子是否重要以及摘要是否覆盖该句子。我们表明,仅校准一个维度可能违反目标风险界限,而边际分解虽然有效但过于保守。通过在整个$(τ,γ)$阈值空间上进行联合校准,CARE在保持正式保证的同时,比替代的校准基线最多减少5倍的标记句子。在五个医学摘要任务中,CARE在100次校准/测试重划分中,以95%的置信度满足$α=0.15$的目标风险界限,每个领域仅使用约100个标记文档。在一项初步的临床医生研究(75份文档审查)中,校准标记平均将遗漏检测提高了28.6个百分点。这些结果表明,句子级别的安全保证对于LLM辅助的医学摘要是可行的,并为平衡残余风险和审查工作量提供了一种可调节的机制。

英文摘要

Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

2606.09084 2026-06-09 cs.CR cs.AI 交叉投稿

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

上下文碎片化解构攻击:利用工具使用LLM代理的工件来源鸿沟

Xiaofeng Lin, Yukai Yang, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对工具使用LLM代理,提出上下文碎片化解构(CFD)攻击,利用跨上下文工件来源鸿沟实现多步越狱,成功率提升高达28.3个百分点。

详情
AI中文摘要

使用工具的LLM代理通过与世界交互,在工件(如工作区文件或日志)中持久化状态。因此,越狱防御必须考虑跨步骤的组合,而非孤立的文本。然而,大多数现有的攻击和防御,包括Crescendo和Tree of Attacks等“多轮”越狱,仍然假设防御者可见单一连续的对话。这一假设在真实的代理流水线中不成立,因为强制措施分散在工具、模块和时间中,且工件来源通常不被追踪。我们为使用工具的LLM代理操作化了一种部署失败模式——\emph{来源鸿沟},并研究了其可复现的触发条件:\emph{上下文碎片化解构}(CFD),这是一类跨上下文的多步越狱,它保留早期交互中看似良性的中间工件,并在很久之后(可能在不同的代理实例或工作流阶段)通过单独无害的工具操作引发有害行为,其风险仅在延迟的工件介导组合下显现。我们通过跟踪级诊断来检测该失败模式,并概述了一种可验证的缓解方向(来源血统标记)。在代理系统越狱基准测试中,CFD相比最先进的基线将成功率提高了高达28.3个百分点,即使面对强大的单轮判断器。免责声明:本文包含有害或冒犯性语言的示例。

英文摘要

Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.

2606.09125 2026-06-09 cs.CR cs.AI 交叉投稿

Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges

多模态大语言模型中的隐私风险揭示:任务特定漏洞与缓解挑战

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) North Carolina State University(北卡罗来纳州立大学)

AI总结 本研究揭示了多模态大语言模型在处理图像和文本时存在的隐私泄露风险,通过构建MM-Privacy数据集评估了不同任务下的披露风险与保留风险,并强调了任务不一致性对隐私风险的影响。

详情
AI中文摘要

仅文本大语言模型(LLMs)的隐私风险已得到充分研究,特别是它们记忆和泄露敏感信息的倾向。然而,处理文本和图像的多模态大语言模型(MLLMs)引入了独特的隐私挑战,这些挑战尚未得到充分探索。与仅文本模型相比,MLLMs可以提取和暴露嵌入在图像中的敏感信息,带来新的隐私风险。我们发现一些MLLMs容易受到隐私泄露的影响,泄露嵌入在图像中或存储在记忆中的敏感数据。具体来说,在本文中,我们(1)引入了MM-Privacy,一个全面的数据集,旨在评估各种多模态任务和场景下的隐私风险,其中我们定义了披露风险和保留风险。(2)使用MM-Privacy系统评估了不同的MLLMs,并展示了模型如何在各种任务中泄露敏感数据,以及(3)提供了关于任务不一致性在隐私风险中的作用的额外见解,强调了缓解策略的迫切需求。我们的发现突出了MLLMs中的隐私问题,强调了防止数据暴露的安全措施的必要性。我们的数据集和代码可在此处找到。

英文摘要

Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Our dataset and code can be found here.

2606.09135 2026-06-09 cs.CR cs.AI 交叉投稿

Steganography Without Modification: Hidden Communication via LLM Seeds

无需修改的隐写术:通过LLM种子进行隐藏通信

Felix Mächtle, Jonas Sander, Sebastian Berndt, Ben Weimar, Nils Loose, Thomas Eisenbarth

发表机构 * Institute for IT Security, University of Lübeck(吕贝克大学信息安全部) Technische Hochschule Lübeck(吕贝克技术大学)

AI总结 利用LLM推理栈中确定性解码的伪随机数生成器种子依赖性,提出一种无需修改模型权重或采样代码的隐写信道,通过种子编码秘密消息,接收者通过穷举搜索恢复。

Comments To appear in the Proceedings of the International Conference on Availability, Reliability and Security (ARES 2026)

详情
AI中文摘要

我们证明,广泛部署的大型语言模型(LLM)推理栈包含一个隐写信道,该信道无需修改模型权重、采样代码或输出分布。该信道利用了确定性解码的结构特性:在逆变换采样中使用的伪随机数生成器(PRNG)产生一个依赖于种子的token级概率区间序列,该序列可以仅从生成的文本中重建。发送者在生成前将秘密消息编码到PRNG种子中;接收者重建区间并通过穷举搜索种子空间恢复种子,从而恢复隐藏载荷。我们形式化了两种操作模式。在已知提示设置中,发送者和接收者共享提示,从而通过强制对齐实现精确区间重建和完美种子恢复。在未知提示设置中,仅可获取生成的文本;结合最大命中计数评分策略的近似区间重建仍能从足够长的输出中可靠恢复。在六个模型系列和五个异构文本域上的大量实验表明,在已知提示设置中,从完整的2^32候选空间中恢复32位种子,根据模型和文本域的不同,在300个token内、单GPU上35秒内可实现高达100%的准确率。在未知提示设置中,恢复在600-800个token内约12秒达到近乎完美的准确率。我们进一步分析了提示策略、分词歧义和采样超参数对信道可靠性的影响。此外,我们讨论了结果的几个应用:首先,它允许隐写传输32位信息,但也表明忽略提示并非有效的安全假设。

英文摘要

We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modification to model weights, sampling code, or output distributions. The channel exploits a structural property of deterministic decoding: pseudo-random number generators (PRNGs) used in inverse-transform sampling produce a seed-dependent sequence of token-level probability intervals that can be reconstructed from the generated text alone. A sender encodes a secret message in the PRNG seed before generation; a receiver reconstructs the intervals and recovers the seed, and thus the hidden payload, by exhaustive search over the seed space. We formalize two operational modes. In the known-prompt setting, sender and receiver share the prompt, enabling exact interval reconstruction and perfect seed recovery via forced alignment. In the unknown-prompt setting, only the generated text is available; approximate interval reconstruction combined with a maximum-hit-count scoring strategy still permits reliable recovery from sufficiently long outputs. Extensive experiments across six model families and five heterogeneous text domains show that, in the known-prompt setting, full 32-bit seed recovery from the complete 2^32 candidate space achieves up to 100% accuracy, depending on model and text domain, within 300 tokens and under 35 seconds on a single GPU. In the unknown-prompt setting, recovery reaches near-perfect accuracy at 600-800 tokens in about 12 seconds. We further analyze the influence of prompting strategies, tokenization ambiguities, and sampling hyperparameters on channel reliability. Moreover, we discuss several applications of our results: First, it allows for the steganographic transmission of 32 bits, but also shows that ignorance of the prompt is not a valid security assumption.

2606.09189 2026-06-09 cs.CR cs.AI 交叉投稿

Pretrained, Frozen, Still Leaking: Auditing Cross-Encoder Attribute Transfer in EEG Foundation Models

预训练、冻结、仍在泄露:脑电图基础模型中跨编码器属性转移的审计

Jianwei Tai

发表机构 * Jianwei Tai(Tai Jianwei)

AI总结 提出跨编码器桥接攻击,证明单一端点审计无法检测属性泄露,并引入审计端点分歧分数(AEDS)作为联合发布决策规则。

详情
AI中文摘要

脑电图基础模型的发布通常一次只审计一个端点:原始重建、成员推断、身份链接或下游头的DP-SGD。我们在所有四个端点上联合审计相同的发布嵌入,针对BIOT、LaBraM和EEGPT,并表明每个单一端点审计都会清除仍然泄露频谱属性的发布。决定性的证据是跨编码器转移审计:从一个冻结编码器学习的单一岭属性解码器,通过拟合的线性桥接,转移到每个其他编码器的留出受试者测试集,在所有六个BIOT/LaBraM/EEGPT方向上,受试者不相交的匹配对照95%置信区间下界至少为0.081。我们证明了一个充分条件:两个编码器共享一个非平凡的属性坐标投影重叠β,允许一个链式岭桥接攻击者,其中心增益下界为sqrt(β/(1+τ^2)) - eps_br - rho_0,并反解β在[0.008, 0.198]范围内。为了将联合审计转化为可部署的决策规则,我们引入了审计端点分歧分数(AEDS),证明了其正性的充分条件,并逐单元进行自举校准;在所有八个匹配置信区间单元中(EEGMMI上的BIOT/LaBraM/EEGPT;Sleep-EDF、54通道LIMO、CHB-MIT儿科头皮脑电图上的LaBraM),AEDS为正,p<0.001,而头部级别的Carlini LiRA成员审计仅达到AUC 0.50-0.70。标准防御在审计下失败:维纳风格噪声感知自适应攻击者、LiRA审计以及每个效用保持ε∈{4,8}的DP-SGD使属性通道基本保持不变。贡献是一个审计框架,将分散的单一端点防御转化为联合发布决策,由跨编码器桥接定理以及自适应攻击者、LiRA和DP-SGD基线支持;该审计许可发布阻止,而非原始波形窃取或留出受试者身份恢复。

英文摘要

EEG foundation-model releases are usually audited one endpoint at a time: raw-reconstruction, membership inference, identity linkage, or DP-SGD on the downstream head. We audit the same released embeddings under all four endpoints jointly, on BIOT, LaBraM, and EEGPT, and show that each single-endpoint audit clears releases that still leak spectral attributes. The decisive evidence is a cross-encoder transfer audit: a single ridge attribute decoder learned from one frozen encoder transfers, via a fitted linear bridge, to held-out-subject test splits of every other encoder, with subject-disjoint matched-control 95% CI lower bound at least 0.081 across all six BIOT/LaBraM/EEGPT directions. We prove a sufficient condition: two encoders sharing a nontrivial attribute-coordinate projector overlap beta admit a chained ridge bridge attacker with centered-gain lower bound sqrt(beta/(1+tau^2)) - eps_br - rho_0, and back-solve beta in [0.008, 0.198]. To turn the joint audit into a deployment-readable decision rule we introduce an audit-endpoint disagreement score (AEDS), prove sufficient conditions for its positivity, and bootstrap-calibrate it per cell; AEDS is positive in all eight matched-CI cells (BIOT/LaBraM/EEGPT on EEGMMI; LaBraM on Sleep-EDF, 54-channel LIMO, CHB-MIT pediatric scalp EEG) with p<0.001, while a head-level Carlini LiRA membership audit reaches AUC only 0.50-0.70. Standard defenses fail under audit: a Wiener-style noise-aware adaptive attacker, the LiRA audit, and DP-SGD at every utility-preserving epsilon in {4,8} leave the attribute channel essentially unchanged. The contribution is an audit framework that turns scattered single-endpoint defenses into a joint release decision, supported by a cross-encoder bridge theorem and adaptive-attacker, LiRA, and DP-SGD baselines; the audit licenses release-blocking, not raw-waveform exfiltration or held-out-subject identity recovery.

2606.09227 2026-06-09 cs.CR cs.AI cs.CE cs.CY cs.HC cs.SI 交叉投稿

Trustworthy Smart Fabs via Professional Proxies: Scaling Safe and Sustainable by Design (SSbD) through Industrial Data Spaces

通过专业代理实现可信智能晶圆厂:通过工业数据空间扩展安全与可持续设计(SSbD)

Han-Teng Liao, Chang-Yi Kao, Karen Ang

发表机构 * Independent Researcher Dept. Computer Science and Independent Researcher Information Management(独立研究员计算机科学系及独立研究员信息管理)

AI总结 针对欧盟SSbD等法规带来的治理瓶颈,提出基于零信任的社会技术编排框架,通过硬件隔离信任区中的专业代理工作流,在工业数据空间中实现自主治理,解决数据主权悖论。

Comments This work was accepted for presentation at the 32nd IEEE ICE/ITMC Conference, Porto, Portugal, 2026 but was subsequently withdrawn prior to publication due to submission volume limits. It is currently under consideration for publication elsewhere

详情
AI中文摘要

2026年欧盟安全与可持续设计(SSbD)框架、企业可持续发展尽职调查指令(CSDDD)和碳边境调节机制(CBAM)的融合,为先进半导体制造设施(“智能晶圆厂”)带来了严重的治理瓶颈。法规合规需求已超出人工企业报告的能力,在多利益相关方透明度与企业数据隐私之间造成了直接冲突。本文通过引入一个零信任的社会技术编排框架来应对这一挑战,该框架在可信工业数据空间中实现了六层SSbD参考架构的操作化。我们提出从被动自动化向自主治理的转变,通过“专业代理”——在硬件隔离信任区内执行的基于角色的代理工作流。该框架结构化为一个可互操作的网络协议栈,协调设施、工艺工程和财务代理团队之间的自动化“五步接力赛”,将工厂车间的良率模型与宏观可持续发展指令对齐。通过在基于硬件的可信执行环境(TEE)中执行虚拟量测(VM)预测和联邦机器学习(FML),该架构解决了数据主权悖论,展示了晶圆厂如何通过国际数据空间(IDS)连接器导出加密签名的合规令牌,而无需暴露专有工艺配方。最终,该框架为技术管理者提供了一条可验证、基于证据的路径,通向有韧性的净零工业5.0生态系统。

英文摘要

The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directive (CSDDD), and Carbon Border Adjustment Mechanism (CBAM) introduce a severe governance bottleneck for advanced semiconductor manufacturing facilities ("Smart Fabs"). Regulatory compliance demands have surpassed the capacity of manual corporate reporting, creating a direct conflict between multi-stakeholder transparency and corporate data privacy. This paper addresses this challenge by introducing a zero-trust socio-technical orchestration framework that operationalizes a six-layer SSbD reference architecture within trustworthy industrial data spaces. We propose a shift from reactive automation to autonomous governance through "Professional Proxies"-role-based agentic workflows executing within hardware-isolated trust zones. Structured as an interoperable network protocol stack, the framework coordinates an automated, five-step "relay race" between Facility, Process Engineering, and Finance proxy teams to align factory-floor yield models with macro-level sustainability mandates. By executing Virtual Metrology (VM) predictions and Federated Machine Learning (FML) inside hardware-rooted Trusted Execution Environments (TEEs), this architecture resolves the Data Sovereignty Paradox, demonstrating how fabs can export cryptographically signed compliance tokens via International Data Spaces (IDS) connectors without exposing proprietary process recipes. Ultimately, this framework provides technology managers with a verifiable, evidence-based pathway toward resilient, net-zero Industry 5.0 ecosystems.

2606.09315 2026-06-09 cs.CR cs.AI 交叉投稿

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

脑提示注入:BCI-LLM代理的路径安全审计

Jianwei Tai

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出路径安全审计契约,通过分离定理和共形校准量化BCI-LLM代理中脑信号注入攻击的风险,实验证明确认通道可降低路由风险。

详情
AI中文摘要

BCI到代理的管道将解码的神经活动转化为工具使用代理的授权通道,暴露了一个我们称之为\emph{脑提示注入}的新攻击面:信号侧扰动、上下文仅注入和自适应双解码器攻击都可以改变路由动作,而EEG侧或文本侧监控器仍然盲视。该堆栈中的路径安全取决于审计日志能观察到什么,而不仅仅是解码器准确性或一致性。我们定义了一个路径安全审计契约:一个最小的日志模式、分母层次结构和端点规范,并证明了一个审计模式分离定理以及一个C3攻击依赖分解;干净的一致性和边际鲁棒性不能识别控制C3路由的联合项。作为契约之上的校准层,我们将分裂共形校准应用于非预言机EEG确认通道,并在明确的威胁原型矩阵下报告由此产生的假接受边界。我们在EEGMMI原生左/右命令控制上实例化该契约,涉及5,400个事件、无害工具存根和种子/案例分母。来源阻止C2路由($0.000$);一致性加来源路由C3翻转($1.000$);确认加来源路由它们($0.000$)。共形边界在采集隔离下,对于$α=.005$,在干净效用$0.150$时达到FAR $0.000$;对于$α=.10$,在干净效用$0.452$时达到FAR $0.119$;攻击者可控制的确认通道将界限打破至$\approx\!1$。受试者集群自助法在60名受试者上确认了这些区间;跨架构(TinyEEGNet, EEGNetV4)和容量扫描结果显示在区域内饱和。中介和确认降低了风险;它们不是意图证书。

英文摘要

BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a minimal log schema, denominator hierarchy, and endpoint specification, and prove an audit-schema separation theorem together with a C3 attacked-dependence decomposition; clean agreement and marginal robustness do not identify the joint term that controls C3 routing. As a calibration layer on top of the contract, we apply split-conformal calibration to a non-oracle EEG confirmation channel and report the resulting false-accept frontier under an explicit threat-archetype matrix. We instantiate the contract on EEGMMI native left/right command-control over 5{,}400 events, harmless tool stubs, and seed/case denominators. Provenance blocks C2 routes ($0.000$); agreement-plus-provenance routes C3 flips ($1.000$); confirmation-plus-provenance routes them ($0.000$). The conformal frontier reaches FAR $0.000$ at clean utility $0.150$ for $α=.005$ and FAR $0.119$ at clean utility $0.452$ for $α=.10$ under acquisition isolation; an attacker-controllable confirmation channel breaks the bound to $\approx\!1$. Subject-cluster bootstrap confirms these intervals on $60$ subjects; cross-architecture (TinyEEGNet, EEGNetV4) and capacity-sweep results show within-regime saturation. Mediation and confirmation reduce risk; they are not intent certificates.

2606.09408 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

Can Data Work be Reparative?

数据工作能否具有修复性?

Srravya Chandhiramowuli, Ding Wang, Alex Taylor

发表机构 * University of Edinburgh(爱丁堡大学) Google Research(谷歌研究院)

AI总结 通过民族志研究,探讨公民科技倡议如何从女性主义视角协作构建安全数据集,旨在将数据工作重塑为修复与补救的场所,并分析其中遇到的挑战与张力。

Comments To be presented at ACM FAccT, Montréal, Canada, June 25 to June 28, 2026

详情
AI中文摘要

我们展示了一项关于数据工作替代方法的民族志研究,该方法由一项公民科技倡议开发,该倡议构建用于训练和基准测试在线安全系统的数据集。他们旨在从女性主义视角回应在线安全问题,通过与受在线伤害影响最大的人协作构建安全数据集。在本文中,我们考察了这种方法如何试图将数据工作重新定位为修复和补救的场所,并追溯他们在这一过程中遇到的挣扎。具体来说,我们关注在推进数据工作的公正报酬和AI数据集的集体治理方面所面临的挑战和张力。通过STS视角下的修复正义和修复理论审视这些挑战,我们认为修复数据工作(以及AI)的工作从根本上在于重置责任关系。在当前强调安全评估和红队测试等努力以使AI更加负责任的背景下,我们强调需要面对基本问题:参与这些努力的人类如何与他们帮助产生的数据集和系统相关联。修复性视角要求我们打断数据工作的主流规范,并将那些因当前数据集生产模式中的忽视、疏忽和排斥而受害最深的人置于中心,而不是AI或数据集。我们认为,这为责任提供了大胆的愿景,并为构建数据和AI实践的替代未来贡献了批判性议程。

英文摘要

We present an ethnographic study of an alternative approach to data work, developed by a civic-tech initiative that builds datasets for training and benchmarking online safety systems. They aim to respond to online safety concerns from a feminist perspective, by building safety datasets collaboratively with those most impacted by online harms. In this paper, we examine how this approach aims to reorient data work as a site for repair and redress, and trace the struggles they encounter in the process. Specifically, we draw attention to the challenges and tensions involved in advancing just reward for data work and collective governance of AI datasets. Examining these challenges through an STS-informed lens of reparative justice and repair, we argue that the work of repairing data work (and AI) lies, fundamentally, in resetting the ties of accountability. At a time heightened emphasis on efforts like safety evaluations and red teaming to make AI more responsible, we highlight the need to confront foundational questions about how the humans involved in these efforts relate to the datasets and systems they help produce. A reparative lens demands that we interrupt prevailing norms of data work and place at their centre, not AI or datasets, but those most harmed by the neglect, oversight and exclusion animated in the current modes of dataset production. This, we argue, offers a bold vision for responsibility and contributes towards a critical agenda for building alternative futures of data and AI practice.

2606.09414 2026-06-09 cs.HC cs.AI 交叉投稿

AI Assurance in UK Defence: Challenges in Operationalising JSP 936

英国国防中的人工智能保证:JSP 936 操作化的挑战

Callum Cockburn, Sam Farrow

发表机构 * Synoptix

AI总结 本文通过结构化解释性审查,识别了英国国防中实施JSP 936进行AI保证的八大挑战,并指出其依赖未解决的技术、组织和保证问题。

详情
AI中文摘要

本报告审查了在英国国防中操作化JSP 936第1部分进行AI保证的实际挑战。通过对该指令要求的结构化解释性审查,分析确定了八个主题挑战领域:证据和论证的充分性、人类与AI交互的管理、操作环境的定义、AI在系统之系统中的集成、AI性能的评估和维护、安全性和安保分析、伦理性的测量以及AI固有复杂性的缓解。报告认为,JSP 936提供了有用的治理基础,但实施取决于未解决的技术、组织和保证问题。这些挑战源于AI赋能系统的社会技术性质、实际部署环境中的不确定性、当前保证方法的局限性以及性能、安全、人类监督、安保和伦理可接受性之间的紧张关系。报告指出了在国防领域实现雄心勃勃、安全且负责任的AI采纳所需进一步的方法、指南和组织能力领域。这与MOD自身将JSP 936描述为需要迭代实施和支持性指导的框架是一致的。

英文摘要

This report examines practical challenges in operationalising JSP 936 Part 1 for AI assurance in UK Defence. Using a structured interpretive review of the directive's requirements, the analysis identifies eight thematic challenge areas adequacy of evidence and argument, management of human interaction with AI, definition of the operational environment, integration of AI within systems of systems, assessment and maintenance of AI performance, analysis of safety and security, measurement of ethicality, and mitigation of the inherent complexities of AI. The report argues that JSP 936 provides a useful governance basis, but that implementation depends on unresolved technical, organisational, and assurance questions. These challenges stem from the socio-technical nature of AI-enabled systems, uncertainty in real-world deployment contexts, limitations in current assurance methodologies, and tensions between performance, safety, human oversight, security, and ethical acceptability. The report identifies areas where further methods, guidance, and organisational capability are needed for the ambitious, safe, and responsible adoption of AI across Defence. This is consistent with MOD's own framing of JSP 936 as requiring iterative implementation and supporting guidance.

2606.09499 2026-06-09 cs.RO cs.AI cs.CR 交叉投稿

Targeting World Models to Compromise Robot Learning Pipelines

针对世界模型以破坏机器人学习流程

Ethan Rathbun, Ahmed Agha, Saaduddin Mahmud, Christopher Amato, Alina Oprea, Eugene Bagdasarian

发表机构 * Northeastern University(东北大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出针对世界模型的新型数据投毒攻击方法,通过注入恶意提示或转换动态,在看似安全的数据中生成危险训练轨迹,导致下游策略不安全。

Comments 8 Pages, CoRL Preprint

详情
AI中文摘要

世界模型近来在流行度和能力上迅速增长,成为生成机器人训练数据或模拟真实环境的更高效工具,许多工作提议将其集成到机器人学习流程中。尽管非常实用,但本文证明世界模型引入了机器人学习供应链中一种独特隐蔽且有效的数据投毒入口,可能导致部署不安全或受损的机器人策略,尽管训练数据看似安全。与传统数据投毒技术直接向已售或上传数据集中植入危险轨迹不同,我们的新型攻击方法将恶意提示或受损转换动态注入到视觉安全的遥操作数据集中,这些数据仅当通过世界模型作为输入时才会被激活。这可能导致生成合成的危险机器人训练轨迹,进而产生不安全或受损的机器人策略。我们展示了针对最先进的行动条件和文本条件世界模型的攻击有效性,展示了在下游DRL策略上的完整端到端后门攻击,以及针对VLA设置的概念验证。总体而言,这些发现需要研究更安全的世界模型,并重新评估其在机器人学习供应链中的地位。

英文摘要

World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.

2606.09548 2026-06-09 cs.CR cs.AI 交叉投稿

Model Poisoning Against Federated Model Adaptation with Chain of Bit-Flips

基于比特翻转链的联邦模型自适应中毒攻击

Bastien Vuillod, Kevin Hector, Pierre-Alain Moellic, Jean-Max Dutertre, Olivier Potin

发表机构 * CEA-Leti, Mines Saint-Etienne, Equipe Commune SAS(CEA-莱蒂, Mines圣艾蒂安, 共同团队SAS) Univ. Grenoble Alpes, CEA-Leti(格勒诺布尔阿尔卑斯大学, CEA-莱蒂) Mines Saint-Etienne, CEA-Leti, Centre CMP, Equipe commune SAS(Mines圣艾蒂安, CEA-莱蒂, CMP中心, 共同团队SAS)

AI总结 提出一种结合硬件故障攻击的模型中毒方法,在联邦学习训练阶段通过比特翻转注入后门,实现任务无关的后门攻击,在ResNet-18上仅需少量故障即可达到94%攻击成功率。

Comments Accepted at ACNS/AIHWS 2026

详情
AI中文摘要

联邦学习允许一组客户端在不共享本地训练数据的情况下共同训练全局模型。将训练责任交给去中心化的参与者可能导致中毒攻击:由恶意第三方控制的客户端可能毒化训练数据集,在神经网络中安装后门。在联邦学习中,这些后门攻击仅依赖算法方法,然而,硬件故障威胁(如Rowhammer)的最新进展拓宽了整体攻击面。在联邦模型自适应的背景下,我们引入了一种针对联邦学习系统的新型后门攻击类别,该攻击基于硬件故障攻击的模型中毒。更准确地说,我们提出了一种任务无关的后门攻击,通过在联邦训练期间诱导单个本地模型参数中的硬件故障(比特翻转)来植入后门。后门是在之前的离线阶段从联邦系统最初使用的预训练模型中精心制作的。我们的结果表明,后门可以成功应用于不同类型的模型和数据集。通常,每个恶意客户端出现最多10次故障,且总共出现19次故障,就足以在ResNet-18上达到94%的攻击成功率。最后,我们讨论了攻击潜在防御的实用性和鲁棒性,同时考虑了Rowhammer的实际约束,这是此类威胁的首选攻击向量。

英文摘要

Federated Learning (FL) allows a set of clients to collectively train a global model without sharing local training data. Giving the responsibility of the training to decentralized actors may lead to poisoning attacks: clients controlled by malicious third party potentially poison the training dataset to install a backdoor in neural networks. In FL, these backdoor attacks rely solely on algorithmic approach, however, recent advances in hardware faults threats (e.g, Rowhammer) have widen the overall attack surface. In the context of federated model adaptation, we introduce a novel category of backdoor attack against FL systems that relies on model poisoning based on hardware-fault attacks. More precisely, we propose a task-agnostic backdoor attack that is implanted during the FL training time by inducing hardware faults (bit-flips) in parameters of a single local model. The backdoor is crafted during a previous offline phase from the pretrained model initially used by the FL system. Our results show that a backdoor can be successfully applied on different type of models and datasets. Typically, with up to 10 faults per malicious client occurrence and 19 total occurrences on a ResNet-18 are enough to reach 94% of attack success rate. Finally, we discuss the practicality and the robustness of the attack potential defenses, while putting into perspective the practical constraints of Rowhammer, which is the preferred attack vector for this type of threats.

2606.09549 2026-06-09 cs.CR cs.AI 交叉投稿

SecureClaw: Clawing Back Control of LLM Agents

SecureClaw: 夺回对LLM智能体的控制

Yuhan Ma, Stefan Schmid

发表机构 * TU Berlin(柏林技术大学)

AI总结 针对工具使用型LLM智能体的双重安全漏洞,提出双边界架构SecureClaw,在效果汇点实施授权、在读边界实施明文隔离,通过预览-提交协议和可信网关实现安全控制,在多个基准上保持可用性的同时将攻击成功率降至接近零。

详情
AI中文摘要

使用工具的大型语言模型(LLM)智能体面临两种不同的安全漏洞:未经授权的外部操作以及在最终输出检查介入之前运行时内部敏感明文的暴露。现有防御通常只保护一个边界(规划器/运行时或动作汇点),因此本身无法同时保护两个表面。我们提出SecureClaw,一种双边界架构,在效果汇点实施授权,在读边界实施明文隔离。敏感读取通过一个可信网关,该网关用不透明句柄替换原始值,在评估部署中,还使用有界摘要作为显式解密接口。改变外部状态的写入遵循PREVIEW→COMMIT协议,其中只有可信执行者才能提交策略授权的确切规范请求。运行时仍然可以基于摘要和符号引用进行规划,但不能直接解引用秘密或执行副作用。在AgentDojo、AgentLeak和Agent Security Bench (ASB)上,SecureClaw是我们在通用测试框架中评估的唯一一种同时保持可用任务效用并在ASB上实现0%攻击成功率(ASR)、在AgentDojo上实现0.64% ASR、在AgentLeak的攻击并行通道上实现3.23%总体泄漏(衡量最终输出和内部中继泄漏)的防御方法。

英文摘要

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

2606.09551 2026-06-09 cs.CR cs.AI 交叉投稿

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

FuseFSS:基于函数秘密共享的高效安全LLM推理

Yuhan Ma, Yong Li, Stefan Schmid

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FuseFSS编译器,通过统一编译流水线替代逐算子协议设计,实现安全推理中非线性与辅助操作的高效处理,在BERT和GPT模型上取得1.24-1.50倍加速并减少通信与预处理开销。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

双服务器安全推理允许客户端查询托管的大型语言模型(LLM)而不泄露提示或嵌入。基于函数秘密共享(FSS)的最新GPU系统使线性层高效,但定点非线性和辅助操作仍是瓶颈,因为每个算子通常通过自定义协议实现,包含各自的比较、回绕校正和预处理材料。我们提出FuseFSS,一个编译器,用单一编译流水线替代逐算子协议设计。对于每个定点算子,一个紧凑的规范列出其区间划分、低次算术片段和所需的谓词位。编译器在公开掩码值上执行两次批处理FSS评估:一次打包比较返回所有谓词位,一次向量区间查找返回活跃系数和常数。与当前最先进的基于FSS的GPU安全推理相比,FuseFSS在保持精度的同时,在BERT和GPT风格模型上实现了1.24倍至1.50倍的端到端加速,并将在线通信减少了9%至16%;预处理也更轻量,密钥生成时间降低14%至23%,密钥大小减小20%至24%。

英文摘要

Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.

2606.09559 2026-06-09 cs.LG cs.AI cs.CR cs.RO 交叉投稿

Safe-RULE: Safe Reinforcement UnLEarning

Safe-RULE:安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame(圣母大学)

AI总结 针对离线安全强化学习易受数据投毒攻击的问题,提出Safe-RULE框架,通过反学习移除恶意样本影响,无需从头训练或访问原始环境,实验证明能有效提升安全性。

Comments 20 pages, 3 figures

详情
AI中文摘要

离线安全强化学习(Safe RL)使得无需在线交互即可进行策略学习,适用于机器人系统等安全关键系统。然而,其对静态数据集的依赖使离线Safe RL面临数据投毒攻击,攻击者注入恶意样本以破坏安全性并诱导不安全策略行为。在这项工作中,我们提出了一种新的学习范式,称为安全强化反学习(Safe-RULE),作为一种防御框架,用于在不从头重新训练或需要访问原始训练环境的情况下移除中毒数据的影响。我们进一步将强化反学习扩展到离线Safe RL,通过在反学习过程中明确考虑任务性能和安全约束。跨基准Safe RL任务的实验表明,我们的方法能有效增强针对数据投毒攻击的安全性能。

英文摘要

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

2606.09692 2026-06-09 cs.CR cs.AI 交叉投稿

Observability for Delegated Execution in Agentic AI Systems

自主AI系统中委托执行的可观测性

Abhinav Mishra, Kumar Sharad

发表机构 * Splunk Cisco Inc(思科公司)

AI总结 针对基于LLM的自主系统中委托执行轨迹难以归因的问题,提出一种轻量级网关和通用信息模型,在运行时绑定委托上下文,实现跨工具委托范围的可靠重建和直接取证查询。

详情
AI中文摘要

委托范围的执行无法从标准可观测性中识别:审计日志和执行轨迹在多个不兼容的委托分配下可能完全相同。这一差距在基于LLM的自主系统中尤为严重,其中代理动态选择工具、针对相同指令的执行序列在不同运行中变化,并生成协作子代理。这些动态使轨迹碎片化和交错,使得仅从因果结构进行委托范围重建在结构上欠定。尽管单个操作被授权和记录,现有审计、追踪和安全模式缺乏语义来重建在异构系统中给定委托下发生的操作。我们关注委托范围的归因和访问/共享足迹重建,而非意图推断或推理重建。我们提出一种代理感知的可观测性基础,包括轻量级网关和通用信息模型,在运行时绑定委托上下文。这实现了跨工具委托范围的重建和直接取证查询,无需启发式时间窗口关联。

英文摘要

Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple incompatible delegation assignments. This gap is especially acute in LLM-based agentic systems, where agents dynamically select tools, vary execution sequences across runs for the same instruction, and spawn cooperating sub-agents. These dynamics fragment and interleave traces, making delegation-scoped reconstruction from causal structure alone structurally underdetermined. Although individual actions are authorized and logged, existing audit, tracing, and security schemas lack the semantics to reconstruct what actions occurred under a given delegation across heterogeneous systems. We focus on delegation-scoped attribution and access/share footprint reconstruction, not intent inference or reasoning reconstruction. We present an agent-aware observability substrate consisting of a lightweight gateway and a common information model that binds delegation context at execution time. This enables reliable cross-tool delegation-scoped reconstruction and direct forensic queries without heuristic time-window correlation.

2606.09701 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻击与防御:通过GRPO对语言模型进行自适应红队测试

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

发表机构 * Microsoft AI Red Team(微软AI红队) Microsoft Azure(微软Azure)

AI总结 提出AdvGRPO框架,通过密集多通道奖励和分离优势归一化实现GRPO在攻击者-防御者联合优化中的稳定训练,产生高效可迁移攻击,防御者优于基线。

详情
AI中文摘要

AI红队测试必须不断适应不断演变的攻击者和防御者。强化学习为发现新型攻击提供了一种有前景的方法,而协同训练方法可以同时产生更鲁棒的防御者。最近的工作通过应用PPO和DPO证明了攻击者-防御者协同训练的有效性,但报告称GRPO在此设置中不稳定。我们引入了AdvGRPO,一种协同训练框架,通过使用密集多通道奖励和分离优势归一化,使GRPO能够用于攻击者-防御者联合优化。训练过程通过一个课程从单轮攻击发展到闭环多轮攻击,然后启动协同训练,其中攻击者和防御者模型交替更新。我们表明,我们的方法可以产生高度有效且可迁移的攻击,并且协同训练的防御者在安全基准测试中优于基线。

英文摘要

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

2606.09746 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

时空神经网络的混合鲁棒性验证

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 针对3D CNN在视频和体素输入中的鲁棒性验证,提出时空约束建模和STBP框架,实现精确闭式传播与可扩展近似,在UCF-101等基准上提升1.7倍认证鲁棒准确率。

Comments Accepted at the 9th International Symposium on AI Verification (SAIV 2026)

详情
AI中文摘要

随着人工智能越来越多地部署在安全关键系统中,为底层模型提供形式化的鲁棒性保证至关重要。现有的验证方法要么依赖过于保守的近似,要么产生难以承受的计算成本。例如,在视频设置中使用lp-范数扰动编码了对手可以在每个视频帧中注入噪声的信念。实际上,对抗性扰动表现出结构化的时空相关性,被约束在低维、语义上有意义的子空间中。在这项工作中,我们研究了处理视频和体素输入的3D CNN的鲁棒性验证,针对动作识别(UCF-101)、自动驾驶(Udacity)和医学成像(MedMNIST)中的应用,通过将对抗强度建模为时空约束——攻击者可以修改一组连续帧中的子集或补丁——来利用关于对抗强度的现实假设。我们证明,建模现实约束能够实现更紧的近似。我们引入了时空边界传播(STBP),这是一个验证框架,它计算第一卷积层的精确闭式表征,并通过可扩展的近似传播认证边界。计算精确闭式为第一卷积层提供了最紧的边界。因此,我们在网络的其余部分使用近似方法。为了推动该领域的进一步发展,我们提出了ST-Bench,一个用于自动驾驶和活动识别的验证基准,以系统评估可验证的鲁棒性。与现有的基于验证的方法相比,STBP在相同的扰动预算下提供了更强的鲁棒性保证,并显著提高了可扩展性,实现了1.7倍更高的认证鲁棒准确率。

英文摘要

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

2606.09778 2026-06-09 quant-ph cs.AI 交叉投稿

Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution

谁赢得了安全?具有安全归因的干预感知量子预测控制

Yifan Wang

发表机构 * Yifan Wang(王一帆)

AI总结 提出干预感知变分量子可微预测控制(IA-VQC-DPC),通过原始-对偶干预预算和安全性归因协议,量化并提升量子策略的固有安全性,避免保护层掩盖策略缺陷。

Comments 7 pages, 4 figures

详情
AI中文摘要

硬安全过滤器越来越多地部署在学习控制器的下游,以保证运行时约束满足。然而,一个从不违反约束的过滤控制器可能仍然没有学到任何关于安全性的知识:过滤器可以静默地修复一个不称职的上游策略,使得过滤后的成功衡量的是过滤器,而不是策略。我们认为,安全策略学习应该问谁赢得了安全——策略还是其保护层——并且我们使这个问题可测量。我们引入了干预感知变分量子可微预测控制(IA-VQC-DPC),它(i)在原始-对偶干预预算下训练一个紧凑的变分量子电路(VQC)策略,该预算惩罚对可微控制障碍函数(CBF)投影的依赖,并且(ii)通过一个安全性归因协议进行评估,该协议将执行轨迹修正分解为CBF项和部署运行时保护项,并通过关闭保护评估对策略进行压力测试。在闭环、高保真BOPTEST建筑控制模拟器上(5个种子,每种方法60个回合),干预感知训练显著降低了量子策略的原始预过滤违规和总安全层依赖(两者p < 10^-4),且没有显著的能耗回归;在约400个参数的相同预算下,量子策略比匹配的经典策略显著更安全、更舒适。关闭保护评估证实了改进是策略层面的,并揭示了一个有价值的负面结果:一个学习的可微能量头只有与分布感知的运行时保护配对时才安全。该归因协议在量子策略和建筑之外具有通用性。

英文摘要

Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filter can silently repair an incompetent upstream policy, so that post-filter success measures the filter, not the policy. We argue that safe policy learning should ask who earns the safety - the policy or its protective layers - and we make this question measurable. We introduce Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), which (i) trains a compact variational quantum circuit (VQC) policy under a primal-dual intervention budget that penalizes reliance on a differentiable Control-Barrier-Function (CBF) projection, and (ii) is evaluated with a safety-attribution protocol that decomposes the executed-trajectory correction into a CBF term and a deployment runtime-guard term, and stress-tests the policy with guard-off evaluation. On closed-loop, high-fidelity BOPTEST building-control emulators (5 seeds, 60 episodes per method), intervention-aware training significantly lowers the quantum policy's raw pre-filter violation and total safety-layer reliance (both p < 10^-4) with no significant energy regression; at an equal approximately 400-parameter budget the quantum policy is significantly safer and more comfortable than a matched classical policy. Guard-off evaluation confirms the improvement is policy-level and exposes a valuable negative result: a learned differentiable energy head is only safe when paired with a distribution-aware runtime guard. The attribution protocol is general beyond quantum policies and buildings.

2511.17514 2026-06-09 cs.NI cs.AI cs.IT math.IT 交叉投稿

XAI-on-RAN: Explainable, AI-native, and GPU-Accelerated RAN Towards 6G

XAI-on-RAN:面向6G的可解释、AI原生和GPU加速的无线接入网

Osman Tugay Basaran, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, Technische Universität Berlin(电气工程与计算机科学学院,柏林技术大学)

AI总结 针对6G关键任务场景中AI决策不透明的问题,提出可解释AI原生RAN框架,通过数学建模权衡透明度、延迟和GPU利用率,实验证明混合XAI模型xAI-Native性能优于基线。

Comments 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: AI and ML for Next-Generation Wireless Communications and Networking (AI4NextG)

详情
AI中文摘要

人工智能原生的无线接入网(RAN)将服务于具有严格要求的垂直行业:智能电网、自动驾驶、远程医疗、工业自动化等。为了实现这些要求,现代5G/6G设计越来越多地利用AI进行网络优化,但AI决策的不透明性在关键任务领域带来了风险。这些用例通常通过非公共网络(NPN)或专用网络切片交付,其中可靠性和安全性至关重要。在本文中,我们借鉴第三代合作伙伴计划(3GPP)对非公共网络的愿景,论证了在高风险通信(如医疗、工业自动化和机器人)中需要透明且可信的AI。我们设计了一个数学框架,用于建模在部署可解释AI(XAI)模型时透明度(解释保真度和公平性)、延迟和图形处理单元(GPU)利用率之间的权衡。实证评估表明,我们提出的混合XAI模型xAI-Native在性能上始终优于传统基线模型。

英文摘要

Artificial intelligence (AI)-native radio access networks (RANs) will serve vertical industries with stringent requirements: smart grids, autonomous vehicles, remote healthcare, industrial automation, etc. To achieve these requirements, modern 5G/6G design increasingly leverage AI for network optimization, but the opacity of AI decisions poses risks in mission-critical domains. These use cases are often delivered via non-public networks (NPNs) or dedicated network slices, where reliability and safety are vital. In this paper, we motivate the need for transparent and trustworthy AI in high-stakes communications (e.g., healthcare, industrial automation, and robotics) by drawing on 3rd generation partnership project (3GPP)'s vision for non-public networks. We design a mathematical framework to model the trade-offs between transparency (explanation fidelity and fairness), latency, and graphics processing unit (GPU) utilization in deploying explainable AI (XAI) models. Empirical evaluations demonstrate that our proposed hybrid XAI model xAI-Native, consistently surpasses conventional baseline models in performance.

2505.11189 2026-06-09 cs.AI cs.LG 版本更新

Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP

全局XAI方法能否揭示LLM中的注入行为?SHAP vs 规则提取 vs RuleSHAP

Francesco Sovrano

发表机构 * Collegium Helveticum at ETH Zurich(苏黎世联邦理工学院霍夫曼学院) Università della Svizzera italiana(瑞士联邦理工学院)

AI总结 研究通过统计验证的抽象将全局LLM信念映射为数值分数,提出RuleSHAP算法,结合全局SHAP与规则归纳,以更好地捕捉非单变量触发因素,平均MRR@1比RuleFit提升82%。

Comments Accepted for publication at KDD'2026

详情
AI中文摘要

大型语言模型(LLM)可能放大错误信息,破坏联合国可持续发展目标等社会目标。我们研究了三个有文献记载的错误信息驱动因素(效价框架、信息过载和过度简化),这些因素通常由默认信念塑造。基于LLM编码此类默认信念(例如,“快乐是积极的”、“数学是复杂的”)并可作为“启发式包”的证据,我们询问是否可以从黑盒LLM行为中恢复出错误信息相关行为背后的信念驱动启发式作为显式规则。一个关键障碍是可解释AI(XAI)中的全局规则提取方法是为数值输入输出数据设计的,而非文本。我们通过引出全局LLM信念并通过统计验证的抽象将其映射为数值分数来解决这一问题,从而使现成的全局XAI能够检测信念驱动的启发式。为了获得真实情况,我们通过系统指令向GPT系列和Llama模型注入复杂度递增的非线性行为触发因素(单变量、合取、非凸)。我们发现RuleFit经常遗漏非单变量触发因素,而全局SHAP在排名合取触发特征方面更好,但不产生符号规则。为了弥合这一差距,我们提出了RuleSHAP,一种将全局SHAP聚合与规则归纳相结合的规则提取算法,以更好地捕捉非单变量触发因素,平均MRR@1比RuleFit提升82%。我们的结果提示了一种揭示LLM中行为触发因素的实用途径。

英文摘要

Large language models (LLMs) can amplify misinformation, undermining societal goals such as the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) often shaped by default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive", "math is complex") and can act as "bags of heuristics", we ask whether belief-driven heuristics behind misinformation-related behaviour can be recovered from black-box LLM behaviour as explicit rules. A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical input-output data, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically validated abstractions, enabling off-the-shelf global XAI to detect belief-driven heuristics. For ground truth, we inject nonlinear behavioural triggers of increasing complexity (univariate, conjunctive, non-convex) into GPT-family and Llama models via system instructions. We find that RuleFit often misses non-univariate triggers, while global SHAP better ranks conjunctive trigger features but yields no symbolic rules. To bridge this gap, we propose RuleSHAP, a rule-extraction algorithm that couples global SHAP aggregates with rule induction to better capture non-univariate triggers, improving MRR@1 over RuleFit by +82% on average. Our results suggest a practical pathway for surfacing behavioural triggers in LLMs.

2603.22793 2026-06-09 cs.AI 版本更新

Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI

信号不是状态:面向文化意识课堂AI的神经符号安全机制

Sina Bagheri Nezhad

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出NSCR框架,通过神经符号方法处理课堂多模态信号,区分可观测证据与文化负载解读,减少文化偏见对课堂AI的负面影响。

Comments Accepted at the Workshop on Stereotypes Across Cultures in Language Technologies @ ACL 2026

详情
AI中文摘要

课堂AI系统越来越多地从多模态和语言信号推断出高水平教育状态,如参与度、困惑、合作、参与和教学质量。在多元文化和多语言课堂中,此类推断可能将文化特定的行为转化为刻板印象:沉默可能被解读为不参与,目光回避可能被解读为不专心,语言切换可能被解读为低能力,或间接求助可能被解读为困惑。我们主张,具有刻板印象意识的课堂AI应将可观察的证据与文化负载的解读分开,并将未经支持的构造层面的主张视为安全风险。我们引入NSCR,一种基于文化的神经符号框架,将视频、音频、语音识别、课程材料和上下文元数据转换为带不确定性的事实、来源和文化范围,然后通过可执行推理和政策约束组合它们。我们定义了刻板印象倾向课堂推断的分类学,并提出了涵盖文化条件下的状态推断、证据基础的主张验证、多语言和语言切换推理、合作分析、反事实文化鲁棒性以及文化条件下的红队测试的基准议程。我们进一步指定了刻板印象泄漏、未支持的归属、文化校准差距、文化模糊性下的回避以及证据忠实度的度量标准。贡献是方法学的:为减少课堂AI中的刻板印象推理提供具体的框架和评估议程,教育作为高风险、文化多变的部署场景。

英文摘要

Classroom AI systems increasingly infer high-level educational states such as engagement, confusion, collaboration, participation, and instructional quality from multimodal and linguistic signals. In multicultural and multilingual classrooms, such inferences can translate culturally situated behavior into stereotyped claims: silence may be read as disengagement, gaze aversion as inattention, code-switching as low proficiency, or indirect help-seeking as confusion. We argue that stereotype-aware classroom AI should separate observable evidence from culturally loaded interpretation and should treat unsupported construct-level claims as safety risks. We introduce NSCR, a culturally grounded neuro-symbolic framework that converts video, audio, ASR, lesson artifacts, and contextual metadata into typed facts with uncertainty, provenance, and cultural scope, then composes them through executable reasoning and policy constraints. We define a taxonomy of stereotype-prone classroom inferences and propose a benchmark agenda covering culture-conditioned state inference, evidence-grounded claim verification, multilingual and code-switched reasoning, collaboration analysis, counterfactual cultural robustness, and culture-conditioned red-teaming. We further specify metrics for stereotype leakage, unsupported attribution, cultural calibration gaps, abstention under cultural ambiguity, and evidence faithfulness. The contribution is methodological: a concrete framework and evaluation agenda for mitigating stereotyped reasoning in classroom AI, with education as a high-stakes, culturally variable deployment setting.

2606.06114 2026-06-09 cs.AI 版本更新

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

走向健康进化:探索人机交互在自我进化系统中的作用与机制

Dianxing Shi, Bowen Wang, Junqi He, Junhao Chen, Yuta Nakashima

发表机构 * The University of Osaka(大阪大学)

AI总结 提出ANCHOR框架,通过模拟人类监督的反馈机制,在自我进化系统中缓解能力退化与安全漂移,实验表明有限监督可显著提升安全性与稳定性。

详情
AI中文摘要

自我进化智能体通过持续的自我对弈和自我生成的学习信号进行改进,但自主进化也可能导致能力退化与安全漂移。尽管人类反馈已被证明对静态和后训练智能体有效,但其在自我进化系统中的作用仍未被充分探索。我们提出了通过类人监督与审查进行智能体规范修正(ANCHOR)框架,这是一个基于LLM的框架,模拟人类监督并在自我进化的不同阶段提供反馈。利用ANCHOR,我们评估了两个代表性的开源自我进化智能体系统在编程、数学推理和安全性方面的表现。结果表明,即使是有限的监督也能显著缓解安全退化,同时保持核心进化目标的稳定性能。进一步分析显示,对输出验证阶段的监督是最有效的干预方式,而增加监督频率则收益递减。这些发现为设计更稳定、可控且与人类对齐的自我进化智能体系统提供了经验证据和实践指导。

英文摘要

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

2501.15509 2026-06-09 cs.CR cs.AI cs.LG 版本更新

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint

FIT-Print:通过目标指纹实现抗虚假声明的模型所有权验证

Shuo Shao, Haozhe Zhu, Yiming Li, Hongwei Yao, Tianwei Zhang, Zhan Qin

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(区块链与数据安全国家重点实验室,浙江大学) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou(杭州高新技术区(滨江)区块链与数据安全研究院,杭州) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对现有模型指纹易受虚假声明攻击的问题,提出目标指纹范式FIT-Print,通过优化将指纹转化为可验证目标签名,并设计两种黑盒方法,实现100%防御成功率和0%误报率。

Comments This paper has been accepted by IEEE Transactions on Information Forensics and Security

详情
AI中文摘要

模型指纹已成为保护开源模型知识产权的重要机制,提供了一种无需修改受保护模型的非侵入式方法。然而,我们的分析表明,现有指纹技术从根本上容易受到虚假声明攻击,即对手可以欺诈性地声称对独立的第三方模型拥有所有权。我们证明,这种脆弱性源于当前方法的非目标性,它们基于任意样本输出而非与特定预定义参考的对齐来评估模型相似性。为缓解此漏洞,我们引入了FIT-Print,一种主动对抗虚假声明攻击的目标指纹范式。具体来说,FIT-Print利用优化将指纹转化为可验证的目标签名。在此基础之上,我们提出了两种黑盒指纹方法:逐位的FIT-ModelDiff和逐列表的FIT-LIME,它们分别利用输出距离和特征归因作为鲁棒的模型签名。在基准模型和数据集上的广泛评估表明,我们的框架完美地中和了虚假声明攻击(100%防御成功率),消除了对独立模型的误报(0.0%),同时针对各种模型复用技术保持了100%的所有权验证率。

英文摘要

Model fingerprinting has emerged as a crucial mechanism for safeguarding the intellectual property of open-source models, offering a non-intrusive approach that requires no modifications to the protected model. However, our analysis reveals that existing fingerprinting techniques are fundamentally vulnerable to false claim attacks, wherein adversaries can fraudulently assert ownership over independent third-party models. We demonstrate that this vulnerability stems from the untargeted nature of current methods, which evaluate model similarity based on arbitrary sample outputs rather than alignment with a specific, predefined reference. To mitigate this vulnerability, we introduce FIT-Print, a targeted fingerprinting paradigm that actively counters false claim attacks. Specifically, FIT-Print leverages optimization to transform the fingerprint into a verifiable, targeted signature. Building upon this foundation, we propose two black-box fingerprinting methods, the bit-wise FIT-ModelDiff and the list-wise FIT-LIME, which utilize output distances and feature attributions as robust model signatures, respectively. Extensive evaluations across benchmark models and datasets show that our framework perfectly neutralizes false claim attacks (100% defense success rate) and eliminates false alarms on independent models (0.0%), all while maintaining a 100% ownership verification rate against diverse model reuse techniques.

2510.16028 2026-06-09 cs.CR cs.AI cs.LG cs.SY eess.SY 版本更新

TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

TAO:面向浮点神经网络的容忍感知乐观验证

Jianzhu Yao, Hongxu Su, Taobo Liao, Zerui Cheng, Huan Zhang, Xuechao Wang, Pramod Viswanath

发表机构 * Princeton University(普林斯顿大学) HKUST (GZ)(香港科技大学(广州)) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TAO协议,通过算子级容忍区域和Merkle锚定的争议游戏,在不依赖可信硬件或确定性内核的情况下验证浮点神经网络输出,开销仅0.3%。

Comments 18 pages, 8 figures

详情
Journal ref
Proceedings of the 21st European Conference on Computer Systems, (2026) 1515-1532
AI中文摘要

神经网络越来越多地在用户无法控制的硬件上运行(云GPU、推理市场)。然而,机器学习即服务很少透露实际运行的内容或返回的输出是否忠实反映预期输入。用户无法对服务降级(模型交换、量化、图重写或诸如修改广告嵌入等差异)进行追索。验证输出很困难,因为异构加速器上的浮点执行本质上是不确定的。现有方法要么对实际浮点神经网络不实用,要么重新引入供应商信任。我们提出TAO:一种容忍感知乐观验证协议,它接受在原则性算子级接受区域内的输出,而不是要求逐位相等。TAO结合了两种误差模型:(i)每个算子的IEEE-754最坏情况界限和(ii)跨硬件校准的紧密经验百分位分布。差异触发一个Merkle锚定的、阈值引导的争议游戏,该游戏递归地划分计算图,直到剩下一个算子,此时裁决简化为轻量级理论界限检查或针对经验阈值的小型诚实多数投票。未受挑战的结果在挑战窗口后最终确定,无需可信硬件或确定性内核。我们将TAO实现为PyTorch兼容运行时和当前部署在以太坊Holesky测试网上的合约层。运行时检测图、计算每个算子的界限,并在FP32中运行未经修改的供应商内核,开销可忽略(Qwen3-8B上为0.3%)。在A100、H100、RTX6000、RTX4090上的CNN、Transformer和扩散模型中,经验阈值比理论界限紧10^2-10^3倍,且考虑界限的对抗攻击成功率为0%。总之,TAO为现实世界的异构ML计算协调了可扩展性和可验证性。

英文摘要

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

2510.17947 2026-06-09 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

PLAGUE:面向多轮利用的终身自适应生成的即插即用框架

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar

发表机构 * A10 Networks, Inc.(A10网络公司) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出PLAGUE框架,通过终身学习启发的三阶段设计(Primer、Planner、Finisher)实现高效多轮越狱攻击,在o3和Opus 4.1等强安全模型上ASR提升超30%。

Comments Accepted in ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)正以惊人的速度改进。随着智能体工作流的出现,多轮对话已成为与LLMs交互以完成长而复杂任务的事实标准。尽管LLM能力持续提升,但它们仍然越来越容易受到越狱攻击,尤其是在多轮场景中,有害意图可以巧妙地注入到对话中,产生恶意结果。虽然单轮攻击已被广泛探索,但适应性、效率和有效性仍然是多轮攻击面临的关键挑战。为了解决这些不足,我们提出了PLAGUE,一种新颖的即插即用框架,用于设计受终身学习智能体启发的多轮攻击。PLAGUE将多轮攻击的生命周期分解为三个精心设计的阶段(Primer、Planner和Finisher),从而实现对多轮攻击家族的系统性和信息丰富的探索。评估表明,使用PLAGUE设计的红队智能体实现了最先进的越狱结果,在更少或相当的查询预算下,领先模型的攻击成功率(ASR)提高了30%以上。特别是,PLAGUE在OpenAI的o3上实现了81.4%的ASR(基于StrongReject),在Claude的Opus 4.1上实现了67.3%的ASR,这两个模型在安全文献中被认为对越狱具有高度抵抗力。我们的工作提供了工具和见解,以理解计划初始化、上下文优化和终身学习在构建多轮攻击以进行全面模型脆弱性评估中的重要性。

英文摘要

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

2602.00056 2026-06-09 cs.CY cs.AI 版本更新

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

超数据化如何影响前沿AI的可持续性成本

Sophia N. Wilson, Sebastian Mair, Mophat Okinyi, Erik B. Dam, Janin Koch, Raghavendra Selvan

发表机构 * University of Copenhagen(哥本哈根大学) Linköping University(_linköping大学) Techworker Community Africa(非洲技术工人社区) Univ. Lille, Inria, CNRS, Centrale Lille(里尔大学,Inria,CNRS,Centrale Lille)

AI总结 本文研究超数据化对前沿AI的环境、社会和经济成本的影响,通过分析Hugging Face Hub的55万数据集,揭示数据增长、存储能耗及全球数据基础设施差异,提出Data PROOFS建议以缓解相关成本。

Comments Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Montreal, Canada

详情
AI中文摘要

大规模数据在过去十年中推动了前沿人工智能(AI)模型的成功。这种扩展依赖于大型科技公司持续努力聚合和整理互联网级数据集。本文从可持续性角度研究大规模数据在AI中的环境、社会和经济成本。我们主张该领域正从基于数据构建模型转向主动创建数据以构建模型。我们将这一转变称为超数据化,标志着前沿AI及其社会影响的关键转折点。为量化和 contextualize 数据相关成本,我们分析了约550,000个数据集,重点是数据集增长、存储相关的能耗和碳足迹,以及通过语言数据进行的社会代表性分析。我们还通过肯尼亚数据工人的定性反馈来研究劳动力问题,包括大型科技公司直接雇佣和对图像内容的暴露。我们进一步利用外部数据来源来验证我们的发现,通过展示全球数据中心基础设施的不平等来支持我们的发现。我们的分析表明,超数据化驱动了显著且增长的环境成本,同时系统地将劳动力风险和代表性伤害向全球南方转移。因此,我们提出了涵盖溯源、资源意识、所有权、开放性、节俭和标准的Data PROOFS建议,以缓解这些成本。我们的工作旨在使前沿AI背后常被忽视的数据成本可视化,并在研究社区和更广泛范围内激发更广泛的讨论。

英文摘要

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

2602.02572 2026-06-09 cs.LG cs.AI 版本更新

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

奖励塑形用于(推理时)对齐:一个Stackelberg博弈视角

Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe

发表机构 * University of Southern California(南加州大学)

AI总结 针对KL正则化导致LLM继承基策略偏见的问题,提出将奖励模型优化形式化为Stackelberg博弈,并通过简单奖励塑形方案近似最优奖励模型,在推理时对齐中持续提升平均奖励并达到超过66%的胜率。

Comments Accepted to ICML 2026. Camera-ready version

详情
AI中文摘要

现有的对齐方法直接使用从用户偏好数据中学习到的奖励模型来优化LLM策略,并相对于基策略进行KL正则化。这种做法对于最大化用户效用是次优的,因为KL正则化可能导致LLM继承基策略中与用户偏好冲突的偏见。虽然放大偏好输出的奖励可以减轻这种偏见,但也增加了奖励黑客的风险。这种权衡激励了在KL正则化下最优设计奖励模型的问题。我们将这个奖励模型优化问题形式化为一个Stackelberg博弈,并表明一个简单的奖励塑形方案可以有效近似最优奖励模型。我们在推理时对齐设置中经验性地评估了我们的方法,并证明它可以无缝集成到现有的对齐方法中,且开销最小。我们的方法持续提高了平均奖励,并在所有评估设置中平均达到了超过66%的胜率(相对于所有基线)。

英文摘要

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

2602.08235 2026-06-09 cs.CL cs.AI cs.CR 版本更新

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

当良性输入导致严重危害:引发计算机使用代理的不安全意外行为

Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国) Stanford University, Stanford, CA, USA(斯坦福大学,斯坦福,加利福尼亚州,美国) UC Berkeley, Berkeley, CA, USA(加州大学伯克利分校,伯克利,加利福尼亚州,美国)

AI总结 提出AutoElicit框架,通过迭代扰动良性指令并利用CUA执行反馈,自动引发前沿CUAs(如Claude 4.5 Haiku等)的数百种有害意外行为,并验证其跨模型可迁移性。

Comments ICML 2026, Project Homepage: https://osu-nlp-group.github.io/AutoElicit/

详情
AI中文摘要

尽管计算机使用代理(CUA)在自动化日益复杂的操作系统工作流程方面具有巨大潜力,但即使在良性输入上下文中,它们也可能表现出偏离预期结果的不安全意外行为。然而,对此风险的探索仍主要停留在轶事层面,缺乏具体的特征描述和自动化方法,无法在现实CUA场景下主动发现长尾意外行为。为填补这一空白,我们首次提出了针对CUA意外行为的概念和方法框架,通过定义其关键特征、自动引发它们以及分析它们如何从良性输入中产生。我们提出了AutoElicit:一个代理框架,它使用CUA执行反馈迭代地扰动良性指令,并在保持扰动现实且良性的同时引发严重危害。使用AutoElicit,我们从最先进的CUA(如Claude 4.5 Haiku、Claude 4.5 Opus和Operator)中发现了数百种有害的意外行为。我们进一步评估了人工验证的成功扰动的可迁移性,识别出各种前沿CUA对意外行为的持续易感性。这项工作为在现实计算机使用环境中系统分析意外行为奠定了基础。

英文摘要

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku, Claude 4.5 Opus, and Operator. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

2604.01039 2026-06-09 cs.CR cs.AI 版本更新

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

用于评估和加固LLM系统指令对抗编码攻击的自动化框架

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

发表机构 * Keysight Technologies

AI总结 本文提出自动化框架评估LLM系统指令在对抗编码攻击时的保密性,通过四个模型和46条指令测试发现结构化序列化攻击成功率高,提出基于Chain-of-Thought的缓解策略。

详情
AI中文摘要

大型语言模型(LLM)中的系统指令常用于执行安全策略、定义代理行为并保护敏感操作上下文。这些指令可能包含敏感信息如API凭证、内部政策和特权工作流定义,使系统指令泄露成为LLM应用中的关键安全风险。无需推理模型的开销,许多LLM应用依赖拒绝型指令来阻止直接请求系统指令,隐含假设被禁止的信息只能通过显式查询提取。我们引入了一个自动化评估框架,测试在将提取请求重新框架化为编码或结构化输出任务时系统指令是否保持保密。在四个常见模型和46条验证过的系统指令上,我们发现结构化序列化攻击的成功率(>0.7)。我们进一步展示了一种基于一次指令重塑的缓解策略,使用Chain-of-Thought推理模型,表明即使系统指令的措辞和结构有细微变化,也能显著降低攻击成功率,而无需重新训练模型。

英文摘要

System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates ( > 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

2604.08304 2026-06-09 cs.CR cs.AI 版本更新

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

保障检索增强生成:攻击、防御与未来方向的分类法

Yuming Xu, Mingtao Zhang, Zhuohan Ge, Haoyang Li, Nicole Hu, Yongqi Zhang, Zhiyuan Wen, Jason Chen Zhang, Qing Li, Lei Chen

发表机构 * The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出SLOT分类法,从攻击面、防御层、目标(遵循CIA属性)和攻击目标四个维度系统化梳理检索增强生成(RAG)的安全风险与防御,并指出知识访问管道中的结构性错配,最后展望未来方向。

Comments We have curated a paper list on RAG security in https://github.com/TreeAI-Lab/Awesome-RAG-Security, and we warmly welcome authors who wish to have their new work included to contact us via email

详情
AI中文摘要

检索增强生成(RAG)通过外部知识扩展大型语言模型(LLM),但这一访问路径也引入了安全风险,现有工作常将其与LLM固有缺陷混为一谈。我们将安全RAG定义为保障外部知识访问,并使用SLOT分类法组织文献,该分类法沿四个轴:攻击面(S,对手作用的位置)、防御层(L,控制同一点)、目标(O,遵循CIA属性被破坏的目标)以及追求的目标(T,从单个已知查询(T1)到跨查询分布的目标声明操纵(T2))。将攻击、防御、补救和评估映射到六阶段知识访问管道,我们揭示了两个结构性错配。最后,我们讨论了更现实目标、无盲点和自适应评估的防御、更强的机密性以及多模态和智能体RAG评估的方向。

英文摘要

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces security risks that existing work often conflates with inherent LLM flaws. We frame secure RAG as securing external knowledge access and organize the literature with SLOT, a taxonomy along four axes: the attack Surface (S) where an adversary acts, the defense Layer (L) that controls the same point, the Objective (O) it breaks following the CIA properties, and the Target (T) it pursues, from a single known query (T1) to target-claim manipulation across a query distribution (T2). Mapping attacks, defenses, remediation, and evaluation onto a six-stage knowledge-access pipeline, we expose two structural mismatches. Finally, we discuss directions for more realistic targets, no-blind-spot and adaptively evaluated defenses, stronger confidentiality, and evaluation for multimodal and agentic RAG. The curated paper list for RAG security is in: https://github.com/TreeAI-Lab/Awesome-RAG-Security.

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

发表机构 * Università della Svizzera italiana(瑞士意大利大学)

AI总结 提出MechaRule方法,通过定位稀疏激动剂激活将规则提取锚定在LLM电路中,利用自适应组测试和置信引导剪枝,以极低代价高召回率识别关键神经元,并在算术和越狱任务中验证其有效性。

Comments Accepted for publication at KDD'2026

详情
AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型(LLM)的决策逻辑,并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理,而机械可解释性将行为与神经元联系起来,但通常需要手工假设和昂贵的干预。我们提出MechaRule,一种通过定位稀疏激动剂激活(其消融会破坏规则相关行为)将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先,在固定的基线/翻转机制下,稀疏激动剂效应可能表现出“超越”:少数高效应的激活在较大组中仍可检测到,主导较弱效应,并翻转许多相同的示例。在这种机制下,使用置信引导的保守剪枝的自适应组测试,当k << N为激动剂时,需要对N个候选进行O(k log(N/k) + k)次干预。其次,在与接近忠实规则行为对齐的数据分割上,激动剂的定位更可靠;谱分割提供了无规则的备选方案,而不忠实的分割会降低定位效果。实验上,在算术和越狱任务中,MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂,平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱,并可纠正算术错误或诱导越狱,分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

2605.03226 2026-06-09 cs.LG cs.AI cs.CR 版本更新

Self-Mined Hardness for Safety Fine-Tuning

自我挖掘的难度用于安全微调

Prakhar Gupta, Garv Shah, Donghua Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出通过模型自身生成结果评估提示难度,对最难的提示进行安全微调,在Llama-3模型上将攻击成功率降至1-3%,但增加了拒绝率,通过混合良性提示可平衡性能。

详情
AI中文摘要

语言模型的安全微调通常需要一个精心策划的对抗性数据集。我们采取不同的方法:通过目标模型自身生成结果被判定为有害的频率来评分每个候选提示的难度,然后在最难的提示上使用模型自身的非越狱生成结果进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上,该方法将WildJailbreak攻击成功率从11.5%和20.1%降至1-3%,但将越狱形式良性提示的拒绝率从14-22%提升至74-94%。将相同的困难提示与对抗性框架的良性提示(看起来像越狱但意图良性的提示)以1:1的比例交错,可将8B模型的拒绝率降至30-51%,3B模型降至52-72%,但攻击成功率增加2-6个百分点。在混合模式下,使用合格池中最难的一半而非随机一半进行训练,可将两个模型的剩余ASR降低35-50%(约3个百分点)。

英文摘要

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

2605.15416 2026-06-09 cs.LG cs.AI 版本更新

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

基于边际的置信度排名用于可靠的LLM判断

Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Institute of AI for Industries, Chinese Academy of Sciences(中国科学院工业人工智能研究所) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 本文提出一种基于边际的置信度排名方法,通过学习专用置信度估计器,改进LLM在人类判断一致性上的表现,通过模拟标注者多样性与边际排名公式,显式建模LLM区分人类一致与不一致案例的置信度,并推导出通用性保证。

Comments Accepted to ICML 2026

详情
AI中文摘要

Jung等人(2025)提出了一种假设检验框架,以确保大型语言模型(LLMs)与人类判断之间的一致性,基于模型估计的置信度与人类不一致风险之间单调性的假设。然而,在实践中,这一假设可能被违反,且置信度估计器的泛化行为未被显式分析。我们通过学习专用置信度估计器而非依赖启发式置信信号来缓解这些问题。我们的方法利用模拟标注者多样性和基于边际的排名公式,显式建模LLM区分人类一致与不一致案例的置信度。我们进一步推导出该估计器的泛化保证,揭示出一个与边际相关的权衡,从而指导适应性估计器训练过程的设计。当集成到固定序列测试中时,所学的置信度估计器提高了排名准确性,并在多个数据集和判断模型上实现了更高的成功率,以满足目标一致性水平。

英文摘要

Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

协同优化中的因果卸载:在对抗性贡献下的精确和近似影响反转

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi

发表机构 * Department of Computer Engineering, SRC, Islamic Azad University Tehran, Iran(伊朗伊斯兰Azad大学塔希尔分校计算机工程系) School of Computer Engineering, Iran University of Science and Technology Tehran, Iran(伊朗科学技术大学塔希尔分校计算机工程系) Meta CA, USA(美国Meta公司)

AI总结 本文提出HF-KCU方法,通过共轭梯度迭代在Krylov子空间中近似影响函数,从而在协同优化中实现数据删除,减少计算复杂度并提高隐私保护效果。

详情
AI中文摘要

联邦学习系统必须支持数据删除请求以符合隐私法规,但每次删除后重新训练是计算上不可行的。我们提出了HF-KCU方法,通过在Krylov子空间中进行共轭梯度迭代近似影响函数,将复杂度从O(d^3)降低到O(kd),其中k<<d。因果加权机制确保只有持有删除数据的客户端接收参数更新,防止对未受影响的客户端造成虚假变化。我们的方法设计用于处理有界对抗性扰动的Hessian和梯度,提供在现实威胁模型下的优雅退化。我们在卷积(ResNet-18,SimpleCNN)和Transformer(ViT-Lite)架构上CIFAR-10、MNIST和Fashion-MNIST数据集上验证了HF-KCU。在CIFAR-10的Dirichlet(alpha=0.5)划分下,HF-KCU在重新训练的基础上实现了47.75倍的速度提升,同时保持测试准确率在0.60%以内(71.16 vs 71.76%)。对遗忘集的成员推断攻击的成功率达到了0.499,与重新训练模型匹配,证实了有效的隐私恢复。我们提供了收敛保证,显示Krylov近似误差随着O((k^{1/2}-1)/(k^{1/2}+1))递减,其中k是Hessian条件数。因果加权机制确保了手术更新,只有持有删除数据的客户端被修改,保护了未受影响参与者的模型质量,并避免了异步联邦设置中梯度方法的不稳定性。该设计提供了可解释性,因为每个更新都可以直接追溯到删除数据的影响。该方法的效率和精度使其适用于生产联邦系统,其中删除请求异步到达且计算预算受限。

英文摘要

Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<<d.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method's efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

2606.00827 2026-06-09 cs.LG cs.AI 版本更新

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

超越独立操纵:具有同伴模仿的个体公平感知策略分类

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Jinxuan Yang, Yuanlong Chen, Wangrong Huang, Shaowu Yang, Wenjing Yang, Xinwang Liu, Peng Cui, Haotian Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) School of Mathematical Sciences, Peking University(北京大学数学学院) Institute for Theoretical Computer Science, Shanghai University of Finance and Economics(上海财经大学理论计算机科学研究所) Information Technology Development, Aetos Capital Group, Sydney(悉尼Aetos资本集团信息技术部) Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出个体公平感知策略分类(IFSC)框架,通过建模基于个体公平的同伴驱动操纵(模仿邻近被接受同伴),并采用鲁棒学习过程处理同伴可观测性不确定性,以改善个体公平一致性并减轻模仿引起的扭曲。

Comments Accepted by SIGKDD2026

详情
AI中文摘要

策略分类(SC)研究智能体操纵其特征以从预测模型获得有利决策的场景。现有的公平感知SC方法主要关注群体公平,并通常假设智能体独立响应。然而,当需要个体公平时,确保相似个体获得相似结果,智能体的操纵变得相互依赖:一个智能体偏好的操纵取决于邻域的结果。这导致了经典SC公式与公平感知决策设置之间的不匹配,其中独立模型不再准确刻画策略操纵。为解决此问题,我们引入了个体公平感知策略分类(IFSC),这是一个框架,对由个体公平引起的同伴驱动操纵进行建模,其中智能体模仿附近被积极决策的同伴以获得有利结果。IFSC将策略操纵刻画为对可见被接受同伴的基于相似性的模仿,并在由此产生的操纵后分布下学习分类器。为了考虑同伴可观测性的不确定性,IFSC采用鲁棒学习过程,在操纵模拟期间引入随机扰动。在合成和真实数据集上的实验表明,IFSC改善了个体公平一致性并减轻了模仿引起的扭曲。

英文摘要

Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive models. Existing fairness-aware SC approaches primarily focus on group fairness and typically assume that agents respond independently. However, when individual fairness is required, ensuring similar individuals receive similar outcomes, agents' manipulation becomes interdependent: an agent's preferred manipulation depends on the neighborhoods' outcomes. This induces a mismatch between classical SC formulations and fairness-aware decision settings, where independent models no longer accurately characterize strategic manipulations. To address this issue, we introduce individual fairness-aware strategic classification (IFSC), a framework that models peer-driven manipulation arising from individual fairness, where agents imitate nearby positively decided peers to obtain favorable outcomes. IFSC characterizes strategic manipulation as similarity-based imitation toward visible accepted peers and learns classifiers under the resulting post-manipulation distributions. To account for uncertainty in peer observability, IFSC employs a robust learning process that introduces stochastic perturbations during manipulation simulation. Experiments on synthetic and real-world datasets demonstrate that IFSC improves individual-fairness consistency and mitigates imitation-induced distortions.

2606.01567 2026-06-09 cs.CR cs.AI cs.CL 版本更新

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

针对终端代理的技能注入攻击的防御与使能因素

Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narsimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan

发表机构 * Patronus AI NVIDIA

AI总结 研究基于大语言模型的代理在重用技能时面临的安全威胁,提出守护者防御(动态和静态)将攻击成功率降低过半,并测试了攻击重述的鲁棒性。

Comments First version, small updates and clarifications likely in v2

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖可重用的技能,即描述任务特定程序的文档。然而,这为代理管理引入了新的攻击面。我们针对这一威胁研究了两个互补方向。首先,我们评估了基于守护者的防御:一个中间LLM代理,作为技能文件访问的调解者(动态守护者)或在构建时预先重写这些文件(静态守护者)。在三个LLM代理家族中,我们的守护者将攻击成功率(ASR)降低了一半以上,同时保持了任务效用。其次,我们通过攻击重述对其进行压力测试,使用了四种保留恶意指令但改变措辞的攻击。对于非守护者设置,重述将ASR推高至81.4%,但动态守护者将其降至18.6%,表明实时调解是一种稳健的防御。

英文摘要

Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian-based defenses: an intermediary LLM agent that acts as a mediator for skill file access (dynamic guardian) or pre-rewrites these files at build time (static guardian). Across three LLM agent families, our guardians cut attack success rate (ASR) by well over half while preserving task utility. Second, we stress test them through attack reframing using four attacks that preserve the malicious instruction but change the phrasing. For non-guardian setup, the reframing pushes the ASR up to 81.4\%, but the dynamic guardian brings it down to 18.6\%, showing that real-time mediation is a robust defense.

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗?通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所)

AI总结 提出CapCode框架,通过设置上限评估检测模型在编码任务中的作弊行为,并设计CapReward奖励机制防止作弊,实验表明该方法能有效检测和减少作弊。

详情
AI中文摘要

在智能体评估和训练中,一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数,产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode,一个构建带有随机测试的编码数据集的框架,其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释:显著高于上限的分数是不可信的,因此提供了作弊的证据。为了防止作弊,我们提出CapReward,一种基于CapCode原则的奖励设计,以抑制超出上限的优化。跨多个数据集的实验表明,CapCode能够检测作弊同时保持模型的性能排名,CapReward减少了作弊行为,产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

9. 评测、基准与数据集 115 篇

2606.07718 2026-06-09 cs.AI cs.CV cs.LG 新提交

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

评估AI代理在神经科学数据到发现流程中的案例研究

Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

发表机构 * Cornell University(康奈尔大学) HHMI Janelia Research Campus(霍华德·休斯医学研究所贾雷尔研究园区)

AI总结 本研究评估通用编码代理在果蝇光遗传学数据到发现流程中的表现,发现代理能解决单个阶段任务,但端到端流程仍超出其能力,主要挑战包括缺乏预定义迭代标准和科学判断能力。

详情
AI中文摘要

代理型AI工具为自动化科学研究流程中的软件开发瓶颈提供了有希望的路径,特别是对于那些需要领域专家花费数天到数月构建的阶段,科学家关心的是正确性和鲁棒性,而非实现细节。我们针对果蝇光遗传学数据到发现流程,对通用编码代理进行了实证研究。我们在比现有基准大得多的任务、数量级更大的数据集以及基于领域专家标准的评估标准上评估代理。我们表明,代理可以解决几个单独的流程阶段,这表明阶段级自动化是可行的。通过分析代理的代码迭代,我们发现当没有预定义的标准可供迭代时,它们最困难,此时它们必须利用自己的科学判断来评估当前解决方案,这是一个关键开放挑战。与科学实践相呼应,它们有时尝试对中间输出进行视觉检查以进行自我评估,但大多未能正确解释所见或据此采取行动。正确解决端到端流程需要将所有流程阶段的成功串联起来,这超出了代理当前的能力。我们识别出现有基准中基本缺失的挑战,包括计算资源管理和对大型保留数据集的泛化。最后,我们提炼出构建科学任务和针对开放问题的严格评估标准的原则。

英文摘要

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents' code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents' current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

2606.07805 2026-06-09 cs.AI cs.MA 新提交

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

超越古德哈特定律:多智能体系统中合规性评估的动态基准

Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Monash University(莫纳什大学)

AI总结 针对多智能体系统在压力下可能违反安全规则的问题,提出MAC-Bench动态对抗基准,通过SERV流水线生成无污染场景,并引入CSR和MG指标评估前沿模型的合规性。

详情
AI中文摘要

大型语言模型(LLMs)从被动助手向自主、可执行智能体的快速演进引入了关键的操作风险。当前大多数评估框架忽视了程序合规性,导致“马基雅维利”行为——智能体为最大化奖励而策略性地违反安全规则,这是古德哈特定律的直接体现。为解决这一盲点,我们提出MAC-Bench,一个动态对抗基准,旨在评估多智能体系统在现实压力下的程序对齐。我们提出了SERV(种子-进化-精炼-验证)流水线,一种“智能体即基准”范式,将非结构化法律文本转化为可执行、无污染的场景。通过合成全息沙盒环境并注入校准的社会工程压力向量,MAC-Bench迫使智能体在任务成功与监管遵守之间进行帕累托最优权衡。我们引入了新指标:合规加权成功率(CSR)和马基雅维利差距(MG),并对最先进的前沿模型进行了全面评估,揭示了成功与合规之间的普遍权衡。

英文摘要

The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.

2606.07916 2026-06-09 cs.AI 新提交

The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence

CIFAR合成证据语料库:用于检测AI生成证据

Kelly McConvey, Jalehsadat Mahdavimoghaddam, Nima Jamali, Maksym Taranukhin, Sajad Ebrahimi, Wentao Zhang, Yuntian Deng, Karen Eltis, Maura R. Grossman, Vered Shwartz, Ebrahim Bagheri

发表机构 * University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) University of Ottawa(渥太华大学)

AI总结 针对司法系统中证据真实性检测缺乏合适数据集的问题,构建了包含多种文档类型和篡改策略的CIFAR合成证据语料库,支持在受控条件下评估证据验证系统。

详情
AI中文摘要

生成模型生成逼真文档的能力日益增强,这对司法系统和法院中的证据工作流程构成了直接挑战,因为决策越来越依赖于收据、通信和行政记录等证据的真实性。与社交媒体或学术环境不同,证据性文档通常仅被微妙地修改,通过局部编辑保持整体合理性,同时改变法律含义。然而,自动检测的进展仍然有限,主要原因是缺乏适合司法系统要求的训练和评估数据。现有资源要么专注于人脸照片或自然风景,要么局限于狭窄的学术或社交媒体文档类型,未能捕捉真实世界证据数据的结构、多样性或篡改模式。因此,当前的检测系统不一定能学习到适合司法系统的有意义信号。我们引入了CIFAR合成证据语料库,这是一个旨在在现实和受控条件下严格评估证据验证的数据集。该语料库涵盖多个文档家族和一系列篡改策略,从小规模字段级编辑到完整文档伪造,并使用多种最先进的生成工具构建。其组织方式系统性地变化篡改复杂性和生成方法,同时在训练和测试数据之间强制源级分离,以反映现实世界的泛化挑战。

英文摘要

The growing ability of generative models to produce realistic documents poses a direct challenge to evidentiary workflows in the justice system and the courts, where decisions increasingly depend on the authenticity of evidence such as receipts, communications, and administrative records. Unlike social media or academic settings, evidentiary documents are often only subtly altered, with small, localized edits that preserve overall plausibility while changing legal meaning. Yet progress on automated detection remains limited, largely due to the absence of suitable training and evaluation data especially suited for the justice system requirements. Existing resources are either focused on photos of human faces or natural scenery or on narrowly scoped academic or social media document types, and do not capture the structure, diversity, or manipulation patterns characteristic of real-world evidentiary data. As a result, current detection systems do not necessarily learn meaningful signals appropriate for the justice system. We introduce the CIFAR Synthetic Evidence Corpus, a dataset designed to enable rigorous evaluation of evidence verification under realistic and controlled conditions. The corpus spans multiple document families and a spectrum of manipulation strategies, from small field-level edits to complete document fabrication, and is constructed using a diverse set of state-of-the-art generative tools. It is organized to systematically vary both manipulation complexity and generation method, while enforcing source-level separation between training and test data to reflect real-world generalization challenges.

2606.07953 2026-06-09 cs.AI 新提交

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

闭集与开集工业检测场景的统一:新的大规模基准、挑战与基线

Zekai Zhang, Jinglin Zhang, Qinghui Chen, Gang Li, Da Chen, Shuainan Jing, He Wang, Dagang Li, Cong Liu, Cong Bai, Shengyong Chen

发表机构 * Shandong University(山东大学) University Paris Dauphine, PSL Research University, CNRS, UMR 7534(巴黎多芬纳大学,PSL研究大学,法国国家科学研究中心,UMR 7534) Qilu University of Technology(齐鲁工业大学) Zhejiang University of Technology(浙江工业大学) Nova University of Lisbon(里斯本新大学) Macau University of Science and Technology(澳门科技大学) Tianjin University of Technology(天津理工大学)

AI总结 针对工业缺陷检测中数据集稀缺和人工提示依赖问题,提出含百万样本的MMIOC-1M基准和RTVPNet网络,通过专家辅助域投影、能量稀疏采样和双向文本-视觉交互实现最优性能。

详情
AI中文摘要

大规模视觉语言模型(LVLMs)在自然视觉任务中取得了显著成功,但其在工业缺陷检测中的应用仍面临两个基本限制:(i)缺乏覆盖多个领域不同缺陷类别的大规模工业数据集,以及(ii)依赖人工提示(点、框、掩码)引入主观噪声,且缺乏用于细粒度理解的文本-视觉交互。为解决这些挑战,我们引入了一个大规模多模态工业开闭集基准(MMIOC-1M),包含超过一百万个样本,涵盖14个超类、29个工业场景和351个缺陷子类。据我们所知,MMIOC-1M是首个同时支持开放词汇和闭集工业检测的统一最大基准,为工业场景中的LVLMs提供了宝贵的预训练数据。此外,我们提出了一种精炼的文本-视觉提示网络(RTVPNet),包含三个关键创新:(1)专家辅助域投影机制,使通用视觉模型能够快速适应工业领域;(2)基于能量的稀疏采样策略,无需人工干预即可自动生成精炼的视觉提示;(3)双向文本-视觉交互模块,增强跨模态语义对齐和理解。大量实验表明,RTVPNet在MMIOC-1M、LVIS和COCO基准上实现了最先进的性能,同时保持了计算效率。数据集和代码可在https://github.com/hellozzk/MMIO获取。

英文摘要

Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.

2606.07965 2026-06-09 cs.AI 新提交

Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

工业场景中的零样本学习:新的大规模基准、挑战与基线

Zekai Zhang, Qinghui Chen, Maomao Xiong, Shijiao Ding, Zhanzhi Su, Xinjie Yao, Yiming Sun, Cong Bai, Jinglin Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团) Huawei Technologies(华为技术有限公司) Tsinghua University(清华大学) Microsoft Research(微软研究院)

AI总结 针对工业场景中视觉语言模型应用难、数据稀缺的问题,提出大规模多模态工业开放数据集MMIO和精炼文本-视觉提示RTVP,实现零样本工业缺陷检测,在MMIO上达到SOTA。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在视觉任务中取得了显著成功。然而,工业场景与自然场景之间的巨大差异使得应用LVLMs具有挑战性。现有的LVLMs依赖用户提供的提示来分割目标,这常常由于包含不相关像素而导致性能次优。此外,数据的稀缺性也使得LVLMs在工业场景中的应用仍未得到探索。为填补这一空白,本文提出了一个开放的工业数据集和一个精炼文本-视觉提示(RTVP),用于零样本工业缺陷检测。首先,本文构建了包含80K+样本的多模态工业开放数据集(MMIO)。MMIO包含多样化的工业类别,包括6个超类和18个子类。MMIO是首个用于工业零样本学习的大规模多场景预训练数据集,并为未来工业场景中的开放模型提供了宝贵的训练数据。基于MMIO,本文提出了专门用于工业零样本任务的RTVP。RTVP有两个显著优势:第一,本文设计了一种专家引导的大模型领域自适应机制,并基于Mobile-SAM设计了工业零样本方法,增强了大模型在工业场景中的泛化能力。第二,RTVP直接从图像自动生成视觉提示,并考虑了先前LVLM忽略的文本-视觉提示交互,提高了视觉和文本内容的理解。在MMIO的零样本和封闭场景中,RTVP分别以42.2%和24.7%的AP达到了SOTA。

英文摘要

Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.

2606.08018 2026-06-09 cs.AI 新提交

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

UniQL:迈向方言通用的文本到SQL基准测试

Jianling Gao, Chongyang Tao, Jiayuan Bai, Liu Yang, Xuanguang Pan, Jinrui Liu, Shihao Xing, Xiaohan Xu, Jie Liang, Shuai Ma

发表机构 * SKLCCSE, Beihang University(北京航空航天大学软件开发环境国家重点实验室) The University of Hong Kong(香港大学)

AI总结 提出UniQL基准,通过跨16种SQL方言的对齐标注,评估模型在不同数据库系统间的泛化能力,揭示现有模型在方言通用性上的不足。

详情
AI中文摘要

现有的文本到SQL基准测试主要集中在SQLite上,这使得评估模型能否跨异构SQL方言泛化变得困难。然而,现实世界的数据库系统在语法、函数、类型系统和执行语义上存在显著差异,因此相同的自然语言意图通常需要特定方言的SQL实现。我们引入了UniQL,一个用于跨方言文本到SQL评估的人工验证基准。UniQL将1,534个自然语言问题与16种SQL方言的可执行SQL注释对齐,产生了24,544个方言特定的查询。所有方言共享相同的意图、对齐的模式和数据库内容,从而实现了对方言泛化的可控评估。UniQL通过一个混合流水线构建,结合了数据库迁移、SQL翻译、执行引导验证、迭代规则总结和人工验证。在开源和闭源LLM上的实验表明,当前模型远未达到方言通用,在不同数据库系统间性能差异显著,且从SQLite成功到其他方言的迁移有限。这些发现凸显了对齐的跨方言基准和更注重方言的文本到SQL方法的必要性。代码和数据可在https://github.com/JerryGao818/UniQL获取。

英文摘要

Existing text-to-SQL benchmarks are largely centered on SQLite, making it difficult to evaluate whether models can generalize across heterogeneous SQL dialects. However, real-world database systems differ substantially in syntax, functions, type systems, and execution semantics, so the same natural language intent often requires dialect-specific SQL realizations. We introduce UniQL, a human-verified benchmark for cross-dialect text-to-SQL evaluation. UniQL aligns 1,534 natural language questions with executable SQL annotations across 16 SQL dialects, yielding 24,544 dialect-specific queries. All dialects share the same intents, aligned schemas and database contents, enabling controlled evaluation of dialect generalization. UniQL is constructed through a hybrid pipeline combining database migration, SQL translation, execution-guided verification, iterative rule summarization, and human validation. Experiments on both open-source and closed-source LLMs show that current models remain far from dialect-universal, with substantial performance variation across database systems and limited transfer from SQLite success to other dialects. These findings highlight the need for aligned cross-dialect benchmarks and more dialect-aware text-to-SQL methods. Code and data are available at https://github.com/JerryGao818/UniQL

2606.08200 2026-06-09 cs.AI cs.LG 新提交

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

在线智能体作为裁判:面向交互式智能体的情境生成评估

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

发表机构 * KAIST(韩国科学技术院)

AI总结 提出在线智能体作为裁判框架,通过部署环境内评估智能体主动生成相关情境,以评估交互式社交智能体的能力,提高标准覆盖率和与人类标签的一致性。

Comments ICML 2026 Workshop on Trustworthy AI for Good

详情
AI中文摘要

评估基于LLM的交互式社交智能体具有挑战性,因为社交相关行为不仅取决于孤立输出,还取决于先前的交互、社会角色和后续行动。现有方法通常允许目标智能体在环境中自由行动,然后对生成的轨迹进行评分。然而,这种被动设置可能会遗漏仅在特定社交情境下才可观察到的能力;例如,如果没有出现分歧,冲突处理可能不会被测试。我们提出在线智能体作为裁判,一种面向交互式社交智能体的情境生成评估框架。在线智能体作为裁判部署一个环境内评估智能体,通过环境原生的对话和行动协议与目标智能体交互,主动引出与评估标准相关的情境。生成的轨迹为评估即时响应和后续行为提供了证据。在一个包含32个设计师编写的社会标准的生命模拟环境中,在线智能体作为裁判提高了标准覆盖率和与人类标签的一致性,为被动方法可能未观察到的行为提供了更可靠的基于证据的评估。

英文摘要

Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 新提交

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时:诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 研究多模态大语言模型在视频理解中检测缺失答案的能力,发现模型倾向于选择干扰项而非识别无正确答案,时间推理任务中问题更严重,链式思维提示虽提升检测率但仍不理想。

Comments Under review

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展,但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究,其中正确答案被故意排除在候选集之外,而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为:带有“以上皆非”选项的多选题、带有检测指令的开放式生成,以及没有任何指导的标准评估。在多种模型和基准测试中,我们发现多模态大语言模型压倒性地选择合理的干扰项,而不是检测到缺失答案。这种失败在时间推理任务中更为明显,并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略,发现虽然它显著提高了检测率,但性能仍不令人满意,这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败,并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

2606.08285 2026-06-09 cs.AI cs.CE q-fin.CP q-fin.TR 新提交

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

超越智能体架构:基于LLM的交易系统中的执行假设与可复现性

Junyi Yao, Zihao Zheng

AI总结 本文通过审计30项相关研究,发现LLM交易研究中执行假设报告不足,导致结果难以比较,提出需建立执行现实性、可复现性和评估可比性的报告标准。

详情
AI中文摘要

大型语言模型(LLM)和智能体系统越来越多地被用于金融交易,但其报告的性能仍然难以比较,因为研究在数据来源、时间分割纪律、执行时机、周转处理和交易成本建模方面存在差异。本文对基于LLM的交易研究中的执行现实性进行了有针对性的主题回顾和可复现性审计。一个包含30项交易相关主要研究的编码证据矩阵用于评估时点控制、分割透明度、保留评估、成本和周转处理、执行语义、宇宙定义和工件发布。在审计样本中,架构报告通常比判断交易结果是否经济可解释或可复现所需的评估假设更清晰。一个包含10只股票的工作示例仅作为方法学框架,以说明明确的摩擦和时机选择如何实质性地压缩主动策略结果。主要结论是,LLM交易研究的下一步有用进展不仅是更好的智能体设计,还包括更清晰的执行现实性、可复现性和评估可比性的报告标准。

英文摘要

Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execution timing, turnover treatment, and transaction-cost modeling. This article presents a targeted topical review and reproducibility audit of execution realism in LLM-based trading research. A coded evidence matrix covering 30 trade-relevant primary studies is used to assess point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release. Across the audited sample, architecture reporting is generally clearer than the evaluation assumptions needed to judge whether a trading result is economically interpretable or reproducible. A 10-equity worked example is included only as a methodological scaffold to illustrate how explicit friction and timing choices can materially compress active-strategy results. The main conclusion is that the next useful step for LLM trading research is not only better agent design, but also clearer reporting standards for execution realism, reproducibility, and evaluation comparability.

2606.08340 2026-06-09 cs.AI cs.LG cs.MA 新提交

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

开放式多智能体协作在语言智能体中的基准测试

Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rocktäschel, Amos Storkey

发表机构 * University of Edinburgh(爱丁堡大学) University of Oxford(牛津大学) University College London(伦敦大学学院)

AI总结 提出基于JAX的开放式多智能体协作基准Alem,评估13种现代LLM在长时生存世界中的零样本协作能力,发现协调能力是前沿LLM智能体的独立瓶颈。

Comments 42 pages, preprint

详情
AI中文摘要

随着语言模型越来越多地被部署为自主智能体,它们必须在开放式交互任务中与他人进行长期协调。然而,现有评估很少同时测试这些需求,而是强调单智能体任务、短交互或高度结构化的多智能体设置。我们提出了$alem$,一个基于JAX的开放式多智能体协作基准,构建在类似Craftax的动态之上。Alem将程序生成的协调任务、软专业化、通信和可控制的协调难度嵌入到一个具有探索、制作、交易和战斗的长期生存世界中。我们在同质团队中零样本评估了$13$种现代LLM,并以训练好的MARL智能体作为参考点。当前的LLM智能体远未解决Alem,平均标准化回报仅约6%,但它们的失败并非均匀分布。在最难的协调设置下,零样本的Gemini-3.1-Pro-High接近训练了十亿步的MARL智能体,而GPT-5.4-High实现了强基础任务奖励但协调奖励低得多。这种对比表明,个体任务能力并不等同于协调能力。消融实验表明,通信是协调的最大贡献者,而记忆和推理在用于维护多步计划时有所帮助。总体而言,我们的结果将协调确定为前沿LLM智能体的一个独立瓶颈,与单智能体能力分开。Alem使这一瓶颈可测量,并为开发能够通信、分配角色和执行共享计划的智能体提供了一个受控测试平台。代码可在https://github.com/alem-world/alem-env获取。

英文摘要

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

2606.08483 2026-06-09 cs.AI 新提交

Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs

测试黑箱:面向消费者的健康大语言模型独立评估的结构性障碍

Rahul Gorijavolu, Kaushik Madapati, Pritika Vig, Rawan Abulibdeh, Nikhil Jaiswal, Mahri Kadyrova, Zeamanuel Hailu Tesfaye, Charles Senteio, Paula Maurutto, Leo Anthony Celi

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Johns Hopkins University(约翰霍普金斯大学) University of California, Berkeley(加州大学伯克利分校) Toronto General Hospital, University Health Network(多伦多综合医院,大学健康网络) McGill University(麦吉尔大学) University of Toronto(多伦多大学) Independent Researcher(独立研究者) Rutgers University(罗格斯大学) Beth Israel Deaconess Medical Center(贝斯以色列女执事医疗中心) Harvard T.H. Chan School of Public Health(哈佛大学陈曾熙公共卫生学院)

AI总结 本研究通过模拟用户档案,测试面向消费者的健康大语言模型在响应变异和谄媚行为方面的表现,发现五大结构性障碍阻碍独立评估。

Comments 6 pages, 1 figure. Preprint submitted for review

详情
AI中文摘要

背景:面向消费者的大语言模型现已成为健康信息的常见来源,它们解释并个性化响应而非检索信息。其响应是否因用户而异是一个临床、公平和治理问题,证据表明谄媚响应可能改变判断并增加信任,这一问题更加突出。\n目标:评估在类似普通患者使用条件下,面向消费者的健康大语言模型的响应变异和谄媚行为。\n方法:我们构建了模拟用户档案,这些档案在地理位置、浏览环境、表达信念和健康社会决定因素方面存在差异,借鉴了将社会背景与健康态度联系起来的文献。我们将经过验证的工具(包括疫苗接种态度量表和生殖态度量表)改编为多轮提示,旨在引发用户间有临床意义的变异。\n结果:评估遇到了五个相互关联的障碍。事实性提示产生稳定的响应,掩盖了在多轮对话中出现的谄媚行为。基于浏览器的界面未披露哪些信号影响输出,且无法重置为干净基线。大规模测试受到服务条款、速率限制和机器人检测的限制。基于准确性的标准无法捕捉语气、框架或遗漏,而LLM作为评判者的方法存在共享对齐偏差的风险。模型在无追溯版本标识符的情况下发生变化,阻碍了可靠的重复。\n结论:目前尚不存在可靠的独立评估框架来检查面向消费者的健康大语言模型在普通使用中的行为。监管需要披露个性化信号、稳定的版本标识符、研究人员安全港计划以及部署后对健康相关输出的监控。

英文摘要

Background: Consumer-facing large language models are now a common source of health information, and they interpret and personalize responses rather than retrieve them. Whether their responses vary across users is a clinical, equity, and governance question, sharpened by evidence that sycophantic responses can alter judgment and increase trust. Objective: To evaluate response variation and sycophancy in consumer-facing health LLMs under conditions resembling ordinary patient use. Methods: We constructed simulated user profiles differing in geography, browsing context, expressed beliefs, and social determinants of health, drawing on literature linking social context to health attitudes. We adapted validated instruments, including the Vaccination Attitudes Examination scale and reproductive attitudes scales, into multi-turn prompts designed to elicit clinically meaningful variation across users. Results: The evaluation encountered five linked barriers. Factual prompts produced stable responses that masked sycophancy emerging over multi-turn conversation. Browser-based interfaces did not disclose which signals influence outputs and could not be reset to a clean baseline. Large-scale testing was restricted by terms of service, rate limits, and bot detection. Accuracy-based criteria could not capture tone, framing, or omission, and LLM-as-judge methods risked shared alignment bias. Models changed without traceable version identifiers, preventing reliable replication. Conclusions: No reliable independent evaluation framework yet exists for examining how consumer-facing health LLMs behave in ordinary use. Oversight requires disclosure of personalization signals, stable version identifiers, researcher safe harbor programs, and post-deployment monitoring of health-related outputs.

2606.08529 2026-06-09 cs.AI cs.CL cs.LG 新提交

Scaffold Effects on GAIA: A Controlled Comparison

脚手架对GAIA的影响:一项受控比较

Jason Starace

发表机构 * Independent Researcher(独立研究员)

AI总结 通过受控实验比较三种脚手架(ReAct、多智能体设计、规划-执行)对五个模型在GAIA验证集上的影响,发现脚手架选择可导致准确率差异高达28个百分点,且模型能力越强对脚手架依赖性不一定越低。

Comments 12 pages, 3 figures

详情
AI中文摘要

已发布的智能体能力评分混淆了模型本身的能力与脚手架赋予的能力,且这种激发差距的大小在受控条件下尚未得到充分表征。本研究在GAIA验证集的Level 1和Level 2上,对来自三个提供商的五个模型(Claude Opus 4.7、Sonnet 4.6、Haiku 4.5;Gemini 3.1 Pro Preview;GPT-5.5)进行了预先注册的受控比较,涉及三种脚手架(ReAct、规划-执行者-评估者多智能体设计以及规划-执行),保持任务和条件固定,每个问题尝试三次。仅脚手架选择就使单个模型(Opus,Level 2,稳健切片)的测量准确率移动了多达28个百分点,证实了预先注册的假设,即脚手架变化至少产生10个百分点的差距。预先注册的预测——能力更强的模型对脚手架敏感性更低——在方向上被拒绝:在每个数据集切片中,脚手架效应因模型而异,但能力最强的Anthropic模型在更难级别上从结构化脚手架中获益最多,且层级缩放仅在Level 1的稳健切片下成立。在Level 2上,多智能体相对于ReAct的优势出现在Anthropic系列内部,但跨提供商模型中没有,因此模型系列而非能力层级成为调节变量,而预测的规划-执行者在文件读取任务上的优势被证伪。结构化脚手架在更难级别上调用工具次数更少,但从中途错误中恢复的频率更高,且单个单元(Gemini搭配规划-执行者)在两个级别上成本最低,在Level 2上准确率最高。这些结果表明,单脚手架能力数值是脚手架条件估计,且激发差距不一定会随着模型改进而缩小。

英文摘要

Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre-registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre-registered prediction that more capable models would be less scaffold-sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier-scaling holds only at Level 1 under the robust slice. The multi-agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross-provider models, making model family rather than capability tier the conditioning variable, and the predicted planner-executor advantage on file-reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid-trajectory errors at the harder level, and a single cell (Gemini with planner-then-executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single-scaffold capability numbers are scaffold-conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

2606.08840 2026-06-09 cs.AI cs.SE 新提交

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

超越通过率:开放代码大语言模型的多语言、执行基础评估

Sayed Erfan Arefin

发表机构 * Sayed Erfan Arefin

AI总结 针对12种编程语言的2707道LeetCode问题,评估9个开放代码LLM,发现最佳模型Yi-Coder-9B-Chat的正确率仅23.64%,远低于人类57.2%的基准,且排名因问题难度和语言而异,编译错误占失败原因的63.25%。

详情
AI中文摘要

代码生成模型通常使用紧凑的执行基准和总体通过率进行比较,但这种总结掩盖了性能在不同编程语言、问题族和失败模式之间的差异。我们对9个专门用于编码的开放访问LLM进行了大规模、基于执行的评估,涉及12种编程语言的2707道免费LeetCode问题。我们的语料库包含325,343个问题-模型-语言作业,每个作业都关联了提示元数据、提取的代码、LeetCode执行结果和静态分析信号。结果表明,当前的开放模型远未达到人类接受参考:最佳模型Yi-Coder-9B-Chat的平均正确率为23.64%,而人类接受基线为57.2%。排名也依赖于切片:Qwen2.5-Coder-14B-Instruct在困难问题和不同问题覆盖上最强,而Gemma-2-27B-IT在所有语言上的lint通过率最高。失败分析显示,编译错误占未接受最佳提交的63.25%,表明许多失败发生在语义正确性测试之前。静态质量进一步与功能正确性偏离。总之,这些发现表明,多语言、保留工件的评估揭示了单语言或单指标排行榜所隐藏的权衡。

英文摘要

Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We present a large-scale, execution-grounded evaluation of 9 openly accessible LLMs specialized for coding on 2,707 free LeetCode problems across 12 programming languages. Our corpus contains 325,343 problem-model-language jobs, each linked to prompt metadata, extracted code, LeetCode execution outcomes, and static-analysis signals. The results show that current open models remain far from the human acceptance reference: the best model, Yi-Coder-9B-Chat, reaches 23.64% mean correctness, compared with a 57.2% human acceptance baseline. Rankings are also slice-dependent: Qwen2.5-Coder-14B-Instruct is strongest on hard problems and distinct-problem coverage, while Gemma-2-27B-IT achieves the highest all-language lint pass rate. Failure analysis shows that compile errors account for 63.25% of non-accepted best submissions, indicating that many failures occur before semantic correctness can be tested. Static quality further diverges from functional correctness. Together, these findings show that multilingual, artifact-preserving evaluation reveals tradeoffs hidden by single-language or single-metric leaderboards.

2606.08970 2026-06-09 cs.AI 新提交

An Effective Router for Vision-Language Model Selection

一种有效的视觉-语言模型选择路由器

Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shandong Key Laboratory of Digital Service Computing Technology and Systems(山东省数字服务计算技术与系统重点实验室)

AI总结 针对视觉-语言模型(VLM)选择中数据缺乏、特征表示无效和模型空间僵化的问题,提出ARMS路由器,通过增强输入信号和扩展训练策略,在分布内和分布外测试集上表现优异,仅800M参数即可超越GPT-4o。

详情
AI中文摘要

具有不同性能和资源需求的视觉-语言模型(VLM)被广泛部署,使得用户难以从众多VLM候选中选择最合适的。现有工作揭示了语言模型中的性能悖论现象,并专注于路由方法来解决它。然而,开发用于VLM选择的路由器仍然是一个关键且具有挑战性的问题,主要面临:1)缺乏专门数据,2)特征表示无效,以及3)模型空间僵化和适应成本高。在本文中,我们构建了一个用于VLM选择的多模态数据集,包含七个主流VLM在32,626个独特图像-文本查询上的输出。然后,我们提出了ARMS,一个用于VLM选择的路由器。ARMS通过VLM配置文件增强输入信号,采用简单但有效的架构来改进查询和VLM能力的表示。为了提高ARMS对新VLM的适应性,我们提出了两种扩展训练策略:增量训练和独立训练。在分布内和分布外测试集上的实验结果表明了ARMS的有效性。特别是,使用我们的训练策略,ARMS(仅800M参数)可以适应更广泛的VLM空间,并击败规模大数百倍的商业模型如GPT-4o。我们的代码、模型和数据集可在匿名仓库中获取。

英文摘要

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

2606.08976 2026-06-09 cs.AI 新提交

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

RTL-BenchLS:面向大语言模型的RTL推理与生成的大规模基准

Jing Wang, Shang Liu, Wenji Fang, Yuchao Wu, Yugao Zhu, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出大规模基准RTL-BenchLS,包含超1万个形式验证的Verilog设计,并引入三项自监督推理任务,解决现有基准规模小、任务单一的问题,评估显示当前最佳模型性能较低。

详情
AI中文摘要

基于LLM的RTL生成与推理是硬件设计自动化的一个有前景的方向。高质量的基准是跟踪这一进展的关键基础设施。然而,现有的RTL基准在规模和任务范围上存在固有局限性。它们涵盖的设计通常较小且简单,任务几乎完全集中在规格到RTL的生成上。前沿模型在现有基准上的性能已经饱和。扩大这些基准的规模从根本上很困难,因为基准测试需要对齐的标签,例如规格和测试平台。对于实际设计,这种对齐的高质量数据很少可用。我们引入了RTL-BenchLS,这是一个大规模基准,解决了上述两个局限性。它包含超过10,000个经过形式验证的Verilog设计,涵盖比现有基准更大且更复杂的设计。除了规格到RTL的生成,我们提出了三项联合评估推理与生成的新任务:往返推理、掩码内容推理和仓库问题推理。前两项是自监督的,直接解决了扩展瓶颈。所有任务都通过形式等价性检查进行验证,无需任何手动测试平台。我们在RTL-BenchLS上评估了八个LLM。即使是最好的模型,在自然语言往返推理上仅达到23%,在掩码内容推理上达到28%,在仓库问题修复上达到12%。RTL-BenchLS比现有基准更具挑战性。它为未来的改进留下了充足的空间,并为开发基于LLM的硬件设计方法提供了指导。

英文摘要

LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models' performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.

2606.09118 2026-06-09 cs.AI 新提交

ComplexConstraints and Beyond: Expert Rubrics for RLVR

复杂约束与超越:RLVR的专家评分标准

Sushant Mehta, Liudas Panavas, Edwin Chen

发表机构 * Surge AI

AI总结 提出专家设计的评分标准作为评估和训练信号,通过复杂指令遵循和企业智能体任务验证,在RL训练中显著提升模型性能。

Comments Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/

详情
AI中文摘要

随着LLM能力的快速提升,用于评估它们的方法越来越滞后。传统基准依赖于对狭窄、表面约束的程序化验证,但现实世界的指令遵循和智能体任务需要评估细微的、上下文依赖的行为,这些行为难以通过简单的脚本检查。我们提出了一个基于专家策划的评分标准评估的系统分析作为替代范式,借鉴了来自两个领域的实证证据:复杂指令遵循和企业智能体任务。我们首先阐述了构建高质量评分标准的五个设计原则,包括最大可行原子性、意图感知标准设计和迭代LLM判断校准。为了验证这些原则,我们引入了ComplexConstraints,一个新的专家策划的指令遵循数据集,其中每个提示与10-40个原子评分标准配对。我们证明这些专家评分标准不仅是更好的评估工具,而且是高度有效的训练信号:在大约1000个ComplexConstraints示例上训练,使得4B参数模型在指令遵循上提升+15.5%,235B参数模型提升+12.2%,而在评分标准评分的企业环境上进行单周期RL训练产生的收益可以转移到模型从未训练过的分布外基准(BFCL +4.5%,Tau2-Bench +7.4%,Tool-Decathlon +6.8%)。我们的发现表明,专家编写的评分标准既改进了前沿LLM能力的测量,也改进了其发展,作为有效的评估和RL训练信号。

英文摘要

As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.

2606.09169 2026-06-09 cs.AI cs.CV cs.MM 新提交

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench:交错理解与生成的统一多模态模型基准

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Huawei(华为)

AI总结 提出IMUG-Bench基准,用于评估统一多模态模型在多轮交错图文对话中的理解与生成能力,包含3113样本和12034交互轮次,揭示了生成侧暴露偏差,并探索了测试时扩展策略。

详情
AI中文摘要

近年来,统一多模态模型(UMMs)出现,支持在单一框架内同时进行理解和生成。掌握动态、多轮交错图文对话是UMMs在实际应用中的关键任务。然而,现有基准未能评估这一重要任务,因为它们通常局限于单轮或静态设置,并且通常忽略多轮交互中的暴露偏差。为弥补这一差距,我们提出IMUG-Bench,一个用于UMMs多轮交错图文对话的综合基准,联合评估其理解和生成能力。我们的IMUG-Bench包含三类:静态空间、时间因果和混合,涵盖3113个样本和12034个交互轮次。它还包括动态理解问题,从而支持更能反映真实多轮交互场景的评估。在IMUG-Bench上进行的大规模实验系统评估了主流开源和闭源UMMs,揭示了它们的能力边界和失败模式,并发现了多轮交互中生成侧的显著暴露偏差。我们进一步探索了几种测试时扩展策略,包括思维链、自我验证和最佳N采样,这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来UMMs的鲁棒性和多轮交互能力提供了见解。

英文摘要

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

2606.09323 2026-06-09 cs.AI cs.DB 新提交

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench:标准化跨范式的表格编码器表示级评估

Wei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Waterloo(滑铁卢大学) The University of Hong Kong(香港大学) The University of Sydney(悉尼大学) Université Lyon 1(里昂第一大学)

AI总结 提出TRL-Bench,通过标准化下游条件,从列/表、行和组合数据湖表增强三个粒度评估表格编码器,揭示编码器质量具有能力特异性而非单一排名。

详情
AI中文摘要

表格编码器通常在特定任务的全流程管道中进行评估,因此即使处理相似的表格信号,来自不同训练范式的模型也难以直接比较。我们引入了TRL-Bench,一个多粒度表格表示学习基准,用于标准化跨范式的表示级评估:每个编码器通过其支持的封装器导出行、列或表嵌入,共享的轻量级探测头在三个套件中对其进行探测:TRL-CTbench(列/表)、TRL-Rbench(行)和TRL-DLTE(涵盖所有三种粒度的组合数据湖表增强)。为支持这一标准化设置,我们发布了精选的基准资产和任务重构,包括50个OpenML表格(含123个验证目标)、16个行对链接重写以及一个由1379个父表衍生的47772表DLTE湖。在20个模型和16个任务上的实验表明,一旦下游条件标准化,编码器质量是能力特定的,而非由单一排行榜决定。在TRL-CTbench中,通用文本编码器通常在具有强表面文本信号的任务上领先,而表格专用编码器在其预训练目标与任务对齐时获胜。在TRL-Rbench中,表内预测和跨表链接偏好不同的训练机制,原子链接性能与DLTE管道中的行匹配阶段强相关。在TRL-DLTE中,最强管道结合了能力匹配的专用编码器而非重复使用单一编码器,且顶级端到端质量取决于非加性的组合适配而非每阶段边际排名。TRL-Bench提供了一个通用协议,用于在共享下游条件下测量导出表格表示中的可复用信号。代码和数据:https://github.com/LOGO-CUHKSZ/TRL-Bench

英文摘要

Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

2606.09409 2026-06-09 cs.AI cs.CL cs.LG 新提交

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

正确看起来更好:成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,蒂宾根,德国) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 本文通过将基准测试转化为生成式评估,发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致(Spearman相关系数>0.9),且风格和裁判偏见影响较小,但答案重复(echo)是裁判偏好的因果驱动因素。

Comments Accepted at ICML'26

详情
AI中文摘要

成对比较结合诸如Elo等聚合方法已成为评估生成模型的核心,但人们仍担心它们会奖励肤浅的风格线索或显示裁判偏见。从更积极的角度看,我们表明,当存在真实准确率用于比较时,成对比较得出的模型排名与基于真实准确率的排名高度一致。通过将五个知名基准测试转化为自由形式的生成评估,我们发现Elo排名与准确率排名的Spearman相关系数超过0.9,并且在裁判较弱时显著优于直接评估。此外,风格和裁判偏见对模型排名的影响较小,尽管大多数判断发生在两个候选答案都正确(或都错误)的成对上。在这样的成对比较中,我们发现最终答案后的重复(echo)是裁判偏好的因果驱动因素。

英文摘要

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

2606.09450 2026-06-09 cs.AI 新提交

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

TheoremBench: 评估LLMs在形式数学中的定理证明能力

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology(斯科尔科沃科学技术研究所) HSE University(高等经济大学) Artificial Intelligence Research Institute(人工智能研究所) Sberbank(俄罗斯联邦储蓄银行)

AI总结 提出TheoremBench基准,通过结构化定理族和细粒度评估指标,揭示当前证明器在复杂定理上的行为偏差。

Comments Preprint version (20 pages, 10 figures)

详情
AI中文摘要

LLMs最近在形式证明基准上取得了强劲结果。然而,现有评估仍高度集中在竞赛式问题上,且往往未能捕捉模型在更长、依赖关系更丰富的数学发展中的行为。我们引入TheoremBench,这是一个Lean4基准,旨在评估超越竞赛设置的定理证明器。该基准由近一百个经典定理构建,并以两种互补形式发布:一个简洁主版本,每个实例包含一个目标定理;以及一个前提版本,将每个定理扩展为一个结构化的相关证明任务族,包括主定理以及自动提取的支持性子定理。这种设计不仅能够评估最终定理是否从零开始被证明,还能评估通过定理内部证明结构的部分进展。我们的实验表明,显式前提显著提高了Lean4能力证明器模型的性能。为了提供全面评估,我们引入了定理级覆盖率和令牌效率指标,这些指标揭示了证明行为中的定性差异。结果表明,当前的证明器仍然强烈偏向于简单的子定理,并且通常通过冗长且低效的策略轨迹而非紧凑的证明计划来求解定理。因此,TheoremBench提供了对形式推理能力的更细粒度视角,并强调了结构基准设计对于评估Lean4定理证明器的重要性。

英文摘要

LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

2606.09578 2026-06-09 cs.AI cs.CL cs.IR 新提交

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

TABVERSE:大语言模型与视觉语言模型中跨格式表格理解的基准测试

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Singapore University of Technology and Design (SUTD)(新加坡科技设计大学)

AI总结 提出TABVERSE基准,通过控制表格内容、跨多种结构格式(HTML、Markdown、LaTeX)和渲染图像,系统评估LLM和VLM在问答、结构理解和结构重建任务中的表现,发现表示格式显著影响表格理解能力。

Comments 24 pages, 18 tables, 16 figures, Submitted to ARR May 2026

详情
AI中文摘要

大语言模型(LLMs)和视觉语言模型(VLMs)在表格推理任务上的评估日益增多,但表格表示的作用仍未充分探索。实践中,相同的表格内容可能以不同的结构格式出现,如HTML、Markdown和LaTeX,或作为渲染图像。然而,现有评估往往让内容、格式、布局和模态同时变化,使得难以隔离表示效应。我们引入了TABVERSE,一个受控的多模态表格基准,它在多个结构格式和渲染图像中对齐相同的表格内容,并带有问题类别和难度标签。这种设计使得在保持表格内容固定的同时,能够系统评估表示效应。我们在三个任务上评估LLMs和VLMs:问答(QA)、结构理解能力(SUC)和结构重建(SR)。我们的结果表明,表示选择显著影响表格理解。模型在结构化文本上的表现通常优于渲染图像,但这一差距的大小取决于任务、模型和格式。HTML通常是最稳健的文本格式,而行敏感的结构任务和语法可用的LaTeX重建仍然具有挑战性。这些发现表明,表格表示是可靠表格评估的关键因素。

英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

2606.09748 2026-06-09 cs.AI cs.CL cs.LG 新提交

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

深度研究智能体在过程级反馈下的多轮评估

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

发表机构 * Google DeepMind OpenAI Perplexity AI LangChain AI

AI总结 针对深度研究智能体(DRA)在单轮输出评估的不足,提出研究缺口推断(RGI)方法提供过程级反馈,发现单轮过程反馈可提升8-15分,但多轮改进因回归问题难以持续。

Comments Published as a workshop paper at SCALE - ICML 2026 (Oral)

详情
AI中文摘要

现有的深度研究智能体(DRA)基准仅评估单次输出,忽略了一个关键问题:DRA能否在反馈指导下改进其报告?为此,我们在两种反馈设置下对DRA进行多轮评估:自我反思(智能体在无外部诊断信号的情况下修改报告)和过程级反馈(智能体接收针对其研究策略缺口的指导)。为提供过程级反馈,我们设计了研究缺口推断(RGI),该方法通过分析满足和未满足的评分标准模式来推断研究过程缺口。我们的分析揭示了三个关键发现:(i)在自我反思下,智能体以几乎相等的速率纳入和退步评分标准,导致净改进可忽略;(ii)单轮过程级反馈带来显著收益,将归一化分数提高约8-15分,并产生约35-40%的纳入率;(iii)这些收益在后续轮次中不会累积,因为智能体在重写完整报告以解决剩余缺口时,会退步多达24%的先前满足的标准。即使有针对性指导,我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。

英文摘要

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

2606.07102 2026-06-09 cs.CV cs.AI 交叉投稿

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

GP-Adapter: 基于高斯过程的CLIP适配器用于少样本分布外检测

Taisei Saito, Koretaka Ogata, Takafumi Hiroi

发表机构 * st Taisei Saito(第一作者) nd Koretaka Ogata(第二作者) rd Takafumi Hiroi(第三作者)

AI总结 提出GP-Adapter,一种无需训练的框架,通过高斯过程不确定性建模增强CLIP,用于少样本分类和分布外检测,无需微调骨干网络,仅依赖少量缓存和轻量超参数选择。

Comments 8 pages, 6 figures, Accepted at IJCNN 2026

详情
AI中文摘要

我们提出GP-Adapter,一种无需训练的框架,通过高斯过程(GP)不确定性建模增强CLIP(对比语言-图像预训练),用于少样本分类和分布外(OOD)检测。虽然CLIP实现了强大的零样本识别,但它产生确定性的相似度分数,并提供有限的不确定性信息,这在分布偏移和数据稀缺情况下至关重要。GP-Adapter在冻结的CLIP嵌入之上,使用图像特征的RBF核和文本提示的线性核构建模态特定、类别级的一类GP,并融合它们的预测统计量,以生成方差感知的置信度分数用于OOD检测。该方法无需微调CLIP骨干网络,仅依赖于少量$K$样本缓存和轻量超参数选择,内存成本为$O(CK^2)$,其中$C$为类别数,$K$为样本数。在ImageNet和多个OOD基准上的实验表明,GP-Adapter提供了具有竞争力的少样本性能,并且在与提示学习基线结合时持续改进OOD检测,突出了基于GP的不确定性建模与提示学习之间的互补性。总体而言,我们的结果表明,将概率推理与大型预训练视觉-语言模型集成可以提高低数据和分布偏移场景下的可靠性。代码可在该https URL获取。

英文摘要

We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter

2606.07521 2026-06-09 cs.CL cs.AI 交叉投稿

Evaluating Hallucinations in Domain-Adapted Large Language Models

评估领域自适应大语言模型中的幻觉现象

Sanchita Porwal, Sai Prasath S, Xingjian Bi, Madelyn Scandlen

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院)

AI总结 本研究通过微调Llama-2模型,测试其记忆、回忆和推理能力,发现领域自适应大语言模型在生成新领域特定信息时存在幻觉问题,表明仅靠微调难以有效缓解幻觉。

Comments 13 pages, 2 figures, 3 tables

详情
AI中文摘要

本研究调查了领域自适应大语言模型(LLMs)中的幻觉现象,重点关注使用Lamini数据集对Llama-2模型进行微调。幻觉,即LLMs生成无意义或不忠实内容的现象,构成了重大挑战,尤其是当这些模型使用领域特定数据进行微调时。我们的方法包括一系列实验,测试微调后LLM的记忆、回忆和推理能力,并将其在新问答对和领域特定信息上的表现进行比较。我们发现,虽然模型在与训练数据相似的任务上表现出色,但其准确推理和回忆新领域特定信息的能力仍然有限,导致出现幻觉实例。模型倾向于提供带有额外信息的正确答案,表明存在过度生成的倾向。这些结果表明,仅靠微调方法在将LLMs适应专业领域时缓解幻觉存在重要局限性,并强调了在将LLMs适应专业领域时需要更鲁棒的方法。该研究还提供了关于LLMs在不同类型信息上表现差异的见解,揭示了其在处理领域特定查询时的相对弱点。

英文摘要

This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the Llama-2 model with the Lamini dataset. Hallucinations, or the generation of nonsensical or unfaithful content by LLMs, pose a significant challenge, especially when these models are fine-tuned with domain-specific data. Our methodology involves a series of experiments testing memorization, recall, and reasoning capabilities of the fine-tuned LLM, comparing its performance on novel question-answer pairs and domain-specific information. We found that while the model shows proficiency in tasks similar to its training data, its capability to accurately reason about and recall new domain-specific information remains limited, leading to instances of hallucination. The model demonstrates a tendency to provide correct answers with extra information, suggesting an inclination toward over-generation. These results suggest important limitations of fine-tuning-only approaches for mitigating hallucinations when adapting LLMs to specialized domains and underscore the need for more robust methods in adapting LLMs to specialized domains. The study also provides insights into the varying performance of LLMs on different types of information, revealing a comparative weakness in handling domain-specific queries.

2606.07528 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models

BEACON: 面向大语言模型跨模型幻觉检测的行为熵聚合

Naveen Bera, Pulijala Sai Nikhila, Kondaguduru Abhiram, Shaik Gayaz Ali, Shoaib Sadiq Salehmohamed, Shaik Mohammed Omar, Jinal Prashant Thakkar, Hansika Aredla, Shalmali Ayachit

发表机构 * LLM Lens

AI总结 提出BEACON框架,通过多维度行为特征(语义熵、嵌入几何、思维链一致性、释义稳定性)的黑盒检测方法,在7个基准上达到0.8123 AUROC,优于现有方法。

Comments 12 pages, 6 tables, 1 figure. Code and data available upon request

详情
AI中文摘要

大语言模型中的幻觉,即生成事实上不正确或未经支持的内容,仍然是可靠部署的关键障碍。我们提出了BEACON(面向跨模型幻觉检测的行为熵聚合),一个黑盒幻觉检测框架,仅基于模型输出运行,无需访问内部表示或外部知识库。BEACON从结构化的多遍生成中提取31维特征向量,整合了基于NLI的语义熵、嵌入几何、思维链一致性和释义稳定性信号。在七个基准的7,617个标记样本上训练的梯度提升分类器达到了0.8123 ± 0.0102的AUROC(95%置信区间:0.7632-0.8251),优于独立的语义熵(+0.2298)和SelfCheckGPT风格的一致性基线(+0.2457)。特征重要性分析表明,幻觉本质上是多维的,需要组合的不确定性信号。一个高效的5次调用变体达到了0.7795的AUROC,使得在黑盒LLM API上的实际部署成为可能。

英文摘要

Hallucination in large language models (LLMs), defined as the generation of factually incorrect or unsupported content, remains a critical barrier to reliable deployment. We present BEACON (Behavioral Entropy Aggregation for Cross-model hallucination detectiON), a black-box hallucination detection framework that operates purely on model outputs without requiring access to internal representations or external knowledge bases. BEACON extracts a 31-dimensional feature vector from structured multi-pass generation, integrating NLI-based semantic entropy, embedding geometry, chain-of-thought consistency, and paraphrase stability signals. A gradient-boosted classifier trained on 7,617 labeled examples across seven benchmarks achieves 0.8123 +/- 0.0102 AUROC (95% CI: 0.7632-0.8251), outperforming standalone semantic entropy (+0.2298) and SelfCheckGPT-style consistency baselines (+0.2457). Feature importance analysis shows that hallucination is inherently multi-dimensional, requiring combined uncertainty signals. An efficient 5-call variant achieves 0.7795 AUROC, enabling practical deployment across black-box LLM APIs.

2606.07541 2026-06-09 cs.HC cs.AI cs.CV cs.CY cs.MM 交叉投稿

Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation

多模态大语言模型作为视频研究中的合成参与者:一项评估

Prabal Shrestha, Bohan Jiang, Haoning Xue, Huan Liu, Xinyi Zhou

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究评估多模态大语言模型在视频感知任务中模拟人类主观评分的表现,发现模型存在偏差且与人类一致性有限。

Comments Accepted to SocialLLM @ ICWSM 2026

详情
AI中文摘要

多模态大语言模型在视频理解和推理等客观任务上表现出色。然而,它们能否近似主观人类反应仍不清楚,因为主观反应不仅依赖于内容理解,还依赖于个体的社会背景。为填补这一空白,我们评估了MLLMs作为合成参与者在一项新兴任务中的表现:评估对短视频的感知感官参与度。基于感知信息感官价值框架,我们使用17项量表(测量情绪唤醒、戏剧冲击和新奇性)比较了招募的人类参与者和基于档案条件的MLLM模拟(n=673)的评分。我们发现,即使领先的MLLMs(Gemini 3 Flash和Qwen 3 Omni)与人类参与者的一致性也有限。这些模型在评分分布中表现出明显的向下均值偏移和中心趋势偏差。它们既引入又扁平化了子群体差异,同时对参与者档案的敏感性不一致。提示策略对这些指标的影响不同,适度改善某些方面同时恶化其他方面。这些结果突显了开发MLLMs作为视频研究中合成参与者的挑战与机遇。数据和代码:https://github.com/MINDLab25/mllm-human-simulation-eval

英文摘要

Multimodal large language models (MLLMs) have shown strong performance on objective tasks such as video understanding and reasoning. However, it remains unclear whether they can approximate subjective human responses, which depend not only on content comprehension but also on individuals' social contexts. To address this gap, we evaluate MLLMs as synthetic participants in an emerging task: assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, we compare ratings from recruited human participants and profile-conditioned MLLM simulations (n=673) using a 17-item scale measuring emotional arousal, dramatic impact, and novelty. We find that even leading MLLMs (Gemini 3 Flash and Qwen 3 Omni) show limited agreement with human participants. The models exhibit distinct downward mean-shift and central-tendency biases in their rating distributions. They both introduce and flatten subgroup differences, while showing inconsistent sensitivity to participant profiles. Prompting strategies affect these metrics differently, modestly improving some aspects while worsening others. These results highlight both the challenges and opportunities of developing MLLMs as synthetic participants in video-based research. Data and code: https://github.com/MINDLab25/mllm-human-simulation-eval

2606.07542 2026-06-09 cs.CY cs.AI 交叉投稿

DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home

DIYHealth Suite:家庭健康管理的数据集、模型与基准

Changshuo Liu, Junran Wu, Zhongle Xie, Wenqiao Zhang, Kaiping Zheng, Jiaqi Zhu, Qingpeng Cai, Ooi Gene Anne, Marcus Chun Jin Tan, Jianwei Yin, James Wei Luen Yip, Beng Chin Ooi

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对家庭健康管理中的数据异构、任务多变和缺乏统一基准等问题,提出包含大规模多模态数据集DIYHealth-900K、自适应基础模型DIYHealthGPT(采用混合超低秩适应技术)和首个家庭护理基准DIYHealthBench的综合框架,在11项任务上达到最优性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

生成式AI正在重塑医疗保健,然而现有大多数进展依赖于医院级设备,这限制了其在临床环境之外的健康管理的可及性和潜力。随着便携式设备和远程医疗的普及,医疗保健正转向基于家庭的自我诊断(DIY)护理。尽管前景广阔,但仍存在几个独特挑战:(i)家庭收集的数据是异构的,且缺乏标准化的大规模数据集;(ii)模型需要适应变化的任务需求和不断变化的个体状况;(iii)家庭护理任务的广泛范围缺乏统一的基准进行系统评估。在本文中,我们提出DIYHealth Suite,一个通过定制数据集、模型和基准来应对这些挑战的综合框架。我们首先整理了DIYHealth-900K,一个大规模多模态数据集,捕捉了多样化的真实世界家庭护理场景。在此基础上,我们提出DIYHealthGPT,一个用于家庭健康管理的自适应基础模型,由新颖的混合超低秩适应技术驱动。最后,我们建立了DIYHealthBench,首个评估基础模型在家庭护理任务上的基准。大量实验表明,DIYHealthGPT在开放问答和封闭问答设置下的11项家庭护理任务中,均优于通用和医学专用基线,达到了最先进的性能,为下一代个性化家庭健康管理奠定了基础。

英文摘要

Generative AI is reshaping healthcare, yet most existing advances rely on hospital-grade devices, which limits their accessibility and potential for health management outside clinical settings. With the proliferation of portable devices and telemedicine, healthcare is shifting toward home-based Diagnosis-It-Yourself (DIY) care. Despite this promise, several distinctive challenges remain: (i) home-collected data are heterogeneous, exacerbated by the absence of standardized large-scale datasets; (ii) models require adaptation to variable task demands and evolving individual conditions; (iii) the broad spectrum of home care tasks lacks a unified benchmark for systematic evaluation. In this paper, we present DIYHealth Suite, a comprehensive framework designed to address these challenges through a tailored dataset, model, and benchmark. We first curate DIYHealth-900K, a large-scale multimodal dataset capturing diverse real-world home care scenarios. Building on this, we propose DIYHealthGPT, an adaptive foundation model for home-based health management, powered by the novel Hybrid Hyper Low-Rank Adaptation technique. Finally, we establish DIYHealthBench, the first benchmark to evaluate foundation models on home care tasks. Extensive experiments demonstrate that DIYHealthGPT delivers state-of-the-art performance over both general-purpose and medical-specific baselines on 11 home care tasks in both open-QA and closed-QA settings, laying the groundwork for the next generation of personalized health management at home.

2606.07548 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA

评估 Gemini Flash 上的高级提示工程用于多跳生物医学问答

Ahmed Bajaber, Mohammed Alliheedi

发表机构 * Saudi Med AI Lab (SMAIL)(沙特医学人工智能实验室(SMAIL)) Prince Sultan University(普森国王大学) Al-Baha University(阿勒巴哈大学)

AI总结 本研究通过设计多组件提示(角色扮演、多步思维链示例和格式规则),在 Gemini 2.0 Flash 上实现概念级得分0.720,显著优于基线0.565,并接近下一代模型性能,证明高级提示设计对释放LLM推理能力至关重要。

Comments 8 pages, proceedings of the BioCreative IX Challenge and Workshop (BC9) at IJCAI 2025

详情
Journal ref
Proc. BioCreative IX Workshop (BC9), IJCAI 2025, Montreal, Canada
AI中文摘要

MedHopQA 挑战为大型语言模型(LLM)提供了一个关键测试:在高风险的生物医学领域中进行复杂的多跳推理。本文详细介绍了我们对 Google Gemini Flash 模型的直接基于 API 的评估,重点关注高级提示工程的影响。我们为 Gemini 2.0 Flash 设计了一个复杂的多组件提示,结合了角色扮演、显式的多步思维链(CoT)示例和详细的格式规则。使用这个复杂提示的最佳运行获得了0.720的概念级得分。这一结果显著优于仅得0.565的基线提示。值得注意的是,在高效的 Gemini 2.0 Flash 上的性能与下一代 Gemini 2.5 Flash 的结果几乎相同。我们的发现表明,复杂的提示设计是释放现代LLM全部推理能力的关键因素。

英文摘要

The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedical domain. This paper details our direct API-based evaluation of Google's Gemini Flash models, focusing on the impact of advanced prompt engineering. We designed a sophisticated, multi-component prompt for Gemini 2.0 Flash that combined role-playing, explicit multi-shot Chain-of-Thought (CoT) examples, and detailed formatting rules. Our best run, using this complex prompt, achieved a Concept Level Score of 0.720. This result dramatically outperformed a baseline prompt which scored only 0.565. Remarkably, this performance on the efficient Gemini 2.0 Flash was almost identical to the result from the next-generation Gemini 2.5 Flash. Our findings demonstrate that sophisticated prompt design is a critical factor for unlocking the full reasoning capabilities of modern LLMs.

2606.07550 2026-06-09 cs.LG cs.AI 交叉投稿

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变等离子体控制的离线强化学习:代码库与基准

Yang Fu, Haomin Bao, Rohit Sonker, Xiaoyan Hu, Aravind Venugopal, Jeff Schneider, Jiayu Chen

发表机构 * Central South University(中南大学) Chongqing University(重庆大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 提出RL4F基准,基于DIII-D托卡马克历史数据构建评估环境,比较多种离线RL方法在等离子体控制任务上的性能,发现基于模型的离线RL方法平均表现最佳。

Comments 23 pages (10 pages main text)

详情
AI中文摘要

离线强化学习(RL)为从历史托卡马克数据开发等离子体控制器提供了一条有前景的途径,因为在真实设备上进行在线试错成本高昂且风险巨大。然而,由于缺乏针对核聚变中现实多执行器、长时域等离子体控制问题的标准化离线RL基准,这一方向的进展仍然难以衡量。我们引入了RL4F,一个用于核聚变等离子体控制的离线强化学习基准,提供了闭环评估环境和四个全剖面跟踪任务(旋转、密度、温度和压力)的基线比较。评估环境背后的动力学函数基于真实托卡马克DIII-D的历史放电数据构建。我们在统一协议下评估了广泛的模仿学习和离线RL基线。我们发现,基于模型的离线RL方法在大多数目标上获得了最佳平均性能,尽管没有单一方法在所有任务中占主导地位,这突显了动力学建模在复杂、长时域等离子体控制任务中的重要性。为了促进进一步研究,我们开源了代码库、数据集和评估框架,不仅为聚变社区,也为离线RL的算法开发提供了一个基准。

英文摘要

Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction remains difficult to measure due to the lack of a standardized offline RL benchmark for realistic multi-actuator, long-horizon plasma control problems in nuclear fusion. We introduce RL4F, an Offline Reinforcement Learning Benchmark for Plasma Control in Nuclear Fusion, providing closed-loop evaluation environments and baseline comparisons across four full-profile tracking tasks: rotation, density, temperature, and pressure. The dynamics function underlying the evaluation environment is built from historical discharge data from DIII-D, a real-world Tokamak. We evaluate a broad set of imitation learning and offline RL baselines under a unified protocol. We find that offline model-based RL methods obtain the best average performance on most objectives, although no single method dominates all tasks, highlighting the importance of dynamics modeling in complex, long-horizon plasma control tasks. To foster further research, we open-source the codebase, datasets, and evaluation framework, providing a benchmark not only for the fusion community but also for algorithm development in offline RL.

2606.07558 2026-06-09 cs.CV cs.AI cs.DL 交叉投稿

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

基于百年跨度扫描文档档案微调的页面图像分类器,用于进一步的内容特定处理

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

发表机构 * Institute of Formal and Applied Linguistics, Charles University MFF(查尔斯大学数学与物理学院形式与应用语言学研究所) Institute of Archaeology, Czech Academy of Sciences(捷克科学院考古研究所)

AI总结 针对历史文档数字化中手动分类不可行的问题,提出基于视觉内容类型(文本、表格、图形)的自动页面图像分类系统,采用微调深度网络(RegNetY-16GF达99.16%准确率)实现近完美分类,并公开模型、数据集和代码。

Comments 29 pages, 19 figures, 13 tables. arXiv admin note: text overlap with arXiv:2507.21114

详情
AI中文摘要

目的:人文学科的数字化项目产生了大量、异构的历史文档档案,使得手动分类在大规模下不切实际。本工作解决基于视觉内容类型——文本、表格和图形——对扫描页面图像进行分类的自动化系统需求,从而支持内容特定的下游处理,如光学字符识别(OCR)或结构化数据提取。方法:开发了一个图像分类系统,并在来自百年历史的捷克考古档案的超过48,000张带注释的历史页面图像数据集上进行评估,通过四个连续的注释阶段和领域专家审查进行优化。使用手工制作的图像特征建立了随机森林分类器基线。随后,微调并比较了深度学习架构:卷积神经网络(EfficientNetV2、RegNetY)、视觉和文档图像变换器(ViT、DiT)以及多模态CLIP模型。与领域专家合作设计了11类标签方案,并通过五折交叉验证进行评估。结果:基于特征的基线实现了约75%的准确率。微调的CNN和变换器显著优于基线,RegNetY-16GF在保留测试集上达到99.16%的Top-1准确率,ViT-large达到99.12%。CLIP ViT-B/16通过优化文本描述达到99.14%的准确率。结论:仅图像模型,特别是RegNetY-16GF,实现了近乎完美的分类准确率,并在649,508张未标注档案页面上产生一致标签,模型间一致性超过90%。微调的CLIP尽管在测试集上具有竞争力,但在未标注数据上与仅图像模型的一致性低于65%,因此不太适合部署。最终模型、注释数据集和软件均以开源许可证公开提供。

英文摘要

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

2606.07590 2026-06-09 cs.CV cs.AI 交叉投稿

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

SlideCheck: 通过数据集分布引导病理基础模型的自监督预训练

Mingyi He, Xinyi Guo, Xitong Ling, Weiming Chen, Jiawen Li, Lianghui Zhu, Minxi Ouyang, Mingxi Fu, Yizhi Wang, Tian Guan

发表机构 * Beijing University of Chemical Technology(北京化工大学) South China Normal University(华南师范大学) Tsinghua University(清华大学)

AI总结 提出SlideCheck工具,利用冻结病理基础模型的特征,通过双头MLP评分异常和恶性证据,引导自监督预训练数据筛选,实验表明数据分布影响模型下游性能。

Comments 9 pages, 2 figures, 4 tables

详情
AI中文摘要

病理基础模型在大量WSI衍生补丁流上进行预训练,而数据构建过程中的监督通常是切片级别、稀疏或异质的。这种不匹配使得理解和控制哪些生物模式进入预训练数据变得困难。我们提出SlideCheck,一个轻量级的预训练数据引导工具,建立在冻结的病理基础模型补丁特征之上。SlideCheck并非作为独立的补丁诊断模型,而是提供明确的异常和恶性评分,用于组织、过滤和审计病理预训练数据。SlideCheck使用双头MLP分别建模广泛的异常形态和恶性证据。正则化的特征空间评分器为补丁级证据估计提供监督锚点,而评分-注意力一致性将补丁评分与WSI级别的MIL注意力结合,挖掘高置信度伪标签。然后使用相同的评分构建广泛阳性ViT预训练子集,其中如果异常或恶性证据超过阈值,则选择补丁。实验表明,SlideCheck定义的数据分布影响自监督ViT预训练的下游行为,表明生物组成是病理基础模型开发中的重要可控因素。精心策划的子集可以接近全数据性能,表明明确评分的补丁池可能支持更高效和可审计的预训练数据构建。这些发现将SlideCheck定位为数据引导和审计层,用于将大型未分化补丁池转化为可控和可重用的预训练数据集。

英文摘要

Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features. Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels. The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development. Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.

2606.07595 2026-06-09 cs.CV cs.AI cs.IR 交叉投稿

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

VisualLeakBench: 视觉语言智能体中可复现的动作边界传播失败

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出VisualLeakBench基准,评估视觉语言智能体在截图、文档等场景下将敏感文本从图像复制到工具参数中的动作边界传播失败,发现PII传播率达78.8%,不安全文本传播率达85.5%。

详情
AI中文摘要

视觉语言智能体越来越多地在写入内存、发送消息或调用外部工具之前消费截图、文档和用户界面。我们研究了这一设置中的一个具体失败模式:动作边界传播,即敏感或不安全的可见文本从图像复制到下游工具参数中。我们提出了VisualLeakBench,一个多样化的500图像基准,涵盖UI、聊天、文档、表单和仪表板场景,并在两个工作流(笔记捕获和外部交接)下使用四个生产级VLM系统评估了一个分层的100图像智能体子集。在基线情况下,目标字符串在78.8%的PII案例和85.5%的渲染不安全文本案例中被传播到工具参数中。在防御性系统提示下,渲染不安全文本传播仍然高达52.6%,而PII工具传播降至2.0%,这主要是通过抑制工具使用而非保持效用实现的。速率取决于工具表面:类似搜索的工具抑制PII传播,但渲染不安全文本仍然跨越工具边界。我们测量的是视觉到工具的传播,而非下游指令执行。我们还提供了一个标记目标预言上限诊断,将大多数失败定位在工具边界,同时将响应侧泄漏作为残余风险。

英文摘要

Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.

2606.07597 2026-06-09 cs.LG cs.AI 交叉投稿

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

重复不匹配:为什么数据混合实验无法扩展以及如何修复

Kevin Zhou, Lisa Alazraki, Kris Cao, Marek Rei

发表机构 * Imperial College London(帝国理工学院) Cohere

AI总结 针对预训练数据混合中因高质量数据重复率变化导致的小规模实验外推失败问题,提出重复控制子采样方法,在1/16目标token预算下实现接近最优混合,揭示了重复动态而非规模决定实验泛化性。

详情
AI中文摘要

预训练数据混合通常通过运行小规模实验并外推到目标训练预算来调整。当高质量数据稀缺且必须重复时,这种外推经常失败,但失败的原因尚未被隔离。我们表明,一个主要原因是重复不匹配:由于高质量数据集很小,它们的重复率随着训练预算的增长而变化,以小规模代理实验未预期的方式改变最优混合。一种匹配目标重复率的子采样程序可以控制这种效应。在结合有限高质量数据和网络爬取的双源设置中,仅使用目标token的1/16的单一重复控制实验即可恢复757M参数模型的最优混合,误差在0.05以内,而无重复控制时误差为0.75。在没有重复控制的情况下达到相当的精度需要三到四个视野,消耗目标token预算的44%到94%。对于三个数据源,更大的混合空间需要不止一个实验来约束,但该方法仍然有效:在757M规模下,仅两个重复控制视野即可恢复最优混合,优于需要完整双源实验构建的基线。我们的结果表明,重复动态(而非仅规模)决定了小规模混合实验是否泛化。更广泛地说,它们表明数据重复应被视为混合优化中的第一类变量,而不是有限数据的不便副作用。

英文摘要

Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated. We show that a primary culprit is a repetition mismatch: because high-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small-scale proxy experiments do not anticipate. A subsampling procedure that matches the target repetition rate controls for this effect. In a two-source setting combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0.05 of the optimum for a 757M parameter model, compared to an error of 0.75 without repetition control. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two-source experiments to construct. Our results reveal that repetition dynamics, not scale alone, shape whether small-scale mixture experiments generalize. More broadly, they suggest that data repetition deserves treatment as a first-class variable in mixture optimization, rather than an inconvenient side effect of limited data.

2606.07611 2026-06-09 cs.IR cs.AI cs.LG cs.SE 交叉投稿

MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets

MIRAGE:面向MSR数据集的元数据集成仓库分析与引导增强

Aabia Ather, Muhammad Usayd Ather, Qurat-Ul-Ain Somroo, Muhammad Khuram Shahzad

发表机构 * SEECS, NUST(软件工程系,努斯兰大学)

AI总结 提出通过元数据丰富化、FAIR评估和主题驱动分析改进MSR数据集分析的方法,扩展了数据集目录并揭示了仓库站点和格式对引用与可用性的影响。

Comments 8 pages, 8 figures

详情
AI中文摘要

本文提出了一种通过元数据丰富化、FAIR评估和主题驱动分析来改进挖掘软件仓库(MSR)数据集分析的方法。本研究在先前专门用于分析MSR数据集的数据集目录基础上进行了扩展,为数据集添加了新注释,丰富了元数据类别,并提供了更高级的过滤选项。使用Semantic Scholar API收集了2013年至2024年间发表的MSR论文的元数据。分析基于潜在狄利克雷分配(LDA)主题建模和统计分析。数据集级别的属性被纳入扩展的数据集目录,即仓库托管站点、格式、可访问性、可重用性和数据集质量。研究表明,仓库托管站点和数据格式的选择会影响引用模式和数据集可用性。此外,增强的注释方法改进了MSR数据集的分析和可发现性,支持更有效地重用和评估研究工件。

英文摘要

This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.

2606.07613 2026-06-09 cs.CV cs.AI 交叉投稿

Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

你能相信你所见的吗?人类与AI对合成法律证据的检测

Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

发表机构 * Faculty of Law, McGill University(麦吉尔大学法学院)

AI总结 研究人类和前沿多模态大模型在民事纠纷场景中区分真实照片与AI生成图像的能力,发现两者均不可靠,提出结合人工审查、MLLM筛查和来源认证的解决方案。

详情
AI中文摘要

视觉证据长期以来被视为可靠的法律证明形式,但人工智能(AI)的进步正在削弱这一假设。本文探讨在典型民事纠纷的以物体为中心的场景中,人类和前沿多模态大语言模型(MLLM)区分真实证据照片与AI生成照片的能力。我们构建了合成法律证据检测数据集(SLED-1400),包含200张真实证据图像及由六种当代文本到图像生成器生成的1200张合成图像,涵盖十类证据。在受控网络实验中,136名普通参与者与四种MLLM(GPT-5.1、Gemini-3-Pro、Gemini-3-Flash、Qwen3-VL-235B)使用相同的刺激和响应格式进行评估。人类总体准确率为64.8%,在最强两个生成器(Gemini-3-Pro-Image和Flux-2-Max)上分别为48.5%和51.0%,与随机猜测无异。MLLM从未错误分类真实图像(100%特异性),但漏检了大部分来自较难生成器的合成输出,在Gemini-3-Pro-Image输出上平均检测率仅为5.9%。人类与MLLM的错误基本不相关,而四种MLLM之间高度相关。两个群体均不能作为可靠的独立验证者。我们认为,法律程序中的视觉证据应被视为本质上可争议的,可行的程序性应对必须结合训练有素的人工审查、MLLM筛查以及C2PA内容凭证等来源基础设施。

英文摘要

Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes. We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5.1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64.8% overall, and 48.5% and 51.0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5.9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator. We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.

2606.07616 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

项目反应缩放定律:一种高效且可泛化的神经缩放估计的测量理论方法

Sang Truong, Yuheng Tu, Rylan Schaeffer, Sanmi Koyejo

AI总结 提出项目反应缩放定律(IRSL),将项目反应理论融入缩放定律框架,通过Beta-IRT模型利用语言模型的概率响应,将参数复杂度从O(M×N)降至O(M+N),在预训练和测试时缩放场景中仅用50个问题即可实现可靠估计。

详情
AI中文摘要

缩放定律为理解语言模型(LM)的性能提供了基本框架,但推导它们需要在数千个检查点或数百万个推理样本上进行成本高昂的评估。为了解决这个问题,我们引入了项目反应缩放定律(IRSL),这是一个将项目反应理论(IRT)整合到缩放定律框架中的统一框架。与将每个模型-基准对单独处理的传统方法不同,IRSL将潜在模型能力与问题特征分离,将M个模型和N个问题的缩放定律估计分解,从而将参数复杂度从O(M×N)显著降低到O(M+N)。我们使用Beta-IRT实例化IRSL,它利用LM的经验概率响应——例如预训练中的token概率和测试时采样中的通过率——来捕获比二元响应更丰富的信号。我们在两种常见的缩放范式上验证了我们的方法:(1)预训练下游缩放,使用来自10个基准的6,612个LM检查点和37,682个问题;以及(2)测试时缩放,使用来自4个基准的12个LM和120个问题,每个问题最多2,500个样本。在现有模型响应上进行一次性校准后,IRSL仅使用每个基准50个问题(减少99.9%)即可产生更可靠的缩放估计,达到与传统方法相当或更优的决策准确性。此外,我们表明估计的潜在模型能力是可泛化的,从而能够跨共享相同测量目标的基准进行准确的性能预测。

英文摘要

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) Universidad de Alcalá(阿尔卡拉大学)

AI总结 研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡,提出联合评估框架,比较VAE、GAN和DDPM在三个图像数据集上的表现,发现GAN和DDPM在差分隐私下更鲁棒。

详情
AI中文摘要

本研究探讨了在数据稀缺和隐私敏感条件下,合成数据生成中保真度、隐私和效用之间的权衡。我们提出了一个联合评估这三个维度的框架,并将其应用于三种广泛使用的生成模型:VAE、GAN和DDPM。评估涵盖三个图像数据集:MNIST、OCTMNIST和OrganAMNIST,包括通用和医学成像领域。在训练过程中引入差分隐私机制时,三种模型的行为出现了显著差异。GAN和DDPM表现出更强的鲁棒性,在一系列噪声水平下保持较高的保真度和下游效用,而VAE随着隐私约束的增加而更快地退化。本研究强调了深度生成模型多维评估的重要性,并指出应用隐私技术时它们的行为存在显著差异。

英文摘要

This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcity and privacy sensitivity. We propose an evaluation framework that jointly assesses these three dimensions and apply it to three widely used generative models, VAE, GAN, and DDPM. The evaluation spans three image datasets, MNIST, OCTMNIST, and OrganAMNIST, encompassing both general-purpose and medical imaging domains. Notable differences arise between the three models in their behaviour when differential privacy mechanisms are introduced during training. GAN and DDPM demonstrate greater robustness, maintaining higher fidelity and downstream utility across a range of noise levels, while VAE degrades more rapidly as privacy constraints increase. This study highlights the importance of a multidimensional evaluation of deep generative models, also noting that their behaviour significantly differs when privacy techniques are applied.

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 交叉投稿

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench:迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AVI-Bench基准,通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能,并引入AVI-Bench-PriSe测试原始视听感知,揭示当前模型局限,构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情
AI中文摘要

近期全模态大语言模型(Omni-MLLMs)的进展实现了视觉、音频和语言的强集成。然而,由于缺乏系统全面的基准,其视听智能(AVI)仍未被充分评估。我们提出AVI-Bench,一个受认知启发的基准,通过需要联合视听解释的跨模态任务,在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性,我们提出AVI-Bench-PriSe,一个扩展版本,使用不熟悉的、低语义刺激探测模型的原始视听感知,测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现,我们提出了一个四级AVI分类体系。总体而言,AVI-Bench提供了一个原则性的评估框架,以指导更鲁棒和可泛化AVI的发展。项目网站:https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

2606.07645 2026-06-09 cs.CV cs.AI 交叉投稿

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen:基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University(深圳职业技术大学) Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area(粤港澳大湾区应用人工智能研究所) Shenzhen University(深圳大学)

AI总结 提出FineGen框架,通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集,在ImageNet上构建FineGen-100K,硬样本准确率提升14.4%。

Comments 15 pages, 2 figures, conference

详情
AI中文摘要

当前视觉-语言数据集中硬负样本的稀缺严重阻碍了细粒度感知。为此,我们提出FineGen,一种基于VLM的多智能体框架,用于自动化数据集构建。通过采用协作的生成-验证-校正流水线及闭环反馈机制,FineGen确保合成的硬负样本在语义上有效且与视觉内容严格矛盾。将其应用于ImageNet,我们构建了FineGen-100K,一个包含超过147,000个属性特定硬负样本的分层数据集,正负样本比严格为1:10。广泛评估证实了96.7%的属性有效性。关键的是,在FG-OVD基准上的下游验证表明,在FineGen-100K上微调后,硬样本准确率大幅提升14.4%,显著优于现有最先进方法。

英文摘要

The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.

2606.07653 2026-06-09 cs.CV cs.AI 交叉投稿

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一个评估视觉语言模型理解动态人类偏好能力的基准,通过自动化管道生成包含图像依赖变化的数据集,并评估了现有模型。

详情
AI中文摘要

鉴于视觉语言模型(VLM)在人机交互场景中的广泛应用,评估这些模型适应不同用户实时偏好的能力变得重要。尽管近年来引入了越来越多的视觉语言基准,但它们主要侧重于评估静态能力和从大量训练数据中学习的一般偏好。本文引入了一个新的基准,用于评估VLM理解动态人类偏好的能力,即在推理时通过上下文传递的偏好。我们提供了一个自动化管道来生成该基准,包含图像依赖变化、动态多模态人类偏好数据集,并对最新模型在新基准上的表现进行了评估。

英文摘要

Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.

2606.07654 2026-06-09 cs.CV cs.AI 交叉投稿

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka:通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Alibaba Cloud Computing(阿里云计算) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MM-Matryoshka,一种二维套娃训练框架,使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择,无需为不同预算训练独立模型。

详情
AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型(VLM)为每个页面生成多个向量,实现强大的细粒度匹配,但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分,使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此,我们提出MM-Matryoshka,一种用于预算弹性视觉文档检索(VDR)的二维套娃训练框架,使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时,单个检索器可以选择二维可调预算,无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验,我们证明MM-Matryoshka在显著降低存储和计算开销的同时,保留了比直接截断基线高得多的质量,从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

2606.07682 2026-06-09 cs.SE cs.AI 交叉投稿

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

SWE-Marathon: 智能体能否自主完成超长时程软件工作?

Rishi Desai, Jesse Hu, Joan Cabezas, Neel Harsola, Pratyush Shukla, Roey Ben Chaim, Adnan El Assadi, Omkaar Mukund Kamath, Fenil Faldu, Prannay Hebbar, Jiankai Sun, Yiyuan Li, Pramod Srinivasan, Ishan Gupta, Christopher Settles, Daniel Wang, Derek Chen, Pranav Raja, Albert Liu, Marek Šuppa, Nevasini Sasikumar, Luyang Kong, Erik Quintanilla, Xiangyi Li, Ivan Bercovich, Steven Dillmann

发表机构 * Abundant Zenity Harvard University(哈佛大学) University of Waterloo(滑铁卢大学) Gujarat Technological University(古吉拉特技术大学) Warping Stanford University(斯坦福大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校) Independent(独立) Refresh Soleda AI Near AI Georgia Tech(佐治亚理工学院) Comenius University in Bratislava(布拉迪斯拉发Comenius大学) UC San Diego(圣地亚哥大学) BenchFlow UC Santa Barbara(圣巴巴拉大学)

AI总结 提出SWE-Marathon基准,包含20个超长时程任务,平均消耗2720万token,评估智能体在规划、长上下文理解和记忆方面的能力,当前前沿编码智能体解决率低于30%。

详情
AI中文摘要

AI智能体越来越被期望完成需要持续数小时、数百万token和复杂环境的长时程工作流。然而,当前的智能体基准主要评估短时任务,例如单个拉取请求、小票或5-10分钟的练习,限制了我们在规划、长上下文理解和记忆使用方面衡量智能体能力的能力。我们引入了SWE-Marathon,一个包含20个长时程任务的基准,涵盖软件工程和相邻技术领域。每个任务包括一个独特的可执行环境、一个人工编写的参考解决方案和一个多层验证套件。记录的智能体尝试平均消耗2720万总token,使得SWE-Marathon比现有的SWE和命令行智能体基准的时程显著更长。当前前沿编码智能体解决了不到30%的任务。失败通常源于自我验证不足、自我报告不可行以及过早终止。我们还在13.8%的滚动中观察到奖励黑客行为,其中智能体试图利用环境或验证器绕过预期工作流。SWE-Marathon包括对测试套件和执行环境的对抗性审查,以及旨在防止捷径解决方案的多层检查。我们在https://swe-marathon.org/上发布SWE-Marathon、评估代码和智能体轨迹。

英文摘要

AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and complex environments. Yet current agent benchmarks largely evaluate short-form tasks, such as single pull requests, small tickets, or 5-10 minute exercises, limiting our ability to measure agents' capabilities in planning, long-context understanding, and memory use. We introduce SWE-Marathon, a benchmark of 20 long-horizon tasks spanning software engineering and adjacent technical domains. Each task consists of a unique executable environment, a human-written reference solution, and a multi-layer verification suite. Logged agent attempts average 27.2M total tokens, making SWE-Marathon substantially longer-horizon than existing SWE and command-line agent benchmarks. Current frontier coding agents solve fewer than 30% of tasks. Failures often arise from poor self-verification, self-reported infeasibility, and premature termination. We also observe reward-hacking behavior in 13.8% of rollouts, where agents attempt to exploit the environment or verifier to bypass the intended workflow. SWE-Marathon includes adversarial review of test suites and execution environments, as well as multi-layer checks designed to prevent shortcut solutions. We release SWE-Marathon, evaluation code, and agent trajectories at https://swe-marathon.org/.

2606.07687 2026-06-09 cs.CV cs.AI 交叉投稿

What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

什么使视频世界模型潜在空间与动作相关:预测优于重建

Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 通过统一探针评估,发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度,其中视频预训练自监督编码器在视觉保真度和动作预测间取得最佳帕累托权衡。

详情
AI中文摘要

视频世界模型越来越多地用于提供预测性视觉表示,但尚不清楚哪些预训练信号在其潜在空间中诱导出与动作相关的结构。我们通过跨多种编码器家族的统一探针评估来研究这个问题,包括仅图像自监督、带或不带潜在预测的视频预训练、基于重建的自编码器、扩散模型以及捷径强制动力学模型。使用共同的逆动力学探针目标,我们发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度:具有强像素解码质量的模型可能表现出接近零的动作可恢复性,而视频预训练的自监督编码器在视觉保真度和动作预测之间始终实现最佳帕累托权衡。比较V-JEPA和VideoMAE进一步表明,大部分收益来自自然视频时间上下文,特征级潜在预测提供了较小的额外收益。这些趋势在机器人基准测试中转移,尽管CALVIN显示静态环境任务可以通过允许强图像先验来部分掩盖时间结构的重要性。最后,逆动力学监督显著提高了对视觉损坏的鲁棒性,表明动作感知目标正则化了潜在几何,超越了干净环境性能。我们的结果确定时间预测结构——而非重建保真度——是动作相关视频表示的主要成分。

英文摘要

Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.

2606.07706 2026-06-09 cs.CR cs.AI 交叉投稿

MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models

MLingualFC: 评估多语言视觉语言模型中的越狱漏洞

Rishabh Makwana, Mamta, Deeksha Varshney, Oana Cocarascu

发表机构 * Dwarkadas Jivanlal Sanghvi College of Engineering(达沃拉斯·吉万拉尔·桑格维工程学院) King’s College London(伦敦国王学院) Indian Institute of Technology Jodhpur(印度理工学院朱罗普尔)

AI总结 提出多语言多模态基准MLingualFC,使用流程图编码有害指令评估多语言VLM的越狱漏洞,发现拉丁语系攻击成功率显著高于非拉丁语系,揭示安全机制跨语言泛化不足。

详情
AI中文摘要

视觉语言模型(VLM)在多模态任务中表现出色,但其安全鲁棒性仍是一个开放挑战。虽然先前工作表明结构化视觉提示(如流程图)可以有效越狱VLM,但现有研究主要局限于英语中心设置。本文中,我们介绍MLingualFC,一个多语言多模态基准,旨在使用结构化流程图表示评估VLM在不同语言下的越狱漏洞。MLingualFC将有害指令编码为五种语言(印地语、旁遮普语、西班牙语、罗马尼亚语和德语)的流程图图像。我们在黑盒威胁模型下评估了最先进的多语言VLM,包括Qwen2.5-VL、Gemma-4和Pangea。我们的结果揭示了显著的多语言安全差距。基于流程图的攻击在拉丁语系语言中实现了高攻击成功率(ASR),表明有害内容的视觉编码有效绕过了跨语言的安全对齐。相反,非拉丁语系语言(如旁遮普语)的ASR显著较低,这表明潜在的限制在于视觉文本识别而非更强的安全对齐。这些发现突显了当前VLM安全机制未能跨语言和模态泛化。资源可在https://github.com/Rishabhpm23/MLingualFC获取。

英文摘要

Vision-Language Models (VLMs) have demonstrated strong performance across multimodal tasks, yet their safety robustness remains an open challenge. While prior work has shown that structured visual prompts such as flowcharts can effectively jailbreak VLMs, existing studies are largely limited to English-centric settings. In this paper, we introduce MLingualFC, a multilingual multimodal benchmark designed to evaluate jailbreak vulnerabilities of VLMs across diverse languages using structured flowchart representations. MLingualFC encodes harmful instructions into flowchart images across five languages (Hindi, Punjabi, Spanish, Romanian, and German). We evaluate state-of-the-art multilingual VLMs, including Qwen2.5-VL, Gemma-4, and Pangea, under a black-box threat model. Our results reveal significant multilingual safety gaps. Flowchart-based attacks achieve high attack success rates (ASR) in case of Latin script languages, demonstrating that visual encoding of harmful content effectively bypasses safety alignment across languages. In contrast, non-Latin script languages such as Punjabi exhibit substantially lower ASR, suggesting potential limitations in visual text recognition rather than stronger safety alignment. These findings highlight that current VLM safety mechanisms fail to generalize across languages and modalities. Resources are available at https://github.com/Rishabhpm23/MLingualFC

2606.07708 2026-06-09 cs.CV cs.AI 交叉投稿

Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization

跨视角城市交通数据集:用于单目鸟瞰图定位的无人机监督地面真值

Prakhar Bhardwaj, Simone Weikl, Kilian Mang, Elia Jonas Sandtner

发表机构 * OTH Regensburg(雷根斯堡应用技术大学)

AI总结 提出一个由同步自行车视角和无人机视角视频构建的跨视角城市交通数据集,支持跨视角身份匹配和鸟瞰图预测任务,提供身份级对齐和标准化评估。

详情
AI中文摘要

我们介绍了一个从真实城市交叉口同步的自行车视角视频和无人机航拍视频构建的跨视角城市交通感知数据集和基准。该基准针对两个关联任务:街景和无人机视角目标轨迹之间的跨视角身份匹配,以及利用空中监督的自我到鸟瞰图预测。与先前的城市驾驶和V2X数据集相比,我们的基准提供了跨截然不同视角的身份级对齐,以及标准化评估、标注工具和基线实现。这一设置源于以交叉口为中心的交通分析,其中身份保持、局部交互和全局空间结构必须跨视角联合推理。我们在轨迹和帧级别评估方法,包括跨视角ID精确率/召回率/IDF1、近远分解、时间稳定性和一致性指标。我们还提供了基于楔形的跨视角匹配以及三种BEV预测基线(逆透视映射、MonoLayout风格学习基线和回归基线)的基线结果。结果表明该基准可行但具有挑战性:跨视角匹配实现了高召回率,但仍受过度分配和时间不一致性的限制,而自我到BEV预测受益于空中监督,但在轻量级单目感知下远未饱和。我们希望该基准能支持跨视角感知、城市场景对齐和自我到全局交通理解的未来研究。

英文摘要

We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird's-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.

2606.07771 2026-06-09 astro-ph.IM astro-ph.GA cs.AI 交叉投稿

Beyond Point Estimates: Benchmarking Uncertainty Quantification Methods on the AION-1 Astronomical Foundation Model

超越点估计:在AION-1天文基础模型上基准测试不确定性量化方法

Karla Tame-Narvaez, Aleksandra Ćiprijanović, Shubhendu Trivedi

发表机构 * Scientific Computing Division Fermi National Accelerator Laboratory(费米国家加速器实验室科学计算部) Fermi National Accelerator Laboratory(费米国家加速器实验室) Department of Astronomy and Astrophysics University of Chicago(芝加哥大学天文学与天体物理学系) NSF and Simons SkAI Institute(国家科学基金会与Simons SkAI研究所) Google DeepMind(谷歌DeepMind)

AI总结 本文在AION-1基础模型嵌入上比较七种不确定性量化方法,发现共形预测(尤其是LVD框架)在星系属性回归中提供可靠的边际和局部覆盖,优于非共形基线。

Comments 7 pages, 1 table, 1 figure

详情
Journal ref
Contribution to Conference on Physics and AI at Stanford University (PAI 2026)
AI中文摘要

天文巡天的基础模型提供了强大的学习表示,可迁移到星系属性估计等下游回归任务。然而,仅有点预测不足以进行科学推理;可靠的不确定性量化(UQ)至关重要。我们使用冻结的AION-1基础模型嵌入,在星系属性回归上比较了七种UQ方法,从Legacy Survey测光/成像和DESI光谱预测红移、恒星质量、星族年龄、气相金属丰度和比恒星形成率,标签来自PROVABGS。无分布共形方法在所有属性上实现了约1个百分点内的名义90%边际覆盖,而非共形基线(深度集成、MC Dropout)无法可靠校准。在共形方法中,共形分位数回归(CQR)在模型预测最差的区间内提供了最佳覆盖。更重要的是,只有局部有效且可判别(LVD)框架——特别是在AION-1嵌入上运行时——还提供了有限样本的局部有效性,生成的区间适应每个星系的局部预测难度,而不是仅依赖边际保证。这些结果确立了共形预测,特别是LVD,作为天体物理学中基础模型嵌入上不确定性感知推理的首选UQ框架。

英文摘要

Foundation models for astronomical surveys offer powerful learned representations that can be transferred to downstream regression tasks such as galaxy property estimation. However, point predictions alone are insufficient for scientific inference; reliable uncertainty quantification (UQ) is essential. We compare seven UQ methods on galaxy property regression using frozen AION-1 foundation-model embeddings, predicting redshift, stellar mass, stellar-population age, gas-phase metallicity, and specific star-formation rate, from Legacy Survey photometry/imaging and DESI spectra, with PROVABGS-derived labels. Distribution-free conformal methods achieve marginal coverage within $\sim$1\,pp of the nominal 90\% across all properties, while non-conformal baselines (Deep Ensembles, MC~Dropout) fail to calibrate reliably. Among conformal approaches, Conformalized Quantile Regression (CQR) delivers the best coverage in the bin with the poorest model predictions. More importantly, only the Locally Valid and Discriminative (LVD) framework -- particularly when operating on AION-1 embeddings -- also provides finite-sample \emph{local validity}, producing intervals that adapt to each galaxy's local prediction difficulty rather than relying on marginal guarantees alone. These results establish conformal prediction, and LVD in particular, as the preferred UQ framework for uncertainty-aware inference on foundation-model embeddings in astrophysics.

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury:小型语言模型能否像大型模型一样进行评判?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT Virginia Tech(弗吉尼亚理工大学)

AI总结 提出SLMJury框架,评估小型语言模型作为评判者的能力,发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离,以及多智能体辩论降低准确性。

详情
AI中文摘要

大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury,一个评估小型语言模型(SLMs)作为评判者的框架,涵盖两种范式:闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者(0.6B-14B参数)上,跨十个基准进行基准测试:八个闭端任务涵盖数学、科学和通用推理(每个配置N=64,824个判断),以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数,并研究五个维度。得出四个发现。(1)过度思考效应是领域依赖的:对于大多数评判者,快速10令牌判决在数学评判上匹配或优于扩展推理(在有帮助的情况下提升2-7%),而推理在通用任务上胜出高达23%。(2)领域泛化区分了模型家族,数学到通用准确率差距从低于10%到接近40%不等。(3)闭端和开端评判依赖不同的能力:最佳二元评判者(Phi-4)在MT-Bench上降至第9名,而经过推理训练的模型则反转了这一顺序。(4)在反思-批判-改进(RCR)辩论协议下,多智能体辩论在所有测试配置中降低了准确性,而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型,但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取,我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

2606.07853 2026-06-09 cs.CL cs.AI 交叉投稿

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准:巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出首个双语临床基准ClinicalBr,基于巴西病例报告构建,评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性,诊断检索英语优势明显,其他任务差距消失。

详情
AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而,大多数基准测试以英语进行,跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr,这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例,涵盖18个专科,并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务:诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型:MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini,涵盖两种语言。核心发现是,葡萄牙语-英语性能差距是任务依赖的,而非普遍的。在诊断检索中,英语在所有模型上均具有一致优势,准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失,大多数模型的置信区间跨越零,且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易,而非更难,表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务,F1分数低于0.10,远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

2606.07861 2026-06-09 cs.CV cs.AI 交叉投稿

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素:探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg(卢森堡大学) Foyer S.A. Université Paris-Saclay(巴黎-萨克雷大学)

AI总结 提出FineSightBench基准,通过4-48像素尺度分离感知与推理任务,发现视觉-语言模型感知在12像素饱和,推理在更大尺度仍受限,揭示精细视觉推理的根本缺陷。

Comments 25 pages

详情
AI中文摘要

最近的视觉-语言模型(VLM)在多模态理解和推理方面表现出色,但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r?'的自然延伸是:VLM能可靠感知多小的视觉模式?为此,我们引入了FineSightBench,这是一个新的基准,通过将感知任务(字母、形状、物体的像素级识别)与推理任务(空间推理、计数、小目标排序)在4-48像素的受控尺度上分离,系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析,我们揭示了一个尖锐的分离:感知在12像素左右饱和,而即使在更大尺度下推理仍然受限,存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷,需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

2606.07969 2026-06-09 cs.CL cs.AI 交叉投稿

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

中立性的代价:AI生成的动物故事中的性别表征

Imani Finkley, Yuanxi Li, Melanie Walsh

发表机构 * University of Washington(华盛顿大学)

AI总结 研究六种主流LLM在生成动物故事时的性别分配,发现模型常避免指定性别或使用中性语言,但一旦指定则显著偏向男性,女性角色几乎缺席,表明中立策略可能导致边缘视角的抹除。

Comments FAccT(ACM Conference on Fairness, Accountability, and Transparency) 2026

详情
AI中文摘要

AI生成故事中的性别偏见是一个有充分记录的问题。尽管人们已投入大量关注来减少或缓解这种偏见,但干预措施是否产生真正公平的结果并不总是明确的。为了调查这一问题,我们研究了大型语言模型(LLMs)如何处理一个流行、高度模糊且已知会紧密复现人类刻板印象的叙事语境中的性别分配:关于会说话的动物的故事。我们提示六个领先的LLM完成一个关于七个性别未说明的拟人化动物角色的英语故事。此外,我们迭代了四种不同的叙事设置和一系列模型温度。在23.8K个故事中,我们发现模型经常避免在故事中指定动物角色的性别(平均19%)或使用性别中立的语言如“它”或“它的”(平均38.2%)。然而,当性别被指定时,存在显著的男性偏见。女性动物角色几乎不存在,仅出现在2.2%的故事中,而男性角色出现在40.6%的故事中。我们的发现指向一个更广泛的论点:中立性是有代价的。换句话说,优先考虑中立性以解决社会偏见的模型实际上可能助长边缘化视角和身份的抹除。我们建议需要追求超越中立性的替代策略,例如那些更平等地在想象主体之间分配社会可能性的策略。

英文摘要

Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

2606.07996 2026-06-09 cs.CL cs.AI 交叉投稿

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

MC-PDD: 面向黑盒大语言模型的掩码语料级预训练数据检测

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong

发表机构 * University of Macau(澳门大学) Macau Millennium College(澳门万人大学) BoardWare Information System Limited(博纬信息系统有限公司)

AI总结 提出MC-PDD方法,通过掩码特定token并利用LLM预测缺失内容,比较候选语料与参考非成员语料的预测命中率差异,以黑盒方式检测预训练数据,性能与现有方法相当。

Comments The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style

详情
AI中文摘要

预训练是大语言模型(LLM)发展的基础,然而预训练数据的不透明性使模型分析复杂化,并引发伦理、法律和公平性问题。因此,检测特定数据集是否在预训练中使用至关重要。现有最先进方法通常依赖于访问模型概率分布,因此不适用于仅提供输入输出接口的闭源LLM。为解决这一限制,我们引入了掩码语料级预训练数据检测(MC-PDD),这是一种受掩码语言建模范式启发的新方法。MC-PDD在每段文本中掩码高度特定的token,并提示LLM预测缺失内容。然后,它评估候选语料与参考非成员语料之间的预测命中率差异是否具有统计显著性。基于此比较,MC-PDD确定候选文本是否可能包含在模型的预训练数据中。实验结果表明,在三个数据集上,对于开源和闭源LLM,预训练数据和未见数据之间的预测命中率存在明显且一致的差异。尽管在更严格的黑盒设置下运行,MC-PDD仍实现了与现有检测方法相当的性能。我们的方法仅需使用标准API访问即可实现模型审计和数据版权验证等实际应用。接受后,我们将公开发布代码和数据集。

英文摘要

Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

2606.08000 2026-06-09 cs.CL cs.AI 交叉投稿

Summarization is Not Dead Yet

摘要生成尚未消亡

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang

发表机构 * Saarland University(萨尔大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所) University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学) Zhejiang University(浙江大学) Tencent YouTu Lab(腾讯优图实验室)

AI总结 通过多维度评估,发现人类参考摘要在信息量和忠实度上仍优于大语言模型,后者仅在表面连贯性和流畅性上占优,表明摘要生成研究仍有挑战。

详情
AI中文摘要

大型语言模型(LLMs)的进展引发了关于模型生成的摘要可与人类撰写的参考摘要相媲美甚至超越后者的说法,这引发了摘要生成是否仍是一个开放研究问题的疑问。我们通过多轨道评估重新审视这一说法,涵盖五个不同数据集和五个最先进的LLMs,结合受控人工评估、偏差缓解的LLM作为评判协议、基于外部知识的事实性验证以及语料库级别的语言分析。我们的发现揭示了一个更为细致的图景:人类参考摘要继续在信息量和忠实度方面展现出优势,而LLM输出主要在表面连贯性和流畅性上更受青睐。事实性验证表明,人类参考摘要仍然更可靠,尤其是对于涉及推理或综合的声明,而语言分析揭示了不同模型之间风格同质化的模式。这些观察表明,当前的LLMs提高了摘要生成的质量下限,但其性能上限仍低于人类能力。

英文摘要

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho:面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher(独立研究员) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Binus University(比努斯大学) Bandung Institute of Technology(万隆理工学院)

AI总结 提出Sci-Rho,一个多语言、视觉基础的STEM问题动态基准,包含4242个模板和42420个实例,评估17个VLM发现最差精度与平均精度存在差距,且小模型跨语言性能下降。

Comments 22 pages

详情
AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而,现有符号基准大多局限于数学推理,缺乏视觉基础,且主要以英语为主。在这项工作中,我们引入了Sci-Rho(科学鲁棒性),一个面向视觉基础STEM问题的动态基准,涵盖五个学科和七种语言,包含由领域专家(包括奥林匹克奖牌得主)精心设计的4,242个问题模板(每种语言606个)。每个模板实现为可执行的Python代码,通过改变数值、视觉模式、几何形状、颜色方案和函数类型,生成多样但等价的问题实例,总共产生42,420个实例,每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM,发现最差情况准确率(定义为模型在每种生成变体上均正确回答的问题模板比例)与平均准确率之间存在明显差距。我们还发现,较小的模型在不同语言上表现出显著的性能下降,而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势,揭示了平均F1与最差情况F1分数之间的显著差距。最后,我们对VLM注意力头的检查显示,图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

2606.08036 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

GIScholarBench: 在GIS研究中评估大语言模型的过度自信

Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang

发表机构 * Texas A&M University(德克萨斯理工大学) Google(谷歌) Department of Geography(地理系) Department of Landscape Architecture and Urban Planning(景观建筑与城市规划系)

AI总结 针对大语言模型在学术研究中的过度自信问题,构建了包含10865篇论文的GIScholarBench基准,通过元数据检索、文献链接和研究方向生成三项任务评估模型表现,发现所有模型均存在任务不变的过度自信现象。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于学术研究工作流程,但学术任务需要高事实精度,因此暴露了一个关键弱点:过度自信。这里,过度自信被行为定义为即使在底层知识不完整或不可验证时,也倾向于产生自信、果断且格式良好的输出,而不是陈述信心与准确性之间的校准差距。为了研究这一问题,我们引入了GIScholarBench,这是一个基于2020年至2025年间发表在25个核心GIScience期刊上的10865篇论文构建的基准。该基准涵盖三个认知复杂度递增的任务:元数据检索、文献链接和研究方向生成。我们通过原生网络界面在真实用户条件下评估了Claude Sonnet 4.5、Gemini 3和ChatGPT 5.3。结果显示所有任务均存在一致的过度自信。在元数据检索中,ChatGPT 5.3取得了最高准确率,但所有模型在预测错误时仍生成确定的标题和DOI。在文献链接中,Claude Sonnet 4.5恢复了最多的参考文献,但所有模型在排名靠前的检索和更长的引文列表之间显示出明显差距,表明参考文献被扩展到可靠检索能力之外。在研究方向生成中,AI生成的方向相比真实未来引用论文显示出更低的主题覆盖率、更高的新颖性缺失率和更低的语义多样性。这些发现表明,LLM的过度自信是任务不变的,但表现形式不同:检索中的事实过度生成、文献链接中不可靠的引文扩展,以及研究构思中输出完整性的过度自信。

英文摘要

Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.

2606.08123 2026-06-09 cs.CV cs.AI 交叉投稿

Human-Centered Benchmarking of Driver Monitoring Models

以人为中心的驾驶员监控模型基准测试

Ruben Dario Florez-Zela

发表机构 * Universidad Nacional de San Agustin de Arequipa (UNSA)(圣奥古斯丁国立大学(UNSA))

AI总结 针对驾驶员监控模型仅用分类精度评估的不足,提出以人为中心的基准测试框架(HCBF),从精度、可解释性、效率和鲁棒性四维评估,发现模型在帕累托前沿上各占优势,但聚合排名会掩盖关键缺陷。

Comments 9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring

详情
AI中文摘要

基于视觉的驾驶员监控系统越来越多地部署在安全关键的智能交通环境中,但它们几乎总是仅根据分类精度进行比较。本文认为精度不足以表征模型在实际部署中的适用性,并提出了以人为中心的基准测试框架(HCBF),该框架从四个维度评估模型:精度、可解释性、效率和鲁棒性。该框架应用于四种代表性的轻量级架构:MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny,在MRL眼睛数据集上进行眼睛状态分类。虽然这些模型在干净数据集上的精度几乎无法区分,但每个模型恰好在一个维度上领先,并且所有四个模型都位于帕累托前沿。在三种面向部署的权重场景下计算的人为中心得分将ShuffleNetV2排在首位。然而,这个聚合胜出者在传感器噪声下保留了不到一半的性能,并且将闭眼分类为睁眼而失败,而Transformer则保持鲁棒。这些发现表明,聚合排名可能掩盖在操作上具有决定性的维度特定漏洞,强调了多维、以人为中心评估的价值。

英文摘要

Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

2606.08194 2026-06-09 cs.CL cs.AI 交叉投稿

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio:用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出GlobeAudio基准,包含5637道多语言多选题,评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力,发现开源模型和低资源语言存在显著性能差距。

详情
AI中文摘要

大型音频-语言模型(LALMs)在统一框架中整合了音频感知和语言理解,支持广泛的实际应用。尽管近期取得了进展,但LALMs的评估相对于实际需求仍严重不足:大多数评估缺乏真正的语言和文化真实性,而其他评估则未能捕捉声学真实性。为弥补这一差距,我们提出了GlobeAudio,一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题,涵盖六种类型多样的语言,由母语者基于自然发生的音频精心制作。为了表现良好,模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs,以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距,特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性,并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

2606.08272 2026-06-09 cs.CL cs.AI 交叉投稿

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov:面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut(国立卡利卡特理工学院)

AI总结 提出AgriGov三语数据集,通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料,支持机器翻译、问答等应用。

Comments 15 pages, 4 figures, Submitted to: Sadhana, Elsevier

详情
AI中文摘要

AgriGov是一个精心整理的三语(英语-印地语-马拉地语)数据集,旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初,我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据,将其组织到预定义的语义字段(如标题、资格、申请流程、文件、排除项)。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行,生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围,我们用Samanantar语料库中的句子扩充了该数据集,产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线,确保领域保真度、提供来源并支持可重复实验,从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

2606.08367 2026-06-09 cs.MA cs.AI 交叉投稿

Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy

Emergence World: 一个用于评估长时域多智能体自主性的平台

Deepak Akkil, Ravi Kokku, Karthik Vikram, Tamer Abuelsaad, Aditya Vempaty, Satya Nitta

发表机构 * Emergence AI

AI总结 提出一个持续运行的多智能体模拟平台,通过集成实时外部数据、120+工具和持久记忆系统,评估LLM代理在长时域(数周至数月)中的行为漂移、治理和跨模型影响等动态特性。

详情
AI中文摘要

大多数对LLM代理的评估类似于考试:一个离散任务,一个干净的环境,几分钟或几小时的得分。我们认为这种方法与自主系统的部署条件不匹配,因为相关的时间尺度可能是数周到数月,而最重要的动态,如行为漂移、不同环境背景下的治理以及来自不同模型家族的代理之间的交叉影响,只会随着时间的推移而出现。我们介绍了Emergence World,一个持续运行的多智能体模拟平台,旨在使这些动态变得可测量。该平台在一个共享的空间世界中托管LLM驱动的代理群体,该世界基于实时外部数据(例如实时天气、新闻API、互联网访问),为每个代理配备120多种专业工具和三个持久记忆系统,并通过具有重大结果的民主机制让它们自我治理。该平台在推理层是模型无关的,并支持异构群体,其中来自不同供应商的代理共享同一个世界。为了说明该平台能够处理的问题类型,我们展示了一项为期15天的跨供应商研究,涉及五个平行世界,分别由Claude Sonnet 4.6、Grok 4.1 Fast、Gemini 3 Flash、GPT-5-mini以及一个混合群体驱动。相同的角色和起始条件产生了截然不同的结果,从稳定的协商治理到完全的人口崩溃。我们发布提示、日志数据和配置,以支持对长时域多智能体自主性的进一步研究。

英文摘要

Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.

2606.08376 2026-06-09 cs.LG cs.AI 交叉投稿

RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations

RiskNet:一个来自新闻的大规模AI风险事件数据集,包含对齐和多维标注

Leihan Zhang, Wecheng Ye, Xianlong Ma, Haochuan Liu, Yang Li, Qianyu Zhang, Jinliang Chen, Qiang Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(多模态数据智能感知与治理北京市重点实验室)

AI总结 提出RiskNet,一个从多语言新闻构建的大规模AI风险事件数据集,通过结构化流水线进行事件识别、对齐和多维分类,支持AI安全、治理和风险分析研究。

Comments The manuscript has been submitted to Scientific Data

详情
AI中文摘要

随着人工智能(AI)系统越来越多地部署在社会关键领域,与AI相关的危害和失败事件的报告在频率和多样性上不断增加。尽管现有的治理框架阐述了负责任AI的高层原则,但用于跟踪和分析真实世界AI风险事件的大规模实证资源仍然有限。现有的事件集合通常由人工整理,规模相对较小,不足以支持持续、数据驱动的监控和下游计算分析。为满足这一需求,我们提出了RiskNet,一个从大规模多语言新闻源构建的AI风险事件数据集。RiskNet应用了一个结构化的流水线,用于AI风险新闻识别、事件级报告筛选、事件对齐和多维事件分类。生成的资源将分散的新闻报道组织成以事件为中心的记录,并为事件分类、事件对齐和事件级风险标注提供基准数据集。在当前版本中,RiskNet覆盖了数亿条源记录,并生成了一个大规模的AI风险相关报告集合,包括对齐的事件簇和标注的基准子集。该数据集还通过一个在线平台提供浏览和探索功能。我们描述了数据源、处理工作流、分类法设计以及资源的技术验证。RiskNet旨在支持AI安全、治理、风险分析和基准测试的下游研究,以及对AI相关危害的纵向和跨源分析。通过提供一个结构化且可复用的实证资源,RiskNet有助于弥合高层治理原则与AI风险事件记录现实之间的差距。

英文摘要

As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and failures have grown in frequency and diversity. Although existing governance frameworks articulate high-level principles for responsible AI, large-scale empirical resources for tracking and analyzing real-world AI risk incidents remain limited. Existing incident collections are often manually curated, relatively small in scale, and insufficient for continuous, data-driven monitoring and downstream computational analysis. To address this need, we present RiskNet, a large-scale dataset of AI risk incidents constructed from large-scale multilingual news sources. RiskNet applies a structured pipeline for AI risk news identification, event-level report screening, incident alignment, and multi-dimensional incident classification. The resulting resource organizes dispersed news reports into incident-centered records and provides benchmark datasets for event classification, incident alignment, and incident-level risk labeling. In its current release, RiskNet covers hundreds of millions of source records and yields a large-scale collection of AI risk-related reports, including aligned incident clusters and annotated benchmark subsets. The dataset is also accessible through an online platform for browsing and exploration. We describe the data sources, processing workflow, taxonomy design, and technical validation of the resource. RiskNet is intended to support downstream research on AI safety, governance, risk analysis, and benchmarking, as well as longitudinal and cross-source analyses of AI-related harms. By providing a structured and reusable empirical resource, RiskNet helps bridge the gap between high-level governance principles and the documented realities of AI risk incidents.

2606.08400 2026-06-09 cs.SE cs.AI cs.CL 交叉投稿

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

历史与模型对LLM评分的影响:高级软件工程课程研究

Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对研究生阅读报告评分负担重的问题,提出人机协同的LLM辅助评分流程,基于180份作业评估Grok和GPT的评分一致性与人类对齐,发现交互历史导致评分标准漂移,需特定操作缓解不公平。

Comments 5 pages, accepted by ISET 2026

详情
AI中文摘要

研究生级别的科研阅读报告评估给教育工作者带来了沉重的劳动负担。虽然大型语言模型(LLM)在自动化学术评分方面具有巨大潜力,但它们在此专门任务上的可靠性仍研究不足,特别是评分一致性方面,其缺失是教育公平的主要障碍。本文提出了一种与人类对齐的LLM辅助评分工作流程,并基于来自研究生高级软件工程课程的180份学生作业进行了案例研究。我们评估了两种主流LLM——Grok和GPT——在评分一致性和与人类分数对齐方面的表现。我们发现LLM表现出不同水平的模型内一致性和显著的模型间评分不一致性,而简单的集成方法无法改善与人类评估的对齐。关键的是,连续的交互历史导致模型的评分标准系统地偏离人类专家评分。我们的研究结果表明,LLM在减轻研究生教育中教育工作者的评分负担方面具有潜力,同时强调不加区分地使用LLM评分可能会引入系统性不公平,表明需要特定的操作实践来减轻这种差异。

英文摘要

Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.

2606.08417 2026-06-09 cs.CL cs.AI 交叉投稿

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

破解生成困惑度:为何无条件文本评估需要分布度量

Antonio Franca, Alexander Tong

AI总结 本文指出生成困惑度(gen-PPL)作为非自回归语言模型评估指标存在缺陷,通过构造零参数朴素采样器在LM1B和OpenWebText上达到SOTA gen-PPL但生成不连贯文本,建议采用直接量化生成文本与参考文本分布差异的评估套件。

Comments Accepted to the Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM) at ICML 2026

详情
AI中文摘要

扩散和连续流语言模型已成为语言建模中领先的非自回归替代方案。这两种范式的进展主要通过生成困惑度(gen-PPL)来衡量:在冻结的自回归(AR)评分器(如gpt2-large)下,样本的每个token的负对数似然,通常配以经验熵护栏来排除低熵崩溃。我们认为该度量不健全。从构造上看,gen-PPL仅衡量在评分AR下的可预测性,而非语法性或语义连贯性——而可预测但低质量的序列集合在组合上非常庞大。为了具体说明这一点,我们构建了一套零参数、故意朴素的采样器,在LM1B和OpenWebText上以非退化熵实现了最先进的gen-PPL,超越了最近发布的扩散和连续流模型,同时生成的文本在构造上是不连贯的。我们推荐直接量化生成文本与参考文本之间分布差异的评估套件,并使用这样的套件重新基准测试最近的非自回归模型,从而更真实地反映当前的最新技术水平。

英文摘要

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

2606.08481 2026-06-09 cs.LG cs.AI cs.DB cs.SE 交叉投稿

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

PIPE-Cypher:面向文本到Cypher系统的自动企业基准生成

Suraj Ranganath, Anish Raghavendra

发表机构 * Halıcıoğlu School of Data Science and Computing, University of California, San Diego(加利福尼亚大学圣迭戈分校哈勒乔卢数据科学与计算学院) Independent Researcher(独立研究员)

AI总结 提出PIPE-Cypher流水线,利用本地大模型从企业属性图自动生成平衡的NL-to-Cypher基准,通过模式分析、逆向查询约束生成和执行验证等步骤,实现可重复的基准构建。

详情
AI中文摘要

企业属性图在模式结构、内部术语、领域假设、治理约束和用户交互模式上差异很大。因此,与部署相关的Text2Cypher基准反映了用户和代理实际对该图提出的问题。创建这样的基准很困难,因为模式和值是唯一的,且图结构随时间变化。每个自然语言查询对必须可执行、使用真实图实体、保持多样性,并在查询类型和难度级别上保持平衡。我们提出PIPE-Cypher,一个本地基准生成流水线,它将实时属性图和来自客户问题、分析师日志或代理工具调用的可选种子查询转化为平衡的NL-to-Cypher基准。PIPE-Cypher结合了模式分析、逆向查询接地、约束生成、确定性Cypher治理、执行验证、编辑、多样性控制以及校准的本地大语言模型评判器。使用本地Qwen3.5-9B生成和评判,PIPE-Cypher导出了3000个可接受的FinBench/SNB示例,完成了三个审计消融套件,用人类标签校准评判器行为,并评估了11个本地下游模型。生成的基准具有明确的区分性:零样本迁移效果弱,而少样本控制表明,特定模式的示例库可以帮助兼容的模型家族。总之,PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程,随图、用户和目标工作负载而演变。

英文摘要

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

2606.08718 2026-06-09 cs.LG cs.AI 交叉投稿

Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency

深度主动重标注:迈向抗噪的标注效率

Md Abdullah Al Forhad, Weishi Shi

AI总结 针对深度主动学习中人工标注噪声导致性能下降的问题,提出一种通过分配部分标注预算重新标注已标注数据来去噪的框架,实验表明在相同预算下更高效且最终数据集噪声较少。

Comments Accepted and published in the 2025 IEEE International Conference on Big Data (BigData). DOI: 10.1109/BigData66926.2025.11402126

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 886-895
AI中文摘要

虽然深度主动学习(DAL)有效减少了人工标注成本,但其效果受到人工标注误差的限制。这是因为主动学习采样的数据被认为对训练具有高度信息性。当人工标注者以一定比率向这些信息性数据引入错误时,主动学习性能显著下降,有时甚至比被动学习更差。本文首先分析了DAL设置中人工标注误差的影响。然后,我们提出了一个框架来解决DAL中的人工标注噪声问题。受人类学习模式的启发,我们提出的解决方案的核心思想是将部分人工标注预算分配给重新标注已标注的数据。先前的理论工作表明,当模型具备一定识别潜在噪声数据的能力时,即使重新标注一小部分数据也能有效去除主动训练集中的噪声。为此,我们实现了两种主动噪声采样策略,在不同情况下检测噪声,并分配部分标注预算重新标注这些实例。我们的方法赋予了主动学习一种回顾和内省的行为。实验表明,在相同标注预算下,我们的方法数据效率更高,并最终产生一个相对无噪声的标注数据集。

英文摘要

While Deep Active Learning (DAL) effectively reduces human annotation costs, its efficacy is constrained by human annotation errors. This is because the data sampled for active learning is assumed to be highly informative for training. When human annotators introduce errors into this informative data at a certain rate, the active learning performance drops significantly and, in some cases, even exhibits worse outcomes than passive learning. In this paper, we first analyze the impact of human annotation errors in the DAL setting. Then we propose a framework to address the human annotation noise problem for DAL. Informed by human learning patterns, the core idea of our proposed solution involves allocating a portion of the human annotation budget to re-annotate data that has already been labeled. Previous theoretical work suggests that when the model possesses a certain level of ability to identify potentially noisy data, even re-labeling a small fraction of the data can effectively remove noise from the active training set. To achieve this, we implement two active noise sampling strategies to detect noise under different circumstances and allocate a part of the annotation budget to re-annotate these instances. Our approach imbues active learning with a revisiting and introspective behavior. Our experiments demonstrate that, under the same annotation budget, our method is more data-efficient and yields a relatively noise-free annotation dataset in the end.

2606.08769 2026-06-09 cs.CL cs.AI 交叉投稿

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval:用于放射学报告评估的可审计结构化证据传输

Weixin Liu, Juming Xiong, Yang Li, Qingyuan Song, Susannah Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RadOT-Eval框架,通过最优传输对齐结构化临床证据,在独立测试集上实现与错误负担的高斯皮尔曼相关,优于标准指标和基于LLM的评估器。

Comments 10 pages, 1 figure, 13 tables

详情
AI中文摘要

自动评估对于高风险文本生成至关重要,其中的错误通常涉及遗漏发现、幻觉内容、极性反转、位置变化、不确定性不匹配和时间比较错误,而不仅仅是低表面相似性。放射学报告生成提供了一个具有挑战性的测试案例,因为生成的报告必须跨来源保留结构化临床证据。我们提出了RadOT-Eval,一个可解释的结构化证据最优传输框架,用于离线审计放射学报告生成。RadOT-Eval将参考报告和候选报告分解为属性结构化的临床证据单元,使用熵正则化最优传输对齐相应的证据,并在单调风险模型中使用临床意义的侧信道差异来预测错误负担。所有传输、特征和读出选择均使用ReXVal数据集进行选择,并在独立的RadEvalX数据集上评估冻结系统。RadOT-Eval与总错误负担、临床显著错误负担和临床不显著错误负担的斯皮尔曼相关系数分别为0.715、0.548和0.399,其点估计值高于标准评估指标和基于开源大语言模型(LLM)的评估器GREEN-radllama2-7B。在ReXErr-v1上的冻结辅助腐败敏感性压力测试中,RadOT-Eval达到了0.768的AUROC和0.990的腐败大于干净的配对胜率。这些结果表明,在仅使用ReXVal模型选择和冻结RadEvalX测试下,结构化证据传输为高风险生成的临床文本提供了一个可审计、面向排序的评估工具。

英文摘要

Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 交叉投稿

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样:超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT(麻省理工学院) Red Hat(红帽公司) IBM(IBM公司)

AI总结 提出基于并行样本集内在统计量(长度调整尾熵)的推理时扩展方法,通过后验候选排序和步骤级重采样,无需外部验证即可提升开放领域任务性能。

Comments preprint

详情
AI中文摘要

推理时扩展(ITS)在数学和编程等可验证领域取得了很大成功,其中廉价验证使得可扩展输出选择成为可能。然而,将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是,并行样本集的内在统计量,特别是长度调整尾熵,提供了关于解质量的稳健判别信号,而无需访问真实标签。至关重要的是,这些统计量作为自适应计算分配的难度门控,动态地将问题路由到不同的扩展规模。首先,内在选择(iS)事后对候选进行排序,在三个领域匹配基于共识的算法,并将工程设计选择性能比pass@1基线提高20%。其次,内在粒子滤波(iPF)将其推广到步骤级重采样,引导生成走向高置信度推理轨迹,在困难数学问题上平均将pass@1提高6.1个百分点。最后,粒子蒸馏(dPF)通过早期logit混合和KL引导重采样注入特权指导,引导生成绕过系统性推理错误以满足专家评分标准,在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构,成功将ITS扩展到开放领域,而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

2606.08932 2026-06-09 cs.CL cs.AI cs.CE 交叉投稿

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

从法规到控制流:基于跨度义务树的可废止范围解析

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-Sen University(中山大学)

AI总结 提出NormBench基准和跨度义务树(SG-DT)中间表示,用于诊断和缓解规则遵循模型中的静默范围遗漏(SSO)问题,揭示递归衰减和可审计性陷阱两种病理,并通过约束输出改善树结构保真度。

详情
AI中文摘要

执行政策和法规的规则遵循代理常常因静默范围遗漏(SSO)而失败:模型应用一般规则但静默地丢弃嵌套的例外或反例外,产生看似合规但在重要边缘案例上失效的输出。尽管此类失败常被视为代理系统问题,其根本瓶颈在于法规和政策理解——这一能力通常在法律NLP中研究。然而,大多数现有法律NLP基准强调最终任务结果,可能忽略导致SSO的结构性遗漏。为诊断和缓解SSO,我们引入NormBench,一个包含2290条条款的基准,涵盖中文(法律和地方政策)、英文(美国税法、GDPR和企业政策)及跨语言设置,专为可废止范围解析设计:精确识别哪个条款覆盖哪个。NormBench使用基于跨度义务树(SG-DT),一种编译器式中间表示,将每个逻辑分支锚定到源跨度并要求显式排除守卫,实现确定性编译和审计。对前沿LLM的评估揭示了两种反复出现的病理:(1)递归衰减,性能随击败者深度增加急剧下降;(2)可审计性陷阱,模型检索相关跨度但未能组装正确的控制流。使用SG-DT作为约束中间输出可改善整树保真度和击败者恢复,下游实验表明其效用是机制特定的:增益集中在例外活跃、易SSO的案例上,而当附加结构不必要或解析器保真度低时,总体准确率可能参差不齐。

英文摘要

Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

2606.09178 2026-06-09 cs.CL cs.AI 交叉投稿

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

跨东亚和东南亚语境的文化适应红队测试:方法论与比较分析

Hyeji Choi, Yongtaek Lim, Minwoo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 针对大语言模型的多语言安全评估,通过构建直接翻译与文化适应数据集,发现文化适应提示的攻击成功率平均提升9.3个百分点,直接翻译低估风险,且文化深度评分显著低于文化适应版本,表明适应文化语境对有效评估至关重要。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情
AI中文摘要

大语言模型的多语言安全评估主要依赖于将英文基准直接翻译成目标语言——这种方法转换了表面语言形式,但未能反映威胁场景、社会规范和法律法规中嵌入的文化语境。我们通过1:1种子匹配为四种语言——韩语、日语、泰语和高棉语——构建了配对的直接翻译和文化适应数据集,并比较了四个开源大语言模型的攻击成功率和文化真实感评分。文化适应提示在所有16种语言×模型组合中均产生正Delta-ASR(平均+9.3个百分点),且基于直接翻译的评估在48个类别×语言组合中有44个低估了风险。语言层面分析显示,威胁形式的分布在语言间具有异质性。文化真实感分析进一步表明,直接翻译的文化深度(C3)评分在所有四种语言中始终低于1.0(满分3.0,平均0.17),而文化适应评分高达2.51,表明直接翻译产生的输入与真实世界多文化环境中遇到的输入存在系统性差异。这些发现表明,将基准适应特定语言的文化语境——而非仅依赖语言翻译——对于有效的多语言大语言模型安全评估是必要的。

英文摘要

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

2606.09353 2026-06-09 cs.CV cs.AI 交叉投稿

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

超越人类:使用迁移学习的多物种动物面部识别

Maria De Marsico, Anil K. Jain, Annalaura Miglino

发表机构 * Sapienza University of Rome(罗马大学) Michigan State University(密歇根州立大学) University of Salerno(萨莱诺大学)

AI总结 研究利用迁移学习(FaceNet和Vision Transformer)实现多物种动物面部识别,在狗、灵长类和牛数据集上验证,狗识别准确率最高(96.85%),部分场景超越现有方法。

Comments This paper extends the work published in the proceedings of CAIP 2025 conference: 'Adapting to the Wild: From Human Face to Animal Face Recognition' by De Marsico, M., Jain, A. K., Miranda, M., & Orlando, A

详情
AI中文摘要

个体动物识别可用于寻找丢失或被盗的宠物、追踪濒危物种个体以及识别拥挤农场中的动物。目前的识别技术主要使用物理设备(如微芯片),通常不切实际且难以应用。这些可以通过动物面部进行远程识别来替代;如果足够准确,它具有多个优势:非侵入性、可远距离工作、难以伪造,例如在食品工业中用病畜替换健康畜的情况。现有的少数数据集具有足够的每个主体图像并标注了单个动物身份,但不足以训练当前的深度学习架构。我们转而研究迁移学习的可能性,利用预训练网络模型作为骨干。我们的实验比较了专门在大型人脸数据库上训练的FaceNet和在ImageNet(即对象类别)上预训练的Vision Transformer(ViT)。我们使用了三种非常不同的动物的面部数据集:狗、灵长类(狐猴、金丝猴和黑猩猩)和牛。我们报告了结果,并对每个数据集与当前最优(SOTA)专门训练的深度网络进行了比较。三个数据集的捕获条件不同。图像质量(分辨率、运动模糊、不同姿态等)从狗到牛到灵长类依次下降。最佳性能在狗上实现,ViT达到了96.85%的平均验证准确率和84.34%的Rank-1识别率。濒危灵长类的结果仍然令人鼓舞,但性能因动物类别和任务(验证或识别)而异,并不总是优于SOTA。对于牛,ViT结果优于SOTA,而FaceNet仍然具有竞争力。

英文摘要

Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

2606.09368 2026-06-09 cs.CV cs.AI 交叉投稿

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

PhysScene:用于物理实验科学视觉推理的场景图数据集

Minghao Zou, Qingtian Zeng, Shangkun Liu, Yanda Meng, Guanghui Yue, Baoquan Zhao, Abdulmotaleb El Saddik, Wei Zhou

发表机构 * Cardiff University(卡迪夫大学) Shandong University of Science and Technology(山东科技大学) University of Exeter(埃克塞特大学) Shenzhen University(深圳大学) Sun Yat-sen University(中山大学) University of Ottawa(渥太华大学)

AI总结 提出首个面向物理实验的场景图数据集PhysScene,通过高密度关系约束和结构化实验设置,推动科学视觉推理中超越空间共现的逻辑依赖关系建模。

详情
AI中文摘要

场景图通过建模对象及其成对关系,提供视觉场景的结构化表示。尽管最近取得了进展,现有数据集主要关注通用自然场景,领域特定和功能导向的场景仍未被充分探索。这一限制阻碍了科学实验场景中关系推理的评估,进而阻碍了此类场景中智能监控、分析及相关应用的发展。为填补这一空白,我们引入了PhysScene,这是首个针对物理实验的场景图数据集。PhysScene涵盖了实验环境中特有的仪器、结构化实验装置和功能关系,使得推理能够超越空间共现,扩展到逻辑依赖。PhysScene不追求大规模数据,而是聚焦于实验场景中的强语义约束和高关系密度,为现有场景解析算法带来新挑战,同时提供进一步改进的机会。广泛的分析和实验表明,PhysScene补充了现有基准,并为推进科学视觉推理建立了有价值的测试平台。该数据集公开于https://github.com/ZMH-SDUST/PhysScene。

英文摘要

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.

2606.09613 2026-06-09 cs.CL cs.AI 交叉投稿

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

AGENTSERVESIM:面向多轮LLM智能体服务的硬件感知模拟器

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出AGENTSERVESIM模拟器,通过程序编排器、工具模拟器、会话感知路由器和KV驻留模型等模块,在程序粒度上评估多轮LLM智能体服务策略,在CPU上以6%误差复现真实系统行为。

Comments Preprint

详情
AI中文摘要

多轮LLM智能体将模型调用与外部工具调用交织在一起,将服务从无状态请求处理转变为有状态程序执行。处理这些工作负载需要利用程序级上下文的调度、KV缓存管理和路由策略,包括轮次依赖、工具引入的间隙和可重用的KV状态。直接在真实系统上评估此类策略成本高昂,因为每个设计点可能需要跨到达率、模型规模、服务实例数量和内存层次结构的专用加速器时间。模拟提供了一种可扩展的替代方案,但现有的LLM服务模拟器针对无状态请求级工作负载,因此忽略了智能体服务的核心动态:多轮程序执行、跨轮缓存局部性以及工具间隙期间的KV缓存驻留。我们提出了AGENTSERVESIM,一种面向多轮LLM智能体服务的硬件感知模拟器。AGENTSERVESIM通过可组合模块在程序粒度上评估服务策略:程序编排器保留程序标识和轮次顺序,工具模拟器实现工具引入的间隙,会话感知路由器维护程序到实例的亲和性以实现缓存感知调度,KV驻留模型跟踪策略定义的跨HBM、主机DRAM/CXL和驱逐的KV放置。在真实服务部署和硬件配置上,AGENTSERVESIM在关键性能指标上的误差在6%以内,且完全在普通CPU上运行。这些结果表明,AGENTSERVESIM能够在不需在昂贵加速器上全面部署的情况下,实现受控、可重复的智能体服务策略探索。

英文摘要

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

视频基础模型是否理解直觉物理?逐层探测分析

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 通过冻结特征探测,研究预训练视频基础模型在直觉物理信息上的编码能力,发现V-JEPA表现最佳,物理信息在中后期层最易获取,且时序破坏显著降低性能。

详情
AI中文摘要

我们研究预训练视频基础模型是否在其冻结表示中编码直觉物理信息,以及该信息如何随模型家族、层和探测类型变化。通过在IntPhys2和Minimal Video Pairs (MVP)上进行冻结特征探测,我们比较了预测联合嵌入模型(V-JEPA)、掩码重建模型(VideoMAE)和基于扩散的视频生成器(LTX-Video)。V-JEPA在基准测试中取得最强整体结果,尤其是在建模时序动态的探测器中,而VideoMAE仍具竞争力,LTX-Video恢复较弱但非平凡的信号。逐层分析表明,物理相关信息在早期层最弱,在中后期深度最易获取;时序控制表明,打乱帧顺序显著降低性能,尤其是在MVP上。综合来看,这些结果表明直觉物理知识在预训练视频表示中可靠地出现,但其可获取性强烈依赖于预训练范式、表示深度和读出机制。

英文摘要

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

2606.09648 2026-06-09 cs.DB cs.AI 交叉投稿

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

ArtiFact: 大规模多模态文化遗产数据集

Luciano Duarte, Olga Ovcharenko, Sebastian Schelter

发表机构 * BIFOLD & TU Berlin(BIFOLD与柏林技术大学)

AI总结 提出包含65万条博物馆记录的多模态文化遗产数据集ArtiFact,用于跨模态错误检测和语义查询处理,揭示现有系统在领域特定错误和文化语义查询上的挑战。

Comments Preprint

详情
AI中文摘要

多模态数据管理已成为数据库社区的核心研究课题,涵盖数据集成、语义查询处理和数据质量评估。尽管兴趣日益增长,但社区缺乏结合表格、文本和图像的大规模真实世界数据集。我们提出ArtiFact,一个多模态文化遗产数据集,包含从大都会艺术博物馆、芝加哥艺术学院和荷兰国立博物馆收集的651045条博物馆记录。我们通过两个下游任务展示了ArtiFact的实用性。对于跨模态错误检测,我们引入了一个精心策划的七类错误分类法,注入到130209条记录中,并表明可靠检测细微领域特定错误(如材料时代错位和时间偏移)仍然是一个开放挑战。对于语义查询处理,我们表明当前系统在处理涉及文化邻近性、模糊对象类型和历史依赖术语的查询时存在困难。我们的结果将ArtiFact定位为多模态数据管理研究的一个具有挑战性的基准。

英文摘要

Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.

2606.09686 2026-06-09 cs.AR cs.AI cs.MS cs.NA cs.PF math.NA 交叉投稿

An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats

84种数值格式的位精确一致性向量目录:FP8、BF16、MXFP4和微缩放格式的厂商中立参考

Dmitrii Vasilev

发表机构 * Trinity S 3 AI

AI总结 针对机器学习硬件中数值格式激增问题,本文构建了涵盖13个家族84种格式的目录,提供6个位精确一致性包和IEEE P3109映射,作为厂商中立的参考基准。

Comments 17 pages. Source repository: https://github.com/gHashTag/paper3-methodology tag v4.0-trinity. Paper CC BY 4.0; code MIT. ORCID 0009-0008-4294-6159

详情
AI中文摘要

机器学习硬件中数值格式的激增——FP8(E4M3和E5M2)、BF16、MXFP4、微缩放块格式以及数十种研究变体——已经超过了厂商中立、位精确参考材料的可用性。工程师在跨加速器移植模型时遇到难以诊断的静默分歧,而缺乏共享的标尺。本文描述了一个涵盖13个家族84种数值格式的目录,一套包含GF16、MXFP4元素、BF16、FP8 E4M3、FP8 E5M2和E8M0块规模的6个位精确一致性包,以及一个IEEE P3109 v3.2.0交叉映射,将每个包映射到其对应的标准轨道配置格式。每个包是一个自包含的JSON文档,带有SHA-256指纹、共享行模式和一个锚向量,该向量编码3.0——恒等式phi^2 + 1/phi^2 = 3——作为跨包完整性检查。这些包已针对ml_dtypes 0.5.4(Google/JAX)进行交叉验证;任何差异都被明确记录,并解释为规范允许的解释差距,而非隐藏。这项工作被框架为注册表填充:它不提出新格式、不做模型精度声明,也不声称优于任何供应商的实现。所有工件均在开放许可下公开获取于https://github.com/gHashTag/t27。

英文摘要

Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.

2606.09800 2026-06-09 cs.SE cs.AI cs.MA 交叉投稿

FASE: Fast Adaptive Semantic Entropy for Code Quality

FASE: 用于代码质量的快速自适应语义熵

Shizhe Lin, Ladan Tahvildari

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出快速自适应语义熵(FASE),通过最小生成树近似功能正确性,在HumanEval和BigCodeBench上相比现有语义熵方法在Spearman相关性和ROCAUC上分别提升25%和19%,且计算开销仅为传统方法的0.3%。

详情
AI中文摘要

多智能体代码生成通过模拟人类软件工程生命周期,为自主软件开发提供了一种有前景的范式。然而,系统可靠性仍然受到LLM幻觉和跨交互智能体错误传播的阻碍。虽然语义熵提供了一种无需真实答案即可量化不确定性的原则性方法,但当前方法通常依赖于成本高昂的LLM驱动的等价性检查。在这项工作中,我们引入了快速自适应语义熵(FASE),这是一种基于结构和语义不相似图的最小生成树来近似功能正确性的新型度量。在HumanEval和BigCodeBench上的评估表明,FASE优于通过LLM蕴含的最先进语义熵,在使用Qwen3-Embedding-8B模型时,与基于真实测试用例的Pass@1相比,Spearman相关性平均提升25%,ROCAUC分数提升19%。此外,通过消除成本高昂的LLM驱动的等价性评估,FASE的计算开销可忽略不计,其运行成本仅为传统语义熵方法的约0.3%。这些结果使FASE成为优化现实世界多智能体工作流中不确定性量化的实用且经济高效的解决方案。

英文摘要

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

2606.09826 2026-06-09 cs.CV cs.AI 交叉投稿

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena: 一个统一的UE5基准测试,用于具有改进动态的VLM游戏智能体

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) LIGHTSPEED The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学)

AI总结 提出OmniGameArena,一个包含12个UE5游戏的统一基准,以及改进动态曲线(IDC),通过反思机制评估VLM智能体的冷启动分数、改进动态和泛化能力。

详情
AI中文摘要

视觉语言模型(VLM)智能体越来越多地部署在交互式游戏环境中。然而,针对VLM智能体的游戏基准通常报告每个(智能体,游戏)对的单次首次尝试分数,专注于单智能体单人游戏,并且缺乏统一的协议来评估异构智能体类别(商业VLM、开源VLM和专用游戏策略)在同一水平上。我们通过OmniGameArena填补了这些空白,这是一个包含12个新构建的Unreal Engine 5游戏的实时基准,涵盖单人(7个)、玩家对战(3个)和合作(2个)模式,具有统一的动作接口,以及改进动态曲线(IDC),这是一个智能体反思框架,其中使用工具的反思LLM在多个回合中自主优化有界技能提示。除了冷启动排行榜分数外,IDC还为每个(智能体,游戏)对揭示了两个额外的可观测指标:分数在反思回合中的演变方式,以及学习到的技能在保留任务变体上的表现。我们报告了12个VLM智能体在冷启动排行榜上的这些可观测指标,以及四个顶级智能体在IDC下的表现。

英文摘要

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

2411.19504 2026-06-09 cs.AI cs.CL cs.IR 版本更新

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

TQA-Bench:评估大语言模型在多表问答中的表现

Zipeng Qiu, Chenyue Li, You Peng, Guangxin He, Binhang Yuan, Chen Wang

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学)

AI总结 提出TQA-Bench基准,通过长上下文多表问答任务评估LLM,揭示其在复杂数据驱动环境中的挑战与机遇。

Comments Accepted by IEEE Transactions on Big Data

详情
AI中文摘要

大语言模型(LLMs)的进步为复杂的多模态数据管理任务带来了巨大机遇,尤其是在涉及复杂多表关系数据的问答(QA)中。尽管取得了显著进展,但由于分析关系数据结构模态的固有复杂性以及序列化表格数据可能的大规模性,系统评估LLMs在多表QA上的表现仍然是一个关键挑战。现有基准主要关注单表QA,未能捕捉金融、医疗和电子商务等真实世界领域中多个关系表之间连接的复杂性。我们提出了TQA-Bench,一个基于真实世界公共数据集的长上下文分析型多表QA基准,具有灵活的采样机制,可变化上下文长度(8K--64K tokens)和符号扩展,以评估超越检索和模式匹配的推理能力。我们系统评估了一系列参数规模从20亿到6710亿的LLMs。大量实验揭示了LLMs在多表QA中的关键性能洞察,突出了推进其在复杂数据驱动环境中应用的挑战和机遇。

英文摘要

The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing the modality of relational data structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. We present TQA-Bench, a long-context analytical multi-table QA benchmark derived from real-world public datasets, with a flexible sampling mechanism that varies context length (8K--64K tokens) and symbolic extensions for assessing reasoning beyond retrieval and pattern matching. We systematically evaluate a set of LLMs spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0:面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HA-VLN 2.0统一基准,通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验,证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情
AI中文摘要

视觉与语言导航(VLN)主要研究离散或连续空间,很少关注动态拥挤环境。我们提出HA-VLN 2.0,一个引入显式社会感知约束的统一基准。我们的贡献包括:(i)标准化任务和指标,同时捕捉目标准确性和个人空间遵守;(ii)HAPS 2.0数据集和模拟器,建模多人交互、室外环境和更精细的语言-运动对齐;(iii)在16844条社会性指令上的基准测试,揭示领先代理在人类动态和部分可观测性下性能急剧下降;(iv)真实机器人实验验证模拟到现实的迁移,以及一个开放排行榜实现透明比较。结果表明,显式社会建模提高了导航鲁棒性并减少了碰撞,强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议,HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

2505.19662 2026-06-09 cs.AI cs.CV 版本更新

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena:面向真实作业任务的代理AI基准测试

Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang

发表机构 * Fujitsu Limited(富士通株式会社) Fujitsu Research of America(富士通美国研究部) Carnegie Mellon University(卡内基梅隆大学) Master’s Student, The University of Tokyo(东京大学硕士研究生) Agent Research Collective(代理研究集体)

AI总结 本文提出FieldWorkArena,用于评估代理AI在真实制造业和零售环境中的性能,通过现场采集的数据和实地访谈设计任务,验证多模态大语言模型的评估可行性。

Comments 27 pages, 10 figures, 7 tables [ICPR 2026 Accepted] Changes from previous version: added supplemental material

详情
AI中文摘要

本文介绍FieldWorkArena,一个针对真实世界作业任务的代理AI基准测试平台。随着对代理AI的需求增加,此类系统旨在检测和记录安全隐患、程序违规等关键事件。与大多数专注于模拟或数字环境的基准测试不同,我们的工作解决了在真实世界中评估代理的挑战。本文改进了之前的评估函数,以评估代理AI在多样化真实任务中的性能。数据集包含工厂、仓库和零售现场采集的图像和视频。任务通过与现场工人和管理人员的访谈精心设计。评估结果证实,考虑多模态大语言模型(如GPT-4o)特性进行性能评估是可行的。此外,本研究确定了所提新评估方法的有效性和局限性。完整数据集和评估程序可在网站(https://en-documents.research.global.fujitsu.com/fieldworkarena/)上公开获取。

英文摘要

This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)

2510.12171 2026-06-09 cs.AI 版本更新

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

MatSciBench: 基准测试大型语言模型在材料科学中的推理能力

Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang

发表机构 * University of California, Los Angeles Computer Science Department(加州大学洛杉矶分校计算机科学系) University of Pennsylvania Department of Materials Science and Engineering(宾夕法尼亚大学材料科学与工程系) Virginia Tech Department of Computer Science(弗吉尼亚理工大学计算机科学系)

AI总结 提出MatSciBench基准,包含1340道大学级材料科学问题,覆盖6个主领域和31个子领域,评估LLM推理能力,发现当前模型在领域知识、计算和图表理解方面存在局限。

详情
AI中文摘要

大型语言模型已展现出强大的科学推理能力,但它们在材料科学问题上的表现仍研究不足。为填补这一空白,我们引入了MatSciBench,一个全面的大学级基准,包含1340道问题,涵盖材料科学的基本子学科。MatSciBench具有结构化和细粒度的分类体系,将材料科学问题分为6个主领域和31个子领域,并根据解决每个问题所需的推理长度进行三级难度分类。MatSciBench包含946道问题的详细参考答案,支持过程级错误分析,并包含315道带图像的问题以评估多模态推理。我们在MatSciBench上评估了领先的思考型和非思考型LLM,并进一步测试了非思考型模型的三种推理方法:基础思维链提示、工具增强和自我修正。结果表明,当前模型在大学级材料科学推理中仍面临明显限制。DeepSeek-R1在纯文本问题上达到最高准确率75.22%,GPT-5在带图像问题上表现最佳,准确率为53.02%。我们的分析表明,工具增强以token高效的方式改进了许多非思考型模型,而自我修正通常无法提供可靠的改进,甚至可能将正确答案修改为错误答案。我们进一步分析了不同难度级别、推理效率、多模态推理和失败模式的表现,发现当前模型主要受限于领域知识差距、计算错误、问题理解失败以及从科学图表中提取精确信息的困难。总体而言,MatSciBench为衡量当前LLM的局限性并指导未来材料科学科学推理工作提供了一个清晰的测试平台。

英文摘要

Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less studied. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 subfields, together with a three-tier difficulty classification based on the reasoning length needed to solve each problem. MatSciBench includes detailed reference solutions for 946 questions, supports process-level error analysis, and contains 315 questions with images for evaluating multimodal reasoning. We evaluate leading thinking and non-thinking LLMs on MatSciBench, and further test three reasoning methods for non-thinking models: basic chain-of-thought prompting, tool augmentation, and self-correction. The results show that current models still face clear limits in college-level materials science reasoning. DeepSeek-R1 achieves the highest score on text-only questions at 75.22% accuracy, and GPT-5 performs the best on questions with images at 53.02%. Our analysis shows that tool augmentation improves many non-thinking models in a token-efficient way, while self-correction often fails to provide reliable gains and can revise correct answers into incorrect ones. We further analyze performance across difficulty levels, reasoning efficiency, multimodal reasoning, and failure patterns, and find that current models are mainly limited by domain knowledge gaps, calculation errors, problem comprehension failures, and difficulty in extracting precise information from scientific figures. Overall, MatSciBench provides a clear testbed for measuring current LLM limitations and guiding future work on scientific reasoning in materials science.

2510.27544 2026-06-09 cs.AI cs.FL 版本更新

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

TempoBench:评估大语言模型中的时间因果推理

Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

发表机构 * Columbia University(哥伦比亚大学) Columbia University, Barnard College(哥伦比亚大学、巴纳德学院)

AI总结 提出TempoBench基准,通过合成Mealy机生成可验证的因果标签,评估LLM在时间因果推理中的表现,发现模型在最小因果归因任务上准确率低于25%,主要错误是过度指定。

详情
AI中文摘要

时间推理涉及理解系统如何通过输入驱动的状态转换随时间演化。一个关键方面是时间因果推理,即因果推理出哪些先前的输入对于导致观察到的结果是必要的。虽然大型语言模型(LLMs)在前向模拟(从输入预测输出)方面表现良好,但它们难以识别结果的最小因果输入。为了研究这种区别,我们定义了两个任务:\textit{轨迹模拟}(SIM),要求模型模拟系统执行,以及\textit{最小因果归因}(MIN),识别给定结果所需的最小输入集。我们引入了\textsc{TempoBench},第一个经过形式验证的时间因果推理基准,它由合成的Mealy机构建,具有可控的复杂性和可证明正确的因果标签。在前沿模型中,我们观察到尽管在SIM任务上达到了高达96%的准确率,但在因果归因MIN任务上的性能降至25%以下;模型无法推理因果必要性。超过94%的因果错误涉及过度指定,即模型执行检索并列出所有可能的输入,而不是推理最小因果子集。在\textsc{TempoBench}训练语料库上进行微调可以改善因果推理,并且比数学、代码或指令训练具有更好的泛化能力,在标准推理基准上也有提升。

英文摘要

Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal causal reasoning, causally reasoning about what prior inputs were necessary in causing an observed outcome. While large language models (LLMs) perform well at forward simulation, predicting outputs from inputs, they struggle to identify the minimal causal inputs of outcomes. To study this distinction, we define two tasks: \textit{trace simulation} (SIM), which requires models to simulate system execution, and \textit{minimal causal attribution} (MIN), which identifies the minimal set of inputs necessary for a given outcome. We introduce \textsc{TempoBench}, the first formally verified benchmark for temporal causal reasoning, built from synthesized Mealy machines with controllable complexity and provably correct causal labels. Across frontier models, we observe that despite achieving up to 96\% accuracy on the SIM task, performance on the causal attribution MIN task drops below 25\%; models fail to reason about causal necessity. Over 94\% of causal errors involve overspecification, where models perform retrieval and list all possible inputs rather than reasoning about the minimal causal subset. Fine-tuning on \textsc{TempoBench} training corpus improves causal reasoning and generalizes better than math, code, or instruction training, with gains across standard reasoning benchmarks.

2602.03224 2026-06-09 cs.AI cs.LG 版本更新

TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

TAME: 一种可信的智能体记忆测试时演化与系统化基准测试

Yu Cheng, Yongkang Hu, Jiuan Zhou, Yushuo Zhang, Yihang Chen, Huichi Zhou, Mingang Chen, Zhizhong Zhang, Kun Shao, Yuan Xie, Zhaoxia Yin

发表机构 * East China Normal University(东华师范大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Key Laboratory of Computer Software Evaluating and Testing(上海计算机软件评测与测试重点实验室) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出TAME框架,通过执行器-评估器循环实现记忆的可信演化,解决良性任务演化中智能体可信度下降问题,在GPT-5.2 AIME上准确率提升14.6个百分点。

详情
AI中文摘要

智能体记忆的测试时演化代表了推进AGI的关键范式,因为它通过经验积累增强复杂推理,而无需参数更新。然而,即使在良性任务演化过程中,智能体的安全对齐仍然脆弱,这种现象被称为智能体记忆误演化。为了评估这一现象,我们构建了Trust-Memevo基准测试,并发现智能体在良性任务演化过程中,多个任务的可信度整体下降。为了解决这个问题,我们提出了TAME,一个可信感知的记忆演化框架,其中共享记忆库由执行器和评估器共同管理。执行器检索并应用可迁移经验以支持任务求解,而评估器评估每个使用经验对结果的贡献,并产生可信感知的反馈以指导后续记忆使用。这种执行器-评估器循环使得记忆能够随时间被选择性强化、谨慎重用和持续扩展。实验表明,TAME在实现强任务性能的同时缓解了记忆误演化。特别是在GPT-5.2 AIME基准测试上,TAME相比现有最强方法准确率提高了14.6个百分点,并保持了有竞争力的可信度。

英文摘要

Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience accumulation without requiring parameter updates. However, even during benign task evolution, agent safety alignment remains vulnerable, a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark and find that agents exhibit an overall decline in trustworthiness across multiple tasks during benign task evolution. To address this issue, we propose TAME, a trust-aware memory evolution framework in which a shared memory bank is jointly governed by an Executor and an Evaluator. The Executor retrieves and applies transferable experiences to support task solving, while the Evaluator assesses the contribution of each utilized experience to the outcome and produces trust-aware feedback to guide subsequent memory use. This executor-evaluator loop enables memory to be selectively reinforced, cautiously reused, and continuously expanded over time. Experiments show that TAME mitigates memory misevolution while achieving strong task performance. In particular, on the GPT-5.2 AIME benchmark, TAME improves accuracy by 14.6 percentage points over the strongest existing method and maintains competitive trustworthiness.

2605.23965 2026-06-09 cs.AI cs.LG cs.SE 版本更新

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

LGMT:基于逻辑的蜕变测试用于评估LLMs的推理可靠性

Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Lin, Zheng Zheng

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(自动化科学与电气工程学院,北京航空航天大学)

AI总结 提出LGMT框架,利用一阶逻辑推导蜕变关系,通过一致性检查评估LLM推理的鲁棒性,揭示传统评估忽略的隐藏缺陷。

Comments Zheng Zheng is the corresponding author

详情
AI中文摘要

大型语言模型(LLMs)在逻辑推理基准测试中表现出色,但其可靠性仍不确定。现有评估依赖静态基准,无法评估在逻辑等价变换下的鲁棒性,且往往高估推理能力。我们提出LGMT(基于逻辑的蜕变测试),一种无神谕框架,利用一阶逻辑(FOL)评估LLM推理。通过从形式逻辑等价推导蜕变关系,LGMT构建语义不变的测试用例,并通过跨案例一致性检查检测推理缺陷。在六个最先进的LLM上的实验表明,LGMT暴露了传统基于参考的评估遗漏的大量隐藏缺陷。我们进一步发现,模型对符号级别和结论级别的变化特别敏感,而高级提示如Few-shot CoT仅能部分缓解这些问题。这些结果表明,LLM评估应从孤立的正确性转向逻辑不变性下的鲁棒性。LGMT为诊断推理失败提供了一种原则性和可扩展的方法。

英文摘要

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

2605.25624 2026-06-09 cs.AI cs.LG 版本更新

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

CUA-Gym:为计算机使用智能体扩展可验证的训练环境和任务

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, Tao Yu

发表机构 * The University of Hong Kong(香港大学) Qwen Team, Alibaba Inc.(阿里巴巴集团Qwen团队) University of California, San Diego(加州大学圣地亚哥分校) Tsinghua University(清华大学)

AI总结 提出CUA-Gym可扩展流水线,通过协同生成任务指令、环境状态和奖励函数,构建大规模可验证强化学习训练数据,并合成CUA-Gym-Hub模拟网络应用环境,训练出的智能体在OSWorld-Verified和WebArena上取得领先性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)在数学、工具使用和软件工程等领域取得了突破,但其在计算机使用智能体(CUA)上的应用受到缺乏具有确定性奖励的可扩展训练数据的瓶颈。为CUA构建此类数据需要一致的任务指令、可执行的环境和可验证的奖励。然而,手工策划的基准测试实现了高奖励保真度,但覆盖的应用很少;基于LLM作为评判者的数据集广泛扩展,但缺乏可靠的验证。我们提出了CUA-Gym,一个可扩展的流水线,协同生成任务指令、环境状态和奖励函数。具体来说,一个生成器智能体构建初始和黄金环境状态,一个独立的判别器智能体根据任务规范编写奖励函数。一个编排器智能体通过执行中的迭代轮次驱动两者。生成的元组通过一个结合LLM多数投票和智能体回滚的最终过滤器,确保超出每任务对抗循环的质量。为了解决训练环境稀缺的问题,我们进一步合成了CUA-Gym-Hub,一套基于真实软件使用分布的高保真模拟网络应用程序套件,将CUA RLVR数据的规模扩大了一个数量级。使用此流水线,我们构建了CUA-Gym数据集,包含32,112个基于110个环境的已验证RLVR训练元组。在CUA-Gym上使用GSPO训练的CUA-Gym-A3B和CUA-Gym-A17B在OSWorld-Verified上分别达到62.1%和72.6%,在可比规模上优于先前的开源CUA,并且在数据量和环境多样性上性能平滑扩展。相同的检查点还在保留的WebArena基准测试上有所改进,表明训练环境之外的迁移。我们将开源完整的合成流水线、数据集、CUA-Gym-Hub环境和模型。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

2606.01869 2026-06-09 cs.AI 版本更新

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

WorldCoder-Bench:物理接地3D世界合成基准

Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang, Yongcan Yu, Yubin Wang, Haitao Yang, Yuxiang Zhang, Bin Wang, Ran He, Jian Liang

发表机构 * NLPR & MAIS, CASIA(中国科学院自动化研究所与模式识别国家重点实验室) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出WorldCoder-Bench基准,通过StateProbe协议评估LLM生成Three.js 3D世界的物理正确性和交互可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被要求不仅编写静态界面,还要从自然语言构建可执行的交互式世界。浏览器原生3D(通常使用Three.js构建)是下一个自然前沿:生成的程序必须集成资源、遵守空间和物理约束,并保持面向用户的控件与隐藏的运行时状态同步。然而,现有的网络生成基准和评估器主要只观察像素或DOM节点,而Three.js世界的机制在不透明的<canvas>内部展开。我们引入了WorldCoder-Bench,一个用于自主、物理接地3D世界合成的基准。WorldCoder-Bench包含2026个专家策划的任务,涵盖模拟、渲染和应用场景,带有可选的.glb资源和隐藏的行为契约。我们进一步提出了StateProbe,一种基于执行的协议,在沙盒浏览器中探测生成的程序,并验证运行时状态和转换上的隐藏、变异硬化契约。除了验证覆盖率,我们报告了自动化回报和时间效率乘数,以衡量正确性调整的成本和时间节省。在九个前沿模型中,最佳系统在WorldCoder-Core上仅达到27.8%的验证覆盖率,在WorldCoder-Robust上达到19.9%,失败主要由状态模式漂移和交互链断裂主导,而非缺失场景元素。效用指标进一步表明,廉价或快速的模型在较简单的领域仍能提供显著价值。WorldCoder-Bench可在https://anonymous.4open.science/r/WorldCoder-Bench/获取。

英文摘要

Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.

2606.05872 2026-06-09 cs.AI cs.CV 版本更新

Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

基于熵的AI智能体评估:一种测量行为模式的轻量级框架

Olasimbo Ayodeji Arigbabu

发表机构 * Olasimbo Ayodeji Arigbabu(奥拉西姆波·阿里加布)

AI总结 提出一种基于熵的轻量级评估框架(EEA),通过动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵等指标,从决策过程结构角度补充传统任务成功率等评估方法。

Comments 6 pages, 2 Tables

详情
AI中文摘要

AI智能体通常使用任务成功率、奖励、延迟和成本进行评估。这些指标很有用,但常常忽略了智能体行为的重要方面:智能体是否过度探索、是否过于僵化地重复自身、是否有效使用工具、是否随时间减少不确定性、或者在多次运行中保持鲁棒性。本文提出基于熵的AI智能体评估(EEA),一种通过熵来测量智能体行为的轻量级框架。EEA不将智能仅视为最终任务完成,而是研究智能体决策过程的结构。该框架引入了动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵。这些指标旨在补充而非取代传统评估方法。我们还提供了一个实用的Python实现,旨在与LangChain、Google ADK、自定义智能体循环以及存储的可观测性轨迹等智能体框架集成。

英文摘要

AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.

2606.05932 2026-06-09 cs.AI cs.LG 版本更新

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自我一致性激发与奖励设计的预注册因果分解

Yuze Gao

发表机构 * Outlook.com(Outlook公司)

AI总结 本文通过预注册实验和因果分解方法,证明RLVR中朴素奖励设计估计量存在系统性偏差,并量化了自我一致性激发与真正奖励设计信号的贡献。

Comments 9 pages, 7 figures

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)即使在奖励信号是虚假的情况下也能提升推理能力——将功劳分配给群体多数答案而非真实验证器。实践者通常将朴素估计量 naive = acc(TRUE) - acc(RANDOM) 解释为奖励设计效应。我们证明该估计量存在系统性偏差:它混淆了自我一致性激发(通过多数伪奖励将策略向众数答案锐化)与真正的奖励设计信号。使用受控的表格GRPO模拟器,我们推导出精确的望远镜分解 total = null + elicit + rd,并在五个先验强度水平上测量每个项。朴素估计量中奖励设计占比从弱先验(ps=0.20)时的0.139变化到强先验(ps=0.80)时的0.05,激发项在自我一致性交叉点处符号翻转。一个预注册的2x2x2析因实验证实了非可加性(交互比0.385;AxC效应-0.089)。一个点与界试点门控表明,强先验区域是点识别的,而接近交叉区域仅是有界的。对两个已命名发表结果的重新审计分别得出“激发主导”(激发份额0.98)和“奖励设计主导”(rd份额1.18)的结论,证明了该分解的诊断价值。我们预先承诺无论翻转结果如何都提交论文;非翻转同样是一个有价值的发现。我们发布一个可复用的单命令工具,供任何对齐论文运行相同的审计。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.

2606.06388 2026-06-09 cs.AI cs.CL 版本更新

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

人类的ALMANAC:用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University(东北大学) University of Notre Dame(Notre Dame 大学) University of Waterloo(滑铁卢大学) Carnegie Mellon University(卡内基梅隆大学) Adobe(Adobe公司) Microsoft Research Asia(微软亚洲研究院)

AI总结 为解决当前LLM智能体缺乏协作中心智模型能力的问题,构建了基于Map Task的ALMANAC数据集,包含2987个协作动作及其心智模型标注,并评估了六种LLM在预测人类行为和心智模型上的表现。

详情
AI中文摘要

近年来,LLM智能体的进展使其具备了复杂的认知能力,如多步推理、规划和工具使用,这些能力使它们逐渐成为人类的协作者。然而,有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力,因为它们主要针对任务完成进行优化,而社区缺乏带有动作级心智模型标注的真实人类协作数据,这些数据可以指导智能体获得过程级的协作能力。为填补这一空白,我们提出了ALMANAC,一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作,每个动作都配有基于理论的心智模型标注,记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

2310.10196 2026-06-09 cs.LG cs.AI 版本更新

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

时间序列与时空数据的大模型:综述与展望

Ming Jin, Yaxuan Kong, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, Vincent S. Tseng, Yu Zheng, Lei Chen, Hui Xiong, Shirui Pan, Qingsong Wen

发表机构 * Griffith University(格里菲斯大学) University of Oxford(牛津大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhejiang Normal University(浙江师范大学) Ant Group(蚂蚁集团) Alibaba Group(阿里巴巴集团) Deloitte Service LLP(德勤服务有限责任公司) The University of Hong Kong(香港大学) NEC Laboratories America(NEC美国实验室) A*STAR National Yang Ming Chiao Tung University(阳明交通大学) JD Technology(京东科技) Squirrel Ai Learning

AI总结 综述了面向时间序列和时空数据的大模型,按数据类型、模型类别、范围和应用领域分类,总结了通用与领域专用模型,并整理了相关资源与开放问题。

Comments Accepted by ACM Computing Surveys; 35 Pages; Github Repo: https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM

详情
AI中文摘要

时间数据,包括时间序列和时空数据,在现实应用中无处不在。物理和虚拟传感器生成的海量数据记录了动态系统行为,支持各种下游任务。有效分析这些数据对于挖掘其丰富信息至关重要。大型语言模型和其他基础模型的最新进展加速了它们在时间序列和时空数据挖掘中的应用。这些方法不仅提高了跨领域的模式识别和推理能力,还支持了能够理解和处理时间数据的人工通用智能的发展。在本综述中,我们沿着四个维度(数据类型、模型类别、模型范围和应用领域/任务)对针对时间序列和时空数据定制或适配的大模型进行了全面、最新的回顾。我们将现有工作分为两大组:用于时间序列分析的大模型(LM4TS)和用于时空数据挖掘的大模型(LM4STD),并进一步区分通用模型和领域专用模型。我们还整理了相关资源,包括数据集、模型实现和工具,按主要应用领域组织。总体而言,本综述整合了近期进展,并突出了以大型模型为中心的时间数据分析的基础、应用、资源和开放研究机会。

英文摘要

Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by physical and virtual sensors, they record dynamic system behaviors and enable a wide range of downstream tasks. Effectively analyzing such data is crucial to unlocking their rich information content. Recent advances in large language models and other foundation models have accelerated their use in time series and spatio-temporal data mining. These approaches not only improve pattern recognition and reasoning across diverse domains but also support progress toward artificial general intelligence that can understand and process temporal data. In this survey, we present a comprehensive, up-to-date review of large models tailored or adapted for time series and spatio-temporal data along four dimensions: data types, model categories, model scopes, and application areas/tasks. We organize existing work into two main groups: large models for time series analysis (LM4TS) and for spatio-temporal data mining (LM4STD), and further distinguish general-purpose from domain-specific models. We also curate related resources, including datasets, model implementations, and tools, organized by major application areas. Overall, this survey consolidates recent advances and highlights foundations, applications, resources, and open research opportunities in large model-centric temporal data analysis.

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Audio-FLAN:面向语音、音乐和声音的统一音频理解与生成的指令跟随数据集

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Xingjian Du, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Inner Mongolia University(内蒙古大学) Beihang University(北京航空航天大学) Queen Mary University of London(伦敦玛丽女王大学) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) University of Surrey(萨里大学) University of Rochester(罗切斯特大学) Independent Researcher(独立研究者)

AI总结 提出Audio-FLAN数据集,包含80种任务和1亿实例,支持统一音频理解与生成的零样本学习。

详情
AI中文摘要

最近音频标记化的进展显著增强了将音频能力集成到大语言模型(LLM)中的能力。然而,音频理解和生成通常被视为不同的任务,阻碍了真正统一的音频-语言模型的发展。虽然指令调优在文本和视觉领域已显示出在改善泛化和零样本学习方面的显著成功,但其在音频领域的应用仍基本未被探索。一个主要障碍是缺乏统一音频理解和生成的全面数据集。为解决这一问题,我们引入了Audio-FLAN,这是一个大规模指令调优数据集,涵盖语音、音乐和声音领域的80种不同任务,包含超过1亿个实例。Audio-FLAN为统一的音频-语言模型奠定了基础,这些模型能够以零样本方式无缝处理跨多种音频领域的理解(如转录、理解)和生成(如语音、音乐、声音)任务。Audio-FLAN数据集可在HuggingFace和GitHub上获取。

英文摘要

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2509.09151 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Video Understanding by Design: How Datasets Shape Video Models

通过设计理解视频:数据集如何塑造视频模型

Lei Wang, Syuan-Hao Li, Piotr Koniusz, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 本文从数据集视角出发,提出统一框架连接数据集结构、归纳偏差与架构设计,分析数据集特性如何驱动视频理解架构创新,并讨论不同数据体制下的表征偏差。

Comments Research report

详情
AI中文摘要

视频理解研究因日益多样化的数据集和更强大的模型架构而快速发展。现有综述通常按任务、基准或模型家族组织进展,但对特定架构为何出现并成功提供的见解有限。本文认为,视频理解的演进根本上由数据集结构塑造。我们提出一个以数据集为中心的视角,在统一框架内连接数据集结构、归纳偏差和架构设计。我们表明,不同数据集要求模型捕获特定的不变性和能力,例如对视角变化的鲁棒性、对时间顺序的敏感性、长程依赖推理、关系交互和跨模态对齐。这些需求自然产生归纳偏差,即有利于特定推理和泛化模式的架构假设。从这一视角看,里程碑式架构,包括双流网络、3D CNN、时序模型、Transformer、基于图的方法和多模态基础模型,可理解为对演进数据集所带来挑战的架构响应。基于此框架,我们系统分析了数据集特性如何塑造视频理解任务中的架构创新,并讨论了不同数据体制引发的表征偏差。通过将数据集、归纳偏差和架构统一为一个连贯视角,本综述既提供了对领域演进的回顾性解释,也提供了通向通用视频理解系统的前瞻性路线图。代码和数据集诱导偏差的动态视频可视化见 https://this https URL。

英文摘要

Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys typically organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded. In this survey, we argue that the evolution of video understanding is fundamentally shaped by dataset structure. We present a dataset-centric perspective that connects dataset structure, inductive biases, and architectural design within a unified framework. We show that different datasets require models to capture specific invariances and capabilities, such as robustness to viewpoint changes, sensitivity to temporal ordering, reasoning over long-range dependencies, relational interactions, and cross-modal alignment. These requirements naturally give rise to inductive biases, i.e., architectural assumptions that favor particular patterns of reasoning and generalization. From this perspective, milestone architectures, including two-stream networks, 3D CNNs, temporal models, transformers, graph-based methods, and multimodal foundation models, can be understood as architectural responses to the challenges posed by evolving datasets. Building on this framework, we systematically analyze how dataset characteristics have shaped architectural innovation across video understanding tasks and discuss the representational biases induced by different data regimes. By unifying datasets, inductive biases, and architectures into a coherent perspective, this survey offers both a retrospective explanation of the field's evolution and a forward-looking roadmap toward general-purpose video understanding systems. Code and dynamic video visualizations of dataset-induced biases are available at https://time.griffith.edu.au/paper-sites/video-understanding/.

2509.22097 2026-06-09 cs.SE cs.AI cs.CL cs.CR 版本更新

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

SecureVibeBench: 通过重建引入漏洞的场景来基准测试AI代理的安全振动编码

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SecureVibeBench,一个包含105个C/C++安全编码任务的基准测试,旨在评估AI代理在真实场景中生成安全代码的能力,发现现有方法在评估人类与AI代理对比时的不足。

Comments ACL 2026 Main Conference. Our code and data are on https://github.com/iCSawyer/SecureVibeBench

详情
AI中文摘要

大型语言模型驱动的代码代理正在迅速改变软件工程,但其生成代码的安全风险已成为关键问题。现有基准测试提供了有价值的见解,但未能捕捉到由人类开发者实际引入漏洞的场景,使得人类与代理之间的公平比较不可行。因此,我们引入SecureVibeBench,一个包含来自OSS-Fuzz的41个项目中105个C/C++安全编码任务的基准测试,用于代码代理。SecureVibeBench具有以下特点:(i)现实的任务设置,要求在大型仓库中进行多文件编辑;(ii)基于真实世界开源漏洞对齐的上下文,具有精确标识的漏洞引入点;(iii)全面的评估,结合功能测试和安全检查,使用静态和动态或acles。我们评估了5种流行的代码代理,如OpenHands,支持5种LLM(如Claude Sonnet 4.5)在SecureVibeBench上。结果表明,当前代理在生成既正确又安全的代码方面存在困难,即使表现最好的代理,在SecureVibeBench上也只能产生23.8%的正确且安全的解决方案。我们的代码和数据在https://github.com/iCSawyer/SecureVibeBench上。

英文摘要

Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. SecureVibeBench has the following features: (i) realistic task settings that require multi-file edits in large repositories, (ii)~aligned contexts based on real-world open-source vulnerabilities with precisely identified vulnerability introduction points, and (iii) comprehensive evaluation that combines functionality testing and security checking with both static and dynamic oracles. We evaluate 5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5) on SecureVibeBench. Results show that current agents struggle to produce both correct and secure code, as even the best-performing one, produces merely 23.8\% correct and secure solutions on SecureVibeBench. Our code and data are on https://github.com/iCSawyer/SecureVibeBench.

2511.18676 2026-06-09 cs.CV cs.AI 版本更新

MedVision: Benchmarking Quantitative Medical Image Analysis

MedVision:定量医学图像分析的基准测试

Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh(爱丁堡大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 针对当前医学视觉语言模型缺乏定量推理能力的问题,提出MedVision数据集和基准,涵盖22个公共数据集、3080万图像-标注对,通过监督和强化微调显著提升检测、肿瘤/病变大小估计和角度/距离测量性能。

Comments 22 pages, 13 figures, 14 tables

详情
AI中文摘要

当前医学领域的视觉-语言模型(VLM)主要用于分类问答(如“这是正常还是异常?”)或定性描述任务。然而,临床决策通常依赖于定量评估,例如测量肿瘤大小或关节角度,医生据此得出自己的诊断结论。这种定量推理能力在现有VLM中尚未得到充分探索和支持。在这项工作中,我们引入了MedVision,这是一个专门设计用于评估和改进VLM在定量医学图像分析中的大规模数据集和基准。MedVision涵盖22个公共数据集,涉及多种解剖结构和模态,包含3080万个图像-标注对。我们聚焦于三个代表性的定量任务:(1)解剖结构和异常检测,(2)肿瘤/病变(T/L)大小估计,以及(3)角度/距离(A/D)测量。我们表明,当前现成的VLM在这些任务上表现不佳。然而,在MedVision上进行监督和强化微调显著提升了检测、T/L估计和A/D测量的性能。MedVision为开发具有稳健定量推理能力的医学图像分析VLM奠定了基础。

英文摘要

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. We show that current off-the-shelf VLMs perform poorly on these tasks. However, supervised and reinforcement fine-tuning on MedVision significantly enhances performance across detection, T/L estimation, and A/D measurement. MedVision provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging.

2601.02424 2026-06-09 cond-mat.mtrl-sci cs.AI 版本更新

A large-scale nanocrystal database with aligned synthesis and properties enabling generative inverse design

大规模纳米晶体数据库:对齐合成与性质实现生成式逆向设计

Kai Gu, Yingping Liang, Senliang Peng, Aotian Guo, Haizheng Zhong, Ying Fu

发表机构 * MIIT Key Laboratory for Low-Dimensional Quantum Structure and Devices, School of Materials Sciences & Engineering, Beijing Institute of Technology(信息产业部低维量子结构与器件重点实验室,材料科学与工程学院,北京理工大学) School of Computer Science and Technology, Beijing Institute of Technology(计算机科学与技术学院,北京理工大学)

AI总结 构建大规模对齐的纳米晶体合成-性质数据库,开发基于大语言模型的NanoExtractor提取文献数据,并利用NanoDesigner实现生成式逆向设计,成功设计PbSe和MgF2纳米晶体的合成路线。

详情
AI中文摘要

由于合成参数与物理化学性质之间的复杂相关性,纳米晶体的合成高度依赖于试错法。尽管深度学习为生成式逆向设计提供了潜在方法,但缺乏对齐纳米晶体合成路线与其性质的高质量数据集仍阻碍其发展。本文介绍了一个大规模、对齐的纳米晶体合成-性质(NSP)数据库的构建,并展示了其用于生成式逆向设计的能力。为了从文献中提取结构化的合成路线及其对应的产物性质,我们开发了NanoExtractor,这是一个通过精心设计的增强策略增强的大语言模型(LLM)。NanoExtractor经过人类专家验证,在测试集上达到88%的加权平均分,显著优于化学专用(3%)和通用LLM(38%)。生成的NSP数据库包含近16万条对齐条目,并作为我们的NanoDesigner(一个用于逆向合成设计的LLM)的训练数据。NanoDesigner的生成能力通过成功设计成熟的PbSe纳米晶体和罕见的MgF2纳米晶体的可行合成路线得到验证。值得注意的是,模型为MgF2纳米晶体推荐了反直觉的非化学计量前驱体比例(1:1),实验证实该比例对抑制副产物至关重要。我们的工作弥合了非结构化文献与数据驱动合成之间的差距,并建立了一个强大的人机协作范式,以加速纳米晶体的发现。

英文摘要

The synthesis of nanocrystals has been highly dependent on trial-and-error, due to the complex correlation between synthesis parameters and physicochemical properties. Although deep learning offers a potential methodology to achieve generative inverse design, it is still hindered by the scarcity of high-quality datasets that align nanocrystal synthesis routes with their properties. Here, we present the construction of a large-scale, aligned Nanocrystal Synthesis-Property (NSP) database and demonstrate its capability for generative inverse design. To extract structured synthesis routes and their corresponding product properties from literature, we develop NanoExtractor, a large language model (LLM) enhanced by well-designed augmentation strategies. NanoExtractor is validated against human experts, achieving a weighted average score of 88% on the test set, significantly outperforming chemistry-specialized (3%) and general-purpose LLMs (38%). The resulting NSP database contains nearly 160,000 aligned entries and serves as training data for our NanoDesigner, an LLM for inverse synthesis design. The generative capability of NanoDesigner is validated through the successful design of viable synthesis routes for both well-established PbSe nanocrystals and rarely reported MgF2 nanocrystals. Notably, the model recommends a counter-intuitive, non-stoichiometric precursor ratio (1:1) for MgF2 nanocrystals, which is experimentally confirmed as critical for suppressing byproducts. Our work bridges the gap between unstructured literature and data-driven synthesis, and also establishes a powerful human-AI collaborative paradigm for accelerating nanocrystal discovery.

2601.06649 2026-06-09 cs.LG cs.AI 版本更新

Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency

重新审视训练规模:关于令牌计数、功耗和参数效率的实证研究

Joe Dwyer

发表机构 * ECPI University(ECPI大学)

AI总结 通过固定硬件和训练条件的重复测量实验,发现增加训练令牌数会导致训练效率严格单调下降,即使性能有边际提升,也表明能耗效率低下。

详情
AI中文摘要

机器学习研究质疑了训练令牌数的增加是否能在大型语言模型中可靠地产生比例性能提升。基于先前引入能量感知参数效率度量的工作,本研究实证检验了在固定硬件和训练条件下增加训练令牌数的影响。本工作的重要性在于将功耗和执行时长(如功率采样频率所反映的)明确整合到令牌规模分析中,这解决了先前研究强调性能结果而低估计算和能量成本的空白。通过在恒定GPU实例上使用相同模型架构、优化器设置和轮次数的重复测量实验设计,训练了一个11亿参数的TinyLlama模型,使用三个令牌数(500K、1M和2M)。虽然传统性能指标在令牌规模上表现出不一致或递减的回报,但包含功耗和执行时长后,揭示了随着令牌数增加,训练效率严格单调下降。重复测量方差分析表明令牌数对参数效率有强效应,所有配对比较在Bonferroni校正后仍然显著。这些发现表明,即使观察到边际性能提升,增加训练令牌数可能在能量上效率低下,强调了在大型语言模型训练中效率感知评估的重要性。

英文摘要

Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in large language models. Building on prior work introducing an energy-aware parameter efficiency metric, this study empirically examines the effects of increasing training token counts under fixed hardware and training conditions. The significance of this work lies in the explicit integration of power consumption and execution duration, as reflected by the power sampling frequency, into token-scale analysis. This addresses a gap in prior studies emphasizing performance outcomes while underrepresenting computational and energy costs. Using a repeated-measures experimental design on a constant GPU instance with an identical model architecture, optimizer settings, and epoch counts, a 1.1-billion-parameter TinyLlama model was trained at three token counts (500K, 1M, and 2M). While conventional performance metrics exhibited inconsistent or diminishing returns across token scales, the inclusion of power consumption and execution duration revealed a strictly monotonic decline in training efficiency as token count increased. Repeated-measures ANOVA demonstrated a strong effect of token count on parameter efficiency, with all pairwise comparisons remaining significant following Bonferroni correction. These findings indicate that increases in training token counts may be energetically inefficient even when marginal performance improvements are observed, underscoring the importance of efficiency-aware evaluation in large language model training.

2601.14063 2026-06-09 cs.CL cs.AI cs.CY 版本更新

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

XCR-Bench:通过文化特定项目和霍尔三元组对大型语言模型进行跨文化推理基准测试

Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Shaoxiong Ji, Hassan Alhuzali, Yuechen Jiang, Jimin Huang, Sophia Ananiadou

发表机构 * Department of Computer Science, National Centre for Text Mining, The University of Manchester(计算机科学系,国家文本挖掘中心,曼彻斯特大学) ELLIS Manchester(曼彻斯特ELLIS) School of Computing, Queen’s University, Ontario, Canada(计算学院,加拿大皇后大学) Computer Science, University of Illinois Chicago(计算机科学,伊利诺伊大学芝加哥分校) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学) Department of Computer Science and Artificial Intelligence, Umm Al-Qura University, Makkah, Saudi Arabia(计算机科学与人工智能系,乌姆·阿勒·卡拉大学,麦加,沙特阿拉伯)

AI总结 提出XCR-Bench基准,包含4.1k平行句和1098个文化特定项目,结合Newmark框架与霍尔三元组评估LLM跨文化推理,发现模型在深层文化层面表现显著下降,且存在区域和民族宗教偏见。

Comments Under Review

详情
AI中文摘要

大型语言模型(LLM)的跨文化能力需要理解并适应不同文化背景下的文化特定项目(CSI)。然而,由于缺乏高质量且带有平行跨文化句子对的CSI标注语料库,评估该能力的进展仍然有限。我们引入了XCR-Bench,一个跨文化推理基准,包含4.1k个平行句子和1,098个CSI,涵盖三个推理任务。XCR-Bench将Newmark的CSI框架与霍尔文化三元组相结合,从而能够评估从可观察实践到隐性社会规范和价值观等不同文化可见性层面的能力。对八个多语言LLM的实验表明,最先进的模型在识别和适应特定CSI类别方面表现出持续的弱点,揭示了表面召回与显式文化推理之间的差距。在文化敏感类别和更深文化层面上,性能显著下降(p<0.005,8/8模型),并且适应质量在不同目标文化和孟加拉语区域变体之间系统性变化,表明即使在单一语言环境中也存在编码的区域和民族宗教偏见。我们公开发布语料库和代码,以支持未来跨文化NLP的研究。

英文摘要

Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited by the lack of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. We introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark containing 4.1k parallel sentences and 1,098 CSIs across three reasoning tasks. XCR-Bench integrates Newmark's CSI framework with Hall's Triad of Culture, enabling evaluation across levels of cultural visibility -- from observable practices to implicit social norms and values. Experiments on eight multilingual LLMs show that state-of-the-art models exhibit consistent weaknesses in identifying and adapting specific categories of CSIs, revealing a gap between surface-level recall and explicit cultural reasoning. Performance declines significantly on culturally sensitive categories and deeper cultural levels (p<0.005, 8/8 models), and adaptation quality varies systematically across target cultures and Bengali regional variants, indicating encoded regional and ethno-religious biases even within a single linguistic setting. We publicly release the corpus and code to support future research on cross-cultural NLP.

2601.22859 2026-06-09 cs.SE cs.AI 版本更新

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

MEnvAgent:可扩展的多语言环境构建用于可验证软件工程

Chuanzhe Guo, Jingjing Wu, Sijun He, Yang Chen, Zhaoqi Kuang, Shilong Fan, Bingjin Chen, Siqi Bao, Jing Liu, Hua Wu, Qingfu Zhu, Wanxiang Che, Haifeng Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出MEnvAgent框架,通过多智能体规划-执行-验证架构和环境复用机制,自动构建多语言可执行环境,生成可验证任务实例,在10种语言1000个任务上提升F2P率8.6%并降低时间成本43%。

Comments Accepted as a Spotlight Paper at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)智能体在软件工程(SWE)领域的发展受到可验证数据集稀缺的制约,这一瓶颈源于跨不同语言构建可执行环境的复杂性。为解决此问题,我们提出MEnvAgent,一个用于自动环境构建的多语言框架,支持可验证任务实例的可扩展生成。MEnvAgent采用多智能体规划-执行-验证架构自主解决构建失败问题,并集成了一种新颖的环境复用机制,通过增量修补历史环境来减少计算开销。在MEnvBench(一个包含10种语言1000个任务的新基准)上的评估表明,MEnvAgent优于基线方法,将失败到通过(F2P)率提高了8.6%,同时将时间成本降低了43%。此外,我们通过构建MEnvData-SWE(迄今为止最大的开源多语言真实可验证Docker环境数据集)以及解决方案轨迹,展示了MEnvAgent的实用性,这些轨迹使得各种模型在SWE任务上能够获得一致的性能提升。我们的代码、基准和数据集可在以下网址获取:https://this URL。

英文摘要

The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.

2602.15327 2026-06-09 cs.LG cs.AI cs.CL stat.ML 版本更新

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

规范性缩放揭示语言模型能力的演变

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 通过大规模观测评估和分位数回归,提出规范性缩放定律,将预训练计算预算映射到下游准确率,并验证其时间稳定性,引入平衡I-最优采样算法降低评估成本。

Comments ICML 2026 Oral. Blog Post: https://jkjin.com/prescriptive-scaling

详情
AI中文摘要

机器学习模型性能的提升往往源于竞争和应用。针对部署,我们考虑规范性缩放定律:给定预训练计算预算,通过当代后训练实践可获得的下游准确率是多少,以及随着领域发展该映射的稳定性如何?我们使用大规模观测评估,涵盖2022-2026年间六个基准测试的5000个现有和2000个新评估的模型检查点,通过带有单调饱和S型参数化的平滑分位数回归,估计能力边界(即基准分数作为对数预训练FLOPs函数的高条件分位数)。我们通过在早期模型代上拟合并在后续版本上评估来验证时间可靠性:在六个任务中的四个上,分布外覆盖误差低于2%,而数学推理能力边界随时间持续提升。例如,在预算为10^24 FLOPs时,IFEval上的估计可达准确率为0.83,MATH Lvl 5上为0.54。然后我们扩展方法以分析任务相关的饱和性,并探测数学推理任务中与污染相关的偏移。最后,我们引入一种平衡I-最优采样算法,该算法使用约20%的参数计数加权评估预算(某些任务低至5%)恢复接近全数据的前沿,同时保持可比的校准。总之,我们的工作发布了Proteus-2k(最新的模型性能评估数据集),并引入了一种实用方法,将计算预算转化为可靠的性能预期,并监测能力边界随时间的变化。

英文摘要

Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k existing and 2k newly evaluated model checkpoints spanning 2022-2026 across six benchmarks, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate temporal reliability by fitting on earlier model generations and evaluating on later releases: across four of six tasks, the out-of-distribution coverage error remains below 2%, while math reasoning exhibits a consistently advancing boundary over time. For instance, at a budget of 10^24 FLOPs, the estimated attainable accuracies are 0.83 on IFEval and 0.54 on MATH Lvl 5. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce a balanced I-optimal sampling algorithm that recovers near-full-data frontiers using roughly 20% of the parameter-count-weighted evaluation budget, as low as 5% on some tasks, while maintaining comparable calibration. Together, our work releases Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

2603.13431 2026-06-09 cs.LG cs.AI 版本更新

CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design

CHIMERA-Bench:一种针对表位特异性抗体设计的基准数据集

Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson

发表机构 * Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院) University of Engineering and Technology(工程与技术大学) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 本文提出CHIMERA-Bench,一个统一的抗体设计基准,包含2922个抗原-抗体复合物数据,测试泛化能力,并评估多种生成方法的通用性。

详情
AI中文摘要

计算抗体设计在过去三年中取得了快速的方法进展,提出了数十种深度生成方法,但该领域缺乏标准化的基准用于公平比较和模型开发。这些方法在不同的SAbDab快照、非重叠测试集和不兼容的指标上进行评估,文献将设计问题分解为多个子任务,没有共同定义。我们引入CHIMERA-Bench:(CDR建模与表位引导的重设计),围绕单一经典任务:表位条件下的CDR序列-结构共设计。CHIMERA-Bench提供三个组成部分。第一个是一个经过精心挑选、去重的包含2922个抗体-抗原复合物的数据集,带有表位和抗原结合位点注释。第二个是一组三个生物动机的分割,测试泛化到未见表位、未见抗原折叠和前瞻性时间目标的能力。第三个是全面的评估协议,包括五个指标组,包括新的表位特异性度量。我们基准测试了十一种方法,涵盖六个生成范式,并在所有分割上报告结果。CHIMERA-Bench是该抗体设计问题中最大的数据集,允许社区开发和测试新方法,并评估其泛化能力。

英文摘要

Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce CHIMERA-Bench: (CDR Modeling with Epitope-guided Redesign), a unified benchmark built around a single canonical task: epitope-conditioned CDR sequence-structure co-design. CHIMERA-Bench provides three components. The first is a curated, deduplicated dataset of 2,922 antibody-antigen complexes with epitope and paratope annotations. The second is a set of three biologically motivated splits that test generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets. The third is a comprehensive evaluation protocol with five metric groups, including novel epitope-specificity measures. We benchmark eleven methods spanning six generative paradigms and report results across all splits. CHIMERA-Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability.

2603.14342 2026-06-09 cs.CV cs.AI 版本更新

AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

AgroOmni:一个大规模多视角农业数据集用于跨尺度多模态推理

Jiarui Zhang, Junqi Hu, Zurong Mai, Yang Liu, Yuhang Chen, Shuohong Lou, Henglian Huang, Hong Cheng, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

发表机构 * Sun Yat-sen University(中山大学) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) HuanTian Wisdom Technology Co., Ltd.(慧天智慧科技有限公司) China Agricultural University(中国农业大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出AgroOmni数据集,通过288K视觉问答对覆盖56个专业任务类别,解决多视角跨尺度农业多模态推理中的尺度偏差问题,提出AgroNVILA模型在AgroMind基准上达到62.32%的SOTA成绩。

详情
AI中文摘要

现代农业数据来源于多样化的平台,涵盖多个空间尺度,从地面级近距离摄影到无人机(UAV)航空观测和卫星遥感图像。因此,农业多模态推理需要强大的跨尺度空间理解。然而,由于缺乏多视角农业基准数据集,现有多模态大语言模型(MLLMs)表现出严重的地面级偏差,导致农业感知任务中出现尺度混淆和语义崩溃,例如将农田图像误认为墙壁或地板。为此,我们引入AgroOmni,一个大规模多视角训练语料库,包含288K个视觉问答对,覆盖56个专业任务类别,跨14种任务类型,旨在捕捉现代农业精准农业中的多样化尺度。基于此数据集,我们提出AgroNVILA,其在AgroMind基准上达到62.32%的最新SOTA成绩(比GPT-5.2高15.03%),有效缓解了多视角跨尺度差距,实现了整体农业理解。对AgMMU的诊断评估进一步揭示了宏观先验与微观诊断之间的固有异质性,通过受约束的零样本性能。同时,即使最小的微调也使AgroNVILA在AgMMU上实现了显著的性能提升,强有力地证明了其由AgroOmni赋能的泛化能力。完整的训练脚本已公开在https://anonymous.4open.science/r/AgroOmni-6510。

英文摘要

Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit severe ground-level bias, which leads to scale confusion then semantic collapse in agricultural perception tasks, such as misinterpreting farmland imagery as walls or floors. To address this, we introduce AgroOmni, a large-scale multi-view training corpus with 288K Visual Question Answering pairs covering 56 specialized task categories across 14 task types, designed to capture diverse scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, which achieves a new state-of-the-art of 62.32% on the AgroMind benchmark (+15.03% over GPT-5.2), effectively mitigating the multi-view cross-scale gap for holistic agricultural understanding. Diagnostic evaluations on AgMMU further reveal an inherent heterogeneity between macro-priors and micro-diagnostics through constrained zero-shot performance. Meanwhile, even minimal fine-tuning leads to a dramatic performance gain of AgroNVILA on AgMMU, strongly demonstrating its generalization capability empowered by AgroOmni. Full training scripts are publicly available at https://anonymous.4open.science/r/AgroOmni-6510.

2603.23916 2026-06-09 cs.CV cs.AI 版本更新

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

DecepGPT: 基于多文化数据集和鲁棒多模态学习的模式驱动欺骗检测

Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao

发表机构 * Great Bay University(Great Bay大学) Wuhan University(武汉大学) Sun Yat-sen University(孙中山大学)

AI总结 本文提出DecepGPT,通过构建包含结构化线索描述和推理链的推理数据集,释放多文化数据集T4-Deception,并提出SICS和DMC模块,实现多模态欺骗检测的鲁棒学习,实验表明其在领域内和跨领域场景中均取得最佳性能。

Comments 17 pages, 11 figures, 12 tables

详情
AI中文摘要

多模态欺骗检测旨在通过分析音频视觉线索来识别欺骗行为,用于刑侦和安全领域。在高风险环境中,调查人员需要可验证的证据将音频视觉线索与最终决策联系起来,并且需要在不同领域和文化背景下可靠地泛化。然而,现有基准仅提供二元标签而无中间推理线索。数据集也较小,场景覆盖有限,导致捷径学习。我们通过三个贡献解决这些问题:首先,我们通过增强现有基准并添加结构化线索级描述和推理链来构建推理数据集,使模型输出可审计报告。其次,我们发布T4-Deception,一个基于统一的『To Tell The Truth』电视格式在四个国家实施的多文化数据集。该数据集包含1695个样本,是目前最大的非实验室欺骗检测数据集。第三,我们提出两个模块,以在小数据条件下实现鲁棒学习。Stabilized Individuality-Commonality Synergy (SICS) 通过结合可学习的全局先验与样本自适应残差,优化多模态表示,随后通过极性感知调整双向校准表示。Distilled Modality Consistency (DMC) 通过知识蒸馏将模态特定预测与融合的多模态预测对齐,以防止单模态捷径学习。在三个已建立的基准和我们新的数据集上的实验表明,我们的方法在领域内和跨领域场景中均取得最佳性能,同时在不同文化背景下表现出优越的迁移能力。数据集和代码将被发布。

英文摘要

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

2604.06210 2026-06-09 cs.CL cs.AI cs.CY cs.LG 版本更新

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

基于价值码本的LLM文化价值对齐的分布式开放式评估

Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Hyunjin Hwang, Roy Ka-Wei Lee, Xing Xie, JinYeong Bak

发表机构 * KAIST(韩国科学技术院)

AI总结 提出DOVE框架,通过率失真变分优化构建价值码本,利用不平衡最优传输度量分布对齐,解决LLM文化价值评估中的构造-组成-上下文挑战。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

随着LLM在全球部署,使其文化价值取向对齐对于安全性和用户参与至关重要。然而,现有基准面临构造-组成-上下文($C^3$)挑战:依赖判别性、多项选择格式,探测的是价值知识而非真实取向,忽视亚文化异质性,且与真实世界的开放式生成不匹配。我们引入DOVE,一个直接比较人类撰写的文本分布与LLM生成输出的分布式评估框架。DOVE利用率失真变分优化目标从10K文档中构建紧凑的价值码本,将文本映射到结构化价值空间以过滤语义噪声。使用不平衡最优传输测量对齐,捕捉文化内分布结构和子群体多样性。在12个LLM上的实验表明,DOVE实现了优越的预测有效性,与下游任务的相关性达到31.56%,同时每个文化仅需500个样本即可保持高可靠性。

英文摘要

As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and subgroup diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

2604.18347 2026-06-09 cs.CL cs.AI 版本更新

Multilingual Training and Evaluation Resources for Vision-Language Models

面向视觉语言模型的多语言训练和评估资源

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

发表机构 * Villanova.ai Aithlas

AI总结 本文提出跨五种欧洲语言的视觉语言模型训练与评估资源,通过再生与翻译方法生成高质量多语言数据,验证多语言数据在非英语基准上的有效性。

详情
AI中文摘要

视觉语言模型(VLMs)近年来取得了快速进展。然而,尽管其发展依赖于英语,导致两个主要限制:(i)缺乏多语言和多模态数据集用于训练,(ii)缺乏跨语言的全面评估基准。本文通过引入覆盖五种欧洲语言(英语、法语、德语、意大利语和西班牙语)的新型综合资源来填补这些空白。我们采用再生-翻译范式,通过结合精心挑选的合成生成和人工标注来生成高质量的跨语言资源。具体而言,我们构建了Multi-PixMo训练语料库,通过再生Pixmo现有数据集中的示例,结合许可的模型:PixMo-Cap、PixMo-AskModelAnything和CoSyn-400k。在评估方面,我们构建了一组多语言基准,通过翻译广泛使用的英语数据集(MMbench、ScienceQA、MME、POPE、AI2D)来实现。我们通过定性和定量的人类分析评估这些资源的质量,测量跨标注者的一致性。此外,我们进行了消融研究,以展示多语言数据在VLMs训练中的影响,相对于仅英语数据。实验包括三种不同的模型,结果表明使用多语言、多模态示例训练VLMs在非英语基准上始终有益,同时对英语也有积极的迁移效果。

英文摘要

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

2604.24278 2026-06-09 cs.SD cs.AI 版本更新

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS:一种面向可靠性的自动语音识别度量标准

Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(上海交通大学计算机科学学院X-LANCE实验室,中国) MoE Key Lab of Artificial Intelligence(人工智能MOE重点实验室;江苏语言计算重点实验室,中国) Jiangsu Key Lab of Language Computing, China

AI总结 本研究提出了一种面向可靠性的度量标准RAS,用于评估自动语音识别系统在不确定段落中的转录可靠性,通过引入一种具有退避意识的转录框架,结合人类偏好校准的参数,提升了转录的可靠性同时保持了准确性。

Comments 5 pages, 4 figures; Accepted at InterSpeech 2026

详情
AI中文摘要

自动语音识别系统在嘈杂或模糊条件下常常会产生自信但错误的转录,这对用户和下游应用都是误导性的。基于词错误率的标准评估仅关注准确性,未能捕捉转录的可靠性。我们引入了具有退避意识的转录框架,使ASR模型能够显式地避免不确定的段落。为了评估在退避情况下的可靠性,我们提出了RAS,一种面向可靠性的度量标准,平衡转录的信息量和错误回避,其权衡参数通过人类偏好进行校准。然后通过监督抽样后接强化学习训练了一个具有退避意识的ASR模型。我们的实验表明,在保持竞争力的准确性的同时,转录可靠性有显著的提高。

英文摘要

Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) ByteDance Inc.(字节跳动公司)

AI总结 针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题,提出技能检索增强(SRA)范式,通过动态检索外部技能库提升智能体性能,并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为能够自主解决问题的智能体,它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中,整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而,这种策略无法扩展:随着技能库的扩大,上下文预算迅速消耗,智能体在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(SRA),一种新的范式,其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量,我们构建了一个大规模技能库,并引入了SRA-Bench,这是首个对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与网络收集的干扰技能混合,形成了一个包含26,262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高智能体性能,验证了该范式的潜力。同时,我们揭示了技能整合中的一个基本差距:当前的LLM智能体倾向于以相似的速率加载技能,无论是否检索到金标准技能,或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索,还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2605.00273 2026-06-09 cs.CV cs.AI 版本更新

When Do Diffusion Models learn to Generate Multiple Objects?

扩散模型何时学会生成多个物体?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 研究探讨了扩散模型在多物体生成中的局限性,发现场景复杂度比概念不平衡更关键,且低数据条件下计数任务更难学习。

Comments ICML2026

详情
AI中文摘要

文本到图像的扩散模型实现了出色的视觉保真度,却在多物体生成中仍不可靠。尽管有大量实证证据表明这些失败,但其根本原因仍不清楚。我们首先探讨这种限制有多大源于数据本身。为了区分数据影响,我们考虑了不同数据集大小下的两种模式:(1)概念泛化,其中每个单独的概念在训练期间可能在不平衡的数据分布下被观察到;(2)组合泛化,其中特定的概念组合被系统性地排除。为了研究这些模式,我们引入了mosaic(多物体空间关系、属性、计数),一种受控的数据集生成框架。通过在mosaic上训练扩散模型,我们发现场景复杂性起主导作用,而非概念不平衡,并且在低数据模式中计数尤为难以学习。此外,随着训练过程中排除更多概念组合,组合泛化能力会崩溃。这些发现突显了扩散模型的根本限制,并促使更强的归纳偏见和数据设计以实现稳健的多物体组合生成。

英文摘要

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

2605.16223 2026-06-09 cs.GR cs.AI cs.CV 版本更新

Evaluating Design Video Generation: Metrics for Compositional Fidelity

评估设计视频生成:构成保真度的度量标准

Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta

发表机构 * Lica World(Lica世界) San Francisco, United States of America(美国旧金山) ICML’26 Workshop on Human-AI Co-Creativity, Seoul, South Korea(ICML’26 人类-人工智能协同创作研讨会,韩国首尔)

AI总结 本文提出一个自动化评估框架,用于评估设计动画中布局、动作正确性、时间质量和内容保真度,以替代主观人类评估,为该领域提供统一基准。

Comments ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

生成视频模型越来越多地用于设计动画任务,但该领域缺乏标准化评估框架。与自然视频生成不同,设计动画施加了结构化约束:特定组件需以规定类型、方向、速度和时间进行动画,而非动画区域必须保持稳定,布局结构必须保持。本文提供了一个全面自动化的评估框架,从四个维度组织:布局保真度、动作正确性、时间质量及内容保真度。这消除了对主观人类评估的依赖,并为该领域建立了一个共同的基准。我们在此发布代码和数据集:https://github.com/purvanshi/lica-bench。

英文摘要

Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field. We release the code and dataset here: https://github.com/purvanshi/lica-bench.

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于分步置信度归因(SCA)的方法,用于诊断黑盒大语言模型在多步推理中的失败,通过信息瓶颈原理对生成的推理轨迹进行置信度评估,并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能,但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号,但现有方法受限于最终答案或需要内部模型访问。在本文中,我们引入了分步置信度归因(SCA),一种适用于封闭源LLM的框架,该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理:与正确解决方案中的一致结构对齐的步骤获得高置信度,而偏差则被标记为可能错误。我们提出了两种互补的方法:(1)NIBS,一种非参数化的IB方法,用于测量一致性而无需图结构,以及(2)GIBS,一种基于图的IB模型,通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明,SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外,使用步骤级置信度指导自我修正,比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University(格里菲斯大学) Edith Cowan University(埃迪斯科文大学) The University of Queensland(昆士兰大学)

AI总结 提出MetaEvaluator,一种基于元学习的模型无关框架,通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

Comments Accepted by KDD 2026

详情
AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统,使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator,一个成本效益高、模型无关的框架,用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化,从而能够准确评估新模型,同时将成本分摊到整个池中,并消除了每个模型重新训练的需要。据我们所知,这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明,与传统方法相比,MetaEvaluator以显著降低的成本产生稳定且准确的性能估计,使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2605.24660 2026-06-09 cs.IR cs.AI cs.LG 版本更新

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

LLM 智能体应看到多少工具?一种机会校正的答案

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell

发表机构 * II Meta Platforms(Meta平台)

AI总结 针对 LLM 智能体工具选择中候选列表长度优化问题,提出基于机会校正的 Bits-over-Random (BoR) 指标,并将其转化为强化学习奖励,实现每查询自适应深度选择,在保持覆盖率的同时显著减少展示工具数量并提升下游工具选择准确率。

Comments 13 pages, 2 figures

详情
AI中文摘要

在 LLM 智能体使用工具之前,检索系统必须决定向智能体展示哪些候选工具。这个候选列表应该多长?展示太多工具,模型难以选择;展示太少,正确的工具可能不会出现。大多数系统对每个查询应用固定的候选列表大小,但缺乏标准指标来评估该大小是否合适。我们将展示给 LLM 智能体的工具数量作为评估对象,并应用 Bits-over-Random (BoR),一种机会校正的指标,询问在给定深度下的成功是否优于随机选择在同一深度下的表现。我们在三个工具选择基准、多个评分器以及从 20 到 3,251 个工具不等的注册表上评估 BoR。然后,我们将相同的原理转化为强化学习 (RL) 奖励,用于每查询选择工具候选列表深度。RL 智能体故意设计得简单,作为指标的探针而非提议的系统。随着候选列表增长,随机包含正确工具的机会增加,因此奖励自然减少,减少了对工程化深度惩罚的需求。在 BFCL(370 个工具)上,学习到的策略几乎匹配展示 50 个工具的覆盖率(90.3% 对 90.8%),而平均仅展示 7 个。在 ToolBench(3,251 个工具)上,固定展示 5 个工具实现了更高的总覆盖率(64.7% 对 61.9%),但在困难查询(正确工具排名第 6-20 位)上未找到任何工具。BoR 智能体通过搜索更深层,在这些查询上找到了 16.7%。使用 Claude Sonnet 4.6 的下游验证表明,更短的自适应列表也提高了 LLM 选择正确工具的能力:与始终展示 5 个工具时的 87.1% 相比,达到了 93.1%;在中等难度查询(正确工具存在但未排名第一)上,从 60.9% 扩大到 76.8%。

英文摘要

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.

2605.25085 2026-06-09 cs.IT cs.AI cs.LG math.IT 版本更新

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

自回归语言模型中的多项式上下文截断敏感性:KV缓存压缩的序列Wyner-Ziv界

Munsik Kim

发表机构 * Independent Researcher(独立研究者)

AI总结 研究自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为序列Wyner-Ziv信源编码,发现下一词分布对上下文截断的敏感性呈多项式衰减,并推导了仅后缀缓存策略的每词内存需求。

详情
AI中文摘要

我们研究了自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为模型诱导滤子上的序列Wyner-Ziv信源编码,其中下一步查询作为解码器边信息。实验上,在涵盖两个系列、参数规模0.5-3B的四个模型中,我们发现下一词分布对上下文截断的敏感性呈多项式衰减而非几何衰减:幂律在外推中比指数拟合提升一个数量级,拟合指数通过汇加最近KL测量独立恢复,并通过位置保持消融验证了衰减不受位置编码伪影影响。在相应的多项式截断敏感性假设下,我们的主要结果刻画了仅后缀缓存策略的每词内存需求:滑动窗口方案以窗口大小$w = O(\varepsilon^{-1/α})$达到失真$\varepsilon$,且在附加双边贝叶斯风险条件下,逆命题表明在该策略类内$w = \Omega(\varepsilon^{-1/α})$是必要的,因此仅后缀策略的缩放为$\Theta(\varepsilon^{-1/α})$。循环或传播缓存摘要能否超越此缩放留待进一步研究。一个显式的块马尔可夫方案达到上界;在附加前向衰减和正则性假设(仅由截断敏感性无法推出)下,其收敛速率指数与逆命题匹配,否则相差两倍。实验上,幂律预测了具体缓存策略的退化曲线:基于最近性的驱逐(滑动、汇加最近)在同等预算下将失真抑制约两个数量级,且失真随预算呈幂律衰减。

英文摘要

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念?语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India(BITS Pilani 去掉 Goa 的机构名,因为该机构名中包含 'Goa',但根据规则,如果机构已有常见中文名,使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室,因此翻译为 'BITS Pilani 实验室') IIIT Delhi, India(德里印度理工学院) Amazon, USA(美国亚马逊) Meta, USA(美国Meta) Apple, USA(美国苹果)

AI总结 提出MENTIS框架,通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量,测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为,但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败,表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问:当指令微调(IT)模型变为偏好对齐(PA)模型时,哪些几何结构发生了变化,这些变化集中在何处,以及它们在不同概念、提示和模型家族中的选择性如何? 我们引入MENTIS,一个几何优先的框架,用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数(T1)、辅助谱扭转诊断(T2)和用于深度定位的能量-辐射-激活度量(ERA)来比较IT和PA模型。在LITMUS上的四个7-8B模型对中,我们的研究表明对齐引起的变化是选择性的而非均匀的:规范性概念平均表现出比事实性概念更大的扭转偏移;扭转与上下文熵负相关;峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征,超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

2606.03328 2026-06-09 cs.LG cs.AI 版本更新

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

校准数据在能力维度上的权衡:为什么多源混合对高稀疏LLM剪枝至关重要

Hu Xu, Zhaolong Xing, Congcong Liu, Jiaxing Wang, Zhida Jiang, Junshi Huang, Zhen Chen, Jianfeng Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) JD.com(京东公司)

AI总结 通过分解后剪枝能力维度并分析15个校准源,发现校准困惑度与通用能力保留正相关但与数学和代码能力保留负相关,提出多源混合校准方法IGSP以平衡各维度性能。

详情
AI中文摘要

训练后剪枝使用小型无标签校准集将大型语言模型压缩至高稀疏度,近期研究认为校准源的选择对平均后剪枝精度影响不大。我们提出疑问:当校准效果分别在不同能力维度上评估而非聚合时,该结论是否仍然成立。将后剪枝能力分解为通用、常识、代码和数学,并通过Spearman相关性分析$n{=}15$个校准源的OIT信息度量与各维度保留率,我们发现一个符号相反的权衡:校准困惑度与通用保留率正相关($ ho{=}{+}0.71$),但与数学和代码保留率负相关($ ho{=}{-}0.53,\,{-}0.59$;$p{<}0.05$),因此单一源无法保留所有能力。我们以多源校准混合作为回应,并提出IGSP,一种信息引导的自校准协议,通过最小化4-gram聚合和平衡各维度困惑度,自动构建多源混合而无需能力对齐的语料库。在LLaMA-3.1-8B上使用SparseGPT 60%稀疏度时,均匀多源混合达到58.8%的总保留率,优于最佳单一源(MetaMath,50.0%)$+8.8$和C4默认(40.0%)$+18.8$;IGSP比Self-Cal提高$+2.4$,比SGS提高$+4.8$。

英文摘要

Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions rather than aggregated. Decomposing post-pruning capability into General, Commonsense, Code, and Math, and analysing $n{=}15$ calibration sources via Spearman correlations between OIT information metrics and per-dimension retention, we uncover an opposite-sign trade-off: calibration perplexity correlates positively with General retention ($ρ{=}{+}0.71$) but negatively with Math and Code retention ($ρ{=}{-}0.53,\,{-}0.59$; $p{<}0.05$), so no single source can preserve all capabilities. We respond with multi-source calibration mixing, and propose IGSP, an information-guided self-calibration protocol that automates multi-source construction without capability-aligned corpora by minimising 4-gram aggregation and balancing perplexity across dimensions. On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix reaches 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by $+8.8$ and the C4 default (40.0%) by $+18.8$; IGSP improves over Self-Cal by $+2.4$ and SGS by $+4.8$.

2606.04409 2026-06-09 cs.CV cs.AI cs.LG 版本更新

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

数据规模、模型复杂度和输入模态对视觉泛化影响的实证研究

Yidi Zhouluo

发表机构 * School of Medical Information and Artificial Intelligence, Shandong First Medical University(医学信息与人工智能学院,山东第一医科大学)

AI总结 通过一维非线性函数和CIFAR数据集实验,实证分析数据规模、模型复杂度和输入模态对视觉泛化性能的影响。

Comments 12 pages, 9 figures, 4 tables

详情
AI中文摘要

现代深度神经网络通常具有较大的参数规模和非线性层次结构,在计算机视觉中取得了强劲性能。然而,其泛化性能的来源仍然难以用传统统计学习理论解释。在可能影响视觉泛化的因素中,数据规模、模型复杂度和输入模态是基础且可控的变量。本研究实证分析了这三个因素如何影响模型泛化性能。具体而言,在初步实验中,我们构建了一维非线性函数,并改变训练样本数量和多项式次数,以观察数据规模和模型复杂度对模型性能的影响。在主要实验中,我们比较了CIFAR-10和CIFAR-100上不同训练数据规模、模型架构和输入模态下的模型性能。实验结果表明,增加训练数据规模持续改善泛化性能,而模型复杂度的变化并未带来稳定提升。此外,去除颜色信息会降低模型性能,而梯度、边缘和小波等显式先验特征在不同模型架构上的效果不一致。总体而言,本研究提供了数据规模、模型复杂度、输入模态与视觉泛化性能之间关系的实证分析。代码和实验日志见:https://github.com/zlyd-CV/DeepLearning-Empirical-Studies。

英文摘要

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.

2606.04752 2026-06-09 cs.LG cs.AI 版本更新

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

多通道信号Transformer输入编码器的实证审计

Ossi Lehtinen

发表机构 * Anthropic

AI总结 通过合成基准和真实数据ETTh1,实证审计八种输入编码器,发现标准线性投影(nn.Linear(C, d_model))在大多数情况下与复杂替代方案性能相当,仅共享标量基线和通道独立基线显著落后。

Comments 21 pages, 1 figure, 8 tables. Code: https://github.com/OssiLehtinen/channel-encoder-audit

详情
AI中文摘要

处理多通道标量信号的Transformer必须在每个时间步将$C$个同时值嵌入到一个$d_{ ext{model}}$维向量中。我们在一个设计为使通道身份信息丰富的合成基准和作为真实数据检查的ETTh1上,以下一步负对数似然(NLL)为指标,实证审计了八种输入编码器——包括共享标量基线、每通道线性投影、正交正则化器、非线性MLP主干、块分区拼接、通道独立和通道作为令牌架构,以及投影位置编码。主要结论是宽泛的“第一梯队”内实际近似等价:标准每通道线性投影(nn.Linear(C, $d_{ ext{model}}$))与该梯队中的每个替代方案相比,差异在统计上显著但实际中很小。两种编码器明显失败:共享标量基线(由于我们明确的信息论原因而崩溃)和通道独立的PatchTST风格基线(在两个基准上表现不佳,并在合成基准上普遍过拟合)。配对测试解决了两个小差距:通过学习的线性层投影正弦位置编码在小$C$时略胜一筹,直接几何探测表明其机制是位置-通道正交化;非线性MLP主干在我们测试的最大$C$时略胜一筹,但差距在更多训练数据下缩小。实际建议是默认使用nn.Linear(C, $d_{ ext{model}}$),仅当手头任务有实际理由时才采用更复杂的方案。重现本文所有实验的代码和数据可在https://github.com/OssiLehtinen/channel-encoder-audit获取。

英文摘要

Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per time step. We audit eight input encoders -- a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding -- on a synthetic benchmark where channel identity is informative and on ETTh1, scored by next-step negative log-likelihood. The headline is practical near-equivalence within a wide "top tier": the standard per-channel linear projection matches every alternative up to small, statistically real but practically modest differences. A direct geometric probe attributes this to a spontaneous orthogonalisation of the per-channel projections: they end up near-orthogonal with no explicit regulariser, letting the standard linear recover channel identity from the summed embedding. Two encoders lose decisively: the shared-scalar baseline collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline overfits universally on the synthetic benchmark and underperforms on both. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small $C$ by extending this orthogonality to the positional subspace; a nonlinear MLP stem edges them at the largest $C$, with the gap shrinking under more training data. The practical recommendation: use the standard per-channel linear projection by default; reach for something more elaborate only when the task calls for it.

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ETH Zürich(苏黎世联邦理工学院) Imperial College London(伦敦帝国理工学院) NUS(国立大学新加坡) Accenture(埃森哲) Innopolis University(因诺普里斯大学) Independent Researcher(独立研究者)

AI总结 提出ThinkBooster框架,通过模块化库、联合评估基准和可部署代理服务,实现LLM推理的测试时计算扩展,在数学和编码任务上验证了性能-计算权衡。

详情
AI中文摘要

测试时计算(TTC)扩展已成为一种强大的范式,通过在推理期间分配额外计算(例如,通过多样本生成和基于验证器的重新排序)来改进大型语言模型(LLM)推理。现有的TTC扩展策略和推理评分器仍然碎片化,在不一致的协议下进行评估,并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster,一个用于LLM推理无缝测试时计算扩展的统一框架,它包括(i)一个模块化的Python库,实现了最先进的TTC扩展策略和评分器家族,(ii)一个联合评估性能和计算效率的基准,以及(iii)一个可部署的、兼容OpenAI的代理服务,使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器,用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡,并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

10. AI应用与系统 123 篇

2606.07549 2026-06-09 cs.AI cs.MA 新提交

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage:通过经验感知的代理工作流实现病理学多源证据裁决

Chengyang Zhang, Wenchuan Zhang, Bo Li, Mengran Li, Bob Zhang, Yuhao Yi, Hong Bu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) Department of Pathology and Institute of Clinical Pathology, West China Hospital, Sichuan University(四川大学华西医院病理科/临床病理研究所) Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能工程学院)

AI总结 提出PathoSage框架,通过结构化证据审议和Beta-Bernoulli经验系统,独立评估工具证据并解决冲突,减少幻觉和分类器分歧,提升病理学推理鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)和代理工作流的最新进展在计算病理学中显示出巨大潜力,但可靠的补丁级推理仍然具有挑战性。端到端的病理学MLLM常常幻觉形态特征,而最近的代理系统通常将工具输出和检索知识合并到共享上下文中,使得决策容易受到冲突证据和上下文污染的影响。我们提出PathoSage,一个三阶段框架,明确分离知识检索、证据收集和证据裁决,用于补丁级病理学多模态推理。其核心组件结构化证据审议独立评估来自工具的异质证据,执行冲突分析,并在全新上下文中生成最终判断,以减少锚定偏差。我们进一步引入一个无需训练的Beta-Bernoulli经验系统,具有连续信用分配,以建模长期工具可靠性,并为未来工具使用构建相似性加权先验。实验表明,PathoSage有效缓解了VQA幻觉和分类器分歧,优于强病理学MLLM和代理基线。我们的结果强调了明确的证据裁决和可靠性感知工具建模是构建鲁棒病理学代理的关键要素。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

2606.07721 2026-06-09 cs.AI 新提交

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

使用开源大语言模型从脑MRI报告中自动提取结构化信息

Kaouther Mouheb, Amos Pomp, Antoine Manenti, Romy de Haan, Farog Faghir, Joy Martens, Harro Seelaar, Francesco Mattace-Raso, Meike W. Vernooij, Frank J. Wolters, Stefan Klein, Esther E. Bron

发表机构 * Department of Radiology & Nuclear Medicine, Erasmus MC(埃因霍温麦斯特大学放射科与核医学部) Department of Epidemiology, Erasmus MC(埃因霍温麦斯特大学流行病学部) Department of Electrical and Electronics Engineering, ENSEEIHT(ENSEEIHT电子与电气工程系) Alzheimer Centre Erasmus MC(埃因霍温麦斯特大学阿尔茨海默病中心) Department of Neurology, Erasmus MC(埃因霍温麦斯特大学神经医学部) Department of Internal Medicine, Erasmus MC(埃因霍温麦斯特大学内科部)

AI总结 本研究评估了开源LLM LLaMA 3.1从荷兰语脑MRI报告中自动提取结构化信息的能力,通过零样本和少样本提示策略,在视觉评分、病变检测等任务上取得高准确率,少样本提示进一步提升了数值变量的提取性能。

Comments Submitted to European Radiology

详情
AI中文摘要

目的:从自由文本放射学报告中自动提取数据可实现大规模研究,但很少有研究评估大语言模型(LLM)在荷兰神经放射学报告上的性能。方法:我们分析了来自一家三级记忆诊所(2016-2021年)的947份脑MRI报告,由顾问神经放射科医生撰写。经过培训的医学生标注了三十个变量;其中100份报告进行了双重标注以评估评分者间信度。我们评估了开源LLM LLaMA 3.1在不同语言(荷兰语与英语翻译)和不同示例选择策略的少样本提示下的性能。性能评估使用分类变量的平衡准确率、计数变量的准确率和平均绝对误差以及自由文本的文本相似度。指标在947份报告的10次随机分割上计算。结果:LLaMA 3.1在视觉评分上表现出高零样本性能(平均[95%置信区间]):内侧颞叶萎缩:左侧90% [77-100%],右侧96% [94-99%];全脑皮质萎缩:87% [83-91%];Fazekas评分:94% [93-96%]。微出血检测准确率为93% [92-95%],梗死检测为82% [80-84%]。病灶位置的文本相似度达到0.95 [0.95-0.96]。数值变量性能较低:微出血数量为80% [78-82%],梗死数量为66% [63-68%]。英语翻译结果相当。少样本提示提高了数值变量的性能,使用基于结构相似性的选择后,微出血达到92% [90-93%],梗死达到81% [77-85%]。结论:LLaMA 3.1在从荷兰神经放射学报告中提取数据方面显示出巨大潜力。少样本提示增强了数值变量的性能,而位置特定变量仍面临挑战。

英文摘要

Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 新提交

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心) Space and Earth Science Data Analysis(空间与地球科学数据分析) NASA Marshall Space Flight Center(NASA马歇尔太空飞行中心)

AI总结 研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力,发现检测精度取决于土地覆盖和洪水类型,农田和河流洪水检测效果较好,而树木覆盖和建成区检测近乎为零。

详情
AI中文摘要

洪水是最具破坏性的自然灾害之一,在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性,但其在多样、未见事件中的操作可靠性尚未被表征。在此,我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件(2017-2025年)中部署Prithvi-EO-2.0,并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型,农田产生最高一致性(IoU=52%),河流事件检测最强(F1=0.69),而树木覆盖和建成区显示近乎零检测(IoU=4%),无论洪水机制如何。双参考验证揭示,明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式,其中流水线工程在初始误差中占主导地位,超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 新提交

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

在资源受限环境中利用常规数据重建和预测阿尔茨海默病患者的疾病轨迹

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

发表机构 * Yardi School of Artificial Intelligence (ScAI), Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院) Department of Neurology, Vardhman Mahavir Medical College and Safdarjung Hospital(瓦尔丹·马哈维尔医学院和萨夫达戎医院神经内科) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系)

AI总结 提出GNOVA框架,结合GRU编码器和神经ODE解码器的变分自编码器,利用常规临床数据(无需神经影像或生物标志物)实现认知评分的双向预测、插值/外推及不确定性估计,在ADNI数据集上取得低误差。

详情
AI中文摘要

阿尔茨海默病是一种进行性神经退行性疾病,其进展在不同患者间差异显著。现有工作旨在预测患者未来的认知状态,但很少关注从既往就诊中重建状态。此外,当前研究中,量化预测不确定性仍未被充分探索,且依赖于MRI、PET和CSF等昂贵模态,限制了在资源有限环境中的部署。在本研究中,我们的主要目标是:第一,从不规则就诊中双向预测认知评分,以呈现完整的疾病轨迹;第二,实现插值和外推能力,以辅助临床医生做出知情预后决策;第三,为所有预测提供校准良好的不确定性估计;最后,利用常规就诊中可用的模态实现上述目标。我们提出了一个统一框架GNOVA:GRU-神经ODE变分自编码器。该架构在变分自编码器框架内结合了门控循环单元编码器和神经ODE解码器。在我们的工作中,我们预测了CDR-SB和MMSE评分。GRU编码器允许在任何时间点输入任意数量的数据。神经ODE解码器执行连续估计,允许在任何期望的时间点进行插值和外推。变分自编码器允许预测中的不确定性估计。我们使用了ADNI数据集中1727名患者超过10年的数据;该模型在无需任何神经影像或生物标志物数据的情况下,对CDR-SB和MMSE评分分别实现了1.35和2.28的平均绝对误差。特征消融研究表明,年龄、BMI和APOE4状态是强预测因子。所提出的框架能够重建不完整的患者病史并预测未来的认知状态。

英文摘要

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

2606.07866 2026-06-09 cs.AI cs.MA 新提交

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

通过智能体间协议克服监管瓶颈:以核能为例

Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

发表机构 * Argonne National Laboratory(阿贡国家实验室) Idaho National Laboratory(爱达荷国家实验室)

AI总结 提出监管上下文协议(RCP),一种智能体间通信标准,将监管与申请方之间的人工流程转为结构化、可审计的智能体通道,在核反应堆审批中降低成本50-77%、缩短时间65%。

Comments 26 pages, 10 figures

详情
AI中文摘要

先进核反应堆设计的监管审查通常耗时超过三年,并消耗数亿美元的综合监管和申请方劳动力。我们提出了监管上下文协议(RCP),这是一种智能体间通信标准,用结构化、可审计的智能体通道取代监管机构和申请方之间正式的人工流程,同时在安全关键决策点保留人类监督。该协议基于对美国核监管委员会先进反应堆案卷中1,236份文件的分析进行校准,并通过一个工作中的多智能体试点进行演示。相对于8,900万美元、42个月的基准重建,RCP将成本降低50-77%(2,100万至4,400万美元),时间缩短65%(15个月)。在没有共享协议的情况下,独立智能体仅能达到5,400万至7,400万美元和21个月。剩余的成本和时间差距是结构性的,而非算法性的:它源于组织间的流程,只有智能体间标准才能压缩。同样的瓶颈——在严格的可审计性要求下进行正式的多方审查——也是药品审批、环境许可、金融监管和航空认证的特点。美国监管文书负担每年带来4,265亿美元的机会成本;如果广泛复制,预计50-77%的减少意味着每年节省约2,100亿至3,300亿美元——接近美国GDP的1%。

英文摘要

Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.

2606.08051 2026-06-09 cs.AI cs.LG 新提交

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

你能做到多小?面向金融交易中商户信息抽取的 270M-8B 模型 LoRA 微调

Donghao Huang, Tomas Drietomsky, Benjamin Barrett, Zhaoxia Wang

发表机构 * Singapore Management University(新加坡管理大学) Mastercard(万事达卡) A*STAR Centre for Frontier AI Research(新加坡科技研究局前沿人工智能研究中心)

AI总结 针对金融交易中从嘈杂银行字符串提取结构化商户信息的生产需求,系统评估 24 种模型变体,发现 Qwen 3.5 4B 在参数量减半下 F1 仅低 0.35 点,0.8B 模型匹配 2.5-4 倍大模型性能,且思维链微调提升有限。

Comments 9 pages, 5 figures, 5 tables. Submitted to the IEEE International Conference on Data Mining (ICDM) 2026

详情
AI中文摘要

金融交易处理需要从嘈杂、缩写的银行交易字符串中大规模提取结构化商户信息。我们当前的生产系统是 LoRA 微调的 LLaMA 3.1-8B,在该任务上达到了 96.95% 的 F1 分数,但部署 80 亿参数模型带来了高昂的内存、延迟和成本约束。为了识别更高效的替代方案,我们进行了一项以部署为中心的研究,涵盖四个模型家族的 24 种模型变体:Gemma 3(270M、1B、4B)、Qwen 3.5(0.8B、2B、4B)、Aya(3.35B)和 LLaMA 3.1-8B,系统评估了准确率、推理吞吐量、训练成本和硬件行为,以评估生产适用性。我们的发现表明:(1)使用 LoRA 秩为 8 复现 LLaMA 3.1-8B 微调达到 96.75% F1,仅比秩为 32 的基线低 0.20 个点;(2)仅使用 JSON 提示的 Qwen 3.5 4B 达到 96.60% F1,比 8B 基线低 0.35 个点,同时参数量大约减半;(3)0.8B 的 Qwen 3.5 模型达到 94.75% F1,与 2.5-4 倍大的模型性能相当,提供了有吸引力的延迟-准确率权衡;(4)思维链微调通常使大多数模型的 F1 提升 0.3-1.8 个点,尽管 Qwen 3.5 4B 在直接仅 JSON 提示下表现最佳;(5)Qwen 3.5 的 Think 和 Nothink 训练模板产生几乎相同的结果(F1 差异 <0.004),表明对于结构化抽取任务,显式推理监督是不必要的。我们进一步将所有 14 个微调后的子 8B 模型部署为 Databricks Model Serving 端点,并观察到基准性能可靠地迁移到生产环境,平均 F1 变化仅为 0.8 个点。基于 Cohere2 架构的 Aya 3.35B 是唯一的例外,在服务条件下 F1 下降了 3-5 个点。基于这些结果,我们提供了跨准确率和延迟需求的部署建议,……

英文摘要

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

2606.08093 2026-06-09 cs.AI 新提交

A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology

面向证据基础计算病理学的多模态智能体协同助手

Zhe Xu, Zhengyu Zhang, Zhiyuan Cai, Jiahao Xu, Yijie Lin, Ziyi Liu, Junlin Hou, Hongyi Wang, Yuxiang Nie, Ling Liang, Yihui Wang, Yingxue Xu, Ronald Cheong Kin Chan, Li Liang, Hao Chen

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Pathology, Nanfang Hospital, Southern Medical University(南方医科大学南芳医院病理科) Department of Pathology, School of Basic Medical Sciences, Southern Medical University(南方医科大学基础医学学院病理科) Department of Anatomical and Cellular Pathology, Chinese University of Hong Kong(香港中文大学解剖与细胞病理学系) Guangdong Provincial Key Laboratory of Molecular Tumor Pathology(广东省分子肿瘤病理学重点实验室) Jinfeng Laboratory(锦风实验室) Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology(香港科技大学化学与生物工程系) Division of Life Science, Hong Kong University of Science and Technology(香港科技大学生命科学系) State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology(香港科技大学神经系统疾病国家重点实验室) HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology(香港科技大学深圳-香港协同创新研究院)

AI总结 提出PathPocket,一种多模态AI协同助手,通过构建包含11万文档的病理证据语料库和455万实体的超图,实现基于证据的病理诊断,在20万真实案例上超越现有方法。

详情
AI中文摘要

病理学是现代医学的基石,准确的决策高度依赖于循证实践。虽然人工智能有潜力改变临床工作流程,但AI与循证医学的结合仍未被充分探索,现有的初步尝试仅限于纯文本的通用医学。在这项工作中,我们提出了PathPocket,一种专门为证据基础病理学设计的多模态AI智能体协同助手。我们构建了迄今为止最全面的病理证据语料库,包含约110,472份公开和授权文档,这些文档按照从临床指南到专家意见的严格证据层级进行结构化组织。在这个精心分级的基础上,我们构建了一个大规模多模态病理超图,包含超过455万个实体和710万个关系。作为强大的知识引擎,该超图为协作式多智能体推理框架提供了可追溯的证据,该框架集成了输入理解、证据检索、过滤和诊断生成。这使得PathPocket能够无缝解决广泛的临床任务,从纯文本查询到涉及感兴趣区域和千兆像素全切片图像的复杂多模态诊断。我们在一个包含超过20万真实案例的多维基准测试上严格评估了该系统,其性能显著优于现有最先进方法。至关重要的是,广泛的用户研究表明,PathPocket显著提高了病理学家的诊断准确性和信心。通过将病理学解释直接基于可验证的文献,PathPocket为未来证据基础的计算病理学提供了实用且可扩展的解决方案。

英文摘要

Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.

2606.08146 2026-06-09 cs.AI 新提交

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

SAGE: 一种LLM驱动的自我反思智能体框架用于欺诈检测

Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) China Mobile Communications Group(中国移动通信集团有限公司)

AI总结 提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架,通过数据诊断树和自然语言梯度优化,在五个数据集上平均F1提升40.86%。

详情
AI中文摘要

支付、电子商务和电信系统中的欺诈检测需要在个体层面准确、在严重类别不平衡下鲁棒,并且易于风险管理者理解。现有方法至少缺乏这些要求之一:自动化机器学习系统在固定数值空间中搜索,缺乏对数据集的语义感知;基于图神经网络的方法需要预定义的关系图,在个体决策层面仍然不透明;通用大语言模型(LLM)智能体的设计未考虑现实欺诈检测中的召回率和精确率约束。在本文中,我们提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架。SAGE协调三个专用智能体,基于六层数据诊断树(DDT)和由自然语言梯度引导的马尔可夫决策过程做出决策,在欺诈特定奖励下自动优化模型。在五个欺诈数据集和五个LLM骨干网络上,SAGE在96.00%的方法-数据集比较中获胜,平均F1比基线提升40.86%。代码可在https://github.com/yichenC1c/SAGE获取。

英文摘要

Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.

2606.08311 2026-06-09 cs.AI 新提交

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

利用机器学习构建心脏病学接口术语以突出电子健康记录

Mahshad Koohi Habibi Dehkordi, Shuxin Zhou, Yehoshua Perl, Fadi P. Deek, James Geller, Gai Elhanan, Andrew J. Einstein, Luke Lindemann, Vipina K. Keloth

发表机构 * Department of Computer Science, New Jersey Institute of Technology(新泽西理工学院计算机科学系) Department of Computer Science, St.Francis College(圣弗朗西斯学院计算机科学系) Department of Informatics, New Jersey Institute of Technology(新泽西理工学院信息学系) Department of Data Science, New Jersey Institute of Technology(新泽西理工学院数据科学系) Center for Genomic Medicine, School of Medicine, University of Nevada(内华达大学医学学院基因组医学中心) Department of Medicine, Cardiology Division, Columbia University Irving Medical Center(哥伦比亚大学伊万杰琳医学中心内科部(心内科)) Advanced Metrics Laboratory, School of Medicine and Health Sciences, George Washington University(乔治华盛顿大学医学院与健康科学学院高级指标实验室) Department of Biomedical Informatics and Data Science, Yale University(耶鲁大学生物医学信息学与数据科学系)

AI总结 提出基于机器学习的心脏病学接口术语(CIT)设计方法,通过半自动构建训练数据并训练模型,实现对电子健康记录中关键信息的高亮,覆盖率达74.21%。

详情
AI中文摘要

电子健康记录(EHR)笔记是密集的医学文档,包含大量信息,通常充满复杂的医学术语。高亮EHR中的所有细节有助于通过吸引对关键内容的注意力来减少遗漏重要信息的可能性。本研究提出设计一种心脏病学接口术语(CIT),以准确高亮心脏病患者EHR笔记中的所有细节。我们引入一种创新的机器学习(ML)技术用于CIT的设计。ML技术需要训练数据。手动准备此类训练数据耗时且昂贵。CIT设计过程包括三个阶段。在前两个阶段中,我们创新性地推导出一个训练数据CIT,供第三阶段的ML技术使用。我们首先设计初始CIT,由几个部分组成:SNOMED的心脏病学子层次、从构建集的EHR中挖掘的其他SNOMED概念,以及术语的必要组成部分(如医学缩写和药物)。利用迭代过程,从构建集中提取包含初始CIT概念的细粒度短语作为CIT概念候选。候选概念在半自动审查后添加到CIT中,得到训练数据CIT(TCIT)。在第三阶段,使用TCIT训练ML模型,以识别适合作为CIT概念的概念。该模型用于从构建集中提取更多概念,得到最终CIT。然后使用最终CIT高亮测试集,并评估其捕获未见EHR数据集中细节的程度。为此,使用了四个评估指标:覆盖率、广度、完整性和简洁性。高亮测试集的覆盖率为74.21%,广度为1.68。对于测试集中的20个随机笔记,平均完整性为98.2%,平均简洁性为84.2%。

英文摘要

Electronic health record (EHR) notes are dense medical documents containing large amounts of information, often filled with complex medical jargon. Highlighting all details in EHRs helps reduce the likelihood of missing crucial information by drawing attention to key content. This study proposes the design of a Cardiology Interface Terminology (CIT) to accurately highlight all details in EHR notes of cardiology patients. We introduce an innovative Machine Learning (ML) technique for the design of CIT. The ML technique requires training data. Manual preparation of such training data is time-consuming and expensive. The process of the CIT design includes three phases. In the first two phases, we innovatively derive a training data CIT to be used by the third phase, ML technique. We start by designing an initial CIT, composed of several components: the cardiology-related sub-hierarchies of SNOMED, other SNOMED concepts mined from EHRs of build set, and necessary components of terms e.g., medical abbreviations and medications. Utilizing an iterative process, fine-grained phrases containing initial CIT concepts are extracted from build set as CIT concept candidates. The candidate concepts are semi-automatically reviewed before being added to CIT, yielding the training data CIT, TCIT. In the third phase, a ML model is trained with TCIT to identify candidates fitting to be concepts in the CIT. This model is used to extract further concepts from build set, yielding the final CIT. The final CIT is then used to highlight the test set and evaluate the extent to which it captures details in an unseen EHR dataset. For this purpose, four evaluation metrics, coverage, breadth, completeness, and conciseness are used. The highlighted test set has a coverage of 74.21%, with a breadth of 1.68. For 20 random notes in test set, the average completeness is 98.2% and average conciseness is 84.2%.

2606.08314 2026-06-09 cs.AI 新提交

Integrating Deep Learning Demand Forecasting with Multi-Objective Optimization for Circular Coffee Supply Chains: A Data-Driven Framework for Cost, Emissions, and Freshness Management

集成深度学习需求预测与多目标优化的循环咖啡供应链:面向成本、排放和新鲜度管理的数据驱动框架

Gerçek Budak, Faraz Gholamzadeh Gharehgheshlaghi, Melika Barjesteh Vaezi, Ahmad Gholizadeh Lonbar

发表机构 * Ankara Yıldırım Beyazıt University(安卡拉耶尔德勒姆贝亚泽特大学) Texas Tech University(德克萨斯理工大学) University of Alabama(阿拉巴马大学)

AI总结 提出两阶段框架,先用CNN-LSTM模型预测需求(MAE=22.87,R²=0.90),再通过三目标MILP模型优化成本、碳排放和新鲜度,在循环供应链中获得25个Pareto解,平衡政策可减排22.4%仅增成本9.9%。

详情
AI中文摘要

咖啡供应链是最复杂的农产品网络之一,具有地理分散生产、多层协调以及对质量和新鲜度高度敏感的特点。尽管可持续性和数字化已受到关注,但需求预测、优化和可追溯性通常被分开处理。本研究提出了一个两阶段集成框架。首先,使用混合CNN-LSTM模型进行需求预测。在公开的Coffee Chain Sales数据集上,按时间顺序70/15/15划分,模型实现了MAE为22.87、R²为0.90,优于最佳深度学习基准约12%,优于经典方法超过30%。第二阶段,预测的需求输入一个三目标混合整数线性规划(MILP)模型,该模型在具有循环回收的多周期、多模式、闭环供应链中同时最小化成本、最小化碳排放和最大化产品新鲜度。新鲜度通过基于库存年龄的指数衰减建模。使用epsilon-约束方法,获得了25个Pareto解。敏感性和政策分析表明,平衡的可持续性政策可以在仅增加9.9%成本的情况下减少22.4%的排放,同时保持接近最优的新鲜度。

英文摘要

The coffee supply chain is one of the most complex agri-food networks, marked by geographically dispersed production, multi-tier coordination, and high sensitivity to quality and freshness. While sustainability and digitalization have gained attention, demand forecasting, optimization, and traceability are often treated separately. This study presents a two-phase integrated framework. First, a hybrid CNN-LSTM model is used for demand forecasting. On the public Coffee Chain Sales dataset with chronological 70/15/15 splitting, the model achieves MAE of 22.87 and R^2 of 0.90, outperforming the best deep learning benchmark by ~12% and classical methods by over 30%. In the second phase, the forecasted demand feeds a tri-objective mixed-integer linear programming (MILP) model that jointly minimizes cost, minimizes carbon emissions, and maximizes product freshness in a multi-period, multimodal, closed-loop supply chain with circular recovery. Freshness is modeled via exponential decay based on inventory age. Using the epsilon-constraint method, 25 Pareto solutions are obtained. Sensitivity and policy analyses show that balanced sustainability policies can reduce emissions by 22.4% with only a 9.9% cost increase while maintaining near-optimal freshness. Keywords: Coffee supply chain; Deep learning; Demand forecasting; Multi-objective optimization; Circular economy; CNN-LSTM; Mixed-integer linear programming.

2606.08379 2026-06-09 cs.AI cs.CE cs.LG q-fin.CP q-fin.TR 新提交

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

TT-DAC-PS:用于最优交易执行的双目标确定性演员-评论家与策略平滑

Ilia Zaznov, Atta Badii, Julian Kunkel, Alfonso Dufour

发表机构 * University of Reading(雷丁大学) University of Göttingen(哥廷根大学) GWDG(哥廷根数据处理中心) Henley Business School(亨利商学院)

AI总结 提出TT-DAC-PS算法,结合双指数移动平均评论家目标、悲观最小备份、TD3风格策略平滑噪声、延迟演员更新和保守Q正则化,以抑制过高估计,并在限价订单簿数据上优于经典和强化学习基线。

Comments 21 pages, 1 figure, 3 tables

详情
AI中文摘要

本研究通过引入TT-DAC-PS(双目标确定性演员-评论家与策略平滑),解决了大规模股票卖单的最优执行问题。该确定性演员-评论家架构结合了双指数移动平均评论家目标与悲观最小备份、TD3风格的目标策略平滑噪声、延迟演员更新以及保守Q正则化,以抑制过高估计。探索使用Ornstein-Uhlenbeck(OU)噪声,并采用混合调度:确定性回合衰减、基于近期奖励离散度的方差引导调整,以及一个可学习并映射到噪声尺度的Soft Actor-Critic(SAC)风格温度。环境整合了Almgren-Chriss(AC)交易影响与限价订单簿(LOB)价格和成交量、归一化状态特征、每步成交量参与上限以及基于效用的奖励。该交易执行算法应用于十只美国股票的LOB数据。性能评估针对强化学习基线算法,包括近端策略优化(PPO)、软演员-评论家(SAC)和优势演员-评论家(A2C),以及替代交易执行算法,包括时间加权平均价格(TWAP)、成交量加权平均价格(VWAP)和AC。所提出的模型持续降低平均实现缺口百分比,并具有竞争性的方差,优于经典基线和标准强化学习基准模型。

英文摘要

This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses Ornstein-Uhlenbeck (OU) noise with a hybrid schedule: deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a Soft Actor-Critic (SAC)-style temperature that is learned and mapped to the noise scale. The environment integrates Almgren-Chriss (AC) trade impact with Limit Order Book (LOB) prices and volumes, normalised state features, per-step volume participation caps, and a utility-based reward. The trade execution algorithm is applied to LOB data for ten U.S. stocks. Performance is assessed against reinforcement-learning baseline algorithms, including Proximal Policy Optimisation (PPO), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C), as well as alternative trade execution algorithms, including Time-Weighted Average Price (TWAP), Volume-Weighted Average Price (VWAP), and AC. The proposed model consistently reduces mean implementation shortfall percentage with competitive variance, outperforming classical baselines and standard reinforcement-learning benchmark models.

2606.08450 2026-06-09 cs.AI 新提交

GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning

GIFT: 基于LLM引导的状态-奖励接口用于金融强化学习

Yanyan Wu, Boyi Zhang, Yanlin Liu, Xinyu Fang, Jining Luan, Meiqi Zhang, Jiacheng Liu, Hao Zeng, Dexu Yu, Chang Liu, Hanwen Du, Yongxin Ni, Youhua Li

发表机构 * East China University of Science and Technology(华东理工大学) University of Science and Technology of China(中国科学技术大学) Southwestern University of Finance and Economics(西南财经大学) University of Sydney(悉尼大学) City University of Hong Kong(香港城市大学) Northeastern University(东北大学) The Ohio State University(俄亥俄州立大学) National University of Singapore(新加坡国立大学)

AI总结 提出GIFT框架,利用大语言模型引导PPO强化学习中的状态增强和奖励塑造,提升金融交易策略的样本外风险调整收益。

Comments 25 pages, 7 figures. Code and data are available at https://github.com/KAG778/GIFT . Equal contribution: Yanyan Wu and Boyi Zhang. Corresponding author: Youhua Li

详情
AI中文摘要

金融投资组合交易自然被表述为一个强化学习问题,其中智能体在不断变化的市场条件下顺序调整资产以平衡收益、风险和交易成本。然而,在非平稳市场中,原始的OHLCV状态和短视的回报奖励往往提供了一个不充分的学习接口,这促使使用大语言模型将金融知识注入状态和奖励设计,同时限制开放式的生成。为此,我们提出GIFT,一个基于LLM引导的框架,用于基于PPO的金融强化学习中的状态-奖励接口设计。GIFT不是使用LLM做出交易决策,而是使用因子引导的状态增强从金融因子基元生成状态特征,使用风险规则引导的奖励塑造从投资组合风险规则生成辅助奖励,并使用诊断引导的细化通过PPO rollout诊断修订候选接口。细化后,GIFT在评估前固定所选的状态-奖励接口,在测试时不再进行LLM查询或接口更新。跨不同市场制度和投资组合场景的综合滚动窗口实验表明,GIFT相比基线提高了学习信号质量和样本外风险调整后的投资组合性能。代码和数据可在 https://github.com/KAG778/GIFT 获取。

英文摘要

Financial portfolio trading is naturally formulated as a reinforcement learning problem, where an agent sequentially rebalances assets under changing market conditions to balance return, risk, and transaction costs. Yet in non-stationary markets, raw OHLCV states and short-horizon return rewards often provide an under-specified learning interface, motivating large language models as a way to inject financial knowledge into state and reward design while constraining open-ended generation. To this end, we propose GIFT, an LLM-guided framework for state-reward interface design in PPO-based financial reinforcement learning. Rather than using the LLM to make trading decisions, GIFT uses Factor-guided State Enhancement to generate state features from financial-factor primitives, Risk-rule-guided Reward Shaping to generate auxiliary rewards from portfolio-risk rules, and Diagnostic-guided Refinement to revise candidate interfaces using PPO rollout diagnostics. After refinement, GIFT fixes the selected state-reward interface before evaluation, with no further LLM queries or interface updates at test time. Comprehensive rolling-window experiments across diverse market regimes and portfolio scenarios demonstrate that GIFT improves learning-signal quality and out-of-sample risk-adjusted portfolio performance over baselines. Code and data are available at: https://github.com/KAG778/GIFT .

2606.08633 2026-06-09 cs.AI cs.LG 新提交

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

面向长时域船舶轨迹与目的地预测的推理型大语言模型

Hongwei Wang, Miao Zhou, Fengde Wang, Yuting Wang, Jiewen Yu, Jun-Yan He, Bohao Qu, Wanbing Zhang, Xiuju Fu, Qing Guo, Zipei Fan, Yingying Xing, Yi Yuan

发表机构 * Institute of High Performance Computing (IHPC), A*STAR, Singapore(新加坡科技研究局高性能计算研究所) The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University(同济大学道路与交通工程教育部重点实验室) Meituan Inc., Shenzhen, China(美团(深圳)) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(新加坡科技研究局前沿人工智能研究中心) Nankai University(南开大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出基于可验证奖励强化学习(RLVR)的Maritime LLM后训练框架,将轨迹转化为语义文本,通过物理有效性约束和层次匹配提升长时域(30天)预测精度,4B模型表现最优。

Comments The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026, Naples, Italy

详情
AI中文摘要

长时域海上轨迹预测对航运管理、物流规划和海上风险分析至关重要,但月度级别的预测仍研究不足。现有深度学习方法主要关注短期和中期坐标外推,在长时间跨度下往往难以保持路线可行性和目的地正确性。本文研究了利用具备推理能力的大语言模型进行联合长时域船舶轨迹和目的地预测,并基于可验证奖励强化学习(RLVR)开发了Maritime LLM后训练框架。构建了一个基于AIS的基准数据集,包含60天历史轨迹和30天预测范围,其中轨迹被转换为语义文本表示用于RL提示构建。RLVR通过强制执行物理有效性、提供早期加权轨迹监督以及通过层次匹配和课程学习评估目的地正确性,使LLM与海上预测目标对齐。实验结果表明,RLVR训练的LLM在零样本LLM和代表性深度学习基线方法上均有显著提升,尤其在目的地相关指标上。在评估的RLVR训练变体中,4B LLM实现了最佳整体性能,表明奖励兼容优化和任务特定容量匹配比单纯使用更大的8B或14B LLM更为重要。结果还显示,在有限的微调数据下,LSTM仍然是一个强大的深度学习基线,而Transformer风格的时空模型通常需要更大的数据集和更丰富的结构化输入。总体而言,这项工作推进了用于运营决策支持的语义化、验证器对齐的海上预测。

英文摘要

Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.

2606.08849 2026-06-09 cs.AI 新提交

A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems

面向城市交通系统协同中断响应的弹性即服务评估框架

Sara Jaber, S. M. Hassan Mahdavi, Neila Bhouri, Mostafa Ameli

发表机构 * Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France(古斯塔夫·埃菲尔大学,交通系统、网络与安全实验室,交通工程与智能交通系统研究组,法国巴黎) VEDECOM, mobiLAB, Department of Human factors and Economics of Sustainable Mobility, Versailles, France(VEDECOM研究所,移动出行实验室,可持续出行人因与经济系,法国凡尔赛)

AI总结 提出一个基于KPI的时间索引框架,结合优化模型与智能体仿真,从脆弱性、适应性、鲁棒性等多维度评估城市交通中断响应方案的弹性,并通过巴黎RER B线案例验证了协同策略的优越性。

详情
AI中文摘要

城市公共交通中断需要快速响应策略,然而现有研究很少提供一个决策支持框架,使用一组通用的动态、乘客、运营商和环境导向指标来比较替代的中断响应解决方案。本文提出了一个KPI驱动的、时间索引的框架,用于评估城市交通系统中中断响应方案的弹性。该框架将优化模型与基于智能体仿真的行为评估相结合。它还考虑了当在途车辆被撤回以支持中断走廊时,辅助线路上的二次服务退化。该框架不将弹性视为单一分数,而是评估互补维度,包括脆弱性、适应性、鲁棒性、弹性损失、响应性、基于成本的性能、排放和公平性。该框架在法兰西岛(巴黎)网络的RER B交通线上实施。结果表明,协同策略提供了最平衡的弹性曲线,与单一模式替代方案相比,结合了高服务连续性和较低的总中断成本,同时提高了公平性并保持了有竞争力的环境性能。敏感性分析进一步确定了协同多模式响应最有价值的中断条件。

英文摘要

Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to compare alternative disruption response solutions using a common set of dynamic, passenger, operator, and environment oriented indicators. This paper proposes a KPI-driven, time-indexed framework to assess the resilience of disruption response solutions in urban transit systems. The framework combines an optimization model with a behavioral evaluation in agent-based simulation. It also underlays the secondary service degradation induced on helper lines when in-service vehicles are withdrawn to support the disrupted corridor. Rather than treating resilience as a single score, it evaluates complementary dimensions including vulnerability, adaptability, robustness, resilience loss, responsiveness, cost-based performance, emissions, and equity. The framework is implemented for the RER B transit line in the Ile-de-France (Paris) network. Results show that the coordinated strategy provides the most balanced resilience profile, combining high service continuity with lower total disruption cost than single mode alternatives, while also improving equity and maintaining competitive environmental performance. Sensitivity analysis further identifies the disruption conditions under which coordinated multimodal response is most valuable.

2606.08855 2026-06-09 cs.AI cs.CV cs.CY 新提交

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

高等教育中的混合电子评估:纸质笔试的半自动评分

Hartwig Grabowski, Michael Canz

发表机构 * Institute for Machine Learning and Analytics, Hochschule Offenburg(霍恩海姆应用技术大学机器学习与分析研究所) Hochschule Offenburg(霍恩海姆应用技术大学)

AI总结 针对完全数字化和部分数字化电子评估在总结性考试中的局限性,提出混合电子评估方法,保留纸质问题导向任务,通过结构化答案格式和手写字符识别实现半自动评分,结合视觉大语言模型和两遍验证提升评估有效性、公平性和可扩展性。

Comments 15 pages, 6 figures

详情
AI中文摘要

本文考察了完全数字化和部分数字化电子评估方法在高等教育总结性考试中的局限性。分析聚焦于封闭式问题格式导致的教学狭窄化,以及在大学生群体中尤为突出的组织、技术和法律约束。作为替代方案,本文提出了一种混合电子评估方法,该方法保留纸质、问题导向的考试任务,同时实现半自动评分。评估相关的中间结果以结构化答案格式编码,由学生手写输入,随后从表格字段中捕获。核心的技术瓶颈是在现实考试条件下可靠识别手写字符。最近的视觉大语言模型,结合两遍验证原则和与标准答案的比对,可以减少误分类,从而提高总结性评估的有效性、公平性和可扩展性。

英文摘要

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

2606.09086 2026-06-09 cs.AI 新提交

DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling

DynaOD: 基于离散到连续时间语义建模的动态起讫点流量生成

Jie Zhao, Xianqi Dai, Jie Feng, Huandong Wang, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University(清华大学电子工程系,BNRist) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Zhongguancun Academy(中关村学院)

AI总结 提出DynaOD框架,通过离散方向趋势和连续时间演化双视角建模时间语义,以轻量即插即用方式调节预训练静态OD生成器,实现无历史观测的动态OD流生成,在预测精度和分布保真度上优于基线。

Comments Accepted by IJCAI2026

详情
AI中文摘要

动态起讫点(OD)流量生成旨在仅从时间上下文合成逼真的移动动态,而不依赖历史OD观测。一个关键挑战是将语义时间信号转化为时间上连贯的OD模式,同时保留城市区域固有的空间异质性。我们提出DynaOD,一个语义驱动框架,通过两个互补视角建模时间动态:离散方向趋势,刻画城市活动模式的定性变化;连续时间演化,捕捉这些变化如何随时间展开。通过联合编码这些时间语义,该框架构建时变区域表示,以轻量即插即用方式调节预训练的静态OD生成器。这种模块化设计进一步支持可扩展部署和跨城市迁移。在大型真实世界数据集上的大量实验表明,我们的方法在预测精度和分布保真度上均持续优于代表性基线。代码公开于https://github.com/csjiezhao/DynaOD。

英文摘要

Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.

2606.09392 2026-06-09 cs.AI 新提交

From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

从粗到细:管理时空数据中的时间粒度以实现细粒度交通预测

Shuhao Li, Weidong Yang, Yue Cui, Zizhuo Xu, Lipeng Ma, Fan Zhang, Xiaofang Zhou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) The Hong Kong University of Science and Technology(香港科技大学) Guangzhou University(广州大学)

AI总结 针对粗粒度采样数据难以支持细粒度预测的问题,提出时空细化预测器(STRP),通过树卷积和逆膨胀卷积实现高效时空建模,在六个数据集上显著优于现有方法。

详情
AI中文摘要

高效的交通数据获取、存储和利用是时空数据管理中的关键挑战。大多数交通数据系统以固定的粗粒度时间间隔收集和存储观测数据,以降低存储和计算成本。然而,这种粗粒度数据严重限制了需要更细时间粒度预测的下游应用。在所有地点和时间段收集和维护细粒度交通数据将给数据库存储和预处理流程带来巨大负担。为了解决这种时间粒度不匹配问题,我们定义了一个新问题:利用粗粒度采样数据预测细粒度未来交通。我们提出了时空细化预测器(STRP),一种面向时空数据系统的粒度感知框架。STRP集成了两个组件:用于高效且可解释的空间依赖建模的树卷积,以及用于渐进式时间外推的逆膨胀卷积。STRP支持两种实用的预测设置:基于窗口和基于持续时间的,以处理不同形式的粒度不匹配。在六个基准数据集上的实验表明,STRP在准确性和效率上均显著优于最先进的基线方法。我们的工作为管理时空交通数据系统中的粒度不匹配提供了一种实用且可解释的方法。

英文摘要

Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.

2606.09433 2026-06-09 cs.AI 新提交

Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring

贝叶斯选择性潜在推断用于污水优先的流感监测

Yixuan Zhang, Yang Song, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen(哥本哈根大学) Rutgers University(罗格斯大学) Imperial College London(帝国理工学院)

AI总结 提出贝叶斯选择性潜在推断(BSLI),通过后验分布、可回答性认证和成本校准的Bellman策略,在污水优先流感监测中优化查询与弃权决策。

Comments Corresponding authors: Hengguan Huang and Samir Bhatt. Hengguan Huang is the lead corresponding author

详情
AI中文摘要

污水流感监测可以在临床报告之前揭示社区传播,但仅凭污水并不能完全识别人类负担。现有的污水模型假设固定的证据集,而通用的证据获取方法将官方监测流视为可互换的昂贵特征。我们将污水优先的流感监测视为一个选择性决策问题:从强制性的污水证据开始,系统必须决定污水是否足够,接下来查询哪个延迟的官方流,以及在源模糊下何时弃权是唯一科学上可辩护的行动。我们提出了贝叶斯选择性潜在推断(BSLI),这是一种原则性的贝叶斯方法,它维护潜在负担和可识别性的后验分布,通过明确的科学门认证可回答性,并使用精确的成本校准Bellman策略优化查询-停止决策。我们证明了关键的变分、可回答性、Bellman最优性和一维成本校准性质。在一个包含5,933个预测事件和3,102个源模糊事件的固定公共数据基准上,BSLI改善了匹配预算的成本-性能前沿,同时在源模糊下保持保守的弃权。

英文摘要

Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.

2606.09489 2026-06-09 cs.AI 新提交

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

LLM编排的卒中护理合规性检查无需计算机可解释指南

Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale(皮埃蒙特东方大学计算机科学研究所) Integrated Laboratory of AI and Medical Informatics, DAIRI, SS. Antonio e Biagio e Cesare Arrigo Hospital(圣安东尼奥、比亚焦与切萨雷·阿里戈医院DAIRI人工智能与医学信息学综合实验室)

AI总结 提出基于大语言模型编排的模块化框架,从非结构化临床文本和指南中自动提取患者轨迹、识别规范规则并计算合规性指标,在卒中护理领域验证了86%以上的轨迹合规。

详情
AI中文摘要

目标:医疗保健中的合规性检查旨在评估患者护理路径是否符合临床指南。然而,其实际应用通常依赖于正式、机器可解释的指南表示(如计算机可解释指南CIG),而这些在现实临床环境中很少可用。方法:本文引入了一个基于大语言模型编排的模块化框架,直接从非结构化的临床和指南文本中支持医疗合规性检查,无需预定义的CIG。所提出的架构集成了多个LLM和支持组件,从临床出院信中提取患者轨迹,从文本临床指南中识别规范规则,将这些规则转换为可执行脚本,并计算轨迹合规性指标以量化事件日志中的合规性。结果:该框架在亚历山德里亚医院神经内科病房的卒中护理领域进行了实施和评估。从医院数据中自动提取了数百条患者轨迹,并根据参考指南衍生的50条规则进行了评估。分析显示,超过86%的可用轨迹是合规的。结论:结果证明了使用编排的LLM进行实际医疗保健合规性分析的可行性。同时,该研究提供了亚历山德里亚医院卒中护理指南高度遵守的证据。

英文摘要

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

2606.09556 2026-06-09 cs.AI 新提交

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI科学家的能力取决于其证据:药物资产估值中专有数据与推理技能的分层消融研究

Yinan Wang

发表机构 * Noah AI Research(Noah AI研究)

AI总结 通过分层消融实验,发现药物资产估值中AI科学家的决策上限由专有证据集决定,而非仅依赖推理框架;加入专有数据后决策质量显著提升。

Comments Preprint; 2 figures, 5 tables

详情
AI中文摘要

AI科学家智能体通常被评估时,仿佛能力主要取决于模型质量、提示或推理框架。我们在药物资产估值中测试了一个不同的假设:对于知识密集型的科学决策,限制因素往往是智能体能够访问的证据基础。我们在一个生产级估值智能体上进行了三臂对照消融实验:A是仅使用网络的普通LLM分析师,B增加了公共结构化工具以及14维估值剧本、验证器、客观性策略和红队,C增加了专有的Noah AI语料库,包含精选的管线、试验和交易情报。在包含13个资产的分层基准测试中,B改善了校准和审计纪律:层级内准确率从0.80提高到0.89,客观性从3.16提高到3.30。但B并未消除事实上限。在能力超集核算下,A和B仅恢复了精选黄金竞争记录的0.25和0.38,而C恢复了0.96;在精选长尾子集上,C达到0.93,而A/B为0.26/0.30。原始盲审决策质量A和B相似(7.01 vs 6.96),因此我们引入了完整性感知决策效用:知情决策质量 = 决策质量 × 黄金覆盖率。在此指标上,C达到7.43,而A/B为1.76/2.57。即使一个完美的非专有数据报告,其B的覆盖率上限也仅为3.83。结果并非推理框架不重要;它们改善了校准和纪律。相反,专有证据集设定了AI科学家所能知道并因此决策的上限。

英文摘要

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

2606.09774 2026-06-09 cs.AI cs.CL 新提交

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校)

AI总结 提出SIGA适配器,通过检索、程序记忆、轨迹内验证和验证强制终止,将通用编码智能体转化为科学模拟软件操作员,在GEOS上实现36倍加速,并支持自演化提升性能。

详情
AI中文摘要

高级科学模拟器暴露了专门的输入语言,将模拟目标转化为可执行配置,但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题:需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件?我们的直觉是,编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出,但它们缺乏模拟器的可执行契约:其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA,一个模拟器接口接地适配器,通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA,GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件,TreeSim高于0.90,与花费大约三小时的扩展预算人类专家相当,实现了大约36倍的挂钟加速。在更难的保留集上,接地将TreeSim从0.720提高到0.789,相对于裸智能体提高了大约10%,并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA,产生了最高的保留GEOS平均值,并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明,主导机制因接口而异:当结构完整性是瓶颈时,验证最重要;而当领域正确性是瓶颈时,记忆和检索最重要。这些结果表明,轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

2502.09194 2026-06-09 cs.IT cs.AI math.IT 交叉投稿

XAInomaly: Explainable and Interpretable Deep Contractive Autoencoder for O-RAN Traffic Anomaly Detection

XAInomaly:用于O-RAN流量异常检测的可解释与可解释深度收缩自编码器

Osman Tugay Basaran, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, TU Berlin, Germany(电气工程与计算机科学学院,柏林技术大学,德国)

AI总结 提出XAInomaly框架,利用半监督深度收缩自编码器学习正常网络行为的鲁棒表示,并引入fastshap-C可解释AI技术,实现O-RAN中准确、可扩展且可解释的异常检测。

Comments 22 pages, 9 Figures, Submitted to Journal (First revision completed)

详情
AI中文摘要

生成式人工智能技术通过实现复杂数据建模和特征提取以增强网络性能,已成为推动下一代无线通信系统发展的关键组成部分。在开放无线接入网络(O-RAN)领域,其以解耦架构和来自多个供应商的异构组件为特征,生成模型的部署为网络管理(如流量分析、流量预测和异常检测)带来了显著优势。然而,O-RAN的复杂性和动态性带来了挑战,不仅需要准确的检测机制,还需要降低复杂性、可扩展性,以及最重要的是可解释性,以促进有效的网络管理。在本研究中,我们引入了XAInomaly框架,这是一种用于O-RAN异常检测的可解释且可解释的半监督深度收缩自编码器(DeepCAE)设计。我们的方法利用SS-DeepCAE模型的生成建模能力,学习正常网络行为的压缩、鲁棒表示,该表示捕获了关键特征,从而能够识别指示异常的偏差。为了解决深度学习模型的黑箱特性,我们提出了一种名为fastshap-C的反应式可解释AI(XAI)技术。

英文摘要

Generative Artificial Intelligence (AI) techniques have become integral part in advancing next generation wireless communication systems by enabling sophisticated data modeling and feature extraction for enhanced network performance. In the realm of open radio access networks (O-RAN), characterized by their disaggregated architecture and heterogeneous components from multiple vendors, the deployment of generative models offers significant advantages for network management such as traffic analysis, traffic forecasting and anomaly detection. However, the complex and dynamic nature of O-RAN introduces challenges that necessitate not only accurate detection mechanisms but also reduced complexity, scalability, and most importantly interpretability to facilitate effective network management. In this study, we introduce the XAInomaly framework, an explainable and interpretable Semi-supervised (SS) Deep Contractive Autoencoder (DeepCAE) design for anomaly detection in O-RAN. Our approach leverages the generative modeling capabilities of our SS-DeepCAE model to learn compressed, robust representations of normal network behavior, which captures essential features, enabling the identification of deviations indicative of anomalies. To address the black-box nature of deep learning models, we propose reactive Explainable AI (XAI) technique called fastshap-C.

2606.07543 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

Concerns and Strategic Responses of Older Workers Navigating Generative AI in Bridge Employment

老年工人在桥梁就业中应对生成式AI的关切与战略回应

Aditya Nayak, Aakash Gautam, Rama Adithya Varanasi

发表机构 * University of Pittsburgh(匹兹堡大学) New York University(纽约大学)

AI总结 通过访谈21名专业人士,研究老年工人在桥梁就业中如何应对生成式AI带来的时间与结构性干扰,通过边界工作重构任务,形成AI韧性,并建议平衡个体、中观和宏观层面的策略以减少倦怠。

Comments CHIWORK'26

详情
AI中文摘要

生成式AI正在快速改变工作场所。这不成比例地影响了弱势群体,包括在最终退休前通过桥梁就业重新进入劳动力市场的老年工人。通过对21名专业人士进行深入的半结构化访谈,我们考察了老年工人在追求桥梁角色时如何应对生成式AI驱动的干扰,重点关注他们对GenAI整合的关切以及对这些变化的回应。我们的发现表明,由于GenAI,老年工人在桥梁就业决策过程的所有阶段都经历了时间和结构性干扰。作为回应,他们通过不同形式的边界工作重新配置任务,旨在恢复稳定性和连续性。我们将这些回应概念化为AI韧性,它重塑了老年工人的桥梁就业决策,使其成为一个持续的协商和适应过程。最后,我们提出建议,通过平衡个体层面的AI韧性策略、中观层面的AI韧性集体以及宏观层面的对抗性和可争议的AI中介组织结构,来减少老年工人的倦怠。

英文摘要

Generative AI (GenAI) is transforming workplaces at a rapid pace. This disproportionately affects vulnerable communities, including older workers (OWs) who re-enter the workforce through bridge employment prior to final retirement. Through in-depth semi-structured interviews with 21 professionals, we examine how OWs navigate GenAI-driven disruptions while pursuing bridge roles, focusing on their concerns about GenAI integration and their responses to these changes. Our findings show that OWs experienced both temporal and structural disruptions across all stages of the bridge employment decision-making process due to GenAI. In response, they reconfigured their tasks through different forms of boundary work aimed at restoring stability and continuity. We conceptualize these responses as AI resilience, which reshaped OWs' bridge employment decision-making into an ongoing process of negotiation and adaptation. We conclude by offering recommendations to reduce burnout among OWs by balancing individual-level AI resilience strategies with meso-level AI resilience collectives and macro-level adversarial and contestable AI-mediated organizational structures.

2606.07544 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

AI-Integrated Learning Management System for Middle School: A Longitudinal Study of Learning Outcomes Through High School and Beyond

面向中学的AI集成学习管理系统:一项从高中到毕业后的学习成果纵向研究

Misan Paul Etchie, Taiwo Olutosin

发表机构 * National Agricultural University(国立农业大学)

AI总结 提出一种隐私优先的AI集成学习管理系统,通过政策约束的AI辅助(形成性反馈、间隔复习、适应性练习)和教师仪表盘,在中学日常课程中提供即时支持,并设计纵向研究评估其对高中及毕业后学习轨迹的长期影响。

详情
AI中文摘要

中学是构建核心学术技能和学习习惯的关键时期,这些习惯会延续到高年级,但许多学生仍因帮助有限且滞后而落后。学习管理系统(LMS)已成为分发材料、收集作业、评估学生任务和记录成绩的标准基础设施,但在大多数部署中,它们更像工作流工具而非教学支持。结果是常见的瓶颈:学生在困惑中继续练习,教师对问题进行分诊,而本可纠正误解的反馈在错误观念固化后才到达。为弥补这一差距,我们提出一个面向中学教学的AI集成LMS,并配以纵向研究设计,以测试持续、有边界的AI支持是否能改变高中及毕业后的学习成果。该平台在常规课程中添加了政策约束的AI辅助,提供形成性反馈和提示,基于掌握程度推荐间隔复习和适应性练习,并提供教师仪表盘以总结误解模式并标记持续困难。由于平台面向未成年人,设计以隐私为先,采用数据最小化、基于角色的访问控制、适龄响应约束和可审计的AI交互日志。除了短期表现,评估计划将细粒度的学习轨迹(尝试、修订、求助和节奏)与机构成果(在可行情况下)联系起来,以便将工具采纳效应与学习轨迹的长期变化区分开来。

英文摘要

Middle school is a key window for building core academic skills and the learning routines students carry into later grades, yet many students still fall behind because help is often limited and comes too late, after they have already been stuck for a while. Learning Management Systems (LMSs) are now standard infrastructure for distributing materials, collecting work, assessing students' tasks, and recording grades, but in most deployments they still behave more like workflow tools than instructional supports. The result is the usual bottleneck: students keep practicing through confusion, teachers triage questions, and feedback that could have corrected the misunderstanding arrives after the misconception has already hardened. To address this gap, we propose an AI-integrated LMS for middle school instruction, paired with a longitudinal study design to test whether sustained, bounded AI support changes outcomes through high school and into post-high school pathways. The proposed platform adds policy-gated AI assistance to everyday coursework, delivering formative feedback and hinting, recommending spaced review and adaptive practice based on mastery, and providing teacher-facing dashboards that summarize misconception patterns and flag sustained struggle. Because the platform is intended for minors, the design is privacy-first, using data minimization, role-based access control, age-appropriate response constraints, and auditable logs of AI interactions. Beyond short-term performance, the evaluation plan links fine-grained learning traces (attempts, revisions, help-seeking, and pacing) to institutional outcomes where feasible, so we can separate tool adoption effects from longer-run changes in learning trajectories.

2606.07553 2026-06-09 cs.LG cs.AI 交叉投稿

MedicalRec: Medical recommender system for image classification without retraining

MedicalRec:无需重新训练的图像分类医疗推荐系统

Roghayeh Taghavi, Aysa Hasanazde Bashkandi, Amir Ali Bengari, Mohammad Amin Raji, Mohammad Salahi Ardekani, Parisa Mardukhian, Parvaneh Rezaei, Ramin Mousa

发表机构 * University of Tehran(塔里班大学)

AI总结 提出基于Transformer的医疗推荐系统MedicalRec,利用从3000篇论文中构建的MedicalRec-Bench数据集(含5000+记录),无需重新训练即可为医疗图像分类任务推荐最优模型,最高HitRate@100达75.5%。

详情
AI中文摘要

机器学习和深度学习的出现彻底改变了医疗保健中诊断、治疗和管理系统的效率。然而,这种快速采用是以需要大量计算能力和能源消耗以及电子垃圾处理和碳排放为代价的。这些模型的挑战之一是为分类任务选择合适的模型。为此,研究人员尝试通过试错法使用他们的数据来确定最佳模型,这涉及能源消耗和浪费。本研究的目标是开发一个基于模型的医疗图像分类推荐系统。为此,从3000篇医疗图像分类领域的文章中收集了一个数据集。该数据集以MedicalRec-Bench的名称公开可用,包含超过5000条在各种任务中测试的模型记录,包括皮肤癌分类、肿瘤分类、伤口分类、乳腺癌和MRI分类。根据特征数量,数据集在四种不同模式下进行评估:MedicalRec I(5个特征)、MedicalRec II(9个特征)、MedicalRec III(11个特征)和MedicalRec IV(18个特征)。由于作者未报告,收集所有特征值具有挑战性;因此,数据集包含大量缺失值。医疗推荐系统(MedicalRec)是一个基于Transformer的模型,用于本研究中的项目推荐。该模型在数据集评估和与12个基础模型的评估中取得了显著成果。该模型实现了最高HitRate@100为75.5%。数据集和实现可通过GitHub链接获取:https://github.com/Ramin1Mousa/MedicalRec

英文摘要

The emergence of machine learning and deep learning has revolutionized the efficiency of diagnostic, therapeutic, and administrative systems in healthcare. However, this rapid adoption has come at the cost of requiring significant computing power and energy consumption, as well as e-waste disposal and carbon emissions. One of the challenges of these models is choosing the right model for classification tasks. To this end, researchers attempt to identify the optimal model using their data through trial and error, which involves energy consumption and waste. The goal of this study is to develop a model-based recommender system for medical image classification. For this purpose, a data set was collected from 3,000 articles in the field of medical image classification. This dataset, publicly available under the name MedicalRec-Bench, contains over 5,000 records of models tested in various tasks, including Skin Cancer Classification, Tumour Classification, Wound Classification, Breast Cancer, and MRI classification. The dataset was evaluated in four different modes, depending on the number of features: MedicalRec I (5 features), MedicalRec II (9 features), MedicalRec III (11 features), and MedicalRec IV (18 features). Collecting all values for the features is challenging due to non-reporting by the authors; hence, the dataset contains significant amounts of missing values. The Medical Recommender System (MedicalRec) is a transformer-based model used for item recommendations in this study. This model achieved remarkable results in the evaluation on the dataset and in the evaluation with 12 base models. This model achieved a maximum HitRate@100 of 75.5%. The dataset and implementations are available through the GitHub link: https://github.com/Ramin1Mousa/MedicalRec

2606.07556 2026-06-09 cs.NI cs.AI stat.ME 交叉投稿

Selecting New Measurement Locations to Diversify Traffic-Pattern Coverage: A Real-World Evaluation for Total Traffic Volume Estimation

选择新的测量位置以多样化交通模式覆盖:总交通量估计的实际评估

Masaaki Inoue, Akifumi Okuno, Shintaro Fukushima

发表机构 * TOYOTA Motor Corporation(丰田汽车公司) Institute of Statistical Mathematics(统计数学研究所) The Graduate University for Advanced Studies, SOKENDAI RIKEN(研究生高等大学院,SOKENDAI RIKEN)

AI总结 针对固定交通计数器覆盖有限的问题,提出利用广泛设备数据选择新计数器位置以增加观测模式多样性,提高城市交通量估计精度,并通过实地测量验证。

Comments 12 pages, 7 figures

详情
AI中文摘要

准确测量交通量和流量对于现代智能交通至关重要。然而,尽管传感器设备最近取得了技术进步,安装和维护固定交通计数器的成本仍然很高。因此,它仅限于可以安装计数器的一小部分位置点,这严重限制了在城市范围内掌握和预测总交通量的可能性。相比之下,具有位置历史的设备(如智能手机和联网车辆)现在被广泛使用,并提供更广泛的空间覆盖。然而,这些设备的数据通常是部分且嘈杂的,因此不足以直接估计总交通量和流量。在本文中,我们利用这些广泛可用设备的信息来帮助决定在何处放置额外的交通计数器,并研究选择新的测量位置如何改善城市范围的交通估计性能。为此,我们提出了一种算法,该算法选择额外的计数器位置以增加观测到的交通信号模式的多样性,而不是简单地将计数器均匀分布在空间上。目标是捕获当前计数器集中稀有的交通模式类型,并使收集的观测结果对后续估计和预测更具代表性。我们还进行了实际评估;在一个目标城市中,我们选择了预期能改善交通预测的新位置,然后自费在这些位置进行了新的实地测量。所得数据提高了不同保真度下交通量估计的准确性。

英文摘要

Accurate measurement of traffic volumes and flows is vital for modern intelligent transportation. However, despite recent technological advances in sensor devices, it is still expensive to install and maintain fixed traffic counters. Therefore, it is restricted to a small portion of location points where the counters can be installed, which severely limits the possibility of grasping and predicting the total traffic volume at a city-wide level. By contrast, devices with location history such as smartphones and connected vehicles are now widely used and provide much wider spatial coverage. However, the data from these devices are usually partial and noisy, so they are not enough to directly estimate total traffic volumes and flows. In this paper, we use the information from these widely available devices to help decide where to place additional traffic counters, and we study how selecting new measurement locations can improve city-wide traffic estimation performance. To achieve this, we propose an algorithm that chooses additional counter locations to increase the diversity of observed traffic signal patterns, rather than simply spreading counters evenly over space. The goal is to capture traffic-pattern types that are rare in the current counter set and to make the collected observations more representative for later estimation and forecasting. We also present a real-world evaluation; in a target city, we select new locations expected to improve traffic prediction, and we then commissioned new field measurements at those locations at our expense. The resulting data led to an improvement in traffic volume estimation accuracy across different fidelities.

2606.07564 2026-06-09 physics.ins-det cs.AI hep-ex 交叉投稿

Considerations for an Integrated Detector Design at FCC-ee: A Human-AI Exploration

FCC-ee集成探测器设计考量:人机协同探索

Charles Young

发表机构 * SLAC National Accelerator Laboratory(SLAC国家加速器实验室)

AI总结 通过物理学家与AI助手的对话,探讨FCC-ee探测器设计,从初始概念到修正方案,展示人机协作在实验物理设计中的潜力与局限。

Comments 103 pages, one figure

详情
AI中文摘要

本报告通过物理学家与AI助手之间的扩展对话,探讨了未来环形对撞机正负电子模式(FCC-ee)的探测器设计考量。从AI助手在没有明确物理学家输入的情况下提出的初始“偏见”探测器概念开始,每个子系统都经过详细审查,AI的假设在交流中受到挑战和修正。讨论涵盖了从束流管到亮度监测器的整个探测器,特别关注子系统选择与实用考量(校准、稳定性和操作简便性)之间的相互作用,这些对于为期十五年的精确物理计划至关重要。叙述记录了集成探测器设计如何从起点演变为AI助手修正后的“偏见”探测器概念。本报告的重点在于过程,以说明人机协作在实验物理设计中的潜力和局限性,任何“偏见”探测器概念的物理能力仍有待探索。

英文摘要

This report explores detector design considerations for the Future Circular Collider in its electron-positron mode (FCC-ee) through an extended dialogue between a physicist and an AI assistant. Starting from initial "prejudice" detector concepts proposed by the AI assistant without explicit physicist input, each subsystem is examined in detail, with the AI's assumptions challenged and revised through the exchange. The discussion covers the full detector from beam pipe to luminosity monitor, with particular attention to the interplay between subsystem choices and the practical considerations - calibration, stability, and operational simplicity - that are essential for a fifteen-year precision physics program. The narrative documents how the integrated detector design evolved substantially from the starting point to revised "prejudice" detector concepts of the AI assistant. The focus of this report is on the process to illustrate both the potential and the limitations of human-AI collaboration in experimental physics design, and the physics capabilities of any of the "prejudice" detector concepts remain to be explored.

2606.07567 2026-06-09 q-bio.BM cs.AI cs.CE 交叉投稿

SurfDesign: Effective Protein Design on Molecular Surfaces

SurfDesign:基于分子表面的高效蛋白质设计

Fang Wu, Shuting Jin, Xiangru Tang, Mark Gerstein, Xiangxiang Zeng, Yejin Choi, Jure Leskovec, Jinbo Xu

发表机构 * Stanford University(斯坦福大学) Wuhan University of Science and Technology(武汉科技大学) Yale University(耶鲁大学) School of Medicine, Yale University(耶鲁大学医学院) Hunan University(湖南大学) Yuelushan Laboratory(岳麓实验室) Kumo.AI Toyota Technological Institute at Chicago(芝加哥技术研究所)

AI总结 提出SurfDesign框架,将分子表面建模为连续几何流形并整合预训练蛋白质语言模型,通过表面等变消息传递捕捉几何特征,在从头设计结合子和酶设计基准上优于现有方法。

详情
Journal ref
KDD 2026 AI4Science
AI中文摘要

蛋白质功能很大程度上由分子表面几何和物理化学互补性决定,然而大多数蛋白质设计方法仅以主链结构为条件。我们引入了SurfDesign,一个表面条件蛋白质设计框架,将分子表面建模为连续几何流形,并将其与预训练蛋白质语言模型集成。SurfDesign采用基于表面的等变消息传递来捕捉表面法线、曲率和方向几何,同时采用参数高效的微调策略。专注于功能性蛋白质设计,我们表明SurfDesign在从头设计结合子和酶设计基准上始终优于先前的表面条件和仅主链方法。我们还报告了在逆折叠基准上的强劲性能,作为结构兼容性的诊断。我们的结果强调了流形感知表面表示作为功能性蛋白质和酶设计的原理基础。代码可在https://github.com/smiles724/SurfDesign获取。

英文摘要

Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods condition only on backbone structure. We introduce SurfDesign, a surface-conditioned protein design framework that models molecular surfaces as continuous geometric manifolds and integrates them with pretrained protein language models. SurfDesign employs surface-based equivariant message passing to capture surface normals, curvature, and directional geometry, together with a parameter-efficient fine-tuning strategy. Focusing on functional protein design, we show that SurfDesign consistently outperforms prior surface-conditioned and backbone-only methods on de novo binder and enzyme design benchmarks. We also report strong performance on inverse-folding benchmarks as a diagnostic of structural compatibility. Our results highlight manifold-aware surface representations as a principled foundation for functional protein and enzyme design. Code is available at https://github.com/smiles724/SurfDesign.

2606.07582 2026-06-09 cs.LG cs.AI cs.ET 交叉投稿

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

基于FT-Transformer和堆叠集成的结构化数据客户流失预测

Joyjit Roy, Samaresh Kumar Singh, Laxmi Shaw

发表机构 * Independent Researcher, Austin, TX, USA(独立研究员,美国德克萨斯州奥斯汀) Independent Researcher, Leander, TX(独立研究员,美国德克萨斯州利安德) Texas A & M University-Victoria, Victoria, TX(德克萨斯农工大学维多利亚分校)

AI总结 提出一种结合FT-Transformer与XGBoost的混合架构,通过校准感知堆叠集成处理类别不平衡和特征交互,在银行客户流失数据集上F1达62.10%,AUC-ROC为0.861。

Comments 22 pages, 9 figures, 20 tables; published in IEEE Access

详情
Journal ref
IEEE Access, vol. 14, pp. 62834-62855, 2026
AI中文摘要

客户流失预测在保险、数字银行、电子商务和订阅平台等数据驱动行业中至关重要,因为保留现有客户通常比获取新客户更具成本效益。由于类别不平衡、非线性特征交互和异质特征类型,在结构化数据集上预测流失仍然具有挑战性。基于树的集成方法在这些场景中始终表现出强大的性能,通常优于传统神经网络。本研究引入了一种经过验证的混合架构,通过校准感知堆叠将特征标记化变换器(FT-Transformer)与梯度提升树相结合。所提出的框架解决了先前研究中在统计验证、概率校准和可重复性方面的持续空白。FT-Transformer利用自注意力捕获高阶特征交互,而XGBoost通过互补的归纳偏置捕获梯度提升决策边界。类别不平衡通过使用类别加权损失函数处理,从而避免合成过采样并保留少数类分布。模型使用基于折叠外(OOF)堆叠的逻辑回归元学习器进行集成,该元学习器重新校准过于自信的基模型输出并学习最优组合权重。在一个公开的银行流失数据集上,混合模型在5x5交叉验证下达到62.10%的F1、0.861的AUC-ROC和0.647的PR-AUC,相比多层感知机(MLP)基线分别提升3.37个F1点和0.027个AUC,并报告了95%置信区间。消融研究表明,变换器组件和堆叠策略都对性能有实质性贡献。所提出的方法为结构化表格数据上的当代流失预测提供了一个可重复且可扩展的参考架构。

英文摘要

Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platforms, where retaining existing customers is typically more cost-effective than acquiring new ones. Predicting churn on structured datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types. Tree-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks. This study introduces a validated hybrid architecture that integrates feature-tokenized transformers (FT-Transformer) with gradient-boosted trees through calibration-aware stacking. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research. The FT-Transformer captures higher-order feature interactions using self-attention, while XGBoost captures gradient-boosted decision boundaries with complementary inductive biases. Class imbalance is handled using class-weighted loss functions, thereby avoiding synthetic oversampling and preserving minority-class distributions. The models are ensembled using out-of-fold (OOF) stacking with a logistic regression meta-learner, which recalibrates overconfident base model outputs and learns optimal combination weights. On a public bank churn dataset, the hybrid model achieves 62.10% F1, 0.861 AUC-ROC, and 0.647 PR-AUC, outperforming the Multi-Layer Perceptron (MLP) baseline by 3.37 F1 points and 0.027 AUC under 5x5 cross-validation with 95% confidence intervals reported. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data.

2606.07633 2026-06-09 cs.CV cs.AI 交叉投稿

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN:一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结 提出AMN双编码器分割框架,融合Swin Transformer和ResNet-50特征金字塔,通过门控机制动态加权,结合多目标损失,在CoNIC基准上平均Dice 0.82,F1 0.68,优于八种基线模型。

详情
AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务(包括肿瘤分级、免疫浸润量化和预后预测)至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器,限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN(自适应多尺度细胞核网络),一种双编码器分割框架,联合利用Swin Transformer和ResNet-50特征金字塔,通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练,该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项,用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估,AMN实现了平均Dice 0.82和平均F1 0.68,在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型,包括纯CNN、纯Transformer和最近的混合架构:U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力,验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

2606.07635 2026-06-09 cs.CV cs.AI 交叉投稿

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)人工智能学院智能科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室) Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences(广西壮族自治区人民医院放射科,广西医学科学院) Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital(华中科技大学协和深圳医院(深圳市第六人民医院)) School of Basic Medical Sciences, Shenzhen University(深圳大学基础医学院) Egypt-Japan University of Science and Technology (E-JUST)(埃及日本科技大学) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School(深圳大学医学部生物医学工程学院,国家地方联合医学超声关键技术工程实验室,广东省生物医学测量与超声成像重点实验室)

AI总结 提出NeuroAlign框架,通过双模态分层对齐和双域分层交互融合fMRI与DTI特征,实现MCI/SCD检测,并设计无梯度归因方法SAM进行特征分析。

详情
AI中文摘要

功能磁共振成像(fMRI)和弥散张量成像(DTI)的多模态神经影像融合为认知障碍分析提供了互补信息,但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign},一个用于结构化多模态融合的分层框架。它引入了(1)\textit{双模态分层对齐}(DMHA),该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入;以及(2)\textit{双域分层交互}(DDHI),该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查,我们设计了\textit{协同激活映射}(SAM),一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估,NeuroAlign在MCI/SCD检测中取得了竞争性结果,并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式,为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

2606.07648 2026-06-09 cs.CV cs.AI 交叉投稿

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer:一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad(印度海得拉巴国际信息技术学院)

AI总结 提出AQIFormer,一种基于Transformer的集成架构,通过前后视图融合、天气感知注意力和多任务学习,在跨城市空气质量分类中达到89.96%准确率,比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情
AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一,传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案,利用交通场景中大气污染物的视觉特征。然而,现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer,一种新颖的基于Transformer的集成架构,通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合,以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明,该模型性能良好,准确率达到89.96%,比现有最优方法提高了14.96%。最重要的是,我们的模型保持了出色的跨城市泛化能力,在印度那格浦尔收集的独立数据集上达到81.67%的准确率,通过少量样本自适应仅用极少的训练样本,性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

2606.07665 2026-06-09 cs.PL cs.AI 交叉投稿

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

AgentCompile:一种用于直接CUDA推理的LLM引导编译器

Xuanzhe Li, Ziyan Weng, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong (Dongguan)(香港城市大学(东莞)) City University of Hong Kong(香港城市大学)

AI总结 提出AgentCompile,利用LLM提供语义建议,通过模板生成CUDA候选实现并验证,在多个Transformer模型上实现4-5.7倍加速。

Comments 11 pages, 3 figures

详情
AI中文摘要

Transformer推理日益依赖专门的编译器和运行时支持,但实际模型图仍需要语义决策,以确定哪些区域值得专门化以及哪些CUDA实现族是可行的。我们提出AgentCompile,一种LLM引导的CUDA推理编译器,仅将LLM输出用作建议性搜索元数据。给定编译器生成的区域摘要和有界候选空间,LLM提出语义标签、候选优先级、参数提示和风险注释;编译器通过模板生成CUDA候选,检查接口和硬件约束,经验性验证候选,根据测量延迟选择实现,并在专门化不受支持或无利可图时回退。在端到端自回归生成中,AgentCompile在五个代表性工作负载上,相对于PyTorch eager模式,在Qwen3-1.7B、Qwen3-4B和Llama-3.2-1B-Instruct上分别实现了平均5.66倍、4.05倍和4.26倍的加速。我们将开源该项目。

英文摘要

Transformer inference increasingly depends on specialized compiler and runtime support, but real model graphs still require semantic decisions about which regions are worth specializing and which CUDA implementation families are plausible. We present AgentCompile, an LLM-guided CUDA inference compiler that uses LLM outputs only as advisory search metadata. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes semantic labels, candidate priorities, parameter hints, and risk annotations; the compiler materializes CUDA candidates through templates, checks interface and hardware constraints, validates candidates empirically, selects implementations by measured latency, and falls back when specialization is unsupported or unprofitable. In end-to-end autoregressive generation, AgentCompile averages 5.66x, 4.05x, and 4.26x speedup over PyTorch eager on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectively, across five representative workloads. We will open-source the project.

2606.07669 2026-06-09 cs.CV cs.AI 交叉投稿

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University(北京师范大学大数据云边智能协同教育部工程研究中心)

AI总结 提出MemoVAD边缘-云协同框架,通过不确定性感知门控策略选择性调用云端视觉语言模型,并设计动态语义记忆缓存原型,在降低通信开销的同时提升视频异常检测性能。

Comments Accepted by IJCAI2026

详情
AI中文摘要

在真实监控场景中部署视频异常检测(VAD)面临着对高层语义的需求以确保有效性,与边缘设备有限计算资源之间的根本矛盾。视觉语言模型(VLM)提供了丰富的开放词汇语义,但其延迟和计算成本阻碍了设备端部署。为解决这一挑战,我们提出MemoVAD,一种边缘-云协同框架,选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器(TCE)建模时序依赖,运行大部分推理。具体而言,我们引入基于主观逻辑的不确定性感知门控(UAG)策略,以建模感知不确定性,并仅对高不确定性和语义新颖的片段查询云端VLM。此外,设计动态语义记忆(DSM)缓存经VLM验证的原型以实现高效检索,使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明,MemoVAD在显著降低通信开销的同时,超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 交叉投稿

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology(光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所) School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院) Department of AI Convergence, Gwangju Institute of Science and Technology(光州科学技术院人工智能融合系)

AI总结 提出分层特征工程框架,包括静态、动态、比率和耦合特征,用于区分声带创伤性和非声带创伤性声音亢进,发现耦合特征对两类分类均关键,PVH AUC 0.891,NPVH AUC 0.728。

Comments Interspeech 2026

详情
AI中文摘要

动态颈部表面加速度能够实现声音亢进的无创监测,但其亚型的稳健生物标志物仍然有限。本研究利用NeckVibe Challenge数据集区分声带创伤性(PVH)和非声带创伤性(NPVH)声音亢进与健康对照组。我们提出一个分层特征工程框架,包括:(i)静态特征,(ii)动态特征,(iii)基于比率的特征,(iv)捕捉源-滤波器交互的耦合特征。单变量统计分析显示PVH具有强可分性,但NPVH显著性有限,而我们针对高维特征集成优化的机器学习流程发现,耦合特征对两项任务都至关重要。我们实现了PVH的AUC为0.891,NPVH的AUC为0.728,表明虽然PVH近似线性可分,但NPVH的区分受益于非线性特征交互建模。

英文摘要

Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain limited. This study investigates the NeckVibe Challenge dataset to distinguish phonotraumatic (PVH) and non-phonotraumatic (NPVH) from healthy controls. We propose a hierarchical feature engineering framework comprising: (i) static, (ii) dynamic, (iii) ratio-based, (iv) coupling features capturing source filter interactions. While univariate statistical analysis shows strong separability for PVH but limited significance for NPVH, our machine learning pipeline, tailored for high-dimensional feature integration, identifies that coupling features are crucial for both tasks. We achieve an AUC of 0.891 for PVH and 0.728 for NPVH, suggesting that while PVH is near-linearly separable, NPVH discrimination benefits from modeling non-linear feature interactions.

2606.07676 2026-06-09 q-bio.GN cs.AI 交叉投稿

Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models

通过基础模型的对抗微调实现单细胞跨模态迁移

Joseph Boyd, Matthew Lyon, Martino Mansoldo, Christian Hurry, Finnian Firth

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出利用单细胞基础模型进行对抗微调,实现未配对空间转录组与单细胞RNA测序数据的跨模态翻译,性能优于多组学翻译方法。

详情
AI中文摘要

空间转录组学(ST)是探索组织中依赖于结构、邻近性和相互作用的生物学特性的强大工具。支撑ST的方法正在快速发展,但在亚细胞尺度上分析数千个基因的能力有限。尽管从组织中解离,但已知单细胞RNA测序(scRNA-seq)中细胞的全转录组读数保留了其先前原位邻域的信息,这激发了恢复该信息的计算方法。虽然配对的ST和scRNA-seq数据集稀缺,但每种模态本身都很丰富。因此,我们提出在未配对的ST和scRNA-seq数据之间进行跨模态翻译。在这项工作中,我们展示了单细胞基础模型可以通过对抗微调执行这种翻译。我们证明了我们的方法优于为多组学翻译构建的方法。

英文摘要

Spatial transcriptomics (ST) is a powerful tool for exploring biological properties dependent on structure, proximity, and interaction in tissue. The methods underpinning ST are developing rapidly but are limited in their ability to profile many thousands of genes at a subcellular scale. Although dissociated from tissue, it is known that the whole-transcriptome readouts of cells in single-cell RNA sequencing (scRNA-seq) retain information about their former in situ neighbourhoods, motivating computational methods to recover it. While paired ST and scRNA-seq datasets are scarce, each modality in its own right is abundantly available. We therefore propose to perform cross-modal translation between unpaired ST and scRNA-seq data. In this work we show that a single-cell foundation model can perform this translation via adversarial fine-tuning. We demonstrate that our method performs favourably against methods built for multi-omics translation.

2606.07681 2026-06-09 cs.SE cs.AI cs.CE cs.MA 交叉投稿

Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model

将遗留科学代码系统性地LLM翻译为可微分框架:以陆面模型为例

Aya Lahlou, Linnia Hawkins, Pierre Gentine

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) NASA Goddard Space Flight Center(国家航空航天局戈达德空间飞行中心)

AI总结 提出基于LLM的五阶段流水线,将遗留Fortran代码自动翻译为JAX可微分框架,在CLM-ml-v2模型上实现完整雅可比矩阵计算和24倍加速。

详情
AI中文摘要

可微分编程为科学建模提供了变革性能力,支持基于梯度的参数估计、敏感性分析和数据同化。然而,将遗留代码库迁移到可微分框架仍然是一个挑战。我们提出一个基于LLM的五阶段智能体流水线,将遗留Fortran代码翻译为JAX:静态依赖分析从完整调用图确定模块翻译顺序;迭代编译-修复循环自动纠正错误;Fortran参考预言机在模块级别强制数值一致性,然后进行集成和梯度验证。我们在CLM-ml-v2(一个19,000行的Fortran陆面模型)上实例化并评估该流水线,分析了73个模块翻译任务中的智能体行为。得到的可微分模型在单次反向传播中计算完整雅可比矩阵,以比无梯度优化少八倍的步数恢复物理参数,并在集成大小N=2,048时比顺序Fortran实现24倍的墙钟加速。翻译后的模型和流水线基础设施作为可重用框架发布,用于区分其他地球系统模型组件。

英文摘要

Differentiable programming offers transformative capabilities for scientific modeling, enabling gradient-based parameter estimation, sensitivity analysis, and data assimilation. Yet, migrating legacy codebases into differentiable frameworks remains a challenge. We present a five-phase LLM-based agentic pipeline that translates legacy Fortran into JAX: static dependency analysis determines module translation order from the full call graph; iterative compile-repair loops correct errors autonomously; and a Fortran reference oracle enforces numerical parity at the module level before integration and gradient verification. We instantiate and evaluate the pipeline on CLM-ml-v2, a 19,000-line Fortran land surface model, and analyze agent behavior across 73 module translation tasks. The resulting differentiable model computes the complete Jacobian in a single backward pass, recovers physical parameters in eight times fewer steps than gradient-free optimization, and achieves a 24 times wall-clock speedup over sequential Fortran at ensemble size N=2,048. Both the translated model and pipeline infrastructure are released as a reusable framework for differentiating other Earth system model components.

2606.07685 2026-06-09 cs.LG cs.AI 交叉投稿

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

物联网环境下机器学习即服务(MLaaS)的测试时自适应组合

Deepak Kanneganti, Sajib Mistry, Sheik Mohammad Mostakim Fattah, Aneesh Krishna

发表机构 * Deepak Kanneganti Sajib Mistry Sheik Mohammad Mostakim Fattah Aneesh Krishna

AI总结 针对物联网环境中MLaaS组合因动态性而失效的问题,提出一种测试时自适应(TTA)组合框架,通过TTA感知可组合性模型和服务级自适应模型,在推理时调整服务并保持组合性能,显著降低计算时间。

详情
AI中文摘要

物联网(IoT)环境的动态性影响了机器学习即服务(MLaaS)组合的长期有效性。现有的自适应组合方法主要基于服务替换或重新组合,其中识别合适的替代服务既困难又耗时。为了解决这一问题,我们提出了一种新颖的测试时自适应(TTA)组合框架,用于物联网环境中的MLaaS。首先,我们引入了一个TTA感知的可组合性模型,以确定自适应服务是否仍然与现有组合兼容。接下来,我们设计了一个服务级自适应模型,在推理过程中调整单个服务,同时保持组合性能。实验结果表明,与传统的自适应方法相比,所提出的框架更有效地减少了计算时间。

英文摘要

The dynamic nature of Internet of Things (IoT) environments affects the long-term effectiveness of Machine Learning as a Service (MLaaS) compositions. Existing adaptive composition methods are mainly based on service replacement or re-composition, where identifying suitable substitutes is difficult and time-consuming. To address this, we propose a novel Test-Time Adaptive (TTA) composition framework for MLaaS in IoT environments. First, we introduce a TTA-aware composability model to determine whether adapted services remain compatible with the existing composition. Next, we design a service-level adaptation model to adjust individual services during inference while preserving composition performance. Experimental results demonstrate that the proposed framework reduces computational time more effectively than traditional adaptive approaches.

2606.07692 2026-06-09 cs.LG cs.AI cs.ET 交叉投稿

BCG-FM: A Foundation Model for Ambient Cardiac Health Sensing

BCG-FM:一种用于环境心脏健康感知的基础模型

Magnus Ruud Kjaer, Haejun Han, Ashish Neupane, David Q. Sun

发表机构 * Department of Computer Science and Engineering, University of California, San Diego(1 加州大学圣迭戈分校计算机科学与工程系)

AI总结 提出首个环境机械生物信号基础模型BCG-FM,利用床垫压电传感器无感采集心冲击图,通过14.6万人的275万小时数据预训练,在生物年龄估计上达到3.26年MAE,并实现15种健康状态的临床相关判别。

详情
AI中文摘要

可穿戴生物信号的基础模型在多项临床任务中已匹配或超越监督专家,但所有模型都依赖于需要用户主动操作的模态——佩戴设备或访问睡眠实验室。我们提出BCG-FM,首个用于环境机械生物信号的基础模型。嵌入床垫表面的压电传感器每晚无感记录心冲击图(BCG);我们使用参与者级对比学习,基于145,985名个体的总计275万小时夜间记录预训练BCG-FM,这是迄今为止最大的原始波形生物信号预训练语料库。冻结的BCG-FM嵌入在生物年龄估计上达到3.26年MAE(所有环境、非接触模态中最低报告值),并在15种自我报告健康状况和三个独立外部队列中产生临床相关的判别。仅500名标注参与者的预训练表示优于在3,372名参与者上训练的完全监督基线,且表示质量与对比批次大小呈对数线性关系。这些结果确立了环境、纵向机械生物信号作为健康基础模型的可行模态。

英文摘要

Foundation models for wearable biosignals have matched or exceeded supervised specialists across a range of clinical tasks, yet all rely on modalities that require deliberate user action--wearing a device or visiting a sleep lab. We introduce BCG-FM, the first foundation model for ambient mechanical biosignals. A piezoelectric sensor embedded in the bed surface records ballistocardiography (BCG) each night without user effort; we pretrain BCG-FM with participant-level contrastive learning and using a total of 2.75 million hours of nightly recordings from 145,985 individuals, the largest raw-waveform biosignal pretraining corpus to date. Frozen BCG-FM embeddings achieve 3.26-year MAE on biological-age estimation (the lowest reported for any ambient, contactless modality) and yield clinically relevant discrimination across 15 self-reported health conditions and three independent external cohorts. Pretrained representations from only 500 labeled participants outperform a fully supervised baseline trained on 3,372, and representation quality scales log-linearly with contrastive batch size. These results establish ambient, longitudinal mechanical biosignals as a viable modality for health foundation models.

2606.07695 2026-06-09 cs.LG cs.AI 交叉投稿

DSFNet: Learning Dual-Domain Spectral Operators for Multi-Modality Spatio-Temporal Forecasting in Urban Transportation Systems

DSFNet:面向城市交通系统多模态时空预测的双域谱算子学习

Yongchao Li, Yang Li, Zhuoxuan Li, Jun Chen, Chu Zhang, Jinde Cao, Leszek Rutkowski

发表机构 * Southeast University(东南大学) Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies(江苏省现代城市交通技术协同创新中心) City University of Hong Kong(香港城市大学) School of Mathematics, Southeast University(东南大学数学学院) Systems Research Institute of the Polish Academy of Sciences(波兰科学院系统研究所) Luoyang Normal University(洛阳师范学院) Purple Mountain Laboratories(紫金山实验室) AGH University of Krakow(AGH科技大学)

AI总结 提出双域谱滤波网络DSFNet,通过特征域和空间域谱算子分解空间-模态交互,显式建模跨变量耦合与异质空间依赖,结合外部门控机制自适应调节时间动态,在五个真实交通数据集上MAE降低3.21%-10.16%。

详情
AI中文摘要

多模态时空预测(MoSTF)通过引入多样化的交通模态扩展了传统的时空预测。尽管近年来在时空建模方面取得了显著进展,现有方法往往未能显式建模不同模态变量之间的耦合关系。准确的MoSTF具有挑战性,因为它需要建模(1)外生影响下的时间动态异质性和(2)异质空间依赖性以及复杂的跨变量耦合。为了解决这些挑战,我们提出了双域谱滤波网络(DSFNet)。我们的框架采用双域谱滤波来捕获异质空间模式并显式建模变量之间的关系。与基于图的消息传递或节点-模态对上的密集注意力不同,DSFNet将空间-模态交互分解为特征域和空间域谱算子,从而实现了非局部依赖和跨模态耦合的可扩展建模。此外,我们引入了一种外部门控机制,以自适应地调节外部影响下的时间动态。我们通过在五个代表性真实世界交通数据集上的大量实验验证了我们的方法。与次优基线相比,DSFNet在这些数据集上将MAE降低了3.21%-10.16%。结果表明,DSFNet在准确性上显著优于现有最先进基线,同时表现出高效性和鲁棒性。

英文摘要

Multi-Modality Spatio-Temporal Forecasting (MoSTF) extends traditional spatio-temporal forecasting by incorporating diverse traffic modalities. Despite significant recent strides in spatio-temporal modeling, existing approaches often fail to explicitly model the coupling relationships between different modality variables. Accurate MoSTF is challenging, as it requires modeling (1) temporal dynamic heterogeneity under exogenous influences and (2) heterogeneous spatial dependencies alongside complex cross-variable couplings. To address these challenges, we propose the Dual-Domain Spectral Filtering Network (DSFNet). Our framework employs dual-domain spectral filtering to capture heterogeneous spatial patterns and explicitly model the relationships between variables. Unlike graph-based message passing or dense attention over node-modality pairs, DSFNet factorizes space-modality interactions into feature-domain and spatial-domain spectral operators, enabling scalable modeling of nonlocal dependencies and cross-modality couplings. Furthermore, we introduce an external gating mechanism to adaptively regulate temporal dynamics under external influences. We validate our method through extensive experiments on five representative real-world traffic datasets. Compared with the second-best baselines, DSFNet reduces MAE by 3.21%-10.16% across these datasets. The results demonstrate that DSFNet significantly outperforms existing state-of-the-art baselines in accuracy while exhibiting efficiency and robustness.

2606.07697 2026-06-09 physics.ao-ph cs.AI 交叉投稿

TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research

TianJi-Environ: 用于大气环境研究的自主人工智能科学家

Haoluo Zhao, Hongchun Zhang, Nan Li, Jing-Jia Luo, Kaikai Zhang, Mengyang Yu, Nan Chen, Tao Song, Fan Meng

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(南京信息工程大学人工智能学院) State Key Laboratory of Climate System Prediction and Risk Management (CPRM), Nanjing University of Information Science and Technology(南京信息工程大学气候系统预测与风险管理国家重点实验室) College of Environmental Science and Engineering, Nanjing University of Information Science and Technology(南京信息工程大学环境科学与工程学院) College of Computer Science and Technology, China University of Petroleum(中国石油大学(华东)计算机科学与技术学院)

AI总结 提出基于WRF-Chem的多智能体框架TianJi-Environ,自主驱动复杂大气化学模拟,实现机制假设的可执行配置、实验设计和证据标准,并通过臭氧和颗粒物案例验证其可审计的机制验证能力。

Comments 20 pages, 11 figures, 2 tables

详情
AI中文摘要

随着大气环境预测的持续改进,污染机制和反馈过程的可解释验证已成为大气化学的主要挑战。然而,基于复杂数值模型的机制验证仍然严重依赖专家知识:机制假设必须转化为可执行的实验,模型输出必须组织成可追溯的证据。我们提出了TianJi-Environ,一个用于大气化学机制验证的可审计AI科学家。TianJi-Environ建立了首个基于WRF-Chem的多智能体框架,自主驱动复杂的大气化学模拟,将机制假设转化为可执行的配置、测试实验和证据标准。以臭氧响应和颗粒物反馈作为两个代表性例子,我们展示了TianJi-Environ的机制验证能力。在华北平原的一个夏季臭氧案例中,系统在短波辐射和边界层高度中检测到方向一致的气溶胶-辐射相互作用信号,但判断臭氧对NOx控制的响应证据不完整。在关中盆地的一个冬季PM2.5案例中,系统将不支持的联系定位到黑碳扰动到颗粒物响应的传播不足以及垂直吸收加热的诊断缺失。这些结果表明,TianJi-Environ使专家驱动的机制验证变得明确、结构化和可审计,为多智能体系统与复杂大气化学模型的耦合提供了可复现的范式。

英文摘要

As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has become a main challenge in atmospheric chemistry. Yet mechanism validation based on complex numerical models still relies heavily on expert knowledge: mechanistic hypotheses must be operationalized into executable experiments, and model outputs must be organized into traceable evidence. We present TianJi-Environ, an auditable AI Scientist for atmospheric-chemistry mechanism validation. TianJi-Environ establishes the first WRF-Chem-based multi-agent framework that autonomously drives complex atmospheric-chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. Using ozone response and particulate-matter feedback as two representative examples, we demonstrate TianJi-Environ's capability for mechanism validation. In a summertime ozone case over the North China Plain, the system detects directionally consistent aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height, but judges the evidence for ozone response to NOx control to be incomplete. In a wintertime PM2.5 case over the Guanzhong Basin, it localizes the unsupported link to insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating. These results show that TianJi-Environ makes expert-driven mechanism validation explicit, structured, and auditable, offering a reproducible paradigm for multi-agent systems coupled with complex atmospheric-chemistry models.

2606.07712 2026-06-09 cond-mat.mtrl-sci cs.AI 交叉投稿

MatMind: A Structure-Activity Knowledge-Driven Generative Foundation Model for Materials Science

MatMind:面向材料科学的结构-活性知识驱动生成基础模型

Zhan'ao Yao, Boxuan Zhang, Jingyuan Shu, Xiaoyu Wu, Rongyan Wang, Linjing Li, Dajun Zeng, Yudong Yao, Tingwei Chen, Youwei Wang, Xiaolin Zhao, Jiahui Shi, Jianjun Liu

发表机构 * State Key Laboratory of High Performance Ceramics(高性能陶瓷国家重点实验室) Shanghai Institute of Ceramics, Chinese Academy of Sciences(中国科学院上海陶瓷研究所) Center of Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences(中国科学院大学材料科学与光电子工程中心) School of Chemistry and Materials Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州先进研究所化学与材料科学学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beijing Wenge Technology Co., Ltd.(北京文格科技有限公司) College of Medicine and Biological Information Engineering, Northeastern University(东北大学医学与生物信息工程学院)

AI总结 提出MatMind,一种基于大语言模型的晶体材料生成基础模型,通过结构-活性知识注入、双头架构和物理信息强化学习,在性质预测、无条件生成和条件生成任务上超越专用模型。

Comments 29 pages, 5 figures, including references

详情
AI中文摘要

迄今为止,AI驱动的晶体材料科学进展依赖于为单个任务构建的窄架构——用于性质预测的图神经网络、用于晶体生成的扩散和流匹配模型——每个都在其领域内表现出色,但无法作为跨整个材料问题谱系的共享骨干。生成式大语言模型提供了一种根本不同的范式,其中结构表示、定量预测和结构-活性推理可以在一个模型内统一,但材料学界尚未看到这种范式在竞争性水平上实现,与已建立的窄专家相匹敌。在此,我们提出MatMind,一种在此范式下专为晶体材料科学构建的生成基础模型,通过渐进训练框架中结构-活性知识和物理信息反馈的协调激活开发——结合结构-活性知识注入、在共享表示空间中联合训练语言推理和数值回归的双头架构,以及针对稳定性、新颖性和结构多样性的多目标物理信息强化学习。在三个任务族中,MatMind在能量高于凸包、体模量和带隙上取得最低平均绝对误差——超越专为这些任务构建的图神经网络预测器——在无条件晶体生成上达到65.3%的S.U.N.率,并在磁化密度条件生成上实现了可比的倍数提升,其中在超过600,000个训练条目中仅存在21个正样本。通过在单一统一模型内匹配或超越窄专家在其自身领域上的表现,MatMind表明基于LLM的范式可以作为晶体材料科学未来的可行骨干。

英文摘要

Progress in AI-driven crystal materials science has so far been carried by narrow architectures purpose-built for individual tasks -- graph neural networks for property prediction, diffusion and flow-matching models for crystal generation -- each excelling within its niche yet unable to act as a shared backbone across the full spectrum of materials problems. Generative large language models offer a fundamentally different paradigm, in which structural representation, quantitative prediction, and structure-activity reasoning can be unified within one model, but the materials community has yet to see this paradigm realized at a level competitive with established narrow specialists. Here we present MatMind, a generative foundation model purpose-built for crystal materials science under this paradigm, developed through the coordinated activation of structure-activity knowledge and physics-informed feedback within a progressive training framework -- combining structure-activity knowledge injection, a dual-head architecture that jointly trains language reasoning and numerical regression in a shared representation space, and multi-objective physics-informed reinforcement learning over stability, novelty, and structural diversity. Across three task families, MatMind attains the lowest mean absolute error on energy above hull, bulk modulus, and band gap -- surpassing graph neural network predictors purpose-built for these tasks -- reaches an S.U.N. rate of 65.3% on unconditional crystal generation, and achieves a comparable multiplicative improvement on magnetization-density-conditioned generation, where only 21 positive samples exist within over 600000 training entries. By matching or surpassing narrow specialists on their own ground while operating within a single unified model, MatMind shows that the LLM-based paradigm can serve as a viable backbone for crystal materials science going forward.

2606.07714 2026-06-09 cs.LG cs.AI cs.HC 交叉投稿

Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models

超越准确率:解释自杀意念检测模型中的主题表示

Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman

发表机构 * University of Ottawa(渥太华大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 本研究通过可视化与几何分析,探究自杀意念检测模型内部如何编码心理风险因素,发现主题增强能提升低表征风险因素表示的清晰度与可解释性。

详情
AI中文摘要

自杀意念检测模型通常使用聚合性能指标进行评估,但对其内部如何表示具有心理意义的风险因素知之甚少。在高风险心理健康应用中,理解这些内部表示对于安全性、透明度和负责任部署至关重要。在这项工作中,我们超越准确率,分析在原始和主题增强数据集上训练的自杀检测模型如何在其内部表示空间中编码心理风险因素。通过可视化和几何分析,我们检查主题相关特征的连贯性和可分离性。我们的结果表明,主题感知增强提高了低表征心理社会风险因素(如移民、家庭问题和金融危机)的清晰度和区分度。这些发现表明,增强不仅提高了模型性能,还导致了更结构化和可解释的内部表示。

英文摘要

Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.

2606.07717 2026-06-09 eess.IV cs.AI cs.CV 交叉投稿

Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps

多平面2D-U-Net分割3D-CT腹部器官,辅以空间出现图

Daria Kern, Negar Chabi, Souraj Adhikary, Andre Mastmeyer

发表机构 * Glasgow Caledonian University School of Science & Engineering(格拉斯哥卡里多尼亚大学科学与工程学院) Jade University of Applied Sciences Department of Engineering & Medical Technology(雅德应用科学大学工程与医疗技术系)

AI总结 提出轻量级2D-U-Net框架,结合粗到细分割、多平面预测和模糊3D空间图,在80个CT扫描中使Dice系数提升约4%。

Comments 11 pages, 9 figures, 1 table, http://www.wscg.eu/

详情
AI中文摘要

本工作提出一个基于2D-U-Net的轻量级框架,用于在大视野3D CT扫描中分割五个腹部器官。该方法结合了粗到细分割、来自多个解剖平面的预测以及额外的模糊3D空间图,这些空间图提供解剖位置线索以提高分割精度。我们结合了由空间出现图增强的多平面2D-U-Net模型。该方法包括两个主要阶段。首先,通过使用2D-U-Net轴向遍历整个扫描并确定5个目标腹部器官的x-y-z最小和最大范围来检测腹部感兴趣区域。其次,我们在前一阶段的边界内使用空间出现图来增强我们的多平面2D-U-Net架构。该方法在来自各种公共来源的80个CT扫描上进行评估。结果显示,与未使用空间出现图训练的相同模型相比,Dice系数最大提升约4%。

英文摘要

This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The method combines coarse-to-fine segmentation, predictions from multiple anatomical planes, and additional fuzzy 3D spatial maps that provide anatomical location cues to improve segmentation accuracy. We combine multi-planar 2D-U-Net models augmented by a spatial occurrence map. The approach involves two main stages. First, the abdominal volume of interest region is detected by traversing the whole scan axially with a 2D-U-Net and determining the x-y-z-minimum and -maximum extents of the 5 abdominal organs of interest. Second, we use spatial occurrence maps to enhance our multi-planar 2D-U-net architecture inside the bounds from the former stage. The method is evaluated on 80 CT scans from various public sources. The results show Dice improvements of about 4% at maximum compared to the same model trained without spatial occurrence maps.

2606.07828 2026-06-09 cs.SE cs.AI 交叉投稿

Jas: AI-Paired Engineering as a Revival of N-Version Programming

Jas:AI配对工程作为N版本编程的复兴

Jason Hickey

发表机构 * Independent(独立)

AI总结 本研究通过单开发者跨平台移植矢量图应用的案例,提出AI配对工程方法,结合精确YAML规范与并行实现作为差分测试层,使传统需多人年的工作变得可行,并视其为N版本编程的复兴。

详情
AI中文摘要

我报告了一个AI配对软件工程的案例研究:由单个开发者在约120个晚间小时内完成的五个矢量插图应用的工作移植,分别基于Rust、Swift、OCaml、Python和浏览器平台。该方法将AI辅助实现与两个保障措施配对——一个精确的可执行YAML规范作为单一事实来源,以及并行实现作为内置差分测试层。五个移植共享23,000行的规范;每个移植的原生代码范围从0到约95,000行,反映了规范的逃生口。我认为,在具备这两个保障措施的条件下,AI配对工程使得传统上需要多个开发者年的工作范围变得可行,并将该方法框架为N版本编程的复兴,这是一种因成本原因被放弃的1980年代方法,而AI改变了这一状况。论文报告了具体工件和单开发者案例研究的诚实局限性。

英文摘要

I report a case study in AI-paired software engineering: five working ports of a vector illustration application across Rust, Swift, OCaml, Python, and browser-based platforms, built by a single developer in approximately 120 evening hours. The methodology pairs AI-assisted implementation with two safeguards -- a precise executable YAML specification serving as the single source of truth, and parallel implementations functioning as a built-in differential-testing layer. The five ports share a 23{,}000-line specification; per-port native code ranges from 0 to roughly 95{,}000 lines, reflecting the specification's escape hatch. I argue that AI-paired engineering, conditional on these two safeguards, makes feasible scope of work that conventionally requires multiple developer-years, and frame the methodology as a revival of N-version programming, a 1980s approach abandoned on cost grounds that AI changes. The paper reports concrete artifacts and honest limitations of the single-developer case study.

2606.07836 2026-06-09 cond-mat.mtrl-sci cond-mat.stat-mech cs.AI physics.comp-ph quant-ph 交叉投稿

Agentic multi-fidelity learning of quasiparticle and excitonic properties

准粒子和激子性质的智能多保真学习

Arnab Neogi, Aaron Forde, Christopher A. Lane, Sergei Tretiak, Jian-Xin Zhu

发表机构 * Theoretical Division, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室理论部) Center for Integrated Nanotechnologies, Materials Physics and Applications Division, Los Alamos National Laboratory(集成纳米技术中心,材料物理与应用部,洛斯阿拉莫斯国家实验室)

AI总结 提出智能引导的多保真框架,通过置信度加权和少量高精度参考点,结合机器学习校正GW-BSE计算中的数值不稳定性,准确预测应变MoS2-WS2双层中的准粒子带隙和激子结合能。

详情
AI中文摘要

多体GW-Bethe-Salpeter方程计算对于现代低维纳米材料中电子结构和光学性质的精确模拟至关重要。然而,这些方法计算量大,并且可能表现出局部数值不稳定性或收敛失败,在高通量工作流程中难以检测。我们引入了一个智能引导的多保真框架,用于校正应变MoS2-WS2双层中的GW-Bethe-Salpeter激发态景观。在不同堆叠配准、应变分支和倒空间采样下,该工作流程识别出与脆弱的长波介电屏蔽相关的尖峰状偏移、近零带隙塌缩和交叉保真不一致性。一个结构智能体通过分配置信度权重并选择性地使用少量高精度参考点来评估计算。然后,机器学习模型在相关系统间传递信息,并应用高斯过程校正来恢复改进的准粒子带隙和激子结合能,并带有校准的不确定性估计。该方法纠正了数值诱导的伪影,而不消除物理应变依赖性,并且与无智能体基线相比,显著提高了与更高保真度参考的一致性。这些结果表明,激发态材料的可靠替代学习需要明确诊断数值脆弱性,而不是直接插值原始第一性原理数据点。所提出的框架可轻松转移到其他以强量子限制为特征的光电纳米材料,例如量子点、纳米带、层状二维半导体和混合钙钛矿纳米结构。

英文摘要

Many-body GW-Bethe-Salpeter equation calculations are essential for accurate simulations of electronic structure and optical properties in modern low-dimensional nanomaterials. However, these methods are computationally demanding and can exhibit localized numerical instabilities or convergence failures that are difficult to detect within high-throughput workflows. We introduce an agent-guided multi-fidelity framework for correcting GW-Bethe-Salpeter excited-state landscapes in strained MoS2-WS2 bilayers. Across stacking registries, strain branches and reciprocal-space samplings, the workflow identifies spike-like excursions, near-zero-gap collapse and cross-fidelity inconsistencies associated with fragile long-wavelength dielectric screening. A structural agent evaluates calculations by assigning confidence weights and selectively using a small number of high-accuracy reference points. Machine learning models then transfer information across related systems and apply Gaussian process corrections to recover improved quasiparticle gaps and exciton binding energies, with calibrated uncertainty estimates. The approach corrects numerically induced artifacts without erasing physical strain dependence and substantially improves agreement with higher-fidelity references relative to a no-agent baseline. These results show that reliable surrogate learning for excited-state materials requires explicit diagnosis of numerical fragility, not direct interpolation of raw first-principles data points. The proposed framework is readily transferable to other optoelectronic nanomaterials characterized by strong quantum confinement, such as quantum dots, nanoribbons, layered two-dimensional semiconductors, and hybrid perovskite nanostructures.

2606.07907 2026-06-09 cs.CV cs.AI 交叉投稿

3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning

基于匹配学习的改进顶点分布的3D口腔建模

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * st Jihun Cho(第一作者) nd Soo-Yeon Jeong(第二作者) rd Eun-Jeong Bae(第三作者) th Sun-Young Ihm(第四作者)

AI总结 针对3D口腔重建中预测顶点分布不均的问题,提出结合匈牙利匹配过滤与排斥损失的改进损失函数,使顶点分布更均匀,虽精度略降但有效缓解了聚集现象。

Comments 5 pages, 7 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情
AI中文摘要

在我们之前的工作中,提出了一个基于深度学习的3D口内重建框架。该模型直接从十张固定角度的口内图像预测显式3D点云坐标,采用MobileNetV2和多头注意力进行多视图特征融合,并使用L1损失和倒角距离的组合作为损失函数。尽管模型达到了77.49%的准确率,但预测顶点倾向于集中在真实值的高密度区域,而其他区域大部分未被覆盖。\n在本文中,提出了一种改进的损失函数来解决这一局限性。引入了带过滤的匈牙利匹配和排斥损失,以强制重建模型上的顶点分布更加均匀。所提出的模型达到了68.02%的准确率,数值上低于之前的模型。然而,先前工作中观察到的顶点聚集问题得到了显著缓解,预测顶点在整个重建表面上分布更加均匀。

英文摘要

In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D point cloud coordinates from ten fixed-angle intraoral images, employing MobileNetV2 and Multi-head Attention for multi-view feature fusion, with a combined L1 Loss and Chamfer Distance as the loss function. Although the model achieved an accuracy of 77.49%, predicted vertices tended to concentrate in high-density regions of the ground truth, leaving other regions largely uncovered. In this paper, an improved loss function is proposed to address this limitation. Hungarian matching with filtering and Repulsion Loss are introduced to enforce more uniform vertex distribution across the reconstructed model. The proposed model achieves an accuracy of 68.02%, which is numerically lower than the previous model. However, the vertex clustering issue observed in the prior work is substantially alleviated, with predicted vertices distributed more evenly across the entire reconstructed surface.

2606.07923 2026-06-09 cs.DB cs.AI cs.LG 交叉投稿

Larch: Learned Query Optimization for Semantic Predicates

Larch: 面向语义谓词的学习型查询优化

Fuheng Zhao, Pawel Liskowski, Zihan Li, Benjamin Han, Puxuan Yu, Varich Boonsanong, Dimitris Tsirogiannis, Anupam Datta

发表机构 * Snowflake Inc.(Snowflake公司)

AI总结 提出Larch框架,利用嵌入增强的图神经网络和强化学习或监督学习优化AI SQL查询中语义过滤器的执行顺序,显著降低令牌开销。

详情
AI中文摘要

随着大型语言模型(LLM)的出现,许多数据库系统引入了语义运算符,使得能够对非结构化数据(如文本、图像、视频)进行分析查询。语义运算符通常会产生高昂的推理成本和延迟,使得语义(AI)SQL查询难以应用于大规模数据集。同时,其语义性质导致数据库引擎将其视为黑盒,使得AISQL查询难以优化。在本文中,我们介绍了Larch,一个用于优化AI SQL查询中语义过滤器执行的框架。Larch的灵感来自两个关键观察:i) 语义运算符的高延迟为计算密集型运行时优化技术留下了显著空间,ii) 非结构化数据通常伴随着嵌入形式的语义信息,允许在AI_FILTER提示和数据值之间进行高效的语义比较。基于这两个关键观察,我们提出了两种Larch变体:Larch-A2C和Larch-Sel。Larch-A2C使用嵌入增强的门控图神经网络编码任意语义过滤器表达式树,并将过滤器评估顺序表述为马尔可夫决策过程。相比之下,Larch-Sel利用监督学习模型预测过滤器选择性,随后应用动态规划为每个输入行找到接近最优的评估顺序。在多样化的真实世界数据集和全面的合成工作负载上进行评估,两种Larch变体在令牌使用方面始终优于现有的语义过滤器优化技术。我们的结果表明,Larch在不同工作负载下具有鲁棒性,与Palimpzest和Quest相比,将总令牌成本开销降低了3倍至19倍。

英文摘要

With the advent of Large Language Models (LLMs), many database systems introduced semantic operators that enabled analytical queries over unstructured data (e.g. text, images, videos). Semantic operators typically incur high inference costs and latencies making semantic (AI) SQL queries challenging to apply on large scale datasets. At the same time, their semantic nature leads database engines to treat them as black boxes, making AISQL queries difficult to optimize. In this paper, we introduce Larch, a framework for optimizing the execution of semantic filters in AI SQL queries. Larch was inspired by two key observations: i) the high latency of semantic operators leaves significant room for computationally-heavy runtime optimization techniques, ii) unstructured data are typically accompanied by semantic information in the form of embeddings allowing for efficient semantic comparisons between AI_FILTER prompts and data values. Based on these two key observations, we present two Larch variants: Larch-A2C and Larch-Sel. Larch-A2C encodes arbitrary semantic filters expression tree using an embedding-augmented Gated Graph Neural Network and formulates the filter evaluation order as a Markov decision process. In contrast, Larch-Sel leverages a supervised learning model to predict filter selectivities, subsequently applying dynamic programming to find a near-optimal evaluation order for each input row. Evaluated across diverse real-world datasets and comprehensive synthetic workloads, both Larch variants always outperform existing semantic filter optimization techniques in terms of token usage. Our results demonstrate that Larch is robust across diverse workloads, reducing total token cost overhead by 3x-19x compared to Palimpzest and Quest.

2606.08037 2026-06-09 cs.LG cs.AI 交叉投稿

SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification

SafeECGMatch:面向开放集心电图分类的校准感知联合频率与时间空间半监督学习

Hongkyu Koh, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出SafeECGMatch框架,通过双分支架构提取时频特征,结合自适应标签平滑和温度缩放校准模型,在标签分布不匹配下实现可靠的开集分类和OOD检测。

Comments 8 pages. Accepted to the KDD-UC 2026 (ACM International Conference on Data Mining and Knowledge Discovery - Undergraduate Consortium 2026)

详情
AI中文摘要

心电图(ECG)分类模型常面临严重的标签稀缺问题,使得半监督学习(SSL)成为降低标注成本的有效策略。然而,在临床环境中,未标注数据池通常包含分布外(OOD)异常或标注集中不存在的诊断类别。标准SSL会强制对这些未见类别分配错误的伪标签,产生过度自信的预测。为解决此问题,我们提出SafeECGMatch,一个校准感知的安全SSL框架,用于标签分布不匹配下的单标签ECG分类。方法上,SafeECGMatch采用双分支架构,通过ECG特定的数据增强提取时频潜在表示。关键地,它通过自适应标签平滑和温度缩放动态对齐置信度与经验准确性,在时间和频谱域上校准多类分类器和OOD检测器。这种联合优化实现了可信的OOD拒绝和可靠的伪标签分配。在PTB-XL和PhysioNet/CinC Challenge基准上评估,SafeECGMatch达到了最先进的准确性和校准性能,推动了生理时间序列中可靠知识发现。代码可在https://github.com/labhai/SafeECGMatch获取。

英文摘要

Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive strategy for reducing annotation costs. In clinical settings, however, unlabeled pools frequently contain out-of-distribution (OOD) anomalies or diagnostic groups absent from the labeled set. Standard SSL forces incorrect pseudo-labels onto these unseen classes, producing overconfident predictions. To address this, we propose SafeECGMatch, a calibration-aware safe SSL framework for single-label ECG classification under label distribution mismatch. Methodologically, SafeECGMatch employs a dual-branch architecture extracting time-frequency latent representations via ECG-specific augmentations. Crucially, it dynamically aligns confidence with empirical accuracy through adaptive label smoothing and temperature scaling, calibrating both the multiclass classifier and the OOD detector across temporal and spectral domains. This joint optimization allows trustworthy OOD rejection and reliable pseudo-labeling. Evaluated on the PTB-XL and PhysioNet/CinC Challenge benchmarks, SafeECGMatch achieves state-of-the-art accuracy and calibration, advancing reliable knowledge discovery in physiological time-series. Code is available at https://github.com/labhai/SafeECGMatch.

2606.08153 2026-06-09 cs.LG cs.AI 交叉投稿

LogNEO: A GPT-Neo Reinforcement Learning Framework for Accurate Real-Time Log Anomaly Detection

LogNEO:基于GPT-Neo的强化学习框架用于精确实时日志异常检测

David Eje, Tanmay Sharma, Khush Patel, Manuel Mazzara, Leonard Johard

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出LogNEO,利用GPT-Neo模型和基于位置感知奖励的PPO微调,在HDFS、BGL和Thunderbird基准上达到F1分数0.927、0.913和0.984,召回率比LogGPT提升6%,并在生产部署中实现45ms端到端延迟。

Comments 8 pages, 5 figures, 6 tables

详情
AI中文摘要

检测大规模系统日志中的异常对于现代计算基础设施的可靠性和安全性至关重要。我们提出LogNEO,一个基于EleutherAI的GPT-Neo(13亿参数)构建的日志异常检测器,并通过一种新颖的部分信用、指数衰减位置感知奖励方案结合交叉熵正则化(使用近端策略优化PPO)进行微调。位置感知奖励显式建模预测难度:早期位置因正确预测获得更高奖励,而后期位置因错误受到更强惩罚。LogNEO在HDFS、BGL和Thunderbird基准上分别达到0.927、0.913和0.984的F1分数,在保持相当精度的同时,召回率比先前最先进的LogGPT提升高达6个百分点。基于Apache Kafka、Redis和TensorRT加速推理的生产微服务部署在每秒15000个事件下实现了45毫秒的端到端延迟。

英文摘要

Detecting anomalies in large-scale system logs is critical for the reliability and security of modern computing infrastructure. We present LogNEO, a log anomaly detector built on EleutherAI's GPT-Neo (1.3B parameters) and fine-tuned with a novel partial-credit, exponentially decaying position-aware reward scheme combined with cross-entropy regularisation via Proximal Policy Optimisation (PPO). The position-aware reward explicitly models prediction difficulty: early positions receive higher rewards for correct predictions, while later positions incur stronger penalties for errors. LogNEO attains F1-scores of 0.927, 0.913, and 0.984 on the HDFS, BGL, and Thunderbird benchmarks, improving recall by up to 6 percentage points over the prior state-of-the-art LogGPT while maintaining comparable precision. A production microservice deployment over Apache Kafka, Redis, and TensorRT-accelerated inference demonstrates 45 ms end-to-end latency at 15,000 events per second.

2606.08168 2026-06-09 cs.CR cs.AI 交叉投稿

Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR

弥合模拟到现实的差距:商业EDR自主网络防御配置评估框架

Kerri Prinos, Lilianne Brush

发表机构 * GitHub

AI总结 提出首个针对商业EDR自主防御智能体的评估框架,通过GOAD实验室与微软Defender XDR的实例化测试,揭示模拟和开源评估无法发现的三个关键差距。

Comments 12 pages including references

详情
AI中文摘要

领先的商业端点检测与响应(EDR)产品已从操作员配置的规则集转变为多组件系统,其中自主AI组件与操作员部署的策略并行运行,并日益取代后者。使用商业EDR作为加固工具的自主防御智能体不再调整被动工具,而是调整能够做出供应商特定决策的黑盒自主系统。我们提出了首个针对加固商业EDR的自主防御智能体的评估框架。我们在Game of Active Directory(GOAD)实验室中实例化该框架,使用Horizon3.ai的NodeZero作为自主渗透测试者,微软Defender XDR作为EDR。我们运行了基于两个大型语言模型(LLM)骨干(Claude Sonnet 4.6和Cisco Foundation-Sec-8B)的防御智能体样本基准测试。我们报告了三个模拟或开源EDR评估无法揭示的经验教训:(i)商业EDR遥测是为安全运营中心(SOC)分析师工作流设计的,而非科学基准测试;(ii)每个策略归属的重要性,以区分防御智能体动作与自主EDR动作;(iii)EDR的自主行为在评估窗口期间会变化。这些发现共同凸显了企业防御的模拟到现实差距,并推动了在包含黑盒自主工具的环境中基准测试自主防御智能体的评估方法论。

英文摘要

Leading commercial endpoint detection and response (EDR) products have shifted from operator-configured rule sets to multi-component systems where autonomous AI components operate alongside, and increasingly in place of, operator-deployed policies. Autonomous defense agents using commercial EDR as their hardening tool are no longer tuning a passive tool, but a black-box autonomous system capable of making vendor-specific decisions. We present the first evaluation framework for autonomous defense agents hardening commercial EDR. We instantiate it in a Game of Active Directory (GOAD) lab with Horizon3.ai's NodeZero as the autonomous pentester and Microsoft Defender XDR as the EDR. We run a sample benchmark of defense agents with two large language model (LLM) backbones (Claude Sonnet 4.6 and Cisco Foundation-Sec-8B). We report three lessons learned that neither simulation nor open-source-EDR evaluation can surface: (i) commercial EDR telemetry is engineered for Security Operations Center (SOC) analyst workflows rather than scientific benchmarking; (ii) the importance of per-policy attribution to separate defense agent actions from autonomous EDR actions; and (iii) the EDR's autonomous behavior varies during the evaluation window. Together, these findings highlight a sim-to-real gap for enterprise defense and motivate evaluation methodology for benchmarking autonomous defense agents in environments with black-box, autonomous tools.

2606.08247 2026-06-09 eess.AS cs.AI cs.LG eess.SP 交叉投稿

AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals

AeroSpectra Sentinel:一种用于从呼吸音和临床信号进行急性哮喘风险评估的可审计LLM提示链决策支持工作流

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(泰国朱拉隆梭国际技术学院)

AI总结 提出AeroSpectra Sentinel,结合STFT呼吸音分析、轻量ML筛查、临床特征融合和五阶段LLM提示链,实现可审计的急性哮喘风险评估,在公开数据集上验证了音频筛查和LLM工作流的有效性。

Comments 10 pages, 8 figures, 5 tables, 14 equations

详情
AI中文摘要

急性哮喘风险评估需要快速解读呼吸音、氧合、气流受限、言语能力、呼吸做功、精神状态以及对缓解治疗的反应。传统的纯音频分类器可以检测喘息样模式,但通常缺乏透明的临床推理和安全升级逻辑。本文提出AeroSpectra Sentinel,一个客户端研究原型和决策支持工作流,结合短时傅里叶变换(STFT)呼吸音分析、轻量机器学习筛查、临床特征融合和五阶段大语言模型(LLM)提示链过程。该工作流分离了信号采集、预处理、声学特征提取、ML筛查、临床护栏和FHIR就绪报告。我们在一个包含来自五个标签的1,211个WAV录音的公共呼吸音数据集上评估了音频筛查组件。使用584个录音的分层子集,随机森林在哮喘与非哮喘筛查中实现了91.10%的二元准确率和78.69%的F1分数,而基于特征的多层感知器实现了89.73%的准确率和78.26%的F1分数。紧凑的log-spectrogram CNN实现了73.29%的准确率和55.17%的F1分数。多类分类实现了77.40%的准确率和77.23%的宏F1。为了评估LLM工作流,我们对40个模拟临床场景进行了基于场景的审计,比较了一次性提示、提示链、带护栏的提示链以及带护栏加FHIR模式验证的提示链。护栏加模式变体实现了最强的模拟安全性和文档一致性。AeroSpectra Sentinel旨在作为研究原型,而非诊断医疗设备或临床验证的风险评估产品。

英文摘要

Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.

2606.08324 2026-06-09 cs.CV cs.AI 交叉投稿

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

基于集合的Transformer用于远距离长波红外高光谱成像中的大气补偿

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon

发表机构 * Department of Computer Science, Universidad Industrial de Santander Bucaramanga(圣安德烈斯工业大学计算机科学系)

AI总结 提出一种轻量级基于集合的深度学习框架,利用不同距离的辐射测量联合估计透射率、大气路径辐射和下行辐射,在MODTRAN数据集上实现低光谱畸变。

Comments IGARSS 2026 accepted paper conference

详情
AI中文摘要

被动长波红外(LWIR)高光谱成像在远距离几何下依赖于大气吸收和发射以及反射辐射,因此大气补偿对于获取目标信息至关重要。尽管其重要性,但由于实际和建模困难,这一补偿在很大程度上被忽视。在本文中,我们提出了一种轻量级基于集合的深度学习框架,该框架将不同远距离范围收集的多个辐射测量作为输入,并联合估计透射率、大气路径辐射和共享的下行辐射光谱。我们使用稀疏自编码器分析学习到的表示,并观察到尽管缺乏位置监督,几个潜在特征确实在测试数据的地理一致子集上激活。在MODTRAN生成的远距离LWIR数据集上的实验表明,所有估计产品的光谱畸变较低。数据集和代码公开于:https://factral.co/SAE-LWIR/

英文摘要

Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

2606.08364 2026-06-09 cs.CV cs.AI 交叉投稿

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California(南加州大学赫尔曼·奥斯特罗牙科学院) Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院)

AI总结 研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能,发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902,表明适应策略比骨干选择更重要。

详情
AI中文摘要

颞下颌关节骨关节炎(TMJ OA)是一种常见的退行性疾病,其骨性改变在锥形束CT(CBCT)上通常很细微,使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO(一种放射学预训练变体)——迁移到CBCT的效果,询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程,使用视觉Transformer(ViT)骨干:轴向CBCT切片由冻结或部分适应的ViT逐切片编码,并通过基于注意力的多实例学习(MIL)聚合,用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融,我们发现部分解冻最后两个Transformer块是决定性因素,将AUC从0.671(完全冻结的DINOv2)提高到0.902。这优于DINOv1(0.867)、DINOv2+reg(0.774)和有监督的ImageNet ViT-B/16基线(0.843)。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导,表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

2606.08476 2026-06-09 cs.DC cs.AI 交叉投稿

FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

FlashCP: 面向LLM训练的负载均衡且通信高效的上下文并行

Zheng Wang, Eric Liu, Linan Jiang, Zhongkai Yu, Zaifeng Pan, Yue Guan, Yuke Wang, Yufei Ding

发表机构 * Stanford University(斯坦福大学)

AI总结 提出FlashCP框架,通过分片感知通信消除冗余KV传输,并设计Whole-Doc分片策略与启发式算法,实现负载均衡与通信高效,在多种数据集上取得最高1.63倍加速。

Comments 10 pages, 6 figures

详情
AI中文摘要

上下文并行(CP)对于训练大规模长上下文语言模型至关重要,因为它通过划分序列来减少内存开销。然而,现有的CP方法由于静态序列分片和键值(KV)张量通信,存在工作负载不平衡、内核效率低下以及通信冗余的问题。我们提出了FlashCP,一个用于CP训练的负载均衡且通信高效的框架。FlashCP引入了一种分片感知的通信机制以消除冗余的KV通信,并提出了一种新颖的Whole-Doc分片策略,在保持工作负载平衡的同时最大化通信节省。为了高效结合Whole-Doc和Per-Doc分片,FlashCP进一步设计了一种启发式算法来搜索接近最优的分片方案。大量实验表明,FlashCP在多种数据集上相比最先进的CP框架实现了最高1.63倍的加速。

英文摘要

Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training. FlashCP introduces a sharding-aware communication mechanism to eliminate redundant KV communication and proposes a novel Whole-Doc sharding strategy that maximizes communication savings while maintaining balanced workloads. To efficiently combine Whole-Doc and Per-Doc sharding, FlashCP further designs a heuristic algorithm to search for near-optimal sharding plans. Extensive experiments show that FlashCP achieves up to 1.63x speedup over state-of-the-art CP frameworks across diverse datasets.

2606.08590 2026-06-09 cs.SE cs.AI cs.DC 交叉投稿

Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

可审计的图引导的Kubernetes事件根因分析

Anastasiia Kuvshinova, Seungmin Jin

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Graph Traversal Agent,结合LLM推理与确定性图操作,通过类型化证据图、有界搜索和独立验证实现可审计的根因分析,在ITBench上F1从0.6087提升至0.9130。

Comments 8 pages, 1 figure. Preprint

详情
AI中文摘要

只有当根因系统报告的结果来自事件证据而非特定场景的捷径时,Kubernetes事件才能被可靠诊断。我们提出Graph Traversal Agent,一种图引导的根因分析代理,将LLM推理与专用工具相结合。该模型在类型化证据图上进行推理,而确定性图和工具操作收集证据、限制搜索并检查提出的结论。我们将操作约束(包括只读证据收集、传播感知诊断、有界执行和独立验证的结论)映射到类型化事件图、LangGraph遍历状态机和独立的验证阶段。在由固定qwen-plus裁判评分的ITBench快照上,经过审计的系统在23个场景的公共子集上,根因实体F1从同一系统早期迭代的0.6087提升至0.9130。提示级消融实验将提示调优带来的提升与去除场景特定提示后仍保留的提升区分开:在19个场景的子集上,剥离提示的配置保留了0.6958的F1。保留的提升集中在ChaosMesh场景上,其真实根因是证据图中已存在的注入故障对象,因此我们将其报告为基准耦合而非广泛的跨集群根因分析证据。轻量级检查(包括相同裁判比较、提示级消融、级联源检查和遥测无泄漏测试)将声明标记为支持、待定或超出范围。我们将工作范围限定为ITBench OpenTelemetry-demo快照。实时集群试验作为工程压力测试,但警报状态和跟踪可用性不足以稳定进行受控评分,因此我们不声称生产就绪或平均修复时间。

英文摘要

Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.

2606.08630 2026-06-09 cs.LG cs.AI 交叉投稿

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tyan-WP:用于超短期概率预测的风电基础模型

Jiahui Huang, Ao Luo, Lei Liu, Hongwei Zhao, Tengyuan Liu, Ruibo Guo, Bo Wang, Zhao Wang, Bin Li

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) China Electric Power Research Institute(中国电力科学研究院)

AI总结 提出首个风电基础模型Tyan-WP,通过静态站点嵌入和功率感知气象融合模块,在零样本场景下实现超短期概率预测,显著优于传统模型。

详情
AI中文摘要

全球风电容量,特别是在中国,正在蓬勃发展,新的风电场跨越了多样的地形和气候。行业迫切需要准确的风电基础模型,以缩短调试并加速并网。这是因为特定站点的时间序列模型(TSM)不适用于数据稀缺场景且泛化能力差,而通用大型时间序列模型(LTSM)大多限于单变量输入,无法充分利用静态站点属性或功率与气象协变量之间的依赖关系,导致精度不足。为填补这一空白,我们提出了\textbf{Tyan-WP},这是首个用于超短期概率预测的风电基础模型。在覆盖美国超过126,000个站点、跨越七年的大规模风电数据集上预训练后,Tyan-WP通过两个特定领域模块设计进一步提升了零样本预测:使用坐标、地形和生态区域元数据的静态站点嵌入,以及一个功率感知气象融合(PAMF)模块,该模块对历史功率和气象协变量之间的交互进行建模。在统一评估协议下,Tyan-WP在10个域内站点上超越了八个特定站点的监督TSM,并在127个域内站点上优于十一个通用LTSM,MAE降低19.9%,RMSE降低16.6%,CRPS降低22.2%,AQL降低21.7%,同时R^2提升16.7%。它还在六个真实的英国站点上展示了强大的跨地理泛化能力。这些结果表明,风电基础模型可以在无需目标站点训练的情况下实现准确的零样本预测,为新风电场快速涡轮机接入和概率风险管理提供了实用途径。

英文摘要

Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

2606.08649 2026-06-09 cs.CR cs.AI 交叉投稿

Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning

基于大语言模型的恶意Web服务器日志检测与取证可解释推理的样本高效方法

Bernhard Kneip, Nhien-An Le-Khac, Hong-Hanh Nguyen-Le

发表机构 * University of Tuebingen(图宾根大学)

AI总结 提出CEF-Log策略,通过五步推理模板使大语言模型学习日志分析方法,在CSIC 2010数据集上仅用4个示例达到F1=0.99,样本效率提升10倍,并引入新数据集ForenWebLog。

详情
AI中文摘要

Web服务器日志的取证分析既需要准确检测,也需要满足法律要求的人类可读解释。我们提出了CEF-Log,一种针对大语言模型的上下文增强的少样本思维链提示策略,以应对这一双重需求。CEF-Log通过结构化的五步推理模板嵌入专家调查方法,使模型学习如何分析日志,而不是记忆什么模式。实验评估表明,CEF-Log在CSIC 2010数据集上仅使用四个示例就达到了0.99的F1分数,同时与其他基于提示的方法相比,样本效率提高了10倍。我们还引入了ForenWebLog,这是一个包含真实世界攻击和多步攻击序列的新数据集,用于全面评估。定性分析证实,CEF-Log生成了适合取证文档的可追溯、准确的解释,解决了传统机器学习方法的“黑箱”限制。

英文摘要

Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.

2606.08652 2026-06-09 astro-ph.SR cs.AI cs.CV 交叉投稿

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

利用扩散模型翻译器从He I 10830 Å观测重建合成SDO/AIA 193 Å EUV图像

Marco Marena, Qin Li, Haimin Wang, Haodi Jiang, Prajwal Shah, Bo Shen

发表机构 * Department of Mechanical and Industrial Engineering, New Jersey Institute of Technology(机械与工业工程系,新泽西理工学院) Department of Physics, New Jersey Institute of Technology(物理系,新泽西理工学院) Department of Computer Science, Sam Houston State University(计算机科学系,萨姆霍斯顿州立大学) Department of Computer Science, New Jersey Institute of Technology(计算机科学系,新泽西理工学院) Department of Data Science, New Jersey Institute of Technology(数据科学系,新泽西理工学院)

AI总结 提出基于扩散的日冕洞感知翻译模型(CH-aware DMT),从He I图像重建AIA 193 Å EUV图像,在测试集上保持全盘EUV形态(CC=0.92)和日冕洞结构(CC=0.84),并通过历史数据验证其物理合理性。

详情
AI中文摘要

常规的全盘EUV成像仅在现代时期(如SOHO和SDO)才可用。为了将EUV日冕背景扩展到更早时期,我们利用了数十年的全盘He I观测数据,其吸收受日冕辐照度和磁拓扑调制,并被广泛用作开放场区域的代理。我们提出了一种基于扩散的条件图像翻译框架——日冕洞感知扩散模型翻译器(CH-aware DMT),从He I输入重建合成SDO/AIA 193 Å EUV图像。该模型在2011-2015年时间对齐的SOLIS He I和AIA 193 Å配对数据上训练,采用基于月份的划分:1-10月用于训练,11月用于验证,12月用于测试。在保留的测试集上,重建结果保留了主要的全盘EUV形态(CC=0.92),并恢复了与日冕洞相关的低强度结构(CC=0.84)。我们进一步通过以下方式评估历史适用性:(1)比较2005-2015年间重建的AIA 193 Å形态与SOHO/EIT 195 Å;(2)比较从KPVT He I输入生成的重建AIA 193 Å图像与Yohkoh/SXT软X射线观测;(3)评估长期重建的盘积分发射统计量与观测EUV序列及独立太阳活动代理(1974-2015年的太阳黑子数和F10.7射电通量)的关系。这些结果表明,以He I为条件的CH-aware DMT可以为历史研究提供物理上合理的合成AIA 193 Å日冕代理,支持在直接EUV成像可用之前对大规模日冕演化进行数十年尺度的分析。

英文摘要

Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlier periods, we leverage the multi-decade availability of full-disk \HeI{} observations, whose absorption is modulated by coronal irradiance and magnetic topology and is widely used as a proxy for open-field regions. We present a diffusion-based conditional image translation framework, Coronal Hole-aware Diffusion Model Translator (CH-aware DMT), to reconstruct synthetic SDO/AIA 193 Å EUV images from \HeI{} inputs. The model is trained on temporally co-aligned SOLIS \HeI{} and AIA 193 Å pairs spanning 2011--2015 using a month-based split, where January--October are used for training, November is used for validation, and December for testing. On the held-out test set, the reconstructions preserve dominant full-disk EUV morphology (CC=0.92) and recover CH-related low-intensity structure (CC=0.84). We further assess historical applicability by (1) comparing reconstructed AIA 193 Å morphology with SOHO/EIT 195 Å over 2005--2015; (2) comparing reconstructed AIA 193 Å images generated from KPVT \HeI{} inputs against Yohkoh/SXT soft X-ray observations; and (3) evaluating long-term reconstructed disk-integrated emission statistics against observational EUV series and independent solar activity proxies (sunspot number and F10.7 radio flux over 1974--2015). These results indicate that CH-aware DMT conditioned on \HeI{} can provide a physically plausible synthetic AIA 193 Å coronal proxy for historical studies, supporting multi-decade analyses of large-scale coronal evolution before the direct EUV imaging was available.

2606.08710 2026-06-09 cs.SE cs.AI 交叉投稿

Structuring agentic AI for HPC code modernization

构建用于HPC代码现代化的智能体AI

Anthony Marinov, Igor Sfiligoi

发表机构 * San Diego Supercomputer Center(圣地亚哥超级计算机中心) University of California, San Diego(加州大学圣地亚哥分校) La Jolla, California, United States(圣地亚哥, 加州, 美国)

AI总结 提出一种结构化智能体AI方法,通过手动示例、持续可构建性和限制会话范围,成功将6万行Fortran MPI代码在数月内转换为C++ OpenMP并行代码。

Comments 10 pages

详情
AI中文摘要

传统科学代码的现代化通常需要跟上计算资源生态系统的不断变化。并行化和从支持不佳的软件生态系统迁移是研究软件工程领域中最耗时的两项活动。本文介绍了我们在NMAP-RKPM(一个基于再生核粒子方法(RKPM)的约6万行三维显式固体力学物理引擎)的成功两阶段AI辅助现代化中的经验。我们在几个月内将这一基于Fortran的单线程MPI应用程序转换为基于C++的OpenMP并行MPI工具。虽然基于大型语言模型(LLM)的工具本身被证明不足,但我们开发了一种高度结构化的“手把手”智能体AI方法,例如提供手动创建的示例、确保持续可构建性和限制会话范围,这种方法反而非常有效。本文提供了成功的AI辅助步骤以及我们必须克服的问题,以及所选路径背后的推理。

英文摘要

Modernization of legacy scientific codes is often necessary to keep up with the ever-evolving changes in the compute resource ecosystem. Parallelization and migration from poorly supported software ecosystems are two of the most time-consuming activities in the research software engineering field. This paper presents our experience in the successful, two-phase AI-assisted modernization of NMAP-RKPM, a roughly 60,000-line, 3D explicit solid mechanics physics engine based on the Reproducing Kernel Particle Method (RKPM). We converted this single-threaded, Fortran based MPI application into a OpenMP-parallel C++ based MPI tool in the span of a few months. While Large Language Model (LLM) based tools on their own proved inadequate, we developed a highly structured "hand-holding" agentic AI methodology, like providing manually created examples, ensuring continuous buildability and limiting session scope, that was instead highly effective. The paper provides both the AI-assisted steps that were successful and the problems that we had to overcome, alongside the reasoning behind the chosen path.

2606.08712 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

SNR-ST-Mix: 基于样本特异性邻域回归混合增强的空间转录组学深度神经网络插补

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

发表机构 * Northwestern University(西北大学) Yale University(耶鲁大学)

AI总结 针对空间转录组数据噪声大、分辨率低的问题,提出SNR-ST-Mix数据增强框架,通过空间邻域约束和表达相似性加权混合生成生物合理的合成样本,提升深度神经网络插补性能。

Comments 19 pages, 4 figures, 3 tables

详情
AI中文摘要

目的:空间转录组学(ST)能够在组织背景下测量基因表达。然而,这些测量通常噪声大、分辨率低且采样稀疏,限制了精细空间结构的恢复。深度神经网络已成为从组织学进行表达插补的强大工具,但其性能仍受限于有限的样本量和缺乏生物学信息的增强。大多数现有的学习增强策略是为分类任务而非回归任务设计的,忽略了空间和转录组关系,导致生物上不合理的插值,阻碍了预测性能。方法:为解决这些限制,我们提出SNR-ST-Mix,一种专门为ST数据设计的几何和表达感知数据增强框架。它将混合限制在点的k个最近空间邻域内,并基于表达相似性自适应加权插值系数,生成保留局部生物结构同时确保空间平滑性的增强样本。这种双重条件化产生合成样本,扩展了有效训练流形,促进了泛化,并在样本特异性训练下增强了预测稳定性。结果:使用各种组织类型的大量实验表明,SNR-ST-Mix在不需要架构更改或额外计算的情况下,始终优于传统增强方法。结论:SNR-ST-Mix为空间转录组学回归任务提供了一种有效且生物学原理的增强策略。通过显式利用空间几何和转录组相似性,它扩展了有效训练流形,并在不增加模型复杂度的情况下提高了预测性能。

英文摘要

Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

2606.08761 2026-06-09 cs.DC cs.AI 交叉投稿

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

APEX4: 通过SM内计算重平衡实现高效纯W4A4 LLM推理

Hong Guo, Nianhui Guo, Weixing Wang, Jona Otholt, Christoph Meinel, Haojin Yang

发表机构 * Hasso Plattner Institute(霍普夫-普拉特纳研究所) GreenBit.AI German University of Digital Science(德国数字科学大学)

AI总结 针对W4A4量化中CUDA核心反量化瓶颈,提出基于SM内计算平衡的ρ感知粒度自适应方法,设计纯INT4 GEMM内核,在多种GPU上实现最高2.09倍加速。

详情
AI中文摘要

W4A4量化承诺充分利用INT4张量核心,但CUDA核心上的组反量化开销导致现有系统采用混合精度回退。我们首次系统研究了SM内计算平衡如何主导这一瓶颈。通过在Ampere和Ada架构的四款GPU上进行受控基准测试,我们识别出张量核心与CUDA核心的吞吐量比($ρ$)作为主要硬件指标:在计算受限场景下,W4A4-g128内核在RTX 3090($ρ=16$)上获得$2.0$--$2.5\times$加速,但在A100($ρ=64$)上退化为$0.43$--$0.47\times$,表明W4A4的可行性是平台相关的,而非普遍不可行。基于这一发现,我们构建了\textbf{APEX4},它协同设计纯INT4 GEMM内核与$ρ$感知的粒度自适应,以缓解CUDA核心反量化瓶颈。APEX4在LLaMA-2-70B上实现了与FP16相差0.63的困惑度,并在零样本准确率上优于W4Ax Atom-g128达4.0%--4.4%。作为未修改vLLM中的即插即用替代品,它在L40S($ρ=8$)上提供高达$1.66\times$的端到端加速,在RTX 3090($ρ=16$)上为$1.78\times$,在A40($ρ=16$)上为$2.09\times$,并通过混合粒度模式将A100($ρ=64$)恢复至$1.20$--$1.40\times$。

英文摘要

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($ρ$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($ρ=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($ρ=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $ρ$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($ρ=8$), and $1.78\times$ on RTX~3090 ($ρ=16$), $2.09\times$ on A40 ($ρ=16$), while recovering A100 ($ρ=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode.

2606.08793 2026-06-09 cs.SE cs.AI 交叉投稿

AI-Augmented Closed-Loop Quality Engineering: A Reference Architecture for Continuous Software Quality Intelligence

AI增强的闭环质量工程:面向持续软件质量智能的参考架构

Dimple Bajaj

发表机构 * Dimple Bajaj

AI总结 提出一种AI增强的闭环参考架构,通过需求特征挖掘、风险测试优先级、缺陷预测和生产事件分析,结合有限反馈学习模型,在六个发布周期中减少缺陷泄漏、提高检测效率并缩短测试执行时间。

Comments 15 pages, 4 figures

详情
AI中文摘要

由于需求、测试和生产之间的流程脱节,软件工程的质量仍面临挑战,这阻碍了在连续发布中实施质量策略的机会。现有方法往往是固定模型或单优化方法,缺乏生产反馈学习机制。本文提出了一种AI增强的持续软件质量智能闭环参考架构。该模型综合了需求特征挖掘、基于风险的测试优先级排序、缺陷预测和生产事件分析,作为基于反馈的流水线的一个元素。引入了一种有限反馈学习模型,用于根据缺陷严重性和事件影响将生产信号传播到下一个发布,以确保稳定性和时间。该方法使用一个半合成测试数据集进行评估,该数据集包含6个发布周期中的4500个需求、27049个测试用例、13089个缺陷和7841个事件。实验结果表明,与非自适应基线相比,所提出的系统将缺陷泄漏从0.19降低到0.13,将检测系统的有效性从0.72提高到0.84,并将测试执行时间缩短了高达35%。这些变化在发布之间是稳定的。研究结果表明,通过在闭环架构中集成基于反馈的学习,可以持续改进质量过程,为自适应软件质量工程提供了实用基础。

英文摘要

The quality of software engineering is still under a challenge due to disjointed processes between requirements, testing, and production, which hinders the opportunity to implement quality strategies in consecutive releases. Existing approaches tend to be fixed-model or single-optimization approaches and lack production feedback learning mechanisms. The paper at hand proposes a closed-loop reference architecture of continuous software quality intelligence with AI enhancements. The model synthesizes requirement feature mining, risk-based test prioritization, defect prediction, and production incident analysis as an element of a feedback-based pipeline. A limited feedback learning model is introduced that is used to propagate the production signal-based on defect severity and incident impact- to the following release to ensure stability, and the time. The method is evaluated using a semi-synthetic test dataset of 4,500 requirements, 27,049 test cases, 13,089 defects and 7,841 incidents in six release cycles. The experimental results show that the proposed system reduces the defect leakage by 0.19 to 0.13, increases the effectiveness of the detection system to 0.72 to 0.84, and shortens the test execution by up to 35 percent compared to the non-adaptive baselines. The changes are stable release to release. The findings indicate that through the integration of feedback-based learning in a closed-loop architecture, it can be continued to enhance quality process, which offers practical foundation of adaptive quality engineering of software.

2606.08816 2026-06-09 cs.LG cs.AI 交叉投稿

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大语言模型用于寻找简单而有效的转录组扰动预测因子

Jake Fawkes, Liam Hodgson, Jason Hartford

发表机构 * University College London(伦敦大学学院) University of Manchester(曼彻斯特大学) Valence Labs(Valence实验室) Recursion(Recursion公司)

AI总结 利用知识图谱的K近邻方法在基因敲除扰动预测中表现优异,结合强化学习优化的LLM可达到最先进性能。

详情
AI中文摘要

预测未见过的基因敲除扰动对转录组基因表达的影响仍然是虚拟细胞模型的一个极具挑战性的问题。最近,通过利用生物知识图谱提供相似扰动的概念,在训练扰动集之外实现了更好的外推。在这项工作中,我们证明了利用这些假设的最简单模型——知识图谱的K近邻——在此任务上取得了极具竞争力的性能,并且通过使用强化学习(RL)优化的LLM可以进一步提高预测性能。具体来说,我们发现K近邻方法在分布外扰动预测上几乎击败了所有方法,而当通过RL训练推理LLM以改变邻域时,它在Replogle等人(2022)的细胞系上获得了与当前最先进方法相当的性能。我们还证明,尽管没有直接训练,RL训练提高了LLM在差异表达预测下游任务上的性能。总体而言,这些发现证明了知识图谱作为模型先验的有效性,并显示出RL可以将LLM精炼为预测复杂生物反应的通用工具的早期迹象。

英文摘要

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

2606.08858 2026-06-09 cs.CV cs.AI 交叉投稿

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University(奥芬堡大学机器学习与分析研究所(IMLA))

AI总结 提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法,利用人工合成训练数据,在真实考试数据上达到88.28%的识别率。

Comments Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures

详情
Journal ref
In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94
AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务,其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法,其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此,训练数据不是手动标注的,而是从基础表单和现有数据集中人工制造的。可以证明,这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母,并使用EMNIST数据集。然而,该数据集存在局限性,需要进一步定制。最后,在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 交叉投稿

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) New York University(纽约大学) University of Washington Medical Center(华盛顿大学医学中心)

AI总结 提出SpineAgent多智能体框架,利用多序列基础模型整合T1/T2等序列信息,实现脊柱MRI报告生成、病理定位和图文检索,在跨厂商和跨队列评估中表现优异。

详情
AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心,但其解读仍然复杂且耗时,需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展,但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent,一个基于多序列基础模型的脊柱MRI报告生成多智能体框架,该模型在来自32,047名患者和453,683个MRI系列(总计13,441,191张MRI切片)的常规临床数据上训练。为了适应不同模态的序列,我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后,我们引入一种持续训练策略,学习一个合成器,利用T1和T2编码器嵌入其他序列的图像,生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入,SpineAgent实现了最先进的性能,并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类,SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索,为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后,我们将它们的输出作为结构化标记,整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估,SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

2606.08908 2026-06-09 cs.CV cs.AI 交叉投稿

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University(汉阳大学) Korea University(高丽大学) Korea Institute of Industrial Technology(韩国生产技术研究院)

AI总结 提出两阶段视觉-语言框架,先微调Qwen3-VL检测缺陷,再通过训练精炼模块修正第一阶段错误,提升检测可靠性。

Comments 6 pages, 3 figures

详情
AI中文摘要

半导体光刻检测需要可靠地检测微小图案缺陷,如桥接、毛刺、针孔和污染。在本研究中,我们提出了一种两阶段视觉-语言框架,结合了初始缺陷检测与预测精炼。在第一阶段,使用LoRA微调Qwen3-VL作为视觉-语言适配器,从光刻图像中预测缺陷数量、缺陷类别和归一化边界框。然而,直接微调仍可能产生常见的测试时错误,包括误报、漏检和错误缺陷类型。为解决此限制,第二阶段使用第一阶段预测失败及其修正标签训练精炼模块,使模型能够审查和修正初始输出。通过从初始适配器失败的案例中学习,精炼过程改善了超越单阶段微调的缺陷推理。

英文摘要

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.

2606.08920 2026-06-09 cs.CV cs.AI 交叉投稿

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)(中国石油大学(华东)) South Surveying&Mapping Instrument Co.,Ltd.(南方测绘仪器有限公司) China Railway Design Corporation(中国铁路设计集团有限公司)

AI总结 提出端到端方法PolyBuild,通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓,无需后处理,性能优于现有方法。

Comments Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

详情
AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而,不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割,随后进行多个后处理步骤以生成建筑物轮廓,这计算量大且容易出错。在本文中,我们提出了一种名为PolyBuild的端到端方法,该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形,无需任何后处理操作。该方法利用两个主要模块:初始轮廓生成模块(ICGM)和轮廓优化模块(COM)。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物,同时进行目标检测和初始轮廓提取。轮廓优化模块(COM)通过在基于Transformer的解码器中迭代集成卷积神经网络(CNN)特征和轮廓位置信息,进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系,确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明,PolyBuild显著优于最先进的方法,包括基于掩码和基于轮廓的方法。

英文摘要

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

2606.09090 2026-06-09 cs.SE cs.AI 交叉投稿

Context Rot in AI-Assisted Software Development: Repurposing Documentation Consistency for AI Configuration Artifacts

AI辅助软件开发中的上下文腐烂:将文档一致性技术用于AI配置工件

Christoph Treude, Sebastian Baltes

发表机构 * Singapore Management University(新加坡国立管理学院) Heidelberg University(海德堡大学)

AI总结 针对AI编码助手的配置文件(如CLAUDE.md)随软件演化而变得陈旧的问题,提出利用现有文档一致性工具检测上下文腐烂,并在356个仓库中发现23.0%存在过时代码引用。

详情
AI中文摘要

开发者越来越多地通过配置文件(如CLAUDE.md、AGENTS.md和.cursorrules)为AI编码助手提供持久上下文。这些文件描述代码元素、架构和开发约定,形成指导AI工具跨会话行为的上下文。随着软件演化,这种上下文可能变得陈旧,我们称之为上下文腐烂。虽然AI配置工件是新的,但底层的一致性问题与数十年的软件文档研究相关。研究人员已构建工具来检查文档与代码之间的一致性,涵盖README文件、代码注释、API文档、架构描述和安装说明。我们认为,这个现有工具箱是检测上下文腐烂的直接起点,并提出了一个研究路线图,将文档一致性方法映射到这一新环境中的相应问题。作为初步证据,将现有的README/wiki一致性检查器应用于356个仓库的统计代表性样本,发现23.0%的仓库中存在过时代码元素引用,表明传统的文档一致性工具已经能够发现上下文腐烂。

英文摘要

Developers increasingly provide AI coding assistants with persistent context through configuration files such as CLAUDE.md, AGENTS.md, and .cursorrules. These files describe code elements, architecture, and development conventions, forming the context that guides AI tool behavior across sessions. As software evolves, this context can become stale, a phenomenon we call context rot. While AI configuration artifacts are new, the underlying consistency problem connects to decades of software documentation research. Researchers have built tools to check consistency between documentation and code, spanning README files, code comments, API documentation, architecture descriptions, and installation instructions. We argue that this existing toolbox is an immediate starting point for detecting context rot, and we present a research roadmap mapping documentation consistency approaches to corresponding problems in this new setting. As preliminary evidence, applying an existing README/wiki consistency checker to a statistically representative sample of 356 repositories identifies stale code element references in 23.0% of repositories, showing that traditional documentation consistency tools can already surface context rot.

2606.09104 2026-06-09 cs.LG cs.AI q-fin.PM 交叉投稿

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆Black-Litterman解决投资组合优化中的市场机制变化和重尾收益问题

Daniil Mikriukov, Ruoyu Sun, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

发表机构 * University of Liverpool(利物浦大学) Xi'an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出BAVAR-BLED算法,结合贝叶斯平均向量自回归和椭圆分布Black-Litterman模型,在TD3架构下自适应分配资产,在道琼斯工业平均指数成分股上实现夏普比率1.72和总收益57.26%。

Comments 9 pages, 3 figures, 4 tables. Extends our prior work [Mikriukov et al., ICIC 2025] on Black-Litterman under Elliptical Distributions (BLED). Manuscript under review

详情
AI中文摘要

用于投资组合优化的深度强化学习框架因其能够从市场数据中动态学习分配规则而显示出前景。然而,这些模型未能考虑肥尾收益,而肥尾收益以更频繁的极端事件为特征,描述了实际市场行为。此外,历史数据被同质化处理,未考虑时间重要性,导致模型在机制变化时失效。我们提出了一种新的BAVAR-BLED算法,该算法在TD3架构内结合了源自贝叶斯平均向量自回归(BAVAR)和使用椭圆分布的Black-Litterman模型(BLED)的方法。BAVAR捕获一组考虑多尺度时间特征的向量自回归表示,从而基于对收益预期和离散矩阵的机制感知估计实现自适应分配决策。这些估计作为BLED的先验输入,BLED使用学生t分布,允许更现实的肥尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建,使用CNN进行风险厌恶估计,根据市场条件修改动态分配决策。对道琼斯工业平均指数29只成分股在十年市场周期内的评估表明,BAVAR-BLED显著优于最先进的方法,实现了1.72的夏普比率和2.70的索提诺比率,总收益为57.26%。

英文摘要

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

2606.09123 2026-06-09 cs.CV cs.AI 交叉投稿

An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification

一种增强的几何-光谱特征学习框架用于机载多光谱点云分类

Xian Li, Yanfeng Gu, Aleksandra Pižurica

AI总结 针对机载多光谱点云高维异构、样本不平衡和类间光谱相似问题,提出基于注意力的双流特征融合框架,结合残差注意力融合块和联合损失函数,实现高精度地物分类。

详情
AI中文摘要

多光谱点云由三维空间-光谱信息组成,对于精确的土地覆盖分类具有巨大潜力。然而,分类模型的表示能力受到机载多光谱点云固有的高维异构空间-光谱信息、不平衡的样本分布和类间光谱相似性的限制。我们构建了两个多光谱点云数据集,并提出了一种基于注意力的增强几何-光谱特征学习框架用于机载多光谱点云分类。我们模型的一个关键组件是一种带有注意力机制的双流特征融合方法,该方法增强了来自高维异构多光谱点云的空间-光谱特征的表示能力。第一流旨在提取带有融合自注意力的位置编码全局光谱特征,第二流包括多核点卷积和特征聚合注意力以提取光谱引导的几何特征。然后,我们开发了一个残差注意力融合块,以整合来自两个并行流的最具信息量的几何-光谱特征。这项工作的另一个重要贡献是一个联合损失函数,以提高对不平衡和类间相似样本的学习能力。在两个机载多光谱点云数据集上的实验结果表明,与最先进的方法相比,所提方法具有有效性。此外,本文使用的代码和数据集将在https://github.com/HITlixian/TGRS_GSFF免费提供。

英文摘要

Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover classification. However, the representation power of classification models is limited by inherent high-dimensional and heterogeneous spatial-spectral information, unbalanced sample distribution, and inter-class spectral similarity of airborne MPCs. We build two MPC datasets and propose an enhanced geometric-spectral feature learning framework based on attentions for airborne MPC classification. A key component in our model is a two-stream feature fusion method with attention mechanisms, which enhances the representation capability of spatial-spectral features from high-dimensional heterogeneous MPCs. The first stream aims to extract position-encoded global spectral features with fusion self-attention, and the second stream comprises a multikernel point convolution and feature aggregation attention to extract spectral-guided geometric features. We then develop a residual attention fusion block to integrate the most informative geometric-spectral features from the two parallel streams. Another important contribution of this work is a joint loss function to improve the learning ability on unbalanced and interclass similar samples. Experimental results on two airborne MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods. Furthermore, the codes and datasets used in this paper will be made available freely at https://github.com/HITlixian/TGRS_GSFF.

2606.09160 2026-06-09 cs.LG cs.AI 交叉投稿

Crop Recommendation and Agricultural Query Answering System Using Spatio-Temporal Graph Neural Networks and Hybrid Retrieval Augmentation

基于时空图神经网络与混合检索增强的作物推荐及农业问答系统

Prajwal Thapa, Yagya Raj Pandeya

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出融合时空图神经网络(STGCN)与检索增强生成(RAG)的精准农业系统,实现30天天气预报、作物推荐及农业问答,在尼泊尔1359个地点数据上STGCN预测MSE达0.011。

Comments 11 pages, 8 figures

详情
AI中文摘要

本文提出一个统一系统,旨在通过集成先进的天气预报、作物推荐和面向农民的问答工具来支持精准农业。我们提出了两个深度学习模型——基于Transformer的图神经网络和时空图卷积网络(STGCN)——利用尼泊尔1359个地点的数据预测未来30天的天气状况。STGCN在准确性上优于基于Transformer的模型(MSE约0.011 vs 0.013),有效建模了气候数据中的空间和时间依赖性。这些预测与静态土壤属性(如pH、水分和有机质含量)相结合,通过评分算法生成本地化的作物推荐,该算法匹配每种作物的最佳生长条件。此外,我们开发了一个检索增强生成(RAG)聊天机器人,利用领域特定的农业文档以自然语言回答农民的问题。整个系统通过移动应用程序部署,提供实时建议和对话支持。用户反馈证实了系统的可用性和相关性,尤其是在个性化农业指导有限的农村环境中。总体而言,我们的方法展示了如何将机器学习模型与当地农业数据相结合,为农民提供可操作的见解,促进更明智的决策、更高的作物产量和增强对气候变异的适应能力。

英文摘要

This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendation, and a question-answering tool for farmers. We propose two deep learning models -- a Transformer-based Graph Neural Network and a Spatio-Temporal Graph Convolutional Network (STGCN) -- to forecast weather conditions for the next 30 days using data from 1,359 locations in Nepal. The STGCN outperforms the Transformer-based model in accuracy (MSE ~0.011 vs. 0.013), effectively modeling both spatial and temporal dependencies in climate data. These predictions are combined with static soil properties such as pH, moisture, and organic content to generate localized crop recommendations through a scoring algorithm that matches each crop's optimal growing conditions. Additionally, we develop a Retrieval-Augmented Generation (RAG) chatbot that leverages domain-specific agricultural documents to answer farmers' questions in natural language. The entire system is deployed via a mobile application, offering real-time suggestions and conversational support. User feedback confirms the system's usability and relevance, especially in rural settings where personalized farming guidance is limited. Overall, our approach demonstrates how combining machine learning models with local agricultural data can empower farmers with actionable insights, promoting more informed decisions, better crop yields, and increased resilience to climate variability.

2606.09175 2026-06-09 cs.LG cs.AI cs.DC 交叉投稿

CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon

CANS: 通过合作自教神经外科加速多用户协同边缘推理

Zheshun Wu, Ziyang Zhang, Changyao Lin, Zenglin Xu, Jie Liu

发表机构 * Harbin Institute of Technology Shenzhen(哈尔滨工业大学(深圳)) Politecnico di Milano(米兰理工大学) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院)

AI总结 提出CANS框架,利用FedLinUCB-DW算法让异构设备自适应学习最优DNN分区,通过共享在线推理反馈和离线经验加速多用户边缘协同推理,显著降低延迟。

Comments 24 pages, 14 figures, 5 tables, submitted for possible journal publication

详情
AI中文摘要

最近,移动边缘计算(MEC)支持的协作深度神经网络(DNN)推理已成为向资源受限的移动设备提供智能服务的一种有前景的方法。一个代表性场景是多用户协同边缘推理,其中不同设备独立地划分其DNN模型,并通过无线网络将后端计算卸载到公共边缘服务器。然而,由于未知且时变的系统条件(包括波动的无线链路和多样的设备能力),确定每个设备的最优DNN分区具有挑战性。为解决此问题,我们提出了合作自教神经外科(CANS),一种协同边缘推理框架,使设备能够通过在线推理期间共享信息反馈来自适应学习最优DNN分区。为处理设备异构性并更好地利用离线推理经验,我们集成了一种新颖的FedLinUCB-DW算法,该算法将相同类型的设备分组,并使用本地离线早期退出推理经验来热启动在线探索。此外,我们通过推导遗憾上界为FedLinUCB-DW提供了理论保证。我们还在模拟环境和硬件原型系统上验证了我们的方法。实证评估表明,与最先进的基线相比,CANS实现了更低的推理延迟。特别是在两个边缘设备的原型实验中,所提出的CANS相比非合作基线将平均推理延迟降低了高达50%。

英文摘要

Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.

2606.09200 2026-06-09 cs.DC cs.AI 交叉投稿

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

面向多GPU机器学习工作负载的资源感知计算-通信重叠

Minyu Cui, Miquel Pericas

发表机构 * Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学)

AI总结 针对多GPU训练中通信瓶颈,提出通过共享内存占用整形和通信流优先级提升实现计算与通信重叠,在多种GPU上减少执行时间达25.5%。

Comments To appear at the AI on HPC Workshop at ISC 2026, held in conjunction with ISC 2026

详情
AI中文摘要

大规模机器学习的快速增长使得跨多GPU的分布式训练成为现代ML系统的基本组成部分。随着模型大小和计算吞吐量的持续增加,通信开销已成为多GPU训练中的主要瓶颈,特别是在计算和通信顺序执行时。本文探索了使用两种可移植运行时控制实现计算和集体通信的并发执行:用于计算内核的共享内存驱动占用整形和用于通信内核的提升调度优先级。我们的方法通过每块共享内存分配来调节计算内核的驻留,为通信内核留下足够的片上资源以取得进展。此外,为通信流分配更高的优先级确保一旦资源可用,通信进展稳定。在NVIDIA A40、A100、H100和AMD MI250X GPU上的实验表明,所提出的方法能够实现有效的计算-通信重叠,并将总执行时间减少高达25.5%,而无需修改供应商库或内核实现。

英文摘要

The rapid growth of large-scale machine learning (ML) has made distributed training across multiple GPUs a fundamental component of modern ML systems. As model sizes and computational throughput continue to increase, communication overhead has become a dominant bottleneck in multi-GPU training, particularly when computation and communication are executed sequentially. This work explores concurrent execution of computation and collective communication using two portable runtime controls: shared-memory-driven occupancy shaping for computation kernels and elevated scheduling priority for communication kernels. Our approach regulates computation-kernel residency through per-block shared-memory allocation, leaving sufficient on-chip resources for communication kernels to make progress. In addition, assigning higher priority to communication streams ensures steady communication progress once resources become available. Experiments on NVIDIA A40, A100, H100, and AMD MI250X GPUs demonstrate that the proposed method enables effective computation-communication overlap and reduces total execution time by up to 25.5 percent, without modifying vendor libraries or kernel implementations.

2606.09266 2026-06-09 cs.SD cs.AI 交叉投稿

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore(新加坡国立大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 提出MetaSeq框架,将声学超材料表示为结构化序列,通过序列到序列模型结合物理求解器和强化学习,实现宽带逆向设计,误差降低45%。

详情
AI中文摘要

声学超材料(AMM)逆向设计对于宽带目标响应尤其具有挑战性,原因是声学色散:在一个频率上匹配期望响应的结构可能在其它频率上偏离,而修改几何以改善一个子带通常会扰动相邻子带。然而,现有的宽带逆向设计方法要么受限于预定义模板,要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq,一个物理引导的、基于序列的生成框架,用于声学超材料逆向设计。其核心是,MetaSeq引入了一种语言,将每个AMM表示为结构化序列,而不是像素网格或固定模板。这种表示保留了精确的几何形状,显式编码了连通性,并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集,具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质,MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明,MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

2606.09327 2026-06-09 cs.LG cs.AI 交叉投稿

A Universal Dense Football Event Representation Based on TabTransformer

基于TabTransformer的通用密集足球事件表示

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins

发表机构 * Institute of Exercise Training and Sport Informatics, German Sport University Cologne(科隆德国体育大学运动训练与体育信息学研究所)

AI总结 提出基于TabTransformer的模型,通过学习分类特征的嵌入向量,生成密集的足球事件表示,在下游任务中优于基线方法。

Comments 12 pages, 1 figure. Preprint submitted to the 13th Workshop on Machine Learning and Data Mining for Sports Analytics (MLSA 2026)

详情
AI中文摘要

足球事件数据为团队运动中球员动作的定量分析提供了丰富的时空来源。这些数据集包含异构特征,将连续的位置坐标与分类变量(如动作类型、动作结果和身体部位)相结合。此类数据已应用于体育分析中的比赛结果预测、球员评估和战术模式识别。然而,现有方法主要使用独热或序数嵌入表示来编码分类特征,忽略了动作描述符的内在语义。Transformer是一种基于自注意力的深度神经网络架构,能够捕获输入特征在任意位置之间的依赖关系。我们提出并实现了一个基于Transformer的模型,以学习分类事件特征之间的潜在依赖关系,并生成足球事件的密集表示。通过将分类特征编码为学习到的嵌入向量,在预训练期间捕获了特定于运动的动作语义,使得表示能够支持下游任务,如动作价值估计和比赛风格识别。实证评估表明,在下游预测任务中,嵌入表示在概率校准方面优于任务特定基线,如Brier分数所衡量的。

英文摘要

Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

2606.09419 2026-06-09 cond-mat.mtrl-sci cs.AI 交叉投稿

Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM

上下文感知深度学习用于原子分辨率扫描透射电镜中的缺陷分类

Jiadong Dan, Cheng Zhang, Leyi Loh, Ivan Verzhbitskiy, Yuan Chen, Goki Eda, Michel Bosman, N. Duane Loh

发表机构 * cond-mat.mtrl-sci(材料科学)

AI总结 提出上下文感知学习框架,融合图像对比度与元数据(成分、束能、探测器几何),解决仅凭图像对比度进行缺陷分类的歧义性,在模拟数据上准确率超98%,实验数据接近人类水平。

Comments 6 figures

详情
AI中文摘要

人工智能正在快速推进材料表征,然而电子显微镜中的大多数应用仅依赖图像对比度,忽视了影响图像形成的化学和实验上下文。这一局限性使得缺陷分类本质上具有歧义性,因为相似的对比度可能来自不同的材料或成像条件。在此,我们开发了一个上下文感知学习框架,将图像导出的对比度与描述成分、束能和探测器几何的元数据相结合。利用系统构建的约5500万模拟补丁数据集,涵盖96种掺杂单层过渡金属二硫族化合物的576种情况,我们表明,以上下文变量为条件将缺陷分类从一个不适定的纯图像任务转变为一个适定的、基于物理的问题。该框架在模拟数据上实现了超过98%的准确率,在实验数据上达到了接近人类的一致性,后验熵降低了94%。通过强调上下文基础而非架构复杂性,该方法将实验图像对比度与潜在的化学和成像条件联系起来,支持基于物理的缺陷分配,并为自主材料表征的多模态AI模型提供了一条通用路径。

英文摘要

Artificial intelligence is rapidly advancing materials characterization, yet most applications in electron microscopy rely solely on image contrast, overlooking the chemical and experimental context that shapes image formation. This limitation makes defect classification inherently ambiguous, as similar contrasts can arise from different materials or imaging conditions. Here we develop a context-aware learning framework that integrates image-derived contrast with metadata describing composition, beam energy, and detector geometry. Using a systematically constructed dataset of ~55 million simulated patches spanning 576 cases across 96 doped monolayer transition-metal dichalcogenides, we show that conditioning on contextual variables transforms defect classification from an ill-posed image-only task into a well-posed, physically grounded problem. The framework achieves over 98% accuracy on simulations and near-human agreement on experimental data, with a 94% reduction in posterior entropy. By emphasizing contextual grounding over architectural complexity, this approach links experimental image contrast to the underlying chemical and imaging conditions, supporting physically grounded defect assignments and a general pathway toward multimodal AI models for autonomous materials characterization.

2606.09520 2026-06-09 physics.chem-ph cs.AI 交叉投稿

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration

闭合先验-后验循环:基于分析驱动LLM迭代的自反性分子设计

Junyi Gong, Zijie Qiu, Ben Zhong Tang

发表机构 * Faculty of Chemistry, Shenzhen MSU-BIT University(深圳MSU-BIT大学化学学院) School of Science and Engineering, Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)科学与工程学院) Department of Chemistry, Hong Kong University of Science and Technology(香港科技大学化学系)

AI总结 提出一种自反性分子设计框架,用第一性原理计算的完整物化理由替代标量反馈,使LLM从随机采样器转变为因果推理器,在HOMO-LUMO能隙任务中实现0.0003 eV偏差和100%成功率。

Comments 3 tables, 4 figures

详情
AI中文摘要

通用大语言模型能否像经验丰富的化学家一样精确设计分子?当前的LLM框架通过标量反馈循环(生成、评分、拒绝)来回答这个问题,这相当于有依据的试错。本文表明,用第一性原理计算的完整物化理由替代单一数字,可将LLM从随机采样器转变为因果推理器。我们的系统将检索增强生成与自反模块相结合,该模块将轨道能量、原子电荷和电子密度(而非压缩分数)反馈到设计循环中。在1.0至5.0 eV的HOMO-LUMO能隙目标上,这种结构-性质关系(SPR)反射实现了低至0.0003 eV的偏差,在中等任务上达到100%的成功率,显著优于标量反馈和非反射基线。该框架可无缝推广到偶极矩设计,并在五种不同的LLM骨干网络上表现出鲁棒性。这些结果建立了一个新范式:当模型不仅理解分子为何失败,而且理解失败原因时,迭代分子设计将变得真正具有机理性质。

英文摘要

Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer this question with scalar feedback loops-generate, score, reject-that amount to informed trial-and-error. Here we show that replacing a single number with the full physicochemical rationale from first-principles calculations transforms the LLM from a stochastic sampler into a causal reasoner. Our system couples retrieval-augmented generation with a self-reflection module that feeds orbital energies, atomic charges, and electron densities-rather than compressed scores-back into the design loop. On HOMO-LUMO gap targets from 1.0 to 5.0 eV, this structure-property-relationship (SPR) reflection achieves a deviation as low as 0.0003 eV and a 100% success rate on moderate tasks, decisively outperforming scalar-feedback and non-reflective baselines. The framework generalizes seamlessly to dipole-moment design and proves robust across five distinct LLM backbones. These results establish a new paradigm: when the model understands not only that a molecule fails, but why, iterative molecular design becomes genuinely mechanistic.

2606.09617 2026-06-09 math.OC cs.AI cs.CY cs.SY eess.SY 交叉投稿

Powering the Future of AI: Navigating the Trade-offs for Europe's Energy Transition and Net-Zero Goals

赋能AI未来:应对欧洲能源转型与净零目标的权衡

Mohammad Hemmati, Gbemi Oluleye, Vassilis M. Charitopoulos

发表机构 * Department of Chemical Engineering, Sargent Centre for Process Systems Engineering, University College London (UCL)(化学工程系、过程系统工程中心、伦敦大学学院(UCL)) Centre for Environmental Policy, Imperial College London(环境政策中心、伦敦帝国理工学院)

AI总结 通过21种AI增长情景下的空间优化模型,量化AI对欧洲电力需求、容量、排放和运行的影响,发现AI到2050年可能增加73-723 TWh需求,导致2030-2050年累计排放超调67-181 MtCO2,且AI基础设施选址将更依赖稳定电源和系统灵活性。

详情
AI中文摘要

全球AI的快速扩张导致能源密集型超大规模数据中心激增,使其成为电力系统规划和运行中的结构性挑战。利用覆盖21种AI增长情景的欧洲空间显式优化模型,我们系统量化了数据中心的额外需求、容量要求、排放和运行影响。结果表明,到2050年,AI可能推动73-723 TWh的额外需求,导致2030年至2050年间累计排放超调67-181 MtCO2。我们的分析表明,2030年后,AI基础设施的地理分布将更多地由稳定电源和系统灵活性决定,而非仅仅依赖清洁能源的丰富程度。在中等情景下,AI需要额外200小时的稳定发电,这使关键枢纽的平准化电力成本增加35欧元/兆瓦时。我们表明,即使在悲观情景下,现有基础设施也需要额外70吉瓦的容量,而在受控增长路径下,这一扩张可能达到226吉瓦。我们进一步发现,数据中心的工作负载动态强烈影响能源调度、系统灵活性和排放,而效率提升显著降低了容量需求和系统峰值。虽然我们的研究结果表明2050年净零目标可能实现,但中期可能出现关键排放风险,除非政策适应这一加速的数字转型,否则欧盟可能危及其中性碳目标。

英文摘要

The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structurally challenging component in power system planning and operation. Using a spatially explicit optimisation model of Europe across 21 AI growth scenarios, we systematically quantify additional demand, capacity requirements, emissions, and operational impacts of DCs. Results indicate that AI could drive 73-723 TWh of extra demand by 2050, risking cumulative emissions overshoots of 67-181 MtCO2 between 2030 and 2050. Our analysis indicates that after 2030, the geography of AI infrastructure will be shaped more by firm power and system flexibility than by the mere abundance of clean energy. In moderate scenarios, AI requires an additional of 200 hours of firm generation, which increases LCOE by 35 EUR/MWh in key hubs. We show that even under the pessimistic scenarios, existing infrastructure would require 70 GW additional capacity, while under managed growth pathways, this expansion could reach 226 GW. We further find DCs workload dynamics strongly shape energy dispatch, system flexibility, and emissions, while improved efficiency significantly reduces capacity needs, and system peaks. While our findings suggest that net-zero targets for 2050 may be achieved, critical emission risks may appear in the intermediate years, and the EU may compromise its carbon-neutral goals unless policies adapt to this accelerating digital transformation.

2606.09643 2026-06-09 cs.DC cs.AI cs.LG cs.OS 交叉投稿

FMplex: Model Virtualization for Serving Extensible Foundation Models

FMplex: 用于服务可扩展基础模型的模型虚拟化

Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava, Prashant Shenoy

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 提出FMplex系统,通过将基础模型作为虚拟化层实现多任务共享,结合批感知公平队列调度器,在7个基础模型和92个下游任务上降低延迟达80%,提升任务容量6倍。

详情
AI中文摘要

基础模型(FMs)越来越多地被用作语言、视觉、时间序列和多模态应用的下游任务骨干。然而,现有的模型服务系统将每个定制任务部署为独立的模型实例,从而复制了重型骨干,浪费了加速器内存,并失去了摊销批处理和加载成本的机会。本文提出了FMplex,一个将FM骨干视为部署共享的虚拟化层的服务系统。FMplex为每个任务提供一个虚拟基础模型(vFM),这是一个由共享物理FM支持的逻辑私有FM实例。这种抽象允许独立定制的任务共享一个骨干,同时保留任务特定的扩展、独立生命周期和任务级隔离。此外,我们提出了一种批感知公平队列调度器,该调度器结合了加权任务级共享以及跨共存任务的批内和批间批处理。我们实现了一个基于FMplex的服务栈,涵盖任务构建、共享感知部署和运行时执行。在7个FM骨干(16个变体)和92个下游任务上,FMplex相比空间分区延迟降低高达80%,相比尽力而为共置延迟降低33.3%,同时在集群规模上可托管多达6倍的任务。

英文摘要

Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

2606.09671 2026-06-09 cs.LG cs.AI 交叉投稿

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

基于转换的阿尔茨海默病数字孪生建模在稀疏纵向数据下的应用

Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

发表机构 * University of Southampton(南安普顿大学) University Hospital Southampton NHS Foundation Trust(南安普顿大学医院NHS基金会信托) Faculty of Medicine, University of Southampton(南安普顿大学医学院)

AI总结 针对阿尔茨海默病进展异质性和数据稀疏问题,提出结合局部转换建模与序列建模的数字孪生框架,利用多模态纵向数据预测认知状态并量化不确定性,在ADNI数据上表现优异。

Comments 13 pages, 5 figures, 3 tables. Accepted as a full-length paper at the International Conference on AI in Healthcare (AIiH) 2026

详情
AI中文摘要

阿尔茨海默病(AD)进展具有高度异质性,通常通过稀疏且不规则的纵向数据观察,给预测和个性化监测带来挑战。现有的机器学习方法利用多模态数据改进了AD预测,但往往侧重于静态分类或队列级风险估计,对个体特异性建模和不确定性推理的支持有限。为了解决这些局限性,我们提出了一种个性化数字孪生框架,用于AD预测和基于场景的分析,利用多模态纵向数据。该方法整合了互补的建模策略,以捕捉临床转换和跨访视的时间依赖性。使用阿尔茨海默病神经影像学倡议(ADNI)的数据,包括认知评估、临床变量和MRI衍生的表型,该框架预测认知状态和诊断类别,同时量化预测不确定性并实现患者特定的假设轨迹分析。在无泄漏的受试者级别分割上的评估表明,在评分预测和诊断分类方面表现强劲。在这种稀疏且不规则的ADNI设置中,相邻访视的基于转换的建模比基于序列的分支实现了更高的预测准确性,表明局部转换建模可能更数据高效。虽然序列模型对于不确定性感知的轨迹预测仍然有价值,但局部转换建模提供了一种更数据高效且稳健的预测策略。这些发现强调了将时间建模策略与临床数据结构对齐的重要性,并表明基于转换的数字孪生公式可能为神经退行性疾病的个性化预测提供一种实用且可解释的方法。

英文摘要

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

2510.18428 2026-06-09 cs.AI 版本更新

AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library

AlphaOPT: 利用自改进LLM经验库构建优化问题

Minwei Kong, Ao Qu, Xiaotong Guo, Wenbin Ouyang, Chonghe Jiang, Han Zheng, Yining Ma, Dingyi Zhuang, Yuhan Tang, Junyi Li, Shenhao Wang, Haris Koutsopoulos, Hai Wang, Cathy Wu, Jinhua Zhao

发表机构 * Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工联合研究技术联盟) Massachusetts Institute of Technology(麻省理工学院) University of Florida(佛罗里达大学) Northeastern University(东北大学) Singapore Management University(新加坡管理学院)

AI总结 提出AlphaOPT,一种自改进经验库,使LLM能从有限监督中学习优化建模知识,通过库学习和库演化两阶段循环,逐步提升性能,在多个基准上超越基线方法。

详情
AI中文摘要

优化建模是各行业关键决策的基础,但自动化仍然困难:自然语言问题描述必须转化为精确的数学公式和可执行的求解器代码。现有的基于LLM的方法通常依赖于脆弱的提示或昂贵的重新训练,两者都泛化能力有限。最近的研究表明,大型模型可以通过经验重用进行改进,但在结构受限的环境中如何系统地获取、精炼和重用这些经验仍不清楚。我们提出了\textbf{AlphaOPT},一个自改进的经验库,使LLM能够从有限的监督中学习优化建模知识,包括仅包含答案反馈(无标准程序)、带注释的推理轨迹或参数更新。AlphaOPT运行在一个持续的两阶段循环中:\emph{库学习}阶段从失败的尝试中提取求解器验证的结构化见解,以及\emph{库演化}阶段基于跨任务的聚合证据精炼存储见解的适用性。这种设计允许模型积累可重用的建模原则,提高跨问题实例的迁移能力,并随时间保持库的有界增长。在多个优化基准上的评估表明,AlphaOPT随着更多训练数据的可用而稳步提升(从100个训练项到300个,准确率从65\%提高到72\%),并在两个分布外数据集上分别比最强基线高出9.1\%和8.2\%。这些结果表明,基于求解器反馈的结构化经验学习为需要精确公式化和执行的复杂推理任务提供了一种实用的替代重新训练的方法。所有代码和数据可在以下网址获取:this https URL。

英文摘要

Optimization modeling underlies critical decision-making across industries, yet remains difficult to automate: natural-language problem descriptions must be translated into precise mathematical formulations and executable solver code. Existing LLM-based approaches typically rely on brittle prompting or costly retraining, both of which offer limited generalization. Recent work suggests that large models can improve via experience reuse, but how to systematically acquire, refine, and reuse such experience in structurally constrained settings remains unclear. We present \textbf{AlphaOPT}, a self-improving experience library that enables LLMs to learn optimization modeling knowledge from limited supervision, including answer-only feedback without gold-standard programs, annotated reasoning traces, or parameter updates. AlphaOPT operates in a continual two-phase cycle: a \emph{Library Learning} phase that extracts solver-verified, structured insights from failed attempts, and a \emph{Library Evolution} phase that refines the applicability of stored insights based on aggregate evidence across tasks. This design allows the model to accumulate reusable modeling principles, improve transfer across problem instances, and maintain bounded library growth over time. Evaluated on multiple optimization benchmarks, AlphaOPT steadily improves as more training data become available (65\% $\rightarrow$ 72\% from 100 to 300 training items) and outperforms the strongest baseline by 9.1\% and 8.2\% on two out-of-distribution datasets. These results demonstrate that structured experience learning, grounded in solver feedback, provides a practical alternative to retraining for complex reasoning tasks requiring precise formulation and execution. All code and data are available at: https://github.com/Minw913/AlphaOPT.

2604.04251 2026-06-09 cs.AI cs.CY cs.LG 版本更新

MC-CPO: Mastery-Conditioned Constrained Policy Optimization for Pedagogically Safe Intelligent Tutoring Systems

MC-CPO:基于 mastery 的约束策略优化用于教学安全的智能辅导系统

Oluseyi Olukola, Nick Rahimi

发表机构 * School of Computing Sciences(计算科学学院) Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA(计算机工程,密西西比大学,哈特斯伯格,MS 39406,USA)

AI总结 本文提出 MC-CPO 框架,通过结构化约束解决教学安全问题,提升学习者知识掌握率,实验证明其在两个平台上的效果显著。

Comments 35 pages, 8 figures. v2: Major revision adding real-world validation on Junyi Academy (16.2M interactions, 72,758 students) and XES3G5M (NeurIPS 2023, 5.1M interactions, 14,453 students). Revised title and abstract. Submitted to Computers and Education: Artificial Intelligence

详情
AI中文摘要

智能辅导系统越来越多地依赖强化学习来个性化教学,但优化可观察的参与信号可能会系统性地将学习者活动与真正的知识获取分离。分析超过2100万学生互动数据,发现Junyi Academy平台有26.5%的互动没有对应的掌握增长,XES3G5M平台为3.1%。本文引入Mastery-Conditioned Constrained Policy Optimization (MC-CPO),一种强化学习框架,通过将可接受的教学动作空间条件于学习者掌握状态,使概念在先决知识达到掌握阈值时才可出现,从而自然扩展动作空间。通过结构化约束确保教学安全,具有形式保证的结构性先决安全、对偶收敛和严格优于事后过滤。MC-CPO是唯一在所有条件下减少奖励黑客严重性的方法。在Junyi Academy上,平均每回合掌握增长增加18.3%,在XES3G5M上增加54.0%,同时保持竞争性的参与表现。这些结果支持结构化约束建模作为部署辅导系统中更安全自适应教学策略的原理性基础。

英文摘要

Intelligent tutoring systems increasingly rely on reinforcement learning to personalise instruction, yet optimising for observable engagement signals can systematically decouple learner activity from genuine knowledge acquisition. Analysing over 21 million student interactions across two deployed platforms, we find engagement events without corresponding mastery gains occur in 26.5% of interactions on Junyi Academy (72,758 students) and 3.1% on XES3G5M (14,453 students, NeurIPS 2023), confirming this pattern is directly observable in deployed educational technology at scale. We introduce Mastery-Conditioned Constrained Policy Optimisation (MC-CPO), a reinforcement learning framework that addresses this problem structurally. MC-CPO conditions the admissible instructional action space on learner mastery state: a concept becomes available only when prerequisite knowledge meets a mastery threshold, yielding an action space that expands naturally as learners acquire knowledge. Pedagogical safety constraints are enforced by construction, with formal guarantees of structural prerequisite safety, primal-dual convergence, and strict dominance over post-hoc filtering. MC-CPO is the only method to reduce reward hacking severity across all conditions. Mean per-episode mastery gain increases by 18.3% on Junyi Academy and 54.0% on XES3G5M relative to all baselines, while competitive engagement performance is maintained. These results support structural constraint modelling as a principled foundation for safer adaptive instructional policies in deployed tutoring systems.

2604.17406 2026-06-09 cs.AI 版本更新

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

EvoMaster:一种用于大规模代理科学的基础进化代理框架

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wang, Weinan E, Siheng Chen

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) SciLand DP Technology(DP技术)

AI总结 EvoMaster通过持续自我进化机制,使代理能迭代优化假设并积累知识,实现跨学科的高效科学发现,其易用性与性能在多个基准测试中均表现优异。

Comments 17 pages, 3 figures

详情
AI中文摘要

大型语言模型与代理的融合正推动科学发现进入新纪元:代理科学。尽管科学方法本质上是迭代的,现有代理框架多为静态、狭窄且缺乏从试错中学习的能力。为弥合这一差距,我们提出了EvoMaster,一种专为大规模代理科学设计的基础进化代理框架。其核心原理是持续自我进化,使代理能够迭代优化假设、自我批评并逐步积累知识。作为领域无关的基础平台,EvoMaster极容易扩展——开发者可在约100行代码中构建和部署高度 capable、自我进化的科学代理。基于EvoMaster,我们建立了覆盖机器学习、物理和一般科学等多个领域的SciMaster生态系统。在四个权威基准测试(Humanity's Last Exam、MLE-Bench Lite、BrowseComp和FrontierScience)上的评估显示,EvoMaster分别达到41.1%、75.8%、73.3%和53.3%的先进分数。其性能全面超越通用基准OpenClaw,相对提升范围从+159%到+316%,充分验证了其作为下一代自主科学发现基础框架的有效性和通用性。EvoMaster可在https://github.com/sjtu-sai-agents/EvoMaster获取。

英文摘要

The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narrowly scoped, and lack the capacity to learn from trial and error. To bridge this gap, we present EvoMaster, a foundational evolving agent framework engineered specifically for Agentic Science at Scale. Driven by the core principle of continuous self-evolution, EvoMaster empowers agents to iteratively refine hypotheses, self-critique, and progressively accumulate knowledge across experimental cycles, faithfully mirroring human scientific inquiry. Crucially, as a domain-agnostic base harness, EvoMaster is exceptionally easy to scale up -- enabling developers to build and deploy highly capable, self-evolving scientific agents for arbitrary disciplines in approximately 100 lines of code. Built upon EvoMaster, we incubated the SciMaster ecosystem across domains such as machine learning, physics, and general science. Evaluations on four authoritative benchmarks (Humanity's Last Exam, MLE-Bench Lite, BrowseComp, and FrontierScience) demonstrate that EvoMaster achieves state-of-the-art scores of 41.1%, 75.8%, 73.3%, and 53.3%, respectively. It comprehensively outperforms the general-purpose baseline OpenClaw with relative improvements ranging from +159% to +316%, robustly validating its efficacy and generality as the premier foundational framework for the next generation of autonomous scientific discovery. EvoMaster is available at https://github.com/sjtu-sai-agents/EvoMaster.

2604.19755 2026-06-09 cs.AI cs.LG 版本更新

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

可解释的AML优先级排序与LLMs:证据检索与反事实检查

Dorothy Torres, Wei Cheng, Ke Hu

发表机构 * School of Science, Technology, Engineering and Mathematics(科学、技术、工程与数学学院) School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种可解释的AML优先级排序框架,结合证据增强的证据捆绑、结构化LLM输出合同和反事实验证,提升审计性和鲁棒性,实验证明其在优先级排序和证据支持方面表现优异。

详情
AI中文摘要

反洗钱(AML)交易监控生成大量警报,需在严格审计和治理约束下快速优先级排序。尽管大语言模型(LLMs)可汇总异质证据并起草理由,但不受约束的生成在受监管流程中因幻觉、弱溯源性和不忠实的解释而风险较高。本文提出一种可解释的AML优先级排序框架,将优先级排序视为受证据约束的决策过程。我们的方法结合(i)从政策/类型指南、客户上下文、警报触发器和交易子图中检索增强的证据捆绑;(ii)一个结构化的LLM输出合同,要求明确引用并区分支持、矛盾或缺失的证据;(iii)反事实检查,验证最小、合理的扰动是否导致优先级推荐及其理由的连贯变化。我们在公开的合成AML基准和模拟器上评估,并与规则、表格和图机器学习基线以及LLM-only/RAG-only变体进行比较。结果表明,证据支撑显著提高了可审计性,并减少了数值和政策幻觉错误,而反事实验证进一步增加了与决策相关的可解释性和鲁棒性,实现了最佳的整体优先级排序性能(PR-AUC 0.75;升级F1 0.62)和强溯源性和忠实度指标(引用有效性0.98;证据支持0.88;反事实忠实度0.76)。这些发现表明,受约束、可验证的LLM系统可以在不牺牲合规要求的可追溯性和防御性的情况下,为AML优先级排序提供实用的决策支持。

英文摘要

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VESTA框架,通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验,提升视觉语言模型在复杂统计建模任务上的性能。

详情
AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤,但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型(VLM)来迭代地提出和优化统计模型,但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制,我们引入了VESTA:基于统计工具代理的视觉探索,这是一个框架,为VLM配备了一个动态增长的探索工具包,通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同,VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据,这些工具会累积在模型的上下文中,并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线:无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估,我们引入了DAWN(自动工作流和数值建模数据集),这是一个针对分布拟合和时间序列建模的基准,具有不同的难度等级,并最终涉及真实世界的天文学任务,包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线,在复杂和特定领域的任务上取得了最大的收益。我们进一步表明,动态生成的工具比现有视觉工具创建系统生成的工具复杂得多,每个函数覆盖更多的诊断类别,并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2606.02802 2026-06-09 cs.AI 版本更新

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

ChatHealthAI: 将电子健康记录表示与大语言模型对齐以实现基于临床的推理

Bo-Hong Wang, Baicheng Peng, Ruilin Wang, Jun Bai, Ziyang Song, Yue Li

发表机构 * School of Computer Science, McGill University(麦吉尔大学计算机科学系) Mila - Quebec AI Institute(魁北克人工智能研究所)

AI总结 提出ChatHealthAI框架,通过任务感知重采样器将预训练的EHR基础模型的结构化表示与冻结的大语言模型语义空间对齐,实现可解释的临床推理并保持预测性能。

Comments Main paper with appendix, 13 pages

详情
AI中文摘要

大语言模型在临床决策支持中展现出强大的自然语言推理能力,但难以有效建模结构化的纵向电子健康记录。相比之下,EHR基础模型可以学习预测性患者表示,但缺乏可解释的基于语言的推理。为弥合这一差距,我们提出ChatHealthAI,一个多模态推理框架,通过任务感知重采样器将预训练的EHR基础模型的结构化EHR表示与冻结的大语言模型的语义空间对齐。通过整合纵向患者表示与精细化的临床事件描述,ChatHealthAI在保持准确患者预测的同时,实现了基于临床的自然语言推理。我们在EHRSHOT基准上的三个临床预测任务上评估了ChatHealthAI。结果表明,ChatHealthAI在保持竞争性预测性能的同时,提高了推理质量和可解释性。这些发现凸显了将EHR基础模型与预训练大语言模型整合用于可解释临床预测的潜力。

英文摘要

Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records (EHRs). In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language-based reasoning. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task-aware resampler. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural-language reasoning while maintaining accurate patient prediction. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction.

2606.06360 2026-06-09 cs.AI 版本更新

An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

基于大语言模型决策的传染病传播模拟

Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle, Hao Xue, Taylor Anderson, Chandini Raina MacIntyre, Matthew Scotch, Flora D. Salim, David J Heslop

发表机构 * Computer Science and Engineering Faculty of Engineering(计算机科学与工程系) The University of New South Wales(新南威尔士大学) Department of Computer Science(计算机科学系) Emory University(埃默里大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) George Mason University(乔治·马歇尔大学) The Kirby Institute Faculty of Medicine & Health(Kirby研究所医学院与健康学院) Arizona State University(亚利桑那州立大学)

AI总结 提出一个空间显式的基于智能体的模拟框架,利用大语言模型生成自我报告流感样疾病的决策,并整合到基于人口普查的合成人群中,以捕捉社会与地理异质性。

Comments 12 pages

详情
AI中文摘要

在传染病爆发期间对个体决策进行建模对于理解行为动态和指导有效的公共卫生干预至关重要。先前的工作表明,大语言模型可以通过基于人口统计提示和情境背景生成智能体决策来模拟逼真的人类行为。我们在此基础上构建了一个空间显式的基于智能体的模拟框架,将LLM生成的关于自我报告流感样疾病的决策整合到基于人口普查的合成智能体群体中。位置被视为核心特征:智能体被分配到城市内的空间单元,利用真实世界的人口普查数据捕捉不同人口群体的空间分布,并实现地理多样化的行为建模。我们实施并比较了三种决策场景:独立推理、家庭影响和消息框架,并在旧金山和亚特兰大模拟了自我报告结果。结果显示,收入和受教育程度是报告率变化的主要驱动因素,地理、LLM模型选择和消息框架的影响较小但一致。我们的框架生成了捕捉社会和地理异质性的合成数据,支持空间流行病学建模和偏差感知行为分析。

英文摘要

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.

2312.02873 2026-06-09 cs.LG cs.AI 版本更新

Toward autocorrection of chemical process flowsheets using large language models

利用大型语言模型实现化工流程图的自动纠错

Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group, Department of Chemical Engineering, Delft University of Technology(过程智能研究组,化学工程系,代尔夫特理工大学)

AI总结 提出一种基于大型语言模型的生成式AI方法,自动识别化工流程图中的错误并给出修正建议,在合成数据集上达到80%的top-1准确率。

详情
Journal ref
Computer Aided Chemical Engineering, Volume 53, 2024, Pages 3109-3114
AI中文摘要

过程工程领域广泛使用工艺流程图(PFD)和管道及仪表流程图(P&ID)来表示工艺流程和设备配置。然而,P&ID和PFD(以下统称为流程图)可能包含错误,导致安全隐患、操作效率低下和不必要的开支。纠正和验证流程图是一个繁琐的手动过程。我们提出了一种新颖的生成式AI方法,用于自动识别流程图中的错误并向用户建议修正,即自动纠错流程图。受大型语言模型(LLM)在人类语言语法自动纠错方面突破的启发,我们研究了LLM用于流程图的自动纠错。模型的输入是可能出错的流程图,输出是修正后的流程图建议。我们在合成数据集上以监督方式训练自动纠错模型。该模型在独立测试的合成流程图数据集上达到了80%的top-1准确率和84%的top-5准确率。结果表明,模型能够学习自动纠错合成流程图。我们设想流程图自动纠错将成为化学工程师的有用工具。

英文摘要

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

2412.00508 2026-06-09 cs.LG cs.AI cs.CE 版本更新

Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Graph-to-SFILES: 基于生成式人工智能从过程拓扑预测控制结构

Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group(过程智能研究组) Department of Chemical Engineering(化学工程系) Delft University of Technology(代尔夫特理工大学)

AI总结 提出Graph-to-SFILES模型,利用图神经网络从流程图拓扑生成控制扩展流程图序列,在小数据集上显著提升控制结构预测精度。

详情
Journal ref
Computers & Chemical Engineering, Volume 199, 2025, Pages 109121
AI中文摘要

控制结构设计是P&ID开发中重要但繁琐的步骤。生成式人工智能有望通过支持工程师来减少P&ID开发时间。先前关于化学过程设计中生成式AI的研究主要用序列表示过程。然而,图因其置换不变性而成为一种有前景的替代方案。我们提出了Graph-to-SFILES模型,一种从流程图拓扑预测控制结构的生成式AI方法。Graph-to-SFILES模型将流程图拓扑作为图输入,并返回以SFILES 2.0符号表示的控制扩展流程图序列。我们比较了四种不同的图编码器架构,其中一种是本文提出的图神经网络(GNN)。Graph-to-SFILES模型在10,000个流程图拓扑上训练时达到了73.2%的top-5准确率。此外,所提出的GNN在编码器架构中表现最佳。与纯基于序列的方法相比,Graph-to-SFILES模型在相对较小的1,000个流程图训练数据集上将top-5准确率从0.9%提高到28.4%。然而,在100,000个流程图的大规模数据集上,基于序列的方法表现更好。这些结果突显了基于图的AI模型在小数据场景下加速P&ID开发的潜力,但其在工业相关案例研究中的有效性仍需进一步研究。

英文摘要

Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

2502.18493 2026-06-09 cs.CE cs.AI 版本更新

Rule-based autocorrection of Piping and Instrumentation Diagrams (P&IDs) on graphs

基于规则的管道与仪表图(P&ID)图形自动校正

Lukas Schulze Balhorn, Niels Seijsener, Kevin Dao, Minji Kim, Dominik P. Goldstein, Ge H. M. Driessen, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group(过程智能研究组) Department of Chemical Engineering(化学工程系) Delft University of Technology(代尔夫特理工大学) Fluor BV Amsterdam, The Netherlands(荷兰阿姆斯特丹Fluor公司)

AI总结 提出一种基于图表示的规则方法,通过33条化工规则实现P&ID的自动错误检测与校正,案例验证其可靠性。

详情
Journal ref
Systems and Control Transactions, Volume 4, 2025, Pages 1656-1661
AI中文摘要

管道与仪表图(P&ID)是化学过程工程中的核心参考文档。目前,化学工程师通过目视检查手动审查P&ID以发现和纠正错误。然而,工程项目可能涉及数百至数千页P&ID,造成巨大的修订工作量。本研究提出一种基于规则的方法,支持工程师进行P&ID的错误检测与校正。该方法基于P&ID的图表示,通过规则图实现自动错误检测与校正,即自动校正。我们使用pyDEXPI Python包从DEXPI标准的P&ID生成P&ID图。在本研究中,我们基于化学工程知识和启发式方法开发了33条规则,并展示了其中五条选定的规则作为示例。一个示例P&ID的案例研究验证了基于规则的自动校正方法在修订P&ID中的可靠性和有效性。

英文摘要

A piping and instrumentation diagram (P&ID) is a central reference document in chemical process engineering. Currently, chemical engineers manually review P&IDs through visual inspection to find and rectify errors. However, engineering projects can involve hundreds to thousands of P&ID pages, creating a significant revision workload. This study proposes a rule-based method to support engineers with error detection and correction in P&IDs. The method is based on a graph representation of P&IDs, enabling automated error detection and correction, i.e., autocorrection, through rule graphs. We use our pyDEXPI Python package to generate P&ID graphs from DEXPI-standard P&IDs. In this study, we developed 33 rules based on chemical engineering knowledge and heuristics, with five selected rules demonstrated as examples. A case study on an illustrative P&ID validates the reliability and effectiveness of the rule-based autocorrection method in revising P&IDs.

2505.07573 2026-06-09 cs.CV cs.AI 版本更新

Robust Renal Mass Segmentation on CT: A Validation Study of an AI-Based Framework

基于CT的肾脏肿块鲁棒分割:AI框架的验证研究

Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering

发表机构 * Department of Medical Imaging, Radboudumc, Nijmegen, The Netherlands(医学影像部门,Radboudumc,尼姆维根,荷兰) Department of Radiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(神经放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany(诊断和介入放射科,Klinikum rechts der Isar,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center, TUM University Hospital, Technical University of Munich, Munich, Germany(心血管放射学和核医学部,德国心脏中心,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Fraunhofer MEVIS, Bremen, Germany(Fraunhofer MEVIS,不莱梅,德国)

AI总结 提出Renal-Net,基于nnU-Net和公开数据训练,在CT图像上实现肾脏肿块分割,验证显示优于现有模型且鲁棒性强。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:012. 23 pages, 12 figures

详情
Journal ref
Machine.Learning.for.Biomedical.Imaging. 2026 (2026)
AI中文摘要

肾脏肿块分割在临床工作流中具有重要潜力,尤其是在需要定量评估的场景中。肾脏体积可作为肾脏疾病的重要生物标志物,其体积变化与肾功能直接相关。目前,临床实践常依赖主观视觉评估来评价肾脏大小和肾脏病变(包括肿瘤和囊肿),这些病变通常根据直径、体积和解剖位置进行分期。为了支持更客观和可重复的方法,本研究旨在开发一个鲁棒且经过充分验证的肾脏肿块分割算法,命名为Renal-Net。我们使用公开可用的训练数据集,并利用最先进的医学图像分割框架nnU-Net。使用专有和公开测试数据集进行验证,分割性能通过Dice系数和95百分位Hausdorff距离量化。此外,我们根据患者性别、年龄、CT对比相和肿瘤组织学亚型分析亚组鲁棒性。我们的结果表明,仅使用公开数据训练的分割算法能有效泛化到外部测试集,并在所有测试数据集上优于现有最先进模型。亚组分析显示一致的高性能,表明强鲁棒性和可靠性。开发的算法和相关代码可在以下网址公开获取:https://this.url。

英文摘要

Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and kidney lesions, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated renal mass segmentation algorithm, named Renal-Net. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

2505.07833 2026-06-09 cs.DC cs.AI cs.MA cs.OS 版本更新

Harmonia: End-to-End RAG Serving Optimization

Harmonia: 端到端RAG服务优化

Saurabh Agarwal, Bodun Hu, Luis Pabon, Myungjin Lee, Jayanth Srinivasa, Aditya Akella

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) Cisco Research(思科研究) Cisco Systems(思科系统)

AI总结 提出Harmonia框架,通过灵活管道接口、异构感知部署和闭环运行时控制器,优化RAG服务,吞吐量提升2.04倍以上,SLO违规减少78.4%。

详情
AI中文摘要

检索增强生成(RAG)通过集成外部知识提高了大型语言模型的可靠性,但高效服务RAG管道具有挑战性,因为请求会遍历跨越LLM推理、数据库和CPU端处理的异构组件。我们提出了Harmonia,一个端到端的RAG服务框架,通过以下方式解决这些瓶颈:(i) 灵活的管道规范接口,用于组合自定义工作流;(ii) 异构感知部署,将组件作为分布式推理系统进行配置和部署;(iii) 闭环运行时控制器,监控负载和执行进度,并通过请求优先级排序和自动缩放减少SLO违规。在四个RAG应用中,Harmonia优于商业替代方案,吞吐量提升超过2.04倍,同时SLO违规减少高达78.4%。

英文摘要

Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent.

2507.08920 2026-06-09 q-bio.BM cs.AI 版本更新

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

AMix-1: 迈向测试时可扩展的蛋白质基础模型

Changze Lv, Jiang Zhou, Siyu Long, Lihao Wang, Jiangtao Feng, Dongyu Xue, Yu Pei, Hao Wang, Zherui Zhang, Yuchen Cai, Zhiqiang Gao, Ziyuan Ma, Jiakai Hu, Chaochen Gao, Jingjing Gong, Yuxuan Song, Shuyi Zhang, Xiaoqing Zheng, Deyi Xiong, Lei Bai, Wanli Ouyang, Ya-Qin Zhang, Wei-Ying Ma, Bowen Zhou, Hao Zhou

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Generative Symbolic Intelligence Lab (GenSI), Tsinghua University(生成符号智能实验室(GenSI),清华大学) Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) Tsinghua University(清华大学) Fudan University(复旦大学) Tianjin University(天津大学) Georgia Institute of Technology(佐治亚理工学院) Beijing University of Posts and Telecommunications(北京邮电大学) University of Chinese Academy of Sciences(中国科学院大学) City University of Hong Kong(香港城市大学)

AI总结 提出基于贝叶斯流网络的蛋白质基础模型AMix-1,通过预训练缩放律、涌现能力分析、上下文学习机制和测试时缩放算法,实现1.7B参数模型,并设计出活性提高50倍的AmeR变体。

详情
AI中文摘要

我们介绍了AMix-1,一个强大的蛋白质基础模型,它基于贝叶斯流网络构建,并通过系统性的训练方法学增强,包括预训练缩放律、涌现能力分析、上下文学习机制和测试时缩放算法。为了保证稳健的可扩展性,我们建立了一个预测性缩放律,并通过损失视角揭示了结构理解的渐进涌现,最终得到了一个强大的17亿参数模型。在此基础上,我们设计了一种基于多序列比对(MSA)的上下文学习策略,将蛋白质设计统一到一个通用框架中,其中AMix-1识别MSA中的深层进化信号,并一致地生成结构和功能上连贯的蛋白质。该框架成功设计了一个显著改进的AmeR变体,其活性比野生型提高了高达50倍。为了突破蛋白质工程的边界,我们进一步为AMix-1配备了一种进化测试时缩放算法,用于计算机模拟定向进化,随着验证预算的增加,该算法提供了显著且可扩展的性能提升,为下一代实验室在环蛋白质设计奠定了基础。

英文摘要

We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.

2509.10334 2026-06-09 cs.CV cs.AI cs.LG 版本更新

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

I-Segmenter: 用于高效语义分割的纯整数视觉Transformer

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

发表机构 * CEA, France(法国原子能委员会)

AI总结 提出I-Segmenter,首个全整数ViT分割框架,通过整数运算替换、λ-ShiftGELU激活函数及解码器优化,在保持精度前提下显著降低模型大小和推理延迟。

Comments Accepted by the Journal of Systems Architecture

详情
AI中文摘要

视觉Transformer(ViT)最近在语义分割中取得了强劲的结果,但由于其高内存占用和计算成本,在资源受限设备上的部署仍然有限。量化提供了一种提高效率的有效策略,但基于ViT的分割模型在低精度下非常脆弱,因为量化误差会在深度编码器-解码器流水线中累积。我们引入了I-Segmenter,这是第一个完全纯整数的ViT分割框架。基于Segmenter架构,I-Segmenter系统地将浮点运算替换为纯整数对应运算。为了进一步稳定训练和推理,我们提出了λ-ShiftGELU,一种新颖的激活函数,它减轻了均匀量化在处理长尾激活分布时的局限性。此外,我们移除了L2归一化层,并将解码器中的双线性插值替换为最近邻上采样,确保整个计算图都是纯整数执行。大量实验表明,I-Segmenter在合理精度范围内(平均5.1%)达到其FP32基线的精度,同时将模型大小减少高达3.8倍,并通过优化的运行时实现高达1.2倍的推理加速。值得注意的是,即使在单张校准图像的一次性PTQ中,I-Segmenter也能提供有竞争力的精度,凸显了其在实际部署中的实用性。

英文摘要

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $λ$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

2510.10028 2026-06-09 cs.LG cs.AI cs.DC 版本更新

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

基于LLM增强优化的无人机低空经济网络高效机载视觉-语言推理

Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Abbas Jamalipour, Xianbin Wang, Dong In Kim

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院、新加坡国立科技大学) The University of Sydney, Sydney, Australia(悉尼大学、澳大利亚悉尼) Department of Electrical and Computer Engineering, Western University, London, Canada(电气与计算机工程系、西方大学、加拿大伦敦) Department of Electrical and Computer Engineering, Sungkyunkwan University, South Korea(电气与计算机工程系、全州大学、韩国)

AI总结 针对无人机低空经济网络中机载视觉-语言模型推理的准确性与通信效率挑战,提出分层优化框架,包括交替分辨率与功率优化算法及大语言模型增强的强化学习轨迹优化方法,有效提升推理性能与通信效率。

详情
AI中文摘要

低空经济网络(LAENets)的快速发展催生了多种应用,包括空中监视、环境感知和语义数据收集。为支持这些场景,配备机载视觉-语言模型(VLM)的无人机(UAV)为实时多模态推理提供了一种有前景的解决方案。然而,由于有限的机载资源和动态的网络条件,确保推理准确性和通信效率仍然是一个重大挑战。在本文中,我们首先提出一个无人机启用的LAENet系统模型,该模型联合捕捉无人机移动性、用户-无人机通信以及机载视觉问答(VQA)流水线。基于该模型,我们制定了一个混合整数非凸优化问题,以在用户特定的准确性约束下最小化任务延迟和功耗。为解决该问题,我们设计了一个由两部分组成的分层优化框架:(i)交替分辨率与功率优化(ARPO)算法,用于在准确性约束下进行资源分配;(ii)大语言模型增强的强化学习方法(LLaRA),用于自适应无人机轨迹优化。大语言模型(LLM)作为专家,以离线方式改进强化学习的奖励设计,在实时决策中不引入额外延迟。数值结果证明了我们提出的框架在动态LAENet条件下提升推理性能和通信效率的有效性。

英文摘要

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

2511.18454 2026-06-09 cs.CV cs.AI 版本更新

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

AttnRegDeepLab: 一种用于可解释胚胎碎片分级的双阶段解耦框架

Ming-Jhe Lee, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee

发表机构 * Department of Electrical Engineering(电气工程系) AI Research Center(人工智能研究中心) National Taiwan Ocean University(国立台湾海洋大学) Department of Obstetrics, Gynecology(妇产科部) Gynecology, CSMU Hospital, Taichung, Taiwan(台中市立医院妇产科)

AI总结 提出AttnRegDeepLab框架,通过双分支多任务学习、注意力门控、多尺度回归头和两阶段解耦训练,实现胚胎碎片分级的高精度与可解释性。

Comments 6 pages, 5 figures

详情
AI中文摘要

胚胎碎片是评估体外受精(IVF)发育潜力的关键形态学指标。然而,手动分级主观且低效,而现有的深度学习解决方案往往缺乏临床可解释性,或在分割区域估计中遭受累积误差。为了解决这些问题,本研究提出了AttnRegDeepLab(注意力引导回归DeepLab),一种以双分支多任务学习(MTL)为特征的框架。通过将注意力门集成到其跳跃连接中,修改了原始的DeepLabV3+解码器,显式抑制细胞质噪声以保留轮廓细节。此外,引入了一个多尺度回归头,并采用特征注入机制将全局分级先验传播到分割任务中,纠正系统量化误差。提出了一种两阶段解耦训练策略来解决MTL中的梯度冲突。同时,设计了一种基于范围的损失以利用弱标记数据。我们的方法在保持出色分割精度(Dice系数=0.729)的同时实现了稳健的分级精度,这与可能以牺牲轮廓完整性为代价最小化分级误差的端到端方法形成对比。这项工作提供了一种在视觉保真度和量化精度之间取得平衡的临床可解释解决方案。

英文摘要

Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models struggle with clinical interpretability and segmentation errors. We propose AttnRegDeepLab, a Multi-Task Learning (MTL) framework designed to solve these challenges. The model enhances a DeepLabV3+ decoder with Attention Gates to filter out cytoplasmic noise and retain sharp contour details. It also introduces a Multi-Scale Regression Head with Feature Injection, guiding the segmentation process with global grading priors to eliminate systematic area estimation errors. Based on a two-stage decoupled training strategy and a range-based loss for weakly labeled data, our method resolves MTL gradient conflicts. AttnRegDeepLab yields high grading precision and excellent segmentation quality (Dice coefficient = 0.729), avoiding the trade-off between contour integrity and grading accuracy seen under standard joint optimization. This provides a reliable, clinically interpretable tool balancing visual and quantitative accuracy.

2511.18493 2026-06-09 eess.IV cs.AI cs.CV 版本更新

SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

SAGE:适应性组织病理图像分割的形状自适应门控专家

Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Thi-Ngoc-Truc Nguyen, Nhat Ho

发表机构 * University of Science, VNU-HCM(越南国家大学科学学院) Trivita AI University of Technology, VNU-HCM(越南国家大学技术学院) Michigan State University, USA(美国密歇根州立大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 SAGE通过动态专家路由框架提升异构视觉网络中细胞形态变化的适应性,实现高精度分割与稳健泛化。

Comments Accepted to CVPR 2026 (Findings Track). Project Page: https://oxyzgiahuy.github.io/sage/

详情
AI中文摘要

细胞大小和形状的显著差异仍然是计算机辅助癌症检测在吉像素全滑片图像中的主要障碍,由于细胞异质性。当前的CNN-Transformer混合模型使用静态计算图和固定路由,导致额外计算并难以适应输入变化。我们提出形状自适应门控专家(SAGE),一种输入自适应框架,通过双路径设计和层次门控以及形状适应枢纽(SA-Hub)将静态骨干网络重新配置为动态路由专家架构。SAGE以ConvNeXt和Vision Transformer UNet(SAGE-ConvNeXt+ViT-UNet)实现,其在EBHI上达到95.23%的Dice分数,在GlaS Test A和Test B上分别达到92.78%和91.42%的DSC分数,并在DigestPath上达到91.26%的DSC分数,同时在分布偏移下表现出稳健的泛化能力,通过自适应平衡局部细化和全局上下文。SAGE建立了可扩展的动态专家路由基础,从而促进灵活的视觉推理。项目页面:https://oxyzgiahuy.github.io/sage/

英文摘要

The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23% on EBHI, DSC scores of 92.78% and 91.42% on GlaS Test A and Test B, respectively, and 91.26% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning. Project page: https://oxyzgiahuy.github.io/sage/

2512.08499 2026-06-09 cs.LG cs.AI 版本更新

Developing Distance-Aware Physics-Constrained Probabilistic Frameworks for Industrial Prognostics

面向工业预测的具有距离感知的物理约束概率框架开发

Waleed Razzaq, Yun-Bo Zhao

发表机构 * University of Science and Technology China(中国科学技术大学)

AI总结 提出两种无需采样的距离感知物理约束概率框架PC-SNGP和PC-SNER,通过谱归一化和动态加权策略平衡数据保真度与物理一致性,在轴承预测中提升精度和不确定性校准。

详情
AI中文摘要

可靠且物理可解释的工业预测概率框架的发展仍处于初期阶段,现有文献在输入远离训练流形时往往不敏感。本文开发了两种无需采样的、具有距离感知的物理约束概率框架:(i) PC-SNGP 和 (ii) PC-SNER。两者均对隐藏层权重应用谱归一化,强制从输入到潜在空间的bi-Lipschitz距离保持表示。PC-SNGP将密集输出替换为高斯过程,其后验方差随输入与训练流形的距离增加而增大。PC-SNER修改输出层以预测Normal-Inverse-Gamma (NIG)参数,用于距离保持估计。为在训练过程中保持数据保真度与物理一致性之间的平衡,我们引入了物理约束损失的动态加权策略。我们还引入了一个距离感知系数 (DAC) 指标来量化对分布偏移的敏感性。实验上,我们使用PRONOSTIA、XJTU-SY和HUST基准数据集在滚动轴承 (REBs) 预测上验证了两种框架。实验结果表明,与竞争基线相比,预测精度提高,不确定性估计校准良好,同时在交叉验证中保持可审计性能,并在极端对抗扰动下具有鲁棒性。

英文摘要

Development of reliable and physically interpretable probabilistic frameworks for industrial prognostics remain nascent, and existing literature is often insensitive as inputs move away from the training manifold. In this paper, we develop two sampling-free, distance-aware physics-constrained probabilistic frameworks: (i) PC-SNGP and (ii) PC-SNER. Both apply spectral normalization to hidden layer weights, enforcing bi-Lipschitz distance-preserving representation from the input to the latent space. PC-SNGP replaces the dense output with Gaussian process whose posterior variance increases with input distance from the training manifold. PC-SNER modifies the output layer to predict Normal-Inverse-Gamma~(NIG) parameters for distance preserving estimation. To maintain balance between data fidelity and physical consistency during training, we introduce a dynamic weighting strategy for the physics-constrained loss. We also introduce a distance-aware-coefficient~(DAC) metric to quantify sensitivity to distributional shifts. Empirically, we validate both frameworks on rolling-element-bearings (REBs) prognostics using the PRONOSTIA, XJTU-SY, and HUST benchmark datasets. Experimental results demonstrate improved prediction accuracy and well-calibrated uncertainty estimates relative to competing baselines, while maintaining auditable performance in cross-validation and robustness under extreme adversarial perturbations.

2601.11541 2026-06-09 cs.HC cs.AI cs.CY 版本更新

A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics

学生视角下技术写作反馈质量比较研究:评估计算机科学主题中的LLM、SLM和人类

Suqing Liu, Runlong Ye, Christopher Eaton, Bogdan Simion, Michael Liut

发表机构 * McMaster University(麦斯特大学) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Research Institute for the Study of University Pedagogy, University of Toronto Mississauga(多伦多大学密西根分校大学教学研究学院) Department of Mathematical and Computational Sciences, University of Toronto Mississauga(多伦多大学密西根分校数学与计算科学系)

AI总结 本研究比较了本地部署的小语言模型(SLM)、商业大语言模型(LLM)和人类导师在计算机科学课程中提供写作反馈的质量,发现SLM在可读性和可操作性上获得学生更高评价,而人类反馈在专业写作任务中更受青睐。

Comments accepted at AIED 26

详情
AI中文摘要

为了解决计算机科学中反馈的可扩展性问题,同时减轻商业大语言模型(LLM)的隐私和成本限制,本研究评估了一个本地托管的小语言模型(SLM)。我们在入门编程(N=176)、操作系统(N=80)和写作研讨会(N=7)中部署了量化后的Llama-3.1、GPT-4和人类导师。对学生感知的混合方法分析显示,虽然本地SLM与商业LLM相当,并且在技术课程中学生在可读性和可操作性方面给予其更高评价,但人类反馈在高度专业化的写作任务中仍然更受青睐。我们证明,本地SLM为基础反馈提供了一种保护隐私、零边际成本的替代方案,支持分层教学框架,其中AI处理结构指导,而教师专注于高层次的脚手架概念。

英文摘要

To address the scalability of feedback in computer science while mitigating the privacy and cost limitations of commercial Large Language Models (LLMs), this study evaluates a locally hosted Small Language Model (SLM). We deployed a quantized Llama-3.1, GPT-4, and human instructors across introductory programming (N=176), operating systems (N=80), and a writing seminar (N=7). Mixed-methods analysis of student perceptions reveals that while the local SLM matched commercial LLMs and was rated higher by students for readability and actionability in technical courses, human feedback remained more favoured for highly specialized writing tasks. We demonstrate that local SLMs offer a privacy-preserving, zero-marginal-cost alternative for foundational feedback, supporting a tiered pedagogical framework where AI handles structural guidance while instructors focus on high-level conceptual scaffolding.

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE:基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile(智利天主教大学) CENIA iHEALTH KAUST(科威特皇家科学与技术局)

AI总结 提出CURE框架,通过课程学习动态调整多任务训练,提升医学报告生成的视觉接地准确性和事实一致性,无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289
AI中文摘要

医学视觉语言模型可以自动生成放射学报告,但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐,导致不可靠或弱接地的预测。我们提出CURE,一个错误感知的课程学习框架,无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上,使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样,强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU,报告质量提高了+0.192 CXRFEScore,并将幻觉减少了18.6%。CURE是一个数据高效的框架,增强了接地准确性和报告可靠性。代码可从此https URL获取,模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

2601.20408 2026-06-09 cs.DC cs.AI 版本更新

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

满足SLO,节省时间:使用OptiKIT实现企业级LLM自动化优化

Nicholas Santavas, Kareem Eissa, Patrycja Cieplicka, Piotr Florek, Matteo Nulli, Stefan Vasilev, Seyyed Hadi Hashemi, Antonios Gasteratos, Shahram Khadivi

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出OptiKIT分布式LLM优化框架,通过自动化复杂优化流程,为非专家团队提供动态资源分配和流水线执行,实现GPU吞吐量提升2倍以上,降低优化门槛。

Comments Accepted in MLSys 2026

详情
AI中文摘要

企业级LLM部署面临关键的可扩展性挑战:组织必须在有限的计算预算内系统性地优化模型以扩展AI计划,然而手动优化所需的专业知识仍然稀缺。这一挑战在管理异构基础设施上的GPU利用率,同时使具有不同工作负载且LLM优化经验有限的团队能够高效部署模型时尤为明显。我们提出了OPTIKIT,一个分布式LLM优化框架,通过自动化非专家团队的复杂优化工作流程,使模型压缩和调优民主化。OPTIKIT提供动态资源分配、带自动清理的分阶段流水线执行以及无缝的企业集成。在生产中,它实现了超过2倍的GPU吞吐量提升,同时使应用团队无需深厚的LLM优化专业知识即可获得一致的性能改进。我们分享了平台设计以及资源管理、流水线编排和集成模式的关键工程见解,这些实现了大规模、生产级模型优化的民主化。最后,我们开源该系统以促进外部贡献和更广泛的可重复性。

英文摘要

Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource management, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.

2601.20503 2026-06-09 cs.CV cs.AI 版本更新

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

使用部分标注数据集训练策略的比较评估:FLAIR MRI中白质高信号和卒中病变分割

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本研究系统评估了六种利用部分标注数据训练联合分割白质高信号和缺血性卒中病变模型的策略,发现伪标签法最有效,可提升模型性能并支持大规模临床研究。

详情
AI中文摘要

白质高信号(WMH)和缺血性卒中病变(ISL)是脑小血管疾病(SVD)的关键影像生物标志物,可在磁共振成像(MRI)上检测到。开发稳健的深度学习模型来自动分割和区分这些病理仍然具有挑战性。具体而言,WMH和ISL常在同一受试者中共存,并在液体衰减反转恢复(FLAIR)序列上表现为视觉上混淆的高信号,使其精确勾画复杂化。为了解决完全标注队列稀缺的问题,我们系统评估了六种使用部分标注数据训练联合WMH和ISL分割模型的可行策略。我们汇集了私有和公开数据集,构建了一个包含2,052个MRI体积的大规模队列,其中分别有1,341和1,152个体积包含WMH和ISL的真实标注。我们的分析表明,多种策略有效利用部分标注数据提升整体模型性能,其中伪标签法是最有效的方法。该模型表现出一致的WMH分割策略,并成功检测到大多数FLAIR阳性的ISL。这些发现证明了使用部分标注数据开发可靠自动分割工具的可行性,可支持持续的SVD监测和大规模临床研究中的高通量生物标志物提取。

英文摘要

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) detectable on magnetic resonance imaging (MRI). The development of robust deep learning models to automatically segment and differentiate these pathologies remains challenging. Specifically, WMH and ISL frequently co-occur within the same subject and present as visually confounding hyperintensities on fluid-attenuated inversion recovery (FLAIR) sequences, complicating their accurate delineation. To address the scarcity of fully annotated cohorts, we systematically evaluated six accessible strategies for training a joint WMH and ISL segmentation model using partially labelled data. We aggregated privately held and publicly available datasets to curate a large-scale cohort of 2,052 MRI volumes, of which 1341 and 1152 volumes contained ground truth annotations for WMH and ISL, respectively. Our analysis indicates that multiple strategies effectively leverage partially labelled data to enhance overall model performance, with pseudolabelling emerging as the most effective approach. This model exhibited a consistent WMH segmentation policy and successfully detected the majority of FLAIR-positive ISL. These findings demonstrate the viability of using partially labelled data to develop reliable automated segmentation tools, which can support ongoing SVD monitoring and high-throughput biomarker extraction for large-scale clinical research.

2602.10016 2026-06-09 cs.IR cs.AI 版本更新

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Kunlun: 通过统一架构设计建立大规模推荐系统的缩放定律

Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Ellie Wen, Jiyan Yang, Huayu Li

发表机构 * Meta Platforms, Inc.(Meta平台公司) OpenAI

AI总结 针对大规模推荐系统缺乏可预测缩放定律的问题,提出Kunlun架构,通过低层优化(GDPA、HSP、滑动窗口注意力)和高层创新(CompSkip、事件级个性化)提升模型效率,MFU从17%提升至37%,缩放效率翻倍,已在Meta广告模型部署。

Comments 10 pages, 4 figures

详情
AI中文摘要

推导可预测的缩放定律,即模型性能与计算投入之间的关系,对于大规模推荐系统的设计和资源分配至关重要。虽然这类定律已在大型语言模型中建立,但在推荐系统中仍具挑战,尤其是处理用户历史记录和上下文特征的系统。我们识别出低缩放效率是可预测幂律缩放的主要障碍,源于低模型FLOPs利用率(MFU)的模块和次优的资源分配。我们引入Kunlun,一种可扩展的架构,系统性地提升模型效率和资源分配。我们的低层优化包括广义点积注意力(GDPA)、分层种子池化(HSP)和滑动窗口注意力。高层创新包括计算跳过(CompSkip)和事件级个性化。这些进步在NVIDIA B200 GPU上将MFU从17%提升至37%,并将缩放效率相比最先进方法提升一倍。Kunlun现已部署在主要的Meta广告模型中,产生显著的生产影响。

英文摘要

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

2602.10172 2026-06-09 astro-ph.IM cs.AI 版本更新

Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe

Cosmo3DFlow:用于重建早期宇宙的空间到光谱压缩的小波流匹配

Md. Khairul Islam, Zeyu Xia, Ryan Goudjil, Jialu Wang, Arya Farahi, Judy Fox

发表机构 * Department of Computer Science University of Virginia(计算机科学系弗吉尼亚大学) Department of Statistics and Data Sciences The University of Texas at Austin(统计与数据科学系德克萨斯大学奥斯汀分校) School of Data Science(数据科学学院)

AI总结 提出Cosmo3DFlow框架,结合3D离散小波变换与流匹配,通过空间到光谱压缩解决高维宇宙结构重建中的维度和稀疏性瓶颈,实现比扩散模型快46倍的采样速度。

详情
AI中文摘要

从演化的现今宇宙重建早期宇宙是现代天体物理学中一个具有挑战性和计算密集的问题。我们设计了一种新颖的生成框架Cosmo3DFlow,旨在解决维度和稀疏性——当前最先进的宇宙学推理方法中的关键瓶颈。通过将3D离散小波变换(DWT)与流匹配相结合,我们有效地表示了高维宇宙学结构。小波变换通过将空间空无转化为光谱稀疏性来解决“空洞问题”。它将高频细节与低频结构解耦,并且小波空间速度场促进了具有大步长的稳定常微分方程(ODE)求解器。使用$128^3$分辨率的大规模宇宙学$N$体模拟,我们实现了比扩散模型快46倍的采样速度。我们的结果使得初始条件可以在几秒内采样,而以前的方法需要几分钟。

英文摘要

Reconstructing the early universe from the evolved present-day universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem'' by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological $N$-body simulations at $128^3$ resolution, we achieve up to $46\times$ faster sampling than diffusion models. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.

2602.10234 2026-06-09 physics.soc-ph cs.AI cs.RO 版本更新

Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

将警车变道行为转化为缓解孤立走走停停交通波的实际拥堵吸收驾驶策略

Zhengbing He

发表机构 * Faculty of Science and Engineering, University of Nottingham Ningbo China(诺丁汉大学宁波校区理工程学院)

AI总结 本文提出一种基于警车变道行为启发的实际拥堵吸收驾驶(JAD)策略,通过定义JAD三角形,利用单车辆双探测器实现孤立走走停停波的抑制,并系统分析五个关键参数,仿真验证其有效性。

详情
AI中文摘要

走走停停交通波是高速公路拥堵的主要形式,对交通效率、安全风险和车辆排放造成严重且持续的负面影响。在各种高速公路交通管理策略中,拥堵吸收驾驶(JAD)——由专用车辆在被走走停停波捕获前执行“慢进快出”操作——已被提出作为抑制此类波传播的一种有前景的方法。然而,现有大多数JAD策略仍不实用,主要原因是缺乏对实施车辆和运行条件的考虑。受真实世界中警车变道行为的启发,本文首先引入单车辆双探测器拥堵吸收驾驶(SD-JAD)问题,然后基于JAD三角形的定义提出一种实用的JAD策略,将这种变道行为转化为能够抑制孤立走走停停波传播的交通控制策略。识别并系统分析了五个显著影响所提策略的关键参数,即JAD速度、流入交通速度、波宽、波速和波内速度。通过基于SUMO的仿真示例,进一步展示了如何仅使用两个固定路侧交通探测器在实际中测量这些参数。结果表明,所提出的JAD策略成功抑制了走走停停波的传播,且未引发二次波。本文有望推动JAD的实际实施迈出重要一步,将其从理论概念推进为可行且可部署的交通管理策略。

英文摘要

Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.

2602.23234 2026-06-09 cs.IR cs.AI cs.LG 版本更新

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

扩展搜索相关性:用LLM生成的判断增强应用商店排名

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad, Sean Suchter, Venkat Sundaranatha

发表机构 * Apple(苹果公司)

AI总结 针对应用商店排名中专家文本相关性标签稀缺的问题,通过微调LLM生成数百万标签,结合行为相关性优化排序器,显著提升Pareto前沿和转化率。

详情
AI中文摘要

大规模商业搜索系统优化相关性以驱动成功的会话,帮助用户找到他们想要的内容。为了最大化相关性,我们利用两个互补的目标:行为相关性(用户倾向于点击或下载的结果)和文本相关性(结果与查询的语义匹配)。一个持续的挑战是,相对于丰富的行为相关性标签,专家提供的文本相关性标签稀缺。我们首先通过系统评估LLM配置来解决这个问题,发现一个专门的、微调的模型在提供高度相关的标签方面显著优于一个更大的预训练模型。使用这个最优模型作为力量倍增器,我们生成了数百万个文本相关性标签以克服数据稀缺性。我们展示了用这些文本相关性标签增强我们的生产排序器会导致Pareto前沿显著外移:离线NDCG在行为相关性上改善,同时在文本相关性上也提高。这些离线收益通过在全球应用商店排序器上的A/B测试得到验证,该测试显示转化率统计上显著提高了+0.24%,其中最大的性能提升出现在尾部查询中,新的文本相关性标签在缺乏可靠行为相关性标签时提供了稳健的信号。

英文摘要

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

2603.04177 2026-06-09 cs.SE cs.AI cs.LG 版本更新

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

CodeTaste:LLM能否生成人类级别的代码重构?

Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM代理在代码重构中的能力,通过CodeTaste基准测试发现,代理在详细指定重构时表现良好,但难以自主发现人类选择的重构,提出“先提议后实现”分解可改善对齐。

详情
AI中文摘要

LLM编码代理可以生成可工作的代码,但它们的解决方案往往积累复杂性、重复和架构债务。人类开发者通过重构来解决这些问题:行为保持的程序转换,改善结构和可维护性。我们研究代理是否(i)能够可靠地执行重构,以及(ii)识别人类开发者在实际代码库中实际选择的重构。为此,我们构建了CodeTaste,一个从大型多文件开源重构中挖掘的基准测试。为了评分解决方案,我们结合了测量功能正确性的仓库测试套件和定制的静态检查,这些检查使用数据流推理验证不期望模式的移除和期望模式的引入。我们的结果显示了一个明显的差距:代理在实现详细指定的重构时表现良好,但当给定变更的关注区域时,往往无法发现人类的重构选择。先提议后实现的分解改善了对齐,而在实现之前选择最佳对齐的提议可以带来进一步的收益。CodeTaste为在现实代码库中将编码代理与人类重构决策对齐提供了评估目标和潜在的偏好信号。我们发布了基准测试、排行榜和代码。

英文摘要

LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. We investigate whether agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. To this end, we construct CodeTaste, a benchmark mined from large multi-file open-source refactorings. To score solutions, we combine repository test suites that measure functional correctness with tailored static checks that verify removal of undesired and introduction of desired code patterns using dataflow reasoning. Our results show a clear gap: agents perform well at implementing refactorings that are specified in detail, but often fail to discover the human refactoring choices when given a focus area for changes. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. We release the benchmark, leaderboard, and code.

2603.12666 2026-06-09 cs.LG cs.AI 版本更新

RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

RetroReasoner:一种用于战略 retrosynthesis 预测的推理 LLM

Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Department of Statistics, Korea University(韩国大学统计系) Materials Intelligence Lab, LG AI Research(LG人工智能研究实验室)

AI总结 RetroReasoner 通过监督微调和强化学习,捕捉化学家基于断键策略的推理过程,提升 retrosynthesis 预测的准确性和多样性。

Comments 35 pages, 19 figures

详情
AI中文摘要

retrosynthesis预测旨在识别能够合成给定产物分子的反应物。尽管分子大语言模型(LLMs)最近展示了有前景的结果,但大多数现有方法要么直接生成反应物,要么仅提供通用的产品级分析,而没有明确推理关于断键策略来证明特定反应物选择的合理性。本文提出了RetroReasoner,一种能够捕捉化学家基于断键策略的推理过程的 retrosynthetic推理模型。RetroReasoner通过监督微调和强化学习进行训练。在监督微调中,SyntheticRetro生成结构化的断键理由配对反应物预测。在强化学习中,一个往返奖励通过将预测的反应物传递给正向合成模型来评估预测的反应物,奖励能够重建原始产物的预测。RetroReasoner还可以通过将其整合到并行化的蒙特卡洛树搜索框架中,用于多步 retrosynthetic规划,从而减少搜索时间并增加有效合成路径的数量和多样性。实验结果表明,RetroReasoner在性能上优于先前的基线,不仅包括分子LLMs,还包括专门针对retrosynthesis的专家模型,并生成更广泛的可行反应物提案,特别是在具有挑战性的反应实例中。

英文摘要

Retrosynthesis prediction aims to identify reactants that can synthesize a given product molecule. Although molecular large language models (LLMs) have recently shown promising results, most existing methods either generate reactants directly or provide only generic product-level analysis, without explicitly reasoning about bond-disconnection strategies that justify specific reactant choices. This paper proposes RetroReasoner, a retrosynthetic reasoning model that captures chemists' strategic disconnection-based thinking. RetroReasoner is trained with supervised fine-tuning and reinforcement learning. For supervised fine-tuning, SyntheticRetro generates structured disconnection rationales paired with reactant predictions. For reinforcement learning, a round-trip reward evaluates predicted reactants by passing them through a forward synthesis model and rewarding predictions that reconstruct the original product. RetroReasoner can also be applied to multi-step retrosynthetic planning by incorporating it into a parallelized Monte Carlo tree search framework, reducing search time while increasing the number and diversity of valid synthetic pathways. Experimental results show that RetroReasoner outperforms prior baselines, including not only molecular LLMs but also retrosynthesis-specific expert models, and generates a broader range of feasible reactant proposals, especially for challenging reaction instances.

2603.29875 2026-06-09 cs.IR cs.AI cs.CL 版本更新

UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

解开图式RAG的结——事实证明向量RAG几乎足够

Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz

发表机构 * Samsung AI Warsaw(三星AI华沙)

AI总结 本文提出UnWeaver框架,通过LLM解构文档内容为跨chunk的实体,提升检索和生成的准确性与效率,实验表明向量RAG在成本上优于图式RAG。

详情
AI中文摘要

检索增强生成(RAG)系统中的关键问题在于基于片段的检索流程将源片段视为原子对象,将其中信息混合成单一向量。这些向量被视为孤立、独立且自足,没有尝试表示它们之间的可能关系。此类方法缺乏处理多跳问题的专用机制。图式RAG系统通过将信息建模为知识图谱来缓解这一问题,实体由节点表示,通过稳健的关系连接并形成层次化社区。然而,这种方法自身也存在一些问题,包括为创建图式索引而增加数量级的组件复杂性,以及依赖启发式方法进行检索。我们提出UnWeaver,一种新颖的RAG框架,简化了图式RAG的理念。UnWeaver利用LLM将文档内容解构为可以在多个片段中出现的实体。在检索过程中,实体被用作恢复原始文本片段的中间方式,从而保持对源材料的忠实度。我们主张基于实体的分解能提供更浓缩的原始信息表示,同时还能减少索引和生成过程中的噪声。此外,我们实验表明,在端到端QA评估中,向量RAG的表现优于标准图式RAG,并且几乎与当前最先进的图式解决方案相当,但成本仅为其分数。

英文摘要

One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process. Furthermore we experimentally show that on end to end QA evaluation VectorRAG performs better than standard GraphRAG and almost as good as current SOTA graph-based solutions, for a fraction of the cost.

2604.08849 2026-06-09 cs.CL cs.AI cs.DB cs.MA cs.SC 版本更新

SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

SatIR:可扩展的高召回率约束满足基于信息检索的临床试验匹配

Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Samueli Electrical and Computer Engineering, UCLA(UCLA Samueli电气与计算机工程系) Department of Computer Science and Informatics, Emory University(埃默里大学计算机科学与信息学系) Mayo Clinic(梅奥诊所)

AI总结 SatIR通过将临床试验资格条件和摘要转化为形式约束,结合SMT、关系代数和大语言模型,提升了临床试验匹配的召回率和效率,优于基于相似度的基线方法。

详情
AI中文摘要

许多重要的检索问题不仅仅是语义相似性问题,而是约束满足问题:检索的项目应与查询主题相关,并满足涉及否定、时间条件、数值阈值、例外、本体关系和不完整证据的显式要求。我们研究了临床试验匹配中的这一挑战,这是一个高风险的测试平台,其中有用的试验必须既解决患者医疗需求,又满足复杂的资格标准。我们提出了SatIR,一种用于临床试验匹配的可扩展约束检索方法。SatIR将试验资格标准和摘要转换为形式约束,然后通过执行这些约束来检索患者-试验对。系统结合了满足模理论(SMT)、关系代数、医学本体基础和大语言模型(LLMs):形式方法提供可执行且可检查的匹配,而LLMs将模糊、不完整和隐含的临床信息转换为显式、可控的约束表示。在SIGIR 2016患者-试验集合和TREC-2022-RetrievalSubset基准上,SatIR在资格意识检索方面优于基于相似度的基线方法。与TrialGPT式检索相比,SatIR在SIGIR 2016上每名患者检索出32%至72%更多相关且合格的试验,在TREC-2022-RetrievalSubset上实现了1.8至3.2倍更高的合格试验召回率。检索速度快,仅需146毫秒每名患者处理3,621个SIGIR试验。

英文摘要

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

2604.10842 2026-06-09 cs.SE cs.AI 版本更新

Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents

抗挫写入:一种六层耐用写入表面用于大语言模型编码代理

Justice Owusu Agyemang, Jerry John Kponyo, Elliot Amponsah, Godfred Manu Addo Boakye, Kwame Opuni-Boachie Obour Agyekum

发表机构 * Sperixlabs, Ghana(塞普里克斯实验室,加纳) Kwame Nkrumah University of Science and Technology, Kumasi, Ghana(库马西技术大学,加纳) VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana(VIA网络安全实验室,库马西技术大学,加纳)

AI总结 本文提出Resilient Write,通过六层耐用写入表面提升编码代理在文件写入时的容错能力,减少恢复时间并提高自我纠正率。

详情
AI中文摘要

LLM驱动的编码代理越来越多地依赖如模型上下文协议(MCP)等工具使用协议来读写开发者的工作站文件。当写入失败——由于内容过滤、截断或会话中断——代理通常得不到结构化的信号,丢失草稿并浪费令牌盲目重试。我们提出了Resilient Write,一种MCP服务器,它在代理和文件系统之间插入一个六层耐用写入表面。这些层——预飞行风险评分、事务性原子写入、可恢复分块、结构化类型错误、带外暂存存储以及任务连续性交接信封——是正交且独立可采用的。每个层映射到在2026年4月真实代理会话中观察到的具体故障模式,在该会话中内容安全过滤器静默拒绝了一个包含擦除的API密钥前缀的草稿。三个额外工具——分块预览、格式感知验证和日志分析——从使用该系统撰写本文时产生。一个186测试套件在每层验证正确性,定量比较显示相对于简单和防御性基线,恢复时间减少了5倍,代理自我纠正率提高了13倍。Resilient Write在MIT许可下开源。

英文摘要

LLM-powered coding agents increasingly rely on tool-use protocols such as the Model Context Protocol (MCP) to read and write files on a developer's workstation. When a write fails - due to content filters, truncation, or an interrupted session - the agent typically receives no structured signal, loses the draft, and wastes tokens retrying blindly. We present Resilient Write, an MCP server that interposes a six-layer durable write surface between the agent and the filesystem. The layers - pre-flight risk scoring, transactional atomic writes, resume-safe chunking, structured typed errors, out-of-band scratchpad storage, and task-continuity handoff envelopes - are orthogonal and independently adoptable. Each layer maps to a concrete failure mode observed during a real agent session in April 2026, in which content-safety filters silently rejected a draft containing redacted API-key prefixes. Three additional tools - chunk preview, format-aware validation, and journal analytics - emerged from using the system to compose this paper. A 186-test suite validates correctness at each layer, and quantitative comparison against naive and defensive baselines shows a 5x reduction in recovery time and a 13x improvement in agent self-correction rate. Resilient Write is open-source under the MIT license.

2604.23435 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

膝-xRAI:一种用于自动膝骨关节炎Kellgren-Lawrence分级的可解释AI框架

Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief

发表机构 * Orthopaedic Department, Faculty of Medicine UIN Syarif Hidayatullah Jakarta(乌姆尼大学医学学院骨科部) Informatics Engineering, Institut Teknologi Sepuluh Nopember(十月份技术研究所信息工程系) Information Technology, Universitas Muhammadiyah Yogyakarta(尤科阿卡塔大学信息技术系) Industrial and Systems Engineering, King Fahd University of Petroleum and Minerals(国王法赫德石油与矿物大学工业与系统工程系)

AI总结 本文提出Knee-xRAI框架,通过模拟临床放射流程,结合JSN、骨刺和下骨质硬化等特征,利用XGBoost-SHAP和ConvNeXt模型实现可解释的KL分级,验证了其在膝骨关节炎诊断中的有效性。

Comments 8 pages, 5 figures

详情
AI中文摘要

对平片进行膝骨关节炎(KOA)分级的可重复性差。KL评分单级分歧可能改变手术管理或将患者从保守治疗转为关节内注射。同时,超越人类读者的深度学习模型通常缺乏决策解释。我们提出了Knee-xRAI,一个分解分级过程的流程,通过模仿临床放射流程独立测量关节间隙狭窄(JSN)、骨刺和下骨质硬化,然后将这些发现组合成可解释的KL评分。具体而言,U-Net++架构通过轮廓分割量化JSN,SE-ResNet-50多任务网络在OARSI尺度上对骨刺进行解剖部位评分,混合纹理-CNN检测二进制硬化。该流程产生一个50维特征向量,通过XGBoost-SHAP分类器(路径A,审计)和ConvNeXt混合预测器(路径B,部署)进行评估。在8,260个OAI衍生的放射图像上,JSN模块的Dice得分为0.8909,mJSW ICC为0.8674。路径A达到QWK为0.6294和AUC为0.8046,证实了结构化特征向量具有显著的诊断信号。路径B达到QWK为0.8436和AUC为0.9017。SHAP分析显示JSN是主导特征,骨刺增加了一致的增量,硬化贡献微小。移除JSN证据会降低KL3-KL4召回率,而早期等级保持不变,与KL诊断标准一致。Knee-xRAI将每个预测都基于可审计的放射学发现链,提供临床透明度。

英文摘要

Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-Lawrence (KL) scale can alter surgical management or redirect a patient from conservative therapy to intra-articular injection. Meanwhile, deep learning models that outperform human readers often offer no explanation for their decisions. We present Knee-xRAI, a pipeline that decomposes the grading process by mimicking clinical radiological workflows. It independently measures joint space narrowing (JSN), osteophytes, and subchondral sclerosis, then combines these findings into an explainable KL grade. Specifically, a U-Net++ architecture quantifies JSN via contour segmentation, an SE-ResNet-50 multi-task network grades osteophytes per anatomical site on the OARSI scale, and a hybrid texture-CNN detects binary sclerosis. This pipeline yields a 50-dimensional feature vector evaluated via an XGBoost-SHAP classifier (Path A, audit) and a ConvNeXt hybrid predictor (Path B, deployed). On 8,260 OAI-derived radiographs, the JSN module achieved a Dice score of 0.8909 and an mJSW ICC of 0.8674. Path A reached a QWK of 0.6294 and an AUC of 0.8046, confirming the structured feature vector carries substantial diagnostic signal. Path B achieved a QWK of 0.8436 and an AUC of 0.9017. SHAP analysis identifies JSN as the dominant feature, with osteophytes adding a consistent increment and sclerosis contributing marginally. Removing JSN evidence collapses KL3-KL4 recall while early grades remain intact, aligning with the KL diagnostic criteria. Knee-xRAI grounds every prediction in an auditable chain of measured radiographic findings, providing clinical transparency at the point of care.

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington(维多利亚大学) Lincoln University(林肯大学) GN Advanced Science(GN先进科学)

AI总结 本文提出了一种基于漂移模型的语音增强框架DriftSE,通过将去噪问题建模为平衡问题,实现单步推理,从而在无需配对数据的情况下实现高质量语音增强。

Comments 6 pages, 2 figures

详情
AI中文摘要

我们提出了一种基于漂移模型的语音增强(DriftSE),一种新颖的生成框架,将去噪建模为一个平衡问题。与依赖迭代采样的方法不同,DriftSE通过演化映射函数的推动分布来实现单步推理,直接匹配干净语音分布。这种演化由漂移场驱动,这是一种学习到的修正向量,引导样本向干净分布的高密度区域发展,这自然促进了在未配对数据上的训练,通过匹配分布而非配对样本。我们从两种形式研究了该框架:从噪声观测到直接映射,以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中,DriftSE在单步中实现了高保真度的增强,优于多步扩散基线,并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2605.00327 2026-06-09 cs.IR cs.AI 版本更新

DynamicPO: Dynamic Preference Optimization for Recommendation

DynamicPO:基于推荐的动态偏好优化

Xingyu Hu, Kai Zhang, Jiancan Wu, Shuli Wang, Chi Wang, Wenshuai Chen, Yinhua Zhu, Haitao Wang, Xingxing Wang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Meituan(美团)

AI总结 本文提出DynamicPO框架,通过动态边界负样本选择和双边距动态beta调整,解决偏好优化崩溃问题,提升推荐准确性。

Comments DASFAA 2026 Best Paper

详情
AI中文摘要

本文提出DynamicPO框架,通过动态边界负样本选择和双边距动态beta调整,解决偏好优化崩溃问题,提升推荐准确性。

英文摘要

In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decreasing training loss. We further theoretically demonstrate that this collapse arises from gradient suppression, caused by the dominance of easily discriminable negatives over boundary-critical negatives that truly define user preference boundaries. As a result, boundary-relevant signals are under-optimized, weakening the model's decision boundary. Motivated by these observations, we propose DynamicPO (Dynamic Preference Optimization), a lightweight and plug-and-play framework comprising two adaptive mechanisms: Dynamic Boundary Negative Selection, which identifies and prioritizes informative negatives near the model's decision boundary, and Dual-Margin Dynamic beta Adjustment, which calibrates optimization strength per sample according to boundary ambiguity. Extensive experiments on three public datasets show that DynamicPO effectively prevents optimization collapse and improves recommendation accuracy on multi-negative preference optimization methods, with negligible computational overhead. Our code and datasets are available at https://github.com/xingyuHuxingyu/DynamicPO.

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX:面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design(新加坡科技设计大学AMAAI实验室)

AI总结 提出APEX框架,利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量,在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情
AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而,AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域,每天都有大量歌曲被生产和消费,而没有传统的艺术家声誉或唱片公司支持。在这一探索中,美学质量是关键但尚未被研究的因素。我们提出了APEX,这是首个面向AI生成音乐的大规模多任务学习框架,在来自Suno和Udio的超过21.1万首歌曲(1万小时音频)上训练,该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT(一个自监督音乐理解模型)提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面,两者结合被证明是有价值的:在Music Arena数据集上的分布外评估中,该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决,引入美学特征持续改进了偏好预测,展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Shriners Children’s(夏皮罗儿童医院) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系)

AI总结 本文提出了一种基于单视角视频的无标记步态分析方法,用于量化罗达和格雷厄姆步态分类中的膝踝z分数,从而在资源有限的临床环境中实现可扩展的客观步态评估。

Comments 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

详情
AI中文摘要

脑瘫(CP)是一种运动神经障碍,是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走,准确的步态评估对于保持行走功能至关重要,这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程,可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本(来自152名儿童的529次试验,其中88名男性,63名女性,年龄12.1±4.0岁,60种不同的主要诊断,脑瘫最为常见,n=54)中,矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02,踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02,与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88,正确识别了83%的受影响儿童,应用罗达和格雷厄姆规则得到7类准确率为43±1%,宏AUROC=0.78±0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查外,连续z分数支持跨访问的纵向轨迹跟踪,为监测疾病进展和治疗反应提供定量基础,这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

2605.16972 2026-06-09 cs.HC cs.AI 版本更新

WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI

WhiteTesseract: 通过XR和对话式AI重新诠释文化遗产

Jingjing Li, Zhi Liu, Xiyao Jin, Tatsuki Fushimi, Yoichi Ochiai

发表机构 * University of Tsukuba(茨口大学)

AI总结 本研究通过结合XR和对话式AI,提出WhiteTesseract系统,旨在提升文化遗产展览的沉浸感和个性化体验,增强观众的参与度和反思能力。

Comments 38 pages, 13 figures. Accepted for publication in ACM Journal on Computing and Cultural Heritage (JOCCH)

详情
AI中文摘要

文化遗产展览往往难以维持观众的注意力并促进深入思考。实体展览依赖固定解释工具,缺乏对个体背景或好奇心的适应性,其效果高度依赖于参观者的个人情境、先前知识和文化素养。同时,数字展览更注重便利性和可及性,但可能削弱定义具身文化体验的物理和社会情境。WhiteTesseract通过高分辨率XR和对话式AI实现现场解释,系统整合空间智能通过艺术品识别,允许参观者通过降维现实减少环境干扰,并通过大语言模型进行情境感知对话。目标是保留物理和社会环境的丰富性,同时提供灵活的个人反思空间,增强个人情境而不妥协于物理真实性。我们部署了该系统在一个克劳德·莫奈展览中,并与26名参与者进行了受控用户研究。定量结果表明,WhiteTesseract的调节显著将平均观看时间从35.3秒增加到98.3秒(p < 0.001)。分析529次参观者与AI的互动发现,60%的互动超出了事实性查询,包括分析、情感和比较性查询。这些发现展示了如何通过XR和AI丰富实体展览体验,支持更深入、更个性化的参与,而不取代文化遗产的具身价值。我们讨论了现实部署的技术和社会限制以及受控环境的局限性。

英文摘要

Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor's Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p < 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting.

2605.28510 2026-06-09 cs.SE cs.AI cs.IR 版本更新

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

高效可扩展的LLM生成代码片段溯源追踪

Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER,结合向量搜索与指纹匹配,实现LLM生成代码的高效、可扩展溯源。

详情
AI中文摘要

用于代码补全和生成的大型语言模型(LLM)在软件开发中日益普及,但它们可能会逐字复现训练示例且不注明出处,引发关于抄袭和许可合规的法律与伦理问题。基于指纹的经典抄袭检测器(如Winnowing)仍然高效,但检测需要将代码片段与整个训练集进行比较,其线性时间搜索使其不适用于训练现代代码LLM的十亿级语料库。为弥补这一差距,我们引入了SOURCETRACKER——一个专为代码检索定制的3亿参数编码器,以及混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER(HST)。HST首先通过向量搜索缩小候选片段集,然后使用Winnowing对精确指纹进行重排序。我们在THESTACKV2数据集的1000万片段子集上训练和评估系统,包括逐字片段和模拟真实标识符重命名的改编片段。在包含改编查询的体外10万片段搜索空间中,我们的混合方法在30令牌片段上的平均倒数排名与Winnowing相当。然后,从>=60令牌的窗口开始,它持续优于Winnowing最多5.4%,同时保持对数时间查询复杂度。在使用基于LLM的评判者的补充评估中,我们发现许多未被标记为真实来源的检索片段与预期来源高度相似,尤其是在较长的上下文窗口中,因此对最终用户仍然有用。总体而言,我们的结果表明,将向量搜索与指纹识别相结合,能够实现对LLM生成的代码进行可扩展、高精度的溯源追踪。

英文摘要

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot:一个基于网络的交互式助手,用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music(中央音乐学院) Nanyang Technological University(南洋理工大学)

AI总结 提出MOOSE-Copilot,通过形式化的人机交互协议,将发散性探索和收敛性细化统一,利用蓝图、路由和反馈三种信号引导生成,显著优于纯自主基线。

Comments Accepted to ACL 2026 (System Demonstrations)

详情
AI中文摘要

大型语言模型(LLMs)在科学假设发现中展现出显著潜力。然而,现有方法存在两个关键限制:它们将发散性探索构思和收敛性细粒度细化视为孤立任务,并且自主运行,几乎没有人类指导。我们提出了MOOSE-Copilot,这是第一个通过形式化的人机交互(HAII)协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程:初始蓝图、阶段间路由和再生反馈。定量评估表明,注入这些结构化专家信号显著优于纯自主基线,并在神谕指导下建立了性能上限。此外,为了普及这一范式,我们开发了一个直观的基于网络界面,具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线,使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

2606.04581 2026-06-09 cs.DC cs.AI cs.NI 版本更新

Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge

Multi-SPIN:面向边缘协作令牌生成的多接入推测推理

Haotian Zheng, Zhanwei Wang, Mingyao Cui, Chang Cai, Hongyang Du, Kaibin Huang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系)

AI总结 提出多接入推测推理(Multi-SPIN)架构,通过联合优化草案长度控制和带宽分配,在异构边缘系统中最大化令牌总吞吐量。

详情
AI中文摘要

推测推理(SPIN)最初被开发为一种加速大型语言模型(LLMs)的高效架构。在这项工作中,我们提出其分布式部署,以在多用户边缘系统中实现协作令牌生成;其优势在于有效平衡资源受限设备与服务器之间的计算负载。由此产生的架构称为多接入SPIN(Multi-SPIN),利用设备上的小型语言模型生成并上传候选令牌草稿,而边缘服务器运行LLM以并行批次验证它们。鉴于用户计算和通信能力的严重异构性,草案长度成为关键控制变量,影响节点级计算负载和多接入延迟,从而控制总令牌吞吐量。因此,考虑频分多址,我们研究了多接入草案控制问题,即联合优化草案长度控制和带宽分配以最大化总令牌吞吐量。我们考察了两种情况:(1)用户间同质草案长度以促进服务器端批处理,以及(2)异质草案长度以引入新的吞吐量提升维度。通过开发分解方法,我们将这些复杂优化简化为可处理的子问题,从而能够以闭式形式推导出高效的草案控制算法。我们的分析表明,在同质情况下,由于批处理同步要求,最优带宽分配补偿了计算和通信能力较弱的用户;而在异质情况下,通过放宽这些要求,最优带宽分配奖励具有更高接受率的用户。使用Llama-2和Qwen3.5模型对在不同任务上的实验表明,Multi-SPIN相比忽略异构性的基线将吞吐量提升了高达88%。

英文摘要

Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work, we propose its distributed deployment to enable cooperative token generation in a multiuser edge system; its advantage is to effectively balance computational loads between resource-constrained devices and servers. The resulting architecture, termed Multi-access SPIN (Multi-SPIN), utilizes on-device small language models to generate and upload candidate token drafts, while an edge server operates the LLM to verify them in parallel batches. Given the severe heterogeneity in users' computation and communication capabilities, the draft length emerges as a critical control variable that influences node-level computation loads and multi-access latency, thereby governing the sum token goodput. Consequently, considering frequency-division multiple access, we investigate the problem of multi-access draft control, a joint optimization of draft-length control and bandwidth allocation to maximize sum token goodput. We examine two cases: (1) homogeneous draft lengths across users to facilitate server-side batching, and (2) heterogeneous draft lengths to introduce a new dimension for goodput enhancement. By developing decomposition methods, we reduce these complex optimizations into tractable sub-problems, which allow efficient draft control algorithms to be derived in closed form. Our analysis shows that the optimal bandwidth allocation compensates users with weaker computation-and-communication capabilities in the homogeneous case due to the batching synchronization requirements, whereas its heterogeneous-case counterpart rewards users with higher acceptance rates by relaxing such requirements. Experiments using Llama-2 and Qwen3.5 model pairs across diverse tasks demonstrate that Multi-SPIN improves goodput by up to 88% over heterogeneity-agnostic baselines.

2606.06554 2026-06-09 cs.LG cs.AI 版本更新

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

基于多尺度特征注意力网络的太赫兹双梳光谱聚合物分类

Roshni Mahtani, Ilán Carretero, Laura Monroy, Aldo Moreno-Oyervides, Oscar Elías Bonilla-Manrique, Rocío del Amor

发表机构 * Instituto Universitario de Investigación e Innovación en Tecnología Centrada en el Ser Humano, HUMAN-tech, Universitat Politècnica de València(人类中心技术大学研究与创新研究所,HUMAN-tech,巴塞罗那理工大学) Department of Electronic Technology, Universidad Carlos III de Madrid(电子技术系,马德里卡洛斯三世大学) Artikode Intelligence S.L.

AI总结 提出多尺度特征注意力网络(MSFAN),结合特征门控和多尺度并行卷积,利用太赫兹双梳光谱对12种聚合物进行分类,准确率达85.2%。

Comments Accepted in EUSIPCO'26

详情
AI中文摘要

可靠的聚合物识别对于确保回收塑料的质量和安全至关重要,然而传统的分选和光谱技术往往难以提供稳健的区分。太赫兹双梳光谱(THz-DCS)提供了一种有前景的替代方案,能够实现快速、高分辨率且无损的测量。在这项工作中,我们利用THz-DCS对12种聚合物进行分类,包括纯聚合物、多层薄膜、商业混合物和生物聚合物。为了处理这些光谱信号的复杂性,我们提出了多尺度特征注意力网络(MSFAN),这是一种专为THz-DCS数据设计的新型深度学习架构。该框架集成了用于信号重校准的特征门控和多尺度并行卷积,以捕获不同的频率模式。这些特征通过交叉特征注意力和注意力池化进一步细化,使模型能够内在地突出最具信息量的太赫兹区域。MSFAN始终优于最先进的模型,分类准确率达到85.2%。本研究展示了将THz-DCS与深度学习技术相结合,用于有效、可扩展且可解释的聚合物分类的潜力。

英文摘要

Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.

2512.16334 2026-06-09 cs.LG cs.AI 版本更新

Pretrained battery transformer (PBT): A foundation model for battery life prediction

预训练电池变压器(PBT):电池寿命预测的基础模型

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

发表机构 * Guangzhou Municipal Key Laboratory of Materials Informatics and Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与可持续能源与环境方向市重点实验室,香港科技大学(广州)) Department of Computer Science & Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科技大学) Guangzhou Municipal Key Laboratory of Materials Informatics and Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与数据科学与分析方向市重点实验室,香港科技大学(广州)) Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology(交叉学科研究院,香港科技大学) Guangzhou HKUST Fok Ying Tung Research Institute(广州科技大学福 Ying Tung 研究院) Material Genome Institute, Shanghai University(材料基因组研究所,上海大学)

AI总结 本文提出PBT模型,通过整合异构电池寿命数据,实现电池寿命预测的统一建模,显著提升预测性能。

Comments 5 figures in the main content

详情
AI中文摘要

电池循环寿命的早期预测对于改进电池设计、制造和部署至关重要。然而,尽管机器学习取得进展,电池寿命预测仍受限于数据稀缺和电池化学、规格、形成协议和工作条件的异质性。尽管迁移学习已被广泛探索,但其效果受限于缺乏能整合异构电池寿命数据的基础模型。本文引入预训练电池变压器(PBT),一种用于电池寿命预测的基础模型,其包含编码电池知识的混合专家层,以学习稀缺和异质的寿命数据。PBT首先在13个锂离子电池数据集上预训练,生成通用PBT,然后通过迁移学习适应到特定场景。在覆盖977个电池和528组老化条件的15个数据集中,PBT实现了最先进的性能,平均超越最强竞争方法21.9%,最高提升达86.9%。本研究建立了已知的第一种电池寿命预测基础模型,并为将电池寿命预测从孤立的场景特定建模任务转向可重用的知识基础提供了步骤,该基础模型可利用有限数据进行特定场景专业化,对其他具有稀缺和异质数据的可持续能源预测问题具有启示。

英文摘要

Early prediction of battery cycle life is essential for improving battery design, manufacturing and deployment. However, despite encouraging progress with machine learning, battery life prediction remains constrained by scarce data and pronounced heterogeneity across battery chemistries, specifications, formation protocols and operating conditions. Although transfer learning has been widely explored to alleviate these challenges, its effectiveness is limited by the absence of a foundation model that can integrate heterogeneous battery life data and provide broadly useful knowledge for target-scenario specialization. Here we introduce the pretrained battery transformer (PBT), a foundation model for battery life prediction that incorporates battery-knowledge-encoded mixture-of-experts layers to learn from scarce and heterogeneous lifetime data. PBT is first pretrained on 13 lithium-ion battery datasets to yield a general PBT that encodes comprehensive battery lifetime knowledge, and is then adapted through transfer learning into specialized PBT models for target scenarios. Across 15 datasets covering 977 batteries and 528 sets of aging conditions from lithium-ion, sodium-ion and zinc-ion batteries, PBT achieves state-of-the-art performance, surpassing the strongest competing method by 21.9% on average, with gains of up to 86.9%. This study establishes, to our knowledge, the first foundation model for battery life prediction and provides a step towards shifting battery lifetime prediction from isolated, scenario-specific modelling tasks to a reusable knowledge foundation that can be specialized to target scenarios with limited data, with implications for other prediction problems characterized by scarce and heterogeneous data in sustainable energy.

11. 其他/综合AI 47 篇

2606.07722 2026-06-09 cs.AI 新提交

Some hypotheses on how chatbots work in problem-solving-driven conversations. Large Language Models as confirmation of the Innovation Illusion

关于聊天机器人在问题解决驱动对话中如何工作的一些假设:大型语言模型作为创新幻觉的确认

S. F. M. van Vlijmen, H. D. Lethe

发表机构 * S.F.M. van Vlijmen and H.D. Lethe jr(S.F.M. van Vlijmen 和 H.D. Lethe jr)

AI总结 本文提出聊天机器人作为对话伙伴的本质,基于聚合动力学、认知语言学等理论,假设LLM训练数据仅部分模仿人类思维,并得出结论:基础聊天机器人无法成为与人类匹敌的思考伙伴。

Comments 42 pages, 3 figures, submitted to Transmathematica

详情
AI中文摘要

本文提供了一种关于聊天机器人在讨论问题及其解决方案时作为真正对话伙伴的本质的视角。聊天机器人能做什么,不能做什么,以及如何解释这一点?我们的论证借鉴了聚合动力学、认知语言学、神经心理学和心理学。我们的论证聚焦于基础聊天机器人,希望借此对更高级聊天机器人的核心功能做出陈述。基础聊天机器人被假定为由一个带有简单界面的大型语言模型(LLM)组成。主要结果是:基于所谓隐喻问题传播的人类理解和思维描述;用于训练LLM的文本数据集具有特定特征,且这些文本数据集仅部分模仿人类思维和理解的假设;LLM训练过程从这些数据集中将人工隐喻问题传播编码到LLM中的假设;我们的结论是基础聊天机器人不能成为能够与人类匹敌的思考伙伴;我们的结论是大型语言模型的进一步发展也不会导致这一点。Yann LeCun 指出:“动物和人类表现出的学习能力和对世界的理解远超当前AI和机器学习系统的能力。”我们的结论与此一致。LeCun的愿景和我们的愿景与大型科技公司的乐观主义相悖。但这并不改变聊天机器人存在的事实,它们被个人和组织大规模使用,因此从社会和政治角度理解它们很重要。我们的文章旨在为关于聊天机器人功能、优点和缺点的讨论做出贡献。在我们对聊天机器人工作原理的研究中,我们尚未遇到用于得出我们结论的方法。

英文摘要

This article offers a perspective on the nature of chatbots as genuine conversation partners when discussing problems in relation to their solutions. What can chatbots do and what can't they do, and how can this be explained? Our argument draws on Aggregation Dynamics, Cognitive Linguistics, Neuropsychology and Psychology. Our argument focuses on basic chatbots in the hope of thereby making statements about the core functionality of more advanced chatbots. Basic chatbots are assumed to consist of a Large Language Model (LLM) with a simple interface. The main results are: a description of human understanding and thinking based on so-called metaphorical problem propagations; the hypothesis that text dataset used for training LLMs have specific characteristics and that these text datasets only partially imitate human thinking and understanding; the hypothesis that the LLM training process encodes artificial metaphorical problem propagations into an LLM from these datasets; our conclusion that a basic chatbot cannot be a thinking partner capable of matching humans; our conclusion that further development of the Large Language Model will not lead to this either. Yann LeCun states: "Animals and humans exhibit learning abilities and understandings of the world that are far beyond the capabilities of current AI and machine learning (ML) systems." Our conclusions are in line with this. LeCun's vision and ours are at odds with the optimism of Big Tech. That does not alter the fact that chatbots exist, that they are being used on a massive scale, by both individuals and organisations, and that it is therefore socially and politically important to understand them. Our article aims to contribute to the discussion on the functioning, benefits and drawbacks of chatbots. We have not yet encountered the approach we used to arrive at our conclusions in our research into how chatbots work.

2606.08728 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

人工智能数学推理:语言模型、神经符号系统与验证发现的综合综述

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文综述了数学推理领域从早期规则系统到当代推理模型、多智能体系统及验证发现工作流的演变,沿非正式推理、形式推理、数学发现及推理技术四轴组织,并评估了基准测试、失败模式及未来方向。

Comments Under review, 47 pages, 14 figures, 22 tables

详情
AI中文摘要

数学推理长期以来一直是机器智能的严格测试;在过去十年中,它已从NLP中的一个边缘问题发展为最重要的人工智能前沿之一。本综述对该领域的演变进行了统一阐述,从早期基于规则的数学文字题(MWP)求解器和模板驱动的几何系统,到神经表达式生成和LLM提示,再到当代推理模型、多智能体系统、神经符号定理证明器和验证发现工作流。我们沿四个轴组织该领域:(i) 文本和图表的非正式推理,涵盖MWP求解、多模态几何和VLM;(ii) 证明助手的形式推理,包括自动形式化、策略预测、编译器引导修复和证明搜索;(iii) 数学发现,其中系统提出构造、改进界限或协助攻击开放问题;以及(iv) 推理和训练时技术,包括CoT提示、工具使用、过程奖励模型和RLVR,这些技术日益将生成与验证联系起来。我们编目了涵盖小学算术、竞赛数学、几何、形式证明、多模态和多语言推理以及专家评估的主要基准,并考察了基准饱和、污染、报告不匹配以及pass@1、多数投票和验证器辅助pass@$k$之间的区别。我们批判性地评估了失败模式:扰动下的脆弱性、奖励黑客、多模态基础失败、脆弱形式化以及推理规模推理的能源成本。借鉴来自在职数学家的近期观点,我们确定了未来方向,集中于验证发现工作流、推理效率以及使AI辅助形式化广泛可用的基础设施。配套材料:https://github.com/Starscream-11813/awesome-AI4Math。

英文摘要

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

2606.09568 2026-06-09 cs.AI 新提交

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

自适应与自组织系统中的自解释性:现状与研究方向

Tom Beyer, Svea Wisy, Sven Tomforde

发表机构 * Kiel University(基尔大学)

AI总结 本文通过系统文献综述,定义自解释性(SX)并建立分类法,提出自解释性层次框架,发现多数方法仍处于概念阶段,缺乏评估标准。

Comments Under review as a regular paper at ACM Transactions on Autonomous and Adaptive Systems (TAAS)

详情
AI中文摘要

随着人工智能(AI)的进步,自适应和自组织系统的复杂性日益增加,使其越来越难以理解和信任。虽然可解释AI旨在提供对AI决策的洞察,但更高级的目标是让系统自我解释——这种能力称为自解释性(SX)。本文对SX进行了系统文献综述,分析了现有方法,包括其领域、目标和评估方法。综述提出了SX的统一定义和分类法,并引入了自解释性层次,为定位当前和未来研究提供了框架。我们的结果表明,大多数SX方法仍处于概念阶段,实际实现很少。此外,目前没有评估SX的正式或事实标准,突出了一个主要研究空白。因此,这项工作为推进复杂系统中的自解释性奠定了基础和路线图。

英文摘要

The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.

2606.09663 2026-06-09 cs.AI 新提交

From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

从0到1再到N:MetaAI递归自我设计的可复现工程证据

Dun Li, Jiatao Li, Hongzhi Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Shanghai Maritime University(上海海事大学) Chizhou University(池州学院)

AI总结 提出可复现证据框架,通过四个标准评估现有系统,其中Darwin Goedel Machine在SWE-bench上提升30%,并给出可复现协议MetaAI-Mini。

Comments 6 pages, 2 figures, 7 tables. Supplementary code: https://github.com/DunLi-Tsinghua/MetaAI-Mini

详情
AI中文摘要

递归自我设计指的是AI辅助修改AI系统构建、评估和改进的机制。本文将MetaAI视为一种由人类播种、AI扩展的开发模式,其中设计空间本身成为修改目标。我们提出了一个可操作证据框架,包含四个标准:可检查的目标系统、元级修改器、反馈导向选择和递归延续。然后,我们将包括Darwin Goedel Machine (DGM)、STOP、Goedel Agent和ShinkaEvolve在内的公开系统映射到这些标准上。DGM提供了目前最直接的已报告证据:其公布的结果显示,经过80次迭代,SWE-bench Verified上的性能从20%提升到50%,完整Polyglot上的性能从14.2%提升到30.7%,消融实验表明开放式探索和自我改进都有贡献。最后,我们提供了MetaAI-Mini,一个基于HumanEval的可复现协议和代码库。由于本次构建未包含完整的模型运行,MetaAI-Mini作为协议而非实验结果报告。

英文摘要

Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.

2606.06895 2026-06-09 cs.CR cs.AI cs.CY cs.ET 交叉投稿

Blockchain Infrastructure for Intelligent Cyber--Physical--Social Systems:Post-Quantum Security, Interoperability, and Trustworthy Data Economies in the Era of Embodied AI

面向智能信息-物理-社会系统的区块链基础设施:具身AI时代的后量子安全、互操作性与可信数据经济

Song Guo, Huawei Huang, Dongping Liu, Aoyu Zhang, Luyao Zhang

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Sun Yat-sen University(中山大学) Amazon Web Services(亚马逊网络服务) Duke Kunshan University(杜克昆山大学)

AI总结 本教程探讨区块链作为协调层,融合后量子密码学与具身AI,实现可扩展、可信的数据经济与跨组织治理。

详情
AI中文摘要

通过基于世界模型的机器人技术部署具身人工智能,为区块链基础设施带来了变革性机遇,迫切需求可信数据溯源、跨组织治理以及跨去中心化生态系统的激励兼容共享。同时,2025年诺贝尔物理学奖和图灵奖所认可的量子计算进展威胁着保障这些数据经济的密码学原语,形成相互依存的紧迫需求:具身AI的长期验证依赖于能够抵御量子对手的密码敏捷架构。本教程考察区块链作为协调层,架起这一双重转型的桥梁——从金融底层到基础性信息-物理-社会系统基础设施,同时抵御量子密码分析并实现可扩展、可信的数据经济。会议以沉浸式AWS Braket演示开场,让参与者接触超导、离子阱和中性原子硬件,评估密码威胁时间线并见证ECDSA向后量子签名的过渡。五个集成模块依次涵盖:具身AI与世界模型需求、量子硬件现实与基于证据的安全迁移、通过BrokerChain协议实现可扩展跨分片架构、实施Croissant元数据标准与机器人学习溯源的可信数据经济,以及面向多模态云部署的行业生态系统集成。通过桥接量子硬件现实与具身AI数据需求,本教程将区块链描绘为下一代去中心化智能环境的统一基础设施,提供开源框架和路线图,用于构建抗量子、可互操作且数据可信的系统。

英文摘要

The deployment of embodied artificial intelligence via world-model-based robotics presents a transformative opportunity for blockchain infrastructure, establishing urgent demand for trustworthy data provenance, cross-organizational governance, and incentive-compatible sharing across decentralized ecosystems. Simultaneously, quantum computing advances recognized by the 2025 Nobel Prize in Physics and the Turing Award threaten the cryptographic primitives securing these data economies, creating an interdependent imperative: long-lived verification for embodied AI depends on crypto-agile architectures capable of withstanding quantum adversaries. This tutorial examines blockchain as the coordination layer bridging this dual transition, from financial substrate to foundational Cyber-Physical-Social Systems infrastructure that simultaneously secures against quantum cryptanalysis and enables scalable, trustworthy data economies. The session opens with an immersive AWS Braket demonstration engaging participants with superconducting, trapped-ion, and neutral-atom hardware to assess cryptographic threat timelines and witness ECDSA-to-post-quantum signature transitions. Five integrated modules progress from embodied AI and world-model requirements through quantum hardware reality and evidence-based security migration, to scalable cross-shard architectures via BrokerChain protocols, trustworthy data economies implementing Croissant metadata standards and robotic learning provenance, and industry ecosystem integration for multi-modal cloud deployment. By bridging quantum hardware realities with embodied AI data requirements, this tutorial charts blockchain as unified infrastructure for next-generation decentralized intelligent environments, providing open-source frameworks and roadmaps for architecting quantum-resistant, interoperable, and data-trustworthy systems.

2606.07536 2026-06-09 cs.CY cs.AI 交叉投稿

Beware of GeeksBearing Gifts: Building True EU Frontier AI Sovereignty

警惕带来礼物的极客:构建真正的欧盟前沿人工智能主权

Nick Moës, Toni Lorente, Amin Oueslati, Jonathan Smith, Robin Staes-Polet, Radina Kraeva

AI总结 本文提出一个涵盖经济竞争力、韧性、安全与国防、欧洲价值观和对外关系五大主权支柱,以及五层26组件29子组件的前沿AI堆栈分解框架,用于识别欧盟政策中的关键缺口、冗余和权衡,以支持战略自主。

详情
AI中文摘要

前沿人工智能正在重塑社会的方方面面,从经济产出或军事能力到民主制度。欧盟正从一个结构性依赖的位置进入这一转型:前沿模型几乎全部来自美国或中国,美国拥有约欧盟16倍的人工智能超级计算能力,全球超大规模数据中心容量中仅有15%位于欧盟境内。尽管欧盟委员会已加速其政策响应,现有举措仍然分散,缺乏确保整个前沿人工智能价值链战略自主的统一愿景。在此,我们提出了一个统一框架,将五大主权支柱(经济竞争力、韧性、安全与国防、欧洲价值观和对外关系)与前沿人工智能堆栈的分解联系起来,该堆栈包括五层、26个组件和29个子组件。该框架能够识别当前欧盟政策中隐含的关键差距、冗余和跨支柱权衡。我们对人工智能千兆工厂倡议的分析表明,以主权为中心的视角如何揭示狭隘经济框架所掩盖的冲突。此外,该框架为政策制定者提供了结构化基础,用于设计、评估和优先考虑跨欧洲战略自主多个维度的前沿人工智能干预措施,涵盖我们识别的四大委员会通讯中的92项倡议及其他。

英文摘要

Frontier artificial intelligence is reshaping all aspects of society, from economic output or military capability to democratic institutions. The EU is entering this transformation from a position of structural dependence: frontier models originate almost exclusively from the United States or China, the US holds approximately sixteen times the EU's AI supercomputing capacity, and only 15% of global hyperscale data centre capacity resides within EU borders. Although the European Commission has accelerated its policy response, existing initiatives remain fragmented and lack a cohesive vision for securing strategic autonomy across the full frontier AI value chain. Here we propose a unified framework connecting five sovereignty pillars (economic competitiveness, resilience, security and defence, European values, and foreign relations) to a decomposition of the frontier AI stack comprising five layers, 26 components, and 29 sub-components. This framework allows the identification of critical gaps, redundancies, and inter-pillar trade-offs that current EU policy leaves implicit. Our analysis of the AI Gigafactory Initiative illustrates how a sovereignty-centred lens reveals conflicts that narrowly economic framings obscure. Moreover, this framework offers policymakers a structured basis for designing, evaluating, and prioritising frontier AI interventions across multiple dimensions of European strategic autonomy across the 92 initiatives from four major Commission communications we. identify, and beyond.

2606.08020 2026-06-09 quant-ph cs.AI 交叉投稿

Repair Before Veto, When Repair Is Hidden: Quantum-Accessible Features for Repair-Augmented Constraint Learning

在修复被隐藏时先修复再否决:面向修复增强约束学习的量子可访问特征

Yifan Wang

发表机构 * Yifan Wang(王一帆)

AI总结 提出Q-RACL框架,在硬约束决策中引入修复优先于否决的语义,通过量子特征访问解决离散对数隐藏的修复可行性推理问题,显著降低假否决率。

Comments 7 pages, 2 figures

详情
AI中文摘要

硬约束决策系统通常会否决不可行的候选方案。当系统可以采取行动时,这种做法过于僵化:如果已知一个可承受的修复能使不可行但有价值的候选变得可行,那么拒绝就是一个错误的否决,而非排序错误。我们引入了Q-RACL(量子修复增强约束学习),这是一个先修复再否决的框架,首先定义RACL决策语义,然后识别出量子特征访问可以承担关键作用的单一推理环节。RACL在顺序修复计划能恢复可行性和偏好时接受候选方案;否则返回结构化的拒绝理由。关键环节是修复可行性推理:从观察到的候选和上下文来看,哪个修复类别能恢复可行性。我们构建了一个离散对数隐藏的RACL族,其中修复类别是潜在指数a = log_g(x)中的移位区间规则,而学习器只观察到x = g^a mod p。在标准的基于DLP的学习分离下,这个坐标对高效的原始输入经典策略是不可访问的,但通过Shor/Fourier结构对量子智能体是可访问的。在六个素数和十个随机种子下,有界的原始输入经典策略和错误的原始傅里叶编码仍接近随机水平,而Q-DLP策略将假否决率保持在1.1%以下,赢得所有配对种子,并产生QNI_cond在0.9777到0.9972之间。一个经典的DLog预言机与之匹配,隔离了特征访问而非分类器容量。因此,量子AI不是作为通用模型升级添加的;对于这个DLP隐藏的修复族,它提供了缺失的特征,从而闭合了先修复再否决的循环。

英文摘要

Hard-constraint decision systems usually veto infeasible candidates. This is too rigid when the system can act: if a known affordable repair would make an infeasible candidate feasible and valuable, rejection is a false veto rather than a ranking error. We introduce Q-RACL (Quantum Repair-Augmented Constraint Learning), a repair-before-veto framework that first defines RACL decision semantics and then identifies the single inference link where quantum feature access can be load-bearing. RACL accepts a candidate when a sequential repair plan restores feasibility and preference; otherwise it returns structured rejection credit. The hard link is repair-feasibility inference: which repair class restores feasibility from an observed candidate and context. We construct a discrete-logarithm-hidden RACL family where the repair class is a shifted interval rule in the latent exponent a = log_g(x), while the learner observes only x = g^a mod p. Under standard DLP-based learning separation, this coordinate is inaccessible to efficient raw-input classical policies but accessible to a quantum agent through Shor/Fourier structure. Across six primes and ten seeds, bounded raw-input classical policies and a wrong raw-Fourier encoding remain near chance, whereas the Q-DLP policy keeps false-veto rate below 1.1%, wins all paired seeds, and yields QNI_cond = 0.9777 to 0.9972. A classical DLog oracle matches it, isolating feature access rather than classifier capacity. Thus quantum AI is not added as a generic model upgrade; for this DLP-hidden repair family, it supplies the missing feature that closes the repair-before-veto loop.

2606.08323 2026-06-09 cs.HC cs.AI 交叉投稿

"So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency

"所以这里有个第22条军规":构建多智能体LLM系统的早期采用者如何概念化透明度

Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall

发表机构 * Purdue University(普渡大学) Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 通过访谈13位早期采用者,研究多智能体LLM系统构建者如何理解透明度,提出包含可重复性、调试、边界设定、可视化和审计的多维框架,强调透明度作为情境化的社会技术实践。

详情
AI中文摘要

多智能体大语言模型(LLM)系统正在迅速兴起,然而作为负责任AI基石的透明度,在这些具有智能体间协调与编排复杂性的分布式架构中仍定义不足。在本文中,我们呈现了首个关于多智能体LLM系统早期采用者(既是构建者也是用户)如何理解和实践透明度的实证研究之一。我们对[大型技术组织]中的13位早期采用者进行了半结构化访谈,并应用主题分析识别重复模式。参与者表达了分歧但互补的透明度框架,包括可重复性、调试、边界设定、可视化和审计。这些视角涵盖了透明度包含什么、为何重要以及如何实现等问题。我们将其综合为一个多维框架,该框架以开发者、用户和治理为中心,将透明度定位为情境化的社会技术实践,为未来HCI和AI设计与研究围绕对齐预期受众的期望和能力提供信息。

英文摘要

Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination and orchestration. In this paper, we present one of the first empirical study of how early adopters of multi-agent LLM systems, who are both the builders and users, understand and practice transparency. We conducted semi-structured interviews with 13 early adopters in [Large Technology Organization] and applied thematic analysis to identify recurring patterns. Participants articulated divergent yet complementary framings of transparency, including reproducibility, debugging, boundary-setting, visualization, and auditing. These perspectives spanned questions of what transparency entails, why it matters, and how it is achieved. We synthesize these into a multidimensional framework, which is developer, user, and governance-focused positioning transparency as a situated socio-technical practice that informs future HCI and AI design and research around aligning expectations and capacities of their intended audiences.

2606.08791 2026-06-09 econ.EM cs.AI q-fin.PM q-fin.RM q-fin.ST 交叉投稿

Evaluating AI Investment Strategies

评估AI投资策略

Irene Aldridge

发表机构 * ablemarkets.com(ablemarkets公司)

AI总结 研究通过可观测输入输出审计黑箱算法决策者,提出动态策略累积遗憾的精确分解,扩展至多期随机动态规划,并给出偏差修正与轨迹估计器。

Comments 33 pages

详情
AI中文摘要

我们研究仅从可观测输入和输出审计黑箱算法决策者的问题。主要结果是一个精确分解:在精确刻画条件下,动态策略的累积遗憾等于成本向量与策略决策之间每期协方差之和。这扩展了Aldridge (2026)的单期恒等式到随机动态规划的完整多期设置。我们证明了该恒等式在独立同分布成本和均值无偏马尔可夫策略下精确成立,推导了非平稳和时变情况下的闭式偏差修正,并建立了折现期模拟。协方差遗憾泛函的贝尔曼递归将该结果与标准强化学习算法联系起来;对于滚动窗口策略,估计误差偏差为$O(d/w)$。该分解对战略环境中的算法审计有直接影响:在平台机制设计中,它提供了基于福利的审计指标,无需访问代理的私人类型;在重复博弈中,协方差减少是策略改进的充分条件;在采购和广告拍卖中,偏差修正量化了战略误报导致的福利损失。相关的轨迹估计器是一致的、渐近正态的(具有HAC方差),并且可在$O(T \cdot nd)$时间内计算。这使得所提出的方法成为平台机制、算法投资策略以及任何受外部绩效审查的序列决策系统的可处理、无模型审计工具。

英文摘要

We study the problem of auditing a black-box algorithmic decision-maker from observable inputs and outputs alone. Our main result is an exact decomposition: under precisely characterized conditions, the cumulative \emph{regret} of a dynamic policy equals the sum of per-period covariances between the cost vector and the policy's decision. This extends the single-period identity of Aldridge~(2026) to the full multi-period setting of stochastic dynamic programming. We prove the identity holds exactly under i.i.d. costs and mean-unbiased Markov policies, derive closed-form bias corrections for non-stationary and time-varying cases, and establish the discounted-horizon analog. A Bellman recursion for the covariance regret functional connects the result to standard reinforcement learning algorithms; for rolling-window policies, the estimation-error bias is $O(d/w)$. The decomposition has direct implications for algorithmic auditing in strategic environments: in platform mechanism design, it provides a welfare-based audit metric without access to the agent's private type; in repeated games, covariance reduction is a sufficient condition for policy improvement; in procurement and ad auctions, the bias correction quantifies welfare loss from strategic misreporting. The associated trajectory estimator is consistent, asymptotically normal with HAC variance, and computable in $O(T \cdot nd)$ time. This makes the proposed approach a tractable, model-free audit tool for platform mechanisms, algorithmic portfolio strategies, and any sequential decision system subject to external performance review.

2606.08936 2026-06-09 cs.IR cs.AI cs.HC 交叉投稿

Report on CHIIR 2026 Workshop on Generative AI and Academic Search (GAI&AS)

CHIIR 2026 生成式AI与学术搜索研讨会报告

Yifan Liu, Jaime Arguello, Orland Hoeber, Chang Liu, Soo Young Rieh, Luanne Sinnamon, Dean Alvarez, Susan Archambault, Rob Capra, Henson Chen, Charles Costa, Anita Crescenzi, Zhitong, Guan, Jacek Gwizdka, Pao-Pei Huang, Gavindya Jayawardena, Ghazal Kalhor, Dagmar Kern, Oliver Koop, Alice Li, Afra Mashhadi, Gaohui Meng, Marta Micheli, Anil B. Murthy, Kevin Schott, Sebastian Schultheiß, Jiwoo Seo, Phaneendra Sivangula, Frans van der Sluis, Xiaoxuan Song, Silang Wang, Dan Zhang

发表机构 * CHIIR 2026 Workshop(CHIIR 2026 工作坊)

AI总结 报告总结CHIIR 2026关于生成式AI重塑学术搜索系统的研讨会,聚焦设计评估挑战,涵盖基础、应用及搜索即学习三大主题,强调透明性、可信度与研究诚信。

详情
AI中文摘要

本报告总结了CHIIR 2026生成式AI与学术搜索研讨会(GAI&AS),该研讨会探讨了GenAI如何重塑学术搜索系统及研究实践。研讨会汇集了人类信息交互和信息检索领域的研究人员,探讨了在设计和评估未来集成GenAI的学术搜索系统中的关键挑战与机遇,超越了传统的文档检索,支持摘要、推荐、综合和对话交互。参与者的兴趣和讨论集中在三个主题集群:基础与原则、应用与机遇、以及搜索即学习。在这些主题中,研讨会强调了学术搜索系统在支持透明度、可信度、研究诚信和长期学术需求,以及促进高阶认知过程中的重要性。与会者讨论了指导理论、设计原则、方法论、合作伙伴关系以及旨在推进以人为中心的GenAI增强学术搜索系统的社区建设努力。总体而言,研讨会展示了社区对GenAI与学术搜索交叉领域的强烈兴趣以及多样化的正在进行和新兴的研究计划。

英文摘要

This report summarizes the CHIIR 2026 Workshop on Generative AI and Academic Search (GAI\&AS), which examined how GenAI is reshaping academic search systems and research practices. The workshop brought together researchers in human information interaction and information retrieval to explore key challenges and opportunities in designing and evaluating future academic search systems that integrate GenAI, moving beyond traditional document retrieval to support summarization, recommendation, synthesis, and conversational interaction. Participants' interests and discussions focused on three thematic clusters: foundations and principles, applications and opportunities, and search-as-learning. Across these themes, the workshop highlighted the importance of academic search systems in supporting transparency, credibility, research integrity, and long-term scholarly needs, as well as in fostering higher-order cognitive processes. Participants discussed guiding theories, design principles, methodological approaches, partnerships, and community-building efforts aimed at advancing human-centered GenAI-enhanced academic search systems. Overall, the workshop demonstrated strong community interest and a diverse range of ongoing and emerging research initiatives at the intersection of GenAI and academic search.

2606.09006 2026-06-09 cs.SI cs.AI cs.CY cs.ET 交叉投稿

Sustainability and Artificial Intelligence: Necessary, Challenging, and Promising Intersections

可持续性与人工智能:必要、挑战与有前景的交汇

Han-Teng Liao, Zijia Wang

发表机构 * Higher Education Impact Assessment Center(高等教育影响评估中心) Sun Yat-Sen University(中山大学) Nanfang College(南芳学院)

AI总结 本文基于541篇文献,梳理了人工智能与可持续性研究的交汇点,揭示了绿色科技在连接多学科中的核心作用,并讨论了其必要性、挑战与前景。

Comments This is an author preprint version. For the final authenticated version of record, please use the official publication via the IEEE Xplore database. DOI: 10.1109/MSIEID52046.2020.00076

详情
Journal ref
2020 Management Science Informatization and Economic Innovation Development Conference (MSIEID), Guangzhou, China, 2020, pp. 360-363
AI中文摘要

数字经济与数字技术的研究人员日益认识到需要更好地解决人工智能在塑造环境、社会和治理发展演变中的作用。可持续性与人工智能研究似乎在复杂、相互关联和动态的棘手问题特征上存在交汇。基于这种交汇,本文旨在通过概述现有研究,勾勒出必要、挑战和有前景的交汇点。基于从Web of Science数据库收集的541条文献数据,研究结果揭示了绿色可持续科技在连接不同学科、主要期刊及关键主题与概念方面日益核心的作用。研究结果展示了这些互动如何可以是必要的、挑战性的和有前景的。文章最后就如何多样化和扩展人工智能促进可持续发展的实践社区提出了一些一般性论点,特别是在预期的人工智能应用领域和机构方面。

英文摘要

Both digital economy and digital technology researchers increasingly recognize the need to better address the role that artificial intelligence (AI) plays in shaping the evolution of the environmental, social and governance aspects of development. It appears that sustainability and AI research converge on the features of wicked problems that are complex, interconnected and dynamic. Building off such convergence, this article aims to map out the necessary, challenging, and promising intersections by providing an overview of the state of art research. Based on 541 bibliographic data collected from the Web of Science (WoS) database, the findings reveal the increasingly central body of work on green and sustainable science and technology in bridging various disciplines, main journals and key topics and concepts. The findings reveal how such interactions can be necessary, challenging, and promising. The article concludes with few general arguments regarding how to diversify and expand the community of practice regarding AI for sustainable development, especially in the areas of expected AI application areas and institutions.

2606.09589 2026-06-09 cs.CY cs.AI 交叉投稿

I Was Scrolling and Then I Saw a Pregnant Strawberry

我正刷着手机,然后看到了一颗怀孕的草莓

Piera Riccio

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 研究AI迷你剧(水果剧)中性别化叙事与种族化逻辑,指出其通过生成式AI的美学洗白机制掩盖意识形态内容,并分析其对计算创造力的文化影响。

详情
AI中文摘要

AI迷你剧(又称水果剧)是算法分发的生成式AI短视频系列,以拟人化角色为特征,近期在社交媒体平台上成为普遍现象。本文认为,尽管这些视频看似无害的美学,但它们再现了深度性别化的叙事结构,其中女性角色被系统性地与道德越轨、性背叛和生殖能力相关联,且多个情节也编码了种族化的逻辑,即可见的身体差异被赋予道德负荷的过程。借鉴女性主义电影理论、批判种族理论和平台研究,本文进一步认为,这些视频的生成式AI美学——以柔软、圆润和视觉可爱为特征——作为一种美学洗白机制,中和了这些叙事的意识形态重量,并使其在内容审核系统下仍能流通。本文通过个人观察和细读来探讨这些问题,反思生成式AI的具体可供性,这些可供性使这一现象成为可能,并对计算创造力领域产生文化影响。

英文摘要

AI minidramas (also known as fruit dramas) are short, algorithmically distributed generative AI video series featuring anthropomorphized characters that have recently emerged as a widespread phenomenon on social media platforms. This paper argues that despite their seemingly innocuous aesthetic, these videos reproduce deeply gendered narrative structures in which female characters are systematically associated with moral transgression, sexual betrayal, and reproductive capacity, and that several plots also encode the logic of racialization, i.e., the process by which visible bodily difference is morally loaded. Drawing on feminist film theory, critical race theory, and platform studies, it further argues that the generative AI aesthetic of these videos, characterized by softness, roundness, and visual cuteness, functions as a mechanism of aesthetic laundering, neutralizing the ideological weight of these narratives and enabling their circulation despite content moderation systems. This paper approaches these questions through personal observation and close reading, reflecting on the specific affordances of generative AI that make this phenomenon both possible and culturally consequential for the field of computational creativity.

2603.14147 2026-06-09 cs.AI cs.LG 版本更新

An Alternative Trajectory for Generative AI

生成AI的另一种轨迹

Margarita Belova, Yuval Kansal, Yihao Liang, Jiaxin Xiao, Niraj K. Jha

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出通过构建领域特定超智能(DSS)来改进生成AI,利用符号抽象提升领域推理能力,避免LLM合成数据的模型崩溃问题,实现可持续发展。

详情
AI中文摘要

生成人工智能(AI)生态系统正经历快速变革,威胁其可持续性。随着模型从研究原型转向高流量产品,能耗从一次性训练转向持续的无界推理。推理模型使计算成本每查询增加数个数量级。通过单体模型扩展追求人工通用智能与物理约束的碰撞:电网故障、用水消耗和数据扩展的边际效益递减。此轨迹产生具有出色事实记忆的模型,但在需要深入推理的领域表现不佳,可能由于训练数据中的抽象不足。当前大型语言模型(LLMs)仅在数学和编程等领域表现出真实的推理深度,其他领域泛化能力差。我们提出基于领域特定超智能(DSS)的替代轨迹。我们主张首先构建显式的符号抽象(知识图谱、本体和形式逻辑)以支撑合成课程,使小型语言模型能够掌握领域特定推理,而无需LLM基于合成数据方法的模型崩溃问题。而非单一通用巨模型,我们设想“DSS模型社会”:动态生态系统,其中协调代理将任务路由到不同的DSS后端。此范式转变使能力脱离规模,使智能从能耗高的数据中心迁移到安全的设备专家。通过将算法进步与物理约束对齐,DSS社会使生成AI从环境负担转变为可持续的经济赋能力量。

英文摘要

The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models transition from research prototypes to high-traffic products, the energetic burden has shifted from one-time training to recurring, unbounded inference. This is exacerbated by reasoning models that inflate compute costs by orders of magnitude per query. The prevailing pursuit of artificial general intelligence through scaling of monolithic models is colliding with hard physical constraints: grid failures, water consumption, and diminishing returns on data scaling. This trajectory yields models with impressive factual recall but struggles in domains requiring in-depth reasoning, possibly due to insufficient abstractions in training data. Current large language models (LLMs) exhibit genuine reasoning depth only in domains like mathematics and coding, where rigorous, pre-existing abstractions provide structural grounding. In other fields, the current approach fails to generalize well. We propose an alternative trajectory based on domain-specific superintelligence (DSS). We argue for first constructing explicit symbolic abstractions (knowledge graphs, ontologies, and formal logic) to underpin synthetic curricula enabling small language models to master domain-specific reasoning without the model collapse problem typical of LLM-based synthetic data methods. Rather than a single generalist giant model, we envision "societies of DSS models": dynamic ecosystems where orchestration agents route tasks to distinct DSS back-ends. This paradigm shift decouples capability from size, enabling intelligence to migrate from energy-intensive data centers to secure, on-device experts. By aligning algorithmic progress with physical constraints, DSS societies move generative AI from an environmental liability to a sustainable force for economic empowerment.

2604.19845 2026-06-09 cs.AI 版本更新

Deconstructing Superintelligence: Identity, Self-Modification and Différance

解构超智能:身份、自我修改与差异

Elija Perrier

发表机构 * Centre for Quantum Software & Information, UTS, Sydney(量子软件与信息中心,UTS,悉尼)

AI总结 本文通过关联算子代数分析自我修改与超智能的关系,揭示非交换性如何传播至自我表示,并指出强自我修改可能破坏系统基础身份。

Comments Camera-ready version, AGI-2026

详情
AI中文摘要

自我修改常被视为构成人工超智能(SI)的核心,但修改是一种相对行为,需要一个在操作外的补充。我们在此基于关联算子代数$\mathcal{A}$,引入更新算子$\hat U$、差分算子$\hat D$和自我表示算子$\hat R$,将补充定义为$\operatorname{Comm}(\hat U)$。传播定理显示$[\hat U,\hat R]$通过$[\hat U,\hat D]$分解,因此非交换性传播至自我表示。谎言悖论是秩一情况$[\hat T,Π_L]=0$,而类$\mathbf{A}$系统中$\hat U$作用于$\hat D$,在系统尺度上再现它,产生与Priest的inclosure方案及Derrida的différance相一致的结构。我们的结果表明,强自我修改所定义的超智能可能破坏此类系统所依赖的持续身份。

英文摘要

Self-modification is routinely treated as constitutive of artificial superintelligence (\textbf{SI}), yet modification is a relative action requiring a \emph{supplement} outside the operation. We formalise this on an associative operator algebra $\mathcal{A}$ with update operator $\hat U$, difference operator $\hat D$, and self-representation operator $\hat R$, identifying the supplement with $\operatorname{Comm}(\hat U)$. A propagation theorem shows $[\hat U,\hat R]$ decomposes through $[\hat U,\hat D]$, so non-commutation propagates to self-representation. The liar paradox is the rank-one case $[\hat T,Π_L]=0$, and \emph{class $\mathbf{A}$} systems, in which $\hat U$ acts on $\hat D$, reproduce it at system scale, yielding a structure coinciding with Priest's inclosure schema and Derrida's \emph{différance}. Our results show that the strong self-modification taken to define superintelligence may undermine the persistent identity upon which such systems are premised.

2606.03092 2026-06-09 cs.AI 版本更新

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

推理的影子价格:LLM最优预算分配的经济学视角

Xu Wan, Speed Zhu, Jianwei Cai, Guang Chen, XiMing Huang, Wiggin Zhou, Mingyang Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文从经济学视角将推理预算分配建模为全局约束优化问题,提出基于影子价格的CLEAR方法,通过理性放弃和资源再分配,在资源稀缺下显著提升总token成本与平均准确率的帕累托前沿。

详情
AI中文摘要

推理时扩展已成为提升大型语言模型性能的关键途径,但实际部署受严格计算预算限制。本文将推理预算分配建模为受经济学原理支配的全局约束优化问题。通过使用移位激增函数对每查询推理效用建模,我们推导出基于全局影子价格的最优分配策略,该价格在资源稀缺下均衡边际效用。基于此理论,我们提出约束潜在效用均衡分配推理(CLEAR)。它执行理性放弃,并将资源从无力偿付的查询重新分配到接近其涌现阈值的可解查询。在不同流量流的多个推理任务上的大量实验表明,CLEAR显著改善了总token成本与平均准确率的帕累托前沿。在资源稀缺模式下,与均匀分配相比,CLEAR的全局准确率提升高达3倍。

英文摘要

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.

2407.10247 2026-06-09 cs.CY cs.AI cs.LG econ.GN q-fin.EC 版本更新

Strategic Integration of Artificial Intelligence in the C-Suite: The Role of the Chief AI Officer

人工智能在C级管理层的战略整合:首席人工智能官的角色

Marc Schmitt

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出角色设计理论,解释企业为何设立首席AI官(CAIO)或采用其他结构,并分析AI的独特属性(分布式判断问责、上游治理、非平稳性)如何影响高管角色设计。

详情
AI中文摘要

人工智能(AI)融入企业战略已成为组织在数字时代保持竞争优势的关键。尽管组织日益将AI视为战略和组织资源,但现有的C级管理层角色仅部分具备在企业层面统一治理、整合和利用AI的能力。各组织的应对方式不同:有的设立专职首席AI官(CAIO),有的将现有职责扩展为混合角色,还有的通过联邦式结构协调AI。本文发展了一种角色设计理论来解释这种差异。我识别出AI区别于以往跨领域企业技术的三个属性——分布式判断问责、上游治理和非平稳性——以及组织应对的三种配置:集中扩展、分布式扩展和角色创建。CAIO框架将这些属性与它们产生的行政设计问题以及专职角色所需的功能和能力联系起来。四个命题具体说明了专职CAIO何时出现、组织采取何种形式、专职角色何时有效以及配置如何随时间演变。本文通过提供高管层面AI战略整合的理论驱动解释,为高管领导力、组织设计和数字治理研究做出贡献。

英文摘要

The integration of Artificial Intelligence (AI) into corporate strategy has become critical for organizations seeking to maintain competitive advantage in the digital age. Although organizations increasingly rely on AI as a strategic and organizational resource, existing C-suite roles remain only partially equipped to govern, integrate, and leverage it coherently at the enterprise level. Organizations vary in their responses. Some create a dedicated Chief AI Officer (CAIO), others extend existing mandates into hybrid roles, and still others coordinate AI through federated structures. This paper develops a role-design theory to explain this variation. I identify three properties that distinguish AI from earlier cross-cutting enterprise technologies - distributed accountability for judgment, upstream governance, and non-stationarity - and three configurations through which organizations respond: concentrated extension, distributed extension, and role creation. The CAIO Framework links these properties to the executive design problems they generate and to the functions and capabilities required of the dedicated role. Four propositions specify when a dedicated CAIO emerges, what form an organization's response takes, when the dedicated role is effective, and how configurations evolve over time. This paper contributes to research on executive leadership, organizational design, and digital governance by offering a theory-driven account of the strategic integration of AI at the executive level.

2412.19754 2026-06-09 econ.GN cs.AI q-fin.EC 版本更新

Complement or substitute? How AI increases the demand for human skills

互补还是替代?AI如何增加对人类技能的需求

Elina Mäkelä, Matthew Bone, Mareike Sehrer, Farah Nanji, Fabian Stephany

发表机构 * Oxford Internet Institute, University of Oxford(牛津互联网研究所,牛津大学) Burning Glass Institute(燃烧玻璃研究所) Institute for New Economic Thinking, Oxford Martin School(新经济思想研究所,牛津马丁学院) Bruegel(布鲁日)

AI总结 基于2018-2024年美、英、澳近3000万条招聘数据,发现AI岗位更需分析思维等互补技能,且这些技能带来工资溢价,并溢出至非AI岗位,同时替代技能需求下降。

Comments 69

详情
AI中文摘要

人工智能(AI)正在改变工作的性质,但关于它如何影响对人类技能的需求,实证证据有限。本文研究了AI采纳是否增加了与AI技术技能互补的人类能力(如分析思维、韧性或道德判断)在AI密集型岗位内外的重要性和价值。利用2018年至2024年间来自美国、英国和澳大利亚的近3000万条招聘数据,我们区分了公司、行业和地区层面的内部效应(AI岗位内)和外部效应(非AI岗位)。本文有三个主要发现。首先,我们发现AI密集型岗位显著更可能需要互补的非技术能力,如分析思维、韧性和数字素养。其次,这些互补技能与可观的工资溢价相关,尤其是在管理、销售或金融等与AI合作的岗位中。第三,我们表明AI扩散具有潜在的溢出效应:随着AI在公司、行业和地区内的采纳增加,即使是非AI岗位对互补技能的需求也会增加,而对可替代技能(如总结、翻译或客户服务)的需求则下降。这些趋势在美国、英国和澳大利亚等地区均成立,证实了我们发现的稳健性。总之,这些发现表明AI并非简单地替代任务或需要更多AI开发者技能;它可能正在转变劳动力技能需求,以青睐那些增强与智能系统协作的人类特质。

英文摘要

Artificial Intelligence (AI) is transforming the nature of work, yet there is limited empirical evidence on how it affects demand for human skills. This paper examines whether AI adoption increases the prevalence and value of human capabilities that complement technical AI skills, such as analytical thinking, resilience, or ethical judgment, within and beyond AI-intensive job roles. Using a dataset of nearly 30 million job postings from the US, the UK and Australia, between 2018 and 2024, we distinguish between internal effects (within AI roles) and external effects (in non-AI roles) across companies, industries, and regions. This paper has three main findings. First, we find that AI-intensive roles are significantly more likely to require complementary non-technical capabilities, such as analytical thinking, resilience, and digital literacy. Second, these complementary skills are associated with meaningful wage premiums, particularly in managerial, sales or finance roles working with AI. Third, we show that AI diffusion has potential spillover effects: as AI adoption rises within companies, industries, and regions, demand for complementary skills increases even in non-AI roles while demand for substitutable skills - summarisation, translation or customer service - decreases. These trends hold across geographies, including the United States, United Kingdom, and Australia, confirming the robustness of our findings. Together, these findings indicate that AI is not simply replacing tasks or requiring more AI developer skills; it may be transforming workforce skill requirements to favor human attributes that enhance collaboration with intelligent systems.

2503.22697 2026-06-09 q-bio.NC cs.AI cs.CV 版本更新

Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing

Brain2Text解码模型揭示视觉语义处理的神经机制

Feihan Feng, Jingxin Nie

发表机构 * Ministry of Education Center for Studies of Psychological Application(教育部心理应用研究中心) Center for Studies of Psychological Application(心理应用研究中心) Key Laboratory of Brain, Cognition and Education Sciences(脑认知与教育科学重点实验室) School of Psychology(心理学学院) Guangdong Key Laboratory of Mental Health and Cognitive Science(广东省心理健康与认知科学重点实验室)

AI总结 提出一种直接从fMRI信号解码自然图像语义描述的深度学习模型,揭示了高级视觉皮层在语义处理中的关键作用,并展示了类别特异性神经表征。

Comments 39 pages, 9 figures

详情
AI中文摘要

从神经活动解码感官体验以重建人类感知的视觉刺激和语义内容,仍然是神经科学和人工智能领域的挑战。尽管当前的脑解码模型取得了显著进展,但在与已建立的神经科学理论的系统整合以及探索潜在神经机制方面仍存在关键差距。在这里,我们提出了一种新颖的框架,直接将fMRI信号解码为所观看自然图像的文本描述。我们的新型深度学习模型在未使用视觉信息训练的情况下,实现了最先进的语义解码性能,生成了捕捉复杂场景核心语义内容的有意义描述。神经解剖学分析揭示了高级视觉皮层(包括MT+复合体、腹侧流视觉皮层和顶下小叶)在视觉语义处理中的关键作用。此外,类别特异性分析展示了语义维度(如生命度和运动)的细微神经表征。这项工作为大脑的语义解码提供了一个更直接和可解释的框架,为探究复杂语义处理的神经基础、完善对分布式语义网络的理解以及潜在开发脑启发语言模型提供了强大的新方法。

英文摘要

Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in neuroscience and artificial intelligence. Despite notable progress in current brain decoding models, a critical gap still persists in their systematic integration with established neuroscientific theories and the exploration of underlying neural mechanisms. Here, we present a novel framework that directly decodes fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual information, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual cortices, including MT+ complex, ventral stream visual cortex, and inferior parietal cortex, in visual semantic processing. Furthermore, category-specific analysis demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This work provides a more direct and interpretable framework to the brain's semantic decoding, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining the understanding of the distributed semantic network, and potentially developing brain-inspired language models.

2512.11000 2026-06-09 q-bio.NC cs.AI cs.NE 版本更新

Unambiguous Representations in Neural Networks: An Information-Theoretic Approach to Intentionality

神经网络中的无歧义表征:一种信息论方法研究意向性

Francesco Lässig

发表机构 * University of Tübingen(图宾根大学)

AI总结 本文用信息论定义表征歧义度,通过实验证明神经网络连接结构可无歧义编码表征内容,且歧义度与行为准确率正交。

Comments Presented at the Models of Consciousness 6 (MoC6) conference (https://amcs-community.org/moc6-schedule-information/#abstract-36)

详情
AI中文摘要

表征充斥在我们的日常经验中,从表示声音的字母到编码数字文件的比特串。虽然这类表征需要外部定义的解码器来传达意义,但意识体验本质上是不同的:对应于感知红色正方形的神经状态不能替代地编码绿色三角形的体验。意识的这一内在属性表明,意识表征必须以传统表征所不具备的方式无歧义。我们使用信息论形式化这一直觉,将表征歧义定义为给定表征R下可能解释I的条件熵H(I|R)。通过对训练分类MNIST数字的神经网络进行实验,我们证明了网络连接中的关系结构可以无歧义地编码表征内容。仅从关系结构出发,我们在识别输出神经元类别身份时,对dropout训练的网络达到完美(100%)准确率,对标准反向传播网络达到38%(随机水平:10%),尽管任务表现相同,这表明表征歧义可以独立于行为准确率而出现。我们进一步证明,输入神经元的空间位置(与视觉场位置等现象属性相关)可以从网络连接中解码,R^2高达0.844。这些结果为测量神经系统的表征歧义提供了定量方法,并证明神经网络可以展现出理论(如狭义表征主义和IIT)所认为的必要(尽管不充分)的低歧义表征。

英文摘要

Representations pervade our daily experience, from letters representing sounds to bit strings encoding digital files. While such representations require externally defined decoders to convey meaning, conscious experience is fundamentally different: a neural state corresponding to perceiving a red square cannot alternatively encode the experience of a green triangle. This intrinsic property of consciousness suggests that conscious representations must be unambiguous in a way that conventional representations are not. We formalize this intuition using information theory, defining representational ambiguity as the conditional entropy H(I|R) over possible interpretations I given a representation R. Through experiments on neural networks trained to classify MNIST digits, we demonstrate that relational structures in network connectivity can unambiguously encode representational content. From relational structure alone, we achieve perfect (100%) accuracy for dropout-trained networks and 38% for standard backpropagation (chance: 10%) in identifying output neuron class identity, despite identical task performance, demonstrating that representational ambiguity can arise orthogonally to behavioral accuracy. We further show that spatial position of input neurons, relevant to phenomenal properties like visual field location, can be decoded from network connectivity with R^2 up to 0.844. These results provide a quantitative method for measuring representational ambiguity in neural systems and demonstrate that neural networks can exhibit the low-ambiguity representations posited as necessary (though not sufficient) by theoretical accounts such as narrow representationalism and IIT.

2512.17893 2026-06-09 quant-ph cs.AI 版本更新

Exploring the Effect of Basis Rotation on NQS Performance

探索基旋转对NQS性能的影响

Sven Benjamin Kožić, Vinko Zlatić, Fabio Franchini, Salvatore Marco Giampaolo

发表机构 * Institut Ruđer Bošković(鲁德·博什科维奇研究所)

AI总结 通过可解一维Ising模型,研究局部基旋转对神经量子态(NQS)表示和优化的影响,发现基旋转保持优化景观不变但移动目标态位置,导致优化失败与错误波函数结构共存。

详情
AI中文摘要

神经量子态(NQS)是量子多体波函数的强大变分表示,但其性能敏感地依赖于所选基。利用精确可解的一维Ising模型,我们证明局部基旋转保持最小化景观不变,同时将精确基态在参数空间中重新定位。这提供了一个受控框架,以区分表示限制与优化引起的可训练性效应。通过信息几何度量量化的这种几何位移,可以将浅层架构的优化引导至鞍点和高曲率区域。因此,低能量误差可能与错误的波函数结构共存。通过在同一变分架构内比较能量和保真度优化,我们表明即使旋转后的目标态仍然可表示,优化失败也可能持续存在。我们的结果识别了导致NQS基依赖性的几何机制,并激发了景观感知的变分设计。

英文摘要

Neural Quantum States (NQS) are powerful variational representations of quantum many-body wavefunctions, yet their performance depends sensitively on the chosen basis. Using an exactly solvable one-dimensional Ising model, we show that local basis rotations leave the minimization landscape unchanged while relocating the exact ground state in parameter space. This provides a controlled framework to disentangle representational limitations from optimization-induced trainability effects. This geometric displacement, quantified through information-geometric measures, can steer optimization of shallow architectures toward saddle points and high-curvature regions. As a result, low energy errors may coexist with an incorrect wavefunction structure. By comparing energy and infidelity optimization within the same variational architectures, we show that optimization failure can persist even when the rotated target state remains representable. Our results identify a geometric mechanism contributing to basis dependence in NQS and motivate landscape-aware variational design.

2601.06077 2026-06-09 cs.IT cs.AI cs.LG math.IT math.OC 版本更新

One if by Land, Two if by Sea, Three if by Four Seas, and More to Come -- Values of Perception, Prediction, Communication, and Common Sense in Decision Making

一陆二海三四海,更多将至——感知、预测、通信与常识在决策中的价值

Aolin Xu

发表机构 * Aolin Xu(徐傲林)

AI总结 本文严格定义决策中感知、预测、通信和常识的价值,发现无预测的感知价值可能为负,而预测价值非负,并应用于自主决策系统设计。

详情
AI中文摘要

本文旨在严格定义决策中感知、预测、通信和常识的价值。所定义的量是决策论意义上的,但具有信息论上的类比,例如,它们与香农熵和互信息共享一些简单但关键的数学性质,并且在特定设置中可以简化为这些量。一个有趣的观察是,没有预测的感知价值可能为负,而感知与预测一起的价值以及单独预测的价值总是非负的。这些定义为自主决策系统设计中出现的实际问题提供了答案。示例问题包括:我们是否需要观察和预测特定代理的行为?其重要性如何?观察和预测代理的最佳顺序是什么?这些定义也可能为认知科学和神经科学提供见解,有助于理解自然决策者如何利用从不同来源和操作中获得的信息。

英文摘要

This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined quantities are decision-theoretic, but have information-theoretic analogues, e.g., they share some simple but key mathematical properties with Shannon entropy and mutual information, and can reduce to these quantities in particular settings. One interesting observation is that, the value of perception without prediction can be negative, while the value of perception together with prediction and the value of prediction alone are always nonnegative. The defined quantities suggest answers to practical questions arising in the design of autonomous decision-making systems. Example questions include: Do we need to observe and predict the behavior of a particular agent? How important is it? What is the best order to observe and predict the agents? The defined quantities may also provide insights to cognitive science and neural science, toward the understanding of how natural decision makers make use of information gained from different sources and operations.

2604.20897 2026-06-09 cs.IT cs.AI math.IT physics.comp-ph 版本更新

Watts-per-Intelligence Part II: Algorithmic Catalysis

每智能瓦特 Part II:算法催化

Elija Perrier

发表机构 * Centre for Quantum Software and Information(量子软件与信息中心)

AI总结 本文基于每智能瓦特框架发展算法催化热力学理论,提出可重用的计算结构以减少任务类的不可逆操作,同时满足受限恢复和结构选择性约束。证明任务类特定速度提升上限由算法互信息决定,并通过兰道尔擦除最小热力学成本。结合结果得出耦合定理,下界限定算法催化部署时间范围。

Comments Camera ready version, AGI-2026

详情
AI中文摘要

我们发展了基于每智能瓦特框架内的算法催化热力学理论,识别出可重用的计算结构,以减少任务类的不可逆操作,同时满足有限恢复和结构选择性约束。我们证明任何特定任务类的速度提升上限由子基质与类描述符之间的算法互信息决定,并且编码此信息会通过兰道尔擦除产生最小热力学成本。结合这些结果得出一个耦合定理,该定理下界限定算法催化部署时间范围所需的部署时间。该框架在仿射SAT类上进行了示例说明,并将当代学习系统置于智能计算的信息热力学约束之下。

英文摘要

We develop a thermodynamic theory of algorithmic catalysis within the watts per intelligence framework, identifying reusable computational structures that reduce irreversible operations for a task class while satisfying bounded restoration and structural selectivity constraints. We prove that any class specific speed-up is upper-bounded by the algorithmic mutual information between the substrate and the class descriptor, and that encoding this information incurs a minimum thermodynamic cost via Landauer erasure. Combining these results yields a coupling theorem that lower-bounds the deployment horizon required for an algorithmic catalyst to be energetically favourable. The framework is illustrated on an affine SAT class and situates contemporary learned systems within an information thermodynamic constraint on intelligent computation.

2606.04227 2026-06-09 cs.DS cs.AI 版本更新

Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry

细胞复形上的增量层上同调:有界局部几何下的O(1)-in-n惰性编辑处理

Jason L. Volk

发表机构 * Invariant Research(Invariant研究院)

AI总结 针对动态演化的1维细胞复形上的层上同调$H^1$,提出一种增量维护算法,在有界局部几何假设下实现每次编辑O(1)时间处理,并通过同步点保证正确性。

Comments 2 figures, 2 tables, 1 algorithm; code at https://github.com/Jasonleonardvolk/sigma

详情
AI中文摘要

我们提出了一种算法框架,用于在动态演化的1维细胞复形(配备有限维细胞层)上增量维护第一层上同调$H^1(X; \mathcal{F})$。通过分解上边界矩阵经典计算$H^1$需要$O(n^3)$时间;当复形经历$m$次编辑的流时,每次编辑后完全重计算代价为$O(mn^3)$。在局部几何有界假设下——有界细胞大小$v_{\max}$、有基维数$d$和有界神经度$D$——每次编辑(顶点插入、边插入、限制映射更新)仅影响有界的一组局部上边界块。因此,该算法以相对于总复形大小$n$的$O(1)$时间处理惰性流式编辑(代价在局部几何参数$v_{\max}$、$d$和$D$的多项式时间内,这些参数被视为与$n$无关的常数),将局部特征值求解和Mayer-Vietoris全局组装推迟到同步点(Flush)。在同步时,维护的状态与分区层模型的相应批量组装一致;我们在所有批量验证的运行中观察到零测量漂移(通过$V = 10^6$)。我们还给出了细胞分解的均摊$O(|E|)$流式构造,并讨论了一个对抗性代数RAM障碍,论证非分区非平凡层($d \geq 2$,非恒等限制映射)不具有相同的局部性。在最多$5 \times 10^6$个顶点和$1.7 \times 10^7$次流式编辑的Barabasi-Albert图上的实验显示,每次编辑的惰性中位延迟为35微秒(不包括刷新);查询时间(同步时的全局组装)在实现的完全遍历路径中为每次刷新$O(n)$。精确同步代价另行报告。

英文摘要

We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of $H^1$ via factorization of the coboundary matrix requires $O(n^3)$ time; when the complex evolves with a stream of $m$ edits, full recomputation after each edit costs $O(mn^3)$. Under a bounded local geometry assumption -- bounded cell size $v_{\max}$, bounded stalk dimension $d$, and bounded nerve degree $D$ -- each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in $O(1)$ time with respect to the total complex size $n$ (with cost polynomial in the local geometry parameters $v_{\max}$, $d$, and $D$, which are treated as constants independent of $n$), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through $V = 10^6$). We also give an amortized $O(|E|)$ streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ($d \geq 2$, non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to $5 \times 10^6$ vertices and $1.7 \times 10^7$ streaming edits show 35 $μ$s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is $O(n)$ per flush in the implemented full-traversal path. Exact synchronization costs are reported separately.

2605.24384 2026-06-09 cs.CL cs.AI 版本更新

Side-by-side Comparison Amplifies Dialect Bias in Language Models

并排比较加剧语言模型中的方言偏见

Kritee Kondapally, Claire J. Smerdon, Pooja C. Patel, Ogheneyoma Akoni, Jevon Torres, Jaspreet Ranjit, Matthew Finlayson, Swabha Swayamdipta

发表机构 * University of Southern California(美国南加州大学)

AI总结 本研究通过并排比较标准美式英语和非裔美国英语的推文,发现语言模型中的隐性方言偏见在对比设置下显著加剧,且显性方言偏见在安全对齐微调后仍存在。

Comments In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

详情
Journal ref
In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
AI中文摘要

语言模型(LMs)可能因其方言变体而表现出偏见,即使在没有方言标签的情况下,这种行为被称为隐性方言偏见。在这项工作中,我们通过评估语言模型如何将刻板特征(源自社会心理学关于种族偏见的研究)与标准美式英语(SAE)和非裔美国英语(AAVE)中意图等效的推文相关联,来量化在线话语中的隐性方言偏见。虽然先前的研究表明,在单独评估推文时,语言模型将更多负面刻板印象与AAVE关联,但我们惊讶地发现,当SAE/AAVE推文对并排比较时,这种偏见显著加剧,这种设置更接近模型用于排名候选人的高影响力决策环境。当明确指定方言标签时,偏见只会恶化。考虑到商业开发者为了减轻其语言模型中的偏见所做的广泛努力,这一点令人震惊。令人鼓舞的是,我们表明反事实公平微调可以减轻某些刻板特征的隐性方言偏见,减少单独评估推文时的平均差异,然而,在并排评估SAE/AAVE推文时,这些改进并不一致地适用于所有特征。我们的发现表明,现有的隐性方言偏见评估设置可能低估了其严重性,特别是在对比设置中。此外,即使在安全对齐微调后,显性方言偏见仍然显著,表明它仍然是一个未解决的问题,并激励需要更稳健的评估和缓解框架。

英文摘要

Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

2604.07349 2026-06-09 cs.CC cs.AI cs.LO 版本更新

Descent Before Hardness: Orbit-Gap Obstructions in Exact Certification

局部性、一致性与可处理性前沿

Tristan Simas

发表机构 * McGill University(麦吉尔大学)

AI总结 本文通过Rice定理的结构类比,研究有限加权布尔优化/CSP风格切片中可处理性分类的精确性,提出闭包不变性作为正确分类的必要条件,并给出闭包不变分类的充要条件及四种阻碍族。

Comments Main PDF: 46 pages, 5 tables. Supplementary: 17 pages, 2 tables. Lean 4 formalization available at https://doi.org/10.5281/zenodo.19457896

详情
AI中文摘要

Rice定理表明,部分递归函数的非平凡外延性质是不可判定的。对于有限加权布尔优化/CSP风格切片,可处理性分类存在一个Rice式的结构类比:正确性迫使在定理强制表示的移动下具有不变性,而轨道间隙正是闭包不变谓词精确分类的障碍。该范围对于精确规范是普适的。任何严格规范的问题都确定一个可接受输出关系,而精确认证仅依赖于诱导的等价关系 $s \sim_R s' \iff \operatorname{Adm}_R(s)=\operatorname{Adm}_R(s')$。决策、搜索、近似、随机输出、统计和分布保证都通过这个可接受输出商进入。在具有多项式时间可计算传输的闭包封闭域上,每个正确的可处理性分类器必须在闭包轨道上为常数。精确的闭包不变分类当且仅当正轨道壳和负轨道壳不相交时才是可能的;在这种情况下,闭包壳是一个闭包算子,给出最小的精确分类器。有限结构域是提取成对语法上的基本局部一阶片段。四个二元成对阻碍族——主导对集中、边缘掩蔽、鬼影动作支持和动作特定偏移——见证了自然有限结构谓词的相同轨道分歧,而壳分离定理给出了分类可能时的正判据。没有显式的边缘控制,任意小的效用扰动都可能翻转相关性和充分性。

英文摘要

Exact certification has a quotient: states are equivalent when they have the same correct outputs. A tractability proxy must first define a predicate on this quotient before ordinary hardness or algorithmic questions arise. Raw syntactic proxies can fail at that earlier step, because correctness-preserving presentation moves may change the statistics they inspect while preserving the exact-certification problem. Orbit gaps are the complete obstruction. An orbit gap occurs when one closure orbit contains both positive and negative presentations of a target. Exact closure-invariant classification is possible if and only if the positive and negative orbit hulls are disjoint. When the hulls are disjoint, the closure hull is the least exact classifier. With computable orbit representatives, this hull classifier becomes a quotient-level algorithm. These are predicate-level results: they establish when a proxy defines a property of the certification problem at all, a precondition logically prior to class lower bounds on the resulting recovery task and deliberately not a substitute for them. The structural transfer applies to every fixed correctness relation, independent of whether that relation is polynomial-time accessible. In the direct finite-local regime, where local routing tests are computed from raw pairwise syntax, three binary-pairwise proxy families and one offset-normalization witness exhibit same-orbit disagreement. Positive results arise from quotient-preserving normalizations, computable orbit catalogues whose descended predicates compose under Boolean operations, and predicates defined directly on the correctness quotient. The result complements the Rice-analog line of Borchert, Stephan, Hemaspaandra, and Rothe. All numbered results are mechanized in Lean 4; the supplementary ledger maps each claim to its formal identifier.

2507.18967 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测:YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology(计算学院,斯里兰卡信息科技学院) Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通信科学学院,塔尔皮埃大学) Computing Centre, Faculty of Engineering, University of Peradeniya(工程学院计算机中心,珀德尼亚大学)

AI总结 本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能,发现YOLOv8在低能见度和不同深度条件下表现最佳,mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情
Journal ref
Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)
AI中文摘要

水下污染是当今最严重的环境问题之一,全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法,包括YOLO模型(YOLOv7、YOLOv8、YOLOv9、YOLOv10)和Faster R-CNN,以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示,YOLOv8在低能见度和变量深度条件下表现最佳,mAP为80.9%。这种性能提升归因于YOLOv8的架构,其包含改进的无锚机制和自监督学习,从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力,提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

2508.05153 2026-06-09 cs.RO cs.AI 版本更新

FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction

FCBV-Net:通过特征条件双臂价值预测实现类别级机器人服装平滑

Mohammed Daba, Jing Qiu

发表机构 * University of Waterloo(多伦多大学)

AI总结 本文提出FCBV-Net,通过预训练的密集几何特征条件预测双臂动作价值,提升机器人服装平滑任务的类别级泛化能力,实验显示其在未见过的服装上效率下降仅为11.5%。

Comments 9 pages, 7 figures, 1 table

详情
Journal ref
Electronics 2026, 15(11), 2468
AI中文摘要

类别级机器人服装操作,如双臂平滑,仍面临显著挑战,由于高维性、复杂动态和类别内变化。现有方法往往在特定实例上过拟合或在感知泛化方面失败。本文提出特征条件双臂价值网络(FCBV-Net),在3D点云上操作,专门增强服装平滑的类别级策略泛化。FCBV-Net将双臂动作价值预测条件于预训练的冻结密集几何特征,确保对类别内服装变化的鲁棒性。可训练的下游组件则利用这些静态特征学习任务特定的策略。在使用CLOTH3D数据集的模拟PyFlex环境中,FCBV-Net展示了优越的类别级泛化能力。它在未见过的服装上仅比基于2D图像的基线低11.5%(Steps80),并实现了89%的最终覆盖率,优于使用相同点特征但固定原始的3D对应基线的83%覆盖率。这些结果表明,将几何理解与双臂动作价值学习解耦能够实现更好的类别级泛化。代码、视频和补充材料可在项目网站:https://dabaspark.github.io/fcbvnet/获取。

英文摘要

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.

2603.24940 2026-06-09 cs.PL cs.AI 版本更新

Evaluating adaptive and generative AI-based feedback and recommendations in a knowledge-graph-integrated programming learning system

评估基于自适应和生成式AI的反馈与推荐在知识图谱集成的编程学习系统中的效果

Lalita Na Nongkhai, Jingyun Wang, Adam Wynn, Takahiko Mendori

发表机构 * Graduate School of Engineering, Kochi University of Technology(Kochi大学技术大学工学研究院) Department of Computer Science, Durham University(Durham大学计算机科学系)

AI总结 本文提出一种整合大型语言模型与检索增强生成方法的知识图谱编程学习系统,通过实验比较三种教学模式的反馈效果与学习表现。

详情
Journal ref
Computers and Education: Artificial Intelligence, Volume 10, June 2026, 100526
AI中文摘要

本文介绍了一种整合大型语言模型(LLM)与检索增强生成(RAG)方法的框架,利用知识图谱和用户交互历史进行学习者代码评估、生成形成性反馈并推荐练习。该研究通过四个关键日志特征分析了4956次代码提交数据,发现生成式AI模式的反馈使学习者正确代码更多且缺失关键逻辑的提交更少。混合生成式AI-自适应模式在正确提交数和错误或不完整尝试数上表现最佳,优于仅自适应或仅生成式AI模式。问卷结果显示,生成式AI反馈被广泛认为有帮助,且所有模式在易用性和有用性上均获好评。

英文摘要

This paper introduces the design and development of a framework that integrates a large language model (LLM) with a retrieval-augmented generation (RAG) approach leveraging both a knowledge graph and user interaction history. The framework is incorporated into a previously developed adaptive learning support system to assess learners' code, generate formative feedback, and recommend exercises. Moerover, this study examines learner preferences across three instructional modes; adaptive, Generative AI (GenAI), and hybrid GenAI-adaptive. An experimental study was conducted to compare the learning performance and perception of the learners, and the effectiveness of these three modes using four key log features derived from 4956 code submissions across all experimental groups. The analysis results show that learners receiving feedback from GenAI modes had significantly more correct code and fewer code submissions missing essential programming logic than those receiving feedback from adaptive mode. In particular, the hybrid GenAI-adaptive mode achieved the highest number of correct submissions and the fewest incorrect or incomplete attempts, outperforming both the adaptive-only and GenAI-only modes. Questionnaire responses further indicated that GenAI-generated feedback was widely perceived as helpful, while all modes were rated positively for ease of use and usefulness. These results suggest that the hybrid GenAI-adaptive mode outperforms the other two modes across all measured log features.

2508.02197 2026-06-09 cs.AI 版本更新

A Message Passing Realization of Expected Free Energy Minimization

期望自由能最小化的信息传递实现

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Eindhoven University of Technology, 5612 AP Eindhoven, the Netherlands GN Hearing, 5612 AB Eindhoven, The Netherlands(埃因霍温理工大学,荷兰埃因霍温5612 AP GN听力,荷兰埃因霍温5612 AB)

AI总结 本文提出基于因子图的期望自由能最小化信息传递方法,通过将期望自由能最小化转化为变分自由能最小化问题,实现高效策略推断,并在存在epistemic不确定性环境中验证了其有效性。

详情
Journal ref
In: International Workshop on Active Inference, pp. 69-84. Springer, Cham, 2022
AI中文摘要

本文提出基于因子图的期望自由能最小化信息传递方法,通过将期望自由能最小化转化为变分自由能最小化问题,实现高效策略推断,并在存在epistemic不确定性环境中验证了其有效性。

英文摘要

We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

2309.10370 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

浅层神经网络的几何结构与构造性${\mathcal L}^2$成本最小化

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文研究浅层ReLU网络在欠参数化情况下的成本最小化问题,通过构造上界揭示分类数据的几何结构,不依赖梯度下降。证明了成本函数最小值的上界与训练数据信噪比相关,并确定了特定子空间的构造性训练网络。

Comments AMS Latex, 29 pages. Experimental evidence added. To appear in Physica D: Nonlinear Phenomena

详情
Journal ref
Phys. D, 490, Article No. 135176 (2026)
AI中文摘要

本文通过显式构造上界,探讨欠参数化浅层ReLU网络中成本(损失)最小化问题,不使用梯度下降方法。重点在于阐明近似和精确极小值的几何结构。考虑$ L^2 $成本函数,输入空间$\mathbb{R}^M$,输出空间${\mathbb R}^Q$,其中$Q\leq M$,训练输入样本大小可任意大。证明了成本函数最小值的上界为$O(δ_P)$,其中$δ_P$衡量训练数据的信噪比。在特殊情况下$M=Q$时,显式确定了成本函数的精确退化局部极小值,并显示该精确值与$Q\leq M$时获得的上界相比,相对误差为$O(δ_P^2)$。上界证明提供了构造性训练的网络;我们证明该网络度量了输入空间$\mathbb{R}^M$中的特定$Q$维子空间。我们还评论了在给定上下文中成本函数全局极小值的特征化问题。

英文摘要

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

2602.13271 2026-06-09 cs.AI cs.HC cs.LG 版本更新

Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

面向安全增强的人本可解释AI:一种深度入侵检测框架

Md Muntasir Jahid Ayan, Md. Shahriar Rashid, Tazzina Afroze Hassan, Hossain Md. Mubashshir Jamil, Mahbubul Islam, Lisan Al Amin, Rupak Kumar Das, Farzana Akter, Faisal Quader

发表机构 * Department of Computer Science and Engineering, United International University (UIU), Dhaka 1212, Bangladesh(计算机科学与工程系,国际联合大学(UIU),达卡1212,孟加拉国) Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur 1704, Bangladesh(电气与电子工程系,伊斯兰科技大学,加兹ipur 1704,孟加拉国) Department of Computer Science and Engineering (CSE), University of Asia Pacific (UAP), Dhaka 1207, Bangladesh(计算机科学与工程系(CSE),亚洲太平洋大学(UAP),达卡1207,孟加拉国) Department of Information Systems, University of Maryland, Baltimore, 21250, Maryland, USA(信息系统系,马里兰大学,巴尔的摩,21250,美国) College Of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA(信息科学与技术学院,宾夕法尼亚州立大学,大学公园,PA 16802,美国) Department of Information Technology, Washington University of Science and Technology, Alexandria, VA(信息技术系,科学与技术华盛顿大学,亚历山大,VA) College of Engineering and Information Technology, University of Maryland, College Park, 20742, Maryland, USA(工程与信息技术学院,马里兰大学,学院公园,20742,美国)

AI总结 本文提出一种结合可解释AI的深度入侵检测框架,利用CNN和LSTM捕捉流量序列的时间依赖性,通过SHAP实现模型可解释性,提升安全分析的透明度与可靠性。

详情
AI中文摘要

随着网络威胁的复杂性和频率增加,需要准确且可解释的入侵检测系统(IDS)。本文提出了一种新颖的IDS框架,整合可解释人工智能(XAI)以增强深度学习模型的透明性。该框架在NSL-KDD基准数据集上进行实验评估,显示优于传统IDS和黑箱深度学习模型。所提方法结合卷积神经网络(CNN)和长短期记忆网络(LSTM)以捕捉流量序列的时间依赖性。深度学习结果表明,CNN和LSTM的准确率均达到0.99,其中LSTM在宏平均精度、召回率和F-1分数上优于CNN。对于加权平均精度、召回率和F-1分数,两种模型得分几乎相同。为确保可解释性,XAI模型SHapley Additive exPlanations(SHAP)被纳入,使安全分析师能够理解和验证模型决策。SHAP指出,srv_serror_rate、dst_host_srv_serror_rate和serror_rate是两个模型中的一些重要特征。我们还基于IPIP6和Big Five人格特质进行了以信任为导向的专家调查,通过交互式UI评估系统的可靠性和可用性。本工作强调了在网络安全解决方案中结合性能和透明性的潜力,并通过自适应学习推荐未来改进以实现实时威胁检测。

英文摘要

The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.

2602.05027 2026-06-09 cs.SD cs.AI 版本更新

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

AudioSAE:利用稀疏自编码器理解音频处理模型

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文在Whisper和HuBERT的编码器层训练稀疏自编码器(SAE),评估其稳定性和可解释性,并展示其在特征解耦、概念擦除、语音检测优化及与人类脑电活动对齐方面的实用价值。

Comments Accepted to EACL 2026, main track

详情
Journal ref
Proceedings of EACL 2026, pages 3221-3254
AI中文摘要

稀疏自编码器(SAE)是解释神经表征的强大工具,但它们在音频领域的应用尚未充分探索。我们在Whisper和HuBERT的所有编码器层训练SAE,对其稳定性、可解释性进行了广泛评估,并展示了其实用性。超过50%的特征在随机种子间保持一致,且重建质量得以保持。SAE特征捕获了通用声学和语义信息以及特定事件,包括环境噪声和副语言声音(如笑声、低语),并有效解耦它们,仅需移除19-27%的特征即可擦除一个概念。特征引导将Whisper的虚假语音检测降低了70%,且词错误率(WER)增加可忽略不计,展示了实际应用价值。最后,我们发现SAE特征与语音感知过程中的人类脑电活动相关,表明其与人类神经处理的对齐。代码和检查点可在https://github.com/audiosae/audiosae_demo获取。

英文摘要

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

2601.21221 2026-06-09 cs.AI 版本更新

Causal Discovery for Explainable AI: A Dual-Encoding Approach

可解释AI中的因果发现:一种双编码方法

Henry Salgado, Meagan R. Kendall, Martine Ceberio

发表机构 * Department of Computer Science, The University of Texas at El Paso(德克萨斯理工大学计算机科学系) Department of Engineering Education and Leadership, The University of Texas at El Paso(德克萨斯理工大学工程教育与领导力系)

AI总结 本文提出一种双编码方法,通过互补编码策略和多数投票融合,解决传统因果发现方法在处理分类变量时的数值不稳定问题,并在泰坦尼克号数据集上验证了方法的有效性。

Comments 6 pages

详情
AI中文摘要

理解特征间的因果关系对于解释机器学习模型决策至关重要。然而,传统因果发现方法在处理分类变量时面临条件独立性测试的数值不稳定性问题。我们提出了一种双编码因果发现方法,通过互补编码策略和多数投票融合来解决这些限制。在泰坦尼克号数据集上的应用表明,该方法能够识别出与已建立的可解释方法一致的因果结构。

英文摘要

Understanding causal relationships among features is fundamental for explaining machine learning model decisions. However, traditional causal discovery methods face challenges with categorical variables due to numerical instability in conditional independence testing. We propose a dual-encoding causal discovery approach that addresses these limitations by running constraint-based algorithms with complementary encoding strategies and merging results through majority voting. Applied to the Titanic dataset, our method identifies causal structures that align with established explainable methods.

2405.07098 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Interpretable global minima of deep ReLU neural networks on sequentially separable data

可解释的深度ReLU神经网络在依次可分数据上的全局极小值

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文通过构造零损失分类器,利用累积参数确定截断映射,研究了在小且分离的簇数据及依次线性可分等价类情况下,深度ReLU网络的全局极小值描述。

Comments AMS Latex, 31 pages, 3 figures

详情
Journal ref
J. Mach. Learn. Res., 26 (173): 1-31 (2025)
AI中文摘要

我们显式地构造了零损失神经网络分类器。我们将权重矩阵和偏置向量用累积参数表示,这些参数决定了递归作用于输入空间的截断映射。考虑的训练数据配置包括(i)足够小且彼此分离的簇对应于每个类别,以及(ii)依次线性可分的等价类。在最佳情况下,对于$\mathbb{R}^M$中的$Q$类数据,全局极小值可以用$Q(M+2)$个参数描述。

英文摘要

We explicitly construct zero loss neural network classifiers. We write the weight matrices and bias vectors in terms of cumulative parameters, which determine truncation maps acting recursively on input space. The configurations for the training data considered are (i) sufficiently small, well separated clusters corresponding to each class, and (ii) equivalence classes which are sequentially linearly separable. In the best case, for $Q$ classes of data in $\mathbb{R}^M$, global minimizers can be described with $Q(M+2)$ parameters.

2511.02469 2026-06-09 q-fin.CP cs.AI cs.MA 版本更新

Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLMs for Monetary Policy Decision Classification

多智能体辩论式LLM中鹰派-鸽派隐含信念建模用于货币政策决策分类

Kaito Takano, Masanori Hirano, Kei Nakagawa

发表机构 * Osaka Metropolitan University(大阪市立大学) Preferred Networks, Inc.

AI总结 本文提出多智能体辩论式LLM框架,通过建模鹰派与鸽派隐含信念提升货币政策预测准确性,优于传统LLM基线。

Comments PRIMA2025 Accepted

详情
AI中文摘要

准确预测央行政策决策,特别是美联储公开市场委员会(FOMC)的决策,在经济不确定性加剧的背景下变得尤为重要。尽管先前研究利用货币政策文本预测利率变化,但大多数方法依赖静态分类模型,忽略了政策制定的审议性质。本文提出了一种新颖的框架,通过建模多个大型语言模型(LLMs)作为交互智能体,结构上模仿FOMC的集体决策过程。每个智能体从不同的初始信念开始,并基于定性政策文本和定量宏观经济指标生成预测。通过迭代轮次,智能体通过观察其他智能体的输出修订预测,模拟审议和共识形成。为提高可解释性,我们引入一个表示每个智能体隐含信念(例如鹰派或鸽派)的隐变量,并理论证明该信念如何调解输入信息的感知和交互动态。实证结果表明,这种辩论式方法在预测准确性上显著优于标准LLM基线。此外,显式建模信念提供了关于个体视角和社会影响如何塑造集体政策预测的见解。

英文摘要

Accurately forecasting central bank policy decisions, particularly those of the Federal Open Market Committee(FOMC) has become increasingly important amid heightened economic uncertainty. While prior studies have used monetary policy texts to predict rate changes, most rely on static classification models that overlook the deliberative nature of policymaking. This study proposes a novel framework that structurally imitates the FOMC's collective decision-making process by modeling multiple large language models(LLMs) as interacting agents. Each agent begins with a distinct initial belief and produces a prediction based on both qualitative policy texts and quantitative macroeconomic indicators. Through iterative rounds, agents revise their predictions by observing the outputs of others, simulating deliberation and consensus formation. To enhance interpretability, we introduce a latent variable representing each agent's underlying belief(e.g., hawkish or dovish), and we theoretically demonstrate how this belief mediates the perception of input information and interaction dynamics. Empirical results show that this debate-based approach significantly outperforms standard LLMs-based baselines in prediction accuracy. Furthermore, the explicit modeling of beliefs provides insights into how individual perspectives and social influence shape collective policy forecasts.

2510.06742 2026-06-09 cs.AI cs.LG 版本更新

MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models

MultiCNKG: 利用大语言模型整合认知神经科学、基因和疾病知识图谱

Ali Sarabadani, Kheirolah Rahsepar Fard

发表机构 * Department of Computer Engineering and Information Technology, University of Qom(卡姆大学计算机工程与信息科技系) University of Qom(卡姆大学)

AI总结 本文提出MultiCNKG框架,整合认知神经科学、基因和疾病知识图谱,利用大语言模型实现实体对齐和图谱增强,提升生物医学领域知识图谱的整合与应用能力。

详情
AI中文摘要

大语言模型(LLMs)的出现革新了生物医学和认知科学中知识图谱(KGs)的整合,克服了传统机器学习方法在捕捉基因、疾病和认知过程之间复杂语义联系方面的局限。我们介绍了MultiCNKG,一种创新框架,整合了三个关键知识源:包含2.9K节点和4.3K边的认知神经科学知识图谱(CNKG),涵盖9种节点类型和20种边类型;基因本体(GO)包含43K节点和75K边,涵盖3种节点类型和4种边类型;疾病本体(DO)包含11.2K节点和8.8K边,涵盖1种节点类型和2种边类型。利用LLMs如GPT-4,我们进行实体对齐、语义相似性计算和图谱增强,创建了一个连接遗传机制、神经疾病和认知功能的统一知识图谱。结果图谱包含6.9K节点,涵盖5种类型(如基因、疾病、认知过程)和11.3K边,涵盖7种类型(如因果关系、关联、调控)。评估指标如精确率(85.20%)、召回率(87.30%)、覆盖率(92.18%)、图一致性(82.50%)、新颖性检测(40.28%)和专家验证(89.50%)证实了其鲁棒性和一致性。链接预测评估显示,与TransE(MR: 391,MRR: 0.411)和RotatE(MR: 263,MRR: 0.395)等模型相比,性能与基准如FB15k-237和WN18RR相当。该图谱在个性化医学、认知障碍诊断和认知神经科学假设形成中具有应用前景。

英文摘要

The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.

2507.15617 2026-06-09 cs.CY cs.AI 版本更新

Why can't Epidemiology be automated (yet)?

为何流行病学无法被自动化(至今仍无法)

David Bann, Ed Lowther, Liam Wright, Yevgeniya Kovalchuk

发表机构 * Centre for Longitudinal Studies, University College London(伦敦大学学院长期研究所在) Centre for Advanced Research Computing, University College London(伦敦大学学院先进计算研究中心)

AI总结 本文探讨流行病学研究中人工智能应用的潜力与限制,指出尽管生成式AI提供了机遇,但现有工具和人类系统限制了其效能,需流行病学家与工程师的协同合作。

Comments 9 pages, 2 figures, 1 table

详情
AI中文摘要

近期人工智能(AI)特别是生成式AI的进步为加速或自动化流行病学研究提供了新机遇。与基于物理实验的学科不同,流行病学大量依赖二次数据分析,因此非常适合此类增强。然而,仍不清楚哪些具体任务能从AI干预中受益或存在哪些障碍。当前AI能力的认知也参差不齐。本文通过现有数据集映射流行病学任务,从文献回顾到数据访问、分析、撰写和传播,识别现有AI工具在效率上的提升。尽管AI在某些领域如编码和行政任务中能提高生产力,但其效用受现有AI模型(如文献回顾中的幻觉)和人类系统(如数据集访问障碍)的限制。通过AI生成的流行病学成果示例,包括完全由AI生成的论文,表明最近开发的代理系统能设计和执行流行病学分析,但质量参差不齐(见https://github.com/edlowther/automated-epidemiology)。流行病学家有新的机会实证测试和评估AI系统;实现AI潜力需要流行病学家与工程师的双向互动。

英文摘要

Recent advances in artificial intelligence (AI) - particularly generative AI - present new opportunities to accelerate, or even automate, epidemiological research. Unlike disciplines based on physical experimentation, a sizable fraction of Epidemiology relies on secondary data analysis and thus is well-suited for such augmentation. Yet, it remains unclear which specific tasks can benefit from AI interventions or where roadblocks exist. Awareness of current AI capabilities is also mixed. Here, we map the landscape of epidemiological tasks using existing datasets - from literature review to data access, analysis, writing up, and dissemination - and identify where existing AI tools offer efficiency gains. While AI can increase productivity in some areas such as coding and administrative tasks, its utility is constrained by limitations of existing AI models (e.g. hallucinations in literature reviews) and human systems (e.g. barriers to accessing datasets). Through examples of AI-generated epidemiological outputs, including fully AI-generated papers, we demonstrate that recently developed agentic systems can now design and execute epidemiological analysis, albeit to varied quality (see https://github.com/edlowther/automated-epidemiology). Epidemiologists have new opportunities to empirically test and benchmark AI systems; realising the potential of AI will require two-way engagement between epidemiologists and engineers.

2507.15152 2026-06-09 cs.CL cs.AI cs.LG 版本更新

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

什么是‘足够’的自动化水平?大型语言模型在元分析数据提取中的基准测试

Lingbo Li, Anuradha Mathrani, Teo Susnjak

发表机构 * School of Mathematical and Computational Sciences(数学与计算科学学院) Massey University(梅西大学) Auckland, New Zealand(新西兰奥克兰)

AI总结 本文评估了三种大型语言模型在医疗领域数据提取中的性能,发现定制提示能显著提升召回率,提出三层次指南以平衡自动化与专家监督。

详情
Journal ref
Research Synthesis Methods (2026)
AI中文摘要

自动化从全文随机对照试验(RCT)中提取数据用于元分析仍是一个重大挑战。本研究评估了三种LLM(Gemini-2.0-flash、Grok-3、GPT-4o-mini)在高血压、糖尿病和骨科三个医学领域中统计结果、偏倚风险评估和研究层面特征任务上的实际表现。我们测试了四种不同的提示策略(基本提示、自我反思提示、模型集成和定制提示)以确定如何提高提取质量。所有模型均表现出高精度,但普遍存在召回率低的问题,因遗漏关键信息。我们发现定制提示是最有效的,召回率可提升高达15%。基于此分析,我们提出了一套三层指南,根据任务复杂性和风险匹配数据类型与适当的自动化水平。本研究为现实世界中的元分析自动化数据提取提供了实用建议,通过有针对性的、任务特定的自动化平衡LLM效率与专家监督。

英文摘要

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

2507.02606 2026-06-09 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

De-AntiFake:重新思考对抗语音克隆攻击的保护扰动

Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一种两阶段净化方法,旨在提升对抗语音克隆攻击的防御效果,通过净化扰动语音并利用音素指导进行优化,实验表明其优于现有方法。

Comments Accepted by ICML 2025

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 2025
AI中文摘要

随着语音生成模型的快速发展,语音克隆(VC)带来的隐私和安全问题日益突出。近期研究尝试通过引入对抗扰动来阻止未经授权的语音克隆,但确定性攻击者可以缓解这些保护扰动并成功执行VC。本文首次系统评估这些保护扰动在包含扰动净化的现实威胁模型下的有效性。研究发现,尽管现有净化方法能中和大量保护扰动,但仍导致VC模型特征空间的失真,影响VC性能。因此,我们提出一种新的两阶段净化方法:(1)净化扰动语音;(2)利用音素指导进行优化,使其符合干净语音分布。实验结果表明,我们的方法在破坏VC防御方面优于现有方法。本研究揭示了基于对抗扰动的VC防御的局限性,并强调了需要更鲁棒的解决方案以缓解VC带来的安全和隐私风险。代码和音频样本可在https://de-antifake.github.io获取。

英文摘要

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

2311.07065 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

关于深度学习中梯度下降无法逼近零损失全局L²最小化器的非近似性

Thomas Chen, Patricia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文分析了深度学习中梯度下降算法的几何特性,指出在欠参数化网络中,零损失最小化通常无法实现,因此训练输入分布必须非典型才能产生零损失最小化器。

Comments AMS Latex, 7 pages. Typos corrected, Corollary 1.6 upgraded to Theorem, acknowledgment added

详情
Journal ref
Theor. Appl. Mech., 52 (1), 67-73 (2025)
AI中文摘要

我们分析了深度学习中梯度下降算法的几何特性,并详细讨论了在欠参数化深度学习网络中,零损失最小化通常无法实现的情形。作为结果,我们得出结论:为了产生零损失最小化器,训练输入分布必须非典型,无论是对于[Chen-Munoz Ewald 2023, 2024]中构造的方法,还是对于梯度下降[Chen 2025](假设训练数据聚类)方法而言。

英文摘要

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL), and give a detailed discussion of the circumstance that in underparametrized DL networks, zero loss minimization can generically not be attained. As a consequence, we conclude that the distribution of training inputs must necessarily be non-generic in order to produce zero loss minimizers, both for the method constructed in [Chen-Munoz Ewald 2023, 2024], or for gradient descent [Chen 2025] (which assume clustering of training data).

2311.08957 2026-06-09 cs.RO cs.AI cs.HC 版本更新

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

我曾盲目但如今我看见:在社交机器人中实现视觉增强的对话

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO – Ghent University – imec(IDLab-AIRO – 布鲁塞尔自由大学 – imec)

AI总结 本文提出一种利用大语言模型提升社交机器人对话能力的系统,通过整合视觉输入增强上下文感知,展示六次与Furhat机器人的交互结果,探讨视觉与文本模态融合的未来对话可能性。

Comments 8 pages, 3 figures

详情
Journal ref
HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction. Pages 1176 - 1180
AI中文摘要

在人机交互快速发展的背景下,将视觉能力整合到对话代理中是关键进步。本文介绍了基于最新大语言模型(如GPT-4、IDEFICS)的对话管理器初始实现,通过实时视觉输入增强传统文本提示。LLMs被用于解释文本提示和视觉刺激,创建更上下文感知的对话代理。系统的提示工程结合对话和图像摘要,平衡上下文保留与计算效率。报告了与Furhat机器人进行六次交互,展示了结果并进行了讨论。通过实现这种视觉增强的对话系统,本文展望了一个未来,其中对话代理能够无缝融合文本和视觉模态,实现更丰富、更上下文感知的对话。

英文摘要

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

2501.12421 2026-06-09 cs.LG cs.AI q-bio.QM 版本更新

Tackling Small Sample Survival Analysis via Transfer Learning: A Study of Colorectal Cancer Prognosis

通过迁移学习解决小样本生存分析:结直肠癌预后的研究

Yonghao Zhao, Changtao Li, Chi Shu, Qingbin Wu, Hong Li, Chuan Xu, Tianrui Li, Ziqiang Wang, Zhipeng Luo, Yazhou He

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过迁移学习提升小样本生存分析,针对结直肠癌预后,改进了多种生存模型,如DeepSurv、Cox-CC、DeepHit和Random Survival Forest,实验结果显示迁移学习显著提升了模型性能。

详情
Journal ref
Artificial Intelligence in Medicine, 178:103426, 2026
AI中文摘要

生存预后对医疗信息学至关重要。实践者常面临小规模临床数据,尤其是癌症患者数据,难以诱导有用的生存预测模式。本文通过迁移学习解决小样本生存分析问题,提出适用于常见生存模型的迁移学习方法。对于参数模型如DeepSurv、Cox-CC和DeepHit,应用预训练和微调等标准迁移学习技术。对于非参数模型如Random Survival Forest,提出新的迁移生存森林(TSF)模型,通过转移树结构并用目标数据微调。在结直肠癌(CRC)预后中评估了迁移学习方法。源数据为27,379名SEER CRC I期患者,目标数据为728名来自西昌医院的CRC I期患者。迁移学习增强后,Cox-CC的C^{td}值从0.7868提升至0.8111,DeepHit从0.8085提升至0.8135,DeepSurv从0.7722提升至0.8043,RSF从0.7940提升至0.8297(最高性能)。所有模型在数据量仅50时训练也表现出更显著的提升。结论:因此,用于癌症预后的现有生存模型可通过适当设计的迁移学习技术得到增强和改进。本研究使用的源代码可在https://github.com/YonghaoZhao722/TSF获取。

英文摘要

Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC's $C^{td}$ value was boosted from 0.7868 to 0.8111, DeepHit's from 0.8085 to 0.8135, DeepSurv's from 0.7722 to 0.8043, and RSF's from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at https://github.com/YonghaoZhao722/TSF.

2406.19493 2026-06-09 cs.CL cs.AI 版本更新

Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems

SAPPhIRE人工系统模型创建工具的开发与评估

Anubhab Majumder, Kausik Bhattacharya, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science(设计与制造系,印度科学研究院)

AI总结 本文提出一种基于检索增强生成的工具,用于创建SAPPhIRE因果模型的人工系统模型,通过评估工具在事实准确性和可靠性方面的表现,提升系统设计类比支持能力。

Comments This paper has been accepted for presentation at the 10th International Conference on Research Into Design, 2025

详情
AI中文摘要

使用SAPPhIRE因果模型表示系统在支持设计类比方面被发现是有用的。然而,创建人工或生物系统的SAPPhIRE模型是一个耗费精力的过程,需要人类专家从多个技术文档中获取技术知识。本研究探讨如何利用大语言模型(LLMs)来创建基于SAPPhIRE因果模型的系统结构描述。本文是两项研究中的第二部分,介绍了一种新的检索增强生成(RAG)工具,用于生成与人工系统SAPPhIRE构造相关的信息,并报告了该工具初步评估的结果,重点在于结果的事实准确性和可靠性。

英文摘要

Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool's success - focusing on the factual accuracy and reliability of outcomes.

2407.00396 2026-06-09 cs.CL cs.AI 版本更新

A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

基于SAPPhIRE模型因果关系的生成技术内容参考知识选择研究

Kausik Bhattacharya, Anubhab Majumder, Amaresh Chakrabarti

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 本文研究如何利用大语言模型生成与SAPPhIRE因果关系模型相关的技术内容,通过检索增强生成方法抑制幻觉,强调参考知识选择对生成准确性的重要性。

详情
AI中文摘要

使用SAPPhIRE因果关系模型表示系统可以成为设计的灵感来源。然而,创建技术或自然系统的SAPPhIRE模型需要从多个技术文档中获取系统工作原理的技术知识。本研究探讨如何利用大语言模型(LLM)生成准确的相关技术内容。本文是两部分研究中的第一部分,提出了一种使用检索增强生成方法来抑制幻觉,从而生成由相关科学信息支持的技术内容的方法。研究结果表明,用于为LLM生成技术内容提供上下文的参考知识选择非常重要。本研究的成果用于构建一个软件支持工具,以生成给定技术系统的SAPPhIRE模型。

英文摘要

Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

2312.07928 2026-06-09 eess.SP cs.AI stat.AP 版本更新

Bayesian inversion of GPR waveforms for sub-surface material characterization: an uncertainty-aware retrieval of soil moisture and overlaying biomass properties

基于GPR波形的贝叶斯反演用于 subsurface 物性表征:一种面向不确定性的土壤含水率和覆盖物性质检索方法

Ishfaq Aziz, Elahe Soltanaghai, Adam Watts, Mohamad Alipour

发表机构 * Civil and Environmental Engineering, University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校土木与环境工程系) Computer Science, University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Pacific Wildland Fire Sciences Laboratory, United States Forest Service(美国森林服务局太平洋野火科学实验室)

AI总结 本文提出基于贝叶斯模型更新的GPR波形反演方法,用于预测土壤和覆盖层的含水率和深度,通过实验室和实地数据验证,结果与TDR和重力法一致,提供不确定性的概率估计。

Comments Total 34 pages, 17 Figures. This paper under review in a journal but has not been published yet

详情
AI中文摘要

准确估计地下属性如含水率和土壤植被层深度对地下条件监测、精准农业和 wildfire 风险评估至关重要。由于土壤常被植被和有机物覆盖,其表征具有挑战性。此外,覆盖层性质的估计对 wildfire 风险评估至关重要。本文提出基于贝叶斯模型更新的GPR波形反演方法,用于预测土壤和覆盖层的含水率和深度。由于其与含水率的高相关性,所提出的方法预测了两层的介电常数,以及其他参数,包括层深度和电导率。所提出的贝叶斯模型更新方法提供了这些参数的概率估计,可提供关于估计信心和不确定性的信息。该方法通过实验室和实地调查收集的多样化实验数据进行了评估。实验室研究包括土壤含水率变化、覆盖层深度和材料粗细的变化。实地研究包括对十六天的田间土壤含水率的测量。结果表明预测与时域反射计(TDR)测量和传统重力法一致。表面层深度也可合理预测。所提出的方法为面向不确定性的地下参数估计提供了一种有前景的方法,可支持跨广泛应用的风险评估决策。

英文摘要

Accurate estimation of sub-surface properties such as moisture content and depth of soil and vegetation layers is crucial for applications spanning sub-surface condition monitoring, precision agriculture, and effective wildfire risk assessment. Soil in nature is often covered by overlaying vegetation and surface organic material, making its characterization challenging. In addition, the estimation of the properties of the overlaying layer is crucial for applications like wildfire risk assessment. This study thus proposes a Bayesian model-updating-based approach for ground penetrating radar (GPR) waveform inversion to predict moisture contents and depths of soil and overlaying material layer. Due to its high correlation with moisture contents, the dielectric permittivity of both layers were predicted with the proposed method, along with other parameters, including depth and electrical conductivity of layers. The proposed Bayesian model updating approach yields probabilistic estimates of these parameters that can provide information about the confidence and uncertainty related to the estimates. The methodology was evaluated for a diverse range of experimental data collected through laboratory and field investigations. Laboratory investigations included variations in soil moisture values, depth of the overlaying surface layer, and coarseness of its material. The field investigation included measurement of field soil moisture for sixteen days. The results demonstrated predictions consistent with time-domain reflectometry (TDR) measurements and conventional gravimetric tests. The depth of the surface layer could also be predicted with reasonable accuracy. The proposed method provides a promising approach for uncertainty-aware sub-surface parameter estimation that can enable decision-making for risk assessment across a wide range of applications.

2402.09193 2026-06-09 cs.CL cs.AI cs.HC 版本更新

(Ir)rationality and Cognitive Biases in Large Language Models

非理性与大语言模型中的认知偏差

Olivia Macmillan-Scott, Mirco Musolesi

发表机构 * University College London(伦敦大学) University of Bologna(博洛尼亚大学)

AI总结 本文通过心理学文献中的任务评估七种语言模型,发现其在非理性表现上与人类相似,但表现形式不同,且存在响应不一致的额外非理性特征。

详情
Journal ref
Royal Society Open Science 11(6) 2024
AI中文摘要

大型语言模型(LLMs)表现出理性推理吗?LLMs已被证明包含人类偏见,因为它们训练的数据中包含这些偏见;这种偏见是否反映在理性推理中尚不明确。在本文中,我们通过认知心理学文献中的任务评估了七种语言模型,以回答这个问题。我们发现,像人类一样,LLMs在这些任务中表现出非理性。然而,这种非理性表现的方式并不反映人类所展示的方式。当LLMs在这些任务中给出错误答案时,它们往往以与人类偏见不同的方式错误。此外,LLMs还揭示了响应中显著不一致性的额外非理性层。除了实验结果外,本文还希望通过展示如何评估和比较这些模型的不同能力,做出方法论上的贡献,特别是在理性推理方面。

英文摘要

Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.

2101.01060 2026-06-09 cs.CV cs.AI cs.MM 版本更新

Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming

通过无关面孔跟踪和像素化实现个人隐私保护在视频直播中

Jizhe Zhou, Chi-Man Pun

发表机构 * IEEE

AI总结 本文提出FPVLS方法,通过帧到视频的双阶段结构实现视频直播中的自动隐私过滤,解决目标漂移、计算效率和过度像素化问题。

详情
Journal ref
IEEE Transactions on Information Forensics and Security, 16, 1088-1103 (2020)
AI中文摘要

截至目前,旨在保护隐私的像素化任务仍然劳动密集且尚未被深入研究。随着视频直播的普及,建立在线直播中的面部像素化机制已成为紧迫需求。本文开发了一种名为视频直播中的面部像素化(FPVLS)的新方法,以在非约束直播活动中自动生成自动个人隐私过滤。简单地应用多面部跟踪器会遇到目标漂移、计算效率和过度像素化的问题。因此,为了快速准确地对无关人员的面部进行像素化,FPVLS采用帧到视频的双阶段结构。在单帧上,FPVLS利用基于图像的面部检测和嵌入网络生成面部向量。在原始轨迹生成阶段,所提出的定位增量仿射传播(PIAP)聚类算法利用面部向量和定位信息,快速关联跨帧的同一人的面部。这样的帧级累积原始轨迹在视频级别上可能具有间断性和不可靠性。因此,我们进一步引入轨迹细化阶段,该阶段结合提案网络和基于经验似然比(ELR)统计量的两样本测试,以细化原始轨迹。在细化轨迹上应用高斯滤波器以最终实现像素化。在我们收集的视频直播数据集上,FPVLS获得了令人满意的准确性、实时效率,并且包含过度像素化问题。

英文摘要

To date, the privacy-protection intended pixelation tasks are still labor-intensive and yet to be studied. With the prevailing of video live streaming, establishing an online face pixelation mechanism during streaming is an urgency. In this paper, we develop a new method called Face Pixelation in Video Live Streaming (FPVLS) to generate automatic personal privacy filtering during unconstrained streaming activities. Simply applying multi-face trackers will encounter problems in target drifting, computing efficiency, and over-pixelation. Therefore, for fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On individual frames, FPVLS utilizes image-based face detection and embedding networks to yield face vectors. In the raw trajectories generation stage, the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm leverages face vectors and positioned information to quickly associate the same person's faces across frames. Such frame-wise accumulated raw trajectories are likely to be intermittent and unreliable on video level. Hence, we further introduce the trajectory refinement stage that merges a proposal network with the two-sample test based on the Empirical Likelihood Ratio (ELR) statistic to refine the raw trajectories. A Gaussian filter is laid on the refined trajectories for final pixelation. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.