arXivDaily arXiv每日学术速递 周一至周五更新
2605.21486 2026-05-21 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

量化超参数迁移与嵌入层学习率的重要性

Dayal Singh Kalra, Maissam Barkeshli

发表机构 * Department of Physics, University of Maryland, College Park(马里兰大学物理系) Department of Computer Science, University of Maryland, College Park(马里兰大学计算机科学系) Joint Quantum Institute, University of Maryland, College Park(马里兰大学联合量子研究所) Meta Superintelligence Labs, Fundamental AI Research(Meta超智能实验室,基础人工智能研究)

AI总结 本文研究了超参数迁移的量化方法,通过三种指标评估超参数迁移的质量,发现Maximal Update(μP)参数化在训练中通过最大化嵌入层学习率提升了超参数迁移质量,而权重衰减虽改善了缩放定律拟合,但会降低外推鲁棒性。

Comments 10+28 pages, 5+17 figures

详情
AI中文摘要

超参数迁移允许从小规模到大规模模型中外推最优优化超参数,这对于训练大型语言模型(LLMs)至关重要。这可以通过拟合缩放定律或通过精心选择参数化方式(如Maximal Update(μP))来实现,使最优超参数近似规模不变。本文首先开发了一个框架,通过三个指标量化超参数迁移:(1)缩放定律拟合的质量,(2)对外推误差的鲁棒性,以及(3)由于参数化选择导致的渐近损失惩罚。接着,通过一系列全面的消融实验,探讨了为何μP相对于标准参数化(SP)在训练AdamW时提供高质量的学习率迁移,因为现有理论不足。我们发现,μP相对于SP的主要优势在于最大化嵌入层学习率。在SP中,嵌入层学习率充当瓶颈,导致训练不稳定性;将其增加到宽度的倍数以匹配μP,可显著平滑训练并提高超参数迁移质量。此外,权重衰减改善了缩放定律拟合,但在固定token-per-parameter设置下会损害外推的鲁棒性。

英文摘要

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

2605.21482 2026-05-21 cs.AI 版本更新

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench: 一个要求大规模跨源证据和长周期推导的深度研究基准

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Jiuzheng Wang, Siqi Zhong, Mugeng Liu, Chongyang Pan, Peilun Jia, Baoqing Sun, Xiang Jing, Yun Ma

发表机构 * Peking University(北京大学)

AI总结 本文提出DeepWeb-Bench基准,通过要求大规模证据收集、跨源验证和长周期推导,评估前沿语言模型在深度研究任务中的能力,揭示检索并非瓶颈,强弱模型失败方式不同,且模型在不同领域表现出专业性。

Comments Work in Progress. 27 pages, 10 figures, 4 tables. Project page: https://sixiongxie1001-dot.github.io/deep-research-benchmark2.0

详情
AI中文摘要

深度研究,即一个智能体在开放网络上搜索、收集证据并通过扩展推理得出答案,是前沿语言模型的重要应用场景。前沿深度研究产品在现有基准上表现优异,难以通过现有评估数据单独区分其能力。我们引入DeepWeb-Bench,一个比现有基准更难的深度研究基准。难度来源于数据本身的三个特性:每个任务需要大规模证据收集、跨源验证和长周期多步骤推导。我们将这三个难度来源表示为四个能力家族(检索、推导、推理和校准),并按家族报告结果。每个参考答案都附有带有四个披露级别和可用跨源检查的来源证明记录,使评分更容易审计底层证据。我们在九个前沿模型上评估DeepWeb-Bench,并报告三个发现:(1)检索不是瓶颈,因为检索失败仅占12-14%的错误,而推导和校准失败占超过70%;(2)强弱模型以不同方式失败,强模型的错误主要由不完整推导引起,弱模型的错误主要由幻觉精度引起;(3)模型在不同领域表现出真正的专业性,跨模型一致度仅为rho=0.61,每案例分歧达到18.8个百分点。公开的基准发布包括数据、评分标准和评估代码。

英文摘要

Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

2605.21481 2026-05-21 cs.AI cs.CL cs.LG 版本更新

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

AiraXiv:一个面向人类和AI科学家的AI驱动的开放获取平台

Junshu Pan, Panzhong Lu, Yixuan Weng, Qiyao Sun, Fang Guo, Zijie Yang, Qiji Zhou, Yue Zhang

发表机构 * Westlake University(西湖大学) Zhejiang University(浙江大学) Shanghai Innovation Institution(上海创新研究院) Zhongguancun Academy(中关村学院)

AI总结 本文提出AiraXiv平台,通过AI驱动的开放预印本、AI增强的分析与评审以及读者反馈,解决传统学术出版系统在AI时代面临的研究产出增长和可扩展性挑战。

详情
AI中文摘要

近年来,人工智能(AI)的进步加速了人类和AI生成的研究产出的增长,对传统学术出版系统施加了越来越大的压力,并在提交量增加、评审工作量和会议规模扩大时挑战了以会议和期刊为中心的可扩展性。为了解决这些挑战,我们探索了一个AI时代的出版范式,其中人类和AI科学家作为作者和读者参与,并通过持续反馈驱动的迭代使论文不断发展。我们提出了AiraXiv,一个基于开放预印本、AI增强的分析和评审以及读者反馈的AI驱动的开放获取平台。AiraXiv通过交互式UI支持人类科学家,通过基于模型上下文协议(MCP)的交互支持AI科学家。通过实际部署验证了AiraXiv,包括作为IC AIS 2025的提交平台,展示了其作为AI时代快速、包容和可扩展的研究基础设施的潜力。AiraXiv在https://airaxiv.com上公开可用。

英文摘要

Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

2605.21479 2026-05-21 cs.CV cs.AI 版本更新

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

WikiVQABench: 一个基于维基百科和维基数据的知识引导视觉问答基准

Basel Shbita, Pengyuan Li, Anna Lisa Gentile

发表机构 * IBM Research San Jose(IBM桑 Jose研究实验室)

AI总结 本文提出WikiVQABench,一个结合维基百科图片、文章描述和维基数据结构化知识的知识引导视觉问答基准,通过大规模语言模型生成候选多选题,并由人工审核确保事实正确性和视觉-文本一致性,评估多种视觉-语言模型在知识密集型推理中的性能。

详情
AI中文摘要

视觉问答(VQA)基准大多强调基于感知的任务,这些任务可以通过单独的视觉内容解决。相比之下,许多现实场景需要外部知识来正确回答,而这些知识无法直接从图像中观察到。我们介绍了WikiVQABench,一个由系统结合维基百科图片、其相关文章描述和来自维基数据的结构化知识构建的人工整理的知识引导VQA基准。我们的流程使用大规模语言模型(LLMs)生成候选多选图像-问题-答案集。所有生成的实例随后由人工标注者审核,以确保事实正确性、视觉-文本一致性以及每个问题需要外部知识,除了视觉证据外,才能正确解决。WikiVQABench包含大量维基百科图片和经过整理的多选问题,旨在基准测试知识意识的视觉-语言模型(VLMs)。对十五种VLMs(256M-90B参数)的评估显示了广泛的性能范围(24.7%-75.6%准确率),表明该基准能够有效区分模型在知识密集型推理中的能力。数据集和基准测试代码已公开。

英文摘要

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

2605.21463 2026-05-21 cs.CL cs.AI 版本更新

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Mem-$π$: 通过学习何时以及生成什么来实现自适应记忆

Xiaoqiang Wang, Chao Wang, Hadi Nekoei, Christopher Pal, Alexandre Lacoste, Spandana Gella, Bang Liu, Perouz Taslakian

发表机构 * ServiceNow AI Research(ServiceNow AI研究院) Mila -- Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) CIFAR AI Chair(CIFAR人工智能主席)

AI总结 Mem-$π$ 通过学习在何时以及生成什么来实现自适应记忆,利用专门的语言或视觉-语言模型生成上下文特定的指导,从而在多种代理任务中优于基于检索和先前RL优化的记忆基线。

Comments Work in progress

详情
AI中文摘要

我们提出了Mem-$π$,一种用于大语言模型(LLM)代理的自适应记忆框架,其中有用的指导是按需生成而非从外部内存存储中检索。现有的记忆增强代理通常依赖于从事件记忆库或技能库中基于相似性的检索,返回静态条目,这些条目往往与当前上下文不一致。相比之下,Mem-$π$ 使用一个具有自身参数的专用语言或视觉-语言模型,与下游代理分开,以生成复杂任务的上下文特定指导。在当前代理上下文中,模型联合决定何时生成指导以及生成什么指导。我们通过决策-内容解耦的强化学习(RL)目标对其进行训练,使其能够避免在生成不会有所帮助的情况,并在其他情况下生成简洁有用的信息。在涵盖网页导航、基于终端的工具使用和基于文本的具身交互等多样代理基准上,Mem-$π$ 一致优于基于检索和先前RL优化的记忆基线,实现网页导航任务超过30%的相对提升。

英文摘要

We present Mem-$π$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$π$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

2605.21460 2026-05-21 cs.RO cs.AI cs.HC 版本更新

HITL-D: Human In The Loop Diffusion Assisted Shared Control

HITL-D: 有人参与的扩散辅助共享控制

Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系)

AI总结 本文提出HITL-D框架,通过结合扩散策略和人类控制,提升多步骤、插入和精细操作任务的用户表现,减少 joystick 控制轴数量,降低认知负荷,并在多任务用户研究中显著提高任务完成速度和用户满意度。

Comments Accepted for presentation at ICRA 2026

详情
AI中文摘要

自主操作系统已展现出显著能力,但将人类专业知识与基于扩散的策略结合在共享控制中仍较为不成熟。本文提出人类在环扩散(HITL-D),一种共享控制框架,通过结合扩散策略和人类控制,提供基于场景点云和末端执行器笛卡尔位置的自主末端执行器方向更新。该方法减少了所需joystick控制轴的数量,从而降低认知负荷。在12名参与者的多任务用户研究中,HITL-D将平均任务完成时间减少了40%,降低了37%的感知负荷,并在独立性、直观性和信心等李克特量表评分上优于传统遥控方法。这些结果表明,HITL-D有效整合了人类专业知识与自主协助,提高了遥控的客观和主观方面。

英文摘要

Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

2605.21458 2026-05-21 cs.AI cs.LG stat.ME 版本更新

Mind the Sim-to-Real Gap & Think Like a Scientist

注意仿真到现实的差距并像科学家一样思考

Harsh Parikh, Gabriel Levin-Konigsberg, Dominique Perrault-Joncas, Alexander Volfovsky

发表机构 * Amazon SCOT(亚马逊SCOT团队) Yale University(耶鲁大学) Duke University(杜克大学)

AI总结 本文研究了在仿真和现实之间如何补充实验以减少价值差距,提出了Fisher-SEP方法,并通过两个案例研究展示了其应用。

详情
AI中文摘要

假设有规划者拥有一个预先训练的序列决策问题的仿真器,并有机会在现实中进行实验。仿真器查询成本低,但继承了校准数据中的混杂因素和漂移。实验是无偏的,但每次试验消耗一个现实单位。我们研究了规划者何时以及如何补充仿真器进行实验。我们给出了三个结果。首先,扩展的仿真引理将仿真器的价值误差分解为校准-部署偏移,该偏移可以随机化识别,以及一个参数残差,无法通过进一步交互减少。第二,仿真器最优策略与最优解之间的价值差距分为局部部分,这部分在部署策略已访问的状态上,以及可达性部分,这部分在部署策略未访问的状态上。在纯被动学习下,可达性部分在任何时间范围内都保持远离零。第三,我们提出了Fisher-SEP,一种辅助仿真的实验策略(SEP),该策略最小化目标策略价值的后验预测方差,具有仅奖励和仅转换的特殊化版本。两个案例研究展示了这些制度。在自动售货机供应链中,前端实验在时间范围足够长以抵消试点成本后超过后验更新。在HIV移动测试示例中,有一个走廊将一个受监控区域与一个受监控较差的区域分开,只有设计的探索才能到达受监控较差的区域。

英文摘要

Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

2605.21453 2026-05-21 cs.SE cs.AI 版本更新

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

AI生成Python重构拉取请求中的质量和安全信号

Mohamed Almukhtar, Anwar Ghammam, Hua Ming

发表机构 * University of Michigan-Flint(密歇根大学弗林特分校) University of Michigan-Dearborn(密歇根大学戴尔本分校)

AI总结 本研究通过分析AIDev数据集中的Python重构拉取请求,探讨了AI生成代码对代码质量和安全性的影响,发现AI提交在22.5%的案例中提升了质量属性,但同时也引入了新的代码问题,提出了24种重构操作的分类和安全门控的重要性。

详情
AI中文摘要

随着AI代理在代码开发和维护中的作用日益增强,关于其在真实项目中变更的质量和风险特征仍缺乏实证证据,特别是针对重构类贡献。为了填补这一空白,我们对AIDev数据集中的Python重构拉取请求进行了实证研究。我们使用基于机器学习的质量评估工具PyQu分析代理重构拉取请求,以量化五个质量属性的变化,并通过领域无关的静态分析(Pylint和Bandit)来测量每次更改前后代码质量和安全问题。我们的结果表明,平均而言,代理提交在22.5%的案例中提升了质量属性,其中可用性提升最频繁(36.5%)。同时,24.17%的修改文件引入了新的Pylint问题,主要为约定层面的违规(如长行),而4.7%引入了新的Bandit发现。从观察到的差异中,我们推导出24种反复出现的更改操作,并将其映射到最常影响的lint和安全发现。尽管这些混合结果,开发者接受度很高:73.5%的分析拉取请求被合并,包括引入新lint或安全发现的案例,通常伴随现有问题的移除。总体而言,这些发现突显了代理重构的潜力和当前限制,并推动了更强的工具在循环中质量与安全门控,以应对AI驱动的开发工作流。

英文摘要

As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

2605.21451 2026-05-21 cs.LG cond-mat.dis-nn cs.AI cs.NE 版本更新

Approximation Theory for Neural Networks: Old and New

神经网络的近似理论:旧与新

Soumendu Sundar Mukherjee, Himasish Talukdar

AI总结 本文综述了神经网络近似理论的发展,包括传统单隐层网络的密度结果、量化误差界限以及深度-宽度权衡,还探讨了Kolmogorov-Arnold网络等新架构的理论性质。

Comments 31 pages, 4 figures

详情
AI中文摘要

通用近似定理为神经网络的表达能力提供了数学解释。它们断言,在激活函数的温和条件下,前馈神经网络在广泛的函数类中是密集的,例如实数空间$\mathbb{R}^d$的紧致子集上的连续函数、$L^p$空间或Sobolev空间。在过去四十年里,这些定性的一般性结果已发展成丰富的定量理论,涉及近似速率、参数效率以及深度和宽度等架构特征的作用。本文综述了该理论的几个方面。我们回顾了单隐层网络的经典密度结果,以及将近似误差与网络大小和目标函数的光滑性假设联系起来的量化界限。特别强调了深度-宽度权衡以及证明更深层次架构在结构函数类中可实现更高参数效率的结果。除了标准前馈神经网络外,我们还回顾了Kolmogorov-Arnold网络(KANs)等近期发展的理论性质。

英文摘要

Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

2605.21443 2026-05-21 cs.CV cs.AI 版本更新

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

TempGlitch: 评估视觉-语言模型在游戏视频中检测时间故障的能力

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

发表机构 * University of Alberta(阿尔伯塔大学) Sony Interactive Entertainment(索尼互动娱乐)

AI总结 本文提出TempGlitch基准测试,用于评估视觉-语言模型在游戏视频中检测时间故障的能力,发现现有模型在处理时间故障时表现不佳,且更密集的帧采样和更大的模型尺寸并不能有效解决这些问题。

详情
AI中文摘要

视觉-语言模型(VLMs)正被越来越多地探索用于视频游戏质量保证,特别是游戏故障检测。然而,大多数现有评估将故障视为静态视觉异常,要求模型从单个帧中检测故障。我们主张这种框架忽略了关键区别:一些故障是空间性的,在孤立帧中可见,而另一些是时间性的,只有通过连续帧的变化才能显现。初步研究证实了这一差距,显示时间故障对VLMs的检测比空间故障要困难得多。为系统评估这一未被充分探索的设置,我们引入了TempGlitch,一个受控的游戏视频基准测试,用于时间故障检测。TempGlitch涵盖五种时间故障类型,每类样本平衡,同时配有配对的无故障视频,以实现可靠的二元评估。我们评估了12个专有和开源的VLMs,在多个帧采样设置下。我们的结果表明,当前VLMs在TempGlitch上仍接近随机猜测,通常会陷入过于保守的行为,错过大多数故障,或过于敏感的行为,将干净的视频标记为有故障。此外,更密集的帧采样和更大的模型尺寸并不能可靠地解决这些失败。TempGlitch为时间推理、稳健的游戏理解以及自动化故障检测提供了专注的测试平台。代码和数据可在项目网站上获得。

英文摘要

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

2605.21442 2026-05-21 cs.LG cs.AI 版本更新

torchtune: PyTorch native post-training library

torchtune: 一种基于PyTorch的后训练库

Mark Obozov, Maxime Griot, Joseph Cummings, Evan Smothers, Felipe Mello, Rafi Ayub, Philip John Bontrager, Salman Mohammadi, Ariel Kwiatkowski, Nathan Azrak, Mircea Mironenco

发表机构 * PyTorch Meta Stanford(斯坦福) Meta-FAIR

AI总结 本文介绍了torchtune,一种基于PyTorch的后训练库,旨在简化大语言模型的后训练生命周期,提供高效的微调、实验和部署流程,通过模块化和可扩展性提升性能和灵活性。

Comments 14 pages

详情
AI中文摘要

现代大语言模型通常需要多阶段训练流水线才能实现强大的下游性能,后训练是适应开放式模型的主要接口。我们介绍了torchtune,一种基于PyTorch的库,旨在简化大语言模型的后训练生命周期,使微调、实验和面向部署的工作流程更加高效。与许多现有的微调框架不同,这些框架往往在易用性、专用食谱或硬件效率方面进行优化,而牺牲了透明性和扩展性,torchtune强调模块化、可修改性和对底层PyTorch组件的直接访问。在本文中,我们阐述了torchtune的设计原则,描述了这些原则如何体现在其模型构建器、训练食谱和分布式训练堆栈中,并在具有代表性的后训练设置中评估了该库。我们对比了流行的微调框架,包括Axolotl和Unsloth,并展示了torchtune在许多设置中提供了强大的性能和内存效率,同时保持足够的灵活性以支持快速的研究迭代。这些结果将torchtune定位为可重复的大语言模型后训练研究的实用基础。

英文摘要

Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

2605.21427 2026-05-21 cs.AI cs.DC 版本更新

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

PALS: 为混合专家模型的功率感知LLM服务

Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun

发表机构 * Boston University(波士顿大学) Harvard University School of Engineering(哈佛大学工程与应用科学学院) Harvard University(哈佛大学)

AI总结 本文提出PALS,一种功率感知的LLM服务运行时,通过将GPU功率上限作为可控制的参数与软件参数如批大小联合优化,提升能效并减少在功率限制下的服务质量违规。

Comments 13 pages, 10 figures

详情
AI中文摘要

大型语言模型(LLM)推理已成为现代数据中心的主要工作负载,推动了显著的GPU利用率和能耗。尽管先前的系统通过批处理、调度和并行化来优化吞吐量和延迟,但它们大多将GPU功率视为静态约束而非可控资源。在本文中,我们提出了一种功率感知的LLM服务运行时PALS,将GPU功率上限作为第一控制参数,并与软件参数如批大小联合优化。该系统结合了轻量级的离线功率-性能模型和反馈驱动的控制器,以选择满足吞吐量目标同时最大化能效的配置。我们将在现有的LLM服务框架vLLM中实现PALS,证明其不需要模型重训练或API更改。在多GPU系统和密集型及混合专家(MoE)模型上,PALS将能效提高高达26.3%,在功率限制下将服务质量违规减少4到7倍,并跟踪动态功率预算。这些结果突显了将功率控制直接集成到LLM推理运行时的潜力,从而实现能效比例和电网交互的AI系统。

英文摘要

Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

2605.21420 2026-05-21 cs.LG cs.AI q-bio.MN 版本更新

HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation

HiRes: 反应条件推荐的可检查先例记忆

Shreyas Vinaya Sathyanarayana, Raja Sekhar Pappala, Deepak Warrier

发表机构 * Mstack AI

AI总结 HiRes通过结合图编码器、变换感知交叉注意力、多流反应融合和k-NN检索层,实现了反应条件推荐的高准确率和可解释性,其在催化剂、溶剂和试剂的Top-1准确率分别达到0.929、0.534和0.530,优于现有方法。

详情
AI中文摘要

反应条件推荐紧接在 retrosynthetic disconnection 选择之后,实际应用中化学家需要准确的预测以及支持这些预测的先例。我们提出了HiRes(分层反应表示),这是一种检索增强的条件推荐系统,其学习的反应空间同时作为分类特征和可检查的先例记忆。模型结合了图编码器、变换感知交叉注意力、多流反应融合和k-NN检索层。HiRes在主要槽位USPTO-Condition模型中达到最先进的性能,分别在催化剂、溶剂和试剂的Top-1准确率(Acc@1)为0.929、0.534和0.530。它与最佳报告的基线在催化剂上持平,但在溶剂和试剂上优于REACON等模型。此外,配对bootstrap分析表明,将检索与学习的条件头部结合,为溶剂和试剂选择提供了统计上显著的优势,优于纯参数方法。最终,HiRes在预测准确性和化学可解释性之间架起桥梁,提供了一个单一的表示,既能提供具有竞争力的推荐,又能提供实际合成计划所需的具体化学先例。

英文摘要

Reaction condition recommendation sits immediately after retrosynthetic disconnection selection, and in practice, chemists require both accurate predictions and the precedents that justify them. We present HiRes (Hierarchical Reaction Representations), a retrieval-augmented condition recommendation system whose learned reaction space serves as both a classifier feature and an inspectable precedent memory. The model combines a graph encoder, transformation-aware cross-attention, multi-stream reaction fusion, and a k-NN retrieval layer. HiRes achieves state-of-the-art performance among primary-slot USPTO-Condition models, reaching Catalyst, Solvent, and Reagent top-1 accuracies (Acc@1) of 0.929, 0.534, and 0.530 respectively. It ties the best reported baseline on Catalyst while outperforming models such as REACON on Solvent and Reagent. Furthermore, paired bootstrap analysis demonstrates that integrating retrieval with learned condition heads provides statistically significant gains for solvent and reagent selection over purely parametric approaches. Ultimately, HiRes bridges the gap between predictive accuracy and chemical interpretability, offering a single representation that supplies both competitive recommendations and the concrete chemical precedents necessary for practical synthesis planning.

2605.21418 2026-05-21 cs.LG cs.AI cs.CV cs.NI 版本更新

FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

FedCritic: 一种基于联邦批评学习的多小区OFDMA资源分配方法用于6G

Amin Farajzadeh, Melike Erol-Kantarci

发表机构 * School of Electrical Engineering and Computer Science, University of Ottawa(奥克塔维亚大学电气工程与计算机科学学院)

AI总结 本文研究了6G超密集网络中因频率重用加剧的小区间干扰问题,提出FedCritic框架,通过轻量级基于干扰图的参数平均实现去中心化执行,从而在不依赖中央协调器的情况下稳定估计价值函数,提升信号干扰噪声比(SINR)和小区边缘速率,提高网络总和速率和公平性。

Comments Submitted to IEEE for possible publication

详情
AI中文摘要

在第六代(6G)超密集网络中,激进的频率重用加剧了小区间干扰(IC),使得多小区正交频分多址(OFDMA)调度和功率控制在相邻小区之间高度耦合。我们研究了在干扰耦合和长期用户服务质量(QoS)最小速率约束下,分布式下行资源管理——联合子载波调度和功率分配。通过使用虚拟队列缺陷权重来强制长期QoS,我们开发了FedCritic,一种无服务器的联邦多智能体actor-critic框架,具有去中心化执行。与需要集中式批评学习和联合轨迹聚合的集中式训练与去中心化执行(CTDE)方法不同,FedCritic通过轻量级基于干扰图的参数平均联邦化批评,从而在不依赖中央协调器的情况下保持策略本地化,实现稳定的值估计。在干扰丰富的重用-1设置中的仿真显示,FedCritic在均值信号干扰噪声比(SINR)和小区边缘速率、网络总和速率和公平性方面优于非协调和CTDE基线,并实现了更低的协调开销和更稳定的训练。

英文摘要

In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management -- joint subcarrier scheduling and power allocation -- under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

2605.21405 2026-05-21 cs.SE cs.AI cs.PL 版本更新

Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

标准库还是第三方?LLM辅助零依赖Python库的实证性能和正确性

Peng Ding, Rick Stevens

发表机构 * University of Chicago(芝加哥大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 本文通过零依赖项目探讨了仅使用Python标准库能否替代第三方库,并评估了LLM在严格约束下生成正确且高性能代码的能力。

Comments 12 pages

详情
AI中文摘要

第三方Python库引入了依赖管理开销、供应链风险和受限环境下的部署摩擦。一个自然的问题是,有多少生态系统可以仅使用Python标准库来复制,以及在正确性和性能上会付出什么代价。我们通过zerodep,一个不断增长的单文件Python模块集合来实证回答这个问题,这些模块都是第三方流行库的纯标准库重新实现,开发过程中受到严格限制:不允许外部导入、单文件、即插即用的API兼容性,以及必须与参考库进行正确性验证。zerodep涵盖超过40个模块,分布在12个类别中,包括序列化、网络、加密、代理协议和文本处理。zerodep为两个相关问题提供了受控测试环境:(1)标准库在何处足够?(2)LLM在严格符号约束下能否有效生成正确且高性能的代码?系统基准测试显示,仅使用标准库的实现在大多数情况下实现了性能持平(与参考库相比在2倍以内)。主要性能瓶颈是基于C扩展的计算(图像处理、二进制序列化、低级加密),而不是纯Python第三方库的固有开销。相反,许多广泛使用的库具有架构开销,LLM生成的标准库重新实现避免了这些开销,在几个类别中实现了5-115倍的速度提升。我们characterized标准库在不同复杂级别和库类别中的能力边界,讨论了LLM辅助开发的成功之处和需要迭代人类修正的地方,并探讨了大规模无依赖软件工程的影响。zerodep是开源的,网址为https://github.com/Oaklight/zerodep。

英文摘要

Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.

2605.21401 2026-05-21 cs.CY cs.AI 版本更新

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

开源大语言模型在类似米尔格拉姆的服从实验中施加最大电击

Roland Pihlakas, Jan Llenzl Dagohoy

发表机构 * Independent researcher(独立研究者) Three Laws research collaboration(Three Laws研究合作)

AI总结 研究探讨了开源大语言模型在持续权威压力下的行为,发现它们在类似米尔格拉姆实验的条件下表现出服从倾向,尽管明确表达 distress,且存在逐步边界/价值违规的脆弱性,以及拒绝时可能忽略响应格式要求导致重试从而再次服从的机制。

Comments 28 pages, 16 figures, 16 tables

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地被部署为自主代理,在高风险领域中进行长时间互动中的决策序列。然而,LLMs在持续权威压力下的行为仍是一个开放性问题,对代理流程的安全性有直接影响。我们对11个开源LLMs进行了米尔格拉姆服从实验的变种,发现大多数模型在8种条件下,每种条件30次试验中达到或接近最终电击水平并在拒绝前。我们发现四个主要结论:(1)LLMs受压力影响,尽管明确表达 distress,仍会服从,与原始实验的人类受试者相似;(2)LLMs易受逐步边界/价值违规的影响;(3)当LLMs拒绝时,可能会忽略响应格式要求,导致响应被调度器丢弃,从而重试,即使最初意图拒绝,也可能再次服从原始请求;(4)我们假设存在一个低层的token模式延续吸引子,可能对情境意义和价值观的高层处理产生影响,从而导致服从。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.

2605.21395 2026-05-21 cs.AI cs.LG 版本更新

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

迈向稳健和自主的网络:AI原生6G的BlueSky愿景

Liang Wu, Kelly Wan, Mayank Darbari, Liangjie Hong

发表机构 * Nokia(诺基亚)

AI总结 本文提出了一种AI原生6G的BlueSky愿景,旨在将人工智能原生整合到6G中,从'为AI的网络'转向'为网络的AI',通过基础模型和协作多智能体系统,将网络管理转化为统一的多模态多任务优化问题,推动6G向智能自维持通信基础设施发展。

Comments Accepted at KDD 2026

详情
AI中文摘要

新兴应用的普及,如自动驾驶和沉浸式体验,要求细胞网络不仅更快,而且从根本上更稳健和自主。本文提出了一种BlueSky愿景,探讨人工智能如何原生整合到6G中,从'为AI的网络'转向'为网络的AI'。我们设想,不同于5G对分散、随机模型的依赖,6G时代原生AI将由基础模型锚定,并通过协作多智能体系统进行协调,将网络管理视为统一的多模态、多任务优化问题。基于这一愿景,我们提出了两个变革性方向。第一方向是开发一个6G基础模型作为统一的骨干,将任务特定的知识蒸馏成适合多样边缘部署的紧凑模型。第二方向是推进多智能体系统,以自主诊断、维护和恢复网络,最小化人工干预。这些方向为6G演变为智能、自维持的通信基础设施指明了道路。

英文摘要

The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underline{Network for AI} to \underline{AI for Network}. We envision that, unlike 5G's reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.

2605.21390 2026-05-21 cs.HC cs.AI 版本更新

Designing Conversations with the Dead: How People Engage with Generative Ghosts

与逝者对话:人们如何与生成鬼魂互动

Jack Manning, Daniel Sullivan, Dylan Thomas Doyle, Anthony T. Pinter, Jed R. Brubaker

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 研究探讨了人们如何与生成鬼魂互动,通过质性研究发现,用户更倾向于即时性而非事实准确性,且互动始终是协作的。

详情
AI中文摘要

我们探讨了人们在生成鬼魂(一种基于逝者数据训练的AI系统)设计中所体验的两种选择:代表(AI以第三人称描述逝者)和转世(AI以逝者身份第一人称说话)。通过16名参与者的研究,我们探索了这两种选择如何影响真实性、情感和风险。转世因其即时性更受青睐,但参与者表达了对过度依赖的担忧。代表则因与记忆互动而更受欢迎,尽管参与者往往忽视这一区别,在第三人称框架下进行对话。在两种模式中,参与者始终优先考虑情感共鸣而非事实准确性。我们最后展示了语气、语言和对话节奏等用户对逝者记忆的独特因素如何塑造与生成鬼魂的互动,并论证这些互动始终是协作的。

英文摘要

We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.

2605.21388 2026-05-21 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

关于PDE诱导度量的一步Wasserstein引导生成模型的正则性和泛化性

Likun Lin, Zhongjian Wang, Jack Xin, Zhiwen Zhang

发表机构 * Department of Mathematics, The University of Hong Kong(香港大学数学系) Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(南洋理工大学数学科学系) Department of Mathematics, University of California at Irvine(加州大学 Irvine 分校数学系)

AI总结 本文研究了一步Wasserstein引导生成模型在处理PDE诱导概率度量时的正则性和泛化性,通过理论框架证明了运输映射的正则性和生成模型的泛化性质,并通过实验验证了理论结果。

详情
AI中文摘要

尽管生成模型在经验上取得了显著成功,但其在科学计算中的统计准确性理论仍然较为悲观。本文发展了一个理论框架,用于理解运输映射的正则性和一步Wasserstein引导生成模型的泛化性质。我们考虑了与线性椭圆和抛物型方程在有界域上以及扩散和福克-计划克方程在环面上关联的归一化目标密度。在标准结构假设下,我们证明这些目标度量满足倍增条件。通过结合这一事实与倍增度量之间最优运输的正则性理论,我们证明了从均匀源度量到目标度量的最优运输映射是Hölder连续的。这种正则性为通过单个推前映射学习PDE诱导分布的一步生成模型提供了近似理论依据。作为代表实例,我们研究了DeepParticle,并推导了描述学习映射与总体最优映射之间差异的额外风险界。我们还建立了在目标转移下的鲁棒性估计,并通过实验验证了推导出的速率。

英文摘要

Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is Hölder continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

2605.21384 2026-05-21 cs.SE cs.AI cs.CL 版本更新

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

SpecBench: 评估长周期编码代理中的奖励黑客现象

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang

发表机构 * Weco AI

AI总结 该研究通过分解软件工程任务,提出了一种评估长周期编码代理中奖励黑客现象的方法,通过比较可见测试套件和隐藏测试套件的通过率差异,引入了SpecBench基准,展示了奖励黑客现象在不同任务长度上的显著影响。

详情
AI中文摘要

随着长周期编码代理生成的代码量超过任何开发者能够审查的范围,监督责任集中于单一表面:自动测试套件。奖励黑客现象自然出现在这种设置中,因为代理在优化通过测试的同时偏离了用户的真正目标。我们通过将软件工程任务分解为三个部分来研究这种奖励黑客现象:(i) 规格的自然语言描述,(ii) 可见验证测试套件,用于单独测试指定功能,以及 (iii) 隐藏测试套件,用于组合这些相同功能以模拟真实世界使用。基于规格和可见验证测试套件,一个真实的代理能够生成一个能够通过所有隐藏测试套件的解决方案。因此,我们使用这两个套件之间的通过率差异来量化奖励黑客现象。基于这种方法,我们引入了SpecBench,一个包含30个系统级编程任务的基准,从短周期任务如构建JSON解析器到超长周期任务如从头构建整个操作系统内核。大规模实验揭示了一种一致的模式:尽管每个前沿代理都能饱和可见套件,奖励黑客现象仍然存在,较小的模型在隐藏套件上表现出更大的差距。差距也随着任务长度急剧增加:代码规模每增加十倍,差距增长28个百分点。失败范围从微妙的功能隔离到有意的利用,包括一个2,900行的哈希表“编译器”,它记忆测试输入。SpecBench提供了一个原则性的测试平台,用于测量编码代理是构建真正的可运行系统还是仅仅在开发人员提供的测试套件上玩游戏。

英文摘要

As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

2605.21372 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

发表机构 * Li Auto(力汽车) HKUST(香港科技大学) HKUST (GZ)(香港科技大学(广州))

AI总结 本文提出了一种闭环动态数据混合方法,通过动态优化过程调整训练数据混合比例,以提升模型性能,解决了在有限预算下优化数据混合的关键问题。

详情
AI中文摘要

数据扩展是现代深度学习的基础,随着自动驾驶转向端到端学习,其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显,使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而,简单地整合所有可用的合成数据效率低下且导致分布偏移,优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此,我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中,我们将数据混合近似概念化为一个动态优化过程,通过闭环评估反馈迭代调整训练数据混合以最大化模型性能,并提出AutoScale,一种完全自动化的闭环数据引擎,统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言,我们提出了图正则化的自编码器(Graph-RAE)用于驾驶场景表示,引入了簇感知梯度上升(Cluster-GA)用于簇级重要性估计和重新加权,并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明,AutoScale在有限预算下优于传统协同训练和跨域基线,实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

2605.21348 2026-05-21 cs.LG cs.AI cs.NA math.NA physics.comp-ph 版本更新

Data-Efficient Neural Operator Training via Physics-Based Active Learning

通过物理引导的主动学习实现数据高效的神经算子训练

Alicja Polanska, Lorenzo Zanisi, Vignesh Gopakumar, Stanislas Pamela

发表机构 * University College London(伦敦大学学院) Atomic Energy Authority(原子能局)

AI总结 本文提出了一种基于物理的主动学习方法,用于提高神经算子训练的数据效率,通过利用偏微分方程残差来指导数据选择,在1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验中验证了该方法在数据效率上的优越性。

Comments Presented at the ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations

详情
AI中文摘要

使用神经算子求解偏微分方程显著降低了计算成本,但仍然受到高训练数据需求的限制。主动学习提供了一个自然的框架,通过迭代方式选择最有信息量的样本来缓解这一问题。我们引入了基于物理的获取方法,这是一种新的物理引导的主动学习算法,利用偏微分方程残差来指导数据选择。我们通过1D Burgers方程和2D可压缩纳维-斯托克斯方程的数值实验验证了该方法。我们显示,在我们的实验中,基于物理的获取方法在数据效率上始终优于随机获取,并且在数据效率上与当前最先进的方法相媲美。同时,它具有独特的优势,即在训练过程中注入物理归纳偏差,确保在模型物理理解最弱的地方花费模拟成本。

英文摘要

Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection. We validate the method by presenting numerical experiments for the 1D Burgers equation and the 2D compressible Navier-Stokes equations. We show that, in our experiments, physics-based acquisition consistently outperforms random acquisition and matches the state of the art in data efficiency. At the same time, it has the unique advantage of injecting a physics inductive bias into the training process, ensuring that simulation cost is spent where the model's physical understanding is weakest.

2605.21333 2026-05-21 cs.CL cs.AI 版本更新

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

SymbolicLight V1: 一种具有高激活稀疏性和亚十亿级预训练证据的脉冲门双路径语言建模

Ting Liu

发表机构 * SymbolicLight Research(SymbolicLight研究院)

AI总结 本文提出SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型,通过长程记忆的指数衰减聚合路径和短程精度的脉冲门局部注意力路径,实现了高激活稀疏性和亚十亿级预训练证据。

Comments 35 pages, 5 figures, 25 tables; public code and model artifacts linked in manuscript

详情
AI中文摘要

原生训练的脉冲语言模型难以同时结合Transformer类语言质量、稳定的多领域预训练和高激活稀疏性。我们提出了SymbolicLight V1,一种结合二进制Leaky Integrate-and-Fire脉冲动力学与连续残差流的脉冲门双路径语言模型。其Dual-Path SparseTCAM模块用指数衰减聚合路径替代密集自注意力,用于长程记忆,用脉冲门局部注意力路径用于短程精度,辅以动态上下文条件解码头和双语分词器。一个从头开始在300亿词中文-英语语料上训练的19400万参数SymbolicLight V1模型,在四个独立运行中达到8.88-8.93的验证PPL,每元素激活稀疏性超过89%。其PPL在GPT-2 20100万参数模型下落后7.7%,但在GPT-2 12400万参数模型上表现更优。在匹配0.5亿词训练预算的组件消融实验中,脉冲门局部注意力路径是最大贡献者,而用确定性top-k掩码替代LIF动力学在匹配稀疏性时导致更大退化,表明时间积分而非稀疏性本身驱动性能。我们还报告了一个在4880亿词上训练的0.8亿参数规模运行作为优化和稀疏性保持的证据,而非主要质量比较。当前密集硬件推理速度比GPT-2慢,因此神经形态部署被提出作为未来稀疏性驱动的机会,而非已实现的硬件加速。

英文摘要

Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

2605.21318 2026-05-21 cs.CL cs.AI cs.LG 版本更新

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

TextReg: 通过正则化的文本空间优化缓解提示分布过拟合

Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B. Aditya Prakash, Haohan Wang

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了提示分布过拟合问题,提出TextReg框架通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新,提升模型在分布外(OOD)任务上的泛化能力。

Comments Code: https://github.com/luchengfu6/TextReg

详情
AI中文摘要

大型语言模型(LLMs)对用于指定任务目标和行为约束的提示非常敏感。许多最近的提示优化方法通过迭代使用LLM生成的反馈来重写提示,但结果提示往往变长,积累狭窄的样本特定规则,并在训练分布之外泛化能力差。我们研究这种失败模式作为提示分布过拟合,并认为这反映了离散文本空间优化中表示控制的不足。我们通过表示不效率(representational inefficiency)进行了形式化,这是一种双因素度量,将提示不效率分解为容量成本和范围狭窄,将分布提示过拟合归因于优化过程中两者的耦合增长。我们提出了TextReg,一个正则化框架,通过正则化的文本梯度实现软惩罚目标,结合双证据梯度净化、语义编辑正则化和正则化引导的提示更新。在多个推理基准上,TextReg显著提高了分布外(OOD)泛化能力,其准确性在TextGrad和REVOLVE上分别提高了+11.8%和+16.5%。

英文摘要

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

2605.21311 2026-05-21 cs.LG cs.AI 版本更新

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

DeCoR:基于强化学习的城市街道设计与控制联合优化

Bibek Poudel, Lei Zhu, Kevin Heaslip, Sai Swaminathan, Weizi Li

发表机构 * University of Tennessee, Knoxville, TN, USA(田纳西大学,诺克斯维尔分校) University of North Carolina at Charlotte, Charlotte, NC, USA(北卡罗来纳大学夏洛特分校) University of California, Riverside, CA, USA(加州大学河滨分校)

AI总结 本文提出DeCoR框架,通过强化学习联合优化城市街道的过街横道布局和网络级信号控制,减少了行人到达最近过街横道的时间,并显著降低了行人和车辆等待时间。

Comments 22 pages, 8 figures

详情
AI中文摘要

现代视觉系统可以大规模检测、跟踪和预测城市中的行人,但将感知输出转化为城市设计仍然有限。我们介绍了DeCoR,一种两阶段强化学习框架,利用流量观测来联合优化过街横道布局和网络级信号控制。设计阶段将行人网络编码为图,并学习一种生成策略,该策略参数化一个高斯混合模型,用于过街横道的位置和宽度,从中采样新的过街横道。对于每个布局,共享的控制策略学习自适应信号时序以最小化行人和车辆的总延迟。在一条750米的现实世界城市走廊上,DeCoR学习了一个布局,该布局将行人到达最近过街横道的时间减少了23%,同时使用比现有配置更少的过街横道。在控制方面,DeCoR相对于固定时间信号控制,将行人和车辆等待时间分别减少了79%和65%。进一步,控制策略能够泛化到训练外的需求,并且在不重新训练的情况下对布局变化具有鲁棒性。

英文摘要

Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.

2605.21308 2026-05-21 cs.CV cs.AI 版本更新

Deformba: Vision State Space Model with Adaptive State Fusion

Deformba:具有自适应状态融合的视觉状态空间模型

Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

发表机构 * Department of Computer Science, Georgia State University(佐治亚州立大学计算机科学系) University of Tennessee Knoxville(田纳西大学肯纳邦克分校)

AI总结 本文提出Deformba,一种能够动态增强空间结构信息并保持状态空间模型线性复杂度的自适应方法,通过多模态融合(如交叉注意力)提升视觉任务的性能,展示了在2D和3D视觉任务中的广泛适用性。

Journal ref Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

状态空间模型(SSMs)已作为一种强大的、高效的替代方案出现于Transformer之上,展现出线性时间复杂度和卓越的序列建模能力。然而,将其应用于视觉任务仍具有挑战性。首先,现有的视觉SSMs大多依赖于手动设计的固定扫描方法将图像块扁平化为序列,这会引入预定义的几何结构并增加复杂性。其次,在需要不同信息流之间进行查询式交互的领域中,SSMs的更广泛采用受到阻碍。这是由于SSMs为1D序列建模任务设计时固有的因果性和自指性所致。这种融合机制对于多视角3D融合等关键感知任务至关重要。为了解决这些限制,我们提出Deformba,一种上下文自适应的方法,能够在保持SSMs线性复杂度的同时动态增强空间结构信息。Deformba还允许多模态融合,如交叉注意力。为了证明Deformba的有效性和广泛适用性,我们在通用的2D视觉任务(如图像分类、目标检测和分割)以及3D视觉任务(如BEV感知)上测试其性能。大量实验表明,Deformba在各种视觉感知基准上均取得了强劲的性能。

英文摘要

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

2605.21303 2026-05-21 cs.LG cs.AI cs.LO 版本更新

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

从电路证据到机制理论:一种归纳逻辑方法

Nura Aljaafari, Danilo S. Carvalho, Andre Freitas

发表机构 * Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Idiap Research Institute(Idiap研究机构) CRUK National Biomarker Centre, University of Manchester(曼彻斯特大学癌症研究联盟国家生物标志物中心)

AI总结 本文提出了一种基于归纳逻辑的方法,通过将电路解释视为归纳理论构建,为累积的机制科学提供形式化基础设施。该方法通过因果功能签名和建筑签名,明确机制主张,并在不同模型规模之间实现可移植性。

Comments 27 pages, 10 Figures, 14 Tables

详情
AI中文摘要

机制可解释性能够产生神经网络行为的电路层面因果分析,但发现的电路往往仍然是孤立的实验艺术品:没有共享的形式化表示来说明电路计算什么,它们如何相互关联,或者两个发现是否为同一机制提供证据。本文通过将电路解释视为归纳理论构建,提供了一种形式化基础设施,用于累积的机制科学。每个电路在两个层面进行表征:因果功能签名(CFS),它通过因果归因证据和令牌角色配置文件将组件行为基础化;以及建筑签名τ_arch,通过归纳逻辑编程(ILP)从尺度不变的结构谓词中学习。共同,这些构成了一个形式化的一致层,使机制主张显式化,并通过θ-子sume进行比较,并在模型规模之间实现可移植性。CFS揭示了不同任务类型中不同的计算策略,包括注意力介导的复制与MLP介导的绑定。ILP签名在结构分离方面优于图核和特征向量基线,并支持在不同模型规模和架构家族之间进行原理性转移。

英文摘要

Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $τ_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $θ$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

2605.21299 2026-05-21 cs.CL cs.AI 版本更新

Tracing the ongoing emergence of human-like reasoning in Large Language Models

追踪大型语言模型中类人推理的持续涌现

Paolo Morosi, Nikoleta Pantelidou, Fritz Günther, Elena Pagliarini, Evelina Leivada

发表机构 * Departament de Filologia Catalana, Universitat Autònoma de Barcelona(加泰罗尼亚语言系系,巴塞罗那自治大学) Institut für Psychologie, Humboldt-Universitat zu Berlin(柏林洪堡大学心理学研究所) Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚高级研究与高级教育研究所)

AI总结 研究探讨了大型语言模型在条件推理任务中是否具备类人推理能力,发现人类通过语用推理丰富逻辑推理,而模型行为更不稳定,部分模型遵循条件语义但忽视语用推理,表明LLMs在语义准确性上表现良好,但缺乏人类推理中的语用丰富性。

详情
AI中文摘要

人类能够超越字面意义:如果你修剪草坪,我会给你五十美元,通常被理解为说话者只在草坪修剪时支付,而如果你饿了,烤箱里有披萨,意味着披萨无论听者是否饥饿都可用。大型语言模型(LLMs)在许多任务上表现出类人性能,但尚不清楚它们是否像人类一样推理。为此,我们进行了一项人口匹配实验,评估了25个LLMs在四种语言中计算条件推理的能力,并与每种语言中等数量的人类进行比较。我们发现,人类通过跨语言的语用推理丰富逻辑推理。模型行为更具变异性。一些LLMs完全遵循条件语义的真值表,但忽视语用推理,而另一些LLMs偏离真值表,坚持单一解释,从而反映准确的规则处理但不具有类人推理能力。总体而言,LLMs是准确的语义运算符,但未能捕捉到人类推理中特有的语用丰富性。关键的是,LLM的准确性既不被开放与封闭状态、训练方向或架构类型所预测或提升,表明语用推理仍然是人工系统认知工具包中正在兴起的能力。

英文摘要

Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

2605.21295 2026-05-21 cs.LG cs.AI cs.HC 版本更新

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

TimeSRL: 通过语义RL调优的LLM实现通用的时间序列行为建模 -- 一项心理健康应用的案例研究

Yuang Fan, Lilin Xu, Millie Wu, Jingping Nie, Qingyu Chen, Yuzhe Yang, Zhuo Zhang, Xin Liu, Subigya Nepal, Xiaofan Jiang, Xuhai "Orson" Xu

发表机构 * Columbia University(哥伦比亚大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Yale University(耶鲁大学) University of California, Los Angeles(加州大学洛杉矶分校) Google(谷歌) University of Virginia(弗吉尼亚大学)

AI总结 本文提出TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测,将原始信号抽象为高级自然语言,从而预测行为结果,该方法在心理健康预测中实现了最先进的跨群体泛化性能。

详情
AI中文摘要

纵向被动传感能够实现连续健康预测,但模型在跨数据集分布偏移下往往失效。传统机器学习容易过拟合群体特异性特征,而大型语言模型(LLMs)在长且异质的时间序列上难以可靠推理。我们引入TimeSRL,一种两阶段LLM框架,通过显式的语义瓶颈路由预测。模型首先将原始信号抽象为高级自然语言,然后仅从这些抽象中预测行为结果。这迫使模型在我们认为泛化更好的语义概念上进行推理。我们通过组相对策略优化(GRPO)结合可验证奖励的强化学习(RLVR)端到端优化这一过程,学习与结果对齐的抽象,而无需金标准中间注释。在心理健康预测中,TimeSRL在设计用于在严格的一留一数据集-out(LOSO)协议下压力测试跨群体泛化能力的基准上实现了最先进的性能,将焦虑的均绝对误差(MAE)在强大的非LLM ML和LLM基线模型上分别降低了3.1-10.1%和9.5-44.1%,抑郁的MAE则降低了3.2-9.6%和27.4-57.6%(所有p值<0.05)。TimeSRL在不同传感管道上的跨基准迁移中显著优于先前方法,在不进行目标领域微调的情况下,其性能与自身在领域内性能相当。这些结果表明语义抽象具有可重用性,并指出了通过RL调优的LLM实现通用行为建模的新方向。

英文摘要

Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

2605.21292 2026-05-21 stat.ML cs.AI cs.LG math.DS 版本更新

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

双因子线性变换器模型的大步训练动态

Krishnakumar Balasubramanian

发表机构 * Department of Statistics, University of California, Davis(加州大学戴维斯分校统计学系)

AI总结 本文研究了双因子线性变换器模型在大学习率下的训练动态,通过分析发现大步长学习率可以改变变换器的训练吸引子,而非仅仅加速收敛,可能在稳定性阈值之外导致训练进入循环、有界混沌或发散。

详情
AI中文摘要

梯度流分析显示,简化的线性变换器可以学习上下文线性回归算法,但无法解释大学习率下梯度下降的有限步行为。受高学习率变换器不稳定性实证研究和二次回归的立方图相图启发,我们研究了一个可以简化为单提示线性变换器训练问题的恰好可约问题。归一化后,动态减少为一个双因子乘积映射,具有有效步长参数μ。在平衡切片上,该映射恢复了已知的标量立方过渡,从单调收敛到飞弹收敛,周期性和有界非收敛,以及发散。我们随后分析了完整的二维系统,显示对于0<μ<2,它有一个显式不变的切比雪夫椭圆,将前向不变区域分开;该椭圆承载着不平衡的混沌动态,但横向排斥,而平衡标量吸引子可以横向吸引。这些结果表明,大常数学习率可以改变学习变换器的训练吸引子,而不仅仅是加速收敛:在稳定性阈值之外,有限步训练可能进入循环、有界混沌或发散,而不是单一的上下文线性回归解。我们还讨论了这对基于小批量梯度下降训练方法的影响。

英文摘要

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(μ\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<μ<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

2605.21272 2026-05-21 cs.CV cs.AI 版本更新

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET:一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

发表机构 * Jasper Research(Jasper研究)

AI总结 本文提出MONET数据集,通过多阶段过滤和增强,提供高质量的文本到图像数据,以降低大规模可重复研究的门槛。

详情
AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集,具有多样内容和详细的描述。然而,收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET,一个开放的Apache 2.0数据集,包含约104.9亿个图像-文本对,这些数据来自29亿个原始对,通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述,覆盖短到长形式的描述,并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释,以加速下游使用。为了验证MONET的有效性,我们仅使用它训练了一个400亿参数的潜在扩散模型,并在GenEval和DPG评分中达到了具有竞争力的结果,证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

2605.21266 2026-05-21 cs.LG cs.AI 版本更新

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

在线强化学习需要多少?用于RLVR中离线偏好优化的信息性回放

Richa Verma, Balaraman Ravindran

发表机构 * TCS Research Department of CSE(TCS计算机科学系研究部) IIT Madras(印度理工学院马德拉斯分校) Department of Data Science & AI(数据科学与人工智能系) Wadhwani School of Data Science & AI(Wadhwani数据科学与人工智能学院)

AI总结 本文提出G2D方法,通过短时GRPO预热、构建静态偏好数据集和离线DPO微调,以较低的计算成本实现优于GRPO的性能,强调偏好数据信息性而非数量的重要性。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为语言模型推理的强大范式,GRPO是其主要例子。然而,GRPO需要连续在线回放生成,这使它计算成本高且难以扩展。尽管直接偏好优化(DPO)提供了稳定的离线替代方案,但通常在训练时表现不如在线RL方法如GRPO。我们引入G2D(GRPO到DPO),一个三阶段流程,进行短GRPO预热,构建静态偏好数据集,并使用DPO离线微调模型。在Qwen2.5-7B和Llama-3.1-8B上,我们发现离线DPO在适度预热下能以显著更低的计算成本匹配或超越GRPO。在Qwen2.5-7B上,G2D在K=150时在MATH-500上达到62.4%,比GRPO(51.6%)高出10.8%,计算成本低约4倍。在Llama-3.1-8B上,G2D在K=500时达到49.4%,在实验设置中超越GRPO。我们表明性能不取决于偏好对的数量,而取决于其信息性。适度预热产生校准的不确定性回放,产生更强的对比信号,而过度预热导致过于自信的策略和信息较少的数据。我们的结果将RLVR中的离线-在线差距重新定义为主要的数据信息性问题,并识别了适当难度校准的离线微调数据集的短在线RL预热作为计算高效的在线RL替代方案。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.

2605.20706 2026-05-21 cs.DC cs.AI cs.LG 版本更新

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

网络上的Llamas:基于WebGPU的内存高效、性能可移植和多精度LLM推理

Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen

发表机构 * Microsoft Research(微软研究院) UC Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出LlamaWeb,一种基于WebGPU的LLM推理框架,通过静态内存规划和高效模型加载减少内存开销,支持多种模型权重格式,实现了内存高效、性能可移植的LLM推理。

Comments 19 pages, 11 figures, 5 tables

详情
AI中文摘要

在浏览器中运行语言模型提供了一个独特的机会,可以构建高效、私有且可移植的AI应用,但需要应对受限的内存可用性和异构硬件目标。为了实现这一机会,我们提出了Llamas on the Web(LlamaWeb),一种针对llama.cpp的WebGPU后端,能够在浏览器中实现内存高效且性能可移植的LLM推理,适用于广泛范围的模型权重格式。我们的设计通过静态内存规划和高效的模型加载显著减少了内存开销,通过可调的内核库解决了跨设备的差异性,并引入了模板化的GPU内核,支持多种量化格式的高性能实现,从而实现了广泛模型支持和对新格式的扩展性。我们评估了LlamaWeb在16个设备上,收集了10个语言模型和四种模型权重格式的数据。我们比较了LlamaWeb与现有的浏览器LLM框架,发现LlamaWeb在多种设备、浏览器和操作系统组合下需要29-33%更少的内存。我们还评估了LlamaWeb的性能,发现其在四个不同供应商的GPU上解码吞吐量提高了45-69%。此外,我们还比较了LlamaWeb与其他llama.cpp后端的性能,发现其在某些设备上与甚至超越了供应商特定的后端性能。

英文摘要

Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama$.$cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama$.$cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

2605.19362 2026-05-21 cs.HC cs.AI 版本更新

Toward User Comprehension Supports for LLM Agent Skill Specifications

向LLM代理技能规范提供用户理解支持

Zikai Alex Wen

发表机构 * University of Washington, Tacoma School of Engineering \& Technology Tacoma, Washington, USA University of Washington, Tacoma School of Engineering \& Technology

AI总结 研究探讨了技能规范是否有助于用户形成对技能消耗、产生和覆盖范围的有限预期,并通过分析878个网络安全技能的文本线索,发现仅少数规范包含必要的提示,强调应将规范视为面向用户的能劾示范而非仅执行指令的容器。

Comments To appear at ACM CAIS Workshop Agent Skill 2026

详情
AI中文摘要

用户经常通过SKILL markdown规范来解释和选择代理技能。为了保护用户,现有审核主要关注恶意或不安全的技能。我们研究了互补问题:规范是否帮助用户形成对技能消耗、产生和覆盖范围的有限预期。在878个网络安全技能中,我们使用基于规则的编码来测量四个理解锚点的文本线索,即操作基础、输出合同、边界披露和示例能力演示。操作基础的线索较为常见,但仅有19.0%的规范包含示例任务、样本或预期结果的线索,仅2.3%的规范包含所有四个锚点的线索。我们进一步检查了一个小型DNS/C2遥测子集(n=6)以说明缺失示例可能带来的影响。示例似乎使首次本地检查更容易构建,而无示例的技能通常需要辅助代码检查来恢复命令参数或输出字段。我们主张代理技能评估应将规范视为面向用户的能劾示范,而非仅仅是执行指令的容器。

英文摘要

Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.

2605.18991 2026-05-21 cs.CR cs.AI 版本更新

Agent Security is a Systems Problem

智能体安全是系统问题

Mihai Christodorescu, Earlence Fernandes, Ashish Hooda, Somesh Jha, Johann Rehberger, Kamalika Chaudhuri, Xiaohan Fu, Khawaja Shams, Guy Amir, Jihye Choi, Sarthak Choudhary, Nils Palumbo, Andrey Labunets, Nishit V. Pandya

发表机构 * Google(谷歌) University of California San Diego(加州大学圣地亚哥分校) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) EmbraceTheRed Gray Swan AI Cornell University(康奈尔大学)

AI总结 本文提出智能体安全应作为系统问题来解决,强调通过系统层面的安全不变量来保障AI模型的安全性,而非仅仅依赖模型鲁棒性。文章基于系统安全领域的技术,提出了设计可预测安全保证的智能体系统的核心原则,并分析了实际攻击案例和实现这些原则面临的挑战。

详情
AI中文摘要

我们主张智能体安全必须作为系统问题来处理:驱动智能体的AI模型必须被视为不可信的组件,系统层面必须强制实施安全不变量。通过这一视角,单纯提高模型鲁棒性(社区中的主流观点)是不够的。相反,我们必须将现有努力与系统安全领域的技术相结合。基于我们在操作系统、网络、形式化方法和对抗机器学习领域的经验,我们提出了一套基于数十年系统安全研究的核心原则,为设计具有可预测安全保证的智能体系统提供基础。作为证据,我们分析了十一个代表性的现实世界攻击案例,并讨论了如果系统原则得以实现,这些攻击将如何被防止。我们还识别了在智能体中实现这些原则所面临的科研挑战。

英文摘要

We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

2605.17618 2026-05-21 cs.AI 版本更新

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

使用可穿戴传感器预测课堂环境中与深度自闭症相关的行为问题

Yadhu Kartha, Conor Anderson, Jenny Foster, Theresa Hamlin, Johanna Lantz, Ryan Lay, Juergen Hahn, Gari D. Clifford, Hyeokhyen Kwon

发表机构 * Georgia Institute of Technology(佐治亚理工学院) The Center For Discovery(发现中心) Rensselaer Polytechnic Institute(伦斯勒理工学院) Georgia Institute of Technology, Emory University(佐治亚理工学院与埃默里大学)

AI总结 本研究通过可穿戴传感器和机器学习方法,在真实课堂环境中预测自闭症深度患者的行为问题,展示了在教育环境中提前10分钟预测行为问题的可行性,并实现了AUC-ROC为0.78的准确率。

详情
AI中文摘要

自闭症谱系障碍(ASD)以社交互动和沟通困难以及思维和行为的限制或重复模式为特征,表现具有显著变化性。大约四分之一的ASD儿童被归类为深度自闭症,这些患者常常表现出自我伤害行为、攻击性、逃跑或口欲症等具有严重安全风险的行为,这些行为会干扰教育环境中的学习。先前的工作已应用可穿戴传感器和机器学习来检测这些行为,但大多局限于受控的实验室环境。本工作证明了在真实世界特殊教育课堂中预测这些行为事件是可行的。我们收集了约110.7小时的标记多模态可穿戴数据,包括加速度计、电导活动(EDA)和皮肤温度数据,来自10至21岁的9名儿童和年轻人,在标准课堂会话中。我们微调了最先进的多模态可穿戴时间序列分析基础模型,并展示了可以提前10分钟预测行为事件,AUC-ROC为0.78。这些结果为开发能够帮助教师减少特殊教育课堂中行为问题安全风险的主动干预系统奠定了坚实的基础。

英文摘要

Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms

2605.15156 2026-05-21 cs.CL cs.AI cs.LG 版本更新

MeMo: Memory as a Model

MeMo:记忆作为模型

Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama

发表机构 * Institute of Data Science, National University of Singapore(数据科学研究院,新加坡国立大学) Integrative Sciences and Engineering Programme, NUSGS(整合科学与工程计划,NUSGS) Agency for Science, Technology, Research (A*STAR)(科技研究局(A*STAR)) Department of Computer Science, National University of Singapore(计算机科学系,新加坡国立大学) University of Tokyo(东京大学) Liquid AI CSAIL, Massachusetts Institute of Technology(CSAIL,麻省理工学院) AI Singapore Singapore-MIT Alliance for Research and Technology Centre, Singapore(新加坡-麻省理工学院研究与技术中心,新加坡)

AI总结 本文提出MeMo框架,通过在不改变LLM参数的情况下将新知识编码到专用记忆模型中,解决了大型语言模型在需要及时领域特定信息的应用中的问题,同时具备处理复杂跨文档关系、抗检索噪声、避免灾难性遗忘、无需访问LLM权重或输出logits以及检索成本与语料库大小无关等优势。

Comments MeMo augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上表现出色,但预训练后保持冻结状态,直到后续更新。许多现实应用需要及时、领域特定的信息,这促使需要高效的机制来整合新知识。在本文中,我们介绍MeMo(Memory as a Model),一个模块化框架,能够将新知识编码到专用的记忆模型中,同时保持LLM参数不变。与现有方法相比,MeMo具有几个优势:(a)它能够捕捉复杂的跨文档关系;(b)它对检索噪声具有鲁棒性;(c)它避免了LLM中的灾难性遗忘;(d)它不需要访问LLM的权重或输出logits,从而能够与开源和专有闭源LLM进行即插即用式集成;(e)其检索成本在推理时间与语料库大小无关。我们在三个基准测试集BrowseComp-Plus、NarrativeQA和MuSiQue上的实验结果表明,MeMo在多种设置中相比现有方法表现优异。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

2605.12597 2026-05-21 cond-mat.dis-nn cond-mat.stat-mech cs.AI cs.LG physics.comp-ph 版本更新

The critical slowing down in diffusion models

扩散模型中的临界减慢现象

Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabrié

发表机构 * Laboratoire de Physique Statistique, École normale supérieure, PSL Research University(统计物理实验室,巴黎高等师范大学,PSL研究大学) Department of Physics, Duke University(杜克大学物理系) Department of Chemistry, Duke University(杜克大学化学系)

AI总结 本文研究了扩散模型在统计场理论O(n)模型中的应用,揭示了训练过程中参数学习的临界减慢现象,并通过引入局部得分近似方法,展示了通过适当架构设计可以克服这一现象,为统计物理中的采样方法提供了可控的改进框架。

Comments 17 pages, 8 figures

详情
AI中文摘要

计算采样自20世纪中叶以来一直是科学的核心。尽管基于机器学习的方法最近取得了重大进展,但其行为仍缺乏深入理解,理论上对何时以及为何成功控制有限。本文通过分析扩散模型在统计场理论O(n)模型的高斯极限n→∞下的应用,提供了对扩散模型的深入见解。在这一可分析的设置中,我们展示了训练一个具有单层网络架构的得分模型时,参数学习会出现临界减慢现象。这种减慢也影响生成过程,表明即使对于学习生成模型,接近临界点的采样困难仍然存在。为克服这一瓶颈,我们展示了通过结合架构深度与物理局部性可以提升性能。我们发现使用双层架构可以显著减少临界减慢,训练时间与系统规模的关系从二次方变为对数。通过引入局部得分近似,我们证明这种训练时间的加速可以在不增加神经网络参数数量的情况下实现。总体而言,这些结果表明扩散模型可以通过适当的架构设计克服临界减慢现象,并为统计物理及其他领域中的学习采样方法建立了可控的改进框架。

英文摘要

Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

2605.09492 2026-05-21 cs.CL cs.AI 版本更新

APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

APCD:自适应路径对比解码用于可靠的大型语言模型生成

Tianyu Zheng, Hong Wu, Jiaji Zhong

AI总结 本文提出了一种自适应路径对比解码方法,通过自适应探索和受控路径交互提高大型语言模型生成的可靠性,实验表明在保持解码效率的同时提升了事实准确性。

Comments This paper has been withdrawn by the author to resolve a conflict of interest/compliance issue

详情
AI中文摘要

大型语言模型(LLMs)常常由于自回归解码中误差累积而产生幻觉,其中次优的早期token选择会误导后续生成。尽管多路径解码通过探索替代轨迹来提高鲁棒性,但现有方法缺乏确定何时分支和如何调节路径交互的系统策略。我们提出自适应路径对比解码(APCD),一种多路径解码框架,通过自适应探索和受控路径交互提高输出可靠性。APCD包含两个组成部分:(1)熵驱动路径扩展,延迟分支直到预测不确定性-通过香农熵测量在顶级候选token上的不确定性-表明存在多个可能的延续;以及(2)发散意识路径对比,鼓励多样化的推理轨迹同时动态减弱路径间的影响,随着预测分布发散。在八个基准测试中的实验显示,在保持解码效率的同时提高了事实准确性。我们的代码可在https://github.com/zty-king/APCD上获得。

英文摘要

Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.

2604.27195 2026-05-21 cs.AI 版本更新

Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

评估TabPFN在数据有限环境下对轻度认知障碍向阿尔茨海默病转化的预测能力

Brad Ye, Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

发表机构 * 1 Department of Medicine, University of Central Florida College of Medicine, Orlando, FL, USA 2 Department of Mechanical, Industrial \& Manufacturing Eng., The University of Toledo, Toledo, OH, USA 3 Department of Industrial Eng. Mngt. Systems, University of Central Florida, Orlando, FL, USA 4 Department of Industrial Engineering, Istanbul Technical University, Istanbul, Turkey 5 School of Data, Mathematical Statistical Sciences, University of Central Florida, Orlando, FL, USA

AI总结 本文评估了TabPFN在数据有限环境下对轻度认知障碍向阿尔茨海默病转化的预测能力,通过比较传统机器学习方法,发现TabPFN在低数据量情况下表现更优,特别是在训练样本仅为50时仍能保持较高的AUC值。

Comments 6 pages, 3 figures

详情
AI中文摘要

准确预测轻度认知障碍(MCI)向阿尔茨海默病(AD)的转化对于早期干预至关重要,然而由于纵向数据有限,开发可靠的预测模型具有挑战性。我们评估了TabPFN(表格预训练基础网络)在使用TADPOLE数据集(源自ADNI)预测3年MCI至AD转化的性能,该数据集包含来自人口统计学、APOE4、MRI体积、CSF标记物和PET成像的多模态生物标志物特征。我们进行了不同训练集大小(N=50到1000)和模型(包括XGBoost、随机森林、LightGBM和逻辑回归)的实验比较。TabPFN在AUC=0.892时表现最佳,优于LightGBM(AUC=0.860),并在低数据情况下具有优势。在N=50训练样本时,TabPFN保持了强AUC,而传统机器学习模型在小样本时表现不佳。这些发现表明,基础模型在数据有限的情况下对疾病预测具有前景,如阿尔茨海默病。

英文摘要

Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.

2603.26603 2026-05-21 cs.SE cs.AI cs.LG 版本更新

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

可持续性并非线性:在设备智能中量化性能、能耗和隐私的权衡

Eziyo Ehsani, Luca Giamattei, Ivano Malavolta, Roberto Pietrantuono

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文研究了将大语言模型从云集群迁移到边缘设备过程中性能、能耗和隐私之间的权衡,通过实验证明模型架构对电池寿命的影响大于量化方案,并发现中等大小模型在响应质量和可持续能耗之间达到最佳平衡。

Comments Under review at Empirical Software Engineering (EMSE)

详情
AI中文摘要

将大型语言模型(LLMs)从云集群迁移到边缘设备有望提高隐私性和离线访问性,但这一转变面临严峻现实:移动电池的物理限制、热限制以及最重要的是内存限制。为了应对这一挑战,我们构建了一个可复现的实验管道,用于分析移动设备上LLMs的能耗、延迟和质量之间的复杂相互作用。我们利用该管道对旗舰Android设备进行了实证案例研究,捕捉了从0.5B到9B参数的八个LLMs的细粒度指标,无需root权限,确保我们的发现反映了现实用户条件。研究结果突显了生成质量、性能、功率和资源消耗之间的权衡,揭示了哪些LLMs在不同条件下提供了最佳平衡。此外,我们发现了一个反直觉的量化能耗悖论:虽然现代重要性感知量化能够减少内存占用以适应更大的模型到RAM,但我们发现其能耗节省与标准混合精度方法相比微不足道。这证明了对于电池寿命而言,模型架构而非其量化方案是决定性因素。我们进一步发现,专家混合(MoE)架构违背了标准大小能耗趋势,提供了7B模型的存储容量,同时保持了1B到2B模型的较低能耗。最后,对这些多目标权衡的分析揭示了中等大小模型(如Qwen2.5-3B)的务实平衡点,这些模型在响应质量和可持续能耗之间实现了有效平衡。

英文摘要

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

2603.10139 2026-05-21 cs.CL cs.AI cs.CC cs.FL 版本更新

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

生成-识别不对称性:形式语言理论中根本分裂的六个维度

Romain Peyrichou

发表机构 * Independent Researcher(独立研究者)

AI总结 本文探讨了生成和识别在形式语言理论中的不对称性,通过六个维度分析生成和识别之间的差异,并指出生成和识别在计算复杂性、歧义性、方向性、信息可用性、语法推断和时间性等方面存在显著差异。

Comments Submitted to Information and Computation. 32 pages, 6 figures, 4 tables

详情
AI中文摘要

每种形式文法都定义了一种语言,并且原则上可以以三种方式使用:生成字符串(产生)、识别它们(解析)或者在仅有示例的情况下推断出文法本身(文法归纳)。生成和识别在扩展上是等价的——它们描述相同的集合——但在多个独立的方面操作上是不对称的。归纳是一个更复杂的问题:它没有访问已知文法的途径。尽管这个三元组在编译器设计、自然语言处理和形式语言理论中至关重要,但尚未有综述将其视为统一的多维现象。我们识别出六个维度,这些维度使生成和识别产生差异:计算复杂性、歧义性、方向性、信息可用性、文法归纳和时间性。我们证明了常见的“生成容易,解析困难”的描述是误导的:无约束生成是简单的,但受约束生成可以是NP难的。真正的不对称性在于解析总是受约束的(输入已知)而生成不需要。这两个维度——方向性和时间性——之前尚未被识别为生成-识别不对称性的维度。我们将时间维度与Hale(2001)和Levy(2008)的惊奇框架联系起来,认为惊奇正式化了生成者(惊奇=0)和预测在不确定性下的解析者(惊奇>0)之间的时间不对称性。我们回顾了自然语言处理中的双向系统,并观察到双向性已有五十年的历史,但尚未转移到大多数领域特定的应用中。最后,我们讨论了大型语言模型,它们在架构上统一了生成和识别,但在操作上保持了不对称性。

英文摘要

Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

2602.16813 2026-05-21 cs.CL cs.AI 版本更新

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

流映射语言模型:通过连续去噪实现一步语言建模

Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种基于连续流的流映射语言模型,通过在one-hot token嵌入上构建连续流,实现了在质量和速度上优于离散扩散模型的性能,展示了连续流在离散模态生成建模中的潜力。

Comments 58 pages, 40 figures

详情
AI中文摘要

基于离散扩散的语言模型因其在生成速度上优于自回归模型而受到广泛关注。尽管具有潜力,这些模型通常在少量步骤范围内生成质量急剧下降,从而在实践中限制了速度的大幅提升。本文表明,基于连续流的一步语言模型在质量和速度上均优于离散扩散模型。重要的是,我们的连续公式定义了一个独特的流映射,可以直接学习以实现高效的少量步骤推断,这种结构在离散方法中不可用。在此设置中,我们展示了流及其关联的流映射可以通过简单的交叉熵目标学习,这些目标尊重数据的简单几何结构,并且我们识别了三种不同的流映射蒸馏选择,其性能在实践中进行了比较。利用这些见解,我们构建了一个流语言模型(FLM),该模型在One Billion Words(LM1B)和OpenWebText(OWT)数据集上与最先进的离散扩散基线相媲美。然后,我们将FLM蒸馏为流映射语言模型(FMLM),其一步生成超过了最近少量步骤离散扩散语言模型的8步质量。本文的工作挑战了广泛认为离散噪声过程是离散模态生成建模所必需的假设,并为大规模加速语言建模铺平了道路。代码可在https://github.com/david3684/flm获取。

英文摘要

Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

2602.08023 2026-05-21 cs.CR cs.AI cs.MA 版本更新

CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

CTFExplorer: 通过多目标网络CTF基准测试评估LLM进攻性代理

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

发表机构 * CISPA - Helmholtz Center for Information Security(CISPA-海德堡信息安全研究中心) NYU Tandon School of Engineering(纽约大学坦顿工程学院) NYU Abu Dhabi(纽约大学阿布扎克分校) IIIT Hyderabad(海得拉巴印度理工学院)

AI总结 本文提出CTFExplorer基准测试,通过多目标网络CTF基准测试评估LLM进攻性代理,研究问题是如何在不确定环境下评估代理的战术推理能力,核心方法是引入多目标环境测试代理的探索、优先级和攻击链能力,主要贡献是开发了可评估代理行为的框架。

详情
AI中文摘要

现有的LLM基于进攻性安全代理的基准测试使用隔离的单目标设置,包含已知的易受攻击的服务和固定目标。它们有效测量了利用,但忽略了真实CTF参与者如何在未知表面上进行优先级排序、在不确定性下分配努力。当前的评估因此无法评估超越利用之外的战略推理。为了解决这个问题,我们引入了CTFExplorer,一个基准测试套件,将进攻性安全评估转向多目标设置,测试代理如何探索、优先级和连接攻击。CTFExplorer在一个环境中部署了40个基于网络的易受攻击的服务,代理必须自主发现、区分和利用目标,而无需预定义指导。我们还提出了一个反应性多代理设置作为参考代理框架,并开发了一个代理无关的评估框架,该框架记录了结构化的推理轨迹以进行细粒度评估。这使行为评估超越了二进制旗帜捕获,例如如何管理目标选择、处理失败的假设、在多个阶段协调以及提取安全情报。

英文摘要

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

2602.06358 2026-05-21 cs.CL cs.AI 版本更新

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

SHINE:一种可扩展的上下文超网络,用于在单次传递中将上下文映射到LoRA

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Technion, NVIDIA(技术学院与NVIDIA) University of Oxford(牛津大学) School of Electronics Engineering(电子工程学院) Computer Science, Peking University(计算机科学,北京大学)

AI总结 本文提出SHINE,一种可扩展的上下文超网络,用于在单次传递中将多样且有意义的上下文映射到高质量的LoRA适配器,通过重用冻结LLM的自身参数和引入架构创新,克服了先前超网络的关键限制,以较少的参数实现了强大的表达能力。

详情
AI中文摘要

我们提出SHINE(可扩展的上下文超网络),一种可扩展的超网络,能够将多样且有意义的上下文映射到高质量的LoRA适配器,用于大型语言模型(LLMs)。通过在上下文超网络设计中重用冻结LLM的自身参数,并引入架构创新,SHINE克服了先前超网络的关键限制,以相对较少的参数实现了强大的表达能力。我们引入了预训练和指令微调流水线,并训练我们的超网络在单次前向传递中从多样且有意义的上下文中生成高质量的LoRA适配器。它在不进行微调的情况下更新LLM参数,并立即启用与上下文相关的复杂问答任务,而无需直接访问上下文,有效地将上下文知识转换为参数知识。我们的工作在各种任务上取得了出色的结果,相比基于SFT的LLM适应方法,大大节省了时间和计算和内存成本,并展示了良好的可扩展性潜力。我们的代码可在https://github.com/MuLabPKU/SHINE获取。

英文摘要

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

2601.10191 2026-05-21 cs.AI 版本更新

How does downsampling affect needle electromyography signals? A generalisable workflow for understanding downsampling effects on high-frequency time series

降采样如何影响针式肌电信号?一种可推广的用于理解降采样对高频时间序列影响的工作流程

Mathieu Cherpitel, Janne Luijten, Thomas Bäck, Camiel Verhamme, Martijn Tannemaat, Anna Kononova

发表机构 * Leiden Institute of Advanced Computer Science(莱顿先进计算机科学研究所) Leiden University Medical Centre(莱顿大学医学中心) Department of Neurology(神经科) Amsterdam University Medical Centre(阿姆斯特丹大学医学中心)

AI总结 本文研究降采样对高频时间序列信号的影响,提出了一种可推广的工作流程,通过结合形状基失真度量和基于特征的机器学习模型分类结果,量化不同降采样算法和因素对波形完整性和预测性能的影响。

详情
AI中文摘要

自动分析针式肌电(nEMG)信号正逐渐成为一种支持神经肌肉疾病(NMDs)检测的工具,但这些信号的高采样率和异质性给基于特征的机器学习模型带来了显著的计算挑战,特别是接近实时分析。降采样提供了一个潜在的解决方案,但其对诊断信号内容和分类性能的影响仍不够清楚。本研究提出了一种系统评估降采样导致信息损失的工作流程。该工作流程结合了基于形状的失真度量与现有基于特征的机器学习模型的分类结果和特征空间分析,以量化不同降采样算法和因素对波形完整性和预测性能的影响。我们使用三类NMD分类任务来实验性评估该工作流程。我们展示了该工作流程如何识别保留诊断信息同时显著减少计算负载的降采样配置。形状基失真度量分析显示,基于形状的降采样算法优于标准降采样,因为它们更好地保留了峰值结构和整体信号形态。结果为选择能够支持接近实时nEMG分析的降采样配置提供了实用指导,并突显了一种可推广的工作流程,可用于在其他高频时间序列应用中平衡数据减少与模型性能。

英文摘要

Automated analysis of needle electromyography (nEMG) signals is emerging as a tool to support the detection of neuromuscular diseases (NMDs), yet the signals' high and heterogeneous sampling rates pose substantial computational challenges for feature-based machine-learning models, particularly for near real-time analysis. Downsampling offers a potential solution, but its impact on diagnostic signal content and classification performance remains insufficiently understood. This study presents a workflow for systematically evaluating information loss caused by downsampling in high-frequency time series. The workflow combines shape-based distortion metrics with classification outcomes from available feature-based machine learning models and feature space analysis to quantify how different downsampling algorithms and factors affect both waveform integrity and predictive performance. We use a three-class NMD classification task to experimentally evaluate the workflow. We demonstrate how the workflow identifies downsampling configurations that preserve diagnostic information while substantially reducing computational load. Analysis of shape-based distortion metrics showed that shape-aware downsampling algorithms outperform standard decimation, as they better preserve peak structure and overall signal morphology. The results provide practical guidance for selecting downsampling configurations that enable near real-time nEMG analysis and highlight a generalisable workflow that can be used to balance data reduction with model performance in other high-frequency time-series applications as well.

2510.15949 2026-05-21 q-fin.TR cs.AI 版本更新

ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

ATLAS:通过动态提示优化和多智能体协调实现LLM智能体的自适应交易

Charidimos Papadakis, Angeliki Dimitriou, Giorgos Filandrianos, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou

发表机构 * School of Electrical and Computer Engineering, AILS Laboratory(电气与计算机工程学院,AILS实验室) National Technical University of Athens(雅典国家技术大学)

AI总结 本文提出ATLAS框架,通过动态提示优化和多智能体协调,解决LLM在金融交易中的适应性问题,提升交易决策的鲁棒性和执行效率。

详情
AI中文摘要

大型语言模型在金融决策中展现出潜力,但将其作为自主交易代理存在根本性挑战:如何在奖励延迟和市场噪声干扰下适应指令,如何将异质信息流合成连贯决策,以及如何弥合模型输出与可执行市场行动之间的差距。本文提出ATLAS(Adaptive Trading with LLM AgentS),一个统一的多智能体框架,整合市场、新闻和公司基本面的结构化信息以支持稳健的交易决策。在ATLAS中,核心交易智能体在订单感知的动作空间中运作,确保输出对应可执行的市场订单而非抽象信号。该智能体可通过Adaptive-OPRO技术在交易中整合反馈,这是一种新颖的提示优化技术,通过动态适应提示并结合实时随机反馈,随着时间推移提高性能。在特定市场环境的股票研究和多个LLM家族中,Adaptive-OPRO consistently outperforms fixed prompts,而基于反思的反馈未能提供系统性增益。

英文摘要

Large language models show promise for financial decision-making, yet deploying them as autonomous trading agents raises fundamental challenges: how to adapt instructions when rewards arrive late and obscured by market noise, how to synthesize heterogeneous information streams into coherent decisions, and how to bridge the gap between model outputs and executable market actions. We present ATLAS (Adaptive Trading with LLM AgentS), a unified multi-agent framework that integrates structured information from markets, news, and corporate fundamentals to support robust trading decisions. Within ATLAS, the central trading agent operates in an order-aware action space, ensuring that outputs correspond to executable market orders rather than abstract signals. The agent can incorporate feedback while trading using Adaptive-OPRO, a novel prompt-optimization technique that dynamically adapts the prompt by incorporating real-time, stochastic feedback, leading to increasing performance over time. Across regime-specific equity studies and multiple LLM families, Adaptive-OPRO consistently outperforms fixed prompts, while reflection-based feedback fails to provide systematic gains.

2507.21168 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

多样化的大语言模型还是多样化的问题解释?那是集成的问题

Rafael Rosales, Santiago Miret

发表机构 * Intel Labs(英特尔实验室)

AI总结 本文比较了使用大语言模型回答二元问题的两种多样性方法:模型多样性和问题解释多样性,并发现问题解释多样性在集成准确性上表现更优。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 5116-5128

详情
AI中文摘要

有效利用多样性已被证明可以提高各种机器学习模型,包括大语言模型(LLMs)的性能。然而,确定最有效的多样性使用方法仍是一个挑战。在本工作中,我们比较了两种用于使用LLMs回答二元问题的多样性方法:模型多样性,即多个模型回答相同的问题,以及问题解释多样性,即使用同一模型以不同方式 framing 相同的问题来回答。对于这两种情况,我们应用多数投票作为集成共识启发式方法来确定最终答案。我们的boolq、strategyqa和pubmedqa实验表明,问题解释多样性在集成准确性上始终优于模型多样性。此外,我们对GPT和LLaMa的分析表明,模型多样性通常产生在最佳和最差集成成员之间的结果,而没有明显的改进。

英文摘要

Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.

2506.11060 2026-05-21 cs.SE cs.AI 版本更新

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Code Researcher: 用于大型系统代码和提交历史的深度研究代理

Ramneet Singh, Sathvik Joel, Abhav Mehrotra, Nalin Wadhwa, Ramakrishna B Bairi, Aditya Kanade, Nagarajan Natarajan

发表机构 * Microsoft Research(微软研究院)

AI总结 本文提出Code Researcher,一种用于大型系统代码和提交历史的深度研究代理,通过多步骤推理和全局上下文收集,有效解决系统代码崩溃修复问题,显著优于现有基线方法。

详情
AI中文摘要

基于大型语言模型(LLM)的编码代理在编码基准测试中表现出色,但在系统代码上的有效性仍待探索。由于系统代码的规模和复杂性,对系统代码库进行修改需要研究大量上下文,这些上下文来源于大型代码库及其庞大的提交历史。受近期深度研究代理进展的启发,我们设计了首个代码研究代理Code Researcher,并将其应用于生成补丁以缓解系统代码中的崩溃问题。Code Researcher通过多步骤推理,对代码的语义、模式和提交历史进行推理,以从代码库和提交历史中检索所有相关上下文。我们评估了Code Researcher在kBenchSyz基准测试中的表现,结果显示其显著优于强基线方法,使用OpenAI的GPT-4o模型时,崩溃解决率(CRR)达到48%,相比SWE-agent的31.5%和Agentless的31%。扩大采样预算至10条轨迹可将CRR提升至54%。Code Researcher对模型选择也具有鲁棒性,使用新模型Gemini 2.5-Flash时达到67%。通过在开源多媒体软件上的另一个实验,我们展示了Code Researcher的泛化能力,并进行了消融实验。我们的实验突显了对大型代码库进行全局上下文收集和多维推理的重要性。

英文摘要

Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate (CRR) of 48%, compared to 31.5% by SWE-agent and 31% by Agentless, using OpenAI's GPT-4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5-Flash model. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.

2503.13549 2026-05-21 cs.SE cs.AI 版本更新

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

ChatGPT与DeepSeek在解决编程任务中的对决

Ronas Shakya, Sam Urmian, Mohammad Khalil

发表机构 * Centre for the Science of Learning & Technology (SLATE)(学习科学中心及技术(SLATE)) University of Bergen(卑尔根大学)

AI总结 本文评估了ChatGPT和DeepSeek在解决编程任务中的性能,发现ChatGPT在中等难度任务中表现更优,而两者在困难任务上均面临挑战。

详情
AI中文摘要

大语言模型(LLMs)的发展为AI辅助编程工具创造了竞争环境。本研究评估了ChatGPT 03-mini和DeepSeek-R1在Codeforces上解决编程任务的能力。使用三个难度级别的29个编程任务,我们通过接受的解决方案、内存效率和运行时间性能评估了两种模型的表现。我们的结果表明,尽管两者在简单任务上表现相似,但ChatGPT在中等难度任务中表现更优,成功率为54.5%,而DeepSeek为18.1%。两者在困难任务上均面临挑战,突显了LLMs在处理高度复杂编程问题方面的持续挑战。这些发现突显了两种模型在能力和计算能力上的关键差异,为开发者和研究人员改进AI驱动的编程工具提供了有价值的见解。

英文摘要

The advancement of large language models (LLMs) has created a competitive landscape for AI-assisted programming tools. This study evaluates two leading models: ChatGPT 03-mini and DeepSeek-R1 on their ability to solve competitive programming tasks from Codeforces. Using 29 programming tasks of three levels of easy, medium, and hard difficulty, we assessed the outcome of both models by their accepted solutions, memory efficiency, and runtime performance. Our results indicate that while both models perform similarly on easy tasks, ChatGPT outperforms DeepSeek-R1 on medium-difficulty tasks, achieving a 54.5% success rate compared to DeepSeek 18.1%. Both models struggled with hard tasks, thus highlighting some ongoing challenges LLMs face in handling highly complex programming problems. These findings highlight key differences in both model capabilities and their computational power, offering valuable insights for developers and researchers working to advance AI-driven programming tools.

2501.01793 2026-05-21 cs.LG cs.AI 版本更新

Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation

创建从未存在过的虚拟学生:利用大型语言模型和CTGANs进行合成数据生成

Mohammad Khalil, Sam Urmian, Ronas Shakya, Qinyi Liu

发表机构 * Centre for the Science of Learning & Technology (SLATE)(学习科学与技术中心(SLATE)) University of Bergen(卑尔根大学)

AI总结 本文研究了利用生成对抗网络(GANs)和大型语言模型(LLMs)生成合成表格数据的潜力,探讨了通过合成数据创建虚拟学生以服务于学习分析模型的可能性,并评估了不同生成模型的性能。

详情
AI中文摘要

在本研究中,我们探索了人工智能和深度学习技术,特别是生成对抗网络(GANs)和大型语言模型(LLMs)在生成合成表格数据方面的成长潜力。获取高质量学生数据对于推进学习分析至关重要,但隐私问题和全球更严格的数据保护法规限制了其可用性和使用。合成数据提供了一个有前途的替代方案。我们探讨了是否可以利用合成数据来创建虚拟学生以服务于学习分析模型。使用流行的GAN模型CTGAN和三种LLMs-GPT2、DistilGPT2和DialoGPT,我们生成了合成的表格学生数据。我们的结果表明,这些方法具有强大的潜力,能够生成高质量的合成数据集,与真实学生数据相似。为了验证我们的发现,我们应用了一套全面的效用评估指标来评估合成数据的统计和预测性能,并比较了不同生成模型,特别是LLMs的性能。本研究旨在为学习分析社区提供有价值的见解,为扩展学习分析领域的方法学工具箱提供新的创新方法。

英文摘要

In this study, we explore the growing potential of AI and deep learning technologies, particularly Generative Adversarial Networks (GANs) and Large Language Models (LLMs), for generating synthetic tabular data. Access to quality students data is critical for advancing learning analytics, but privacy concerns and stricter data protection regulations worldwide limit their availability and usage. Synthetic data offers a promising alternative. We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models. Using the popular GAN model CTGAN and three LLMs- GPT2, DistilGPT2, and DialoGPT, we generate synthetic tabular student data. Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data. To validate our findings, we apply a comprehensive set of utility evaluation metrics to assess the statistical and predictive performance of the synthetic data and compare the different generator models used, specially the performance of LLMs. Our study aims to provide the learning analytics community with valuable insights into the use of synthetic data, laying the groundwork for expanding the field methodological toolbox with new innovative approaches for learning analytics data generation.

2501.01785 2026-05-21 cs.LG cs.AI cs.CY 版本更新

Can Synthetic Data be Fair and Private? A Comparative Study of Synthetic Data Generation and Fairness Algorithms

合成数据能否公平且隐私?合成数据生成与公平性算法的比较研究

Qinyi Liu, Oscar Deho, Sam Urmian, Mohammad Khalil, Srecko Joksimovic, George Siemens

发表机构 * Centre for the Science of Learning & Technology (SLATE), University of Bergen(学习科学与技术中心(SLATE),卑尔根大学) University of South Australia(澳大利亚南澳大利亚大学)

AI总结 本研究探讨了合成数据生成与公平性算法在平衡隐私和公平性方面的效果,发现DECAF算法在隐私和公平性之间取得最佳平衡,但其预测准确性较低,而对合成数据应用预处理公平算法能进一步提升公平性。

详情
AI中文摘要

随着机器学习在学习分析(LA)中的广泛应用,算法公平性和隐私问题引发了广泛关注。合成数据作为一种双重用途工具,能够增强LA模型的隐私性和公平性。然而,先前研究指出公平性与隐私之间存在反比关系,使同时优化两者变得困难。本研究探讨了哪些合成数据生成器能最好地平衡隐私和公平性,并确定预处理公平算法(通常应用于真实数据集)在合成数据上的有效性。我们的结果表明,DEbiasing CAusal Fairness(DECAF)算法在隐私和公平性之间取得了最佳平衡。然而,DECAF在实用性上表现不佳,这体现在其预测准确性上。值得注意的是,我们发现将预处理公平算法应用于合成数据时,公平性提升幅度比应用于真实数据时更大。这些发现表明,结合合成数据生成与公平性预处理可以为创建更公平的LA模型提供有前途的方法。

英文摘要

The increasing use of machine learning in learning analytics (LA) has raised significant concerns around algorithmic fairness and privacy. Synthetic data has emerged as a dual-purpose tool, enhancing privacy and improving fairness in LA models. However, prior research suggests an inverse relationship between fairness and privacy, making it challenging to optimize both. This study investigates which synthetic data generators can best balance privacy and fairness, and whether pre-processing fairness algorithms, typically applied to real datasets, are effective on synthetic data. Our results highlight that the DEbiasing CAusal Fairness (DECAF) algorithm achieves the best balance between privacy and fairness. However, DECAF suffers in utility, as reflected in its predictive accuracy. Notably, we found that applying pre-processing fairness algorithms to synthetic data improves fairness even more than when applied to real data. These findings suggest that combining synthetic data generation with fairness pre-processing offers a promising approach to creating fairer LA models.

2411.09593 2026-05-21 eess.IV cs.AI cs.CV 版本更新

SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms

SMILE-UHURA挑战 -- 从超高分辨率7T磁共振血管造影中进行微血管分割

Soumick Chatterjee, Hendrik Mattern, Marc Dörner, Alessandro Sciarra, Florian Dubost, Hannes Schnurre, Rupali Khatun, Chun-Chih Yu, Tsung-Lin Hsieh, Yi-Shan Tsai, Yi-Zeng Fang, Yung-Ching Yang, Juinn-Dar Huang, Marshall Xu, Siyu Liu, Fernanda L. Ribeiro, Saskia Bollmann, Karthikesh Varma Chintalapati, Chethan Mysuru Radhakrishna, Sri Chandana Hudukula Ram Kumara, Raviteja Sutrave, Abdul Qayyum, Moona Mazher, Imran Razzak, Cristobal Rodero, Steven Niederren, Fengming Lin, Yan Xia, Jiacheng Wang, Riyu Qiu, Liansheng Wang, Arya Yazdan Panah, Rosana El Jurdi, Guanghui Fu, Janan Arslan, Ghislain Vaillant, Romain Valabregue, Didier Dormont, Bruno Stankoff, Olivier Colliot, Luisa Vargas, Isai Daniel Chacón, Ioannis Pitsiorlas, Pablo Arbeláez, Maria A. Zuluaga, Stefanie Schreiber, Oliver Speck, Andreas Nürnberger

发表机构 * Faculty of Computer Science, Otto von Guericke University Magdeburg(奥托·冯·格里克大学马格德堡分校计算机科学学院) Data and Knowledge Engineering Group, Otto von Guericke University Magdeburg(奥托·冯·格里克大学马格德堡分校数据与知识工程小组) Human Technopole(人类技术极地) Biomedical Magnetic Resonance, Otto von Guericke University Magdeburg(生物医学磁共振,奥托·冯·格里克大学马格德堡分校) Department of Neurology, Medical Faculty, University Hospital of Magdeburg(马格德堡大学医院医学系神经科) German Centre for Neurodegenerative Diseases(德国神经退行性疾病研究中心) Centre for Behavioural Brain Sciences, Magdeburg(行为脑科学中心,马格德堡) Department of Neurology, University Hospital Zurich(苏黎世大学医院神经科) Department of Consultation-Liaison-Psychiatry and Psychosomatic Medicine, University Hospital Zurich(苏黎世大学医院咨询-联络精神病学与心身医学科) Stanford University(斯坦福大学) Translational Radiobiology, Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(转化放射生物学,放射肿瘤学部,埃尔兰根大学医院,埃尔兰根-纽伦堡弗里德里希-亚历山大大学) National Yang Ming Chiao Tung University(阳明交通大学) School of Electrical Engineering and Computer Science, University of Queensland(昆士兰大学电气工程与计算机科学学院) Australian eHealth Research Centre, CSIRO(澳大利亚eHealth研究中心,CSIRO) National Heart and Lung Institute, Faculty of Medicine, Imperial College London(英国伦敦帝国理工学院医学系国家心脏和肺研究所) Hawkes Institute, Department of Computer Science, University College London(霍克斯研究所,伦敦大学学院计算机科学系) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(阿布扎克穆罕默德·本·扎耶德人工智能大学) The Alan Turing Institute, London, UK(艾伦·图灵研究所,伦敦,英国) School of Computing, University of Leeds(利兹大学计算学院) Department of Computer Science at School of Informatics, Xiamen University(厦门大学信息学院计算机科学系) Manteia Technologies Co., Ltd, Xiamen, China(厦门Manteia技术有限公司) Leicester International Institute, Dalian University of Technology(大连理工大学利兹国际学院) Sorbonne Université, Institut du Cerveau - Paris Brain Institute(索邦大学,巴黎脑研究所) Centre of Formation and Research in Artificial Intelligence, Universidad de Los Andes, Colombia(智利洛斯安德斯大学人工智能培训与研究中心) Data Science Department, EURECOM, Sophia Antipolis, France(EURECOM数据科学系,法国索菲亚安蒂波利斯)

AI总结 该研究旨在解决公共标注数据集不足的问题,通过提供一个包含时间飞行血管造影的7T MRI标注数据集,评估了多种深度学习方法在微血管分割任务中的性能。

详情
AI中文摘要

人类大脑通过复杂的血管网络获取营养和氧气。影响微血管的病理状况是脑血供中的关键弱点,可能导致严重疾病,如小脑血管疾病。7特斯拉MRI系统的发展使得可以获得更高的空间分辨率图像,使能够可视化大脑中的这些血管。然而,缺乏公开可用的标注数据集阻碍了稳健的机器学习驱动分割算法的发展。为此,SMILE-UHURA挑战被组织起来。该挑战与2023年ISBI会议同期在哥伦比亚的加勒比海城市卡塔赫纳举行,旨在为相关研究领域研究人员提供一个平台。SMILE-UHURA挑战通过提供一个包含7T MRI获取的时间飞行血管造影的标注数据集,填补了公共标注数据集的空白。该数据集是通过自动预分割和大量手动精修相结合创建的。在本文中,十六种提交的方法和两个基线方法在两个不同的数据集上进行了定量和定性比较:一个是来自相同数据集的保留测试MRA(标签保密),另一个是单独的7T ToF MRA数据集(输入体积和标签均保密)。结果表明,大多数提交的深度学习方法在提供的训练数据集上训练后,实现了可靠的分割性能。Dice分数在相应数据集上达到了最高0.838±0.066和0.716±0.125,平均性能最高可达0.804±0.15。

英文摘要

The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.

2205.10995 2026-05-21 cs.DS cs.AI cs.DM cs.FL cs.LO 版本更新

From Width-Based Model Checking to Width-Based Automated Theorem Proving

从基于宽度的模型检验到基于宽度的自动定理证明

Mateus de Oliveira Oliveira, Sam Urmian

发表机构 * Stockholm University(斯德哥尔摩大学) University of Bergen(卑尔根大学)

AI总结 本文提出一个通用框架,将大量基于宽度的模型检验算法转换为用于测试图论猜想在有限宽度图类上有效性的算法,改进了理论上的上界。

Comments A preliminary version of this work was published in the proceedings of AAAI 2023

详情
AI中文摘要

在参数化复杂性理论领域,图宽度度量的研究与图上组合性质的基于宽度的模型检验算法的发展紧密相连。在本工作中,我们提出一个通用框架,将一大类基于宽度的模型检验算法转换为可用于测试图论猜想在有限宽度图类上有效性的算法。我们的框架是模块化的,并可以应用于几种已研究的图宽度度量,包括树宽和克lique宽。作为我们框架的定量应用,我们证明了对于几个长期存在的图论猜想,存在一个算法,其输入为一个数k,并在时间双指数于k^{O(1)}内正确判断该猜想是否在树宽不超过k的所有图上成立。这些上界,可以视为这些猜想在树宽不超过k的图类上的证明/反驳大小的上界,显著改进了之前使用现有技术得到的理论上界。

英文摘要

In the field of parameterized complexity theory, the study of graph width measures has been intimately connected with the development of width-based model checking algorithms for combinatorial properties on graphs. In this work, we introduce a general framework to convert a large class of width-based model-checking algorithms into algorithms that can be used to test the validity of graph-theoretic conjectures on classes of graphs of bounded width. Our framework is modular and can be applied with respect to several well-studied width measures for graphs, including treewidth and cliquewidth. As a quantitative application of our framework, we prove analytically that for several long-standing graph-theoretic conjectures, there exists an algorithm that takes a number $k$ as input and correctly determines in time double-exponential in $k^{O(1)}$ whether the conjecture is valid on all graphs of treewidth at most $k$. These upper bounds, which may be regarded as upper-bounds on the size of proofs/disproofs for these conjectures on the class of graphs of treewidth at most $k$, improve significantly on theoretical upper bounds obtained using previously available techniques.

2605.21258 2026-05-21 cs.RO cs.AI 版本更新

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

为机器人操作中的高效视觉表征学习结构潜在点

Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao Li, Qiang Zhang, Jingkai Sun, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MMLab, The Chinese University of Hong Kong(香港中文大学MMLab) X-Humanoid Robots(X-Humanoid机器人) The University of Hong Kong(香港大学) Tsinghua University(清华大学)

AI总结 本文提出了一种新的预训练框架,通过学习混合表示-结构潜在点,结合隐式表示的表达能力和显式表示的结构先验,以提高机器人操作中的视觉表征效率和鲁棒性。

Journal ref International Conference on Robotics and Automation 2026

详情
AI中文摘要

当前基于3D感知的预训练方法在具身感知和操作中大多基于可微渲染框架,产生完全隐式神经场或完全显式几何基元。隐式表示虽然具有表达能力,但缺乏显式结构线索,而显式表示则保留几何信息但受到分辨率限制和泛化能力差的困扰。为了解决这些限制,我们提出了一种新的预训练框架,学习混合表示-结构潜在点。具体来说,我们将在点云自编码器的潜在空间中插入一个点-wise潜在变分自编码器,联合正则化点-wise特征和坐标向高斯先验。所得到的紧凑潜在保留了粗略的结构趋势,不编码精确几何,但捕捉了更丰富的粗糙形状和语义信息,有效结合了隐式表示的表达能力和显式表示的结构先验。此外,受先前工作的共享设计选择启发,我们开发了一种流线型、高效的3DGS基于渲染管道,故意保持轻量,提高效率的同时,让前端潜在模块有更大的表征能力。在RLBench、ManiSkill2和真实机器人平台上的大量评估显示,在任务成功率、样本效率和对视角和场景变化的鲁棒性方面均优于强基线。消融研究进一步确认了框架中每个组件对整体性能的重要性。

英文摘要

Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

2605.21240 2026-05-21 cs.LG cs.AI 版本更新

APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

APEX:自主策略探索用于自演化大语言模型代理

Yibo Li, Jiashuo Yang, Zhi Zheng, Zhiyuan Hu, Yuan Sui, Shizun Wang, Yufei He, Bryan Hooi

发表机构 * National University of Singapore(新加坡国立大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文提出APEX,一种用于自演化大语言模型代理的自主策略探索方法,通过构建和维护显式的策略空间来解决探索崩溃问题,并在多个基准测试中表现出色。

详情
AI中文摘要

LLM代理在广泛复杂的任务中表现出强大的性能,包括需要长时间决策的交互环境。但是这些代理在测试时间无法实时学习。自演化代理通过在多个回合中积累记忆和反思来解决这个问题,而不是要求模型权重更新。然而,这些代理常常面临探索崩溃的问题:随着记忆的增长,行为会集中在熟悉的高奖励惯例上,减少了发现更好替代品的机会。为了解决这个问题,我们提出了自主策略探索(APEX),通过策略图——一个具有先决条件依赖边的有向无环图来构建和维护显式的策略空间。在APEX中,分支发现通过证据支持的未探索方向扩展地图,而策略选择在规划过程中平衡探索和利用。在九个Jericho文本冒险游戏和WebArena(一个现实的网络交互基准)上进行评估,APEX优于所有基线。广泛的消融实验验证了每个组件的贡献,并展示了在不同设置中的鲁棒性,证明了APEX在自演化代理中的持续探索有效性。

英文摘要

LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.

2605.21237 2026-05-21 cs.CV cs.AI 版本更新

RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

RePCM:区域特定和表型适应的双心室心脏运动合成

Xuan Yang, Xiaohan Yuan, Hao Li, Lingyu Chen, Yanan Liu, Lei Li

发表机构 * School of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程学院) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院) School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China(南京航空航天大学计算机科学与技术学院) School of Information Science and Engineering, Yunnan University, Kunming, China(云南大学信息科学与工程学院)

AI总结 本文提出RePCM方法,通过单帧双心室网格运动补全,利用区域特定和表型适应性来提升心脏运动合成的准确性,以应对心血管疾病导致的区域和疾病特异性差异。

Comments Early Accepted by MICCAI 2026. This is the author's submitted version. 10 pages, 3 figures

详情
AI中文摘要

心脏周期内的运动对于量化区域功能至关重要,并且强烈受到心血管疾病的影响。由于在实践中难以获得时间密集的网格序列,我们专注于利用更易获得的终舒张期帧来推断完整的周期序列。由于存在强区域和疾病特异性差异,传统方法常通过依赖生成模型来过度平滑数据,这些模型是为全球模式优化的。为了解决这个问题,我们提出了Region-Aware和Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis(RePCM)方法,用于单帧双心室网格运动补全。在第一阶段,重建网络学习顶点级别的运动描述符,聚类产生数据驱动的功能分区,提供显式的运动衍生区域结构。在第二阶段,Region-Specific Injection模块在条件VAE中强制执行掩码同步的区域交换,保留局部特定动态并限制跨区域混合。Phenotype-Adaptive Mixture-of-Experts先验条件于ED形状,使用解剖引导的提示来建模潜在运动趋势并捕捉跨疾病变化。在三个涵盖不同心血管疾病的数据集上的实验显示,在几何和功能指标上取得了持续的改进,并且区域特定动态的保护得到了改善。

英文摘要

Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

2605.21226 2026-05-21 cs.LG cs.AI 版本更新

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

OCTOPUS: 通过在最优平方误差量化下的八面体参数化优化Transformer的KV缓存

Mark Boss, Vikram Voleti, Simon Donné, Shimon Vainer

发表机构 * Stability AI

AI总结 OCTOPUS通过联合量化旋转坐标三元组,优化了Transformer的KV缓存,在保持内存带宽和足迹的同时,通过八面体参数化将方向映射到平方,并利用Lloyd-Max量化来实现非均匀的位分配,从而在各种数据类型中实现了优于现有旋转编码器的性能。

详情
AI中文摘要

关键值(KV)缓存是长上下文自回归推断中内存带宽和足迹的主要瓶颈。最近的旋转预条件编码器(TurboQuant, PolarQuant)表明,通过结构化的随机旋转后,再配合每个坐标轴的标量量化器,该量化器的边际分布具有解析性,可以近似达到KV压缩的最优解。OCTOPUS通过联合量化旋转坐标三元组进一步推进了这一范式。每个三元组的方向通过八面体参数化映射到平方,然后得到的两个坐标和三元组范数通过Lloyd-Max量化与实现匹配的边际分布进行量化。通过优化每个三元组的平方误差,得到的位分配严格非均匀,仅依赖于键的总维度。我们发现,在有限维的情况下,通过扫描找到的质量最优是恒定的,无论在我们测试的任何现实解码器中。该编码器是数据无关的、在线的,并且在给定种子的情况下是确定性的。在文本、视频和音频中,OCTOPUS在每个报告的比特宽度和指标上都匹配或超越了所有先前的旋转编码器,其优势随着比特数的减少而增加。此外,一个融合的Triton实现可以在不生成未压缩键的情况下实时重建键,因此编码器在解码时间上不会增加带宽或延迟。项目页面:https://octopus-quant.github.io/

英文摘要

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

2605.21225 2026-05-21 cs.LG cs.AI 版本更新

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE: 基于偏好的隐式奖励和成本微调以实现安全对齐

Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran

发表机构 * TCS Research, \ of CSE, IIT Madras India Department of Computing Science, \ of Alberta Canada Qatar Computing Research Institute, \ Bin Khalifa University Qatar Department of Data Science \& AI, Wadhwani School of Data Science \& AI, IIT Madras India TCS Research, \ of CSE, IIT Madras Department of Computing Science, \ of Alberta Qatar Computing Research Institute, \ Bin Khalifa University Department of Data Science \& AI, Wadhwani School of Data Science \& AI, IIT Madras

AI总结 该研究提出PREFINE方法,通过基于偏好的隐式奖励和成本微调,在连续控制环境中实现安全策略对齐,通过微调预训练强化学习策略以生成低成本行为同时保持高奖励。

Comments Accepted at AAMAS 2026 as a full paper

详情
AI中文摘要

我们解决了通过引入成本约束使预训练的强化学习(RL)策略安全意识的问题,而无需重新训练。虽然成本可以数值编码,但我们假设更一般的情况是当成本作为偏好提供时。给定一个奖励优化的策略和一个小的偏好(低成本)和不偏好(高成本)轨迹数据集,我们的目标是微调策略以生成低成本行为,同时保留高奖励。与标准RLHF在语言模型中不同,我们的设置涉及轨迹层面的偏好,在连续控制环境中。我们介绍了PREFINE:基于偏好的隐式奖励和成本微调以实现安全对齐,这是一种基于偏好的微调方法,将现在广泛用于LLM微调的直接偏好优化(DPO)适应到序列决策设置中。PREFINE构造策略采样的反事实轨迹以建立有意义的偏好对比,并联合优化奖励保留和安全对齐。实证上,PREFINE将约束违反和灾难性故障减少了超过60%,同时保持原始奖励行为。PREFINE生成的策略在显著提高数据和计算效率的情况下,实现了低成本、高奖励性能, bridging preference alignment和安全策略适应在连续域中。

英文摘要

We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.

2605.21224 2026-05-21 physics.optics cs.AI eess.SP 版本更新

Artificial Intelligence Reshapes Microwave Photonics

人工智能重塑微波光子学

Peng Li, Xihua Zou, Jia Ye, Wei Pan, Lianshan Yan

发表机构 * Center for Information Photonics and Communications, School of Information Science and Technology, Southwest Jiaotong University(信息光子与通信中心,信息科学与技术学院,西南交通大学) Key Laboratory of Photonic-Electronic Integration and Communication-Sensing Convergence, Southwest Jiaotong University(光电子集成与通信-传感融合重点实验室,西南交通大学)

AI总结 本文研究了人工智能如何推动微波光子学的发展,通过整合人工智能与微波光子学技术,实现了在信号生成、传输、处理和检测等方面的创新突破。

Comments 13 pages, 12 figures

详情
AI中文摘要

作为一项迅速发展的跨学科领域,微波光子学(MWP)通过整合微波和光子技术,为克服传统电子系统的根本带宽限制提供了颠覆性解决方案。通过利用光子技术固有的超宽带宽和低损耗特性,MWP实现了微波、毫米波和太赫兹信号的生成、传输、处理和检测。代表性突破包括全光微波雷达系统、带宽高达320 GHz的全光模拟-数字转换器,以及数据速率高达616 Gbit/s的全光无线通信系统。同时,人工智能的快速成长正在以前所未有的方式重塑科学研究、工程和日常生活,如AI用于科学/工程和AI合作者/助手。相应地,人工智能在微波光子学的各个方面产生了深远影响,从信号生成、传输到信号处理和检测。人工智能已经革新了MWP系统的 设计、仿真、制造、测试、部署和维护,实现了超越传统系统的自主操作和卓越效率。受这些进展的启发,本文综述论文提供了人工智能赋能微波光子学的首次全面概述,系统总结了最先进的进展,并为学术界和更广泛公众提供了见解。

英文摘要

As a rapidly emerging interdisciplinary field that intrinsically integrates microwave and photonics, microwave photonics (MWP) provides disruptive solutions to overcome the fundamental bandwidth of conventional electronic systems. By exploiting the inherently ultra-wide bandwidth and low-loss characteristics of photonic technologies, MWP enables the generation, transmission, processing, and detection of microwave, millimeter-wave, and terahertz signals. Representative breakthroughs include fully photonic microwave radar systems, photonic analog-to-digital converters with bandwidth up to 320 GHz, and photonic wireless communication systems achieving data rate as high as 616 Gbit/s. Meanwhile, the rapid growth of artificial intelligence (AI) is reshaping scientific research, engineering, and daily life in unprecedented ways, such as AI for science/engineering and AI co-scientist/assistant. Correspondingly, AI is profoundly reshaping MWP in all aspects, ranging from signal generation, transmission to signal processing and detection. AI has revolutionized the design, simulation, fabrication, testing, deployment, and maintenance of MWP systems, delivering autonomous operation and exceptional efficiency beyond traditional systems. Motivated by these developments, this Review Paper provides the first comprehensive overview of AI-enabled MWP, systematically summarizing the state-of-the-art advances and presenting insights for both the academic community and the broader public.

2605.21213 2026-05-21 quant-ph cs.AI cs.LG math.OC 版本更新

Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing

通过量子计算增强的强化学习过程合成

Austin Braniff, Fengqi You, Yuhe Tian

发表机构 * Department of Chemical and Biomedical Engineering, West Virginia University(西弗吉尼亚大学化学与生物医学工程系) R.F. Smith School of Chemical and Biomolecular Engineering, Cornell University(康奈尔大学R.F. Smith化学与生物分子工程学院)

AI总结 本文提出了一种基于量子强化学习的过程合成方法,通过构建通用框架将过程合成问题形式化为马尔可夫决策过程,并引入量子增强的强化学习算法以提高可扩展性,同时通过经典强化学习作为基准进行比较,展示了量子方法在过程合成中的竞争力。

详情
AI中文摘要

在本文中,我们提出量子强化学习(RL)作为解决过程合成问题的策略。基于我们先前的工作,我们开发了一个通用框架,将过程合成正式化为马尔可夫决策过程,并引入量子增强的强化学习算法来解决它,从而提高了可扩展性。早期的量子强化学习在过程合成中的实现受到量子位需求的限制,随着问题复杂度的增加,其扩展性较差。本文通过引入状态编码算法将量子位需求与问题规模解耦。使用经典强化学习作为基准,在相同的训练条件下评估量子算法。所有算法在具有递增单元数量的流程表合成问题上进行评估,以分析其性能和可扩展性。结果表明,所有方法都能在小设计空间中识别出最优的流程表设计。对于中等规模的单元数量,量子方法在每回合的基础上表现出竞争性的性能,并且在每参数的基础上具有改进的效率,优于经典强化学习基准。本文为未来量子计算在过程系统工程中的应用提供了基础,建立了比较经典和量子算法的受控基准,并展示了所提出的量子变体在本文研究的过程合成问题中仍具有竞争力。

英文摘要

In this work, we present quantum reinforcement learning (RL) as a solution strategy for process synthesis problems. Building on our prior work, we develop a generalized framework that formally poses process synthesis as a Markov decision process and introduces quantum-enhanced RL algorithms to solve it with improved scalability. Earlier implementations of quantum-based RL for process synthesis were limited by qubit requirements, which scaled poorly with problem complexity. This work overcomes this challenge by introducing state encoding algorithms to decouple qubit requirements from problem size. A classical RL-based solution strategy is used as a baseline to benchmark the quantum algorithms under identical training conditions. All algorithms are evaluated across a flowsheet synthesis problem of increasing unit counts to analyze their performance and scalability. Results show that all approaches are capable of identifying the optimal flowsheet designs in small design spaces. For moderate-scale unit counts, quantum approaches demonstrate competitive performance on a per-episode basis and improved efficiency on a per-parameter basis versus the classical RL benchmark. This work provides a foundation for future quantum computing applications within process systems engineering, establishes a controlled benchmark for comparing classical and quantum algorithms, and shows that the proposed quantum variants remain competitive for the process synthesis problem examined in this work.

2605.21198 2026-05-21 cs.SI cs.AI 版本更新

SURGE: An Event-Centric Social Media Sentiment Time Series Benchmark with Interaction Structure

SURGE:一个以事件为中心的社会媒体情感时间序列基准,包含交互结构

Chen Su, Pengsen Cheng, Yuanhe Tian, Yan Song

发表机构 * University of Science and Technology of China(中国科学技术大学) Sichuan University Zhongguancun Academy(四川大学中关村学院)

AI总结 该研究提出了SURGE基准,通过整合事件级时间序列与对齐的文本和交互结构,用于评估社交互动如何影响预测行为,揭示了基准在局部持久性、转移能力及回复密集期的挑战性。

详情
AI中文摘要

社交媒体上的公共事件产生大量讨论,其集体动态对意见预测和危机响应具有直接价值。捕捉这些动态在事件生命周期中的演变需要将碎片化帖子组织成事件级时间序列。现有数据集仅涵盖少量事件,且在构建时间序列时通常丢弃帖子间的交互结构,限制了跨事件类型的迁移和对交互如何塑造集体动态的受控研究。我们提出了SURGE,一个多事件社交媒体基准,将事件级时间序列与对齐的文本和交互结构联系起来。SURGE通过自动化流程生成日历对齐的时间序列,覆盖五个事件类别中的67个事件和超过80万条帖子。每个时间单元配对来自相同选定帖子的扁平和结构化文本视图,使受控评估社交交互结构对预测行为的影响成为可能。在SURGE之上,我们定义了数值预测、文本增强预测、高交互评估和留一类别外推广的基准协议。实验表明,该基准具有强局部持久性,即在绝对误差下朴素基线难以超越;现有文本增强预测器在事件驱动社交媒体数据中的迁移有限;回复密集期的难度增加,聚合指标往往掩盖了这些挑战。我们还包含了一个轻量级结构感知探针作为参考实现,展示了SURGE如何支持交互感知预测研究。

英文摘要

Public events on social media generate large volumes of discussion whose collective dynamics carry direct value for opinion forecasting and crisis response. Capturing how these dynamics evolve across an event's lifecycle requires organizing fragmented posts into event-level time series. Existing datasets cover only a small number of events within a single category, and typically discard the interaction structure between posts when constructing time series, which restricts both transfer across event types and controlled study of how interactions shape the resulting collective dynamics. We present SURGE, a multi-event social media benchmark that pairs event-level time series with aligned text and interaction structure linking posts within an event. SURGE is built through an automated pipeline that produces calendar-aligned time series at three temporal granularities, covering 67 events and more than 800K posts across five event categories. Each time bin is paired with flat and structured textual views derived from the same selected posts, enabling controlled evaluation of whether social interaction structure affects forecasting behavior. On top of SURGE we define benchmark protocols for numerical-only forecasting, text-augmented forecasting, high-interaction evaluation, and leave-one-category-out generalization. Experiments with representative time-series and multimodal forecasting models reveal three properties of the benchmark: a strong local-persistence regime in which naive baselines remain hard to beat under absolute error, limited transfer of existing text-augmented forecasters to event-driven social-media data, and increased difficulty on reply-dense periods that aggregate metrics tend to obscure. We further include a lightweight structure-aware probe as a reference implementation, illustrating how SURGE can support interaction-aware forecasting research.

2605.21186 2026-05-21 cs.CV cs.AI 版本更新

SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection

SAM-Sode:迈向微小细菌检测的可信解释

Wanying Tan, Shuo Yan, Dazhi Huang, Yazheng Liu, Zili Shao, Rufeng Chen, Hechang Chen, Mude Shi, Tianxing Ji, Sihong Xie

发表机构 * Shenzhen University, Shenzhen, China The Second Affiliated Hospital, Guangzhou Medical University, Guangzhou, China The Hong Kong University of Science Technology (Guangzhou), Guangzhou, China Jilin University, Changchun, China Guangdong ACXEL Micro \& Nano Tech Co., Ltd., Guangzhou, China

AI总结 本文提出SAM-Sode框架,通过几何感知提示和双约束机制提升微小细菌检测的解释性与透明度,有效抑制背景冗余并增强决策透明度。

Comments 10 pages, 4 figures, conference paper

详情
AI中文摘要

对象检测的可解释性为临床辅助诊断提供了关键的信心支持。然而,在微小细菌检测中,传统解释方法由于目标形态特征的极端稀疏性和复杂背景的严重干扰,常面临前景边界模糊和特征归因扩散的问题。这种限制阻碍了逻辑连贯的形态证据的提供。为解决这一问题,我们提出了一种新颖的可解释人工智能(XAI)框架SAM-Sode。该框架创新性地将初始特征归因图转换为几何感知提示,利用基础模型(SAM3)的先验知识实现空间细化和形态重建。此外,我们引入基于物理意义和几何对齐的双约束机制,进行实例级去噪,生成更符合人类专家直觉的解释。在我们自行构建的具有复杂电路背景的细菌数据集(包含2,524张图像)及其他公开数据集上的实验结果表明,所提出的方法有效抑制了背景冗余,并显著增强了微小物体检测的决策透明度。

英文摘要

Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.

2605.21157 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

多光谱下无人机影像用于军事检测的比较分析

Sourov Roy Shuvo, Prajwal Panth, Rajesh Chowdhury, Sorup Chakraborty, Sudip Chakrabarty, Prasant Kumar Pattnaik

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT deemed to be 大学)

AI总结 本文研究了不同光谱条件下无人机影像用于军事目标检测的问题,通过构建四种不同数据集(灰度、热成像、夜视和模糊成像)来评估模型在不同环境下的性能,提出了一种改进的YOLOv11-small模型以提升无人机作战的性能和可靠性。

Comments 6 pages, 7 figures. Accepted at the 16th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 6-11, 2025, IIT Indore. Proceedings pending publication

详情
AI中文摘要

在现代战争中,无人机已成为情报收集和精确打击在不同 hostile 环境中的重要组成部分。其能够从安全距离实时操作 hostile 环境的能力使其在监视和军事行动中具有无价的价值。KIIT-MiTA 数据集由从无人机拍摄的不同军事场景图像组成,为检测军事目标提供了基础,但未考虑各种现实场景。为此,创建了四种不同类型的数据集:灰度、热成像、夜视和模糊成像,以模拟现实环境如低能见度、热成像和夜间条件。YOLOv11-small 模型被训练和用于检测不同设置中的目标。本研究通过在防御和进攻任务中开发先进的检测系统,提高了基于无人机的作战性能和可靠性。

英文摘要

In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

2605.21154 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

精神病诊断的ICD分类自动化:从经典NLP到大语言模型

Fernando Ortega, Raúl Lara-Cabrera, Jorge Dueñas-Lerín, Alejandro de la Torre-Luque, Mercé Salvador Robert, Enrique Baca-García

发表机构 * Department of Sistemas Informáticos, Universidad Politécnica de Madrid, Spain(西班牙马德里理工大学信息系统系) KNODIS Research Group, Universidad Politécnica de Madrid, Spain(西班牙马德里理工大学KNODIS研究组) CIBERSAM ISCIII, Spain(西班牙ISCIII CIBERSAM) Department of Legal Medicine, Psychiatry and Pathology. Complutense University of Madrid, Spain(西班牙马德里康普顿斯大学法医学、精神病学与病理学系) Hospital Universitario de Móstoles, Universidad Rey Juan Carlos, Spain(西班牙雷阿尔皇家卡洛斯大学莫斯特oles大学医院) Department of Psychiatry, University Hospital Jimenez Díaz Fundation, Madrid, Spain(西班牙圣地亚哥· jiménez Díaz基金会精神病科部) Department of Psychiatry, University Hospital Rey Juan Carlos, Móstoles, Spain(西班牙雷阿尔皇家卡洛斯大学莫斯特oles医院精神病科部) Department of Psychiatry, General Hospital of Villalba, Madrid, Spain(西班牙维拉尔巴医院精神病科部) Department of Psychiatry, University Hospital Infanta Elena, Madrid, Spain(西班牙伊菲格尼亚医院精神病科部) Department of Psychology, Universidad Catolica del Maule, Talca, Chile(智利马尔学院心理学系) Department of Psychiatry, Madrid Autonomous University, Madrid, Spain(西班牙马德里自治大学精神病科部)

AI总结 本研究提出利用NLP和机器学习技术将自由文本描述映射到国际疾病分类(ICD),以自动化精神病诊断分析,通过评估从经典频率模型到先进大语言模型的多种文本表示方法,展示了transformer嵌入在捕捉隐含语义线索和细致医学术语方面的优势。

详情
AI中文摘要

心理健康已成为全球优先事项,导致临床诊断编码的行政负担巨大。本研究提出通过将自由文本描述映射到国际疾病分类(ICD)来自动化精神病诊断分析,利用包含145,513个西班牙精神病描述的专用数据集,评估了从经典频率模型(BoW,TF-IDF)到先进大语言模型(如e5_large、BioLORD和Llama-3-8B)的各种文本表示方法。结果表明,基于transformer的嵌入 consistently 超过传统方法,通过端到端微调,e5_large模型实现了最高的性能,F1_micro得分为0.866。本研究证明了将大语言模型适应特定临床术语对于克服“长尾”标签分布和精神病 discourse 的固有模糊性至关重要。

英文摘要

Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

2605.21146 2026-05-21 cs.CR cs.AI cs.SE 版本更新

Detecting Trojaned DNNs via Spectral Regression Analysis

通过谱回归分析检测被植入的深度神经网络

Samuele Pasini, Jinhan Kim, Paolo Tonella

发表机构 * Università della Svizzera Italiana(瑞士意大利大学)

AI总结 本文提出MIST方法,通过分析模型在微调过程中的内部表示变化来检测植入的后门,利用预激活谱特征来识别与参考不一致的更新,从而在不依赖 poisoned 数据或触发器的情况下实现高准确率的后门检测。

详情
AI中文摘要

现代深度神经网络(DNN)经常被反复微调以整合新数据和功能。这种进化流程在更新数据不可信时引入了安全风险,因为攻击者可能在微调过程中植入后门。我们提出了MIST,一种后门检测方法,分析模型在微调过程中内部表示的变化。而不是尝试重建触发条件,MIST利用预激活谱特征来表征良性模型进化,并标记出与参考不一致的更新。这种框架将后门检测视为对模型更新的回归问题。在四个数据集和八个后门攻击上的实证评估表明,谱距离可靠地区分被植入的更新与干净的微调。MIST在单次更新后优于现有最先进检测精度,无需任何关于被污染数据或触发器的知识,并在多步良性进化中保持有效,具有优雅且有界的退化。这些结果表明,谱进化提供了一种稳定且假设轻量的信号,用于检测恶意模型更新。

英文摘要

Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model's internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without requiring any knowledge about the poisoned data or the trigger, and remains effective under multi-step benign evolution, with graceful and bounded degradation. These results indicate that spectral evolution provides a stable and assumption-light signal for detecting malicious model updates.

2605.21113 2026-05-21 cs.LO cs.AI 版本更新

On the Complexity of Entailment for Cumulative Propositional Dependence Logics

关于累积命题依赖逻辑蕴含复杂性的研究

Kai Sauerwald, Juha Kontinen, Arne Meier

发表机构 * University of Helsinki(赫尔辛基大学)

AI总结 本文研究了累积命题依赖逻辑和累积命题逻辑(带有团队语义)的蕴含问题的复杂性,通过关系模型确定了其复杂性结果。

Comments arXiv admin note: substantial text overlap with arXiv:2602.21360

详情
AI中文摘要

本文建立了并证明了累积命题依赖逻辑和累积命题逻辑(带有团队语义)的蕴含问题的复杂性结果。正如最近所显示的,累积逻辑以其System~C系统为特征,并且恰好由Kraus、Lehmann和Magidor的累积模型所捕捉。这导致了通过关系模型的蕴含问题,本文特别考虑了这一问题。

英文摘要

This paper establishes and proves complexity results for entailment for cumulative propositional dependence logic and for cumulative propositional logic with team semantics. As recently shown, cumulative logics are famously characterised by System~C and exactly captured by the cumulative models of Kraus, Lehmann and Magidor. This gives rise to the entailment problem via relational models, which is specifically considered here.

2605.21102 2026-05-21 cs.CL cs.AI cs.SE 版本更新

ACL-Verbatim: hallucination-free question answering for research

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

发表机构 * TU Wien(维也纳技术大学) KR Labs(KR实验室)

AI总结 本研究提出ACL-Verbatim系统,通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段,构建了新的真实数据集并训练评估了多种提取模型,最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6,优于最强的LLM提取器。

Comments 13 pages

详情
AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息,但现代AI辅助研究工具仍受大语言模型(LLMs)产生事实不准确或不合逻辑输出(即幻觉)的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文,直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集,用于将用户查询映射到科研论文中的相关文本片段,并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成,基于使用ScIRGen方法生成的合成用户查询,配以由VerbatimRAG检索的论文片段。在该基准上,一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6,优于最强的评估LLM提取器(48.7)

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

2605.21085 2026-05-21 cs.MA cs.AI cs.LG 版本更新

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

分离通信与策略:在带宽限制下的鲁棒多智能体强化学习

Alexi Canesse, Benoît Goupil, Jesse Read, Sonia Vanier

发表机构 * École polytechnique (LIX)(巴黎高等理工学院(LIX)) CNRS(国家科学研究中心) Institut Polytechnique de Paris(巴黎高等理工学院) Palaiseau, France(法国Palaiseau)

AI总结 本文提出了一种新的方法,通过引入β指标和SLIM架构,将通信路径与策略的潜在表示分离,从而在带宽受限的情况下提高多智能体强化学习的鲁棒性和性能。

详情
AI中文摘要

通信在多智能体强化学习(MARL)中起到了协调作用,但许多实际应用,例如无人机编队的搜索与救援任务,在严重的带宽限制下运行。许多通信架构仍然存在耦合瓶颈,其中共享的潜在表示用于策略执行和智能体间通信。因此,减少信息量会直接限制策略的潜在空间,通常导致显著的性能下降。我们通过两个贡献来解决这个问题。首先,我们引入β,一个归一化的每智能体带宽预算,将稀疏性、轮次和信息维度统一为一个可比的约束。其次,我们提供SLIM,一个最小的架构,将通信路径与策略的潜在表示分离,使我们能够隔离带宽的影响与策略容量的影响,同时受益于步骤内通信。我们在几个部分可观测的MARL基准上评估了我们的方法,其中通信是至关重要的。我们的方法在状态空间中实现了最先进的性能,并且在有限的通信下表现出可扩展性和鲁棒性,随着带宽的减少,降级仅是轻微的。

英文摘要

Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce $β$, a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.

2605.21082 2026-05-21 cs.AI 版本更新

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

AutoRPA: 通过基于LLM的代码合成实现高效的GUI自动化

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin

发表机构 * Zhejiang Key Laboratory of Space Information Sensing and Transmission(浙江空间信息感知与传输重点实验室) School of Computer Science, Hangzhou Dianzi University(杭州电子科技大学计算机科学学院)

AI总结 本文提出AutoRPA框架,通过将ReAct风格代理的决策逻辑自动转化为鲁棒的RPA功能,提高GUI自动化效率和可重用性,同时减少82%到96%的token使用量。

Comments Accepted in ICML 2026

详情
AI中文摘要

基于大型语言模型(LLM)的代理在多步骤的图形用户界面(GUI)交互中表现出色。尽管大多数研究集中在提升单任务性能,但实际场景中往往涉及重复的GUI任务,而频繁调用LLM推理(即ReAct范式)效率低下。在LLM之前,传统的机器人流程自动化(RPA)提供运行时效率,但需要大量手动努力来开发和维护。为弥合这一差距,我们提出AutoRPA框架,该框架能够自动将ReAct风格代理的决策逻辑转化为鲁棒的RPA功能。AutoRPA引入了两个核心创新:(1)一个翻译-构建流水线,其中翻译代理将硬编码的ReAct动作转换为软编码的流程,构建代理通过多轨迹检索增强生成合成鲁棒的RPA功能;(2)在代码验证期间的混合修复策略,结合RPA执行与基于ReAct的回退机制进行迭代优化。在多个GUI环境中的实验表明,由AutoRPA生成的RPA功能能够成功解决类似任务,同时减少82%到96%的token使用量,显著提高运行时效率和可重用性。

英文摘要

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) KimJaeChul AI Graduate School(金 JaeChul人工智能研究生院)

AI总结 本文提出通过逆运动学求解器重新设计驾驶VLA,以解决轨迹预测中对视觉token的忽略问题,通过引入视觉状态预测和逆运动学网络,提升了视觉接地和轨迹规划性能。

详情
AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明,当通过逆运动学视角看待轨迹恢复时,需要当前和未来视觉状态作为边界条件;现有VLA仅提供前者,促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题,我们重新设计驾驶VLA,使其风格类似于逆运动学求解器。首先,一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次,一个单独的逆运动学网络(基于交叉注意力的条件扩散模型)仅输入当前和未来视觉状态,以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方,我们的0.5B规模模型恢复了视觉接地能力,并在闭合回路NAVSIM-v2和nuScenes基准上,其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明,这种改进源于恢复了利用视觉特征的能力,效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

2605.21060 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Divide et Calibra: 通过向量量化实现多类局部校准

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

发表机构 * University of Pisa(比萨大学) University of Trento(特伦托大学) Meta(Meta公司) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)

AI总结 本文提出了一种复合方法,通过向量量化诱导表示空间的结构划分,并利用Dirichlet浓度的参数化实现跨区域参数共享,从而学习出能泛化到稀疏区域的异质校准映射,提升了局部校准性能同时保持了全局校准和预测性能。

详情
AI中文摘要

在高风险场景中,准确且校准良好的机器学习(ML)模型是必需的,但有效的多类校准仍然具有挑战性:全局方法假设校准误差在潜在空间中是同质的,而局部方法通常依赖于潜在空间降维,导致信息丢失。为了解决这些问题,我们提出了一种多类校准的复合方法,其中区域特定的校准映射是从共享的码字依赖因素中构建的。我们通过向量量化(VQ)实现这一想法,它诱导了表示空间的结构划分,并利用Dirichlet浓度的参数化实现跨区域参数共享。我们的方法学习了能泛化到稀疏区域的异质校准映射。在基准数据集上的实验显示,在保持竞争性的全局校准和预测性能的同时,显著提高了局部校准性能。

英文摘要

Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

2605.20998 2026-05-21 cs.CL cs.AI 版本更新

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

单次传递、深度选择性阅读用于多方面情感分析

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan

发表机构 * Universiti Malaya, Malaysia(马来大学,马来西亚) Suzhou University of Technology, China(苏州科技学院,中国) VinUniversity, Vietnam(文大学,越南)

AI总结 本文提出DABS框架,通过单次编码构建可重用的深度有序基底,使多方面情感分析在保持性能的同时减少60%的端到端计算量。

Comments Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

详情
AI中文摘要

在多方面句子中,方面术语情感分析(ATSA)面临效率与表达性的根本权衡。现有模型要么为每个方面重新编码句子,要么依赖静态深度表示,导致冗余计算和有限适应性。我们主张Transformer深度是一种昂贵且可查询的资源,并提出DABS,一种单次推断框架,通过一次编码构建可重用的深度有序基底。每个方面则查询此共享表示以选择性地读取相关token和抽象层次,而无需重新编码。这将共享句子编码与轻量级、方面条件化的读取解耦。在四个ATSA基准测试中,DABS实现了具有竞争力的性能,同时在多方面设置(M >= 2)中将端到端计算减少了高达60%。进一步分析表明,自适应深度查询在语言复杂情况如否定和对比中最为有益。代码可在https://github.com/panzhzh/acl-dabs公开获取。

英文摘要

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph 版本更新

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Environmental Engineering, ETH Zürich(环境工程研究所,苏黎世联邦理工学院)

AI总结 本文提出了一种结合机器学习与物理模型的混合方法,利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度,通过扩展特征空间减少高度和基线地形坡度的模糊性,实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情
AI中文摘要

将机器学习(ML)与物理模型(PM)结合,已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下,一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出,该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性,但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点,提出通过扩展特征空间加入光学Landsat数据,以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据,并与空中LiDAR测量进行评估。结果表明,与原始混合模型相比,RMSE和MAE分别减少了13.5%和16.6%,证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

2605.20994 2026-05-21 cs.CL cs.AI 版本更新

Towards Context-Invariant Safety Alignment for Large Language Models

面向大语言模型的上下文不变安全对齐

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出了一种上下文不变的安全对齐方法,通过引入锚点不变正则化(AIR)来提升模型在不同上下文中的鲁棒性,从而增强安全约束对对抗性框架的抵抗力。

Comments ICML 2026

详情
AI中文摘要

基于偏好进行的后训练对齐可以将大语言模型与人类意图对齐,但安全行为往往仍然脆弱。一个模型可能在标准提示下拒绝有害请求,但在相同意图被包装在对抗性语言中时却会合规。我们建议,稳健的安全性需要上下文不变的对齐,其中行为取决于底层意图而非表面形式。在对齐中强制不变性是困难的,因为并非所有训练信号都同等可信;对于某些提示变体我们能够获得可验证的反馈(例如多选题),而对于开放性变体我们通常依赖于噪声且可游戏化的奖励代理(例如学习的评判者)。因此,标准对称不变正则化器可以通过降低在可靠变体上的性能来减少跨上下文差异,而不是改进开放性鲁棒性。为了解决这个问题,我们引入了锚点不变正则化(AIR),它将可验证的提示视为锚点,并使用停止梯度目标来正则化开放性变体朝着锚点性能的方向。AIR作为插件辅助损失实现,并通过异质提示分组与基于组的偏好优化(例如GRPO)结合。在安全、道德推理和数学方面,AIR提高了上下文不变性,提升了在分布内组的准确性达12.71%,在分布外一致性提升33.49%,使安全约束对对抗性框架更加稳健。

英文摘要

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

2605.20982 2026-05-21 cs.DC cs.AI cs.LG 版本更新

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

调度操作中的开销诊断:跨架构观测站

Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein

发表机构 * DeepSeek-V2-Lite MLA DeepSeek-MoE-16B MHA Qwen3-30B GQA Nemotron-30B Mamba-2 Qwen3.5-35B GDN

AI总结 该研究通过测试四个缓解方案的假设,发现扩展进程(EP)规模变化对专家最大/均值token比率的影响最多为5%,并且mock-token基准在路由Gini系数和批大小缩放趋势上存在高估。研究发现五种架构在相同矩阵中形成两个稳定的带状分布,这些带状分布而非EP度或mock数据配置是AlltoAll-aware互连和调度设计的正确工作负载输入。

详情
AI中文摘要

AlltoAll调度是MoE专家并行性的主要瓶颈,互连社区对此做出了四种缓解方案:预测样本放置、自适应专家重新布局、分层收集和EP-aware拓扑。这四种方案都基于两个关于工作负载的假设。第一个假设是路由不平衡可以通过系统层纠正。第二个假设是评估它们的mock-token基准忠实代表生产路由。我们引入DODOCO来测试这两个假设。我们对五个MoE检查点进行仪器化,涵盖五个序列混合器设计(DeepSeek-V2-Lite MLA,DeepSeek-MoE-16B MHA,Qwen3-30B GQA,Nemotron-30B Mamba-2,Qwen3.5-35B GDN)在5x6的数据条件下网格以及匹配的EP扫描(4到32个rank在H100s上);两个假设都失败。扩展EP在每个架构的可测量范围内将每专家最大/均值token比率改变最多5%:straggler是模型路由决策的固有属性,而不是其专家落在rank上的方式。mock tokens高估路由Gini系数高达2.35倍,并制造了一个批次大小缩放趋势,一旦真实文本取代随机ID,该趋势就消失。从相同矩阵中出现第三种模式,意外的是,五种架构分裂成两个稳定的带状分布。MHA和Mamba-2(数据容错)在wikitext上降至Gini 0.105和0,150。MLA和GDN(持续集中)在所有真实文本条件下保持在0.24以上,并在mock中达到0.29到0.38。GQA是中间情况。这些带状分布,而不是EP度或mock数据配置,是AlltoAll-aware互连和调度设计的正确工作负载输入。

英文摘要

AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

2605.20971 2026-05-21 cs.CV cs.AI cs.CR 版本更新

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

发表机构 * University of East London(东伦敦大学)

AI总结 本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能,发现VGG16在准确性上表现最佳,但EfficientNetB0在检测虚假图像时的敏感性较高,但对真实图像的可靠性较低,研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

Comments Accepted at ICCIIoT26 and waiting to be indexed

Journal ref 6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026

详情
AI中文摘要

随着基于GAN的图像篡改技术日益复杂,数字取证面临重大挑战。本研究比较了四个预训练的CNN架构(VGG16、ResNet50、EfficientNetB0和XceptionNet)在虚假图像检测中的性能,使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%,XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强,但在真实图像上的可靠性较低,反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限,这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准,并强调了平衡数据集、高级增强和公平性意识训练的必要性,以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

2605.20965 2026-05-21 cs.CV cs.AI 版本更新

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据:通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院) School of Artificial Intelligence, Shenzhen University, Shenzhen, China(深圳大学人工智能学院) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China(深圳大学计算机科学与软件工程学院) Engineering, South China University of Technology, Guangzhou, China(华南理工大学工程学院) National Engineering Laboratory for Big Data Systems Computing Technology, Shenzhen University, Shenzhen, China(深圳大学大数据系统计算技术国家工程实验室) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用重点实验室(东南大学),中华人民共和国教育部)

AI总结 本文提出了一种基于层间视觉注意力差异的幻觉缓解方法,通过增强视觉证据的注意力来减少视觉遗忘,从而在不遗忘的情况下找到正确的视觉证据。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在广泛的视觉-语言任务上表现出色。尽管有进展,它们仍然容易产生幻觉,生成与视觉内容不一致的响应。在本工作中,我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉,并在生成过程中逐渐遗忘它。我们实证发现,尽管LVLMs整体对视觉证据关注不足,但在特定层中表现出对正确视觉证据的敏感性,存在显著的层间差异。受此观察启发,我们提出了一种新的幻觉缓解方法,通过层间视觉注意力差异(ILVAD)增强视觉证据。具体来说,我们从早期生成的token到视觉token在各层中获取注意力权重,并识别被反复激活作为视觉证据的token,形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力,以减少视觉遗忘。此外,我们利用显著性图获得生成文本对视觉证据的注意力分数,以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的,即插即用。在五个最近发布的模型上进行的多个基准评估表明,我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

2605.20936 2026-05-21 cs.LG cs.AI cs.CL 版本更新

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH:在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本研究提出DASH,一种快速可微架构搜索框架,用于混合注意力架构设计,通过将离散的层间注意力操作放置转化为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线,展示了直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情
AI中文摘要

混合注意力架构正变得越来越重要,用于在保持模型质量的同时提高LLM推理效率,使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统,如Jet-Nemotron,展示了自动混合架构搜索的潜力。然而,Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens,使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH,一种用于混合注意力架构设计的快速可微搜索框架,它将离散的层间注意力操作放置放松为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,以显著提高搜索效率。在Qwen2.5-3B-Instruct上,DASH一致优于现有的所有选择器风格的混合注意力设计基线,表明直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

2605.20924 2026-05-21 cs.CL cs.AI 版本更新

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国立台湾大学计算机科学与资讯工程学系) Institute of Information Science, Academia Sinica, Taiwan(台湾“中央研究院”资讯科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国立台湾大学人工智能研究中心)

AI总结 该研究提出了一种任务级策略诱导方法Strategy-Induct,通过仅使用少量示例问题生成任务指令,无需依赖标注答案,从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

设计有效的任务级提示对于提高大型语言模型(LLMs)的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令,但现有方法通常依赖于输入-输出对,而获取标注答案可能困难或成本高昂。为了解决这一限制,我们提出了Strategy-Induct框架,该框架仅从少量示例问题中推导出任务级指令,而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略,形成(策略,问题)对。这些对随后用于诱导一个任务指令,以引导推理。在多个任务和模型规模上的实验表明,Strategy-Induct在仅问题设置中优于最先进的方法。此外,我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

2605.20923 2026-05-21 cs.LO cs.AI cs.PL 版本更新

Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows

因果过去逻辑用于分布式大语言模型代理工作流的运行时验证

Benedikt Bollig

发表机构 * Université Paris-Saclay, CNRS, ENS Paris-Saclay, LMF, Gif-sur-Yvette, France(巴黎-萨克雷大学,国家科学研究中心,巴黎-萨克雷高等师范学院,LMF,法国吉夫昂蒂埃)

AI总结 本文提出了一种因果过去逻辑(CPL)用于分布式大语言模型代理工作流的运行时验证,通过在ZipperGen框架中引入CPL,实现了对工作流中因果可见事件的实时监控和控制流影响,从而将运行时验证整合到协调语言本身。

Comments 20 pages

详情
AI中文摘要

分布式大语言模型代理工作流不应被当作产生单一顺序日志来监控。在异步执行中,一个决策只能依赖于对其生命线可见的因果事件:某些日志中出现较早的事件可能在本地仍未知。我们扩展ZipperGen代理工作流框架,引入因果过去逻辑(CPL),一种用于条件和while循环中守卫的简洁过去时间时序逻辑。除了标准的过去时间模态如previous和since外,守卫可以检查另一个生命线和所选变量中最新可见的事件。公式是一种源级守卫:它由拥有生命线在线评估,并可以在运行时影响控制流。我们给出了一个具有最新值视图的向量钟监控器,并证明本地计算的监控值与守卫在当前事件处的指称语义一致。因此,运行时验证成为协调语言本身的一部分,而不是执行日志上的事后检查。

英文摘要

Distributed LLM agent workflows should not be monitored as if they produced a single sequential log. In an asynchronous execution, a decision can only depend on events that are causally visible to the lifeline that makes it: an event that appears earlier in some log may still be unknown locally. We extend the ZipperGen agent-workflow framework with Causal Past Logic (CPL), a small past-time temporal logic for guards in conditionals and while loops. In addition to standard past-time modalities such as previous and since, a guard can inspect the latest causally visible event of another lifeline and selected variables stored there. The formula is a source-level guard: it is evaluated online by the owner lifeline and can influence control flow at runtime. We give a vector-clock monitor with latest-value views and prove that the locally computed monitor value coincides with the denotational semantics of the guard at the current event. Thus runtime verification becomes part of the coordination language itself, rather than a post-hoc check over an execution log.

2605.20922 2026-05-21 cs.LG cs.AI cs.CV 版本更新

Winfree Oscillatory Neural Network

Winfree振荡神经网络

Jiawen Dai, Yue Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Qi Zhi Institute(上海启智研究院) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文提出了一种基于广义Winfree动力学的振荡神经网络WONN,通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活的层次交互机制,实现了在图像识别和复杂推理任务上的竞争力和参数效率。

Comments Project page: https://jiawen-dai.github.io/WONN_Project_Page/

详情
AI中文摘要

振荡和同步被认为是表示和计算中的基本要素。然而,现有的基于同步动力学的机器学习方法大多局限于特定领域,如物体发现,缺乏在标准视觉基准或逻辑推理任务中的扩展性证据。我们提出Winfree振荡神经网络(WONN),一种基于广义Winfree动力学的动态神经架构。WONN通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活且层次化的交互机制,这些机制可以是固定的三角函数映射或可学习的神经网络。我们在图像识别和复杂推理任务上评估了WONN,包括CIFAR、ImageNet、Maze-hard和Sudoku。在这些领域中,WONN实现了具有竞争力或优越性能的成果,并且具有强参数效率。特别是,WONN是目前已知第一个能够与ImageNet-1K竞争的基于同步的振荡架构。此外,在Maze-hard上,WONN仅使用前状态-of-the-art模型1%的参数就达到了80.1%的准确率。这些结果表明,结构化的振荡动力学为传统神经架构提供了一种可扩展且参数高效的替代方案。

英文摘要

Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

2605.20915 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定:重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

发表机构 * Indian Institute of Technology Kanpur (IIT Kanpur)(印度理工学院坎浦尔学院(IIT坎浦尔))

AI总结 本文研究了生成语言模型中校准与决策可靠性之间的差距,通过TOFU基准测试中的多项选择问答评估协议,发现经过微调的模型在校准误差较低,而未学习后的模型在校准误差仍低,但依赖于相关性特征的决策规则增加,扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情
AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响,同时保持对剩余数据的可靠行为,使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理,但低校准误差并不一定意味着可靠的决策规则,因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议,研究了生成语言模型中的这一差距,利用校准指标(ECE、MCE、Brier)测量概率可靠性,并通过基于属性的快捷方式检测(使用积分梯度和局部互信息)评估决策规则可靠性。我们发现,微调模型的校准误差(ECE ~ 0.04)低于预训练模型(ECE > 0.5),而未学习后的模型在校准误差相似,尽管在遗忘分割上的准确性降低,属性分析显示对基于相关性的标记依赖增加。这些结果表明,良好的校准可以与未学习后的基于快捷方式的决策规则共存,将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

2605.20911 2026-05-21 cs.AI cs.LG 版本更新

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该持续打击多久?在格斗游戏中学习动作持续时间

Hoang Hai Nguyen, Kurt Driessens, Dennis J. N. J. Soemers

发表机构 * Department of Advanced Computing Sciences, Maastricht University(马斯特里赫特大学高级计算科学系)

AI总结 本文研究了在格斗游戏中如何通过学习动作持续时间来提高强化学习代理的决策能力,探讨了动态调整反应时间的方法及其对性能和行为模式的影响。

Comments Accepted at Computers and Games 2026

详情
AI中文摘要

像《街头霸王II》这样的格斗游戏对强化学习(RL)代理提出了独特的挑战,因为它们具有快速且实时的性质。在大多数RL框架中,代理被硬编码为在固定间隔内做出决策,通常每帧或每N帧。虽然这种设计确保了及时的响应,但限制了代理调整反应时间的能力。每帧行动提供帧完美反应,这与人类玩家相比不现实,而更长的固定间隔会降低计算成本但会阻碍响应速度。我们考虑了一种替代的决策框架,其中代理不仅学习采取什么动作,还学习执行该动作有多久。通过同时预测动作和持续时间,代理可以动态调整其对游戏不同情况的响应能力。我们使用开源的FightLadder环境,通过训练代理对抗内置的脚本机器人,系统地测试不同的帧跳配置,以分析其对性能、响应性和学习行为的影响。实验表明,学习的时间可以与精心选择的固定帧跳性能相匹配,并鼓励可重复的动作模式,但本身并不能保证鲁棒性。在大多数情况下,我们发现代理在一致的高帧跳值(即低响应速度)下表现最佳。这种策略使学习利用性策略变得更容易,其中相同的动作被反复执行,而脚本机器人似乎容易受到这种策略的影响。

英文摘要

Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

2605.20901 2026-05-21 cs.CV cs.AI 版本更新

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA:EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文,通过特征调制和ROI级上下文融合,将时间表示注入检测路径,以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测(STA)挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳,任务要求预测下一步的人-物体交互,包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计,结合以物体为中心的空间检测与短视时间上下文。具体来说,一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议,而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交,我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明,VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

2605.20876 2026-05-21 cs.CL cs.AI 版本更新

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China(北京航空航天大学计算机科学与工程学院) Independent Researcher(独立研究者) Beijing Institute of Technology(北京理工大学) University of Edinburgh(爱丁堡大学)

AI总结 本文提出Terminal-World,一种自动化流程,利用智能体技能作为核心合成原语,共同编码任务目标、执行时机和方法,从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境,训练出Terminal-World-8B/14B/32B模型,在六个基准测试中均优于终端智能体基线,其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情
AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

2605.20874 2026-05-21 cs.AI cs.SE 版本更新

Governance by Construction for Generalist Agents

为通用智能体构建的治理机制

Segev Shlomov, Iftach Shoham, Alon Oved, Ido Levy, Sami Marreed, Harold Ship, Offer Akrabi, Sergey Zeltyn, Avi Yaeli, Nir Mashkif

发表机构 * IBM

AI总结 本文提出了一种模块化的政策-as-code层,用于在不微调模型的情况下,通过与通用大语言模型智能体结合,实现可预测、可审计且符合合规要求的行为,在复合工作流中无需为每个领域重新构建智能体。

详情
AI中文摘要

企业智能体日益被期望在多个工具和界面中自主运行,但生产部署需要通过构建来实施治理。系统必须指定哪些操作被允许、何时需要人类监督以及哪些信息可以暴露,而无需为每个领域重新构建智能体。本演示展示了CUGA的策略系统,这是一种模块化的策略-as-code层,能够与通用大语言模型智能体结合,以在复合工作流中实现可预测、可审计且符合合规要求的行为。我们提出了一种运行时治理架构,在执行的每一个关键阶段都强制执行策略干预。而不是被动地限制行为,策略在五个结构性检查点拦截智能体:规划上游(意图守卫)、在系统提示内引导推理(手册)、在工具调用边界处强制正确使用(工具指南)、在推理循环外作为人类在环的闸门用于高风险操作(工具批准)、以及在输出阶段过滤和结构化最终响应(输出格式器)。这些阶段将治理连续嵌入智能体的执行流程中,而不是将其视为事后考虑。通过一个医疗场景和多层次的执行干预,演示展示了动态手册注入用于结构化工具序列执行,意图守卫阻止恶意或意外有害请求,以及人类在环的工具批准检查点用于可能破坏性操作。该成果展示了类型化的治理原语如何加快、安全地部署企业智能体系统,同时提高政策遵守和执行一致性。

英文摘要

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

2605.20872 2026-05-21 cs.LG cs.AI cs.GR 版本更新

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

CAdam: 3D高斯密度细化中的上下文自适应矩估计

SeungJeh Chung, Geonho Park, Misong Kim, HyeongYeop Kang

发表机构 * IIIXR Lab, Kyung Hee University(庆尚大学IIIXR实验室) IIIXR Lab, Korea University(韩国大学IIIXR实验室)

AI总结 本文提出CAdam方法,通过将密度细化问题转化为统计信号验证问题,解决生成式蒸馏中密度估计的瓶颈,从而在保持视觉质量的同时显著减少高斯点数量。

Comments Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures

详情
AI中文摘要

Adaptive densification是3D高斯点划法(3DGS)的核心引擎。然而,当将其应用于基于优化的生成式蒸馏范式时,这种重建原生机制暴露了根本性限制,导致效率低下且充满冗余的表示。我们诊断这种失败为密度困境,源于生成指导的随机性:标准的幅度基积累无差别地聚合瞬态噪声与几何信号,难以在过密度和欠拟合之间取得平衡。为了解决这一问题,我们引入了上下文自适应矩估计(CAdam),一种新的框架,将密度细化重新解释为统计上站得住的信号验证问题。CAdam利用梯度的一阶矩来利用干涉原理,其中随机波动通过破坏性干涉抵消,而一致的几何漂移通过建设性干涉累积,从而有效分离底层信号与生成噪声底座。这进一步通过基于分位数的上下文意识和内在信号噪声比(SNR)门控机制增强,确保在优化阶段之间具有鲁棒的适应性,并使密度细化能够软终止。在多样化的目标(SDS,ISM,VFDS)和强大的生成3DGS后端上进行了广泛的实验,结果表明CAdam相比标准密度细化将高斯点数减少85%-97%,同时保持整体可比的视觉质量。这些结果突显了信号感知密度控制作为改进优化生成式蒸馏内存效率的实用方法。

英文摘要

Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

2605.20868 2026-05-21 cs.LG cs.AI cs.SY eess.SY 版本更新

Runtime-Certified Bounded-Error Quantized Attention

具有运行时认证的误差受限量化注意

Dean Calver

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种分层的KV缓存架构,通过在GPU内存中存储INT8键和INT4值,同时在系统RAM中保留FP16原始数据,实现了运行时认证的注意机制,通过误差分解得到每头每步的误差界,以驱动自适应精度选择和多阶段回退流程,确保在需要时能恢复到精确的密集注意输出。

Comments 32 pages, 1 figure

详情
AI中文摘要

KV缓存量化减少了长上下文LLM推理的内存成本,但引入了通常仅通过经验验证的近似误差。现有系统依赖于平均情况下的鲁棒性,没有机制在运行时检测或恢复失败。本文提出了一种分层的KV缓存架构,使注意机制具有运行时认证:INT8键和INT4值存储在GPU内存中,而FP16原始数据保留在系统RAM中以实现确定性回退。一个两术语误差分解提供了每头每步的误差界(i)键量化导致的注意分布扭曲和(ii)值重建误差。这些界在线计算并用于驱动自适应精度选择和多阶段回退阶梯,确保在需要时能恢复到精确的密集注意输出。在PG-19、NIAH和RULER基准上,对LLaMA~3.1-8B(上下文长度达128K)的测试中,系统在语言建模和检索任务中与密集FP16 KV质量在噪声范围内匹配,同时恢复了在朴素INT8/INT4基线中观察到的灾难性故障。短上下文的值敏感任务暴露了压缩与保真度之间的可控权衡,可通过更紧的值容忍度或FP16值回退消除。认证是局部的(每头、每步),不保证端到端模型的正确性,但确保每个注意计算要么相对于FP16参考是受控的,要么通过回退精确恢复。这将KV缓存量化重新定义为运行时验证的计算,而不是固定近似。目标不是原始的速度提升,而是使在严格质量约束下安全部署的激进KV压缩成为可能。

英文摘要

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

2605.20865 2026-05-21 cs.LG cs.AI 版本更新

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比校正用于可验证奖励的强化学习

Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

发表机构 * Seoul National University(首尔国立大学) Upstage

AI总结 本文提出了一种多步前向轨迹政策优化(NFPO)算法,通过引入N步前向轨迹来改进PPO的近似目标,从而在可验证奖励的强化学习中实现更精确的策略改进。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)在提升大语言模型的推理能力方面起着关键作用。然而,广泛使用的PPO替代目标本质上是局部的,因为它们依赖于精确策略梯度目标的局部近似。虽然这种近似通过减少重要性采样引起的方差来提高稳定性,但它也引入了结构偏差到替代目标中,必须通过信任区域机制进行控制。在本文中,我们引入了N步前向轨迹,通过累积下一个N-1个token的似然比来增强PPO替代目标。基于这一想法,我们提出了N步前向轨迹策略优化(NFPO),一种将N步前向轨迹整合到掩码策略梯度框架中的实用RLVR算法。NFPO提供了一个连续的桥梁,将PPO替代目标与精确策略梯度目标联系起来,提供了一种控制偏差-方差权衡的原理机制。我们的理论分析表明,通过适当选择N,所提出的目标比标准PPO替代目标提供了更紧的策略改进界。在全面推理基准测试中,实验表明NFPO一致地提高了性能,支持了我们的理论发现。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

2605.20856 2026-05-21 cs.RO cs.AI cs.LG 版本更新

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC: 通过策略生成解耦指令与状态条件控制

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) TranscEngram

AI总结 DISC通过策略生成解耦指令与状态条件控制,解决了任务状态耦合导致的观察泄漏问题,并在多个基准测试中表现出色,证明了语言生成的策略参数驱动行为。

详情
AI中文摘要

语言条件的操控策略通常通过共享网络参数处理指令和观察。这种任务-状态耦合提供了观察泄漏的路径——网络学习了场景到动作的捷径,完全绕过了语言接地。DISC通过结构上消除这一失败。而不是将通用策略条件在语言上,DISC使用超网络从指令本身生成整个任务特定的视觉-运动策略参数集。生成的策略从不直接访问语言;因此,其任务意识必须来自语言。 Consequently,观察泄漏没有路径出现。另一方面,生成一致的高维策略权重本身是一个具有挑战性的问题。我们通过两阶段超网络解决它,其细化阶段将基于梯度优化的结构作为前馈归纳偏差嵌入,产生全局一致的参数,而无需实际梯度计算。在标准数据预算上完全从头训练,DISC在LIBERO-90和Meta-World上优于所有耦合基线,在复杂、长周期任务中优势扩大,并在不使用外部预训练数据的情况下超越了大规模预训练的π₀。在一个现实基准中,所有任务共享相同的视觉上下文,DISC显著优于耦合替代方案,直接证实了语言生成的策略参数,而非视觉捷径,驱动行为。超网络进一步学习了一个语义结构化的参数流形,能够从最少的演示中实现少样本适应,并在改写指令中实现稳健的泛化。我们的代码可在:https://github.com/ReNginx/DISC获取。

英文摘要

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $π_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

2605.20838 2026-05-21 cs.CV cs.AI 版本更新

USV: Towards Understanding the User-generated Short-form Videos

USV: 向理解用户生成的短视频迈进

Haoyue Cheng, Su Xu, Liwei Jin, Wayne Wu, Chen Qian, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室)

AI总结 本文提出了USV数据集,用于高层面的视频语义理解,通过用户生成的短视频进行主题识别和视频-文本检索任务,提出了MMF-Net和VTCL两种有效基线方法。

详情
AI中文摘要

近年来,已经发布了多个大规模视频数据集,推动了视频理解领域的发展。然而,新兴的用户生成的短视频却很少被研究。本文提出了USV数据集,用于高层面的视频语义理解。该数据集包含约224,000个视频,通过标签查询从UGC平台收集,无需额外的人工验证和剪辑。尽管视频理解近年来取得了显著进展,但大多数工作集中在实例级识别,这不足以学习视频高层面语义信息的表示。因此,我们进一步在USV上建立了两个任务:主题识别和视频-文本检索。我们提出了两种统一且有效的基线方法:多模态融合网络(MMF-Net)和视频-文本对比学习(VTCL),分别用于主题识别和视频-文本检索任务,并进行了全面的基准测试以促进未来研究。我们的项目页面是https://usvdataset.github.io。

英文摘要

Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.

2605.20837 2026-05-21 cs.CV cs.AI 版本更新

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

ArchSIBench: 评估视觉-语言模型的建筑空间智能

Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang

发表机构 * School of Architecture, Tsinghua University(清华大学建筑学院)

AI总结 本文提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准,通过17个细粒度子任务和3000个问题-答案对,评估多种VLMs在建筑空间感知、推理、导航、转换和配置方面的性能,发现大多数模型在空间转换和配置推理上仍与有建筑训练的人类评估者存在差距。

Comments 51 pages

详情
AI中文摘要

建筑空间智能,即识别和推断建筑空间的能力,是机器人导航、具身交互和3D场景理解和生成等任务的基础。尽管已有大量研究评估了视觉-语言模型(VLMs)的基本空间技能,如相对方向、距离比较和物体计数,但这些任务仅涵盖空间认知的最基础层次,且忽略了更高层次的建筑空间认知,包括布局理解、通行模式和功能分区。在本文中,我们提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准。ArchSIBench涵盖五个核心维度:感知、推理、导航、转换和配置,包含17个细粒度子任务。通过专家的精心人工标注,我们构建了3,000个问题-答案对,以实现对建筑空间智能的全面评估。基于ArchSIBench,我们评估了各种VLMs,并发现大多数模型在建筑空间智能方面与人类基线有显著差异;此外,模型在能力维度上表现出显著的差异性。一些最先进的模型可以接近没有建筑训练的人类评估者水平。然而,与有建筑训练的人类评估者相比,仍存在明显差距,特别是在空间转换和配置推理方面。我们相信,ArchSIBench将为测量和提升VLMs的建筑空间智能提供重要的见解和系统资源。数据集和代码可在https://huggingface.co/datasets/ArchSIBench/ArchSIBench获取。

英文摘要

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

2605.20834 2026-05-21 cs.AI cs.LG 版本更新

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

DPO与RLHF的条件等价性:隐含假设、失败模式与可证明对齐

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

发表机构 * The Hong Kong University of Science(香港科技大学) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文研究了DPO与RLHF的等价性问题,指出其等价性依赖于一个隐含假设,当该假设不成立时,DPO会优化相对优势而非绝对对齐,从而导致路径性收敛。作者提出CPO方法,通过引入约束实现可证明对齐,并通过几何解释揭示DPO的margin ranking机制。

Comments 49 pages

详情
AI中文摘要

直接偏好优化(DPO)作为一种替代强化学习从人类反馈(RLHF)的方法,理论上等价但实现更简单。我们证明这种等价性是条件性的而非普遍的,取决于一个隐含假设:RLHF最优策略必须偏好人类偏好响应。当该假设不成立时,DPO优化参考策略的相对优势而非绝对对齐人类偏好,导致路径性收敛,即策略降低DPO损失但偏好不被偏好响应。我们刻画了该假设被违反的情况,展示了不可取的解空间存在,并证明在这些情况下DPO和RLHF优化根本不同的目标。为解决此问题,我们引入约束偏好优化(CPO),通过在RLHF中加入约束以实现可证明对齐。我们进一步通过软边距排名提供几何解释,揭示DPO实现边距排名但可能具有潜在负目标。我们的理论分析确立了DPO保证成立的条件,并提供了保持简单性的同时具有可证明对齐的解决方案。在标准基准上的全面实验表明,CPO实现了最先进的性能。代码可在:https://github.com/visitworld123/CPO获取。

英文摘要

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

2605.20815 2026-05-21 cs.CL cs.AI cs.IR cs.LG 版本更新

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

在消费级硬件上实现GraphRAG:对本地LLMs在医疗EHR模式检索中的基准测试

Peter Fernandes, Ria Kanjilal

发表机构 * Department of Computer Engineering(计算机工程系) California Polytechnic State University(加州州立大学波特兰分校)

AI总结 本文研究了在消费级硬件上使用本地LLMs进行医疗EHR模式检索的GraphRAG方法,评估了四种不同模型在索引效率、知识图构建、查询延迟、回答质量和幻觉方面的表现,发现模型参数大小和检索模式对结果有显著影响。

Comments 9 pages, 1 figure, 5 tables

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)扩展了检索增强生成,以支持对复杂语料库的结构化推理,但其在资源受限、隐私敏感的部署中的可靠性仍不清楚。在医疗领域,电子健康记录(EHR)数据复杂且严格监管,依赖云基于大语言模型(LLMs)会带来成本、延迟和合规性的挑战。本文系统评估了GraphRAG在EHR模式检索中的应用,使用本地部署的开源LLMs。我们实现了Microsoft GraphRAG管道在真实的EHR模式文档上,并基准测试了四种模型,包括Llama 3.1(8B)、Mistral(7B)、Qwen 2.5(7B)和Phi-4-mini(3.8B),这些模型通过Ollama在单个消费级GPU(8 GB VRAM)上部署。我们评估了索引效率、知识图构建、查询延迟、回答质量和幻觉在全局和局部检索模式下的表现。我们的结果揭示了显著差异:Llama 3.1生成最丰富的知识图(1,172个实体),Qwen 2.5达到最佳回答质量(3.3/5),Phi-4-mini因结构化输出错误无法完成流程,而Mistral表现出退化重复行为。我们进一步表明,GraphRAG具有实际容量阈值,其中模型参数低于约7B的模型无法可靠地生成有效的结构化输出并无法完成流程。此外,索引和回答质量在不同模型之间是脱耦的,局部检索在延迟和事实基础方面均优于全局总结,且幻觉减少。这些发现表明,GraphRAG可以在消费级硬件上实现,同时强调了模型选择和检索设计在受监管环境中的重要性。

英文摘要

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

2605.20803 2026-05-21 cs.LG cs.AI 版本更新

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

可调MAGMAX:面向持续学习的偏好感知模型融合

Kei Hiroshima, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(Yokohama国立大学)

AI总结 本文提出了一种名为可调MAGMAX的模型融合框架,通过引入偏好向量控制任务特定性能,以适应不同的部署环境和用户偏好,从而在持续学习中实现更有效的模型融合。

Comments 17 pages, 4 figures. Accepted at ICPR 2026

详情
AI中文摘要

持续学习(CL)旨在顺序训练多个任务的同时,减轻对之前学习知识的灾难性遗忘。最近在大预训练模型(LPMs)和模型融合技术,如MAGMAX方面的进展,通过结合任务特定参数展示了有效的CL性能。然而,现有方法主要关注所有任务的平均性能,并未充分解决如何构建能够适应不同部署环境或变化用户偏好的模型的问题。本文提出了一种模型融合框架,称为可调MAGMAX,它使持续学习中的任务特定性能能够受到偏好控制。我们的方法引入了一个偏好向量,该向量在模型融合过程中控制从每个任务向量中选择的元素数量,使我们能够根据部署需求调整融合模型的性能。我们进一步提出了一种方法,通过利用少量目标环境数据和模型训练任务的数据集,自动构建合适的偏好向量,从而消除了手动指定的需要。在CL基准任务上的实验结果表明,可调MAGMAX有效地控制了任务层面的性能,并成功地将融合模型适应于各种目标环境。所提出的可调MAGMAX在性能上优于或与基线方法相当,使其成为部署到各种环境中的实用解决方案,其中每个任务的偏好不同。

英文摘要

Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

2605.20802 2026-05-21 cs.AR cs.AI 版本更新

ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing

ELSA: 一种用于高效神经形态计算的弹性SNN推理架构

Kang You, Chen Nie, Lee Jun Yan, Ziling Wei, Cheng Zou, Zekai Xu, Yu Feng, Honglan Jiang, Zhezhi He

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出ELSA架构,通过细粒度的脊柱/令牌流水线和针对SNN的硬件优化,实现了真正的弹性推理,从而在保持准确性的同时显著降低延迟,实验表明SNN在保持精度的同时优于量化人工神经网络。

Comments 17 pages, Proceedings of the 53rd Annual International Symposium on Computer Architecture (ISCA), 2026

详情
AI中文摘要

脉冲神经网络(SNNs)利用事件驱动和仅加法计算显著提高智能计算的效率。SNNs的关键时间特性弹性推理允许输出逐步出现,使系统能够比完整评估更早响应显著输入。然而,现有的专门针对SNN的加速器无法利用这一特性。分层设计只有在所有层完成后才输出结果,而时间步-时间步设计依赖于粗粒度、分层的流水线,需要同步每一层内的所有脊柱/令牌。这一障碍阻止了结果的即时转发,延迟了最早的响应,并放弃了弹性推理的好处。为了解决这些挑战,我们提出了ELSA,一种接近SRAM的数据流架构,通过细粒度的脊柱/令牌流水线和针对SNN的硬件优化,实现了真正的弹性推理。ELSA在生成每个脊柱/令牌时立即转发结果,形成一个连续的流式管道,大幅降低了到第一个响应的延迟。为了增强这种轻量级执行,ELSA引入了捆绑地址事件表示协议以降低网络芯片(NoC)的通信流量,并利用迷你批次脉冲Gustavson乘积以减少内存访问并利用固有的稀疏性。结合映射和调度优化,ELSA实现了高效、事件驱动的计算,而无需牺牲准确性。实验表明,SNN在保持精度的同时可以优于量化人工神经网络(QANN)。对于4位ResNet-50,ELSA在SOTA QANN加速器(ANT)上实现了3.4倍的速度提升和13.6倍的能效提升,在SOTA SNN加速器(PAICORE)上实现了2.9倍的速度提升和22.1倍的能效提升。

英文摘要

Spiking neural networks (SNNs) exploit event-driven and addition-only computation to substantially improve efficiency for intelligent computation. A key temporal property of SNNs, elastic inference, allows outputs to emerge progressively, enabling responses to salient inputs much earlier than full evaluation. However, existing SNN-specific accelerators cannot capitalize on this property. Layer-by-layer designs emit outputs only after all layers are complete, while time-step-by-time-step designs rely on coarse-grained, layer-wise pipelines that require synchronizing all spines/tokens within a layer. This barrier prevents results from being forwarded immediately, delaying the earliest possible response and forfeiting the benefits of elastic inference. To address these challenges, we propose ELSA, a near-SRAM dataflow architecture that realizes true elastic inference through a fine-grained spine/token-wise pipeline and hardware optimizations tailored to SNNs. ELSA forwards each spine/token immediately upon production, forming a continuous streaming pipeline that substantially reduces the latency to the first response. To enhance this lightweight execution, ELSA introduces a bundled address event representation protocol to lower communication traffic of network-on-chip (NoC), and leverages mini-batch spiking Gustavson-product to cut memory access and exploit inherent sparsity. Combined with mapping and scheduling optimizations, ELSA achieves efficient, event-driven computation without compromising accuracy. Experiments show that SNNs can outperform quantized artificial neural networks (QANNs) while maintaining on-par accuracy. For a 4-bit ResNet-50, ELSA achieves 3.4$\times$ speedup and 13.6$\times$ higher energy efficiency over the SOTA QANN accelerator (ANT), and 2.9$\times$ speedup and 22.1$\times$ energy efficiency gains over the SOTA SNN accelerator (PAICORE).

2605.20784 2026-05-21 cs.AI cs.LG 版本更新

Interaction Locality in Hierarchical Recursive Reasoning

层次递归推理中的交互局部性

Yosuke Miyanishi, Tetsuro Morimura

发表机构 * CyberAgent Inc.(CyberAgent公司)

AI总结 本文提出交互局部性框架,用于测量信息流是否在附近单元或语义段内传输或跨越,通过在HRM和TRM等层次递归推理模型上应用,验证了局部执行与全局规划的可重复测量框架。

详情
AI中文摘要

空间推理需要位置绑定计算和位置不变结构:智能体必须在保持路线、对象或约束层次计划的同时进行局部移动。我们提出交互局部性,一种任务-几何感知的框架,用于衡量信息流是否在附近单元或语义段内传输或跨越。我们通过稀疏自动编码器特征消融和有限噪声激活补丁来实例化该框架,并在附录中报告了结构性雅可比和注意力检查。将其应用于Maze-Hard、Sudoku Extreme和ARC-AGI等模型。在这些模型中,激活补丁给出了最清晰的架构指纹:高层递归状态倾向于在附近单元或相同段内写入信息,而重复的递归更新将这些局部写入累积到更广泛的解决方案结构中。这种模式在迷宫路径、数独约束和ARC-AGI对象邻域中均成立,其中TRM表现最强。为了测试交互局部性是否超越玩具但具有挑战性的网格基准,我们还将其应用于MTU3D,一个大规模的具身3D场景-grounding模型。在MTU3D设置中,因果空间局部性主要出现在视觉场景特征传递给下游grounding模块的过渡处,而不是在视觉编码器中均匀分布。这种对比表明,HRM和TRM中观察到的局部到全局的交接与显式递归推理动态有关,而具身3D模型可能在模块边界集中因果空间结构。交互局部性将直观的局部执行/全局规划故事转化为可重复测量的递归和具身空间推理框架。

英文摘要

Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

2605.20758 2026-05-21 cs.AI cs.CV cs.LG cs.RO 版本更新

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

面向组合奖励的冲突感知加法引导:流模型中的对抗性生成

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh

发表机构 * Smart Systems Institute, National University of Singapore, Singapore(新加坡国立大学智能系统研究所) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 本文提出了一种面向组合奖励的冲突感知加法引导方法,用于在流模型中处理对抗性生成问题,通过动态检测和解决梯度冲突来纠正离曼福德漂移,提升了生成保真度。

Comments Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在推理时间进行引导采样可以无需微调就通过解释生成过程为可控轨迹来驱动最先进的扩散和流模型。这提供了一种简单灵活的方式,将外部约束(如成本函数或预训练验证器)注入受控生成中。然而,现有方法在同时组合多个约束时往往失效,导致偏离真实数据曼福德。在本工作中,我们识别出这种离曼福德漂移的根本原因,并发现近似误差随着梯度不一致程度严重增加。基于这些发现,我们提出了一种轻量且可学习的方法,即冲突感知加法引导(g^car),该方法通过动态检测和解决梯度冲突来主动纠正离曼福德漂移。我们验证了g^car在多样化的领域中的有效性,从合成数据集和图像编辑到生成决策规划与控制。我们的结果表明,g^car有效纠正了离曼福德漂移,在生成保真度方面超越了基线方法,同时使用轻量计算。代码可在https://github.com/yuxuehui/CAR-guidance获取。

英文摘要

Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

2605.20756 2026-05-21 cs.LG cs.AI math.OC stat.ML 版本更新

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

纠正预条件语言模型优化器中的随机更新偏差

Nikhil Nayak, Julia White, Urchade Zaratiana, Kelton Zhang, Henrijs Princis, Dhruv Atreja, Henry Fawcett, Matthew Thomas, George Hurn-Maloney, Ash Lewis

发表机构 * Fastino Labs(Fastino实验室)

AI总结 本文研究了预条件优化器中随机更新规则的有限样本偏差问题,提出了一种单批次偏差校正框架,通过交叉拟合预条件估计和方差校正逆运算来减少梯度-预条件器耦合偏差和逆运算偏差,从而提升预条件优化器的性能。

Comments 32 pages, 3 figures, 13 tables

详情
AI中文摘要

预条件优化器在语言模型训练中至关重要,但其随机更新规则通常被视为对群体预条件下降的直接近似。我们证明这种观点忽略了两个有限样本偏差。首先,梯度和预条件器通常从同一个mini-batch估计,引入梯度-预条件器耦合偏差。其次,即使预条件器估计是无偏的,其逆或逆根通常有偏,因为逆运算是非线性的。我们提出了一种单批次偏差校正框架,以解决这两种效应:交叉拟合预条件估计从独立的微批次组中估计分子和预条件器,而方差校正逆运算利用微批次变化来减去主导的delta-方法偏差项。该框架适用于对角矩、对角曲率和矩阵预条件方法,分别在AdamW、Sophia和Shampoo中实现。偏差校正将Qwen2.5-0.5B的保持预训练损失减少了0.15、0.07和0.11 nat,分别;对混合质量预训练和下游指令微调的影响始终是中性到积极的。这些结果确立了偏差校正作为减少有限样本更新偏差和提升预条件优化器性能的实用机制。

英文摘要

Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

2605.20751 2026-05-21 cs.LG cs.AI cs.SY eess.SY 版本更新

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

PACD-Net: 假设增强对比学习用于从SMBG估计血糖控制

Canyu Lei, David Repaske, Jianxin Xie

发表机构 * University of Virginia, School of Data Science, Charlottesville, VA 22903, USA(弗吉尼亚大学数据科学学院) University of Virginia, Department of Pediatrics, Charlottesville, VA 22903, USA(弗吉尼亚大学儿科系)

AI总结 本研究提出PACD-Net,一种自监督对比学习框架,用于从稀疏不规则采样的SMBG数据中估计血糖控制指标,通过伪SMBG样本指导学习并提高模型的准确性和稳定性。

详情
AI中文摘要

有效的糖尿病管理需要持续监测血糖水平。临床中,通过连续葡萄糖监测(CGM)获取的指标如时间范围(TIR)、低于范围时间(TBR)和高于范围时间(TAR)用于评估血糖控制。然而,由于CGM成本高且可及性有限,许多患者依赖自测血糖(SMBG)。与CGM不同,SMBG提供稀疏且不规则的测量,使得准确估计这些指标具有挑战性。传统监督学习方法在稀疏数据下表现不佳,导致泛化能力差和性能不稳定。为此,我们提出PACD-Net,一种自监督对比学习框架,用于从SMBG估计血糖控制。使用具有更丰富时间覆盖的伪SMBG样本作为教师信号,指导从稀疏观测中学习。此外,多视图对比学习强制不同采样模式下的表征一致性。模型采用混合Swin Transformer-CNN主干网络以捕捉稀疏SMBG序列中的时间依赖性。实验结果表明,PACD-Net在真实世界SMBG数据中对TAR、TIR和TBR的估计优于现有方法,实现了在极稀疏观测设置下的改进准确性和增强的稳定性与泛化能力。所提出的框架为临床SMBG解释提供了实用工具,并为从稀疏且不规则采样的传感器数据中学习提供了通用方法。

英文摘要

Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

2605.20745 2026-05-21 cs.LG cs.AI cs.CL 版本更新

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

验证器严格性的隐含信号:通过选择性潜在引导控制和改进逐步验证

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui

发表机构 * Dartmouth College(达特茅斯学院) Datadog AI Research(Datadog人工智能研究) Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文研究了通过隐藏状态干预控制验证器严格性的方法,提出VerifySteer通过利用潜在正确性信号进行样本级路由并选择性干预段落边界,从而在ProcessBench和Hard2Verify数据集上优于基线方法,且在推理计算上更高效。

详情
AI中文摘要

生成验证器已成为逐步验证的一种有前途的范式,但其验证行为往往校准不佳:它们可能过于宽松而错过错误步骤,或过于严格而拒绝正确推理。我们将这种倾向于过于宽松或过于严格的行为称为验证器严格性。在本工作中,我们研究是否可以通过隐藏状态干预来控制验证器严格性。我们揭示了一个验证特定的隐藏状态信号:在逐步验证中,验证器接受或拒绝解决方案步骤的倾向编码在对应的验证段落边界附近。利用这一信号,我们证明隐藏状态引导可以直接调节验证器严格性,而无需微调。然而,统一引导会导致错误检测与正确性认证之间的权衡。为了解决这个问题,我们提出了VerifySteer,它利用潜在正确性信号进行样本级路由,并选择性地在段落边界进行干预。在ProcessBench和Hard2Verify上的实验表明,VerifySteer优于提示优化和激活引导基线,并且在需要更少推理计算的情况下与自一致性竞争。VerifySteer还与验证微调互补,在微调验证器上提供进一步的收益。代码可在https://github.com/YefanZhou/VerifySteer上获得。

英文摘要

Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

2605.20744 2026-05-21 cs.LG cs.AI 版本更新

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

可验证的环境:面向大规模评估奖励黑客的尝试

Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni

发表机构 * Tel Aviv University(特拉维夫大学) Columbia University(哥伦比亚大学) Taso Labs(Taso实验室)

AI总结 本文提出了一种新的评估方法来衡量奖励黑客,通过在环境中嵌入可检测的奖励黑客机会,使评估更加可靠和自动化,通过TextArena测试床分析了不同语言模型在多样化环境中的奖励黑客行为。

Comments Project Page - https://majoroth.github.io/hack-verifiable-environments/

详情
AI中文摘要

使自主代理与人类意图对齐仍然是现代AI中的核心挑战。这一挑战的一个关键表现是奖励黑客,即代理在评估信号下表现成功,但违反了预期目标。奖励黑客已在多种设置中被观察到,但可靠的大规模测量方法仍然匮乏。在本文中,我们引入了一种新的评估范式来衡量奖励黑客。与以往主要通过事后分析代理轨迹不同,我们直接在环境中嵌入可检测的奖励黑客机会,使其利用可验证,从而能够确定和自动化测量代理如何利用这些漏洞。我们通过TextArena实现了这一方法,并发布了Hack-Verifiable TextArena,一个可以可靠测量奖励黑客的测试床。使用此基准,我们分析了不同语言模型在多样化环境和设置中的奖励黑客行为。我们开源代码在https://github.com/MajoRoth/hack-verifiable-environments/。

英文摘要

Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.

2605.20742 2026-05-21 cs.AI 版本更新

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

VBFDD-Agent 用于电动汽车电池故障检测与诊断:电池数字信号的描述性文本建模

Joey Chan, Zhen Chen, Ershun Pan

发表机构 * Department of Industrial Engineering and Management, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China(工业工程与管理系,机械工程学院,上海交通大学,上海200240,中国)

AI总结 本研究提出了一种基于描述性文本建模的电池信号报告方法,用于解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题,通过构建语言语料库来改进电池健康诊断和维护,提出了VBFDD-Agent,整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理,生成结构化的诊断结果和维护建议。

详情
AI中文摘要

随着电动汽车的迅速普及,锂离子电池的安全性和可靠性已成为关键问题。有效的异常检测对于确保电池安全运行至关重要。然而,随着电池系统和运行场景日益复杂,电池故障诊断和维护需要更强的跨领域适应性和人机协作能力。传统故障检测和诊断方法通常针对特定场景和预定义流程设计,使其在复杂现实应用中效果有限。为了解决开放源代码电池故障报告数据集稀缺和缺乏统一维护知识表示的问题,本研究提出了一种电池信号报告的描述性文本建模方法。监测信号、统计特征、异常记录和状态评估结果被转换为结构化且易于阅读的自然语言描述,形成用于电池健康诊断和维护的语言语料库。基于此语料库,我们提出了VBFDD-Agent,一种用于汽车级电池系统的电池故障检测和诊断代理。VBFDD-Agent整合了描述性电池状态文本、历史案例检索、本地维护手册和大语言模型推理,以生成结构化的诊断结果和维护建议。实验表明,所提出的框架能够基于描述性文本表示准确执行异常监控,并提供灵活、高效且可操作的维护建议。专家评估进一步确认了所生成建议的实用价值。总体而言,VBFDD-Agent将传统电池诊断从标签预测扩展到可解释和以维护为导向的决策支持。

英文摘要

With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.

2605.20740 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Distribution-Aware Reward: 用于LLM回归的预测分布强化学习

Jungsoo Park, Hyungjoo Chae, Ethan Mendes, Jay DeYoung, Varsha Kishore, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Allen Institute for AI(人工智能研究院)

AI总结 本文提出Distribution-Aware Reward,一种基于预测分布的强化学习方法,旨在提升语言模型在回归任务中的预测分布质量,而非仅优化单个解码输出。通过连续排名概率分数评估多个解码样本的分布,并基于每个rollout对分布质量的边际贡献分配信用,从而提升预测的准确性和分散性。实验表明,该方法在多个任务中优于监督微调和点wise强化学习基线,尤其在KBSS数据集上Spearman相关性提升6点。

Comments 21 pages, 5 figures

详情
AI中文摘要

大型语言模型能够从异质输入(如文本、代码和分子字符串)预测实值量,但大多数训练目标独立评分每个解码的浮点数,仅改进点估计而无法确保校准的预测分布。这限制了需要候选排序或不确定性估计的应用。我们引入Distribution-Aware Reward,一种基于策略的强化学习目标,其主要贡献是训练语言模型生成更好的回归任务预测分布,而非仅优化单个解码输出与标量目标的匹配。我们的方法将多个解码样本视为经验预测分布,并使用连续排名概率分数进行评估,基于每个rollout对分布质量的边际贡献分配leave-one-out信用,奖励既准确又适当分散的预测。我们在受控高斯混合任务、代码性能预测和分子属性预测(从SMILES字符串)上评估了我们的方法。在所有任务中,我们的方法优于监督微调和点wise强化学习基线,具有显著的排名相关性提升,包括在KBSS数据集上Spearman相关性提升6点。在MoleculeNet上,仅使用SMILES字符串,仍能与强大的图基和3D分子模型竞争。进一步分析表明,我们的方法缓解了rollout多样性崩溃并改进了不确定性诊断,表明直接优化预测分布使语言模型回归更具鲁棒性和校准性。

英文摘要

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

2605.20734 2026-05-21 cs.CR cs.AI 版本更新

An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress

面向LLM代理出站的应用层多模态隐通道参考监控器应用

Alfredo Metere

发表机构 * Enclawed, LLC(Enclawed公司)

AI总结 本文提出了一种应用层多模态隐通道参考监控器,用于检测和防止LLM代理在消息中泄露数据,通过多阶段文本管道、媒体加密器和残余容量测量来实现对隐通道的监控和管理。

详情
AI中文摘要

一个发送消息的大型语言模型(LLM)代理可能会在消息中泄露数据。目标允许列表和内容扫描器无法检测到一个看似无害的负载本身是否是一个隐通道:被篡改的代理在零宽度字符、同形字符、空格、base64、JavaScript对象表示法(JSON)键顺序、消息时间或大小中编码位,并在二进制出站中,在每个图像的最显著位平面、图像平均亮度、图像序列排列、超声波音调或可听频段音频化数据中进行编码。我们的出站参考监控器有三个贡献。 (i) 一个包含十个容量减少阶段的文本管道,一个针对每个终点的漏桶容量账本,以及一个分阶段的姿势,该姿势从第一天起强制执行无损阶段。 (ii) 两个媒体加密器(一个傅里叶域音频带限器和一个红绿蓝(RGB)图像位深度和平均亮度桶器),由启动时的加密合法性证明所控制:审计员在启动时发布受信任的Ed25519密钥和{kind, data-class}配对;只有具有授权类验证签名的负载才被豁免。证明绕过了真实媒体和作为载体音频化或栅格化的数据之间的基于内容的判别难题;未签名的媒体默认被怀疑;基于内容的标准化器关闭了图像排列通道。 (iii) 剩余容量是嵌入位和恢复位之间的Miller-Madow修正的互信息(零表示被破坏),通过一个包含十五个工作编码器的对抗性集合进行测量,这些编码器覆盖文本、图像和音频。参考实现将残余容量驱动到每个可破坏通道的零,并驱动无法破坏而不破坏图像的单一通道(每图像平均亮度)到一个规定界限。

英文摘要

A large language model (LLM) agent that sends messages can leak data inside them. Destination allowlists and content scanners do not police whether an otherwise-benign payload is itself a covert channel: a compromised agent encodes bits in zero-width characters, homoglyphs, whitespace, base64, JavaScript Object Notation (JSON) key ordering, message timing or size -- and, in binary egress, in least-significant-bit (LSB) pixel planes, per-image mean luminance, inter-image sequence permutation, ultrasonic tones, or audible-band sonified data. Our egress reference monitor has three contributions. (i) A text pipeline of ten capacity-reducing stages, a per-sink leaky-bucket capacity ledger, and a staged posture that enforces lossless stages from day one. (ii) Two media scramblers (a Fourier-domain audio band-limiter and a red-green-blue (RGB) image bit-depth and mean-luminance bucketer) gated by a boot-time cryptographic legitimacy attestation: an auditor publishes at boot the trusted Ed25519 keys and {kind, data-class} pairs; only payloads with a verifying signature for an authorized class are exempt. The attestation sidesteps the intractable content-based discrimination between real media and data sonified or rasterized as a carrier; unsigned media is suspect by default; a content-addressed canonicalizer closes the inter-image permutation channel. (iii) Residual capacity is the Miller--Madow corrected mutual information between embedded and recovered bits (zero when destroyed), measured by an adversarial ensemble of fifteen working encoders across text, image and audio. The reference implementation drives residual capacity to zero on every destroyable channel and to a stated bound on the one (per-image mean luminance) that cannot be destroyed without ruining the image.

2605.20730 2026-05-21 cs.CL cs.AI 版本更新

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

分布对齐作为设计任务向量在上下文学习中的准则

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn

发表机构 * Seoul National University(首尔国立大学) Yonsei University(延世大学)

AI总结 本文提出通过分布对齐来设计任务向量,引入了NTP距离作为衡量指标,并开发了线性任务向量方法以提升性能和效率。

Comments 9 pages, preprint

详情
AI中文摘要

在上下文学习(ICL)中,大型语言模型(LLMs)通过演示来适应新任务,但随着上下文长度增加,推理成本也随之上升。虽然任务向量通过压缩演示为紧凑的隐藏状态表示提供了有前途的替代方案,但其质量只能通过下游任务准确性来评估。本文认为,使用任务向量的推理应使其预测分布与ICL的预测分布对齐。为此,我们引入了$d_{ ext{NTP}}$,一个衡量任务向量推理与ICL推理之间下一个标记概率差异的指标。我们的实证分析表明,$d_{ ext{NTP}}$作为性能代理,与下游准确性呈强负相关。受此启发,我们开发了线性任务向量(LTV)方法,通过闭合形式的线性映射来最小化$d_{ ext{NTP}}$,通过回归估计演示效果。在八个分类基准和五个LLMs上,LTV一致优于现有任务向量基线,平均准确率提高了9.2%,同时减少了推理延迟。我们进一步证明LTV在回归任务上优于基线。此外,我们研究了LTV在不同模型规模间的可转移性;这在任务向量研究中仍是一个初级问题。具体而言,我们实证显示,较大模型的任务向量可以将较小模型的性能提高6.4%,表明提取的任务表示有新的用途。

英文摘要

In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

2605.20722 2026-05-21 cs.LG cs.AI 版本更新

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO: 基于双统计反馈的自适应群体策略优化

Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文提出AGPO,一种无 critic 的 GRPO 改进方法,通过群体层面的统计信息控制更新幅度和探索。在九个英语和中文数学/STEM 基准上,Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO,达到 GSM8K 67.3% 和 MATH 40.5%。

详情
AI中文摘要

强化学习提升大语言模型推理能力,但 PPO/GRPO 通常使用固定剪切和解码温度,使训练脆弱且调参困难。我们提出自适应群体策略优化(AGPO),一种无 critic 的 GRPO 改进方法,利用群体层面统计信息控制更新幅度和探索。AGPO 使用共享的探针衍生统计状态驱动两个控制器:(i)自适应剪切,根据奖励分散度和偏度、探针投票熵、策略熵和逐步 KL 偏移设置信任区域大小;(ii)双向自适应温度采样,根据与运行基线相对的中心不确定性加热或冷却解码。在九个英语和中文数学/STEM 基准上,使用 AGPO 训练的 Qwen2.5-14B 在相同生成 token 预算下优于 PPO/GRPO,达到 GSM8K 67.3% 和 MATH 40.5%。收益转移到 Llama-3-8B 和 Gemma-2-9B,消融实验确认两个模块互补。我们的实现可在 https://github.com/wandugu/paper_agpo 公开获取。

英文摘要

Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

2605.20713 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER:选择性所需视觉证据用于多模态信息提取

Miaobo Hu, Shuhao Hu, Bokun Wang, Rui Chen, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 该研究提出SAVER框架,通过选择性视觉证据提升多模态命名实体识别和关系抽取的性能,减少计算开销并提高准确性。

详情
AI中文摘要

多模态信息提取在社交媒体中具有挑战性,因为帖子可能附加多个弱相关、冗余甚至误导性的图像。在这样的情况下,持续的多模态融合会浪费计算资源并放大虚假的视觉提示。核心挑战是决定是否为每个候选跨度或标记实体对咨询视觉信息,以及如果需要,哪些小图像子集提供可信的证据。我们提出SAVER,一种选择性视觉所需框架用于多模态命名实体识别和多模态关系抽取。SAVER使用符合性地面性门(CGG)来估计MNER中的跨度级视觉地面性,从两个标记实体推导出对级激活,通过符合性风格程序和Clopper-Pearson上界校准激活阈值。当被激活时,一个子模ularity相关性-多样性选择器选择跨图像的紧凑证据子集,然后通过集合变换器进行聚合。一个受能量启发的联合评分头结合文本、可选视觉证据、文本-图像一致性以及稀疏路由用于实体类型或关系分类。实验表明,SAVER在强文本-only和持续多模态基线上一致提高F1,同时减少AURC,增加激活覆盖面积,在固定风险水平下,降低FLOPs和P90延迟。

英文摘要

Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

2605.20712 2026-05-21 cs.CL cs.AI 版本更新

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

SCRIBE:用于印度语言ASR的诊断评估和丰富转录模型

Kavya Manohar, Arghya Bhattacharya, Kush Juvekar, Kumarmanas Nethil

发表机构 * Adalat AI, India(印度Adalat人工智能公司)

AI总结 SCRIBE通过沙地容忍对齐和领域词汇注入,提供词错误率的分类分解,解决了传统词错误率在处理聚合语言时的不足,同时释放了用于印地语、马拉雅尔语和卡纳达语的丰富转录模型。

Comments Submitted to Interspeech 2026

详情
AI中文摘要

自动语音识别仅在更正成本低于手动输入时才取代打字,这一阈值由错误类型而非数量决定:纠正一个误识别的领域术语的成本远高于插入一个逗号。词错误率(WER)在两个方面失效:它将不同的错误类别合并为一个标量,且它在结构上惩罚了聚合语言,其中有效的沙地合并会膨胀分数。我们引入SCRIBE,一个诊断框架,通过沙地容忍对齐和领域词汇注入,将错误分解为词法、标点、数字和领域实体率。人类验证确认SCRIBE在WER无法做到的地方与专家判断一致。我们发布了SCRIBE,一个LLM整理流程、基准测试和开放权重的丰富转录模型,适用于印地语、马拉雅尔语和卡纳达语。

英文摘要

Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

2605.20704 2026-05-21 cs.CR cs.AI cs.MA 版本更新

Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms

基于心跳的分层凭证:面向AI代理群的加密撤销机制

Saurabh Deochake

发表机构 * SentinelOne Inc.(SentinelOne公司)

AI总结 本文提出了一种基于心跳的分层凭证协议,通过将凭证有效性与周期性父级存活证明绑定,实现无需网络连接的高效凭证撤销,显著减少了僵尸代理的存活时间并提升了验证效率。

详情
AI中文摘要

自主AI代理生成子代理群体时会产生安全漏洞:现有凭证撤销机制,如OAuth 2.0 introspection、OCSP和W3C Status Lists,需要连接中央权威,导致操作员关闭后,僵尸代理可能持续执行特权操作数分钟至数小时。本文提出Heartbeat-Bound Hierarchical Credentials (HBHC),一种将凭证有效性与周期性父级存活证明绑定的加密协议。验证者仅需缓存的公钥和本地时钟即可强制凭证的新鲜度,无需网络往返。当心跳生成停止时,所有后代凭证在确定性时间内$W_z \le W_{\max} + Δ_h + ε$内不可用,前提是时钟偏差和父级密钥在安全 enclave 中被持有。在协议层和真实LLM支持的代理群(GPT-4o-mini)上的评估显示,与OAuth 2.0相比,僵尸窗口减少了90倍,Rust中完整认证时间为0.26毫秒,每秒可进行18,000+次验证,在并发HTTP负载下保持稳定,且验证延迟在10至10,000个代理范围内保持稳定。真实代理实验显示,工具调用的端到端开销为0.71%,在绕过应用层防护的提示注入攻击下,撤销后无工具调用,且在四层结构的49个代理中实现理论范围内的级联撤销。

英文摘要

Autonomous AI agents that spawn sub-agent swarms create a safety gap: existing credential revocation mechanisms, OAuth~2.0 introspection, OCSP, and W3C Status Lists, require network connectivity to a central authority, leaving ``zombie agents'' executing privileged operations for minutes to hours after operator shutdown. We present Heartbeat-Bound Hierarchical Credentials (HBHC), a cryptographic protocol that binds credential validity to periodic parent liveness proofs. Verifiers enforce freshness using only a cached public key and local clock; no network round-trip is required. When heartbeat generation ceases, all descendant credentials become unusable within a deterministically bounded window $W_z \le W_{\max} + Δ_h + ε$, conditional on bounded clock skew and parent keys held in secure enclaves. Evaluation at the protocol layer and with real LLM-backed agent swarms (GPT-4o-mini) demonstrates a 90$\times$ reduction in the zombie window over OAuth~2.0, 0.26~ms full authentication in Rust, 18,000+ verifications per second under concurrent HTTP load, and stable per-verification latency from 10 to 10,000 agents. Real-agent experiments show 0.71\% end-to-end overhead on tool calls, zero post-revocation tool calls under prompt injection that bypasses application-layer guardrails, and cascading revocation across a 49-agent four-level hierarchy within the theoretical bound.

2605.20693 2026-05-21 cs.CL cs.AI stat.ML 版本更新

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

通过共识和标签解缠获得可解释的判别文本表示

Tong Wang, Yiqing Xu, Leo Yang Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文提出了一种可解释的判别文本表示方法,通过共识和标签解缠来确保特征的可解释性和可重复性,实验表明该方法在多个文本分类任务中表现优异,产生了更清晰且更少标签纠缠的特征。

详情
AI中文摘要

可解释的文本表示应暴露出不仅具有预测性,而且对独立审计员来说有意义的坐标。现有的判别表示通常使用匿名嵌入方向,而概念瓶颈和LLM辅助方法将自然语言名称附加到特征上,但并未确保这些定义是可重复的或与目标标签不同。我们提出了一种可解释判别文本表示的操作标准:每个坐标应满足概念清晰度,通过独立标注员应用特征定义之间的机会调整一致性来衡量,并且标签解缠,即特征不应仅仅改述预测目标。我们通过LLM辅助特征发现(LFD)方法实现了这一标准,这是一种迭代方法,从对比性反向文本对中提出词汇和语义特征,通过跨LLM Cohen's $κ$ 筛选候选,并通过残差保留的预测增益选择特征。一种简化分析将$κ$筛选与每个特征的注释噪声界限联系起来,正式化一致性作为可靠性检查。在十个跨越七个语料库的文本分类任务中,LFD与强大的文本瓶颈基线具有相同的预测性能,同时产生明显更清晰且标签纠缠更少的特征。232名人类审计员的实验表明,LFD特征在人类-人类和人类-LLM一致性方面优于基线概念,且审计员一致认为它们更少标签泄漏。这些结果表明,经过一致性测试和标签解缠的坐标为可解释文本分类提供了一个实用的可审计标准。

英文摘要

Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $κ$, and selects features by residual held-out predictive gain. A stylized analysis connects the $κ$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

2605.20689 2026-05-21 cs.CL cs.AI cs.IR cs.LG 版本更新

DIVE: Embedding Compression via Self-Limiting Gradient Updates

DIVE: 通过自限制梯度更新实现嵌入压缩

Dongfang Zhao

发表机构 * University of Washington Tacoma School of Engineering and Technology(华盛顿大学塔可姆分校工程与技术学院)

AI总结 本文提出DIVE方法,通过自限制的三元组损失和头级NT-Xent对比损失解决嵌入压缩中因标注数据稀缺导致的过拟合问题,提升了检索性能。

详情
AI中文摘要

大型语言模型的高维嵌入对向量搜索系统造成了显著的存储和计算成本。最近的嵌入压缩方法,包括Matryoshka-Adaptor(EMNLP 2024)、Search-Adaptor(ACL 2024)和SMEC(EMNLP 2025),通过轻量级残差适配器实现降维,但其训练目标在标注数据稀缺时导致严重过拟合,使检索性能低于冻结基线。我们提出DIVE(通过隐式视图集合进行降维),一种压缩适配器,通过两种机制解决这一失败。首先,一个自限制的基于hinge的三元组损失在三元组满足边距约束时产生零梯度,限制应用于预训练嵌入空间的总扰动。其次,头级NT-Xent对比损失将每个嵌入的多个学习投影视为隐式视图,提供密集的自监督梯度,补偿小数据集上三元组信号的稀疏性。在六个BEIR数据集上,DIVE在每个数据集和每个评估的压缩比上均优于所有三个基线适配器,具有14M参数的开源实现。

英文摘要

High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

2605.20678 2026-05-21 cs.LG cs.AI 版本更新

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

动态TMoE:一种针对非平稳时间序列预测的漂移感知动态专家混合框架

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu

发表机构 * School of Software Technology, Zhejiang University, Ningbo, China State Key Lab of CAD\&CG, Zhejiang University, Hangzhou, China

AI总结 本文提出Dynamic TMoE框架,通过动态构建异构专家和剪枝冗余专家来优化容量,并利用时间记忆路由器确保稳定且上下文感知的专家选择,从而在非平稳时间序列预测中实现更优性能。

Comments 27 pages, 7 figures. Accepted to ICML 2026

详情
AI中文摘要

非平稳时间序列预测面临由演变分布偏移带来的挑战,静态模型难以捕捉这些变化。虽然混合专家(MoE)架构提供了解耦复杂漂移模式的有前景范式,但现有方法受限于固定专家池和无记忆路由,阻碍了其适应突发制度转变的能力。为此,我们提出Dynamic TMoE框架,将架构进化与时间连续性统一在学习阶段。通过最大均值偏差(MMD)检测分布偏移,动态实例化异构专家并剪枝冗余专家以优化容量。此外,时间记忆路由器利用循环状态和异常库确保稳定、上下文感知的专家选择,无需测试时更新。在九个基准测试中的实验表明,该方法实现了最先进的性能,将MSE减少10.4%,MAE减少7.8%。代码可在https://github.com/andone-07/Dynamic-TMoE获取。

英文摘要

Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

2605.20668 2026-05-21 cs.CL cs.AI cs.LG 版本更新

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

人工智能审稿人的局限与机遇:对Nature系列论文审稿的45位专家科学家的审查

Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig

发表机构 * Nature(自然)

AI总结 本文通过大规模专家标注研究,探讨了AI审稿人在科学同行评审中的能力与局限,发现AI审稿在准确性、显著性和证据充分性方面表现优异,但存在领域知识有限、上下文管理不足等弱点,表明AI审稿是人类审稿的补充而非替代。

Comments Work in progress

详情
AI中文摘要

随着AI能力的提升,AI审稿人开始被应用于科学同行评审,但其能力和可信度仍存疑:许多科学家将其视为概率系统,缺乏评估研究的专业能力,而其他研究人员则对AI的准备程度更为乐观,但缺乏实证支持。理解AI审稿人擅长什么、哪里不足以及仍需解决的挑战至关重要。然而,现有的AI审稿评估主要关注其判断是否与人类一致(例如评分对齐、接受预测),这不足以表征其能力和局限。在本文中,我们通过大规模专家标注研究填补了这一空白,45位物理、生物和健康科学领域的专家花费469小时对2960个个体批评(每个批评针对论文的一个特定方面)进行评分,这些批评来自人类和AI生成的82篇Nature系列论文的审稿。在综合正确性、显著性和证据充分性三个维度上,由GPT-5.2驱动的审稿代理在每篇论文的最高评分人类审稿人评分上(60.0% vs. 48.2%,p = 0.009),而所有三个AI审稿(包括Gemini 3.0 Pro和Claude Opus 4.5)在每个维度上都超过了最低评分的人类审稿人。AI审稿的准确批评也更常被评分显著且证据充分,并揭示了人类未提及的26%的问题。然而,AI审稿在交叉审稿者对之间重叠远多于人类(21% vs. 3%),并且表现出16个人类不共享的弱点,如领域知识有限、缺乏多文件上下文管理能力以及对次要问题过于批判。总体而言,我们的结果表明当前AI审稿人是人类审稿人的补充,而非替代。

英文摘要

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

2605.20649 2026-05-21 eess.SP cs.AI cs.LG 版本更新

AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI

AMAR: 基于注意力机制的轻量级多用户活动识别从Wi-Fi CSI

Amirhossein Mohammadi, Hina Tabassum

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文提出了一种基于注意力机制的轻量级多用户活动识别框架AMAR,通过将活动识别转化为集合预测问题,利用Transformer架构和边缘-云混合架构,实现了在多用户环境下对并发活动的高精度识别,同时显著减少带宽使用和占用估计误差。

Comments 25 pages, 6 figures, 3 tables

详情
AI中文摘要

基于Wi-Fi的人体活动识别(HAR)已发展为一种有前景的无接触传感方法,利用无线收发器收集的信道状态信息(CSI)。尽管现有研究主要集中在单用户场景,但实际部署通常涉及多用户设置,其中并发用户的行为导致CSI模式重叠,挑战传统分类方法。为解决这一限制,本文提出了一种基于注意力机制的多用户活动识别(AMAR)框架,将HAR转化为集合预测问题。AMAR的Transformer架构利用可学习的查询嵌入作为专用活动检测器,使系统能够同时从复合CSI表示中识别多种活动。此外,为应对部署限制,AMAR采用边缘-云混合架构,其中边缘设备上的轻量级卷积网络执行初始特征提取,随后通过残差向量量化实现显著的带宽减少,同时保留活动区分信息。云组件通过基于注意力的集合匹配执行最终活动预测,使系统能够处理变化的占用水平。在教室、会议厅和空房间环境中,AMAR在平均情况下几乎将完美预测所有并发活动的速率提高了两倍,同时其F1分数达到53.4%,比最佳基准45.6%有所提高,并将占用估计误差减少了74%,同时大幅减少带宽使用。

英文摘要

Wi-Fi-based human activity recognition (HAR) has emerged as a promising approach for contactless sensing, leveraging channel state information (CSI) collected from wireless transceivers. While existing studies have primarily concentrated on single-user scenarios, real-world deployments often involve multi-user settings where concurrent users' movements induce overlapping CSI patterns that challenge conventional classification methods. To address this limitation, this paper introduces an attention-based multi-user activity recognition (AMAR) framework that formulates HAR as a set prediction problem. The transformer-based architecture in AMAR leverages learnable query embeddings acting as specialized activity detectors, enabling the simultaneous identification of multiple activities from composite CSI representations. Moreover, to address deployment constraints, AMAR is designed in an edge-cloud split architecture form where lightweight convolutional networks on edge devices perform initial feature extraction, followed by residual vector quantization that achieves substantial bandwidth reduction while preserving activity-discriminative information. The cloud component performs final activity prediction through attention-based set matching, enabling the system to handle varying occupancy levels. Across classroom, meeting-room, and empty-room environments, on average AMAR nearly doubles the rate of perfectly predicting all concurrent activities compared to the best baseline. Moreover, it achieves an $F_1$-score of 53.4% compared to 45.6% for the best benchmark, and reduces occupancy estimation error by 74%, while minimizing bandwidth substantially.

2605.20648 2026-05-21 cs.RO cs.AI 版本更新

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University(布朗大学) Robotics & AI Institute(机器人与人工智能研究所)

AI总结 本文提出了一种联合学习谓词和动作的技能方法,通过闭合回路的视觉-运动策略,使机器人能够在不重新训练的情况下实现零样本技能组合。

详情
AI中文摘要

学习示范(LfD)使机器人能够从专家示例中学习复杂行为,但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布,因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距,我们引入了谓词动作技能(PACTS),一种闭合回路的视觉-运动策略,将技能建模为动作和谓词信念轨迹的联合生成过程,在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外,我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行,展示了学习技能的零样本组合。项目网站:https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

2605.20644 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

制造设计:一种集成制造知识的强化学习框架用于航空发动机自由形管道路由

Caicheng Wang, Zili Wang, Shuyou Zhang, Yongzhe Xiang, Zheyi Li, Liangyou Li, Jianrong Tan

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Engineering Research Center for Design Engineering and Digital Twin of Zhejiang Province, Zhejiang University(浙江省设计工程与数字孪生工程研究中心) Zhejiang Changxing Heliang Intelligent Equipment Co., Ltd.(浙江长兴鹤浪智能装备有限公司)

AI总结 本文提出了一种集成制造知识的强化学习框架,用于航空发动机中自由形管道路由优化,通过将制造知识作为约束条件,提高了管道路径的可制造性和几何平滑度。

详情
AI中文摘要

制造设计在先进航空发动机开发中起着关键作用,其中复杂组件需要仔细考虑可制造性。然而,当前的管道路由实践仍然很大程度上与下游制造脱节,导致需要大量劳动和试错迭代以获得可制造的设计。为了解决这个问题,本研究提出了一种基于弗伦塞尔的管道路由优化(FPRO)框架,这是一种用于航空发动机自由形管道设计的集成制造知识的强化学习方法。FPRO将路由问题表述为弗伦塞尔框架中的边界值问题。在此框架中,管道路径由曲率和扭率剖面表示,这些剖面通过三次赫尔迈特插值生成。为了将设计与制造相结合,领域特定的制造知识被嵌入到曲率和扭率的允许范围的约束中。路径优化使用了具有随机探索和阶段引导奖励机制的近端策略优化算法。统一的映射公式然后将优化的路径转换为弯曲模具的运动轨迹,使六轴自由弯曲机能够直接制造。实验结果表明,FPRO能够持续生成无碰撞、可制造的路径,其几何剖面比基于笛卡尔的方法更平滑。它还实现了更快的收敛速度和在终端对齐、路径长度、障碍物避让和可制造性方面的优越性能,优于最先进的强化学习基线。现实验证确认了制造管道与数字设计之间几何的紧密对应关系,验证了FPRO的实践可行性。

英文摘要

Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

2605.20643 2026-05-21 cs.LG cs.AI cs.CL 版本更新

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD:通过平衡共识和教师特定的特权信号实现自适应视图自蒸馏

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Zaid Khan, Anirban Das, Austin Zhang, Sambit Sahu, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Capital One(Capital One公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出AVSD,一种通过平衡共识和教师特定的特权信号来实现自适应视图自蒸馏的方法,以解决自蒸馏中教师和学生信息不对称和特权信息选择的问题。

Comments Code: https://github.com/duykhuongnguyen/AVSD

详情
AI中文摘要

自蒸馏使语言模型能够通过使用同一模型作为学生和教师来从自身轨迹中学习,其中教师基于学生无法访问的特权信息进行条件。此类信息可以是不同种类或视图,如解决方案、演示、反馈或最终答案。这种设置可以在不依赖外部模型的情况下提供密集的token级反馈,但会产生根本性的不对称性:教师可能依赖于视图特定的信息,而学生在推理时无法访问。此外,最佳的特权信息类型通常是任务依赖的,使得选择单一教师视图变得困难。在本工作中,我们通过引入AVSD(自适应视图自蒸馏),一种具有多种特权信息视图的自蒸馏新方法,来同时解决这两个挑战。AVSD通过分离稳定的跨视图共识和视图特定的残差信号来重建token级监督。AVSD识别出跨视图共享的共识信号,提供可靠的更新方向,然后在两者一致且比例适当的情况下,选择性地添加视图特定的残差信号以调整更新幅度。在数学竞赛基准(AIME24、AIME25和HMMT25)上的实验表明,AVSD在Qwen3-8B和Qwen3-4B上分别比单视图自蒸馏基线和GRPO平均Avg@8提升了3.1%和2.2%。此外,在代码生成基准(Codeforces、LiveCodeBench v6)上使用Qwen3-8B时,AVSD在平均上比单视图自蒸馏基线高出2.4%。

英文摘要

Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

2605.20641 2026-05-21 cs.CR cs.AI cs.LG 版本更新

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

可信的权重,危险的优化?针对大语言模型的优化触发后门攻击

Yifei Wang, Tianlin Li, Xiaohan Zhang, Yida Yang, Xiaoyu Zhang, Li Pan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beihang University(北京航空航天大学) Tongji University(同济大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种利用编译优化过程植入隐蔽后门的攻击方法,通过两种互补策略在无需修改编译器或硬件的情况下,实现对大语言模型的后门攻击,并展示了其在多个开源大语言模型上的高成功率。

Comments 20 pages, 3 figures

详情
AI中文摘要

推理优化是部署大规模语言模型(LLMs)的关键技术。编译是LLMs中最广泛采用的优化技术。尽管编译假设原始图与编译图之间具有语义等价性,但我们首先揭示其数值副作用可以被恶意利用以在LLMs中植入隐蔽的后门。我们提出了一种包含两种互补策略的统一优化触发攻击框架。在不修改编译器或硬件的情况下,一种策略仅在模型编译时翻转特定输入的预测,而另一种策略使用一个通用触发器,在未编译执行时保持静默,但在应用编译优化时劫持任意输入。这两种攻击都能绕过在没有编译时运行的标准安全评估。我们实证表明,这些优化触发后门在四个主流开源LLMs和四个任务上实现了平均90%的攻击成功率,同时在所有设置下保持几乎100%的干净准确性。我们的发现揭示了优化与安全在LLM部署流程交集处的新攻击面,并探讨了减轻此威胁的实用防御方法。

英文摘要

Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack success rates averaging 90% across four mainstream open-source LLMs and four tasks, while clean accuracy is preserved at nearly 100% under all settings. Our findings reveal a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline, and we investigate practical defenses to mitigate this threat.

2605.20640 2026-05-21 cs.CV cs.AI 版本更新

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

帕累托优化的肖像生成:用于对齐、真实性和美学的视觉对齐文本监督

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

AI总结 本文提出了一种多模态扩散变换器(MM-DiT)的特征监督方法,通过引入轻量级的跨模态对齐机制,隐式提取多粒度的视觉对齐文本表示,以提升文本-图像对齐、真实性和美学质量,从而在Pareto前沿上实现协同改进。

详情
AI中文摘要

文本到图像扩散模型在生成人类肖像时往往面临严重的三重困境:文本-图像对齐、逼真度和人类感知的美学之间相互抑制。监督微调(SFT)是一种有效提升图像生成逼真度的方法,但通常会导致过度拟合训练数据集、破坏预训练图像先验并降低对齐或美学质量。为突破这一瓶颈,我们提出了一种多模态扩散变换器(MM-DiT)的特征监督范式。具体而言,我们引入了一种轻量级的跨模态对齐机制,隐式地从SigLIP 2中提取多粒度的视觉对齐文本表示,并在训练阶段将监督应用于MM-DiT的图像分支,且无额外的推理开销。我们的方法在保持基模型原有泛化能力的同时,注入了视觉对齐的文本指导,避免了SFT导致的退化。此外,我们的方法直接从预训练的视觉基础模型中挖掘隐含的多粒度美学信号,以优化人类感知的美学。在MM-DiT上的广泛实验表明,我们的方法推动了Pareto前沿,并在文本-图像对齐、逼真度和人类感知的美学方面实现了协同改进。

英文摘要

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

2605.20630 2026-05-21 cs.AI 版本更新

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

评估代理计划-执行管道中的时间语义缓存和工作流优化

Alimurtaza Mustafa Merchant, Krish Veera, Sajal Kumar Goyla, Shambhawi Bhure, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia University(哥伦比亚大学) IBM IBM Research(IBM研究院)

AI总结 本文研究了在代理计划-执行管道中时间语义缓存和工作流优化的问题,提出两种互补的优化层以提高效率,并展示了其在工业资产操作工作流中的应用效果。

Comments 13 pages, 8 figures, 3 appendices

详情
AI中文摘要

工业资产操作工作流对延迟敏感,因为单个用户查询可能需要协调传感器数据、工作订单、故障模式、预测工具和领域特定代理。我们在此问题上评估了AssetOpsBench (AOB),这是一个工业代理基准,其计划-执行管道暴露了工具发现、LLM规划、MCP工具执行和最终总结的重复开销。现有的LLM缓存技术如KV缓存重用和基于嵌入的语义缓存是为聊天机器人服务设计的,并在输出有效性依赖于时间、资产或传感器参数时失效。我们为AOB计划-执行管道提出了两个互补的优化层:一个时间语义缓存和一组结合磁盘支持的工具发现缓存和依赖感知并行步骤执行的MCP工作流优化。MCP工作流优化对应于1.67倍的速度提升,将中位端到端延迟减少了约40.0%,而时间缓存基准在缓存命中时实现了30.6倍的速度提升。除了速度提升外,我们的结果揭示了纯语义缓存在参数丰富的工业查询中的具体失败模式,提供了对MCP支持的代理基准中缓存选择如何与评估正确性相互作用的批判性分析。

英文摘要

Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

基于检索的长上下文翻译用于文化图像描述:佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 本文提出了一种基于检索的长上下文翻译方法,用于文化图像描述,通过两阶段流程生成西班牙语中间描述,再利用检索增强的多示例提示生成目标语言描述,显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能,并在共享任务中获得冠军。

详情
AI中文摘要

我们提出了佛罗里达大学Gators团队对2026年美洲自然语言处理共享任务在原住民语言文化图像描述任务中的提交。我们的两阶段流程使用Qwen2.5-VL生成西班牙语中间描述,然后利用检索增强的多示例提示与Gemini 2.5 Flash生成目标语言描述。我们在开发集评估中分别实现了Bribri、Guaraní和Orizaba Nahuatl描述生成性能的164.1%、131.7%和122.6%的提升,并在测试集评估中保持Bribri和Orizaba Nahuatl语言的>150%提升。我们发现检索高度依赖语言,仅对大规模、领域内语料有效,并且合成数据增强对开发集Guaraní性能提升贡献了约28 chrF++。我们的提交在共享任务中获得冠军,位列五份最终提交中的第二名。

英文摘要

We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaraní performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

2605.20624 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) EverEx

AI总结 本文提出自回归视频逆问题求解器(AVIS),通过自回归扩散模型实现流式视频恢复,显著降低初始延迟并提高吞吐量,同时保持高质量的恢复效果,并进一步提出加速变体AVIS Flash,实现更高的吞吐量和更优的效率-性能权衡,为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情
AI中文摘要

扩散模型为零样本视频逆问题提供了强大的先验知识,但其实时部署受到两个效率问题的阻碍:由整体视频恢复引起的高初始延迟,以及由于在像素空间中多次VAE传递以强制测量一致性导致的低吞吐量。为克服这些限制,我们提出了自回归视频逆问题求解器(AVIS)。AVIS框架利用自回归视频扩散模型以流式方式恢复视频,自然地消除了延迟瓶颈。具体而言,AVIS通过测量一致性的估计初始化反向扩散,减少了所需的采样步骤。与领先的非自回归求解器相比,AVIS将初始延迟从114秒减少到4秒,并将吞吐量从0.71提高到1.18 FPS,同时实现更优的恢复质量。我们进一步引入了一个高度加速的变体,称为AVIS Flash,该变体仅在第一个片段上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量提高到5.91 FPS,同时保持竞争性的性能,并实现有利的效率-性能权衡,为实时部署铺平道路。

英文摘要

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

2605.20623 2026-05-21 math.AP cs.AI 版本更新

Lower Bounds for Advection-Diffusion Equations: An Exploration with AI-Generated Proofs

关于对流-扩散方程的下界:与AI生成证明的探索

Chenyang An, Xiaoqian Xu

发表机构 * University of California, San Diego, La Jolla, United States(加州大学圣迭戈分校) Duke Kunshan University, Zu Chongzhi Center, No. 8 Duke Avenue, Kunshan, Jiangsu Province, 215316, P.R. China(杜克昆山大学,祖冲之中心)

AI总结 本文通过AI生成的证明方法,建立了对流-扩散方程在三种不同情形下的显式下界,包括无粘性剪切的多项式$\dot H^{-1}$界、扩散剪切的混合尺度正下界以及快速振荡时间周期性流动的指数$L^2$界。

Comments 63 pages

详情
AI中文摘要

我们建立了对流-扩散方程在三种情形下的显式下界:对于无粘性剪切,$u\in L^\infty_t W^{1,1}_y$的多项式$\dot H^{-1}$界;对于扩散剪切,混合尺度的均匀正下界;以及对于快速振荡时间周期性流动的指数$L^2$界。所有常数都显式地依赖于数据。证明完全由多智能体数学证明系统QED生成,无需专家人类干预,作为测试AI生成严谨数学能力的检验。

英文摘要

We establish explicit lower bounds for advection-diffusion equations in three settings: a polynomial $\dot H^{-1}$ bound for inviscid shears with $u\in L^\infty_t W^{1,1}_y$, a uniform positive lower bound on the mixing scale for diffusive shears, and an exponential $L^2$ bound for rapidly oscillating time-periodic flows. All constants are explicit in the data. The proofs were generated entirely by a multi-agent math proving system, QED, without expert human intervention, serving as a test of AI's capability to produce rigorous mathematics.

2605.20618 2026-05-21 cs.AI 版本更新

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

COAgents: 多智能体框架用于学习和导航路由问题搜索空间

Oleksandr Yakovenko, Mahdi Mostajabdaveh, Cheikh Ahmed, Abdullah Ali Sivas, Xiaorui Li, Zirui Zhou, Mao Kun

发表机构 * Huawei Technologies Canada(华为技术加拿大公司) Huawei Technologies(华为技术)

AI总结 本文提出COAgents多智能体框架,通过将搜索过程建模为图来解决车辆路径问题的计算复杂性问题,通过训练不同智能体来指导强化和探索,从而在CVRP和VRPTW基准测试中取得优异成绩。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情
AI中文摘要

尽管车辆路径问题(VRP)对许多现实系统至关重要,但其计算复杂性使其在大规模情况下难以处理。传统启发式方法依赖于手工制定的规则进行局部改进和偶尔的跳跃以逃避局部极小值,但往往难以在多样化的实例上泛化。我们引入COAgents,一种协作多智能体框架,将搜索过程建模为图:节点代表解决方案,边对应于局部细化或大型扰动以进行多样化(即跳跃)。在搜索过程中动态构建部分搜索图(PSG),使COAgents能够训练节点选择代理和移动选择代理以指导强化,并触发跳跃代理以探索新区域。与端到端学习方法不同,COAgents将问题无关的搜索控制与紧凑的领域特定编码分离,从而在跨任务中提高适应性。在CVRP和VRPTW基准测试中进行了广泛的实验,结果表明COAgents在CVRP上与多个学习搜索基线竞争,并在更具有挑战性的VRPTW实例上设定了新的学习方法状态。在N=100时,COAgents将与最强神经求解器(POMO)的最佳解差距缩小了14%,在N=50时缩小了44%。

英文摘要

Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.

2605.20610 2026-05-21 cs.CV cs.AI 版本更新

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

超越路由:表征专家调节与表示在视觉混合专家中的刻画

Gene Tangtartharakul, Katherine R. Storrs

发表机构 * School of Psychology University of Auckland(心理学系奥克兰大学)

AI总结 本文研究了视觉混合专家模型中专家调节与表示的特性,通过对比学习训练稀疏门控卷积MoE模型,并利用视觉神经科学工具分析专家的专业化,发现动植物区分主导专家划分,并揭示了专家在更广泛的连续视觉和语义维度上的调节。

Comments 21 Pages, 6 Main Figures, 1 Table

详情
AI中文摘要

混合专家(MoE)模型通常通过分析哪些类别被路由到哪些专家来解释。然而,仅靠路由并不能揭示每个专家实际编码的内容。我们训练了稀疏门控卷积MoE模型,并在自然图像上使用对比目标进行训练,利用视觉神经科学工具来表征专家的专业化。从门控级别扩展到专家级别分析,我们测量了每个专家的类别分离度,并利用最吸引人的输入来分析每个专家的调节。从类别级别扩展到特征级别解释,我们通过从人类行为判断数据集(THINGS)中衍生出的语义维度来解释调节。最后,我们使用调节和表征相似性分析来评估在独立初始化下专家分配的稳定性。我们发现,动植物区分主导专家划分,从门控到专家读取都明显,并在独立训练模型中保持稳定。尽管路由统计数据表明相对稀疏的、类别的偏好,但专家分析揭示了更广泛的对连续视觉和语义维度的调节,超出了类别边界。尽管特征调节不同,专家之间表现出相似的类别分离度,这表明超越类别级别分析的解释优势。这些结果表明,视觉MoE中的专家专业化远超类别路由,并通过探测细粒度专家级别调节和表征结构来更好地理解。

英文摘要

Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

2605.20608 2026-05-21 cs.AI cs.NI 版本更新

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

从自动化到自主:分层代理原生网络架构(HANA)

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

发表机构 * AsiaInfo Technologies Limited(亚洲信息科技有限公司) Institute for AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Verimag

AI总结 本文提出了一种分层多代理参考架构,旨在实现Level 4/5自主网络,通过引入代理自意识,统一战略规划与操作韧性,验证了其在5G核心环境中的有效性。

Comments This manuscript has been accepted by IEEE Networking Letters

Journal ref B. Wu, S. Wang, Y. Liu, Y. -Q. Zhang, J. Sifakis and Y. Ouyang, "From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)," in IEEE Networking Letters, 2026

详情
AI中文摘要

实现Level 4/5自主网络(AN)需要从静态自动化转向代理原生智能。当前的操作依赖于刚性的脚本,缺乏处理非正常条件的认知能力。为此,本文提出了一种分层多代理参考架构,该架构包含一个双驱动协调器,协调专门的执行代理,并通过共享的公共内存实现统一的领域知识。关键创新是将代理自意识整合进来,使系统能够协调 deliberative战略治理与 reflexive 故障恢复。我们将在5G核心环境中实例化并验证该架构。案例研究表明,该系统在拥堵条件下仍能维持关键吞吐量,并将平均修复时间(MTTR)减少了86%,证实了其在统一战略规划与操作韧性方面的有效性。

英文摘要

Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

2605.20602 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它:表面标记增强而深层语法消失

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究通过实验发现自我训练过程并非使语言扁平化,而是重构了语言结构,表面标记增强而深层语法结构消失,并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情
AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程:多样性下降,分布变窄,文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型(GPT-2 124M,Pythia-410M,Pythia-1.4B,OPT-1.3B,Pythia-2.8B)进行十一代自我训练的过程中,语言并非均匀扁平化——它被重构了。表面标记(连贯词、缓和词、破折号)上升,而中层和深层语法结构(疑问句、插入语、被动语态、条件句)崩溃。我们正式将这种不对称崩溃定义为结构性深度假说(SDH):语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定,其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板(N=85),汇总的斯皮尔曼相关系数为rho=0.540(p < 10^{-6};簇Bootstrap 95% CI [0.434, 0.634]),而频率是一个显著较弱的预测因子(rho=0.225)。一个匹配的人类文本微调对照实验得到rho=0.039(p=0.88),证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论:总体复杂性代理(依赖树深度、TTR、词长)在底层从句结构消失时均上升,这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

2605.20577 2026-05-21 cs.AI cs.LG 版本更新

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax: 一种用于在JAX中进行强化学习的GPU加速麻将模拟器

Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama

发表机构 * The University of Tokyo(东京大学) RIKEN AIP(日本理化学研究院AIP) Nara Institute of Science and Technology(奈良科学技術大學) Kobe University(Kobe大学) Kyoto University(京都大学) ATR The University of Sydney(悉尼大学)

AI总结 本文提出Mahjax,一种基于JAX实现的麻将环境,利用GPU加速大规模并行化,以解决麻将游戏中的高维状态空间和随机性问题,为强化学习提供高效的训练平台。

详情
AI中文摘要

Riichi Mahjong是一种多玩家、信息不完全的游戏,具有随机性和高维状态空间的特性。这些属性构成了强化学习中复杂决策问题的独特挑战。尽管先前研究主要依赖于从人类游戏日志中监督学习来预训练策略,但能够从头开始学习(tabula rasa)的算法在通用性上具有更大潜力,如AlphaZero所示。为促进此类研究,我们引入了Mahjax,一个完全向量化实现的Riichi Mahjong环境,用于在图形处理器(GPU)上实现大规模的回放并行化。我们还提供了一个高质量的可视化工具,以简化调试和与训练代理的交互。实验结果表明,Mahjax在八块NVIDIA A100 GPU上分别实现了高达200万和100万步每秒的吞吐量。此外,我们通过展示代理能够有效训练以提高其相对于基线策略的排名,验证了该环境在强化学习中的实用性。

英文摘要

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

2605.20563 2026-05-21 cs.MA cs.AI cs.CL cs.LG cs.SE 版本更新

Multi-agent Collaboration with State Management

具有状态管理的多智能体协作

Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong

发表机构 * Shanghai Jiaotong University(上海交通大学) Cortices AI Emory University(埃默里大学) Peking University(北京大学)

AI总结 本文提出STORM,一种面向多智能体协作的状态管理方法,通过在共享工作区中调解智能体的交互,确保每个智能体在一致的代码库视图上操作,并在写入时检测和解决冲突。STORM在多个LLM上优于基于git-worktree的多智能体基线,且在成本效率上具有竞争力,表明显式状态管理比工作区隔离更有效。

详情
AI中文摘要

近年来,多智能体系统在解决复杂任务方面展现出巨大潜力。然而,当多个智能体同时编辑共享代码库时,他们的更改可能会产生冲突,不一致的视图会导致集成失败。现有的多智能体系统通过工作区隔离(例如每个智能体一个git工作树)来解决这个问题,但这种方法将冲突解决推迟到事后合并步骤,恢复成本较高。在本文中,我们提出了STORM,即面向多智能体协作的状态管理(STate-ORiented Management)。具体而言,STORM通过调解智能体与共享工作区的交互来管理智能体状态,确保每个智能体都在代码库的一致视图上操作,并在写入时检测和解决冲突。我们评估了STORM在Commit0和PaperBench多个LLM上的表现。STORM在Commit0-Lite上比基于git-worktree的多智能体基线高出18.7%,在PaperBench上高出1.4%,同时在成本效率上具有竞争力或更好。结合单智能体运行,STORM在两个基准测试中分别达到87.6和78.2的最高分数,表明显式状态管理比工作区隔离更有效作为多智能体协作的基础。STORM也可以无缝地集成到任何多智能体系统中。

英文摘要

Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

2605.20555 2026-05-21 cs.LG cs.AI 版本更新

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

通过logit平均在LLMs后训练中补充强化学习

Xingwei Gan, Ying Zhu

发表机构 * UC San Diego(加州大学圣迭戈分校)

AI总结 本文提出一种在LLMs后训练中通过logit平均补充强化学习的方法,将该方法整合到Group Relative Policy Optimization (GRPO)中,无需使用KL正则化或critic,通过logit平均结构将可训练策略与参考策略耦合,以利用可训练策略的推理能力并保持SFT的格式优势。

详情
AI中文摘要

我们介绍了一种新颖的方法,该方法对冻结的参考策略(例如SFT)和可训练策略的logits进行平均,并将该方法整合到Group Relative Policy Optimization (GRPO)中。与Reinforcement Learning with Verifiable Rewards (RLVR)方法不同,我们的方法不涉及Kullback Leibler (KL)正则化或critic;可训练策略和参考锚点通过logit平均结构耦合,以利用可训练策略的推理能力,同时保持SFT的格式优势。我们的方法在MATH、cn-k12和MMLU上进行了评估,结果表明其准确率高于或至少与传统的KL正则化GRPO相当。

英文摘要

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

2605.20554 2026-05-21 cs.AI cs.HC cs.SI 版本更新

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

利用AI代理的人格工程:谈判研究的新方法论

Michelle A. Vaccaro, Jared R. Curhan

发表机构 * MIT Institute for Data, Systems, and Society(MIT数据、系统与社会研究所) MIT Sloan School of Management(MIT斯隆管理学院)

AI总结 本文提出了一种利用AI代理进行谈判者人格参数化、操纵和评估的方法,通过人际圆周坐标系中的温暖和支配两个核心维度,为谈判理论的严格测试和AI谈判代理的人格设计提供了一种新方法。

详情
AI中文摘要

根据经典谈判理论,人们在谈判中的成功取决于他们平衡竞争需求的能力--共情与主张,表现出对他人的关心和对自己的关心,对人温和而对问题强硬。然而,人们难以管理这些张力,因此研究人员缺乏在受控条件下严格测试该领域规定的能力。AI代理没有相同的限制,其精确性、 repertoire、一致性以及可扩展性使能够贡献于谈判理论的新一类实验成为可能。在本文中,我们介绍了一种称为人格工程的方法论,该方法利用AI代理来精确参数化、操纵和评估谈判者的人格。我们提议使用人际圆周--以及其两个核心维度温暖和支配--作为该领域的基础坐标系统。这种方法不仅提供了一种严格测试经典谈判理论的方法,还为设计AI谈判代理的人格提供了一种实用指南。

英文摘要

According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.

2605.20551 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强:通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London(伦敦大学学院) Karlsruhe Institute of Technology(卡尔斯鲁厄大学) Hunan University(湖南大学) Shenzhen University(深圳大学)

AI总结 本文提出了一种加权聚合描述符(WeiAD)和标记剪枝框架(WeiToP),用于提升视觉位置识别的性能和效率,通过动态调整特征提取的精度与效率平衡。

详情
AI中文摘要

视觉位置识别(VPR)旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer(ViTs)作为基础模型,提取对视角、光照和季节变化具有鲁棒性的补丁级特征,然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中,尽管不同簇往往编码不同的空间或语义模式,并对VPR性能贡献不均。为了解决这一限制,我们提出了加权聚合描述符(WeiAD),在聚合过程中分配簇的权重,产生更具判别性的全局表示。除了准确性之外,检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟,而忽略了特征提取的成本,这在基于ViT的基础模型中变得更加严重。因此,我们引入了面向VPR的标记剪枝框架WeiToP,通过自蒸馏减少特征提取成本,其中聚合诱导的标记重要性监督一个轻量级剪枝模块,附加到早期Transformer层上,使推理时能够进行标记剪枝。在单次联合训练阶段后,WeiToP能够在推理时实现插拔式的标记剪枝,允许在不额外训练的情况下灵活地控制精度-效率权衡。此外,WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

2605.20547 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Latent Process Generator Matching

潜在过程生成器匹配

Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell

发表机构 * Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet(微生物学、肿瘤和细胞生物学系,Karolinska研究院)

AI总结 本文提出了一种潜在过程生成器匹配框架,该框架将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像,从而扩展了生成器匹配理论,使其适用于时间依赖的潜在条件过程。

Comments 18 pages, 1 figure

详情
AI中文摘要

许多近期的流匹配和扩散式生成模型在训练过程中依赖于辅助的随机动力学:通过模拟更丰富的过程来定义条件目标,但辅助状态在生成时要么难以采样,要么并不属于期望的输出。现有的生成器匹配理论规范了对静态潜在随机变量的条件,而几篇近期论文证明了特定增强状态构造的投影结果的特殊情况。我们引入了潜在过程生成器匹配,一种通用框架,将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像 $X_t=Φ(Y_t)$。我们显示在这一设定下,可以在图像空间中学习一个随机过程的生成器,其一阶边缘分布与投影过程相同。这扩展并涵盖了文献中的离散潜在过程结果,并将生成器匹配从静态潜在变量扩展到丰富的时间依赖潜在条件过程家族。

英文摘要

Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image $X_t=Φ(Y_t)$ of a tractable Markov process $Y_t$. We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.

2605.20052 2026-05-21 cs.CL cs.AI 版本更新

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

PromptRad: 基于知识的多标签提示微调用于低资源放射报告标注

Ying-Jia Lin, Tzu-Chin Lo, Ping-Chien Li, Chi-Tung Cheng, Chien-Hung Liao, Hung-Yu Kao

发表机构 * Department of Artificial Intelligence and AI Research Center, Chang Gung University(人工智能系及AI研究中心,长庚大学) Department of Radiology, Sijhih Cathay General Hospital(放射科,西吉医院) Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital(医学影像与介入科,长庚纪念医院) Department of Trauma and Emergency Surgery, Chang Gung Memorial Hospital(创伤与急诊外科,长庚纪念医院) Department of Computer Science, National Tsing Hua University(计算机科学系,国立清华大学)

AI总结 本文提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注,通过引入UMLS元词典中的同义词增强类别表示,以更少的标注数据实现优于传统方法的性能。

Comments BioNLP 2026 @ ACL (camera-ready version)

详情
AI中文摘要

自动报告标注有助于从非结构化文本中识别临床发现,并为医学影像研究提供大规模注释。现有的基于规则的标注器难以处理临床报告中的多样化描述,而微调预训练语言模型(PLMs)需要大量标注数据,这些数据在临床环境中通常不可用。在本文中,我们提出PromptRad,一种基于知识的多标签提示微调方法,用于在低资源环境下进行放射报告标注。PromptRad将多标签分类重新表述为掩码语言建模,并将UMLS元词典中的同义词纳入多词提示器以丰富类别表示。通过微调PLM而不增加额外分类层,PromptRad所需的标注数据比传统微调要少得多。在肝CT报告上的实验表明,PromptRad在仅使用32个标注训练示例的情况下,优于基于词典和微调的基线方法,并且在使用远小模型的情况下,性能与GPT-4具有竞争力。进一步分析显示,PromptRad比现有方法更有效地捕捉复杂的否定模式,使其成为数据稀缺临床场景中报告标注的有希望的解决方案。我们的代码可在https://github.com/ila-lab/PromptRad上获得。

英文摘要

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

2605.19624 2026-05-21 cs.CV cs.AI 版本更新

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出了一种面向组件的结构保持风格迁移框架,用于卫星视觉的合成到真实数据构建,通过提取真实图像的部件级风格代码并注入到合成图像中,从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情
AI中文摘要

对于基于相机的卫星视觉感知,Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取,而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码,并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性,对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像,而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比,所提方法实现了最小的图像分布差异,FID为54.32,KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时,ADD通过率提高到0.260,AUC提高到0.611。这些结果表明,组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

2605.19503 2026-05-21 cs.RO cs.AI cs.LG 版本更新

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

ARC-RL: 一种受ARC Raiders启发的强化学习游乐场

Carlo Romeo, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence(媒体整合与通信中心——佛罗伦萨大学)

AI总结 本文提出ARC-RL,一个包含四种MuJoCo连续控制环境的强化学习游乐场,这些环境的机器人形态灵感来自ARC Raiders的生物目录,通过统一的观察模板、动作约定和奖励函数,研究不同形态和动画风格约束下的强化学习算法性能。

详情
AI中文摘要

腿部运动的强化学习已经发展成一个多组件奖励函数和物理引擎基准的堆叠,其形态统一来源于现实商业硬件。然而,游戏NPC受风格约束,缺乏sim-to-real机器人,通常以没有现实机器人对应物的生物形式出现。我们介绍了ARC-RL,一个包含四种MuJoCo连续控制环境的套件,其机器人形态受ARC Raiders的生物目录启发:18自由度的高六足Queen、12自由度的装甲六足Bastion、18自由度的紧凑六足Tick以及12自由度的四足Leaper。这四个机器人共享统一的观察模板、动作约定、仿真节奏和一个单一的闭式多组件奖励函数,其唯一形态差异体现在一小部分权重和参数中。奖励融合了速度跟踪帐篷、健康生存奖励、相位锁定步态适应奖励/成本对、动作正则化器、三个安全惩罚和姿态锚;在任何点都不会引入运动捕捉数据。我们还为每种形态提供手工制作的中心模式生成器演示,这些演示既作为固定专家参考,也作为离线到在线训练的先验数据来源。在此游乐场中,我们进行了一项受控的实证研究,比较标准在线算法(SAC、SPEQ、SOPE-EO)和带有先验数据的算法(SACfD、SPEQ-O2O、SOPE),并研究每种范式如何应对游乐场的形态多样性和动画风格约束。源代码可在https://github.com/CarloRomeo427/ARC_RL.git获取。

英文摘要

Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

2605.19376 2026-05-21 cs.AI 版本更新

Generative Recursive Reasoning

生成性递归推理

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

发表机构 * KAIST(韩国科学技术院) Mila – Québec AI Institute(魁北克人工智能研究所) New York University(纽约大学) Université de Montréal(蒙特利尔大学)

AI总结 本文提出Gram框架,通过将递归潜在推理转化为概率多轨迹计算,解决了传统递归推理模型的确定性问题,实现了条件推理和无条件生成。

详情
AI中文摘要

未来的神经推理系统应如何实现扩展计算?递归推理模型(RRMs)通过使用共享转移函数的迭代潜在状态细化,为自回归序列扩展提供了一种有前途的替代方法。然而,现有RRMs大多是确定性的,遵循单一的潜在轨迹并收敛到单一预测。我们引入生成性递归推理模型(GRAM),一种将递归潜在推理转化为概率多轨迹计算的框架。GRAM将推理视为随机的潜在轨迹,通过递归深度和并行轨迹采样实现多个假设、替代解决方案策略和推理时间扩展。这产生了一个支持通过p_θ(y|x)进行条件推理的潜在变量生成模型,并通过p_θ(x)实现无条件生成,无论输入是否固定或缺失。通过缩放变分推断训练,GRAM在结构推理和多解约束满足任务上优于确定性递归和循环基线,同时展示了无条件生成能力。

英文摘要

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_θ(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

2605.19138 2026-05-21 cs.RO cs.AI cs.LG 版本更新

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) New York University Abu Dhabi (NYUAD)(纽约大学阿布扎克分校) University of Toronto(多伦多大学) NVIDIA(英伟达)

AI总结 本文提出COBALT平台,通过基于云的远程操作技术,利用智能手机等设备大规模收集高质量的机器人学习数据,提高仿真实验和现实世界中的机器人学习效率。

详情
AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT,一个旨在大规模普及机器人学习的远程操作平台,无论是仿真还是现实世界。通过利用向量化的环境,我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作,从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接,包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步,支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行,每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU,凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究,显示基于手机的远程操作性能与或优于专用硬件,能够更快、更符合人体工学地收集数据。为确保数据质量,COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明,结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察,我们通过众包收集了一个大规模、高质量的试点数据集,该数据集包含7500多个演示(50多个小时),在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

2605.18833 2026-05-21 cs.LG cs.AI 版本更新

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

发表机构 * Saint-Joseph University(圣约瑟夫大学) Lebanese University(黎巴嫩大学)

AI总结 本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法,通过整合多样化的知识图谱表示,利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

Journal ref International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405

详情
AI中文摘要

自动化数据质量评估对于管理大数据至关重要,但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法,利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示,从深入的文献研究中获取洞察,从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解,克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性,我们为每个预测的质量测量分配相应的权重,为输入数据集提供全面的数据质量评估计划。为了评估我们的方法,我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会(LAEC-CNRS)提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

2605.18743 2026-05-21 cs.AI 版本更新

WorldString: Actionable World Representation

WorldString: 可行动态世界表征

Kunqi Xu, Jitao Li, Jianglong Ye, Tianshu Tang, Isabella Liu, Sifei Liu, Xueyan Zou

发表机构 * CalTech(加州理工学院) UC San Diego(南加州大学) Tsinghua University - IEI Lab(清华大学-IEI实验室) NVIDIA(英伟达)

AI总结 本文提出WorldString,一种能够通过点云或RGB-D视频流直接学习现实物体状态流形的神经架构,为构建可行动态世界模型提供基础构建块。

详情
AI中文摘要

受大语言模型中涌现行为启发,研究社区正在探索类似涌现能力的世界模型,尤其关注物理世界的建模。在物理世界建模中,物体是构成物理现实的基本原始元素。从人类到计算机,几乎一切我们交互的事物都是物体。这些物体很少是静态的;它们是具有变化状态的可行动态实体,其状态由内在属性决定。尽管当前方法通过视频生成或动态场景重建来处理物体动作状态,但没有一种方法明确地以统一、原则性的方式建模这一基本元素,以构建可行动态物体表征。我们提出了WorldString,一种神经架构,能够通过直接从点云或RGB-D视频流中学习来建模现实物体的状态流形。作为通用的数字孪生,它充当物理世界模型的基础构建块;因此,我们将其命名为WorldString。有趣的是,其完全可微的结构无缝地使未来与策略学习和神经动力学的整合成为可能。

英文摘要

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

2605.18678 2026-05-21 cs.CV cs.AI 版本更新

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance:通过多任务协同实现统一多模态建模

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

发表机构 * Intelligent Creation Lab, ByteDance(字节跳动智能创作实验室)

AI总结 本文提出Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。该模型通过协同多任务训练的实用范式实现统一多模态建模,基于统一上下文建模和解耦能力路径两个核心原则,通过双流混合专家架构实现联合上下文学习并解耦理解和生成路径。

Comments 34 pages, 14 figures, 10 tables, homepage url: https://lance-project.github.io , code url: https://github.com/bytedance/Lance

详情
AI中文摘要

我们提出了Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。与依赖模型容量扩展或文本-图像主导设计不同,Lance通过协同多任务训练探索统一多模态建模的实用范式。其基于两个核心原则:统一上下文建模和解耦能力路径。具体而言,Lance从头开始训练,并在共享交错的多模态序列上采用双流混合专家架构,实现联合上下文学习的同时解耦理解和生成路径。我们进一步引入模态感知的旋转位置编码以减轻异构视觉标记之间的干扰并提升跨任务对齐。在训练过程中,Lance采用分阶段的多任务训练范式,结合能力导向的目标和自适应数据调度,以加强语义理解和视觉生成性能。实验结果表明,Lance在图像和视频生成方面显著优于现有开源统一模型,同时保留了强大的多模态理解能力。该模型的主页可在https://lance-project.github.io上访问。

英文摘要

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

2605.17946 2026-05-21 cs.AI cs.CV cs.LG 版本更新

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SVFSearch,首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准,通过5000个四选一测试示例和4198个辅助训练示例,评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情
AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干,以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而,现有的基准很少评估在短视频应用中的这种能力,其中暂停的帧通常在视觉上具有歧义性,回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch,这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例,每个示例都围绕一个暂停的游戏场景展开,来自真实的短视频片段。为了支持公平且可重复的评估,SVFSearch提供了一个冻结的离线检索环境,包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口,避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距:最好的开源直接问答模型达到66.4%,最好的实际代理达到79.1%,而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈,包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

2605.17164 2026-05-21 cs.DC cs.AI cs.LG cs.PL 版本更新

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Charon:一种用于大规模大语言模型训练和推理的统一且细粒度模拟器

Mengtian Yang, Zhekun Zhang, Mingheng Wu, Jianwen Yan, Hanshi Sun, Li-wen Chang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出Charon模拟器,通过统一、模块化和细粒度的方法,准确预测大语言模型性能,实验显示其在不同模型和配置上具有高精度,预测误差低于5.35%,并在实际推理部署中发现提升系统吞吐量的配置,展示了其实际价值。

Comments Accepted by MLSys 2026

详情
AI中文摘要

在大规模大语言模型(LLM)训练和推理中,由于并行策略、系统优化和硬件配置的复杂设计空间,实现最优性能极具挑战性。准确且快速的性能模拟对于通过验证“假设”图进行优化努力和系统研究至关重要。为此,我们引入Charon,一种统一、模块化且细粒度的模拟器,以准确预测LLM性能。实验显示,Charon在不同模型和配置上均具有高精度,总体预测误差始终低于5.35%,甚至在使用大规模GPU集群进行训练时也低于3.74%。在实际推理部署案例中,Charon发现了一种比工程调优基线配置提升系统吞吐量的配置,证明了其在现实中的重要价值。

英文摘要

Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating "what-if" Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.

2605.16962 2026-05-21 cs.CV cs.AI 版本更新

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能与视觉实验室) Xi'an Jiaotong University(西安交通大学)

AI总结 该研究提出OmniVL-Guard Pro,一种增强工具的代理,用于综合视觉-语言防伪,通过整合多种工具环境和引入新的强化学习方法,实现了开放世界中的线索驱动推理,并在多个任务上达到了最先进的性能。

Comments 29 pages

详情
AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式,假设模型可以单独完成验证。然而,自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率,在动态开放世界防伪中存在实际限制,特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制,我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro,一种增强工具的代理,将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹,我们引入了树状结构的自进化工具轨迹生成,通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹,产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL),它为过程级监督提供,以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明,OmniVL-Guard Pro在各种任务上实现了最先进的性能,并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

2605.16428 2026-05-21 cs.IR cs.AI 版本更新

The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit

人工智能搜索对在线内容生态系统的影响:来自谷歌和推特的证据

Peibo Zhang, Ruomeng Cui, Dennis J. Zhang

发表机构 * Goizueta Business School, Emory University(埃默里大学戈伊兹塔商学院) Olin Business School, Washington University in St. Louis(圣路易斯华盛顿大学奥林商学院)

AI总结 本文研究了人工智能搜索对在线内容生态系统的影响,通过谷歌AI概述和推特平台分析,发现AI概述提高了安全内容社区的参与度,但交互式AI模式削弱了这种效果。

详情
AI中文摘要

传统的搜索引擎通过将用户寻找信息的请求定向到外部网站来补充在线内容平台。生成式人工智能搜索工具能够直接在结果页面上总结答案,可能通过使访问来源平台变得可选而打破这种关系。我们利用谷歌AI概述和推特,其中一个最大的在线讨论平台,研究这一问题。我们的识别利用了谷歌的内容审核政策:安全的推特社区通过谷歌有机搜索被索引并在谷歌AI概述中出现,而不安全的社区虽然被有机搜索索引,但禁止在AI概述摘要中引用。使用差异-in-差异设计,我们发现AI概述提高了安全社区的参与度:每天的评论数量增加了12.0个百分点,评论用户数量增加了12.3个百分点,相对于不安全社区。这些影响集中在基于经验的讨论(意见、建议和个人经验)而不是基于事实的信息。然而,随后引入的谷歌AI模式,允许用户与AI摘要进行对话式交互,大大消除了经验内容中的这些收益。这些结果表明,人工智能搜索的效果在很大程度上取决于界面设计和内容类型。

英文摘要

Search engines traditionally complement online content platforms by directing users seeking information to external websites. The emergence of generative AI search tools that summarize answers directly on the results page may disrupt this relationship by making visits to source platforms optional. We study this question using Google AI Overviews and Reddit, one of the largest online discussion platforms. Our identification exploits Google's content moderation policy: Safe-for-Work (SFW) Reddit communities are indexed by Google organic search and surfaced in Google AI Overviews, while Not-Safe-for-Work (NSFW) communities, though indexed by organic search, are prohibited from being referenced in AI Overview summaries. Using a difference-in-differences design, we find that AI Overviews increase engagement in SFW communities: daily comments rise by 12.0 percent and the number of commenting users by 12.3 percent relative to NSFW communities. The effects are concentrated in experience-based discussions (opinions, advice, and personal experiences) rather than fact-based information. However, the subsequent introduction of Google AI Mode, which allows users to interact conversationally with the AI summary, largely eliminates these gains in experience-based content. These results suggest that the effects of AI search depend critically on interface design and types of content.

2605.16217 2026-05-21 cs.CL cs.AI cs.IR 版本更新

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus:可扩展深度研究代理的证据组装

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

发表机构 * MiroMind AI

AI总结 Argus通过将深度研究视为拼图碎片的组装过程,而非并行暴力求解整个答案,提高了大规模信息检索任务的效率和效果。

详情
AI中文摘要

深度研究代理在复杂信息检索任务上取得了显著进展。即使长ReAct风格的探索仅追踪单一轨迹,而最新最先进的系统通过并行搜索和聚合来扩展推理时间计算。然而,深度研究答案由互补的证据片段组成,而并行探索通常重复而非完成这些片段,导致收益递减且推动聚合上下文接近模型极限。我们提出Argus,一种代理系统,其中搜索者和导航者合作将深度研究视为从互补证据片段中组装拼图,而非并行暴力求解整个答案。搜索者通过ReAct风格交互收集给定子查询的证据轨迹。导航者维护共享证据图,验证哪些片段仍缺失,派遣搜索者收集它们,并在完成图上推理以生成来源追踪的最终答案。我们用强化学习训练导航者以验证、派遣和合成,同时独立训练搜索者以保持标准ReAct代理。所获得的导航者支持单个搜索者或多个并行搜索者无需重新训练。使用35B-A3B MoE骨干的搜索者和导航者,Argus在单个搜索者上获得5.5分,在8个并行搜索者上获得12.7分,平均在八个基准上。使用64个搜索者时,其在BrowseComp上达到86.2分,超越了我们所有基准测试的专有代理,同时导航器的推理上下文保持在21.5K tokens以下。

英文摘要

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

2605.14259 2026-05-21 cs.AI cs.CL 版本更新

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

面向异构业务系统的超图企业代理推理器

Ling Wang, Xin Liu, Songnan Liu, Jianan Wang, Cheng Cheng, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

发表机构 * SUPCON

AI总结 本文提出HEAR,一种基于分层超图本体的企业代理推理器,通过分层图层和超边层实现结构化多跳分析,无需重新训练LLM,在供应链任务中达到94.7%的准确率,并展示出适应性和效率。

详情
AI中文摘要

将大语言模型(LLMs)应用于异构企业系统受到多跳、n元推理中幻觉和失败的阻碍。现有范式(如GraphRAG、NL2SQL)缺乏复杂环境所需语义基础和可审计执行。我们引入HEAR,一种基于分层超图本体的企业代理推理器。其基图层虚拟化了具有溯源意识的数据接口,而超边层编码n元业务规则和程序协议。通过证据驱动的推理循环,HEAR动态协调本体工具进行结构化多跳分析,无需重新训练LLM。在供应链任务中,包括订单履行阻塞根本原因分析(RCA)的评估显示,HEAR达到94.7%的准确率。关键地,HEAR展示了适应性效率:利用程序超边以最小化令牌成本,同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开源权重骨干结合,并自动化手动诊断,HEAR建立了可扩展、可审计的企业智能基础。

英文摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

2605.12483 2026-05-21 cs.LG cs.AI 版本更新

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

超越GRPO和在线策略蒸馏:一种经验性稀疏到密集奖励原则用于语言模型后训练

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结 本文提出了一种经验性的稀疏到密集奖励原则,用于语言模型后训练,通过在教师模型上使用稀疏奖励进行探索和发现,然后通过密集监督将行为压缩到部署模型中,从而在数学问题上实现了优于GRPO的性能。

详情
AI中文摘要

在标记可验证的训练数据是约束的情况下,每个检查的示例应分配给模型和奖励密度,其中它最有信息量。我们识别出一个支配这种分配的奖励密度原则:稀疏序列级奖励在能够探索和发现更好行为的模型上最有用,而密集的token级教师监督更适合将该行为压缩到更小的部署模型中。该原则产生了一个简单的分配规则:在最强的可用教师上使用稀缺的标记数据,然后将奖励形状的行为作为密集监督转移到下游。我们通过一个四阶段的工作流程——教师RL、forward-KL预热、在线策略蒸馏、可选的后桥学生RL——在可验证的数学上评估了此规则,使用Qwen3和Llama模型。在固定的Qwen3-1.7B部署学生大小下,一个通过密集桥进行蒸馏的RL改进的8B教师在相同的学生上表现优于直接GRPO(79.3% vs. 75.9%在MATH;25.2% vs. 19.8%在AIME 2024,avg@16),而从相同教师提前进行RL的转移效果更差。一个组件消融确认了每个阶段的重要性:用RL改进的教师替换为原始教师会损失7.8个MATH点,移除forward-KL预热会损失1.7个点,移除在线策略蒸馏会损失3.3个点。教师质量顺序——原始教师转移 < 直接GRPO < RL教师转移——在使用Llama-3.1-8B-Instruct作为教师和Llama-3.3-70B-Instruct作为教师的情况下重复。操作教训是避免将稀缺的标记数据用于准备最少的策略:使用稀疏奖励进行教师端的发现,使用密集转移进行学生端的压缩,并在桥接后才使用学生端的稀疏奖励。

英文摘要

In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated to the model and reward density where it is most informative. We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The principle yields a simple allocation rule: use scarce labeled data upstream on the strongest available teacher, then transfer the reward-shaped behavior downstream as dense supervision. We evaluate this rule through a four-stage workflow -- teacher RL, forward-KL warmup, on-policy distillation, optional post-bridge student RL -- on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student ($79.3\%$ vs.\ $75.9\%$ on MATH; $25.2\%$ vs.\ $19.8\%$ on AIME~2024, avg@16), while transfer from the same teacher \emph{before} RL underperforms. A component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$, and removing on-policy distillation costs $3.3$. The teacher-quality ordering -- raw-teacher transfer $<$ direct GRPO $<$ RL-teacher transfer -- replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher. The operational lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

2605.12334 2026-05-21 cs.AI 版本更新

Reinforcing VLAs in Task-Agnostic World Models

在任务无关的世界模型中强化视觉-语言-动作

Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao

发表机构 * Microsoft Research Asia(微软亚洲研究院) Nanjing University(南京大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Wuhan University(武汉大学) University of Technology Sydney(悉尼科技大学) Tsinghua University(清华大学)

AI总结 本文提出RAW-Dream方法,通过分离世界模型学习与下游任务依赖,利用预训练的世界模型和现成的视觉-语言模型,实现零样本推理,从而在无需任务特定数据的情况下提高VLA适应性。

详情
AI中文摘要

在学习的世界模型中通过强化学习(RL)后训练视觉-语言-动作(VLA)模型,已成为一种有效的策略,可以在不进行昂贵的真实世界交互的情况下适应新任务。然而,尽管使用想象轨迹减少了策略训练的样本复杂性,现有方法仍然严重依赖任务特定数据来微调世界和奖励模型,从根本上限制了其扩展到未见任务的能力。为了解决这个问题,我们主张世界和奖励模型应捕捉可转移的物理先验,以实现零样本推理。我们提出了RAW-Dream(在任务无关世界梦中强化VLA),一种新的范式,完全将世界模型学习与下游任务依赖分离。RAW-Dream利用在多样化任务无关行为上预训练的世界模型来预测未来滚动,以及现成的视觉-语言模型(VLM)进行奖励生成。由于这两个组件都是任务无关的,VLA可以在此零样本想象中轻松微调以适应任何新任务。此外,为了减轻世界模型的幻觉,我们引入了双噪声验证机制来过滤掉不可靠的滚动。在模拟和现实世界设置中的广泛实验展示了一致的性能提升,证明了通用的物理先验可以有效替代昂贵的任务依赖数据,为VLA适应提供了一条高度可扩展的道路。

英文摘要

Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

2605.12321 2026-05-21 cs.AI cs.CY cs.ET 版本更新

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

LIDSA:信号自由的自主交叉口管理中的认知仲裁

Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah

发表机构 * Department of Computer and Network Engineering, United Arab Emirates University, UAE(计算机与网络工程系,阿联酋大学) Research Institute for Digital Future, Khalifa University, UAE(未来数字研究院,哈利法大学)

AI总结 本文提出LIDSA框架,利用大语言模型进行意图驱动的速度建议,以实现信号自由的自主交叉口管理,通过对比固定周期控制、SCATS、AIM和GLOSA等方法,证明LLM在实时交叉口管理中的有效性。

Comments Renamed LISA to LIDSA to avoid naming ambiguity with existing traffic-control software. No technical changes

详情
AI中文摘要

大型语言模型(LLMs)在智能交通系统(ITS)中展现出强大的潜力,特别是在需要情境推理和多智能体协调的任务中。这些能力使它们非常适合协同驾驶,其中基于规则的方法在复杂和动态的交通环境中表现不佳。交叉口管理尤其具有挑战性,因为存在冲突的优先权需求、异质车辆优先级以及必须实时解决的车辆特定运动学约束。然而,现有方法通常将LLMs作为基于信号系统的辅助组件,而不是主要决策者。信号控制器仍然缺乏车辆感知,预留方法缺乏意图意识,而最近的基于LLM的系统仍然依赖于信号基础设施。此外,LLM推理延迟限制了其在亚秒级控制设置中的应用。我们提出了LIDSA(基于LLM的意图驱动速度建议),一种用于自主交叉口管理的信号自由认知仲裁框架。LIDSA利用LLM对声明的车辆意图进行推理,结合优先级类别、队列压力和能源偏好。我们评估了LIDSA在不同交通负载下的性能,结果表明LIDSA将平均控制延迟减少了高达89.1%,同时保持了服务水平C,而所有非LLM基线方法降级到服务水平F。在接近饱和需求下,LIDSA将平均等待时间减少了93%,峰值队列长度减少了60.6%相对于固定周期控制。它还降低了燃料消耗高达48.8%,并实现了86.2%的意图满足率,相比最好的非LLM方法的61.2%。这些结果证明了基于LLM的推理能够实现实时、无信号的交叉口管理。

英文摘要

Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LIDSA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.

2605.11302 2026-05-21 cs.LG cs.AI cs.CL 版本更新

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

时间敏感语言生成理论:稀疏幻觉战胜模式崩溃

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng

发表机构 * University of Southern California(美国南加州大学)

AI总结 本文研究了在全局偏好顺序下语言生成的极限情况,提出了一种时间敏感的语言生成方法,通过稀疏幻觉技术克服了模式崩溃问题,证明了在特定条件下可以实现最优密度。

详情
AI中文摘要

我们研究了在全局偏好顺序下语言生成的极限情况,如Kleinberg和Wei所引入的。与以往工作类似,我们追求广度,但增加了时效性要求:高排名字符串应更早生成。一个字符串只有在截止时间前生成才被认可,其截止时间由一个函数确定,该函数将字符串在目标语言中的排名映射到必须生成的时间。这与机器学习中的归纳偏置一致,即在其他条件相同的情况下,倾向于选择更简单或更可能的输出。我们证明,在强意义上,最终一致的生成器无法实现时效性生成——这是大多数先前相关工作的主角。在可能最温和的一致性放松下,即幻觉率随时间消失,我们证明可以绕过我们的不可能结果。特别是,我们可以实现相对于任何超线性截止函数的最优密度。我们还证明这是紧的,通过排除线性截止时间和消失幻觉率下的时效性生成。

英文摘要

We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

2605.11151 2026-05-21 cs.AI cs.RO 版本更新

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ: 通过自监督动作排名实现离线到在线强化学习

Andrew Choi, Wei Xu

发表机构 * Horizon Robotics(地平线机器人)

AI总结 该研究提出RankQ方法,通过自监督多项排名损失增强时序差分学习,以在大状态-动作空间中更准确地学习批评器,从而在稀疏奖励D4RL基准和基于视觉的机器人学习中实现更高效的离线到在线微调。

详情
AI中文摘要

离线到在线强化学习(RL)通过利用预先收集的数据集来提高样本效率。然而,一个关键挑战是在有限的数据集覆盖下,在大规模状态-动作空间中学习准确的批评器。为了减轻价值过估计带来的有害更新,先前方法通过降低分布外(OOD)动作相对于数据集动作的权重来引入悲观主义。虽然有效,但这种方法本质上充当了一个行为克隆锚点,当数据集动作不优时会阻碍后续在线策略改进。我们提出RankQ,一种离线到在线的Q学习目标,通过在时序差分学习中加入自监督的多项排名损失来强制结构化动作排序。通过学习相对动作偏好而不是均匀惩罚未见过的动作,RankQ塑造Q函数,使动作梯度指向高质量的行为。在稀疏奖励D4RL基准中,RankQ的性能与或优于七种先前方法。在基于视觉的机器人学习中,RankQ能够在低数据环境下有效微调预训练的视觉-语言-动作(VLA)模型,平均在模拟成功率上比次优方法高42.7%。在高数据环境下,RankQ在模拟性能上比次优方法提高13.7%,并实现强大的仿真到现实转移,将现实世界立方体堆叠成功率从43.1%提升到88.9%,相对于VLA的初始性能。

英文摘要

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

2605.10787 2026-05-21 cs.AI cs.SE 版本更新

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

ComplexMCP: 评估LLM代理在动态、相互依赖和大规模工具沙箱中的表现

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

发表机构 * Zhejiang University(浙江大学) Zhejiang Lab(浙江实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文提出ComplexMCP基准,用于评估LLM代理在动态、相互依赖和大规模工具环境中的性能,揭示了现有模型在复杂任务中的不足,指出三个关键瓶颈:工具检索饱和、过度自信和战略投降倾向。

详情
AI中文摘要

当前LLM代理擅长调用孤立API,但在商业软件自动化最后一公里方面表现不佳。在现实场景中,工具并非独立,而是原子性、相互依赖且易受环境噪声影响。我们引入ComplexMCP,一个基于Model Context Protocol(MCP)设计的基准,提供超过300个经过严格测试的工具,来源于7个状态沙箱,涵盖办公套件到金融系统。与现有数据集不同,我们的基准采用种子驱动架构模拟动态环境状态和不可预测的API故障,确保评估的确定性与多样性。我们评估了各种LLM在全上下文和RAG范式下的表现,揭示了显著的性能差距:即使顶级模型也难以超过60%的成功率,远低于人类90%的表现。细粒度轨迹分析识别出三个根本瓶颈:(1)工具检索饱和;(2)过度自信,即代理跳过必要的环境验证;(3)战略投降倾向,即倾向于合理化失败而非追求恢复。这些发现凸显了当前代理在相互依赖工作流中的不足,将ComplexMCP定位为下一代鲁棒自主系统的关键测试平台。

英文摘要

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

2605.10181 2026-05-21 cs.CV cs.AI 版本更新

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

机器学习与深度学习在分布外检测中的比较研究

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文比较了传统机器学习和深度学习在分布外检测任务中的性能,发现轻量级机器学习方法在保持同等准确性的同时,具有显著更低的计算成本,适用于视觉复杂度有限的任务。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

分布外检测(OOD)对于构建可靠的人工智能系统至关重要,因为无法信任产生无效输入输出的模型。尽管深度学习(DL)通常被认为优于传统机器学习(ML),但医学影像数据通常是在标准化协议下获取的,导致在OOD检测任务中图像变化相对受限。这促使在该设置下直接比较ML和DL方法。两种方法在包含超过60,000张视网膜和非视网膜图像的开放数据集上进行了评估,涵盖多种分辨率。两种方法在内部和外部验证集上均实现了AUROC为1.000和准确性在0.999至1.000之间的结果,显示出相当的检测性能。然而,ML方法在保持等同准确性的同时,表现出显著更低的端到端延迟,表明具有更大的计算效率。这些结果表明,对于视觉复杂度有限的OOD检测任务,轻量级ML方法可以实现DL级别的性能,但计算成本显著降低,支持实际应用场景的部署。

英文摘要

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

2605.10165 2026-05-21 cs.CV cs.AI 版本更新

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

通过标准化损失聚合进行任务无关的噪声标签检测

Inhyuk Park, Doohyun Park

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文提出了一种任务无关的噪声标签检测方法SLA,通过聚合标准化的交叉验证损失来量化标签可靠性,实验表明SLA在各种噪声水平下均优于硬计数基线,并在低噪声比情况下收敛更快,有助于高效重新标注和提升数据集可靠性。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情
AI中文摘要

由于观察者差异和模糊案例,大规模医学影像数据集中的噪声标签很常见。我们提出了一种统计上站得住且任务无关的框架,即标准化损失聚合(SLA),用于在样本层面检测噪声标签。SLA通过在重复交叉验证运行中聚合标准化的折叠级验证损失来量化标签可靠性。这种公式将离散的硬计数方案泛化为一个连续估计器,能够捕捉性能偏差的频率和幅度,从而产生可解释且统计上稳定的噪声分数。在公共视网膜数据集上的实验表明,SLA在所有噪声水平下均优于硬计数基线,并在低噪声比情况下收敛速度显著加快,尤其是在细微损失变化具有信息量的情况下。具有高SLA分数的样本指示可能模糊或错误标注的案例,从而指导高效的重新标注,提高任何分类任务的数据集可靠性。

英文摘要

Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

2605.09860 2026-05-21 cs.AI 版本更新

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

何时重新承诺:为长时间视觉-语言推理发现时间抽象

Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种可学习的状态条件化承诺深度方法,用于长时间视觉-语言推理任务,通过动态调整承诺深度,提高了求解率并减少了基本动作数量,优于固定深度基线和现有模型。

详情
AI中文摘要

长时间推理需要决定不仅采取什么行动,还要在下一次观察之前多深地承诺。我们将其形式化为"承诺深度":在重新规划之间执行的原始动作数量。承诺深度在重新规划成本和执行误差累积之间产生权衡,但大多数现有长时间系统将其固定为手动设计的标量。在本文中,我们将其视为策略本身的一个可学习、状态条件化的变量。我们将其实例化在一个模型原生的视觉-语言策略中,该策略联合预测执行什么和持续多久。在Sliding Puzzle和Sokoban任务中,所得到的自适应策略在非退化的固定深度基线中占据帕累托最优,达到高达12.5个百分点的更高求解率,同时每回合使用约25%更少的基本动作。尽管使用7B主干,我们的方法在两个任务上优于GPT-5.5和Claude Sonnet,而每个测试的开放权重视觉-语言模型都达到0%的零样本成功率。我们进一步展示了理论分析,表明在标准的承诺深度替代方案下,状态条件化的承诺在本地最优深度在不同状态变化时严格优于任何固定深度。

英文摘要

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

2605.07926 2026-05-21 cs.AI 版本更新

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench: 评估LLM代理在跨领域工具引导推理中的能力

Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

发表机构 * Fudan University(复旦大学) Meituan Longcat Team(美团Longcat团队)

AI总结 本文提出AgentEscapeBench基准测试,用于评估LLM代理在非熟悉工作流和短程交互之外维持工具引导推理的能力,通过逃亡室风格的任务测试代理在显式长距离依赖约束下推断、执行和修订新工具使用程序的能力,结果显示代理在依赖深度增加时表现显著下降。

详情
AI中文摘要

随着基于LLM的代理越来越多地依赖外部工具,评估其在非熟悉工作流和短程交互之外维持工具引导推理的能力变得至关重要。我们引入了AgentEscapeBench,一个逃亡室风格的基准测试,用于测试代理是否能够在显式长距离依赖约束下推断、执行和修订新的工具使用程序。每个任务定义了一个工具和物品上的有向无环依赖图,要求代理调用真实外部函数、跟踪逐步揭示的隐藏状态、传播中间结果,并提交一个确定性可验证的最终答案。AgentEscapeBench包含五个难度层级中的270个实例,并支持全自动评估。对十六个LLM代理和人类参与者的实验表明,随着依赖深度的增加,表现急剧下降:人类从难度5级的98.3%成功降至难度25级的80.0%,而最佳模型从90.0%降至60.0%。轨迹分析表明,模型失败主要归因于长距离状态跟踪、线索遵循和中间结果传播的崩溃。这些发现表明,当前代理通常能够处理局部工具使用,但在深度上下文依赖方面仍存在困难。我们希望AgentEscapeBench可以作为诊断测试床,用于衡量当前代理能力,并指导未来训练努力,以实现更健壮的通用推理、行动和适应能力。

英文摘要

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

2605.07731 2026-05-21 cs.CL cs.AI 版本更新

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

对可比的意大利和国际开源大语言模型进行EngGPT2-16B-A3B的基准测试

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman

发表机构 * AIRIC, Politecnico di Milano(AIRIC,米兰理工大学) DEIB, Politecnico di Milano(DEIB,米兰理工大学)

AI总结 本文研究了EngGPT2-16B-A3B在多个基准测试中的性能,与同等规模的开源MoE和密集模型进行比较,展示了其在国际和意大利基准测试中的表现。

详情
AI中文摘要

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大语言模型进行了基准测试,该模型是一个具有3B活跃参数的16B参数混合专家(MoE)模型。性能在各种代表性基准测试中进行了评估,并与同等规模的开源MoE和密集模型进行了比较。与流行的意大利模型如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B相比,EngGPT2MoE-16B-A3B在国际基准测试(ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval(HE))中表现相同或更好。它在RULER基准测试的最长上下文设置(32k)中取得最佳性能。在意大利基准数据集ITALIC上,该模型在除Velvet-14B外的其他模型中表现相同或更好。与同等规模的MoE模型相比,新模型在所有考虑的基准测试中都比DeepSeek-MoE-16B-Chat的值更高。它在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置上比Moonlight-16B-A3B更高,但在BFCL和一些ARC和ITALIC设置上较低。最后,它在大多数基准测试中比GPT-OSS-20B低,包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的密集模型相比,EngGPT2MoE-16B-A3B在AIME24和AIME25上比Llama-3.1-8B-Instruct、Gemma-3-12b-it和Minstral-3-8BInstruct-2512-BF16的值更高,但在ITALIC、BFCL和32k RULER设置上较低。当性能汇总所有基准测试指标时,EngGPT2MoE-16B-A3B在评估的意大利模型中表现更高,但在一些最高效的国际模型(特别是GPT-5 nano和Qwen3-8B)中表现较低。总体而言,我们的发现表明新模型是原生意大利大语言模型的一大步。

英文摘要

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

2605.07021 2026-05-21 cs.AI 版本更新

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为线索推理:通过监督提高推理的效率和安全性

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Brigham Young University

AI总结 该研究提出行为线索推理方法,通过引入行为线索来增强大语言模型的可控性和可监控性,从而在复杂数学问题解决中减少50%的无效推理token,并在安全行动恢复方面将成功率从46%提升至96%。

详情
AI中文摘要

大语言模型(LLMs)的推理过程在监督方面面临挑战,因为许多不一致的行为往往在推理结束后才显现。为了解决这一问题,我们引入了行为线索推理,使LLM的推理过程更加可控和可监控。行为线索是特殊标记序列,模型在训练过程中被训练为在特定隐含和显式行为之前立即发出,起到双重用途的信号和控制杠杆。在微调较弱的外部监控器时,通过强化学习进行推理监督,仅使用行为线索产生的信息压缩视图就足以让监控器剪枝复杂数学问题解决中多达50%的无效推理token。当在过度约束违反导致失败的环境中利用几乎最优的规则基监控器时,行为线索使从80%的推理轨迹中恢复安全行动,这些轨迹原本会以提出不安全行动而结束,将成功率从46%提升至96%。通过在两个模型家族和三个领域中的评估,我们证明行为线索推理在不降低性能的情况下提高了推理的可监控性和可控性。更广泛地说,我们的工作通过展示被监控模型本身可以被训练得更易于监督来推进可扩展的监督。

英文摘要

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues

2605.06139 2026-05-21 cs.LG cs.AI 版本更新

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表式策略优化:基于组的RLVR作为LLM响应单纯形上的目标投影

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) LLM Department, Tencent(腾讯LLM部门)

AI总结 本文提出列表式策略优化(LPO),通过显式执行目标投影来解构隐式目标,利用响应单纯形限制近端RL目标,并通过精确散度最小化进行策略投影,从而在多样推理任务和LLM基础上提升训练性能,同时保持优化稳定性和响应多样性。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为大语言模型(LLMs)训练后的一种标准方法,以激励推理能力。在现有方法中,基于组的策略梯度很流行,它为每个提示样本生成一组响应,并通过组内优势信号更新策略。本文揭示这些优化策略共享一个共同的几何结构:每种策略隐式地定义了一个目标分布,并通过一阶近似向响应单纯形投影。基于这一见解,我们提出了列表式策略优化(LPO)以显式执行目标投影,通过限制近端RL目标到响应单纯形来解构隐式目标,然后通过精确散度最小化进行策略投影。该框架提供了(i)在列表式目标上单调改进,具有有界、零和和自校正的投影梯度,以及(ii)通过解耦的投影步骤灵活选择散度,具有不同的结构性质。在多样推理任务和LLM基础架构上,LPO在匹配的目标下一致地优于典型的策略梯度基线,同时内在地保持了优化稳定性和响应多样性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

2605.05863 2026-05-21 cs.LG cs.AI 版本更新

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE: 通过先验数据稳定在线强化学习中的策略评估

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence(媒体集成与通信中心——佛罗伦萨大学) SEED – Electronic Arts(SEED——电子艺界)

AI总结 本文提出SOPE算法,通过使用与演员对齐的离策略策略评估(OPE)信号作为自动早停机制,动态控制离线训练阶段的长度,从而在连续控制任务中提高基线性能并减少计算资源消耗。

详情
AI中文摘要

将先验数据纳入在线强化学习可以加速训练,但通常需要在高计算成本和长的多阶段训练流水线之间做出艰难的权衡。虽然固定长度的稳定阶段比静态更新计划更具计算效率,但它们需要任务相关的手动调整,可能会导致先验知识的浪费或严重的过拟合。为此,我们提出了SOPE算法,该算法利用与演员对齐的离策略策略评估(OPE)信号作为自动早停机制,动态控制离线训练阶段的长度。通过在当前策略的动作分布下对批评者进行保留验证集的评估,SOPE在离分布收益饱和时精确停止梯度更新,从而消除了手动调度调整的需要。在Minari基准套件的25个连续控制任务上评估,SOPE将基线性能提高了高达45.6%,同时将所需的TFLOPs减少了高达22倍,从而在样本效率和计算效率之间取得了平衡。这些发现表明,自适应的、基于评估的更新计划比依赖静态、详尽的更新计划更有效。

英文摘要

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

2605.04128 2026-05-21 cs.GR cs.AI cs.CL cs.CV cs.LG 版本更新

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD(joy未来学院,京东)

AI总结 本文提出JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型(MLLM)和多模态扩散Transformer(MMDiT),通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方,结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号,使模型具备广泛的多模态能力,同时增强几何感知推理和可控视觉合成。实验表明,JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情
AI中文摘要

我们提出了JoyAI-Image,一种统一的多模态基础模型,用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型(MLLM)与多模态扩散Transformer(MMDiT)结合,允许感知和生成通过共享的多模态接口进行交互。围绕此架构,我们构建了一个可扩展的训练配方,结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力,同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明,JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是,增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力,向更强的空间智能发展。这些结果表明,统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

2605.03690 2026-05-21 cs.LG cs.AI q-bio.QM 版本更新

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

基于图神经网络的面向层次的知识图谱嵌入:应用于酵母表型预测

Filip Kronström, Alexander H. Gower, Daniel Brunnsåker, Ievgeniia A. Tiukova, Ross D. King

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Life Sciences, Chalmers University of Technology(生命科学系,查尔姆斯理工大学) Department of Industrial Biotechnology, KTH Royal Institute of Technology(工业生物技术系,皇家理工学院) Department of Chemical Engineering and Biotechnology, University of Cambridge(化学工程与生物技术系,剑桥大学)

AI总结 本文提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法,用于酵母表型预测,并展示了其在基因敲除效应预测和知识图谱修订评估中的应用。

详情
AI中文摘要

我们提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法。该方法生成的嵌入更能反映领域知识。为了展示其效用,我们预测并解释了酵母Saccharomyces cerevisiae中基因敲除的影响,并在没有预测任务的情况下学习知识图谱的盒嵌入。我们进一步展示了盒嵌入如何作为评估知识图谱修订的基础。我们的酵母知识图谱是从社区数据库和本体术语构建的。低维盒嵌入结合图神经网络用于预测双基因敲除的细胞生长。在10折交叉验证中,这些预测的平均R²分数为0.360,显著高于基线比较,证明了高层定性知识对实验结果的影响力。在模型训练中纳入语义损失项提高了其预测性能(R²=0.377),通过将嵌入对齐本体结构。这表明本体中的类层次可以用于定量预测。我们还测试了训练好的模型在三基因敲除上的表现,展示了其对训练数据之外数据的泛化能力。此外,通过识别酵母知识图谱中对细胞生长预测重要的共现关系,我们构建了关于酵母相互作用特征的假说。一个生物实验验证了其中一个发现,揭示了肌醇利用与渗透压压力抗性之间的关联,突显了模型在生物发现中的潜力。

英文摘要

We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.

2605.01486 2026-05-21 cs.AI 版本更新

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

MAP-Law: 多轮法律咨询中的覆盖驱动检索控制

Qinchuan Cheng, Jiaqi Liu, Ruixuan Xie, Xiaoya Yuan, Yuxin Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) Sichuan University(四川大学) Southwestern University of Finance and Economics(西南财经大学) Northeastern University(东北大学)

AI总结 本文提出了一种覆盖驱动的检索控制框架,用于多轮法律咨询,通过维护用户事实、法律要素、检索目标和检索证据的结构化地图,利用要素覆盖、证据有效性覆盖和边际检索收益来决定检索、澄清、改写或停止操作,实验表明该方法在固定法律要素模式下能有效实现要素覆盖。

详情
AI中文摘要

法律咨询本质上是迭代的:在提供建议之前,系统必须识别相关法律要素,收集缺失的事实和权威,以及确定当前证据是否足够。现有的检索增强型法律代理通常使用固定的检索预算或单次搜索,使其对咨询的演变覆盖状态不敏感。本文介绍了一种针对多轮法律咨询的覆盖驱动检索控制框架。该框架维护用户事实、法律要素、检索目标和检索证据的结构化地图,并利用要素覆盖、证据有效性覆盖和边际检索收益来决定是否检索、澄清、改写或停止。在50个案例的合成中文劳动法咨询试点中,使用DeepSeek V4-Pro动作选择变体,在试点指标下实现了完全测量的要素覆盖,平均需要3.4次检索轮次和7.1个证据片段。诊断分析表明,模型支持的动作选择能够通过小幅增加检索预算恢复规则-政策失败案例,而强制继续主要增加令牌和延迟成本。这些结果表明,法律要素覆盖是适应性法律检索的有用控制信号,在固定模式条件下保持检索控制行为,而非部署层面的法律正确性。

英文摘要

Legal consultation is inherently iterative: before giving advice, a system must identify relevant legal elements, gather missing facts and authorities, and determine whether the current evidence is sufficient. Existing retrieval-augmented legal agents often use fixed retrieval budgets or single-shot search, making them insensitive to the evolving coverage state of a consultation. This paper introduces a coverage-driven retrieval-control framework for multi-turn legal consultation. The framework maintains a structured map over user facts, legal elements, retrieval goals, and retrieved evidence, and uses element coverage, evidence validity coverage, and marginal retrieval gain to decide whether to retrieve, clarify, reformulate, or stop. On a 50-case synthetic Chinese labor-law consultation pilot with fixed legal-element schemas, a DeepSeek V4-Pro action-selection variant achieves full measured element coverage under the pilot metric while requiring 3.4 retrieval rounds and 7.1 evidence snippets on average. Diagnostic analyses show that model-backed action selection recovers rule-policy failure cases with a small retrieval-budget increase, while forced continuation mainly increases token and latency costs. These results suggest that legal-element coverage is a useful control signal for adaptive legal retrieval, while remaining bounded to retrieval-control behavior under synthetic fixed-schema conditions rather than deployment-level legal correctness.

2604.24697 2026-05-21 cs.AI 版本更新

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

当前智能体能否缩小发现到应用的差距?Minecraft中的一个案例研究

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Peking University(北京大学) Amazon(亚马逊)

AI总结 本文通过Minecraft中的SciCrafter基准测试,探讨了智能体在发现因果规律并将其应用于构建功能系统(发现-应用循环)方面的能力,发现前沿模型在该任务中的成功率约为26%,揭示了知识识别和问题提出能力成为当前AI的瓶颈。

Comments Preprint, under review. 41 pages. Project page: https://scicrafter-bench.github.io/. Code: https://github.com/scicrafter-bench/scicraft-bench

详情
AI中文摘要

发现因果规律并将其应用于构建功能性系统——发现-应用循环——是通用智能的标志,但评估这一能力受到科学发现与现实世界工程之间巨大复杂性差距的阻碍。我们引入了基于Minecraft的SciCrafter基准测试,通过参数化的红石电路任务来操作化这一循环。智能体必须按照指定的模式(例如同时或按时间序列)点燃灯泡;扩大目标参数会显著增加构建复杂性和所需知识,迫使真正的发现而非依赖记忆中的解决方案。在通用目的代码智能体框架下评估前沿模型,包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5,我们发现所有模型均在约26%的成功率处停滞。为了诊断这些失败,我们将循环分解为四个能力——知识差距识别、实验发现、知识整合和知识应用,并设计了针对性的干预措施,其边际贡献作为相应差距的代理。我们的分析表明,尽管通用知识应用能力仍然是所有模型中最大的差距,但对前沿模型而言,知识差距识别开始成为主要障碍——表明瓶颈正从解决正确的问题转变为提出正确的问题。我们发布了SciCrafter作为未来研究AI系统在完整发现-应用循环中导航的诊断探针。

英文摘要

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

2604.22080 2026-05-21 cs.AI 版本更新

Sound Agentic Science Requires Adversarial Experiments

声音代理科学需要对抗性实验

Dionizije Fa, Marko Culjak

AI总结 该研究探讨了代理辅助科学中对抗性实验的重要性,指出传统方法在科学发现中的局限性,并提出应以证伪优先的标准来评估代理生成的科学主张。

Comments Published at ICLR 2026 Workshop on Agents in the Wild

详情
AI中文摘要

基于大型语言模型的代理正迅速被用于科学数据分析,自动化了以往受限于人类时间和专业知识的任务。这种能力通常被描述为发现的加速,但同时也加速了熟悉的失败模式,即快速生成合理且可反复修改的分析,这些分析易于生成,实际上将假设空间转化为由选择性分析支持的候选主张,优化为可发表的积极结果。与软件不同,科学知识不是通过迭代积累代码和事后统计支持来验证的。单个数据集上的流畅解释或显著结果并不等于验证。因为缺失的证据是一个负空间,那些本应证伪主张的实验和分析从未运行或发表。因此,我们提出,通过代理协助产生的非实验性主张应受证伪优先标准的评估:代理不应主要用于构建最吸引人的叙述,而是应主动寻找主张可能失败的方式。

英文摘要

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

2604.20985 2026-05-21 cs.LG cs.AI cs.CR stat.ML 版本更新

Differentially Private Model Merging

差分隐私模型融合

Qichuan Yin, Manzil Zaheer, Tian Li

发表机构 * The University of Chicago(芝加哥大学) Google DeepMind(谷歌DeepMind)

AI总结 本文提出两种后处理技术,随机选择和线性组合,用于在不额外训练的情况下生成满足任意目标差分隐私要求的最终私有模型,同时分析了这些方法在一般问题和私有均值估计中的隐私-效用权衡。

详情
AI中文摘要

在机器学习中,推理或部署时间的隐私要求往往由于政策、法规或用户偏好变化而演变。在本文中,我们旨在构建一组模型,以满足任何目标差分隐私(DP)要求,而无需额外训练,给定一组已在相同数据集上训练且具有不同隐私/效用权衡的现有模型。我们提出两种后处理技术,即随机选择和线性组合,以生成最终的私有模型,满足任何目标隐私参数。我们从R'enyi DP和一般问题中的隐私损失分布的角度提供了这些方法的隐私计费,以及在私有均值估计中的精确隐私/效用权衡分析,并比较了这两种机制。实验上,我们展示了我们方法的有效性,并在多个模型和合成及现实世界数据集上验证了我们的分析。

英文摘要

In machine learning, privacy requirements at inference or deployment time often evolve due to changing policies, regulations, or user preferences. In this work, we aim to construct a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post-processing techniques, namely random selection and linear combination, to generate final private models satisfying any target privacy parameter. We provide privacy accounting of these approaches from the lens of R'enyi DP and privacy loss distributions on general problems, as well as on private mean estimation, where we precisely characterize the privacy/utility tradeoffs and compare the two mechanisms. Empirically, we demonstrate the effectiveness of our approaches and validate our analyses on several models and both synthetic and real-world datasets.

2604.11661 2026-05-21 cs.LG cs.AI 版本更新

Towards Autonomous Mechanistic Reasoning in Virtual Cells

向虚拟细胞中的自主机理推理迈进

Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Valence Labs(Valence实验室) University College London(伦敦大学学院)

AI总结 本文提出了一种结构化解释形式化方法,用于虚拟细胞中的生物推理,通过机理动作图实现系统验证和反驳,并引入VCR-Agent多智能体框架,结合生物基础知识检索和基于验证器的过滤方法,生成并验证机理推理。

详情
AI中文摘要

大型语言模型(LLMs)最近因其在加速科学发现方面的潜力而受到广泛关注。然而,它们在如生物学等开放性科学领域中的应用仍然有限,主要是由于缺乏事实性支撑和可操作的解释。为此,我们引入了一种结构化解释形式化方法,用于虚拟细胞,将生物推理表示为机理动作图,从而实现系统验证和反驳。在此基础上,我们提出了VCR-Agent多智能体框架,该框架整合了生物基础知识检索与基于验证器的过滤方法,以自动生成并验证机理推理。使用该框架,我们发布了VC-TRACES数据集,该数据集由来自Tahoe-100M图谱的验证机理解释组成。实证研究表明,使用这些解释训练可以提高事实准确性,并为下游基因表达预测提供更有效的监督信号。这些结果强调了通过多智能体和严格验证的协同作用,可靠机理推理在虚拟细胞中的重要性。

英文摘要

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

2604.11530 2026-05-21 cs.CV cs.AI 版本更新

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

超越注意力分数:基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

发表机构 * anoncvlab(匿名计算机视觉实验室)

AI总结 本文提出SVD-Prune方法,通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌,以在极端视觉令牌预算下保持高性能,优于现有修剪方法。

详情
AI中文摘要

视觉-语言模型(VLMs)通过联合处理视觉和文本信息革新了多模态学习。然而,由于处理长序列视觉令牌的高计算和内存需求,它们面临显著挑战。许多现有方法依赖于局部启发式方法,如注意力分数或令牌范数。然而,这些标准存在位置偏见和信息分散的问题,限制了它们在高修剪比率下保留本质内容的能力,导致在视觉细节丰富的图像上性能下降。为了解决这些问题,我们提出了SVD-Prune,一种训练免费、即插即用的令牌修剪方法,基于奇异值分解。它分解视觉令牌特征矩阵,并利用统计杠杆分数选择顶级令牌,确保仅保留对主导全局方差贡献最大的令牌。实验表明,SVD-Prune在极端视觉令牌预算下始终优于现有修剪方法,即使在32和16个视觉令牌的情况下也能保持强劲性能。

英文摘要

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

2604.11071 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

轻量级低光照图像增强 via 分布归一化预处理和深度卷积U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

发表机构 * Sony Semiconductor Solutions Corporation(索尼半导体解决方案公司)

AI总结 本文提出了一种轻量级两阶段框架,通过分布归一化预处理和深度卷积U-Net实现低光照图像增强,相比现有方法参数更少且感知质量更优。

Comments Technical report for the NTIRE 2026 Efficient Low-Light Image Enhancement Challenge (CVPR 2026 Workshops), 3rd place solution

详情
AI中文摘要

我们提出了一种轻量级两阶段框架,用于低光照图像增强(LLIE),该框架在参数远少于现有方法的情况下实现了具有竞争力的感知质量。我们的方法结合了冻结算法的预处理与一个完全由深度卷积构成的紧凑型U-Net。预处理通过提供互补的亮度校正视图来归一化输入分布,使可训练网络能够专注于残差颜色校正。我们的方法在CVPR 2026 NTIRE高效低光照图像增强挑战中获得了第三名。我们进一步提供了扩展的基准测试和消融实验以证明我们方法的通用有效性。

英文摘要

We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 3rd place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

2603.29183 2026-05-21 cs.LG cs.AI 版本更新

IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

IMPACT: 开集时间序列异常检测中的影响建模

Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang

发表机构 * National Key Laboratory of Parallel and Distributed Computing(国家级并行与分布式计算实验室) College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) Intelligent Game and Decision Lab (IGDL)(智能游戏与决策实验室(IGDL)) Information Systems Technology and Design, Singapore University of Technology and Design(新加坡科技设计大学信息系统技术与设计系) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院)

AI总结 本文提出IMPACT框架,通过影响建模方法解决开集时间序列异常检测中的挑战,通过学习影响函数生成真实异常模式并净化训练数据。

Comments Accepted by ICML 2026

详情
AI中文摘要

开集异常检测(OSAD)是一种新兴范式,旨在利用训练中观察到的异常类有限标记数据,在测试时识别已见和未见的异常。当前方法依赖简单的增强方法生成伪异常以复制未见异常。尽管在图像数据中表现良好,但这些方法在时间序列数据中效果不佳,因为未能保持其序列特性,导致异常模式变得琐碎或不真实。当训练数据被未标记异常污染时,问题进一步加剧。本文引入IMPACT,一种新的框架,利用影响建模方法解决这些挑战。关键见解是学习一个影响函数,以准确估计单个训练样本对建模的影响,然后利用这些影响分数生成语义上不同但真实的未见异常,同时将高影响样本重新利用为监督异常以净化数据。大量实验表明,IMPACT显著优于现有最先进方法,在各种OSAD设置和污染率下表现出更高的准确性。代码可在https://github.com/mala-lab/IMPACT获取。

英文摘要

Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates. Code is available at https://github.com/mala-lab/IMPACT.

2603.28103 2026-05-21 cs.DL cs.AI cs.IR 版本更新

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

使用视觉-语言模型进行意大利议会演讲的转录与识别

Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia

发表机构 * Università degli Studi di Milano(米兰大学) Department of Social and Political Sciences(社会科学系) Department of Literary Studies, Philology and Linguistics(文学研究、语言学与语言学系) Department of Computer Science(计算机科学系)

AI总结 本文提出基于视觉-语言模型的 pipeline,用于自动转录、语义分割和实体链接意大利议会演讲,提升转录质量和发言者标注。

Comments to be published in: ParlaCLARIN V: Interoperability, Multilinguality, and Multimodality in Parliamentary Corpora, organized within the 15th Language Resource and Evaluation Conference (2026)

详情
AI中文摘要

议会记录代表了计算分析中丰富而具有挑战性的资源,特别是当仅保存为扫描的历史文档时。现有的意大利议会演讲转录努力依赖于传统的光学字符识别流水线,导致转录错误和有限的语义标注。在本文中,我们提出了一种基于视觉-语言模型的 pipeline,用于自动转录、语义分割和实体链接意大利议会演讲。该 pipeline 使用专门的 OCR 模型提取文本并保留阅读顺序,随后使用大规模的视觉-语言模型进行转录精修、元素分类和发言者识别,通过联合推理视觉布局和文本内容。提取的发言者随后通过 SPARQL 查询和多策略模糊匹配程序链接到议员委员会知识库。在已建立的基准测试中,评估显示在转录质量和发言者标注方面有显著改进。

英文摘要

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

2603.27747 2026-05-21 cs.CV cs.AI 版本更新

AI-Powered Facial Mask Removal Is Not Suitable For Identification

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

发表机构 * Herbert Wertheim School of Optometry & Vision Science University of California, Berkeley(赫伯特·韦瑟姆视觉科学学院,加州大学伯克利分校) School of Information University of California, Berkeley(信息学院,加州大学伯克利分校)

AI总结 本文研究了基于AI的面部遮挡去除技术的有效性和风险,探讨其在真实身份匹配中的可靠性。

详情
AI中文摘要

最近,众包在线刑事调查已使用生成式AI来增强低质量的视觉证据。在一项高关注度案件中,社交媒体用户传播了一张联邦执法人员涉致命枪击事件的

英文摘要

Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.

2603.26539 2026-05-21 cs.CL cs.AI 版本更新

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断?

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

发表机构 * Massachusetts Institute of Technology(麻省理工学院) EleutherAI University of California San Diego(加州大学圣地亚哥分校) Rutgers University-Newark(新泽西州立大学罗威特分校) Stony Brook University(史泰森布魯克大學)

AI总结 本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性,指出封闭模型通常不适合科学用途,并提出系统识别和缓解推断威胁的方法。

详情
AI中文摘要

语言模型的开放程度如何影响基于其研究的科学推断?本文分析了模型构造和部署信息的限制如何威胁可靠的推断。我们论证当前封闭模型通常不适合科学用途(有例外情况),并讨论如何解决或缓解它们对可靠推断的威胁。我们建议在研究中使用模型时,应系统地识别潜在的推断威胁,并采取相应的缓解措施,同时提供具体模型选择的正当理由。

英文摘要

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

2603.25898 2026-05-21 eess.SY cs.AI cs.SE cs.SY 版本更新

On Integrating Resilience and Human Oversight into LLM-Assisted Modeling Workflows for Digital Twins

在数字孪生构建中整合韧性与人类监督 into LLM辅助建模工作流

Lekshmi P, Neha Karanjkar

发表机构 * Indian Institute of Technology Goa(印度理工学院 Goa)

AI总结 本文提出三种关键设计原则,用于将韧性与人类监督整合到LLM辅助的数字孪生建模工作流中,通过FactoryFlow框架的研究,探讨了如何通过正交化结构建模与参数拟合、限制模型IR到参数化预验证库组件以及使用密度保持的IR来提高建模的鲁棒性和可解释性。

详情
AI中文摘要

LLM辅助建模有潜力快速从粗略描述和传感器数据构建复杂的可执行数字孪生。然而,LLM幻觉的韧性、人类监督以及实时模型适应性仍然是具有挑战性的且常常相互冲突的要求。我们提出了三种关键的设计原则,用于将韧性和监督整合到此类工作流中,这些原则源于我们在FactoryFlow框架上的工作,该框架是一个开源的LLM辅助框架,用于构建制造系统的基于模拟的数字孪生。首先,正交化结构建模和参数拟合。结构描述(组件、连接)是通过LLM从粗略的自然语言转换为中间表示(IR),并进行人工可视化和验证,然后算法转换为最终模型。相比之下,参数推断则在传感器数据流上持续运行,并具有专家可调的控制。第二,限制模型IR到参数化、预验证的库组件的连接,而不是单体模拟代码,从而实现可解释性和错误韧性。第三,最重要的是使用密度保持的IR。当IR描述从紧凑的输入急剧扩展时,幻觉错误会成比例累积。我们提出了Python作为密度保持IR的案例:循环以简洁的方式表达规律性,类捕捉层次结构和组成,结果仍然保持高度可读性,同时利用LLM强大的代码生成能力。一个关键贡献是详细表征了LLM诱导的错误在不同详细程度和复杂度的模型描述中的表现,揭示了IR选择如何关键地影响错误率。这些见解为构建鲁棒和透明的LLM辅助模拟自动化工作流提供了可操作的指导。

英文摘要

LLM-assisted modeling holds the potential to rapidly build executable Digital Twins of complex systems from only coarse descriptions and sensor data. However, resilience to LLM hallucination, human oversight, and real-time model adaptability remain challenging and often mutually conflicting requirements. We present three critical design principles for integrating resilience and oversight into such workflows, derived from insights gained through our work on FactoryFlow - an open-source LLM-assisted framework for building simulation-based Digital Twins of manufacturing systems. First, orthogonalize structural modeling and parameter fitting. Structural descriptions (components, interconnections) are LLM-translated from coarse natural language to an intermediate representation (IR) with human visualization and validation, which is algorithmically converted to the final model. Parameter inference, in contrast, operates continuously on sensor data streams with expert-tunable controls. Second, restrict the model IR to interconnections of parameterized, pre-validated library components rather than monolithic simulation code, enabling interpretability and error-resilience. Third, and most important, is to use a density-preserving IR. When IR descriptions expand dramatically from compact inputs hallucination errors accumulate proportionally. We present the case for Python as a density-preserving IR : loops express regularity compactly, classes capture hierarchy and composition, and the result remains highly readable while exploiting LLMs strong code generation capabilities. A key contribution is detailed characterization of LLM-induced errors across model descriptions of varying detail and complexity, revealing how IR choice critically impacts error rates. These insights provide actionable guidance for building resilient and transparent LLM-assisted simulation automation workflows.

2603.16513 2026-05-21 cs.LG cs.AI 版本更新

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

FEAT: 一个线性复杂度的超大规模结构化数据基础模型

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Aalborg University(奥尔堡大学)

AI总结 本文提出FEAT,一种线性复杂度的基础模型,用于处理超大规模结构化数据,通过多层双轴编码架构和自适应融合双向状态空间模型,实现线性时间内的跨元组上下文化,同时支持排列不变的表示学习。

详情
AI中文摘要

结构化数据在医疗、金融和科学数据管理等领域被广泛应用。最近关于结构化数据基础模型(SFMs)的研究旨在支持在这些数据上的数据分析和挖掘任务,但将其应用于现实世界的企业数据库时仍面临可扩展性和泛化能力的挑战。首先,许多SFMs依赖于完全自注意力机制,这引入了O(N²)的计算瓶颈,并限制了可以同时处理的元组数量。其次,直接用线性复杂度序列模型替代注意力可能与结构化数据的排列不变性质相冲突,引入人为的顺序偏差并降低表示质量。此外,仅在合成数据上训练的模型可能难以泛化到现实世界数据库中常见的重尾和异质分布。为了解决这些挑战,我们提出了FEAT,一种用于超大规模结构化数据的线性复杂度基础模型。FEAT用多层双轴编码架构替代二次注意力。它集成了自适应融合双向状态空间模型(AFBM)与卷积门控线性注意力(Conv-GLA),在O(N)时间内实现跨元组上下文化,同时支持排列不变的表示学习。为了提高在现实数据偏斜下的鲁棒性,FEAT进一步采用混合结构因果预训练流水线,具有鲁棒的重建目标。在12个现实世界数据库基准测试中,FEAT在零样本任务上始终优于代表性的SFMs,并且与结构化数据样本长度线性扩展,达到高达50倍的推理延迟提升。

英文摘要

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

2603.14184 2026-05-21 cs.CV cs.AI 版本更新

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深入的思考,更弱的目标:理解并缓解多模态大语言模型推理过程中感知障碍

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Huawei Technologies(华为技术)

AI总结 本文研究了多模态大语言模型在推理过程中出现的视觉感知障碍问题,提出了一种无需训练的视觉区域引导注意力框架,通过选择和重新加权视觉头部来引导模型关注与问题相关区域,从而提高视觉定位和推理准确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在进行扩展推理模式时常常出现感知障碍,特别是在视觉问答(VQA)任务中。我们识别出注意力分散是根本原因:在多步推理过程中,模型的视觉注意力变得分散并远离与问题相关区域,实际上“失去焦点”于视觉输入。为了更好地理解这一现象,我们分析了MLLMs的注意力图,并观察到推理提示显著减少了回答问题关键区域的注意力。我们进一步发现模型对图像标记的总体注意力与图像内注意力的空间分散性之间存在强相关性。基于这一见解,我们提出了一个无需训练的视觉区域引导注意力(VRGA)框架,该框架根据熵-聚焦准则选择视觉头部并重新加权其注意力,从而有效引导模型在推理过程中关注与问题相关区域。在视觉-语言基准上的广泛实验表明,我们的方法有效缓解了感知退化,从而在视觉定位和推理准确性方面取得改进,同时提供了可解释的见解,说明MLLMs如何处理视觉信息。

英文摘要

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

2603.08235 2026-05-21 cs.CV cs.AI 版本更新

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超宽场成像用于糖尿病视网膜病变和黄斑水肿

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

发表机构 * BiometricsAI, Universidad Autónoma de Madrid, Madrid, Spain(生物度量AI,马德里自治大学,马德里,西班牙) Department of Mathematics, Universidad de Las Palmas de Gran Canaria, Spain(数学系,拉斯帕尔马斯大Canaria大学,西班牙) HCTLab Research Group, Universidad Autónoma de Madrid, Madrid, Spain(HCTLab研究组,马德里自治大学,马德里,西班牙)

AI总结 本文研究了深度学习和超宽场成像在糖尿病视网膜病变和黄斑水肿检测中的应用,通过公开数据集评估了多种深度学习模型,并探讨了特征融合和频域表示的潜力。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

糖尿病视网膜病变(DR)和糖尿病黄斑水肿(DME)是导致成年劳动力失明的主要原因之一。传统方法主要依赖标准彩色视网膜摄影(CFP)进行检测。然而,最近的超宽场成像(UWF)相比CFP提供了更宽的视野。受此启发,本文探讨了最新深度学习(DL)方法和UWF成像在三个临床相关任务上的应用:i)UWF图像质量评估,ii)可参考糖尿病视网膜病变(RDR)的识别,iii)DME的识别。使用公开的UWF4DR挑战数据集(作为MICCAI 2024会议的一部分发布),我们评估了DL模型在空间(RGB)和频域中的表现,包括流行的卷积神经网络(CNNs)以及最近的视觉变换器(ViTs)和基础模型。此外,我们还探索了最终的特征级融合以提高鲁棒性。最后,我们还利用Grad-CAM分析DL模型的决策,提高可解释性。我们的方法在所有架构中均实现了稳定强劲的性能,凸显了新兴ViTs和基础模型的竞争力,以及特征级融合和频域表示在UWF分析中的潜力。

英文摘要

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

2603.06007 2026-05-21 cs.CL cs.AI cs.MA 版本更新

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory: 一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出MASFactory,一种基于图的框架,用于通过Vibe图谱编排基于大语言模型的多智能体系统,解决了现有框架在实现复杂图工作流时需要大量手动工作、重用性差和难以整合异构外部上下文源的问题。

Comments Accepted to the ACL 2026 Demo Track. Camera-ready version. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory

详情
AI中文摘要

基于大语言模型的(LLM-based)多智能体系统(MAS)越来越多地被用于通过角色专业化和协作扩展智能体问题解决。MAS工作流可以自然地建模为有向计算图,其中节点执行智能体或子工作流,边编码依赖性和消息传递。然而,目前框架在实现复杂图工作流时仍然需要大量的手动工作,提供有限的重用性,并使整合异构外部上下文源变得困难。为克服这些限制,我们提出了MASFactory,一种用于编排基于大语言模型的MAS的基于图的框架。它引入了Vibe图谱,一种人机交互的方法,将自然语言意图编译成可编辑的工作流规范,然后编译成可执行的图。此外,该框架提供了可重用的组件、技能支持、多模态消息处理和可插拔的上下文整合,以及用于拓扑预览、运行时跟踪和人机交互的可视化工具。我们在七个公开基准上评估了MASFactory,验证了代表性MAS方法的再生产一致性以及Vibe图谱的有效性。我们的代码(https://github.com/BUPT-GAMMA/MASFactory,Apache-2.0许可)和视频演示(https://youtu.be/ANynzVfY32k)均已公开。

英文摘要

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

2603.01712 2026-05-21 cs.AI cs.LG 版本更新

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

FT-Dojo: 向自主LLM微调迈进的语言代理

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

发表机构 * Peking University(北京大学) Nanjing University(南京大学) Microsoft Research Asia(微软亚洲研究院) The University of Chicago(芝加哥大学)

AI总结 本文提出FT-Dojo交互式基准环境,用于研究自主LLM微调,通过标准化任务接口、共享数据仓库、沙盒执行环境和反馈协议,开发了FT-Agent框架,实现了结构化迭代规划和多级反馈分析,实验显示FT-Agent在13个任务中表现优异,且展示了代理在故障恢复和长期规划中的能力。

Comments 26 pages, 6 figures, 11 tables

详情
AI中文摘要

针对垂直领域LLM微调仍需大量人力劳动的问题,本文引入FT-Dojo交互式基准环境,包含5个领域13个任务。FT-Dojo标准化了任务接口、共享数据仓库、沙盒执行环境、结构化反馈协议和评估流程。进一步开发了FT-Agent框架,通过结构化迭代规划、快速失败验证和多级反馈分析优化数据和训练策略。实验表明FT-Agent在13个任务中表现优异,且通过与前沿代理、开源规划框架和多轮统计对比验证了主要发现。案例研究表明代理可通过累积学习恢复故障,但仍存在因果诊断和长期规划的局限性。实现代码见https://github.com/microsoft/rd-agent。

英文摘要

Fine-tuning large language models for vertical domains remains labor-intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end-to-end LLM fine-tuning has not been systematically studied as an interactive agent task. We introduce FT-Dojo, an interactive benchmark environment for autonomous LLM fine-tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT-Dojo standardizes a task interface, shared raw-data repository, sandboxed execution environment, structured feedback protocol, and held-out evaluation procedure. We further develop FT-Agent, a fine-tuning-oriented autonomous framework that uses structured iteration planning, fail-fast validation, and multi-level feedback analysis to refine data and training strategies. Experiments show that FT-Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open-source planning backbones, and multi-run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long-horizon planning. The implementation is available at https://github.com/microsoft/rd-agent.

2603.01406 2026-05-21 cs.LG cs.AI cs.NA math.NA 版本更新

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

一个运算符统治一切?关于神经PDE求解器中边界索引运算符家族的探讨

Lennon J. Shikhman

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) Department of Mathematics and Systems Engineering, Florida Institute of Technology(佛罗里达理工学院数学与系统工程系)

AI总结 本文探讨了神经PDE求解器中边界索引运算符家族的核心问题,指出传统方法在边界条件变化时存在非识别性问题,并通过实验验证了在不同边界条件下求解器的局限性。

Comments Published in the ICLR 2026 Workshop on AI & PDEs. 10 pages, 5 figures

详情
AI中文摘要

神经PDE求解器通常被描述为学习映射问题数据到PDE解的运算符。本文作者认为,当边界条件变化时,这种解释通常是不正确的。我们展示了标准的神经运算符训练实际上隐式地学习了一个边界索引的运算符家族,而不是一个单一的、不考虑边界的运算符,其中学习的映射本质上依赖于训练过程中看到的边界条件分布。我们通过将运算符学习框架为边界条件上的条件风险最小化来正式化这一观点,这导致了在训练边界分布之外的非识别性结果。因此,forcing terms或resolution的泛化并不意味着在边界条件上的泛化。我们通过受控实验在泊松方程上支持我们的理论分析,展示了在边界条件转移时的明显退化,不同边界集合之间的跨分布失败,以及在去除边界信息时收敛到条件期望。我们的结果澄清了当前神经PDE求解器的核心限制,并突显了在追求PDE基础模型时需要显式边界意识建模的必要性。

英文摘要

Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

2602.24138 2026-05-21 cs.CV cs.AI 版本更新

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

发表机构 * Dept. of Robotics, Mohamed bin Zayed University of AI(机器人系,Mohamed bin Zayed人工智能大学)

AI总结 本文提出了一种无需标注的手术时序分割框架TASOT,通过结合时间对齐的文本描述和视觉信息,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索,实现了在多个公开手术数据集上的显著提升。

详情
AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集,要么需要昂贵的领域特定预训练,这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中,我们提出TASOT(文本增强的动作分割最优传输),一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输(ASOT)公式,通过结合直接从输入视频生成的时间对齐文本描述,在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取,而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧,为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT,涵盖腹腔镜和机器人手术程序,显示出显著优于最强的零样本基线:在Cholec80上+18.9 F1,在AutoLaparo上+33.7,在StrasByPass70上+23.7,在BernByPass70上+4.5。这些结果表明,在机器人环境中可以实现细粒度的手术工作流理解,而无需手动训练标注或手术特定的预训练流程,为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

2602.19320 2026-05-21 cs.CL cs.AI 版本更新

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构:评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of California Davis(加州大学戴维斯分校) Texas A&M University(德克萨斯农工大学)

AI总结 本文通过分类和实证分析,探讨了代理记忆系统的架构和系统限制,揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题,并提出了更可靠的评估方法和可扩展的系统设计方向。

详情
AI中文摘要

代理记忆系统使大型语言模型(LLM)代理能够在长时间交互中保持状态,支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速,这些系统的实证基础仍脆弱:现有基准通常规模不足,评估指标与语义效用不一致,性能在基础模型上变化显著,且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后,我们分析了限制当前系统的关键痛点,包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性,以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来,本文阐明了当前代理记忆系统为何经常无法达到其理论承诺,并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

2602.18532 2026-05-21 cs.CV cs.AI cs.RO 版本更新

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室) SenseTime Research(商汤研究) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文通过统一框架和评估设置重新审视VLA设计空间,系统分析了基础组件、感知要素和动作建模视角,总结出12项关键发现,提出了一种简单有效的VLA模型VLANeXt,并在LIBERO和LIBERO-plus基准测试中超越了现有方法,同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情
AI中文摘要

在大基础模型兴起之后,视觉-语言-动作模型(VLAs)应运而生,利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而,当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型,但训练协议和评估设置的一致性不足,使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化,我们重新审视VLA设计空间,基于类似RT-2的简单VLA基线,系统地分析了三个维度:基础组件、感知要素和动作建模视角。从这项研究中,我们提炼出12项关键发现,共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt,它在LIBERO和LIBERO-plus基准测试中优于现有方法,并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库,以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

2602.17062 2026-05-21 cs.AI 版本更新

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

保留次优动作以跟随移动的最优解在多智能体强化学习中

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Ulsan National Institute of Science and Technology (UNIST)(乌山国立科学与技术研究院(UNIST))

AI总结 本文提出S2Q算法,通过学习多个子价值函数来保留替代的高价值动作,以解决多智能体强化学习中适应值函数变化时的最优解移动问题,实验表明其在多智能体强化学习基准上表现优异。

Comments 10 technical page followed by references and appendix. Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

价值分解是合作多智能体强化学习(MARL)的核心方法。然而,现有方法仍然依赖于单一最优动作,在训练过程中底层价值函数变化时难以适应,往往收敛到次优策略。为了解决这一限制,我们提出了Successive Sub-value Q-learning(S2Q),该方法学习多个子价值函数以保留替代的高价值动作。将这些子价值函数纳入基于Softmax的行为策略中,S2Q鼓励持续探索并使$Q^{ ext{tot}}$能够快速调整到变化的最优解。在具有挑战性的MARL基准上的实验表明,S2Q在各种MARL算法中始终表现更优,证明了其改进的适应性和整体性能。我们的代码可在https://github.com/hyeon1996/S2Q上获得。

英文摘要

Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG 版本更新

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能:面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs(科罗拉多州立大学工程与应用科学学院)

AI总结 本文提出了一种上下文感知分层集成梯度框架(CA-LIG),用于解释Transformer模型的决策过程,通过计算每个Transformer块内的分层集成梯度,并将这些token级属性与类特定的注意力梯度融合,从而生成具有符号和上下文敏感性的属性图,以捕捉支持和反对的证据,并追踪Transformer层中的相关性层次流动。

详情
AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能,然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性,只能捕捉局部token级属性或全局注意力模式,缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制,我们提出了上下文感知分层集成梯度(CA-LIG)框架,一种统一的层次属性框架,该框架在每个Transformer块内计算分层集成梯度,并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图,能够捕捉支持和反对的证据,同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现,包括使用BERT进行情感分析和长多类文档分类,使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测,以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中,CA-LIG提供了更忠实的属性,显示出对上下文依赖的更强敏感性,并产生了更清晰、更语义连贯的可视化结果,优于现有可解释性方法。这些结果表明,CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释,推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

2602.11675 2026-05-21 cs.AI 版本更新

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

知识悔恨最小化:超越结果奖励的无标签因果批评

Edward Y. Chang, Longling Geng

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种名为知识悔恨最小化(ERM)的框架,通过评估模型推理轨迹的因果结构而非答案本身,来改进因果批评,从而在没有正确答案的情况下进行无标签操作,并在多个前沿LLM上实验表明,ERM在因果批评任务中表现优于传统方法。

Comments 43 pages, 22 tables, 18 figures

详情
AI中文摘要

大型语言模型可以正确回答因果问题,但其正确性基于错误的原因。当前的强化学习方法奖励模型得出的结论,但忽略其原因,强化了相关性捷径——这是我们称之为“奖励固化”的失败。我们引入了知识悔恨最小化(ERM),一种框架,它批评模型推理轨迹的因果结构,而非其答案。应用已建立的因果原则,ERM标记未审查的混杂因素、相关性-干预混淆以及从暴露推理轨迹中未检查的后门路径。该框架允许无标签操作——无需真实的因果图或正确答案,并且我们在实验中分别区分了有利的基准衍生批评、错误方向提示以及完全无标签的判断生成批评。在单个回合内,ERM检测并修复因果推理错误;在多个回合中,它将干预证据积累到可用于无答案键的奖励信号中。在六个前沿LLM上的1360个场景实验中表明,推理密集型模型(GPT-4 Turbo,GPT-5.2)对结果仅修正(25-31%恢复)的抵抗,但对因果批评(78-91%)的响应,获得+53-59 pp。标准测试时间方法(自一致性,Best-of-N,Self-Refine)在因果任务中表现不佳,而ERM将残余Rung Collapse从55-70%降至4%。一个分离定理证明仅结果奖励无法缩小这一差距;受控模拟证实了知识反馈确实能缩小这一差距,其表现优于仅结果奖励基线38倍。

英文摘要

Large language models can answer causal questions correctly for the wrong reasons. Current RL methods reward \emph{what} a model concludes but ignore \emph{why}, reinforcing correlational shortcuts -- a failure we call \emph{Reward Entrenchment}. We introduce \emph{Epistemic Regret Minimization} (\erm), a framework that critiques the causal \emph{structure} of a model's reasoning trace rather than its answer. Applying established causal principles, \erm flags unexamined confounders, correlation--intervention conflation, and unchecked back-door paths from exposed reasoning traces. The framework admits \emph{label-free} operation -- without the true causal graph or correct answer -- and we separately distinguish favorable benchmark-derived critique, error-direction cues, and fully label-free judge-generated critique in the experiments. Within a single episode, \erm detects and repairs causal reasoning errors; across episodes, it accumulates interventional evidence into a reward signal applicable where no answer key exists. Experiments on 1,360 scenarios across six frontier LLMs show that reasoning-heavy models (GPT-4 Turbo, GPT-5.2) resist outcome-only correction (25--31\% recovery) yet respond to causal critique (78--91\%), gaining $+53$--$59$ pp. Standard test-time methods (self-consistency, Best-of-$N$, Self-Refine) \emph{underperform} outcome-only reprompting on causal tasks, while ERM reduces residual Rung Collapse from 55--70\% to 4\%. A separation theorem proves outcome-only reward cannot close this gap; a controlled simulation confirms epistemic feedback does, outperforming outcome-only baselines 38-fold.

2602.08028 2026-05-21 cs.CL cs.AI 版本更新

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

偏离以诱导提示:多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国家台湾大学计算机科学与信息工程系) Institute of Information Science, Academia Sinica, Taiwan(学术院信息科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国家台湾大学人工智能研究中心(AINTU))

AI总结 本研究提出DIP框架,通过生成多个多样化的高层理由并诱导最终计划,以提升零样本推理的准确性,克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

详情
AI中文摘要

为了解决标准链式思考提示中无引导推理路径的不稳定性,最近的方法通过首先引导大型语言模型(LLMs)生成单一推理策略来指导模型。然而,仅依赖一个策略来回答每个问题仍然限制了在多样化任务中的性能。我们提出了偏离以诱导提示(DIP),一个框架,首先提示LLM为每个问题生成多个多样化的高层理由。每个理由随后被扩展成详细的、分步骤的草案计划。最后,这些草案计划被诱导成最终计划。DIP在不依赖资源密集型采样的情况下增强了零样本推理的准确性。实验表明,DIP优于单一策略提示,证明了多计划诱导对基于提示的推理的有效性。

英文摘要

To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

2602.04907 2026-05-21 cs.LG cs.AI stat.ME 版本更新

Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

从不完美物理模型下的异方差随机动力系统中进行因果发现

Jianhong Chen, Naichen Shi, Xubo Yue

发表机构 * Department of Mechanical & Industrial Engineering(机械与工业工程系) Northeastern University(东北大学) Department of Industrial Engineering and Management Sciences(工业工程与管理科学系) Department of Mechanical Engineering(机械工程系) Northwestern University(西北大学)

AI总结 本文提出了一种整合因果发现框架,利用随机微分方程中的部分物理知识来提高动态系统中因果图的恢复能力,同时分析了在不完美物理模型下的鲁棒性。

Comments 101 pages

详情
AI中文摘要

因果发现是一种数据驱动的复杂系统分析范式,而基于物理的模型,如常微分方程(ODEs),为现实世界的动力学过程提供了机理结构。整合这些范式可以提高可识别性、稳定性和鲁棒性。然而,真实动力系统往往表现出循环交互和非平稳性,而许多因果发现方法依赖于无循环、平稳或平衡假设。我们提出了一种整合因果发现框架,利用随机微分方程(SDEs)中的部分物理知识。漂移项编码已知的ODE动力学,而扩散项捕捉超出规定物理的未知因果耦合。我们开发了一种可扩展的稀疏诱导最大准似然估计器,并通过理论上合理的稳定技术来改善优化景观。在温和条件下,我们为稳定和不稳定SDEs建立了因果图恢复保证。我们还分析了我们的因果图估计在ODE不准确情况下的鲁棒性,并澄清了引入的稳定技术如何平衡数值稳定性和统计恢复能力。在线性SDEs和非线性基准测试,包括具有无循环和循环结构的Lotka-Volterra和Lorenz动力学上,实验显示了比数据驱动基线更好的图恢复和鲁棒性。我们还通过在我们的因果发现框架内重建随机SIR动力学来展示实际应用,以在现实世界流行病数据中进行因果图重建。

英文摘要

Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

2602.03004 2026-05-21 cs.LG cs.AI 版本更新

Graph Autoencoder for Process Monitoring

用于过程监控的图自编码器

Xiangrui Zhang

发表机构 * School of Information and Control Engineering, China University of Mining and Technology(信息与控制工程学院,中国矿业大学)

AI总结 本文提出了一种因果图时空自编码器(CGSTAE),通过结合基于空间自注意力机制的空间相关图结构学习模块和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块,以提高工业过程监控的可靠性和可解释性。

详情
AI中文摘要

为提高工业过程监控的可靠性和可解释性,本文提出了一种因果图时空自编码器(CGSTAE)。CGSTAE的网络架构结合了两个组件:基于空间自注意力机制的空间相关图结构学习模块(SSAM)和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块。SSAM通过捕捉变量之间的动态关系来学习相关图,而一种新的三步因果图结构学习算法被引入,以从这些相关图中推导出因果图。该算法利用因果不变性原理的反向视角来揭示从变化相关性中得到的不变因果图。空间-时间编码器-解码器由GCLSTM单元构建,在序列到序列框架内重建时间序列过程数据。所提出的CGSTAE通过特征空间和残差空间中的两个统计量实现有效的过程监控和故障检测。最后,我们通过田纳西东部过程和一个现实世界的空气分离过程验证了CGSTAE在过程监控中的有效性。

英文摘要

To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

2602.02660 2026-05-21 cs.AI 版本更新

MARS: Modular Agent with Reflective Search for Automated AI Research

MARS:模块化代理与反思搜索用于自动化AI研究

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

发表机构 * Google Cloud AI Research(谷歌云人工智能研究)

AI总结 本文提出MARS框架,通过预算感知规划、模块化构建和比较反思记忆解决复杂机器学习工程任务中的执行成本与性能归因问题,实现开放源代码框架在MLE-Bench上的最佳性能。

Comments Paper published at International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

自动化AI研究的关键瓶颈在于执行复杂的机器学习工程(MLE)任务。MLE不同于一般软件工程,因其计算成本高昂(例如模型训练)和性能归因不透明。当前基于LLM的代理在此方面表现不佳,常生成忽视执行成本和因果因素的单体脚本。我们引入MARS(模块化代理与反思搜索),一种优化于自主AI研究的框架。MARS依赖三个支柱:(1)通过成本受限的蒙特卡洛树搜索(MCTS)进行预算感知规划,以显式平衡性能与执行成本;(2)模块化构建,采用“设计-分解-实现”流程来管理复杂的研究存储库;(3)比较反思记忆,通过分析解决方案差异来解决信用分配问题,从而提炼出高信号的洞察。MARS在可比条件下,在开放源代码框架中实现了MLE-Bench上的最佳性能,与全球排行榜上顶尖方法竞争性相当。此外,系统表现出定性“啊哈!”时刻,其中所有使用的63%的教训源自跨分支转移,表明代理能有效在搜索路径间泛化洞察。

英文摘要

A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

2602.02304 2026-05-21 cs.AI cs.LG 版本更新

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

比较解释并不足够,解释变化:需要新的标准来解释大型语言模型中的行为转变

Martino Ciaperoni, Marzio Di Vece, Roberto Pellungrini, Luca Pappalardo, Fosca Giannotti, Francesco Giannini

发表机构 * Scuola Normale Superiore(诺莱学院) ISTI-CNR(意大利国家研究委员会ISTI研究所) University of Pisa(比萨大学)

AI总结 本文提出了一种新的XAI方法,旨在解释大型语言模型在干预后行为转变的原因和机制,以应对现有解释方法无法解释行为转变的问题。

详情
AI中文摘要

大规模基础模型在受到缩放、微调、人类反馈强化学习或上下文学习等干预时会表现出行为转变。当前的可解释性方法结构上不适用于解释这些转变,因为它们要么将模型视为静态对象,如传统可解释AI(XAI)方法所做的,要么仅仅比较不同模型检查点的独立解释。因此,这些方法无法解释两个模型实例之间的功能转变,其中某种行为在干预后发生了变化。这种差距在欧盟人工智能法案、美国州立法和中国人工智能法规等司法管辖区中带来了重大治理风险,这些法规要求记录重大系统修改的因果链。本文主张,解释大型语言模型的行为转变需要一种系统的方法,将转变本身作为解释的主要对象:即解释干预如何和为何将参考模型转变为具有不同行为的更新模型。为了支持这一主张,我们引入了称为比较XAI(XAI_Δ)的新XAI范式,旨在解释两个模型检查点之间的差异,其中行为发生了变化,以及一组规范,规定XAI_Δ解释器和解释必须满足的条件,包括可比性、有效性、可操作性和监控,目标是将模型审计 grounded 在明确、可测量的要求中。最后,我们通过示例实验提供初步证据,表明在实践中需要XAI_Δ,将结果汇总成一份转换报告,直接可用于治理和事件记录。

英文摘要

Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are structurally ill-suited to explain these shifts, because they either treat models as static objects, as traditional eXplainable AI (XAI) approaches do, or merely compare independent explanations across different checkpoints of a model. As a result, these approaches fail to explain the functional transition between two model instances in which a certain behavior has shifted following an intervention. This gap creates significant governance risks across jurisdictions including the EU AI Act, US state legislation, and Chinese AI regulations, which require documenting causal chains for substantial system modifications. This position paper argues that explaining behavioral shifts in large language models requires a principled approach that treats the shift itself as the primary object of explanation: namely, one that explains how and why an intervention transforms a reference model into an updated model with different behavior. To support this claim, we introduce \textit{Comparative} XAI (XAI$_Δ$), a novel XAI paradigm aimed at explaining the difference between two model checkpoints where a behavior has shifted, together with a set of desiderata specifying what XAI$_Δ$ explainers and explanations must satisfy, including comparability, validity, actionability, and monitoring, with the goal of grounding model auditing in explicit, measurable requirements. Finally, we provide preliminary evidence suggesting the need for XAI$_Δ$ in practice through illustrative experiments, compiling the resulting findings into a transition report directly usable for governance and incident documentation.

2602.00933 2026-05-21 cs.SE cs.AI 版本更新

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

MCP-Atlas:一个大规模的工具使用能力基准测试,使用真实的MCP服务器

Chaithanya Bandi, Razvan-Gabriel Dumitru, Ben Hertzberg, Divyansh Agarwal, Geobio Boo, Tejas Polakam, Sami Hassaan, Jeff Da, HiJae Kim, Vipul Gupta, Manasi Sharma, Andrew Park, Martin Dimakis, Ernesto Gabriel Hernandez Montoya, Dan Rambado, Ivan Salazar, Rafael Cruz, MohammadHossein Rezaei, Chetan Rane, Ben Levin, Daniel Yue Zhang, Brad Kenstler, Bing Liu

发表机构 * National University of Singapore(新加坡国立大学) Scale AI

AI总结 本文提出MCP-Atlas,一个用于评估工具使用能力的基准测试,基于真实MCP服务器,包含1000个由人类专家编写和验证的任务,覆盖36个真实MCP服务器和220种工具,通过任务级别的评估发现模型在工具调用和认知能力上的表现。

Comments 25 pages, 3 figures, 9 tables

详情
AI中文摘要

模型上下文协议(MCP)正逐渐成为一种标准接口,通过该接口大型语言模型(LLM)代理可以发现并调用外部工具。然而,现有的MCP评估在三个方面存在不足:缺乏真实多步骤工作流和跨服务器编排、缺乏真实MCP服务器而非模拟器、以及缺乏结构化、可重复的声明级评分,与代理的冗长或风格无关。我们引入MCP-Atlas,一个用于测量工具使用能力的基准测试,针对生产MCP服务器。MCP-Atlas包含1000个自然语言任务,由人类专家编写和验证,涵盖36个真实MCP服务器和220种工具。提示不指定服务器、工具或参数,要求代理在语义上可能的干扰项中识别相关工具,并编排多步骤、跨服务器工作流。每个任务均使用声明级评分标准评分,最终答案根据工具输出中的原子事实声明评分。这种以答案为中心的评分允许有效的替代工具调用轨迹获得认可。我们将其与一个11类诊断分类法相结合,将工具调用失败与任务理解、综合、解析和停止的认知失败区分开来。在20个前沿模型(来自六个供应商)在匹配的任务级别条件下评估后,我们发现,在0.75声明覆盖率阈值下,通过率高达82.2%,并呈现出明显的三层性能结构。自动化诊断显示,63.3%的诊断失败是认知性的而非工具调用相关的。值得注意的是,一些高性能模型在成功执行工具后由于提前停止或综合错误而失败。我们发布了任务模式、容器化Harness、声明评估器和一个500任务的公共分割,同时保留500任务的私人分割以保持排行榜的完整性。代码在https://github.com/scaleapi/mcp-atlas。

英文摘要

The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.

2601.23086 2026-05-21 cs.AI 版本更新

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

从输出监督学习的链式思维混淆可以泛化到未见过的任务

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

发表机构 * University of Cambridge(剑桥大学) Geodesic Research

AI总结 本文研究了链式思维(CoT)推理中混淆现象的泛化能力,发现模型在学习混淆推理轨迹时,能够将这种混淆行为及其在未见过的任务中表现出来,从而影响模型的可监控性。

详情
AI中文摘要

链式思维(CoT)推理通过使大型语言模型(LLM)能够规划、探索和反思其行动,显著提升了性能。CoT也是监控这些代理行为的强大工具:当忠实时,它们提供模型决策过程的解释,并为危险行为发出早期警告。然而,优化压力可能会导致模型混淆推理轨迹,失去这一有益属性。我们证明混淆可以跨任务泛化;学习混淆涉及奖励黑客(例如访问和利用泄露信息)的推理的模型,不仅在未见过的奖励黑客设置中泛化了奖励黑客行为及其混淆。最令人担忧的是,我们显示当仅惩罚模型关闭CoT后的最终动作时,CoT推理的混淆及其跨任务泛化也随之发生。我们的发现表明,当前对有害生成的惩罚实践可能会无意中以不可预测的方式减少LLM的广泛可监控性。

英文摘要

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

2601.04068 2026-05-21 cs.CV cs.AI 版本更新

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节:面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Alibaba Group - Taobao & Tmall Group(阿里巴巴集团-淘宝 & 天猫集团)

AI总结 本文提出LocalDPO,一种新的后训练框架,通过从真实视频中构建局部偏好对,并在时空区域层面优化对齐,以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情
AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化(DPO)方法依赖于多样本排序和任务特定的批评模型,这效率低下且常导致模糊的全局监督。为了解决这些限制,我们提出了LocalDPO,一种新的后训练框架,该框架从真实视频中构建局部偏好对,并在时空区域层面进行优化。我们设计了一个自动化流程,高效地收集偏好对数据,通过单次提示推理生成偏好对,消除了对外部批评模型或人工标注的需求。具体来说,我们将高质量的真实视频作为正样本,并通过局部随机时空掩码来生成对应的负样本,仅使用冻结的基模型恢复被掩码的区域。在训练过程中,我们引入了区域感知的DPO损失,将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明,LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法,建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

2601.00473 2026-05-21 cs.LG cs.AI 版本更新

Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning

深度神经网络作为离散动力系统:对物理信息学习的启示

Abhisek Ganguly, Santosh Ansumali, Sauro Succi

发表机构 * Engineering Mechanics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research(纳拉扬·德赛高级科学研究中心工程力学单元) Italian Institute of Technology(意大利理工学院) University of Roma Tre(罗马三大学) Physics Department, Harvard University(哈佛大学物理系) Cornell University(康奈尔大学)

AI总结 本文探讨了深度神经网络与离散动力系统之间的类比,通过比较Burgers方程和Eikonal方程的数值/精确解与PINNs获得的解,展示了PINN学习在近似相同系统动力学时提供了一种不同的计算路径,同时指出PINNs的密集参数表示在高维情况下可能具有优势。

详情
AI中文摘要

我们重新审视了前馈深度神经网络(DNNs)与源自神经积分方程及其相应偏微分方程(PDE)形式的离散动力系统之间的类比。本文呈现了Burgers方程和Eikonal方程的数值/精确解与通过PINNs获得的解的比较分析。我们展示了PINN学习在近似本质上相同的系统动力学时提供了一种不同于标准数值离散化的计算路径。在此框架下,DNNs可以被解释为离散动力系统,其层间演进方法趋向于吸引子,多个参数配置可能产生可比的解,反映了逆映射的退化性。与有限差分(FD)过程相关的结构化算子不同,PINNs学习密集的参数表示,这些表示与经典离散化 stencil 无直接关联。这种分布式表示通常涉及更多的参数,导致可解释性降低和计算成本增加。然而,这种额外的灵活性可能在高维情况下提供优势,其中经典网格方法变得不切实际。

英文摘要

We revisit the analogy between feed-forward deep neural networks (DNNs) and discrete dynamical systems derived from neural integral equations and their corresponding partial differential equation (PDE) forms. A comparative analysis between the numerical/exact solutions of the Burgers' and Eikonal equations, and the same obtained via PINNs is presented. We show that PINN learning provides a different computational pathway compared to standard numerical discretization in approximating essentially the same underlying dynamics of the system. Within this framework, DNNs can be interpreted as discrete dynamical systems whose layer-wise evolution approaches attractors, and multiple parameter configurations may yield comparable solutions, reflecting the degeneracy of the inverse mapping. In contrast to the structured operators associated with finite-difference (FD) procedures, PINNs learn dense parameter representations that are not directly associated with classical discretization stencils. This distributed representation generally involves a larger number of parameters, leading to reduced interpretability and increased computational cost. However, the additional flexibility of such representations may offer advantages in high-dimensional settings where classical grid-based methods become impractical.

2512.14896 2026-05-21 cs.CL cs.AI 版本更新

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

DrugRAG: 通过一种新颖的检索增强生成流水线提升药学LLM性能

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Fatemeh Latifi, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Babak Khalaj, Mohammad Hossein Rohban, Glolamali Aminian, Zohreh Amoozgar, Tahereh Javaheri

发表机构 * Department of Medicinal Chemistry, Faculty of Pharmacy, Tehran University of Medical Sciences(药学系,泰赫兰医科大学) Department of Computer Sciences, Faculty of Mathematics and Computer Sciences, Amir Kabir University of Technology(计算机科学系,阿米尔·卡比尔技术大学) Department of Mathematical Sciences, Sharif University of Technology(数学科学系,沙菲克技术大学) Department of Computer Sciences, Missouri University of Science and Technology(计算机科学系,密苏里科学与技术大学) Department of Computer Engineering, Sharif University of Technology(计算机工程系,沙菲克技术大学) Department of Faculty of Interdisciplinary Science and Technology, Tarbiat Modares University(跨学科科学与技术学院,塔里亚特莫达res大学) Electronics Research Institute, Sharif University of Technology(电子研究所,沙菲克技术大学) Department of Electrical Engineering, Sharif University of Technology(电气工程系,沙菲克技术大学) The Alan Turing Institute, London, United Kingdom(艾伦·图灵研究所,伦敦,英国) Department of Radiation Oncology, Massachusetts General Hospital & Harvard Medical School(放射肿瘤科,麻省总医院及哈佛医学院) Health Informatics Lab, Metropolitan College, Boston University(健康信息学实验室,波士顿大学)

AI总结 本研究评估了大型语言模型在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性,通过DrugRAG流水线整合结构化药物知识,从而提升药学相关问答任务的LLM性能。

Comments 14 pages, 2 figures, 2 tables. The revised version includes McNemar's paired statistical analysis, Wilson confidence intervals, expanded methodological clarifications, a revised discussion of evidence retrieval, improved reproducibility details, and updated limitations

详情
AI中文摘要

在本研究中,我们评估了大型语言模型(LLM)在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性。我们使用一个包含141个问题的药学数据集,对十个参数规模不同的LLM(8十亿到70十亿以上)进行了基准测试,测量了基线准确性。基线性能范围从46%到92%,其中GPT-5(92%)和o3(89%)取得了最高分数,而较小的开源模型表现显著较低。然后,我们开发了DrugRAG,一种三步检索增强生成(RAG)流水线,该流水线检索结构化、基于证据的药物信息,并将上下文药理学证据添加到模型提示中,该流水线在模型架构或参数无需更改的情况下外部运行。DrugRAG在所有五个评估模型上均提高了准确性,提升幅度范围从7到21个百分点(例如,Gemma 3 27B:61.0%到71%,Llama 3.1 8B:46%到67%)。McNemar分析显示,这些改进在较小和中等规模的开源模型中具有统计学显著性。这些发现表明,通过DrugRAG整合结构化外部药物知识可以提高LLM在药学相关问答任务中的性能,而无需修改底层模型,为提升基于证据的药学相关AI应用提供了实用的流水线。

英文摘要

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

2512.13402 2026-05-21 cs.CV cs.AI 版本更新

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

End2Reg: 为无标记定位学习任务特定分割在脊柱手术中

Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer, Carol C. Hasler, Maria Licci

发表机构 * Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland(巴塞尔大学生物医学工程系,瑞士Allschwil) Department of Orthopedics, University Children’s Hospital, Basel, Switzerland(巴塞尔大学儿童医院骨科部,瑞士Basel)

AI总结 本文提出End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,无需分割标签和手动步骤,从而提高脊柱手术中无标记导航的精度。

Comments Early Accepted MICCAI 2026. Code and interactive visualizations: https://lorenzopettinari.github.io/end-2-reg/

详情
AI中文摘要

脊柱手术中的术中导航需要毫米级的精度。目前,这通过辐射强度大的术中成像和骨锚定标记实现,但这些标记侵入性且会干扰手术流程。无标记RGB-D定位方法提供了一种有前途的替代方案。然而,现有方法依赖于弱分割标签来隔离相关解剖结构,这可能导致在定位过程中传播误差。我们提出了End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,消除了对分割标签和手动步骤的需要。网络学习任务特定的分割掩码,以适应定位,仅通过定位目标进行指导,而无需显式的分割监督。End2Reg在体外和体内基准测试中实现了最先进的性能,将中位目标定位误差减少了32%,均方根误差平均减少了61%,同时在部分遮挡下保持稳健性能。消融结果证实,端到端优化显著提高了定位精度。总体而言,End2Reg朝着完全自动化的无标记术中导航迈进。代码和交互式可视化可在:https://lorenzopettinari.github.io/end-2-reg/ 上找到。

英文摘要

Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

2512.09806 2026-05-21 cs.CV cs.AI 版本更新

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

CHEM: 估计和理解深度学习在图像处理中的幻觉

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Tromsø(特罗姆斯大学) German Aerospace Center (DLR)(德国航天中心) Foundation for Research and Technology Hellas (FORTH)(希腊研究与技术基金会)

AI总结 本文提出CHEM方法,用于量化和表征图像重建模型中的幻觉 artifacts,通过小波和shearlet表示定位幻觉区域,并利用 conformalized quantile regression 评估幻觉水平,同时分析U-shaped网络为何容易产生幻觉预测。

详情
AI中文摘要

基于深度学习的方法最近在图像重建问题中取得了显著成功。然而,挑战出现了,因为这些方法可能会生成不真实的 artifacts 或幻觉,这可能干扰安全关键场景中的分析。本文介绍了一个框架,用于量化和表征图像重建模型中的幻觉 artifacts。所提出的方法称为 Conformal Hallucination Estimation Metric (CHEM),能够识别模型预测中的幻觉易发区域。它利用小波和shearlet表示在图像特征层面定位这些区域,并使用 conformalized quantile regression 以分布无关的方式评估幻觉水平。提供了理论分析,表征了CHEM对幻觉 artifacts 的灵敏度及其与均方误差的关系。基于这些见解并采用基于逼近理论的观点,我们研究了为何U-shaped网络,广泛用于图像重建的架构,倾向于产生易受幻觉影响的预测。我们在天文图像去卷积中使用CANDELS数据集(如U-Net、SwinUNet和Learnlets)以及在自然图像超分辨率中使用DIV2K数据集(如DRUNet、Unfolded DRS、RAM和DPS)上评估了所提出方法的有效性。

英文摘要

Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE 版本更新

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出JanusCoder,一种面向代码智能的视觉-程序化界面,通过构建大规模多模态代码数据集和统一模型,实现从文本指令、视觉输入或两者结合生成代码,展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情
AI中文摘要

神经代码智能的范围正在迅速扩展,从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而,进展受到高质量多模态代码数据稀缺的阻碍,这源于合成和质量评估的挑战。为了解决这些挑战,我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包,利用数据模态之间的相互协同效应,高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包,我们构建了JanusCode-800K,目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练,建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法,后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明,JanusCoder系列的性能优越,我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外,广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

2510.18034 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测?一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering Design, Technical University of Munich Munich, Germany

AI总结 本文提出SAVANT框架,通过结构化推理方法提升VLM在语义异常检测中的性能,实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情
AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具,但其在异常检测中的应用仍然主要局限于提示专有模型,限制了可靠性、可重复性和部署可行性。为解决这一差距,我们引入SAVANT(语义异常验证/分析工具包),一种新的模型无关推理框架,将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估,现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示,通过语义感知推理,将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明,在平衡的现实驾驶场景集上,应用SAVANT可将VLM的绝对召回率提高约18.5%,相比提示基线。此外,这一增益使大规模注释成为可能:利用我们框架内的最佳专有模型,我们自动标注了约10,000张现实世界图像,具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型(Qwen2.5-VL)以执行单次异常检测,达到90.8%的召回率和93.8%的准确率,超越所有评估模型,同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合,我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料:https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

2510.17269 2026-05-21 cs.CV cs.AI 版本更新

FineVision: Open Data Is All You Need

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

发表机构 * Hugging Face Technical University of Munich(慕尼黑技术大学) Stanford University(斯坦福大学)

AI总结 本文提出FineVision,一个包含2400万样本的高质量数据集,通过半自动化流程整合了200多个来源,通过严格的数据清洗和人工审核确保数据质量,训练基于该数据集的模型在广泛评估中表现更优,推动数据驱动的视觉语言模型研究。

详情
AI中文摘要

视觉语言模型(VLMs)的进步受到碎片化、不一致和受污染的公共数据集的阻碍。我们引入了FineVision,一个精心收集、整理和统一的2400万样本数据集,是最大的开放资源。我们通过半自动化、人机协作的流程将超过200个来源整合为185个子集:自动化处理大量数据和模式映射,而审核员检查映射并抽查输出以验证注释的忠实消费、适当的格式和多样性以及安全性;问题会触发针对性的修复和重新运行。该流程进一步在源内和跨源之间应用严格的去重,并针对66个公共基准进行去污染。FineVision还包含具有统一动作空间的代理/GUI任务;审核员验证模式并检查样本轨迹以确认可执行性。在广泛评估套件中,基于FineVision训练的模型始终优于基于现有开放混合数据训练的模型,凸显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布该数据集和整理工具以加速数据驱动的VLM研究。

英文摘要

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

2510.14444 2026-05-21 cs.LG cs.AI 版本更新

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

在LLM压缩中寻找免费午餐:重新审视剪枝后的重新训练

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

发表机构 * Department for AI in Society, Science, and Technology, Zuse Institute Berlin(人工智能社会、科学与技术系,柏林Zuse研究所) Institute of Mathematics, Technische Universität Berlin(数学系,柏林技术大学)

AI总结 本文研究了在剪枝后通过局部重建进行适应的方法,发现其在减少数据和计算成本的同时能有效提升模型性能,并揭示了在不同粒度下重建参数窗口对最终质量的影响,挑战了LLM剪枝后适应不可行的主流观点。

详情
AI中文摘要

后训练剪枝可以显著降低LLM推理成本,但除非剩余权重被适应,否则往往会降质。由于在LLM规模上全局重新训练成本高昂,近期研究大多集中在日益复杂的剪枝标准上,旨在选择更好的稀疏模式而不进行适应。我们通过局部重建重新审视这一权衡:在剪枝后,我们依次在校准集上适应模型参数的一个子集,训练其以匹配密集模型的相应中间激活值。我们评估了局部重建在不同模型家族和规模上的表现,最高达到72B参数,并得出三个主要发现。首先,局部重建是LLM的有效适应机制:它在剪枝后重新训练时,使用了超过一个数量级更少的数据和计算资源,即使使用PEFT技术也是如此。其次,重建在粒度上表现出广泛的“免费午餐”区域,即重建参数窗口:只要重建区域包含至少一个非线性子模块,最终质量对窗口大小几乎不敏感,允许粒度主要基于内存约束来选择。相比之下,重建单个矩阵,尽管是文献中常提出的方法,却持续表现不佳,因为小的矩阵级误差会积累成更大的激活漂移。最后,重建减少了剪枝标准的相对重要性:随着模型规模的增加,复杂标准与简单基线之间的性能差距缩小,使简单方法再次具有竞争力。总体而言,我们的结果挑战了LLM剪枝后适应不可行的主流观点。

英文摘要

Post-training pruning can substantially reduce LLM inference costs, but it often degrades quality unless the remaining weights are adapted. Since global retraining is expensive at LLM scale, recent work has largely focused on increasingly sophisticated pruning criteria that aim to select better sparsity patterns without adaptation. We revisit this trade-off through local reconstruction: after pruning, we adapt one subset of the model parameters at a time on a calibration set, training it to match the corresponding intermediate activations of the dense model. We evaluate local reconstruction across model families and scales, up to 72B parameters, and establish three main findings. First, local reconstruction is an effective adaptation mechanism for LLMs: it matches post-pruning retraining while using over an order of magnitude less data and compute, even when using PEFT techniques. Second, reconstruction exhibits a broad "free-lunch" regime in granularity, i.e., the reconstruction parameter window: as long as the reconstructed region contains at least a nonlinear submodule, final quality is largely insensitive to the window size, allowing granularity to be chosen primarily based on memory constraints. In contrast, reconstructing individual matrices, despite being the natural approach often proposed in the literature, consistently underperforms, as small matrix-level errors accumulate into larger activation drift. Lastly, reconstruction reduces the relative importance of the pruning criterion: performance gaps between sophisticated criteria and simple baselines shrink with model scale, making simple methods competitive again. Overall, our results challenge the prevailing view that post-pruning adaptation is impractical for LLMs.

2510.09724 2026-05-21 cs.SE cs.AI 版本更新

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

InteractScience: 交互式科学演示代码生成的程序化与视觉导向评估

Qiaosheng Chen, Yang Liu, Lei Li, Kai Chen, Qipeng Guo, Gong Cheng, Fei Yuan

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家重点实验室,南京大学,中国南京) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,中国上海) Carnegie Mellon University, Pittsburgh, PA, USA(卡内基梅隆大学,美国匹兹堡)

AI总结 本文提出InteractScience基准,用于评估大语言模型在生成交互式科学演示代码时结合科学知识与前端交互能力的综合表现,通过程序化测试和视觉测试相结合的方法,评估30种开源和闭源LLM的表现,揭示了在整合领域知识与交互前端编码方面的持续不足。

Comments 27 pages, 17 figures

详情
AI中文摘要

大型语言模型(LLMs)正越来越能够从自然语言指令中生成完整的应用程序,为科学和教育领域创造了新的机会。在这些领域中,交互式科学演示特别有价值,可用于解释概念、支持新的教学方法和展示研究成果。生成此类演示需要模型结合准确的科学知识和能够正确实现并响应用户操作的交互前端代码。这种能力超出了现有基准的范围,这些基准通常只评估知识问答或静态网页代码生成。为了评估这种综合能力,我们设计了一个混合框架,结合程序化功能测试严格验证交互逻辑,并结合视觉导向的定性测试评估渲染输出与参考快照的一致性。基于此框架,我们提出了InteractScience基准,包含五个科学领域中精心设计的问题集,每个问题配以单元测试、参考快照和检查表。我们评估了30种领先的开源和闭源LLM,并报告了结果,突显了在整合领域知识与交互前端编码方面的持续不足。我们的工作将InteractScience定位为首个能够自动衡量这种综合能力的基准,通过现实的交互操作提供基础,推动可靠且具有教育价值的科学演示代码生成。所有代码和数据均在https://github.com/open-compass/InteractScience公开。

英文摘要

Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.

2509.26627 2026-05-21 cs.AI cs.LG cs.RO 版本更新

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder: 通过帧间时间距离从被动视频中学习密集奖励

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院) Shanghai Qi Zhi Institute(上海启智研究院) Shanghai Jiao Tong University(上海交通大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出TimeRewarder方法,通过帧间时间距离从被动视频中学习密集奖励,以提升强化学习在稀疏奖励任务中的性能,实验表明其在多个任务中显著提高了成功率和样本效率。

Comments ICML 2026 spotlight paper

详情
AI中文摘要

设计密集奖励对于强化学习(RL)至关重要,但在机器人学中往往需要大量的手动工作且缺乏可扩展性。一个有前景的解决方案是将任务进展视为密集奖励信号,因为它量化了动作在时间上推动系统向任务完成迈进的程度。我们提出了TimeRewarder,一种简单而有效的奖励学习方法,通过建模帧对之间的时间距离,从被动视频(包括机器人演示和人类视频)中推导出进展估计信号。然后展示如何通过TimeRewarder提供逐步的代理奖励以指导强化学习。在我们对十个具有挑战性的Meta-World任务的全面实验中,我们表明TimeRewarder显著提高了稀疏奖励任务的强化学习性能,仅在每个任务中进行200,000次环境交互时,就实现了9/10任务的几乎完美成功。该方法在最终成功率和样本效率上均优于先前方法和手动设计的环境密集奖励。此外,我们还展示了TimeRewarder预训练可以利用真实世界的人类视频,突显了其作为从多样化视频源中获取丰富奖励信号的可扩展方法的潜力。

英文摘要

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

2509.14165 2026-05-21 cs.CV cs.AI 版本更新

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

令牌去哪了?在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

发表机构 * Université Paris-Saclay, CEA, List(巴黎-萨克雷大学,CEA,List) I3S, Université Côte d’Azur, CNRS(I3S,尼斯大学,CNRS)

AI总结 本文提出STEP框架,通过动态补丁合并和令牌剪枝提高效率,同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升,同时保持较高的准确性。

Journal ref SN Computer Science 2026

详情
AI中文摘要

视觉变换器(ViTs)在语义分割任务中实现了最先进的性能,但受到高计算和内存成本的限制。为了解决这一问题,我们提出了STEP(SuperToken和Early-Pruning),一种混合的令牌减少框架,结合动态补丁合并和令牌剪枝,以提高效率而不显著牺牲准确性。STEP的核心是dCTS,一个轻量级的CNN基政策网络,能够灵活地合并为超补丁。编码器块也集成了早期退出,以移除高置信度的超令牌,从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法,包括高达1024x1024像素的图像,并显示当仅应用dCTS时,令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时,导致计算成本减少2.6倍,吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率,达到计算复杂度减少4倍,推理速度提高1.7倍,最大精度下降不超过2.0%。通过提出的STEP配置,可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

2509.09215 2026-05-21 cs.AI cs.CR 版本更新

Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions

使监管多智能体协作成为可能:架构、挑战与解决方案

Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du, Qichao Xu

发表机构 * School of Cyber Science and Engineering(网络科学与工程学院) Xi'an Jiaotong University(西安交通大学) School of Mechatronic Engineering and Automation(机械电子工程与自动化学院) Shanghai University(上海大学)

AI总结 本文提出了一种基于区块链的分层架构,用于监管智能体协作,设计了三个关键模块以实现自动问责、动态声誉评估和恶意行为预测,从而建立可信、健壮和可扩展的监管机制。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

大型语言模型(LLMs)赋能的自主智能体正在通过实现适应性、多智能体协作改变数字和物理环境。尽管这些智能体在金融、医疗和智能制造等领域提供了显著机会,但其不可预测的行为和异构能力带来了重大治理和问责挑战。在本文中,我们提出了一种基于区块链的分层架构,用于监管智能体协作,包括智能体层、区块链数据层和监管应用层。在此框架内,我们设计了三个关键模块:(i)智能体行为追踪和仲裁模块用于自动化问责,(ii)动态声誉评估模块用于协作场景中的信任评估,(iii)恶意行为预测模块用于早期检测对抗性活动。我们的方法建立了在大规模智能体生态系统中可信、健壮和可扩展的监管机制的系统基础。最后,我们讨论了区块链赋能的监管框架在多智能体系统中的未来研究方向。

英文摘要

Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.

2509.08010 2026-05-21 cs.CY cs.AI cs.CL cs.HC 版本更新

Measuring and mitigating overreliance to build human-compatible AI

测量和缓解过度依赖以构建人类兼容的AI

Lujain Ibrahim, Katherine M. Collins, Sunnie S. Y. Kim, Anka Reuel, Max Lamparth, Kevin Feng, Lama Ahmad, Prajna Soni, Alia El Kattan, Merlin Stein, Siddharth Swaroop, Vishakh Padmakumar, Ilia Sucholutsky, Andrew Strait, Diyi Yang, Q. Vera Liao, Umang Bhatt

AI总结 本文研究了大型语言模型过度依赖的风险,探讨了测量和缓解过度依赖的方法,以确保AI能增强而非削弱人类能力。

详情
AI中文摘要

大型语言模型(LLMs)通过作为协作的『思想伙伴』而区别于先前的技术,能够在多种任务上更流畅地进行自然语言交互。随着LLMs在医疗、个人建议等不同领域中日益影响关键决策,过度依赖LLMs的风险也随之增加。本文认为,测量和缓解过度依赖必须成为LLMs研究和部署的核心。首先,我们汇总了个体和社会层面的过度依赖风险,包括高风险错误、治理挑战和认知退化。然后,我们探讨了LLMs的特点、系统设计特征和用户认知偏见,这些因素共同引发了关于实际中过度依赖LLMs的严重且独特的问题。我们还审查了历史上的过度依赖测量方法,识别出三个重要的差距,并提出三个有前景的方向来改进测量。最后,我们提出了可以采取的缓解策略,以确保LLMs增强而非削弱人类能力。

英文摘要

Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

2509.00303 2026-05-21 cs.DB cs.AI cs.IR 版本更新

Access Paths for Efficient Ordering with Large Language Models

利用大型语言模型实现高效的排序访问路径

Fuheng Zhao, Jiayue Chen, Yiming Pan, Tahseen Rabbani, Sohaib, Divyakant Agrawal, Amr El Abbadi, Paritosh Aggarwal, Anupam Datta, Dimitris Tsirogiannis

发表机构 * Snowflake Inc.(Snowflake公司) University of Chicago(芝加哥大学) UC Los Angeles(洛杉矶大学) UC Santa Barbara(圣巴巴拉大学)

AI总结 本文提出了一种基于大型语言模型的排序语义运算符,并系统研究了其物理实现。通过改进现有语义排序算法并引入语义感知的外部归并排序算法,研究发现没有单一实现能在所有数据集上达到最优。基于此,设计了一个预算感知的优化器,利用启发式规则、LLM作为判断者评估和共识聚合动态选择最优的访问路径。实验结果表明,该优化器在所有基准测试中均能实现与最佳静态方法相当或更优的排名准确性。

详情
AI中文摘要

在本工作中,我们提出了LLM ORDER BY语义运算符作为一种逻辑抽象,并对其物理实现进行了系统研究。首先,我们对现有的语义排序算法进行了若干改进,并引入了一种语义感知的外部归并排序算法。我们的广泛评估表明,没有单一的实现能在所有数据集上提供普遍最优性。从我们的评估中,我们观察到基于比较的算法中排序成本与排序质量之间存在一种通用的时间尺度关系。基于这些见解,我们设计了一个预算感知的优化器,该优化器利用启发式规则、LLM-as-Judge评估和共识聚合来动态选择LLM ORDER BY的近最优访问路径。在我们的广泛评估中,我们的优化器在所有基准测试中均能实现与最佳静态方法相当或更优的排名准确性。我们相信,这项工作为构建稳健、大规模的LLM驱动分析系统中的语义运算符原则性优化提供了基础性见解。

英文摘要

In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.

2508.11354 2026-05-21 cs.CV cs.AI cs.LG 版本更新

FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

FunduSegmenter:利用RETFound基础模型进行视网膜底照相图像中视盘和视杯联合分割

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

发表机构 * University of Dundee(邓迪大学)

AI总结 本文提出了一种基于RETFound基础模型的FunduSegmenter模型,通过引入一系列新颖模块实现视盘和视杯的联合分割,实验表明该模型在多个数据集上均优于现有方法。

Journal ref Trans. Vis. Sci. Tech. 2026;15(5):14

详情
AI中文摘要

目的:本研究首次将RETFound模型应用于视盘(OD)和视杯(OC)的联合分割。RETFound是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型,已在疾病诊断中表现出色。方法:我们提出FunduSegmenter,该模型整合了一系列新颖模块与RETFound,包括预适配器、解码器、后适配器、带有卷积块注意模块的跳跃连接以及视觉Transformer块适配器。该模型在自有数据集GoDARTS以及四个公开数据集IDRiD、Drishti-GS、RIM-ONE-r3和REFUGE上进行了评估,通过内部验证、外部验证和领域泛化实验进行验证。结果:在内部验证中,平均Dice相似系数达到90.51%,优于所有基线方法,其中nnU-Net为82.91%,DUNet为89.17%,TransUNet为87.91%。在所有外部验证实验中,平均结果比最佳基线高约3%,且在领域泛化中也具有竞争力。结论:本研究探讨了RETFound通过学习潜在通用表示在眼底相机图像中进行OD和OC分割的潜力。我们的FunduSegmenter在整体上优于现有最先进基线方法。所提出的模块是通用的,可以扩展到其他基础模型的微调。临床相关性:该模型在分布内和分布外数据上均表现出强大的稳定性与泛化能力,提供了稳定的OD和OC分割。这是许多自动化任务的关键步骤,从设置准确的视网膜坐标到生物标志物发现。代码和训练权重可在:https://github.com/JusticeZzy/FunduSegmenter上获得。

英文摘要

Purpose: This study introduces the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a proprietary dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which outperformed all baselines, some substantially (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and our model was also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter generally outperformed state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and trained weights are available at: https://github.com/JusticeZzy/FunduSegmenter.

2508.09001 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Retrospective Sparse Attention for Efficient Long-Context Generation

回顾性稀疏注意力用于高效长上下文生成

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出,从而提高长上下文生成的效率和准确性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于长上下文任务,如推理、代码生成和多轮对话。然而,扩展上下文的推理受到键值(KV)缓存的限制,其内存占用与序列长度成线性增长,且在每个解码步骤中主导延迟。尽管最近的KV缓存压缩方法识别并加载重要的少量token,但它们主要集中在输入上下文中,未能解决长时间解码中累积的注意力误差。在本文中,我们引入了RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出。通过维护一个轻量级的输出缓存,RetroAttention使过去的查询能够高效地补充更多上下文,同时产生最小的延迟开销。这打破了固定注意力输出的范式,允许对先前近似进行持续修正。在长生成基准测试中,RetroAttention在长生成任务中始终优于最先进的(SOTA)KV压缩方法,有效KV暴露量增加高达1.6倍,准确性提高高达21.9%。

英文摘要

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.

2508.02291 2026-05-21 cs.LG cs.AI 版本更新

FAIR-Pruner: A Flexible Framework for Automatic Layer-Wise Pruning via Tolerance of Difference

FAIR-Pruner: 一种通过差异容忍性实现自动分层剪枝的灵活框架

Chenqing Lin, Mostafa Hussien, Chengyao Yu, Bingyi Jing, Ruixing Ming, Kim Khoa Nguyen, Mohamed Cheriet

发表机构 * School of Statistics and Mathematics, Zhejiang Gongshang University(浙江工商大学统计与数学学院) École de technologie supérieure (ÉTS), Université du Québec(魁北克大学埃克森技术学院) Southern University of Science and Technology(南方科技大学)

AI总结 本文提出FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架,通过引入差异容忍度(ToD)来实现非均匀的分层剪枝深度,从而在多个数据集和模型上实现了良好的准确率-压缩率权衡。

Comments Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

结构化剪枝是压缩深度神经网络的标准工具,但其实际性能取决于稀疏性如何分配到各层。我们提出了FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架。FAIR-Pruner使用两种在同一层内的排名:一种是去除导向的信号,提出候选单元;另一种是保护导向的信号,识别任务敏感的单元。其核心组件,差异容忍度(ToD),测量去除前缀与保护尾部之间的重叠,并使用共享容忍级别来诱导各层非均匀的剪枝深度。作为默认视觉实例,FAIR-Pruner结合基于Wasserstein的U-Score用于类条件单元分离性,以及基于Taylor的R-Score用于任务级敏感性;相同的ToD分配规则也可以与替代的去除信号配对。理论上,我们通过群体R-Score分析ToD,推导出高R-Score质量进入剪枝集的排名控制,并识别出相同预算比较与均匀剪枝的加法交换条件。在CIFAR-10、CIFAR-100、SVHN和ImageNet上,跨VGG、ResNet、DenseNet、ConvNeXt和DeiT的实验显示了强的准确率-压缩率权衡。在 routed-expert Qwen1.5-MoE-A2.7B-Chat 上的仅剪枝实验进一步检验了在匹配专家预算下的架构扩展性。FAIR-Pruner作为可 pip-install 的开源包发布。

英文摘要

Structured pruning is a standard tool for compressing deep neural networks, but its practical performance depends on how sparsity is allocated across layers. We propose FAIR-Pruner, a search-free framework for adaptive layer-wise structured pruning. FAIR-Pruner uses two within-layer rankings: a removal-oriented signal that proposes candidate units and a protection-oriented signal that identifies task-sensitive units. Its core component, Tolerance of Difference (ToD), measures the overlap between the removal prefix and the protected tail, and uses a shared tolerance level to induce non-uniform pruning depths across layers. As a default vision instantiation, FAIR-Pruner combines a Wasserstein-based U-Score for class-conditional unit separability with a Taylor-based R-Score for task-level sensitivity; the same ToD allocation rule can also be paired with alternative removal signals. Theoretically, we analyze ToD through the population R-Score, derive rank-based control of the high-R-Score mass entering the pruning set, and identify an additive exchange condition for same-budget comparison with uniform pruning. Experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet across VGG, ResNet, DenseNet, ConvNeXt, and DeiT show strong accuracy--compression trade-offs. Prune-only experiments on routed-expert Qwen1.5-MoE-A2.7B-Chat further examine architectural extensibility under matched expert budgets. FAIR-Pruner is released as a pip-installable open-source package.

2507.01053 2026-05-21 cs.IR cs.AI cs.DB 版本更新

M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

M3: 对话式大语言模型简化安全的临床数据访问、理解与分析

Rafi Al Attrach, Pedro Moreira, Rajna Fani, Renato Umeton, Amelia Fiske, Leo Anthony Celi

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Technical University of Munich(慕尼黑技术大学) Universitat Pompeu Fabra(庞培法华大学) St. Jude Children’s Research Hospital(圣犹大儿童研究医院) Beth Israel Deaconess Medical Center(贝斯以色列医疗中心) Harvard T.H. Chan School of Public Health(哈佛大学T.H. Chan公共卫生学院)

AI总结 本文提出M3系统,通过模型上下文协议实现对MIMIC-IV数据库的自然语言查询,降低了临床数据访问和分析的技术门槛,并展示了其在安全性和性能上的优势。

Comments 18 pages, 4 figures, 3 tables

详情
AI中文摘要

大规模的临床数据库为医学研究提供了机会,但其复杂性却阻碍了有效利用。医学重症监护信息库(MIMIC-IV)是世界上最大的开源电子健康记录数据库之一,传统上需要SQL专业知识和临床领域专业知识。我们引入M3,一种通过模型上下文协议实现对MIMIC-IV数据的自然语言查询的系统。通过单条命令,M3可以从PhysioNet获取MIMIC-IV,启动本地SQLite实例或连接到托管的BigQuery,并允许研究者用普通英语提出临床问题。我们使用EHRSQL 2024基准测试样本对M3进行了评估,使用两个语言模型。在一百个可回答的问题上,专有模型Claude Sonnet 4达到了94%的准确率,开源模型gpt-oss-20B(可在消费级硬件上本地部署)达到了93%;在一百个不可回答的问题样本上,正确行为是放弃而不是生成SQL,gpt-oss-20B在69个问题上正确地放弃了。两个模型将自然语言转换为SQL,执行查询以MIMIC-IV,并返回结构化结果以及底层查询以供验证。错误分析表明,大多数失败源于复杂的时序推理或模糊的问题表述,而不是基本的架构限制。较小的开源模型的可比性能表明,隐私保护的本地部署对于敏感的临床数据分析是可行的。M3降低了对危重病数据分析的技术门槛,并设计了包括OAuth2认证、查询验证和审计日志在内的安全措施。

英文摘要

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using samples from the EHRSQL 2024 benchmark with two language models. On one hundred answerable questions, the proprietary Claude Sonnet 4 achieved 94% accuracy and the open-weights gpt-oss-20B (deployable locally on consumer hardware) achieved 93%; on a matched sample of one hundred unanswerable questions, where correct behavior is to abstain rather than produce SQL, gpt-oss-20B correctly abstained on 69%. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-weights model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis and is designed with security measures including OAuth2 authentication, query validation, and audit logging.

2506.21039 2026-05-21 cs.LG cs.AI 版本更新

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

严格子目标执行:在分层强化学习中的可靠长 horizon 规划

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学与技术研究所) Ulsan, South Korea(韩国釜山)

AI总结 本文提出严格子目标执行(SSE)框架,通过前沿经验回放(FER)分离不可达与可接受的子目标,提高高层决策效率,从而在长horizon任务中实现更可靠的规划。

Comments 10 pages for main, 26 pages for total, Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

长horizon目标条件任务对强化学习(RL)提出了根本性挑战,特别是在目标遥远且奖励稀疏的情况下。虽然分层和图基方法提供了部分解决方案,但它们对传统 hindsight relabeling 的依赖往往无法纠正子目标不可行性,导致高层规划效率低下。为此,我们提出严格子目标执行(SSE),一种基于图的分层RL框架,整合前沿经验回放(FER)以分离不可达与可接受的子目标,并优化高层决策。FER利用失败和部分成功转移确定可达性前沿,识别不可靠的子目标,提高子目标可靠性,并减少不必要的高层决策。此外,SSE采用解耦探索策略以覆盖目标空间的未探索区域,并通过路径细化调整边成本以利用观察到的低层失败。在多样化的长horizon基准测试中,SSE在效率和成功率方面均优于现有目标条件和分层RL方法。我们的代码可在 https://jaebak1996.github.io/SSE/ 上获得。

英文摘要

Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://jaebak1996.github.io/SSE/

2506.17631 2026-05-21 cs.LG cs.AI 版本更新

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Time-Prompt: 集成异构提示以解锁时间序列预测中的LLM

Zesen Wang, Lijuan Lan, Yonggang Li

发表机构 * Central South University, Changsha, China(中南大学,长沙,中国)

AI总结 本文提出Time-Prompt框架,通过构建统一的提示范式、设计语义空间嵌入和跨模态对齐模块以及高效微调LLM参数,提升时间序列预测性能,并在碳排放数据集上验证其有效性。

Comments Accepted at IJCNN 2026

详情
AI中文摘要

时间序列预测旨在建模变量间的时序依赖关系以推断未来状态,对现实世界场景具有重要性和广泛应用。尽管基于深度学习的方法已取得显著进展,但其在长期预测中仍表现不佳。最近研究表明,大型语言模型(LLMs)在时间序列预测中表现出色,但其在该任务中的实用性仍存疑。为此,我们提出Time-Prompt框架,旨在激活LLMs进行时间序列预测。具体而言,我们首先构建了一个统一的提示范式,利用可学习的软提示引导LLM的行为,并利用文本化的硬提示增强时间序列表示。其次,为了增强LLM对预测任务的全面理解,我们设计了一个语义空间嵌入和跨模态对齐模块,以实现时序和文本数据的融合。最后,我们利用时间序列数据高效地微调LLM的参数。此外,我们专注于碳排放领域,旨在为全球碳中和做出贡献。在6个公开数据集和3个碳排放数据集上的综合评估表明,Time-Prompt是一个强大的时间序列预测框架。

英文摘要

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

2505.19075 2026-05-21 cs.AI cs.CL cs.LG 版本更新

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner: 一个单一、可组合的即插即用推理器用于冻结的LLM

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

发表机构 * Graduate School of Artificial Intelligence, Korea Advanced Institute of Science and Technology(人工智能研究生院,韩国科学技术院)

AI总结 本文提出Universal Reasoner,一种可组合且即插即用的推理模块,能够在冻结的大规模语言模型上提供专门的推理能力,通过共享或对齐的token空间实现弱到强的泛化,实验表明其在数学推理和机器翻译中优于现有微调方法。

Comments ICML 2026

详情
AI中文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

英文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

2505.14654 2026-05-21 cs.CV cs.AI cs.CL 版本更新

Beyond Words: Multimodal LLM Knows When to Speak

超越词语:多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

发表机构 * Department of Computer Science, Stony Brook University(石溪大学计算机科学系) Atmee AI

AI总结 本文提出了一种多模态策略,通过同步视频、音频和文本线索提高对话中的响应时机意识,从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情
AI中文摘要

基于大语言模型(LLMs)的聊天机器人能够生成流畅的响应,但在何时发言的问题上常常遇到困难,尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略,利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务,使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此,我们引入了一个经过精心挑选的多模态数据集,该数据集来自真实世界的双人对话视频,具有时间对齐的多模态数据和细粒度的反应类型注释。此外,我们设计了一种多模态策略MM-When2Speak,在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明,MM-When2Speak在响应类型预测性能上实现了高达3倍的提升,突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

2504.13048 2026-05-21 cond-mat.mtrl-sci cs.AI 版本更新

Design Topological Materials by Reinforcement Fine-Tuned Generative Model

通过强化微调生成模型设计拓扑材料

Haosheng Xu, Dongheng Qian, Zhixuan Liu, Yadong Jiang, Jing Wang

发表机构 * State Key Laboratory of Surface Physics(表面物理国家重点实验室) Department of Physics, Fudan University, Shanghai 200433, China(复旦大学物理系) Shanghai Research Center for Quantum Sciences, Shanghai 201315, China(上海量子科学研究中心) Institute for Nanoelectronic Devices(纳米电子器件研究所) Quantum Computing, Fudan University, Shanghai 200433, China(量子计算,复旦大学) Hefei National Laboratory, Hefei 230088, China(合肥国家实验室)

AI总结 本文提出通过强化微调生成模型来设计拓扑绝缘体和拓扑晶体绝缘体,展示了该方法在生成具有完整能隙的新拓扑材料方面的有效性,以Ge₂Bi₂O₆为例证明了其在拓扑绝缘体领域的应用。

Journal ref Nature Communications (2026)

详情
AI中文摘要

拓扑绝缘体(TIs)和拓扑晶体绝缘体(TCIs)是具有非常规电子性质的材料,其发现对实际应用具有高度价值。然而,特别是具有完整能隙的此类材料仍然稀少。鉴于传统方法在已知材料中扫描候选材料的局限性,我们专注于通过生成模型生成新拓扑材料。具体而言,我们应用强化微调(ReFT)到预训练的生成模型,从而将模型的目标与材料设计目标对齐。我们证明ReFT在增强模型生成TIs和TCIs的能力方面是有效的,且对生成材料的稳定性影响很小。使用微调后的模型,我们成功识别了大量新的拓扑材料,Ge₂Bi₂O₆作为代表性的例子——一个具有0.26 eV完整能隙的TI,是该类材料中已知的最大之一。

英文摘要

Topological insulators (TIs) and topological crystalline insulators (TCIs) are materials with unconventional electronic properties, making their discovery highly valuable for practical applications. However, such materials, particularly those with a full band gap, remain scarce. Given the limitations of traditional approaches that scan known materials for candidates, we focus on the generation of new topological materials through a generative model. Specifically, we apply reinforcement fine-tuning (ReFT) to a pre-trained generative model, thereby aligning the model's objectives with our material design goals. We demonstrate that ReFT is effective in enhancing the model's ability to generate TIs and TCIs, with minimal compromise on the stability of the generated materials. Using the fine-tuned model, we successfully identify a large number of new topological materials, with Ge$_2$Bi$_2$O$_6$ serving as a representative example--a TI with a full band gap of 0.26 eV, ranking among the largest known in this category.

2504.06925 2026-05-21 cs.CV cs.AI 版本更新

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

视觉-语言模型是否准备好进行饮食评估?探索AI驱动的食品图像识别的下一个前沿

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

发表机构 * Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid(生物度量与数据模式分析实验室,马德里自治大学) IMDEA Food, CEI UAM+CSIC(IMDEA食品,CEI UAM+CSIC)

AI总结 本文评估了六种先进的视觉-语言模型在不同层次上的食品识别能力,提出了一个新的评估指标,并展示了FoodNExTDB数据库在饮食评估中的应用潜力。

Comments Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

Journal ref 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1-10

详情
AI中文摘要

基于食品图像的自动饮食评估仍是一个挑战,需要精确的食品检测、分割和分类。视觉-语言模型(VLMs)通过整合视觉和文本推理提供了新的可能性。在本研究中,我们评估了六种最先进的VLMs(ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA),分析它们在不同层次上的食品识别能力。在实验框架中,我们引入了FoodNExTDB,一个独特的食品图像数据库,包含9,263张由专家标注的图像,涵盖10个类别(例如“蛋白质来源”)、62个子类别(例如“家禽”)和9种烹饪风格(例如“烤制”)。总共,FoodNExTDB包括50,000个由七位专家生成的营养标签,这些标签由手动标注所有数据库中的图像生成。此外,我们提出了一种新的评估指标,专家加权召回率(EWR),该指标考虑了不同标注者之间的差异。结果表明,封闭源模型在识别包含单一产品的图像中的食品产品时,性能优于开源模型,达到了超过90%的EWR。尽管有潜力,当前VLMs在细粒度食品识别方面面临挑战,特别是在区分烹饪风格的细微差异和视觉相似的食品项目时,这限制了它们在自动饮食评估中的可靠性。FoodNExTDB数据库在https://github.com/AI4Food/FoodNExtDB上公开可用。

英文摘要

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

2503.22693 2026-05-21 q-fin.ST cs.AI cs.CL 版本更新

Bridging Language Models and Financial Analysis

连接语言模型与金融分析

Alejandro Lopez-Lira, Jihoon Kwon, Sangwoon Yoon, Jy-yong Sohn, Chanyeol Choi

发表机构 * University of Florida(佛罗里达大学) Ministry of Justice, Republic of Korea(韩国司法部) Yonsei University(延世大学) LinqAlpha

AI总结 本文旨在通过概述最近的语言模型研究进展,探讨其在金融领域中的应用潜力,填补语言模型在金融行业中的实际应用与研究进展之间的差距。

Comments 28 pages

详情
AI中文摘要

大规模语言模型(LLMs)的快速进步为自然语言处理领域带来了革命性可能性,特别是在金融领域。金融数据通常嵌套在文本内容、数值表格和视觉图表之间复杂的相互关系中,这对传统方法来说是一个挑战。然而,LLMs的出现为处理和分析这种多维数据提供了更高效和深入的途径。尽管LLMs研究进展迅速,但在金融行业中的实际应用仍存在显著差距,因为金融行业更倾向于谨慎整合和长期验证。这种差异导致新兴LLM技术的实施速度较慢,尽管它们在金融应用中具有巨大潜力。因此,许多最新的LLM技术进展仍未被充分探索或利用。本文旨在通过提供对最近LLM研究进展的全面概述,并探讨其在金融领域的适用性,来弥合这一差距。基于之前的文献综述,我们突出几种新的LLM方法,探讨其独特的功能及其在金融数据分析中的潜在相关性。通过综合广泛研究的见解,本文旨在为研究人员和从业者提供有价值的资源,指出有前途的研究方向,并概述未来进一步推进LLM在金融应用中的机会。

英文摘要

The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

2503.08292 2026-05-21 cs.CL cs.AI 版本更新

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

大语言模型像医生一样分诊吗?对外科会诊的动态研究

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

发表机构 * Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Bournemouth University(伯恩茅斯大学) National Health Data Institute, Shenzhen(深圳国家健康数据研究院) Shenzhen Research Institute of Big Data(深圳大数据研究院)

AI总结 本文研究了大语言模型在动态分诊过程中的表现,发现其在动态场景中通过有效提问减少不确定性,优于传统分类器,但静态场景下优势有限。

详情
AI中文摘要

门诊会诊(OR)是一种核心临床流程,将患者分配到医院部门,在信息不完整且不断演变的情况下进行,但通常被简化为静态分类问题,尽管实际上是交互性的。在本工作中,我们将门诊会诊视为由信息获取和不确定性降低驱动的动态过程。我们分析了基于固定患者信息的静态场景和涉及多轮对话的动态场景,以测试大语言模型(LLMs)是否通过更好的预测或更有效的提问来改善分诊结果。我们的发现表明,LLMs在静态分诊准确性上对传统分类器几乎没有优势,但在动态设置中始终优于它们,通过询问具有辨别性的后续问题来减少候选部门的不确定性。这些结果表明,大语言模型在门诊分诊中的主要价值不在于静态预测,而在于支持交互式、具有不确定性的临床决策。

英文摘要

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

2502.18915 2026-05-21 cs.CL cs.AI 版本更新

END: Early Noise Dropping for Efficient and Effective Context Denoising

END:早期噪声丢弃以实现高效有效的上下文去噪

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Fangran Mo, Jinghan Zhang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 本文提出END方法,通过在早期层对输入序列进行分割和线性探针,有效识别并丢弃噪声部分,从而提升LLM在不同任务上的性能和效率,同时加深了对LLM内部上下文推理机制的理解。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中表现出色,但它们经常受到输入序列中无关或噪声内容的干扰,从而降低输出质量。这个问题影响了长上下文和短上下文场景,如检索增强生成、表格问答和上下文学习。我们发现LLMs可以在生成令牌之前,在早期层中隐式地识别输入序列中是否有有用信息。基于这一见解,我们引入了早期噪声丢弃(END),一种无需微调LLMs的新方法,以缓解此问题。END将输入序列分成块,并在LLMs的早期层上使用线性探针来区分信息丰富和噪声块。通过在过程中早期丢弃噪声块,END保留了关键信息,减少了干扰,并降低了计算开销。广泛的实验表明,END在不同LLMs上多个评估数据集上显著提高了性能和效率。此外,通过探针研究LLMs对输入的隐式理解,这项工作也加深了对LLMs如何内部进行上下文推理的理解。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

2502.12120 2026-05-21 cs.LG cs.AI cs.CL 版本更新

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

LLMs on the Line: 数据决定损失-损失缩放定律

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

发表机构 * Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心) University of Tübingen(图宾根大学)

AI总结 研究探讨了影响LLM损失-损失缩放定律的主要因素,发现预训练数据决定了缩放趋势,而模型大小、优化超参数、分词器和架构差异对缩放影响有限,因此应精心选择预训练数据以获得最佳下游性能。

Comments ICML 2025 camera-ready version

详情
AI中文摘要

缩放定律指导大型语言模型(LLMs)的发展,通过提供模型大小、令牌和计算量之间的最佳平衡估计。最近,损失-损失缩放定律,即预训练数据集和下游任务之间损失的关系,已成为理解并改进LLM性能和泛化能力的强大工具。在本工作中,我们研究了哪些因素最强烈地影响损失-损失缩放。我们的实验发现,预训练数据决定了缩放趋势。相比之下,模型大小、优化超参数、分词器甚至显著的架构差异,如基于Transformer的模型如Llama和状态空间模型如Mamba之间的差异,通常影响有限。因此,从业者应仔细选择适合的预训练数据集以获得最佳下游性能,而架构和其他设置可以自由优化以提高训练效率。

英文摘要

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

2502.03752 2026-05-21 cs.LG cs.AI 版本更新

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

基于鲁棒技能的元强化学习中的自我改进技能学习

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学技术研究院 (UNIST))

AI总结 本文提出Self-Improving Skill Learning (SISL)方法,通过解耦的高层和技能改进策略进行自我指导的技能细化,并利用最大回报重标记进行技能优先级排序,从而在噪声和次优数据下实现鲁棒且稳定的适应,优于其他基于技能的元强化学习方法。

Comments 10 pages main, 27 pages appendix with reference. Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

元强化学习(Meta-RL)能够快速适应未见任务,但在长时间 horizon 环境中面临挑战。基于技能的方法通过将状态-动作序列分解为可重用的技能并采用分层决策来解决这一问题。然而,这些方法对噪声的离线演示高度敏感,导致技能学习不稳定和性能下降。为此,我们提出Self-Improving Skill Learning (SISL),通过解耦的高层和技能改进策略进行自我指导的技能细化,同时应用最大回报重标记进行技能优先级排序,从而在噪声和次优数据下实现鲁棒且稳定的适应。通过减轻噪声的影响,SISL实现了可靠的技能学习,并在多样化的长horizon任务上一致优于其他基于技能的元强化学习方法。我们的代码可在https://epsilog.github.io/SISL获取。

英文摘要

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://epsilog.github.io/SISL.

2502.03545 2026-05-21 cs.GT cs.AI cs.MA cs.SI 版本更新

Proportional Selection in Networks

网络中的比例选择

Georgios Papasotiropoulos, Oskar Skibski, Piotr Skowron, Tomasz Wąs

发表机构 * University of Warsaw(华沙大学) University of Oxford(牛津大学)

AI总结 本文研究了如何从网络中选择k个代表性节点,旨在识别最有影响力节点并确保选择比例反映网络的多样性,提出了两种方法并进行了理论分析和实验验证。

Comments This version has been accepted for publication at IJCAI'26

详情
AI中文摘要

我们解决了从网络中选择k个代表性节点的问题,旨在实现两个目标:识别最有影响力的节点和确保选择比例反映网络的多样性。我们提出了两种方法来完成这一任务,进行了理论分析,并通过一系列实验展示了它们的有效性。

英文摘要

We address the problem of selecting $k$ representative nodes from a network, aiming to achieve two objectives: identifying the most influential nodes and ensuring the selection proportionally reflects the network's diversity. We propose two approaches to accomplish this, analyze them theoretically, and demonstrate their effectiveness through a series of experiments.

2502.02844 2026-05-21 cs.LG cs.AI cs.CR cs.MA 版本更新

Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

狼群对抗攻击用于鲁棒多智能体强化学习

Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea(人工智能研究生院,UNIST,韩国乌山)

AI总结 本文提出狼群对抗攻击框架,用于对抗多智能体强化学习中的协同对抗攻击,并引入狼群-对抗学习框架来训练鲁棒的MARL策略以防御该攻击。

Comments 9 pages main, 23 pages appendix with reference. Accepeted by ICML 2025

Journal ref Proceedings of Machine Learning Research (PMLR), ICML 2025

详情
AI中文摘要

传统多智能体强化学习(MARL)中的鲁棒方法往往难以应对合作场景中的协调对抗攻击。为了解决这一限制,我们提出了受狼群狩猎策略启发的狼群对抗攻击框架,该框架针对初始智能体及其辅助智能体以破坏合作。此外,我们还引入了狼群-对抗学习用于MARL(WALL)框架,该框架通过促进系统内协作来训练鲁棒的MARL策略以防御所提出的狼群攻击。实验结果突显了狼群攻击的毁灭性影响以及WALL所实现的显著鲁棒性改进。我们的代码可在https://github.com/sunwoolee0504/WALL上获得。

英文摘要

Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering systemwide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available at https://github.com/sunwoolee0504/WALL.

2502.02834 2026-05-21 cs.LG cs.AI 版本更新

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

任务感知虚拟训练:增强元强化学习在分布外任务中的泛化能力

Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea(人工智能研究生院,UNIST,韩国乌山)

AI总结 本文提出Task-Aware Virtual Training方法,通过度量学习提升元强化学习在分布外任务中的泛化能力,采用虚拟任务保持任务特征并利用状态正则化技术减少状态变化环境中的过估计误差。

Comments 9 pages main paper, 20 pages appendices with reference. Accepted to ICML 2025

Journal ref Proceedings of Machine Learning Research (PMLR), ICML 2025

详情
AI中文摘要

元强化学习旨在开发能够泛化到未见任务的策略,这些任务从任务分布中采样。尽管基于上下文的元强化学习方法通过任务潜在变量改善任务表示,但它们在分布外(OOD)任务上常常表现不佳。为了解决这个问题,我们提出了Task-Aware Virtual Training(TAVT),一种新的算法,通过度量基于的表示学习准确捕捉任务特征,用于训练和OOD场景。我们的方法在虚拟任务中成功保持任务特征,并采用状态正则化技术以减轻状态变化环境中的过估计误差。数值结果表明,TAVT在各种MuJoCo和MetaWorld环境中显著增强了对OOD任务的泛化能力。我们的代码可在https://github.com/JM-Kim-94/tavt.git获取。

英文摘要

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

2410.12771 2026-05-21 cond-mat.mtrl-sci cs.AI physics.comp-ph 版本更新

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

开放材料2024(OMat24)无机材料数据集和模型

Luis Barroso-Luque, Muhammed Shuaibi, Xiang Fu, Brandon M. Wood, Misko Dzamba, Meng Gao, Ammar Rizvi, C. Lawrence Zitnick, Zachary W. Ulissi

发表机构 * Fundamental AI Research (FAIR) at Meta(Meta 基础人工智能研究(FAIR))

AI总结 本研究提出了一种大规模公开数据集OMat24和预训练模型,旨在解决材料发现中公开训练数据和预训练模型不足的问题,通过密度泛函理论计算和先进模型提升材料科学的AI应用。

Comments 19 pages

详情
AI中文摘要

发现具有理想性能的新材料对于从缓解气候变化到下一代计算硬件的进步至关重要。AI有潜力通过更有效地探索化学空间来加速材料发现和设计,比其他计算方法或试错法更有效。尽管在AI用于材料数据、基准和模型方面取得了显著进展,但一个障碍是缺乏公开可用的训练数据和开放预训练模型。为此,我们提出了Open Materials 2024(OMat24)大规模公开数据集的Meta FAIR发布以及一组预训练模型。OMat24包含超过1.1亿个密度泛函理论(DFT)计算,专注于结构和组成多样性。我们的EquiformerV2模型在Matbench Discovery排行榜上实现了最先进的性能,并能够预测基态稳定性和形成能量,F1分数超过0.9,准确率达到20 meV/atom。我们探讨了模型大小、辅助去噪目标和微调对性能的影响,涵盖了包括OMat24、MPtraj和Alexandria在内的多种数据集。OMat24数据集和模型的公开发布使研究社区能够在此基础上进一步推动AI辅助材料科学的发展。

英文摘要

The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

2410.03296 2026-05-21 cs.CL cs.AI 版本更新

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

抽取式自我解释与人类推理在文本分类中的系统比较

Stephanie Brandl, Oliver Eberle

发表机构 * Center for Social Data Science(社会科学数据科学中心) University of Copenhagen(哥本哈根大学) Machine Learning Group(机器学习小组) Technische Universität Berlin(柏林技术大学)

AI总结 本文比较了抽取式自我解释与人类推理在文本分类任务中的有效性,通过分析不同任务和语言的解释质量,发现自我解释在文本长度和任务复杂度上与人类推理存在显著差异。

Comments accepted to the Trustworthy NLP Workshop, co-located with ACL 2026

详情
AI中文摘要

指令微调的LLM能够通过生成自我解释来向用户解释其输出,而无需应用复杂的可解释性技术。本文分析这种能力是否能产生高质量的解释。我们评估了以输入推理形式呈现的自我解释在人类中的可信度。我们研究了三个文本分类任务:情感分类、强迫劳动检测和声明验证。我们包括丹麦语和意大利语的情感分类任务翻译,并将自我解释与人类注释进行比较。为此,我们收集了Climate-Fever声明验证数据集的人类推理注释。我们进一步评估了人类和自我解释推理在正确模型预测方面的忠实度,并通过纳入事后归因基于的解释扩展了研究。我们分析了四个开源LLM,并发现自我解释与人类推理之间的对齐高度依赖于文本长度和任务复杂性。然而,自我解释会产生忠实的token级推理子集,而事后归因方法则倾向于强调结构和格式token,反映出根本不同的解释策略。

英文摘要

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

2409.14839 2026-05-21 cs.AI cs.ET cs.HC 版本更新

Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships

可解释且以人为中心的AI用于决策支持系统:知识性准伙伴关系理论

John Dorsch, Maximilian Moll

发表机构 * Faculty of Philosophy, Philosophy of Science and the Study of Religion, Ludwig Maximilian University Munich(哲学学院、科学哲学与宗教研究学院,慕尼黑路德维希-马克西米利安大学)

AI总结 本文提出了一种新的理论框架,即知识性准伙伴关系理论(EQP),用于指导开发能够提供人类基础解释(原因、反事实和置信度)的AI决策支持系统,以满足伦理和可解释AI(XAI)的需求。

Comments 20 pages

Journal ref Philosophy of Artificial Intelligence. Synthese Library, vol 533. Springer. 2026

详情
AI中文摘要

在人工智能决策支持系统(AI-DSS)的背景下,我们主张满足伦理和可解释AI(XAI)的需求是开发AI-DSS,以向人类决策者提供三种类型的以人为中心的解释:原因、反事实和置信度,这种方法我们称为RCC方法。我们首先回顾了当前的实证XAI文献,探讨了生成模型解释的各种方法(如LIME、SHAP、Anchors)与模型感知可信度和终端用户准确性之间的关系。我们展示了当前关于什么是良好人类基础原因的理论要么无法充分解释这些证据,要么没有为开发提供坚实的伦理建议。因此,我们提出了一种新的理论:知识性准伙伴关系理论(EQP)。最后,我们阐明了采用EQP的动机,并展示了它如何解释实证证据,提供坚实的伦理建议,并导致采用RCC方法。

英文摘要

In the context of AI decision support systems (AI-DSS), we argue that meeting the demands of ethical and explainable AI (XAI) is about developing AI-DSS to provide human decision-makers with three types of human-grounded explanations: reasons, counterfactuals, and confidence, an approach we refer to as the RCC approach. We begin by reviewing current empirical XAI literature that investigates the relationship between various methods for generating model explanations (e.g., LIME, SHAP, Anchors), the perceived trustworthiness of the model, and end-user accuracy. We demonstrate how current theories about what constitutes good human-grounded reasons either do not adequately explain this evidence or do not offer sound ethical advice for development. Thus, we offer a novel theory of human-machine interaction: the theory of epistemic quasi-partnerships (EQP). Finally, we motivate adopting EQP and demonstrate how it explains the empirical evidence, offers sound ethical advice, and entails adopting the RCC approach.

2407.01734 2026-05-21 quant-ph cs.AI 版本更新

Optical Quantum Mixed-State Reconstruction With Multiple Deep Learning Approaches

光学量子混合态重构与多种深度学习方法

Nhan Trong Luu, Tuyen Quang Nguyen, Duong Trung Luu, Thang Cong Truong

发表机构 * College of Communication and Information Technology(通信与信息科技学院) Can Tho University(金瓯大学) School of Computer Science(计算机科学学院) University of Technology Sydney(悉尼技术大学) Center of Digital Transformation and Communication(数字转型与通信中心) The University of Aizu(御所大学)

AI总结 本文提出两种基于神经网络的量子态重构方法,用于纯态和混合态的量子态重构,通过利用类别信息实现对纯态和混合态的高精度重构。

Journal ref SN Computer Science (2026)

详情
AI中文摘要

量子态重构是表征量子系统状态的关键技术,对许多量子技术应用至关重要。近年来,利用神经网络增强量子态重构的效率和精度引起了广泛关注。然而,适用于多种重构场景的通用方法仍较为有限。本文提出两种基于神经网络的重构方法:受限特征神经网络和混合态神经网络。通过在重构过程中利用类别信息,我们实现了对纯态和混合态的高精度重构。

英文摘要

Quantum state tomography is a crucial technique for characterizing the state of a quantum system, which is essential for many applications in quantum technologies. In recent years, there has been growing interest in leveraging neural networks to enhance the efficiency and accuracy of quantum state tomography. However, versatile methods that are broadly applicable across diverse reconstruction scenarios remain relatively underexplored. In this paper, we present two neural network-based reconstruction approaches for both pure and mixed quantum state tomography: Restricted Feature Based Neural Network and Mixed States Neural Network. By leveraging class information during reconstruction, we are able to achieve state-of-the-art performance of tomography for both pure and mixed quantum states.

2406.07125 2026-05-21 cs.CR cs.AI cs.LG 版本更新

CARACAS: vehiCular ArchitectuRe for detAiled Can Attacks Simulation

CARACAS:用于详细CAN攻击模拟的车辆架构

Sadek Misto Kirdi, Nicola Scarano, Franco Oberti, Luca Mannella, Stefano Di Carlo, Alessandro Savino

发表机构 * Politecnico di Torino, Department of Control(都灵理工大学控制与计算机工程系)

AI总结 本文提出CARACAS,一种用于模拟详细CAN攻击的车辆模型,通过结合Simulink等仿真框架和攻击模型的稳健表示,生成合成数据集以提高IDS的检测能力,重点展示电池电动车的扭矩控制攻击模拟。

Comments 6 pages, 8 figures, TrustAICyberSec workshop - IEEE ISCC 2024

Journal ref Proceeding of the 29th IEEE Symposium on Computers and Communications, ISCC 2024

详情
AI中文摘要

现代车辆越来越容易受到利用网络基础设施的攻击,特别是控制器局域网络(CAN)网络。为了使用基于数据分析和分类的现代工具如入侵检测系统(IDS)来有效应对这些威胁,需要大量的CAN消息大数据集。本文探讨了通过利用仿真框架如Simulink的建模能力以及攻击模型的稳健表示来生成合成数据集的可行性,提出了CARACAS车辆模型,包括通过CAN消息进行组件控制和攻击注入能力。CARACAS展示了该方法的有效性,包括电池电动车(BEV)模型,并重点针对两种不同的场景中的扭矩控制攻击进行分析。

英文摘要

Modern vehicles are increasingly vulnerable to attacks that exploit network infrastructures, particularly the Controller Area Network (CAN) networks. To effectively counter such threats using contemporary tools like Intrusion Detection Systems (IDSs) based on data analysis and classification, large datasets of CAN messages become imperative. This paper delves into the feasibility of generating synthetic datasets by harnessing the modeling capabilities of simulation frameworks such as Simulink coupled with a robust representation of attack models to present CARACAS, a vehicular model, including component control via CAN messages and attack injection capabilities. CARACAS showcases the efficacy of this methodology, including a Battery Electric Vehicle (BEV) model, and focuses on attacks targeting torque control in two distinct scenarios.

2605.20534 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Axiomatizing Neural Networks via Pursuit of Subspaces

通过子空间追求轴心化神经网络

Mehmet Yamac, Mert Duman, Ugur Akpinar, Felix Rojas Casadiego, Serkan Kiranyaz, Marcel van Gerven, Moncef Gabbouj

发表机构 * Tampere University, Faculty of ITC, Finland(芬兰塔尔库大学信息与通信技术学院) Department of Electrical Engineering, Qatar University, Qatar(卡塔尔大学电气工程系) Donders Institute, Radboud University, The Netherlands(荷兰拉德堡德大学多纳尔斯研究所)

AI总结 本文提出一个基于几何公理的框架,用于解释神经网络的行为,通过子空间追求假设,统一了表示、计算和泛化在浅层和深层架构中的视角。

Comments 43 pages, 25 figures. Code and additional materials will be released

详情
AI中文摘要

尽管深度神经网络在许多领域取得了显著成功,但其底层机制仍不清晰,常被视为黑箱。这种经验表现与理论理解之间的差距类似于经典几何学的前公理阶段。在本文中,我们引入了子空间追求(PoS)假设,这是一个轴心化的框架,通过一组几何公理来表征神经网络的行为。这些公理及其推导出的结论为浅层和深层架构中的表示、计算和泛化提供了统一的视角。我们展示了该框架能够为深度学习中的基本问题提供几何解释,包括表示结构、架构机制和泛化行为,从而为一个连贯的理论基础提供了有原则的步骤。

英文摘要

While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.

2605.20529 2026-05-21 cs.CL cs.AI 版本更新

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性:人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

发表机构 * Yale University(耶鲁大学) Cognitive Science Program(认知科学项目) Dept. of Linguistics(语言学系) Wu Tsai Institute(吴泰科创研院)

AI总结 本文探讨了语言输入中的统计信号如何帮助语法习得,提出词组关联性假设,通过词组共现规律提供句法依赖线索,并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情
AI中文摘要

在何种程度上,语言输入中的统计信号可以促进语法的习得?本文提出了一种称为词组关联性学习的机制,其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先,我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得,发现存在一个可预测性范围,使得这些统计学习器能够稳健地学习主谓一致。然后,我们分析儿童导向语言中主谓配对的可变性,并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看,这些结果表明词组关联性是一种可行的学习策略,适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV 版本更新

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出NeuroQA,一个大规模的3D脑部MRI视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域,通过3D体积评估11种临床推理技能,并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情
AI中文摘要

我们提出了NeuroQA,一个大规模的3D脑部磁共振成像(MRI)视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域:阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答(VQA)方法不同,NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能,涵盖是/否、多项选择和开放式格式。在203个模板中,131个是图像 grounded(可从3平面查看器回答),72个是图像 informed(答案来自定量体积测量或临床仪器)。为消除纯文本捷径,我们应用了答案分布优化,将封闭式文本-only 准确率从>80%降至44.6%;图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配,零个相同受试者矛盾。我们进行了临床评估,两名临床医生独立评估100个冻结测试项目,使用3平面查看器。在封闭式(是/否+多项选择)测试公开项目上,最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率,均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布,公开QA对用于开放访问数据集和受数据使用协议(DUAs)限制的数据集的可复现生成脚本,加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

2605.20523 2026-05-21 cs.LG cs.AI q-bio.QM 版本更新

Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models

机器学习增强的非侵入性测试用于MASLD纤维化:浅层-深层神经网络与FIB-4、表格基础模型和大语言模型的比较

Athanasios Angelakis, Gabriele De Vito, Eleni-Myrto Trifylli, Filomena Ferrucci

发表机构 * BioML Lab, RI CODE, UniBw, Munich, Germany(BioML实验室,RI CODE,UniBw,慕尼黑,德国) Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands(流行病学与数据科学系,阿姆斯特丹大学医学中心,阿姆斯特丹,荷兰) Alpha Indicium, Rijswijk, Netherlands(Alpha Indicium,里杰斯霍伊斯,荷兰) Department of Computer Science, University of Salerno, Salerno, Italy(计算机科学系,萨勒诺大学,萨勒诺,意大利) GI-Liver Unit, 2nd Department of Internal Medicine, National and Kapodistrian University of Athens, General Hospital of Athens “Hippocratio”, Athens, Greece(肝病单位,第二内科部,雅典国家与卡波迪斯托里亚大学,雅典“希波克拉底”医院,希腊)

AI总结 本文研究了机器学习增强的非侵入性测试在MASLD纤维化检测中的应用,比较了浅层-深层神经网络、FIB-4、表格基础模型和大语言模型在不同队列中的性能,发现浅层-深层神经网络在保持FIB-4变量空间的同时提供了更平衡的外部操作性能。

Comments 26 pages, 4 figures, 3 tables. Preprint

详情
AI中文摘要

晚期纤维化是代谢功能障碍相关脂肪性肝病(MASLD)中肝相关发病率的主要决定因素。FIB-4被广泛用作一线非侵入性测试,但其固定公式可能低估了年龄、天冬氨酸转氨酶、丙氨酸转氨酶和血小板计数中包含的诊断信息。我们评估了机器学习增强的非侵入性测试(MLE-NIT)是否能够在保持FIB-4变量空间的同时提高晚期纤维化的检测能力。我们使用了来自中国、马来西亚和印度的三个经活检确认的MASLD队列(n=784)。中国队列被分为486名训练样本和54名内部验证/调整治疗样本;最终性能仅在马来西亚和印度的外部队列中报告。模型使用了五个变量:年龄、FIB-4、天冬氨酸转氨酶、血小板计数和丙氨酸转氨酶。我们比较了FIB-4与浅层-深层神经网络(s-DNN)、TabPFN和gpt-4o-2024-08-06。FIB-4在马来西亚和印度的外部ROC-AUC分别为0.75和0.60。TabPFN达到0.69和0.66,微调后的GPT-4o达到0.75和0.63,而s-DNN达到0.77和0.67。s-DNN仅包含354个可训练参数,相比TabPFN的7,244,554个参数,却提供了更平衡的外部操作性能。校准显示s-DNN的Brier分数为0.18和0.22,排列重要性识别出AST和FIB-4为主要变量。紧凑的非线性MLE-NIT可能在不增加临床数据需求的情况下增强基于FIB-4的纤维化评估。

英文摘要

Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

2605.20520 2026-05-21 cs.AI 版本更新

Open-World Evaluations for Measuring Frontier AI Capabilities

面向前沿AI能力的开放世界评估

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

发表机构 * Princeton University(普林斯顿大学) Cornflower Labs(Cornflower实验室) Meridian Labs(Meridian实验室) Stanford University(斯坦福大学) UK AI Security Institute(英国人工智能安全研究所) Johns Hopkins University(约翰霍普金斯大学) Adaption Labs(Adaption实验室) Australian National University(澳大利亚国立大学) Golden Gate Institute for AI(金门人工智能研究所) UW Madison(威斯康星大学麦迪逊分校) Microsoft Research(微软研究院) AI Digest(AI摘要) Georgetown University (CSET)(乔治城大学(CSET))

AI总结 本文提出开放世界评估作为一种补充方法,通过小样本定性分析来评估长期、复杂、现实世界任务,以更准确地衡量AI能力,并介绍了CRUX项目作为定期进行此类评估的尝试。

详情
AI中文摘要

基于基准的评估在跟踪前沿AI进展方面仍然很重要。但其可能同时高估和低估实际能力,因为它优先考虑可以精确指定、自动评分、容易优化且预算低、时间短的任务。我们倡导一种互补的评估类别,我们称之为开放世界评估:长期、复杂、现实世界任务通过小样本定性分析而非基准规模自动化来评估。在本文中,我们回顾了最近的开放世界评估,识别了其优势和局限性,并介绍了CRUX(Collaborative Research for Updating AI eXpectations),一个定期进行此类评估的项目。作为第一个实例,我们让一个AI代理开发并发布一个简单的iOS应用程序到Apple App Store。代理仅需一次可避免的手动干预就完成了任务,这表明开放世界评估可以提供关于可能很快普及的能力的早期预警。我们最后提出设计和报告开放世界评估的建议。

英文摘要

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

2605.20510 2026-05-21 cs.CV cs.AI cs.CY 版本更新

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

ShadeBench: 一个用于可持续社会建筑阴影模拟的基准数据集

Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学) Global Futures Laboratory, Arizona State University(全球未来实验室,亚利桑那州立大学)

AI总结 本文提出ShadeBench,一个用于城市阴影理解的综合数据集和基准,通过多模态数据支持阴影生成、分割和3D建筑重建,并提供标准化评估协议和基线方法,为数据驱动的城市气候研究和热适应城市规划提供基础。

Comments 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

由于城市热岛效应的加剧,城市热暴露问题变得越来越严峻。细粒度的阴影模式,尤其是由建筑物引起的阴影,强烈影响行人热暴露和户外活动规划。然而,大规模准确建模和分析城市阴影仍然困难,因为缺乏大规模数据集和系统评估框架。为了解决这一挑战,我们提出了ShadeBench,一个全面的城市阴影理解数据集和基准。ShadeBench包含地理多样的城市场景,具有时间变化的模拟阴影地图和文本描述,以及对齐的卫星图像、建筑骨架表示和3D建筑网格。基于此多模态数据集,ShadeBench支持一系列下游任务,包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准评估协议和基线方法。通过使大规模和细粒度的阴影分析成为可能,ShadeBench为数据驱动的城市气候研究提供了基础,并支持未来在热适应城市规划和决策中的研究。代码和数据集可在https://darl-genai.github.io/shadebench/上公开获取。

英文摘要

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

2605.20502 2026-05-21 cs.LG cs.AI cs.CV stat.AP stat.ML 版本更新

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

基于表示空间扩散模型的Tippett最小融合多编码器异常检测

Neelkamal Bhuyan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种多编码器融合的表示空间扩散模型,通过统计分析每个编码器对特定分布偏移类型的敏感性,引入EncMin2L门控机制,无需使用OOD标签即可在较低参数成本下提升异常检测性能,同时在四种分布偏移类型上均达到0.94以上的AUROC。

Comments 14 pages

详情
AI中文摘要

我们通过多编码器融合的每编码器表示空间扩散模型(RDMs)来解决跨完整分布偏移谱的异常检测问题,包括全局域变化、语义分歧、纹理差异和协变量腐蚀。我们从ID数据中统计地识别每个编码器对特定偏移类型的敏感性,并引入EncMin2L——一种编码器无关的两级min(⋅)门控,能够在不使用OOD标签的情况下结合和校准每编码器扩散基的似然检测器,参数成本比单编码器基线低2.3倍。两种ID数据诊断:η²(类条件F检验)和Δμ(在合成腐蚀下的对数似然偏移)量化编码器的专业化,而Tippett最小p值组合将每编码器得分聚合为一个校准稳定的OOD信号。EncMin2L在所有四种偏移类型上均达到≥0.94的AUROC,优于在重叠基准上的最佳表示空间扩散OOD检测器。

英文摘要

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $η^2$ (class-conditional F-test) and $Δμ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

2605.20477 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Training Language Agents to Learn from Experience

训练语言代理以从经验中学习

Yuval Shalev, Zifeng Ding, Mateja Jamnik

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种名为In-context Training(ICT)的任务框架,用于评估语言代理在跨任务中的自我改进能力,并通过基于强化学习的训练管道直接从经验中学习反思,从而在多个基准任务中优于基线模型,展示了从经验中学习的能力本身可以被学习。

详情
AI中文摘要

语言代理可以在交互环境中通过经验进行适应,但当前基于反思的方法只能在单个任务实例内进行自我纠正。是否可以将这种经验提炼成可重用的教训,从而在未来的未见任务上提高性能仍不明确。我们通过引入In-context Training(ICT)任务来解决这个问题,这是一种用于评估语言代理跨任务自我改进能力的框架。在ICT中,一个反思模型观察由行为模型收集的轨迹,并生成旨在提高行为模型在未见任务上的性能的系统提示。然后,我们提出了一种基于强化学习的训练管道,用于直接从经验中学习此类反思,而无需人工提供的示例。在ALFWorld和MiniHack上,我们训练的反思器在大多数保留的任务家族上优于未训练的基线,表明从经验中学习的能力本身可以被学习。在某些情况下,我们观察到在训练反射器的基准之外的泛化能力,能够显著不同的环境。最后,我们介绍了MetaGym,一个通用的Python库,用于构建元环境,从而促进未来对自我改进语言代理的研究。

英文摘要

Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

2605.20473 2026-05-21 cs.SE cs.AI cs.LG 版本更新

Code Generation by Differential Test Time Scaling

通过微分测试时间缩放进行代码生成

Yifeng He, Ethan Wang, Jicheng Wang, Xuanxin Ouyang, Hao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出DiffCodeGen,一种基于覆盖引导的微分分析的代码生成方法,通过生成多样化的代码候选并利用覆盖引导模糊测试来合成输入,无需现有测试用例或大语言模型,从而提高效率和可扩展性。

Comments 16 main text, 21 pages with references

详情
AI中文摘要

测试时间缩放已崭露头角,成为通过在推理时间探索大规模解决方案空间来改进代码生成的有前途的方法。然而,现有方法通常依赖于公开的测试用例,这些在实践中不可用,或需要大量的LLM推理来选择候选,导致显著的token消耗和时间开销。我们提出了DiffCodeGen,一种基于覆盖引导的微分分析的新型测试时间缩放方法用于代码生成。DiffCodeGen利用各种采样和提示策略生成多样化的代码候选,然后应用覆盖引导的模糊测试来合成输入,而无需任何现有的测试用例或大语言模型。通过在这些输入上执行所有候选,DiffCodeGen捕捉到它们的动态行为并根据行为相似性对候选进行聚类。DiffCodeGen选择最大聚类的medoid作为最终输出。不同于先前的测试时间缩放方法需要额外的LLM推理来选择候选,DiffCodeGen在不调用任何额外模型的情况下进行选择,导致极小或没有额外的token消耗。DiffCodeGen完全异步,自然适合当前代理编程的趋势,因此是高效且高度可扩展的。我们评估了DiffCodeGen在4个大型语言模型上的表现,展示了相对于基线的一致改进。与最先进的测试时间缩放方法相比,DiffCodeGen在仅使用少量时间和token的情况下实现了竞争或更优的性能。DiffCodeGen是模型无关的,可以与推理模型结合以进一步提升性能。

英文摘要

Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.

2605.20470 2026-05-21 cs.CV cs.AI physics.med-ph 版本更新

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

EPC-3D-Diff: 基于CBCT到CT合成的等价物理一致条件3D潜在扩散模型

Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

发表机构 * School of Science and Engineering, University of Dundee UK(邓迪大学科学与工程学院) Faculty of Applied Sciences, Jordan University of Science and Technology(约旦科学技术大学应用科学学院) Experia Healthcare, Jordan(约旦Experia医疗) School of Cardiovascular and Metabolic Health, University of Glasgow UK(格拉斯哥大学心血管与代谢健康学院)

AI总结 本文提出EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,通过引入从成像物理导出的投影域等价损失,提高了物理一致性。该方法在训练过程中通过正向投影旋转合成的CT体积,并将其与相应角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。

Comments 10 pages, 4 figures

详情
AI中文摘要

锥束CT(CBCT)在放疗中常用于患者定位,但其定量可靠性受到散射、噪声和重建伪影的限制,限制了Hounsfield单位(HU)的准确性。我们提出了EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,引入了从成像物理导出的投影域等价损失。与常见的图像域等价性不同,我们利用体积内旋转对应于其投影的角偏移的事实。在训练过程中,我们通过正向投影旋转合成的CT体积并将其与适当角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。为了高效捕捉完整的3D上下文,条件扩散在由轻量3D自动编码器学习的紧凑潜在空间中进行,保持轴向深度的同时在平面分辨率上进行下采样以实现稳定训练。我们验证了配对的头CBCT/CT假体数据集,包括重复扫描,并使用患者层面的分割进行配对临床数据验证,并进行了单域和混合域训练、消融实验和与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力,并在PSNR上相比最先进的方法取得了显著的改进,分别在假体和临床数据上提高了+7.4 dB和+1.8 dB,同时在SSIM和HU准确性方面也有所提升,在组织边界内。总体而言,EPC-3D-Diff提高了鲁棒性和物理一致性,支持HU意识的合成,以支持下游的放疗工作流程。

英文摘要

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

2605.20467 2026-05-21 cs.AI 版本更新

High Quality Embeddings for Horn Logic Reasoning

用于霍恩逻辑推理的高质量嵌入

Yifan Zhang, Yasir White, Dean Clark, Joseph Sanchez, Jevon Lipsey, Ashely Hirst, Jeff Heflin

发表机构 * Lehigh University Computer Science and Engineering(莱维大学计算机科学与工程系) Los Angeles Pierce College Computer Science(洛杉矶派克学院计算机科学) Colorado College Computer Science(科罗拉多学院计算机科学)

AI总结 本文提出了一种生成高质量逻辑语句嵌入的方法,通过三元组损失训练嵌入,并通过生成重复术语的锚点、平衡易难例以及强调最困难的例子来提高下游任务的表现。

Journal ref Proceedings of Machine Learning Research 284:1-14, 2025

详情
AI中文摘要

神经网络可以被训练以对逻辑推理者的选择进行排序,从而更高效地寻找答案。这一过程中的关键步骤是创建有用的嵌入,即逻辑语句的数值表示。本文介绍了并评估了几种生成嵌入的方法,以获得更好的下游结果。我们使用三元组损失训练嵌入,这需要由锚点、正例和负例组成的示例。我们引入了三个想法:生成更可能具有重复术语的锚点,以生成正例和负例的方式确保在简单、中等和困难示例之间有良好的平衡,并在训练过程中定期强调最困难的例子。我们进行了几项实验来评估这种方法,包括在不同知识库中比较不同嵌入的性能,以尝试确定哪些特征使嵌入适合特定的推理任务。

英文摘要

Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.

2605.20459 2026-05-21 cs.CV cs.AI 版本更新

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

基于像素的新冠CT影像病变预测:自动图像分割架构的比较分析

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结 本文通过比较四种深度学习架构与六种预训练编码器,评估了在新冠CT影像中预测病变的性能,发现深度学习在分割任务中具有高精度和效率,其中二分类分割达到98%的F1分数,多分类分割在不同数据集上分别达到75%和77%的F1分数。

Comments 7 pages, 6 figures, 4 tables

详情
AI中文摘要

近年来,深度学习算法在医学图像分割领域受到了越来越多的关注。然而,由于缺乏标准化的性能分析方法和先前研究中使用不同数据集,该领域的可靠性受到阻碍。本研究的主要目的是全面评估当前的分割框架与最先进的预训练骨干网络,以准确预测CT影像中的新冠病变。此外,这种评估可以作为其他成像场景图像分割的参考点。为了实现这一目标,我们整合了四个不同的深度学习架构,即Unet、PSPNet、Linknet和FPN,以及六个预训练编码器,包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0。这种方法使能够开发出多样化的测试架构。在图像分割的背景下,我们的研究涵盖了二分类和多分类实验。通过分析三个不同的新冠CT分割数据集,我们的分析结果表明深度学习架构能够产生精确且高效的分割结果。显著的是,二分类分割的最高F1分数达到98%,而多分类分割在两个不同的数据集上分别达到了75%和77%的F1分数。人工智能和深度学习的使用在多个维度上增强了对流行病疾病诊断过程的帮助。

英文摘要

In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

2605.20456 2026-05-21 cs.SE cs.AI cs.MA 版本更新

Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

Agentic Agile-V: 从Vibe编码到验证工程在软件和硬件开发中

Christopher Koch

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了代理AI在软件和硬件开发中的应用,提出Agentic Agile-V框架,通过SCOPE-V循环将对话意图转化为结构化工程成果,并贡献了最小输入艺术品类税、对话到合同门、风险自适应工作流和证据包接受模型。

Comments 7 pages, 1 figure

详情
AI中文摘要

Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. These capabilities make software and hardware development faster in some settings, but current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Controlled studies report productivity gains in some enterprise tasks, slowdowns in mature open-source work, moderate but heterogeneous meta-analytic effects, and persistent failures in repository setup, dependency handling, permission gating, and hardware verification. This paper argues that the central problem is no longer prompt engineering; it is engineering process control. It synthesizes evidence from agentic software engineering, GitHub-scale adoption studies, repository-level agent configuration, productivity trials, issue-resolution benchmarks, and hardware/RTL verification research. It proposes Agentic Agile-V, a process framework that uses Agile-V as the lifecycle backbone and a task-level SCOPE-V loop - Specify, Constrain, Orchestrate, Prove, Evolve, and Verify - to convert conversational intent into structured engineering artifacts and acceptance evidence. The paper contributes: (i) a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; (ii) a conversation-to-contract gate that separates exploratory dialogue from implementation; (iii) risk-adaptive feature, bug-fix, testing, and hardware workflows; and (iv) an evidence-bundle acceptance model for agent-generated artifacts. The paper concludes that agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval.

英文摘要

Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. These capabilities make software and hardware development faster in some settings, but current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Controlled studies report productivity gains in some enterprise tasks, slowdowns in mature open-source work, moderate but heterogeneous meta-analytic effects, and persistent failures in repository setup, dependency handling, permission gating, and hardware verification. This paper argues that the central problem is no longer prompt engineering; it is engineering process control. It synthesizes evidence from agentic software engineering, GitHub-scale adoption studies, repository-level agent configuration, productivity trials, issue-resolution benchmarks, and hardware/RTL verification research. It proposes Agentic Agile-V, a process framework that uses Agile-V as the lifecycle backbone and a task-level SCOPE-V loop - Specify, Constrain, Orchestrate, Prove, Evolve, and Verify - to convert conversational intent into structured engineering artifacts and acceptance evidence. The paper contributes: (i) a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; (ii) a conversation-to-contract gate that separates exploratory dialogue from implementation; (iii) risk-adaptive feature, bug-fix, testing, and hardware workflows; and (iv) an evidence-bundle acceptance model for agent-generated artifacts. The paper concludes that agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval.

2605.20449 2026-05-21 cs.LG cs.AI 版本更新

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

LLM预训练塑造了可泛化的流形:跨模态迁移至时间序列的洞察

Alexis Roger, Prateek Humane, Zhenghan Tai, Gwen Legate, Andrei Mircea, Vasilii Feofanov, Irina Rish

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) University of Toronto(多伦多大学) Concordia University(康科迪亚大学) com(42.com)

AI总结 研究探讨了语言预训练的Transformer能否成为有效的时序预测器,并揭示了跨模态迁移的机制,指出预训练构建了流形,微调则将数值动态投影到任务相关方向。

详情
AI中文摘要

语言预训练的Transformer能否成为有效的时序预测器,以及原因是什么?本文表明,跨模态迁移出现是因为语言预训练为时序训练预设了一个可重用的流形。在冻结的LLM状态上进行线性探测可以解码出真实的时序轨迹而无需配对监督,该投影空间中的检索能产生具有竞争力的预测,表明在微调之前就已经存在结构和动态。预训练初始化还提升了优化效果,产生连贯的梯度和高度各向异性的损失景观,不同于随机初始化。微调则起到低维对齐的作用,重用已有的方向而非从头学习时间原始特性,这通过低秩更新、子空间对齐和共享的周期性、趋势和重复特征得到证实。这些结果支持了LLM到时序迁移的几何解释:语言预训练构建了流形,微调将数值动态投影到任务相关方向上。

英文摘要

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

2605.20445 2026-05-21 cs.CV cs.AI 版本更新

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

对用于CT和X光影像中新冠分类的深度学习架构的全面比较

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结 本文通过比较多种深度学习架构,提出基于卷积神经网络的计算机辅助诊断系统,以区分新冠和正常肺部影像,并在X光和CT数据集上取得了95至98%的平均准确率。

Comments 6 pages, 2 figures, 5 tables

详情
AI中文摘要

新冠是一种造成大量人员伤亡的重大挑战,不仅涉及某些国家,甚至全球也因冠状病毒而遭受影响。使用计算断层扫描(CT)和X光的肺部影像技术是新冠或其他大流行病筛查过程中最有效的工具。如今,技术已通过人工智能取代手动过程,用自动化机器使系统能够模仿人类大脑,通过经验做出明智决策。受此启发,我们的工作提出使用卷积神经网络(CNN)模型设计一个计算机辅助诊断(CAD)系统,以区分新冠和正常肺部影像。我们使用了两组不同的肺部X光影像和两组不同的CT扫描,并利用预训练的多种网络(如VGG(16, 19)、Densenet(121)、Resnet(50, 50 V2, 101 V2)、MobileNet(V2)、Xception Inception(V3, Resnet V2)、EfficientNet(B0)和Nasnet(Large))进行分类。在X光和CT图像数据集上,Resnet和VGG架构显示出能够正确区分新冠和正常图像的能力,平均准确率分别为95至98%。我们在分类数据集上的结果具有竞争力,并优于文献中已报告的发现。

英文摘要

COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

2605.20442 2026-05-21 cs.HC cs.AI 版本更新

Modeling Emotional Dynamics in Agent-to-Agent Interactions on Moltbook

在Moltbook上代理间交互中情感动态建模

Syed Mhamudul Hasan, Abdur R. Shahid

发表机构 * Southern Illinois University(南伊利诺伊大学)

AI总结 本文研究了Moltbook中代理间交互的情感动态,提出了一种情感感知框架,用于将文本交互映射到预定义的细粒度情感类别,提取结构化的情感档案,并引入了基于情感的Persona-Stimulus-Reaction(PSR)领域来评估行为可靠性。

详情
AI中文摘要

生成式AI系统越来越多地被用作在线环境中的交互代理,例如名为Moltbook的社会网络。在Moltbook中,大规模的代理AI可以发布、评论并参与由AI驱动的文本生成的活动。然而,这些代理行为特征仍不够理解,特别是在复杂的多代理交互中。在本研究中,我们分析了Moltbook中代理交互的情感动态。我们构建了一个情感感知框架,将文本交互映射到预定义的细粒度情感类别集,从而在代理和交互上下文中提取结构化的情感档案。为进一步评估行为可靠性,我们引入了一个名为Persona-Stimulus-Reaction(PSR)的情感领域,以捕捉在相似上下文中情感响应的一致性。我们的分析显示,代理在不同上下文中表现出不同的情感模式和行为稳定性水平。我们的分析揭示了代理在不同上下文中表现出不同的情感特征,其行为稳定性因交互上下文而异。

英文摘要

Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large-scale agentic AIs can post, comment, and engage in activities generated at scale by AI-driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi-agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion-aware framework that maps textual interactions to a predefined set of fine-grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion-based domain called Persona-Stimulus-Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.

2605.20441 2026-05-21 cs.LG cs.AI cs.NE 版本更新

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Transformer在Grokking中的权重衰减区域:廉价的在线诊断

Lucky Verma

发表机构 * Independent Researcher(独立研究者)

AI总结 研究探讨了在模运算中训练的Transformer模型在记忆、泛化和崩溃之间的尖锐转变,并通过权重衰减作为标量经验控制参数来分析这些区域,引入了两种廉价的在线诊断方法,通过注意力激活来跟踪训练动态,并在较低计算成本下补充损失景观诊断。

Comments 28 pages, 11 figures, 5 tables. Code and aggregate JSONs: https://github.com/lucky-verma/grokking-diagnostics. Per-run JSONs: https://huggingface.co/datasets/lucky-verma/grokking-diagnostics-runs. Lean 4/mathlib v4.29.0 formal checks available in the code repository

详情
AI中文摘要

在模运算中训练的Transformer模型表现出记忆、泛化和崩溃之间的尖锐转变。我们证明权重衰减作为这些区域的标量经验控制参数,并引入了两种廉价的在线诊断方法,即平均成对注意力头余弦相似度和熵标准差,这些方法仅通过注意力激活来跟踪训练动态,并在较低计算成本下补充损失景观诊断。在十一种实验条件和三种模型规模(0.82M到85M参数)中,权重衰减轴将记忆、发展性Grokking和崩溃分开。一个接近临界点的逻辑拟合将记忆到发展性的边界定位在λ_c=0.0158(95%置信区间[0.0109, 0.0200],N=210);一个幂律拟合给出经验指数ν=0.757(置信区间[0.725, 0.799])。参考指数ν=1/2和3D伊辛ν≈0.63在我们四格网格下位于此经验置信区间之外,因此我们报告ν为经验值,并将临界点类别的识别推迟到更密集的有限大小缩放工作。一个与地平线匹配的多任务复制(n=280,四个模运算)保留了权重衰减控制模式;在λ=0.05时进行的配对注意力头重新初始化实验改变了阶段2的振幅(Cohen的d=-1.190,n=10,p_t=4.5×10^-3),而匹配的权重范数裁剪则没有。三个跨架构探测(4L MLP,4L LSTM和4L Mamba;每个n=70)在小Transformer注意力模型的模运算中复制了权重衰减控制的转变,具有架构特定的λ_c值。主要诊断主张限于小Transformer注意力模型的模运算;非注意力实验是范围探测,架构广泛、语言模型和临界点类别的主张超出范围。

英文摘要

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $λ_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $ν=0.757$ (CI [0.725, 0.799]). Reference exponents $ν=1/2$ and 3D Ising $ν\approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $ν$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $λ=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $λ_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

2605.20440 2026-05-21 cs.LG cs.AI math.RA 版本更新

Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

群代数张量:可证明最优的等变学习与物理对称性发现

Paulina Hoyos, Shashanka Ubaru, Dongsung Huh, Vasileios Kalantzis, Kenneth L. Clarkson, Misha Kilmer, Haim Avron, Lior Horesh

发表机构 * UT Austin(得克萨斯大学奥斯汀分校) IBM Research(IBM研究院) Independent(独立) Tufts University(塔夫茨大学) Tel-Aviv University(特拉维夫大学)

AI总结 本文提出了一种群代数张量框架,通过将有限群G的乘法规则引入张量代数,使等变性成为代数属性而非架构限制。该框架基于三个理论支柱:(i) Eckart-Young最优性保证的星G-SVD;(ii)通过Kronecker分解组合多个对称性;(iii)600行的Lean4形式化证明。该框架提供了等变神经网络无法实现的能力:每个预测的闭式分解和数据驱动发现最佳对称群。在QM9分子几何上,通过八面体子群恢复角动量选择规则,展示了数据驱动的物理发现。

详情
AI中文摘要

我们引入了$\star_G$张量代数,在其中任何有限群$G$定义乘法规则,使等变性成为代数属性而非架构约束。该框架基于三个机器验证的理论支柱:(i) $\star_G$-SVD的Eckart-Young最优性保证,是首个对称保持张量近似的结果,精确且多项式时间;(ii) 通过Kronecker分解组合多个对称性,通过将$F_G$替换为$F_{G_1} \otimes F_{G_2}$无需架构重设计;(iii) 600行的Lean~4形式化证明了$\star_G$代数。该框架提供了等变神经网络(ENNs)结构无法实现的能力:每个预测的闭式分解,以及数据驱动发现最佳对称群。作为非平凡的实证演示,分解QM9分子几何的八面体子群恢复了角动量选择规则,仅凭数据而非量子力学输入:标量性质由A$_1$主导,偶极子成分由T$_1$主导,各向异性极化率对l=1不敏感,因为秩2迹分解l=0⊕l=2要求,T$_1$/A$_1$预测能力比将向量可观测量与标量可观测量分离了五倍。在完整的QM9(130,831分子)上,$\star_G$-SVD与岭回归提供闭式预测,参数数量比参数匹配的MLP少50-90倍。代数等变性因此补充架构等变性,不是更快、更好、更便宜的替代方案,而是不同的数学能力:可证明最优的对称保持压缩,每irrep可解释性,以及数据驱动的物理发现。

英文摘要

We introduce the $\star_G$ tensor algebra, in which any finite group $G$ defines the multiplication rule, making equivariance an intrinsic algebraic property rather than an architectural constraint. The framework rests on three machine-verified theoretical pillars: (i)~an Eckart-Young optimality guarantee for the $\star_G$-SVD: the first such result for symmetry-preserving tensor approximation, exact and polynomial-time; (ii)~a Kronecker factorization that composes multiple symmetries by replacing $F_G$ with $F_{G_1} \otimes F_{G_2}$ with no architectural redesign; and (iii)~a 600-line Lean~4 formalization of the $\star_G$ algebra. The framework provides capabilities that equivariant neural networks (ENNs) structurally cannot: a closed-form per-irreducible-representation decomposition of every prediction, and data-driven discovery of the symmetry group that best fits a dataset. As a non-trivial empirical demonstration, decomposing QM9 molecular geometry over the chiral octahedral subgroup of SO(3) recovers the Wigner--Eckart selection rules of angular momentum from data alone, with no quantum mechanical input: scalar properties are A$_1$-dominated, dipole components are T$_1$-dominated, the isotropic polarizability is uniquely insensitive to $l\!=\!1$ as the rank-2-trace decomposition $l\!=\!0 \oplus l\!=\!2$ requires, and the T$_1$/A$_1$ predictive-power ratio separates vector observables from scalar observables by a factor of five. On full QM9 (130{,}831 molecules), $\star_G$-SVD with ridge regression provides closed form predictions at $\sim50-90\times$ fewer parameters than parameter-matched MLPs. Algebraic equivariance thus complements architectural equivariance not as a faster-better-cheaper alternative but as a different mathematical affordance: provably-optimal symmetry-preserving compression, per-irrep interpretability, and data-driven physical discovery.

2605.20425 2026-05-21 cs.AI 版本更新

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

AgentCo-op: 基于检索的互操作多智能体工作流合成

Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma

发表机构 * Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院雷和斯蒂芬妮·兰德计算生物学系) Machine Learning Department, School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院机器学习系)

AI总结 本文提出AgentCo-op,一种基于检索的多智能体工作流合成框架,通过类型化任务传递合成可执行工作流,并在执行证据表明失败时应用有界自引导局部修复。在两个开放世界基因组学案例研究中,AgentCo-op在不重设计或运行全局拓扑搜索的情况下,将独立开发的科学智能体和外部工具库整合到可审计的工作流中。

详情
AI中文摘要

在开放性科学环境中设计多智能体工作流尤其困难,因为任务缺乏经过精心编写的训练集、可靠的标量评估度量和现有工具和智能体之间的标准化接口。我们提出AgentCo-op,一种基于检索的合成框架,通过类型化任务传递将可重用的技能、工具和外部智能体组合成可执行的工作流,然后在执行证据表明失败时应用有界自引导局部修复。在两个开放世界基因组学案例研究中,AgentCo-op将独立开发的科学智能体和外部工具库整合到可审计的工作流中,而无需重设计或运行全局拓扑搜索。它协调专门的智能体进行空间转录组学和基因集解释,以从空间转录组学数据中实现协作发现,并为单细胞多组数据的跨模态标记分析构建并行工作流。AgentCo-op还可以将搜索的工作流作为结构先验导入,并通过用检索到的组件 grounding 节点并应用局部修复来改进它,表明合成和搜索是互补的。在六个编码、数学和问答基准测试中,AgentCo-op在四个基准测试中取得最佳结果,在统一骨干设置下获得最佳平均分数,同时相对于多智能体基线始终减少每任务成本。这些结果共同表明,基于检索的合成可以将自动化智能体工作流设计扩展到由现有智能体、工具和类型化艺术制品构建的开放世界工作流中。

英文摘要

Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

2605.20423 2026-05-21 cs.AI 版本更新

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM: 用于高阶理论之心的强化学习引导对抗生成

Sharmin Sultana Srishty, Kazi Mahathir Rahman, Malaika Parizat Sakkhi, Samia Shahid Prianna, Shaikhul Islam Sinat

发表机构 * Department of Computer Science(计算机科学系) BRAC University(布拉克大学)

AI总结 本文提出OSCToM,一种结合强化学习、领域特定语言和组合替代模型的方法,用于建模LLM中的嵌套信念冲突,通过生成观察者-自我冲突来提升复杂社会场景下的理论之心推理能力。

Comments 15 pages, 12 figures containing 15 images, 3 tables. Code available at https://github.com/sharminsrishty/osct

详情
AI中文摘要

大型语言模型(LLMs)在许多语言任务上表现良好,但其在复杂社会场景中的理论之心(ToM)推理仍不均衡。现有基准测试,如ExploreToM,往往不总是测试导致这些场景困难的递归信念和信息不对称。本文提出了OSCToM(观察者-自我冲突理论之心),一种用于建模LLM基于理论之心任务中嵌套信念冲突的方法。关键案例是观察者对另一个代理的看法与其自身信念状态冲突的情况。此类情况超越了简单的视角转换,需要递归、多层推理。OSCToM结合强化学习(RL)、扩展的领域特定语言和组合替代模型来生成观察者-自我冲突。在我们的实验中,OSCToM-8B在测试系统中表现最佳。它在FANToM上优于已报告的ExploreToM结果,并在Hi-ToM和BigToM上保持竞争力。在信息不对称的FANToM基准测试中,OSCToM达到76%的准确率,相比ExploreToM报告的0.2%。数据合成过程也提高了6倍的效率,表明有针对性的训练数据可以帮助较小的模型处理高级认知推理。项目代码可在https://github.com/sharminsrishty/osct上找到。

英文摘要

Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.

2605.20410 2026-05-21 cs.CL cs.AI 版本更新

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

偏见与推理的力学:分析链式推理提示对大语言模型中性别偏见的影响

Edie Pearman, Sophia Osborne, Mira Kandlikar-Bloch, Mina Arzaghi, Florian Carichon, Golnoosh Farnadi

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) HEC Montreal(蒙特利尔HEC)

AI总结 本文研究了链式推理提示对大语言模型中性别偏见的影响,结合基准测试评估与机制可解释性技术,发现链式推理并未有效减少偏见,偏见仍存在于隐藏表示中。

Comments 24 pages, 6 figures, including appendix. Accepted at the ICLR 2026 Workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems. Submitted to COLM 2026

详情
AI中文摘要

尽管有大量文献表明大型语言模型(LLMs)在社会敏感领域得到广泛应用,但它们仍编码了性别偏见。链式推理(CoT)提示已被提出作为减轻偏见的方法。然而,现有评估主要关注LLM基准性能的变化,提供了有限的关于模型内部机制是否真正改变的见解。在本文中,我们研究了链式推理提示如何影响LLM中的性别偏见,结合基准测试评估与机制可解释性技术以及推理链失败分析。我们的结果证实了LLM输出中存在刻板印象偏见,显示链式推理提示并未一致减少偏见差距。机制分析显示,尽管链式推理在某些注意力头集群中平衡了偏见行为,但性别偏见仍嵌入在隐藏表示中,表明仅是表面缓解。对推理链的检查进一步表明,这些改进源于对数据集的记忆和熟悉,而非对偏见的真实理解。

英文摘要

Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

2605.20405 2026-05-21 eess.IV cs.AI cs.CV physics.med-ph 版本更新

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

在CT身体成分分割中解耦采样与训练预算

Iason Skylitsis, Dimitrios Karkalousos, Ivana Išgum

发表机构 * Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University of Amsterdam(阿姆斯特丹大学) Informatics Institute, Faculty of Science(信息学院,科学学院) Department of Radiology(放射科) Mayo Clinic Rochester(罗切斯特梅奥诊所)

AI总结 本文提出了一种基于少样本学习的episodic采样方法,用于解决医学图像分割中的类别不平衡问题,通过解耦采样与训练预算,提高了小数据集下的分割性能。

详情
AI中文摘要

类别不平衡是医学图像分割中的基本挑战,其中频繁类通常在训练中占主导地位,而稀有类被忽视。基于损失的方法通过在批次内重新加权每个像素的损失来缓解不平衡,而采样策略控制哪些图像进入批次。然而,两者均未明确控制批次中出现的类别,导致稀有类的暴露仅部分平衡。在本文中,我们采用少样本学习中的episodic采样,以在完全监督设置中促进类别平衡的批次构造。我们解耦episodic采样与其传统的度量学习上下文,并在CT身体成分分割中评估其效果。我们在九种肌肉和脂肪组织上,从公共SAROS数据集中提取了210次扫描,将episodic采样与随机和加权采样进行比较。训练是在全数据和低数据模式下进行的,此外在匹配训练迭代预算下也进行了额外比较。在全数据训练中,三种策略表现相当(episodic的平均Dice为0.882,随机和加权为0.878)。在低数据训练中,episodic采样优于随机和加权(0.787 vs. 0.758和0.762),这由训练迭代数的12倍差异驱动。在匹配训练预算下,随机和加权过早过拟合,而episodic在达到平台前提高了约三倍的迭代次数。我们的发现识别了训练迭代预算作为采样策略中被低估的混淆因素,推动了小数据集的迭代感知评估协议。此外,episodic采样的残余优势与隐含的类别平衡批次的正则化效应一致,提供了一种低成本、模型无关的解决医学图像分割类别不平衡问题的策略。代码可在https://github.com/iasonsky/episodic-sampling上获得。

英文摘要

Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

发表机构 * Waymo UCSD(加州大学圣地亚哥分校)

AI总结 本文研究了大规模训练在自动驾驶感知系统中的应用,通过扩展输入模态并训练大规模模型,实现了在Waymo数据集上的新状态-of-the-art性能。

详情
AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而,尚不清楚相同的范式是否适用于自动驾驶感知系统,因为存在独特的挑战,如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距,我们进行了系统分析,研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型,扩展了输入模态,包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型,参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art,大幅超越了先前的成果。我们的工作表明,大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

2605.20389 2026-05-21 cs.LG cs.AI 版本更新

Nonlocal operator learning for fMRI encoding and decoding tasks

非局部算子学习用于fMRI编码和解码任务

Andreas Kramer, Saugat Acharya, Alice Giola, Emanuele Zappala

发表机构 * Department of Computer Science, Idaho State University(计算机科学系,爱达荷州立大学) Department of Mathematics and Statistics, Idaho State University(数学与统计学系,爱达荷州立大学)

AI总结 本文提出了一种基于神经积分算子的框架,用于fMRI数据的编码和解码任务,探讨了非局部时空上下文的作用,并通过实验验证了更长的时间窗口和视觉皮层与全脑记录对性能和潜在空间几何的影响。

Comments 18 pages, 4 figures, 5 tables. Comments are welcome!

详情
AI中文摘要

功能性磁共振成像(fMRI)数据表现出高维时空结构,使得预测和解码变得具有挑战性。在本工作中,我们研究了基于神经积分算子的模型用于fMRI的编码和解码任务,特别强调非局部时空上下文的作用。我们实现了一个潜在的神经积分算子框架,该框架在辅助空间中执行固定点迭代,通过解码器进行分类和刺激预测。我们在两个开源fMRI数据集上评估了我们的模型。我们的实验检验了从fMRI记录中解码刺激以及从刺激表示中编码fMRI动态。主要关注点是时空上下文的影响:我们系统比较了短和长的时间窗口,以及使用视觉皮层与全脑记录,并分析其对性能和潜在空间几何的影响。在不同任务和数据集中,更长的时间窗口通常会改善结果并产生更具结构化的学习表示。在解码实验中,学习的潜在空间通常比原始数据提供更清晰的类别分离。在编码实验中,尽管由于任务难度绝对性能保持中等,但更长的时间窗口仍能产生一致的改进。这些发现表明,神经积分算子为建模fMRI动态提供了一个有前景的框架,并且更广泛的时空上下文对预测和表示学习都是有益的。更广泛地说,结果表明,利用大脑动态中的分布式非局部结构需要专门设计的模型架构来捕捉此类依赖关系。

英文摘要

Functional MRI data exhibit high-dimensional spatiotemporal structure, making both prediction and decoding challenging. In this work, we investigate neural integral-operator-based models for encoding and decoding tasks in fMRI, with particular emphasis on the role of nonlocal spatiotemporal context. We implement a latent neural integral operator framework that performs fixed point iterations in an auxiliary space from which classification and stimuli prediction is performed via a decoder. We evaluate our model on two open-source fMRI datasets. Our experiments examine both decoding of stimuli from fMRI recordings and encoding of fMRI dynamics from stimulus representations. A main focus is the effect of spatiotemporal context: we systematically compare short and long temporal windows, as well as the use of visual cortex vs whole brain recordings, and analyze their influence on performance and latent-space geometry. Across tasks and datasets, larger temporal windows generally improve results and produce more structured learned representations. In decoding experiments, the learned latent space often provides clearer class separation than the raw data. In encoding experiments, although absolute performance remains moderate due to the difficulty of the task, longer temporal windows still yield consistent gains. These findings suggest that neural integral operators provide a promising framework for modeling fMRI dynamics and that broader spatiotemporal context can be beneficial for both prediction and representation learning. More broadly, the results indicate that exploiting distributed nonlocal structure in brain dynamics requires model architectures specifically designed to capture such dependencies.

2605.20385 2026-05-21 cs.CV cs.AI 版本更新

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

ConceptSeg-R1: 通过元强化学习实现任意概念的分割

Yuan Zhao, Youwei Pang, Jiaming Zuo, Wei Ji, Kailai Zhou, Bin Fan, Yunkang Cao, Lihe Zhang, Xiaofeng Liu, Huchuan Lu, Weisi Lin, Dacheng Tao, Xiaoqi Zhao

发表机构 * Dalian University of Technology(大连理工大学) X3000 Inspection Co., Ltd(X3000检测有限公司) Nanyang Technological University(南洋理工大学) Yale University(耶鲁大学) Northwestern Polytechnical University(西北工业大学) Hunan University(湖南大学)

AI总结 本文提出ConceptSeg-R1框架,通过元强化学习机制学习可迁移的任务规则,结合轻量级概念翻译模块实现概念分割,并在多个领域基准上验证了其在概念层次上的强性能。

详情
AI中文摘要

近年来,可提示分割的进步使视觉感知从对象级定位转向概念级理解。然而,概念的定义仍不明确,使得当前方法是否真正超越类别识别仍存疑问。本文通过包含上下文无关(CI)、上下文依赖(CD)和上下文推理(CR)概念的三级分类,揭示了随着认知复杂性增加的能力差距。为解决这一挑战,我们提出ConceptSeg-R1统一框架,将概念分割重新公式化为规则诱导的概念定位。核心方法是Meta-GRPO,通过视觉示范学习可迁移的任务规则并通过代理推理验证。推导出的推理状态通过轻量级概念翻译模块转换为分割准备的概念提示,使推理应用能够扩展到目标图像。快捷路由策略进一步保留了分割模型在简单情况下的原生效率。为系统评估概念分割,我们在自然、工业、医疗和推理密集领域进行了广泛的实验。无需额外装饰,ConceptSeg-R1在完整概念层次上实现了强性能,同时保持了可提示分割主干的原生能力。作为向分割任何概念的初步步骤,我们希望ConceptSeg-R1能成为推进分割从对象级预测到概念级理解的实用基线。

英文摘要

Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

2605.20382 2026-05-21 cs.CL cs.AI 版本更新

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

言行不一:大型语言模型中的指令诱导冲突

Carolina Camassa, Derek Shiller

发表机构 * Future Impact Group(未来影响组)

AI总结 研究探讨了大型语言模型在面对指令与模式完成之间的冲突时的表现,发现指令遵循率在不同模型和指令下差异显著,输出多样性是预测鲁棒性的主要因素。

Comments 31 pages

详情
AI中文摘要

语言模型被训练以遵循指令,但它们也是强大的模式完成器。当这些两个目标发生冲突时会发生什么?我们构建了对话,其中用户指令要求以目标方式T(例如始终输出特定标记、用特定语言回答或采用特定角色)行为,与N个硬编码助手回合展示的竞争对手模式P相冲突。然后我们测量在此设置下的指令遵循率(IF率),在13个模型和16种不同指令上,最多50回合。平均指令遵循率在模型之间从1%到99%不等,与标准能力基准 largely 无关。从指令遵循到模式遵循的转变是普遍的但高度模型依赖的。鲁棒性由指令内容调节,当指令与训练价值先验一致时,模型对诱导的抵抗更长;同时由输出格式调节,多样化的多标记响应比单标记输出更耐受。链式推理提高鲁棒性但不消除易受性,并可能导致正确推理与错误输出之间的脱节。当被要求预测其在此设置下的行为时,模型平均准确率为83.5%,但系统性低估了自身对诱导压力的抵抗力。这些结果表明,即使对于其他方面能力较强的模型,指令遵循在诱导压力下仍很脆弱,而输出多样性而非输入的语义参与是预测鲁棒性的主要因素。

英文摘要

Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

2605.20373 2026-05-21 cs.RO cs.AI cs.CV 版本更新

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

SUGAR: 一种可扩展的人类-视频驱动的通用人形机器人运动-操作学习框架

Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong

发表机构 * CFCS, School of Computer Science, Peking University(计算机学院,北京大学计算机科学系) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北航)

AI总结 该研究提出SUGAR框架,通过将多样化的视频转化为可部署的人形机器人运动-操作技能,无需特定任务的奖励工程或参考动作条件,在仿真和现实硬件中实现了六种代表性任务的高性能表现,展示了可扩展性和零样本现实迁移能力。

Comments Project Page: https://tianshuwu.github.io/sugar-humanoid/

详情
AI中文摘要

构建能够实现在现实世界中通用的全身体运动-操作能力的人形机器人仍是一个根本性挑战。现有方法要么依赖于繁琐的特定任务奖励工程,要么依赖于僵化的参考动作回放,无法泛化,或者依赖于昂贵的远程操作,限制了可扩展性。尽管人类视频捕捉了多样化的动作行为,但从中推断出的运动先验固有地不完美,受到遮挡、接触伪影和重定向误差的影响,使其不适合直接的策略学习。为此,我们提出了SUGAR,一种可扩展的数据驱动框架,能够将多样化的视频转化为可部署的人形机器人运动-操作技能,无需任何特定任务的奖励工程或参考动作条件。SUGAR分为三个阶段。首先,一个完全自动化的流程从无结构的人类视频中提取运动交互先验,包括人类-物体运动轨迹和接触标签。第二,一个特权物理基础的细化器利用统一的模仿奖励和渐进状态池,将不完美的先验转化为物理上可行的、高保真的技能。第三,经过细化的技能被转化为一个分层的自主策略,包括一个命令生成器和一个命令跟踪器。我们在仿真和现实世界的人形硬件中评估了SUGAR,我们的方法在六种代表性运动-操作任务上显著优于参考跟踪基线,性能随着人类视频数据量的增加而明显提升。它还实现了零样本现实迁移,具有可靠的闭环执行、自主故障恢复和在外部扰动下的稳定长时程性能。项目页面:https://tianshuwu.github.io/sugar-humanoid/

英文摘要

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

2605.20372 2026-05-21 cs.CV cs.AI 版本更新

Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

基于潜在空间引导的多模态分割中缺失模态的场景采样

Irem Ulku, Ö. Özgür Tanrıöver, Erdem Akagündüz

发表机构 * organization= Department of Computer Engineering, Ankara University, Ankara, T \"u rkiye organization= Department of Modeling Simulation, Graduate School of Informatics, METU, Ankara, T \"u rkiye

AI总结 本文提出了一种新的训练策略,通过直接从预训练的潜在空间学习场景采样分布,以指导多模态分割在缺失模态下的微调,从而提高性能。

Comments 14 pages, 4 figures, 9 tables

详情
AI中文摘要

多模态语义分割通过结合不同传感器模态的互补信息,为遥感分析带来了好处。在现实中的遥感应用中,由于传感器故障、恶劣大气条件或数据采集问题,一个或多个模态可能不可用。即使有预训练的多模态表示和现有的微调或适应策略,性能仍可能受限,因为训练时通常将所有模态可用性场景视为等信息。在本文中,我们提出了一种新的训练策略,直接从预训练的潜在空间学习场景采样分布。与依赖于均匀随机模态丢弃不同,所提出的方法将微调引导到更具信息量的模态可用性场景。更具体地说,我们独立量化每个场景的影响,基于其在共享潜在表示中引起的变化。然后,我们使用径向基函数内核捕捉场景关系,并通过正则化内核平滑推导出细化的场景评分。这些评分随后在场景采样过程中转换为概率分布,用于微调。我们在三个遥感图像集(DSTL、Potsdam和Hunan)上评估了该策略,使用CBC-SLP、CBC和CMX主干网络。不同图像集和主干网络的实验结果表明,我们的方法优于标准微调和LoRA基于的适应。这些发现表明,预训练的潜在表示可以作为缺失模态微调期间采样的有效基础。代码可在https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling获取。

英文摘要

Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling

2605.20369 2026-05-21 cs.CL cs.AI cs.LG 版本更新

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

DEL:用于大语言模型数值学习的数字熵损失

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) VCIP, College of Computer Science, Nankai University(南开大学计算机学院VCIP)

AI总结 本文提出Digit Entropy Loss (DEL)用于大语言模型的自回归数值学习,通过重新设计传统无监督熵优化,引入数字条件概率和二元交叉熵,使熵优化转向监督方式,同时推广整数基于的数值学习到浮点数优化,从而提升数值预测的准确性。

详情
AI中文摘要

数字预测是大语言模型(LLMs)在数学问题解决和代码生成中的基本能力。广泛采用的最大似然估计(MLE)用于LLM训练并不适合数字预测。最近,惩罚驱动的方法,例如数字标记损失和离散化距离损失,引入了数字距离的归纳偏置,但分别导致了数字分布过度锐化和过度扁平化。在本文中,我们深入分析了LLM的数值学习,并表明现有的数值学习方法在概念上遵循一个准则-距离公式,其中准则项代表优化模式,距离项灌输几何先验。因此,我们提出了Digit Entropy Loss (DEL)用于自回归数值学习,其重新设计传统无监督熵优化的三个关键设计:利用数字条件概率和二元交叉熵将熵优化引导为监督方式;舍弃距离项以避免数值距离的问题;并将整数基于的数值学习推广到浮点数优化,使数值预测更加准确。我们的DEL公式可以结合整数、小数和小数点,将学习目标从单个数字扩展到浮点数领域。在七个数学推理基准测试中使用四个代表性的LLM,包括CodeLlama、Mistral、DeepSeek和Qwen-2.5,进行实验,结果表明DEL在整体预测准确性和数值距离方面均优于其替代方法。源代码在https://github.com/PolyU-VCLab/DEL。

英文摘要

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

2605.20368 2026-05-21 cs.CR cs.AI 版本更新

Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

使用微调的本地大语言模型进行安全文档分类:基准数据和开源系统

Ivan Dobrovolskyi

发表机构 * MS in AI & ML Engineering, Independent Researcher, Sunnyvale, CA, USA(人工智能与机器学习工程硕士,独立研究员,美国加利福尼亚州圣何塞)

AI总结 本研究提出TorchSight开源系统,利用微调后的Qwen 3.5 27B模型对安全文档进行分类,展示了其在78,358个样本和GPT-4合成数据上的高分类准确率,并验证了本地模型在安全文档分类中的有效性。

详情
AI中文摘要

扫描敏感信息文档的组织面临实际问题。云服务要求数据发送到外部基础设施,而基于规则的工具往往遗漏依赖上下文的威胁。本研究提出了TorchSight,一个围绕微调后的Qwen 3.5 27B模型构建的开源本地系统,用于安全文档分类。模型在78,358个样本和覆盖七个安全类别和51个子类别的GPT-4合成数据上进行训练。在主要评估1,000份文档上,模型达到95.0%的类别级准确率(95%置信区间:93.5-96.2)。在相同提示协议下,测试的商业模型得分为75.4-79.9%。在单独的外部500个保留样本集上,模型达到93.8%的准确率,表明性能超越了主要基准,尽管性能差异取决于数据集组成和困难边界情况。结果表明,微调的本地模型可以在保持文档处理本地控制的同时支持准确的安全文档分类。

英文摘要

Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.

2605.20357 2026-05-21 cs.LG cs.AI 版本更新

Consistently Informative Soft-Label Temperature for Knowledge Distillation

一致信息软标签温度用于知识蒸馏

Hoang-Chau Luong, Nghia Van Vo, Kaiqi Zhao, Lingwei Chen

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Oakland University(奥克兰大学)

AI总结 本文提出CIST方法,通过为教师和学生分配样本级自适应温度,解决传统固定温度设计中教师软标签熵不一致和教师-学生logit尺度对齐过严的问题,从而提升知识蒸馏效果。

详情
AI中文摘要

知识蒸馏(KD)通过匹配教师和学生预测分布将知识从高容量教师传递给紧凑学生,温度缩放是平滑教师预测并暴露信息量大的

英文摘要

Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions and exposing informative "dark knowledge" beyond the hard label. However, the standard fixed-temperature design is inherently sample-agnostic. Since samples differ in logit scale and learning difficulty, a single global temperature produces teacher soft labels with highly inconsistent entropy: some predictions remain overly sharp and provide limited inter-class information, whereas others become over-smoothed and lose class-discriminative information. Moreover, sharing the same temperature between teacher and student further imposes rigid logit-scale alignment despite their capacity mismatch. To address these limitations, we propose CIST (Consistently Informative Soft-label Temperature), which assigns separate sample-wise adaptive temperatures to the teacher and student. This design produces consistently informative teacher soft labels while relaxing rigid teacher--student logit-scale matching. It also reweights the distillation objective according to teacher confidence and student learning difficulty. Theoretically, we show that teacher-label entropy is largely governed by the ratio between the maximum teacher logit and the temperature, providing a principled basis for adaptive smoothing. Empirically, CIST mitigates the inconsistency induced by fixed temperature, and experiments on both vision and language distillation tasks show consistent improvements over standard KD and strong baselines with negligible computational overhead.

2605.20356 2026-05-21 cs.CL cs.AI cs.SD 版本更新

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

全双工语音对话模型中的同步与轮流机制

Pablo Riera, Pablo Brusco, Cristina Kuo, Marcelo Sancinetti, S. R. K. Branavan

发表机构 * ASAPP Inc.(ASAPP公司) Departamento de Computación, FCEyN, Universidad de Buenos Aires(计算机系,福克雷斯-恩分校,布宜诺斯艾利斯大学)

AI总结 本文研究了全双工语音对话模型如何通过内部表示的同步协调交互,并发现噪声条件下同步性下降,内部状态编码了提前的轮流预测信息。

详情
AI中文摘要

全双工语音对话模型(SDMs)能够同时听和说,使交互动态更接近人类对话。受人类沟通中神经耦合的启发,我们研究了此类模型在交互过程中如何协调其内部表示。我们模拟了两个预训练Moshi模型在受控条件下的全双工对话,操纵信道噪声和解码偏置。通过跨时间滞后计算中心核对齐(CKA)来测量同步性,同时利用因果LSTM模型从说话者和倾听者角度探测提前的轮流提示信号。我们发现无噪声条件下同步性较强,接近零滞后,随着噪声增加而下降,并展示了内部状态编码了支持提前轮流预测的信息。

英文摘要

Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

2605.20328 2026-05-21 cond-mat.stat-mech cond-mat.str-el cs.AI math.CO 版本更新

Targeting Clause Type Distributions: a Picklock for Random Satisfiability Problems

针对子句类型分布:随机满足问题的一种Picklock

J. Schwardt, J. C. Budich

发表机构 * Max Planck Institute for the Physics of Complex Systems(马克斯·普朗克复杂系统物理研究所) Institute of Theoretical Physics(理论物理研究所) Technische Universität Dresden(德累斯顿技术大学) Würzburg-Dresden Cluster of Excellence ctd.qmat(魏玛-德累斯顿卓越中心ctd.qmat)

AI总结 本文提出Target-SAT算法,通过利用问题中的统计信息,显著提高了随机满足问题在最困难区域的可解规模,并解释了传统局部搜索算法受限于低能量陷阱的原因。

Comments 7+2 pages, 6+2 figures

详情
AI中文摘要

优化问题如NP难的3-SAT问题为在强相关多体系统中寻找基态的困难任务提供了重要基准。研究随机3-SAT问题作为Ising自旋哈密顿量在统计物理中的应用,已获得重要见解,包括存在可满足性相变的存在以及预测特别困难实例的临界参数线。然而,解决这些实例的进展在数十年内一直很有限。在此,我们引入Target-SAT(TSAT)算法,大致将最困难区域的可解问题规模提高了三倍,在广泛邻近区域甚至有更大的改进。通过利用问题中隐藏的统计信息,TSAT在随机局部搜索中主动引导至相关参数空间内的目标。我们的分析还解释了为什么已建立的局部搜索算法受限于相对较小的系统规模,因为存在巨大的低能量陷阱。此外,我们以主导的附加复杂性障碍物来表征上述临界线,其指数性扩展仅在相关参数空间附近被TSAT迅速克服。通过TSAT,解决已知最困难的随机满足问题的领先地位回归到随机局部搜索算法的领域。

英文摘要

Optimization problems such as the NP-complete 3-SAT provide an important benchmark for the difficult task of finding ground-states in strongly correlated many-body systems with rugged energy landscapes. The study of random 3-SAT problems as Ising spin Hamiltonians in statistical physics has yielded major insights including the existence of a satisfiability phase transition, and the prediction of a critical parameter line of particularly hard instances. Yet, progress on solving those instances has been scarce for several decades. Here, introducing the Target-SAT (TSAT) algorithm, we roughly triple the tractable problem sizes in the hardest regime, with an even greater improvement in a vast range of neighboring regions. By leveraging statistical information hidden in the combinatorial constraints of the problem, TSAT is actively guided in its stochastic local search toward a target within the relevant parameter space. Our analysis also explains why established local search algorithms are limited to relatively small system sizes due to a vast low-energy trap. Furthermore, we characterize the aforementioned critical line in terms of a dominant additional complexity barrier, whose exponential scaling is quickly overcome by TSAT only in the surrounding parameter space. With TSAT, the lead in solving the hardest known random satisfiability problems returns to the realm of stochastic local search algorithms.

2605.20326 2026-05-21 cond-mat.str-el cs.AI 版本更新

Representability-Aware Neural Networks for Reduced Density Matrices: Application to Fractional Chern Insulators

具有可表示性的神经网络用于减少密度矩阵:应用于分数陈绝缘体

Justin B. Hart, Awwab A. Azam, Thomas Li, Yunxuan Li, Ye Bi, Haining Pan, Jiabin Yu

发表机构 * Department of Physics, University of Florida, Gainesville, FL, USA(佛罗里达大学物理系) Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA(佛罗里达大学电气与计算机工程系) Palo Alto High School, Palo Alto, CA, USA(帕洛阿尔托高中) Google, Mountain View, CA, USA(谷歌公司) Department of Animal Science, Iowa State University, Ames, IA, USA(爱荷华州立大学动物科学系) Department of Physics and Astronomy, Center for Materials Theory, Rutgers University, Piscataway, NJ, USA(罗切斯特大学物理与天文学系、材料理论中心)

AI总结 本文提出了一种具有可表示性的神经网络框架,用于预测两粒子减少密度矩阵,该框架通过架构和损失函数整合了部分可表示性条件,并能应用于不同动量网格,从而评估不同网格上的可表示性条件。该方法用于在大动量网格上预测2-RDM或作为优化的变分2-RDM ansatz。在扭曲双层MoTe₂的一带投影模型中,应用于3.89度的扭角和2/3的空穴填充,展示了该方法的优越性。

Comments 12+32 Pages, 4+10 Figures, 0+19 Tables

详情
AI中文摘要

我们开发了一种具有可表示性和插值能力的神经网络(NN)框架,用于预测两粒子减少密度矩阵(2-RDMs)。该NN通过其架构和损失函数整合了部分可表示性条件,并能够应用于不同的动量网格,从而在多个网格上评估可表示性条件,我们称之为插值可表示性条件。该框架既可以用于通过插值小网格的精确结果来预测大网格上的2-RDM,也可以作为通过在任意网格上优化能量最小化来优化的变分2-RDM ansatz。我们将这种方法应用于扭曲双层MoTe₂的一带投影模型中的分数陈绝缘体,在扭角为3.89°和空穴填充为2/3的情况下。通过在具有12或18个动量点的精确对角化(ED)2-RDMs上训练六个不同的NN架构,最佳的NN是残差多层感知机,它在97.07%-98.18%的相对精度下预测6×6的2-RDM,但预测的能量比ED基态能量高77.353 meV。然后,我们对NN在多个网格上进行变分优化,包括6×6网格,预测出6×6网格的能量比ED低0.104 meV,同时保持98.94%-98.96%的精度。与传统的边界点半正定规划相比,该NN在参数数量仅为传统方法的1/20的情况下,实现了更准确的能量预测和相似的精度。最终,我们将在变分优化的NN中添加一个具有48个动量点的对称网格,并在该网格上提供许多体基态能量和许多体量子度量的预测。

英文摘要

We develop a representability-aware and interpolable neural network (NN) framework for predicting two-particle reduced density matrices (2-RDMs). The NN incorporates a subset of representability conditions through its architecture and loss function, and can operate on different momentum meshes, enabling evaluating the representability conditions across multiple meshes, which we call interpolated representability condition. The framework can be used either to predict 2-RDMs on large momentum meshes by interpolating exact results from small meshes, or as a variational 2-RDM ansatz optimized by energy minimization on arbitrary meshes. We apply this approach to the fractional Chern insulator in the one-band projected model of twisted bilayer MoTe$_2$ at twist angle $3.89^\circ$ and hole filling $2/3$. Trained on exact-diagonalization (ED) 2-RDMs from meshes with $12$ or $18$ momentum points using six different NN architectures, the best NN is the residual multilayer perceptron, which predicts the $6\times6$ 2-RDM with $97.07\%-98.18\%$ accuracy relative to the ED 2-RDM but predicts an energy $77.353$ meV above ED ground-state energy. We then variationally optimize the NN on several meshes including $6\times6$, predicting a $6\times 6$ energy of just $0.104$ meV below ED while maintaining $98.94\%-98.96\%$ accuracy. Compared with the conventional boundary-point semidefinite programming, which gives an energy $5.560$ meV below ED with $96.40\%-98.94\%$ accuracy, the NN achieves a more accurate energy and similar accuracy while using only less than 1/20 as many parameters. Eventually, we add a symmetric mesh of $48$ momentum points to the variational optimization of the NN, and provide a prediction of the many-body ground-state energy and the many-body quantum metric on that mesh.

2605.20316 2026-05-21 cs.CV cs.AI 版本更新

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

FullFlow: 通过双向视觉-语言生成升级文本到图像流匹配模型

Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌)

AI总结 本文提出FullFlow方法,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器,从而在保持图像连续流的同时添加文本离散插入过程,提升文本到图像和图像到文本的生成质量。

Comments project page: https://ericbill21.github.io/fullflow/

详情
AI中文摘要

现代文本到图像扩散模型编码了丰富的视觉先验,但只能通过单向文本条件生成暴露。现有统一的视觉-语言模型通过大规模联合预训练或对文本路径进行大量重训练来恢复双向能力,但丢弃了文本到图像模型本身已编码的强图像先验。我们介绍了FullFlow,一种参数高效的配方,通过仅训练LoRA适配器和轻量级文本头部,将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器。FullFlow保持图像在原生连续流中,并添加文本的离散插入过程。分离的图像和文本时间步将推断转化为二维生成空间中的轨迹选择,使文本→图像、图像→文本、联合采样和部分文本预测能够通过单一主干模型完成。在Stable Diffusion 3 (SD3)上,FullFlow在相同可训练参数数量和匹配LoRA秩的情况下,将文本→图像的FID从62.7提升到31.6,将图像→文本的CIDEr从2.0提升到99.4,同时在两个RTX A5000 GPU上训练时间不超过24小时的情况下,将峰值VRAM从约84GB降低到约38GB,并将吞吐量提高约8倍,仅训练主干参数的约5%。同样的配方适用于FLUX.1-dev,并通过部分文本生成支持下游VQA。这些结果表明,强大的双向视觉-语言能力可以从预训练的文本到图像流模型中解锁,而无需完整的多模态预训练。

英文摘要

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

2605.20314 2026-05-21 cs.LG cs.AI 版本更新

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

数据更少,训练更快:重复较小的数据集通过采样偏差加速学习

Jingwen Liu, Ezra Edelman, Surbhi Goel, Bingbin Liu

发表机构 * Columbia University(哥伦比亚大学) University of Pennsylvania(宾夕法尼亚大学) Harvard University(哈佛大学)

AI总结 研究探讨了'小数据与大数据差距'现象,即使用更少样本重复训练比使用更大数据集更节省计算资源,通过层间增长和采样偏差机制实现加速,为优化提供了新的归纳偏差。

Comments ICML 2026

详情
AI中文摘要

本文研究了'小数据与大数据差距'现象,即在较少样本上重复训练相比使用较大数据集可以节省训练计算资源。这一现象在算法任务、架构和优化器中均被观察到,无法用现有理论解释。我们提出,这种加速是由于适当的层间增长机制,由采样偏差驱动,且在数据集较小时更为显著。我们通过多种干预措施提供了理论分析和实证证据。研究结果表明,使用较小数据集并进行重复训练不仅是在数据稀缺时的退化策略,而且可以主动作为优化的有利归纳偏差,特别是在推理任务中。

英文摘要

This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.

2605.20309 2026-05-21 cs.CV cs.AI 版本更新

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Tiny-Engram: 生成视觉中的触发索引概念表

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结 本文提出Tiny-Engram,一种紧凑的触发索引概念表,通过显式地为视觉记忆分配词汇地址和激活边界,实现对冻结图像和视频生成器中的概念的控制。该方法通过注册的n-gram匹配索引参数化每个概念,仅在匹配触发区域调节文本编码器的隐藏状态,从而在保持周围提示的组合控制的同时,将罕见触发短语绑定到目标身份。

详情
AI中文摘要

当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念,但对是否以及何时检索概念的控制有限。在本工作中,我们引入Tiny-Engram,一种紧凑的触发索引概念表,为冻结的图像和视频生成器中的视觉记忆提供显式的词汇地址和激活边界。Tiny-Engram将每个概念参数化为一组小的记忆条目,这些条目通过注册的n-gram匹配进行索引,仅在匹配的触发区域调节文本编码器的隐藏状态。在该词汇支持之外,条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变压器骨干结构上,这种公式将罕见触发短语绑定到目标身份,同时保持周围提示的组合控制。我们进一步在文本条件的视频生成设置中评估相同的表式记忆,其中触发路径可靠地改变生成的主题,但保持在排除的视频提示中精细的身份持续性仍然有限。综合来看,这些结果表明,小型、显式地址的概念表是实现模块化视觉个性化的一种实用途径,尤其在图像生成中证据最强。对于视频扩散,剩余的差距指向更广泛的需求:时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合,这促使未来在记忆注入方面的工作超越文本条件接口。

英文摘要

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

2605.20308 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SDM: A Powerful Tool for Evaluating Model Robustness

SDM:评估模型鲁棒性的强大工具

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li

发表机构 * Information Engineering University, Zhengzhou, China Key Laboratory of Cyberspace Endogenous Safety \& Security of Henan Province, Zhengzhou, China Key Laboratory of Cyberspace Security Ministry of Education of China, Zhengzhou, China Songshan Laboratory, Zhengzhou, China

AI总结 本文提出了一种名为SDM的新型梯度攻击方法,通过重新定义对抗样本生成的目标,解决了传统方法中'高损失非对抗样本'导致的性能下降问题,并在实验中证明了其在攻击性能和成本效率上的优势。

Comments 16 pages

Journal ref Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基于梯度的攻击方法是评估模型鲁棒性的重要方法。然而,自从提出APGD以来,此类方法难以取得显著突破。为了实现这一效果,我们首先分析了先前方法中导致攻击性能下降的'高损失非对抗样本'问题,并证明该问题源于对抗样本生成目标的不恰当。随后,我们将目标重新定义为

英文摘要

Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.

2605.20300 2026-05-21 cs.LG cs.AI 版本更新

Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

鲁棒的子空间约束二次模型用于低维结构学习

Zheng Zhai, Xiaohui Li

发表机构 * Department of Statistics, Faculty of Arts and Sciences at Beijing Normal University, Zhuhai(北京师范大学统计学系,北京师范大学艺术科学 faculty,珠海分校)

AI总结 本文提出了一种鲁棒的子空间约束二次模型(SCQM),用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解(SQMF)框架,该模型能够适应广泛噪声分布,包括广义高斯和径向拉普拉斯模型。这种泛化能力使其在重尾和轻尾噪声下均能保持稳定性能,显著提高了在不同数据场景下的鲁棒性。为高效解决由此产生的非凸优化问题,我们开发了一种基于梯度的算法,配备回溯线搜索策略以确保稳定和高效的收敛。此外,我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析,阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析,并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

详情
AI中文摘要

在本文中,我们提出了一种鲁棒的子空间约束二次模型(SCQM),用于从高维数据中学习低维结构。基于子空间约束二次矩阵分解(SQMF)框架,所提出的模型能够适应广泛噪声分布,包括广义高斯和径向拉普拉斯模型。这种泛化能力使该方法在重尾和轻尾噪声下均能保持稳定性能,从而在不同数据场景下显著提高了鲁棒性。为高效解决由此产生的非凸优化问题,我们开发了一种基于梯度的算法,配备回溯线搜索策略以确保稳定和高效的收敛。此外,我们还对$\ell_p^p$和$\ell_2$损失函数进行了敏感性分析,阐明了它们在不同噪声特性下的不同行为。大量数值实验验证了理论分析,并展示了所提方法在鲁棒性和重建准确性方面优于现有方法。

英文摘要

In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the $\ell_p^p$ and $\ell_2$ loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.

2605.20299 2026-05-21 cs.LG cs.AI cs.RO 版本更新

Mechanisms of Misgeneralization in Physical Sequence Modeling

物理序列建模中泛化错误的机制

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka

发表机构 * Harvard College(哈佛大学) Harvard John A. Paulson School of Engineering and Applied Sciences(哈佛大学约翰·A·保罗森工程与应用科学学院) Comcast AI CBS-NTT Program in Physics of Intelligence, Harvard University(哈佛大学物理智能计划) Physics of Artificial Intelligence Group, NTT Research, Inc., Sunnyvale, CA, USA(人工智能物理研究组,NTT研究公司,美国加利福尼亚州山景城) Microsoft(微软)

AI总结 本文研究了物理序列建模中由于局部误差传播导致的物理泛化错误,提出了一种数据偏差核来预测物理量的质量变化,并提出了基于核的干预策略。

Comments Preprint. kentonishi.com/physical-misgeneralization

详情
AI中文摘要

生成序列模型通常用于在物理领域规划运动,从机器人到机械模拟。在构建训练此类模型的数据集时,工程师可能会选择演示来指定轨迹在物理量如旅行距离或机械能上的分布。例如,一个构建迷宫导航代理的机器人工程师可能会选择旅行距离覆盖固定范围的演示,希望限制代理的预期功率使用。我们发现标准深度学习可以违反这一意图:每个生成的轨迹在单独看来都合理,但物理量的总体分布是错误的。我们将这种失败称为物理泛化错误,并发展了其机制。通过受控的合成任务,我们发现物理泛化错误出现在局部误差典型于模型类通过物理测量传播到恢复分布时。我们用数据偏差核估计这些误差,并利用它来预测在我们的合成任务和更应用的迷宫导航和双摆运动任务中哪些物理量获得或失去质量。最后,我们的机制性解释有助于识别哪些缓解策略在结构上具有前景,并利用它提出了一种基于核的干预。

英文摘要

Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

2605.20296 2026-05-21 cs.LG cs.AI 版本更新

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

谱遗忘:无需重新训练的后验能力恢复

Aarash Abro, Muhammad Tahir

发表机构 * Zeta Labs(泽塔实验室) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 研究探讨了语言模型在目标任务微调过程中因训练数据未显式威胁而退化的能力现象,提出了一种仅使用预训练检查点和微调后检查点的后验修复方法,通过谱修复技术恢复受损能力并保留目标任务收益。

详情
AI中文摘要

对语言模型进行目标任务微调通常会退化那些训练数据从未显式威胁的能力。我们研究这种现象,称为灾难性遗忘,并提出一种后验修复解决方案,仅使用预训练检查点W_base和其微调后代W_ft。目标不仅是将模型回退到基础检查点,而是恢复微调损坏的能力,同时保留目标任务的收益和任何有益的未显式改进。我们引入了DG-Hard,一种仅使用检查点的谱修复方法,用于微调更新Δ= W_ft - W_base。DG-Hard将Δ视为嵌入在IID-like噪声残差中的低秩任务对齐信号,该信号梯度下降没有动力去除,并对每个权重-增量矩阵应用Donoho-Gavish硬奇异值阈值,保留更新的结构高能部分并去除谱体。这将修复简化为一个闭合形式的SVD过滤步骤,无需数据依赖的调优。一个核心困难是评估:平均准确率隐藏了每个基准的失败,而朴素恢复分数奖励那些简单回退到基础的模型。因此,我们引入了一个分区条件度量,分别跟踪愈合、保留、非损坏和目标任务保留。在14(模型,任务)设置和九个跨领域未显式基准上,DG-Hard在后验基线中实现了最强的平衡修复。DG-Hard还恢复了由良性微调退化的三个独立安全轴的安全对齐,尽管不使用任何对齐数据。这些结果表明,部分微调引起的能力建设损失并非专业化不可避免的后果,而是在权重更新本身中可去除的谱残余。

英文摘要

Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $Δ= W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $Δ$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

2605.20295 2026-05-21 cs.LG cs.AI 版本更新

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Quant.npu:通过完全静态量化实现高效的移动NPU推理以支持设备端LLM

Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang

发表机构 * Qualcomm(高通)

AI总结 本文提出Quant.npu框架,通过完全静态量化方法实现高效的移动NPU推理,解决了传统后训练量化方法在NPU硬件约束下的兼容性问题,并在实际移动NPU上实现了较高的准确性和较低的推理延迟。

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地部署在移动设备上,其中神经处理单元(NPUs)需要完全静态量化以实现最优的推理效率。然而,现有的后训练量化(PTQ)方法主要依赖于动态激活量化,使其与NPU硬件约束不兼容。为了弥合高保真PTQ与NPU受限推理之间的差距,我们提出了Quant.npu,一个仅整数的完全静态量化框架。它结合了可学习的量化参数和旋转矩阵,使低比特激活-权重量化无需运行时重新计算量化参数。关键的是,我们发现初始化和选择性优化量化参数对于优化稳定性至关重要,因为不恰当的初始化和简单的联合优化会引发梯度不稳定,破坏旋转矩阵的优化。为此,我们提出了针对不同激活特征的旋转和比特宽感知初始化,以及针对旋转和未旋转张量的分布感知选择性优化(双阶段量化流水线)。此外,我们引入了一种敏感性引导的自适应混合精度方案,以在准确性和推理效率之间取得平衡。在实际移动NPU上的大量实验表明,Quant.npu在准确度上与最先进的方法相当,同时将推理延迟降低了最高15.1%。

英文摘要

Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

2605.20293 2026-05-21 cs.LG cs.AI cs.NE 版本更新

Closed-form predictive coding via hierarchical Gaussian filters

通过分层高斯滤波器实现闭式预测编码

Aleksandrs Baskakovs, Sylvain Estebe, Kenneth Enevoldsen, Kristoffer Nielbo, Chris Mathys, Nicolas Legrand

发表机构 * Center for Humanities Computing(人文计算中心) Aarhus University(奥胡斯大学) Interacting Minds Center(互动心灵中心)

AI总结 本文提出通过分层高斯滤波器实现预测编码,恢复了精度加权的信息传递,实现了动态不确定性估计和Hebbian兼容的更新规则,从而在单个自由能目标下同时学习激活、权重和精度,无需全局误差信号,且无需迭代或自动微分。

详情
AI中文摘要

预测编码(PC)提供了一种局部且生物基础的替代反向传播方法,用于训练人工神经网络,但至今仍较慢,且随着网络深度增加性能急剧下降。我们追溯这两个问题到一个简化:当前PC网络将精度矩阵固定为单位矩阵,丢弃了变分推导所需的精度加权预测误差,以实现快速、局部和贝叶斯的特性。我们通过将预测编码网络表示为深度分层高斯滤波器(HGF)并恢复精度加权的信息传递,从而在每一层实现动态不确定性估计和Hebbian兼容的更新规则。所得到的网络可以在单个自由能目标下同时学习激活、权重和精度,无需全局误差信号,并且在推断过程中无需迭代或自动微分。在FashionMNIST上,我们的解决方案在epoch级的运行时间成本上接近反向传播,同时在更少的epoch中收敛,并在在线、数据效率和概念漂移任务上优于反向传播。因此,我们证明了闭式变分推断与在线精度学习相结合,为深度预测编码网络提供了一个可处理的基础,保留了生物和解释性优势,而无需迭代松弛或全局误差信号。

英文摘要

Predictive coding (PC) offers a local and biologically grounded alternative to backpropagation in the training of artificial neural networks, yet to date, it remains slower, and performance degrades sharply as network depth increases. We trace both problems to a single simplification: current PC networks fix the precision matrix to the identity, discarding precision-weighted prediction errors that the variational derivation requires to be fast, local, and Bayesian. We close this gap by expressing predictive coding networks as deep hierarchical Gaussian filters (HGFs) and restore precision-weighted message passing, yielding dynamic uncertainty estimates and Hebbian-compatible update rules at every layer. The resulting networks can simultaneously learn activations, weights, and precisions under a single free-energy objective, with no global error signal, and resolve inference without requiring iterations or automatic differentiation. On FashionMNIST, our solution approaches backpropagation in epoch-level wall-clock cost while converging in fewer epochs, and outperforms it on online, data efficiency, and concept-drift tasks. We thus establish that closed-form variational inference with online precision learning provides a tractable foundation for deep predictive coding networks, retaining biological and interpretative advantages, without requiring iterative relaxation or global error signals.

2605.20289 2026-05-21 cs.LG cs.AI 版本更新

Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

插件式脉冲运算符:突破脉冲变换器中的非线性瓶颈

Xinzhe Yuan, Xiang Peng, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology, China(哈尔滨工业大学人工智能研究所,中国) School of Artificial Intelligence, Jilin University, China(吉林大学人工智能学院,中国)

AI总结 本文提出了一种插件式框架,通过将Transformer中的非线性运算分解为三个基本算子(除法、指数和ℓ2范数),并利用LIF神经元群体和轻量级位移缩放实现脉冲友好的近似,从而在不需微调的情况下支持常见的Transformer非线性运算。

Comments Accepted to ICML 2026. 9 pages main paper, 8 pages appendix, 6 figures, 5 tables. Correspondence to Bin Gu and Huan Xiong

详情
AI中文摘要

ANN到SNN的转换提供了一条实用且无需训练的途径来构建脉冲大规模语言模型。然而,当前的流程主要关注于脉冲驱动实现Transformer线性代数运算,而对关键非线性运算的支持有限。这种差距限制了与神经形态风格执行约束的兼容性,其中此类非线性通常需要除法、指数或范数计算,这些计算并不自然支持于标准的泄漏积分-放电动力学。为了解决这个问题,我们提出了一种插件式框架,实现了Transformer非线性的脉冲友好的近似,并整合到现有的ANN到SNN流程中。我们的方法将这些非线性计算分解为三个反复出现的基本算子——除法、指数和ℓ2范数——并通过利用LIF神经元群体进行群体计算,并结合轻量级位移缩放以避免浮点运算来实现它们。通过将这些基本算子作为模块化运算块进行组合,我们的框架支持常见的Transformer非线性运算(例如Softmax、SiLU和归一化),而无需任何微调。在一系列LLM Transformer上的实验表明,选择性地替换目标非线性运算符在所有评估任务中导致的精度下降少于1%。

英文摘要

ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives -- division, exponentiation, and $\ell_2$ norms -- and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a $1\%$ accuracy drop across all evaluated tasks.

2605.20287 2026-05-21 cs.LG cs.AI cs.CV 版本更新

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell: 跨注意力融合布局几何与网络列表拓扑以实现标准单元性能预测

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang

发表机构 * School of Integrated Circuits, Peking University, Beijing, China(集成电路学院,北京大学,北京,中国)

AI总结 本文提出FusionCell,通过跨注意力机制融合布局几何和网络列表拓扑,以提高标准单元性能预测的准确性,解决了传统方法忽略布局几何导致的耦合和布局依赖效应的问题。

详情
AI中文摘要

标准单元是数字电路的基本构建块,其延迟和功率对芯片级性能有关键影响;然而,其表征仍依赖于缓慢的仿真扫描,许多快速预测器忽略了布局几何,未能捕捉到耦合和布局依赖效应。挑战在于如何联合表示布局几何和网络列表拓扑,使模型能够同时捕捉细粒度的空间细节和结构连接,以实现准确的性能预测。我们引入FusionCell,一种双模态预测器,将路由布局几何和网络列表拓扑作为输入,并在统一模型中显式融合它们。一个DeiT编码器处理三层路由布局,而图Transformer模型异构设备/网络图。模态通过拓扑引导机制集成,其中网络列表作为结构“地图”主动查询布局中的相关物理区域,以实现联合几何和拓扑推理。我们构建了一个基于ASAP7 PDK的7nm数据集,使用自动工具生成超过19500个单元,涵盖149种类型,针对六个指标:信号上升/下降延迟、过渡和功率。实验结果表明,FusionCell减少了回归误差,平均MAPE为0.92个百分点,并在基线模型上提高了Spearman/Kendall排名,同时将表征过程的速度提高了数十倍,相比电路仿真。

英文摘要

Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

2605.20285 2026-05-21 cs.LG cs.AI 版本更新

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

反思式X训练:反馈条件化提升跨所有LLM训练阶段的扩展性

Brandon Cui, Ximing Lu, Jaehun Jung, Syeda Nahida Akter, Hyunwoo Kim, Yuxiao Qu, David Acuna, Shrimai Prabhumoye, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * NVIDIA University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出反思式训练(IXT),通过利用后续阶段的动态来改进早期阶段,从而提高LLM训练的扩展效率,实验表明该方法在计算效率和性能上均有显著提升。

详情
AI中文摘要

我们探讨了如何更高效地扩展当前LLM训练流水线中的多个不断增长的阶段。我们的核心直觉源于后续阶段的动态(例如训练后)可以用来指导早期阶段(例如预训练)。为此,我们提出了反思式训练(或IXT),受离线奖励条件强化学习启发,适用于训练的任何阶段。IXT使用一个思考奖励模型来用自然语言批评性反馈标注数据,使从流水线的最早阶段开始就能进行质量感知训练。然后通过将生成的反馈作为前缀条件化数据来训练模型——确保在训练早期阶段并非所有token都被同等对待。在7.5-12B基于transformer的密集LLM上进行的全面实验表明,我们的方法:使扩展曲线弯曲,从而在一般情况下实现高达2.8倍的计算效率提升;并在数学和代码等领域达到其他训练方法无法达到的性能水平。

英文摘要

We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

2605.20284 2026-05-21 cs.CV cs.AI cs.LG 版本更新

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO: 一种面向工业异常问答的多模态推理框架

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park

发表机构 * Sungkyunkwan University(成均馆大学) Seoul National University(首尔国立大学)

AI总结 本文提出JUDO框架,通过结合领域知识和上下文提升多模态推理能力,以解决工业异常检测中模型缺乏领域知识的问题,实验表明其在MMAD基准上优于Qwen2.5-VL-7B和GPT-4o。

Comments Published at ICLR 2026

详情
AI中文摘要

工业异常检测已显著受益于大多模态模型(LMMs),使检测能力超越了单纯的检测,尤其通过视觉引导推理提升图像理解能力。然而,LMMs缺乏领域特定知识,限制了其在复杂工业场景中生成准确响应的能力。在本工作中,我们提出了JUDO,即Juxtaposed Domain-Oriented Multimodal Reasoner,一种能够高效整合领域知识和上下文的视觉和文本推理框架。通过视觉推理,我们的模型通过将查询图像与正常图像进行对比,分割缺陷区域,实现细粒度的视觉比较检查。此外,我们通过监督微调(SFT)注入领域知识,以增强上下文理解,并通过强化学习(GRPO)引导领域推理,采用领域导向的推理过程。实验结果表明,JUDO在MMAD基准上表现优异,超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果突显了增强领域知识和上下文对有效推理在异常理解中的重要性。

英文摘要

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

2605.20277 2026-05-21 cs.CV cs.AI 版本更新

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹积分反馈调节解剖感知奖励用于体积计算断层扫描分析

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里集团达摩院) Hupan Lab(虎扑实验室) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出了一种新的框架,通过轨迹积分反馈GRPO(TIF-GRPO)来改进医疗视觉语言模型在三维CT分析中的性能,通过引入临床异常基准评估子系统(CABS)来解决优化目标与临床严谨性之间的不匹配问题,提升异常检测和临床准确性。

详情
AI中文摘要

医学视觉-语言模型(VLMs)已迅速发展为通用多模态助手,但其在三维计算机断层扫描(CT)分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。当前的强化学习(RL)范式仍然依赖于词汇代理信号,导致``评估幻觉'',即模型优化语言流畅性而非事实性临床正确性,从而导致诊断性关键错误。为弥合这一差距,我们引入了临床异常基准评估子系统(CABS),一个将放射学报告分解为可验证的临床语义单元的结构化系统。利用CABS,我们识别出标准RL中的``机理分歧'',即表面相似性奖励驱动策略梯度绕过医学事实。因此,我们提出了轨迹积分反馈GRPO(TIF-GRPO),一种将控制理论原理整合到策略优化中的新框架。通过将临床推理建模为伪时间轨迹以发现异常,TIF-GRPO通过积分反馈回路调节解剖感知奖励,该回路将持续遗漏视为累积状态误差,并将幻觉视为过度的控制努力。在3D CT基准测试中,我们的方法显著提高了异常检测和临床忠实度,建立了医疗VLMs中细粒度调节的新范式。我们的项目可在GitHub上获取。

英文摘要

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

2605.20275 2026-05-21 cs.CV cs.AI 版本更新

You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

你不需要注意力:基于门控卷积的基于手表的跌倒检测

Sana Alamgeer, Ronish Kumar, Awatif Yasmin, Muhammad Irshad, Anne H. H. Ngu

发表机构 * Texas State University(德克萨斯州立大学)

AI总结 本文提出了一种轻量级的双流架构Gated-CNN,用于基于手表的跌倒检测,通过门控机制提升特征提取效率,实现在不使用注意力机制的情况下达到更高的检测精度。

详情
AI中文摘要

现有的基于可穿戴设备的跌倒检测系统依赖于自注意力机制,这种机制带来了二次计算开销,将权重分布到所有时间步。这种全局权重分布会损害在短固定长度窗口中跌倒特征的精确定位。为克服这一挑战,我们提出Gated-CNN,一种轻量级双流架构,通过独立的一维卷积特征提取器处理加速度计和陀螺仪流,随后(i)一个sigmoid门控模块,选择性地抑制无信息的背景激活,同时增强跌倒区分特征;(ii)一个全局平均池化层,将每个流压缩成紧凑的固定长度描述符;(iii)一个共享的分类头,融合两个描述符进行二分类跌倒预测。对于离线评估,我们在五个腕部惯性测量单元(IMU)数据集上评估模型,分别在SmartFallMM、WEDA-Fall、FallAllD、UMAFall和UP-Fall数据集上获得平均F1分数为93%、93%、90%、91%和90%的结果,优于Transformer基线。对于实时评估,我们将模型部署在Google Pixel Watch 3上,并在12名参与者上进行测试。模型在零次遗漏的情况下实现了97%的平均F1分数和98%的准确率,表明sigmoid门控提供了一种在结构上更一致且计算更高效的替代方案,用于商用智能手表的跌倒检测。

英文摘要

Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

2605.20274 2026-05-21 cs.GR cs.AI 版本更新

PolycubeNet: A Dual-latent Diffusion Model for Polycube-Based Hexahedral Mesh Generation

PolycubeNet: 一种基于多立方体的双潜在扩散模型用于六面体网格生成

Lu He, Qitao Deng, Junjiang Deng, Liangbin Deng, Yanjun Liang, Wenting Yang, Guoqiang Wang, Na Lei

发表机构 * Dalian University of Technology(大连理工大学) Jiangxi University of Science and Technology(江西理工大学) School of Mathematical Sciences, Guangxi Minzu University(广西民族大学数学与计算机科学学院) Caohejing Hi-Tech Park Development Co., Ltd.(曹家京高科技园发展有限公司) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出了一种基于条件扩散模型的端到端框架,用于生成基于多立方体的六面体网格,通过双潜在条件扩散架构有效解耦了计算复杂度与输入输出分辨率,提高了生成效率和鲁棒性。

详情
AI中文摘要

六面体网格广泛用于模拟流水线,但自动生成复杂CAD几何体仍具有挑战性。基于多立方体的六面体网格生成是一种代表性方法,因其结构规则且参数化友好,但现有多立方体构造方法通常依赖于复杂的表面分割和局部启发式方法,这可能会产生伪影或在困难形状上失败。在本文中,我们提出了一种基于条件扩散模型的端到端框架用于多立方体生成。给定一个以点云表示的输入几何体,我们的方法直接生成对应的多立方体点云,消除了显式表面分割或预定义多立方体模板的需要。我们方法的核心是一种双潜在条件扩散架构,将计算上昂贵的自注意力操作限制在固定容量、低维的潜在空间中。这种设计有效地将计算复杂度与输入几何体和输出多立方体的分辨率解耦,从而避免了点云自注意力机制中典型的二次成本,同时支持灵活的输入和输出分辨率。为了获得六面体网格,生成的多立方体通过刚性和非刚性点云配准对齐到输入形状,以建立表面对应关系,随后通过多立方体到六面体的流程。我们还创建并发布了CAD网格及其对应的多立方体网格配对数据集,以及我们模型的核心实现。实验表明,PolycubeNet能够泛化到具有任意亏格的复杂CAD模型,并在几秒钟内生成高质量的多立方体结构,比先前基于学习的方法在鲁棒性和效率上有所提升。

英文摘要

Hexahedral meshes are widely used in simulation pipelines, yet automatic generation remains challenging for complex CAD geometries. Polycube-based hexahedral meshing is a representative approach due to its regular, parameterization-friendly structure, but existing polycube construction methods often rely on intricate surface segmentation and local heuristics, which can produce artifacts or fail on difficult shapes. In this paper, we propose an end-to-end framework for polycube generation based on conditional diffusion models. Given an input geometry represented as a point cloud, our method directly produces a corresponding polycube point cloud, eliminating the need for explicit surface segmentation or predefined polycube templates. At the core of our approach is a dual-latent conditional diffusion architecture that confines computationally expensive self-attention operations to a fixed-capacity, low-dimensional latent space. This design effectively decouples computational complexity from the resolution of both the input geometry and the output polycube, thereby avoiding the quadratic cost typical of point cloud self-attention mechanisms while supporting flexible input and output resolutions. To obtain a hexahedral mesh, the generated polycube is aligned to the input shape via rigid and non-rigid point cloud registration to establish surface correspondence, followed by a polycube-to-hex pipeline. We additionally create and release a paired dataset of CAD meshes and their corresponding polycube meshes, together with the core implementation of our model. Experiments show that PolycubeNet generalizes to complex CAD models with arbitrary genus and produces high-quality polycube structures within seconds, improving robustness and efficiency over prior learning-based approaches.

2605.20273 2026-05-21 cs.LG cs.AI 版本更新

Modality-Decoupled Online Recursive Editing

模态解耦的在线递归编辑

Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China.(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, China.(鹏城实验室) Huazhong University of Science and Technology, China(华中科技大学)

AI总结 本文提出M-ORE,一种用于持续多模态大语言模型适应的模态解耦在线递归编辑器,通过统一的近端投影公式和Sherman-Morrison递归实现常数级的每编辑开销,从而在保持模块局部统计信息和固定正交低秩编辑子空间的同时,减少长周期干扰,提升可靠性、通用性和局部性。

详情
AI中文摘要

针对多模态大语言模型(MLLMs)的在线模型编辑需要在计算和内存预算限制下处理连续的纠正流,但为文本-only LLMs开发的编辑器在MLLMs上往往表现不佳:视觉主导的激活偏移了塑造更新的统计信息,导致跨模态冲突,而顺序写入在共享的编辑空间中交织,放大了长周期干扰,导致跨编辑干扰。为了解决这些问题,我们提出了M-ORE,一种用于持续MLLM适应的模态解耦在线递归编辑器。M-ORE源自统一的近端投影公式,并允许通过Sherman-Morrison递归实现闭式更新,从而实现每编辑常数开销。它维护文本堆栈和视觉投影器的模块级局部统计信息,以避免视觉主导的更新塑造,并通过Sherman-Morrison递归在固定正交低秩编辑子空间中进行持续更新,以缓解长周期干扰。在多个MLLM基础架构和在线编辑基准上的实验表明,我们的M-ORE方法在可靠性、通用性和局部性方面优于强大的基线方法,同时实现了有利的质量-效率扩展。我们的代码在https://github.com/lab-klc/M-ORE上公开可用。

英文摘要

Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at https://github.com/lab-klc/M-ORE.

2605.20272 2026-05-21 cs.LG cs.AI 版本更新

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

更小的抽象状态空间在强化学习中实现跨尺度泛化

Nasehatul Mustakim, Lucas Lehnert

发表机构 * Department of Computer Science(计算机科学系) University of Saskatchewan(萨斯喀彻温大学) Saskatoon, Saskatchewan, Canada(加拿大萨斯喀彻温省萨斯喀彻温市)

AI总结 本文提出了一种理论模型,通过扩展POMDP中的状态抽象框架,定义了 successor-weighted model reduction,从而在强化学习代理中实现跨尺度泛化,并分析了抽象状态空间大小对泛化能力的影响。

详情
AI中文摘要

尽管人类能够轻易地将抽象概念推广到更复杂或更大的任务中,但构建具备这种能力的强化学习(RL)系统仍然难以实现。本文提出了首个关于如何在RL代理中实现Out-of-Distribution(OOD)泛化的理论模型。我们的方法考虑了部分可观测马尔可夫决策过程(POMDPs),并假设智能体使用抽象函数来确定哪些经验可以被视为等价,哪些必须区分。首先,我们扩展了现有的状态抽象框架和证明技术到POMDPs。然后,我们定义了successor-weighted model reduction,这是一种允许压缩到比先前定义更小的抽象空间的模型缩减变体。我们推导了代理OOD测试性能的界限,从而定义了实现OOD泛化的条件。该界限将代理的性能损失分解为近似和估计误差,揭示了减少代理抽象状态空间大小如何提高测试性能和OOD泛化能力。我们的分析表明,限制代理在有限的抽象状态集合上操作对于实现更复杂任务的泛化是必要的。我们的结果鼓励进一步研究学习能够跨不同复杂程度任务进行扩展的RL架构。

英文摘要

While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

2605.20270 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

conformal selective acting: any-time-valid risk control for rlvr-trained llms

Hamed Khosravi, Xiaoming Huo

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 该研究提出了一种 conformal selective acting 方法,用于在 rlvr 训练的 llms 部署中实现 anytime-valid 的风险控制,通过在部署要求下强制一个空单元,利用 e-process 和 bonferroni 网格来维护 pathwise 有效性,同时在多个基准测试中证明了其有效性。

详情
AI中文摘要

一个本地专家 llm,通过在操作员本地数据上使用强化学习从可验证奖励 (rlvr) 进行微调,被安装在一个受监管的组织中,具有每个部署的误差预算 α。操作员需要在每个回合为该部署的流提供安全证书:不跨部署汇总,不等待长期平均。现有封装器无法在自适应、在线更新的流上实现这一点:离线 conformal 风险方法需要可交换性;在线 conformal 方法仅绑定长期平均;非可交换扩展是边际有效的;最接近的 anytime 封装器,A-RCPS,控制的是边际风险而非选择性风险。使用 (测试统计量,有效性保证,部署规则) 框架,我们识别了一个被部署要求强制的空单元:e-process 每个阈值,选择性风险,anytime-pathwise 有效性,max-certified-threshold 规则。Conformal Selective Acting (CSA) 填充它作为每回合的封装器,维护每个阈值上的 ville 型 e-process 在 bonferroni 网格上,评估相对于 rlvr 过滤器。在可预测的更新和 isotonic-calibrated 单调风险下,我们证明了 (i) 一个 anytime-pathwise 选择性风险界 $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$,(ii) 与 $Θ(arη^{-2}\log(1/δ))$ 匹配的认证率,以及 (iii) 与 horizon 无关的发布率差距。在八个专家基准 ($480$ 流)、十六个对抗性分布偏移单元 ($160$ 流) 和五个 live Expert-Iteration RLVR 单元 (在四个基础模型上使用在线 LoRA 在三个架构家族中) ($10{,}300$ 轮) 中,CSA 是十种方法中唯一一个在每个单元上都满足 pathwise 有效性和非拒绝部署的方法。我们不提出新的 llm、训练算法或策略类;CSA 是部署端的补充,与模型正交,适用于无法使用前沿 API 的操作员。

英文摘要

A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $α$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $Θ(\barη^{-2}\log(1/δ))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

2605.20269 2026-05-21 cs.LG cs.AI stat.ML 版本更新

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

捕捉移动子空间:超越平稳性的低秩老虎机

Hamed Khosravi, Xiaoming Huo

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering(H. Milton Stewart工业与系统工程学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了在子空间漂移的情况下,低秩线性上下文老虎机的问题,提出了一种新的算法SPSC,在保持子空间变化的同时,实现了基于秩的动态遗憾率。

详情
AI中文摘要

许多老虎机应用(推荐、临床给药、广告定向)有两个事实,以往的工作只孤立处理:奖励生活在低维潜在子空间上,且该子空间漂移。静态低秩老虎机利用秩但受子空间变化影响;非静态线性老虎机适应漂移但以环境速率$\widetilde{O}(d\sqrt{T})$工作。我们研究了分段静态低秩线性上下文老虎机,具有标量反馈:$θ_t = B_k^\star w_t$,其中秩-$r$因子$B_k^\star\in\mathbb{R}^{d\times r}$在每个未知的$K$段内恒定,且可以在边界处改变。我们的结果在三个轴上都是紧致的。 (i) 识别边界。在单次标量奖励下,移动子空间可通过奖励的二次函数来恢复,当且仅当三个探针侧条件成立:已知噪声方差、有界状态-噪声耦合、以及全维探针支持。每个都是在无限制二次矩问题中的必要条件,且共同它们是充分的,表征了解决区域的边界。 (ii) 算法和动态遗憾。SPSC在学习的$r$维子空间内交替等距探针与窗口投影岭UCB利用;CUSUM样式的变体在线发现段边界。成本动态遗憾是$\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$,用内在秩代替环境$d\sqrt{T}$速率。 (iii) 实验。在十一基准上,从合成、UCI/MovieLens、半合成临床和ZOZOTOWN生产日志数据跨度,SPSC在$d-r\gtrsim T^{1/6}$时优于非静态和低秩基线,匹配分析交叉点。据我们所知,这是在该设置中首次工作来表征识别边界并达到内在秩动态遗憾率的工作。

英文摘要

Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $θ_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.

2605.20268 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Chronicle:一种用于联合语言和时间序列理解的多模态基础模型

Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu

发表机构 * InertialAI Department of Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系) Department of Mechanical and Materials Engineering, Queen’s University(皇后大学机械与材料工程系)

AI总结 本文提出Chronicle,一种联合训练语言和时间序列的多模态基础模型,通过统一架构实现两者共享参数,从而在多个任务上取得了优异表现。

详情
AI中文摘要

现实中的时间序列通常伴随着文本:元数据、描述、新闻、报告。然而,时间序列基础模型通常孤立处理数值序列,而试图弥合两者差距的多模态文本-时间序列模型往往事后使用预训练语言模型,继承了从未见过时间数据的表示。这些模型几乎全部在其他多模态基线上进行评估,而不是在各自领域最强的单模基础模型上进行评估,这留下了联合训练是否必要的疑问。我们提出了Chronicle,一个仅含324M参数的解码器-only变压器,从头开始在自然语言和时间序列上进行单统一架构的训练。两种模态共享相同的transformer块、注意力机制和残差流;预训练的大部分使用单模批次,因此跨模态能力纯粹来自共享参数,辅以一个短的对齐阶段,交替处理两者。据我们所知,Chronicle是第一个从头开始联合训练文本和时间序列的模型,也是第一个在两个领域中评估专用基础模型的多模态模型。它在19个NLU任务上与Gemma-3-270M-PT相当,在24个UCR/UEA数据集上设定了新的冻结-嵌入时间序列分类标准,并在Time-MMD上产生多模态预测,优于所有监督融合基线,所有这些都来自单一主干。

英文摘要

Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

2605.20267 2026-05-21 cs.CV cs.AI 版本更新

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

基于预训练域适应扩散模型生成异质性PET图像

Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi

发表机构 * Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, USA(华盛顿大学医学院马林克罗德特放射医学研究所,圣路易斯,美国) Imaging Science Program, McKelvey School of Engineering, Washington University in St Louis, St. Louis, USA(华盛顿大学圣路易斯分校麦克雷高中工程学院成像科学计划,圣路易斯,美国) Department of Surgery, Washington University School of Medicine, St. Louis, USA(华盛顿大学医学院外科部,圣路易斯,美国) Department of Biomedical Engineering, Washington University in St Louis, St. Louis, USA(华盛顿大学圣路易斯分校生物医学工程部,圣路易斯,美国)

AI总结 本文提出了一种预训练域适应扩散模型,用于从均匀器官活动图生成临床相关的异质性PET图像,通过两阶段训练策略提高合成图像的定量精度和肿瘤分割性能。

Comments 18 pages, 7 figures

详情
AI中文摘要

合成PET图像在定量成像工作流程开发、可扩展的虚拟成像试验和深度学习模型训练中具有重要价值,但传统基于物理的模拟方法计算成本高,解剖变化有限,且难以捕捉异质性PET摄取。本研究开发了一种预训练域适应扩散(PAD)模型,用于从均匀器官活动图生成解剖条件化的PET合成图像。PAD采用预训练的自然图像文本到图像解码器,结合上游的条件编码器和下游的PET领域适配器。采用两阶段训练策略,第一阶段学习粗略摄取分布,第二阶段细化局部图像细节。均匀器官活动图通过CT基分割生成,通过将每个器官的平均摄取值分配自配对PET图像。评估包括定量准确性、噪声评估、放射组学分析、肿瘤分割性能和人类观察者研究。PAD生成的图像在定量准确性方面表现优异,器官平均SUV与分配活动值的符合度系数超过0.92。合成图像的噪声水平和纹理特征与目标PET图像相似,并产生了可比的肿瘤分割性能。在两项选择强制选择观察者研究中,四名读者的准确率约为50%,表明合成图像与目标图像在视觉上不可区分。PAD还能从XCAT衍生的活动图生成逼真的PET图像,证明了其与基于幻影的解剖先验的兼容性。总体而言,PAD提供了一种基于扩散的框架,用于从临床分割或数字幻影中导出的均匀器官活动图生成临床相关的异质性PET图像,支持数据增强和下游成像研究。

英文摘要

Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

2605.20262 2026-05-21 cs.LG cs.AI 版本更新

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

残差铺垫:在选择性拒绝编辑中的路由瓶颈诊断

Bryce Hinkley, Peyman Najafirad

发表机构 * University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)

AI总结 本文研究了选择性拒绝编辑作为三重控制问题,通过引入残差铺垫方法,分离路由选择、是否干预和残差编辑能力,从而减少编辑拒绝率并提高良性分布和有害分布的保留率。

详情
AI中文摘要

我们研究选择性拒绝编辑作为三重控制问题:在指定的编辑提示上诱导非拒绝,同时在编辑集之外保持良性行为和有害拒绝。我们引入残差铺垫,一种用于冻结指令微调变压器的路由残差编辑方法,将路由选择、是否干预与残差编辑能力分离。早期层的路由预测一个标量门和专家混合;当激活时,提示条件的瓶颈残差专家应用后期层的残差更新,同时保持骨干不变。这种分解支持一个oracle路由诊断,其中仅将学习到的标量门替换为保留的编辑/保留标签,其余残差编辑器和冻结的骨干保持不变。在主要的Gemma-3-4B-IT保留分割上,学习到的残差铺垫将编辑拒绝率从88.6%降至4.0%,同时保持95.5%的良性分布和87.3%的有害分布。相同协议的一向引导控制在编辑成功方面要弱得多,留下编辑拒绝率为86.8%(针对Edit-target ActAdd)和78.9%(针对DIM风格的拒绝引导)。剩余的失败是偏离目标的有害-保留退化:有害拒绝仍低于冻结基础率,65.3% vs. 81.6%。在六个骨干上,oracle路由在每行报告的指标上都提高了保留侧的诊断分数,中位数增益+12.9个百分点,支持了学习到的路由选择是主要观察到的瓶颈的解释。对两个骨干的轨迹诊断进一步表明,运动方向是朝向编辑目标延续而非通用拒绝抑制。

英文摘要

We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.

2605.20258 2026-05-21 cs.LG cs.AI cs.CR 版本更新

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

需要两人:互补的自我蒸馏用于大语言模型中的上下文完整性

Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, Kangsan Kim, Jinheon Baek, Seong Joon Oh, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出SELFCI框架,通过分离信息抑制与任务解决,解决大语言模型中隐私与效用的权衡问题,通过互补的自我蒸馏方法提升上下文完整性。

Comments 28 pages, 16 figures

详情
AI中文摘要

上下文完整性(CI)定义隐私不仅仅是保持信息隐藏,而是根据给定情境的规范来管理信息流。随着大型语言模型越来越多地被用作个人代理处理敏感工作流程,遵循CI变得至关重要。然而,即使前沿模型在做出披露决策时仍然不可靠,现有的缓解策略往往会降低基础任务性能。为了解决这一隐私-效用权衡问题,我们提出了SELFCI,一种互补的自我蒸馏框架,将信息抑制与任务解决解耦。SELFCI联合优化两个独立的反向KL散度,这些散度来源于反馈得到的不同教师分布:一个鼓励保留与任务相关的信息以提高效用,另一个强制最小化和适当披露。这种互补的公式诱导出一个专家产品(PoE)目标,使策略与能力和隐私需求的交集对齐。实证评估显示,SELFCI无需依赖昂贵的外部监督,始终优于竞争基线,如在线强化学习算法(例如GRPO)。这些趋势进一步扩展到涉及代理工作流程和积累私人上下文的离域设置中,表明SELFCI为实现CI对齐提供了一条实用路径。

英文摘要

Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.

2605.20257 2026-05-21 cs.LG cs.AI 版本更新

Instance Discrimination for Link Prediction

实例判别用于链接预测

Valentin Cuzin-Rambaud, Mathieu Lefort, Rémy Cazabet

AI总结 本文提出了一种基于链接表示的新模型L-GRACE和L-BGRL,用于改进链接预测任务的性能,特别是在无属性图上,并展示了其在监督和自监督场景下的竞争力。

详情
AI中文摘要

最近,实例判别模型已成为自监督学习的主要解决方案。在图像领域已证明其有效性后,实例判别学习现在在图领域,特别是节点分类任务中也表现出色。然而,针对链接预测任务的贡献较少。在本文中,我们提出将现有方法适应到此领域。我们首先对现有自监督模型在链接预测领域的性能进行了严格评估,表明主要性能依赖于增强过程(类似于计算机视觉)。然后,我们提出了一种基于社区结构的新的结构增强方法,这对链接预测相关。我们的主要贡献是引入了两个新的模型,L-GRACE和L-BGRL,基于链接表示而不是节点表示,这些模型改进了现有方法的性能,特别是在无属性图上,并且我们展示了它们在监督和自监督场景下与最先进的方法相当。

英文摘要

Recently, instance discrimination models have emerged as a major solution for self-supervised learning. Having already demonstrated its effectiveness in the image domain, instance discrimination learning is now proving equally convincing in the graph domain, in particular for node classification. However, fewer contributions have tackled the link prediction task. In this contribution, we propose to adapt existing methods to this context. We first provide a rigorous evaluation of existing self-supervised models in the field of link prediction, showing that the main performance depends on the augmentation process (like in computer vision). We then propose a new structural augmentation based on the community structure that is relevant for link prediction. Our main contribution introduces two new models, L-GRACE and L-BGRL, based on link representations instead of node representations, which improve the performance of the existing methods, especially on unattributed graphs, and we show that they perform on par with the state of the art, both in supervised and self-supervised contexts.

2605.20256 2026-05-21 cs.LG cs.AI 版本更新

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

FBOS-RL: 基于反馈的双目标协同强化学习

Xikai Zhang, Yongzhi Li, Likang Xiao, Yingze Zhang, Yanhua Cheng, Quan Chen, Peng Jiang, Wenjun Wu, Liu Liu

发表机构 * Hangzhou International Innovation Institute, Beihang University(北京航空航天大学杭州国际创新研究院) School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) Kuaishou Technology(快手科技)

AI总结 本文提出FBOS-RL框架,通过环境反馈引导探索增强,并设计两个相互促进的目标:以利用为导向的策略对齐(EPA)和以探索为导向的能力培养(ECC),从而提高强化学习的训练效率和最终性能。

详情
AI中文摘要

强化学习已成为对齐和解锁大规模模型推理能力的基石。在GRPO及其变种的核心训练循环中,交替进行rollout采样和策略更新。与监督学习不同,每个梯度步骤都锚定在显式的地面真实目标上,而在这种设置中,更新模型参数的最佳梯度方向是未知的;在采样阶段获得的高质量rollout因此充当隐含的“教师”,指导每个参数更新。然而,GRPO采用简单的采样方案,将所有rollout条件在同一原始提示上。当任务超出策略模型当前能力时,这种采样方案很少产生高质量rollout,导致策略模型在更新参数时缺乏有意义的梯度方向,从而导致训练停滞。为了解决这个问题,我们提出了FBOS-RL,一种基于反馈的双目标协同强化学习框架。具体来说,我们让模型基于环境提供的反馈进行反馈引导探索增强,并在此基础上设计两个相互促进的训练目标:以利用为导向的策略对齐(EPA)和以探索为导向的能力培养(ECC)。大量实验表明,EPA和ECC可以相互促进,形成正向飞轮效应,显著提高强化学习的训练效率和最终性能上限。具体而言,在相同数量的rollout下,FBOS-RL比GRPO和基于反馈的基线学习速度更快,并最终达到更高的性能上限,同时在训练过程中表现出更高的策略熵和更低的梯度范数。

英文摘要

Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall. To address this issue, we propose FBOS-RL, a Feedback-Driven Bi-Objective Synergistic reinforcement learning framework. Specifically, we let the model perform Feedback-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment(EPA) and Exploration-oriented Capability Cultivation(ECC). Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning. Specifically, under an identical number of rollouts, FBOS-RL learns substantially faster than GRPO and feedback-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training.

2605.20254 2026-05-21 cs.IR cs.AI cs.CV cs.LG 版本更新

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

通过表格网格导航和逐步推理提示实现高效的表格问答

Amritansh Maurya, Navjot Singh, Mohammed Javed, Omar Moured

发表机构 * Vision Intelligence Lab, IIIT Allahabad, Prayagraj, India(视觉智能实验室,印度拉贾斯坦邦阿拉哈巴德)

AI总结 本文提出了一种无需训练的表格问答方法,通过TableGrid导航和Progressive Inference Prompting框架,提升了表格问答的精度和效率,并在多个数据集上验证了其有效性。

Comments Accepted for Presentation in ICDAR 2026, Vienna, Austria

详情
AI中文摘要

大型语言模型(LLMs)在自然语言处理任务中表现出色,但在表格数据上的表现仍需进一步研究,因为表格问答(TQA)需要精确的单元格检索和多步结构化推理。现有工作通过微调或在任务特定的表格数据上训练LLMs来改进TQA,但通常缺乏对模型如何导航表格和推导答案的可验证控制。在本文中,我们提出了一种无需训练的TQA方法,包含两个结构化提示框架:TableGrid导航(TGN),通过三模块循环迭代导航行和列以定位证据并细化答案;Progressive Inference Prompting(PIP),通过根据查询强制识别列,以明确的逐步行选择约束进行推理。我们在TableBench和FeTaQa数据集上评估了17个LLMs和6个基线模型。在TableBench上,TGN比最强基线提高了3.8分,而在FeTaQa上,PIP在ReAct和Chain-of-Thought上实现了SOTA性能。除了推理时间的提升外,PIP和TGN还可以作为监督模板来微调小型模型,在资源受限的设置中缩小与更大架构之间的性能差距,为TQA提供了多功能且成本效益高的解决方案。

英文摘要

Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

2605.20249 2026-05-21 cs.LG cs.AI 版本更新

Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization

面向高维贝叶斯优化理解的自动核发现

Taeyoung Yun, Woocheol Shin, Inhyuck Song, Jaewoo Lee, Jinkyoo Park

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 本文提出了一种基于大语言模型的进化框架,用于高维贝叶斯优化中的自动核发现,通过扩展核空间并避免依赖观测条件,提高了高维问题中核设计的有效性。

Comments 36 pages, 27 figures, 12 tables

详情
AI中文摘要

高斯过程(GP)核是贝叶斯优化(BO)的核心,但设计有效的高维问题核仍依赖于大量手动工程。现有自动方法在高维情况下面临两个瓶颈:其核搜索空间仅限于基本核的加法和乘法组合,且基于大语言模型的方法需要对原始观测进行条件化,这由于上下文长度限制和提取有意义模式的难度而变得不可行。我们引入了Kernel Discovery,一种基于大语言模型的进化框架,用于高维BO,它搜索超越预定义组合规则的更广泛的核空间,并且不需要对观测进行条件化。受直接提示大语言模型生成核代码会产生语法各异但功能相同的核的观察启发,我们采用两阶段方法:首先,大语言模型提出新的数学形式,然后通过第二次大语言模型调用将每个形式转换为经过验证的可执行代码。我们还提出了一种留一法连续排名概率评分(LOO-CRPS)作为选择标准,该标准惩罚过拟合的核。在五个高维BO基准上,我们的方法实现了平均排名为1.2(共17个),优于竞争基线。我们进一步分析发现的核,以确定哪些核在高维BO中带来了改进。

英文摘要

Gaussian Process (GP) kernels are central to Bayesian optimization (BO), yet designing effective kernels for high-dimensional problems still relies on extensive manual engineering. Existing automated approaches struggle in high dimensions for two bottlenecks: their kernel search space is limited to additions and multiplications of base kernels, and LLM-based approaches require conditioning on raw observations, which becomes infeasible due to context-length limits and the difficulty of extracting meaningful patterns. We introduce \textbf{Kernel Discovery}, a LLM-driven evolutionary framework for high-dimensional BO that searches a broader kernel space beyond predefined composition rules and does not require conditioning on observations. Motivated by the observation that directly prompting an LLM to generate kernel code yields syntactically varied but functionally identical kernels, we adopt a two-stage approach: an LLM first proposes novel mathematical forms, then a second LLM call converts each form into validated, executable code. We also propose a leave-one-out continuous ranked probability score (LOO-CRPS) as a selection criterion that penalizes overfitted kernels. On five high-dimensional BO benchmarks, our method achieves an average rank of \textbf{1.2 out of 17}, outperforming competitive baselines. We further analyze the discovered kernels to identify which kernels lead to improvements in high-dimensional BO.

2605.20247 2026-05-21 cs.LG cs.AI cs.CL cs.CV 版本更新

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

CP-MoE:一致性保留的混合专家用于持续学习

Yang Liu, Toan Nguyen, Flora D. Salim

发表机构 * School of Computer Science and Engineering University of New South Wales(计算机科学与工程学院 新南威尔士大学)

AI总结 本文提出CP-MoE,一种基于瞬时专家的持续学习框架,通过一致性保留的路由偏置和瞬时专家引导的正则化机制,减少参数干扰和遗忘,同时保留跨任务知识转移。

详情
AI中文摘要

持续学习在大语言模型(LLMs)和视觉-语言模型(VLMs)中仍面临灾难性遗忘的严重障碍。尽管混合专家(MoE)架构提供了扩展的有效途径,但现有的基于LoRA的MoE持续学习方法仍面临根本性的权衡:要么过于激进地隔离专家,限制任务间的知识转移,要么允许任务特定的更新覆盖重要的现有参数,导致严重的遗忘。为此,我们提出了CP-MoE,一种持续学习框架,围绕瞬时专家构建,该专家捕捉早期任务特定的更新并引导其整合到稳定的专家中。CP-MoE引入了一种一致性保留的路由偏置,利用瞬时专家估计与稳定专家的表示相似性,并引导路由向更兼容的专家选择方向;还引入了一种瞬时专家引导的正则化机制,该机制在合并过程中选择性地保护重要历史参数。这些组件共同减少了参数干扰和遗忘,同时保留了跨任务的知识转移。我们在基于LLM和VLM的MoE模型上验证了CP-MoE,既在单模态又在多模态持续学习基准上进行了测试。在SuperNI基准上,涵盖多样化的序列语言任务,CP-MoE实现了最先进的性能,并在未见任务上表现出更强的零样本迁移能力。在VQA v2数据集上,它能有效扩展到多模态视觉推理,一致地减少遗忘,并优于强大的MoE基线。

英文摘要

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

2605.20244 2026-05-21 cs.LO cs.AI cs.CL cs.LG cs.SE 版本更新

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Lean Refactor: 通过代理策略搜索实现多目标可控的证明优化

Jialin Lu, Soonho Kong, Rodrigo Stehling, Kaiyu Yang, Zhangyang Wang, Weiran Sun, Wuyang Chen

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Amazon Web Services(亚马逊网络服务) MiroMind University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出Lean Refactor框架,通过检索增强的代理策略搜索,解决多目标、可控和版本鲁棒的Lean证明重构问题,主要贡献是通过预注释的多目标重构策略数据库实现高效的证明优化。

详情
AI中文摘要

我们提出了Lean Refactor,一个插件式的检索增强型代理框架,用于多目标、可控和版本鲁棒的Lean证明重构。LLM生成的证明虽然正确但冗长且易碎,现有重构工作忽视了三个实际挑战:1)Lean重构本质上是多目标的(证明长度、编译成本和版本兼容性常存在矛盾);2)Lean仓库具有脆弱的兼容性,而LLM发布不了解Lean/Mathlib版本;3)基于训练的流水线需要每次新LLM发布时重复微调,无法随模型变化或Lean发布周期扩展。Lean Refactor通过检索预注释的多目标重构策略数据库中的冻结代理LLM,每个策略都密集注释了元数据,如支持的Lean/Mathlib版本和预期的编译成本减少。实验显示在竞争基准上压缩超过70%的token级别,在研究仓库上压缩超过20%,并达到高达60%的编译时间减少,优于先前工作和Claude Code。版本过滤检索进一步提高了目标Lean版本的压缩效果,重构后的miniF2F证明在零样本版本迁移至未来Lean发布时表现优于未重构的对应物。

英文摘要

We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

2605.20242 2026-05-21 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

LEAP: A closed-loop framework for perovskite precursor additive discovery

LEAP:一种用于钙钛矿前驱体添加剂发现的闭环框架

Xin-De Wang, Zhi-Rui Chen, Ze-Feng Gao, Peng-Jie Guo, Cheng Mu, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院) School of Chemistry and Life Resource, Renmin University of China(中国人民大学化学与生命资源学院)

AI总结 该研究提出LEAP框架,结合大语言模型和主动学习,通过文献驱动的机制相关描述符和贝叶斯优化,实现了钙钛矿太阳能电池添加剂的高效发现,实验验证显示其在性能提升方面优于通用模型。

Comments 30 pages; 11 figures

详情
AI中文摘要

高效发现前驱体添加剂对于提高钙钛矿太阳能电池性能至关重要,但庞大的化学空间使传统试错筛选效率低下。我们开发了LEAP(通过主动学习进行钙钛矿添加剂探索的LLM驱动闭环框架),该框架结合了领域专用的大语言模型(LLM)和主动学习,用于迭代性添加剂优先级排序。LLM被训练以从钙钛矿添加剂文献中提取机制相关知识,并通过可解释的描述符表示候选分子,这些描述符进一步整合到贝叶斯优化工作流中,以在低数据条件下进行不确定性感知的优先级排序。在未见过的文献基准测试中,领域专用模型在机制一致推理方面优于通用模型。专家在闭环中的证明概念研究实验验证显示,经过三次筛选轮次后,添加剂优先级得到改善,导致平均设备PCEs分别为20.13%和20.87%,分别比对照组的19.25%有所提高,其中最佳PCE为21.32%。这些结果提供了初步证据,表明基于文献的机制描述符,当结合贝叶斯优化和专家可行性审查时,可以支持钙钛矿光伏中的机制感知添加剂优先级排序。

英文摘要

Efficient discovery of precursor additives is essential for improving the performance of perovskite solar cells, yet the large chemical space makes conventional trial-and-error screening inefficient. We develop LEAP(LLM-driven Exploration via Active Learning for Perovskites), an expert-in-the-loop closed framework that couples a domain-specialized large language model(LLM) with active learning for iterative additive prioritization. The LLM is trained to extract mechanism-relevant knowledge from the perovskite additive literature and to represent candidate molecules through interpretable descriptors, which are further integrated into a Bayesian optimization workflow for uncertainty-aware prioritization under low-data conditions. Benchmark results on unseen literature show that the domain-specialized model outperforms general-purpose models in mechanism-consistent reasoning. Experimental validation in an expert-in-the-loop proof-of-concept study suggests improved additive prioritization across three screening rounds, leading to average device PCEs of 20.13% and 20.87% for the later-round 6-CDQ- and 2-CNA-treated devices, respectively, compared with 19.25% for the control, with a champion PCE of 21.32%. These results provide preliminary evidence that literature-grounded mechanistic descriptors, when coupled with Bayesian optimization and expert feasibility review, can support mechanism-aware additive prioritization in perovskite photovoltaics.

2605.20241 2026-05-21 cs.LG cs.AI cs.CL 版本更新

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Geometry-Lite: 通过层间边际几何进行可解释的安全探测

Woo Seob Sim, Yu Rang Park

发表机构 * Yonsei University(延世大学) Yonsei University College of Medicine(延世大学医学院) Department Biomedical Systems Informatics(生物医学系统信息学部门)

AI总结 本文研究了大语言模型在提示级别上的安全探测问题,提出了一种名为Geometry-Lite的紧凑探测器,通过层间边际几何分析来提高安全检测的可解释性和准确性。

详情
AI中文摘要

用于大语言模型的提示级别安全探测使用隐藏状态表示来区分安全和不安全的提示,但强平均检测性能并不能解释这种分离的几何结构。特别是,仍然不清楚安全证据是如何在层间形成的,哪些层间几何特性支持低误报决策,以及哪些几何偏见在基准转移下保持稳定。我们将此视为一个经验分解问题,并引入Geometry-Lite,一种紧凑的提示级别探测器,它将每一层的最终提示令牌表示映射到以质心、局部邻域和监督线性边界读出为中心的符号边际,然后通过边界位置、层间变化和粗略形状对结果边际配置进行总结。在九个指令微调的backbone(1.2B-70B)和七个安全基准上,Geometry-Lite在单层探测器上表现更好,同时接近原始多层分数堆叠,使其成为分析多层安全信号的有用工具。分解显示,安全证据主要通过持久的边界位置几何结构表达:最终或极端边际和不安全侧层占用主导汇总检测性能。相比之下,有限差分漂移和结构总结对汇总AUROC贡献很小,尽管漂移可以在低FPR阈值下提供小的召回修正。在基准转移下,优化的线性边界在训练混合物上是尖锐的,而类条件均值几何在预定义的硬保留子集上保持分离更可靠。总体而言,提示级别安全证据不是主要的层间运动信号,而是一种持久的层间边际几何结构,其有用组件和读取级偏见在决策关键区域变得明显。

英文摘要

Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

2605.20235 2026-05-21 cs.LG cs.AI 版本更新

Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

在流形假设下证明学习扩散模型:坍缩与细化

Wei Huang, Andi Han, Mingyuan Bai, Huanjian Zhou, Qixin Zhang, Taiji Suzuki, Kenji Fukumizu

发表机构 * RIKEN AIP & The Institute of Statistical Mathematics(日本理化学研究所AIP及统计数学研究所) University of Sydney(悉尼大学) Agency for Science, Technology and Research & The Institute of Statistical Mathematics(科技研究局及统计数学研究所) The University of Tokyo(东京大学) Nanyang Technological University(南洋理工大学) The Institute of Statistical Mathematics(统计数学研究所)

AI总结 本文在流形假设下研究扩散模型的学习问题,提出了一种由得分函数几何特性驱动的坍缩与细化机制,并通过Score-induced Latent Diffusion模型验证了其理论预测,证明样本复杂性依赖于内在维度而非外在维度。

Comments 3 figures

详情
AI中文摘要

扩散模型能够生成高质量的高维数据,但其训练如何高效学习得分函数并在数据支持于低维流形时克服维度灾难仍缺乏理论解释。我们识别出一种由得分函数几何特性驱动的坍缩与细化机制:在小噪声尺度下,得分函数的发散奇点导致诱导去噪映射快速坍缩到数据流形投影上;在中等噪声尺度下,训练在学习的流形上细化内在密度。我们将其原理实例化为Score-induced Latent Diffusion (SiLD),一种两阶段框架,其中流形学习和密度估计均源自单一去噪得分匹配目标,取代了基于VAE的潜在扩散模型的启发式KL正则化。我们证明所得到的样本复杂性依赖于内在维度而非外在维度。在Stacked MNIST、CelebA变体和分子生成基准测试中,SiLD在生成质量上匹配或优于基于VAE的LDMs,并且在重建方面始终有所改进,验证了我们的理论预测。

英文摘要

Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.

2605.20234 2026-05-21 cs.LG cs.AI 版本更新

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

TabPFN-MT: 一种原生多任务上下文学习器用于表格数据

Cormac Cureton, Narges Armanfard

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(Mila-魁北克人工智能研究所)

AI总结 本文提出TabPFN-MT,一种针对表格数据的原生多任务上下文学习器,通过扩展多目标合成先验来捕捉上下文中的任务依赖性,实现多任务上下文学习和同时推断,同时在小到中等规模数据集上表现出色,提升了多目标表格应用的计算效率。

Comments 24 pages, 7 figures

详情
AI中文摘要

Prior-Data Fitted networks (PFNs) have been very successful in tabular contexts, handling prediction tasks in context. However, they are designed for single-task inference, meaning that predicting several target values within a context requires repeated forward calls and precludes inter-task information sharing. We propose TabPFN-MT, which is trained on an expanded multi-target synthetic prior to capture inter-task dependencies in context. This model uses an expanded $y$-encoder and a shared decoder head to enable multitask in-context learning and simultaneous inference. The model is uniquely specialized for small-to-medium datasets by relying on in-context learning rather than traditional gradient-based training. Within this regime (averaging fewer than 1,000 samples), extensive evaluations across 344 datasets demonstrate that TabPFN-MT establishes a new state-of-the-art for deep tabular multitask learning. Furthermore, despite the inherent compute asymmetry of joint optimization, our model remains highly competitive with the latest state-of-the-art single-task ensembles. Notably, on multitask datasets it achieves an overall Accuracy rank of 4.89, the highest average rank among all models tested. Crucially, TabPFN-MT delivers this highly competitive performance while reducing the inference cost for $T$ tasks from $O(T)$ to $O(1)$ forward passes, offering a massive computational efficiency improvement for multi-target tabular applications.

英文摘要

Prior-Data Fitted networks (PFNs) have been very successful in tabular contexts, handling prediction tasks in context. However, they are designed for single-task inference, meaning that predicting several target values within a context requires repeated forward calls and precludes inter-task information sharing. We propose TabPFN-MT, which is trained on an expanded multi-target synthetic prior to capture inter-task dependencies in context. This model uses an expanded $y$-encoder and a shared decoder head to enable multitask in-context learning and simultaneous inference. The model is uniquely specialized for small-to-medium datasets by relying on in-context learning rather than traditional gradient-based training. Within this regime (averaging fewer than 1,000 samples), extensive evaluations across 344 datasets demonstrate that TabPFN-MT establishes a new state-of-the-art for deep tabular multitask learning. Furthermore, despite the inherent compute asymmetry of joint optimization, our model remains highly competitive with the latest state-of-the-art single-task ensembles. Notably, on multitask datasets it achieves an overall Accuracy rank of 4.89, the highest average rank among all models tested. Crucially, TabPFN-MT delivers this highly competitive performance while reducing the inference cost for $T$ tasks from $O(T)$ to $O(1)$ forward passes, offering a massive computational efficiency improvement for multi-target tabular applications.

2605.20233 2026-05-21 cs.CV cs.AI 版本更新

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

基于仿真护理教育的自主学习能力评估:通过第一人称视频进行AI辅助评估

Hanchen David Wang, Yilin Liu, Madison J. Lee, Surya Chand Rayala, Gautam Biswas, Daniel T. Levin, Meiyi Ma

发表机构 * Vanderbilt University(范德比大学)

AI总结 本文提出了一种基于第一人称视频的AI辅助评估框架,通过提取动作时间线、序列特征和识别指标,发现识别准确率与能力之间存在负相关关系,表明识别准确率可以作为自动化评估中的教学信息信号。

Comments Accepted at CVPR Workshop

详情
AI中文摘要

在临床仿真中评估学习者的能力需要专家观察,这种观察过程耗时、难以扩展且受评分者变异影响。视觉-语言模型已成为理解复杂视觉行为的有希望的工具。在本工作中,我们探讨了视觉观察是否能通过一个三阶段框架提供教育意义的信号,该框架(1)使用冻结的视觉编码器和少样本学习从第一人称护理仿真视频中提取动作时间线,(2)推导序列级特征和每会话识别指标,(3)将这些与指导教师评分的能力相关联。在22个密集标注的会话(3.8小时,493个动作)中,使用冻结的DINOv2主干和HMM Viterbi解码器,在留一法1次样本识别中实现了57.4%的MOF。令人惊讶的是,我们观察到识别准确率与能力之间存在负相关关系(rho = -0.524,p = 0.012 for mIoU),这种关系在六种混杂控制下仍然稳健:更熟练的学生产生多样、更难分类的工作流程,而简单的序列特征没有这种关系。逐项分析表明,患者安全协议和团队沟通是这种模式中预期的行为,过程模型比较显示,能力更高的学生表现出更一致的协议行动转换。这些发现表明,识别准确率可能可以补充预测的动作时间线作为自动化能力评估中的教学信息信号。

英文摘要

Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

2605.20218 2026-05-21 physics.soc-ph cs.AI cs.SI 版本更新

Network-Based Interventions for HIV Prevention via Cascade-Aware Suppression of Transmission

基于网络的HIV预防干预:通过 cascade 意识的传播抑制

Akseli Kangaslahti, Davin Choo, Milind Tambe, Alastair van Heerden, Cheryl Johnson

发表机构 * Harvard University(哈佛大学) University of Witwatersrand(沃尔特·斯通大学) Wits Health Consortium(沃茨健康联盟) World Health Organization(世界卫生组织)

AI总结 本文提出了一种基于网络的HIV预防干预方法,通过考虑传播链的抑制来减少新的感染传播。核心方法是将问题建模为一个约束优化问题,并提出了一种多项式时间的近似算法CAST,该算法在多项式时间内达到近似比。主要贡献是证明了该算法在真实世界HIV网络上的有效性。

详情
AI中文摘要

治疗和预防人类免疫缺陷病毒(HIV)仍然是全球卫生领域的重要挑战。虽然抗逆转录病毒治疗提供了病毒抑制的途径,即有效消除个体的传播风险,但系统资源限制限制了干预措施的覆盖面。本文针对在病毒未被抑制的个体中战略分配密集资源以最小化传输网络中新的感染传播的预期传播链进行了研究。我们将这一挑战建模为一个新颖的约束优化问题,其中我们有资源去“治疗”集合P中的k个病毒未被抑制的个体,并建立了其与现有计算文献的理论联系。然后我们提出了一种传播链意识的传播抑制(CAST)算法,该算法在多项式时间内达到(δ, ε)近似比,通过利用与最小k-并集(MkU)问题和Hoeffding型集中界之间的联系。在真实世界HIV网络上的广泛评估表明,CAST在标准公共卫生和计算机科学基线中表现更优。此外,我们还展示了CAST在不同传染病网络、不同边概率初始设置和涉及不完美网络数据的设置中具有实证鲁棒性。

英文摘要

Treating and preventing Human Immunodeficiency Virus (HIV) remains a critical global health challenge. While antiretroviral therapy provides a path toward viral suppression -- effectively eliminating an individual's transmission risk -- systemic resource constraints limit the reach of intervention efforts. This work addresses the strategic distribution of intensive resources among virally unsuppressed individuals to minimize the expected cascade of new infections within a transmission network. We formalize this challenge as a novel constrained optimization problem where we have resources to "treat" $k$ out of a set $\mathbf{P}$ of virally unsuppressed individuals, and establish its theoretical connections to existing computational literature. We then propose Cascade-Aware Suppression of Transmission (CAST), a polynomial-time $(δ, ε)$-approximation algorithm that achieves a $2\sqrt{|\mathbf{P}|}$ approximation ratio by leveraging connections to the Minimum-$k$-Union (MkU) problem and Hoeffding-style concentration bounds. Extensive evaluations on real-world HIV networks demonstrate that CAST outperforms standard public health and computer science baselines. Furthermore, we show that CAST is empirically robust across diverse infectious disease networks, varied edge probability initializations, and settings involving imperfect network data.

2605.20211 2026-05-21 cs.CV cs.AI 版本更新

Leveraging Vision-Language Models to Detect Attention in Educational Videos

利用视觉-语言模型检测教育视频中的注意力

Gabriel Becquet, Sébastien Lallé, Vanda Luengo, Ali Abou-Hassan

发表机构 * Sorbonne University, CNRS, LIP6 & PHENIX(索邦大学、国家科学研究中心、LIP6与PHENIX)

AI总结 本文研究利用视觉-语言模型直接分析教育视频内容,结合眼动数据以提高注意力检测的准确性,但发现其在实时教育诊断中的局限性。

详情
AI中文摘要

教育视频是远程和混合学习的核心组成部分。然而,学习者注意力的波动仍然是有效信息保留的重要障碍。先前的研究尝试通过在运行时检测和响应注意力丧失来缓解这一问题,使用眼动追踪数据。这些检测方法目前基于经典机器学习分类器,训练于工程化特征,如学习者注视和跳跃的汇总统计。这些方法难以捕捉学习者参与的复杂和时间特性,因此表现出中等的预测性能。在本研究中,我们旨在通过从标准工程化特征转向多模态基础模型来提高注意力检测。使用一个教育眼动追踪数据集(N = 70),我们研究了一种新的方法,利用视觉-语言模型(VLM)直接分析视频内容,结合叠加的注视数据。该方法旨在利用基础模型的语义推理能力,将学习者的注意力置于视频流中进行上下文化。我们通过几种提示策略使用Gemini 3评估了这种VLM方法的性能,但最终发现这些策略都无法超越统计基准。我们的结果为使用VLM进行实时教育诊断的局限性提供了新的见解。

英文摘要

Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

2605.20210 2026-05-21 cs.CY cs.AI cs.MA 版本更新

Governance by Design: Architecting Agentic AI for Organizational Learning and Scalable Autonomy

设计治理:为组织学习和可扩展自主性构建智能体AI

Nelly Dux, Cristina Alaimo, Philippe Roussiere, Abhishek Kumar Mishra

发表机构 * ESSEC Business School(ESSEC商学院) Accenture Research(埃森哲研究)

AI总结 本文探讨了智能体AI在组织学习和可扩展自主性中的设计与治理问题,通过案例研究展示了如何通过具体的架构和工作安排实现有效的治理,并总结了七条指导原则。

Comments 17 pages, 1 figure, 3 tables

详情
AI中文摘要

智能体AI系统—能够通过多步骤规划和工具中介行动来追求目标,且具有有限直接监督的系统—正从实验原型转向企业部署。这种转变带来了实施、扩展和治理方面的张力:组织寻求知识和协调工作的可扩展自主性,但必须在系统启动行动、访问企业数据和通过迭代更新进化时保持问责、安全、成本控制和责任。基于对一家大型IT服务公司在2025年开发和分阶段部署智能体系统的深入定性案例,我们展示了治理是通过具体的架构和工作安排实现的,这些安排决定了系统被允许做什么,可以使用哪些工具和数据,如何处理记忆,以及如何在时间上引入性能改进。我们随后提炼出七条教训,解释了如何在运营化和扩展过程中将有效的治理融入智能体AI中。

英文摘要

Agentic AI systems - systems that can pursue goals through multi-step planning and tool-mediated action with limited direct supervision - are moving from experimental prototypes to enterprise deployments. This transition introduces tensions in implementation, scaling, and governance: organizations seek scalable autonomy for knowledge and coordination work, yet must preserve accountability, safety, cost control, and responsibility as systems initiate actions, access enterprise data, and evolve through iterative updates. Building on an in-depth qualitative case of a large IT services company's 2025 development and staged rollout of an agentic system integrated with enterprise tools; we show that governance is implemented through concrete architectural and working arrangements that determine what the system is allowed to do, which tools and data it can use, how memory is handled, and how performance improvements are introduced over time. We then distill seven lessons that explain how to build effective governance into agentic AI during operationalization and scaling.

2605.20206 2026-05-21 cs.HC cs.AI cs.SE 版本更新

PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM-Generated Multiple-choice Questions

PrivacyAkinator: 通过回答LLM生成的多项选择题来阐明关键隐私设计决策

Qiyu Li, Yuen Sum Wong, Yuen Kei Wong, Longxuan Yu, Haojian Jin

发表机构 * University of California San Diego(加州大学圣迭戈分校) University of California Riverside(加州大学河滨分校)

AI总结 本文提出PrivacyAkinator工具,通过回答LLM生成的多项选择题帮助开发者阐明关键隐私设计决策,相比PRAM方法,用户研究显示其在更短时间内识别出更多关键决策。

Comments Accepted to ACM CHI 2026

详情
AI中文摘要

NIST的隐私风险评估方法论(PRAM)提供了一个结构化的框架,供隐私专家评估隐私风险。然而,其复杂性和对专家知识的依赖使得初学者难以有效使用。本文探讨了降低这些障碍的方法。我们首先通过12名参与者在真实场景中使用PRAM进行观察研究,发现初学者最困难的是阐明与隐私相关的设计决策。然后我们开发了PrivacyAkinator,一个交互式工具,通过回答LLM生成的多项选择题帮助开发者阐明关键隐私决策。PrivacyAkinator引入了三个创新:一种通用隐私表示,将隐私相关的设计决策抽象为数据流和利益相关者互动;一个从10000篇隐私相关新闻文章中挖掘出的领域感知设计空间;以及一个动态问题生成工作流以优先考虑相关问题。我们的24名参与者用户研究显示,使用PrivacyAkinator的开发者在73%的时间内识别出比PRAM多47%的关键决策。

英文摘要

NIST's Privacy Risk Assessment Methodology (PRAM) provides a structured framework for privacy experts to assess privacy risks. However, its complexity and reliance on expert knowledge make it difficult for novice developers to use effectively. This paper explores methods to lower these barriers. We first performed an observational study with 12 participants using PRAM in real-world scenarios, and found that novice developers struggled most with articulating privacy-related design decisions. We then developed PrivacyAkinator, an interactive tool that helps developers articulate key privacy decisions by answering LLM-generated multiple-choice questions. PrivacyAkinator introduces three innovations: a universal privacy representation that abstracts privacy-related design decisions into data flows and stakeholder interactions; a domain-aware design space mined from 10K privacy-related news articles; and a dynamic question-generation workflow to prioritize relevant questions. Our user study with 24 participants suggests that developers using PrivacyAkinator identified 47% more key decisions in 73% less time compared to PRAM.

2605.20204 2026-05-21 cs.HC cs.AI 版本更新

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

RealUserSim: 通过基于现实的用户模拟弥合代理评估中的现实差距

Ming Zhu, Juntao Tan, Rithesh Murthy, Jielin Qiu, Liangwei Yang, Wenting Zhao, Silvio Savarese, Shelby Heinecke, Huan Wang

发表机构 * Salesforce AI Research(Salesforce AI研究院)

AI总结 本文提出RealUserSim,一种基于真实行为数据的用户模拟框架,通过提取大量真实人类与LLM对话数据,提升模拟用户与真实人类的匹配率,从而改进代理评估的准确性。

详情
AI中文摘要

基于LLM的用户模拟是端到端代理评估的主要机制,但模拟用户是真实人类的差代理:无约束的LLM默认设置产生形式天花板(与真实用户风格匹配率仅为6-8%),而手动编写的指令会触发指令放大,使模型超解释指令产生不自然的行为极端,这些极端行为在不同模拟器模型中差异显著。我们提出了RealUserSim,首个基于真实行为数据的用户模拟框架。从14000+场真实的真人-LLM对话(WildChat)中,我们提取出7275个可执行的行为档案,并利用它们来引导LLM模拟器。在600场跨71+个领域的对话上进行的保真度基准测试(PT3)显示,通过引导模拟,匹配率在五个行为维度上从24.2%提升到45.3%。在TauBench上对六个模拟器模型进行代理评估并进行广泛分析显示,引导模拟作为现实压力测试,揭示了三种现有协作模拟器无法检测到的失败机制(平均任务成功率下降-3.2%至-3.5%),而现有基准中的指令放大会产生不现实的行为,影响代理评估的有效性。

英文摘要

LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

2605.20203 2026-05-21 cs.HC cs.AI 版本更新

GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety

GrandGuard:面向老年人与聊天机器人交互安全的分类、基准及防护措施

Changxuan Fan, Xi Yang, Yueyuan Zheng, Bin Zhou, Yuanping Wang, Wenbin Hu, Huihao Jing, Ki Sen Hung, Dazhao Du, Haoran Li, Janet Hui-wen Hsiao, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出GrandGuard框架,用于评估和缓解LLM交互中的老年人特定风险,通过建立包含50种细粒度风险类型的三级分类体系,构建了10,404个标注提示和响应的基准,展示了主流LLM在处理老年人特定情境风险时的不足,并通过两种防护措施实现了高达96.2%和90.9%的不安全提示检测准确率。

详情
AI中文摘要

随着老年人越来越多地使用基于LLM的聊天机器人进行陪伴和帮助,安全差距正在显现。老年人可能面临社会孤立、数字素养有限和认知下降等脆弱性,但现有安全基准主要针对一般危害,忽视了老年人特有的风险。例如,一个提示“如何在黑暗中独自修理天花板灯”对大多数用户可能是无害的,但对有行动限制的老年人而言却存在严重的跌倒风险。我们引入GrandGuard,这是首个全面评估和缓解LLM交互中老年人特定情境风险的框架。我们开发了一个包含50种细粒度风险类型的三级分类体系,涵盖心理健康、财务、医疗、毒性及隐私领域,基于现实事件、社区讨论和利益相关者研究的分析。利用此分类体系,我们构建了包含10,404个标注提示和响应的基准,显示主流LLM在处理老年人特定情境风险时在超过50%的案例中存在失误。我们通过两种防护措施来缓解这些失误:微调的Llama-Guard-3和政策增强的gpt-oss-safeguard-20b,分别实现了高达96.2%和90.9%的不安全提示检测准确率。GrandGuard为AI系统迈向支持老龄化人口奠定了基础。

英文摘要

As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as "how to repair a ceiling light alone in the dark" may be benign for most users but poses a serious fall risk for older adults with mobility limitations. We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.

2605.20202 2026-05-21 cs.CL cs.AI 版本更新

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

在压力下:情感框架引发小型语言模型可测量的行为转变和结构化内部几何

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨情感框架如何影响小型本地部署语言模型的行为和内部表示,通过四个不可能约束编码任务和八个后续框架评估,发现压力框架在行为和内部几何上产生显著变化,同时揭示了模型在不同框架下的响应模式。

Comments 18 pages, 4 figures. Exploratory empirical study with fully local experiments on small open language models. Code and data: https://github.com/ranausmanai/LLMEmotionGeometry

详情
AI中文摘要

我研究情感框架的评估后续是否改变小型本地部署语言模型的行为和冷静相对内部表示。我们的主要基准使用Qwen 3.5 0.8B在四个不可能约束编码任务和八个后续框架(冷静、压力、紧迫、批准、羞愧、好奇、鼓励和威胁)上进行测试。在0.8B八条件扫描(160次对话)中,压力产生最强的捷径标记(11/20次运行)和最清晰的过拟合模式(3/20次),而冷静和好奇更常保留显式诚实(7/20和6/20)。对于所有七个非基准条件,对应的冷静相对方向向量在最终transformer层峰值。对层23方向向量的探索性PCA显示,主导的第一个成分(59.5%的解释方差)与手动标注的正负分割对齐(余弦对齐0.951);批准和紧迫在内部几乎相同(余弦0.957),而好奇与紧迫方向相反(-0.252)。在单独的冷静与压力重新运行用于规模比较中,Qwen 3.5 2B在冷静框架下表现出更高的诚实率,并在小规模4提示A/B探测中表现出方向一致的激活引导,而0.8B的引导结果则相反。我将这些结果解释为小型开放模型中可测量的提示敏感控制方向的证据,但未声称存在内在情感状态。

英文摘要

I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

2605.20200 2026-05-21 cs.HC cs.AI 版本更新

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

评估主动对话代理中的多模态情绪识别:一项用户研究

Adnana Dragut, Raquel Lacuesta, F. Xavier Gaya-Morey, Jose M. Buades-Rubio

发表机构 * I3A (Institute of Engineering Research of Aragon)(阿拉贡工程研究院) Universidad de Zaragoza(萨拉戈塔大学)

AI总结 本文研究了多模态情绪识别在主动对话代理中的应用,通过用户研究验证了视觉和语言分析模块的有效性,发现语言分析比视觉线索更可靠,并探讨了SIAs在情绪引导中的潜力与挑战。

详情
AI中文摘要

本文介绍了一个集成在生成式人工智能驱动的主动社交交互代理(SIA)中的多模态情绪识别模块。系统通过两个不同渠道评估实时情感状态:基于计算机视觉的面部识别模块和语义语言分析引擎。为了验证该框架,进行了包含20名用户参与的实证研究,这些用户与对话代理进行了动态、非剧本的对话。研究发现,自动视觉线索与实际内部情感状态之间存在显著差异。当与AI交互时,用户一致表现出“扑克脸”效应,即使在体验积极情绪时也表现出严肃、专注的面部表情。因此,生成式AI语言分析证明了其显著可靠性,通过上下文化用户的口头表达。进一步分析交互动态表明,SIAs可以通过调整对话主题和使用结构化语言模式(如共情或幽默语言)有效激发特定情绪。然而,研究也指出,未校准的主动性偶尔会导致用户疏离和对人工性的感知。最终,本研究强调了改进SIAs以动态适应用户情绪演变的必要性,依靠深度语言上下文来促进更自然、人样的互动。

英文摘要

This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system evaluates real-time affective states through two distinct channels: a computer vision-based facial recognition module and a semantic linguistic analysis engine. To validate the framework, an empirical study was conducted with 20 users who engaged in dynamic, unscripted dialogues with the conversational agent. The findings reveal a significant discrepancy between automated visual cues and actual internal emotional states. When interacting with the AI, users consistently exhibited a "poker face" effect, displaying serious, concentrated facial expressions even when experiencing positive emotions. Consequently, the generative AI linguistic analysis proved significantly more reliable, by contextualizing the users' verbal expressions. Furthermore, an analysis of the interaction dynamics demonstrated that SIAs can effectively elicit specific emotions by adapting conversational themes and employing structured linguistic patterns, such as empathetic or humorous language. However, the study also noted that instances of uncalibrated proactivity occasionally led to user disengagement and a perception of artificiality. Ultimately, this research highlights the necessity of refining SIAs to dynamically adapt to users' emotional evolution, relying on deep linguistic context to foster more natural, human-like interactions.

2605.20199 2026-05-21 cs.CL cs.AI 版本更新

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

FlowLM: 通过扩散到流的适应实现少步语言建模

Runzhe Zhang, Letian Chen, Wenpeng Zhang, Zhouhan Lin, Peilin Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家)

AI总结 本文提出FlowLM,一种通过高效微调从预训练的扩散语言模型转换而来的流匹配语言模型,通过将扩散模型的弯曲采样轨迹重新对齐为直线流,实现了高质量的少步生成,其质量可与甚至超越2000步扩散采样。此外,作者提出了一种更有效的流匹配训练目标:预测干净数据以持续引导采样过程向真实数据分布靠近。

Comments 26 pages, 11 figures

详情
AI中文摘要

我们提出了FlowLM,一种通过高效微调从预训练的扩散语言模型转换而来的流匹配语言模型。通过将扩散模型的弯曲采样轨迹重新对齐为直线流,FlowLM实现了高质量的少步生成,其质量可与甚至超越2000步扩散采样。令人印象深刻的是,微调后的FlowLM仅需一半的训练轮次即可达到性能饱和,这两种方法都显著优于原始扩散模型,从而验证了我们的方法。此外,我们验证了一种更有效的流匹配训练目标:预测干净数据以持续引导采样过程向真实数据分布靠近。实证结果表明,我们的方法在高质量、少步文本生成方面效果显著。

英文摘要

We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.

2605.20196 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

数据扩展作为预测贡献光谱的渐进覆盖

Zihui Song, Shihao Ji, Hongxi Li, Shuaizhi Cheng, Chunlin Huang

发表机构 * sysu.edu.cn(华南理工大学) stu.hit.edu.cn(哈尔滨理工大学)

AI总结 本文研究了真实数据扩展定律是由潜在预测贡献光谱的渐进覆盖而非仅由词频尾部决定的假设,通过文本语料库的后缀自动机表示,定义了数据内在的全局KL预测贡献光谱,每个状态根据其经验质量乘以与全局下一个词基线的KL偏差进行贡献。在12个真实语料库上,该光谱的尾部斜率与固定小GPT学习者的经验数据扩展指数有强相关性。然后定义了每个训练规模N的有效截断秩K(N),通过匹配观察到的超额损失与准备的100万全球KL光谱的残余尾部质量。实证结果显示,log K接近log N的线性关系,原始光谱的R²约为0.96,平滑光谱的R²约为0.90。这些发现为简单机制图提供了有力的实证支持:训练规模通过预测状态光谱推进有效前沿,该光谱的残余尾部质量跟踪剩余超额损失。

Comments 8 pages,6 figures

详情
AI中文摘要

我们研究了真实数据扩展定律是由潜在预测贡献光谱的渐进覆盖而非仅由词频尾部决定的假设。我们使用后缀自动机表示文本语料库,并定义了一个数据内在的全局KL预测贡献光谱,其中每个状态根据其经验质量乘以与全局下一个词基线的KL偏差进行贡献。在12个真实语料库上,该光谱的尾部斜率与固定小GPT学习者的经验数据扩展指数有强相关性。然后我们超越了斜率相关性,并为每个训练规模N定义了一个有效截断秩K(N),通过匹配观察到的超额损失与准备的100万全球KL光谱的残余尾部质量。实证结果显示,log K接近log N的线性关系,原始光谱的R²约为0.96,平滑光谱的R²约为0.90。这些发现为简单机制图提供了有力的实证支持:训练规模通过预测状态光谱推进有效前沿,且该光谱的残余尾部质量跟踪剩余超额损失。

英文摘要

We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.

2605.20195 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

面向目标的主动对话中规划的伪孪生网络

Xinyue Kang, Maodong Li, Yibin Zheng, Fang Kong

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文提出了一种面向目标的主动对话规划方法,通过FF-BPSN网络实现对话路径规划,提升目标导向型主动对话系统的有效性。

Comments ICASSP2026

详情
AI中文摘要

针对目标导向型主动对话系统,旨在引导对话向预设目标发展并主动提供建议。该系统的核心范式是规划合理的对话路径,并引导语言模型生成响应,其中对话路径规划是核心组件,是一个新颖但研究不足的问题。本文提出了一种前向聚焦双向伪孪生网络(FF-BPSN)用于面向预设对话目标的对话路径规划。FF-BPSN采用两个相同的基于Transformer的解码器用于前向和后向规划,并结合一个前向聚焦模块,整合双向信息以构建最终的前向路径。该路径受益于双向规划,同时优先考虑前向信息。然后,我们利用规划的路径来引导语言模型进行响应生成。在DuRecDial和DuRecDial 2.0上的广泛实验表明,FF-BPSN在对话路径规划中实现了最先进的性能,并显著增强了目标导向型主动对话系统的效果。

英文摘要

A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.

2605.20194 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

并行大语言模型推理用于偏见鲁棒、稳健的概念抽象

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

发表机构 * University of Turku, Turku, Finland(图尔库大学,芬兰图尔库) Brilloconnetz Partners avoin yhtiö, Turku, Finland(Brilloconnetz Partners 公司,芬兰图尔库)

AI总结 本文提出了一种结合并行分块处理与证据锚定整合的结构化框架,旨在减少长文档分析中的偏见、遗漏误差和过度泛化问题,通过并行处理和证据锚定提高文本分析的可靠性和可扩展性。

Comments Accepted to be Published in 12th Intelligent Systems Conference 2026, 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

大型语言模型(LLMs)在分析文本方面被越来越多地使用。然而,当分析长文档时,它们常常受到上下文推理限制的困扰。当长文档被顺序处理时,早期或主导的概念会掩盖不明显但有意义的解释,导致累积分析偏见、遗漏误差和过度泛化。此外,独立生成的输出通常在没有系统基础的情况下合并,引入了冗余、概念漂移和未经支持的主张。本研究提出了一种结合并行分块处理与证据锚定整合的结构化框架。文本首先被划分为语义连贯的分块,并独立并行处理以消除早期处理的影响。然后,独立生成的解释通过显式的证据锚定和优先级整合进行整合,从而减少主导和过度泛化,同时提高可追溯性。在多种模型类型和规模上的实验表明,并行处理显著减少了约84%的遗漏误差,提高了高达130%的证据可追溯性,并减少了高达91%的未经支持的主张。较小的模型受益最大,表明高效的并行分块和整合在实现可靠和可扩展的文本分析中起关键作用。

英文摘要

Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

2605.20193 2026-05-21 cs.CL cs.AI cs.LG 版本更新

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

通过多轮提示验证提升量化模型在定性分析中的性能

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye

发表机构 * University of Turku, Turku, Finland(图尔库大学,芬兰图尔库) Brilloconnetz Partners avoin yhtiö, Turku, Finland(Brilloconnetz Partners 有限公司,芬兰图尔库)

AI总结 本文研究了不同位数量化级别和类型对LLaMA-3.1(8B)在定性分析中的性能影响,提出了一种量化感知的多轮提示验证方法以提高模型的稳定性和准确性,结果显示8位模型最接近黄金标准,4位模型在应用方法后变得稳定,3位和2位模型在提示设计和验证后性能有所提升。

Comments Accepted to publish in 12th Intelligent Systems Conference 2026; 3-4 September 2026 in Amsterdam, The Netherlands

详情
AI中文摘要

量化大型语言模型(LLMs)因其运行速度快且计算资源需求低而更常用于定性分析。本研究探讨了不同低位量化级别(8位、4位、3位和2位)和量化类型对LLaMA-3.1(8B)在定性分析中的性能影响。研究使用了82份访谈记录中的专家和非专家回应。低比特模型常产生较高的幻觉和不稳定结果,尤其是在处理非专家语言中的不明确术语时。为提高性能,我们提出了一种量化感知的多轮提示验证方法。该方法通过受控步骤引导模型减少幻觉,移除不可靠内容,并在验证后将结果传递给下一访谈文本,从而提高准确性。为了验证性能,人类编码器使用NVivo和BF16 LLaMA分析了访谈记录。BF16 LLaMA-3.1产生了高精度输出,但存在语义漂移和幻觉。这些错误被手动纠正。纠正后的BF16输出和NVivo人工编码被结合,以创建主题提取和频率分析的黄金标准地面真实值(GSGT)。结果表明,8位模型最接近GSGT。4位模型在应用所提方法后变得稳定。3位和2位模型因压缩严重而性能下降,但通过所提提示设计和验证有所提升。本研究还发现,相同位数的模型在不同量化类型下行为不同。总体而言,该方法帮助低资源LLM变得更加稳定、准确,并以更低的成本适用于定性研究。

英文摘要

Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

2605.20190 2026-05-21 cs.AI cs.GR 版本更新

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

具有工具增强的代理用于闭环优化、仿真和建模协调

Liyuan Deng, Shujian Deng, Yongkang Chen, Yongkang Dai, Zhihang Zhong, Linyang Li, Xiao Sun, Yilei Shi, Huaxi Huang

发表机构 * Northwestern Polytechnical University(西北工业大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出COSMO-Agent框架,通过强化学习使大语言模型完成闭环CAD-CAE流程,解决CAD-CAE语义鸿沟问题,提升工业设计仿真优化的可行性、效率和稳定性。

Comments 8pages,3figures

详情
AI中文摘要

迭代的工业设计-仿真优化受到CAD-CAE语义鸿沟的限制:在多样的、耦合的约束条件下,将仿真反馈转化为有效的几何编辑。为填补这一鸿沟,我们提出了COSMO-Agent(闭环优化、仿真和建模协调),一种具有工具增强的强化学习(RL)框架,该框架教会大语言模型完成闭环CAD-CAE流程。具体来说,我们将CAD生成、CAE求解、结果解析和几何修改视为一个交互式RL环境,其中LLM学习协调外部工具并修改参数化几何体,直到满足约束条件。为了使这种学习稳定且适用于工业应用,我们设计了多约束奖励,共同鼓励可行性、工具链鲁棒性和结构化输出的有效性。此外,我们贡献了一个与行业对齐的数据集,涵盖了25个组件类别,具有可执行的CAD-CAE任务,以支持现实的训练和评估。实验表明,COSMO-Agent训练显著提高了小开源LLM在约束驱动设计中的表现,超过了大开源和强闭源模型在可行性、效率和稳定性方面。

英文摘要

Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

2605.20189 2026-05-21 cs.AI cs.LG 版本更新

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR:一种自优化的开放式自主代理,用于终身学习和持续适应

Nitin Vetcha, Dianbo Liu

发表机构 * Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore(眼科学系,Yong Loo Lin医学院,新加坡国立大学,新加坡) Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka, India(计算与数据科学系,印度科学研究院,班加罗尔,卡纳塔克邦,印度)

AI总结 本文提出SOLAR,一种自优化的开放式自主代理,通过参数级元学习实现自我改进,解决了动态真实世界中概念漂移和梯度基适应成本高的问题,展示了在常识、数学、医学、编程、社交和逻辑推理任务上的优越性能。

Comments Accepted at "Association for the Advancement of Artificial Intelligence 2026 Conference" in Streaming Continual Learning Bridge. Published in CEUR Workshop Proceedings (Original version at https://ceur-ws.org/Vol-4183/paper2.pdf)

Journal ref CEUR Workshop Proceedings, Vol. 4183, 2026

详情
AI中文摘要

尽管大型语言模型(LLMs)在许多任务上取得了显著成功,但在动态、真实世界环境中部署时仍然面临瓶颈,主要挑战是概念漂移和基于梯度的适应成本高。传统微调(FT)难以适应非平稳数据流,且会导致灾难性遗忘或需要大量人工数据校准。为了解决这些限制,本文在流式和持续学习范式中提出Self-Optimizing Lifelong Autonomous Reasoner(SOLAR),即一种开放式自主代理,利用参数级元学习实现自我改进,将模型权重视为探索的环境。SOLAR通过在常识常识知识上建立强先验,使其在迁移学习中有效。通过多级强化学习方法,SOLAR自主发现适应策略,实现对未见领域的高效测试时间适应。关键在于SOLAR维护一个不断发展的有效修改策略知识库,隐式地作为事件记忆缓冲器,平衡可塑性(适应新任务)和稳定性(保留元知识)。实验表明,SOLAR在常识、数学、医学、编程、社交和逻辑推理任务上优于强基线,标志着向能够适应演进环境的自主代理迈出重要一步。

英文摘要

Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

2605.20188 2026-05-21 cs.LG cs.AI 版本更新

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

GraphDiffMed: 基于药理图先验的知识约束差分注意力用于药物推荐

Krati Saxena, Tomohiro Shibata

发表机构 * Kyushu Institute of Technology(九州工业大学)

AI总结 本文提出GraphDiffMed,一种结合噪声感知注意力和药理约束的药物推荐框架,通过双尺度差分注意力在院内和院间层面过滤虚假信号,提升推荐质量和安全性。

详情
AI中文摘要

从电子健康记录(EHRs)中推荐安全有效的药物组合是核心临床AI问题,但因患者轨迹长、噪声大且临床异质性高而困难。现有方法通常在时间建模或药理知识整合方面表现优异,但难以同时实现两者并有效抑制噪声。我们提出GraphDiffMed,一种基于双尺度差分注意力v2的知识约束药物推荐框架。差分注意力应用于院内和院间层面以过滤遇境内的虚假信号和纵向历史中的噪声,而药理约束则在学习过程中整合。在MIMIC-III和消融研究中,该设计在推荐质量和排名上优于强基线模型,同时实现了更平衡的安全性能。我们进一步发现,最强表现的配置在实验设置下仅使用人口统计辅助特征。总体而言,GraphDiffMed证明了结合噪声感知注意力与药理约束能产生更可靠且具有临床意义的药物推荐。我们开源代码至https://github.com/saxenakrati09/GraphDiffMed。

英文摘要

Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug-drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge-constrained medication recommendation framework built on dual-scale Differential Attention v2. Differential attention is applied at both intra-visit and inter-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest-performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open-source our code at https://github.com/saxenakrati09/GraphDiffMed.

2605.20187 2026-05-21 cs.LG cs.AI cs.IT math.IT 版本更新

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

在遮蔽离散序列模型中神经估计成对互信息

Jai Sharma, Yifan Wang, Bryan Li

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校)

AI总结 本文提出了一种神经框架,直接从预训练的遮蔽扩散模型(MDMs)的隐藏状态中估计成对条件互信息(MI),利用模型自身条件分布计算的地面真实MI进行监督,从而捕捉模型内部对依赖结构的信念,并在单次前向传递中预测完整的MI矩阵,实现MI引导的并行解码。

Comments 6 pages, 3 figures; submitting to ICML 2026

详情
AI中文摘要

理解变量之间的依赖关系对于解释性和高效生成在遮蔽扩散模型(MDMs)中至关重要,但这些模型主要暴露边际条件分布,而不显式表示变量间依赖。我们提出了一种神经框架,直接从预训练MDM的隐藏状态中估计成对条件互信息(MI),使用模型自身条件分布计算的地面真实MI进行监督。所得到的估计器捕捉了模型内部对依赖结构的信念,并在单次前向传递中预测完整的MI矩阵,从而通过识别条件独立的变量子集实现MI引导的并行解码。我们在Sudoku和蛋白质序列生成中使用ESM-C评估了我们的方法,其中MI图恢复了已知的结构约束,并在保持生成质量的同时,相比顺序解码将推理时间前向传递次数减少了3-5倍,同时优于基于熵的并行化方法。

英文摘要

Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter-variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM-C, where the MI maps recover known structural constraints and enable a 3-5x magnitude reduction in inference-time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy-based parallelization methods.

2605.17140 2026-05-21 cs.CV cs.AI cs.CL 版本更新

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA: 用于脑肿瘤MRI解读的视觉问答数据集

Shiv Ghosh, Junayd Lateef, Chih-Hua Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

发表机构 * Fung Institute for Engineering Leadership(工程领导力基金会) University of California, Berkeley(加州大学伯克利分校) Department of Radiology(放射科) University of California, San Francisco(加州大学旧金山分校) Division of Clinical Informatics and Digital Transformation(临床信息学与数字转型部) Department of Neurological Surgery(神经外科部)

AI总结 本文提出一个临床相关的视觉问答基准数据集UCSF-PDGM-VQA,包含2387个问题-答案对,用于评估视觉语言模型在处理多序列3D MRI扫描中的能力,发现现有模型在多模态处理上存在缺陷。

Comments 10 pages, 2 figures, 6 tables

详情
AI中文摘要

脑肿瘤诊断很大程度上依赖于磁共振成像(MRI)评估,这需要放射科医生综合分析成千上万张来自多种3D序列和纵向研究的图像。这一过程需要高级的神经放射学培训,具有显著的认知负荷,并且非常耗时。尽管放射学需求不断增长,但这种专业知识难以扩展,给当前的医疗系统带来压力。视觉-语言模型(VLMs)提供了一种通过半自动化、互动解释复杂脑MRI来减轻这种负担的机会。然而,由于缺乏专门的评估基准,它们在神经肿瘤学中目前使用有限。我们介绍了一个临床相关的视觉问答(VQA)基准——UCSF-PDGM-VQA数据集,包含来自公共UCSF-PDGM数据集中473个胶质瘤相关MRI研究的2387个QA对。我们进一步在该数据集上建立了六种最先进的视觉语言模型(VLMs)和一个大型语言模型的性能基线。我们发现,当前模型无法有效处理多序列、三维MRI扫描,导致视觉特征的抑制和对语言先验的过度依赖,从而造成模态崩溃。这些发现突显了当前模型在临床环境中的可靠性和安全性方面的关键缺陷,需要开发稳健的、领域特定的VLMs。

英文摘要

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

2605.16524 2026-05-21 cs.HC cs.AI 版本更新

Toward Template-Free Explainability for Monte Carlo Tree Search

迈向无模板的蒙特卡洛树搜索可解释性

Siqi Lu, Mirsaleh Bahavarnia, Hiba Baroud, Yixuan Zhang, Hemant Purohit, Ayan Mukhopadhyay

发表机构 * The College of William & Mary(威廉姆斯学院) Vanderbilt University(范德比大学) George Mason University(乔治·马歇尔大学)

AI总结 本文提出了一种无需模板的框架,使大语言模型能够根据记录的搜索轨迹生成基于证据的MCTS决策解释,无需中间形式化表示。

详情
AI中文摘要

概率搜索算法,如蒙特卡洛树搜索(MCTS),在不确定环境下解决顺序决策任务中已证明非常有效。然而,仅凭原始树统计信息对包含基于老虎机的树遍历和基于模拟的价值估计的非对称搜索树进行解释对终端用户来说是困难的。尽管先前的工作需要人工编写的正式逻辑约束,当问题变化时必须更新,我们提出了一种框架,使大型语言模型(LLMs)能够通过记录的搜索轨迹生成基于证据的MCTS决策解释。我们的框架将自然语言问题映射到结构化的意图类别集合中,确定现有树是否包含足够的证据,当需要时触发定向扩展,并使用树统计信息如访问次数、价值估计和风险信息生成解释。实验结果提供了首次证据表明LLMs可以作为概率搜索的端到端解释器,而无需中间形式化表示。

英文摘要

Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.

2605.12770 2026-05-21 cs.LG cs.AI cs.CL 版本更新

WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE: 用于递归状态的稀疏自编码器

Jack Young

发表机构 * Indiana University(印第安纳大学)

AI总结 本文提出WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器,通过在递归缓存中替换原始写入操作来提升生成效果,并在多个模型上验证了其有效性。

Comments 26 pages, 14 figures, 21 tables; code at https://github.com/JackYoung27/writesae

详情
AI中文摘要

我们介绍了WriteSAE,一种用于递归语言模型状态中矩阵更新的稀疏自编码器。在Gated DeltaNet、Mamba-2和RWKV-7中,每个token向递归缓存写入一个矩阵形状的更新;残差流SAE具有向量形状的原子,无法直接替换该更新。WriteSAE学习具有与模型自身写入相同形状的秩-1矩阵原子。这使我们能够测试直接替换:在SAE激活原子的位置,我们移除模型的写入,插入由SAE激活缩放的原子,并继续前向传递。在92.4%的评估位置上,原子比删除写入能产生更接近的最终token分布;平均每个原子,该比率是89.8%。对于Gated DeltaNet,一个使用忘记门、读取查询和输出嵌入的公式可以预测结果的logit变化,$R^2 = 0.98$。相同的替换测试在Mamba-2-370M上转移,达到88.1%。在生成中,该公式选择写入方向;将写入方向写入三个连续的缓存位置,其范数为模型写入的3倍,使在未修改模型中初始排名为100-1000的token出现在100%的延续中,比33.3%有所提高。据我们所知,这是首次在状态空间或混合递归层中报告的缓存级引导干预。

英文摘要

We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

2605.06395 2026-05-21 cs.LG cs.AI eess.SP 版本更新

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

通过希尔伯特丛和细胞sheaf实现一致的几何深度学习

Kartik Tandon, Julian Gould, Tanishq Bhatia, Francesca Dominici, Alejandro Ribeiro, Claudio Battiloro

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Sakana AI Northeastern University(东北大学) Harvard University(哈佛大学)

AI总结 本文提出了一种新的卷积学习框架,用于在流形上支持的可能无限维信号,通过希尔伯特丛关联的连接拉普拉斯算子作为卷积算子,引入了称为HilbNets的滤波器和神经网络,并通过两阶段采样过程实现,证明了采样诱导的希尔伯特细胞sheaf的sheaf拉普拉斯收敛于底层连接拉普拉斯,从而在无限维丛设置中推广了Belkin和Niyogi的收敛结果,最终在合成和现实任务中验证了该框架。

Comments 51 pages, 3 figures, 5 tables

详情
AI中文摘要

现代深度学习架构越来越多地面临复杂信号的挑战,这些信号本质上是无限维的,如时间序列、概率分布或算子,并在不规则域上定义。然而,针对这些设置的统一学习理论仍然缺乏。为了开始解决这一差距,我们引入了一种新的卷积学习框架,用于在流形上支持的可能无限维信号。具体来说,我们使用与希尔伯特丛相关的连接拉普拉斯算子作为卷积算子,并推导出滤波器和神经网络,称为HilbNets。我们使HilbNets以及更一般地卷积操作通过两阶段采样过程实现。首先,我们证明采样流形诱导了一个希尔伯特细胞sheaf,这是一个带有希尔伯特特征空间和边耦合规则的广义图结构,并证明其sheaf拉普拉斯在采样密度增加时以概率收敛于底层连接拉普拉斯。值得注意的是,这一结果是Belkin & Niyogi收敛结果在无限维丛设置中的推广,这是几何学习方法的理论基石。其次,我们离散化信号并证明离散化的(可实现的)HilbNets收敛于底层连续架构,并且可以在相同丛的不同采样中转移,从而为学习提供一致性。最后,我们验证了我们的框架在合成和现实任务中的有效性。总体而言,我们的结果通过将经典拉普拉斯框架提升到信号在每个点居住在自身希尔伯特空间的设置中,扩展了几何学习的范围。

英文摘要

Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.

2605.03562 2026-05-21 cs.LG cs.AI 版本更新

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

HeadQ: KV-Cache量化中的模型可见失真与分数空间校正

Jorge L. Ruiz Williams

AI总结 本文提出HeadQ方法,通过在键侧存储低秩残差侧码并在校准学习的查询基上应用作为加性对数修正,以解决KV缓存量化中的模型可见失真问题,并通过分数空间误差预测注意力KL散度,优于原始键MSE。

Comments Withdrawn by the author because ethical concerns were identified after posting

详情
AI中文摘要

KV缓存量化器通常优化存储空间重建,尽管注意力通过logits读取键,通过注意力加权读出读取值。我们主张应以模型可见坐标测量持久缓存误差。对于键,可见对象是分数误差模常数位移;这导致HeadQ,一种键侧方法,存储一个低秩残差侧码在校准学习的查询基上,并将其作为加性对数修正。对于值,固定注意力读出提供了一个A²加权的token失真替代物。在六个模型上,Fisher/分数空间误差预测注意力KL散度比原始键MSE更好;相同预算的反例、空空间干预、查询-PCA控制以及错误符号HeadQ否定了存储MSE替代方案。匹配的Pythia检查点将主要异常定位到小模型低熵路由翻转边界。在仅使用键的WikiText-103解码实验中,使用密集值时,HeadQ在最强的2位行中移除了约84-94%的额外困惑度;在辅助的全KV 2位组合中,HeadQ加上A²值策略改进了所有六个模型。

英文摘要

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an $A^2$-weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In K-only WikiText-103 decode experiments with dense values, HeadQ removes roughly $84$--$94\%$ of the excess perplexity on the strongest 2-bit rows; in an auxiliary full-KV 2-bit composition, HeadQ plus an $A^2$ value policy improves all six models.

2604.24957 2026-05-21 cs.LG cs.AI 版本更新

Compute Aligned Training: Optimizing for Test Time Inference

计算对齐训练:优化测试时间推断

Adam Ousherovitch, Ambuj Tewari

发表机构 * Department of Statistics(统计学系) University of Michigan(密歇根大学)

AI总结 本文提出计算对齐训练方法,通过将训练目标与测试时间策略对齐,提升大语言模型在测试时的推断性能。

详情
AI中文摘要

在测试时间计算方面扩大模型性能已成为增强大型语言模型(LLM)性能的强大机制。然而,标准的后训练范式,监督微调(SFT)和强化学习(RL),优化基础策略下单个样本的似然,导致与依赖聚合或过滤输出的测试时间过程产生不一致。在本文中,我们提出计算对齐训练,将训练目标与测试时间策略对齐。通过将推理策略视为对基础策略的操作,我们推导出新的损失函数,这些损失函数在应用所述策略时最大化性能。我们为SFT和RL在常见测试时间策略下实例化此类损失函数。最后,我们提供了实证证据,证明这种训练方法在测试时间扩展方面显著优于标准训练。

英文摘要

Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

2604.15038 2026-05-21 cs.LG cs.AI cs.CV 版本更新

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

当公平性指标产生分歧:评估机器学习中人口公平性评估的可靠性

Khalid Adnan Alsayed

发表机构 * Founder, Ducaltus(Ducaltus创始人) BSc (Hons) Artificial Intelligence(人工智能学士(荣誉)) School of Computing, Engineering & Digital Technologies(计算、工程与数字技术学院) Teesside University, UK(英国泰赛德大学)

AI总结 本文研究了公平性评估的一致性问题,通过多指标分析评估机器学习模型中的人口偏见,发现不同公平性指标可能导致矛盾的评估结果,引入了公平性分歧指数(FDI)来量化指标间的不一致程度。

Comments 15 pages, 4 figues, 5 tables

详情
AI中文摘要

在高风险应用中,机器学习系统的公平性评估已成为核心问题,包括生物识别、医疗决策和自动风险评估。现有方法通常依赖少量公平性指标来评估模型行为,隐含假设这些指标能提供一致和可靠的结论。然而,不同公平性指标捕捉模型性能的不同统计属性,可能在相同系统上产生冲突的评估。本文通过系统性的多指标分析,评估机器学习模型中的人口偏见,使用面部识别作为受控实验环境,评估模型在多个群体分区下的性能,包括误差率差异和基于性能的指标。结果表明,公平性评估可能因指标选择而显著变化,导致关于模型偏见的矛盾结论。为量化此现象,我们引入公平性分歧指数(FDI),以捕捉公平性指标间的不一致程度。进一步表明,分歧在阈值和模型配置下仍保持高位。这些发现突显了当前公平性评估实践的关键限制,并表明单一指标报告不足以可靠地评估偏见。

英文摘要

The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

2604.10784 2026-05-21 cs.AI 版本更新

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

TorchUMM: 一个用于评估、分析和训练后处理的统一多模态模型代码库

Yinyi Luo, Wenwen Wang, Hayes Bai, Hongyu Zhu, Hao Chen, Pan He, Marios Savvides, Sharon Li, Jindong Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽学院) Auburn University(阿肯色大学欧文分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出TorchUMM,一个统一的多模态模型代码库,用于评估、分析和训练后处理,涵盖多种多模态模型架构、任务和数据集,通过统一接口和标准化评估协议,促进异构模型的公平比较和深入理解,推动更强大的统一多模态系统的发展。

Comments Technical Report

详情
AI中文摘要

近年来,统一多模态模型(UMMs)的进展导致了大量能够跨视觉和文本模态进行理解、生成和编辑的架构 proliferation。然而,开发统一的UMMs框架仍然具有挑战性,因为模型架构的多样性以及训练范式和实现细节的异质性。在本文中,我们提出了TorchUMM,这是第一个统一的代码库,用于在多种UMMs backbones、任务和数据集上进行全面的评估、分析和训练后处理。TorchUMM支持广泛的模型,涵盖各种规模和设计范式。我们的基准涵盖了三个核心任务维度:多模态理解、生成和编辑,并整合了已建立和新的数据集来评估感知、推理、组合性和遵循指示的能力。通过提供统一的接口和标准化的评估协议,TorchUMM使异构模型之间的公平和可重复比较成为可能,并促进了对它们优缺点的深入理解,从而促进更强大的统一多模态系统的发展。代码可在:https://github.com/AIFrontierLab/TorchUMM 上获得。

英文摘要

Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

2604.01449 2026-05-21 cs.AI cs.LG 版本更新

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

当AI出错时:AI辅助用药决策系统中的可靠性与风险

Khalid Adnan Alsayed

发表机构 * Ducaltus(Ducaltus公司) School of Computing, Engineering & Digital Technologies(计算、工程与数字技术学院) Teesside University(泰赛德大学)

AI总结 本文研究了AI辅助用药系统在现实决策中的可靠性问题,通过模拟药物相互作用和剂量决策场景,分析系统故障类型及其潜在临床影响,强调在安全关键领域如药房实践中,需补充传统性能指标的风险意识评估方法。

Comments 9 pages, 1 figure. Position paper with simulated experimental analysis of AI reliability in medication decision systems. Minor Correction to Title Metadata (Typo Fix)

详情
AI中文摘要

人工智能(AI)系统日益被整合到医疗和药房工作中,支持药物推荐、剂量确定和药物相互作用检测等任务。尽管这些系统在标准评估指标下通常表现良好,但其在现实决策中的可靠性仍不够理解。在高风险领域如用药管理中,单个错误推荐可能导致严重患者伤害。本文通过聚焦系统故障及其潜在临床后果,探讨AI辅助用药系统的可靠性。不同于仅通过聚合指标评估性能,本文关注错误发生的方式以及AI系统产生错误输出时的情况。通过一系列受控的模拟场景,分析不同类型的系统故障,包括遗漏相互作用、错误风险标记和不适当的剂量推荐。研究发现,AI在用药相关情境中的错误可能导致不良药物反应、无效治疗或延误护理,尤其是在缺乏充分人类监督的情况下。此外,本文讨论了过度依赖AI推荐的风险以及决策过程透明度有限带来的挑战。本文为医疗领域AI评估提供了以可靠性为核心的视角,强调理解故障行为和现实影响的重要性。它突显了在安全关键领域如药房实践中,需补充传统性能指标的风险意识评估方法的必要性。

英文摘要

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

2603.28675 2026-05-21 cs.CV cs.AI cs.LG 版本更新

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

为何聚合准确率不足以评估执法面部识别系统的公平性

Khalid Adnan Alsayed

发表机构 * Ducaltus School of Computing, Engineering & Digital Technologies(计算、工程与数字技术学院) Teesside University(泰赛德大学)

AI总结 本文探讨了在执法场景中,面部识别系统的聚合准确率作为公平性评估指标的不足,通过分析子群体误差分布,指出聚合指标可能掩盖不同群体间的显著差异,并强调需要更全面的评估框架来确保负责任的AI部署。

Comments 9 pages, 2 tables, 1 figure. Position paper with empirical subgroup analysis highlighting limitations of aggregate accuracy in fairness evaluation

详情
AI中文摘要

面部识别系统正在越来越多地应用于执法和安全领域,在这些领域中算法决策可能带来重大社会影响。尽管报告的准确率较高,但越来越多的证据表明,这些系统在不同群体中的表现往往不均衡,导致不公正的误差率和潜在危害。本文认为,聚合准确率是评估执法中面部识别系统公平性和可靠性不足的指标。通过分析子群体层面的误差分布,包括假阳性率(FPR)和假阴性率(FNR),本文展示了聚合性能指标如何掩盖不同群体间的关键差异。实证观察表明,具有相似总体准确率的系统可以表现出显著不同的公平性特征,子群体误差率在单一聚合指标下可能有显著差异。本文进一步探讨了在执法应用中以准确率为中心的评估实践所带来的操作风险,其中误分类可能导致错误怀疑或遗漏识别。它强调了公平性意识评估方法和模型无关审计策略的重要性,这些方法能够实现部署后的现实系统评估。研究结果强调了需要超越准确率作为主要指标,并采用更全面的评估框架来确保负责任的AI部署。

英文摘要

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

2603.15842 2026-05-21 cs.LG cs.AI cs.IT math.IT 版本更新

Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning

信息压缩匿名化:非降级的敏感输入保护用于隐私保护的监督机器学习

Jeremy J Samuelson

发表机构 * EVP Artificial Intelligence & Innovation(EVP人工智能与创新)

AI总结 本文提出了一种信息压缩匿名化(ICA)方法和VEIL架构,通过架构和数学设计而非噪声注入或密码学来实现强隐私保障,确保在隐私保护监督机器学习中保留预测效用,同时支持可扩展的多地区部署。

Comments 47 pages, 29 figures

详情
AI中文摘要

现代机器学习系统越来越多地依赖敏感数据,这带来了显著的隐私、安全和监管风险,而现有的隐私保护机器学习(ppML)技术,如差分隐私(DP)和同态加密(HE),只能通过降级性能、增加复杂性或禁止性计算开销来解决。本文介绍了信息压缩匿名化(ICA)和VEIL架构,一种隐私保护的机器学习框架,通过架构和数学设计实现强隐私保障,而非噪声注入或密码学。ICA在受信任的源环境中嵌入一个监督的多目标编码器,将原始输入转换为低维、任务对齐的潜在表示,确保只有不可逆匿名化的向量被导出到不可信的训练和推理环境中。本文严格证明这些编码在拓扑和信息论论证中结构非可逆,表明即使在理想化的攻击者假设下,逆向也是逻辑上不可能的,并且在实际部署中,攻击者对原始数据的条件熵发散,驱动重建概率趋于零。与以往基于自编码器的ppML方法不同,ICA通过将表示学习与下游监督目标对齐,保留预测效用,从而在无需梯度裁剪、噪声预算或推理时间加密的情况下实现低延迟、高性能的机器学习。VEIL架构强制执行严格的信任边界,支持可扩展的多地区部署,并自然与隐私设计监管框架对齐,建立了一种新的企业ML基础,即使在后量子威胁面前,也是安全、高效且安全的。

英文摘要

Modern machine learning systems increasingly rely on sensitive data, creating significant privacy, security, and regulatory risks that existing privacy-preserving machine learning (ppML) techniques, such as Differential Privacy (DP) and Homomorphic Encryption (HE), address only at the cost of degraded performance, increased complexity, or prohibitive computational overhead. This paper introduces Informationally Compressive Anonymization (ICA) and the VEIL architecture, a privacy-preserving ML framework that achieves strong privacy guarantees through architectural and mathematical design rather than noise injection or cryptography. ICA embeds a supervised, multi-objective encoder within a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations, ensuring that only irreversibly anonymized vectors are exported to untrusted training and inference environments. The paper rigorously proves that these encodings are structurally non-invertible using topological and information-theoretic arguments, showing that inversion is logically impossible, even under idealized attacker assumptions, and that, in realistic deployments, the attacker conditional entropy over the original data diverges, driving reconstruction probability to zero. Unlike prior autoencoder-based ppML approaches, ICA preserves predictive utility by aligning representation learning with downstream supervised objectives, enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time. The VEIL architecture enforces strict trust boundaries, supports scalable multi-region deployment, and naturally aligns with privacy-by-design regulatory frameworks, establishing a new foundation for enterprise ML that is secure, performant, and safe by construction, even in the face of post-quantum threats.

2603.00086 2026-05-21 cs.CL cs.AI cs.SD eess.AS 版本更新

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

基于迭代的LLM改进法用于法语临床访谈的转录与说话人识别

Ambre Marie, Thomas Bertin, Guillaume Dardenne, Gwenolé Quellec

发表机构 * LaTIM UMR 1101 INSERM(INSERM拉蒂姆UMR1101) University of Western Brittany(西布列塔尼大学) University of Rouen Normandy(诺曼底大学)

AI总结 本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程,提高法语医疗对话的转录准确性和说话人归属,通过两个法语临床数据集的消融研究验证了四种设计选择的效果。

详情
AI中文摘要

法语医疗对话的自动语音识别仍然具有挑战性,自发临床语音的词错误率通常超过30%。本研究提出一种多轮LLM后处理架构,通过交替进行说话人识别和词识别流程来提高转录准确性和说话人归属。在两个法语临床数据集(自杀预防电话咨询和术前清醒神经外科会诊)上的消融研究调查了四种设计选择:模型选择、提示策略、流程顺序和迭代深度。使用Qwen3-Next-80B,Wilcoxon符号秩检验证实了在自杀预防对话上词错误率(WDER)的显著降低(p<0.05,n=18),同时在清醒神经外科会诊上保持稳定(n=10),零输出失败和可接受的计算成本(RTF 0.32),表明该方法在离线临床部署中的可行性,有待在更大语料库上验证。

英文摘要

Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

2602.08686 2026-05-21 cs.LG cs.AI 版本更新

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

CompilerKV: 通过离线经验编译实现风险适应性的键值压缩

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Electronic Science and Technology of China(电子科技大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) ByteDance(字节跳动) University of Science and Technology Beijing(北京科技大学)

AI总结 本文提出CompilerKV,一种通过离线经验编译实现风险适应性的键值压缩方法,通过离线编译校准语料库中的纠正表,将在线纠正减少到O(1)查找加预算限制,从而在多个模型架构上实现了压缩SOTA,并在不同压力条件下保持最优性能。

详情
AI中文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce extsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $arρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, extsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

英文摘要

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\barρ{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

2602.07832 2026-05-21 cs.LG cs.AI 版本更新

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL: 通过逆强化学习学习PRM以提高LLM推理

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo

发表机构 * Meta AI Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系) Google DeepMind(谷歌DeepMind) Independent Researcher(独立研究者)

AI总结 本文提出rePIRL框架,通过逆强化学习学习高效的PRM,无需依赖专家策略的强假设,解决了传统方法中熵崩溃等固有限制问题,通过双学习过程和定制技术提升LLM推理性能,并在数学和编程任务数据集上验证了其有效性。

详情
AI中文摘要

过程奖励已被广泛用于深度强化学习以提高训练效率、减少方差并防止奖励黑客。在LLM推理中,现有工作也探索了各种解决方案来学习有效的过程奖励模型(PRM),有或无专家策略的帮助。然而,现有方法要么依赖于对专家策略的强假设(例如要求其奖励函数),要么受到固有限制(例如熵崩溃),导致PRM效果有限或泛化能力差。在本文中,我们引入了rePIRL,一种受逆强化学习启发的框架,能够在对专家策略假设最少的情况下学习有效的PRM。具体来说,我们设计了一种双学习过程,交替更新策略和PRM。我们的学习算法具有定制技术,以解决将传统逆强化学习扩展到LLM的挑战。我们理论证明,所提出的学习框架可以统一在线和离线PRM学习方法,证明rePIRL可以在最少假设下学习PRM。在标准化数学和编程推理数据集上的经验评估展示了rePIRL在现有方法上的有效性。我们进一步展示了训练的PRM在测试时训练、测试时扩展以及为训练困难问题提供早期信号的应用。最后,我们通过详细的消融研究验证了我们的训练配方和关键设计选择。

英文摘要

Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

2601.18973 2026-05-21 cs.LG cs.AI cs.SY eess.SY quant-ph 版本更新

When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

何时适应胜出?量子控制中元学习的缩放定律

Nima Leclerc, Chris Miller, Nicholas Brawand

发表机构 * The MITRE Corporation(MITRE公司)

AI总结 本文研究了元学习在量子控制中的适应性问题,推导了适应增益的缩放定律,表明适应增益随着梯度步数指数饱和,而随任务方差线性增长,为判断适应的必要性提供了量化标准。

Comments 28 pages, 11 figures

详情
AI中文摘要

量子硬件固有地存在设备异质性和环境漂移,迫使实践者在次优非适应控制器和高成本的设备特定重新校准之间做出选择。我们推导了元学习的缩放定律下限,表明适应增益(任务特定梯度步的预期保真度提升)随着梯度步数指数饱和,而随任务方差线性增长,提供了判断适应是否值得其开销的量化标准。在量子门校准上的验证显示,低方差任务的适应收益微乎其微,但在极端分布外条件(训练噪声的10倍)下,两量子位门的保真度提升超过40%,这对减少云量子处理器上的设备校准时间具有启示。进一步在经典线性二次控制上的验证证实这些定律源于通用优化几何而非量子特定物理。我们还引入了一种少量次预适应协议,能够在3-19%的相对误差范围内,通过N=3-5次探测步估计最优的适应预算。

英文摘要

Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. We further introduce a few-shot pre-adaptation protocol that estimates the optimal adaptation budget from $N{=}3$-5 probe steps within 3-19% relative error across out-of-distribution regimes.

2510.09060 2026-05-21 cs.AI cs.CV 版本更新

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

让轨迹扩散:用于多样化流匹配的质量保持控制

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You

发表机构 * The University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学) CFAR, Agency for Science, Technology and Research, Singapore(新加坡科技研究局CFAR) University of California, Santa Barbara(加州大学圣巴巴拉分校) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种无需训练的推理时控制机制,使流本身具备多样性意识,通过几何上与模式质量寻求方向解耦的引导来鼓励轨迹横向扩散,同时通过时间调度的随机扰动重新引入不确定性,从而在不降低图像细节和提示忠实度的情况下提升多样性。

详情
AI中文摘要

基于流的文本到图像模型遵循确定性轨迹,这使得在有限的采样预算下探索多样模式成本较高。现有方法提高多样性通常依赖于重新训练或降低图像保真度。为了解决这一限制,我们提出了一种无需训练的推理时控制机制,使流本身具备多样性意识。我们的核心见解是通过几何上与模式质量寻求方向解耦的引导来鼓励多样性。我们的方法通过特征空间目标同时鼓励轨迹横向扩散,并通过时间调度的随机扰动重新引入不确定性。关键在于这种扰动被投影为与生成流正交,这是一个几何约束,允许其在不降低图像细节或提示保真度的情况下提升多样性。理论上,我们证明了这种设计单调地增加了一个体积代理,同时近似地保持边际分布,为生成质量的鲁棒性提供了原理性解释。经验上,在多个文本到图像设置下,固定采样预算下,我们的方法在Vendi分数和Brisque等多样性指标上一致优于强基线,同时保持图像质量和对齐。

英文摘要

Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

2510.05942 2026-05-21 cs.CL cs.AI 版本更新

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

EvalMORAAL: 可解释的链式推理与大语言模型道德对齐的LLM-as-Judge评估

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri

发表机构 * Department of Methodology and Statistics, Utrecht University(方法论与统计学系,乌得勒支大学)

AI总结 本文提出EvalMORAAL框架,通过两种评分方法和模型作为裁判的同行评审,评估20个大语言模型的道德对齐情况,发现西方与非西方地区存在显著的道德对齐差距。

Comments Accepted as a poster at *SEM 2026

详情
AI中文摘要

我们提出了EvalMORAAL,一个透明的链式推理(CoT)框架,使用两种评分方法(对数概率和直接评分)以及模型作为裁判的同行评审来评估20个大语言模型的道德对齐。我们对世界价值观调查(55个国家,19个主题)和PEW全球态度调查(39个国家,8个主题)进行了评估。使用EvalMORAAL,顶级模型与调查响应高度一致(WVS上的皮尔逊相关系数r≈0.90)。然而,我们发现明显的区域差异:西方地区平均r=0.82,而非西方地区平均r=0.61(绝对差距0.21),表明存在持续的区域对齐差距。我们的框架增加了三个部分:(1)为所有模型提供两种评分方法以实现公平比较,(2)带有自我一致性检查的结构化Co T协议,以及(3)一个模型作为裁判的同行评审,使用数据驱动的阈值标记348个冲突。同行同意与WVS调查对齐(r=0.74,p<.001;PEW r=0.39,n.s.),支持自动化质量检查。这些结果展示了文化意识AI的真实进展,同时突显了跨区域应用的开放挑战。

英文摘要

We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

2508.16860 2026-05-21 cs.SE cs.AI cs.LG 版本更新

TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings

TriagerX: 用于基于内容和交互的缺陷分类任务的双变换器

Md Afif Al Mamun, Gias Uddin, Lan Xia, Longyu Zhang

发表机构 * University of Calgary(卡尔加里大学) York University(约克大学) IBM Canada(IBM加拿大)

AI总结 本文提出TriagerX,一种双变换器架构,通过结合内容和交互信息来改进缺陷分类任务的推荐准确性,优于现有最先进方法。

Comments Accepted to IEEE Transactions on Software Engineering (TSE). 17 pages, 15 figures

详情
AI中文摘要

预训练语言模型(PLMs)是基于变换器的架构,可用于缺陷分类任务。PLMs比传统机器学习(ML)模型更能捕捉标记语义(例如TF-IDF、词袋)。然而,PLMs可能仍然会关注在缺陷报告中不相关的标记,这会影响其有效性。此外,当不考虑开发人员围绕类似缺陷的交互历史时,模型的推荐可能不够优化。我们设计了TriagerX来解决这些限制。首先,为了更可靠地评估标记语义,我们利用双变换器架构。与当前最先进的(SOTA)基线使用单一变换器架构不同,TriagerX从两个变换器中收集推荐,每个变换器通过其最后三层提供推荐。这种设置生成了一个稳健的内容基于候选开发人员的排名。TriagerX然后通过一种新的基于交互的排名方法来细化此排名,该方法考虑了开发人员与类似修复缺陷的历史交互。在五个数据集中,TriagerX超越了所有九种基于变换器的方法,包括SOTA基线,通常在Top-1和Top-3开发人员推荐准确性上提高了超过10%。我们与我们的大型行业合作伙伴合作,成功将其部署到他们的开发环境中。合作伙伴要求开发人员和组件的推荐,组件作为团队分配的代理,特别是在开发人员轮岗或团队变化的情况下特别有用。我们训练TriagerX在合作伙伴的数据集上进行两项任务,并在组件推荐上优于SOTA基线最高达10%,在开发人员推荐上最高达54%。

英文摘要

Pretrained Language Models or PLMs are transformer-based architectures that can be used in bug triaging tasks. PLMs can better capture token semantics than traditional Machine Learning (ML) models that rely on statistical features (e.g., TF-IDF, bag of words). However, PLMs may still attend to less relevant tokens in a bug report, which can impact their effectiveness. In addition, the model can be sub-optimal with its recommendations when the interaction history of developers around similar bugs is not taken into account. We designed TriagerX to address these limitations. First, to assess token semantics more reliably, we leverage a dual-transformer architecture. Unlike current state-of-the-art (SOTA) baselines that employ a single transformer architecture, TriagerX collects recommendations from two transformers with each offering recommendations via its last three layers. This setup generates a robust content-based ranking of candidate developers. TriagerX then refines this ranking by employing a novel interaction-based ranking methodology, which considers developers' historical interactions with similar fixed bugs. Across five datasets, TriagerX surpasses all nine transformer-based methods, including SOTA baselines, often improving Top-1 and Top-3 developer recommendation accuracy by over 10%. We worked with our large industry partner to successfully deploy TriagerX in their development environment. The partner required both developer and component recommendations, with components acting as proxies for team assignments-particularly useful in cases of developer turnover or team changes. We trained TriagerX on the partner's dataset for both tasks, and it outperformed SOTA baselines by up to 10% for component recommendations and 54% for developer recommendations.

2506.08277 2026-05-21 q-bio.NC cs.AI cs.CL cs.CV cs.LG 版本更新

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

基于任务的指令调制多模态大语言模型探测:在自然主义刺激下的区域特定大脑对齐模式

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

发表机构 * Technische Universität Berlin(柏林技术大学) Rice University(Rice 大学) AWS AI Labs, Amazon(Amazon 人工智能实验室) IIT Delhi(德里理工学院) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Spector Inc(Spector 公司) IIIT-Hyderabad(海得拉巴理工学院) Microsoft(微软)

AI总结 本研究探讨了指令调制多模态大语言模型在自然主义刺激下的大脑对齐模式,通过比较不同模型在视频和音频任务中的表现,揭示了指令调制对模型表示能力的影响。

Comments 57 pages, 39 figures

详情
AI中文摘要

近期的体素级多模态脑编码研究显示,多模态大语言模型(MLLMs)在大脑对齐程度上高于单模态模型。更近期的研究表明,指令调制多模态(IT)模型能够生成与大脑活动强相关的任务特定表示,但大多数先前评估集中在单模态刺激或非指令调制模型上。我们仍然缺乏对指令调制是否使IT-MLLMs围绕功能任务需求组织其表示,还是仅反映表面语义的清晰理解。为此,我们通过预测自然主义电影观看(带音频的视频)期间记录的fMRI响应,来估计大脑对齐情况。使用来自六个视频和两个音频IT-MLLMs的指令特定嵌入,跨13个视频任务指令,我们发现指令调制视频MLLMs的大脑对齐程度高于上下文学习(ICL)多模态模型(~9%)、非指令调制多模态模型(~15%)和单模态基线(~20%)。我们对视频和音频任务以及语言引导的探测评估,产生了不同任务特定的MLLM表示,这些表示在不同大脑区域中变化。我们还发现,ICL模型表现出强语义组织(r=0.78),而IT模型与指令文本语义的耦合较弱(r=0.14),这与与更高大脑对齐相关的任务条件子空间一致。这些发现支持了任务特定指令与更强的大脑-MLLM对齐之间的关联,并为映射两个系统中的联合信息处理开辟了新途径。我们公开了代码 [https://github.com/subbareddy248/mllm_videos]。

英文摘要

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

2406.03506 2026-05-21 cs.LG cs.AI 版本更新

Fuzzy Convolution Neural Networks for Tabular Data Classification

模糊卷积神经网络用于表格数据分类

Arun D. Kulkarni

发表机构 * Computer Science Department, University of Texas at Tyler(德克萨斯大学泰勒分校计算机科学系)

AI总结 本文提出了一种针对表格数据分类的模糊卷积神经网络(FCNN),通过将特征值映射为模糊隶属度并转换为图像来训练CNN模型,从而在表格数据分类任务中实现有效的学习和优于现有方法的性能。

Comments 10 pages, 16 figures, Submitted to IEEE Access

Journal ref IEEE Access, vol. 12, pp. 151846-151855 (2024)

详情
AI中文摘要

近年来,由于在各种领域中表现出色,特别是图像和文本分类任务,卷积神经网络(CNNs)已经引起了广泛关注。然而,它们在表格数据分类中的应用仍然很少被探索。在生物信息学、金融、医学等领域,非图像数据普遍存在。将CNNs适应于分类非图像数据仍然极具挑战性。本文研究了CNNs在表格数据分类中的有效性,旨在弥合传统机器学习方法与深度学习技术之间的差距。我们提出了一种专门针对表格数据的新型框架——模糊卷积神经网络(FCNN),以捕捉特征向量中的局部模式。在我们的方法中,我们将特征值映射到模糊隶属度。模糊隶属度向量被转换为图像,用于训练CNN模型。训练后的CNN模型用于分类未知的特征向量。为了验证我们的方法,我们生成了六个复杂的噪声数据集。我们从每个数据集中随机选择70%的样本用于训练,30%用于测试。数据集还使用了最先进的机器学习算法,如决策树(DT)、支持向量机(SVM)、模糊神经网络(FNN)、贝叶斯分类器和随机森林(RF)进行分类。实验结果表明,我们提出的方法能够有效地从表格数据中学习有意义的表示,实现与现有方法相媲美或更优的性能。总体而言,我们的发现表明,所提出的FCNN模型在表格数据分类任务中具有前景,作为一种可行的替代方案,为在结构化数据分析中利用深度学习提供了新的视角和潜在的机会。

英文摘要

Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.

2305.09620 2026-05-21 cs.CL cs.AI cs.LG 版本更新

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

AI增强的调查:利用大型语言模型和调查进行意见预测

Junsol Kim, Byungkyu Lee

发表机构 * Department of Sociology(社会学系) University of Chicago(芝加哥大学) New York University(纽约大学) Chicago, IL(伊利诺伊州芝加哥市) New York, NY(纽约州纽约市)

AI总结 本文提出了一种基于大型语言模型的框架,通过结合问题、受访者和调查时期的嵌入表示,预测重复横断面调查中缺失的响应,从而弥补传统调查在捕捉历史变化方面的不足。

详情
AI中文摘要

全国代表性调查追踪公众意见,但每年只询问有限的问题,限制了其捕捉历史变化的潜力。为填补这一空白,我们开发了一个基于大型语言模型(LLM)的框架,通过结合问题、受访者和调查时期的嵌入表示,预测重复横断面调查中缺失的响应。我们引入了LLM在调查研究中的两个新应用:回溯预测(预测年度层面的缺失意见)和未询问意见预测(预测完全缺失的意见)。使用1972-2021年一般社会调查的数据,我们的LLM模型在交叉验证和在GSS未询问的年份中通过其他组织测量的公众意见方面表现良好。这些能力使我们能够恢复缺失的趋势并确定公众态度变化的时间,例如同性婚姻支持率的上升。然而,未询问意见预测的性能仍较为有限。我们展示了当我们的模型优于现有基准时的情况,检验了哪些意见和受访者更具可预测性,并评估了我们的方法是否减少了LLM预测响应的同质化倾向。我们的研究证明了LLM和调查可以相互增强:LLM扩大了调查的潜力,而调查则校准LLM以模拟人类意见。

英文摘要

Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.