arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.05917 2026-06-02 cs.LG cs.AI q-fin.ST

Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis

结合BERT情感分析的节点Transformer架构用于股票市场预测

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * University of Technology, Baghdad, Iraq（巴格达大学）

AI总结提出一种将节点Transformer与BERT情感分析相结合的框架，通过图结构建模股票间依赖关系并融合社交媒体情感，在S&P 500股票上实现0.80%的MAPE，显著优于传统方法。

Comments 18 pages, 5 figures, 12 tables. Accepted for publication in IEEE Access

详情

DOI: 10.1109/ACCESS.2026.3691980
Journal ref: IEEE Access, vol. 14, pp. 72613-72631, 2026

AI中文摘要

股票市场预测对在噪声、非平稳和行为动态的复杂市场环境中操作的投资者、金融机构和政策制定者提出了相当大的挑战。传统的预测方法，包括基本面分析和技术指标，往往无法捕捉金融市场中固有的复杂模式和横截面依赖性。本文提出了一种结合节点Transformer架构与基于BERT的情感分析的集成框架，用于股票价格预测。该模型将股票市场表示为图结构，其中个股构成节点，边捕捉关系，包括行业隶属关系、相关价格变动和供应链连接。一个微调的BERT模型从社交媒体帖子中提取情感信息，并通过基于注意力的融合机制将其与定量市场特征相结合。节点Transformer处理历史市场数据，同时捕捉股票间的时间演变和横截面依赖性。在1982年1月至2025年3月期间20只S&P 500股票上进行的实验表明，集成模型在一天前预测中实现了0.80%的平均绝对百分比误差（MAPE），而ARIMA为1.20%，LSTM为1.00%。情感分析的加入使预测误差总体降低10%，在财报公告期间降低25%，而基于图的架构通过捕捉股票间依赖性额外贡献了15%的改进。方向准确率在一天预测中达到65%。通过配对t检验的统计验证确认了这些改进的显著性（所有比较p < 0.05）。该模型在高波动期保持较低的误差，MAPE为1.50%，而基线模型范围为1.60%至2.10%。

英文摘要

Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods, including fundamental analysis and technical indicators, often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment information from social media posts and combines it with quantitative market features through attention-based fusion mechanisms. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments conducted on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. The inclusion of sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while the graph-based architecture contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms the significance of these improvements (p < 0.05 for all comparisons). The model maintains lower error during high-volatility periods, achieving MAPE of 1.50% while baseline models range from 1.60% to 2.10%.

URL PDF HTML ☆

赞 0 踩 0

2605.15141 2026-06-02 cs.CV

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Causal Forcing++：用于实时交互式视频生成的可扩展少步自回归扩散蒸馏

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu

发表机构 * Tsinghua University（清华大学）； ShengShu（盛数）； Renmin University of China（中国人民大学）

AI总结提出Causal Forcing++框架，通过因果一致性蒸馏（causal CD）实现帧级1-2步自回归扩散蒸馏，在降低延迟和训练成本的同时提升视频生成质量。

详情

AI中文摘要

实时交互式视频生成需要低延迟、流式处理和可控展开。现有的自回归（AR）扩散蒸馏方法通过将双向基础模型蒸馏为少步AR学生模型，在分块4步机制中取得了强劲结果，但仍受限于粗粒度响应和不可忽略的采样延迟。本文研究了一种更激进的设置：仅用1-2采样步的帧级自回归。在此机制下，我们识别出少步AR学生模型的初始化是关键瓶颈：现有策略要么目标不对齐，要么无法进行少步生成，要么成本过高难以扩展。我们提出 extbf{Causal Forcing++}，一个原则性且可扩展的流水线，使用\emph{因果一致性蒸馏}（causal CD）进行少步AR初始化。核心思想是：因果CD学习与因果ODE蒸馏相同的AR条件流映射，但通过相邻时间步之间的单个在线教师ODE步获得监督，避免了预计算和存储完整PF-ODE轨迹的需要。这使得初始化既更高效又更易优化。由此产生的流水线\ours在 extit{ extbf{帧级2步设置}}下，VBench总分、VBench质量和VisionReward分别超过SOTA 4步分块Causal Forcing 0.1、0.3和0.335，同时首帧延迟降低50%，阶段2训练成本降低约$4 imes$。我们进一步将流水线扩展到以动作条件的世界模型生成，秉承Genie3的精神。项目页面：https://github.com/thu-ml/Causal-Forcing 和 https://github.com/shengshu-ai/minWM 。

英文摘要

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

URL PDF HTML ☆

赞 0 踩 0

2605.14709 2026-06-02 cs.CV

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

打破双重瓶颈：将统一多模态模型演化为自适应交错视觉推理器

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu

发表机构 * Tsinghua University（清华大学）

AI总结针对统一多模态模型在理解与生成之间的鸿沟导致的注意力纠缠和视觉细化瓶颈，提出一种自适应切换生成策略的框架，通过分层数据流水线和两阶段训练（SFT+RL）提升X2I任务性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

最近的统一模型在单一框架内集成了多模态理解和生成。然而，“理解-生成鸿沟”仍然存在，模型能够捕捉用户意图，但往往难以将这种语义知识转化为精确的像素级操作。这种鸿沟在任意到图像任务（X2I）中导致了两个瓶颈：注意力纠缠瓶颈，即盲目规划难以处理复杂提示；以及视觉细化瓶颈，即非结构化反馈无法有效纠正缺陷。在本文中，我们提出了一种新颖的框架，使统一模型能够根据指令复杂性和模型能力自主切换生成策略。为此，我们构建了一个分层数据流水线，在三种自适应模式中构建执行路径：简单情况的直接生成、质量细化的自我反思以及分解复杂场景的多步规划。基于该流水线，我们贡献了一个包含超过50,000个样本的高质量数据集，并实施了一个包含SFT和RL的两阶段训练策略。具体地，我们设计了逐步推理奖励以确保逻辑一致性，以及组内复杂度惩罚以防止冗余计算开销。大量实验表明，我们的方法在X2I上优于现有基线，在简单到复杂指令中实现了优越的生成保真度。代码已发布在 https://github.com/WeChatCV/Interleaved_Visual_Reasoner。

英文摘要

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.

URL PDF HTML ☆

赞 0 踩 0

2605.14398 2026-06-02 cs.AI

Coding Agent Is Good As World Simulator

编码智能体作为世界模拟器

Hongyu Wang, Jingquan Wang, Bocheng Zou, Radu Serban, Dan Negrut

发表机构 * Department of Mechanical & Aerospace Engineering, University of Wisconsin-Madison（威斯康星大学麦迪逊分校机械与航空航天工程系）； School of Computer, Data, and Information Sciences, University of Wisconsin-Madison（威斯康星大学麦迪逊分校计算机、数据与信息科学学院）

AI总结提出一个通过可执行模拟代码构建基于物理的世界模型的智能体框架，协调规划、代码生成、视觉审查和物理分析智能体，迭代修正代码以满足物理约束，在物理准确性、指令忠实度和视觉质量上超越视频模型。

详情

AI中文摘要

世界模型已成为构建交互式模拟环境的强大范式，最近的基于视频的方法在生成视觉上合理的动态方面展示了令人印象深刻的进展。然而，由于这些模型通常从视频中推断动态并以潜在状态表示，它们没有明确强制执行物理约束。因此，生成的视频展开在物理上不合理，表现出不稳定的接触、扭曲的形状或不一致的运动。在本文中，我们提出了一个通过可执行模拟代码构建基于物理的世界模型的智能体框架。该框架协调规划、代码生成、视觉审查和物理分析智能体。规划智能体将自然语言提示转换为结构化场景计划，代码智能体将其实现为可执行模拟代码，视觉审查智能体提供视觉反馈，而物理分析智能体检查物理一致性。代码根据反馈进行迭代修订，直到模拟符合提示要求和物理约束。实验结果表明，我们的框架在物理准确性、指令忠实度和视觉质量方面优于先进的基于视频的模型，可应用于各种场景，包括驾驶模拟和具身机器人任务。

英文摘要

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.13527 2026-06-02 cs.AI

MMSkills: Towards Multimodal Skills for General Visual Agents

MMSkills: 面向通用视觉智能体的多模态技能

Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Xiaohongshu Inc.（小红书公司）； Southeast University（东南大学）

AI总结提出MMSkills框架，通过将多模态程序知识编码为紧凑的状态条件包（包含文本程序、运行时状态卡和多视角关键帧），并利用轨迹到技能生成器和分支加载多模态技能智能体，显著提升GUI和游戏场景下视觉智能体的决策能力。

Comments 25 pages, 8 figures, 8 tables. Project page: https://zkangning.github.io/MMSkills_for_Visual_Agents/

详情

AI中文摘要

可复用技能已成为提升智能体能力的核心基础，然而现有大多数技能包主要将可复用行为编码为文本提示、可执行代码或学习例程。但对于视觉智能体而言，程序知识本质上是多模态的：复用不仅取决于执行什么操作，还取决于识别相关状态、解释进展或失败的视觉证据，以及决定下一步行动。我们将这一需求形式化为多模态程序知识，并解决三个实际挑战：(I) 多模态技能包应包含什么；(II) 这些包可以从公共交互经验中从哪里获取；(III) 智能体如何在推理时参考多模态证据，而无需过多的图像上下文或过度依赖参考截图。我们引入MMSkills，一个用于表示、生成和使用可复用多模态程序以支持运行时视觉决策的框架。每个MMSkill是一个紧凑的状态条件包，将文本程序与运行时状态卡和多视角关键帧耦合。为构建这些包，我们开发了一个智能体轨迹到技能生成器，通过工作流分组、程序归纳、视觉定位和元技能引导审计，将公共非评估轨迹转化为可复用的多模态技能。为使用它们，我们引入了一个分支加载多模态技能智能体：在临时分支中检查选定的状态卡和关键帧，与实时环境对齐，并提炼为结构化指导供主智能体使用。在GUI和游戏视觉智能体基准上的实验表明，MMSkills持续提升前沿和较小规模的多模态智能体，表明外部多模态程序知识补充了模型内部先验。

英文摘要

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

URL PDF HTML ☆

赞 0 踩 0

2602.19612 2026-06-02 cs.CL

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

遗忘的解剖：事实显著性与模型微调的双重影响

Anna Borisiuk, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina

发表机构 * AIRI ； Sber AI Lab ； HSE University（HSE大学）； Skoltech ； ISP RAS Research Center for Trusted Artificial Intelligence（ISP RAS可信人工智能研究中心）

AI总结本研究通过构建DUET基准，分析了预训练与监督微调阶段的事实显著性对机器遗忘效果的影响，发现微调后的模型遗忘更平滑且保留率更高。

2605.09692 2026-06-02 cs.AI

Causal state binding predicts action control in language agents

因果状态绑定预测语言智能体中的动作控制

Xiao Jia

发表机构 * School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen（人工智能学院，香港中文大学（深圳））

AI总结提出因果状态绑定框架，通过干预实验测量智能体动作是否随事件特定决定性状态变化，验证了结构化智能体在动作控制上优于随机基线。

Comments 85 pages, 5 main figures; supplementary information included

详情

AI中文摘要

自主语言智能体越来越多地暴露痕迹、记忆、计划和约束，但现有评估很少测试这些状态变量是否与最终动作绑定。我们引入了因果状态绑定，一种干预耦合的评估框架，用于测量动作是否随事件特定的决定性状态变化，同时对无关线索保持不变。主要读出是一个隐藏目标有限动作基准，其中评分者侧干预目标在生成前分配，并从模型可见提示中隐藏。在七个语料库级别的57,816条评分记录中，结构化智能体条件在推理、记忆、否决和自连续性响应方面超过了高随机性控制和目标组件移除。跨Qwen2.5 7B、14B和32B以及Mistral-7B的开源权重验证显示，动作先验、无字段提示和打乱的决定性上下文未能恢复结构化控制特征。在诊断性有限动作探测中，最小决定性字段读出恢复了规定的动作模式，而仅表面、仅动作先验和打乱字段控制则没有。在300条SWE-bench Lite问题记录和六个API模型上，将无预言机的因果状态绑定组合添加到完整非CSB基线中，将约束清洁的问题到文件命中@3 AUC从0.873提高到0.935。该验证涉及问题到文件定位，而非补丁应用或SWE-bench问题解决。这些结果支持智能体评估的测量原则：动作控制由事件特定的状态-动作绑定预测，而非仅由输出熵、动作先验匹配或推理格式预测。

英文摘要

Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these state variables are bound to final actions. We introduce causal state binding, an intervention-coupled evaluation framework that measures whether actions change with the event-specific decisive state while remaining invariant to irrelevant cues. The primary readout is a hidden-target finite-action benchmark in which scorer-side intervention targets are assigned before generation and withheld from the model-visible prompt. Across 57,816 scored records in seven corpus-level units, structured-agent conditions exceeded high-randomness controls and targeted component removals on reason, memory, veto and self-continuity responsiveness. Open-weight validation across Qwen2.5 7B, 14B and 32B plus Mistral-7B showed that action priors, no-field prompts and scrambled decisive context did not recover the structured-control signature. In diagnostic finite-action probes, the minimal decisive-field readout recovered the prescribed action pattern whereas surface-only, action-prior-only and scrambled-field controls did not. Across 300 SWE-bench Lite issue records and six API models, adding an oracle-free causal state-binding composite to a full non-CSB baseline increased constraint-clean issue-to-file hit@3 AUC from 0.873 to 0.935. This validation concerns issue-to-file localization, not patch application or SWE-bench issue resolution. These results support a measurement principle for agent evaluation: action control is predicted by event-specific state-action binding, not by output entropy, action-prior matching or rationale format alone.

URL PDF HTML ☆

赞 0 踩 0

2605.08193 2026-06-02 cs.CV cs.AI

Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising

任意骨干网络的归一化等变性及其在图像去噪中的应用

Youssef Saied, François Fleuret

发表机构 * University of Cambridge（剑桥大学）； DeepMind

AI总结提出无参数包装器WNE，通过输入归一化、任意骨干网络处理、输出反归一化实现归一化等变，在盲去噪中提升CNN和Transformer对噪声水平失配的鲁棒性且无GPU开销。

详情

AI中文摘要

归一化等变性（NE）是一种结构先验，可提高图像到图像任务中对分布偏移的鲁棒性。函数 $f$ 是归一化等变的当且仅当对于所有 $a>0$ 和 $b\in\mathbb{R}$，有 $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$。现有的NE方法将每个内部层约束为与NE兼容的操作。这些约束增加了运行时成本，并排除了标准的Transformer组件，如softmax注意力和LayerNorm。我们引入了包装归一化等变性（WNE），这是一种无参数包装器，它对输入进行归一化，应用任意骨干网络，然后对输出进行反归一化。我们证明了每个NE函数都允许这种分解，因此该包装器精确参数化了NE函数类。在盲去噪中，包装CNN和Transformer架构在噪声水平失配下提高了鲁棒性，且没有可测量的GPU开销，而架构性NE基线则慢达 $1.6$ 倍。

英文摘要

Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f$ is normalization equivariant iff $f(a y + b\mathbf{1}) = a f(y) + b\mathbf{1}$ for all $a>0$ and $b\in\mathbb{R}$. Existing NE methods constrain every internal layer to NE-compatible operations. These constraints add runtime cost and exclude standard transformer components such as softmax attention and LayerNorm. We introduce Wrapped Normalization Equivariance (WNE), a parameter-free wrapper that normalizes the input, applies any backbone, and denormalizes the output. We prove every NE function admits this factorization, so the wrapper exactly parameterizes the class of NE functions. On blind denoising, wrapping CNN and transformer architectures improves robustness under noise-level mismatch with no measurable GPU overhead, while architectural NE baselines are up to $1.6\times$ slower.

URL PDF HTML ☆

赞 0 踩 0

2605.13834 2026-06-02 cs.LG cs.AI cs.CG

Topology-Preserving Neural Operator Learning via Hodge Decomposition

通过Hodge分解保持拓扑的神经算子学习

Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文从函数空间视角研究几何网格上物理场方程的解算子，利用Hodge正交性分离不可学习的拓扑自由度与可学习的几何动力学，提出基于Hodge谱对偶的混合欧拉-拉格朗日架构，在保持物理不变量的同时提升几何图上的精度与效率。

Comments Accepted at ICML 2026. Code available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

详情

AI中文摘要

本文从函数空间视角研究几何网格上物理场方程的解算子。我们发现Hodge正交性通过将不可学习的拓扑自由度与可学习的几何动力学分离，从根本上解决了谱干扰问题，从而实现了局限于保结构子空间的加性逼近。基于Hodge理论和算子分裂，我们推导出原则性的算子级分解。结果是一种混合欧拉-拉格朗日架构，具有我们称为Hodge谱对偶（HSD）的代数级归纳偏置。在我们的框架中，我们使用离散微分形式捕捉拓扑主导的分量，并使用正交辅助环境空间表示复杂的局部动力学。我们的方法在几何图上实现了优越的准确性和效率，并增强了对物理不变量的保真度。我们的代码可在https://github.com/ContinuumCoder/Hodge-Spectral-Duality获取。

英文摘要

In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

URL PDF HTML ☆

赞 0 踩 0

2509.24627 2026-06-02 cs.LG

Learning Hamiltonian Dynamics at Scale: A Differential-Geometric Approach

大规模学习哈密顿动力学：一种微分几何方法

Katharina Friedl, Noémie Jaquier, Alyx Liao, Danica Kragic

发表机构 * arXiv.org ； cs.LG（计算机学习）

AI总结提出结合哈密顿力学守恒律与模型降阶可扩展性的降阶哈密顿神经网络（RO-HNN），通过几何约束辛自编码器和几何哈密顿神经网络实现高维动力系统的物理一致预测。

Comments 32 pages, 21 figures, Intl. Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

将物理直觉嵌入网络架构允许学习强制执行基本属性（如能量守恒定律）的动力学，从而产生物理上合理的预测。然而，将这些模型扩展到高维动力系统仍然是一个重大挑战。本文介绍了降阶哈密顿神经网络（RO-HNN），一种新颖的物理启发神经网络，它结合了哈密顿力学的守恒律与模型降阶的可扩展性。RO-HNN 建立在两个核心组件上：一种新颖的几何约束辛自编码器，用于学习低维、保结构的辛子流形，以及一种几何哈密顿神经网络，用于建模子流形上的动力学。我们的实验表明，RO-HNN 提供了复杂高维动力学的物理一致、稳定且可泛化的预测，从而有效地将哈密顿神经网络的范围扩展到高维物理系统。

英文摘要

Embedding physical intuition into network architectures allows the learning of dynamics that enforce fundamental properties, such as energy conservation laws, thereby leading to physically-plausible predictions. Yet, scaling these models to high-dimensional dynamical systems remains a significant challenge. This paper introduces Reduced-order Hamiltonian Neural Network (RO-HNN), a novel physics-inspired neural network that combines the conservation laws of Hamiltonian mechanics with the scalability of model order reduction. RO-HNN is built on two core components: a novel geometrically-constrained symplectic autoencoder that learns a low-dimensional, structure-preserving symplectic submanifold, and a geometric Hamiltonian neural network that models the dynamics on the submanifold. Our experiments demonstrate that RO-HNN provides physically-consistent, stable, and generalizable predictions of complex high-dimensional dynamics, thereby effectively extending the scope of Hamiltonian neural networks to high-dimensional physical systems.

URL PDF HTML ☆

赞 0 踩 0

2505.12741 2026-06-02 cs.AI

Language Model Networks: Supervision-Efficient Learning through Dense Communication

语言模型网络：通过密集通信实现监督高效学习

Shiguang Wu, Yaqing Wang, Quanming Yao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LMNet，一种密集可微的语言模型网络，通过可训练的序列到序列模块作为通信边，实现节点间密集向量交换，支持端到端梯度优化和高效信息传递，在有限监督下实现有效适应。

详情

AI中文摘要

语言模型不仅被用作独立的预测器，还越来越多地作为更大推理系统的组件，从测试时扩展到多智能体协作。我们研究语言模型网络，其中预训练语言模型作为可重用节点，智能从其拓扑、通信和优化中涌现。现有系统主要通过自然语言通信：易于部署，但离散、低效，且难以从最终任务监督中优化。我们提出LMNet，该范式的密集且可微的实现。LMNet使用精简的LLM作为顶点模块，可训练的序列到序列模块作为通信边，使中间节点能够交换密集向量，同时在系统边界保留自然语言的输入和输出。通过绕过中间嵌入和解嵌入，LMNet实现了高效的信息传输、端到端梯度优化以及超越手工设计协议的学习通信。实验表明，在少量额外训练成本下性能良好，并在有限监督下实现有效适应。

英文摘要

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time scaling to multi-agent collaboration. We study language model networks, where pre-trained language models serve as reusable nodes and intelligence emerges from their topology, communication, and optimization. Existing systems mostly communicate through natural language: easy to deploy, but discrete, inefficient, and hard to optimize from end-task supervision. We propose LMNet, a dense and differentiable realization of this paradigm. LMNet uses stripped LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary. By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization, and learned communication beyond hand-designed protocols. Experiments show performance with small additional training cost and effective adaptation under limited supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.13178 2026-06-02 cs.CV cs.AI

CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models

CLIP Tricks You: 面向大型视觉-语言模型中高效像素定位的无训练令牌剪枝

Sangin Lee, Yukyung Choi

发表机构 * KAIST（韩国科学技术院）

AI总结提出LiteLVLM，一种无需训练、文本引导的令牌剪枝策略，通过反转CLIP视觉-文本相似度排序，保留指代区域令牌并恢复上下文令牌，实现高效像素定位推理，在多种令牌预算下性能提升超5%，保持90%原始性能同时加速22%并减少2.3倍内存。

Comments Accepted by ICML 2026

详情

AI中文摘要

在大型视觉-语言模型中，视觉令牌通常构成输入令牌的大部分，导致大量计算开销。为了解决这个问题，最近的研究探索了为图像理解任务剪枝冗余或信息量较少的视觉令牌。然而，这些方法在像素定位任务中表现不佳，因为令牌重要性高度依赖于输入文本。通过对CLIP的深入分析，我们观察到指代区域内的视觉令牌与其文本表示的相似度通常较低。受此启发，我们引入了LiteLVLM，一种无需训练、文本引导的令牌剪枝策略，用于高效的像素定位推理。通过反转CLIP视觉-文本相似度的排序，LiteLVLM有效地保留了覆盖指代区域的视觉令牌，同时恢复上下文令牌以实现清晰的前景-背景分离。大量实验表明，LiteLVLM在不同令牌预算下均显著优于现有方法，性能提升超过5%。无需任何训练或微调，LiteLVLM在保持90%原始性能的同时，实现了22%的加速和2.3倍的内存减少。我们的代码可在https://github.com/sejong-rcv/LiteLVLM获取。

英文摘要

In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens within referent regions often exhibit low similarity to their textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90% of the original performance with a 22% speedup and a 2.3X memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.

URL PDF HTML ☆

赞 0 踩 0

2605.13175 2026-06-02 cs.LG

Do Heavy Tails Help Diffusion? On the Subtle Trade-off Between Initialization and Training

重尾是否有助于扩散？初始化与训练之间的微妙权衡

Hamza Cherkaoui, Hélène Halconruy, Antonio Ocello

发表机构 * SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, Palaiseau, France（SAMOVAR，电信南巴黎，巴黎理工学院，Palaiseau，法国）； Modal’X, Université Paris Nanterre, Nanterre, France（Modal’X，巴黎南大学，Nanterre，法国）； CREST, ENSAE Paris, Institut Polytechnique de Paris, Palaiseau, France（CREST，ENSAE巴黎，巴黎理工学院，Palaiseau，法国）

AI总结本文通过理论和实验研究，比较了重尾噪声与轻尾高斯噪声在扩散模型中的表现，发现重尾噪声虽能更好地匹配数据尾部，但会使统计估计更困难，导致更差的采样误差。

详情

AI中文摘要

最近的工作提出将重尾噪声引入基于扩散和流的生成模型，旨在更好地恢复目标分布的尾部并提高生成多样性。这一动机直观：如果数据是重尾的，重尾噪声可能比轻尾高斯噪声更匹配。然而，用重尾噪声替换高斯噪声也改变了底层估计问题。在本文中，我们通过理论和实验相结合的研究重新审视这一范式，建立了由重尾和轻尾噪声驱动的两种代表性扩散模型的采样误差界。我们表明，重尾噪声使统计估计问题更困难，导致更不利的采样误差界。我们通过在合成和真实数据集上的实验支持这些发现，经验性地恢复了预测的误差权衡。我们的结果质疑了生成建模中日益增长的设计趋势，并挑战了使用重尾噪声来改进稀有区域探索的做法。

英文摘要

Recent works have proposed incorporating heavy-tailed (HT) noise into diffusion- and flow-based generative models, with the goals of better recovering the tails of target distributions and improving generative diversity. This motivation is intuitive: if the data are heavy-tailed, HT noise may appear better matched than light-tailed (LT) Gaussian noise. However, replacing Gaussian noise by HT noise also changes the underlying estimation problem. In this paper, we revisit this paradigm through a combined theoretical and empirical study, establishing sampling-error bounds for two representative diffusion models driven by HT and LT noise. We show that HT noise makes the statistical estimation problem harder, leading to less favorable sampling-error bounds. We support these findings with experiments on synthetic and real-world datasets, empirically recovering the predicted error trade-off. Our results call into question a growing design trend in generative modeling and challenge the use of HT noise to improve rare-region exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.13136 2026-06-02 cs.CL

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

GateKD: 基于置信度门控的闭环蒸馏用于鲁棒推理

Kasidit Sermsri, Teerapong Panboonyuen

发表机构 * Chulalongkorn University（朱拉隆梭大学）； MARSAIL ； PBYAIL (Panboonyuen AI Lab)（Panboonyuen AI Lab）

AI总结提出GateKD框架，通过置信度门控的闭环蒸馏（包括软监督、隐藏状态演化和注意力蒸馏）减少噪声传播，提升学生模型在常识、逻辑和符号推理上的鲁棒性。

Comments 16 pages

详情

AI中文摘要

从大型语言模型（LLMs）中蒸馏多步推理能力到紧凑的学生模型仍然具有挑战性，原因在于噪声理由、幻觉监督和静态的师生交互。现有的推理蒸馏方法，包括基于导师的方法，主要采用开环方式运行，隐含地假设教师可靠性一致，从而传播错误的中间推理。我们提出GateKD，一种置信度门控的闭环蒸馏框架，通过将教师视为动态门控者而非静态预言家，实现鲁棒的推理迁移。GateKD引入三种互补机制：(i) 置信度门控的软监督，选择性蒸馏可靠的预测信号；(ii) 门控的隐藏状态演化，仅在教师置信度高时对齐中间表示；(iii) 可靠性过滤的注意力蒸馏，在抑制噪声模式的同时保留稳定的推理结构。这些组件共同形成一个闭环反馈回路，其中教师置信度持续调节蒸馏过程，减少幻觉转移并稳定学生推理。在常识、逻辑和符号推理基准上，使用不同规模的T5和Flan-T5骨干网络进行的大量实验表明，GateKD始终优于强开环蒸馏基线。值得注意的是，GateKD在逻辑和符号推理中取得了显著收益，在低资源蒸馏设置下保持鲁棒性，并且当移除任何门控组件时性能明显下降。我们的结果强调，置信度门控的闭环监督对于构建可靠且可扩展的小型推理模型至关重要。

英文摘要

Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2605.12895 2026-06-02 cs.LG cs.AI cs.CY stat.AP

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

RISED：高风险AI决策支持系统的部署前评估框架，及其在医疗中的应用

Rohith Reddy Bellibatlu, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal, Abhishek Israni

发表机构 * Florida International University（佛罗里达国际大学）； Boston University（波士顿大学）； New York University（纽约大学）； University of Maryland（马里兰大学）； Boston University School of Public Health（波士顿大学公共卫生学院）

AI总结提出RISED框架，通过BCa bootstrap置信区间、文献阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定，从五个维度评估高风险AI决策支持系统，在医疗等数据集上发现AUROC无法揭示的失败模式。

Comments 39 pages, 7 figures, 15 tables. Code at https://github.com/rohithreddybc/rised-healthcare-eval and dataset at https://doi.org/10.57967/hf/8734 (Hugging Face). To be submitted to Expert Systems with Applications (Elsevier)

详情

AI中文摘要

临床决策支持系统是专家系统，临床医生直接根据其建议行动，但通常仅通过保留测试集上的一个总体准确率数字来批准。这个数字对编码偏移下的输入可靠性、子组差距、阈值敏感性或操作可行性毫无说明。我们提出RISED，一个部署前评估框架，通过BCa bootstrap 95%置信区间、基于文献的阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定，操作化五个维度（可靠性、包容性、敏感性、公平性、可部署性）；公平性是一个代理依赖诊断而非门控测试。应用于跨越35年的七个队列（n从303到99,492），RISED揭示了AUROC无法发现的失败：在Diabetes 130上，可靠性通过三个数量级（PSS = 0.0004），而包容性（AUC差距 = 0.262）和敏感性（最大阈值翻转率49.1%）明确失败；两个NHIS队列也重复了这一点。具有完整特征配置的NHANES 2021-2023获得了INCONCLUSIVE判定；BRFSS 2024在仪器旋转移除高血压和胆固醇后产生了该套件中最严重的敏感性失败（最大阈值翻转率64.2%）。该模式在信用和收入预测队列上重复出现，证实了领域无关性；多模型检查显示失败是数据驱动的，而非模型特定的。RISED作为开源Python包发布，补充了TRIPOD+AI、FUTURE-AI和Fairlearn，提供了这些标准要求但未规定的结构化数值证据。

英文摘要

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

URL PDF HTML ☆

赞 0 踩 0

2605.12813 2026-06-02 cs.CL cs.AI cs.CR cs.LG

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA: 引发LLM幻觉的逼真潜在对抗攻击

Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出REALISTA框架，通过潜在空间优化语义等价的对抗提示，有效引发大语言模型幻觉，优于现有方法。

Comments Accepted at ICML 2026. Code is available at https://github.com/Buyun-Liang/REALISTA

详情

AI中文摘要

大型语言模型（LLM）在许多任务上表现出色，但仍然容易产生幻觉，因此有必要在逼真的对抗输入下系统地评估其可靠性。我们将幻觉引发问题形式化为一个约束优化问题，目标是找到与良性用户提示语义等价的对抗提示。现有攻击方法仍有局限：基于离散提示的攻击保持语义等价性和连贯性，但仅搜索有限的提示变体；而连续潜在空间攻击探索更丰富的空间，但通常解码为不再有效改写的提示。为解决这些局限，我们提出REALISTA，一个逼真的潜在空间攻击框架。REALISTA构建了一个依赖于输入的合法编辑方向字典，每个方向对应一个语义等价且连贯的改写，并在潜在空间中优化这些方向的连续组合。这种设计结合了连续攻击的优化灵活性和基于离散改写的攻击的语义逼真性。实验表明，REALISTA在开源LLM上达到优于或与最先进逼真攻击相当的性能，并且关键的是，在自由形式响应设置下成功攻击大型推理模型，而先前的逼真攻击则失败。代码可在https://github.com/Buyun-Liang/REALISTA获取。

英文摘要

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

URL PDF HTML ☆

赞 0 踩 0

2605.12689 2026-06-02 cs.RO

3D RL-DWA: A Hybrid Reinforcement Learning and Dynamic Window Approach for Goal-Directed Local Navigation in Multi-DoF Robots

3D RL-DWA：一种用于多自由度机器人目标导向局部导航的混合强化学习与动态窗口方法

Chiara Castellani, Enrico Turco, Domenico Prattichizzo

发表机构 * European Union’s Horizon Europe Research and Innovation Programme（欧洲联盟的地平线欧洲研究与创新计划）

AI总结提出结合强化学习与动态窗口方法的混合框架，利用稀疏点云数据动态调整可变形微型机器人的运动和形状，在复杂受限环境中实现目标导航并最大化占据体积，实验表明该方法在变形和导航能力上优于纯强化学习和基于模型的方法。

Comments Accepted for publication in the Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM 2026)

详情

AI中文摘要

在本文中，我们提出了一种新颖的混合方法，将强化学习与动态窗口方法相结合，用于高自由度机器人系统的自适应3D局部导航。我们的方法利用稀疏点云数据动态调整可变形微型机器人的运动和形状，使系统能够在复杂受限环境中导航到目标，同时最大化占据体积。我们在模拟血管网络中评估了我们的框架。基于1080次试验的实验结果表明，与纯强化学习和基于模型的方法相比，将强化学习与基于DWA的局部规划器集成显著增强了变形和导航能力。特别是，所提出的自主控制器在训练过程中始终实现高变形和近乎完美的路径完成，并在未见过的场景中保持稳健性能。这些发现突显了混合规划策略在稀疏感知条件下实现高效自适应3D导航的潜力。

英文摘要

In this paper, we present a novel hybrid approach that combines Reinforcement Learning (RL) with Dynamic Window Approach (DWA) for adaptive 3D local navigation of high-degree-of-freedom robotic systems. Our method leverages sparse point cloud data to dynamically adjust both the motion and the shape of a deformable microrobot, enabling the system to navigate toward a goal in complex, constrained environments while maximizing the occupied volume. We evaluate our framework in a simulated vascular network. Experimental results, based on 1080 trials, indicate that integrating RL with a DWA-based local planner significantly enhances both deformation and navigation capabilities compared to pure RL and model-based methods. In particular, the proposed autonomous controller consistently achieves high deformation and near-perfect path completion during training and maintains robust performance in unseen scenarios. These findings highlight the potential of hybrid planning strategies for efficient and adaptive 3D navigation under sparse sensory conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.12652 2026-06-02 cs.LG cs.AI

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

基于同伴成功与失败的多轨迹在线策略蒸馏

Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软）； Purdue University（普渡大学）

AI总结提出多轨迹在线策略蒸馏（MOPD），利用学生模型的本地轨迹组构造更丰富的教师信号，通过同伴成功与失败条件化提升蒸馏效果。

Comments 23 pages

详情

AI中文摘要

大型语言模型通常使用稀疏验证器奖励进行后训练，该奖励指示采样轨迹是否成功，但对推理成功或失败的位置提供有限指导。在线策略蒸馏（OPD）通过训练学生生成的轨迹提供更密集的令牌级监督，但现有方法通常独立蒸馏每个轨迹，忽略为同一提示采样的其他尝试。我们引入多轨迹在线策略蒸馏（MOPD），一种同伴条件化蒸馏框架，利用学生的本地轨迹组构造信息更丰富的教师信号。MOPD 将教师条件化于成功和失败的同伴轨迹：成功为有效推理模式提供正面证据，而失败则为要避免的合理错误提供结构化负面证据。我们研究了两种同伴上下文构建：正面同伴模仿和对比性成功-失败条件化。在竞争性编程、数学推理、科学问答和工具使用基准上的实验表明，MOPD 持续优于标准在线策略基线。进一步的教师信号分析表明，混合成功-失败上下文能更好地使教师分数与验证器奖励对齐，表明性能提升源于更忠实、实例自适应的监督。这些结果表明，有效的在线策略蒸馏应利用学生的多轨迹试错行为，而不是将轨迹视为孤立样本。

英文摘要

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

URL PDF HTML ☆

赞 0 踩 0

2605.11374 2026-06-02 cs.LG cs.CL cs.IR

Test-Time Compute for Frozen Embedding Models through Agentic Program Search

冻结嵌入模型在测试时通过智能体程序搜索的计算

Han Xiao

发表机构 * Jina AI by Elastic（Jina AI 由 Elastic 提供）

AI总结本文提出一种智能体程序搜索方法，通过大语言模型编写程序操作冻结编码器API，在推理时提升小嵌入模型的检索质量，无需训练参数且跨任务迁移。

Comments 15 pages, 7 figures, 4 tables

详情

AI中文摘要

测试时计算被广泛认为只对大型推理模型有益，小模型无法从中获益。对于密集检索，我们持相反观点，因为现代小型嵌入模型是从大型语言模型骨干蒸馏或适配而来，继承了其潜在的测试时计算能力。我们探究一个冻结的嵌入模型在仅推理时，无需辅助模型且部署时不训练任何参数，能获得多少检索质量提升。一个智能体循环中，大语言模型在冻结编码器API上编写程序，探索144个候选程序，得到12个帕累托最优程序，这些程序在成本比率从$c=1.2$到$14.7$之间权衡推理计算与质量，每个程序在所有14个发现任务上均提升了nDCG@10。这些程序不使用可训练参数，并恢复了经典检索原语，包括倒数秩融合、Fisher线性判别、Rocchio伪相关反馈和句子级MaxSim。未经修改地应用于19个保留任务和三个未见过的编码器家族，单个固定程序改进了大多数任务，中位数$Δ$nDCG@10为正，在$c\ge4$时胜率为54%至57%，且在发现过程中从未见过的编码器家族上增益最大。一个在相同任务上训练的匹配预算学习投影头无法以这种方式迁移，它在域内检索上提升了$+0.20$至$+0.25$ nDCG@10，但在每个保留编码器上均低于基线。因此，小型嵌入模型继承了可用的测试时计算潜力，冻结编码器将推理计算转化为检索增益，并迁移到新的语料库和编码器，无需每个领域的标签。

英文摘要

Test-time compute is widely believed to benefit only large reasoning models, leaving small models with nothing to gain. We argue the opposite for dense retrieval, since modern small embedding models are distilled or adapted from large language model backbones and can inherit their latent test-time-compute potential. We ask how much retrieval quality a frozen embedding model gains at inference alone, with no auxiliary model and no parameters trained at deployment. An agentic loop in which a large language model writes programs over a frozen encoder API explores 144 candidates and yields twelve Pareto-optimal programs that trade inference compute for quality across cost ratios from $c{=}1.2$ to $14.7$, every one improving nDCG@10 on all 14 discovery tasks. The programs use no trainable parameters and recover classical retrieval primitives, among them reciprocal rank fusion, the Fisher linear discriminant, Rocchio pseudo-relevance feedback, and sentence-level MaxSim. Applied unmodified to nineteen held-out tasks and three unseen encoder families, a single fixed program improves the majority of tasks, with a positive median $Δ$nDCG@10 and a 54 to 57% win-rate at $c{\ge}4$, and the gains are largest on encoder families never seen during discovery. A matched-budget learned projection head trained on the same tasks does not transfer this way, improving in-domain retrieval by $+0.20$ to $+0.25$ nDCG@10 yet falling below baseline on every held-out encoder. Small embedding models therefore inherit usable test-time-compute potential, and a frozen encoder converts inference compute into retrieval gains that transfer to new corpora and encoders with no per-domain labels.

URL PDF HTML ☆

赞 0 踩 0

2604.24919 2026-06-02 cs.CV

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Agentic AI 在遥感中的应用：技术挑战与研究方向

Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begüm Demir, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·扎耶德人工智能大学）

AI总结本文指出遥感中的多步分析工作流存在结构性的地理空间约束，提出面向地球观测的原生智能体设计原则，包括结构化地理空间状态、工具感知推理、验证器引导执行和有效性感知学习评估。

Comments 31 pages. Position Paper

详情

AI中文摘要

地球观测（EO）正从静态预测转向多步分析工作流，这些工作流需要对数据、工具和地理空间状态进行协调推理。尽管基础模型和视觉语言模型在遥感中推进了表示学习和语言基础交互，并且智能体AI在长期推理和工具使用方面显示出强大潜力，但EO并非通用智能体AI的直接扩展。EO工作流处理的是地理参考、多模态和时间结构化的数据，其中重投影、重采样、合成和聚合等操作会改变底层状态，并可能限制后续分析。因此，错误可能跨步骤无声传播，正确性不仅取决于内部一致性，还取决于地理空间一致性、时间有效比较和物理有效性。本文立场是这些挑战是结构性的而非偶然的。我们审视了通用智能体系统中常见的假设，分析了它们在地理空间工作流中失效的方式，并描述了多步EO流水线中的故障模式。然后，我们概述了面向EO的原生智能体设计原则，这些原则围绕结构化地理空间状态、工具感知推理、验证器引导执行以及有效性感知学习和评估。因此，构建可靠的地理空间智能体需要围绕控制EO分析的物理、地理空间和工作流约束重新思考智能体设计。

英文摘要

Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have advanced representation learning and language-grounded interaction in remote sensing, and agentic AI has shown strong potential for long-horizon reasoning and tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate on georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation transform the underlying state and can constrain later analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We examine the assumptions commonly made in generic agentic systems, analyze how they break in geospatial workflows, and characterize failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation. Building reliable geospatial agents, therefore, requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

URL PDF HTML ☆

赞 0 踩 0

2512.12634 2026-06-02 cs.AI

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

MobiBench: 面向移动GUI智能体的多分支模块化基准

Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（全北大学）； Korea University（韩国大学）； Fluiz

AI总结提出MobiBench，首个模块化且支持多路径感知的离线基准测试框架，用于高保真、可扩展和可复现地评估移动GUI智能体，并揭示组件级性能瓶颈。

详情

AI中文摘要

移动GUI智能体，即能够代表用户与移动应用交互的AI智能体，有潜力改变人机交互。然而，当前GUI智能体的评估实践存在两个基本限制。首先，它们要么依赖单路径离线基准，要么依赖在线实时基准。使用静态、单路径标注数据集的离线基准不公平地惩罚有效的替代动作，而在线基准由于实时评估的动态和不可预测性，面临可扩展性和可复现性差的问题。其次，现有基准将智能体视为单一黑盒，忽略了各个组件的贡献，这常常导致不公平的比较或掩盖关键性能瓶颈。为了解决这些限制，我们提出了MobiBench，这是首个模块化且支持多路径感知的移动GUI智能体离线基准测试框架，能够在完全离线环境下实现高保真、可扩展和可复现的评估。我们的实验表明，MobiBench与人类评估者的一致性达到94.72%，与精心设计的在线基准相当，同时保留了静态离线基准的可扩展性和可复现性。此外，我们全面的模块级分析揭示了几个关键见解，包括对移动GUI智能体中使用的多种技术的系统评估、跨模型规模的最佳模块配置、当前LFM的固有限制，以及设计更强大且成本效益更高的移动智能体的可操作指南。

英文摘要

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

URL PDF HTML ☆

赞 0 踩 0

2605.12400 2026-06-02 cs.LG cs.AI

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

OGLS-SD：基于结果引导的对数几率操控的在线自蒸馏用于大语言模型推理

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

发表机构 * UNC Chapel Hill（UNC夏洛特山分校）

AI总结提出OGLS-SD框架，通过结果奖励校准教师对数几率，解决在线自蒸馏中师生响应模式不匹配导致的训练不稳定问题，提升数学推理性能。

Comments 17 pages, 10 figures, 5 tables

详情

AI中文摘要

我们研究在线自蒸馏（OPSD），其中语言模型通过沿其自身在线轨迹蒸馏特权教师分布来提高推理能力。尽管有前景，OPSD可能因教师和学生响应之间的模式不匹配而遭受训练不稳定。自我反思的教师响应可能引入反思引起的偏差和响应模板，从而错误校准令牌级监督，最终损害学生的推理能力。为缓解此问题，我们提出OGLS-SD，一种结果引导的对数几率操控框架，利用可验证的结果奖励来校准特权教师对数几率。具体而言，OGLS-SD对比由成功和失败的在线轨迹诱导的教师对数几率，构建一个结果判别性的操控方向用于令牌级指导。在数学推理基准上的实验表明，OGLS-SD稳定了自蒸馏，并提高了相对于标准OPSD和其他变体的性能。

英文摘要

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.

URL PDF HTML ☆

赞 0 踩 0

2605.12377 2026-06-02 cs.CV

Fast Image Super-Resolution via Consistency Rectified Flow

通过一致性修正流实现快速图像超分辨率

Jiaqi Xu, Wenbo Li, Haoze Sun, Fan Li, Zhixin Wang, Long Peng, Jingjing Ren, Haoran Yang, Xiaowei Hu, Renjing Pei, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； HKUST (GZ)（香港科技大学（广州））； South China University of Technology（华南理工大学）

AI总结提出FlowSR方法，将超分辨率问题重构为从低分辨率到高分辨率图像的修正流，利用改进的一致性学习策略实现单步高质量超分辨率。

Comments Accepted by ICCV 2025; Code: https://github.com/jiaqixuac/FlowSR

详情

AI中文摘要

扩散模型在真实世界图像超分辨率中取得了显著成功，但其依赖耗时的多步采样严重阻碍了实际应用。尽管近期工作引入了少步或单步解决方案，但现有方法要么从噪声输入低效建模，要么未能充分利用迭代生成先验，损害了重建图像的保真度和质量。为解决此问题，我们提出FlowSR，一种将超分辨率问题重构为从低分辨率到高分辨率图像的修正流的新方法。我们的方法利用改进的一致性学习策略实现单步高质量超分辨率。具体而言，我们通过引入高分辨率正则化来优化原始一致性蒸馏过程，确保学习的SR流不仅强制自一致性，而且精确收敛到真实高分辨率目标。此外，我们引入快慢调度策略，其中用于一致性学习的相邻时间步从两个不同的调度器采样：快速调度器使用较少时间步以提高效率，慢速调度器使用更多时间步以捕捉细粒度纹理细节。大量实验表明，FlowSR在效率和图像质量方面均取得了出色性能。代码：\href{https://github.com/jiaqixuac/FlowSR}{this https URL}。

英文摘要

Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality. Code: \href{https://github.com/jiaqixuac/FlowSR}{this https URL}.

URL PDF HTML ☆

赞 0 踩 0

2605.12369 2026-06-02 cs.RO

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

GuidedVLA: 通过即插即用的动作注意力特化指定任务相关因素

Xiaosong Jia, Bowen Yang, Zuhao Ge, Xian Nie, Yuchen Zhou, Cunxin Fan, Yufeng Li, Yilin Chai, Chao Jing, Zijian Liang, Qingwen Bu, Haidong Cao, Chao Wu, Qifeng Li, Zhenjie Yang, Chenhe Zhang, Hongyang Li, Zuxuan Wu, Junchi Yan, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI (TEAI)（可信具身人工智能研究院）； Shanghai Key Laboratory of Multimodal Embodied AI（上海多模态具身人工智能重点实验室）； Shanghai Jiao Tong University（上海交通大学）； OpenDriveLab, The University of Hong Kong（OpenDrive实验室，香港大学）

AI总结提出GuidedVLA框架，通过为动作解码器中的注意力头分配人工定义的辅助信号（如物体定位、空间几何、时序技能逻辑），显式引导模型关注任务相关因素，提升VLA模型在域内和域外场景的成功率。

Comments Accepted to RSS 2026. Project page: https://guidedvla.github.io/project_page/

详情

AI中文摘要

视觉-语言-动作（VLA）模型旨在通过将动作作为模态与强大的视觉-语言模型（VLM）对齐来实现通用机器人学习。现有的VLA依赖端到端监督隐式地使动作解码过程学习任务相关特征。然而，在没有显式指导的情况下，这些模型常常过拟合虚假相关性，例如视觉捷径或环境噪声，限制了其泛化能力。在本文中，我们介绍了GuidedVLA，一个旨在手动引导动作生成聚焦于任务相关因素的框架。我们的核心见解是将动作解码器视为功能组件的集合，而非单一的学习器。个体注意力头通过人工定义的辅助信号进行监督，以捕获不同的因素。作为初步研究，我们用三个特化头实例化该范式：物体定位、空间几何和时序技能逻辑。在仿真和真实机器人实验中，与强VLA基线相比，GuidedVLA在域内和域外设置中均提高了成功率。最后，我们展示了这些特化因素的质量与任务性能正相关，并且我们的机制产生了解耦的、高质量的特征。我们的结果表明，显式引导动作解码器学习是构建更鲁棒和通用VLA模型的有前景方向。

英文摘要

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

URL PDF HTML ☆

赞 0 踩 0

2512.17738 2026-06-02 cs.CL

When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

当黄金标准未必标准：评估用户生成内容翻译的挑战

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

发表机构 * Inria Paris（巴黎研究所）

AI总结针对用户生成内容（UGC）中非标准语言现象，提出十二种非标准现象和五种翻译行为的分类法，并论证翻译评估需考虑指南的标准化程度，呼吁开发可控的、指南感知的评估框架。

Comments 10 pages (23 with references and appendices). Accepted at EAMT 2026

详情

AI中文摘要

用户生成内容（UGC）的特点是频繁使用非标准语言，从拼写错误到表情选择（如俚语、字符重复和表情符号）。这使得评估UGC翻译具有挑战性：什么是“好”的翻译取决于输出的期望标准化水平。为探讨这一点，我们检查了四个UGC数据集的人工翻译指南，并推导出十二种非标准现象和五种翻译行为（标准化、复制、转移、省略、审查）的分类法。我们的分析揭示了UGC处理方式的显著差异，导致参考翻译的标准化程度存在差异。我们表明，大型语言模型的翻译分数对带有明确UGC翻译指令的提示高度敏感，并且当这些指令与数据集指南一致时，分数会提高。我们认为，公平评估需要模型和指标都了解翻译指南。最后，我们呼吁在数据集创建过程中制定明确的指南，并开发可控的、指南感知的UGC翻译评估框架。

英文摘要

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a "good" translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

URL PDF HTML ☆

赞 0 踩 0

2605.05057 2026-06-02 cs.CV

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

ScriptHOI：学习脚本化状态转换用于开放词汇人-物交互检测

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

发表机构 * Phenikaa University（费因克斯大学）

AI总结提出ScriptHOI框架，将交互短语分解为软脚本化状态转换，通过视觉状态分词器和槽位匹配器校准HOI逻辑，并引入区间部分标签学习和反事实脚本对比损失，提升开放词汇HOI检测中稀有和未见交互的识别，减少功能冲突误报。

详情

AI中文摘要

开放词汇人-物交互（HOI）检测需要识别在训练期间可能未作为注释类别出现的交互短语。最近的视觉-语言HOI检测器通过将人-物特征与文本嵌入匹配来改进语义迁移，但其预测通常受物体功能性和短语级共现主导。因此，模型可能仅凭刀和蛋糕的存在就预测“切蛋糕”，而未验证手、工具、目标、接触模式和物体状态是否共同支持该动作。我们提出 extbf{ScriptHOI}，一个结构化框架，将每个交互短语表示为软脚本化状态转换。ScriptHOI不将短语视为单个类别标记，而是将其分解为身体角色、接触、几何、功能性、运动和物体状态槽位。视觉状态分词器将每个检测到的人-物对解析为相应的状态标记，槽位匹配器估计脚本覆盖率和脚本冲突。这两个量校准HOI逻辑值，暴露缺失的视觉证据，并为不完整注释提供训练约束。为避免抑制有效但未注释的交互，我们进一步引入区间部分标签学习，该学习使用脚本导出的下界和上界概率约束未注释的候选，而不是分配封闭世界的负例。反事实脚本对比损失交换单个脚本槽位以阻止仅物体捷径。在HICO-DET、V-COCO和开放词汇HOI分割上的实验表明，ScriptHOI改善了稀有和未见交互的识别，同时大幅减少了功能冲突假阳性。

英文摘要

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

URL PDF HTML ☆

赞 0 踩 0

2605.09907 2026-06-02 cs.AI cs.MA

RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation

RADAR：面向多智能体通信结构生成的冗余感知扩散方法

Zhen Zhang, Wanjing Zhou, Juncheng Li, Hao Fei, Jun Wen, Wei Ji

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种基于条件离散图扩散模型的冗余感知生成框架RADAR，通过逐步生成通信拓扑并利用图有效尺寸引导，在六项基准上实现更高准确率、更低令牌消耗和更强鲁棒性。

Comments Accepted by ICML 2026 (fix typos)

详情

AI中文摘要

与单个智能体相比，基于大语言模型的多智能体系统在代码生成、数学推理和规划等不同任务上持续展现出强大的能力。尽管性能令人印象深刻，但这些系统的有效性和鲁棒性在很大程度上依赖于其通信拓扑，而通信拓扑通常是固定的或单步生成的。这限制了细粒度的结构探索和灵活的组合，导致在简单任务上过度使用令牌，同时在复杂任务上能力受限。为了缓解这一挑战，我们引入了RADAR，一个冗余感知且查询自适应的生成框架，主动减少通信开销。受条件离散图扩散模型最新进展的启发，我们将通信拓扑设计表述为一个逐步生成的过程，并由图的有效尺寸引导。在六个基准上的全面实验表明，RADAR在多种场景下始终优于最近的基线方法，实现了更高的准确率、更低的令牌消耗和更强的鲁棒性。我们的代码和数据可在 https://github.com/cszhangzhen/RADAR 获取。

英文摘要

Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse tasks, including code generation, mathematical reasoning, and planning, etc. Despite their impressive performance, the effectiveness and robustness of these systems heavily rely on their communication topology, which is often fixed or generated in a single step. This restricts fine-grained structural exploration and flexible composition, resulting in excessive token utilization on simple tasks while limiting capability on complicated tasks. To mitigate this challenge, we introduce RADAR, a redundancy-aware and query-adaptive generative framework that actively reduce communication overhead. Motivated by recent progress in conditional discrete graph diffusion models, we formulate communication topology design as a step-by-step generation process, guided by the effective size of the graph. Comprehensive experiments on six benchmarks demonstrate that RADAR consistently outperforms recent baselines, achieving higher accuracy, lower token consumption, and greater robustness across diverse scenarios. Our code and data are available at https://github.com/cszhangzhen/RADAR.

URL PDF HTML ☆

赞 0 踩 0

2605.09883 2026-06-02 cs.CV cs.AI

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

笛卡尔捷径：在极坐标空间中重新评估视觉推理

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

发表机构 * Stanford University（斯坦福大学）； Google Research（谷歌研究院）

AI总结针对多模态大语言模型在视觉推理中利用笛卡尔坐标捷径的问题，提出Polaris-Bench基准，将任务转换至极坐标空间，揭示模型缺乏拓扑不变性视觉推理。

详情

AI中文摘要

随着当前多模态大语言模型迅速饱和标准视觉推理基准，一个关键问题浮现：这些高分是否真正反映了鲁棒的视觉理解？我们发现了一个普遍存在的漏洞，即笛卡尔捷径：视觉推理基准普遍基于正交网格布局，这些布局可以轻易地离散化为显式的文本坐标。模型系统地利用这一特性，大量依赖基于文本的演绎推理来辅助视觉问题解决。为了系统地消除这一捷径，我们引入了Polaris-Bench，该基准将53个视觉推理任务重新表述在极坐标空间中，并配有对应的笛卡尔坐标作为参考，同时保持一致的逻辑约束和任务语义——从而从根本上打破了模型所利用的正交先验。对14个最先进MLLM的全面评估显示，在笛卡尔布局上达到70%-83%的前沿模型在极坐标等价布局上骤降至31%-39%，即使在完全逻辑等价的情况下，性能下降依然持续。此外，在笛卡尔布局上观察到的推理增益在极坐标等价布局上严重减弱。这些发现揭示了当前MLLM的一个关键缺陷：缺乏拓扑不变的视觉推理。

英文摘要

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.09503 2026-06-02 cs.CV

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

PermuQuant：通过重新排列通道降低扩散模型每组量化误差

Yongsen Cheng, Kai Liu, Kaiwen Tao, Junxian Li, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结提出PermuQuant框架，通过基于联合二阶矩的通道重排序和校准接受规则，降低低比特扩散模型每组量化误差，实现显著加速和内存压缩。

详情

AI中文摘要

大规模视觉生成模型取得了显著性能。然而，其高计算和内存成本使得在资源受限场景（如交互应用和个人单GPU使用）中部署具有挑战性。训练后量化（PTQ）通过压缩预训练模型而无需昂贵的重新训练，提供了一种实用解决方案。然而，现有的PTQ方法在极低比特设置下仍然存在严重的质量下降。在本文中，我们识别出通道排序是每组量化中一个重要但未被充分探索的因素。在此设置中，每个连续组共享一个量化尺度。当具有非常不同统计特性的通道被放置在同一个组中时，尺度可能被异常值主导，导致大的量化误差。基于这一观察，我们提出了PermuQuant，一种简单有效的低比特扩散模型PTQ框架。PermuQuant在每组量化之前通过联合二阶矩准则对通道进行排序，将具有相似激活和权重统计的通道放入同一组。它进一步使用基于校准的接受规则，仅当所选排列在校准数据上降低量化误差时才应用重排序。选定的排列被吸收到相邻模块中或离线应用于权重，避免了显式的运行时排列操作。在多个大型扩散模型上的大量实验表明，PermuQuant一致地降低了量化误差，并优于现有的PTQ基线。在搭载RTX 5090的FLUX.1-dev上，PermuQuant在W4A4 NVFP4量化下实现了高达1.7倍的单步加速，并将DiT内存占用减少了3.5倍。代码将在https://github.com/yscheng04/PermuQuant提供。

英文摘要

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.7$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

URL PDF HTML ☆

赞 0 踩 0

2605.09382 2026-06-02 cs.LG cs.CV cs.DS math.OC

Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts

学习增强的可扩展线性分配问题优化：基于神经对偶热启动

Ilay Yavlovich, Jad Agbaria, Muhamed Mhamed, Nir Weinberger, Jose Yallouz

发表机构 * arXiv.org

AI总结提出一种学习增强框架，通过预测对偶变量热启动精确求解器，并设计轻量级行独立架构RowDualNet避免O(N^2)内存瓶颈，实现可扩展的神经热启动，在保持最优性的同时获得超过2倍加速。

Comments Accepted to ICML 2026. 23 pages, 18 figures

详情

AI中文摘要

线性分配问题是一个基本的组合优化任务，经典精确求解器能保证最优性但受限于O(N^3)瓶颈，而最近的神经近似方法在可扩展性和精确性上存在困难。我们提出一个学习增强框架，通过预测对偶变量来热启动搜索，加速精确求解器，并配备回退机制以保持最坏情况保证。我们的核心是RowDualNet，一种轻量级、行独立的架构，避免了图模型的O(N^2)内存瓶颈，实现了高达N=16,384的可扩展神经热启动。通过Min-Trick机制，可行性由构造保证，完全消除了昂贵的迭代投影。实验上，我们的方法大幅减少了Jonker-Volgenant (LAPJV)算法的搜索努力，实现了鲁棒的零样本泛化，在复杂合成数据上获得超过2倍的端到端加速，在真实世界跟踪上获得1.25倍加速，在交通网络上获得1.5倍加速，同时严格保持最优性。

英文摘要

The Linear Assignment Problem is a fundamental combinatorial optimization task where classical exact solvers ensure optimality but suffer from an $\mathcal{O}(N^{3})$ bottleneck, while recent neural approximations struggle with scalability and exactness. We propose a learning-augmented framework that accelerates exact solvers by predicting dual variables to warm-start the search, backed by a fallback mechanism to preserve worst-case guarantees. Central to our approach is RowDualNet, a lightweight, row-independent architecture that avoids the $\mathcal{O}(N^{2})$ memory bottleneck of graph models, enabling scalable neural warm-starting up to $N=16{,}384$. Feasibility is guaranteed by construction via the Min-Trick mechanism, completely eliminating the need for costly iterative projections. Empirically, our method drastically reduces the search effort of the Jonker-Volgenant (LAPJV) algorithm, yielding robust zero-shot generalization with strict optimality and end-to-end speedups of over 2x on complex synthetic data, 1.25x on real-world tracking, and 1.5x on transportation networks.

URL PDF HTML ☆

赞 0 踩 0