arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2605.18662 2026-05-19 cs.LG

Efficient and Noise-Tolerant PAC Learning of Multiclass Linear Classifiers

高效且抗噪声的多类线性分类器PAC学习

Rita Adhikari, Shiwei Zeng

AI总结 本文研究了在存在恶意噪声的情况下,如何高效学习多类线性分类器,并提出了一种在混合分布和边际条件下的PAC学习算法,该算法在常数噪声率下仅需O(k²·(d log d + log k))个样本。

详情
AI中文摘要

自上个世纪以来,噪声容忍的PAC学习线性模型一直是机器学习社区的核心关注点。近年来,许多计算高效的算法已被提出,用于在多种噪声模型下学习线性阈值函数。然而,当问题考虑多类学习设置,即当类别数k至少为3时,尚不清楚是否存在计算高效的PAC学习算法,当数据集被恶意破坏时。在本文中,我们假设边际分布是有限方差分布的混合,并且数据集同时满足边际条件。我们证明存在一种计算高效的算法,能够在常数速率的恶意噪声下,使用至多O(k²·(d log d + log k))个样本来PAC学习多类线性分类器{h_w:x↦argmax_{y∈[k]}w_y·x, x∈R^d, w∈R^{kd}}。我们的算法包含两个主要成分:基于聚类的修剪方案和标准的多类合页损失最小化程序。即使在二元设置的特殊情况下,即k=2时,我们的结果也严格优于所有先前工作。

英文摘要

Noise-tolerant PAC learning of linear models has been of central interests in machine learning community since the last century. In recent years, many computationally-efficient algorithms have been proposed for the problem of learning linear threshold functions under multiple noise models. Yet, when the problem is considered under multiclass learning settings, i.e. when the number of classes $k$ is at least $3$, it is unknown whether there exist computationally-efficient PAC learning algorithms when the data sets are maliciously corrupted. In this paper, we consider that the marginal distribution is a mixture of bounded variance distributions and the data sets satisfy a margin condition at the same time. We show that there exists a computationally-efficient algorithm that PAC learns multiclass linear classifiers $\{h_w:x\mapsto \arg\max_{y\in[k]}w_y\cdot x, x\in \mathbb{R}^d, w\in\mathbb{R}^{kd}\}$ using at most $O(k^2\cdot (d\log d+\log k))$ samples even under a constant rate of nasty noise. Our algorithm consists of two main ingredients: a cluster-based pruning scheme and a standard multiclass hinge loss minimization program. Even in the special case of binary setting, i.e. $k=2$, our result is strictly stronger than all prior works.

2605.18661 2026-05-19 cs.AI

AI for Auto-Research: Roadmap & User Guide

AI用于自动研究:路线图与用户指南

Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, Yingshuo Wang, Shaoyuan Xie, Jiachen Liu, Leigang Qu, Shijie Li, Lai Xing Ng, Benoit R. Cottereau, Ziwei Liu, Tat-Seng Chua, Wei Tsang Ooi

AI总结 本文研究了AI在自动研究中的应用,分析了AI在研究生命周期中的各个阶段的表现,指出AI在结构化任务中表现良好,但在新颖想法、实验和科学判断方面仍存在不足,并提出了协作框架和工具清单。

详情
Comments
Project Page at https://worldbench.github.io/awesome-ai-auto-research GitHub Repo at https://github.com/worldbench/awesome-ai-auto-research
AI中文摘要

AI辅助研究正跨越一个门槛:完全自动化的系统现在可以以不到15美元的成本生成研究论文,而长期代理可以执行实验、起草手稿并模拟批判性评价,且几乎不需要人类输入。然而,这种生产力前沿暴露了更深层次的诚信问题:在科学压力下,即使前沿的LLMs仍会编造结果、遗漏隐藏的错误,并且无法可靠地判断新颖性。通过到2026年4月的发展,我们提出了对AI在整个研究生命周期的端到端分析,分为四个认识论阶段:创建(想法生成、文献综述、编码与实验、表格和图表)、写作(论文写作)、验证(同行评审、反驳与修订)和传播(海报、幻灯片、视频、社交媒体、项目页面和交互式代理)。我们识别出一个明确的、阶段依赖的界限,即可靠帮助与不可靠自主之间的界限:AI在结构化、基于检索和工具中介的任务中表现优异,但在真正新颖的想法、研究级实验和科学判断方面仍显得脆弱。生成的想法在实施后往往退化,研究代码远远落后于模式匹配基准,而端到端的自主系统尚未一致达到主要会议的接受标准。我们进一步表明,更大的自动化可能会掩盖而不是消除失败模式,使人类监管的协作成为最可信的部署范式。最后,我们提供了一个结构化的分类法、基准套件和工具清单,跨阶段设计原则以及面向实践者的操作手册,相关资源在我们的项目页面上维护。

英文摘要

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.

2605.18654 2026-05-19 cs.LG cs.AI

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

口袋基础模型:将TFMs压缩成CPU可用的梯度提升树

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay kumar Sankarapu, Pratinav Seth

AI总结 本文提出了一种将高性能表格基础模型(TFMs)压缩成CPU原生梯度提升树的方法,以解决实时欺诈评分需求与现有模型性能之间的差距,同时在多个数据集上验证了该方法的有效性。

详情
AI中文摘要

一个欺诈评分器需要在2毫秒内响应。最好的表格基础模型(TFMs)在GPU上需要151-1275毫秒。我们通过将TFM离线压缩成XGBoost或CatBoost的学生模型,该模型可以在CPU上原生运行,从而缩小这一差距。核心障碍是特定于上下文学习(ICL)教师:他们在评分自己的训练集时会泄露标签,导致软目标崩溃为近一热向量,不再有可供压缩的类间结构。分层出折(OOF)教师标注可以防止这一问题。在153个来自TALENT、OpenML-CC18、TabZilla和TabArena的数据集上,将TabICLv2压缩成XGBoost在CPU上达到0.882宏均AUC(96.5%的教师AUC),在1.9毫秒内,比教师-学生对的教师模型快38到860倍,且在统计上显著优于调优的CatBoost基线(Wilcoxon p=0.0008;51%胜率)。四个进一步发现:教师排名精确转移到学生排名;收益集中在低维数据(<21个特征:比CatBoost高0.011 vs. >21个特征:高0.001);多教师平均有助于MLP学生(+0.006,p=0.003)但对树学生增加不到0.001;在高维任务中,当教师本身落后于CatBoost时,压缩反而使情况更糟。完整的流水线作为TabTune库的一部分开源。

英文摘要

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

2605.18652 2026-05-19 cs.CV

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

MementoGUI: 学习代理多模态记忆控制以实现长周期GUI代理

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

AI总结 本文提出MementoGUI,一种学习代理多模态记忆控制框架,用于提升长周期GUI代理的任务状态维持能力,通过模块化记忆控制和可扩展的数据管道提高记忆检索和决策效率。

详情
Comments
Preprint, 15 pages, 4 figures, 5 tables
AI中文摘要

近年来,GUI代理在视觉定位和动作预测方面取得了显著进展,但它们在需要跨多个界面转换维持任务状态的长周期任务中仍显得脆弱。现有代理通常依赖于原始历史回放或纯文本记忆,这要么使模型超载冗余截图,要么丢弃未来决策所需的局部视觉证据。为了解决这些限制,我们引入了MementoGUI,一种插件式代理记忆框架,为基于MLLM的GUI代理配备了MementoCore,一个用于在线记忆选择、压缩和检索的学习控制器。与将交互历史视为固定上下文不同,MementoGUI将长周期GUI控制视为一个在线记忆控制问题:工作记忆会选择性地保存与任务相关的界面事件,带有文本摘要和ROI级别的视觉证据,而情景记忆则通过学习的相关性选择检索可重用的过去轨迹。MementoCore将记忆控制模块化为专门的运算符,用于步骤处理、记忆压缩、情景写入和情景选择,使插件式记忆增强而无需微调GUI代理的主干。我们进一步开发了一条可扩展的数据整理管道,将计算机使用轨迹转换为记忆控制器训练数据,引入MementoGUI-Bench用于评估GUI代理的长周期决策能力,并设计基于MLLM的指标用于语义动作匹配、任务进度和记忆一致性。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明,MementoGUI在无历史、历史回放和纯文本记忆基线之上一致提升了GUI代理的表现,较大的MementoCore主干进一步加强了记忆增强的GUI控制。

英文摘要

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

2605.18648 2026-05-19 cs.LG cs.AI cs.CL

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

AI总结 本文通过对比人类和模型标签在软标签学习中的效果,发现人类标签不仅提升了模型准确性,还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情
AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准,但先前研究将这些好处与隐含的错误标签修正(模式偏移)混淆了,从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计,重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦,我们发现虽然人类软标签确实提供了准确性提升,但其更大的价值在于作为正则化器,改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示,训练于人类软标签的模型能反映人类不确定性,而训练于合成标签的模型则无法与人类对齐。广泛而言,这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

2605.18645 2026-05-19 cs.CV

Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

素性在首要位置:从单个随意视频中基于原始的机械对象理解

Arslan Artykov, Tom Ravaud, Nicolás Violante-Grezzi, Vincent Lepetit

AI总结 本文提出了一种不依赖类别信息的优化框架,将机械对象理解视为原始拟合问题,通过几何原始体避免不稳定点跟踪的弊端,并利用新的机制将原始体组织成受旋转和滑动关节约束的连贯部分,从而从单个随意拍摄的视频中恢复复杂的运动学。

详情
AI中文摘要

从单目视频中检索机械对象的3D运动学是计算机视觉中的基本挑战。现有方法依赖于复杂的视频设置或长期点跟踪、宽基线匹配等线索,但在严重遮挡、快速相机自运动或弱局部特征下经常表现脆弱。基于学习的方法在泛化到训练类别之外时也面临困难。我们提出了一种类别无关的优化框架,将机械对象理解视为原始体拟合问题。几何原始体作为代理表示,避免了不稳定点跟踪的陷阱;一种新的机制将它们组织成受旋转和滑动关节约束的连贯部分。我们的公式同时优化部分分割和关节参数,从单个随意拍摄的视频中恢复复杂的运动学。一种可见性意识的程序处理现实数据中固有的部分观察和遮挡。我们还提出了AiP-synth和AiP-real基准,具有显著的相机运动和严重的遮挡,并在现有方法上取得了更好的表现。项目页面:https://aartykov.github.io/Articulation-in-Prime/

英文摘要

Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/

2605.18641 2026-05-19 cs.CV

Leveraging Latent Visual Reasoning in Silence

利用沉默中的潜在视觉推理

Dongyao Zhu, Zhen Wang, Xi Xiao, Han Jiang, Saeed Vahidian, Wei-Lun Chao, Tanya Berger-Wolf, Yu Su, Raju Vatsavai, Jianyang Gu

AI总结 本文探讨了在推理过程中是否需要持续的潜在令牌,发现即使移除这些令牌或用随机噪声替代,性能影响较小,提出了一种基于注意力的奖励机制以促进潜在令牌与后续文本令牌的交互,从而提升视觉感知和视觉推理任务的性能。

详情
AI中文摘要

潜在视觉推理通过在文本生成前插入连续潜在令牌,更直接地参与多模态推理。然而,这些潜在令牌在推理中的必要性仍存疑。我们发现,在空间推理基准上,用随机噪声替代或完全移除潜在令牌对性能影响很小。强化学习进一步在训练后减少了潜在生成行为。这些观察引发了一个核心问题:潜在视觉推理是否仍然有意义?我们认为其价值应由潜在令牌如何引导学习来衡量,而非是否在推理时保留。我们的分析表明,潜在推理在不同问题类型中效果不均,但任务级路由应用潜在生成是脆弱的。受这些发现启发,我们提出了一种基于注意力的奖励,鼓励生成的潜在令牌在强化学习中与后续文本令牌交互。该奖励在潜在模式激活时促进潜在利用,同时保持使用纯文本推理的灵活性。实验表明,我们的方法在感知和视觉推理基准上提升了性能,即使在训练后潜在令牌很少生成。我们的结果表明,在推理时没有显式表达的情况下,潜在视觉推理可以塑造更好的视觉基础和更准确的文本推理。我们的代码和训练模型可在GitHub和Hugging Face上公开获取。

英文摘要

Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.

2605.18636 2026-05-19 cs.CV

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

SPIKE:一种适应性双控制器框架,用于成本效益高的长周期游戏智能体

Wencan Jiang, Jiangning Zhang, Jianbiao Mei, Jinzhuo Liu, Yu Yang, Xiaobin Hu, Zhucun Xue, Yong Liu, Dacheng Tao

AI总结 本文提出SPIKE框架,通过双控制器设计实现成本高效长周期游戏智能体,通过事件触发机制和分层记忆结构提升目标导向性和任务完成率,实验表明在StarDojo数据集上显著提升成功率并降低资源消耗。

详情
Comments
https://wencanjiang.github.io/projects/SPIKE/
AI中文摘要

长周期多模态智能体在开放世界游戏中必须在紧绷的令牌和延迟预算下保持目标导向,通过许多低层交互。现有方法往往在昂贵的每步推理和易漂移、重复失败和恢复不佳的反应执行之间权衡。我们的核心思想是重用战略推理在局部稳定的段落中,并在事件边界重新调用。我们提出了SPIKE,一种适应性双控制器框架,用于成本高效的长周期游戏控制。其战略控制器执行低频全局规划、故障分析和恢复,而其反应控制器在严格的令牌预算下处理快速的本地执行。事件触发器监控视觉变化、任务进度、重复动作和失败信号,以决定何时控制应保持反应或升级到战略推理。分层记忆将短期经验重用在状态-动作记忆银行(SA-MB)中,与结构化证据在状态动作知识图(SA-KG)中分离,使每个控制器能够检索所需的上下文。这种设计在多个反应步骤中重用战略提案,支持计划过时时的本地覆盖,并保留昂贵的推理用于需要额外思考的时刻。在StarDojo的Lite-100分割上,SPIKE比最强的Lite-100基线提高了5.0个百分点(38.5%相对),比最强的预算基线提高了9.3点(75.6%相对)。它还减少了54.9%的令牌消耗和40.8%的延迟。消融实验表明,事件触发、反应覆盖和异构记忆各自对成功和恢复有贡献,支持选择性推理而非每一步推理。

英文摘要

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

2605.18635 2026-05-19 cs.LG cs.AI

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

数据呈现与架构:用于表格基础模型的信用风险预测重采样策略

Aditya Tanna, Mitul Solanki, Mohamed Bouadi, Nassim Bouarour, Pratinav Seth, Vinay Kumar Sankarapu

AI总结 本文研究了在信用风险预测中,通过不同的上下文构建策略对表格基础模型性能的影响,发现上下文构建策略比模型架构对AUC-ROC指标的贡献更大。

详情
AI中文摘要

信用违约预测是一个具有严重类别不平衡、异质特征和严格延迟预算的表格学习问题。表格基础模型(TFMs)通过上下文学习来解决这个问题,其预测结果对上下文窗口的构建方式敏感。我们在Home Credit和Lending Club数据集上基准测试了四种经典模型和五种TFMs,变化上下文构建策略(七种选项)和上下文大小(1K到50K)。在两个数据集上,上下文策略的选择对AUC-ROC的方差解释比模型家族的选择更大:平衡和混合采样比均匀采样增加3到4个AUC点,且差距超过了TFMs之间的差异。使用5K到10K的平衡上下文,最强的TFMs达到经典基线模型在完整数据上训练的AUC,同时恢复了默认类别召回率,而默认阈值GBDTs无法做到。我们将此视为证据,表明在不平衡信用风险设置中,上下文构建而非架构选择是TFMs的主要部署杠杆。

英文摘要

Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

2605.18632 2026-05-19 cs.LG cs.AI

Position: Weight Space Should Be a First-Class Generative AI Modality

权重空间应成为一种第一类生成式AI模态

Zhangyang Wang, Peihao Wang, Kai Wang

AI总结 本文提出将模型检查点视为第一类数据模态,并主张在权重空间中进行生成式建模应成为机器学习的核心原始操作。通过最近的进展表明,神经网络权重可以按需合成,通常在减少适应成本的规模下达到微调性能。本文认为这些结果反映了权重空间中高性能模型占据的低维、高度结构化区域的结构事实。基于此观点,本文将现有方法组织成五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。

详情
Comments
AI systems routinely improve or create other AI systems
AI中文摘要

神经网络检查点已悄然成为大规模数据资源:现在存在数百万个训练好的权重向量,每个都编码任务、领域和架构特定的知识。本文立场论文认为,模型检查点应被视为第一类数据模态,并且在权重空间中的生成式建模应被标准化为机器学习的核心基本操作。最近的进展表明,神经权重可以按需合成,通常在减少适应成本的规模下达到微调性能。我们主张这些结果反映了底层的结构事实:高性能模型占据由对称性、平坦性、模块性和共享子空间形状的权重空间中的低维、高度结构化区域。基于这一观点,我们组织现有方法为五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。我们的目标是将社区的默认思维从按任务优化模型转变为从学习的权重分布中采样模型,加速迈向一个AI系统定期改进或创建其他AI系统的时代。

英文摘要

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

2605.18630 2026-05-19 cs.AI physics.comp-ph

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH: 评估LLM在计算科学任务公式化中的多轮澄清能力

Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan

AI总结 该研究提出SCICONVBENCH基准,用于评估LLM在计算科学任务公式化中的多轮澄清能力,重点在于获取缺失信息和解决请求中的矛盾,通过结构化任务本体和基于标准的评估框架,系统测量LLM在澄清行为、对话基础性和最终规格忠实度三个维度上的表现。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作科学人工智能助手,越来越多的基准测试评估其在知识检索、推理、代码生成和工具使用方面的能力。然而,这些评估通常假设科学问题已经明确提出,而实际的科学协助往往从一个不明确的用户请求开始,必须通过对话进行澄清,才能进行任何计算、分析或实验。我们介绍了SCICONVBENCH,这是一个用于评估科学任务公式化中多轮澄清能力的基准,涵盖四个计算科学问题领域:流体力学、固体力学、材料科学和偏微分方程(PDEs)。SCICONVBENCH针对两种互补能力:获取缺失信息(消歧)和检测并纠正包含内部矛盾信息的请求(一致性解决)。我们的基准结合了结构化的任务本体和基于标准的评估框架,使能够系统地测量LLM在三个维度上的表现:澄清行为、对话基础性和最终规格的忠实度。当前前沿模型在一致性解决方面表现相对较好,但即使最好的模型在流体力学中也只解决了52.7%的消歧案例。我们进一步发现,前沿LLMs经常做出沉默假设并执行隐式规格修复,这些修复并未基于与用户对话的基础。SCICONVBENCH为评估可靠计算科学助手所需的上游对话推理建立了基础。代码和数据可在https://github.com/csml-rpi/SciConvBench找到。

英文摘要

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

2605.18627 2026-05-19 cs.AI

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

从动作轨迹中学习提升的动作模型:最少关于动作和状态的信息

Jonas Gösgens, Niklas Jansen, Hector Geffner

AI总结 本文研究了在不完全信息下从动作轨迹中学习STRIPS+动作域的问题,提出了三种通用情况下的算法和完备性结果,假设选定的动作参数完全可观察,从而在不同可观察性假设下确定等效域的学习条件。

详情
Comments
accepted at KR2026
AI中文摘要

最近研究表明,仅从动作轨迹即可正确高效地学习提升的STRIPS模型;即应用隐藏的STRIPS模型中的可应用动作序列。这一结果令人印象深刻,因为并不假设状态完全可观察,但STRIPS动作包含的参数并非全部用于选择动作,因此实用性不足。为此,假设动作轨迹来自隐藏的STRIPS+模型,其中某些动作参数隐含在隐藏的动作前提中。然而,这种方法的局限性在于它假设状态完全可观察。在本文中,我们放宽这些限制,考虑在更一般的情境下从轨迹中学习STRIPS+动作域的问题,其中轨迹包含关于动作和状态的部分信息。特别地,我们为三种通用情况制定了算法和完备性结果,均假设选定的动作参数完全可观察。第一种情况不假设状态可观察;第二种情况假设某些状态谓词完全可观察;第三种情况则假设某些状态谓词局部可观察。给定一个STRIPS+域,这些结果描述了在什么条件下可以从此类轨迹中学习等效域。实验结果也进行了报告。

英文摘要

It has been recently shown that lifted STRIPS models can be learned correctly and efficiently from action traces alone; i.e., applicable action sequences from a hidden STRIPS model. The result is remarkable because the states are not assumed to be observable at all, and yet it is not practical enough as STRIPS actions include arguments that are not needed for selecting the actions. This shortcoming has been addressed by assuming that the action traces come instead from a hidden STRIPS+ model where some action arguments are implicit in the hidden action preconditions. A limitation of this approach, however, is that it assumes that the states are fully observable. In this work, we relax these restrictions and consider the problem of learning STRIPS+ action domains from traces in a more general context where the traces carry partial information about both actions and states. In particular, we formulate algorithms and completeness results for three general cases, all of which assume full observability of selected action arguments. In the first case, no observability of the state is assumed; in the second case, full observability of some state predicates is assumed, and in the third case, local observability of some state predicates is assumed instead. Given a STRIPS+ domain, these results characterize the conditions under which an equivalent domain can be learned from traces. Experimental results are reported.

2605.18621 2026-05-19 cs.CV cs.AI

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite: 利用数据集、模型和基准 harnessing MLLMs 的跨视图空间智能

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

AI总结 该研究提出CrossView Suite,通过开发CrossViewSet、CrossViewBench和CrossViewer三个组件,解决跨视图推理中的数据稀缺、评估不足和对齐机制缺失问题,提升多视图空间理解能力。

详情
AI中文摘要

空间智能要求多模态大语言模型(MLLMs)超越单一视图感知,对物体、可见性、几何和交互在多个视角下保持一致推理。然而,跨视图推理的进步受限于三个主要缺口:大规模高质量标注训练数据的稀缺性、缺乏系统性评估的基准以及缺乏显式对齐机制以建立物体层面的一致性。为了解决这些缺口,我们全面开发了CrossView Suite的三个协调组件:CrossViewSet、CrossViewBench和CrossViewer。首先,我们引入一个多代理数据引擎,精心编纂了一个大规模、高质量的跨视图指令数据集,称为CrossViewSet,涵盖17种细粒度任务类型,包含1.6M个样本。其次,我们精心创建了一个场景不重叠的CrossViewBench,以全面评估MLLM的跨视图空间理解能力,评估其在各种方面的表现。最后,我们提出了CrossViewer,一个渐进的三阶段框架,用于MLLMs的跨视图空间推理,遵循感知->对齐->推理的范式。我们的方法配备了一个自适应的空间区域标记器,以捕捉细粒度的物体表示,然后显式对齐多视图对象,并因此融合对齐的特征,以提升MLLMs的跨视图推理能力。广泛的实验和分析表明,大规模训练数据、系统性评估和显式的跨视图对齐都是推动MLLMs从单视角感知向现实世界空间智能发展的关键因素。项目页面可在https://github.com/Thinkirin/Crossview-Suite上找到。

英文摘要

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

2605.18617 2026-05-19 cs.RO cs.AI cs.CV

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

AI总结 本文提出ManiSoft基准,用于研究柔软连续机器人的视觉-语言操控,通过定制模拟器结合真实柔软体动力学和丰富的接触交互,定义了四个任务以展示变形控制的不同方面,并通过自动化流程生成6300个多样场景和专家轨迹,评估了三种代表性策略模型的性能。

详情
Comments
Accepted in ICML 2026
AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂,其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案,但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战,我们介绍了ManiSoft,一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器,通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上,ManiSoft定义了四个任务,每个任务突出显示变形控制的不同方面,从基本末端执行器协调到障碍物回避。为了支持策略训练和评估,ManiSoft包括一个自动化流程,生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹,我们首先使用高层规划器将每个任务分解为一系列路径点,然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果,但在随机化情况下性能显著下降。可视化分析表明,失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台,在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

2605.18613 2026-05-19 cs.SD cs.AI

SAME: A Semantically-Aligned Music Autoencoder

SAME:一种语义对齐的音乐自编码器

Julian D. Parker, Zach Evans, CJ Carr, Zachary Zukowski, Josiah Taylor, Matthew Rice, Jordi Pons

AI总结 该研究提出SAME自编码器,通过结合Transformer架构和语义正则化方法,实现了4096倍的时间压缩比,同时保持重建质量和生成性能。

详情
AI中文摘要

潜在表示是现代生成模型的核心。在音频领域,它们通常由神经音频编解码器自编码器生成。在本工作中,我们介绍了SAME(Semantically-Aligned Music autoEncoder),一种用于立体音乐和通用音频的自编码器,实现了4096×的时间压缩比,同时保持重建质量和下游生成性能。我们通过结合基于Transformer的主干结构和一组语义正则化方法、相位感知的重建损失以及改进的判别器设计来实现这一点。该架构通过其高压缩比和对优化良好的Transformer原语的依赖,提供了显著的计算成本优势。两种变体(大型SAME-L和可部署在CPU上的SAME-S)以开放权重形式发布。

英文摘要

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

2605.18611 2026-05-19 cs.RO

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

通过状态依赖对抗运动先验实现人形机器人的统一行走、跑步和恢复

Yidan Lu, Yichao Zhong, Liu Zhao, Wanyue Li, Peng Lu

AI总结 本文提出了一种统一的强化学习框架,使单个策略能够在Unitree G1人形机器人上实现行走、跑步和跌倒恢复,且无需在部署时显式切换模式。该框架通过将传统全局参考分布替换为状态依赖的门控机制,将每个训练过渡路由到两个判别器中:一个专用的恢复判别器和一个基于速度的运动判别器,共同覆盖行走和跑步。

详情
AI中文摘要

我们提出了一种统一的强化学习框架,使单个策略能够在Unitree G1人形机器人上实现行走、跑步和跌倒恢复,且在部署时无需任何显式模式切换命令进行验证。该框架通过将传统全局参考分布替换为状态依赖的门控机制,将每个训练过渡路由到两个判别器中:一个专用的恢复判别器和一个基于速度的运动判别器,共同覆盖行走和跑步。门控机制由一个固定的投影重力阈值定义:当身体倾斜超过约37度(|g_z+1|>0.6)时,激活恢复判别器;否则使用运动判别器,以归一化的命令速度作为条件,选择在行走和跑步片段之间的适当参考轨迹。仅需三个LAFAN1参考片段即可正则化完整的行为集。在部署时,一个冻结的ONNX策略以50Hz的速度执行,无需运行时模式逻辑;硬件实验展示了在相同控制器下成功从仰卧和俯卧跌倒恢复以及平滑的行走转跑步过渡。

英文摘要

We propose a unified reinforcement learning framework that enables a single policy to perform walking, running, and fall recovery on the Unitree G1 humanoid robot, validated on physical hardware without any explicit mode-switching command at deployment. The framework extends Adversarial Motion Priors (AMP) by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately $37^\circ$ from vertical ($|g_z+1|>0.6$); otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only three LAFAN1 reference clips are required to regularize the complete behavior set. At deployment, a single frozen ONNX policy executes at 50\,Hz with no runtime mode logic; hardware experiments demonstrate successful recovery from both prone and supine falls and smooth walk-to-run transitions under the same controller.

2605.18610 2026-05-19 cs.CV cs.AI cs.LG

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

AI总结 本文首次研究了视觉语言模型的持续去学习问题,提出CATA方法,通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情
AI中文摘要

视觉语言模型(VLMs)在对齐视觉和文本表示方面表现出色,能够支持多种多模态应用。然而,其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧,这使得机器去学习变得必要。尽管现有研究主要关注单次去学习,但实际VLM部署往往涉及随时间推移的连续删除请求,从而产生持续机器去学习。在本文中,我们首次研究了VLMs的持续去学习,并识别出该设置中的三个关键挑战:去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战,我们提出了CATA,一种冲突厌恶任务算术方法,将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合,CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明,CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

2605.18609 2026-05-19 cs.LG

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration

在经典动量加速下实现小批量SGD的完美并行化

Sachin Garg, Michał Dereziński

AI总结 本文提出了一种通用的小批量优化理论,展示了经典动量对梯度小批量大小的加速比例关系,从而实现小批量计算的完美并行化。

详情
AI中文摘要

利用经典动量方案(如Polyak的重球方案)加速随机梯度方法,在训练大规模机器学习模型中证明了其高度成功,特别是在结合大规模小批量计算的硬件加速时。然而,经典动量对随机小批量优化的影响在理论上理解甚微,先前工作需要强噪声假设和极大的小批量。在本文中,我们开发了一种通用的随机动量加速理论,用于在插值域中优化二次函数,这是一门研究深度学习动态的流行抽象,也包括随机Kaczmarz和坐标下降等经典方法。我们的框架涵盖了重球和Nesterov式动量,允许任意小批量大小,并对随机噪声做出最小假设。特别地,我们证明了经典动量的加速与梯度小批量大小成正比(除了自然饱和点),从而实现小批量计算的完美并行化。我们的理论还提供了一个简单的动量参数选择,该选择在经验上被证明是有效的。

英文摘要

Accelerating stochastic gradient methods with classical momentum schemes, such as Polyak's heavy ball, has proven highly successful in training large-scale machine learning models, particularly when combined with the hardware acceleration of large mini-batch computations. Yet, the effect of classical momentum on stochastic mini-batch optimization has been poorly understood theoretically, with prior works requiring strong noise assumptions and extremely large mini-batches. In this work, we develop a general theory of stochastic momentum acceleration for optimizing over quadratics in the interpolation regime, a popular abstraction for studying deep learning dynamics which also includes classical methods such as randomized Kaczmarz and coordinate descent. Our framework encompasses both heavy ball and Nesterov-style momentum, allows for arbitrary mini-batch sizes, and makes minimal assumptions on the stochastic noise. In particular, we show that acceleration from classical momentum is directly proportional to the gradient mini-batch size (up to a natural saturation point), thereby enabling perfect parallelization of mini-batch computations. Our theory also provides a simple choice for the momentum parameter, which is shown to be effective empirically.

2605.18608 2026-05-19 cs.CV

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

跨越迁移:通过动态风格桥接实现向前促进的持续测试时间适应

Zhilin Zhu, Yabin Wang, Zhiheng Ma, Yaguang Song, Yaowei Wang, Xiaopeng Hong

AI总结 本文提出了一种新的向前促进的持续测试时间适应方法,通过动态风格桥接机制,在部署前构建紧凑的知识库,并在测试时动态注入输入数据风格,以提供可靠的监督信号,从而在持续迁移中实现稳定的适应。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

持续测试时间适应(CTTA)旨在使感知系统能够处理部署后遇到的动态分布偏移。现有方法主要采用后向对齐范式,这种范式将输入数据与源域衍生的监督代理进行刚性对齐,因此在面对不可靠的监督和不断变化的分布偏移时表现不佳。为克服这些限制,我们引入了一种新的向前促进范式,通过一种称为动态风格桥接的方法。在部署前,我们构建了一个生成类示例的紧凑知识库。在测试时间,为了减轻固有的生成偏移并使这些代理适应输入数据,我们提出了一个多级桥接机制。该机制在输入、统计和表示层动态地将代理与输入数据风格注入,同时保留代理的原始语义。这些高保真的代理随后被用来提供可靠且按需的监督信号,从而在持续偏移下实现稳定的适应。在标准CTTA基准上的广泛实验表明,我们的方法在最近的最先进方法上实现了一致且显著的改进。代码可在https://github.com/z1358/DAS上获得。

英文摘要

Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \href{https://github.com/z1358/DAS}.

2605.18607 2026-05-19 cs.CL cs.LG

Forecasting Downstream Performance of LLMs With Proxy Metrics

通过代理指标预测大语言模型的下游性能

Arkil Patel, Siva Reddy, Marius Mosbach, Dzmitry Bahdanau

AI总结 本文提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标,以更准确地预测大语言模型的下游性能,优于传统的损失和计算量基线方法。

详情
Comments
Preprint. 31 pages
AI中文摘要

语言模型的发展进步往往由比较决策驱动:选择哪种架构、哪种预训练语料库或哪种训练配方。做出这些决策需要可靠的性能预测,但常用的两个信号从根本上受到限制。交叉熵损失与下游能力不匹配,而直接下游评估成本高、稀疏且在早期训练阶段信息有限。相反,我们提出通过聚合候选模型的下一个token分布中的token级统计信息(如熵、top-k准确率和专家token排名)来构建代理指标。在三个设置中,我们的代理指标始终优于基于损失和计算量的基线方法:1)在跨家族模型选择中,它们对异质推理模型的排名平均Spearman Rho为0.81(与交叉熵损失的Rho为0.36相比);2)在预训练数据选择中,它们能以大约10,000倍更低的计算成本可靠地对25个候选语料库进行排名,推动帕累托前沿超越现有方法;3)在训练时间预测中,它们在18倍计算范围内预测下游准确性时,误差大约是现有方法的一半。这些结果表明,专家轨迹是评估模型能力广泛有用的信息源,使整个模型开发生命周期中的性能预测变得可靠。

英文摘要

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

2605.18603 2026-05-19 cs.CV

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

Starve to Perceive: 通过受限视觉带宽驯服VLMs中的懒惰感知

Yuhuan Wu, Cong Wei, Fangzhen Lin, Wenhu Chen, Haozhe Wang

AI总结 本文提出了一种新的训练方法,通过限制视觉带宽迫使视觉语言模型主动感知,从而提升其在高分辨率视觉环境中的表现。

详情
AI中文摘要

视觉语言模型(VLMs)作为处于环境中的智能体,在高分辨率视觉环境中需要主动感知——即通过缩放、裁剪和平移等操作动态决定观察方向的能力。然而,当前的训练范式产生的是模仿这些操作表面形式而没有功能性依赖的模型,我们称之为懒惰感知。我们将其归因于一个根本的学习不对称性:当粗略的全局视图结合语言先验足以达到中等准确度时,模型没有动力学习更复杂的多步骤视觉搜索。如果模型可以不主动观察就成功,它将永远学不会主动观察。这促使我们提出Starve to Perceive,一种限制视觉带宽的训练范式——限制每个观察到紧贴令牌预算,使得单个视角不足以完成任务,使主动感知成为唯一可行的策略。尽管不需要辅助损失、奖励塑造或架构变化——作为标准后训练流程的最小、即插即用修改——在感知饥饿下训练的模型在多种基准上实现了显著的提升,平均相对改进达5%。

英文摘要

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

2605.18601 2026-05-19 cs.CV

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

Incantation: 自然语言作为多实体视频世界模型的动作接口

Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng

AI总结 本研究提出了一种基于自然语言的动作接口,用于多实体视频世界模型,解决了传统接口在细粒度多实体控制和跨实体、跨世界泛化能力上的不足,通过引入自然语言条件化实现了更强大的表达能力。

详情
AI中文摘要

现代交互式视频世界模型已实现了令人印象深刻的视觉保真度,但缺乏细粒度的多实体控制和跨实体、跨世界的泛化能力。我们追溯这一差距到动作接口:标准控制协议(例如动画ID、设备输入、场景级标题)在设计时将动作语义绑定到特定实体或引擎。我们提出自然语言作为接口,以解锁任何先前接口都无法实现的表达能力,并展示了Incantation,第一个具有每潜在帧(0.25秒)自然语言条件化的交互式视频世界模型,支持同时多实体控制和概念级跨实体转移,超越任何固定的渲染管道。我们配对了一个预训练的双向视频主干与帧本地文本交叉注意力,并通过ODE初始化的Self-Forcing蒸馏与RoPE解耦的滑动KV缓存实现实时长时间跨度流媒体。我们在跨实体转移(89% vs. 43%)和out-of-vocabulary提示(90% vs. 0%)上超越了Action-Index基线,并且我们的两步学生在480p下以19.7 FPS稳定运行,FVD在2小时滚动中保持稳定。我们进一步将相同的架构和训练配方应用于《国王之剑》,仅更改每个实体的动作词汇槽。我们已发布Incantation数据集的预览子集,包含手动收集的《艾尔登法环》玩家-Boss战斗片段,带有结构化的动作导向元数据。更大规模的《艾尔登法环》和KOF数据将在完整项目中发布。

英文摘要

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.

2605.18599 2026-05-19 cs.CV

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

通过语义-空间解耦解决前馈新视角合成变换器中的表示歧义

Yihang Wu, Yihang Sun, Shaofeng Zhang, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-gang Jiang

AI总结 本文提出通过语义-空间解耦解决前馈新视角合成变换器中的表示歧义问题,通过分离语义和空间令牌,保持两者的显式表示并利用共享注意力路由保持跨分支交互,同时引入可选分类监督和双向调制以提高交互效果,从而在解码器-only和编码器-解码器前馈NVS模型中实现一致的改进。

详情
Comments
24 pages, 11 figures, 4 tables. Project page: https://hangzay.github.io/ssd_lvsm/
AI中文摘要

基于变换器的模型已推动前馈新视角合成(NVS)的发展。当前架构如GS-LRM和LVSM将语义信息(例如RGB)和空间信息(例如Plücker射线)混合到共享特征空间中。由于Plücker射线自然携带格状空间结构,这些设计会使空间偏差干扰外观表示并降低渲染保真度。为此,我们提出将前馈NVS变换器的表示解耦为单独的语义和空间令牌。解耦设计在各自的分支中保持语义和空间信息的显式性,同时通过共享注意力路由保持跨分支交互。基于此设计,我们引入可选分类监督和双向调制:前者提供分支特定的训练信号,后者提高两个分支之间的交互。值得注意的是,基础解耦设计由于其架构设计几乎不增加推理延迟。所提出的设计实现了持续的改进,证明了其在解码器-only和编码器-解码器前馈NVS模型中的有效性。

英文摘要

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

2605.18598 2026-05-19 cs.LG cond-mat.stat-mech math.FA math.PR math.ST stat.TH

Pointwise Generalization in Deep Neural Networks

深度神经网络中的逐点泛化

Shaojie Li, Yunbei Xu

AI总结 本文提出了一种深度神经网络逐点泛化的理论框架,通过分析全连接网络的点wise Riemannian 维度,建立了新的表示学习统计基础,提供了更精确的泛化界限。

详情
AI中文摘要

我们通过建立全连接网络的点wise泛化理论,探讨了深度神经网络为何能够泛化的根本问题。该框架解决了长期以来在刻画丰富非线性特征学习领域中的障碍,并为表示学习建立了新的统计基础。对于每个训练好的模型,我们通过从各层学习的特征表示的本征值推导出点wise Riemannian 维度来表征假设。这建立了一个有原则的框架,用于推导依赖假设的、具有表示意识的泛化界限。这些界限在理论和实验上都比基于模型大小、范数乘积和无限宽度线性化的方法有数量级更紧的保证。在分析上,我们识别了深度网络可 tractable 的结构属性和数学原理。在经验上,点wise Riemannian 维度表现出显著的特征压缩,随着过度参数化程度的增加而减小,并捕捉了优化器的隐含偏置。综合来看,我们的结果表明,深度网络在实际情况下是数学上可 tractable 的,并且其泛化性可以通过点wise、特征谱意识的复杂性得到清晰解释。

英文摘要

We address the fundamental question of why deep neural networks generalize by establishing a pointwise generalization theory for fully connected networks. This framework resolves long-standing barriers to characterizing the rich nonlinear feature-learning regime and builds a new statistical foundation for representation learning. For each trained model, we characterize the hypothesis via a pointwise Riemannian Dimension, derived from the eigenvalues of the learned feature representations across layers. This establishes a principled framework for deriving hypothesis-dependent, representation-aware generalization bounds. These bounds offer a systematic upgrade over approaches based on model size, products of norms, and infinite-width linearizations, yielding guarantees that are orders of magnitude tighter in both theory and experiment. Analytically, we identify the structural properties and mathematical principles that explain the tractability of deep networks. Empirically, the pointwise Riemannian Dimension exhibits substantial feature compression, decreases with increased over-parameterization, and captures the implicit bias of optimizers. Taken together, our results indicate that deep networks are mathematically tractable in practical regimes and that their generalization is sharply explained by pointwise, feature-spectrum-aware complexity.

2605.18591 2026-05-19 cs.LG cs.AI

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

随机优势变换(RAT):通过直接反向传播计算自然策略梯度

Mingfei Sun

AI总结 本文提出RAT方法,通过直接反向传播估计正则化自然策略梯度,解决了传统方法中估计和求逆Fisher矩阵成本高的问题,实验证明其在连续和视觉控制基准上性能优异且易于实现。

详情
Comments
Accepted to ICML 2026
AI中文摘要

自然策略梯度通过考虑分布空间的几何特性来提高优化效果,但其实际应用受限于估计和求逆Fisher矩阵的成本。我们提出了随机优势变换(RAT),一种通过直接反向传播估计Tikhonov正则化自然策略梯度的方法。通过应用Woodbury公式,我们将正则化自然策略梯度重新表述为带有变换优势的普通策略梯度。RAT通过在在线小批量上应用随机块Kaczmarz迭代高效计算这种变换,避免了显式Fisher构造、共轭梯度求解器和架构特定的近似。我们为RAT提供了收敛保证,并实验证明其在连续和视觉控制基准上与现有自然梯度方法相媲美或更优,同时保持简单易用且兼容各种架构。

英文摘要

Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural policy gradients as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.

2605.18580 2026-05-19 cs.AI cs.LG

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

当结果看似正确但纪律却失败:基于轨迹的评估在隐藏对手状态下的应用

Peiying Zhu, Sidi Chang

AI总结 本文提出了一种基于轨迹的评估方法,用于评估在隐藏对手状态下的行为纪律稳定性,通过轨迹诊断、机制分离和转移测试来改进强化学习策略,特别是在酒店定价和隐藏预算竞标任务中。

详情
AI中文摘要

仅结果的评估可能无法保证经济安全的智能体:一种策略可能在达到业务KPI的同时,违反可部署的行为纪律。在酒店定价中,当存在隐藏的对手状态时,学习者可能在看似合理的每间房收入上取得成绩,却无法保持规则基于的收益管理对手的定价纪律。我们引入了纪律稳定性,一种基于轨迹的评估范式:定义基准行为,限制观察到部署阶段,从失败中诱导轨迹诊断,通过消融分离机制,并测试转移和部署。在两个酒店基准和一个紧凑的隐藏预算竞标任务中,仅奖励的PPO变体无法实现轨迹对齐;揭示隐藏状态可减少标签不确定性;确定性复制可压缩不确定性;而轨迹先验或修正历史策略能更好地保持价格或投标分布。纯粹的行为克隆在对称模仿中几乎足够,而轨迹先验强化学习在容量不对称情况下增加有限的适应性。本文的贡献是一种评估和基准范式,而不是新的优化器或关于多智能体强化学习的普遍声明。

英文摘要

Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL

2605.18577 2026-05-19 cs.CV

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

OmniPro: 一个全面的多模态流视频理解基准测试

Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li

AI总结 本文提出OmniPro基准测试,旨在评估多模态感知、主动响应和多样视频理解任务,通过2700个经人类验证的样本覆盖9个子任务和3个认知层次,揭示音频在流视频理解中的关键作用及模型在长时间任务中的鲁棒性问题。

详情
Comments
Project page: https://ruixiangzhao.github.io/OmniPro
AI中文摘要

多模态流视频理解,即从连续音频视频流中自主决定何时说话和说什么,是多模态大语言模型的一种新兴能力。现有基准测试在三个方面存在不足:它们主要依赖视觉信号,采用轮询或固定时间戳协议而不是真正的主动评估,并且只涵盖有限的任务范围,从而无法可靠地评估和区分多模态流模型。我们提出了OmniPro,这是第一个联合评估多模态感知、主动响应和多样化视频理解任务的基准测试。它包含2700个经人类验证的样本,涵盖9个子任务和3个认知层次,覆盖6种基本视频理解能力。值得注意的是,84%的样本需要音频信号(语音或非语音),并且每个样本都带有模态隔离标签,以实现细粒度的多模态分析。我们进一步引入了一种双模式评估协议:探测模式通过在每个真实触发前后查询模型来评估内容理解,而在线模式通过要求模型在流输入中自主决定何时响应来评估全面的主动能力。评估11个代表性模型后发现三个关键发现:(1)音频提供了一致的收益,但不同模型的利用情况差异很大;(2)性能随时间显著下降,表明长期鲁棒性有限;(3)非语音音频感知仍然是最弱的维度。

英文摘要

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

2605.18576 2026-05-19 cs.LG

scHelix: Asymmetric Dual-Stream Integration via Explicit Gene-Level Disentanglement

scHelix: 通过显式基因层面解缠实现非对称双流整合

Xichen Yan, Zelin Zang, Changxi Chi, Jingbo Zhou, Chang Yu, Jinlin Wu, Shenghui Cheng, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

AI总结 scHelix通过显式基因层面解缠实现非对称双流整合,解决单细胞RNA测序数据整合中消除批次效应与保持生物学忠实性之间的矛盾,通过双流稀疏扩散编码器和非对称对齐-细化-融合协议提升整合效果。

详情
Comments
17 pages, 8 figures, accepted by KDD 26
AI中文摘要

单细胞RNA测序(scRNA-seq)数据整合中一个关键挑战是解决消除批次效应与保持生物学忠实性之间的张力。尽管近期证据表明批次效应在基因层面异质性表现,但大多数现有方法对转录组进行统一处理,常导致过度校正和细微生物学信号的丢失。为此,我们提出了scHelix,一个数据自适应框架,通过在输入层面显式将基因划分为领域不变的Anchors和领域敏感的Variants。scHelix利用配备停止梯度图缓存的双流稀疏扩散编码器,高效学习多尺度结构表示。我们的核心方法是一种新的非对称Align-Refine-Fuse协议:首先将不稳定的Variant流对齐到稳定的Anchor流拓扑结构,随后进行保守细化阶段,其中Anchor流通过有界残差门吸收去噪细节。这种分而治之的架构防止了捷径学习,确保在不损害生物簇完整性的情况下实现稳健的批次去除。广泛基准测试表明,scHelix在性能上优于现有最先进方法。

英文摘要

A critical challenge in single-cell RNA sequencing (scRNA-seq) integration is resolving the tension between eliminating batch effects and maintaining biological fidelity. While recent evidence indicates that batch effects manifest heterogeneously across genes, most existing methods process the transcriptome uniformly, frequently resulting in over-correction and loss of subtle biological signals. To address this, we present scHelix, a dataset-adaptive framework that fundamentally changes how features are processed by explicitly partitioning genes into domain-invariant Anchors and domain-sensitive Variants at the input level. scHelix utilizes a dual-stream sparse diffusion encoder equipped with stop-gradient graph caching to efficiently learn multi-scale structural representations. The core of our approach is a novel asymmetric Align-Refine-Fuse protocol: the unstable Variant stream is first aligned to the robust topology of the Anchor stream, followed by a conservative refinement phase where the Anchor stream absorbs denoised details via bounded residual gating. This divide-and-conquer architecture prevents shortcut learning and ensures robust batch removal without compromising the integrity of biological clusters. Extensive benchmarking demonstrates that scHelix outperforms state-of-the-art methods.

2605.18572 2026-05-19 cs.CL

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA$^{2}$P: 一种用于复杂说服的元认知自主智能体框架

Dingyi Zhang, Ziqing Zhuang, Linhai Zhang, Ziyang Gao, Deyu Zhou

AI总结 本文提出MA$^{2}$P框架,通过自主多智能体架构和元认知配置器,解决复杂说服中说服者需解读内部状态、推断潜在心理状态并生成策略一致行动的挑战,提升说服成功率。

详情
Comments
22 pages, 8 figures. Accepted to Findings of ACL 2026
AI中文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

英文摘要

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

2605.18570 2026-05-19 cs.AI

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

基于查询的知识对齐用于可靠的跨系统医学推理

Yan Jiao, Jingran Xu, Pin-Han Ho, Limei Peng

AI总结 本文提出了一种基于查询的知识对齐方法QCEA,通过将实体对齐问题转化为查询条件下的对应关系问题,以提升跨系统医学推理的可靠性,主要贡献是引入了方向感知变换模块以捕捉异构知识系统中的非对称和多对多对应关系。

详情
AI中文摘要

跨领域知识对齐对于整合异构医学系统至关重要,但现有方法通常将实体对齐视为静态匹配问题,忽略了查询上下文和跨系统不对称性。本论文提出查询条件实体对齐(QCEA),将实体对齐重新表述为查询条件下的对应关系问题。不同于学习实体表示之间的固定映射,QCEA将源实体的文本描述视为查询,并在目标图中对候选实体进行排序,从而实现依赖上下文的对齐。该框架整合了语义编码、基于图的表示学习以及方向感知变换模块,以捕捉异构知识系统中的非对称和多对多对应关系。我们评估了QCEA在TCM--WM知识图谱上的表现,涵盖了症状对齐和草药-分子对齐任务。实验结果表明,QCEA在代表性基线方法上表现一致改进,特别是在对排名敏感的指标如Hit@K和MRR上。此外,下游检索增强生成(RAG)实验表明,改进的对齐导致更好的证据检索、更强的支撑性和更高的答案准确性。这些发现强调,对齐不仅仅是数据整合步骤,而是影响跨系统医学推理中知识可访问性和可靠性的关键因素。

英文摘要

Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a static matching problem, ignoring query context and cross-system asymmetry. This limitation is particularly critical in integrative medical settings, where correspondence between concepts is inherently context-dependent, non-bijective, and direction-sensitive. In this paper, we propose Query-Conditioned Entity Alignment (QCEA), which reformulates entity alignment as a query-conditioned correspondence problem. Instead of learning a fixed mapping between entity representations, QCEA treats the textual description of a source entity as a query and ranks candidate entities in the target graph, enabling context-dependent alignment. The framework integrates semantic encoding, graph-based representation learning, and a direction-aware transformation module to capture asymmetric and many-to-many correspondence across heterogeneous knowledge systems. We evaluate QCEA on TCM--WM knowledge graphs derived from SymMap, covering both symptom alignment and herb--molecule alignment tasks. Experimental results show consistent improvements over representative baselines, particularly on rank-sensitive metrics such as Hit@K and MRR. Furthermore, downstream retrieval-augmented generation (RAG) experiments demonstrate that improved alignment leads to better evidence retrieval, stronger grounding, and higher answer accuracy. These findings highlight that alignment is not merely a data integration step, but a key factor that shapes knowledge accessibility and reliability in cross-system medical reasoning.