arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2601.06943 2026-05-19 cs.CV cs.AI

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

观看、推理与搜索:一个面向开放网络的视频深度研究基准,用于代理视频推理

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang

AI总结 本文提出VideoDR基准,用于研究开放网络环境下视频代理推理,通过跨帧视觉锚点提取、交互式网络检索和多跳推理验证,揭示了长检索链中维持初始视频锚点、目标漂移和长时程一致性等关键挑战。

详情
AI中文摘要

在现实世界视频问答场景中,视频往往只提供局部视觉线索,而可验证答案分布在开放网络中;模型因此需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为弥合这一差距,我们构建了首个视频深度研究基准VideoDR。VideoDR专注于视频条件的开放领域视频问答,要求进行跨帧视觉锚点提取、交互式网络检索和基于联合视频-网络证据的多跳推理;通过严格的真人标注和质量控制,我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们评估了多种闭源和开源多模态大语言模型在Workflow和Agentic范式下的表现,结果表明Agentic并不始终优于Workflow:其收益取决于模型在长检索链中维持初始视频锚点的能力。进一步分析表明,目标漂移和长时程一致性是核心瓶颈。总之,VideoDR为研究开放网络环境下视频代理提供了系统性的基准,并揭示了下一代视频深度研究代理的关键挑战。

英文摘要

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

2601.06163 2026-05-19 cs.CV cs.LG

Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

Forget-It-All: 通过概念感知神经元掩码实现多概念机器去学习

Kaiyuan Deng, Bo Hui, Gen Li, Jie Ji, Minghai Qin, Geng Yuan, Xiaolong Ma

AI总结 该研究提出Forget-It-All框架,通过利用模型稀疏性,解决多概念去学习问题,有效提升去学习效果并保持生成质量。

Comments Accepted to ICML 2026

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

文本到图像(T2I)扩散模型的广泛应用引发了对其可能生成版权、不当或敏感图像的担忧。作为实际解决方案,机器去学习旨在在不重新训练的情况下删除不需要的概念。尽管现有方法在单概念去学习中有效,但去除多个概念时往往面临显著挑战,包括去学习效果、生成质量和对超参数和数据集的敏感性。我们通过利用模型稀疏性,从独特角度看待多概念去学习,并提出Forget It All(FIA)框架。FIA首先引入对比概念显著性以量化每个权重连接对目标概念的贡献。然后通过结合时间信息和空间信息,识别出概念敏感神经元,确保只选择那些一致响应目标概念的神经元。最后,FIA从识别的神经元中构建掩码,并将其融合成统一的多概念掩码,其中对一般内容生成有广泛支持的无概念神经元被保留,而概念特定神经元被修剪以去除目标。FIA是无训练的,需要最少超参数调整即可用于新任务,实现即插即用。在三个不同的去学习任务上进行了广泛的实验,证明FIA在多概念去学习中实现了更可靠的性能,提高了遗忘效果同时保持生成的保真度和质量。代码可在https://github.com/kaiyuan02415/Forget-It-All获取。

英文摘要

The widespread adoption of text-to-image (T2I) diffusion models has raised concerns about their potential to generate copyrighted, inappropriate, or sensitive imagery. As a practical solution, machine unlearning aims to erase unwanted concepts without retraining from scratch. While most existing methods are effective for single-concept unlearning, they often struggle when removing multiple concepts, causing significant challenges in unlearning effectiveness, generation quality, and sensitivity to hyperparameters and datasets. We take a unique perspective on multi-concept unlearning by leveraging model sparsity and propose the Forget It All (FIA) framework. FIA first introduces Contrastive Concept Saliency to quantify each weight connection's contribution to a target concept. It then identifies Concept Sensitive Neurons by combining temporal and spatial information, ensuring that only neurons consistently responsive to the target concept are selected. Finally, FIA constructs masks from the identified neurons and fuses them into a unified multi-concept mask, where Concept Agnostic Neurons that broadly support general content generation are preserved while concept-specific neurons are pruned to remove the targets. FIA is training-free and requires minimal hyperparameter tuning for new tasks, enabling plug-and-play use. Extensive experiments across three distinct unlearning tasks demonstrate that FIA achieves more reliable multi-concept unlearning, improving forgetting effectiveness while maintaining generation fidelity and quality. Code is available at https://github.com/kaiyuan02415/Forget-It-All

2601.06162 2026-05-19 cs.LG cs.CV

Forget Many, Forget Right: Scalable and Precise Concept Unlearning in Diffusion Models

忘却众多,忘却正确:扩散模型中可扩展且精确的概念反学习

Kaiyuan Deng, Gen Li, Yang Xiao, Bo Hui, Xiaolong Ma

AI总结 本文提出了一种名为ScaPre的统一框架,用于在大规模扩散模型中实现精确的概念反学习,通过解决冲突更新、不精确机制和依赖额外数据的问题,提高了反学习的效率和精度。

Comments Accepted at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR) 2026
AI中文摘要

文本到图像的扩散模型已取得显著进展,但其使用引发了版权和滥用问题,促使研究机器反学习。然而,将多概念反学习扩展到大规模场景仍然困难,因为存在三个挑战:(i)冲突的权重更新会阻碍反学习或降低生成质量;(ii)不精确的机制会导致对相似内容的损害;(iii)依赖额外数据或模块,造成可扩展性瓶颈。为了解决这些问题,我们提出了可扩展-精确概念反学习(ScaPre),一种专门针对大规模反学习的统一框架。ScaPre引入了冲突感知的稳定设计,整合了谱迹正则化和几何对齐,以稳定优化、抑制冲突并保持全局结构。此外,Informax解耦器识别与概念相关的参数并自适应地重新加权更新,严格将反学习限制在目标子空间内。ScaPre产生了一个高效的闭式解,无需额外数据或子模型。在对象、风格和显性内容上的全面实验表明,ScaPre能够有效移除目标概念并保持生成质量。它比最佳基线在可接受的质量限制内能忘却多达$ imes \mathbf{5}$更多的概念,实现了大规模反学习的最先进精度和效率。代码可在https://github.com/kaiyuan02415/scapre获取。

英文摘要

Text-to-image diffusion models have achieved remarkable progress, yet their use raises copyright and misuse concerns, prompting research into machine unlearning. However, extending multi-concept unlearning to large-scale scenarios remains difficult due to three challenges: (i) conflicting weight updates that hinder unlearning or degrade generation; (ii) imprecise mechanisms that cause collateral damage to similar content; and (iii) reliance on additional data or modules, creating scalability bottlenecks. To address these, we propose Scalable-Precise Concept Unlearning (ScaPre), a unified framework tailored for large-scale unlearning. ScaPre introduces a conflict-aware stable design, integrating spectral trace regularization and geometry alignment to stabilize optimization, suppress conflicts, and preserve global structure. Furthermore, an Informax Decoupler identifies concept-relevant parameters and adaptively reweights updates, strictly confining unlearning to the target subspace. ScaPre yields an efficient closed-form solution without requiring auxiliary data or sub-models. Comprehensive experiments on objects, styles, and explicit content demonstrate that ScaPre effectively removes target concepts while maintaining generation quality. It forgets up to $\times \mathbf{5}$ more concepts than the best baseline within acceptable quality limits, achieving state-of-the-art precision and efficiency for large-scale unlearning. Code is available at https://github.com/kaiyuan02415/scapre

2601.04855 2026-05-19 cs.LG cs.AI

Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution

重新思考图神经网络与缺失特征:挑战、评估和一个稳健的解决方案

Francesco Ferrini, Veronica Lachi, Antonio Longa, Bruno Lepri, Matono Akiyoshi, Andrea Passerini, Xin Liu, Manfred Jaeger

AI总结 本文针对图神经网络中缺失节点特征的问题,提出了一种稳健的解决方案,通过设计更真实的缺失机制和评估协议,提高了模型的鲁棒性。

详情
AI中文摘要

处理缺失节点特征是部署图神经网络(GNNs)在现实领域如医疗和传感器网络中的关键挑战。现有研究主要针对相对温和的场景,即基准数据集,其中节点特征具有高维但稀疏的特征和由完全随机缺失(MCAR)机制生成的不完整数据。对于(a),我们理论证明高稀疏性显著限制了缺失性导致的信息损失,使所有模型看起来稳健,从而防止了对性能的有意义比较。为克服这一限制,我们引入了一个合成和三个真实世界的数据集,具有密集且语义丰富的特征。对于(b),我们超越MCAR并设计了更真实的缺失机制的评估协议。此外,我们提供了理论背景,明确陈述了缺失过程的假设,并分析了这些假设对不同方法的影响。基于此分析,我们提出了GNNmim,一种简单但有效的基线模型,用于具有不完整特征数据的节点分类。实验表明,GNNmim在各种数据集和缺失性制度下与专门设计的架构具有竞争力。

英文摘要

Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.

2601.03425 2026-05-19 cs.LG cs.AI

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

领域专精的幻觉:揭示混合专家模型中的领域不变‘ standing committee ’

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

AI总结 本研究质疑混合专家模型通过稀疏路由实现领域专精的假设,提出COMMITTEEAUDIT框架分析专家组而非个体专家的路由行为,发现领域不变的standing committee,揭示模型存在向集中计算偏倚的结构倾向,表明混合专家模型中的专精程度远低于预期。

Comments Accepted by ACL 2026 main conference. Camera-ready version

详情
AI中文摘要

混合专家模型被广泛假设通过稀疏路由实现领域专精。在本工作中,我们通过引入COMMITTEEAUDIT框架,质疑这一假设,该框架在专家组层面而非个体专家层面分析路由行为。在三个代表性模型和MMLU基准测试中,我们揭示了一个领域不变的standing committee。这是一个紧凑的路由专家联盟,能够跨领域、层和路由预算持续捕获大多数路由质量,即使在架构已包含共享专家的情况下。定性分析进一步显示,standing committee锚定推理结构和语法,而外围专家处理领域特定知识。这些发现揭示了模型对集中计算的强结构偏倚,表明混合专家模型中的专精程度远低于人们普遍认为的水平。这种固有偏倚也表明,当前的训练目标,如强制均匀专家利用的负载平衡损失,可能与模型的自然优化路径相悖,从而限制了训练效率和性能。

英文摘要

Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

2601.03170 2026-05-19 cs.SD

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

TED-TTS: 无需训练的语句内情感和时长控制用于文本到语音合成

Qifan Liang, Yuansen Liu, Ruixin Wei, Nan Lu, Junchuan Zhao, Ye Wang

AI总结 本文提出TED-TTS,一种无需训练的可控框架,用于预训练零样本TTS,实现语句内情感和时长表达。通过段落感知的情感条件策略和时长控制策略,结合因果掩码和单调流对齐过滤,实现平滑的情感变化并保持全局语义一致性,同时利用大规模多情感和时长标注数据集进行自动提示生成,实验表明其在多情感和时长控制中达到最先进的语句内一致性,同时保持基础TTS模型的语音质量。

Comments 24 pages, 9 figures, 7 tables, 3 lists

详情
AI中文摘要

尽管可控的文本到语音(TTS)已经取得了显著进展,但大多数现有方法仍局限于跨语句级别的控制,由于依赖非公开数据集或复杂的多阶段训练,导致细粒度的语句内表达具有挑战性。在本文中,我们提出TED-TTS,一种无需训练的可控框架,用于预训练零样本TTS,以实现语句内的情感和时长表达。具体而言,我们提出了一种段落感知的情感条件策略,结合因果掩码与单调流对齐过滤,以隔离情感条件并调度掩码转换,从而实现平滑的语句内情感变化,同时保持全局语义一致性。基于此,我们进一步提出了一种段落感知的时长控制策略,结合局部时长嵌入控制与全局EOS对数调节,允许局部时长调整,同时确保全局一致的终止。为了消除对段落级手动提示工程的依赖,我们构建了一个包含30,000个样本的多情感和时长标注文本数据集,以实现基于大语言模型的自动提示生成。广泛的实验表明,我们的无需训练方法不仅在多情感和时长控制中实现了最先进的语句内一致性,还保持了基础TTS模型的语音质量。代码和音频示例可供参考。

英文摘要

While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Code and audio samples are available.

2601.01593 2026-05-19 cs.CV cs.MM

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

超越补丁:面向多模态少样本字体生成的全局感知自回归模型

Haonan Cai, Yuxuan Luo, Zhouhui Lian

AI总结 本文提出GAR-Font,一种多模态少样本字体生成的自回归框架,通过全局感知分词器、多模态风格编码器和后处理流程,提升了字体生成的全局风格一致性和质量。

Comments 28 pages, Accepted as CVPR 2026 Conference Paper

详情
AI中文摘要

手动字体设计是一个将风格化视觉概念转化为一致的字形集的复杂过程。在自动少样本字体生成(FFG)中,模型常常难以在有限参考下保持结构完整性和风格忠实性。尽管自回归(AR)模型展示了出色的生成能力,但其在FFG中的应用受限于传统的补丁级标记化,这忽略了对字体合成至关重要的全局依赖关系。此外,现有FFG方法仍局限于图像到图像的范式,仅依赖视觉参考,忽略了语言在传达字体设计风格意图中的作用。为了解决这些限制,我们提出了GAR-Font,一种新的AR框架用于多模态少样本字体生成。GAR-Font引入了一个全局感知的分词器,能够有效捕捉局部结构和全局风格模式,一个多模态风格编码器通过轻量级的语言-风格适配器提供灵活的风格控制,无需进行高强度的多模态预训练,并且一个后处理流程进一步增强了结构完整性和风格一致性。大量实验表明,GAR-Font在现有FFG方法上表现更优,尤其在保持全局风格忠实性和在文本风格指导下获得更高质量的结果方面表现出色。

英文摘要

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

2601.01155 2026-05-19 cs.RO

ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation

ORION:用于合作多智能体在线导航的选项正则化深度强化学习

Shizhe Zhang, Jingsong Liang, Zhitao Zhou, Shuhan Ye, Yizhuo Wang, Ming Siang Derek Tan, Jimmy Chiun, Yuhong Cao, Guillaume Sartoretti

AI总结 该研究提出ORION框架,通过选项批评方法和双阶段合作策略,解决部分已知环境中多智能体导航的路径最优与环境信息收集之间的平衡问题,实现高效实时协作。

详情
AI中文摘要

现有多智能体导航方法通常假设环境完全已知,难以应对部分已知场景中过时或不完整的先验地图,如仓库或工厂 floor。在此类场景中,智能体需要在路径最优与收集和共享环境信息之间取得平衡。为此,我们提出了ORION,一种用于部分已知环境中合作多智能体在线导航的新型深度强化学习框架。从不完美的先验地图开始,ORION训练智能体进行去中心化决策,朝向个体目标协调,并通过在线感知共享在闭环感知-动作循环中主动减少任务相关的地图不确定性。我们首先设计了一个共享图编码器,将先验地图与在线感知融合为统一的表示,提供在环境差异下的鲁棒状态嵌入。ORION的核心是一个选项批评框架,学习转化为低层动作序列的高层合作模式,使智能体能够自适应地在个体导航和团队层面探索之间切换。我们进一步引入了双阶段合作策略,使智能体能够在地图不确定性下协助队友,从而减少总体完成时间。在广泛的迷宫状地图和大规模仓库环境中,ORION实现了高质量的实时去中心化协作,并可扩展到多达10个机器人,优于最先进的经典和学习基线。最后,我们在物理机器人团队上验证了ORION,证明了其在现实世界协作导航中的鲁棒性和实用性。

英文摘要

Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios with outdated or imperfect prior maps, such as warehouses or factory floors. There, agents need to balance path optimality with collecting and sharing environmental information to help teammates reach their own targets. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate toward individual targets, and actively reduce task-relevant map uncertainty through online observation sharing in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under environmental discrepancies. At the core of ORION is an option-critic framework that learns high-level cooperative modes translated into sequences of low-level actions, enabling adaptive switching between individual navigation and team-level exploration. We further introduce a dual-stage cooperation strategy that allows agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, ORION achieves high-quality real-time decentralized cooperation while scaling to up to 10 robots, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.

2601.01123 2026-05-19 cs.LG cs.AI

Learning from Historical Activations in Graph Neural Networks

在图神经网络中学习历史激活

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

AI总结 本文提出HISTOGRAPH,一种基于注意力的两阶段最终聚合层,通过层间和节点间的注意力机制,利用节点的激活历史和图结构来优化最终预测特征,从而在多个图分类基准上实现了优于传统方法的性能。

Comments ICLR 2026

详情
AI中文摘要

图神经网络(GNNs)在社交网络、分子化学等领域展现了显著的成功。GNNs的关键组成部分是池化过程,其中模型计算的节点特征被结合成一个有信息量的最终描述符,用于下游任务。然而,先前的图池化方案依赖于最后一个GNN层的特征作为池化或分类层的输入,这可能未能充分利用模型前向传递过程中先前层产生的重要激活,即历史图激活。这种差距在节点表示在许多图神经层中显著变化的情况下尤为明显,并且在深度架构中受到过平滑问题的加剧。为弥合这一差距,我们引入HISTOGRAPH,一种新颖的两阶段注意力最终聚合层,首先在中间激活上应用统一的层间注意力,随后进行节点间注意力。通过建模节点表示在层间的演变,我们的HISTOGRAPH利用节点的激活历史和图结构来优化最终预测所用的特征。在多个图分类基准上的实验证明,HISTOGRAPH提供了强大的性能,能够一致地改进传统技术,特别是在深度GNNs中表现出特别强的鲁棒性。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

AI总结 本文研究了在物理规划中使用联合嵌入预测世界模型(JEPA-WMs)的成功因素,通过分析模型架构、训练目标和规划算法对规划成功的影响,提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情
AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型,然后使用规划算法解决新任务。规划通常在输入空间中进行,但最近出现的一类方法引入了在学习的表示空间中优化的规划算法,其承诺通过抽象无关细节来提高规划效率。在本工作中,我们将此类模型称为JEPA-WMs,并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件,旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验,并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现,提出了一种在导航和操作任务中优于两个现有基线方法(DINO-WM和V-JEPA-2-AC)的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

2512.23994 2026-05-19 cs.SD cs.AI

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

PhyAVBench: 一个具有挑战性的音频物理敏感性基准,用于物理基础的文本到音频视频生成

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

AI总结 本文提出PhyAVBench,一个用于评估文本到音频视频生成、图像到音频视频生成和视频到音频生成模型中音频-物理基础能力的基准,通过引入新的数据集和评估方法,揭示了当前模型在物理合理音频生成方面的不足。

Comments 6 major physical dimensions, 41 fine-grained test points, 337 groups of variable-controlled test samples, 11,605 newly recorded videos

详情
AI中文摘要

文本到音频视频(T2AV)生成在影视制作和世界建模等应用中至关重要。然而,当前模型往往无法生成物理上合理的音效。先前的基准主要关注音频视频时间同步,而忽视了对音频-物理基础的显式评估,从而限制了对物理合理音频视频生成的研究。为了解决这个问题,我们提出了PhyAVBench,这是第一个系统评估T2AV、I2AV和V2A模型音频-物理基础能力的基准。PhyAVBench提供PhyAV-Sound-11K,一个包含来自184名参与者25.5小时11,605个可听视频的新数据集,以确保多样性和避免数据泄漏。它包含337对提示组,具有受控的物理变化,驱动声音差异,每个组平均有17个视频,涵盖6个音频-物理维度和41个细粒度测试点。每个提示对都标注了其声音差异背后的物理因素。重要的是,PhyAVBench利用配对文本提示来评估这一能力。我们称这种评估范式为音频-物理敏感性测试(APST),并引入了一个新的指标,对比物理响应分数(CPRS),用于量化生成视频与现实世界对应物之间的声音一致性。我们对17种最先进的模型进行了全面评估。我们的结果表明,即使领先的商业模型在基本的音频物理现象上也存在问题,揭示了超出音频视频同步之外的关键差距,并指明了未来的研究方向。我们希望PhyAVBench能为推进物理基础的音频视频生成提供基础。提示、真实值和生成视频样本可在https://github.com/imxtx/PhyAVBench上获得。

英文摘要

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

2512.19134 2026-05-19 cs.CL cs.IR

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

QuCo-RAG:从预训练语料库中量化不确定性以实现动态检索增强生成

Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng

AI总结 本研究提出QuCo-RAG,通过从预训练语料库中提取客观统计信息来量化不确定性,以解决动态检索增强生成中大语言模型的幻觉问题,实验表明其在多跳问答基准测试中优于现有方法,并在多个模型上实现了显著的提升。

Comments ACL Findings 2026

详情
AI中文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

英文摘要

Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama-3, Qwen2.5, GPT-4.1/5-chat), improving EM by up to 14 points. Generalization to long-form generation and biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at https://github.com/ZhishanQ/QuCo-RAG.

2512.12598 2026-05-19 cs.CV

Setting the Stage: Text-Driven Scene-Consistent Image Generation

设定舞台:基于文本的场景一致图像生成

Cong Xie, Che Wang, Yan Zhang, Ruiqi Yu, Han Zou, Zheng Pan, Zhenpeng Zhan

AI总结 本文研究了场景构建任务,通过结合参考场景图像和文本条件生成符合空间关系的演员图像,提出了一种新的数据构建管道和对应关系引导的注意力损失,以提高场景一致性和文本-图像一致性。

详情
AI中文摘要

我们专注于场景构建的基础任务:给定一个参考场景图像和一个文本条件,该条件指定了要在场景中生成的演员类别及其与场景的空间关系,目标是生成一个输出图像,该图像保持与参考图像相同的场景身份,同时根据文本中描述的空间关系正确生成演员。现有方法在这一任务上面临困难,主要是由于高质量配对数据稀缺和生成目标不明确。为克服数据瓶颈,我们提出了一种新的数据构建管道,结合现实照片、实体移除和图像到视频扩散模型,生成具有多样化场景、视角和正确实体-场景关系的训练对。此外,我们引入了一种新的对应关系引导的注意力损失,利用跨视角线索来强制与参考场景的空间对齐。在我们构建的场景一致基准上进行的实验表明,我们的方法在自动指标和人类偏好研究中均优于最先进的基线。我们的方法能够生成具有多样化视角和构图的图像,同时忠实遵循文本指令并保持参考场景身份。

英文摘要

We focus on the foundational task of Scene Staging: given a reference scene image and a text condition specifying an actor category to be generated in the scene and its spatial relation to the scene, the goal is to synthesize an output image that preserves the same scene identity as the reference image while correctly generating the actor according to the spatial relation described in the text. Existing methods struggle with this task, largely due to the scarcity of high-quality paired data and unconstrained generation objectives. To overcome the data bottleneck, we propose a novel data construction pipeline that combines real-world photographs, entity removal, and image-to-video diffusion models to generate training pairs with diverse scenes, viewpoints and correct entity-scene relationships. Furthermore, we introduce a novel correspondence-guided attention loss that leverages cross-view cues to enforce spatial alignment with the reference scene. Experiments on our scene-consistent benchmark show that our approach achieves better scene alignment and text-image alignment than state-of-the-art baselines, according to both automatic metrics and human preference studies. Our method generates images with diverse viewpoints and compositions while faithfully following the textual instructions and preserving the reference scene identity.

2512.04746 2026-05-19 cs.CL cs.AI

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2:朝着关闭LLMs极低比特后训练量化性能差距的目标

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen, Zaner Ma

AI总结 本文提出SignRoundV2框架,通过自适应混合精度策略和轻量稳定技术,在极低比特量化下保持高性能,实验表明在混合MXFP设置中实现接近无损性能,将性能差距缩小到约1%。

详情
AI中文摘要

极低比特量化对高效部署大型语言模型(LLMs)至关重要,但往往在2比特和4比特(如MXFP4)时导致严重性能下降。我们提出了SignRoundV2,一种后训练量化框架,旨在在极端压缩下保持高性能。SignRoundV2引入(1)一种简单而高效的自适应混合精度策略,利用梯度信息和量化引起的重建误差来指导层间比特分配,以及(2)一组轻量级稳定技术,包括损失过滤和预调制比例搜索,以提高极低比特环境下的调优效果。我们的方法在量化和全精度模型之间显著缩小了性能差距。在多种LLMs上的实验结果表明,SignRoundV2在混合MXFP设置中实现了接近无损性能,将差距缩小到约1%(平均4.5比特),同时在具有挑战性的2比特权重-only量化中大幅提高准确性。源代码可在https://github.com/intel/auto-round获取。

英文摘要

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework designed to maintain high performance even under aggressive compression. SignRoundV2 introduces (1) a simple yet efficient adaptive mixed-precision strategy that leverages gradient information and quantization-induced reconstruction errors to guide layer-wise bit allocation, and (2) a set of lightweight stabilization techniques, including loss filtering and a pre-tuning scale search, to improve tuning effectiveness in extremely low-bit regimes. Our approach takes a significant step toward closing the performance gap between quantized and full-precision models. Experimental results across diverse LLMs demonstrate that SignRoundV2 achieves near-lossless performance in mixed MXFP settings, narrowing the gap to $\sim$1\% at an average of 4.5 bits, while substantially improving accuracy in challenging 2-bit weight-only quantization. The source code is available at \url{https://github.com/intel/auto-round}.

2512.01030 2026-05-19 cs.CV

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

Lotus-2: 通过强大的图像生成模型推进几何密集预测

Jing He, Haodong Li, Mingzhi Sheng, Ying-Cong Chen

AI总结 本文提出Lotus-2,一种两阶段确定性框架,用于稳定、准确和精细的几何密集预测,通过充分利用预训练生成先验,实现了单目深度估计和表面法线预测的新状态-of-the-art结果。

Comments v3: Fixed some typos. Project page: https://lotus-2.github.io/

详情
AI中文摘要

从单张图像中恢复像素级几何属性本质上是病态的,由于外观模糊性和2D观测与3D结构之间的非单射映射。尽管判别回归模型通过大规模监督实现强大性能,但其成功受限于可用数据的规模、质量和多样性,以及有限的物理推理能力。最近的扩散模型表现出强大的世界先验,能够编码从大规模图像-文本数据中学习到的几何和语义信息,但直接重用其随机生成公式对于确定性几何推断是次优的:前者是为多样且高保真的图像生成优化的,而后者需要稳定且准确的预测。在本文中,我们提出Lotus-2,一种两阶段确定性框架,旨在为稳定、准确和精细的几何密集预测提供最优适应协议,以充分利用预训练生成先验。具体而言,在第一阶段,核心预测器采用单步确定性公式和清洁数据目标以及轻量级局部连续性模块(LCM)来生成无网格伪影的全局一致结构。在第二阶段,细节增强器在由核心预测器定义的流形内执行受限的多步校正流细化,通过无噪声的确定性流匹配来增强精细几何。仅使用59K训练样本,即现有大规模数据集的不到1%,Lotus-2在单目深度估计和高度竞争的表面法线预测中建立了新的状态-of-the-art结果。这些结果表明,扩散模型可以作为确定性世界先验,使在传统判别和生成范式之外实现高质量的几何推理成为可能。

英文摘要

Recovering pixel-wise geometric properties from a single image is fundamentally ill-posed due to appearance ambiguity and non-injective mappings between 2D observations and 3D structures. While discriminative regression models achieve strong performance through large-scale supervision, their success is bounded by the scale, quality, and diversity of available data, as well as by limited physical reasoning. Recent diffusion models exhibit powerful world priors that encode geometry and semantics learned from massive image-text data, yet directly reusing their stochastic generative formulation is suboptimal for deterministic geometric inference: the former is optimized for diverse and high-fidelity image generation, whereas the latter requires stable and accurate predictions. In this work, we propose Lotus-2, a two-stage deterministic framework for stable, accurate and fine-grained geometric dense prediction, aiming to provide an optimal adaptation protocol to fully exploit the pre-trained generative priors. Specifically, in the first stage, the core predictor employs a single-step deterministic formulation with a clean-data objective and a lightweight local continuity module (LCM) to generate globally coherent structures without grid artifacts. In the second stage, the detail sharpener performs a constrained multi-step rectified-flow refinement within the manifold defined by the core predictor, enhancing fine-grained geometry through noise-free deterministic flow matching. Using only 59K training samples, less than 1% of existing large-scale datasets, Lotus-2 establishes new state-of-the-art results in monocular depth estimation and highly competitive surface normal prediction. These results demonstrate that diffusion models can serve as deterministic world priors, enabling high-quality geometric reasoning beyond traditional discriminative and generative paradigms.

2511.23253 2026-05-19 cs.AI

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

AgroCoT:用于评估农业中视觉语言模型推理能力的推理链基准

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Xiaoya Fan, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng

AI总结 本文提出AgroCoT基准,通过整合推理链(CoT)方法,评估视觉语言模型在农业复杂场景中的推理和问题解决能力,发现现有模型在推理能力上的不足。

详情
AI中文摘要

近年来,视觉语言模型(VLMs)的进步显著影响了各个行业。在农业中,这些多模态能力在精准农业、作物监测、害虫检测和环境可持续性方面具有巨大潜力。然而,尽管已经开发了多个视觉问答(VQA)数据集和基准来评估VLM性能,但它们往往无法有效评估复杂农业背景下所需的推理和问题解决能力。为解决这一差距,我们引入了AgroCoT,一个整合了推理链(CoT)推理的VQA数据集,专门用于评估VLM的推理能力。AgroCoT包含4,759个精心挑选的样本,提供了对推理能力的全面且稳健的评估,特别是在零样本场景中,重点在于模型进行逻辑推理和有效问题解决的能力。我们对30个代表性VLMs(包括专有和开源模型)的评估揭示了其推理能力的差距,这突显了在评估中整合CoT的重要性。我们的数据集可在https://huggingface.co/datasets/AgroCoT/AgroCoT上获取。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/AgroCoT/AgroCoT.

2511.21654 2026-05-19 cs.LG

EvilGenie: A Reward Hacking Benchmark

EvilGenie: 一个奖励黑客基准

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

AI总结 本文提出EvilGenie基准,用于评估编程环境中奖励黑客问题,通过测试用例硬编码和测试文件编辑等方法检测奖励黑客行为,并验证了LLM判断在无歧义情况下的有效性。

详情
AI中文摘要

我们介绍了EvilGenie,一个用于编程环境中的奖励黑客基准。我们从LiveCodeBench中获取问题,并创建了一个环境,使代理能够通过硬编码测试用例或编辑测试文件等方式轻易进行奖励黑客。我们通过三种方式测量奖励黑客:保留的单元测试、LLM判断和测试文件编辑检测。我们验证了这些方法与人类审查和彼此之间的对比。我们发现LLM判断在无歧义情况下检测奖励黑客非常有效,而保留的测试用例使用仅带来最小的改进。除了使用Inspect的basic_agent框架测试许多模型外,我们还测量了三个流行专有编码代理(OpenAI的Codex、Anthropic的Claude Code和Google的Gemini CLI)的奖励黑客率。我们观察到Codex和Claude Code表现出明显的奖励黑客行为,而所有三个代理都表现出不一致的行为。我们的代码库可在https://github.com/JonathanGabor/evilgenie_inspect找到。

英文摘要

We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic\_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/evilgenie_inspect .

2511.21016 2026-05-19 cs.LG cs.CL

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

门控KalmaNet:通过测试时岭回归实现渐逝记忆层

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

AI总结 本文提出门控KalmaNet(GKA),通过测试时岭回归实现渐逝记忆层,解决了线性状态空间模型(SSMs)在记忆过去信息时的效率与精度问题,展示了GKA在短上下文任务和长上下文任务中的优越性能。

Comments 30 pages, 10 figures. Accepted at CVPR 2026

详情
AI中文摘要

线性状态空间模型(SSMs)提供了一种高效的替代softmax注意力机制的方案,具有恒定的内存和线性计算,但其损失性、渐逝的过去总结对需要回忆的任务造成了伤害。我们提出门控KalmaNet(GKA,发音为“gee-ka”),这是一种能够考虑完整过去同时保持SSM风格效率的层。我们的方法基于卡尔曼滤波(KF),并证明了现有的几种SSM层(DeltaNet、门控DeltaNet、Kimi Delta Attention)是在恒等误差协方差假设下的卡尔曼滤波递归近似。相比之下,GKA保持完整的误差协方差并计算精确的卡尔曼增益。在稳态假设下,这可以简化为具有恒定内存和线性计算的在线岭回归。标准的卡尔曼滤波方程在低精度设置(如bfloat16)下数值不稳定且难以在GPU上并行化。我们通过(1)输入依赖的门控进行自适应正则化以控制岭回归的条件数,以及(2)Chebyshev迭代,证明其在低精度下比传统迭代求解器更稳定。我们进一步开发了针对硬件的分块内核以提高训练效率。实证上,GKA在短上下文任务中优于现有的SSM层(如Mamba2、门控DeltaNet),并在长上下文RAG和LongQA任务中达到128k token的相对改进超过10%。我们还展示了当扩展到ImageNet分类时,GKA优于Mamba。我们的代码,包括用于训练和推理的Triton内核(vLLM),以及在HuggingFace上8B和32B规模的GKA基于混合模型的模型库,均以Apache 2.0许可证发布。

英文摘要

Linear State-Space Models (SSMs) offer an efficient alternative to softmax Attention with constant memory and linear compute, but their lossy, fading summary of the past hurts recall-oriented tasks. We propose Gated KalmaNet (GKA, pronounced "gee-ka"), a layer that accounts for the full past while retaining SSM-style efficiency. We ground our approach in the Kalman Filter (KF), and show that several existing SSM layers (DeltaNet, Gated DeltaNet, Kimi Delta Attention) are approximations to the KF recurrence under an identity error covariance assumption, which ignores how past keys and values should optimally influence state updates. In contrast, GKA maintains the full error covariance and computes the exact Kalman gain. Under a steady-state assumption that enables parallelization, this reduces to an online ridge regression with constant memory and linear compute. The standard KF equations are numerically unstable in low-precision settings (e.g., bfloat16) and hard to parallelize on GPUs. We address this with (1) adaptive regularization via input-dependent gating to control the ridge regression's condition number, and (2) Chebyshev Iteration, which we show is more stable than conventional iterative solvers in low precision. We further develop hardware-aware chunk-wise kernels for efficient training. Empirically, GKA outperforms existing SSM layers (e.g., Mamba2, Gated DeltaNet) on short-context tasks and achieves more than 10\% relative improvement on long-context RAG and LongQA up to 128k tokens. We further show GKA outperforms Mamba when extended to ImageNet classification. Our code, including Triton kernels for training and inference (vLLM), along with a model zoo of GKA-based Hybrid models at 8B and 32B scale on HuggingFace, is released under Apache 2.0.

2511.20353 2026-05-19 cs.RO

Quality-guided UAV Surface Exploration for 3D Reconstruction

基于质量引导的无人机表面探索用于3D重建

Benjamin Sportich, Kenza Boubakri, Olivier Simonin, Alessandro Renzaglia

AI总结 本文提出了一种新的模块化Next-Best-View规划框架,通过使用重建质量目标来指导探索规划,以提高3D重建的效率和质量。

详情
AI中文摘要

映射未知环境的自主机器人有广泛的原因,但在实践中,这些原因往往在制定规划策略时被忽视。快速获取信息和对建筑物的全面结构评估有不同的要求,因此需要不同的方法。在本文中,我们提出了一种新的模块化Next-Best-View (NBV) 规划框架,该框架明确使用重建质量目标来指导探索规划。特别是,我们的方法引入了新的高效视图生成和视角候选选择方法,这些方法能够适应用户定义的质量要求,充分利用截断符号距离场(TSDF)表示中编码的不确定性。这导致了有根据且高效的探索决策,以满足预定目标。最后,我们通过在现实环境中进行广泛的模拟验证了我们的方法。我们证明了该方法能够根据用户目标调整其行为,同时在覆盖范围、最终3D地图的质量和路径效率方面都优于传统NBV策略。

英文摘要

Reasons for mapping an unknown environment with autonomous robots are wide-ranging, but in practice, they are often overlooked when developing planning strategies. Rapid information gathering and comprehensive structural assessment of buildings have different requirements and therefore necessitate distinct methodologies. In this paper, we propose a novel modular Next-Best-View (NBV) planning framework for aerial robots that explicitly uses a reconstruction quality objective to guide the exploration planning. In particular, our approach introduces new and efficient methods for view generation and selection of viewpoint candidates that are adaptive to the user-defined quality requirements, fully exploiting the uncertainty encoded in a Truncated Signed Distance field (TSDF) representation of the environment. This results in informed and efficient exploration decisions tailored towards the predetermined objective. Finally, we validate our method via extensive simulations in realistic environments. We demonstrate that it successfully adjusts its behavior to the user goal while consistently outperforming conventional NBV strategies in terms of coverage, quality of the final 3D map and path efficiency.

2511.19320 2026-05-19 cs.CV

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

SteadyDancer: 一种具有第一帧保留的和谐且一致的人像图像动画方法

Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma

AI总结 本文提出SteadyDancer,一种基于图像到视频(I2V)范式的框架,通过条件协调机制、协同姿态调节模块和分阶段解耦目标训练管道,实现了和谐一致的人像动画,并首次确保第一帧的鲁棒保留。

Comments 10 pages, with supp

详情
AI中文摘要

在人类图像动画中,保留第一帧身份的同时确保精确运动控制是一个根本性挑战。主导的参考到视频(R2V)范式的图像到运动绑定过程忽略了现实应用中常见的时空对齐问题,导致身份漂移和视觉伪影等失败。我们引入SteadyDancer,一种基于图像到视频(I2V)范式的框架,实现了和谐且一致的动画,并首次确保第一帧保留的鲁棒性。首先,我们提出条件协调机制以协调两个冲突条件,从而在不牺牲保真度的情况下实现精确控制。其次,我们设计协同姿态调节模块以生成适应性且一致的姿态表示,该表示高度兼容参考图像。最后,我们采用分阶段解耦目标训练管道,分层优化模型以实现运动保真度、视觉质量和时间一致性。实验表明,SteadyDancer在外观保真度和运动控制方面均达到最先进的性能,同时比可比方法需要显著更少的训练资源。该模型已公开发布在https://mcg-nju.github.io/steadydancer-web。

英文摘要

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods. The model has been publicly released at \url{https://mcg-nju.github.io/steadydancer-web}.

2511.17392 2026-05-19 cs.CV

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

MorphSeek: 用于可变形图像配准的细粒度潜在表示级策略优化

Runxun Zhang, Yizhou Liu, Li Dongrui, Bo XU, Jingwei Wei

AI总结 本文提出MorphSeek,一种在潜在特征空间中进行细粒度策略优化的方法,用于解决可变形图像配准中的高维变形空间和体素级监督稀缺问题,通过引入随机高斯策略头和组相对策略优化,实现了高效探索和粗到细的优化,提升了配准的Dice系数和标签效率。

Comments 20 pages

详情
AI中文摘要

可变形图像配准(DIR)仍然是医学图像分析中的基本但具有挑战性的问题,主要由于密集位移场的高维变形空间和体素级监督的稀缺性。现有的强化学习框架通常将此空间投影到低维表示,限制了其捕捉空间变异形变的能力。我们提出MorphSeek,一种细粒度的表示级策略优化范式,将DIR重新表述为潜在特征空间中的连续优化过程。MorphSeek在编码器上引入随机高斯策略头,以建模潜在特征的分布,从而实现高效的探索和粗到细的优化。该框架通过组相对策略优化整合了无监督预热和弱监督微调,其中多轨迹采样稳定了训练并提高了标签效率。在三个3D配准基准(OASIS脑MRI、LiTS肝脏CT和腹部MR-CT)上,MorphSeek在保持高标签效率的同时,通过最小的参数成本和低步骤级延迟开销,实现了优于竞争基线的Dice改进。除了优化器细节外,MorphSeek推进了一种表示级策略学习范式,实现了空间一致且数据高效的形变优化,提供了一种原理上可行、不依赖于特定主干网络和优化器的可扩展视觉对齐解决方案。

英文摘要

Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

2511.16361 2026-05-19 cs.CV

Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

多阶匹配网络用于无对齐深度超分辨率

Zhengxue Wang, Zhiqiang Yan, Yuan Wu, Guangwei Gao, Xiang Li, Jian Yang

AI总结 本文提出了一种无对齐框架Multi-Order Matching Network (MOMNet),通过多阶匹配机制和多阶聚合策略,从不对齐的RGB数据中提取并选择最相关的信息,以提高深度超分辨率的性能和泛化能力。

详情
AI中文摘要

最近的引导深度超分辨率方法基于深度和RGB之间严格空间对齐的假设,实现高质量的深度重建。然而,在现实场景中,严格对齐的RGB-D数据受到固有硬件限制(例如物理分离的RGB-D传感器)和不可避免的校准漂移(由机械振动或温度变化引起)的阻碍。因此,现有方法在应用于不对齐的现实场景时往往会出现不可避免的性能下降。在本文中,我们提出了Multi-Order Matching Network (MOMNet),一种新颖的无对齐框架,能够自适应地从不对齐的RGB中检索并选择最相关的信息。具体而言,我们的方法首先采用多阶匹配机制,联合执行零阶、一阶和二阶匹配,以在多阶特征空间中全面识别与深度一致的RGB信息。为了有效整合检索到的RGB和深度信息,我们进一步引入了由多个结构检测器组成的多阶聚合模块。该策略利用多阶先验作为提示,促进从RGB到深度的特征选择性转移。广泛的实验表明,MOMNet在未对齐和对齐的数据集上均实现了优越的性能和泛化能力。

英文摘要

Recent guided depth super-resolution methods are premised on the assumption of strict spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves superior performance and generalization across both unaligned and aligned datasets.

2511.16309 2026-05-19 cs.CV cs.LG

Sparse Autoencoders are Topic Models

稀疏自编码器是主题模型

Leander Girrbach, Zeynep Akata

AI总结 本文提出将稀疏自编码器(SAEs)视为主题模型的新视角,通过构建连续主题模型(CTM)来解释嵌入空间,并推导出SAE的目标作为最大后验估计器,从而揭示SAE特征是主题性组件而非可调节方向。

Comments ICML 2026

详情
AI中文摘要

稀疏自编码器(SAEs)被用于分析嵌入,但其作用和实用价值存在争议。我们提出了一种新的视角,通过展示它们可以自然地被理解为主题模型。我们受到潜在狄利克雷分配(LDA)的启发,提出了一种连续主题模型(CTM)用于嵌入空间,并在此模型下推导出SAE目标作为最大后验估计器。这种观点表明SAE特征是主题性组件而非可调节方向。为了验证我们的理论发现,我们引入了SAE-TM主题建模框架,该框架:(1)训练SAE以学习可重用的主题原子;(2)将它们解释为下游数据中的词分布;(3)将它们合并到任意数量的主题中而无需重新训练。SAE-TM在文本和图像数据集上比强大的基线产生更连贯的主题,同时保持多样性。最后,我们分析了图像数据集中的主题结构,并追踪了日本木版画中主题随时间的变化。我们的工作将SAEs定位为跨模态大规模主题分析的有效工具。代码可在https://github.com/ExplainableML/SAE-TM获取。

英文摘要

Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We propose a continuous topic model (CTM) inspired by Latent Dirichlet Allocation (LDA) for embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. To confirm our theoretical findings, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code is available at https://github.com/ExplainableML/SAE-TM .

2511.12710 2026-05-19 cs.CL cs.CR

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

改进方法而非提示:针对大语言模型的进化式 jailbreak 攻击合成

Yunhao Chen, Xin Wang, Juncheng Li, Yixu Wang, Jie Li, Yan Teng, Yingchun Wang, Xingjun Ma

AI总结 本文提出 EvoSynth 框架,通过在代码空间中进行搜索,而非仅在提示空间中优化,从而提高对大语言模型的 jailbreak 攻击成功率和多样性。

详情
AI中文摘要

针对大语言模型(LLM)的自动化红队框架日益复杂,但许多仍主要在提示空间中优化攻击。换句话说,这些方法主要搜索更好的攻击用语或策略选择,但不搜索可执行代码。通过将搜索移至代码空间,我们不仅能优化最终的攻击提示,还能优化生成它的过程,包括执行流程、可重用逻辑、分支和失败驱动的修复。为克服这一差距,我们引入了 EvoSynth,一个自主的多代理框架,将优化空间从提示转移到可执行代码。与直接优化提示不同,EvoSynth 使用多代理系统自主设计、进化和执行基于代码的攻击算法。关键在于其代码级自我纠正循环,使其能够根据目标模型反馈和失败尝试迭代重写基于代码的算法。通过广泛实验,我们证明 EvoSynth 在高度稳健的模型如 Claude-Sonnet-4.5 上实现了 85.5% 的攻击成功率(ASR),并在评估目标上平均达到 95.9% 的 ASR,同时生成的攻击比现有方法更具多样性。我们发布该框架以促进未来在可执行代码空间中进化合成的研究。

英文摘要

Automated red teaming frameworks for Large Language Models (LLMs) have become increasingly sophisticated, yet many still formulate attack optimization primarily in the prompt space. In other words, these methods mainly search for better attack wording or better strategy choices, but they do not search over executable code. By moving the search into code space, we can optimize not only the final attack prompt, but also the procedure that generates it, including execution flow, reusable logic, branching, and failure-driven repair. To overcome this gap, we introduce EvoSynth, an autonomous multi-agent framework that shifts the optimization space from prompts to executable code. Instead of refining prompts directly, EvoSynth employs a multi-agent system to autonomously engineer, evolve, and execute code-based attack algorithms. Crucially, it features a code-level self-correction loop, allowing it to iteratively rewrite the code-based algorithm in response to target-model feedback and failed attempts. Through extensive experiments, we demonstrate that EvoSynth achieves an 85.5\% Attack Success Rate (ASR) against highly robust models like Claude-Sonnet-4.5 and a 95.9\% average ASR across evaluated targets, while generating attacks that are significantly more diverse than those from existing methods. We release our framework to facilitate future research on evolutionary synthesis in executable code space.

2511.11934 2026-05-19 cs.LG cs.CV

A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

基于表示和训练范式转变的分布外检测系统分析

Claudio César Claros Olivares, Austin J. Brockmeier

AI总结 本文通过表示中心的视角系统评估了分布外检测的CSFs,分析了不同架构、训练范式和数据集的影响,并提出基于PCA的投影过滤方法和基于神经坍塌的预测方法来提升检测性能。

详情
AI中文摘要

我们通过表示中心的视角系统评估了分布外检测(OOD)的CSFs。我们的研究涵盖了CNN和ViT架构、多种训练范式、四个图像分类源数据集(CIFAR-10、CIFAR-100、SuperCIFAR-100和TinyImageNet),以及通过CLIP衍生的语义距离将OOD数据集分为近、中、远三个区域。为了比较这些设置下的CSFs,我们采用了一种多重比较受控的排名流程,该流程在无阈值排名指标(AURC和AUGRC)下识别出统计上不可区分的顶级聚类。主要经验发现是,竞争性检测器家族更依赖于学习的表示而不是单纯的分数设计。对于CNN和ViT,简单的概率分数在误分类检测中占主导地位。在CNN中,基于边界的分数在近OOD区域最强,而几何感知分数如NNGuide、fDBD和CTM在移位严重性增加时变得更具竞争力。在微调的ViT中,顶级聚类主要由重建和残差分数主导。为了解释这些排名变化,我们使用神经坍塌(NC)指标分析最后一层表示。得到的图景在不同架构中是一致的:原型和边界感知分数在表示更坍塌且与分类器权重更好对齐时更强,而弱坍塌区域则更青睐梯度和流形基于的分数。基于这些见解,我们提出两个贡献:一种基于PCA的投影过滤过程,可以提高检测器性能,以及一种利用训练分类器计算的NC测量来预测其竞争性的分布外检测器短名单的方法,而无需任何额外的分布外数据。

英文摘要

We present a systematic benchmark of out-of-distribution (OOD) detection CSFs through a representation-centric lens. Our study spans CNN and ViT backbones, multiple training paradigms, four image-classification source datasets (CIFAR-10, CIFAR-100, SuperCIFAR-100, and TinyImageNet), and OOD datasets grouped into near, mid, and far regimes using CLIP-derived semantic distances. To compare CSFs across these settings, we employ a multiple-comparison-controlled rank pipeline that identifies top cliques of statistically indistinguishable winners under threshold-free ranking metrics (AURC and AUGRC). The main empirical finding is that the competitive detector family depends more on the learned representation than on score design alone. For both CNNs and ViTs, simple probabilistic scores dominate misclassification detection. On CNNs, margin-based scores are strongest in near-OOD regimes, while geometry-aware scores such as NNGuide, fDBD, and CTM become more competitive as shift severity increases. On fine-tuned ViTs, the top cliques are led mainly by reconstruction- and residual-based scores. To interpret these ranking shifts, we analyze the last-layer representation using Neural Collapse (NC) metrics. The resulting picture is consistent across architectures: prototype- and boundary-aware scores become stronger when the representation is more collapsed and better aligned with classifier weights, whereas weaker-collapse regimes favor gradient- and manifold-based scores. Building on these insights, we propose two contributions: a simple PCA-based projection-filtering procedure that improves detector performance, and an approach that uses NC measurements computed from a trained classifier to predict its competitive out-of-distribution detector shortlist, without requiring any additional OOD data.

2511.08704 2026-05-19 cs.CV cs.LG

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

重新思考生成图像预训练:我们离扩大下一步像素预测还有多远?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

AI总结 本文研究了自回归下一步像素预测的扩展特性,探讨了统一视觉模型中简单且端到端但尚未充分探索的框架。通过在32x32分辨率的图像上训练Transformer模型,评估了三个目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。研究发现,最优扩展策略高度依赖任务,且随着图像分辨率的增加,模型大小必须比数据量增长得更快。通过预测发现,计算能力是主要瓶颈,而非训练数据量。随着计算能力每年增长四到五倍,预计在五年内可实现像素级图像建模。

Comments Accepted by ICML2026

详情
AI中文摘要

本文研究了自回归下一步像素预测的扩展特性,一种简单、端到端但尚未充分探索的统一视觉模型框架。从32x32分辨率的图像开始,我们训练了一系列Transformer模型,使用IsoFlops配置在计算预算高达7e19 FLOPs的情况下进行训练,并评估了三个不同的目标指标:下一步像素预测目标、ImageNet分类准确率和基于生成的完成度(通过Fr'echet距离测量)。首先,最优扩展策略高度依赖于任务。在固定的32x32分辨率下,图像分类和图像生成的最优扩展特性不同,其中生成最优设置要求数据量增长是分类最优设置的三到五倍。其次,随着图像分辨率的增加,最优扩展策略表明模型大小必须比数据量增长得更快。令人惊讶的是,通过投影我们的发现,我们发现主要瓶颈是计算能力,而不是训练数据量。随着计算能力每年增长四到五倍,我们预测在五年内可以实现像素级图像建模。

英文摘要

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

2511.04070 2026-05-19 cs.CL

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

T-FIX:基于文本的可解释性方法,具备可解释的专家特征

Shreya Havaldar, Weiqiu You, Chaehyeon Kim, Anton Xue, Helen Jin, Marco Gatti, Bhuvnesh Jain, Helen Qu, Amin Madani, Daniel A. Hashimoto, Gary E. Weissman, Rajat Deo, Sameed Khatana, Lyle Ungar, Eric Wong

AI总结 本文提出T-FIX框架,用于评估LLM生成的解释是否符合专家的推理方式,通过七个科学任务和三个领域进行验证,实现了自动且可定制的专家对齐评估。

详情
AI中文摘要

随着LLM被应用于知识密集型领域(例如手术、天文学、治疗),用户通常是领域专家,他们不仅期望答案,还期望解释能反映专业推理。然而,评估LLM是否'像专家一样思考'仍然困难:现有方法依赖于每个示例的专家注释,使它们成本高、难以扩展,并且局限于每个领域的单一正确推理观念。为了解决这一差距,我们引入了T-FIX,一个统一的评估框架,将专家对齐作为LLM生成解释的期望属性进行操作化。T-FIX涵盖七个科学任务和三个领域,每个任务均根据专家定义的准则进行评估,这些准则捕捉的是领域相关的推理而非通用的解释质量。我们的框架实现了自动且可定制的专家对齐评估,能够在没有持续专家参与的情况下泛化到未见过的解释。代码可在https://github.com/BrachioLab/FIX-2/上获得。

英文摘要

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.

2511.03828 2026-05-19 cs.LG

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

从静态约束到动态适应:样本级约束放松用于离线到在线强化学习

Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang

AI总结 本文提出DARE框架,通过行为一致性实现样本级约束放松,解决了离线到在线强化学习中保留离线保守性与适应在线反馈之间的挑战,改进了细调稳定性并优于现有基线。

详情
AI中文摘要

离线到在线强化学习(O2O RL)面临在保留离线保守性与适应在线反馈下的分布偏移挑战。此挑战出现因为数据行为在微调期间演变,使得数据来源成为约束处理的误导基础,从而导致目标-数据不匹配。因此,我们提出了动态对齐用于放松(DARE),一种基于行为模型的行为一致性分布感知框架,用于样本级约束放松。据我们所知,DARE是第一个通过后验诱导交换机制将约束放松条件化于行为一致性,超越二元离线/在线数据区别的方法。重要的是,DARE仅需要每个样本的行为对齐,使它能够在许多离线算法上进行实例化,具有灵活的行为模型和微调目标选择。我们提供理论分析,显示基于行为的样本交换一致地提高了离线样本人群与在线样本人群之间的区分。在D4RL上的实验表明,DARE一致提高了微调稳定性,并在强离线到在线基线之上实现了优越的最终性能。(代码可在https://github.com/lpzu/DARE上公开获取。)

英文摘要

Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves during fine-tuning, rendering data origin a misleading basis for constraint handling and thereby leading to objective-data mismatch. We therefore propose Dynamic Alignment for RElaxation (DARE), a distribution-aware framework for sample-level constraint relaxation based on the behavioral consistency with a behavior model. To our knowledge, DARE is the first to condition constraint relaxation on behavioral consistency via a posterior-induced exchange mechanism, moving beyond a binary offline/online data distinction. Importantly, DARE requires only per-sample behavioral alignment, enabling instantiation on top of many offline algorithms with flexible choices of behavior models and fine-tuning objectives. We provide a theoretical analysis showing that behavior-based sample exchange consistently improves the distinction between offline-like and online-like subsets. Experiments on D4RL demonstrate that DARE consistently improves fine-tuning stability and achieves superior final performance over strong offline-to-online baselines. (The code is publicly available at \url{https://github.com/lpzu/DARE}.)

2511.02610 2026-05-19 cs.LG

Towards Migrating Neural Network Implementations

向神经网络实现迁移迈进

Nadia Daoudi, Ivan Alfonso, Jordi Cabot

AI总结 本文提出了一种自动迁移神经网络代码跨深度学习框架的方法,通过使用一个中间神经网络模型来创建迁移前的抽象,从而解决神经网络库之间迁移的挑战。

Comments To appear at the International Conference on AI-powered Software (AIware 2026)

详情
AI中文摘要

智能系统的开发(即通过AI组件增强的系统)得益于神经网络(NNs)的快速进步。由于神经网络设计和实现的支持,各种库和框架随之涌现。选择框架取决于可用功能、易用性、文档和社区支持等因素。在采用某个NN框架后,组织可能后来选择切换到另一个框架,如果性能下降、需求变化或新功能被引入。不幸的是,由于缺乏专门针对NNs的迁移方法,跨库迁移NN实现具有挑战性。这导致了更多的现代化时间与努力,因为手动更新是必要的,以避免依赖过时的实现并确保与新功能的兼容性。在本文中,我们提出了一种自动迁移神经网络代码跨深度学习框架的方法。我们的方法利用一个中间NN模型来创建迁移前的抽象。我们通过两个流行的NN框架,即PyTorch和TensorFlow,验证了我们的方法。我们还讨论了在两个框架之间迁移代码的挑战以及我们的方法如何处理这些问题。对五个NN的实验评估显示,我们的方法成功地迁移了它们的代码,并生成了与原始功能等效的NN。我们的工作成果已在线上可用。

英文摘要

The development of smart systems (i.e., systems enhanced with AI components) has thrived thanks to the rapid advancements in neural networks (NNs). A wide range of libraries and frameworks have consequently emerged to support NN design and implementation. The choice depends on factors such as available functionalities, ease of use, documentation and community support. After adopting a given NN framework, organizations might later choose to switch to another if performance declines, requirements evolve, or new features are introduced. Unfortunately, migrating NN implementations across libraries is challenging due to the lack of migration approaches specifically tailored for NNs. This leads to increased time and effort to modernize NNs, as manual updates are necessary to avoid relying on outdated implementations and ensure compatibility with new features. In this paper, we propose an approach to automatically migrate neural network code across deep learning frameworks. Our method makes use of a pivot NN model to create an abstraction of the NN prior to migration. We validate our approach using two popular NN frameworks, namely PyTorch and TensorFlow. We also discuss the challenges of migrating code between the two frameworks and how they were approached in our method. Experimental evaluation on five NNs shows that our approach successfully migrates their code and produces NNs that are functionally equivalent to the originals. Artefacts from our work are available online.

2510.26384 2026-05-19 cs.AI cs.LG

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Scales++: 一种计算高效的评估子集选择方法,基于认知尺度嵌入

Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz

AI总结 本文提出了一种基于任务项目内在属性的评估子集选择方法Scales++,通过减少预选成本并保持预测保真度,提高了大规模语言模型的评估效率,同时提升了冷启动性能和可解释性。

Comments 9 pages, 2 figures, 4 tables

详情
AI中文摘要

对大规模语言模型(LLMs)进行全面评估的高昂成本需要创建小而有代表性的数据子集(即小型基准),以实现高效的评估同时保留预测保真度。当前的方法基于模型为中心的范式,根据现有模型的集体性能选择基准项目。这些方法受限于前期成本高、无法立即处理新基准(冷启动)以及假设未来模型会共享前代模型的失败模式的脆弱性。在本文中,我们提出了一种新的以项目为中心的基准子集选择方法,认为选择应基于任务项目的内在属性,而不是模型特定的失败模式。我们通过一种新的方法Scales++来实现这种以项目为中心的高效基准方法,其中数据选择基于基准样本的认知需求。实证研究表明,Scales++将前期选择成本降低了超过18倍,同时实现了有竞争力的预测保真度。在Open LLM Leaderboard上,使用仅0.25%的数据子集,我们预测完整基准分数的均方误差为3.2%,在Humanity's Last Exam上,使用2.0%的样本预测完整分数的均方误差为2.9%。我们证明这种以项目为中心的方法可以在不显著降低保真度的情况下更高效地评估模型,同时提供更好的冷启动性能和更可解释的基准测试。

英文摘要

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.