arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.27864 2026-06-02 cs.AI

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台，用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology（史蒂文斯理工学院）； UMass Boston（马萨诸塞大学波士顿分校）

AI总结提出FundaPod平台，通过多角色独立研究、知识图谱记忆和事后裁决机制，支持人类投资经理进行透明、可验证的基础投资决策。

详情

Comments: 32 pages; 12 figures

AI中文摘要

大型语言模型（LLMs）在金融领域的应用日益增多，但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下，机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果，而是产生透明、可重用和可验证的投资计划，同时促进投资知识的累积发展。我们提出了FundaPod，一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务，在本质上与交易信号生成不同，因此更适合采用保持独立性的架构。在FundaPod中，具有不同角色（如价值投资者或宏观策略师）的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现，供人类投资组合经理（PM）裁决。本文基于设计科学实践以及认知隔离和人机协调理论，提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制：将公开投资者资料转化为可部署智能体的角色提炼管道；允许规划器推导类型化任务图的声明式技能注册表；将备忘录声明与可验证来源联系起来的基于证据的模型；以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

URL PDF HTML ☆

赞 0 踩 0

2605.16415 2026-06-02 cs.CV cs.LG

Diffusion Models, Denoiser Architecture and Creativity

扩散模型、去噪器架构与创造力

Itamar Levine, Yair Weiss

发表机构 * The Hebrew University of Jerusalem（海法大学）

AI总结本文通过理论和实验表明，扩散模型的创造力源于去噪器架构与目标分布之间的相互作用，并指出去噪器架构的归纳偏差必须与真实目标分布高度一致才能成功。

详情

AI中文摘要

扩散模型的创造力是指它们生成与训练数据不同但高度逼真图像的能力。创造力有些令人惊讶，因为已知如果扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器，那么模型将简单地复制训练样本。在本文中，我们提出经验和理论结果，表明扩散模型的创造力源于去噪器架构与目标分布之间的相互作用。理论上，我们针对三种不同的去噪器架构（线性、多项式、瓶颈）给出了生成样本分布作为目标分布和去噪器函数的显式形式。经验上，我们表明流行的UNET去噪器架构的微小变化会导致非常不同的创造力形式，并且这些微小变化通常会产生高度不真实的样本。综合来看，我们的结果表明，只有当去噪器架构的归纳偏差与真实目标分布高度一致时，扩散模型才能成功。

英文摘要

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

URL PDF HTML ☆

赞 0 踩 0

2605.13548 2026-06-02 cs.RO cs.AI

AttenA+: Rectifying Action Inequality in Robotic Foundation Models

AttenA+: 纠正机器人基础模型中的动作不平等性

Daojie Peng, Fulong Ma, Jiahang Cao, Qiang Zhang, Xupeng Xie, Jian Guo, Ping Luo, Andrew F. Luo, Boyu Zhou, Jun Ma

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKU（香港大学）； USTC（中国科学技术大学）； IDEA Research（IDEA研究院）； SUSTech（南方科技大学）； X-Humaniod

AI总结针对机器人基础模型忽视动作物理重要性的问题，提出AttenA+框架，通过速度驱动的动作注意力重加权训练目标，提升复杂长程任务性能。

详情

AI中文摘要

现有的机器人基础模型虽然强大，但基于一个隐含的时间同质性假设：在优化过程中将所有动作视为同等信息量。这种从语言模型继承的“平坦”训练范式，对操作的内在物理层次结构无动于衷。实际上，机器人轨迹本质上是异质的，其中低速段通常通过需要精确交互来决定任务成功，而高速运动则作为容错过渡。这种均匀损失权重与物理关键性之间的错位从根本上限制了当前视觉-语言-动作（VLA）模型和世界-动作模型（WAM）在复杂长程任务中的性能。为了纠正这一点，我们引入了AttenA+，一个与架构无关的框架，通过速度驱动的动作注意力优先考虑运动学关键段。通过基于逆速度场重新加权训练目标，AttenA+自然地使模型的学习能力与操作的物理需求对齐。作为一种即插即用的增强，AttenA+可以集成到现有骨干网络中，无需结构修改或额外参数。大量实验表明，AttenA+显著提升了当前最先进模型的上限。具体来说，它在Libero基准上将OpenVLA-OFT提升至98.6%（+1.5%），并将FastWAM在RoboTwin 2.0上推进至92.4%（+0.6%）。在Franka机械臂上的真实世界验证进一步展示了其鲁棒性和跨任务泛化能力。我们的工作表明，挖掘动作序列的内在结构先验为标准缩放定律提供了一种高效、物理感知的补充，为通用机器人控制开辟了新路径。

英文摘要

Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via velocity-driven action attention. By reweighting the training objective based on the inverse velocity field, AttenA+ naturally aligns the model's learning capacity with the physical demands of manipulation. As a plug-and-play enhancement, AttenA+ can be integrated into existing backbones without structural modifications or additional parameters. Extensive experiments demonstrate that AttenA+ significantly elevates the ceilings of current state-of-the-art models. Specifically, it improves OpenVLA-OFT to 98.6% (+1.5%) on the Libero benchmark and pushes FastWAM to 92.4% (+0.6%) on RoboTwin 2.0. Real-world validation on a Franka manipulator further showcases its robustness and cross-task generalization. Our work suggests that mining the intrinsic structural priors of action sequences offers a highly efficient, physics-aware complement to standard scaling laws, paving a new path for general-purpose robotic control.

URL PDF HTML ☆

赞 0 踩 0

2605.07804 2026-06-02 cs.LG cs.AI

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Prune-OPD：面向长程推理的高效可靠在线策略蒸馏

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）； MBZUAI ； University of California, Merced（加州大学默塞德分校）； Sun Yat-sen University（中山大学）

AI总结提出Prune-OPD框架，通过实时检测学生与教师之间的前缀漂移并动态截断不可靠的轨迹，在减少计算浪费的同时保持或提升长程推理任务的性能。

详情

Comments: 17 pages, 8 figures

AI中文摘要

在线策略蒸馏（OPD）利用密集的教师奖励来增强推理模型。然而，将OPD扩展到长程任务暴露了一个关键缺陷：随着学生生成的前缀不可避免地偏离教师的思维过程，教师的密集奖励失去了局部可开发性。继续在这些“漂移”轨迹上生成和评估标记不仅会降低奖励质量，还会导致巨大的计算浪费。为了解决这个问题，我们引入了 extbf{Prune-OPD}，一个动态地将训练预算与监督质量对齐的框架。通过持续监控学生和教师预测之间的局部兼容性（例如，通过top-$k$重叠），Prune-OPD实时检测前缀漂移事件。一旦检测到严重漂移，它会单调地降低后续不可靠奖励的权重，并触发动态的轨迹截断。这使得训练过程能够停止无效的生成，并将计算重新分配到可靠的教师监督上。在不同的教师-学生组合中，Prune-OPD始终将计算与监督可靠性对齐。当前缀漂移使得密集的教师奖励不可靠时，它减少了37.6\%--68.0\%的训练时间，同时保持甚至提升了在具有挑战性的基准（AMC、AIME、HMMT）上的性能。当学生-教师兼容性保持较高时，它会通过扩展训练窗口自动保留长上下文监督。这些结果表明，Prune-OPD不是通过盲目缩短轨迹来改进OPD，而是通过将计算重新分配到局部可开发的教师奖励上。

英文摘要

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

URL PDF HTML ☆

赞 0 踩 0

2605.04583 2026-06-02 cs.CL

TajikNLP: An Open-Source Toolkit for Comprehensive Text Processing of Tajik (Cyrillic Script)

TajikNLP：一个用于塔吉克语（西里尔字母）综合文本处理的开源工具包

Mullosharif K. Arabov, Karomatullo Habibullozoda, Nurali Shirinov

发表机构 * Institute of Computational Mathematics and Information Technologies（计算数学与信息科技研究所）； Kazan Federal University（喀山联邦大学）； Bokhtar State University named after Nosiri Khusrav（诺西拉夫命名的博克塔尔州大学）

AI总结本文介绍 TajikNLP，一个开源的 Python 库，提供首个完整的塔吉克语文本处理流水线，包括清洗、分词、词性标注、词干提取、词形还原等，并发布四个语言数据集，以促进低资源西里尔字母语言的 NLP 研究。

详情

Comments: Accepted to CLIB 2026

AI中文摘要

塔吉克语使用西里尔字母书写，在公开可用的自然语言处理工具包方面仍然严重资源不足，这阻碍了语言学研究和应用开发。本文介绍了 TajikNLP，一个开源的 Python 库，它提供了首个完整的流水线来处理真实的塔吉克语文本，同时保留原始西里尔字母正字法。该库实现了以统一 Doc 对象为中心的模块化架构，支持顺序应用组件进行清洗、规范化、分词（包括子词 BPE）、词素切分、词性标注、词干提取、词形还原和句子分割。引入了一个新颖的统一形态引擎，提供受控和深度分析模式，显著改进了对塔吉克语黏着性名词和动词屈折变化的处理。该发布版还包含一个基于词典的情感分析器和直接从 Hugging Face Hub 加载的预训练 Word2Vec/FastText 嵌入。为确保可重复性并促进未来研究，四个配套的语言数据集——一个词性标注语料库（52.5k 条目）、一个情感词典（3.5k 条目）、一个地名辞典（5.6k 条目）和一个个人姓名数据集（3.8k 条目）——已在宽松许可下公开发布。该库的可靠性通过包含 616 个自动化测试的广泛测试套件验证，实现了 93% 的源代码覆盖率。因此，TajikNLP 为塔吉克语处理建立了基础技术基础设施，降低了在低资源西里尔字母环境中学术和工业应用的门槛。

英文摘要

The Tajik language, written in Cyrillic script, remains severely under-resourced in terms of publicly available natural language processing (NLP) toolkits, hindering both linguistic research and applied development. This paper introduces TajikNLP, an open-source Python library that provides the first comprehensive pipeline for processing authentic Tajik text while preserving the original Cyrillic orthography. The library implements a modular architecture centered around a unified Doc object, enabling sequential application of components for cleaning, normalization, tokenization (including subword BPE), morphemic segmentation, part-of-speech tagging, stemming, lemmatization, and sentence splitting. A novel unified morphology engine is introduced, offering controlled and deep analysis modes that significantly improve handling of Tajik's agglutinative nominal and verbal inflections. The release further incorporates a lexicon-based sentiment analyser and pre-trained Word2Vec/FastText embeddings loaded directly from the Hugging Face Hub. To ensure reproducibility and facilitate future research, four accompanying linguistic datasets -- a POS-tagged corpus (52.5k entries), a sentiment lexicon (3.5k entries), a toponym gazetteer (5.6k entries), and a personal names dataset (3.8k entries) -- have been openly published under permissive licenses. The library's reliability is validated by an extensive test suite of 616 automated tests achieving 93% source code coverage. TajikNLP thus establishes a foundational technological infrastructure for Tajik language processing, lowering the barrier to entry for both academic and industrial applications in low-resource Cyrillic-script environments.

URL PDF HTML ☆

赞 0 踩 0

2602.14307 2026-06-02 cs.AI cs.LG

Benchmarking at the Edge of Comprehension

在理解边缘的基准测试

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

发表机构 * University of Cambridge（剑桥大学）

AI总结提出Critique-Resilient Benchmarking框架，通过对抗性生成-评估游戏在人类理解受限时比较模型，利用批判韧性正确性概念和分项Bradley-Terry模型对LLM进行排序。

详情

AI中文摘要

随着前沿大型语言模型（LLMs）在新基准发布后迅速饱和，基准测试本身正处于一个转折点：如果前沿模型持续改进，人类将越来越难以生成具有区分度的任务、提供准确的真实答案或评估复杂解决方案。如果基准测试变得不可行，我们衡量AI进展的能力将受到威胁。我们将这种情况称为后理解阶段。在这项工作中，我们提出了Critique-Resilient Benchmarking，一种对抗性框架，旨在即使在人类完全理解不可行的情况下也能比较模型。我们的技术依赖于批判韧性正确性的概念：如果没有对手令人信服地证明答案错误，则该答案被视为正确。与标准基准测试不同，人类充当有界验证者，专注于局部声明，从而在超出任务完全理解的情况下保持评估完整性。使用分项二分Bradley-Terry模型，我们联合对LLM进行排序，依据其解决挑战性任务的能力和生成困难但可解问题的能力。我们在数学领域展示了该方法在八个前沿LLM上的有效性，表明所得分数稳定且与外部能力度量相关。我们的框架将基准测试重新定义为一种对抗性生成-评估游戏，其中人类作为最终裁决者。

英文摘要

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

URL PDF HTML ☆

赞 0 踩 0

2512.00837 2026-06-02 cs.CL

WaterSearch: Exploring Seed Pooling for Improving the Quality-Detectability Trade-off in LLM Watermarking

WaterSearch：探索种子池以改进LLM水印中质量-可检测性权衡

Yukang Lin, Jiahao Shao, Shuoran Jiang, Wentao Zhu, Bingjie Lu, Xiangping Wu, Joanna Siebert, Qingcai Chen

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Peng Cheng Laboratory（鹏城实验室）

AI总结本文提出WaterSearch框架，通过控制种子池实现句子级搜索，联合优化分布保真度和水印信号特征，在保持高可检测性的同时显著提升文本质量。

详情

AI中文摘要

水印技术作为大型语言模型（LLMs）生成文本的关键保护措施，通过将可识别信号嵌入模型输出，实现可靠归因并增强机器生成内容的安全性。现有方法通常通过操纵令牌生成概率来嵌入信号，尽管有效，但本质上面临可检测性与文本质量之间的权衡：鲁棒水印所需的信号强度和随机性往往会降低下游任务的性能。本文设计了一种新颖的嵌入方案，通过控制种子池促进水印文本的多样化并行生成。基于该方案，我们提出WaterSearch，一种句子级、基于搜索的水印框架，可适应多种现有方法。WaterSearch通过联合优化两个关键方面来提升文本质量：1）分布保真度；2）水印信号特性。此外，WaterSearch辅以句子级检测方法，具有强大的攻击鲁棒性。我们在三个流行LLM上评估了该方法，涵盖十个不同任务。大量实验表明，在水印可检测强度为95%时，我们的方法相比最先进基线实现了平均51.01%的性能提升。在短文本生成和低熵输出生成等挑战性场景中，我们的方法分别获得了47.78%和36.47%的性能增益。此外，在包括插入、同义词替换和释义攻击的不同攻击场景下，WaterSearch保持高可检测性，进一步验证了其强大的抗攻击能力。我们的代码可在https://github.com/Yukang-Lin/WaterSearch获取。

英文摘要

Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{https://github.com/Yukang-Lin/WaterSearch}{https://github.com/Yukang-Lin/WaterSearch}.

URL PDF HTML ☆

赞 0 踩 0

2605.29548 2026-06-02 cs.LG

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

为什么更大的模型学习得更多：容量、干扰和稀有任务保留的影响

Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana

发表机构 * Stanford University（斯坦福大学）； Kempner Institute at Harvard University（哈佛大学凯普纳研究所）； MIT（麻省理工学院）； Anthropic

AI总结通过理论分析和合成实验，研究模型规模对学习能力的影响，发现更大模型通过减少梯度干扰来学习稀有和复杂任务，并在OLMo模型上验证。

详情

AI中文摘要

更大的模型能学习更小模型无法学习的任务。是什么驱动了这一现象？我们提出了一个简单的现象学论证，即幂律缩放已经表明，即使训练数据无限，更大的模型也能够学习到更小模型无法学习的数据分布部分。为了验证这一说法并找出其原因，我们研究了模型缩放对合成设置的影响，该设置由一系列呈现单调缩放曲线的任务混合而成。结果指向数据引起的资源（神经元）竞争。具体来说，较小的模型将其神经元分配给高频或低复杂度的任务，因此它们学习到的解决方案在稀有和复杂任务上表现不佳。此外，即使存在能够表达所需任务的解决方案，这种情况也会发生。然后，我们评估了更大的模型如何规避这一以数据为中心的瓶颈，发现这归因于一种减少的干扰机制：更大的模型可以为常见任务分配足够的资源，使得这些任务的梯度更新变弱，这意味着它们不会在缓慢积累稀有任务特征时覆盖这些特征。最后，为了进一步验证这些说法，我们在不同频率和复杂性的新任务上预训练了OLMo模型（4M到4B参数）。结果与我们的合成数据实验相呼应：只有更大的OLMo模型学习了不频繁和复杂的任务，并且这些更大的模型在其表示中嵌入了更多的任务特征，并且任务之间的梯度干扰更少。总体而言，我们提供了一个以数据为中心的解释，说明为什么更大的模型能够学习更小模型无法学习的任务。这有助于解释为什么更大的模型在实践中更好，并且可以为有关模型大小和训练数据混合的实际问题提供信息。

英文摘要

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

URL PDF HTML ☆

赞 0 踩 0

2605.29539 2026-06-02 cs.CV cs.AI

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

GiPL: 用于跨域小样本目标检测的生成增强迭代伪标签方法

Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结提出GiPL双分支训练框架，通过迭代伪标签自训练和生成数据增强，解决跨域小样本目标检测中支持集利用不足和过拟合问题。

详情

Comments: CVPR 2026 Workshop

AI中文摘要

视觉语言基础模型在跨域小样本目标检测（CD-FSOD）中展现出有前景的零样本泛化能力。然而，它们在微调过程中面临两个关键挑战：由于稀疏的单实例标注导致支持集利用不足，以及在极有限的域目标样本下严重过拟合。为解决这些问题，本文提出GiPL，一个高效的双分支训练框架。在第一个分支中，我们设计了一种迭代伪标签自训练范式，该范式对支持集进行零样本推理以生成可靠的伪标注，将其与真实标签融合，并迭代优化模型以充分利用支持集数据。在第二个分支中，我们引入了使用大型视觉语言模型的生成数据增强流程，该流程合成域对齐、多目标标注的图像以丰富训练样本并抑制过拟合。在三个具有挑战性的CD-FSOD数据集（RUOD、CARPK、CarDD）上，在1/5/10样本设置下的大量实验表明，GiPL始终以显著的性能提升优于最先进的方法。代码可在\href{https://github.com/z-yaz/CDiscover}{CDiscover}获取。

英文摘要

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework. In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains. Code is available at \href{https://github.com/z-yaz/CDiscover}{CDiscover}.

URL PDF HTML ☆

赞 0 踩 0

2605.29488 2026-06-02 cs.CV cs.AI

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

AnyMo: 基于掩码建模的任意模态条件运动生成

Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen, Hong Chang, Hao Liu, Shiguang Shan

发表机构 * Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, China（中国科学院智能信息处理重点实验室（中国科学院计算技术研究所，中国））； University of Chinese Academy of Sciences, China（中国科学院大学）

AI总结提出AnyMo框架，结合残差FSQ运动分词器和可扩展掩码建模Transformer，利用大规模多模态对齐数据集OmniHuMo实现任意模态组合下的高质量人体运动生成。

详情

AI中文摘要

条件人体运动生成仍然是计算机视觉和机器人学中的一个基本挑战。尽管取得了显著进展，当前方法通常受限于固定的模态配置和特定任务架构，跨模态交互和多模态条件合成的扩展规律在很大程度上仍未得到充分探索。一个关键瓶颈是缺乏大规模模态对齐的运动数据，限制了跨不同控制信号的泛化能力。在这项工作中，我们引入了OmniHuMo，一个大规模、高质量的数据集，包含超过5000小时的运动和320万条序列，并带有精确对齐的多模态注释（例如，文本、语音、音乐和轨迹）。利用OmniHuMo，我们提出了AnyMo，一个统一的多模态框架，结合了基于残差FSQ的运动分词器与可扩展的掩码建模Transformer，能够在任意模态组合下实现高质量的运动合成。大量实验表明，AnyMo在提供对空间和风格属性的灵活控制的同时，实现了高保真合成。

英文摘要

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

URL PDF HTML ☆

赞 0 踩 0

2605.29463 2026-06-02 cs.LG cs.AI

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

诚实撒谎：理解反射型智能体中的记忆虚构

Prakhar Dixit, Sadia Kamal, Tim Oates

发表机构 * University of Cambridge（剑桥大学）

AI总结研究反射型智能体在自我反思中产生记忆虚构的问题，提出反射重复率（RRR）指标检测该现象，并通过程序化提取失败信号缓解问题。

详情

Comments: Accepted to ICML 2026 Workshop "Failure Modes in Agentic AI"

AI中文摘要

反射型智能体依赖自我生成的反思作为记忆，隐含地假设智能体能够准确诊断自己的失败。我们表明这一假设可能系统性地失败：在ALFWorld和HumanEval中，智能体存储自信但错误的任务解释，并在多次试验中继续据此行动，尽管每次环境都重置为正确任务。我们将这种失败模式称为记忆虚构，并引入反射重复率（RRR），一种基于日志的指标，用于检测对错误反思内容的重复依赖。使用RRR，我们在ALFWorld中识别出16个冻结环境，其中121条反思中0条提及正确目标对象，在HumanEval中有4个类似案例。我们的缓解方法用程序化提取轨迹级失败信号替代开放式自我诊断，将正确对象提及率从0%提升至86%，RRR从0.64降至0.10，并解决了16个冻结ALFWorld环境中的3个，表明反思记忆可能强化而非纠正错误信念。

英文摘要

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials, even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content. Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.

URL PDF HTML ☆

赞 0 踩 0

2605.29365 2026-06-02 cs.CL

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

Casual 作为锚点：解决语体转换数据集中的监督偏差

Hyojeong Yu, Hyukhun Koh, Minsung Kim, Kyomin Jung

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出将语体转换重新概念化为一个分级维度，引入三层次谱系（非正式、随意、正式），并构建3LF数据集以解决现有基准中监督信号偏差导致的非正式到正式转换失败问题。

详情

Comments: HEAL@CHI 2026 Workshop Paper

AI中文摘要

语体转换通常被框架化为非正式与正式语体之间的对称双向任务。我们认为这一框架掩盖了现有基准（如GYAFC）中的监督设计缺陷：二元人工改写编码的是相对风格变化，而非绝对的人类语体概念。因此，模型学会生成满足基准标签的伪正式输出，却无法产生真正的正式语言。我们通过重新评估基准正式标签（基于人类对齐的语体定义）量化了这种偏差，揭示了在多个模型族中持续存在的非正式到正式转换失败。为解决此问题，我们将语体转换重新概念化为一个分级维度而非二元属性。我们引入一个三层次谱系：非正式、随意和正式，其中随意作为明确的中间状态，澄清了监督信号。基于此框架，我们提出了3LF数据集，提供所有三个层次的平行监督。在3LF上训练显著减少了非正式到正式转换的失败，并改善了与人类感知的对齐。例如，GPT-4.1-nano在非正式到正式方向上的F1值从0.06提升至0.88，尽管3LF比GYAFC小得多。我们进一步证明这些增益无法仅通过上下文学习复现，并提供了对歧义驱动错误和意义扭曲的定性分析。总体而言，我们的发现展示了监督设计如何塑造风格对齐，并强调了在可控文本生成中考虑对齐的基准构建的重要性。

英文摘要

Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human rewrites encode relative stylistic shifts rather than absolute human notions of formality. Consequently, models learn to generate pseudo-formal outputs that satisfy benchmark labels while failing to produce genuinely formal language. We quantify this misalignment by re-evaluating benchmark formal labels under a human-aligned definition of formality, revealing substantial discrepancies that propagate to consistent informal-to-formal failures across model families. To address this issue, we reconceptualize formality transfer as a graded dimension rather than a binary attribute. We introduce a three-level spectrum: informal, casual, and formal, where casual serves as an explicit intermediate state that clarifies supervision signals. Based on this framework, we introduce 3LF, a dataset providing parallel supervision across all three levels. Training on 3LF substantially reduces informal-to-formal failures and improves alignment with human perception. For example, GPT-4.1-nano improves from 0.06 to 0.88 F1 in the informal-to-formal direction despite 3LF being significantly smaller than GYAFC. We further demonstrate that these gains cannot be reproduced through in-context learning alone and provide qualitative analyses of ambiguity-driven errors and meaning distortions. Overall, our findings demonstrate how supervision design shapes stylistic alignment and highlight the importance of alignment-aware benchmark construction in controllable text generation.

URL PDF HTML ☆

赞 0 踩 0

2605.29341 2026-06-02 cs.CV cs.CL

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

WorldMemArena: 通过动作-世界交互评估多模态智能体记忆

Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang

发表机构 * arXiv.org

AI总结提出WorldMemArena基准，通过动作-世界交互循环的四阶段生命周期评估多模态智能体记忆，揭示现有方法在写入、维护、检索和使用中的失败点。

详情

Comments: 25 pages, 8 figures

AI中文摘要

多模态大语言模型越来越多地被部署为长周期智能体，其中记忆必须做的不仅仅是回忆：它必须跟踪不断变化的世界，修正过时的信息，并在决策时提供正确的证据。现有基准衡量静态对话中的回忆，将记忆压缩为单一的任务结束准确率，并将视觉观察简化为字幕，使我们无法将失败定位到写入、维护、检索或使用。能够自主管理记忆的智能体框架的兴起加剧了这一差距，因为我们没有原则性的方法来比较手工设计的流水线与自我管理的替代方案。为了弥补这些差距，我们将多模态智能体记忆形式化为一个具有可观察四阶段生命周期的动作-世界交互循环，并在WorldMemArena中实例化：400个多会话多模态任务，涵盖终身演化（演化的个人和任务状态）和智能体执行（来自真实观察、动作和反馈的记忆），并标注了黄金记忆点、更新、干扰项和用于阶段级诊断的证据链。这使得长上下文、手工设计（RAG和外部记忆系统）和基于框架的记忆智能体之间首次进行直接比较。结果表明：（1）更好的记忆写入和存储并不能保证更好的性能；（2）多模态记忆仍然难以充分利用视觉证据；（3）系统在不同领域不稳定，并在真实的智能体轨迹上性能下降；（4）框架记忆更灵活，但成本更高且可靠性较低。

英文摘要

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

URL PDF HTML ☆

赞 0 踩 0

2605.29260 2026-06-02 cs.CV

Deep Psychovisual Image Representations

深度心理视觉图像表示

Wendi Ma, Aryaman Sharma, Wei Dai, Shekhar S. Chandra

发表机构 * School of EECS The University of Queensland（电子工程与计算机科学学院昆士兰大学）

AI总结受心理视觉模型启发，提出深度视觉编码方法，利用频域表示和复值图像表示实现心理视觉风格的抽象，构建首个基于心理视觉的深度学习框架，通过数据驱动频谱滤波器学习任务相关语义结构，实验表明该模型提取可解释性强的物体部分，且对深度依赖较小。

详情

AI中文摘要

心理视觉模型表明，人类视觉通过首先形成中间抽象来将低级特征提取与高级认知解耦。相比之下，基于深度学习的视觉模型通常使用同质空间层堆叠来提取和聚合特征，导致其决策过程不透明。在本文中，我们提出了深度视觉编码，这是一种受20世纪90年代图像编码启发的学习频域表示，该编码量化了感知显著的频率，与复值图像表示一起产生心理视觉风格的抽象。该方法实现了首个基于心理视觉的深度学习框架，利用数据驱动的频谱滤波器学习在不同频率子带内编码任务相关的语义结构。显著性分析表明，与常规卷积神经网络产生的无定形区域相比，我们的心理视觉模型提取了高度可解释的物体部分。此外，我们发现对于模型缩放，我们的模型对深度的依赖小于CNN，因为我们的复值表示和学习抽象取代了深层空间层的作用。这些发现共同表明，心理视觉编码为更高效和透明的视觉模型提供了一条有前景的路径。

英文摘要

Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque. In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands. Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers. Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.

URL PDF HTML ☆

赞 0 踩 0

2605.29233 2026-06-02 cs.LG cs.AI

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

BlockBatch: 面向高效扩散语言模型推理的多尺度共识解码

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan Celine Lin

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出BlockBatch框架，通过多分支并行解码和置信度门控合并，在不训练的情况下加速扩散语言模型推理，平均减少26.6%的去噪步数并实现1.33倍端到端加速。

详情

Comments: 23 pages, including references and appendices

AI中文摘要

Many-Shot CoT-ICL: 使上下文学习真正学习

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

发表机构 * The University of Hong Kong（香港大学）

AI总结研究多示例思维链上下文学习在推理任务中的特性，提出曲线演示选择方法，在数学任务上提升5.42个百分点。

详情

Comments: Accepted by ICML 2026

AI中文摘要

虽然多示例ICL取得了显著性能，但先前对其缩放行为的研究主要关注非推理任务。在这项工作中，我们研究了推理任务上的多示例ICL，特别关注多示例思维链上下文学习（CoT-ICL）。通过分析非推理和推理任务以及非推理和推理导向的LLM，我们识别出多示例CoT-ICL的几个独特性质。我们进一步将这些发现解释为多示例CoT-ICL是上下文测试时学习而非缩放模式匹配，并提出两个原则：（i）演示应易于目标模型理解，（ii）它们应按顺序排列以支持平滑的概念进展。受该原则指导，我们提出了曲线演示选择（CDS），一种简单的排序方法，在具有64个演示的数学任务上获得了高达5.42个百分点的提升。总体而言，我们的结果将长上下文窗口从检索缓冲区重新定义为上下文测试时学习的结构化课程。

英文摘要

While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In this work, we study many-shot ICL on reasoning tasks, with a particular focus on many-shot chain-of-thought in-context learning (CoT-ICL). Analyzing across non-reasoning and reasoning tasks and across non-reasoning and reasoning-oriented LLMs, we identify several distinctive properties of many-shot CoT-ICL. We further interpret these findings by viewing many-shot CoT-ICL as in-context test-time learning rather than scaled pattern matching, and suggest two principles: (i) demonstrations should be easy for the target model to understand, and (ii) they should be ordered to support a smooth conceptual progression. Guided by the principle, we propose Curvilinear Demonstration Selection (CDS), a simple ordering method that yields up to a 5.42 percentage-point gain on a math task with 64 demonstrations. Overall, our results reframe the long context window from a retrieval buffer into a structured curriculum for in-context test-time learning.

URL PDF HTML ☆

赞 0 踩 0

2604.19532 2026-06-02 cs.SD cs.AI

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

BEAT: 通过均匀时间步对符号音乐进行分词和生成

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

发表机构 * arXiv.org

AI总结提出一种以均匀节拍为基本单元的分词方法，将同一时间步内相同音高的所有事件编码为一个令牌，并在音乐续写和伴奏生成任务中验证其相比传统事件基方法能提升音乐质量和结构连贯性。

详情

AI中文摘要

将音乐分词以适应语言模型的通用框架是一个具有挑战性的问题，特别是考虑到音乐可以表示的各种符号结构（例如，序列、网格和图）。迄今为止，大多数方法将符号音乐分词为音乐事件序列，如起始、音高、时移或复合音符事件。这种策略直观且已在基于Transformer的模型中证明有效，但它隐式处理了音乐时间的规律性：单个令牌可能跨越不同时长，导致时间进展不均匀。在本文中，我们考虑另一种分词方式是否可能，其中均匀长度的音乐步长（例如，一个节拍）作为基本单元。具体来说，我们将单个时间步内相同音高的所有事件编码为一个令牌，并显式按时间步对令牌进行分组，这类似于钢琴卷帘表示的稀疏编码。我们在音乐续写和伴奏生成任务上评估了所提出的分词方法，并将其与主流事件基方法进行比较。结果表明，所提出的分词方法提高了音乐质量和结构连贯性，而额外分析证实了更高的效率和更有效地捕获长程模式。

英文摘要

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

URL PDF HTML ☆

赞 0 踩 0

2602.08646 2026-06-02 cs.LG

Gradient Preconditioning for Efficient and Reliable Reward-Guided Generation

梯度预处理实现高效可靠的奖励引导生成

Jisung Hwang, Minhyuk Sung

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种梯度预处理方法，通过将奖励梯度投影到白高斯噪声可行集上，实现一步生成模型在奖励引导生成中的高效性和可靠性，防止奖励黑客攻击并加速优化。

详情

Comments: ICML 2026

AI中文摘要

我们提出了一种梯度预处理方法，使得使用一步生成模型的奖励引导生成既高效又可靠。测试时噪声优化可以从预训练生成模型中解锁显著更好的奖励引导生成，但它容易导致奖励黑客攻击，从而降低质量，并且通常对于实际使用来说太慢。我们通过将奖励梯度投影到一个精心设计的白高斯噪声可行集上来预处理奖励梯度，该可行集是一个具有块状范数约束的紧凑谱集，紧密捕捉白高斯噪声的统计特性和空间不相关性。这种预处理将每个梯度更新重塑为噪声对齐的方向，驱动更快更有效的奖励上升，同时防止奖励黑客攻击。该投影具有闭式解，并且与FFT的$O(N \log N)$复杂度相匹配，在实践中增加了可忽略的开销。在FLUX上使用四个奖励模型的实验中，我们的方法仅使用最先进的基于正则化方法所需挂钟时间的30%就达到了可比的审美分数。

英文摘要

We propose a gradient preconditioning method that makes reward-guided generation with one-step generative models both efficient and reliable. Test-time noise optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and is often too slow for practical use. We precondition reward gradients by projecting them onto a carefully designed white Gaussian noise feasible set, a compact spectral set with blockwise norm constraints that tightly captures the statistics and spatial uncorrelatedness of white Gaussian noise. This preconditioning reshapes each gradient update into a noise-aligned direction, driving faster and more effective reward ascent while preventing reward hacking. The projection is closed-form and matches the $O(N \log N)$ complexity of FFT, adding negligible overhead in practice. In experiments on FLUX with four reward models, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the state-of-the-art regularization-based method.

URL PDF HTML ☆

赞 0 踩 0

2510.01711 2026-06-02 cs.RO cs.LG

Contrastive Representation Regularization for Vision-Language-Action Models

视觉-语言-动作模型的对比表示正则化

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

发表机构 * KAIST（韩国科学技术院）

AI总结提出机器人状态感知对比损失（RS-CL），通过对比学习对齐VLM表示与机器人本体感受状态，提升VLA模型在机器人操作任务中的性能。

详情

Comments: ICML 2026

AI中文摘要

视觉-语言-动作（VLA）模型通过利用预训练视觉-语言模型（VLM）的丰富表示，在机器人操作中展现了强大的能力。然而，它们的表示可以说仍然次优，缺乏对控制动作和本体感受信息等机器人信号的敏感性。为了解决这个问题，我们引入了机器人状态感知对比损失（RS-CL），一种简单有效的VLA模型表示正则化方法，旨在弥合VLM表示与机器人信号之间的差距。特别地，RS-CL通过使用状态之间的相对距离作为软监督，使表示更紧密地对齐机器人的本体感受状态。作为原始动作预测目标的补充，RS-CL增强了控制相关表示学习，同时轻量级且与标准VLA训练流程完全兼容。我们的实验结果表明，RS-CL显著提升了最先进VLA模型的性能；它将先前技术在RoboCasa-Kitchen基准上的性能提升至69.7%，达到最先进水平，并在具有挑战性的真实机器人操作任务中将成功率从45.0%提升至58.3%。

英文摘要

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.21247 2026-06-02 cs.LG

Graph Navier Stokes Networks

图纳维-斯托克斯网络

Zexing Zhao, Guangsi Shi, Yu Gong, Tianyu Wang, Shirui Pan, Hongye Cheng, Yuxiao Li

发表机构 * Northwest A&F University（西北农林科技大学）； Corporate Research Center, Midea Group（美的集团企业研发中心）； Peking University（北京大学）； Fudan University（复旦大学）； Griffith University（格里菲斯大学）； Bosch（博世）

AI总结针对图神经网络中的过平滑问题，提出基于纳维-斯托克斯方程的图纳维-斯托克斯网络（GNSN），通过引入对流机制实现更高效的消息传递，并在12个真实数据集上取得最优分类性能。

详情

AI中文摘要

图神经网络（GNN）已成为深度学习的基石，现有方法大多基于图信号处理和扩散方程来建模消息传递。然而，这些方法固有地存在过平滑问题，即随着网络深度增加，节点特征变得难以区分。受纳维-斯托克斯方程启发，我们提出了图纳维-斯托克斯网络（GNSN），这是一种新颖的架构，通过将对流引入图结构，超越了传统的基于扩散的消息传递。GNSN在图定义动态速度场来控制对流，实现更高效、更直接的消息传播。通过自适应平衡对流和扩散，GNSN能够有效处理具有不同同质性水平的数据集。在12个真实世界数据集上的广泛评估表明，GNSN在分类准确率上持续优于最先进的基线方法。此外，实验结果进一步强调了其在缓解过平滑问题方面的有效性。

英文摘要

Graph Neural Networks (GNNs) have emerged as a cornerstone of deep learning, with most existing methods rooted in graph signal processing and diffusion equations to model message passing. However, these approaches inherently suffer from the oversmoothing problem, where node features become indistinguishable as the network depth increases. Inspired by the Navier Stokes equations, we introduce Graph Navier Stokes Networks (GNSN), a novel architecture that transcends conventional diffusion-based message passing by incorporating convection into graph structures. GNSN defines a dynamic velocity field on the graph to govern convection, enabling more efficient and direct message propagation. By adaptively balancing convection and diffusion, GNSN is able to efficiently handle datasets with varying levels of homophily. Extensive evaluations across twelve real-world datasets demonstrate that GNSN consistently outperforms state-of-the-art baselines in classification accuracy. Moreover, experimental results further emphasize its effectiveness in alleviating the oversmoothing problem.

URL PDF HTML ☆

赞 0 踩 0

2605.25195 2026-06-02 cs.CV

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Baton: 用于联合视频-音频生成的显式语义蓝图

Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Tencent Hunyuan（腾讯幻元）

AI总结提出Baton框架，通过VA-Planner生成语义对齐的模态感知规划令牌作为蓝图，注入扩散骨干以协调视频和音频去噪，解决现有方法因缺乏共享长期规划导致的跨模态对齐脆弱问题。

详情

AI中文摘要

当前的开源扩散模型难以生成稳定且同步的视听内容，尤其是在需要复杂语义推理的场景中。根本原因在于现有方法依赖现成编码器生成的粗糙文本嵌入来引导音频-视频去噪，这丢弃了细粒度语义，并且关键的是缺乏共享的长期规划，导致去噪轨迹不协调和跨模态对齐脆弱。我们提出Baton，这是第一个将显式语义规划引入联合视频-音频生成的框架。我们的关键洞察是，用语义丰富、模态感知的规划令牌（在去噪前经过联合推理和相互对齐）补充粗糙文本引导，可以同时恢复细粒度语义细节并建立协调音频和视频去噪轨迹的共享蓝图。具体来说，Baton首先引入VA-Planner，这是一个配备双语义对齐塔的多模态语言模型，其中可学习查询与视频和音频特征进行交叉注意力，生成一对语义对齐的视频和音频规划令牌作为关键帧级别的蓝图。这些规划令牌通过交叉注意力层注入扩散骨干，提供与粗糙文本嵌入互补的时域引导。由于规划令牌与扩散潜变量不具有一一对应的时空对应关系，我们进一步提出相对语义RoPE，一种相对位置编码，将规划令牌和潜变量映射到共享的时空坐标框架中，使每个潜变量能够准确关注其位置对应的语义线索。基准实验在定性和定量上均证明了Baton的有效性。

英文摘要

Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.

URL PDF HTML ☆

赞 0 踩 0

2605.25144 2026-06-02 cs.CV

SpikeReg: Energy-Efficient 3D Deformable Medical Image Registration with Spiking Neural Networks

SpikeReg: 基于脉冲神经网络的高能效3D可变形医学图像配准

Ali Mikaeili Barzili, Behzad Moshiri, Hamid Azadegan, Mohammad-Reza A. Dehaqani

发表机构 * School of Electrical and Computer Engineering, College of Engineering, University of Tehran（德黑兰大学电气与计算机工程学院）； Max Planck Institute for Brain Research（马克斯·普朗克脑科学研究所）； School of Computer Engineering, Iran University of Science and Technology (IUST)（伊朗科学技术大学计算机工程学院）； Department of Electrical and Computer Engineering, University of Waterloo（滑铁卢大学电气与计算机工程系）

AI总结提出SpikeReg，一种脉冲U-Net，通过层间权重迁移和激活百分位阈值校准从模拟ANN教师初始化，结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度微调，在OASIS Learn2Reg验证集上达到Dice 0.7474，与ANN教师无显著差异，同时实现12.8%平均脉冲率和55.5倍算术能量降低。

详情

AI中文摘要

可变形医学图像配准对齐图像中的解剖结构，但在3D分辨率下计算密集。脉冲神经网络（SNN）提供稀疏事件驱动计算，但尚未系统研究用于可变形医学图像配准。我们提出SpikeReg，一种用于3D脑MRI配准的脉冲U-Net。SpikeReg从模拟ANN配准教师初始化，通过层间权重迁移和激活百分位阈值校准进行转换，并使用结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度目标进行微调。在OASIS Learn2Reg验证集（19对图像）上，SpikeReg达到Dice 0.7474 ± 0.032，与ANN教师（0.7480 ± 0.037，p = 0.67）无显著配对Dice差异，平均脉冲率为12.8%，相对于密集ANN基线，在事件稀疏SynOps/MAC代理下投影算术能量降低55.5倍。我们还报告了两个负面发现：来自ANN教师的位移蒸馏损害性能，以及使用标签Dice损失训练的ANN教师无法通过速率编码转换。这些结果共同表明，密集几何预测可以在稀疏事件驱动计算下进行，为神经形态医学图像配准开辟了道路。

英文摘要

Deformable medical image registration aligns anatomical structures across images but remains computationally dense at 3D resolution. Spiking neural networks (SNNs) offer sparse event-driven computation, yet have not been systematically studied for deformable medical image registration. We introduce SpikeReg, a spiking U-Net for 3D brain MRI registration. SpikeReg is initialized from an analog ANN registration teacher, converted by layer-wise weight transfer and activation-percentile threshold calibration, and fine-tuned with a surrogate-gradient objective combining local cross-correlation, diffusion regularization, and spike-rate sparsity. On the OASIS Learn2Reg validation split ($19$ image pairs), SpikeReg reaches Dice $0.7474 \pm 0.032$, with no significant paired Dice difference from the ANN teacher ($0.7480 \pm 0.037$, $p = 0.67$), at a $12.8\%$ mean spike rate and a $55.5\times$ projected arithmetic-energy reduction under an event-sparse SynOps/MAC proxy relative to the dense-ANN baseline. We additionally report two negative findings: displacement distillation from the ANN teacher hurts performance, and ANN teachers trained with a label-Dice loss fail to transfer through rate-code conversion. Together these results show that dense geometric prediction can be performed under sparse event-driven computation, opening a path toward neuromorphic medical image registration.

URL PDF HTML ☆

赞 0 踩 0

2605.25143 2026-06-02 cs.AI cs.LG

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

超越前沿：用于高效测试时扩展的随机回溯

Dao Tran, Duc Anh Le, Ngoc Luu, Quan Pham, Tung Pham, Hung Bui

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出随机回溯方法，通过维护历史前缀池并利用子池选择和幂回溯序列蒙特卡洛机制，在测试时扩展中实现更高的准确率-令牌数权衡。

详情

AI中文摘要

测试时扩展通过花费额外计算来探索多个解轨迹，从而改进语言模型推理。关键挑战是在推理过程中最大化准确率的同时最小化生成的令牌总数。最近的PRM引导方法对中间前缀进行评分以引导搜索，但大多数方法仅关注前沿：它们只保留当前活动的前缀，并使用带噪声的PRM分数不可逆地剪枝或重采样其余部分。这可能导致过早承诺、多样性崩溃以及丢失仍可产生正确延续的前缀。我们引入了一种基于历史前缀持久池的随机回溯，允许测试时计算重新访问先前生成的状态，而不是仅扩展当前前沿。为了提高效率，我们提出了两种互补机制。子池选择通过随机子池内应用Top-N选择来增强贪婪PRM引导搜索，使历史前缀有机会绕过评分过高的前沿候选。幂回溯序列蒙特卡洛使用幂化PRM分数和混合校正权重，将SMC风格的重采样扩展到持久池。在数学推理基准和模型规模上，我们的方法在每令牌准确率上始终更高，并且与强PRM引导基线相比，仅使用一小部分令牌数即可达到相同的准确率水平，这表明持久池随机回溯为改善测试时扩展中的准确率-令牌权衡提供了一种简单有效的方法。

英文摘要

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.24881 2026-06-02 cs.RO

Learning Transferable Motor Skills for Geometry-Aware Robotic Surface Tasks

面向几何感知的机器人表面任务的可迁移运动技能学习

Miroslav David, Karla Stepanova, Robert Babuska

发表机构 * Czech Institute of Informatics, Robotics, and Cybernetics（捷克信息学、机器人学与自动化研究所）； Czech Technical University in Prague（布拉格捷克技术大学）； Delft University of Technology（代尔夫特理工大学）

AI总结提出一种模块化框架，将几何运动规划与执行级专家行为解耦，通过可解释的原子运动规则和神经网络推断，实现跨几何形状的迁移学习。

详情

Comments: In: Workshop on Geometry in the Age of Data-Driven Robotics, ICRA 2026, Vienna, 2026

AI中文摘要

机器人表面交互任务，如喷涂或焊接，需要精确的几何规划和精确的运动执行。虽然现代运动规划器能够生成有效的几何路径，但它们通常缺乏人类操作员所具备的专家运动模式。相反，从示范中学习往往将任务执行紧密耦合到特定的训练几何形状，限制了可迁移性。我们提出了一种模块化框架，将几何运动规划与执行级专业知识解耦。专家行为被表示为一个可解释的、原子的运动规则词汇表，例如速度缩放和方向偏移，这些规则系统地修改几何规划的参考路径。我们训练了一个多模态神经网络，从运动轨迹数据和CAD模型几何中联合推断规则参数。我们通过在L形和窗形物体上的动态仿真评估了我们的方法，证明了模型在两种拓扑结构上成功提取了速度和方向规则。

英文摘要

Robotic surface-interaction tasks, such as spray painting or welding, require both accurate geometric planning and precise motion execution. While modern motion planners generate valid geometric paths, they often lack the expert motor patterns observed in human operators. Conversely, learning from demonstration often tightly couples task execution to the specific training geometry, limiting transferability. We propose a modular framework that decouples geometric motion planning from execution-level expertise. Expert behavior is represented as a vocabulary of interpretable, atomic motor rules, such as velocity scaling and orientation offsets, that systematically modify a geometrically planned reference path. We train a multimodal neural network to infer rule parameters jointly from kinematic trajectory data and CAD model geometry. We evaluate our approach through dynamic simulation on L-shaped and window-shaped objects, demonstrating on simulated data that the model successfully extracts velocity and orientation rules across both topologies.

URL PDF HTML ☆

赞 0 踩 0

2605.24828 2026-06-02 cs.AI

Test-Time Deep Thinking to Explore Implicit Rules

测试时深度思考以探索隐式规则

Wentong Chen, Xin Cong, Zhong Zhang, Yaxi Lu, Siyuan Zhao, Yesai Wu, Qinyu Luo, Haotian Chen, Yankai Lin, Zhiyuan Liu, Maosong Sun

发表机构 * Renmin University of China（中国人民大学）； Department of Statistics and Data Science, Tsinghua University（清华大学统计与数据科学系）； School of Computer Science and Engineering, UESTC（UESTC计算机科学与工程学院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； School of Mathematical Sciences, Nankai University（南开大学数学科学学院）； Whiting School, Johns Hopkins University（约翰斯·霍普金斯大学惠特林学院）； School of Artificial Intelligence, Shanghai Jiaotong University（上海交通大学人工智能学院）

AI总结针对智能体在隐式规则环境中失败的问题，提出TTExplore框架，通过训练专用模型Exp-Thinker进行测试时推理，平均提升基线性能14-19点。

详情

AI中文摘要

随着大型语言模型（LLMs）的不断进步，智能体变得越来越重要。然而，这些智能体在由隐式规则——无法直接观察、必须通过交互推断的隐藏约束——支配的环境中常常失败。这导致智能体陷入重复的试错循环，最终导致任务失败。为了应对这一挑战，我们提出了测试时探索（TTExplore）框架，其中思考者组件分析交互历史以推断这些隐式规则并指导行动者。在此设置中，有效的探索关键取决于思考者的推理能力。然而，评估深度推理轨迹本质上不稳定且困难，这对有效训练构成了主要障碍。为了解决这个问题，我们引入了一种新颖且稳定的强化学习流程。核心思想是使用准确的任务级分数作为间接奖励，以绕过评估中间推理的困难，并仅保留每个轨迹的单个思考节点以缓解奖励稀疏性。使用此流程，我们训练了一个专门的7B模型Exp-Thinker。在五个基于文本的具体任务上的实验表明，配备Exp-Thinker的TTExplore将基线智能体性能平均提升了14-19个点，证明了显式推理隐式规则的有效性。

英文摘要

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

URL PDF HTML ☆

赞 0 踩 0