arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.10525 2026-05-20 cs.CV

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

GemDepth:用于3D一致视频深度的几何嵌入特征

Yuecheng Liu, Junda Cheng, Longliang Liu, Wenjing Liao, Hanrui Cheng, Yuzhou Wang, Xin Yang

发表机构 * Huazhong University of Science \& Technology Optics Valley Laboratory

AI总结 本文提出GemDepth框架,通过引入几何嵌入模块和交替时空变换器,解决视频深度估计中空间模糊和时间不一致的问题,实现高精度和鲁棒的3D一致性。

详情
AI中文摘要

视频深度估计将单目预测扩展到时间域以确保一致性。然而,现有方法在细节区域常出现空间模糊和时间不一致的问题。我们提出GemDepth框架,其核心思想是显式了解相机运动和全局3D结构是保持3D一致性必要的前提。GemDepth引入了一个几何嵌入模块(GEM),通过预测帧间相机姿态生成隐式几何嵌入。这种运动先验的注入使网络具备内在的3D感知和对齐能力。在这些几何提示的引导下,我们的交替时空变换器(ASTT)捕获潜在点级对应关系,同时提高空间精度以增强细节清晰度,并强制严格的时间一致性。此外,GemDepth采用数据高效训练策略,有效弥合了高效率和鲁棒几何一致性之间的差距。如图2所示,全面评估表明GemDepth在多个数据集上均取得最佳性能,特别是在复杂动态场景中。代码已公开在:https://github.com/Yuecheng919/GemDepth。

英文摘要

Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth.

2605.10344 2026-05-20 cs.AI

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

TMAS: 通过多智能体协同实现测试时间计算的扩展

George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai

发表机构 * IQuest Research(IQuest研究) Beihang University(北航)

AI总结 本文提出TMAS框架,通过多智能体协同实现测试时间计算的扩展,利用层次化记忆和混合奖励强化学习提升推理能力和探索效率。

详情
AI中文摘要

测试时间扩展已成为通过在推理过程中分配额外计算来提高大型语言模型推理能力的有效范式。最近的结构化方法通过在多个轨迹、细化轮次和基于验证的反馈之间组织推理进一步推进了这一范式。然而,现有结构化测试时间扩展方法要么弱化并行推理轨迹的协调,要么依赖于噪声历史信息而没有明确决定应保留和重用什么,限制了它们在探索和利用之间的平衡能力。在本文中,我们提出TMAS,一个通过多智能体协同扩展测试时间计算的框架。TMAS将推理组织为专门智能体之间的协作过程,从而在智能体、轨迹和细化迭代之间实现结构化信息流。为了支持有效的跨轨迹协作,TMAS引入了层次化记忆:经验银行重用低层次可靠中间结论和局部反馈,而指南银行记录之前探索的高层次策略,以引导后续展开远离冗余推理模式。此外,我们设计了一种针对TMAS定制的混合奖励强化学习方案,该方案联合保留基本推理能力、增强经验利用,并鼓励探索超出先前尝试的解决方案策略。在具有挑战性的推理基准上的广泛实验表明,TMAS在迭代扩展方面优于现有测试时间扩展基线,混合奖励训练进一步提高了跨迭代的扩展效果和稳定性。代码和数据可在https://github.com/IQuestLab/tmas获取。

英文摘要

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.

2605.08879 2026-05-20 cs.RO

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

通过保守监督微调在流匹配视觉-语言-动作中保持基础能力

Tianyi Zhang, Shaopeng Zhai, Haoran Zhang, Fuxian Huang, Qi Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出保守监督微调(ConSFT)方法,旨在通过动态调整学习信号来减少流匹配视觉-语言-动作模型在微调过程中对预训练能力的损害,从而在不依赖先验数据或架构开销的情况下提升模型在目标分布上的适应性和能力保留。

Comments 20 pages, 9 figures

详情
AI中文摘要

无约束的流匹配视觉-语言-动作(VLA)模型微调会导致参数过度覆盖,从而降低预训练能力。我们提出了保守监督微调(ConSFT),一种优化目标,能够适应目标分布同时减轻灾难性遗忘,无需先验数据或架构开销。通过根据模型置信度动态调整学习信号,ConSFT抑制来自低置信度样本的过度梯度,从而防止不成比例的参数更新,从而限制内在参数扰动风险。受强化学习信任区域裁剪的启发,这种形式建立了一个渐进学习动态,以确保目标收敛和先前能力保留,实现稀疏参数更新,而无需依赖显式正则化所需的并行参考网络。我们在LIBERO和RoboTwin基准上评估了ConSFT,针对最先进的流匹配VLA(π₀,π₀.₅和GR00T-N1.6-3B)。该方法在能力保留方面优于常规SFT,平均绝对优势超过20%,在无先验数据的环境中与数据密集型经验回放的效能相当。现实世界的机器人部署证实,ConSFT在下游适应过程中防止了空间过拟合,保留了预训练的物理技能,同时获取了序列目标任务。

英文摘要

Unconstrained fine-tuning of flow-matching Vision-Language-Action (VLA) models drives dense parameter overwrites, degrading pre-trained capabilities. We present Conservative Supervised Fine-Tuning (ConSFT), an optimization objective that adapts to target distributions while mitigating catastrophic forgetting, requiring zero prior data or architectural overhead. By dynamically scaling learning signals based on model confidence, ConSFT suppresses excessive gradients from low-confidence samples to prevent disproportionate parameter updates, thereby bounding the intrinsic parameter disruption risk. Inspired by reinforcement learning's trust-region clipping, this formulation establishes a progressive learning dynamic to secure target convergence and prior capability retention, maintaining sparse parameter updates without relying on the parallel reference networks required by explicit regularization. We evaluate ConSFT on the LIBERO and RoboTwin benchmarks across state-of-the-art flow-matching VLAs ($π_0$, $π_{0.5}$, and GR00T-N1.6-3B). The method outperforms vanilla SFT in capability retention by an average absolute margin of over 20\%, matching the efficacy of data-heavy Experience Replay in a prior-data-free regime. Real-world robotic deployments confirm that ConSFT precludes spatial overfitting during downstream adaptation, preserving pre-trained physical skills while acquiring sequential target tasks.

2605.08830 2026-05-20 cs.CV cs.AI cs.RO

VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

VECTOR-Drive: 紧密耦合的视觉-语言与轨迹专家路由用于端到端自动驾驶

Rui Zhao, Jianlin Yu, Zhenhai Gao, Jiaqiao Liu, Fei Gao

发表机构 * College of Automotive Engineering, Jilin University(吉林大学汽车工程学院) The National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物力学国家级重点实验室) ReeFocus AI Technology(ReeFocus人工智能技术)

AI总结 本文提出VECTOR-DRIVE框架,通过紧密耦合的视觉-语言与轨迹专家路由,解决端到端自动驾驶中视觉语言理解和轨迹预测之间的耦合问题,实现更高的任务性能。

详情
AI中文摘要

端到端自动驾驶需要模型理解交通场景、推断驾驶意图并生成可执行的运动计划。最近的视觉-语言-动作(VLA)模型继承了大规模视觉-语言预训练的语义先验,但仍然面临耦合权衡:完全共享的骨干网络保留了多模态交互,但可能导致语言推理和轨迹预测的耦合问题;而解耦的推理-动作管道减少了任务冲突,但削弱了语义-运动耦合。我们提出VECTOR-DRIVE,一个基于Qwen2.5-VL-3B的紧密耦合VLA框架。VECTOR-DRIVE通过共享自注意力保持所有token的耦合,并根据token语义路由前馈计算。视觉和语言token由视觉-语言专家处理以保留语义先验,而目标点、主体状态和噪声动作token则路由到轨迹专家进行运动特定计算。在动作token路径上,一个流匹配规划器将噪声动作token细化为未来路径点和速度配置文件。这种设计在单一多模态Transformer中耦合了语义推理和运动规划,同时分离了任务特定的FFN计算。在Bench2Drive上,VECTOR-DRIVE实现了88.91的驾驶得分,并优于代表性的端到端和VLA基线。定性结果和消融进一步验证了共享注意力、语义感知专家路由、渐进式训练和基于流的动作解码的优势。

英文摘要

End-to-end autonomous driving requires models to understand traffic scenes, infer driving intent, and generate executable motion plans. Recent vision-language-action (VLA) models inherit semantic priors from large-scale vision-language pretraining, yet still face a coupling trade-off: fully shared backbones preserve multimodal interaction but may entangle language reasoning and trajectory prediction, whereas decou pled reasoning-action pipelines reduce task conflict but weaken semantic-motion coupling. We propose VECTOR-DRIVE, a tightly coupled VLA framework built on Qwen2.5-VL-3B. VECTOR-DRIVE keeps all tokens coupled through shared self attention and routes feed-forward computation according to token semantics. Vision and language tokens are processed by a Vision-Language Expert to preserve semantic priors, while target-point, ego-state, and noisy action tokens are routed to a Trajectory Expert for motion-specific computation. On the action-token pathway, a flow-matching planner refines noisy action tokens into future waypoints and speed profiles. This design couples semantic reasoning and motion planning within a single multimodal Transformer while separating task-specific FFN computation. On Bench2Drive, VECTOR-DRIVE achieves 88.91 Driving Score and outperforms representative end-to end and VLA-based baselines. Qualitative results and ablations further validate the benefits of shared attention, semantic-aware expert routing, progressive training, and flow-based action de coding.

2605.08696 2026-05-20 cs.CL cs.LG

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

结构化递归混合器用于大规模并行序列生成

Benjamin L. Badger

发表机构 * IBM

AI总结 本文提出了一种结构化递归混合器架构,能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换,从而在不依赖专用内核或设备特定内存管理的情况下提高训练效率、输入信息容量和推理吞吐量。

详情
AI中文摘要

在过去二十年中,语言建模经历了从主要使用递归架构(在训练和推理过程中按顺序处理标记)到非递归模型(在训练过程中并行处理序列元素)的转变,后者在训练效率和稳定性方面有所提升,但以较低的推理吞吐量为代价。本文介绍了一种结构化递归混合器(SRM)架构,该架构能够在训练时实现序列并行表示与推理时的递归表示之间的代数转换,尤其不需要专用内核或设备特定的内存管理。我们通过实验表明,这种双表示方法相比其他线性复杂度模型,在训练效率、输入信息容量和推理吞吐量及并发性方面具有优势。我们推测递归模型对于信息丰富的输入(如语言)在扩展序列长度方面并不理想,但因其每个样本的常数内存需求,适合在样本(批量)维度上扩展。我们提供了Mojo/MAX推理实现的SRM,其吞吐量和并发性分别比同样强大的Transformer在vLLM上的推理提高了12倍和170倍,这些增益特征与PyTorch实现导致的GSM8k Pass@k计算常数增加30%。最后,我们证明SRM是有效的强化学习训练候选。

英文摘要

Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We postulate that recurrent models are poorly suited to extended sequence length scaling for information-rich inputs typical of language, but are well suited to scaling in the sample (batch) dimension due to their constant memory per sample. We provide Mojo/MAX inference implementations of SRMs exhibiting 12x the throughput and 170x the concurrency of similarly powerful Transformers inferenced on vLLM, increases characteristic of Pytorch implementations resulting in a 30\% increase in compute-constant GSM8k Pass@k. We conclude by demonstrating that SRMs are effective reinforcement learning training candidates.

2605.08391 2026-05-20 cs.LG

SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning

SACHI:通过整体信息整合实现多智能体强化学习中的结构化智能体协调

Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

发表机构 * University of Southern California(南加州大学) DEVCOM Army Research Laboratory(美国陆军研究实验室) DEVCOM Army Research Office(美国陆军研究办公室)

AI总结 本文提出SACHI方法,通过整体信息整合实现多智能体强化学习中的结构化智能体协调,解决了智能体在部分局部观察下协调行动的信息瓶颈问题,通过图Transformer卷积在智能体协调图上增强每个智能体的表示,从而在多个任务中表现出色。

详情
AI中文摘要

在合作性多智能体强化学习中,智能体基于部分局部观察行动时面临一个根本性的信息瓶颈:选择联合最优动作所需的知识分散在整个团队中,但每个智能体必须在没有访问队友观察、意图或所选动作的情况下做出决策。现有方法要么忽略这个瓶颈,将其压缩成一个标量混合信号,或者通过学习的通信通道绕过它。将动作协调视为智能体之间的结构化信息整合问题,我们提出结构化智能体协调通过整体信息整合(SACHI),其中在动作选择之前,通过智能体协调图上的图Transformer卷积,使每个智能体的表示增强,从而接收器敏感、内容依赖的信号来自队友。我们在五个合作任务上评估SACHI,涵盖空间、沟通和对抗性协调挑战,与十二个基线进行比较。SACHI在每个任务中都与最佳基线持平或表现更好,严格的汇总统计分析,包括归一化指标和bootstrap置信区间、Friedman排名和性能分析,证实这种优势在统计上显著,稳健且不依赖于模型容量的增加。参数匹配的消融进一步追溯收益的来源到一个单一的架构属性:消息传递操作中的内容依赖程度。

英文摘要

Cooperative multi-agent reinforcement learning agents that act on partial local observations face a fundamental information bottleneck: the knowledge needed to select jointly optimal actions is scattered across the team, yet each agent must commit to a decision without access to its teammates' observations, intentions, or chosen actions. Existing methods either ignore this bottleneck, compress it into a scalar mixing signal, or route around it with learned communication channels. Framing action coordination as a problem of structured information integration among agents, we propose \textit{structured agent coordination via holistic information integration}, or SACHI, in which graph transformer convolutions over an inter-agent coordination graph enrich each agent's representation with receiver-sensitive, content-dependent signals from teammates prior to action selection. We evaluate SACHI across five cooperative tasks spanning spatial, communicative, and adversarial coordination challenges against twelve baselines. SACHI consistently matches or outperforms the best baseline on every task, and rigorous aggregate statistical analyses, including normalized metrics with bootstrap confidence intervals, Friedman ranking, and performance profiling, confirm that this advantage is statistically significant, robust across environments, and not attributable to increased model capacity. Parameter-matched ablations further trace the source of the gains to a single architectural property: the degree of content-dependence in the message-passing operator.

2605.08143 2026-05-20 cs.LG cs.AI

HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing

HoReN:用于大规模序列模型编辑的归一化Hopfield检索

Yuan Fang, Yi Xie, Xuming Ran

发表机构 * IXL Learning, Inc(IXL学习公司) Technical University of Munich(慕尼黑技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出HoReN,一种基于代码本的参数保持编辑器,通过在单个MLP层中引入离散键值记忆,实现了在大规模序列模型编辑中的高效检索和更新,同时在多种基准测试中表现出色。

Comments 30 pages, 10 figures

详情
AI中文摘要

大型语言模型编码了大量事实性知识,但部署后这些知识可能会过时或错误,而重新训练成本过高。这推动了终身模型编辑,旨在更新特定行为的同时保持模型其余部分。现有的编辑器,无论是参数修改型还是参数保持型,在编辑累积时都会严重退化,并且在处理同义词时难以泛化。我们提出了HoReN,一种基于代码本的参数保持编辑器,通过在单个MLP层中引入离散键值记忆来包装。HoReN将每个代码本条目视为知识键和Hopfield存储模式,通过单位超球面上的角度相似性检索编辑,并通过阻尼Hopfield动态来优化查询,使同义词收敛到正确的记忆盆地,而无关输入保持稳定。HoReN在多种基准测试中表现出强大的编辑性能,包括标准ZsRE、结构化WikiBigEdit和非结构化UnKE评估。此外,HoReN能够扩展到50,000个序列编辑的ZsRE,其整体性能始终高于0.93,而先前的编辑器在达到10,000个编辑之前会崩溃或严重退化。我们的代码可在https://github.com/ha11ucin8/HoReN上获得。

英文摘要

Large language models encode vast factual knowledge that can become outdated or incorrect after deployment, yet retraining is prohibitively costly. This motivates lifelong model editing, which updates targeted behavior while preserving the rest of the model. Existing editors, both parameter-modifying and parameter-preserving, degrade severely as edits accumulate and struggle to generalize across paraphrases. We propose HoReN, a codebook-based parameter-preserving editor that wraps a single MLP layer with a discrete key-value memory. HoReN treats each codebook entry as both a knowledge key and a Hopfield stored pattern, retrieves edits by angular similarity on the unit hypersphere, and refines queries through damped Hopfield dynamics so paraphrases converge to the correct memory basin while unrelated inputs remain stable. HoReN achieves strong editing performance with consistent gains across diverse benchmarks spanning standard ZsRE, structured WikiBigEdit, and unstructured UnKE evaluations. Moreover, HoReN scales to 50K sequential edits on ZsRE with stable overall performance above 0.93, while prior editors collapse or degrade severely before reaching 10K. Our code is available at https://github.com/ha11ucin8/HoReN.

2605.07721 2026-05-20 cs.CL cs.AI cs.LG

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

内存高效的循环变换器:在循环语言模型中解耦计算与内存

Victor Conchello Vendrell, Arnau Padres Masdemont, Niccolò Grillo, Jordi Ros-Giralt, Arash Behboodi, Fabio Valerio Massoli

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种内存高效的循环变换器(MELT),通过解耦推理深度与内存消耗,实现了常数内存的迭代推理,同时保持了LoopLM的性能,仅需轻量级的后训练过程。

Comments 22 pages, 5 figures, 11 tables

详情
AI中文摘要

递归大语言模型(LLM)架构已作为一种改进推理能力的有希望的方法出现,因为它们能够在嵌入空间中进行多步计算而无需生成中间标记。例如Ouro模型通过迭代更新内部表示并在每次迭代中保留标准的键值(KV)缓存来进行推理,导致内存消耗与推理深度成线性增长。因此,增加推理迭代次数会导致内存使用变得不可接受,限制了此类架构的实际可扩展性。在本工作中,我们提出了内存高效的循环变换器(MELT),一种新颖的架构,将推理深度与内存消耗解耦。与使用每个层和循环的标准KV缓存不同,MELT在每个层中维护一个共享于推理循环的单个KV缓存。该缓存通过可学习的门控机制随时间更新。为了在该架构下实现稳定且高效的训练,我们提出采用分块训练的两阶段过程进行训练:插值转换,随后是注意力对齐的蒸馏,均从LoopLM起始模型到MELT。实验表明,我们展示MELT模型在从预训练Ouro参数微调后,优于同等规模的标准LLM,同时保持与这些模型相当的内存占用,并显著小于Ouro的内存占用。总体而言,MELT实现了无需牺牲LoopLM性能的常数内存迭代推理,仅需轻量级的后训练过程。

英文摘要

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

2605.07379 2026-05-20 cs.CV cs.AI

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO:用于视觉目标跟踪的强化学习定位

Xin Chen, Chuanyu Sun, Jiao Xu, Houwen Peng, Dong Wang, Huchuan Lu, Kede Ma

发表机构 * City University of Hong Kong(香港城市大学) Hunyuan Team, Tencent(腾讯文心团队) Dalian University of Technology(大连理工大学)

AI总结 本文提出RELO方法,通过将目标定位建模为马尔可夫决策过程,利用强化学习替代传统手工设计的空间先验,以提升跟踪性能和一致性。

Comments ICML 2026 paper

详情
AI中文摘要

传统视觉目标跟踪方法通常使用手工设计的空间先验(如热图)来定位目标,但这些先验只能提供替代监督,并且与跟踪优化和评估指标(如交并比IoU和成功曲线下的面积AUC)不匹配。本文引入RELO,一种用于视觉目标跟踪的强化学习定位方法,将目标定位建模为马尔可夫决策过程。具体而言,RELO用强化学习学习的空间位置策略替代手工设计的空间先验,奖励结合帧级IoU和序列级AUC。此外,我们还引入层对齐的时间令牌传播以提高帧间语义一致性,计算开销极低。在多个基准测试中,RELO取得了优异的性能,无需模板更新,在LaSOText上达到了57.5%的AUC。这证实了基于奖励的定位为视觉目标跟踪提供了一种有效的替代方法。

英文摘要

Conventional visual object trackers localize targets using handcrafted spatial priors, often in the form of heatmaps. Such priors provide only surrogate supervision and are poorly aligned with tracking optimization and evaluation metrics, such as intersection over union (IoU) and area under the success curve (AUC). Here, we introduce RELO, a REinforcement-learning-to-LOcalize method for visual object tracking that formulates target localization as a Markov decision process. Specifically, RELO replaces handcrafted spatial priors with a localization policy learned over spatial positions via reinforcement learning, with rewards combining frame-level IoU and sequence-level AUC. We additionally introduce layer-aligned temporal token propagation to improve semantic consistency across frames, with negligible computational overhead. Across multiple benchmarks, RELO achieves superior results, attaining 57.5% AUC on LaSOText without template updates. This confirms that reward-driven localization provides an effective alternative to prior-driven localization for visual object tracking.

2605.07066 2026-05-20 cs.AI

2.5-D Decomposition for LLM-Based Spatial Construction

基于2.5-D分解的LLM空间构建

Paul Whitten, Li-Jen Chen, Sharath Baddam

发表机构 * GitHub

AI总结 本文提出了一种基于2.5-D分解的神经符号管道,通过让LLM在二维水平面上规划,同时确定性执行器计算垂直放置,从而消除一类错误,提升了空间构建的准确性。

详情
AI中文摘要

自主系统需要可靠的空问推理来从自然语言指令中构建结构,但大型语言模型(LLMs)在生成三维积木放置时会产生系统性的坐标错误。本文提出了一种基于2.5-D分解的神经符号管道:LLM在二维水平面上进行规划,同时确定性执行器根据列的占用计算所有垂直放置,从而消除了一类错误。在Build What I Mean基准测试(160轮次)中,GPT-4o-mini在12次独立运行中实现了94.6%的平均结构准确性,接近由架构代理错误设定的97.6%上限,且优于GPT-4o(90.3%)和最佳竞争系统(76.3%)。受控消融实验确认2.5-D分解是主要贡献者,占准确性50.7个百分点。该管道可直接转移到边缘硬件:Nemotron-3 120B在本地NVIDIA Jetson Thor AGX上运行,无需修改提示词即可达到94.5%的云结果。该原理,即从LLM的输出空间中移除确定性维度,适用于任何自主建造或组装任务,其中重力或其他物理约束固定一个或多个自由度。在500个IGLU协作建造任务上的转移实验证实了效果超越了主要基准。

英文摘要

Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph{2.5-D decomposition}: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6\% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6\% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3\% and the best competing system at 76.3\%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5\% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM's output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.

2605.06546 2026-05-20 cs.CL

Efficient Pre-Training with Token Superposition

高效的token叠加预训练

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

发表机构 * Nous Research

AI总结 本文提出了一种名为Token-Superposition Training (TST) 的方法,通过在不修改模型架构的情况下,提高预训练的数据吞吐量,从而在大规模预训练中实现更高的效率和性能。

Comments 25 pages, 11 figures, 28 tables

详情
AI中文摘要

大型语言模型的预训练通常成本高昂且在扩展时效率低下,需要复杂的侵入性修改才能实现高数据吞吐量。在本工作中,我们提出了Token-Superposition Training (TST),一种简单的即插即用方法,能够在不修改并行性、优化器、分词器、数据或模型架构的情况下,显著提高预训练过程中每FLOPs的数据吞吐量。TST分为两个阶段:(i) 一个高度高效的叠加阶段,其中我们将许多连续的token合并成一个袋,并使用多热交叉熵(MCE)目标进行训练;(ii) 一个恢复阶段,其中我们恢复回标准训练。我们对270M和600M参数的规模进行了广泛评估,并在3B和10B A1B混合专家模型上进行了验证,证明其在不同设置中具有高度鲁棒性。最终,TST在基准损失和下游评估中均优于基线,且在等损失设置下,TST在10B A1B规模上实现了预训练时间的2.5倍减少。

英文摘要

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

2605.06270 2026-05-20 cs.CV

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Spark3R: 非对称令牌缩减使快速前馈3D重建

Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 本文提出Spark3R框架,通过非对称令牌缩减技术,在不重新训练的情况下加速前馈3D重建模型,实现高达28倍的速度提升同时保持高质量重建。

详情
AI中文摘要

基于视觉Transformer的前馈3D重建模型可以直接从少量输入图像估计场景几何和相机姿态,但将其扩展到具有数百或数千帧的视频输入仍然具有挑战性,因为全局注意力层的二次成本。最近的令牌合并方法通过在全局注意力层内压缩令牌序列来加速这些模型,但它们对查询令牌和键值令牌应用均匀的缩减,忽略了它们在3D重建中功能不同的角色。在本文中,我们识别出前馈3D重建模型的一个关键属性:查询令牌编码视图特定的几何请求并且对压缩敏感,而键值令牌代表共享的场景上下文并且可以容忍剧烈压缩。受这一见解的启发,我们提出了Spark3R,一个无需训练的加速框架,通过为查询令牌和键值令牌分配不同的缩减因子来解耦压缩,对查询令牌应用组内令牌合并,对键值令牌应用轻量级令牌剪枝。此外,Spark3R在不同层之间自适应调整键值缩减因子,进一步改进质量-效率权衡。作为一种即插即用的框架,无需重新训练,Spark3R直接集成到多个预训练的前馈3D重建模型中,包括VGGT、π³、Depth-Anything-3和VGGT-Ω,并在1000帧输入上实现了高达28倍的速度提升,同时保持有竞争力的重建质量。

英文摘要

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, Depth-Anything-3, and VGGT-$Ω$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

2605.05480 2026-05-20 cs.LG cs.AI stat.ML

GRALIS: A Unified Canonical Framework for Linear Attribution Methods via Riesz Representation

GRALIS:通过里斯表示建立线性归因方法的统一规范框架

Raimondo Fanale

发表机构 * Universitas Mercatorum(默卡托大学)

AI总结 本文提出GRALIS框架,通过里斯表示理论统一了线性归因方法,提供七个形式定理保证归因方法的准确性、收敛性、Shapley交互值、Hoeffding ANOVA分解、Sobol敏感性泛化和多尺度扩展,展示了其在医学图像上的初步验证结果。

Comments 25 pages, 6 tables, 2 figures. Theoretical framework with preliminary experimental validation on BreaKHis (1,187 images, DenseNet-121). Extended empirical comparison in preparation

详情
AI中文摘要

深度神经网络的主要XAI归因方法——GradCAM、SHAP、LIME、集成梯度——基于不同的理论基础且无法正式比较。我们提出了GRALIS(梯度-里斯平均局部积分Shapley),一个建立归因表示理论的数学框架:L^2(Q, mu)上的每一个可加、线性和连续的归因功能都具有唯一的规范表示(Q,w,Delta),由里斯表示定理证明其必要性。该类包括SHAP、IG、LIME和线性化GradCAM,但不包括非线性功能如标准GradCAM或注意力图。七个形式定理提供了任何单个方法都缺乏的同时保证:(T1)必要规范形式;(T2)精确完备性;(T3)蒙特卡洛收敛O(1/sqrt(m))+O(1/k);(T4)精确Shapley交互值;(T5)Hoeffding ANOVA分解;(T6)Sobol敏感性泛化;(T7)多尺度扩展(MS-GRALIS)具有最小方差权重。代数附录通过Mobius变换证明GRALIS-SIV对应关系,无需循环论证。GRALIS满足13.5/14个公理性质,而单独方法仅为2.5-6/14,包括完备性、敏感性、局部性、k阶交互和最优多尺度聚合。在BreaKHis(1,187例病理图像,DenseNet-121)上的初步验证报告删除忠实度AUC+0.015(恶性),96%类条件一致性,SAL=0.762±0.109和稀疏性指数0.39。与基线XAI方法的扩展比较计划在配套论文中进行。

英文摘要

The main XAI attribution methods for deep neural networks -- GradCAM, SHAP, LIME, Integrated Gradients -- operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.

2605.04525 2026-05-20 cs.RO

HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks

HDFlow:用于长时间任务的分层扩散-流规划

Nandiraju Gireesh, Yuanliang Ju, Chaoyi Xu, Weiheng Liu, Yuxuan Wan, He Wang

发表机构 * Peking University(北京大学) University of Toronto(多伦多大学)

AI总结 本文提出HDFlow,一种新的分层规划框架,利用扩散和修正流模型的优势,克服了单一范式生成规划器的局限性,通过在模拟和现实中的家具组装任务验证其有效性。

Comments ICML 2026 (Spotlight)

详情
AI中文摘要

近年来,生成模型的进步在长时间、稀疏奖励任务中生成行为计划方面展现出潜力。尽管这些方法取得了有希望的结果,但它们通常缺乏分层分解的原理框架,并且由于其迭代去噪过程,难以应对实时执行的计算需求。在本文中,我们介绍了分层扩散-流(HDFlow),一种新颖的分层规划框架,能够最优地利用扩散和修正流模型的优势,以克服单一范式生成规划器的局限性。HDFlow采用高层扩散规划器,在学习的潜在空间中生成策略子目标序列,利用扩散强大的探索能力。这些子目标随后引导低层修正流规划器生成平滑且密集的轨迹,利用基于常微分方程(ODE)的轨迹生成的速度和效率。我们在四个具有挑战性的家具组装任务中评估了HDFlow,既在模拟中又在现实世界中,其表现显著优于最先进的方法。此外,我们还展示了该方法在两个包含多样化运动和操作任务的长时间基准测试中的泛化能力。项目网站:https://hdflow-page.github.io/

英文摘要

Recent advances in generative models have shown promise in generating behavior plans for long-horizon, sparse reward tasks. While these approaches have achieved promising results, they often lack a principled framework for hierarchical decomposition and struggle with the computational demands of real-time execution, due to their iterative denoising process. In this work, we introduce Hierarchical Diffusion-Flow (HDFlow), a novel hierarchical planning framework that optimally leverages the strengths of diffusion and rectified flow models to overcome the limitations of single-paradigm generative planners. HDFlow employs a high-level diffusion planner to generate sequences of strategic subgoals in a learned latent space, capitalizing on diffusion's powerful exploratory capabilities. These subgoals then guide a low-level rectified flow planner that generates smooth and dense trajectories, exploiting the speed and efficiency of ordinary differential equation (ODE)-based trajectory generation. We evaluate HDFlow on four challenging furniture assembly tasks in both simulation and real-world, where it significantly outperforms state-of-the-art methods. Furthermore, we also showcase our method's generalizability on two long-horizon benchmarks comprising diverse locomotion and manipulation tasks. Project website: https://hdflow-page.github.io/

2605.02223 2026-05-20 cs.SD cs.CV

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

迈向细粒度语音修补取证:一个多区域篡改定位的数据集、方法和度量标准

Tung Vu, Yen Nguyen, Hai Nguyen, Cuong Pham, Cong Tran

发表机构 * Posts and Telecommunications Institute of Technology

AI总结 本文提出MIST数据集、ISA方法和SF1@tau度量标准,用于多区域语音修补检测,揭示现有深度伪造检测器在细粒度语音修补检测上的不足。

详情
AI中文摘要

近年来,语音克隆和文本到语音合成技术的进步使部分语音操纵——即攻击者在语音中替换几个词以改变其含义同时保持说话者身份——成为一种日益现实的威胁。现有音频深度伪造检测基准主要集中在句级二元分类或单区域篡改,无法检测和定位未知数量的多区域修补内容。我们通过三个贡献填补这一空白:首先,我们引入MIST(多区域修补语音篡改),一个覆盖6种语言、每句包含1-3个独立修补词级段的大型多语言数据集,通过LLM引导的语义替换和神经语音克隆生成,其中虚假内容仅占每句的2-7%。其次,我们提出了ISA(迭代段分析),一种与backbone无关的框架,通过粗到细的滑动窗口分类,结合容差区域提议和边界细化,无需先验知识即可恢复所有篡改区域。第三,我们定义了SF1@tau,一个基于时间IoU匹配的段级F1度量标准,联合评估区域计数准确性和定位精度。零样本评估显示,细粒度语音修补仍无法被现有深度伪造检测器解决:句级分类器在完全合成语音上对MIST句的伪造概率接近零,而ISA在这一具有挑战性的设置中始终优于非迭代基线,且数据集、代码和评估工具包已公开发布。

英文摘要

Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfake detection benchmarks focus on utterance-level binary classification or single-region tampering, leaving a critical gap in detecting and localizing multiple inpainted segments whose count is unknown a priori. We address this gap with three contributions. First, we introduce MIST (Multiregion Inpainting Speech Tampering), a large-scale multilingual dataset spanning 6 languages with 1-3 independently inpainted word-level segments per utterance, generated via LLM-guided semantic replacement and neural voice cloning, with fake content constituting only 2-7% of each utterance. Second, we propose ISA (Iterative Segment Analysis), a backbone-agnostic framework that performs coarse-to-fine sliding-window classification with gap-tolerant region proposal and boundary refinement to recover all tampered regions without prior knowledge of their count. Third, we define SF1@tau, a segment-level F1 metric based on temporal IoU matching that jointly evaluates region count accuracy and localization precision. Zero-shot evaluation reveals that partial inpainting at word granularity remains unsolved by existing deepfake detectors: utterance-level classifiers trained on fully synthesized speech assign near zero fake probability to MIST utterances where only 2-7% of content is manipulated. ISA consistently outperforms non-iterative baselines in this challenging setting, and the dataset, code, and evaluation toolkit are publicly released.

2605.00578 2026-05-20 cs.CV

Federated Distillation for Whole Slide Image via Gaussian-Mixture Feature Alignment and Curriculum Integration

通过高斯混合特征对齐和课程整合实现全切片图像的联邦蒸馏

Luru Jing, Cong Cong, Yanyuan Chen, Yongzhi Cao

发表机构 * School of Computer Science, Peking University, Beijing, China(北京大学计算机科学系) Center for Health Informatics, Australian Institute of Health Innovation, Macquarie University, Sydney, NSW 2113, Australia(健康信息学中心,澳大利亚健康创新研究所,麦考利大学,悉尼,NSW 2113,澳大利亚) School of Data Science, University of Virginia, Charlottesville, VA, USA(数据科学学院,弗吉尼亚大学,夏洛特维尔,VA,美国)

AI总结 本文提出FedHD框架,通过高斯混合特征对齐和课程整合策略,在联邦学习中实现全切片图像分析,通过本地生成的语义丰富合成特征表示提升模型性能,同时保持诊断多样性。

Comments Accepted by ICML 2026, Camera-Ready version updated

详情
AI中文摘要

联邦学习(FL)提供了一个有前景的框架,用于通过跨机构进行模型训练来实现协作数字病理学。然而,现实部署面临异质性问题,源于不同机构中多样化的多实例学习(MIL)架构和异构特征提取器。我们提出FedHD,一种新的FL框架,通过针对WSI分析进行本地高斯混合特征对齐。不同于交换模型参数,每个客户端独立地蒸馏语义丰富的合成特征表示,这些表示与真实WSI的分布对齐。为保持诊断多样性,FedHD采用一对一蒸馏策略,为每个真实切片生成一个合成对应物,以避免过度压缩。在联邦过程中,采用基于课程的整合策略,一旦性能达到平台期,逐步将跨站点的合成特征整合到本地训练中。此外,一个可选的解释模块从合成嵌入中重建伪块,提高透明度。FedHD是架构无关的、隐私保护的,并支持在不同机构之间进行个性化但协作的训练。在TCGA-IDH、CAMELYON16和CAMELYON17上的实验表明,FedHD在联邦和蒸馏基线中表现一致优于最先进的方法。

英文摘要

Federated learning (FL) offers a promising framework for collaborative digital pathology by enabling model training across institutions. However, real-world deployments face heterogeneity arising from diverse multiple instance learning (MIL) architectures and heterogeneous feature extractors across institutions. We propose FedHD, a novel FL framework that performs local Gaussian-mixture feature alignment tailored for WSI analysis. Instead of exchanging model parameters, each client independently distills semantically rich synthetic feature representations aligned with the distribution of real WSIs. To preserve diagnostic diversity, FedHD adopts a one-to-one distillation strategy, generating a synthetic counterpart for each real slide to avoid over-compression. During federation, a curriculum-based integration strategy progressively incorporates cross-site synthetic features into local training once performance plateaus. Furthermore, an optional interpretation module reconstructs pseudo-patches from synthetic embeddings, enhancing transparency. FedHD is architecture-agnostic, privacy-preserving, and supports personalized yet collaborative training across diverse institutions. Experiments on TCGA-IDH, CAMELYON16, and CAMELYON17 show that FedHD consistently outperforms state-of-the-art federated and distillation baselines.

2605.00333 2026-05-20 cs.LG cs.CL

Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B

借来的几何:冻结预训练的Gemma 4 31B在跨分布头部重要性指纹

Abay Bektursun

发表机构 * Independent research(独立研究)

AI总结 本文研究了冻结预训练的Gemma 4 31B模型在跨分布任务中的头部重要性指纹,通过分析多个任务中的头部影响,发现特定头部在不同任务中表现出显著的重要性,同时验证了这些头部在因果上的有效性。

Comments v2: Added head-level causal ablation on OGBench cube-task1 (n=30, 3.2x specificity; n=5 paired-t p=0.039) and full L26 sweep. New sections on honest negatives (activation patching null, sufficiency null, within-layer Spearman wrong-direction). Multiplicity-aware permutation null V4 P=0.013. Title and framing updated. 25 pages (13 main), 10 figures

详情
AI中文摘要

冻结在文本上预训练的Gemma 4 31B权重,未经修改,通过一个薄的可训练接口转移到非文本模态。在L24-L29切片(192个注意力头)上,一个英语文本TxtCopy注意力探针(95个句子)和每个头部对四个非语言标记模式任务(二进制复制、联想回忆、1D细胞自动机规则90、二进制加法)的影响共同分类了四个头部——L26.28、L27.28、L27.2、L27.3——在两个信号上都处于顶级。切片级别的联合巧合在超几何空虚下显著(P=0.0013,N=192,K=38,n=4)并且在多重性感知的排列检验中存活(P_V4=0.013)。预训练的Gemma L26在OGBench cube-double-play-task1上达到60.22% vs ~1%对于随机初始化的Gemma(+59pt在n=3时);一个带有正确1/√d_k缩放的FrozenRandom-GPT2对照也失败。头部层面的因果验证:在训练的cube-task1 IQL代理中零化L26.28导致成功从63.3%降至10.0% vs 46.7%对于层匹配的低-TxtCopy负对照(在n=30时有3.2倍的特异性;n=5配对-t p=0.039)。完整的L26扫描将L26.28置于32个中的第4位。诚实的负样本:在L26内Spearman ρ(TxtCopy,drop)=+0.37(与层内因果阅读相反);单个头部激活修补不转移匹配变量;四个命名头部单独不足以完成任何任务;Walker2d-DT和scene-task1招募L24在命名切片之外并显示头-消融特异性为零。我们将贡献框架为切片级别的跨分布重要性指纹加上一个跨模态目标的头部层面因果证据。

英文摘要

Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null ($P = 0.0013$, $N=192$, $K=38$, $n=4$) and survives multiplicity-aware permutation tests ($P_{V4} = 0.013$). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma ($+59$pt at $n=3$); a FrozenRandom-GPT2 control with correct $1/\sqrt{d_k}$ scaling also fails. Head-level causal validation: zeroing L26.28 in the trained cube-task1 IQL agent drops success $63.3\% \to 10.0\%$ vs $46.7\%$ for a layer-matched low-TxtCopy negative control ($3.2\times$ specificity at $n=30$; $n=5$ paired-$t$ $p=0.039$). A full L26 sweep places L26.28 at rank 4 of 32. Honest negatives: within-L26 Spearman $ρ(\text{TxtCopy, drop}) = +0.37$ (opposite of within-layer causal reading); single-head activation patching does not transfer the matching variable; the 4 named heads alone do not suffice on any task; Walker2d-DT and scene-task1 recruit L24 outside the named slice and show null head-ablation specificity. We frame the contribution as a cross-distribution importance fingerprint at the slice level plus head-level causal evidence on one cross-modality target.

2604.25646 2026-05-20 cs.CV cs.RO

SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

SAMe:一种用于机器人超声的语义解剖映射引擎

Jing Zhang, Duojie Chen, Wentao Jiang, Zihan Lou, Jianxin Liu, Xinwu Cui, Qinghong Zhao, Bo Du, Christoph F. Dietrich, Dacheng Tao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Hubei Center for Applied Mathematics, Wuhan University(湖北应用数学中心,武汉大学) Department of Ultrasound, The Central Hospital of Wuhan(武汉市中心医院超声科) Department of Medical Ultrasound, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology(同济医院,同济医学院,华中科技大学医学影像科) Department of Ultrasound in Medicine, Renmin Hospital of Wuhan University(武汉大学仁医医院医学超声科) University Hospital, Johann-Wolfgang-Goethe University Frankfurt am Main(法兰克福歌德大学医学院大学医院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 该研究提出SAMe,一种语义解剖映射引擎,通过提供显式的解剖先验层,解决机器人超声扫描初始化问题,实现了基于临床症状的解剖目标识别和控制指令生成,提高了自动扫描的准确性和效率。

Comments Supplementary information included. Code will be released at https://github.com/MiliLab/Echo-SAMe

详情
AI中文摘要

机器人超声已经实现了局部图像驱动控制、接触调节和视图优化,但当前系统缺乏必要的解剖学理解,无法确定应扫描什么、从哪里开始以及如何适应个体患者解剖结构。这些差距使得系统仍依赖专家干预来启动扫描。本文提出SAMe,一种语义解剖映射引擎,为机器人超声提供显式的解剖先验层。SAMe将扫描初始化视为目标到解剖到动作的过程:它将不明确的临床症状转化为结构化的目标器官,从单张外部身体图像中为这些目标生成患者特定的解剖表示,并将这种表示转换为面向控制的6自由度探头初始化状态,无需使用术前CT或MRI进行额外的配准。SAMe维护的解剖表示是显式的、轻量的(单器官推断在0.08秒内完成),并且设计上与下游控制兼容。在语义接地、解剖生成和真实机器人评估中,SAMe在完整的初始化流程中表现出色。在真实机器人实验中,基于质心的SAMe初始化在单目标设置下,对于肝脏(86.7% vs 46.7%)和肾脏(80.0% vs 73.3%)初始化均优于基于身体关键点的启发式基线。此外,当多个候选目标可用时,试验级别的器官命中率达到了肝脏97.3%和肾脏83.3%。这些结果建立了一个显式的解剖先验层,解决了扫描初始化问题,并为更广泛的下游自主扫描流程提供了解剖基础,为基于症状驱动和解剖信息的机器人超声提供了基础。

英文摘要

Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, centroid-based SAMe initialization outperformed the body-keypoint-based heuristic baseline under a budget-matched single-target setting for both liver (86.7% versus 46.7%) and kidney (80.0% versus 73.3%) initialization. Furthermore, The trial-level organ-hit rate reached 97.3% for liver and 83.3% for kidney when multiple candidate targets were available. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

2604.18739 2026-05-20 cs.LG stat.ML

Discrete Tilt Matching

离散倾斜匹配

Yuyuan Chen, Shiyi Wang, Peter Potaptchik, Jaeyeon Kim, Michael S. Albergo

发表机构 * Harvard University(哈佛大学) University of Oxford(牛津大学) Kempner Institute(凯姆纳研究所)

AI总结 本文提出了一种无需概率模型的离散倾斜匹配方法,用于改进扩散大语言模型的微调,通过局部解掩码后验的状态级匹配来提高训练稳定性并防止模式崩溃。

详情
AI中文摘要

Masked diffusion large language models (dLLMs) 是一种有前景的替代自回归生成方法。尽管最近强化学习 (RL) 方法已被适应到 dLLM 微调中,但其目标通常依赖于序列级边际似然,这在掩码扩散模型中是不可行的。为了解决这个问题,我们推导出离散倾斜匹配 (DTM),一种无需概率模型的方法,将 dLLM 微调重新表述为在奖励倾斜下局部解掩码后验的状态级匹配。DTM 以加权交叉熵目标形式出现,具有显式的最小化器,并且允许控制变体以提高训练稳定性。在合成迷宫规划任务中,我们分析了 DTM 的退火计划和控制变体如何影响训练稳定性并防止模式崩溃。在大规模情况下,使用 DTM 微调 LLaDA-8B-Instruct 在 Sudoku 和 Countdown 任务上表现出强劲的提升,同时在 MATH500 和 GSM8K 任务上保持竞争力。

英文摘要

Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

2604.18225 2026-05-20 cs.CV cs.AI

Is SAM3 ready for pathology segmentation?

SAM3是否准备好进行病理分割?

Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) University of Verona(威尼斯大学) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)

AI总结 本文评估了SAM3在病理图像分割中的能力,发现文本提示效果有限,视觉提示类型和预算对性能影响显著,少样本学习有提升但鲁棒性不足,且提示基于方法与任务训练适配方法之间存在显著差距。

Comments accept to icip2026

详情
AI中文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

英文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

2604.16593 2026-05-20 cs.CL

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

重新审视一个令人头疼的问题:一种用于语言模型的语义推理基准

Yang Liu, Hongming Li, Melissa Xiaohui Qin, Qiankun Liu, Chao Huang

发表机构 * University of Science and Technology Beijing(北京科技大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出SemanticQA基准,用于评估语言模型在语义短语处理任务中的表现,通过整合现有多词表达资源并重新组织为统一测试平台,涵盖通用词汇现象及三种细粒度类别,评估不同架构和规模的语言模型在提取、分类、解释及任务组合中的性能,揭示语义推理任务中模型性能的显著差异,为提升语言模型在非平凡语义短语上的理解能力提供见解。

Comments ACL 2026 (Oral), 24 pages, 22 figures, 14 tables

详情
AI中文摘要

我们提出了SemanticQA,一种评估套件,旨在评估语言模型(LMs)在语义短语处理任务中的能力。该基准整合了现有的多词表达(MwE)资源,并将它们重新组织为一个统一的测试平台。它涵盖了通用词汇现象,如词组搭配,以及三个细粒度类别:习语表达、名词复合词和动词结构。通过SemanticQA,我们评估了不同架构和规模的语言模型在提取、分类、解释任务以及顺序任务组合中的表现。我们揭示了显著的性能差异,特别是在需要语义推理的任务中,突显了不同模型在推理效率和语义理解方面的差异,为推动语言模型在非平凡语义短语上具备更强理解能力提供了见解。SemanticQA的评估套件和数据可在https://github.com/jacklanda/SemanticQA上获取。

英文摘要

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.

2604.16503 2026-05-20 cs.CV cs.AI

Motif-Video 2B: Technical Report

Motif-Video 2B:技术报告

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee, Jeongdoo Lee, Junhyeok Lee, Eunhwan Park, Yeongjae Park, Bokki Ryu, Dongjoo Weon

发表机构 * Motif Technologies(Motif技术公司)

AI总结 该研究探讨在有限预算下是否能够训练出高质量的文本到视频生成模型,提出通过架构设计而非单纯扩大模型规模来提升性能,结合共享交叉注意力和三部分主干网络,实现了在较少参数和数据下的高质量视频生成。

详情
AI中文摘要

训练强大的视频生成模型通常需要大规模数据集、大量参数和大量计算资源。在本工作中,我们探讨在更小的预算下(少于1000万片段和少于10万H200 GPU小时)是否能够实现高质量的文本到视频生成。我们的核心观点是,模型容量的组织方式,而不仅仅是其规模,是关键因素。在视频生成中,提示对齐、时间一致性以及细节恢复在通过相同路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色,而不是仅依赖规模来解决这一问题。该模型结合了两个关键思想:首先,共享交叉注意力在视频令牌序列变长时增强了文本控制;其次,三部分主干网络分离了早期融合、联合表征学习和细节细化。为了使这种设计在有限计算预算下有效,我们将其与基于动态令牌路由和早期阶段特征对齐到冻结预训练视频编码器的高效训练方案相结合。我们的分析显示,后期块比标准单流基线发展出更清晰的跨帧注意力结构。在VBench上,Motif-Video 2B达到了83.76%的性能,超越了Wan2.1 14B模型,使用7倍更少的参数和显著更少的训练数据。这些结果表明,通过精心的架构专门化和以效率为导向的训练方案,可以缩小或超越通常与更大视频模型相关联的质量差距。

英文摘要

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

2604.16491 2026-05-20 cs.CV cs.AI

A Lightweight Transformer for Pain Recognition from Brain Activity

一种轻量级变压器用于从脑活动识别疼痛

Stefanos Gkikas, Christian Arzate Cruz, Yu Fang, Lu Cao, Muhammad Umar Khan, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

发表机构 * Honda Research Institute Japan Wako City, Japan BioSIS (Biosensing \& Intelligent Systems) Lab Centre for Intelligent Computing Systems University of Canberra Canberra, Australia Department of Electronic Engineering Hellenic Mediterranean University Chania, Greece

AI总结 本文提出了一种轻量级变压器,通过统一的标记机制融合多种fNIRS表示,实现互补信号视图的联合建模,无需特定模态适应或增加架构复杂性,从而在保持计算紧凑性的同时实现竞争性的疼痛识别性能。

详情
AI中文摘要

疼痛是一种复杂且广泛的现象,具有显著的临床和社会负担,使其可靠的自动化评估成为关键目标。本文提出了一种轻量级变压器架构,通过统一的标记机制融合多种fNIRS表示,实现了互补信号视图的联合建模,而无需特定模态的适应或增加架构复杂性。所提出的标记混合策略通过将异构输入投影到共享的潜在表示中,保留了空间、时间和时间-频率特性,并使用结构化的分段方案来控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上使用堆叠的原始波形和功率谱密度表示进行评估。实验结果表明,该方法在保持计算紧凑性的同时实现了竞争性的疼痛识别性能,使其适用于GPU和CPU硬件上的实时推断。

英文摘要

Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.

2604.15034 2026-05-20 cs.AI

Autogenesis: A Self-Evolving Agent Protocol

自生成:一种自我进化代理协议

Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Cankun Guo, Ming Yin, Bo An, Mengdi Wang

发表机构 * Nanyang Technological University(南洋理工大学) Stanford University(斯坦福大学) Princeton University(普林斯顿大学) City University of Hong Kong(香港城市大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种自生成协议(AGP),该协议通过分离进化内容与进化过程,解决了现有代理协议在跨实体生命周期管理、版本追踪和安全更新接口方面的不足。基于AGP,作者展示了自生成系统(AGS),该系统能够动态实例化、检索和优化协议注册的资源,通过多个具有长视界规划和工具使用的挑战性基准测试,验证了代理资源管理和闭环自我进化的有效性。

详情
AI中文摘要

近年来,基于大语言模型(LLM)的代理系统在处理复杂、长视界任务方面展现出了巨大潜力。然而,现有的代理协议(如A2A和MCP)在指定跨实体生命周期管理和上下文管理、版本追踪以及安全更新接口方面存在局限,这鼓励了单一结构的组合和脆弱的粘合代码。我们引入了自生成协议(AGP),这是一种自我进化协议,它通过分离进化内容与进化过程来解决这些问题。其资源子strate协议层(RSPL)将提示、代理、工具、环境和记忆建模为具有明确状态、生命周期和版本化接口的协议注册资源。其自我进化协议层(SEPL)指定了一个闭环操作接口,用于提出、评估和提交改进,具有可审计的血统和回滚功能。基于AGP,我们提出了自生成系统(AGS),这是一个能够动态实例化、检索和优化协议注册资源的自我进化多代理系统。我们评估了AGS在多个需要长视界规划和跨异构资源工具使用的挑战性基准测试上的表现。结果表明,与强基线相比,AGS在多个挑战性基准测试上均表现出一致的改进,支持了代理资源管理和闭环自我进化有效性的结论。代码可在https://github.com/DVampire/Autogenesis上获取。

英文摘要

Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.

2604.13392 2026-05-20 cs.AI

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

ReSS: 通过符号支架学习表格数据预测的推理模型

Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang

发表机构 * Department of Computer Science & Engineering, Texas A&M University(德克萨斯A&M大学计算机科学与工程系) Department of Computer Science, University of Florida(佛罗里达大学计算机科学系)

AI总结 本文提出ReSS框架,通过符号支架结合神经推理模型,提升表格数据预测的准确性和可解释性,实验表明其在医疗和金融领域优于传统决策树和标准微调方法。

详情
AI中文摘要

表格数据在医疗和金融等高风险领域仍然广泛存在,预测模型需要提供高准确性和可信的、可被人类理解的推理。虽然符号模型提供可验证的逻辑,但缺乏语义表达能力。同时,通用大语言模型通常需要专门的微调才能掌握领域特定的表格推理。为解决可扩展的数据整理和推理一致性挑战,我们提出了ReSS,一种系统框架,连接符号和神经推理模型。ReSS利用决策树模型提取实例级别的决策路径作为符号支架。这些支架,加上输入特征和标签,指导LLM生成基于现实的自然语言推理,严格遵循底层决策逻辑。由此产生的高质量数据集用于微调预训练LLM为专门的表格推理模型,进一步通过支架不变的数据增强策略提高泛化能力和可解释性。为了严格评估可信度,我们引入了包括幻觉率、解释必要性和解释充分性的定量指标。在医疗和金融基准上的实验结果表明,ReSS训练的模型在传统决策树和标准微调方法上提高了高达10%,同时产生可信且一致的推理。

英文摘要

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

2604.11796 2026-05-20 cs.CL cs.AI

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD:一个源自真实世界提示的综合性中文AI生成文本检测基准

Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia

发表机构 * Tsinghua University(清华大学) Nankai University(南开大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Peng Cheng Laboratory(鹏城实验室) Shannon InfoTech

AI总结 本文提出C-ReD基准,用于检测AI生成的中文文本,通过解决模型多样性、领域覆盖和提示真实性等关键问题,提升检测性能和泛化能力。

Comments ACL 2026 Findings

详情
AI中文摘要

近年来,大型语言模型(LLMs)能够生成高度流畅的文本内容。尽管它们为人类提供了显著的便利,但也引入了诸如钓鱼和学术不端等风险。大量研究致力于开发检测AI生成文本的算法并构建相关数据集。然而,在中文语料领域仍存在挑战,包括模型多样性有限和数据同质性。为了解决这些问题,我们提出了C-ReD:一个综合性的中文真实提示AI生成检测基准。实验表明,C-ReD不仅能够实现可靠的领域内检测,还支持对未见LLMs和外部中文数据集的强大泛化能力,从而弥补了先前中文检测基准在模型多样性、领域覆盖和提示真实性方面的关键缺口。我们已在https://github.com/HeraldofLight/C-ReD上发布了相关资源。

英文摘要

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

2604.11417 2026-05-20 cs.RO cs.AI

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

高效的情绪感知图标手势预测用于机器人同声传译

Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez

发表机构 * School of Engineering(工程学院) Honda Research Institute Japan(本田日本研究院) Department of Electronic Engineering(电子工程系)

AI总结 本文提出一种轻量级的transformer模型,通过文本和情绪单独生成图标手势的位置和强度,无需音频输入,在BEAT2数据集上优于GPT-4o,在语义手势位置分类和强度回归方面表现更佳,且计算紧凑,适合实时部署。

详情
AI中文摘要

同声传译手势可以提高参与度并改善语音理解。大多数数据驱动的机器人系统生成节奏般的运动,但很少整合语义强调。为此,我们提出了一种轻量级的transformer,该模型仅通过文本和情绪推导图标手势的位置和强度,无需在推理时使用音频输入。该模型在BEAT2数据集上在语义手势位置分类和强度回归方面均优于GPT-4o,同时保持计算紧凑性,适合在具身代理上实时部署。

英文摘要

Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.

2604.11089 2026-05-20 cs.CV

Structured State-Space Regularization for Generation-Friendly Image Tokenization

结构化状态空间正则化用于生成友好的图像标记化

Jinsung Lee, Jaemin Oh, Namhun Kim, Dongwon Kim, Byung-Jun Yoon, Suha Kwak

发表机构 * POSTECH Brown University(布朗大学) KAIST(韩国科学技术院) Texas A&M University(德克萨斯大学) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 本文提出结构化状态空间正则化方法,通过诱导潜在空间的频谱结构提升图像标记化生成性能,同时保持重建保真度。

Comments Related blog posts in https://jinsingsangsung.github.io/collections/blog/ : Towards 2-Dimensional State-Space Models series

详情
AI中文摘要

图像标记器在现代生成模型中起着核心作用,其中潜在空间的结构关键决定了下游生成性能。有效潜在表示的一个关键但未被充分探索的特性是频谱组织,即能够跨频率组件编码信息。在本文中,我们引入了结构化状态空间正则化,一种系统诱导潜在空间频谱结构的方法。我们通过重新审视状态空间模型(SSMs)作为模仿基函数行为的系统,推导出一个正则化目标。这种视角揭示了SSMs的隐藏状态被诱导以捕捉频率组件,从而产生一种新的正则器,强制潜在空间捕捉图像的频谱结构。实验表明,我们的正则器在提升图像标记器生成性能的同时,仅导致微小的重建保真度损失。

英文摘要

Image tokenizers play a central role in modern generative models, where the structure of the latent space critically determines the downstream generation performance. A key but underexplored property of effective latent representations is spectral organization, the ability to encode information across frequency components. In this work, we introduce structured state-space regularization, a principled approach to inducing spectral structure in latent spaces. We derive a regularization objective by revisiting state-space models (SSMs) as systems mimicking a basis function's behavior. This perspective reveals that hidden states of SSMs are induced to capture the frequency components, resulting in a novel regularizer that enforces the latent space to capture spectral structure of images. Experiments demonstrate that our regularizer improves the generative performance of image tokenizers while incurring only minimal loss in their reconstruction fidelity.

2604.09323 2026-05-20 cs.RO

Robust Adaptive Backstepping Impedance Control of Robots in Unknown Environments

在未知环境中具有鲁棒性的自适应反步阻抗控制

Reza Nazmara, Alap Kshirsagar, Jan Peters, A. Pedro Aguiar

发表机构 * Research Center for Systems and Technologies (SYSTEC), ARISE, Faculty of Engineering, University of Porto, 4200-465 Porto, Portugal(系统与技术研究中心(SYSTEC),ARISE,工程学院,波尔图大学,葡萄牙4200-465波尔图) Intelligent Autonomous Systems Lab, Department of Computer Science, TU Darmstadt, Germany(智能自主系统实验室,计算机科学系,达姆施塔特技术大学,德国)

AI总结 本文提出了一种针对在接触丰富且不确定环境中操作的机器人鲁棒自适应反步阻抗控制(RABIC)策略,该策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。通过反步方法设计内环以跟踪参考阻抗模型,利用泰勒级数估计器估计系统动力学并采用自适应估计器确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。通过模拟移动机械臂场景和对实际Franka Emika Panda机器人的实验评估,证明了所提方法在安全性、轨迹跟踪和力监测方面优于PD控制。

Comments 8

Journal ref Mechatronics, Vol. 118, 103552 (2026)

详情
AI中文摘要

本文提出了一种鲁棒自适应反步阻抗控制(RABIC)策略,用于在接触丰富和不确定环境中操作的机器人。所提出的控制策略考虑了系统的完整耦合动力学,并明确考虑了外部扰动和未建模动力学等关键不确定性来源,而无需机器人动态参数。我们提出了一种基于反步的自适应阻抗控制方案用于内环以跟踪参考阻抗模型。为了处理不确定性,我们采用基于泰勒级数的估计器来估计系统动力学,并采用自适应估计器来确定外部力的上界。稳定性分析证明了整体系统的半全局有限时间稳定性。为了证明所提方法的有效性,进行了模拟移动机械臂场景和对实际Franka Emika Panda机器人的真实实验评估。所提出的方法在安全性和轨迹跟踪及力监测方面优于PD控制。总体而言,RABIC框架为未来关于耦合移动和固定串联机械臂的自适应和学习阻抗控制的研究提供了坚实的基础。

英文摘要

This paper presents a Robust Adaptive Backstepping Impedance Control (RABIC) strategy for robots operating in contact-rich and uncertain environments. The proposed control strategy considers the complete coupled dynamics of the system and explicitly accounts for key sources of uncertainty, including external disturbances and unmodeled dynamics, while not requiring the robot's dynamic parameters in implementation. We propose a backstepping-based adaptive impedance control scheme for the inner loop to track the reference impedance model. To handle uncertainties, we employ a Taylor series-based estimator for system dynamics and an adaptive estimator for determining the upper bound of external forces. Stability analysis demonstrates the semi-global practical finite-time stability of the overall system. To demonstrate the effectiveness of the proposed method, a simulated mobile manipulator scenario and experimental evaluations on a real Franka Emika Panda robot were conducted. The proposed approach exhibits safer performance compared to PD control while ensuring trajectory tracking and force monitoring. Overall, the RABIC framework provides a solid basis for future research on adaptive and learning-based impedance control for coupled mobile and fixed serially linked manipulators.

2604.08503 2026-05-20 cs.CV

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Phantom:通过联合建模视觉和潜在物理动态实现物理 infused 的视频生成

Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出Phantom模型,通过联合建模视觉内容和潜在物理动态,使视频生成过程具备物理一致性,从而生成既视觉真实又物理合理的视频。

Comments 15 pages, 6 figures, CVPR 2026

详情
AI中文摘要

近期生成视频建模的进展,受到大规模数据集和强大架构的推动,已经取得了显著的视觉真实效果。然而,越来越多的证据表明,仅仅扩大数据和模型规模并不能使这些系统理解支配现实世界动态的底层物理定律。现有方法往往无法捕捉或强制执行这种物理一致性,导致不真实的运动和动态。在本文中,我们探讨是否将潜在物理属性的推断直接整合到视频生成过程中,可以赋予模型生成物理合理视频的能力。为此,我们提出了Phantom,一个物理 infused 的视频生成模型,该模型联合建模视觉内容和潜在物理动态。在观察到的视频帧和推断出的物理状态条件下,Phantom联合预测潜在物理动态并生成未来的视频帧。Phantom利用一种物理感知的视频表示,作为底层物理的抽象但信息丰富的嵌入,从而在不需显式指定复杂物理动态和属性集的情况下,联合预测物理动态和视频内容。通过将物理感知视频表示的推断直接整合到视频生成过程中,Phantom生成的视频序列既具有视觉真实性又具有物理一致性。在标准视频生成和物理感知基准上的定量和定性结果表明,Phantom不仅在遵守物理动态方面优于现有方法,还提供了具有竞争力的感知保真度。

英文摘要

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.