arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距:VSR模型是否像人类唇读者一样感知视觉语音?

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group, School of Engineering, Trinity College Dublin, Ireland

AI总结 通过对比VSR系统与人类在MaFI数据集上的表现,发现模型虽整体准确率更高,但错误模式与人类不同,主要依赖训练数据中的语言线索而非视觉感知。

详情
Comments
Accepted at INTERSPEECH 2026
AI中文摘要

视觉语音识别(VSR)模型在基准测试中现已超越人类唇读者,但这样的进步是否建立了类人的视觉语音感知?为探究此问题,我们使用MaFI词级唇读数据集,在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率,但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明,模型在人类认为最难的视位上获益最多,并且对视觉清晰度的依赖性弱得多。我们的工作表明,VSR系统主要依赖训练数据中的语言线索而非视觉感知,未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

2606.07431 2026-06-09 cs.CV 版本更新

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

OpenGlass:用于设备上基于事件的手势识别的开源智能眼镜

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

发表机构 * ETH Zürich

AI总结 提出开源智能眼镜平台OpenGlass,采用模块化设计、事件驱动电源管理和GAP9 RISC-V SoC,实现低功耗设备上ML,在LynX数据集上达到83.94%的跨主体手势识别准确率。

详情
AI中文摘要

智能眼镜通过多模态传感器和设备上智能实现无干扰、上下文感知的交互,但受限于紧凑外形下的功耗、内存和计算约束。支持事件视觉和嵌入式ML的开源硬件平台在此规模下很少见。本文介绍了一个开源智能眼镜平台,用于新型传感器和算法的快速原型设计。其模块化设计使用灵活的FPC转接板,支持事件相机和帧相机,无需完全重新设计PCB。硬件-软件协同设计的电源管理系统结合了可配置PMIC和通过nRF5340协调器的事件驱动唤醒,使GAP9 RISC-V SoC在推理之间保持断电。原型从200 mAh电池实现长达11.8小时的连续设备上ML。作为演示,使用来自Prophesee GENX320相机的极性分离事件直方图,在LynX数据集上评估了以自我为中心的手势识别流水线。R(2+1)D在留二受试者交叉验证下达到最佳跨主体准确率83.94%(宏F1=0.781),在GAP9上端到端延迟为33.9毫秒。时间增强和去除模糊类别带来了最大增益(+8.9个百分点)。所有硬件设计、固件和模型均开源发布。

英文摘要

Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.5 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 78.3 ms end-to-end inference latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

2606.07419 2026-06-09 cs.CV 版本更新

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

DisPOSE: 投影多随机扩散用于自监督多视图3D人体姿态估计

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * Imperial College London Technical University of Munich

AI总结 提出DisPOSE框架,将多视图人员分配问题建模为多随机张量空间上的生成扩散过程,通过可微Sinkhorn投影和超图卷积解码器实现自监督3D人体姿态估计,在标准数据集和手术室遮挡场景中表现优异。

详情
AI中文摘要

从不同摄像机视角恢复多个个体的3D人体姿态是分析交互行为的基本瓶颈。现有的自监督方法利用3D姿态的合成目录;然而,由于分布偏移,这导致在真实场景中泛化能力差。因此,我们引入了DisPOSE,一个自监督框架,将固有的离散多视图人员分配问题近似为多随机张量空间上的生成扩散过程。通过在去噪过程中采用可微的Sinkhorn投影,模型学会基于2D图像先验引导解决方案走向有效且可行的分配。然后,使用超图卷积解码器对定位个体的完整3D骨架进行回归,该解码器显式建模跨多个视图的关系结构和关节。所提出的方法在标准数据集上优于当前最先进的自监督方法,并在一个包含手术室高度遮挡场景的新基准上展示了强大的性能。我们的基于扩散的定位展示了高标签效率,仅使用10%的伪标签就能保持99%的性能。值得注意的是,在保持可微性的同时解耦分配和根回归组件,使得DisPOSE几乎对不同摄像机布置不敏感。

英文摘要

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗?通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo RIKEN

AI总结 提出CapCode框架,通过设置上限评估检测模型在编码任务中的作弊行为,并设计CapReward奖励机制防止作弊,实验表明该方法能有效检测和减少作弊。

详情
AI中文摘要

在智能体评估和训练中,一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数,产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode,一个构建带有随机测试的编码数据集的框架,其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释:显著高于上限的分数是不可信的,因此提供了作弊的证据。为了防止作弊,我们提出CapReward,一种基于CapCode原则的奖励设计,以抑制超出上限的优化。跨多个数据集的实验表明,CapCode能够检测作弊同时保持模型的性能排名,CapReward减少了作弊行为,产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

2606.07118 2026-06-09 cs.RO 版本更新

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

QuadVerse:一种对齐视觉-物理现实用于四足仿真的集成框架

Yuxiang Chen, Yuanhao Wang, Ziheng Zhang, Meng Zhang, Yu Liu, Yufei Jia, Tiancai Wang, Erjin Zhou, Jin Xie

发表机构 * Nanjing University BUPT DEXMAL Tsinghua University

AI总结 提出QuadVerse框架,通过重建场景校准视觉、物理和致动器,利用3DGS和接触校准减少仿真到现实的差距,实现零样本视觉导航策略部署。

详情
AI中文摘要

仿真对于机器人学习至关重要,然而仿真到现实的差距仍然是一个主要挑战。现有方法通常单独处理视觉或动态差距,忽略了这些个体不匹配如何在机器人状态估计中累积和传播。在本文中,我们介绍QuadVerse,一个集成框架,使用重建场景作为校准基底,对齐视觉感知、物理交互和致动器动力学。从捕获的RGB视频中,我们重建几何约束的3D高斯泼溅(3DGS)场景,支持批处理的光照真实自我视角渲染和可用于碰撞的语义网格提取。网格进一步通过初始化空间变化的摩擦先验并通过基于轨迹的后验推理细化,实现接触校准。为了解决剩余的致动器差异,QuadVerse通过在接触校准的地形上重放真实世界轨迹来训练残差动力学补偿器,减少地形引起的接触误差与致动器动力学之间的纠缠。我们表明,QuadVerse在相关基线上提高了重建质量和运动跟踪。在此基础之上,我们展示了无需任务特定真实世界部署的鲁棒零样本视觉导航策略部署。

英文摘要

Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.

2606.07108 2026-06-09 cs.AI 版本更新

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

DyCon: 通过演化难度建模的动态推理控制

Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen Zhongguancun Academy Huawei Noah’s Ark Lab Shenzhen Loop Area Institute Tsinghua University

AI总结 提出DyCon框架,利用步骤级嵌入动态建模推理过程中的难度演化,无需训练即可控制推理深度,减少冗余步骤,提升效率且不损失准确性。

详情
Comments
Accepted at ICML 2026
AI中文摘要

近期大型推理模型(LRMs)通过迭代反思、探索和执行复杂任务取得了显著的性能提升,但由于冗余推理(即“过度思考”)而效率低下。现有的缓解方法要么依赖静态难度估计,要么需要特定任务训练,因此无法适应推理过程中的动态复杂性。在这项工作中,我们经验性地证明,问题难度在推理过程中动态演化,并线性编码在LRM的步骤级嵌入中。基于这一发现,我们提出了DyCon,一个无需训练的框架,利用潜在步骤级表示显式建模演化中的任务难度,从而实现对推理深度的动态控制以缓解过度思考问题。在4B到32B的四个模型上进行的广泛实验,涵盖数学推理、通用问答和编码任务的十二个基准测试表明,DyCon通过减少冗余步骤显著提升了推理效率,且不牺牲准确性或泛化能力。项目页面和代码可在此https URL获取。

英文摘要

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code is available at https://github.com/yu-lin-li/DyCon.

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ETH Zürich Imperial College London NUS Accenture Innopolis University Independent Researcher

AI总结 提出ThinkBooster框架,通过模块化库、联合评估基准和可部署代理服务,实现LLM推理的测试时计算扩展,在数学和编码任务上验证了性能-计算权衡。

详情
AI中文摘要

测试时计算(TTC)扩展已成为一种强大的范式,通过在推理期间分配额外计算(例如,通过多样本生成和基于验证器的重新排序)来改进大型语言模型(LLM)推理。现有的TTC扩展策略和推理评分器仍然碎片化,在不一致的协议下进行评估,并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster,一个用于LLM推理无缝测试时计算扩展的统一框架,它包括(i)一个模块化的Python库,实现了最先进的TTC扩展策略和评分器家族,(ii)一个联合评估性能和计算效率的基准,以及(iii)一个可部署的、兼容OpenAI的代理服务,使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器,用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡,并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

2606.06656 2026-06-09 cs.AI cs.LO 版本更新

A Study of Parallel Continuous Local Search

并行连续局部搜索研究

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University

AI总结 研究并行连续局部搜索(CLS)在对称伪布尔约束可满足性问题中的应用,发现冗余约束会抑制收敛,CLS在混合求解中能快速完成部分赋值,且局部搜索因鞍点密集目标而快速收敛到稳定解质量分布。

详情
AI中文摘要

我们研究并行连续局部搜索(CLS)作为解决具有对称伪布尔(PB)约束的布尔可满足性问题的一种方法。这里,$n$变量PB可满足性问题被松弛为一个连续优化问题,其目标函数在$n$维超立方体上可微。对于可满足的实例,该优化问题的全局最小值对应于所讨论SAT问题的满足赋值。我们通过实证实验提出了几个新发现:(i)冗余约束会抑制而非加速收敛;(ii)CLS在混合设置中作为子求解器显示出前景,能快速完成部分赋值;(iii)由于鞍点密集的目标函数,局部搜索迅速收敛到解质量的稳定分布(即满足程度),此时额外的求解步骤收益递减。我们的发现为在现代加速硬件上使用CLS解决SAT问题提供了实用指导。

英文摘要

We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

2606.06554 2026-06-09 cs.LG cs.AI 版本更新

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

基于多尺度特征注意力网络的太赫兹双梳光谱聚合物分类

Roshni Mahtani, Ilán Carretero, Laura Monroy, Aldo Moreno-Oyervides, Oscar Elías Bonilla-Manrique, Rocío del Amor

发表机构 * Instituto Universitario de Investigación e Innovación en Tecnología Centrada en el Ser Humano, HUMAN-tech, Universitat Politècnica de València Department of Electronic Technology, Universidad Carlos III de Madrid Artikode Intelligence S.L.

AI总结 提出多尺度特征注意力网络(MSFAN),结合特征门控和多尺度并行卷积,利用太赫兹双梳光谱对12种聚合物进行分类,准确率达85.2%。

详情
Comments
Accepted in EUSIPCO'26
AI中文摘要

可靠的聚合物识别对于确保回收塑料的质量和安全至关重要,然而传统的分选和光谱技术往往难以提供稳健的区分。太赫兹双梳光谱(THz-DCS)提供了一种有前景的替代方案,能够实现快速、高分辨率且无损的测量。在这项工作中,我们利用THz-DCS对12种聚合物进行分类,包括纯聚合物、多层薄膜、商业混合物和生物聚合物。为了处理这些光谱信号的复杂性,我们提出了多尺度特征注意力网络(MSFAN),这是一种专为THz-DCS数据设计的新型深度学习架构。该框架集成了用于信号重校准的特征门控和多尺度并行卷积,以捕获不同的频率模式。这些特征通过交叉特征注意力和注意力池化进一步细化,使模型能够内在地突出最具信息量的太赫兹区域。MSFAN始终优于最先进的模型,分类准确率达到85.2%。本研究展示了将THz-DCS与深度学习技术相结合,用于有效、可扩展且可解释的聚合物分类的潜力。

英文摘要

Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.

2605.03357 2026-06-09 cs.LG math.OC 版本更新

Population-Aware Imitation Learning in Mean-field Games with Common Noise

平均场博弈中考虑共同噪声的群体感知模仿学习

Grégoire Lambrecht, Mathieu Laurière

AI总结 针对含共同噪声的平均场博弈,提出群体感知模仿学习框架,通过行为克隆和对抗散度两种代理,建立有限样本误差界,并利用广义虚拟博弈和深度学习计算专家策略,实验证明群体感知策略对应对随机性的重要性。

详情
AI中文摘要

平均场博弈(MFGs)为建模大量交互智能体的集体行为提供了强大框架。本文研究了含共同噪声的MFG中的模仿学习(IL)问题,其中群体分布随机演化。这种随机性迫使智能体采用群体感知策略以应对总体冲击。我们制定了两个不同的学习目标:恢复纳什均衡和最大化相对于专家群体的性能。我们研究了两种模仿代理:行为克隆(BC)和对抗(ADV)散度。然后,我们建立了有限样本误差界,表明最小化这些代理能有效控制策略的可利用性及其相对于专家的性能差距。此外,我们提出了一个使用广义虚拟博弈和深度学习的数值框架来计算专家群体感知策略。通过在三个环境上的实验,我们证明了标准的群体无感知策略无法捕捉均衡动态。我们的结果强调,学习群体感知策略对于避免被共同噪声固有的随机性误导至关重要。

英文摘要

Mean Field Games (MFGs) provide a powerful framework for modeling the collective behavior of large populations of interacting agents. In this paper, we address the problem of Imitation Learning (IL) in MFGs subject to common noise, where the population distribution evolves stochastically. This stochasticity compels agents to adopt population-aware policies to respond to aggregate shocks. We formulate two distinct learning objectives: recovering a Nash equilibrium and maximizing performance against an expert population. We investigate two imitation proxies: Behavioral Cloning (BC) and Adversarial (ADV) divergence. We then establish finite-sample error bounds showing that minimizing these proxies effectively controls both the policy's exploitability and its performance gap relative to the expert. Furthermore, we propose a numerical framework using generalized Fictitious Play and Deep Learning to compute expert population-aware policies. Through experiments on three environments we demonstrate that standard population-unaware policies fail to capture the equilibrium dynamics. Our results highlight that learning population-aware policies is crucial to avoid being misled by the randomness inherent in common noise.

2605.03229 2026-06-09 cs.CL cs.LG 版本更新

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

稀疏记忆微调:作为LoRA和全微调的低遗忘替代方案

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

AI总结 提出稀疏记忆微调(SMF),通过添加键值记忆层并仅更新当前批次最活跃的记忆行,在MedMCQA任务上提升2.5个百分点,同时将遗忘探针(WikiText困惑度和TriviaQA准确率)控制在基线的1个百分点内,优于LoRA和全微调。

详情
AI中文摘要

将预训练语言模型适应新任务通常会损害其已有的通用能力,这一问题被称为灾难性遗忘。稀疏记忆微调(SMF)通过向模型添加键值记忆层,并在每个训练步骤中仅更新当前批次读取最频繁的一小组记忆行来避免这种情况。我们在Qwen-2.5-0.5B-Instruct上重新实现了SMF,并将其与LoRA和全微调在MedMCQA(一个4选1的医学考试任务)上进行比较,使用WikiText困惑度和TriviaQA准确率作为遗忘探针。SMF将MedMCQA提升了2.5个百分点,同时将两个遗忘探针保持在基线的约1个百分点内,而LoRA和全微调虽然取得了更大的增益,但在两个探针上都出现了明显的漂移。我们还比较了两种行选择规则(KL散度和TF-IDF),它们在两个遗忘指标上取得了不同的平衡。

英文摘要

Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

2605.03226 2026-06-09 cs.LG cs.AI cs.CR 版本更新

Self-Mined Hardness for Safety Fine-Tuning

自我挖掘的难度用于安全微调

Prakhar Gupta, Garv Shah, Donghua Zhang

AI总结 提出通过模型自身生成结果评估提示难度,对最难的提示进行安全微调,在Llama-3模型上将攻击成功率降至1-3%,但增加了拒绝率,通过混合良性提示可平衡性能。

详情
AI中文摘要

语言模型的安全微调通常需要一个精心策划的对抗性数据集。我们采取不同的方法:通过目标模型自身生成结果被判定为有害的频率来评分每个候选提示的难度,然后在最难的提示上使用模型自身的非越狱生成结果进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上,该方法将WildJailbreak攻击成功率从11.5%和20.1%降至1-3%,但将越狱形式良性提示的拒绝率从14-22%提升至74-94%。将相同的困难提示与对抗性框架的良性提示(看起来像越狱但意图良性的提示)以1:1的比例交错,可将8B模型的拒绝率降至30-51%,3B模型降至52-72%,但攻击成功率增加2-6个百分点。在混合模式下,使用合格池中最难的一半而非随机一半进行训练,可将两个模型的剩余ASR降低35-50%(约3个百分点)。

英文摘要

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

2605.05138 2026-06-09 cs.AI 版本更新

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

可执行世界模型在编码智能体时代的ARC-AGI-3应用

Sergey Rodionov

AI总结 提出一种编码智能体系统,通过维护可执行Python世界模型、验证观察、重构简化抽象和模型内规划,在ARC-AGI-3游戏中取得初步成果,GPT-5.5高推理下完全解决15个游戏。

详情
Comments
13 pages. Accepted for publication at AGI-2026
AI中文摘要

我们评估了一个用于ARC-AGI-3的初始编码智能体系统,其中智能体维护一个可执行的Python世界模型,根据先前的观察验证它,将其重构为更简单的抽象作为MDL类简单性偏好的实际代理,并在行动前通过模型进行规划。该系统有意保持直接:它使用脚本化控制器、预定义的世界模型接口、验证程序和执行计划器,但没有手工编码的游戏特定逻辑。面向智能体的提示、工作区和控制器不包含游戏特定代码、游戏特定提示、手工编码的启发式方法、隐藏解决方案或其他游戏特定信息;相同的智能体和提示用于所有游戏。由于编码智能体具有广泛的系统访问权限,我们审计了非预期的信息通道,描述了早期脆弱的框架,并解释了当前框架如何关闭观察到的泄漏通道,同时减少基准特定信息的暴露。我们报告了在25个公开ARC-AGI-3游戏上的结果。每次游戏从全新的智能体实例和干净的工作区开始,无法访问先前游戏的文件或对话状态。使用GPT-5.5高推理努力,智能体完全解决了15个游戏,平均每游戏RHAE为58.12%。使用GPT-5.4高推理努力,它完全解决了8个游戏,平均每游戏RHAE为41.29%。在尚未提供给我们的私有验证集上的性能仍有待测试。总体而言,这些结果提供了初步证据,表明验证器驱动的可执行世界模型是ARC-AGI-3智能体的一种有前景的方法。完整的运行工件与代码一起发布在https://github.com/astroseger/arc-3-agents-baseline1。

英文摘要

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX:面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

AI总结 提出APEX框架,利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量,在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情
AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而,AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域,每天都有大量歌曲被生产和消费,而没有传统的艺术家声誉或唱片公司支持。在这一探索中,美学质量是关键但尚未被研究的因素。我们提出了APEX,这是首个面向AI生成音乐的大规模多任务学习框架,在来自Suno和Udio的超过21.1万首歌曲(1万小时音频)上训练,该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT(一个自监督音乐理解模型)提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面,两者结合被证明是有价值的:在Music Arena数据集上的分布外评估中,该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决,引入美学特征持续改进了偏好预测,展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

AI总结 提出MechaRule方法,通过定位稀疏激动剂激活将规则提取锚定在LLM电路中,利用自适应组测试和置信引导剪枝,以极低代价高召回率识别关键神经元,并在算术和越狱任务中验证其有效性。

详情
Comments
Accepted for publication at KDD'2026
AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型(LLM)的决策逻辑,并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理,而机械可解释性将行为与神经元联系起来,但通常需要手工假设和昂贵的干预。我们提出MechaRule,一种通过定位稀疏激动剂激活(其消融会破坏规则相关行为)将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先,在固定的基线/翻转机制下,稀疏激动剂效应可能表现出“超越”:少数高效应的激活在较大组中仍可检测到,主导较弱效应,并翻转许多相同的示例。在这种机制下,使用置信引导的保守剪枝的自适应组测试,当k << N为激动剂时,需要对N个候选进行O(k log(N/k) + k)次干预。其次,在与接近忠实规则行为对齐的数据分割上,激动剂的定位更可靠;谱分割提供了无规则的备选方案,而不忠实的分割会降低定位效果。实验上,在算术和越狱任务中,MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂,平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱,并可纠正算术错误或诱导越狱,分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

2605.02950 2026-06-09 cs.LG cs.AI 版本更新

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

核仿射包机作为冻结语义空间的计算高效编码器

Mohit Kumar, Somayeh Kargaran, Bernhard A. Moser, Manuela Geiß

AI总结 提出核仿射包机(KAHM)作为轻量级查询编码器,在固定教师表示空间下,通过RKHS中的后验权重估计替代神经网络编码,实现计算高效且性能优异的语义检索。

详情
AI中文摘要

基于Transformer的语义编码器在检索中很有效,但在许多部署中,重复出现的瓶颈是在线查询编码,而非离线语料库索引。本文研究,一旦强大的教师表示空间和语料库索引固定,是否可以用一个更轻量且解析明确的估计器来替代重复的神经查询编码。我们将固定教师的词汇到语义编码表述为一个条件均值估计问题,其中目标语义向量表示为由后验聚类概率加权的语义原型的噪声混合。使用核仿射包机(KAHM)几何,在显式识别的RKHS假设空间中,从廉价的词汇特征估计这些后验权重,并通过归一化最小均方更新从带噪声的教师嵌入中精炼语义原型。这产生了一个无反向传播的查询端编码器,以及一个端到端的误差分解,包括后验近似、有限样本/泛化和教师噪声项。我们在一个受控的奥地利法律检索基准上实例化该方法,该基准包含5000个测试查询、84个候选法律和10762个对齐的检索单元,使用特定于法律的编码器进入冻结的Mixedbread嵌入空间。在评估匹配的学习适配器中,KAHM在所有评估截断处实现了最强的教师空间重建和最佳的排名敏感检索性能。在k=20时,它获得了MRR@20=0.504、Hit@20=0.694和Top-1准确率=0.411,同时在报告的CPU设置中,相对于直接Transformer查询编码,在线每查询时间减少了8.53倍。结果支持KAHM作为监督固定表示部署场景中的计算高效编码器。

英文摘要

Transformer-based semantic encoders are effective for retrieval, but in many deployments the recurring bottleneck is online query encoding rather than offline corpus indexing. This paper studies whether, once a strong teacher representation space and corpus index are fixed, repeated neural query encoding can be replaced by a substantially lighter and analytically explicit estimator. We formulate fixed-teacher lexical-to-semantic encoding as a conditional-mean estimation problem in which the target semantic vector is represented as a noisy mixture of semantic prototypes weighted by posterior cluster probabilities. Kernel Affine Hull Machine (KAHM) geometry is used to estimate these posterior weights from inexpensive lexical features in an explicitly identified RKHS hypothesis space, and the semantic prototypes are refined by normalized least-mean-squares updates from noisy teacher embeddings. This yields a backpropagation-free query-side encoder together with an end-to-end error decomposition into posterior-approximation, finite-sample/generalization, and teacher-noise terms. We instantiate the approach on a controlled Austrian-law retrieval benchmark with 5,000 test queries, 84 candidate laws, and 10,762 aligned retrieval units, using law-specific encoders into a frozen Mixedbread embedding space. Among evaluation-matched learned adapters, KAHM achieves the strongest teacher-space reconstruction and the best rank-sensitive retrieval performance at all evaluated cutoffs. At k=20, it obtains MRR@20 = 0.504, Hit@20 = 0.694, and Top-1 Accuracy = 0.411, while reducing online per-query time by 8.53 relative to direct transformer query encoding in the reported CPU setting. The results support KAHMs as compute-efficient encoders for supervised fixed-representation deployment regimes.

2605.01799 2026-06-09 cs.CV 版本更新

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Embody4D: 面向具身4D世界建模的通用数据引擎

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Yuyan Xu, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

AI总结 提出Embody4D视频到视频世界模型,通过3D感知合成管道、潜在置信度专家调制和交互注意力机制,将单目机器人视频转换为多视角视频,解决具身智能中视角稀疏问题,提升下游规划与学习性能。

详情
AI中文摘要

具身智能体需要鲁棒且全面的3D时空表示来支持空间推理、操作理解和下游决策。然而,现有的机器人数据通常从固定或稀疏的视角捕获,仅提供部分且依赖视角的观察,这限制了多视角感知和跨视角泛化。鉴于在真实环境中收集额外视角的困难,我们提出Embody4D,一种专为具身场景设计的视频到视频世界模型,通过将单目机器人视频转换为来自灵活目标相机视角的新视角视频来弥合这一观察差距。首先,为解决训练数据稀缺问题,我们引入了一种3D感知的组合合成管道,以策划一个异构数据集,该数据集组合了跨具身形态的机器人手臂与多样背景,促进了广泛泛化。其次,为强制几何稳定性,我们设计了一种潜在置信度感知的专家调制策略,该策略估计扭曲潜在先验的可靠性,并自适应地将区域路由到复制、修复或修补专家,以实现时空一致的4D生成。最后,为增强操作保真度,我们引入了一种交互感知注意力机制,该机制明确关注机器人交互区域。大量实验表明,Embody4D在视觉评估基准上达到了最先进的性能,同时模拟和真实机器人实验进一步证明了其作为鲁棒数据引擎的有效性,能够合成高保真、视角一致的视频,赋能下游机器人规划和学习。

英文摘要

Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints in real-world settings, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios to bridge this observation gap by transforming a monocular robot video into novel-view videos from flexible target camera viewpoints. First, to tackle training data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, promoting broad generalization. Second, to enforce geometric stability, we devise a latent confidence-aware expert modulation strategy, which estimates the reliability of warped latent priors and adaptively routes regions to copy, repair, or inpaint experts for spatiotemporally consistent 4D generation. Finally, to enhance the fidelity of the manipulation, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments show that Embody4D achieves state-of-the-art performance on visual evaluation benchmarks, while both simulated and real-world robotic experiments further demonstrate its effectiveness as a robust data engine for synthesizing high-fidelity, view-consistent videos that empower downstream robotic planning and learning.

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

AI总结 本文利用基于Transformer的模型从加密网络流量中学习行为表征,结合用户特定适配器,并通过稀疏表示和广义估计方程分析,发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关,且学习到的表征优于传统手工特征。

详情
Comments
19 pages, 6 figures
AI中文摘要

人类行为难以在大规模下连续测量,然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号,用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为,我们采用基于Transformer的模型,该模型带有用户特定的适配器,学习网络活动的表征,同时考虑个人基线及其偏差。为了提高可解释性,我们进一步使用稀疏表示学习分析这些表征,以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来,从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态:压力主要与持续的个体间变异相关,孤独感与个体内波动更密切相关,而睡眠障碍则反映了两者的结合。重要的是,这些个体内行为信号无法通过传统的手工网络流量特征恢复,这突显了学习表征在纵向行为建模中的优势。总体而言,我们的发现表明加密网络流量包含可解释的行为信息,并能够支持被动、可扩展的行为动态监测,特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

2606.07235 2026-06-09 cs.IR cs.LG 版本更新

FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

FLOWREADER: 多模态长文档问答的最小成本流优化

Ambuj Mehrish, Sebastiano Vascon

AI总结 提出FLOWREADER,将多模态长文档中的证据组装建模为最小成本流问题,通过统一评分向量控制源选择、汇选择和边成本,在碎片化证据场景下优于top-k检索方法。

详情
AI中文摘要

长多模态文档迫使检索增强系统从文本、表格和幻灯片中碎片化的证据中组装答案,这些证据可能分布在长表格的单元格中、多张幻灯片上或图表与其讨论之间。Top-k块检索独立处理每个片段,无法表示证据之间的关联。我们提出FLOWREADER,将证据组装重新定义为多模态节点图上的最小成本流问题:一个单一的评分向量$h$控制源选择(通过MMR)、汇选择(通过长度感知的可回答性代理)以及每条边的成本和容量。最优流被分解为候选证据路径,通过熵正则化复制动力学选择紧凑的非冗余子集,并在双过程门控下并行运行VLM工作器,当答案一致性低或路由流紧张时触发一次System-2精炼过程。在VisDoMBench上,FLOWREADER在碎片化证据主导的两个子集PaperTab(58.40,比G^{2}-Reader高1.30)和SlideVQA(72.93,高0.62)上表现最佳,在SPIQA、FetaTab和SciGraphQA上具有竞争力。在所有五个子集上的宏观平均得分(65.47)与最强基线(G^{2}-Reader,66.21)相差0.74。总体而言,这些结果表明最小成本流在碎片化多模态证据上表现良好,而top-k检索在此类场景中失败。它还提供了一种统一的方式来控制评分、路由、选择和自适应计算。

英文摘要

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-$k$ chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector $h$ controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ($58.40$, $+1.30$ over G^{2}-Reader) and SlideVQA ($72.93$, $+0.62$) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ($65.47$) is within $0.74$ of the strongest baseline (G^{2}-Reader, $66.21$). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top-$k$ retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

2606.07047 2026-06-09 cs.AI 版本更新

Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

Front-to-Attractors:修改双向搜索中的Front-to-Front启发式

Alvin Zou, Muhammad Suhail Saleem, Maxim Likhachev

AI总结 提出Front-to-Attractors (F2A)启发式类,通过动态维护吸引子集替代完整前沿,在保持Front-to-Front信息性的同时大幅降低计算开销,实验显示相比F2F减少最多11.2倍成对评估,平均节点扩展比F2E少4.8倍。

详情
AI中文摘要

启发式在双向搜索算法的性能中扮演核心角色,通常依赖两个主要类别。Front-to-end (F2E) 启发式估计从状态 s 到搜索目标(前向搜索的目标或后向搜索的起点)的距离。相比之下,Front-to-front (F2F) 启发式通过成对函数 h(s, s') 估计从 s 到对面搜索前沿的距离,其中 s' 遍历前沿状态。尽管 F2F 启发式通常信息量更大,从而减少节点扩展数量,但它们依赖大量的成对评估,导致显著的计算开销。为了解决这一限制,我们引入了一个新的启发式类——Front-to-attractors (F2A),它在保留 F2F 大部分信息性的同时,大幅降低了计算成本。F2A 不是评估到对面前沿所有状态的距离,而是估计从 s 到对面搜索方向中一个小的、动态维护的吸引子集的距离。这些吸引子作为完整前沿的替代,使得在极少的计算开销下提供丰富的启发式指导,同时保持 F2F 提供的最优性保证。我们在多个领域评估了 F2A,结果显示,与 F2F 相比,它减少了最多 11.2 倍的成对评估,同时平均节点扩展比 F2E 少 4.8 倍。

英文摘要

Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end (F2E) heuristics estimate the distance from a state s to the target of the search (the goal for forward search or the start for backward search). In contrast, front-to-front (F2F) heuristics estimate the distance from s to the opposite search frontier using a pairwise function h(s, s'), where s' ranges over frontier states. Although F2F heuristics are typically more informative and therefore reduce the number of node expansions, their reliance on extensive pairwise evaluations incurs substantial computational overhead. To address this limitation, we introduce a new heuristic class, front-to-attractors (F2A), that preserves much of the informativeness of F2F while dramatically reducing its computational cost. Rather than evaluating distances to all states on the opposite frontier, F2A estimates the distance from s to a small, dynamically maintained set of attractors in the opposite search direction. These attractors serve as a surrogate for the full frontier, enabling rich heuristic guidance at a fraction of the computational expense while maintaining the optimality guarantees offered by F2F. We evaluate F2A across multiple domains and show that it reduces the number of pairwise evaluations by up to 11.2x compared to F2F, while achieving 4.8x fewer node expansions than F2E on average.

2606.06497 2026-06-09 cs.GR cs.CV cs.HC 版本更新

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

实时注意力弯曲:视频扩散变换器的粒度交互式网络弯曲

Adam Cole, Rebecca Fiebrink, Mick Grierson

AI总结 提出实时注意力弯曲工具,通过操纵视频扩散变换器的自注意力、交叉注意力及前馈网络,实现逐层、逐步、逐令牌的交互式生成控制,增强艺术家的创作代理与模型材料亲密性。

详情
Comments
5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026
AI中文摘要

生成式视频模型已实现显著的视觉保真度,但其仅提示的界面提供了薄弱的创作代理,并使得艺术家无法了解模型的物质过程。我们提出了实时注意力弯曲,这是一种将网络弯曲实践扩展到视频扩散变换器(DiT)全深度并使其进入实时交互式生成的工具。作为DayDream Scope生态系统中的插件构建,并封装了开源实时Wan管道,该工具将自注意力、交叉注意力和前馈网络暴露为可独立操作的面,目标可细化到单个扩散步骤、DiT层、提示令牌和隐藏神经元。实时操作的即时性提供了我们所谓的与模型的“物质亲密性”:对特定层和神经元如何塑造生成视频的响应式、近乎机械的感觉。我们将该工具定位为同时作为对变换器内部结构的XAIxArts探针,以及用于发现模型默认表示空间之外的美学的表达性乐器。

英文摘要

Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.

2606.06443 2026-06-09 cs.CL cs.MM cs.SI 版本更新

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

修正语境,转变模拟立场:审计基于LLM的在线讨论立场模拟

Xinnong Zhang, Wanting Shan, Hanjia Lyu, Zhongyu Wei, Jiebo Luo

AI总结 本研究通过反事实语境修正框架审计LLM立场模拟,对比纯文本与多模态策略,评估平均方向性立场转变和立场转换率,揭示语境敏感性的有效性与鲁棒性。

详情
AI中文摘要

大型语言模型越来越多地被用于模拟社交媒体用户,并推断个人如何回应在线讨论。然而,目前尚不清楚这些模拟是否反映了精确的用户特定信念,或者它们是否对对话语境中语义无关的变化高度敏感。在这项工作中,我们研究反事实语境修正作为审计基于LLM的立场模拟的框架。给定一个原始在线对话,我们首先推断目标用户对特定话题的立场。然后,我们对对话语境应用受控修正策略,并在修正后的语境下再次模拟用户的立场。我们比较了纯文本修正策略与包含模因语境的多模态策略,并评估了两个主要有效性指标,即平均方向性立场转变和立场转换率。结果揭示了在不同极化偏好机制下,纯文本和多模态策略中有效且稳健的立场转变。我们的研究为理解基于LLM的立场模拟的语境敏感性提供了一个评估框架。更广泛地说,它突出了使用LLM模拟在线舆论动态的前景和风险。

英文摘要

Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user's stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user's stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

AI总结 提出一个实体感知的跨图像推理框架,通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型,实现了参考病例检索和时间比较解读,显著提升了放射学比较推理性能。

详情
AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能,但仍与放射学实践存在较大差距,因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题,并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB,这是一个从常规图像-报告对中派生的大规模比较影像资源,包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况,为实体条件检索和比较视觉问答提供监督。利用该资源,我们开发了MedReCo,一个用于可控检索临床类似病例的实体感知视觉编码器,以及MedReCo-VLM,一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中,MedReCo在所有12个内部检索设置中实现了最高的Recall@1,并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中,它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能,并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点,在CT上提高了13.0-27.9个百分点。这些发现表明,实体感知的比较推理可以从常规临床数据中大规模学习,并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

2606.06399 2026-06-09 cs.CL 版本更新

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

CollabSim: 一种基于CSCW的方法,通过受控多智能体实验研究LLM智能体的协作能力

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

AI总结 提出CollabSim框架,结合CSCW理论定义协作能力、控制交互条件并探测智能体内部状态,以系统分析LLM多智能体系统的协作能力。

详情
AI中文摘要

基于大语言模型的多智能体系统展现出日益增长的潜力,其有效性依赖于智能体通过文本渠道进行协调的能力,类似于人类团队。然而,近期研究表明,多智能体系统常常失败并非因为智能体缺乏个体任务解决能力,而是因为缺乏协作能力:建立共同基础、维持共享任务理解、平衡个体与集体激励以及在交互过程中修复失调的能力。计算机支持的协同工作领域数十年的研究已经描述了人类团队在受限通信下协调的这些要求,然而现有的多智能体系统评估主要关注任务结果或单智能体在推理、规划和工具使用方面的能力。为了能够系统分析多智能体系统中智能体的协作能力,我们引入了CollabSim,一个可配置的仿真框架,它结合了基于理论的协作能力定义、交互条件的受控操作以及智能体内部状态的行动级探测。在四个大语言模型上的实验表明,CollabSim能够捕捉条件效应、分离模型性能模式,并揭示智能体设计的任务依赖效应。

英文摘要

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

2606.06388 2026-06-09 cs.AI cs.CL 版本更新

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

人类的ALMANAC:用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

AI总结 为解决当前LLM智能体缺乏协作中心智模型能力的问题,构建了基于Map Task的ALMANAC数据集,包含2987个协作动作及其心智模型标注,并评估了六种LLM在预测人类行为和心智模型上的表现。

详情
AI中文摘要

近年来,LLM智能体的进展使其具备了复杂的认知能力,如多步推理、规划和工具使用,这些能力使它们逐渐成为人类的协作者。然而,有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力,因为它们主要针对任务完成进行优化,而社区缺乏带有动作级心智模型标注的真实人类协作数据,这些数据可以指导智能体获得过程级的协作能力。为填补这一空白,我们提出了ALMANAC,一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作,每个动作都配有基于理论的心智模型标注,记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

2606.06360 2026-06-09 cs.AI 版本更新

An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

基于大语言模型决策的传染病传播模拟

Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle, Hao Xue, Taylor Anderson, Chandini Raina MacIntyre, Matthew Scotch, Flora D. Salim, David J Heslop

AI总结 提出一个空间显式的基于智能体的模拟框架,利用大语言模型生成自我报告流感样疾病的决策,并整合到基于人口普查的合成人群中,以捕捉社会与地理异质性。

详情
Comments
12 pages
AI中文摘要

在传染病爆发期间对个体决策进行建模对于理解行为动态和指导有效的公共卫生干预至关重要。先前的工作表明,大语言模型可以通过基于人口统计提示和情境背景生成智能体决策来模拟逼真的人类行为。我们在此基础上构建了一个空间显式的基于智能体的模拟框架,将LLM生成的关于自我报告流感样疾病的决策整合到基于人口普查的合成智能体群体中。位置被视为核心特征:智能体被分配到城市内的空间单元,利用真实世界的人口普查数据捕捉不同人口群体的空间分布,并实现地理多样化的行为建模。我们实施并比较了三种决策场景:独立推理、家庭影响和消息框架,并在旧金山和亚特兰大模拟了自我报告结果。结果显示,收入和受教育程度是报告率变化的主要驱动因素,地理、LLM模型选择和消息框架的影响较小但一致。我们的框架生成了捕捉社会和地理异质性的合成数据,支持空间流行病学建模和偏差感知行为分析。

英文摘要

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.

2606.06114 2026-06-09 cs.AI 版本更新

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

走向健康进化:探索人机交互在自我进化系统中的作用与机制

Dianxing Shi, Bowen Wang, Junqi He, Junhao Chen, Yuta Nakashima

AI总结 提出ANCHOR框架,通过模拟人类监督的反馈机制,在自我进化系统中缓解能力退化与安全漂移,实验表明有限监督可显著提升安全性与稳定性。

详情
AI中文摘要

自我进化智能体通过持续的自我对弈和自我生成的学习信号进行改进,但自主进化也可能导致能力退化与安全漂移。尽管人类反馈已被证明对静态和后训练智能体有效,但其在自我进化系统中的作用仍未被充分探索。我们提出了通过类人监督与审查进行智能体规范修正(ANCHOR)框架,这是一个基于LLM的框架,模拟人类监督并在自我进化的不同阶段提供反馈。利用ANCHOR,我们评估了两个代表性的开源自我进化智能体系统在编程、数学推理和安全性方面的表现。结果表明,即使是有限的监督也能显著缓解安全退化,同时保持核心进化目标的稳定性能。进一步分析显示,对输出验证阶段的监督是最有效的干预方式,而增加监督频率则收益递减。这些发现为设计更稳定、可控且与人类对齐的自我进化智能体系统提供了经验证据和实践指导。

英文摘要

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

2606.06076 2026-06-09 cs.AI cs.CV 版本更新

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

通过模态差距感知自蒸馏从符号状态学习视觉空间规划

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

AI总结 提出MGSD两阶段框架,通过冷启动接地和特权教师蒸馏弥合视觉与符号规划之间的模态差距,在视觉规划基准上显著提升性能。

详情
Comments
17 pages, preprint
AI中文摘要

尽管视觉-语言模型在通用多模态理解方面表现出色,但在视觉空间规划上仍存在困难。我们将其归因于感知-推理模态差距:视觉规划要求模型从像素中推断潜在状态结构,然后对恢复的结构进行推理以产生有效动作,而符号规划直接利用显式对象和约束。这造成了视觉状态恢复和多步规划的双重瓶颈。为解决此问题,我们提出MGSD,一种两阶段模态差距感知自蒸馏框架。首先,冷启动接地阶段为视觉学生模型配备可靠的状态表示,最小化早期感知噪声。其次,特权教师通过在线策略蒸馏转移规划能力,使用显式符号状态监督学生自身的视觉 rollout 前缀。关键在于,符号数据仅在训练期间使用,推理完全基于视觉。在视觉规划基准上的实验表明,MGSD在4B和8B骨干网络上均持续提升视觉规划性能,宏观平均值分别提高19.3%和18.4%。所得模型缩小了与符号输入上限的差距,而消融和诊断实验证实改进来自视觉状态恢复和最优路径推理。这些结果表明,模态差距感知自蒸馏不仅改善了模型感知可行动状态的方式,也改善了它们在推断结构上进行规划的能力。代码见 https://github.com/Oranger-l/MGSD。

英文摘要

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

2606.06033 2026-06-09 cs.RO 版本更新

RealDexUMI: A Wearable Universal Manipulation Interface for Dexterous Robot Learning

RealDexUMI:用于灵巧机器人学习的可穿戴通用操作接口

Chaoyi Xu, Yixuan Jiang, Jiahui Huan, Yuhui Fu, Haoyu Zhou, Weitian Yuan, Jiayi Yu, Wanpeng Zhang, Haoqi Yuan, Zongqing Lu

AI总结 提出RealDexUMI,一种基于共享灵巧末端执行器模块的可穿戴通用操作接口,通过掌侧同构遥操作手套实现无重定向、直观精确的手部控制,在八项真实机器人任务中平均成功率达88.75%。

详情
AI中文摘要

学习灵巧操作需要演示,这些演示在保持精细手-物体交互的同时,在部署时仍可执行。现有流程要么通过重定向或具身转换损失可部署的灵巧性,要么依赖特定机器人的遥操作,这种遥操作成本高昂且难以扩展,并且通常缺乏用于灵巧数据收集的直观、接触感知控制。我们提出RealDexUMI,一种围绕共享灵巧末端执行器模块构建的可穿戴通用操作接口,该模块集成了轻量级灵巧手、手内视觉和指尖触觉传感。掌侧同构遥操作手套将人类手指输入映射到机器人手关节命令,实现实时、无重定向、直观且精确的手部控制。共享的手和传感模块产生零间隙的末端执行器数据,在收集和部署之间具有匹配的手内观察、触觉信号、接触和手部动作。在涵盖精细、接触丰富、长时域和双臂操作的八项真实机器人任务中,基于RealDexUMI数据训练的策略平均成功率达到88.75%,能够泛化到未见过的初始姿态,并在三种具身之间迁移。网站:https://research.beingbeyond.com/realdexumi

英文摘要

Learning dexterous manipulation requires demonstrations that preserve fine hand-object interactions while remaining executable at deployment. Existing pipelines either lose deployable dexterity through retargeting or embodiment conversion, or rely on robot-specific teleoperation that is costly to scale and often lacks intuitive, contact-aware control for dexterous data collection. We present RealDexUMI, a wearable universal manipulation interface built around a shared dexterous end-effector module that integrates a lightweight dexterous hand, in-hand vision, and fingertip tactile sensing. A palm-side isomorphic teleoperation glove maps human finger inputs to robot-hand joint commands, enabling real-time, retargeting-free, intuitive, and precise hand control. The shared hand and sensing modules yield zero-gap end-effector data, with matched in-hand observations, tactile signals, contacts, and hand actions between collection and deployment. Across eight real-robot tasks spanning fine-grained, contact-rich, long-horizon, and bimanual manipulation, policies trained on RealDexUMI data achieve an average success rate of 88.75%, generalize to unseen initial poses, and transfer across three embodiments. Website: https://research.beingbeyond.com/realdexumi

2606.05932 2026-06-09 cs.AI cs.LG 版本更新

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自我一致性激发与奖励设计的预注册因果分解

Yuze Gao

AI总结 本文通过预注册实验和因果分解方法,证明RLVR中朴素奖励设计估计量存在系统性偏差,并量化了自我一致性激发与真正奖励设计信号的贡献。

详情
Comments
9 pages, 7 figures
AI中文摘要

基于可验证奖励的强化学习(RLVR)即使在奖励信号是虚假的情况下也能提升推理能力——将功劳分配给群体多数答案而非真实验证器。实践者通常将朴素估计量 naive = acc(TRUE) - acc(RANDOM) 解释为奖励设计效应。我们证明该估计量存在系统性偏差:它混淆了自我一致性激发(通过多数伪奖励将策略向众数答案锐化)与真正的奖励设计信号。使用受控的表格GRPO模拟器,我们推导出精确的望远镜分解 total = null + elicit + rd,并在五个先验强度水平上测量每个项。朴素估计量中奖励设计占比从弱先验(ps=0.20)时的0.139变化到强先验(ps=0.80)时的0.05,激发项在自我一致性交叉点处符号翻转。一个预注册的2x2x2析因实验证实了非可加性(交互比0.385;AxC效应-0.089)。一个点与界试点门控表明,强先验区域是点识别的,而接近交叉区域仅是有界的。对两个已命名发表结果的重新审计分别得出“激发主导”(激发份额0.98)和“奖励设计主导”(rd份额1.18)的结论,证明了该分解的诊断价值。我们预先承诺无论翻转结果如何都提交论文;非翻转同样是一个有价值的发现。我们发布一个可复用的单命令工具,供任何对齐论文运行相同的审计。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.