arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.18997 2026-06-18 cs.LG 新提交

DIPHINE: Diffusion-based $Φ$-ID Neural Estimator

DIPHINE: 基于扩散的 $\Phi$ID 神经估计器

Simon Pedro Galeano Munoz, Mustapha Bounoua, Giulio Franzese, Pietro Michiardi, Maurizio Filippone

发表机构 * KAUST(卡塔尔科学与技术部) EURECOM(欧雷康)

AI总结 提出首个基于扩散模型的神经估计器 DIPHINE,用于计算连续非高斯动力系统的集成信息分解($\Phi$ID),通过单个摊销网络联合估计所有互信息项,并利用 Möbius 逆变换恢复十六个原子。

详情
AI中文摘要

揭示真实世界复杂系统的真实信息架构需要厘清其组件如何随时间独特存储、冗余共享和协同整合信息。集成信息分解($\Phi$ID)是一个框架,用于将多变量系统的信息动态分解为十六个非重叠原子,这些原子表征冗余、独特和协同的信息存储、传输和整合模式。现有的计算 $\Phi$ID 的方法仅限于高斯或离散系统,阻碍了其在连续非高斯动力系统中的应用。我们通过提出 DIPHINE(基于扩散的 $\Phi$ID 神经估计器)来解决这一限制,这是首个利用基于分数的扩散模型从单个摊销网络中联合估计 $\Phi$ID 所需的所有互信息项的神经估计器,并通过 Möbius 逆变换恢复十六个原子。我们提供了通过逆变换的误差传播的理论分析,表明从互信息到原子的映射的雅可比矩阵是整数值的,并且协同到协同原子被证明是最难估计的。我们在合成基准上展示了准确恢复真实原子,与已建立的互信息估计器相比具有优越性能,并在涉及真实数据的应用中无需任何分布假设即可提取生理上可解释的信息动态结构。

英文摘要

Uncovering the true informational architecture of real-world complex systems requires disentangling how their components uniquely store, redundantly share, and synergistically integrate information over time. Integrated Information Decomposition ($Φ$ID) is a framework for decomposing the information dynamics of multivariate systems into sixteen non-overlapping atoms that characterize redundant, unique, and synergistic modes of information storage, transfer, and integration. Existing methods to compute $Φ$ID are restricted to Gaussian or discrete systems, preventing its application to continuous non-Gaussian dynamical systems. We address this limitation by proposing DIPHINE (Diffusion-based $Φ$-ID Neural Estimator), the first neural estimator that leverages score-based diffusion models to jointly estimate all the mutual information terms required by $Φ$ID from a single amortized network, recovering the sixteen atoms through Möbius inversion. We provide a theoretical analysis of error propagation through the inversion, showing that the Jacobian of the mapping from mutual informations to atoms is integer-valued and that the synergy-to-synergy atom is provably the hardest to estimate. We demonstrate accurate recovery of ground-truth atoms on synthetic benchmarks, superior performance compared to established mutual information estimators, and the ability to extract physiologically interpretable information-dynamic structure on an application involving real data without any distributional assumptions.

2606.18992 2026-06-18 cs.CV 新提交

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示,而非询问:基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

AI总结 提出CLARA框架,通过展示视觉备选面板让用户选择,结合似然比重校准实现多轮覆盖保证,在组合图像检索中有效消歧,优于文本提问基线。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和文本修改来搜索目标图像。然而,此类查询通常描述多个可能的图像而非一个确切目标,使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限:其覆盖保证仅在第一轮交互中成立,且文本问题通常不足以解决细粒度视觉差异,如外观、属性或视角。我们提出CLARA,一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题,只需选择最接近预期目标的原型图像。这提供了直接的视觉信号,并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证,CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集,并映射到真实语料库图像,确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明,CLARA匹配单轮最先进的检索性能,在多轮交互中维持名义覆盖,并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显,此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

2606.18989 2026-06-18 cs.CL cs.AI 新提交

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign:基于释义的跨语言习语对齐基准

Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(NLP 2 CT实验室,计算机与信息科学系,澳门大学) Faculty of Arts and Humanities, University of Macau(人文学院,澳门大学)

AI总结 提出G-IdiomAlign基准,通过维基词典释义锚定习语,构建高置信度对齐集,并设计多项选择等价测试和释义对比生成协议,揭示大语言模型在习语翻译中的字面翻译偏差。

Comments Accepted to ACL 2026

详情
AI中文摘要

习语由于其非组合性和弱表层形式基础,难以跨语言转换,使得字面映射不可靠。我们提出G-IdiomAlign,一个基于释义的基准,其中每个习语通过维基词典的英语释义进行锚定。我们进一步构建了一个高置信度的参考对齐集,用于可重复评估。G-IdiomAlign支持两种协议:(1)受控的多项选择习语等价测试,带有类型化干扰项用于错误归因;(2)释义对比生成,对比无释义和有释义输入,以隔离显式语义枢轴的影响。在不同的大语言模型中,字面翻译偏差是主要的失败模式,尤其是当目标语言是低资源语言时。在基于嵌入的语义代理下,释义一致地改善了释义对比生成,但性能仍然有限,表明在开放输出空间中存在显著提升空间。随后对Qwen3-8B的分析进一步表明,跨条件差异更多集中在注意力头而非层中,而有释义生成更好的情况与更强的释义锚定相关。

英文摘要

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

2606.18988 2026-06-18 cs.AI 新提交

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 提出ThinkDeception框架,将多模态大语言模型引入欺骗检测,通过逐步推理和视觉-音频一致性组相对策略优化(VAC-GRPO)实现可解释的认知推理,在主流基准上达到新SOTA。

Comments 10pages,4figures

详情
AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要,然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性,无法提供透明的推理轨迹,也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制,我们提出了ThinkDeception,一个新颖且可解释的多模态欺骗检测框架。作为开创性工作,它将多模态大语言模型(MLLMs)引入该领域,将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链(CoT)数据集,我们开发了基础模型ThinkDeception Base,实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上,我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化(VAC-GRPO)。与标准GRPO不同,我们将训练数据分为四个渐进难度等级,引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合,我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明,ThinkDeception建立了新的SOTA,在检测准确性和推理质量上均显著优于现有方法。最终,这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

2606.18986 2026-06-18 cs.CL cs.AI 新提交

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词:面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University(德肯大学)

AI总结 提出CADE框架,通过逐点线性编码器直接嵌入每个时间步,避免分词瓶颈,并利用单向监督对比损失对齐时间序列与文本锚点,在Time-MQA基准上提升六项TSQA任务性能。

详情
AI中文摘要

大型语言模型的最新进展催生了时间序列问答(TSQA),它将时间序列分析表述为自然语言问答。然而,直接将原始数值序列输入LLM会遇到分词瓶颈:字节对编码将连续值分割成不稳定的词元,其嵌入缺乏有意义的度量结构,导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口,锁定单一粒度,这会破坏模式并隐藏确切的时间步,且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战,我们提出了CADE(对比对齐与直接嵌入),一个基于两个关键组件构建的TSQA新框架:直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间,保留了精确的索引级访问,同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距,我们引入了一种新颖的单向监督对比损失,将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明,我们的框架在六项TSQA任务上持续提升了性能,优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

2606.18974 2026-06-18 cs.CV 新提交

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18970 2026-06-18 cs.LG cs.AI cs.CV 新提交

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

Comments This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

2606.18967 2026-06-18 cs.LG 新提交

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout: 面向强化学习推演的感知系统的自推测解码

Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang

发表机构 * FuriosaAI University of California, Berkeley(加州大学伯克利分校)

AI总结 针对强化学习推演中自回归解码延迟瓶颈,提出感知系统的自推测解码框架,通过量化自推测解码器与感知系统的推测开关策略,在保持模型质量前提下降低推演和端到端延迟。

Comments Project Page: https://github.com/furiosa-ai/EfficientRollout

详情
AI中文摘要

强化学习(RL)已成为LLMs代表性后训练范式,赋予其强大的推理和智能体能力。然而,推演生成仍是主要的延迟瓶颈,因为自回归采样顺序解码响应,且少量长尾生成往往决定完成时间。推测解码(SD)为缓解此瓶颈提供了自然途径,它是一种用于服务固定LLMs的成熟技术,通过快速草拟令牌并通过并行验证接受它们来降低延迟,同时保持目标模型分布。但其实际加速效果无法直接迁移到RL推演:(i)不断变化的目标策略使得任何固定草拟者与策略输出分布日益不匹配;(ii)推演解码过程中活跃批次大小缩小,解码从计算受限转向内存受限,此时并行验证可利用未充分利用的计算资源。因此,加速RL推演需要草拟者在长序列、高温生成下对演化策略保持有效,以及感知系统的SD使用以避免计算受限状态。我们提出EfficientRollout,一个感知系统的自推测SD框架,旨在解决RL推演中的这一差距。EfficientRollout从目标模型诱导量化草拟者(即自推测解码),使其与演化策略保持耦合,无需单独草拟者预训练或在线适应。它进一步协调感知系统的SD切换策略与接受感知的草稿长度自适应,仅在有益状态下进行推测,同时使草拟预算与演化草拟者质量匹配。EfficientRollout在加速自回归推演基线上分别将推演和端到端延迟降低高达19.6%和12.7%,同时保持最终模型质量。

英文摘要

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

2606.18963 2026-06-18 cs.LG 新提交

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li(李 Cirong)

AI总结 提出OHIRL框架,在无标量奖励下通过固定通道感知流进行在线奖惩学习,利用内部轨迹评估器推断感知维度的效价,在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情
AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步,智能体仅接收一个固定通道的感知数据包,诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度,其效价必须从转移后果中推断。OHIRL分离了四个角色:M_psi学习下一数据包预测,D_omega建模残差动力学,C_eta是一个固定的内部转移后轨迹评估器,B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向;系数来源审计显示,等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名,而符号反转保留了0%。无奖励协议暴露观察转移,同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中,药物和辣椒在视觉XOR上下文中获得相反的价值,并且相同的疼痛或辣度增加可能根据后果结构为正或负;B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中,M_psi达到留出R2=0.907,B_xi达到0.940的符号准确率,策略达到0.979的最优动作准确率,而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

2606.18961 2026-06-18 cs.LG 新提交

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

做自己的老师:通过无监督奖励优化引导蛋白质语言模型

Lanqing Li, Shentong Mo, Yang Yu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) MBZUAI Hong Kong University of Science and Technology(香港科学理工大学)

AI总结 提出无监督奖励优化框架,结合模型不确定性和语义一致性作为代理奖励,通过SRO和BRO算法优化PLMs,在无标签数据下实现可控蛋白质生成,性能接近有监督方法。

Comments 24 pages, 2 figures, 13 tables

详情
AI中文摘要

蛋白质语言模型(PLMs)已成为可控生物分子设计的有力工具,但其后训练适应通常依赖于昂贵的湿实验验证或精心策划的偏好数据集。为了克服这一监督瓶颈,我们引入了PLMs的无监督奖励优化,这是一个无需真实标签即可实现可引导蛋白质生成的综合框架。我们的关键见解是,任务无关的奖励(将内在模型不确定性与由蛋白质表示模型指导的外在语义一致性相结合)在基础模型和温度设置中与可控性度量表现出强相关性。基于这一发现,我们提出了两种离线算法:软奖励优化(SRO)和二值化奖励优化(BRO),它们有效地最大化由这些代理奖励诱导的经典RLHF目标。在组合性分布外提示上的大量实验表明,两种方法均显著优于竞争基线(DPO、KTO),同时在多个采样温度、模型规模和蛋白质家族中接近理想性能。此外,使用无监督奖励微调的PLMs在pass@k评估中相比其基础模型能够实现持续更高的覆盖率。通过使PLMs能够利用自身生成的体验进行自我改进,我们的框架为在标签偏好或实验反馈稀缺或不可用的环境中实现可控生物分子设计提供了一条可扩展的途径。

英文摘要

Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

2606.18960 2026-06-18 cs.CV cs.RO 新提交

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.18959 2026-06-18 cs.RO 新提交

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.18955 2026-06-18 cs.CV cs.RO 新提交

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO:基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出GraphPO框架,将推理轨迹建模为有向无环图,通过合并语义等价路径减少冗余探索,并利用边级优势函数提高推理效率,在多个基准上优于链式和树式方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性:首先,独立响应常包含相似的中间推理步骤,导致冗余探索和计算浪费;其次,稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号,部分解决了这一问题。然而,树分支仍然是独立扩展的。当不同分支达到相似的推理状态时,它们无法共享信息并重复类似的探索。此外,基于树的方法忽略了这种分散性,仅在不同分支内进行局部比较,这可能导致优势估计的方差更高。为了解决这一挑战,我们提出了GraphPO(基于图的策略优化),一种新颖的RL框架,将轨迹表示为有向无环图,其中推理步骤作为边,从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类,允许它们共享后缀,并将预算从冗余扩展重新分配到多样化探索。此外,我们为入边分配效率优势,为出边分配正确性优势,从而在从结果中推导过程监督的同时提高推理效率。理论表明,GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明,在相同的token预算或响应预算下,GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST(韩国科学技术院) Microsoft Research Asia - Tokyo(微软亚洲研究院-东京) The University of Tokyo(东京大学)

AI总结 提出以对象为中心的残差强化学习框架,在仿真中训练策略,零样本迁移到真实机器人,将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情
AI中文摘要

视觉-语言-动作(VLA)模型能够泛化到多种操作任务,但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱;能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性?残差强化学习在冻结的VLA之上学习修正策略,提供了一个自然框架,但现有方法面临根本的仿真到现实困境:特权状态方法需要有损蒸馏才能部署;基于图像的方法存在视觉域差距;而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架,利用对象姿态优化VLA动作,从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域,我们额外在仿真中重放相同的遥操作演示,以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练,并零样本迁移到真实机器人。在真实Franka Research 3(FR3)机器人的五个操作任务上,我们的方法将成功率从42%零样本提升至76%,且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进,无需额外遥操作。项目页面:此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University(上海大学) Southern University of Science and Technology(南方科技大学) The University of Sydney(悉尼大学)

AI总结 针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战,提出包含10个场景、10297个视角的真实捕获多任务基准STB,支持深度估计、多视图重建和3D语义理解评估。

详情
AI中文摘要

基于单光子雪崩二极管(SPAD)传感的单光子LiDAR(SPL)能够以极高灵敏度进行时间分辨光子测量,为光子匮乏环境下的主动3D感知提供了独特潜力。然而,由于独特的测量噪声和复杂的多回波瞬态现象,真实世界的单光子感知仍然面临根本性挑战,这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长,现有研究大多局限于模拟数据或小规模受控捕获。因此,在深度估计、多视图重建和3D语义理解方面,对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白,我们引入了SP-TransientBench(STB),一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图,使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议,STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

2606.18950 2026-06-18 cs.AI 新提交

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2606.18948 2026-06-18 cs.RO 新提交

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Simulation, Systems Optimization and Robotics Group(仿真、系统优化与机器人组)

AI总结 提出C-ARC框架,通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索,并利用指数控制环自适应校准网格分辨率,实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情
AI中文摘要

实时LiDAR聚类识别点云中的结构,是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来,由于成本和外形尺寸小,非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设:结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布,且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求,我们开发了C-ARC,一个连续自适应范围聚类框架,它在滑动窗口上维护一个持久双图,将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸,无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现,C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明,对于扫描模式高度集中的传感器,无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦:面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.(DoorDash公司)

AI总结 提出解耦搜索接地(DSG)架构,将搜索接地从推理模型中分离,通过MCP兼容网关实现供应商路由、缓存等控制,在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情
AI中文摘要

生产级LLM Agent越来越依赖实时搜索,但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植,并可能触发搜索诱导的冗长,破坏严格的输出合约。我们提出解耦搜索接地(DSG),一种供应商无关的边界,通过MCP兼容网关将接地移出推理模型,将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上,原生搜索在时效性敏感的FreshQA上领先,但DSG在控制重要时展现出更强的前沿:在SimpleQA上,它以91%更低的搜索成本接近原生准确率(86.1%对87.7%),保持简洁答案合约,并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署,DSG在电商查询理解(QIU)工作负载上匹配或略超原生搜索准确率,同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界,而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

2606.18946 2026-06-18 cs.CL 新提交

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University(西北工业大学) Zhejiang Lab(浙江实验室)

AI总结 针对人机混合文档的句子级AI文本检测,提出SenFlow模型,通过图传播和CRF解码建模句间依赖,在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情
AI中文摘要

针对混合文档(人类与LLM共同撰写同一文本)的句子级AI生成文本检测(S-AGTD)面临两个空白:现有方法孤立地对每个句子进行分类,忽略了句间依赖;现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准,包含来自PubMed和XSum的16,000个混合文档,由DeepSeek-V3.2和Kimi K2生成,并经过严格质量控制,包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测,并实例化为SenFlow,在句子图的单次文档级传递中,将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能,在跨域迁移(三种难度递增协议中最难的一种)上平均Macro-F1提高了4.15个百分点。我们进一步发现,即使困惑度过滤器平衡了显式线索,AI插入仍然保留了一个依赖于生成器的句子长度差距,句子级检测器仍可利用这一点。代码和数据:此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs(Anates实验室) Technical University of Munich(慕尼黑技术大学) University of Technology Nuremberg(纽伦堡技术大学) Tuebingen AI Center, University of Tuebingen(图宾根大学人工智能中心) Helmholtz AI, Munich(慕尼黑海德堡人工智能研究所) Google DeepMind research(谷歌DeepMind研究)

AI总结 本文提出Physics-IQ Verified基准,通过改进提示和地面真实质量及引入样本级评分系统,提升视频生成模型对物理现实的理解评估,验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情
AI中文摘要

视频生成模型(VGMs)已成为新的前沿,不仅用于视频生成,还用于多种下游任务,包括世界建模。为推进这些任务,一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域,催生了Physics-IQ基准,通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准,揭示不足并提出三种解决方案,改进如何衡量VGMs的物理理解。具体而言,我们提高了提示和地面真实质量以减少混淆因素影响,并进一步引入样本级评分系统,使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中,我们观察到中等但有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展,向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

2606.18936 2026-06-18 cs.AI cs.CY 新提交

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench:面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China(脑启发认知智能实验室,自动化研究所,中国科学院,北京,中国) School of Future Technology, University of Chinese Academy of Sciences, China(未来技术学院,中国科学院大学,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, China(人工智能学院,中国科学院大学,中国) Zhongguancun Academy, China(中关村学院,中国) Beijing Key Laboratory of Safe AI and Superalignment(北京安全人工智能与超对齐重点实验室) Gaoling School of AI, Renmin University of China(甘露人工智能学院,中国人民大学) Beijing Institute of AI Safety and Governance (Beijing-AISI)(北京人工智能安全与治理研究院(北京-AISI)) School of Humanities, University of Chinese Academy of Sciences, China(人文学院,中国科学院大学,中国)

AI总结 提出SciRisk-Bench基准,从显式风险维度和科学学科两个角度评估AI4Science安全,覆盖7个学科、31个子学科和10个风险维度,实验揭示主流及科学大模型的安全薄弱环节。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入到人工智能驱动的科学(AI4Science)工作流程中,从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估,不仅要评估科学能力,还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式,但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench},这是一个旨在从两个互补视角评估AI4Science安全的基准:显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分,我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现,从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

2606.18933 2026-06-18 cs.LG cs.IR stat.ME 新提交

Zero-Shot Active Feature Acquisition via LLM-Elicitation

基于LLM启发式的零样本主动特征获取

Binyamin Perets, Natalie Mendelson, Shiran Vainberg, Yehuda Chowers, Shai Shen-Orr, Shie Mannor

发表机构 * Faculty of EE, Technion(技术学院电子工程系) Faculty of Medicine, Technion(技术学院医学院) CytoReason NVIDIA

AI总结 提出通过LLM启发式获取马尔可夫随机场充分统计量的零样本主动特征获取框架,解决数据标注不足问题,在IBD患者诊断中优于现有方法。

详情
AI中文摘要

主动特征获取(AFA)顺序选择要观察的特征以达成分类或排序决策。其主要局限性在于依赖大量标注数据来拟合指导获取的概率模型。大型语言模型(LLM)提供无监督的领域知识,但作为序列规划者表现不佳。要求其同时知晓和决策会混淆最好分开的能力。这里,我们通过严格的启发式方法开发了一个零样本AFA框架:仅要求LLM返回其可被信任返回的内容,即马尔可夫随机场(MRF)的充分统计量——一元偏差和成对协变。我们将该框架应用于两个场景:二分类和top-$k$识别。实践中,LLM可靠地仅返回判别性统计量,即区分类别而非孤立每个类别的统计量,这阻碍了经典AFA。我们应用最大熵闭包来解决这种规范模糊性。我们在炎症性肠病(IBD)患者队列上进行评估,这是一个活跃的临床环境,其中诊断模糊性和患者异质性阻碍了稳定的治疗策略。我们的框架在真实标签和其自身提取的信念上均优于LLM。在最关键的地方,即最困难的患者上,我们的top-$k$获取策略显著优于所有现有方法。

英文摘要

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

2606.18924 2026-06-18 cs.SD 新提交

Who Wins the Conflict? Mechanistic Interpretability of Text Bias in Audio LLMs

谁赢得冲突?音频大模型中文本偏差的机制可解释性

Hyebin Cho, Suho Yoo, Jaehyuk Jang, Changick Kim, Joon Son Chung

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 本文通过机制分析揭示音频大模型中的文本主导偏差,发现文本路径主动抑制完整音频表征,并提出无训练干预方法back-patching以增强音频表征,缓解文本主导。

Comments Preprint

详情
AI中文摘要

虽然音频大模型在多模态理解方面表现出色,但它们存在文本主导偏差,即模型盲目偏向文本而忽视声学证据,导致幻觉。然而,当音频和文本输入相互矛盾时,这些模型内部行为的底层机制尚未被探索。在这项工作中,我们通过追踪内部表征在层间的传播,首次对这一现象进行了机制分析。我们的研究揭示了三个关键发现:(i)文本主导在模型中系统性地且经验性地存在;(ii)虽然文本和音频依赖功能不同的路径,但它们最终在后期层中汇聚到一个共享语义空间;(iii)文本路径不会擦除音频信息,而是主动抑制完整的音频表征。基于这些见解,我们利用back-patching,一种无训练干预方法,将后期层的音频激活路由回早期层。这放大了音频表征,使其能够克服文本抑制。我们的评估表明,back-patching持续减少文本主导,为冲突下的机制性多模态对齐铺平了道路。

英文摘要

While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.

2606.18923 2026-06-18 cs.LG 新提交

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

GrapNet: 一种可编程的动态架构神经图基板

Zirong Li

发表机构 * Zirong Li(李子荣)

AI总结 提出GrapNet,一种将图作为可执行架构的神经基板,通过可编程接口支持结构编辑、冻结子图、局部审计等操作,在Split Fashion-MNIST和Split CIFAR-10上分别提升12.08和3.81个百分点的准确率。

Comments 8 pages, 1 figure, preprint

详情
AI中文摘要

可编程性是固定张量神经网络中缺失的一流接口:编辑关系、冻结子图、审计局部函数或更改执行后端应是对神经程序的操作,而非临时参数手术。GrapNet研究这种图即网络的设置。图是架构和可执行程序,而非输入数据图。每个计算节点拥有其下一层子节点引用和与这些引用对齐的可训练分配向量;删除关系会物理移除子节点引用和相应的分配坐标。结构规则和执行策略位于节点核心之外,因此同一子节点拥有的图可以被增长、冻结、结构编辑、分组为可训练族块、通过注意力在活动关系上路由,或在拓扑稳定后降级为密集快照。GrapNet通过向量值父接口与常规模块组合:密集层、CNN编码器、ResNet特征提取器、注意力块和Transformer表示都可以为每个坐标提供一个感知GrapNode。评估组织为可编程性压力测试套件,而非新的重放基准。在匹配的十种子Split Fashion-MNIST研究中,可塑GrapNet+ER头在相同已见类损失和重放记忆下达到63.16%的已见类准确率,而参数更大的密集MLP+ER为51.08%,配对差值为12.08点,p=1.3e-5。在Split CIFAR-10上使用冻结的ImageNet ResNet-18编码器时,相同基板将在线头比MLP-256提高3.81点,p=0.0026。这些结果支持GrapNet作为可编辑的神经图基板,其核心价值在于具有忠实执行视图的结构可编程性。

英文摘要

Programmability is a missing first-class interface in fixed-tensor neural networks: editing a relation, freezing a subgraph, auditing a local function, or changing the execution backend should be an operation on the neural program rather than ad-hoc parameter surgery. GrapNet studies this graph-as-network setting. The graph is the architecture and executable program, not an input data graph. Each compute node owns its next-layer child references and a trainable allocation vector aligned with those references; deleting a relation physically removes both the child reference and the corresponding allocation coordinate. Structural rules and execution policies live outside the node core, so the same child-owned graph can be grown, frozen, structurally edited, grouped into trainable family blocks, routed by attention over active relations, or lowered to dense snapshots after topology stabilizes. GrapNet composes with conventional modules through a vector-valued parent interface: dense layers, CNN encoders, ResNet feature extractors, attention blocks, and transformer representations can all feed one sensory GrapNode per coordinate. The evaluation is organized as a programmability stress suite rather than as a new replay benchmark. In a matched ten-seed Split Fashion-MNIST study, a plastic GrapNet+ER head reaches 63.16 percent seen-class accuracy versus 51.08 percent for a parameter-larger dense MLP+ER under the same seen-class loss and replay memory, with paired delta 12.08 points and p=1.3e-5. On Split CIFAR-10 with a frozen ImageNet ResNet-18 encoder, the same substrate improves the online head over MLP-256 by 3.81 points, with p=0.0026. These results support GrapNet as an editable neural graph substrate whose core value is structural programmability with faithful execution views.

2606.18922 2026-06-18 cs.CL cs.AI 新提交

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

2606.18918 2026-06-18 cs.LG cs.CC 新提交

Some Complexity Results for Robustness Verification for Binarized Neural Networks

二值化神经网络鲁棒性验证的一些复杂性结果

Harshit Goyal, Sudakshina Dutta

发表机构 * Indian Institute of Technology Goa(印度理工学院Goa)

AI总结 本文通过从布尔可满足性问题归约证明二值化神经网络的可满足性是NP完全的,并利用均匀遮挡导致的网络输出分段常数结构,提出多项式时间鲁棒性检查算法。

详情
AI中文摘要

本文研究了二值化神经网络(BNNs)验证问题的计算复杂性,其中激活函数(有时权重)是二值的。我们分析了两个问题:可满足性和均匀图像遮挡下的鲁棒性。我们通过从布尔可满足性问题(SAT)归约证明BNN可满足性是NP完全的,并且均匀遮挡在网络输出中诱导出分段常数结构,从而实现了多项式时间的鲁棒性检查算法。

英文摘要

This paper studies the computational complexity of verification problems for Binarized Neural Networks (BNNs), where activations (and sometimes weights) are binary. We analyze two problems: satisfiability and robustness under uniform image occlusion. We show that BNN satisfiability is NP-complete via a reduction from Boolean satisfiability problem (SAT), and that uniform occlusion induces a piecewise-constant structure in the network output, enabling a polynomial-time robustness-checking algorithm.

2606.18910 2026-06-18 cs.LG cs.CL 新提交

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES:通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University(西北大学) Amazon AGI(亚马逊人工智能实验室) Qualcomm AI Research(高通人工智能研究) University of Minnesota(明尼苏达大学)

AI总结 提出REVES框架,通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示,实现高效的离策略数据生成,提升大语言模型的多步推理能力,在LiveCodeBench上比强化学习基线高6.5分。

详情
AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型(LLM)推理能力的强大范式。然而,标准的后训练方法主要优化单次目标,与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习(RL),但传统方法直接优化多步轨迹,未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架,交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤(“接近正确”答案)转化为解耦的修订和验证提示,我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比,这种方法实现了高效的离策略数据生成,并减少了长程采样的计算开销。在LiveCodeBench上,使用公开可用的测试用例作为反馈,我们观察到比RL基线高6.5分,比标准多轮训练高4.0分。除了编码,我们的方法在圆填充问题上达到了先前报告的SOTA结果,同时使用了最小的基础模型(4B)和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题,如n皇后和迷你数独,其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出随机提示优化框架SPO,其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索,在多个基准测试中表现依赖于错误类型,并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情
AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度,这促使我们将自动提示优化(APO)视为黑盒搜索。我们引入了SPO(随机提示优化),一个在提示空间上进行随机搜索的框架,并比较了三种复杂度递增的策略:基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE(基于智能体引导探索的SPO),后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中,没有单一策略占主导地位;有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上,它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为,将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.