arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00508 2026-06-02 cs.CV cs.AI

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(延世大学,首尔,韩国) Ewha Womans University, Seoul, South Korea(成均馆大学,首尔,韩国)

AI总结 本文发现视频大语言模型中存在令牌接口连续流形,并提出V-LynX框架,通过轻量辅助路径对齐注意力响应和统计分布,无需配对监督即可集成新模态,在音视频问答、3D推理等任务上达到最优效率。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象:视频大语言模型不仅仅是简单地将帧转换为文本嵌入,而是建立了一个连续流形——令牌接口,使得视觉令牌能够在架构内作为独立实体运行。利用这一发现,我们提出了V-LynX,这是一个可扩展的框架,通过重新利用内部化接口,将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同,V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布,将新的感官输入与内在视频先验相结合。这确保了流形兼容性,同时保持了视频大语言模型的完整性。大量基准测试表明,V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

2606.00507 2026-06-02 cs.CL

LaSR: Context-Aware Speech Recognition via Latent Reasoning

LaSR:通过潜在推理实现上下文感知的语音识别

Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 提出LaSR训练范式,利用潜在推理轨迹增强语音大语言模型的上下文感知能力,在学术术语识别上显著提升性能且不增加延迟。

详情
AI中文摘要

近期语音大语言模型(Speech LLMs)的进展显著增强了口语理解与推理能力。然而,其上下文感知能力有限,难以有效反映说话者意图和主题上下文的语音识别。本文提出LaSR(潜在语音推理),一种新颖的训练范式,具有利用潜在推理过程的上下文感知推理轨迹。LaSR不生成显式中间令牌,而是将思维链(CoT)监督对齐到目标单词的声学特征区域,并引入潜在推理阶段用于上下文信息锚定和转录转换。此外,为有效基准测试专业词汇的上下文识别,我们提出Spoken Darwin-Science,一个专注于学术术语的大规模语料库。在Fun-Audio-Chat上的初步实验表明,LaSR显著提升了术语识别,且不引入额外延迟,并持续优于标准监督微调基线。我们的发现凸显了潜在推理在构建高效、上下文感知语音助手方面的潜力。

英文摘要

Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker's intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.

2606.00506 2026-06-02 cs.AI cs.LG

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

EnergyMamba:一种用于能耗预测的具有不确定性感知的图增强选择性状态空间模型

Dahai Yu, Rongchao Xu, Lin Jiang, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出EnergyMamba框架,通过图增强选择性状态空间模型(GE-Mamba)和自适应序列分位数回归(AS-CQR)模块,实现时空联合建模与不确定性量化,在能耗预测中提升准确率约5%、不确定性量化约6%。

Comments Accepted by KDD 2026 AI4S

详情
AI中文摘要

能耗预测对于高效的电网管理、需求侧优化和可持续能源规划至关重要。尽管先进的机器学习方法已被用于提高预测性能,但现有工作存在两个关键局限:(1)通常将任务视为纯时间序列预测问题,未显式建模不同区域间的空间依赖关系;(2)在极端天气等异常情况下无法提供带有不确定性估计的可靠预测。为推进现有研究,我们提出EnergyMamba,一种具有不确定性感知的时空学习框架,用于准确可靠的能耗预测,包含两个关键组件:(i)一种新颖的图增强选择性状态空间模型(GE-Mamba),将从电网拓扑中学到的空间上下文注入时间动态,实现耦合的时空建模;(ii)自适应序列分位数回归(AS-CQR)模块,包括局部自适应归一化和在线反馈机制,以在潜在分布偏移下动态校准预测区间。我们在来自佛罗里达、纽约和加利福尼亚的四个大规模真实数据集上评估EnergyMamba。结果表明,与15个最先进的基线相比,EnergyMamba在预测准确率上提升约5%,在不确定性量化上提升约6%。

英文摘要

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

2606.00503 2026-06-02 cs.LG cs.AI

TabChange: Precise Attribute Changes in Tabular Data

TabChange: 表格数据中的精确属性变化

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校) National Institute of Standards and Technology(美国国家标准与技术研究院) Information Technology Laboratory(信息技术实验室)

AI总结 针对表格数据中修改属性时破坏自然性的问题,提出TabChange方法,通过分析属性间关系并利用对抗框架去除潜在空间中的属性信息,实现精确且自然的属性修改。

详情
AI中文摘要

修改表格数据中的属性通常会破坏其与其他属性的关系,从而产生不自然的实例。修改后的实例必须既自然又与原始实例变化最小。本文解决了生成这种修改实例的挑战。我们识别了现有方法的关键局限性:生成模型要么不支持实例级属性编辑,要么像CVAE这样的方法在潜在空间中保留属性信息,导致不必要的修改。为了解决这个问题,我们提出了TabChange,一种分析数据集中目标属性与其他属性关系的方法。如果关系较弱,它直接翻转属性;如果关系较强,它使用对抗框架去除潜在空间表示中的属性信息。这种去除使得能够进行精确修改,只进行必要的调整以保持自然性。我们在七个数据集上的实验表明,TabChange生成的属性反事实在自然性方面与基线相当,并且更接近原始实例。与基线相比,这导致了更多有效的反事实和更少的无效反事实。

英文摘要

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don't support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.

2606.00499 2026-06-02 cs.CV

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

OptiWorld: 物理约束下的视频世界生成最优控制

Yu Yuan, Jianhao Yuan, Xijun Wang, Daiqing Li, Liu He, Lu Ling, Stanley H. Chan

发表机构 * Purdue University(普渡大学) University of Oxford(牛津大学) SixteenMiles Labs(SixteenMiles 实验室)

AI总结 提出OptiWorld框架,在推理时结合经典最优控制与视频生成,通过提取紧凑世界状态、规划最优轨迹并生成条件视频,实现符合物理约束的动态优化。

Comments Porject Page: https://yuyuanspace.com/OptiWorld/

详情
AI中文摘要

视频生成模型正成为一种可扩展的世界模型形式,但它们主要生成合理的运动,而非主动控制或优化底层动态。因此,生成视频中的物体可能遵循不安全、不光滑、低效或物理不一致的轨迹。在这项工作中,我们提出了 extbf{OptiWorld},一个在推理时将经典最优控制引入视频生成的框架。OptiWorld首先提取紧凑的、与任务相关的世界状态,然后在物理约束下规划最优轨迹,最后基于该轨迹渲染视频。我们将规划表述为连续流形上的几何问题,将3D几何和任务相关的物理约束转化为统一的规划几何。通过添加这一最优控制层,OptiWorld生成具有更优动态的视频,在多个任务中展现出强大潜力,包括目标条件的图像到视频生成、视频动态编辑和反事实生成。

英文摘要

Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

2606.00496 2026-06-02 cs.LG

Torus Graphs for Large Scale Neural Phase Analysis

大规模神经相位分析的环面图模型

Jack Goffinet, Casey Hanks, David E. Carlson

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种随机得分匹配方法,将环面图模型的计算复杂度从O(d^6)降至O(d^2),使其能处理数千变量,并扩展至隐马尔可夫模型和自回归模型,用于分析脑状态依赖的相位耦合和方向性交互。

Comments 23 pages, 15 figures; to be published in ICML 2026

详情
AI中文摘要

振荡神经信号(如脑电图和局部场电位)表现出协调跨脑区通信的相位关系。现代记录在多个频率区间捕获数百个通道,但标准相位分析仅限于少数变量。环面图模型是一种相位上的指数族分布,其单变量和成对势函数推广了冯·米塞斯分布,推断振荡之间的结构化关系,但仅建模静态无向依赖,且由于得分匹配推断复杂度为O(d^6),仅限于约100个变量。我们引入一种随机得分匹配过程,将每次迭代成本降至O(d^2),使得能够对数千变量的数据集进行推断。这一可扩展基础支持对来自多电极LFPs的1,860个频率-相位特征进行分析,并实现了之前环面图或经典圆形统计无法实现的两种扩展:(i) 捕获状态依赖的相位耦合变化(例如睡眠期间纺锤波相关状态)的环面图隐马尔可夫模型,以及(ii) 通过传递熵估计推断方向性交互的自回归环面图。应用于LFP记录,这些模型揭示了清醒和NREM睡眠之间状态依赖的相位交互模式。它们共同实现了对大脑和认知状态中动态和方向性相位关系的系统性大规模映射。

英文摘要

Oscillatory neural signals such as electroencephalography (EEG) and local field potentials (LFPs) show phase relationships that coordinate communication across brain regions. Modern recordings capture hundreds of channels across many frequency bins, yet standard phase analyses are restricted to only a few variables. The Torus Graph (TG) model, an exponential-family distribution over phases whose univariate and pairwise potentials generalize von Mises distributions, infers principled structure among oscillations but models only static, undirected dependencies and is limited to $\sim \! 100$ variables because its score matching inference scales as $\mathcal{O}(d^{6})$. We introduce a stochastic score matching procedure that reduces the per-iteration cost to $\mathcal{O}(d^{2})$, enabling inference on datasets with thousands of variables. This scalable foundation supports analyses of 1,860 frequency-phase features from multi-electrode LFPs and enables two extensions previously inaccessible to TGs or classical circular statistics: (i) a TG Hidden Markov Model capturing state-dependent phase-coupling changes (e.g., spindle-related states during sleep) and (ii) an autoregressive TG inferring directional interactions via transfer-entropy estimation. Applied to LFP recordings, these models reveal state-dependent phase-interaction patterns between wakefulness and NREM sleep. Together, they enable systematic, large-scale mapping of dynamic and directional phase relationships across brain and cognitive states.

2606.00487 2026-06-02 cs.AI

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

TAPS: 面向扩散草稿推测解码的目标感知前缀树选择

Zhuoyu Wang, Junnan Huang, Xinyu Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出TAPS方法,通过目标感知的前缀树选择优化扩散模型草稿的验证效率,实现最高7.9倍无损加速。

详情
AI中文摘要

使用扩散模型进行并行草稿是推测解码的一种有前景的方法。通过在单次前向传播中预测多个未来位置的token,扩散草稿器显著降低了草稿延迟。然而,这会将瓶颈转移到验证上:验证单个序列限制了接受长度,而验证大型草稿树会导致过度的目标模型延迟。我们发现了现有草稿树方法中的一个关键不匹配:现有的扩散树方法按边际概率对节点排序,忽略了验证是前缀条件化的。因此,它们可能会验证被拒绝前缀的不可达后代,从而增加延迟而接受增益有限。为了解决这个问题,我们提出了TAPS,一种目标感知的前缀选择方法,将扩散边际转化为路径条件化的接受估计。然后,TAPS在固定的验证预算下选择一个紧凑的前缀封闭子树,改善接受-成本权衡,而不是简单地扩展草稿树。跨不同数据集和模型族的实验表明,TAPS在无损端到端速度上比普通自回归解码最高提升7.9倍,分别比最先进的DFlash和DDTree提升1.36倍和1.74倍。我们的工作可在https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD获取。

英文摘要

Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

2606.00477 2026-06-02 cs.CL cs.CV

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

文本编辑能否泛化到视觉生成?统一多模态模型中的跨模态知识编辑基准

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学)

AI总结 提出跨模态知识编辑基准UniKE,发现文本编辑在图像生成中效果显著下降(VQA准确率仅18.5%),并提出推理增强参数编辑方法提升跨模态迁移效果。

Comments Published at ICML 2026; Code and data available at https://github.com/gxx27/UniKE

详情
AI中文摘要

统一多模态模型(UMMs)已成为通用多模态智能的有前途的范式。随着它们在现实世界应用中的部署,有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已经成熟,但成功修改文本输出的编辑是否也能迁移到UMMs中的图像生成仍不清楚。为了研究这个问题,我们引入了UniKE,这是第一个用于UMMs中跨模态知识编辑的基准,包含2,971个编辑主题,涵盖属性和关系编辑。使用基于VQA的视觉验证,我们揭示了一个显著的模态差距:文本侧的有效性可以达到约92%,而直接图像生成下的最佳整体VQA准确率仅为18.5%。我们进一步提出了推理增强参数编辑,它在生成前显式激活编辑后的知识,并提高了所有评估模型-编辑器对的整体VQA准确率,提升高达18.6个百分点。机制分析表明,这种差距与编辑后的文本表示与视觉生成的条件路径之间的部分对齐有关,其中足以用于文本输出的编辑可能仍然太弱或未对齐,无法引导图像合成。这些发现表明,文本知识编辑不能保证可靠的跨模态迁移,并激励了模态感知的编辑方法。我们的代码和数据可在https://github.com/gxx27/UniKE获取。

英文摘要

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

2606.00476 2026-06-02 cs.AI

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

做他们所说的,而不是他们所推理的:定位LLM智能体中的忠实性差距

Yufeng Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将忠实性差距分解为推理-结论和结论-行动两个步骤,在可控的德克萨斯扑克模拟器中研究LLM智能体是否按照其陈述的推理行动。

Comments submitted to COLM social simulation with LLM workshop

详情
AI中文摘要

LLM智能体是否按照其陈述的推理行动?这个问题关于过程忠实性对于在社交模拟中使用LLM至关重要,但在没有正确行为参考的情况下很难衡量。我们在一个受控环境中研究这个问题,即一个德克萨斯扑克模拟器,其中每个决策都有一个可验证的参考行动,通过将忠实性差距分解为两个步骤:推理-结论和结论-行动。这两个步骤的行为相反。

英文摘要

Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

CodeCytos: 通过代码增强的智能体动作空间实现AI辅助空间分子成像分析

Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen

发表机构 * University of Houston, Department of Electrical and Computer Engineering(德克萨斯大学休斯顿分校电子与计算机工程系) Houston Methodist Hospital, Department of Systems Medicine and Biomedical Engineering(休斯顿 Methodist 医院系统医学与生物医学工程系)

AI总结 提出CodeCytos框架,通过代码驱动的推理智能体实现空间分子成像数据的动态可编程分析,提升自动化与定制化能力,并在多种组织类型数据集上验证其优于基线方法。

详情
AI中文摘要

传统的组织图像分析软件为细胞分析提供了基础功能,包括分割、基本形态特征提取和空间组织分析。然而,这些工具通常需要手动干预,且与代码驱动的自动化集成不佳,限制了复杂空间组织研究的效率和可扩展性。此外,它们对自定义分析的灵活性有限,通常只支持一组固定的预实现空间细胞特征。为了解决这些限制,我们提出了CodeCytos,一个基于编码的推理智能体框架,能够实现与空间分子成像数据的动态、可编程交互,以提高自动化和定制化。CodeCytos旨在简化自定义空间细胞特征的探索,并适应多样化的研究需求。我们通过四个来自不同组织类型(额叶皮层、非小细胞肺癌、胰腺和扁桃体)的专家精选数据集案例研究展示了其实用性。我们在现实的最小提示设置下评估CodeCytos,其中生物科学家提出简单问题,没有任务特定指令或关于空间细胞分析的上下文信息,并基准测试了多个具有强大编码能力的LLM骨干。我们进一步表明,结合定制的、领域无关的少样本上下文编码推理示例(空间分析领域外随机采样的演示)可以显著提高性能,而无需昂贵的、专家制作的领域内演示。总体而言,CodeCytos优于基线方法,突显了代码动作智能体在空间分子成像中辅助自定义特征探索和加速生物标志物发现的潜力。

英文摘要

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.

2606.00471 2026-06-02 cs.CV

MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory Forecasting

MUSCLE-NET:面向行人轨迹预测的预测多尺度感知网络

Yu Liu, Ming Huang, Xiao Ren, Zhijie Liu, Youfu Li, He Kong

发表机构 * Guangdong Provincial Key Laboratory of Fully Actuated System Control Theory and Technology, School of Automation and Intelligent Manufacturing, Southern University of Science and Technology (SUSTech), Shenzhen(广东省全主动系统控制理论与技术重点实验室,自动化与智能制造学院,南方科技大学(SUSTech),深圳) Department of Mechanical Engineering, City University of Hong Kong, Hong Kong SAR, China(香港城市大学机械工程系,香港特别行政区,中国)

AI总结 提出MUSCLE-NET,通过多尺度多模态特征提取和尺度自适应预测机制,解决现有方法对观测信息利用不足及忽视未来运动尺度依赖的问题,在JAAD和PIE数据集上取得竞争性能。

Comments This manuscript has been accepted to the IEEE Transactions on Intelligent Transportation Systems as a regular paper

详情
AI中文摘要

准确的行人轨迹预测对于自动驾驶和智能交通系统中的安全导航至关重要。尽管近期方法取得了显著进展,但大多数现有方法在充分利用多样化观测方面存在局限,且往往忽视未来运动的尺度依赖性,无论底层运动动态如何,都统一处理多尺度特征。这限制了它们在多样化行人行为中的鲁棒性。为解决这些挑战,我们提出了一种用于行人轨迹预测的预测多尺度感知网络(MUSCLE-NET),该网络将互补的多模态线索与尺度自适应预测机制相结合。所提出的框架基于多尺度多模态特征提取(MMFE)模块,该模块结合了多尺度表示、模态感知重校准和方向性跨模态融合,从边界框、速度和姿态信息中构建语义对齐的表示。基于这些特征,多尺度增强层次预测(MEHP)模块通过概率粗预测器、尺度对齐融合和渐进细化,执行预测感知的未来运动细化,自适应地选择尺度相关线索以减轻空间漂移。在JAAD和PIE基准上的大量实验表明,所提出的MUSCLE-Net与最先进的轨迹预测方法相比,取得了竞争性能并持续改进。

英文摘要

Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.

2606.00470 2026-06-02 cs.RO cond-mat.soft

A passive universal grasping mechanism based on an everting shell

基于外翻壳体的被动通用抓取机构

Mythra V. S. Balakuntala, Safvan Palathingal, G. K. Ananthasuresh

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 提出一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构,通过梁段构成的抓取臂与外翻壳体协同工作,实现对任意形状刚性物体的包络抓取。

详情
AI中文摘要

概念化了一种基于弹性可变形双稳态壳体外翻的被动单片柔性抓取机构。它由梁段构成的抓取臂与外翻壳体协同工作。该抓取器能够抓取任意形状的刚性物体,最大尺寸和重量受限于机构设计。双稳态壳体在接触物体时外翻,使抓取臂包裹物体形成封闭空间。机构保持该构型直到再次被驱动,使壳体恢复原始构型,从而打开封闭空间释放物体。臂的刚度决定机构的有效载荷,臂的尺寸决定可抓取的最大物体。臂具有分布式柔性,可适应物体形状而不施加过大压力。

英文摘要

A passive monolithic compliant grasping mechanism that works based on the eversion of an elastically deformable bistable shell is conceptualized. It comprises grasping arms made of beam segments that work in conjunction with the everting shell. The grasper is capable of picking up a stiff object of any shape up to a maximum size and weight. The bistable shell everts upon contact with the object to enable the grasping arms envelop the object forming an enclosure. The mechanism then stays in that configuration until it is actuated again to turn the shell back to its original configuration and thereby opening the enclosure to release the object. The stiffness of the arms decides the payload of the mechanism. The size of the arms decides the largest object that can be grasped and held. The arms have distributed compliance so that they can conform to the shape of the object without applying undue force on it.

2606.00467 2026-06-02 cs.CL cs.AI cs.LG stat.ML

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

论大语言模型适应性的局限:模型内化先验对标注任务性能的影响

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

发表机构 * University of Washington(华盛顿大学)

AI总结 通过毒性检测实验,研究大语言模型内化先验与指令交互的三个维度,发现近三分之二的零样本错误难以通过提示纠正,并引入定义特定熟悉度(DSF)指标,证明其与性能正相关,而文本记忆指标则无此关联。

Comments Accepted at ICML 2026 (Oral & Spotlight); PMLR vol. 306. 9 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于零样本标注和LLM-as-a-judge任务,但其可靠性取决于模型内化先验与用户提供指令的交互方式。我们研究了这种交互的三个维度:(1)LLM对数据和任务定义的熟悉程度如何影响性能;(2)提示中的额外信息能在多大程度上纠正零样本错误(“决策粘性”);(3)模型对错误任务定义的敏感性。通过在多种数据集(涵盖社交媒体、游戏、新闻和论坛)上进行毒性检测实验,使用密集模型和混合专家模型,我们发现近三分之二的零样本错误难以纠正,提示纠正的总体挽救率(初始错误中被纠正的比例)仅为34.8%。高置信度错误尤其难以纠正。当给出错误定义时,LLM会遵循这些定义,同时保持与正确定义条件下相同的置信水平。关键的是,我们引入了定义特定熟悉度(DSF),它衡量模型内部概念与任务定义之间的一致性。在控制数据集层面的混杂因素后,DSF与模型性能呈正相关(偏相关系数r=+0.41),而三种不同的记忆指标(ROUGE-L、BERTScore和嵌入余弦相似度)均未显示正相关。这些发现揭示了基于提示的纠正在标注任务中的局限性,强调了定义对齐比文本级记忆更重要。

英文摘要

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

2606.00462 2026-06-02 cs.CL cs.AI cs.LG

Short-form Text Rewriting with Phi Silica

短文本改写与 Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan

发表机构 * IEEE ICAD

AI总结 本研究通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写任务,结果表明微调提高了语义保真度、减少了幻觉并提升了与 GPT-5-chat 改写的偏好胜率。

Comments 6 pages

详情
AI中文摘要

短文本改写是释义的一种受限变体,其中有限的上下文和高语义密度几乎没有留下变化空间。虽然大型语言模型在一般释义任务上表现良好,但小语言模型(SLM)在短文本场景中常常在语义保真度和幻觉鲁棒性方面遇到困难。在这项工作中,我们提出了一项实证研究,通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写。我们从公开的幻灯片中整理了一个简短的演示风格文本数据集,并使用 GPT-5-chat 来生成改写监督以及进行 LLM 作为评判者的评估。我们的结果表明,微调提高了语义保真度,减少了幻觉,并提高了与 GPT-5-chat 改写的偏好胜率。这些发现表明,针对 SLM 的定向适配可以显著缩小与云模型的差距,并为将 SLM 适配于精度关键的改写任务提供实用指导。

英文摘要

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.

2606.00461 2026-06-02 cs.CV eess.SP

An explainable hierarchical self attention-based approach for tremor detection in the time domain

一种可解释的基于层次自注意力的时域震颤检测方法

Timothy Odonga, Jeanne M. Powell, Mark Saad, Richa Tripathi, Christine D. Esper, Stewart A. Factor, Hyeokhyen Kwon, J. Lucas Mckay

发表机构 * Department of Biomedical Informatics, School of Medicine, Emory University(埃默里大学生物医学信息学系) Jean and Paul Amos Parkinson’s Disease and Movement Disorders Program, Department of Neurology, School of Medicine, Emory University(埃默里大学帕金森病和运动障碍计划,神经学系) Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology(佐治亚理工学院沃尔什·H·库勒生物医学工程系)

AI总结 提出一种可解释的两阶段层次框架,直接从3D运动学时间序列数据学习震颤模式,实现时域震颤检测,并利用注意力权重和Grad-CAM提供后验可解释性。

Comments Submitted to PLOS Digital Health

详情
AI中文摘要

震颤是一种常见的运动障碍,与帕金森病和特发性震颤等疾病相关,传统上通过临床专家评估诊断。当前的自动检测方法依赖于基于临床知识的频域特征。在这项工作中,我们提出了一种可解释的两阶段层次框架,用于时域震颤检测,该框架直接从整个震颤诱发试验的3D运动学标记时间序列数据中学习震颤模式。我们的框架结合了深度卷积和长短期记忆网络,从试验中短时间、离散、非重叠的运动学时间序列数据段学习震颤表示,然后由视觉变换器处理,该变换器对时间段特征的长期时间动态进行建模,以实现试验(会话)级别的分类。在九个身体部位上评估,该框架的F1分数根据身体部位在0.594-0.947之间(平均0.765),虽然低于频域最先进性能(0.909),但所需预处理最少。注意力权重和基于梯度的类激活图(Grad-CAM)识别了不同身体部位的时域震颤特征。这一概念验证证明了数据驱动的时域建模在解剖学上不同身体部位震颤检测中的可行性,同时减少了对专家设计的频谱特征的依赖,并提供了震颤时间和解剖模式的后验可解释性。

英文摘要

Tremor is a common movement disorder associated with conditions like Parkinson's disease and Essential tremor, traditionally diagnosed through expert clinician assessment. Current automated detection methods rely on frequency-domain features informed by clinical expertise. In this work, we present an explainable, two-stage hierarchical framework for tremor detection in the time domain that learns tremor patterns directly from 3D kinematic marker time-series data across entire tremor-provoking trials. Our framework combined a deep convolutional and long short-term memory network to learn tremor representations from short, discrete, non-overlapping time segments of kinematic time series data from trials, which are then processed by a vision transformer that models their long-term temporal dynamics of time segment features for trial (session) level classification. Evaluated across nine body parts, the framework achieved F1-scores of 0.594 - 0.947 depending on body parts (average: 0.765), falling short of the frequency-domain state-of-the-art performance (0.909) while requiring minimal preprocessing. Attention weights and gradient-based class activation maps (Grad-CAM) identified time-domain features of tremor across body parts. This proof of concept demonstrated the feasibility of data-driven time-domain modeling for tremor detection across anatomically diverse body parts, while reducing reliance on expert-engineered spectral features and providing posthoc interpretability of temporal and anatomical patterns of tremor.

2606.00460 2026-06-02 cs.CL eess.AS

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

SALSA: 通过学习的引导激活向量实现语音感知的LLM适配

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier, Chang D. Yoo, Mark A. Hasegawa-Johnson

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出SALSA方法,通过监督学习优化逐层引导向量,在儿童语音、多语种语音和普通话-英语代码切换基准上显著提升零样本推理和语音上下文学习性能,最高相对提升46.8%。

详情
AI中文摘要

语音感知的大语言模型通常在域外场景中泛化能力较差。我们提出SALSA(通过学习的引导激活实现语音感知的LLM适配),一种轻量级适配方法,学习逐层引导向量。与通常依赖对比激活差异的引导方法不同,SALSA直接使用监督目标优化引导向量。在儿童语音、多语种语音和普通话-英语代码切换基准上,SALSA相比零样本推理和语音上下文学习基线显著提升性能,相对于零样本最高实现46.8%的相对改进。进一步分析表明,引导编码器(尤其是后层)比引导LLM主干更有效。这些发现表明,引导通过调整高层声学和语音表示以更好地与预训练语言模型表示空间对齐,而不是通过修改解码器本身,从而提升下游ASR性能。

英文摘要

Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children's speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.

2606.00459 2026-06-02 cs.RO cs.SY eess.SY

Adaptive PD Gains for Energy-Conscious Control in Physical Human-Robot Interaction

物理人机交互中节能控制的自适应PD增益

Danyal Saqib, Francisco Andrade Chavez, Marie Charbonneau

发表机构 * University of Calgary(卡尔加里大学) University of Waterloo RoboHub(多伦多大学罗布hub)

AI总结 提出一种自适应PD控制器,通过限制机器人动能和势能实现安全物理人机交互,并给出稳定性证明与实验验证。

Journal ref Proceedings of the 23rd Conference on Robots and Vision, 2026

详情
AI中文摘要

柔顺力或力矩控制是常被研究以实现安全物理人机交互(pHRI)的方法。然而,这些方法存在局限性。力控制要求机器人配备外部力传感器以跟踪施加力的幅度和方向。力矩控制需要在每个关节进行力矩感知或估计。由于并非所有机器人都具备这些条件,基于能量的方法提供了一种有前景的替代方案。此类方法旨在通过限制机器人的机械能来实现安全的pHRI。当前利用基于能量方法的方案往往实现复杂,且部分可能需要进一步稳定性验证。因此,我们提出一种自适应比例-微分(PD)控制器,能够在任意给定限制下限制机器人的能量,以实现安全的pHRI。所提出的控制器可以同时限制机器人的动能和势能,并且控制器增益的行为可通过多种参数进行塑造,精确界定截止限制和锐度。我们为控制器构建了稳定性证明,并定义了确保控制器稳定性的条件。所提出控制器的行为和柔顺性在PAL Robotics的TALOS机器人上进行了仿真和硬件测试,验证了控制器预期的柔顺和能量限制行为。

英文摘要

Compliant force or torque control are approaches often investigated to achieve safe physical human-robot interaction (pHRI). However, these approaches have limitations. Force control requires a robot to be equipped with external force sensors to track the amplitude and direction of applied forces. Torque control requires torque sensing or estimation in each joint. As this is not available on every robot, energy-based approaches offer a promising alternative. Such approaches aim to achieve safe pHRI by limiting the mechanical energy of the robot. Current schemes leveraging an energy-based approach tend to have a complex implementation, and some may require further stability verification. We hence propose an adaptive proportional-derivative (PD) controller that can limit a robot's energy under any given limit to achieve safe pHRI. The proposed controller can limit both the kinetic and potential energy of a robot, and the behaviour of the controller gains can be shaped using various parameters, defining precisely the cutoff limit and sharpness. We construct a stability proof for the controller and define a condition to ensure the controller's stability. The proposed controller's behaviour and compliance are tested on the TALOS robot from PAL Robotics both in simulation and on hardware, verifying the expected compliant and energy-limiting behaviour of the controller.

2606.00452 2026-06-02 cs.CV cs.GR

Beyond Static Gaussians: An Empirical Investigation of Architectural Paradigms for Dynamic 3D Scene Reconstruction

超越静态高斯:动态3D场景重建架构范式的实证研究

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文通过实证比较结构引导与高斯中心两种动态3D高斯溅射范式,揭示重建质量/紧凑性与渲染速度之间的根本权衡。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

Journal ref Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025, p. 99

详情
AI中文摘要

通过3D高斯溅射(3DGS)进行动态场景重建已成为表示演化环境的一种引人注目的方法,但理解不同方法之间的权衡仍然至关重要。本文对动态3DGS方法进行了全面分析,将其分为两种范式:结构引导方法,利用辅助表示(变形场、规范空间、网格)来建模时间变化;以及高斯中心方法,通过连续函数或4D表示将动态直接编码到基元中。我们在D-NeRF基准上评估了两种范式的代表性方法。我们的发现表明,结构引导方法实现了优越的重建保真度和紧凑的模型大小,而高斯中心方法则表现出显著更高的渲染速度,能够实现实时性能,但质量变异性更大且可能产生大量存储开销。该分析突出了重建质量/紧凑性与渲染速度之间的根本权衡,为动态场景重建的未来研究和应用开发提供了见解。

英文摘要

Dynamic scene reconstruction via 3D Gaussian Splatting (3DGS) has emerged as a compelling approach for representing evolving environments, yet understanding trade-offs between methodologies remains crucial. This paper presents a comprehensive analysis of dynamic 3DGS methods, categorizing them into two paradigms: structure-guided methods employing auxiliary representations (deformation fields, canonical spaces, grids) to model temporal changes, and gaussian-centric methods encoding dynamics directly into primitives via continuous functions or 4D representations. We evaluate representative methods from both paradigms on the D-NeRF benchmark. Our findings reveal that structure-guided methods achieve superior reconstruction fidelity and compact model sizes, while gaussian-centric approaches demonstrate significantly higher rendering speeds enabling real-time performance, though with greater quality variability and potentially substantial storage overhead. This analysis highlights a fundamental trade-off between reconstruction quality/compactness versus rendering speed, providing insights to guide future research and application development in dynamic scene reconstruction.

2606.00451 2026-06-02 cs.CL

ProtStructQA: A Denotation Threshold in Protein Structural Reasoning

ProtStructQA: 蛋白质结构推理中的指称阈值

Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar, Khaled Rasheed, Samuel E. Aggrey

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出可执行基准ProtStructQA,通过将自然语言问题编译为DSL程序并在AlphaFold结构上执行来评估蛋白质语言模型,发现模型在1.7B到4B参数之间存在指称阈值,低于该阈值时工具辅助推理占优,高于该阈值时思维链成为最强策略。

详情
AI中文摘要

蛋白质语言系统通常通过是否生成合理的生物学文本来评估,但结构问题具有更清晰的语义:它表示3D坐标系中的测量值。我们引入ProtStructQA,一个可执行的蛋白质结构问答基准,其中每个自然语言问题由隐藏的类型化领域特定语言(DSL)程序生成,答案通过在该程序上对AlphaFold预测的结构执行获得。ProtStructQA发布了382.2K个问题,涵盖置信度、距离、预测对齐误差(PAE)、溶剂暴露、二级结构、拓扑和接触,以及保留的组合:一个包含来自四个物种的10K个蛋白质的330K活跃基准,加上一个52.2K的硬负例鲁棒性池。无需微调,我们在直接提示、思维链、语法约束可执行投票、带思维链的可执行投票以及多轮ReAct风格工具使用下评估了Qwen3模型(0.6B至8B),并在Gemma-3-1B和Gemma-3-12B上复现了主要发现。我们发现Qwen3-1.7B和Qwen3-4B之间存在一个能力依赖的指称阈值:低于该阈值时,工具中介的ReAct占主导,因为模型常常无法生成可执行的指称;高于该阈值时,思维链从大多有害转变为强烈有益,并成为大多数分割上的最强策略。解析失败和家族级分析表明,该阈值是从不可解析语言到可执行结构指称的转变,而语法和执行对PAE和二级结构查询仍然具有选择性价值。ProtStructQA将科学问答重新定义为从语言到测量的编译,并为语言模型何时能将单词映射到可执行的3D结构测量提供了诊断测试平台。

英文摘要

Protein-language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural-language question is generated from a hidden typed domain-specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold-predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held-out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard-negative robustness pool. Without fine-tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain-of-thought, grammar-constrained executable voting, executable voting with chain-of-thought, and multi-turn ReAct-style tool use, and replicate the headline finding on Gemma-3-1B and Gemma-3-12B. We find a capability-dependent denotation threshold between Qwen3-1.7B and Qwen3-4B: below it, tool-mediated ReAct dominates because models often fail to produce executable denotations; above it, chain-of-thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse-failure and family-level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary-structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.

2606.00450 2026-06-02 cs.CV cs.GR

Optimizing 3D Gaussian Splatting via Point Cloud Upsampling

通过点云上采样优化3D高斯泼溅

Adrian Ramlal, Yan Song Hu, John S. Zelek

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo(滑铁卢大学视觉与图像处理组,系统设计工程)

AI总结 提出多种点云上采样方法及深度引导点提升技术,改善3D高斯泼溅的初始化质量,实验表明不同场景适用不同策略。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

Journal ref Journal of Computational Vision and Imaging Systems, Vol. 10, No. 1, p. 47, 2024

详情
AI中文摘要

3D高斯泼溅(3DGS)是一种用于创建和渲染3D场景的技术,但其性能严重依赖于初始种子点的质量。为了改进3DGS初始化,本研究提出并评估了几种点云上采样方法:线性插值、三角插值、基于样条的曲面重建、移动最小二乘曲面拟合和基于Voronoi的点生成。此外,本研究引入了一种深度引导的点提升方法,利用深度图保持与运动恢复结构(SfM)重建的几何一致性。通过在Mip-NeRF360和Replica数据集上的大量实验,所提出的方法在多种场景类型中展示了重建质量的提升。结果表明,不同的上采样策略在不同场景中表现优异:曲面重建方法在处理有机、细节丰富的场景时表现更好,而简单的插值方法更适合以分段平滑几何为主的场景。相比之下,深度引导方法在添加整个场景中的几何感知点方面显示出潜力,尤其是在纹理缺失区域。这些发现为根据场景特征和计算约束选择合适的上采样方法提供了初步实用指南,增进了对点云初始化如何影响3DGS质量的理解。

英文摘要

3D Gaussian Splatting (3DGS) is a technique for creating and rendering 3D scenes, however its performance depends heavily on the quality of initial seed points. To improve 3DGS initialization, this study presents and evaluates several point cloud upsampling approaches: linear interpolation, triangular interpolation, spline-based surface reconstruction, moving least squares surface fitting, and Voronoi-based point generation. Additionally, this research introduces a depth-guided point lifting method that leverages depth maps to maintain geometric consistency with Structure-from-Motion (SfM) reconstructions. Through extensive experiments on the Mip-NeRF360 and Replica datasets, the proposed methods demonstrate improvements in reconstruction quality across diverse scene types. Results indicate that different upsampling strategies excel in different scenarios: surface reconstruction methods perform better with organic, detailed scenes, while simpler interpolation approaches are more suited for scenes dominated by piecewise-smooth geometries. In comparison, the depth-guided approach shows promise for adding geometry-aware points across the entire scene, importantly in texture-less regions. These findings, which provide preliminary practical guidelines for selecting appropriate upsampling methods based on scene characteristics and computational constraints, advances the understanding of how point cloud initialization affects 3DGS quality.

2606.00449 2026-06-02 cs.RO

ROG-Grasp: Root-Oriented Geometry for Robotic Grasping and Placement

ROG-Grasp:面向根部的几何方法用于机器人抓取与放置

Zijian An, Augustus Sroka, Ran Yang, Bill Cai, Satoru Eto, Brian Poon, Kelvin Cai, Shijie Geng, Feng Liu, Yiming Feng, Lifeng Zhou

发表机构 * Department of Electrical and Computer Engineering, Drexel University(德雷塞尔大学电气与计算机工程系) Virginia Seafood Agricultural Research and Extension Center, and Department of Biological Systems Engineering, Virginia Tech(弗吉尼亚理工学院生物系统工程系和弗吉尼亚海鲜农业研究与推广中心) Amazon Store Foundation AI (SFAI)(亚马逊商店基金会人工智能(SFAI))

AI总结 提出基于根部表面几何的ROG-Grasp框架,通过RGB-D感知估计农产品朝向,结合YOLO检测器和点云平面拟合生成稳定抓取姿态,在番茄和洋葱实验中实现高成功率与快速执行。

Comments Comments: 7 pages, 6 figures. Video: https://youtu.be/Ir2UtGODdMo

详情
AI中文摘要

朝向感知操作在采后农业加工中至关重要,其中农产品必须以一致的配置被抓取和放置。本文提出ROG-Grasp,一种基于几何的机器人抓取和放置框架,通过RGB-D感知从根部表面几何估计农产品朝向。使用基于YOLO的根部检测器和点云平面拟合来推断根部法线,从而生成稳定的抓取姿态和朝向约束的笛卡尔运动规划。在番茄和洋葱上的实验表明,在孤立和杂乱场景中均具有高成功率和稳定的执行时间。与视觉-语言-动作(VLA)策略相比,所提出的方法实现了更可靠、更准确的抓取完成,且执行速度更快。这些结果突显了几何驱动感知对于实际朝向控制操作任务的有效性。我们的论文视频可在网上获取:https://youtu.be/Ir2UtGODdMo。

英文摘要

Orientation-aware manipulation is essential in post-harvest agricultural processing, where produce must be grasped and placed in consistent configurations. This paper presents ROG-Grasp, a geometry-based robotic grasping and placement framework that estimates the produce orientation from root surface geometry using RGB-D perception. A YOLO-based root detector and point cloud plane fitting are used to infer the root normal, enabling stable grasp pose generation and orientation-constrained Cartesian motion planning. Experiments on tomatoes and onions demonstrate high success rates and stable execution time in both isolated and cluttered scenarios. Compared with vision-language-action (VLA) policies, the proposed method achieves more reliable and accurate grasp completion with faster execution. These results highlight the effectiveness of geometry-driven perception for practical orientation-controlled manipulation tasks. A video of our paper is available online https://youtu.be/Ir2UtGODdMo.

2606.00447 2026-06-02 cs.CV cs.AI

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出GeoSAM-3D方法,利用冻结的视觉基础模型和单目3D高斯泼溅重建,通过可微分的图-测地线传播核在场景图上传播用户提示,实现从单目视频的开放词汇3D场景分割。

详情
AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置:用户上传一段短的单目视频,在一帧中点击或命名一个物体,并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示,而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性,并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

2606.00445 2026-06-02 cs.CV cs.AI cs.LG

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出DarkVesselNet,融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头,实现多模态遥感暗船检测。

详情
AI中文摘要

暗船检测需要融合船只通过AIS报告的信息与卫星通过雷达和光学传感器观测到的信息。DarkVesselNet是一个多模态遥感堆栈,结合了Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型骨干、AIS轨迹推理、TGARD风格的间隙检测以及受Pi-DPM启发的异常头。该仓库将系统呈现为经过测试的Python包和公开的Hugging Face Space。本文介绍了传感器堆栈、骨干抽象、融合路径、异常头和当前的验证。目前可用的证据是基于软件的:针对SAR散斑滤波、光学波段比、Haversine距离、TGARD间隙发射、传感器配准、骨干token形状和可微分异常评分的测试。

英文摘要

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.

2606.00444 2026-06-02 cs.CV cs.GR

Real-Time Physics Simulation with Dynamic Mesh-Gaussian Reconstructions

基于动态网格-高斯重建的实时物理仿真

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对动态重建与物理仿真拓扑不兼容的问题,提出固定拓扑网格与高斯泼溅的双表示框架,实现实时物理仿真,并揭示高质量重建与物理兼容拓扑存在本质冲突。

Journal ref Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025

详情
AI中文摘要

将动态3D重建集成到物理仿真中需要固定的网格拓扑以实现高效的碰撞检测,但像DG-Mesh这样的先进方法会生成针对几何质量优化的可变拓扑。我们研究了拓扑转换是否能在保持重建保真度的同时实现物理集成。我们提出了一种双表示框架,将用于物理的固定拓扑网格与用于渲染的高斯泼溅相结合,通过运行时顶点缓冲区更新实现了比可变拓扑基线快4.65倍的加速。我们在DG-Mesh数据集上评估了两种转换策略(时间对应跟踪和基于模板的投影)与原生固定拓扑方法(MaGS)的性能。我们的评估表明,两种转换方法都会导致65-80%的几何退化,尽管DG-Mesh具有优越的初始质量,但产生的结果不如MaGS。这表明高质量重建和物理兼容拓扑代表了根本不同的目标,无法通过后处理来调和。我们的发现为未来物理感知重建方法的发展提供了信息,并且我们的框架能够与任何固定拓扑方法实现实时仿真。

英文摘要

Integrating dynamic 3D reconstructions into physics simulation requires fixed mesh topology for efficient collision detection, but state-of-the-art methods like DG-Mesh produce varying topology optimized for geometric quality. We investigate whether topology conversion can enable physics integration while preserving reconstruction fidelity. We propose a dual-representation framework combining fixed-topology meshes for physics with Gaussian splatting for rendering, achieving 4.65$\times$ speedup over varying-topology baselines through runtime vertex buffer updates. We evaluate two conversion strategies, temporal correspondence tracking and template-based projection, against native fixed-topology methods (MaGS) on the DG-Mesh dataset. Our evaluation reveals that both conversion approaches incur 65-80% geometric degradation, producing results inferior to MaGS despite DG-Mesh's superior initial quality. This demonstrates that high-quality reconstruction and physics-compatible topology represent fundamentally distinct objectives that cannot be reconciled through post-processing. Our findings inform future development of physics-aware reconstruction methods and our framework enables real-time simulation with any fixed-topology approach.

2606.00442 2026-06-02 cs.LG math.OC stat.ML

Exploiting weight-space symmetries for approximating curvature

利用权重空间对称性近似曲率

Artem Artemev, Rui Xia, Benjamin M. Boyd, Youjing Yu, Felix Dangel, Guillaume Hennequin, Alberto Bernacchia

发表机构 * DeepMind, London, UK(伦敦DeepMind)

AI总结 本文通过解析平均化保持损失不变的群作用,从单个梯度构建结构化的Hessian近似,从而利用权重空间对称性来近似损失函数的曲率。

Comments Published at ICML 2026. 35 pages, 11 figures. Code: https://github.com/mtkresearch/symm_opt

详情
AI中文摘要

许多机器学习技术依赖于近似损失函数的曲率,但在现代深度网络的规模下,这通常很难做到。令人惊讶的是,之前没有工作利用损失景观中众所周知的权重空间对称性所产生的曲率约束。通过解析平均化保持损失不变的群作用,我们从单个梯度构建了结构化的Hessian近似,这些近似可以易于估计、存储和求逆。用户指定的对称群直接控制近似精度与计算成本之间的权衡。此外,我们的框架为审视现有方法提供了统一的理论视角;特别地,特定的对称群选择可以恢复Shampoo/Muon类的曲率估计。我们在多种网络架构上验证了我们的方法,并将其应用于二阶优化基准测试,包括一个小型语言模型。我们的曲率估计框架可能在机器学习其他问题中找到应用,如不确定性估计、持续学习、压缩/剪枝、训练数据归因等。

英文摘要

Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight-space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.

2606.00440 2026-06-02 cs.AI

SDR: Set-Distance Rewards for Radiology Report Generation

SDR:用于放射学报告生成的集合距离奖励

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 针对胸部X光报告生成中标准奖励不兼容的问题,提出基于集合距离的连续置换不变奖励,通过GRPO后训练和测试时缩放显著提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习已迅速推进了视觉-语言模型中的推理能力。然而,对于胸部X光报告生成,标准奖励(即精确匹配准确率和逐步过程)并不兼容,因为报告由无序且正交的发现组成,而非因果推理链。我们通过基于集合的视角来解决这一差距:每个报告被分割成句子,并由冻结的句子变换器嵌入,生成无序的嵌入集合。我们提出使用生成嵌入与参考嵌入之间的集合到集合距离作为连续的、置换不变的奖励。在两个数据集和三个视觉-语言模型(Qwen3-VL-2B/4B, Gemma3-4B)上,通过GRPO使用基于集合到集合距离的奖励进行后训练,在所有主要指标(BERTScore、RadGraph F1和CheXbert F1,分别相对提升平均6.80%、7.82%和4.45%)上一致优于监督微调和精确匹配GRPO。相同的集合距离还实现了测试时的最佳N选:通过候选与训练报告嵌入的距离进行评分,在我们训练的模型以及三个闭源LLM(Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini)上,平均相对提升BERTScore 16.4%,优于随机选择。作为流式信号使用时,它们支持更高效的测试时缩放形式:在生成过程中修剪低分候选,可将生成的令牌减少50%以上,同时保持完整最佳N选的结果质量。这些结果共同确立了集合距离奖励作为胸部X光报告生成中后训练和测试时缩放的统一信号。我们的代码已公开。

英文摘要

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

2606.00439 2026-06-02 cs.CV

Physical Object Understanding with a Physically Controllable World Model

基于物理可控世界模型的物理对象理解

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

发表机构 * Stanford University(斯坦福大学) OpenAI(开放人工智能公司) Noetik Inc.(Noetik公司) Google(谷歌)

AI总结 提出一类概率世界模型,通过自回归序列建模高效训练,从视频中推断对象及其物理交互,实现对象发现、3D操控和物理关系计算。

Comments CVPR 2026 Highlight. Project page at: https://neuroailab.github.io/psi-website/blog.html

详情
AI中文摘要

视觉智能的一个核心挑战是从原始视频中学习场景的物理结构:区域如何形成对象以及支配它们交互的规律。解决这些任务需要能够从部分观测中推断世界分布状态的世界模型——当前架构无法提供这种能力。我们引入了一类新的概率世界模型,支持估计任何视觉变量(如外观和动态)在给定其他变量条件下的概率。在这里,我们发现这些模型可以通过自回归序列建模高效训练,从而产生能够涌现丰富对象理解的世界模型。首先,我们展示了我们的模型通过顺序推理生成多个合理的未来世界状态,捕捉了支配对象如何运动的物理规律。然后,通过分析这些未来状态中的运动相关性,我们提取出对象及其关节子部分。在发现这些对象后,我们展示了我们的世界模型可以在3D中操控它们。最后,我们演示了如何从世界模型计算对象之间的物理关系,从而实现了诸如视觉叠叠乐等应用。

英文摘要

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

2606.00437 2026-06-02 cs.LG

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

EST-PRM:在过程奖励模型成为关键依赖之前对其进行压力测试

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(艾奥瓦州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(艾奥瓦州立大学土木、建设与环境工程系)

AI总结 提出EST-PRM框架,通过步骤膨胀、依赖感知重排序和置信度标记三种变换对过程奖励模型进行压力测试,发现不同模型在奖励膨胀和正确性敏感性损失方面存在显著差异。

详情
AI中文摘要

过程奖励模型(PRM)在具有密集步骤级监督的语言模型训练中被广泛使用。它们假设在标签保持变换下,PRM分数是步骤正确性的稳定代理。这些变换改变推理结构但保留最终答案。我们认为这一假设未得到充分验证。此类变换可能改变PRM分数与正确性信号之间的关系,导致不同模型出现不同的故障模式。为弥补这一空白,我们引入了 extbf{EST-PRM},一个用于密集过程奖励的压力测试框架。它应用三种变换:(1)步骤膨胀,(2)依赖感知步骤重排序,以及(3)置信度标记。定义了一个脆弱性分解,将奖励膨胀与正确性敏感性损失分开。在来自MATH-500、GSM8K和PRMBench的4,687条推理链上评估了五种PRM风格模型。结果表明不同模型的脆弱性模式存在明显差异。Math-Shepherd对位置扰动表现出最强的敏感性,Pearson相关系数下降$0.152 \pm 0.038$,分数膨胀率为$32.8 \pm 4.9\%$。Qwen2.5-Math-PRM受步骤膨胀影响最大,膨胀率达到$47.6 \pm 4.3\%$。基于置信度的扰动也会扭曲奖励校准,揭示正确性估计中的不一致性。评估了三种缓解策略,突出了鲁棒性覆盖率和假阳性率之间的权衡。

英文摘要

Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across models.To address this gap, we introduce \textbf{EST-PRM}, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench.The results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of $0.152 \pm 0.038$ and a $32.8 \pm 4.9\%$ score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a $47.6 \pm 4.3\%$ inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.

2606.00432 2026-06-02 cs.LG

Grounded Decoding: Retrieval-Anchored Probability Fusion for Faithful RAG

Grounded Decoding: 面向忠实RAG的检索锚定概率融合

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建设与环境工程系)

AI总结 提出Grounded Decoding,一种无需训练的推理时解码框架,通过KL-重心目标融合全RAG分布和仅检索分布,并引入冲突感知自适应加权,以提升RAG的事实一致性。

详情
AI中文摘要

随着检索增强生成(RAG)系统的扩展,确保忠实基于外部证据变得越来越具有挑战性。当冲突出现时,大型语言模型仍可能优先考虑参数化知识而非检索信息。我们提出了一种新颖的无训练解码框架——\emph{Grounded Decoding},旨在不修改模型参数的情况下提高RAG的事实一致性。与依赖单一条件分布的标准方法不同,我们的方法在每个生成步骤构建两个匹配提示分布:(1)以查询、检索文档和生成前缀为条件的完整RAG分布,以及(2)仅以检索证据和相同前缀为条件的仅检索分布。最终的下一词分布被推导为概率单纯形上KL-重心目标的唯一解,产生两个分布的归一化几何融合。当接地权重为零时,该公式自然恢复标准RAG,并随着接地强度增加平滑地将概率质量移向检索证据。我们进一步引入了一种冲突感知自适应加权方案,该方案基于分布分歧和检索器置信度动态调整接地。在ALCE、Natural Questions和FActScore上的实验表明,与标准RAG和有竞争力的解码时基线相比,在事实准确性和引用质量上取得了一致改进,同时保持了流畅性。我们的结果表明,概率级融合为忠实RAG解码提供了一种强大且高效的替代对数级干预方法。

英文摘要

As retrieval-augmented generation (RAG) systems scale, it becomes increasingly challenging to ensure faithful grounding in external evidence. Large language models may still prioritize parametric knowledge over retrieved information when conflicts arise. We propose a novel training-free decoding framework, \emph{Grounded Decoding}, designed to improve factual consistency in RAG without modifying model parameters. Unlike standard approaches that rely on a single conditional distribution, our method constructs two matched-prompt distributions at every generation step: (1) a full RAG distribution conditioned on the query, retrieved documents, and generated prefix, and (2) a retrieval-only distribution conditioned solely on retrieved evidence and the same prefix. The final next-token distribution is derived as the unique solution to a KL-barycenter objective over the probability simplex, yielding a normalized geometric fusion of the two distributions.This formulation naturally recovers standard RAG when the grounding weight is zero and smoothly shifts probability mass toward retrieved evidence as grounding strength increases. We further introduce a conflict-aware adaptive weighting scheme that dynamically adjusts grounding based on distributional disagreement and retriever confidence. Experiments on ALCE, Natural Questions, and FActScore demonstrate consistent improvements in factual accuracy and citation quality over standard RAG and competitive decoding-time baselines, while maintaining fluency. Our results indicate that probability-level fusion provides a strong and efficient alternative to logit-level intervention methods for faithful RAG decoding.

2606.00431 2026-06-02 cs.LG

Variance-sensitive Thompson sampling for generalised linear bandits, revisited

广义线性bandits的方差敏感Thompson采样,再探讨

Tom Perneczky, Marc Abeille, David Janz

发表机构 * University of Oxford(牛津大学) Criteo AI Lab(Criteo人工智能实验室)

AI总结 本文通过高斯庞加莱不等式证明Thompson采样在随机广义线性bandits中的方差敏感遗憾界,并指出移除预热阶段保持相同方差敏感尺度是开放且非平凡的问题。

详情
AI中文摘要

我们证明了在随机广义线性bandits中,Thompson采样具有方差敏感的遗憾界。该论证假设了一个预热阶段,之后通过使用高斯庞加莱不等式来控制遗憾。这绕过了先前基于乐观的分析失效的点。在保留相同方差敏感尺度的同时移除预热阶段仍然是开放问题,并且似乎是非平凡的。

英文摘要

We prove a variance-sensitive regret bound for Thompson sampling in stochastic generalised linear bandits. The argument assumes a warm-up, after which the regret is controlled through using the Gaussian Poincaré inequality. This bypasses the point at which previous optimism-based analyses break down. Removing the warm-up while retaining the same variance-sensitive scaling remains open, and appears nontrivial.