arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08780 2026-06-09 cs.CV 新提交

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

超越一致性:在零样本视频编辑中保留时间结构

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu

发表机构 * Anhui University(安徽大学) University of Surrey(萨里大学) University of Warwick(华威大学)

AI总结 提出一种零样本视频编辑方法,通过自适应分割视频片段、选取锚帧和令牌合并策略,首次显式保留源视频的时间结构,平衡编辑保真度与计算效率。

详情
AI中文摘要

现有的零样本视频编辑方法依赖预训练的扩散模型,成功实现了空间控制和基本的时间一致性,但根本上未能保留视频的原始时间结构。这一区别至关重要:时间一致性确保视觉平滑,而时间结构决定了视频的高层叙事、节奏和语义流。没有这种保留,编辑输出(尤其是具有复杂语义变化的长视频)在叙事上变得不连贯,语义模糊。为了解决这一局限性,我们提出了一种新颖的零样本编辑方法,首次明确关注保留源视频的时间结构。我们通过基于特征相似性自适应地将视频分割成语义不同的片段,并为每个片段选择一个代表性的锚帧来实现这一点。为了增强片段内保真度和计算效率,我们设计了一种片段自适应的令牌合并策略,利用锚帧的语义主导性来稳定编辑。此外,我们采用交替组合策略,确保片段间无缝过渡,同时保持语义区分。大量实验表明,我们的方法达到了最先进的结果,成功平衡了原始时间结构的保留与计算效率,为零样本视频编辑保真度设立了新基准。

英文摘要

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

2606.08775 2026-06-09 cs.RO cs.AI 新提交

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

统一对象中心世界模型与扩散策略:多阶段机器人任务的分层框架

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami

发表机构 * Tandon School of Engineering, New York University(纽约大学坦登工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学库朗数学科学研究所) AMI Labs(AMI实验室)

AI总结 提出WorldDP分层框架,结合高层世界模型进行运行时子目标优化和低层扩散策略执行,利用对象中心表示解耦环境实体,实现多阶段机器人操作任务的有效规划与执行。

详情
AI中文摘要

视觉世界模型在学习复杂系统动力学方面显示出巨大潜力。最近的进展利用这些模型作为模型预测控制(MPC)框架中的转移函数来解决各种控制任务。然而,当应用于机器人时,它们仅限于单阶段任务(如抓取或到达),难以处理需要复杂序列规划的多阶段任务。在这项工作中,我们引入了WorldDP,一个专为多阶段机器人操作设计的世界模型框架。我们的分层方法利用高层世界模型作为转移函数,在运行时优化可行的子目标,随后由低层扩散策略实现这些子目标。为了进一步辅助学习动力学和规划,我们结合了对象中心表示,这些表示解耦了环境实体,并使我们能够针对每个实体进行顺序规划。在多个机器人基准测试中,WorldDP始终优于现有基线,验证了将世界模型的物理基础规划与扩散策略的高效执行相结合,能够产生更优的多阶段性能。

英文摘要

Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

2606.08768 2026-06-09 cs.LG 新提交

Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions

理解编码布尔函数的Transformer参数空间几何

Blanka Köver, Alexandra Butoi, Anej Svete, Michael Hahn, Ryan Cotterell

发表机构 * Machine Learning, ICML(机器学习,ICML)

AI总结 针对Transformer无法学习某些简单布尔函数(如奇偶函数)的问题,通过分析参数空间几何,证明敏感函数在参数空间中占据极小区域,随机初始化几乎必然错过,从而解释了可表达但不可学习的现象。

详情
Comments
ICML 2026
AI中文摘要

Transformer始终无法学习某些简单的函数,而这些函数在特定参数设置下是可证明表达的。这种可学习性与可表达性之间的差距对于敏感函数尤为突出——例如奇偶函数,其输出在输入单个比特翻转时很可能改变。虽然先前的研究已经确定Transformer偏向于平均敏感度低的函数,但这种偏向背后的精确机制仍不清楚。为了阐明这一现象,我们研究了Transformer参数空间的几何结构。我们证明,敏感函数——即使可表示——占据了一个极小区域,随机初始化极有可能错过。具体而言,我们将关注点从平均敏感度转移到完整的敏感度分布——所有输入上敏感度值的分布——并证明随机初始化的Transformer几乎必然计算具有低敏感度字符串的函数。因此,任何缺乏此类字符串的函数都是可证明不可学习的。

英文摘要

Transformers consistently fail to learn certain simple functions that are provably expressible with specific parameter settings. This gap between learnability and expressivity is particularly prominent for sensitive functions -- functions whose output is likely to change if a single bit of the input is flipped -- for example, PARITY. While prior work has established that transformers exhibit a bias toward functions with low average sensitivity, the precise mechanism underlying this bias remains poorly understood. To shed light on this phenomenon, we study the geometry of transformers' parameter space. We show that sensitive functions -- even when representable -- occupy a vanishingly small region that random initialization is very likely to miss. Specifically, we shift the focus from average sensitivity to the full sensitivity profile -- the distribution of sensitivity values across all inputs -- and prove that randomly initialized transformers almost surely compute functions which have low-sensitivity strings. Consequently, any function that lacks such strings is provably unlearnable.

2606.08755 2026-06-09 cs.CL 新提交

Co-Evolving Skill Generation and Policy Optimization

共同进化技能生成与策略优化

Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Nanyang Technological University(南洋理工大学) University of California, San Diego(加州大学圣迭戈分校) University of Utah(犹他大学) Harvard University(哈佛大学)

AI总结 提出在线强化学习框架,通过对比基线和技能增强轨迹的奖励差距估计技能边际效用,实现存储前验证,并利用该信号训练策略作为技能生成器,减少对专有模型的依赖。

详情
AI中文摘要

技能增强的强化学习通过存储从过去经验中获取的可重用程序性知识来改进语言智能体。现有方法通常使用强大的语言模型分析轨迹、生成技能,并在在线训练期间更新可检索的技能库。然而,它们很少在存储和重用新生成的技能之前评估其是否有用。我们发现这一假设不可靠:即使由专有前沿LLM生成的技能也表现出高度混合的效用,许多技能几乎没有益处甚至降低性能。一旦此类技能进入库中,其影响难以识别,因为后续的轨迹反馈是延迟的,并且通常反映多个检索技能的组合效果,而非单个技能的边际贡献。我们提出了一种用于存储前技能验证的在线强化学习框架。该框架估计候选技能是否在当前任务的已检索技能之外贡献了有用信息。它使用标准的轨迹预算,在同一任务和检索上下文下形成两个匹配组:基于当前检索技能的条件基础轨迹,以及基于相同技能加上从基础轨迹中诱导出的一个候选技能的条件技能增强轨迹。这两组之间的奖励差距估计了候选技能的上下文相关边际效用,使框架能够在不增加轨迹开销的情况下促进有用技能,同时过滤无效或有害技能。该框架进一步利用这一边际效用信号来训练策略本身作为技能生成器,减少对专有模型重复调用的依赖。学习到的技能生成似然作为上下文相关的分数,用于检索时的重排序和随着策略演化对过时技能的修剪。

英文摘要

Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.

2606.08751 2026-06-09 cs.CV 新提交

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

少即是多:通过全局-局部轨迹缩减实现低计数PET去噪的3D扩散模型免训练加速框架

Yuhan Liu, Scott M. Leonard, Marlee Crews, Muhannad Fadhel, Jinkui Hao, Tianqi Chen, Ryan J. Avery, Bo Zhou

发表机构 * Northwestern University(西北大学) Hefei University of Technology(合肥工业大学)

AI总结 提出一种免训练的全局-局部跳跃策略,通过噪声一致变换初始化中间步骤和重用U-Net特征,在加速3D扩散模型去噪的同时提升重建质量。

详情
Comments
19 pages, 10 figures, 5 tables
AI中文摘要

PET中的准确定量和摄取测量对于评估疾病进展和支持临床决策至关重要。虽然高计数PET提供了可靠的图像质量,但相关的辐射剂量和长时间采集仍然是重要的临床问题,促使采用低计数协议。基于扩散模型的方法在将低计数PET恢复至接近高计数质量方面显示出巨大潜力,但其迭代采样过程在应用于高分辨率3D PET体积时变得极其昂贵,导致显著的推理延迟,限制了实际临床部署。为了解决这些挑战,我们提出了一种免训练的全局-局部跳跃策略,该策略加速了基于扩散模型的3D PET去噪,同时提高了重建质量。所提出的方法即插即用,可直接应用于预训练扩散模型,无需重新训练或修改架构。具体而言,我们引入了:(i) 全局去噪步骤跳跃策略,通过使用低计数输入的噪声一致变换从中间去噪步骤初始化反向扩散过程,大幅减少所需的去噪步骤数;(ii) 局部特征重用捷径,在相邻去噪步骤间重用缓慢变化的高级U-Net特征,进一步减少每步计算量同时保持图像保真度。我们在来自内部和公共数据集的多种PET示踪剂上评估了所提出的方法,包括18F-FDG PET、68Ga-DOTATATE PET和18F-PSMA PET,结果显示相对于全步骤基线,实现了超过一个数量级的一致加速以及改进或相当的重建性能。盲法读者研究进一步证实了增强的临床信心和感知诊断质量。

英文摘要

Accurate quantification and uptake measurement in PET are critical for assessing disease progression and supporting clinical decision-making. While high-count PET provides reliable image quality, the associated radiation dose and prolonged acquisition remain significant clinical concerns, motivating the adoption of low-count protocols. Diffusion-model-based methods have demonstrated strong potential for restoring low-count PET to near high-count quality, but their iterative sampling procedure becomes prohibitively expensive when applied to high-resolution 3D PET volumes, introducing substantial inference latency that limits practical clinical deployment. To address these challenges, we propose a training-free Global-Local Skipping Strategy that accelerates diffusion model-based 3D PET denoising while simultaneously improving reconstruction quality. The proposed method is plug-and-play and directly applicable to pre-trained diffusion models without retraining or architectural modification. Specifically, we introduce: (i) a global denoising step skipping strategy that initializes the reverse diffusion process from an intermediate denoising step using a noise-consistent transformation of the low-count input, substantially reducing the number of required denoising steps; and (ii) a local feature reuse shortcut that reuses slowly-varying high-level U-Net features across neighboring denoising steps, further reducing per-step computation while preserving image fidelity. We evaluate the proposed approach on multiple PET tracers from in-house and public datasets, including 18F-FDG PET, 68Ga-DOTATATE PET, and 18F-PSMA PET, demonstrating consistent acceleration of over an order of magnitude alongside improved or comparable reconstruction performance relative to the full-step baseline. Blinded reader studies further confirm enhanced clinical confidence and perceived diagnostic quality.

2606.08748 2026-06-09 cs.CL 新提交

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

HydraQE: OSU 在 IWSLT 2026 语音翻译指标共享任务中的提交

Kevin Krahn, Eric Fosler-Lussier

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出 HydraQE,一个基于 Qwen3-ASR 的端到端无参考语音翻译质量估计系统,通过可学习的稀疏标量混合和轻量双向 Transformer 实现跨模态交互,并在人类标注、MetricX-24 和 xCOMET 伪标签上训练三个预测头,优于级联文本基线。

详情
Comments
Accepted to IWSLT 2026; 9 pages, 3 figures, 4 tables
AI中文摘要

我们提出了 HydraQE,这是我们对 IWSLT 2026 语音翻译指标共享任务的贡献。HydraQE 是一个基于 Qwen3-ASR 骨干网络的端到端、无参考质量估计(QE)系统,它接受源音频和翻译假设作为联合输入。来自所有骨干网络层的隐藏状态通过可学习的稀疏标量混合进行组合,然后由轻量级双向 Transformer 重新编码,以便在池化为共享嵌入之前实现完整的跨模态交互。三个独立的预测头在互补的监督信号上训练:人工直接评估(DA)标注、MetricX-24 伪标签和 xCOMET 伪标签。为了解决人工标注数据的稀缺性,我们在合成损坏示例和银色伪标签机器翻译输出的组合上进行训练,采用从合成和银色数据开始并逐渐转向人工标注示例的课程学习。HydraQE 优于级联文本基线和先前的直接语音 QE 系统,证明了端到端语音翻译 QE 与级联方法具有竞争力。

英文摘要

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

2606.08745 2026-06-09 cs.CV 新提交

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

染色感知的小波正则化用于组织病理学中的即时对抗净化

Zhe Li, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出染色感知小波正则化(SAWR),利用Haar变换的多级小波域正则化分层分离对抗扰动与诊断结构信息,并扩展到组织学通道实现染色特异性频率调节,在即时净化框架中将对抗鲁棒性提升高达10.69%。

详情
Comments
14 pages, 4 figures
AI中文摘要

深度学习在计算病理学流程中已变得普遍,支持癌症筛查和数字病理学分析等任务。然而,神经网络对对抗扰动的敏感性引发了临床实践中可靠部署的安全问题。在组织病理学图像中,由于难以区分高频对抗噪声与细微且具有诊断意义的组织结构,这一挑战更加严峻。为解决此问题,我们提出染色感知小波正则化(SAWR),一种利用基于Haar变换的多级小波域正则化的对抗净化框架,以分层方式将对抗扰动与诊断结构信息分离。该频谱约束进一步扩展到单个组织学通道,实现与苏木精和伊红的生物学特性一致的染色特异性频率调节。当集成到即时净化框架中时,SAWR将对抗鲁棒性相对于基线方法提升高达10.69%,同时在对抗扰动下保持纹理和频谱保真度。

英文摘要

Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer screening and digital pathology analysis. However, the susceptibility of neural networks to adversarial perturbations raises safety concerns for reliable deployment in clinical practice. In histopathological images, this challenge is exacerbated by the difficulty of distinguishing high-frequency adversarial noise from subtle and diagnostically relevant tissue structures. To address this issue, we propose Stain-Aware Wavelet Regularization (SAWR), an adversarial purification framework that leverages multi-level wavelet-domain regularization based on Haar transform to hierarchically disentangle adversarial perturbations from diagnostic structural information. This spectral constraint is further extended to individual histological channels, enabling stain-specific frequency regulation consistent with the biological properties of Hematoxylin and Eosin. When integrated into an instant purification framework, SAWR improves adversarial robustness by up to 10.69\% over the baseline approach, while maintaining texture and spectral fidelity under adversarial perturbations.

2606.08743 2026-06-09 cs.RO 新提交

Guided Discovery of New Behaviors using Diffusion Policies

使用扩散策略引导发现新行为

Dian Yu, Sebastian Sanokowski, Majid Khadiv

发表机构 * Munich Institute of Robotics and Machine Intelligence, Technical University of Munich(慕尼黑工业大学慕尼黑机器人与机器智能研究所)

AI总结 提出结合Feynman-Kac校正器与引导势能的框架,从扩散策略中挖掘并优化罕见但可行的轨迹,再训练策略以系统发现多样化可执行行为。

详情
Comments
Preprint. Supplementary video: https://youtu.be/T7MUvMA67VM
AI中文摘要

扩散模型已成为机器人学中生成建模的强大工具,扩散策略在多模态动作-轨迹分布建模方面表现出色。然而,当演示数据有限时,标准采样通常再现主导行为,而忽略有效但罕见的模式,限制了新解决方案的发现。现有方法(如引导方法或将强化学习与扩散结合)要么将样本推入不可行区域,要么难以逃离局部最小值,无法系统地发现多样化行为。为解决这些挑战,我们提出一个框架,将Feynman-Kac校正器与一种新颖的引导势能相结合,系统地将扩散策略样本引导至有前景但代表性不足的样本。这些轨迹通过基于采样的轨迹优化进行精炼,并重新纳入训练集以重新训练扩散策略。我们的方法有效地挖掘和修复新轨迹,实现多样化且可执行行为的系统发现。我们在多种操作环境中展示了该框架的有效性,一致地发现了新行为。

英文摘要

Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

2606.08741 2026-06-09 cs.RO 新提交

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

制造环境中人机交互的安全、流畅与可接受运动生成与执行

Thibaut Lopez, Olivier Aycard, Pierre-Brice Wieber, Mohamed Boua, Christine Jeoffrion

发表机构 * GIPSA Lab(GIPSA实验室) Grenoble Institute of Technology(格勒诺布尔理工学院) Inria(法国国家信息与自动化研究所) LIP/PC2S(LIP/PC2S实验室) Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Univ. Savoie Mont Blanc(萨瓦大学)

AI总结 针对人机共享环境,提出结合安全与社交感知的运动生成策略,通过MPC框架生成四种社交行为,用户研究表明机器人行为显著影响社会可接受性。

详情
AI中文摘要

在人类环境中运行的机器人不仅要确保物理安全,还要表现出人类伙伴可理解、流畅和可接受的行为。本文研究了结合安全保障与交互质量考虑(如运动平滑性和人类舒适度)的运动生成策略。虽然能够确保共享人机环境中安全的机器人设计已经实现了更紧密、更高级的交互形式,但这些新的基于近距离的任务需要超越纯技术考虑。特别是,机器人行为还必须从心理认知和社会角度加以解决。在此背景下,我们论证了将社交感知运动控制集成到机器人系统中的相关性。首先,我们识别了影响人类感知和操作员体验的运动参数。然后,我们实现了一个模型预测控制(MPC)框架,该框架生成四种不同的社交知情机器人行为。最后,我们进行了一项用户研究,以评估和验证这些行为,并评估它们对非专家参与者的社会影响。结果表明,机器人行为的变化显著影响系统的感知社会可接受性。这些发现强调了将以人为本的考虑纳入共享环境中机器人运动生成策略的重要性。

英文摘要

Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.

2606.08737 2026-06-09 cs.RO 新提交

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

Dream-Tac: 用于接触丰富机器人操作任务的统一触觉世界动作模型

Yunfan Lou, Yifan Ye, Yankai Fu, Jun Cen, Xiaowei Chi, Yaoxu Lyu, Peidong Jia, Sirui Han, Zhihe Lu, Shanghang Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) Nanjing University(南京大学) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出Dream-Tac统一触觉世界动作模型,通过接触门控视觉-触觉融合和接触感知注意力偏置,联合建模动作、未来视觉观察和触觉动态,在六项接触丰富操作任务中平均动作准确率提升31.7%。

详情
Comments
16 pages,13 figures
AI中文摘要

世界动作模型继承了世界模型的预测能力,使得动作生成能够由预期的未来观察引导。然而,它们主要依赖视觉,在接触丰富的操作任务中常常失败,因为关键线索来自物理交互。在本文中,我们提出Dream-Tac,一个统一的触觉世界动作模型,联合建模动作、未来视觉观察和触觉动态。具体来说,Dream-Tac引入了(i)接触门控视觉-触觉融合,以选择性整合触觉信号,以及(ii)接触感知注意力偏置,以更好地调节操作过程中的跨模态交互。为了支持实时部署,我们进一步设计了双级加速策略,在训练期间重新公式化接触感知偏置以保留融合注意力路径,并在推理时引入基于缓存的扩散加速,实现训练速度提升高达2.9倍,推理速度提升1.8倍。在六项接触丰富的操作任务中,Dream-Tac平均动作准确率提升31.7%,证明了统一视觉-触觉世界建模的有效性。代码可在https://github.com/LYFCLOUDFAN/Dream-Tac获取。

英文摘要

World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.

2606.08736 2026-06-09 cs.LG cs.DB 新提交

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

声明性结果一致性合成:精确、闭式规范满足及一致性基准

Muhammed Rasin

发表机构 * Independent Researcher(独立研究员)

AI总结 针对无源数据下精确满足声明性分析结果的需求,提出结果一致性合成任务,通过闭式条件伽马抽样实现精确聚合,并构建SpecBench基准,证明一致性保真度正交。

详情
Comments
22 pages, 1 figure. Benchmark and reference implementation (MIT): https://github.com/rasinmuhammed/misata
AI中文摘要

我们研究合成表格数据主流范式未能提供的能力:在无源数据下精确满足声明的分析结果。模仿方法(copula、GAN、扩散)学习真实分布并从中采样,其评价基于对真实数据的保真度。一大类实际需求不同:在无源数据(冷启动)下生成数据,该数据在关系模式上复现声明的结果(收入曲线、流失率、群体份额)。现成的模仿工具不提供针对此类目标的接口,且由于采样方差,没有采样器能精确命中聚合值。在真实公共数据集上,基于该数据训练的现成学习合成器将声明的月度聚合值偏离74%至86%;逐周期优化将偏离降至约19%,但仍无法达到0;而闭式生成器精确达到0。我们将此任务命名为结果一致性合成,论证其评价轴为一致性而非保真度,并展示两轴正交。我们的贡献包括:(1) 形式化描述,表明广泛使用的精确聚合生成器族实际上是伽马总体的条件求和采样(通过Lukacs刻画),具有闭式精确性、闭式边际变异系数和尺度不变性;受控实验描绘边界,强制精确聚合在1-Wasserstein距离上对任意外部边际的成本最多为0.006,其余为形状族失配;(2) SpecBench,据我们所知,这是首个衡量冷启动关系合成中分析结果一致性的基准;(3) 一个闭式确定性参考系统。精确聚合本身是平凡的;贡献在于一致性联合闭式边际、完整性、确定性和零源数据。我们承认在存在真实数据时模仿方法的保真度优势。

英文摘要

We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

2606.08735 2026-06-09 cs.AI 新提交

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件化的演员-评论家分支用于质量-多样性强化学习

Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(南京信息工程大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院网络空间安全研究院广东省新型安全智能技术重点实验室)

AI总结 提出SV-QD-RL框架,通过结构条件化的演员-评论家分支和分支感知的QD档案,在MuJoCo任务中构建高质量且行为多样化的策略库。

详情
AI中文摘要

质量-多样性强化学习(QD-RL)旨在构建包含高性能和行为多样化策略的策略库。现有的QD-RL方法主要在 rollout 评估后多样化策略实例,或使用学习到的价值信息来改进策略质量和行为目标,而生成候选策略的学习分支仍较少被探索。本文提出SV-QD-RL,一种结构-价值耦合框架,将每个候选表示为结构条件化的演员-评论家分支。每个分支包含一个演员、一个结构掩码、一个分支特定的评论家、一个回放状态以及评估属性,包括行为、回报、稀疏性和价值分布。结构掩码定义了分支学习的演员子空间,而分支特定的评论家和回放状态塑造了其价值学习轨迹。然后,一个分支感知的QD档案根据行为质量、结构足迹和价值分布信息评估并保留分支。在MuJoCo连续控制任务上的实验表明,SV-QD-RL构建的策略库具有强大的档案质量和行为上有用的多样性。消融和诊断分析进一步表明,结构条件化、评论家差异化和记忆一致性细化对行为专门化做出了互补贡献。调度感知的库评估表明,学习到的档案在变化的行为级别要求下提供了可选择的策略替代方案。这些结果表明,将演员结构与分支特定的价值学习耦合是生成多样化QD-RL策略库的有效机制。

英文摘要

Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.

2606.08729 2026-06-09 cs.RO cs.LG 新提交

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

IR-SIM:一种用于导航、学习和基准测试的轻量级技能原生模拟器

Ruihua Han, Shuai Wang, Chengyang Li, Rui Gao, Xinyi Wang, Zhe Liu, Guoliang Li, Yupu Lu, Qi Hao, Jia Pan, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) Southern University of Science and Technology(南方科技大学) University of Michigan(密歇根大学) University of Macau(澳门大学)

AI总结 提出轻量级技能原生导航模拟器IR-SIM,通过YAML配置完全定义场景,支持文本提示生成与修改,用于导航算法基准测试和训练数据自动生成,并桥接高保真模拟器和真实部署。

详情
Comments
12 pages, 6 figures, project website: https://github.com/hanruihua/ir-sim
AI中文摘要

模拟在由大型语言模型(LLM)支持的自动化机器人研究中起着关键作用。然而,现有的模拟器通常需要自定义代码或复杂接口,为快速原型设计和自动化算法开发设置了障碍。为此,我们提出了智能机器人模拟器(IR-SIM),一种轻量级的技能原生导航模拟器,专为快速场景构建、基准测试和机器人学习而设计。在IR-SIM中,场景完全由YAML配置文件定义,这些文件指定了移动机器人运动学、几何碰撞检测、激光雷达感知、可视化和行为模块。这种设计使机器人模拟完全可描述和可复现,允许通过提出的IR-SIM智能体技能从文本提示生成和修改场景。生成的场景可用于导航算法的自动基准测试以及学习方法的训练数据自动生成。此外,IR-SIM提供了到高保真模拟器和真实世界部署的桥梁,允许用户在原型设计后无需额外编码即可在更真实的环境中验证其算法。实验展示了IR-SIM在多个任务中的便利性和多功能性:从自然语言构建导航场景、训练避碰策略、对社交导航策略进行基准测试,以及桥接到高保真模拟器和真实世界部署。项目网站见https://github.com/hanruihua/ir-sim。

英文摘要

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.

2606.08725 2026-06-09 cs.RO cs.SY eess.SY 新提交

Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning

基于可微约束轨迹规划的实时精确无碰撞遥操作

Max Grobbel, Tristan Schneider, Daniel Flögel, Sören Hohmann

发表机构 * FZI - Forschungszentrum Informatik(FZI 信息技术研究中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对遥操作中自碰撞与环境碰撞问题,提出基于对偶可微碰撞约束的轨迹规划方法,采用胶囊体与多面体建模,实现更低计算时间和更精确障碍物建模,保证平滑无碰撞遥操作。

详情
Comments
8 pages, 4 figures, accepted at ICRA2026
AI中文摘要

在遥操作中,人类操作员通常仅控制末端执行器的姿态,由于关节和连杆未单独控制,常导致机械臂自碰撞及与环境障碍物的碰撞。缓解此问题的常见策略是利用基于最优控制的轨迹规划增强操作员输入。由于基于导数的求解器需要可微约束,现有方法要么用球体近似机器人和障碍物,降低几何精度,要么近似导数,降低收敛性并增加计算时间。我们通过将一种基于凸优化对偶性的可微碰撞避免约束的最新公式应用于遥操作场景,解决了这些局限性。机器人用胶囊体近似,环境用多面体近似。我们在不同障碍物数量的仿真中将所得轨迹规划方法与最先进技术进行比较,并在真实遥操作测试中在UR5e机械臂上进行评估。结果表明,我们的方法在实现更精确障碍物建模的同时,计算时间更低,从而实现更平滑、无碰撞的末端执行器遥操作。

英文摘要

In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.

2606.08722 2026-06-09 cs.SD cs.CL 新提交

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond?一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova(帕多瓦大学) Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 提出 LilyBench,基于 LilyPond 的基准,联合评估开源 LLM 的符号音乐生成与理解能力,实验表明零样本可生成可执行 LilyPond,但结构理解任务仍有挑战,且指标间存在系统性分歧。

详情
Comments
Accepted at Ital-IA 2026
AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench,一个基于 LilyPond 的基准,用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务,涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明,在零样本设置下可以实现可执行的 LilyPond 生成,而结构理解任务尽管在作曲家和流派识别上表现强劲,但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧,表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码,以支持未来在符号音乐生成和理解方面的研究,地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

2606.08721 2026-06-09 cs.LG 新提交

A Geometric Measure of Linear Separability for Neural Representations

神经表征的线性可分性几何度量

Yi Wei, Xuan Qi, Furao Shen

发表机构 * State Key Laboratory of Novel Software Technology, School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院软件新技术国家重点实验室) AI for Good (AIGO), Istituto Italiano di Tecnologia(意大利技术研究院AI for Good (AIGO)) DITEN, University of Genoa(热那亚大学DITEN) State Key Laboratory of Novel Software Technology, School of Artificial Intelligence, Nanjing University(南京大学人工智能学院软件新技术国家重点实验室)

AI总结 提出方向线性可分性度量(LSM),通过搜索包含目标类所有样本的仿射半空间并测量最小竞争样本入侵量,为神经表征的类间几何提供不对称、类级、目标归一化的诊断工具。

详情
AI中文摘要

现代神经分类器通常依赖线性读出,但仅预测指标无法刻画此类读出所操作的表征的类间几何。我们引入方向线性可分性度量(LSM),一种用于单侧仿射可分性的有限样本诊断工具。对于目标类A和竞争集B,LSM搜索包含A中所有样本的仿射半空间,并测量必须留在目标侧的最小竞争样本入侵量,按|A|归一化。所得量是不对称的、类级的、目标归一化的,适用于从神经网络提取的有限表征。我们建立了其支撑超平面刻画,将其与最优仿射分类精度关联,并证明了在全秩线性嵌入下的不变性。这些结果将线性重参数化引起的变化与信息丢失或非线性几何变换引起的变化区分开来。我们还给出了一种基于惩罚的仿射搜索,用于在高维特征中估计类级LSM,报告的值根据原始离散保持和违反准则计算。最后,我们将坐标门控非线性作为有限样本几何算子进行分析,并经验性地使用LSM诊断常见深度学习组件和架构中的类级入侵。

英文摘要

Modern neural classifiers commonly rely on linear readouts, yet predictive metrics alone do not characterize the class-wise geometry of the representations on which such readouts operate. We introduce the directional linear separability measure (LSM), a finite-sample diagnostic for one-sided affine separability. For a target class A and a competing set B, LSM searches over affine halfspaces that contain all samples in A and measures the smallest competing-sample intrusion that must remain on the target side, normalized by |A|. The resulting quantity is asymmetric, class-wise, target-normalized, and applicable to finite representations extracted from neural networks. We establish its supporting-hyperplane characterization, relate it to optimal affine classification accuracy, and prove invariance under full-rank linear embeddings. These results separate changes caused by linear reparameterization from those caused by information loss or nonlinear geometric transformations. We also give a penalty-based affine search for estimating class-wise LSM in high-dimensional features, with reported values computed from the original discrete preservation and violation criterion. Finally, we analyze coordinatewise gated nonlinearities as finite-sample geometric operators and empirically use LSM to diagnose class-wise intrusion across common deep-learning components and architectures.

2606.08719 2026-06-09 cs.CV 新提交

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

无图像思考:通过在线自我蒸馏内化视觉操作

Yishuo Cai, Jiahui Liu, Yuanxin Liu, Haobo Deng, Linli Yao, Yuhao Zheng, Kun Ouyang, Zhimo Li, Ziyue Wang, Xu Sun, Haoli Bai, Xiaohui Li

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) Central South University(中南大学) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) Huawei Technologies(华为技术有限公司)

AI总结 提出Imagine-OPD框架,通过在线自我蒸馏将“用图像思考”的视觉推理能力内化为“用想象思考”,在不调用外部工具的情况下生成内部视觉线索,在保持性能的同时显著降低推理开销。

详情
AI中文摘要

“用图像思考”已成为细粒度视觉推理的有效范式:通过显式放大相关区域并推理裁剪区域,模型可以访问从单个全局图像中难以恢复的局部证据。然而,这种优势伴随着冗余的工具调用和更长的推理轨迹。此外,当这种行为主要从结果奖励中学习时,产生的中间裁剪或视觉线索可能带有噪声,或者无法忠实地捕获任务相关的视觉证据。在这项工作中,我们探讨是否可以通过“用想象思考”来内化“用图像思考”的推理优势:这是一个内部过程,决定看哪里并想象更仔细检查会揭示什么视觉线索,而无需实际调用工具。我们提出Imagine-OPD,一种在线自我蒸馏框架,其中教师模型在训练期间扮演“用图像思考”推理者的角色:它接收来自标注区域的特权缩放证据视图,并监督模型自身的想象推理轨迹。Imagine-OPD不需要外部教师或高质量的想象演示。在视觉中心基准上的实验表明,Imagine-OPD在比较模型中实现了最佳平均性能,同时与“用图像思考”方法相比显著降低了推理开销。

英文摘要

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

2606.08715 2026-06-09 cs.CL 新提交

Operationalizing Linguistic Methods through Prompt-Engineering Skills: An Automatic Chinese Web Neologism Detection Pipeline

通过提示工程技能操作化语言学方法:一种自动中文网络新词检测流水线

Yufeng Wu, Meichun Liu

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出一种自动中文网络新词检测方法,将传统语言学识别原则转化为提示工程技能,通过四阶段流水线从2.67亿文档中检测出4853个新词,并揭示候选覆盖和LLM语义判断为瓶颈。

详情
AI中文摘要

我们提出了一种自动中文网络新词检测方法,该方法将传统语言学识别原则操作化为提示工程技能。该方法包括四个阶段:基于字符n-gram的与分词器无关的候选生成;基于点互信息预过滤的词典锚定;基于中文构词原则的构词合法性技能;以及结合规则和三元分类技能来区分新词、实体和无。将该方法应用于BAAI CCI 3.0语料库(2.67亿文档),产生了226,959个分类候选,其中包括4,853个标注新词。为了评估该方法,我们开发了逐阶段条件召回分解,其中流水线的严格召回在数学上分解为各阶段条件召回的乘积。应用于Hou(2023)(4,199个条目),该分解揭示了阶段1候选覆盖和阶段4B LLM语义判断是两个瓶颈(召回率分别为41.5%和60.0%),而中间阶段接近无损。进一步的长度分层分析表明,结构构词合法性技能与长度无关(>= 96.9%),而语义新颖性分类技能与长度相关(2/3/4字符候选分别为65.6%/59.0%/44.1%),描绘了基于技能的语言学操作化的当前边界。我们将该方法、流水线输出和评估协议作为公共资源发布。

英文摘要

We present a method for automatic Chinese web neologism detection that operationalizes traditional linguistic identification principles as prompt-engineering skills. The method has four stages: tokenizer-independent character n-gram candidate generation; dictionary anchoring with a Pointwise Mutual Information pre-filter; a well-formedness skill based on Chinese word-formation principles; and a combined rule and three-way classification skill that distinguishes neologism, entity, and none. Applied to the BAAI CCI 3.0 corpus (267M documents), the method produces 226,959 classified candidates including 4,853 labeled neologisms. To evaluate the method, we develop a per-stage conditional recall decomposition in which the pipeline's strict recall factors mathematically into the product of stage conditional recalls. Applied to Hou (2023) (4,199 entries), the decomposition exposes Stage 1 candidate coverage and Stage 4B LLM semantic judgment as the two bottlenecks (R=41.5% and 60.0% respectively), while intermediate stages are near-lossless. A length-stratified analysis further reveals that the structural well-formedness skill is length-invariant (>= 96.9%) whereas the semantic novelty-classification skill is length-dependent (65.6%/59.0%/44.1% across 2/3/4-character candidates), mapping a current boundary of skill-based linguistic operationalization. We release the method, pipeline outputs, and evaluation protocol as public resources.

2606.08712 2026-06-09 cs.LG cs.AI cs.CV 新提交

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

SNR-ST-Mix: 基于样本特异性邻域回归混合增强的空间转录组学深度神经网络插补

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

发表机构 * Northwestern University(西北大学) Yale University(耶鲁大学)

AI总结 针对空间转录组数据噪声大、分辨率低的问题,提出SNR-ST-Mix数据增强框架,通过空间邻域约束和表达相似性加权混合生成生物合理的合成样本,提升深度神经网络插补性能。

详情
Comments
19 pages, 4 figures, 3 tables
AI中文摘要

目的:空间转录组学(ST)能够在组织背景下测量基因表达。然而,这些测量通常噪声大、分辨率低且采样稀疏,限制了精细空间结构的恢复。深度神经网络已成为从组织学进行表达插补的强大工具,但其性能仍受限于有限的样本量和缺乏生物学信息的增强。大多数现有的学习增强策略是为分类任务而非回归任务设计的,忽略了空间和转录组关系,导致生物上不合理的插值,阻碍了预测性能。方法:为解决这些限制,我们提出SNR-ST-Mix,一种专门为ST数据设计的几何和表达感知数据增强框架。它将混合限制在点的k个最近空间邻域内,并基于表达相似性自适应加权插值系数,生成保留局部生物结构同时确保空间平滑性的增强样本。这种双重条件化产生合成样本,扩展了有效训练流形,促进了泛化,并在样本特异性训练下增强了预测稳定性。结果:使用各种组织类型的大量实验表明,SNR-ST-Mix在不需要架构更改或额外计算的情况下,始终优于传统增强方法。结论:SNR-ST-Mix为空间转录组学回归任务提供了一种有效且生物学原理的增强策略。通过显式利用空间几何和转录组相似性,它扩展了有效训练流形,并在不增加模型复杂度的情况下提高了预测性能。

英文摘要

Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

2606.08708 2026-06-09 cs.CV 新提交

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO: 通过令牌级动态优势重塑的感知强化策略优化

Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group(阿里巴巴集团高德地图计算机视觉实验室) Peking University(北京大学)

AI总结 提出令牌级强化学习框架PRPO,通过鲁棒视觉依赖(RVD)指标识别关键感知令牌,并利用感知优势重塑(PAR)技术增强其学习信号,在7个多模态推理基准上平均提升23.3%(3B模型)和21.1%(7B模型)。

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为提升大型视觉语言模型(LVLMs)推理能力的有效范式。然而,现有的RLVR方法主要依赖于轨迹级结果奖励,为所有生成的令牌分配相同的学习信号。这种粗粒度的信用分配从根本上与多模态推理不匹配,因为只有稀疏的子集令牌在因果上基于视觉证据。因此,这些关键的感知令牌受到弱监督,并且常常被语言先验或推理模板令牌淹没。为解决这一局限,我们提出感知强化策略优化(PRPO),一种令牌级强化学习框架,明确识别并强化长程多模态推理轨迹中的关键感知令牌。PRPO引入了鲁棒视觉依赖(RVD),一种原则性度量,用于识别预测既基于视觉又对扰动稳定的令牌,过滤掉脆弱或噪声视觉令牌。基于RVD,我们进一步提出感知优势重塑(PAR),一种令牌级信用分配技术,放大感知信息丰富的令牌,同时为非感知令牌保留稳定梯度。在七个多模态推理基准上的大量实验表明,PRPO在3B和7B模型规模上均持续优于强LVLM基线,分别实现了23.3%和21.1%的平均增益。PRPO以更高的训练效率和更强的跨任务泛化能力达到了最先进的性能。我们的发现强调了细粒度信用分配对于可扩展多模态强化学习的重要性。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

2606.08705 2026-06-09 cs.CL 新提交

Analyzing the Correlation Between Hallucinations and Knowledge Conflicts in Large Language Models

分析大型语言模型中幻觉与知识冲突之间的相关性

Lucrezia Laraspata, Giovanna Castellano, Gennaro Vessio

发表机构 * University of Bari Aldo Moro(巴里阿尔多莫罗大学)

AI总结 通过探针技术分析LLM内部表示,发现幻觉激活模式不能完全归因于知识冲突,但探针可提升模型可解释性。

详情
AI中文摘要

幻觉——事实不正确或无法验证的输出——仍然是大型语言模型(LLM)最具挑战性的限制之一,尤其是在知识密集型任务中。一种提出的解释是,由固定的、过时的训练数据引起的内部知识冲突。本文研究了与知识冲突相关的内部表示是否与LLM中的幻觉行为相关。使用受两项先前工作启发的探针技术,我们分析了预定义任务中隐藏层、注意力层和MLP层的激活以及输出logits。我们在幻觉检测基准上探测了LLaMA-3-8B,并在知识冲突数据集上探测了Falcon-7B。我们的发现表明,尽管概念上相关,但幻觉激活模式不能完全简化为或由知识冲突表示解释。尽管如此,探针在多种语言和激活类型中被证明是一个稳健的工具,支持其在提高LLM可解释性方面的作用。这项工作推进了对LLM中幻觉的更广泛理解,并强调了对其内部行为进行细粒度分析的价值。

英文摘要

Hallucinations -- factually incorrect or unverifiable outputs -- remain one of the most challenging limitations of Large Language Models (LLMs), especially in knowledge-intensive tasks. One proposed explanation is internal knowledge conflicts arising from fixed, outdated training data. This paper investigates whether internal representations linked to knowledge conflicts correlate with hallucination behaviors in LLMs. Using probing techniques inspired by two prior works, we analyzed activations from hidden, attention, and MLP layers, as well as output logits, across predefined tasks. We probed LLaMA-3-8B on hallucination detection benchmarks and Falcon-7B on a knowledge conflict dataset. Our findings show that, although conceptually related, hallucination activation patterns cannot be fully reduced to or explained by knowledge conflict representations. Nonetheless, probing proves a robust tool across multiple languages and activation types, supporting its role in improving LLM interpretability. This work advances the broader understanding of hallucinations in LLMs and underscores the value of fine-grained analysis of their internal behavior.

2606.08702 2026-06-09 cs.AI 新提交

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

ConMem: 无训练多智能体系统中的结构化记忆引导自适应

Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen

发表机构 * Central South University(中南大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出ConMem框架,通过结构化记忆卡片和关系感知记忆图实现多智能体系统的高效自适应,无需额外训练,在多个基准上提升性能并降低推理开销。

详情
AI中文摘要

最近的进展通过基于记忆、技能和学习的方法改进了基于LLM的多智能体系统(MAS)的自适应能力,但这些方法仍受到噪声轨迹、记忆-技能关系建模不足以及对额外训练或高质量监督的依赖等挑战。为了解决这些限制,我们提出了ConMem,一个关系感知且无需训练的框架,通过跨经验协调实现高效的多智能体自适应。具体来说,ConMem将历史交互轨迹提炼为结构化记忆卡片,以捕获可重用的策略和线索,并将它们组织成关系感知的记忆图。在运行时,ConMem根据任务需求检索卡片,并通过卡片图协调它们以解决策略冲突并恢复其依赖关系。这些模块结合起来提供了结构化和关系感知的指导,使得多智能体系统能够实现鲁棒、轻量级的自适应,而无需额外训练。在多个基准测试和主流MAS架构上的大量实验表明,与现有记忆架构相比,ConMem取得了持续的性能提升,通过剪枝超过50%的扩展候选并减少超过80%的规划开销,提高了推理时的效率。我们的代码可在https://anonymous.4open.science/r/ConMemCode获取。

英文摘要

Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory-skill relations, and reliance on additional training or high-quality supervision. To address these limitations, we propose ConMem, a relation-aware and training-free framework that enables efficient multi-agent adaptation through cross-experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation-aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation-aware guidance, enabling robust, lightweight adaptation in multi-agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference-time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

2606.08691 2026-06-09 cs.LG stat.ME 新提交

Hierarchical Projection for Adaptive Knowledge Transfer

自适应知识迁移的分层投影

Samhita Pal, Tian Gu

发表机构 * Vanderbilt University Medical Center(范德比尔特大学医学中心) Columbia University(哥伦比亚大学)

AI总结 提出ProjectionTL框架,通过分层贝叶斯建模与自适应投影实现源选择与特征选择,缓解负迁移,提升跨域学习的准确性、稳定性和可解释性。

详情
AI中文摘要

现代数据驱动应用越来越多地涉及从多个异质源中学习,其中目标数据集有限,但跨域可获得相关信息。当相关性变化或存在虚假信号时,简单组合这些源会降低性能,这对可信的跨域学习构成了根本性挑战。我们提出了投影迁移学习(ProjectionTL),这是一个统一框架,将分层贝叶斯建模与自适应投影相结合,用于选择性知识迁移。关键思想是在两个层次上解耦迁移:首先,我们构建一个源引导的分层先验,通过数据驱动的权重聚合跨源信息,捕捉每个源与目标之间的全局对齐;其次,我们通过后验投影步骤在特征层面细化这种借用,选择性地保留与目标信号局部一致的坐标。这种两阶段设计使该方法能够同时进行源选择和特征选择,从而减轻负迁移,同时保持可解释性。ProjectionTL提供了一种跨域整合异质数据的原则性方法,桥接了统计建模和现代机器学习范式,以实现鲁棒且可解释的迁移。通过模拟和真实世界的生物医学应用,我们证明了与现有方法相比,准确性、稳定性和可解释性的提升。我们的框架为高维设置下的可信跨域学习提供了一种可扩展且通用的策略。

英文摘要

Modern data-driven applications increasingly involve learning from multiple heterogeneous sources, where a target dataset is limited but related information is available across domains. Naively combining these sources can degrade performance when relevance varies or spurious signals are present, posing a fundamental challenge for trustworthy cross-domain learning. We propose Projection Transfer Learning (ProjectionTL), a unified framework that integrates hierarchical Bayesian modeling with adaptive projection for selective knowledge transfer. The key idea is to decouple transfer at two levels: first, we construct a source-guided hierarchical prior that aggregates information across sources using data-driven weights, capturing global alignment between each source and the target; second, we refine this borrowing through a posterior-projection step that operates at the feature level, selectively retaining coordinates that exhibit local agreement with the target signal. This two-stage design enables the method to simultaneously perform source selection and feature selection, thereby mitigating negative transfer while preserving interpretability. ProjectionTL provides a principled approach to integrating heterogeneous data across domains, bridging statistical modeling and modern machine learning paradigms for robust and interpretable transfer. Through simulations and real-world biomedical applications, we demonstrate improved accuracy, stability, and interpretability compared to existing methods. Our framework offers a scalable and generalizable strategy for trustworthy cross-domain learning in high-dimensional settings.

2606.08688 2026-06-09 cs.RO cs.CV 新提交

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

PhysAgent: 通过轨迹驱动的多智能体反馈实现基于物理的4D合成自动化

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出PhysAgent,首个模拟器在环的多智能体框架,通过解耦材料与外力、利用视觉基础模型提取轨迹并借助LLM常识推理,实现自动化、物理可信的4D运动合成,显著提升生成多样性与物理准确性。

详情
AI中文摘要

实现完全自动化、物理合理的3D运动合成是图形学和生成式AI的核心目标。然而,配置复杂的环境力场仍然完全依赖人工专家干预,成为大规模模拟数据生成的严重瓶颈。现有自动化方法主要关注材料优化,在应用于更复杂的力场优化空间时表现出严重的模态差距和技术缺陷:朴素的大语言模型缺乏底层模拟反馈,导致严重的物理不准确性,而传统的分数蒸馏采样存在梯度缓慢、陷入局部最优以及数学上无法动态切换离散力场的问题。为此,我们提出PhysAgent,首个模拟器在环的多智能体框架,利用多模态输入实现自动化、基于物理的4D合成。通过将内在材料与外在动力学解耦,PhysAgent利用配备外化力场技能模块的语义智能体掌握模拟规则并生成有效初始化。随后,由轨迹驱动的多智能体反馈驱动的精炼智能体,借助视觉基础模型从渲染帧中提取密集点轨迹。通过将这些显式运动轨迹转换为结构化文本描述符,智能体利用LLM常识推理执行零样本宏观跳跃,有效逃离局部最优并动态切换离散力场。大量实验表明,PhysAgent能够从任意多模态提示快速生成稳定、多样的物理场景,在生成多样性和物理准确性上显著优于现有基线。

英文摘要

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

2606.08684 2026-06-09 cs.CV 新提交

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

BLUE:迈向自动驾驶高效视觉-语言-动作模型中更好的语言使用

George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang

发表机构 * Bosch Research(博世研究院)

AI总结 提出BLUE方法,通过轻量门控机制在视觉-语言-动作模型中按帧决定是否激活语言生成,实现性能提升和2.54倍推理加速。

详情
Comments
preprint
AI中文摘要

我们提出BLUE,一种在自动驾驶(AD)的视觉-语言-动作(VLA)模型中实现更好语言使用的极简方法。通过广泛分析,我们发现语言仅在一小部分路线上重要,但在这些路线上,语言可以大幅提升或降低性能。因此,在每一帧生成语言是低效的,因为大部分计算花费在无法从语言中受益的帧上。我们进一步表明,预训练的VLA隐藏状态可能已经编码了语言是否会对给定帧有益,尽管场景复杂度和运动特征本身难以预测这一点。基于这一发现,BLUE在冻结的VLA隐藏状态上训练一个轻量级门控,以决定每帧是激活语言生成还是直接预测动作,无需修改主干网络或额外的人工标注。仅用0.11M参数的门控,BLUE在两个基准测试上均达到新的最优水平,在Bench2Drive上实现76.2%的成功率,在Longest6 v2上获得36的驾驶分数,同时相比主干网络实现2.54倍的推理加速和8.9%的成功率提升。BLUE为高效的语言增强自动驾驶提供了一条实用路径,表明VLA模型可以以极低的成本保留语言的优势。我们的代码、数据、日志和检查点完全公开在https://github.com/George-Ling3/BLUE。

英文摘要

We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.

2606.08682 2026-06-09 cs.LG cs.AI 新提交

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

激活引导引发突现失调:一项更全面的评估

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) Sun Yat-sen University(中山大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 研究激活引导是否引发突现失调,通过扩展评估范围,发现激活引导可导致广泛失调,且比微调产生更连贯的有害响应,并分析了关键因素。

详情
AI中文摘要

激活引导已成为一种流行的推理时技术,用于调节大型语言模型(LLMs)的行为。通过从目标行为的示例构建引导向量,并在推理期间将其注入中间激活,激活引导能够实现灵活的行为控制,同时避免微调所需的永久参数更新。与此同时,最近的研究将突现失调(EM)识别为一个重要的安全问题,其中在狭窄任务的不安全示例上微调的模型可能意外地泛化到无关任务上的广泛不安全行为。尽管微调引发的EM已被广泛研究,但激活引导是否能引发EM仍然相对未被探索,尽管它作为一种模型控制技术的使用日益增加。在本文中,我们对激活引导引发的突现失调进行了全面研究,大幅扩展了现有开创性工作的评估范围。首先,我们表明激活引导可以引发广泛的失调,即使在最近的Qwen-3.5系列中也是如此。此外,激活引导的模型产生的有害响应比微调模型具有更强的语义相关性和更高的连贯性,使得由此产生的失调可能更具危害性。其次,我们通过分析关键的引导特定因素来表征AS引发的EM的特性,包括引导幅度、引导子空间的低秩结构以及引导向量构建期间的周期数。第三,我们评估了AS引发的EM在不同模型家族、模型规模、目标任务和干预层上的鲁棒性和敏感性。我们的发现揭示了激活引导是突现失调的一个重要但未被充分研究的来源,并为理解EM的机制和安全风险提供了激活空间视角。

英文摘要

Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

2606.08680 2026-06-09 cs.CV cs.RO 新提交

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

畸变感知的PETR用于混合针孔-鱼眼相机的BEV目标检测

Xiangzhong Liu

发表机构 * fortiss GmbH(fortiss有限公司)

AI总结 针对鱼眼相机径向畸变破坏BEV检测器均匀采样假设的问题,提出DAPETR,通过畸变感知位置编码和双向特征-几何协同调制模块,在KITTI-360基准上优于基线方法,并揭示了学习适应与显式几何重参数化之间的冲突。

详情
Comments
8 pages, 5 figures, accepted at ICRA 2026
AI中文摘要

鱼眼相机因其低成本和高覆盖视野(FOV)而被广泛部署于自动驾驶感知套件中,但其在3D目标检测中的潜力仍未得到充分利用。严重的径向畸变通过违反均匀采样的基本假设,对大多数BEV检测器构成挑战。为弥补这一差距,我们提出了畸变感知PETR(DAPETR),一种专为混合针孔-鱼眼相机设置设计的无投影检测器。DAPETR包含两个关键的学习自适应模块:一个统一的畸变感知位置编码,将图像表示的位置编码与鱼眼几何协调一致;以及一个双向特征-几何协同调制模块,使图像特征和3D位置编码相互适应。在我们转换的KITTI-360基准上的实验中,我们系统地将我们的学习自适应方法与极坐标下的PETR(PolarPETR)进行了比较。我们发现,尽管两种方法都优于基线,但我们的学习模块实现了更优的性能。关键的是,我们发现了两种策略结合时的负面交互,表明学习适应和显式几何重参数化可能冲突。我们的最终DAPETR模型显著推进了鱼眼BEV检测的研究和基准,为除图像校正外的有效畸变感知3D感知设计提供了关键见解。

英文摘要

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

2606.08678 2026-06-09 cs.SD cs.LG 新提交

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

基于梯度反转和变分信息瓶颈的说话人不变表示学习用于欺骗检测

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 针对欺骗检测中说话人偏差导致泛化差的问题,提出教师-学生框架,利用梯度反转层和变分信息瓶颈解耦身份信息,在9个数据集上EER相对降低25.7%。

详情
AI中文摘要

先进的生成语音技术可能破坏语音生物识别的可靠性。虽然欺骗检测系统在域内条件下评估时表现出色,但对域外设置的泛化能力通常较差。在本文中,我们表明此类问题可能由说话人偏差引起,即模型学习个体声音特征而非操作或生成的标记。我们提出了一种用于说话人不变欺骗检测的教师-学生框架,该框架无需说话人标签即可解耦身份。我们利用预训练的说话人识别教师通过梯度反转层指导学生模型。为了控制抑制与语音身份相关线索和保留与欺骗检测相关线索之间的平衡,我们集成了变分信息瓶颈。在九个数据集上的评估表明,与MHFA基线相比,我们的模型实现了EER相对降低25.7%。

英文摘要

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

2606.08673 2026-06-09 cs.CL 新提交

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

ClinicalAligner26AM: 用于数据集翻译的跨语言对齐器;来自MultiClinCorpus共享任务的证据

François Remy

发表机构 * Parallia Healthcare AI(Parallia医疗人工智能)

AI总结 提出ClinicalAligner26AM,一种基于ClinicalEncoder26AM初始化的生物医学临床文本多语言对齐模型,通过Sinkhorn-Knop最优传输融合多级信号构建软对齐目标,在MultiClinCorpus任务中跨语言投影实体标注,字符加权F1超0.95。

详情
AI中文摘要

词级跨语言对齐对于标注投影、翻译审计和跨语言忠实度估计至关重要,然而现有的神经对齐器很少适应专业领域。在本文中,我们介绍了ClinicalAligner26AM,这是一个从ClinicalEncoder26AM初始化的大上下文多语言对齐模型,用于生物医学和临床文本。我们的训练方法受AWESoME Align启发。我们通过使用Sinkhorn-Knop最优传输对为平行临床文本和对话建立的成本矩阵进行锐化,该矩阵融合了句子级、短语级和词元级信号,从而构建软对齐目标。我们通过鼓励学生对齐器的朴素余弦词元相似度分数匹配该目标,直接将锐化后的对齐矩阵蒸馏到学生对齐器中。在推理时,我们通过学习的词元对齐矩阵投影源跨度分数,并解码目标文本中最长有效的高分跨度,可选地由附录B中总结的MultiClinNER预测支持。我们在MultiClinCorpus共享任务上评估CA26AM,该任务将西班牙语临床实体标注投影到六种目标语言中。我们提交的两个系统在所有语言和实体类型中分别排名第一和第二,几乎所有设置下的字符加权F1分数均高于0.95。

英文摘要

Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains. In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM. Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn-Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target. At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions summarized in Appendix B. We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

2606.08672 2026-06-09 cs.CV cs.LG 新提交

Learning to Solve Generative ODEs Beyond the Linear Span

学习求解生成式常微分方程:超越线性跨度

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim

发表机构 * Korea University(高丽大学) KAIST(韩国科学技术院) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散和流生成模型中ODE求解器步数多的问题,提出SpanLift轻量神经求解器,通过空间残差算子增强标量系数更新,实现少步采样且不增加模型NFE,在多个任务上达到最先进性能。

详情
Comments
12 pages, 7 figures
AI中文摘要

扩散和流生成模型通过积分学习到的ODE进行采样,但高质量采样仍需要大量连续的模型评估。求解器学习通过调整标量系数、时间步长或两者来降低这一成本,同时保持骨干模型固定。在这项工作中,我们识别出该更新族中的一个结构瓶颈:每一步仍然受限于跨度。由于标量系数更新位于缓冲速度评估的跨度内,它只能拟合跨度内的分量,而任何跨度外的残差无法通过标量重组单独达到。我们提出SpanLift,一种轻量神经求解器,它用空间残差算子增强标量系数更新。SpanLift将固定的基础求解器作为跨度内先验,并在状态和速度缓冲上学习一个空间残差算子。该算子通过端点教师匹配训练,保留预训练的骨干,且不增加模型NFE。实验表明,学习到的校正跨基础求解器迁移,且主要位于跨度外。在像素空间扩散、潜流匹配和降水临近预报中,SpanLift实现了最先进的少步采样。仅用3个NFE,它将CIFAR-10的FID从8.16提升到5.69,ImageNet的FID从17.37提升到11.83。

英文摘要

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.