URL PDF HTML ☆

赞 0 踩 0

2606.18988 2026-06-18 cs.AI 新提交

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University（西安交通大学利物浦大学）

AI总结提出ThinkDeception框架，将多模态大语言模型引入欺骗检测，通过逐步推理和视觉-音频一致性组相对策略优化（VAC-GRPO）实现可解释的认知推理，在主流基准上达到新SOTA。

Comments 10pages,4figures

详情

AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要，然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性，无法提供透明的推理轨迹，也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制，我们提出了ThinkDeception，一个新颖且可解释的多模态欺骗检测框架。作为开创性工作，它将多模态大语言模型（MLLMs）引入该领域，将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链（CoT）数据集，我们开发了基础模型ThinkDeception Base，实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上，我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化（VAC-GRPO）。与标准GRPO不同，我们将训练数据分为四个渐进难度等级，引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合，我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明，ThinkDeception建立了新的SOTA，在检测准确性和推理质量上均显著优于现有方法。最终，这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.18986 2026-06-18 cs.CL cs.AI 新提交

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18974 2026-06-18 cs.CV 新提交

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18959 2026-06-18 cs.RO 新提交

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich（瑞士苏黎世联邦理工学院机器人系统实验室）； Micro- and Nanosystems Lab, ETH Zürich（瑞士苏黎世联邦理工学院微纳系统实验室）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； NVIDIA（NVIDIA公司）

AI总结提出多模态表示学习框架TactSpace，通过共享潜在空间对齐异构触觉模态，实现零样本模拟到现实迁移，在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情

AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而，当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制，严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战，我们提出了一种多模态表示学习框架，该框架在共享潜在空间内对齐异构触觉模态，消除了对精确原始信号模拟的需求，同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测（例如模拟穿透深度和真实电容）投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练，鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入，仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外，结合多物理模拟模态产生了更信息丰富的嵌入，这些嵌入可跨不同下游任务迁移，力预测误差降低16.7%，形状重建误差降低45.8%。最后，我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现，支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

URL PDF HTML ☆

赞 0 踩 0

2606.18955 2026-06-18 cs.CV cs.RO 新提交

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Tianfu Jiangxi Laboratory（天府江西实验室）

AI总结提出基于潜在动作的框架，利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验，通过意图-感知解耦策略减少动作幻觉，仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情

AI中文摘要

训练通用视觉-语言-动作（VLA）模型通常需要大量、多样化的机器人数据集，并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性，但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题，我们提出了一种基于潜在动作的框架，旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE，通过物理掩码将运动动态与环境背景解耦，从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练，VLM骨干网络学习到动作意图的深层表示。为了适应特定实体，我们引入了一种意图-感知解耦策略，其中VLM预测动作意图，而一个独立的冻结视觉编码器为动作专家提供状态特定特征，从而减少动作幻觉。在仿真和真实环境中的结果表明，我们的方法仅在无标签人类视频上预训练，与在大量标注数据集上训练的最先进VLA模型相比具有竞争力，且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Ant Group（蚂蚁集团）

AI总结提出GraphPO框架，将推理轨迹建模为有向无环图，通过合并语义等价路径减少冗余探索，并利用边级优势函数提高推理效率，在多个基准上优于链式和树式方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性：首先，独立响应常包含相似的中间推理步骤，导致冗余探索和计算浪费；其次，稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号，部分解决了这一问题。然而，树分支仍然是独立扩展的。当不同分支达到相似的推理状态时，它们无法共享信息并重复类似的探索。此外，基于树的方法忽略了这种分散性，仅在不同分支内进行局部比较，这可能导致优势估计的方差更高。为了解决这一挑战，我们提出了GraphPO（基于图的策略优化），一种新颖的RL框架，将轨迹表示为有向无环图，其中推理步骤作为边，从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展重新分配到多样化探索。此外，我们为入边分配效率优势，为出边分配正确性优势，从而在从结果中推导过程监督的同时提高推理效率。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明，在相同的token预算或响应预算下，GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.18953 2026-06-18 cs.RO 新提交

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST（韩国科学技术院）； Microsoft Research Asia - Tokyo（微软亚洲研究院-东京）； The University of Tokyo（东京大学）

AI总结提出以对象为中心的残差强化学习框架，在仿真中训练策略，零样本迁移到真实机器人，将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情

AI中文摘要

视觉-语言-动作（VLA）模型能够泛化到多种操作任务，但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱；能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性？残差强化学习在冻结的VLA之上学习修正策略，提供了一个自然框架，但现有方法面临根本的仿真到现实困境：特权状态方法需要有损蒸馏才能部署；基于图像的方法存在视觉域差距；而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架，利用对象姿态优化VLA动作，从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域，我们额外在仿真中重放相同的遥操作演示，以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练，并零样本迁移到真实机器人。在真实Franka Research 3（FR3）机器人的五个操作任务上，我们的方法将成功率从42%零样本提升至76%，且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进，无需额外遥操作。项目页面：此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

URL PDF HTML ☆

赞 0 踩 0

2606.18952 2026-06-18 cs.CV 新提交

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs（Anates实验室）； Technical University of Munich（慕尼黑技术大学）； University of Technology Nuremberg（纽伦堡技术大学）； Tuebingen AI Center, University of Tuebingen（图宾根大学人工智能中心）； Helmholtz AI, Munich（慕尼黑海德堡人工智能研究所）； Google DeepMind research（谷歌DeepMind研究）

AI总结本文提出Physics-IQ Verified基准，通过改进提示和地面真实质量及引入样本级评分系统，提升视频生成模型对物理现实的理解评估，验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情

AI中文摘要

视频生成模型（VGMs）已成为新的前沿，不仅用于视频生成，还用于多种下游任务，包括世界建模。为推进这些任务，一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域，催生了Physics-IQ基准，通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准，揭示不足并提出三种解决方案，改进如何衡量VGMs的物理理解。具体而言，我们提高了提示和地面真实质量以减少混淆因素影响，并进一步引入样本级评分系统，使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中，我们观察到中等但有意义的排名变化（Kendall's τ=0.46）。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展，向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

URL PDF HTML ☆

赞 0 踩 0

2606.18936 2026-06-18 cs.AI cs.CY 新提交

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench：面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China（脑启发认知智能实验室，自动化研究所，中国科学院，北京，中国）； School of Future Technology, University of Chinese Academy of Sciences, China（未来技术学院，中国科学院大学，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, China（人工智能学院，中国科学院大学，中国）； Zhongguancun Academy, China（中关村学院，中国）； Beijing Key Laboratory of Safe AI and Superalignment（北京安全人工智能与超对齐重点实验室）； Gaoling School of AI, Renmin University of China（甘露人工智能学院，中国人民大学）； Beijing Institute of AI Safety and Governance (Beijing-AISI)（北京人工智能安全与治理研究院（北京-AISI））； School of Humanities, University of Chinese Academy of Sciences, China（人文学院，中国科学院大学，中国）

AI总结提出SciRisk-Bench基准，从显式风险维度和科学学科两个角度评估AI4Science安全，覆盖7个学科、31个子学科和10个风险维度，实验揭示主流及科学大模型的安全薄弱环节。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入到人工智能驱动的科学（AI4Science）工作流程中，从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估，不仅要评估科学能力，还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式，但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench}，这是一个旨在从两个互补视角评估AI4Science安全的基准：显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分，我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现，从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

URL PDF HTML ☆

赞 0 踩 0

2606.18933 2026-06-18 cs.LG cs.IR stat.ME 新提交

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18918 2026-06-18 cs.LG cs.CC 新提交

Some Complexity Results for Robustness Verification for Binarized Neural Networks

二值化神经网络鲁棒性验证的一些复杂性结果

Harshit Goyal, Sudakshina Dutta

发表机构 * Indian Institute of Technology Goa（印度理工学院Goa）

AI总结本文通过从布尔可满足性问题归约证明二值化神经网络的可满足性是NP完全的，并利用均匀遮挡导致的网络输出分段常数结构，提出多项式时间鲁棒性检查算法。

2606.18910 2026-06-18 cs.LG cs.CL 新提交

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University（成均女性大学）； Yonsei University（延世大学）； Samsung Research（三星研究院）

AI总结针对多目标图像编辑中的语义混合和对象重复问题，提出BindEdit方法，通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项，在单次扩散轨迹内抑制注意力泄漏，实现精确编辑。

Comments Preprint

详情

AI中文摘要

真实图像编辑能够精确操作视觉内容，但现有方法在复杂的多目标场景中常常失败，导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏，即在去噪过程中，跨空间区域和文本标记的信号变得纠缠。具体来说，我们识别出两种不同形式的泄漏：编辑-标记泄漏，其中模糊的标记-区域对齐导致对象混合；以及源主导泄漏，其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏，我们提出了\textbf{BindEdit}，它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏，BindEdit联合正则化交叉注意力和自注意力，使得每个目标标记组绑定到其对应的空间区域，同时保持实例级别的分离。为了抑制源主导泄漏，一种交叉注意力重平衡机制放大目标标记的影响，并减弱可编辑区域内残留的源语义。此外，区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外，我们提出了一个全面的多目标基准，涵盖不同的对象数量和类别。大量实验表明，BindEdit在单次扩散轨迹内始终优于现有方法，在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0