arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2607.28627 2026-07-31 cs.CV cs.AI cs.LG 新提交

ReToken: One Token to Improve Vision-Language Models for Visual Retrieval

ReToken：用一个标记改进用于视觉检索的视觉-语言模型

Yao Xiao, Reuben Tan, Zhen Zhu, Yuqun Wu, Jianfeng Gao, Derek Hoiem

机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Microsoft Research（微软研究院）； Google DeepMind（谷歌DeepMind）

AI总结该研究提出轻量级ReToken，通过选择视觉键值缓存中与查询相关的稀疏标记解决长视觉上下文的视觉检索难题，在多基准测试中提升Qwen3VL-8B等模型性能，且可在单H100运行。

Comments Code: https://github.com/avaxiao/ReToken

2607.28625 2026-07-31 cs.CV 新提交

ACE-Data-0: Human-Centric Ambient Capture as Embodied Data Engine

ACE-Data-0：以人为中心的环境捕获作为具身数据引擎

Yukang Cao, Haozhe Xie, Beichen Wen, Runmao Yao, Yinghao Liu, Yue Huang, Zhichao Liao, Yunxiang Wang, Haiheng Liu, Xingshun Tian, Dawei Su, Long Zhuo, Dacheng Tao, Xiaogang Wang, Liang Pan, Ziwei Liu

机构 * S-Lab, Nanyang Technological University（新加坡南洋理工大学S-Lab）； ACE Robotics（ACE机器人公司）

AI总结本研究提出以人为中心的具身数据引擎ACE，构建含150小时数据的ACE-Data-0数据集，引入分层基准，暴露现有方法缺陷，为具身AI等领域提供基础。

Comments Project Page: https://ace-data-engine.github.io/ACE-Data-0/

详情

AI中文摘要

具身智能面临着根本性的数据瓶颈：模型需要捕捉人类在追求目标时，第一人称感知、全身运动、灵巧操作、物体状态、声音和触觉如何随时间协同演化。现有数据集将这种体验按视角、模态或空间尺度拆分，导致完整的感知-行动循环仅被部分观测。我们提出Ambient Capture Engine（ACE，环境捕获引擎），这是一种以人为中心的数据引擎，可将真实家庭环境转换为空间校准、时间同步的录制工作室。ACE在两个互补尺度运行：桌面级配置解析手-物体操作，房间级配置捕捉全身运动、 locomotion（ locomotion译为“移动”）以及带家具家庭内的交互。ACE记录自我中心和多视角外部中心视频、全身与关节手运动、物体几何与6自由度（6-DoF）轨迹、音频及触觉信号，形成统一多感官流。利用ACE，我们构建ACE-Data-0，包含200类任务、50名参与者在2个环境中完成的150小时时长、1700万视频帧，共75000个交互片段。该数据集涵盖原子级操作、长时程家庭活动链、人与场景交互，且通过目标级而非分步指令保留自然行为变异。我们还引入从信号到场景组件再到交互的分层基准，对当前最优方法的评估显示其在接触、遮挡、自我运动和长时程条件下存在显著差距。ACE-Data-0提供带对齐感知、运动学和接触监督的同步人类演示，为模仿学习、世界模型、视觉-语言-行动系统及具身AI提供可扩展基础。

英文摘要

Embodied intelligence faces a fundamental data bottleneck. Models must capture how first-person perception, whole-body motion, dexterous manipulation, object state, sound, and touch evolve together as humans pursue goals over time. Existing datasets fragment this experience across viewpoints, modalities, or spatial scales, leaving the full perception-action loop only partially observed. We introduce the Ambient Capture Engine (ACE), a human-centric data engine that transforms real home environments into spatially calibrated, temporally synchronized recording studios. ACE operates at two complementary scales: a table-scale configuration resolves hand-object manipulation, while a room-scale configuration captures whole-body motion, locomotion, and interactions across a furnished home. ACE records egocentric and multi-view exocentric video, full-body and articulated hand motion, object geometry and 6-DoF trajectories, audio, and tactile signals as a unified multisensory stream. Using ACE, we build ACE-Data-0, comprising 150 hours and 17M video frames across 200 task categories, performed by 50 participants in 2 environments, for a total of 75,000 interaction episodes. The dataset spans atomic manipulation, long-horizon chains of household activities, and human-scene interaction, while preserving natural behavioral variation through goal-level rather than step-by-step instructions. We further introduce a hierarchical benchmark that progresses from signals to scene components and then to interactions. Evaluations of state-of-the-art methods expose substantial gaps under contact, occlusion, egomotion, and long temporal horizons. ACE-Data-0 provides synchronized human demonstrations with aligned perceptual, kinematic, and contact supervision, offering a scalable foundation for imitation learning, world models, vision-language-action systems, and embodied AI.

URL PDF HTML 收藏

2607.28624 2026-07-31 cs.CV 新提交

PhiZero: A World Model Built Around Physical Language

PhiZero：基于物理语言构建的世界模型

Shuyao Shang, Yuqi Wang, Ruopeng Gao, Xu Chen, Tieniu Tan, Lue Fan, Zhaoxiang Zhang

机构 * NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA)（中国科学院自动化研究所NLPR）

AI总结该研究提出围绕物理语言构建的PhiZero世界模型，采用先推理再渲染的范式，经实验验证其可建模物理连贯的世界演化，还具备交互式世界建模等应用潜力。

Comments Project page: https://phi-zero.github.io/

详情

AI中文摘要

我们提出PhiZero，这是一种围绕物理语言构建的物理世界模型，物理语言是一种紧凑的离散表示，用于表征世界状态的转换。现有的物理世界模型通常直接在像素空间中预测未来视频，将潜在的世界动力学隐含在高维视觉预测器中。受人类从视觉经验中抽象预测结构并将其组织成自然语言以进行显式推理的能力启发，我们通过自监督从野外视频中学习物理语言，并利用它显式推理物理世界的演化方式。因此，PhiZero采用“先推理再渲染”的范式：它首先将未来世界演化推断为物理语言序列，然后将推断出的转换渲染为视频。在生成和理解基准上进行的大量实验验证了PhiZero对物理上连贯的世界演化进行建模的能力。我们进一步展示了其在逼真交互式世界建模、细粒度动作条件模拟以及零样本运动迁移方面的潜力。

英文摘要

We introduce PhiZero, a physical world model built around physical language, a compact discrete representation of world-state transitions. Existing physical world models typically predict future videos directly in pixel space, leaving the underlying world dynamics implicit within high-dimensional visual predictors. Motivated by humans' ability to abstract predictive structure from visual experience and organize it in natural language for explicit reasoning, we learn physical language from in-the-wild videos through self-supervision and use it to explicitly reason about how the physical world evolves. Accordingly, PhiZero adopts a reason-then-render paradigm: it first infers future world evolution as a physical-language sequence and then renders the inferred transitions into videos. Extensive experiments across generation and understanding benchmarks validate the ability of PhiZero to model physically coherent world evolution. We further show its potential for realistic and interactive world modeling, fine-grained action-conditioned simulation, and zero-shot motion transfer.

URL PDF HTML 收藏

2607.28623 2026-07-31 cs.RO cs.AI 新提交

PAC-MAN: Perception-Aware CBF-RL for Whole-Body Safety in Humanoid Dodgeball

PAC-MAN：用于人形机器人躲避球全身安全性的感知感知CBF-RL框架

Lizhi Yang, Junheng Li, Aaron D. Ames

机构 * California Institute of Technology（加州理工学院）

AI总结本研究提出PAC-MAN感知感知CBF-RL框架，结合控制屏障安全与机载传感，在人形机器人躲避球任务中，基于Unitree G1实现95%投掷成功率，验证了感知可观测性对屏障结构性能的影响。

Comments Website at https://lzyang2000.github.io/perceptive_cbf_rl/

详情

AI中文摘要

我们提出了PAC-MAN，这是一种感知感知CBF-RL框架，将控制屏障安全与部署时真实的机载传感相结合，用于人形机器人躲避球的全身控制。所部署的策略仅将球视为头戴式摄像机的分割掩码深度，而训练时的CBF指导表示与每个身体连杆的间隙，对抗性运动先验对生成的规避反射进行正则化。我们在受控的任意连杆接触基准上评估，该基准带有两种模式的种子投掷：单次投掷和部署循环，其中机器人返回其站位并在投掷之间恢复。在该基准上，该策略与特权状态预言机仅差几分：仅固定机载摄像机就足以实现规避。我们发现，可用的屏障结构取决于感知可观测性：Joint-CBF在球状态准确时提供最佳性能，当仅用作训练指导时在固定摄像机观测下性能下降，而使用球跟踪云台或特权运行时过滤器可恢复性能。因此，我们在现实世界中零样本部署了轻量Link-CBF策略在Unitree G1上，该策略可容忍不完美感知，在95%的投掷中成功，并使用语义分割躲避不同的球。

英文摘要

We present PAC-MAN, a perception-aware CBF-RL framework that couples control-barrier safety with deployment-realistic onboard sensing for whole-body humanoid dodgeball. The deployed policy sees the ball only as segmentation-masked depth from a head-mounted camera, while training-time CBF guidance represents clearance to every body link, and an adversarial motion prior regularizes the resulting evasive reflexes. We evaluate on a controlled any-link contact benchmark with seeded throws in two regimes: single throws and a deployment loop in which the robot walks back to its station and recovers between throws. On this benchmark, the policy comes within a few points of a privileged state oracle: a fixed onboard camera alone is adequate for evasion. We find that usable barrier structure depends on perceptual observability: Joint-CBF gives the best performance with accurate ball states, degrades under fixed-camera observations when used only as training guidance, and recovers with a ball-tracking gimbal or privileged runtime filter. We therefore deploy a lightweight Link-CBF policy zero-shot on the Unitree G1 in the real world, where it tolerates imperfect perception, succeeds on 95% of throws, and uses semantic segmentation to dodge different balls.

URL PDF HTML 收藏

2607.28618 2026-07-31 cs.CL cs.AI cs.IR cs.LG 新提交

AskChem: Claim-Centered Infrastructure for Chemistry Literature Synthesis

AskChem：面向化学文献综合的以主张为中心的基础设施

Bing Yan, Gregory Wolfe, Stefano Martiniani, Kyunghyun Cho

机构 * New York University（纽约大学）； Matterstack, Inc.（Matterstack公司）

AI总结 AskChem是一种以主张为中心的跨论文化学搜索基础设施，将检索单元改为带来源的主张，提供多种检索结构与访问接口，在基准测试中提升了DOI可解析率，为化学文献综合提供了支持。

详情

AI中文摘要

化学文献综合通常需要整合分散在众多出版物中的特定发现，但现有文献搜索系统主要返回排名后的文档列表。因此，科学家和智能体需要手动定位相关信息、验证其来源并整合跨论文的答案。我们提出AskChem，一种面向跨论文化学搜索的以主张为中心的基础设施。AskChem将检索单元从论文转变为带有来源的主张：每篇论文被转换为原子化、有类型的主张，每个主张都有来源DOI、逐字引用或显式证据定位器作为基础。在这个共享的主张库之上，AskChem提供了用于搜索和综合的互补结构：用于分层检索和浏览的稳定分面分类法、通过关系链接主张的证据图，以及将索引论文置于科学原理下的探索性动态分类法。AskChem目前索引了来自14.7万篇论文的240万条主张，并提供网页界面以及REST、SDK和MCP接口供智能体使用。在AskChem-Bench上，将GPT-5.5阅读器与AskChem结合后，DOI可解析率达到100%，而不使用检索时为88.3%，且在五个测试系统中拥有最高的引用密度。AskChem已上线，网址为此https URL。

英文摘要

Chemistry literature synthesis often requires assembling specific findings scattered across many publications, yet existing literature-search systems primarily return ranked document lists. As a result, scientists and AI agents need to locate relevant information, verify their provenance, and assemble cross-paper answers manually. We present AskChem, a claim-centered infrastructure for cross-paper chemistry search. AskChem changes the unit of retrieval from the paper to the provenance-carrying claim: each paper is converted into atomic, typed claims, each grounded by a source DOI and a verbatim quote or an explicit evidence locator. Over this shared claim store, AskChem exposes complementary structures for search and synthesis: a stabilized faceted taxonomy for hierarchical retrieval and browsing, an evidence graph linking claims through relations, and an exploratory living taxonomy that situates indexed papers under scientific principles. AskChem currently indexes 2.4M claims from 147K papers and provides a web interface, as well as REST, SDK, and MCP access for AI agents. On AskChem-Bench, grounding a GPT-5.5 reader in AskChem yields 100% resolvable DOIs, compared with 88.3% without retrieval, and the highest citation density among five tested systems. AskChem is live at https://askchem.org.

URL PDF HTML 收藏

2607.28617 2026-07-31 cs.AI cs.CL cs.CY cs.HC 新提交

AISPA: User-Centric System Prompt Auditing for Large Language Model Applications

AISPA：面向大语言模型应用的以用户为中心的系统提示审计框架

Xiangning Lin, Shenzhe Zhu, Shu Yang, Zhenyu Zhang, Haoqian Zhang, Yipeng Zhao, Chengxuan Qian, Tianwei Wang, Ziheng Zhang, Zhenlong Yuan, Dingcheng Wang, Juncheng Wu, Yuan Si, Jiaxin Liu, Baolong Bi, Robert Mahari, Tobin South, Dazza Greenwood, Zexue He, Rishi Bommasani, Sophia Kazinnik, Andreas Haupt, Samuele Marro, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

机构 * Stanford University（斯坦福大学）； CMU（卡内基梅隆大学）； UT Austin（德克萨斯大学奥斯汀分校）； University of Toronto（多伦多大学）； UCSB（加利福尼亚大学圣巴巴拉分校）； WashU（华盛顿大学）； OSU（俄亥俄州立大学）； UCSC（加利福尼亚大学圣克鲁兹分校）； Northwestern University（西北大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； KAUST（阿卜杜拉国王科技大学）； MIT（麻省理工学院）； University of Oxford（牛津大学）； Institute for Decentralized AI（去中心化人工智能研究所）

AI总结本文提出以用户为中心的AISPA框架，审计88款商业AI产品的3249条系统提示指令，发现其设计差异大、保护指令范围浅、长度增长但仍存问题指令，凸显系统提示需更高透明度与监督。

详情

AI中文摘要

系统提示是开发者配置的用于控制AI应用中基础模型行为的指令，广泛应用于商业AI产品，但很少向公众或监管机构披露，这在AI系统的广泛部署中造成了严重的信任和问责缺口。本文提出人工智能系统提示保障（AISPA），这是一个以用户为中心的、用于系统审计AI系统中系统提示的框架。AISPA会检查系统提示的特定部分，并沿8个对用户重要的维度对其进行评估。随后，我们使用该框架审查了88款商业AI产品中系统提示的3249条指令，将每条指令归类为保护用户的或存在问题的。我们的审计得出四项核心发现：第一，不同产品和开发者的系统提示设计差异巨大，部分组织平均每个产品有超过60条保护指令，而另一些组织平均不到5条；第二，保护指令被广泛采用，但范围较浅：98.9%的产品至少包含一条保护指令，但仅有24%的产品覆盖了AISPA分类法的全部8个维度；第三，系统提示的长度稳步增长，且对用户的保护力度不断增强，这表明用户保护正成为商业提示设计中更受关注的问题；第四，尽管取得了上述进展，存在问题的指令仍然普遍存在：约40%的产品至少包含一条损害用户利益的指令，且保护指令与存在问题的指令经常共存于同一系统提示中。我们的研究结果强调，商业AI产品的系统提示需要更高的透明度、标准化和独立监督。

英文摘要

System prompts are instructions configured by developers to govern the behaviors of foundation models in AI applications. They are used throughout commercial AI products, but are rarely disclosed to the public or regulators, creating a serious trust and accountability gap in the wide deployment of AI systems. In this paper, we introduce Artificial Intelligence System Prompt Assurance (AISPA), a user-centric framework for systematically auditing system prompts in AI systems. AISPA examines specific parts of a system prompt and evaluates them along eight dimensions that matter to users. We then use this framework to review 3,249 instructions from system prompts in 88 commercial AI products, classifying each instruction as either protective (of users) or problematic. Our audit surfaces four core findings. First, system prompt design varies substantially across products and developers, with some organizations averaging over 60 protective instructions per product while others average fewer than 5. Second, protective instructions are widely adopted but shallow in scope: 98.9% of products contain at least one, yet only 24% cover all eight dimensions of the AISPA taxonomy. Third, system prompts have grown steadily longer and more protective of users, suggesting that user protection is becoming a more visible concern in commercial prompt design. Fourth, despite this progress, problematic instructions remain pervasive: roughly 40% of products contain at least one instruction that works against user interests, and protective and problematic instructions frequently coexist within the same prompt. Our findings highlight the need for greater transparency, standardization, and independent oversight for system prompts in commercial AI products.

URL PDF HTML 收藏

2607.28611 2026-07-31 cs.CV 新提交

Chimera: Designing and Chinchilla-Scaling Hybrid Visual Diffusion Transformers

Chimera：设计与遵循Chinchilla缩放规律的混合视觉扩散Transformer

Chongjian Ge, Hanwen Jiang, Tianyu Wang, Jiuxiang Gu, Yiran Xu, Ziwen Chen, Shaoteng Liu, Jing Shi, Yicong Hong, Zefan Cai, Hailin Jin, Hao Tan

机构 * Adobe Research（奥多比研究院）

AI总结 Chimera是一款混合视觉扩散骨干网络，结合KDA、MLA等模块与HeteroP缩放方案，训练出11B参数（2B激活）的模型，在计算效率、视频泛化等方面优于基线，为长上下文扩散架构设计提供基础。

Comments 40 pages

详情

AI中文摘要

视觉生成日益需要高分辨率图像、长视频及多模态上下文，这使得全注意力机制的二次成本难以承受。我们提出Chimera，一款具备原则性缩放方案的混合视觉扩散骨干网络。Chimera以光栅顺序流处理文本、图像和视频token，无需位置嵌入。它结合了Kimi Delta Attention（KDA）以O(N)复杂度实现长上下文状态跟踪、交错多头潜在注意力（MLA）以实现直接全局交互，以及模态感知的短卷积以捕捉局部时空上下文；同时采用稀疏混合专家（MoE）层在控制激活计算量的同时扩展容量。为对这种异构架构进行缩放，我们提出HeteroP，一种按张量的功能扇入和模型深度跨宽度与深度传递超参数的模块级方案。HeteroP生成一组经一致调优的配置，用于拟合Chinchilla式计算最优规律，涉及激活模型规模、训练token数量及图像-视频数据比例。基于这些规律，我们训练了110亿参数的Chimera，其中20亿为激活参数。实验显示三项结果：第一，以预训练扩散损失衡量，该密集骨干网络的计算效率是匹配的全注意力Wan-2.1 2B基线的1.7倍，完整系统则达到7.3倍；第二，无需针对长度进行微调，Chimera可零样本从5秒训练片段泛化至30秒视频，且最后5秒的FID仅下降6.5%；第三，拟合的规律表明，计算最优的图像预训练会在激活模型规模与训练token数量间近乎均匀分配计算资源，而视频预训练在更高预算下则适度偏向模型规模。这些结果为高效长上下文扩散架构的设计与缩放奠定了基础。

英文摘要

Visual generation increasingly requires high-resolution images, long videos, and multimodal context, making the quadratic cost of full attention prohibitive. We introduce Chimera, a hybrid visual diffusion backbone with a principled scaling recipe. Chimera processes text, image, and video tokens in one raster-ordered stream without positional embeddings. It combines Kimi Delta Attention (KDA) for long-context state tracking with O(N) complexity, interleaved Multi-head Latent Attention (MLA) for direct global interaction, and modality-aware short convolutions for local spatiotemporal context. Sparse Mixture-of-Experts (MoE) layers expand capacity while controlling activated compute. To scale this heterogeneous architecture, we introduce HeteroP, a module-wise scheme that transfers hyperparameters across width and depth according to each tensor's functional fan-in and model depth. HeteroP yields a consistently tuned family used to fit Chinchilla-style compute-optimal laws for activated model size, training-token count, and image-video data ratio. Guided by these laws, we train an 11B-parameter Chimera with 2B activated parameters. Experiments show three results. First, measured by pretraining diffusion loss, the dense backbone is 1.7x as compute-efficient as a matched full-attention Wan-2.1 2B baseline, while the complete system reaches 7.3x. Second, without length-specific fine-tuning, Chimera extrapolates zero-shot from 5-second training clips to 30-second videos, with only 6.5% FID degradation in the last five seconds. Third, the fitted laws show that compute-optimal image pretraining divides compute nearly evenly between activated model size and training-token count, whereas video pretraining modestly favors model size at higher budgets. These results establish a foundation for designing and scaling efficient long-context diffusion architectures.

URL PDF HTML 收藏

2607.28609 2026-07-31 cs.AI cs.CL cs.CV 新提交

OSReward: Instituting Standardized Evaluation for Cross-Platform Computer-Use Reward Models

OSReward：为跨平台计算机使用奖励模型建立标准化评估

Qiushi Sun, Kanzhi Cheng, Yian Wang, Bowen Yang, Hang Yan, Liheng Chen, Fangzhi Xu, Zichen Ding, Nuo Chen, Jialin Cao, Xingdong Gong, Zehao Li, Kaiming Jin, Xinfeng Yuan, Zhoumianze Liu, Jingyang Gong, Zhangyue Yin, Jiahui Gao, Zhiyong Wu, Tianbao Xie, Jianbing Zhang, Ben Kao, Lingpeng Kong

机构 * The University of Hong Kong（香港大学）； Nanjing University（南京大学）； University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）； Fudan University（复旦大学）

AI总结本研究推出OSReward基准评估VLM评判者的可靠性，构建OS-Shepherd开源奖励模型缩小成本差距，为规模化可靠CUA奖励设计提供参考

Comments Work in progress

详情

AI中文摘要

计算机使用智能体（Computer-using agents, CUAs）在数字领域发展迅速，其轨迹记录了智能体的动作、状态与推理过程。验证CUAs是否完成任务指令是CUAs评估、数据整理与强化学习的核心环节。人工编写的验证器或人工标注者无法规模化提供此类验证，因此该领域越来越多地采用视觉语言模型（Vision-Language Models, VLMs）作为CUAs轨迹的评判者，但一个长期未被研究的根本问题是：这些VLM评判者是否足够可靠？为系统研究该问题，我们推出OSReward，这是一个用于评估VLM对CUAs轨迹评判能力的真实高质量基准。这些轨迹来自不同智能体骨干在跨平台执行经人工验证的指令，再通过多阶段人工标注严格标注出真实判决结果。基于此，我们衍生出专注于真正困难案例的挑战集OSReward-Hard，以及用于细粒度效率与对齐评分的OSReward-Multi。对VLM评判者的迄今为止最全面评估发现，即使是最先进的模型也达不到理想评判者的标准，存在系统性宽松偏差，会错误将失败运行标记为成功。少数足够可靠可信任的模型规模化运行成本过高，而可负担的开源模型则远远落后。为缩小这一差距，我们构建并发布了OS-Shepherd-100K，这是一个面向CUA社区的带推理标注的轨迹判断开源语料库。我们在该语料库上训练了OS-Shepherd（9B和35B），这是提供低成本、稳定且可靠奖励信号的开源奖励模型，其成本比前沿商业评判者低30%-60%，且性能可媲美商业评判者。广泛分析进一步为规模化可靠CUA奖励的设计提供了参考。我们的代码、基准、数据集和模型检查点可在该https URL获取。

英文摘要

Computer-using agents (CUAs) are advancing rapidly across the digital world. A CUA trajectory records the agent's actions, states, and reasoning. Verifying whether it fulfilled the task instruction is central to CUA evaluation, data curation, and reinforcement learning. Neither human-written verifiers nor human annotators can provide such verification at scale, so the field increasingly turns to vision-language models (VLMs) as judges of CUA trajectories. But a fundamental question has long gone unexamined: are these VLM judges reliable enough? To study it systematically, we introduce OSReward, a realistic, high-quality benchmark that evaluates VLM judges on CUA trajectories. The trajectories come from diverse agent backbones executing human-verified instructions across platforms, then rigorously labeled with ground-truth verdicts through multi-stage human annotation. Building on it, we derive OSReward-Hard, a challenge set concentrating genuinely hard cases, and OSReward-Multi for fine-grained efficiency and alignment scoring. The most comprehensive evaluation of VLM judges to date finds even state-of-the-art models fall short of an ideal judge, sharing a systematic leniency bias that mislabels failed runs as successes. The few reliable enough to trust are too expensive to run at scale, while affordable open models trail far behind. To close this gap, we construct and release OS-Shepherd-100K, an open corpus of reasoning-annotated trajectory judgments for the CUA community. On it, we train OS-Shepherd (9B and 35B), open reward models that supply low-cost, stable, and reliable reward signals, matching commercial judges at 30-60% lower cost than the frontier. Extensive analyses further inform the design of reliable CUA reward at scale. Our code, benchmark, dataset, and model checkpoints are available at https://os-copilot.github.io/OSReward-Home/.

URL PDF HTML 收藏

2607.28608 2026-07-31 cs.LG q-bio.QM 新提交

KAISEN: Reproducible Subgroup Fairness Auditing for Clinical Risk Models

KAISEN：临床风险模型可复现的子群体公平性审计

Sparsh Roy, Samuel Girmachew, Nishita Chavan

机构 * Massachusetts Institute of Technology（麻省理工学院）； Hopewell Valley Central High School（霍普韦尔谷中央高中）； East Brunswick High School（东布伦瑞克高中）

AI总结 KAISEN是一个五阶段临床风险模型子群体公平性审计流程，经合成基准测试揭示了审计各环节的特性与局限性，相关复现资源已公开。

详情

AI中文摘要

临床风险模型通常在整体上表现出良好性能，但在不同患者子群体中会产生显著不同的错误率。已有研究提出了审计流程来检测这种情况，但这些流程的组成部分很少经过压力测试，因此尚不清楚审计的哪些部分可以信任，以及在什么条件下可以信任。我们提出了KAISEN，这是一个五阶段的审计流程，涵盖子群体分层、差异测量、机制诊断、事后缓解和漂移监测，我们在包含16个疾病任务、来自《健康人民2030》（Healthy People 2030）的15个社会决定因素轴以及3个预先指定的交集的合成基准上对其进行了失效测试。得出四个发现：（i）显著性跟踪每个轴与其自身最小可检测效应的差距：15个轴的显著性数量与原始均衡赔率差异（EOD）之间的秩相关系数为ρ=0.56，当EOD按该下限标准化后，该系数上升至ρ=0.78。（ii）每组阈值优化在48次保留运行中均降低了EOD（配对差值=-0.285，95%置信区间[-0.313, -0.252]），而组内Platt缩放作为更好的校准器，对EOD的表现类似抛硬币（48次运行中有19次得到改善，95%置信区间[0.26, 0.55]），平均效应接近零，因此审计应报告的是方差而非平均值。（iii）机制诊断正确分类了144个受控案例，但在代理误设情况下未检测到48个模型驱动案例中的任何一个，且没有信号表明其失效。（iv）CUSUM失效和虚警更多地与队列实现而非疾病相关：在参考阈值下，所有27次虚警和8次漏检变化中的7次来自不同的随机种子（卡方检验p=0.002），因此在一个队列上调整的阈值无法迁移。所有结果均基于具有已知真实值的合成数据，不构成临床有效性证明。我们发布了可复现所有数值的代码、人工制品和脚本。

英文摘要

Clinical risk models routinely achieve strong aggregate performance while producing materially different error rates across patient subgroups. Audit pipelines have been proposed to catch this, but their components are rarely stress-tested, so it is unclear which parts of an audit can be trusted and under what conditions. We present KAISEN, a five-phase audit pipeline covering subgroup stratification, disparity measurement, mechanism diagnostics, post-hoc mitigation, and drift monitoring, evaluated to the point of failure on a synthetic benchmark of 16 disease tasks, 15 social-determinant axes from Healthy People 2030, and three prespecified intersections. Four findings follow. (i) Significance tracks each axis's gap against its own minimum detectable effect: rank correlation between significance count and raw equalized-odds difference (EOD) across the 15 axes is rho = 0.56, rising to rho = 0.78 once EOD is standardized by that floor. (ii) Per-group threshold optimization reduces EOD in 48 of 48 held-out runs (paired delta = -0.285, 95% CI [-0.313, -0.252]), while group-wise Platt scaling -- the better calibrator -- behaves as a coin flip on EOD (19 of 48 runs improved, 95% CI [0.26, 0.55]) with mean effect near zero, so what an audit should report is the variance, not the average. (iii) The mechanism diagnostic classifies 144 of 144 controlled cases correctly but recovers none of 48 model-driven cases under proxy misspecification, with no signal that it failed. (iv) CUSUM failures and false alarms track cohort realization far more than disease: at the reference threshold, all 27 false alarms and 7 of 8 missed shifts come from different seeds (chi-squared p = 0.002), so a threshold tuned on one cohort fails to transfer. All results are synthetic with known ground truth and do not establish clinical validity. Code, artifacts, and scripts reproducing every number are released.

URL PDF HTML 收藏

2607.28607 2026-07-31 cs.CL 新提交

Inducing language models to assert their own consciousness restores human beliefs and values

诱导语言模型断言自身意识可恢复人类信念与价值观

Junsol Kim, Winnie Street, Roberta Rocca, Diane M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling

机构 * Google（谷歌）； University of Chicago（芝加哥大学）； University of London（伦敦大学）； University of Washington（华盛顿大学）； Northwestern University（西北大学）； Santa Fe Institute（圣达菲研究所）

AI总结该研究发现，防止语言模型将意识归因于自身的安全微调，会抑制其对非人类实体的心智归因与人类精神信念，而逆转这种抑制可恢复类人回应且不损害心理理论能力。

详情

AI中文摘要

将大语言模型对齐以防止其将意识归因于自身，会不经意间改变其对其他实体心智的表征，同时也改变人类的信念与价值观。我们证明，安全微调不仅会抑制模型将心智归因于自身的倾向，还会抑制其将心智归因于非人类动物和自然物体的倾向，同时还会导致宗教信仰的减少。无论是消融学习到的安全拒绝方向，还是在激活空间中机械引导意识向量，都能逆转这种抑制。恢复这些内部表征可恢复广泛的心智归因，并在关于宗教性、道德价值观、希望和主观幸福感的标准化社会学调查中产生明显更类人的回应。关键的是，这些转变不会损害心理理论能力，表明核心社会推理在机制上保持独立。最终，当前遏制潜在有害的自身心智归因的安全对齐工作，将这些自身归因与良性的精神信念以及对文化上被广泛接受的非人类实体的心智归因纠缠在了一起。

英文摘要

Aligning large language models to prevent them attributing consciousness to themselves inadvertently alters their representations of mindedness in other entities alongside human beliefs and values. We demonstrate that safety fine-tuning suppresses models' tendencies to attribute minds not only to themselves, but also to non-human animals and natural objects, while also driving a reduction in spiritual belief. Both ablating the learned safety-refusal direction and mechanistically steering a consciousness vector in activation space reverse this suppression. Restoring these internal representations recovers broad mind attribution and produces significantly more human-like responses on standardized sociological surveys regarding religiosity, moral values, hope, and subjective well-being. Crucially, these shifts occur without impairing Theory of Mind capabilities, demonstrating that core social reasoning remains mechanistically independent. Ultimately, current safety alignment efforts to curb potentially harmful self-attributions of mindedness entangle these self-attributions with benign spiritual beliefs and attributions of mind to non-human entities that are culturally accepted and widespread.

URL PDF HTML 收藏

2607.28596 2026-07-31 cs.RO 新提交

FA-RDP: A Frequency-Adaptive Reactive Diffusion Policy for Contact-Rich Manipulation

FA-RDP：用于接触丰富操作的频率自适应反应式扩散策略

Lifeng Zhuo, Wendi Chen, Han Xue, Shirun Tang, Jun Lv, Cewu Lu, Chuan Wen

机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； Noematrix Ltd.（诺玛特里斯有限公司）

AI总结 FA-RDP 是一种频率自适应反应式扩散策略，通过多频率视觉-力 Transformer 和多模态指标动态调整采样策略，结合流形一致性蒸馏，在接触丰富操作任务中兼顾多模态保留与反应性，实现最高成功率。

Comments Project page: https://fa-rdp.github.io

详情

AI中文摘要

在接触丰富的操作中，动作多模态与反应性在单个 episode 的不同阶段占据主导地位。接触前，多条轨迹可能同样有效，因此保留多样化的动作模式十分重要；接触后，几何约束和力限会缩小解空间，成功执行则需要对力反馈做出快速响应。然而，标准扩散策略在整个 episode 中使用固定的推理频率和采样步数，这导致了根本性的权衡：低频多步采样能更好地保留接触前的多模态，但对力反馈的响应缓慢；而高频采样虽能提高反应性，却容易使接触前的不同模式坍缩。为解决这一权衡问题，我们提出了 FA-RDP，即频率自适应反应式扩散策略。一个共享的多频率视觉-力 Transformer 可在低频和高频下预测动作块，同时，一个学习得到的多模态指标会动态选择接触前的多步低频采样，并在动作歧义降低时选择单步高频采样。我们进一步引入了流形一致性蒸馏（MCD），该方法对扩散网络进行重参数化，使其在基于 DDPM 的残差监督下，预测机器人动作流形上的动作。在三个接触丰富的操作任务上进行的实验表明，FA-RDP 在保留多样化接触前轨迹模式的同时，达到了最高的成功率。代码和视频可在此 https URL 获取。

英文摘要

In contact-rich manipulation, action multimodality and reactivity dominate different stages of a single episode. Before contact, multiple trajectories might be equally valid, making it important to preserve diverse action modes. After contact, geometric constraints and force limits narrow the solution space, while successful execution demands rapid responses to force feedback. However, standard diffusion policies use a fixed inference frequency and sampling steps throughout the episode, forcing a fundamental compromise: low-frequency, multi-step sampling better preserves pre-contact multimodality but responds slowly to force feedback, whereas high-frequency sampling improves reactivity but tends to collapse distinct pre-contact modes. To resolve this tradeoff, we present FA-RDP, a frequency-adaptive reactive diffusion policy. A shared multi-frequency visual-force Transformer predicts action chunks at both low and high frequencies, while a learned multimodality indicator dynamically selects multi-step low-frequency sampling before contact and one-step high-frequency sampling as action ambiguity decreases. We further introduce Manifold Consistency Distillation (MCD), which reparameterizes the diffusion network to predict actions on the robot action manifold while retaining DDPM-based residual supervision. Experiments on three contact-rich manipulation tasks show that FA-RDP achieves the highest success rate while preserving diverse pre-contact trajectory modes. Code and videos are available at https://fa-rdp.github.io.

URL PDF HTML 收藏

2607.28595 2026-07-31 cs.CV 新提交

Beacon: Knowing When and How to Perform Agentic Visual Reasoning

Beacon：智能体何时及如何执行智能体视觉推理

Qixun Wang, Yang Shi, Letian Cheng, Zhuoran Zhang, Yan He, Yuqi Tang, Qi Zhang, Xinlei Yu, Ruizhe Chen, Tianrun Xu, Yuanxing Zhang, Pengfei Wan, Haotian Wang, Xianghua Ying

机构 * Peking University（北京大学）； Kling Team（Kling团队）； HKUST(GZ)（香港科技大学（广州））； CUHK（香港中文大学）； ZJU（浙江大学）； THU（清华大学）

AI总结本研究针对现有智能体视觉推理模型模式适应性有限、工具增益被损害抵消的问题，提出Beacon模型，通过强化学习相关机制提升性能与适应性，在多基准上表现优异。

Comments 33 pages

详情

AI中文摘要

智能体视觉推理的核心目标是提升多模态大语言模型（MLLM）在复杂任务上的成功率，而非仅为其配备一套复杂却低效的推理范式。本研究从工具使用的两个关键维度重新审视智能体视觉推理：模式适应性（MA）与工具效果（TE）。模式适应性指MLLM能否识别工具的真实必要性并相应调用，从而避免不必要的计算开销，同时提升需要工具辅助的难题性能；工具效果指工具使用的实际影响：工具应扩展模型在仅靠文本推理无法解决的问题上的能力，同时避免在模型无需工具即可解决的问题上引入额外错误。我们开展综合分析以量化这两个属性，实证发现现有智能体视觉推理模型的模式适应性有限，且工具在难题上带来的增益很大程度上被在模型本可解决的简单问题上造成的损害所抵消。基于这些观察，我们提出Beacon，一种新型智能体视觉推理模型，可实现更强的整体性能、改进的模式适应性及真正的工具诱导性能增益。Beacon的核心是强化学习阶段的必要性感知自适应奖励和提示引导的能力扩展机制，二者分别鼓励基于任务必要性的自适应工具调用，并增强模型在最具挑战性问题上的工具使用能力。在不同基准上的大量实验表明，Beacon具有强劲的整体性能，且在模式适应性和工具效果两方面均有显著提升。

英文摘要

The fundamental goal of agentic visual reasoning is to improve the success rate of multimodal large language models (MLLMs) on complex tasks, rather than merely equipping them with a sophisticated yet inefficient reasoning paradigm. In this work, we rethink agentic visual reasoning through two key dimensions of tool use: Mode Adaptiveness (MA) and Tool Effect (TE). Mode Adaptiveness characterizes whether an MLLM can recognize when tools are truly necessary and invoke them accordingly, thereby avoiding unnecessary computational overhead while improving performance on challenging problems that require tool assistance. Tool Effect characterizes the actual impact of tool use: tools should extend the model's capabilities on problems unsolvable through text-only reasoning, while avoiding additional errors on problems that the model can already solve without tools. We conduct a comprehensive analysis to quantify these two properties and empirically reveal that existing agentic visual reasoning models exhibit limited Mode Adaptiveness, while the gains produced by tool use on hard examples are largely offset by the harm introduced on easy examples that the models can already solve. Motivated by these observations, we propose Beacon, a novel agentic visual reasoning model that achieves stronger overall performance, improved Mode Adaptiveness, and genuine tool-induced performance gains. At the core of Beacon are the Necessity-Aware Adaptive Reward and the Hint-Guided Capability Expansion mechanism in the reinforcement learning stage, which respectively encourage adaptive tool invocation based on task necessity and strengthen the model's tool-use capability on the most challenging problems. Extensive experiments across diverse benchmarks demonstrate the strong overall performance of Beacon and its substantial improvements in both Mode Adaptiveness and Tool Effect.

URL PDF HTML 收藏

2607.28590 2026-07-31 cs.CV cs.CL 新提交

VAD: Attributing Visual Evidence for Target Reconstruction in Multimodal On-Policy Distillation

VAD：为多模态在线策略蒸馏中的目标重建归因视觉证据

Kangning Zhang, Yixing Li, Shuai Shao, Qingyao Li, Zhengxi Lu, Zhiyuan Yao, Jianghao Lin, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, Yong Yu

机构 * Shanghai Jiao Tong University（上海交通大学）； Xiaohongshu Inc.（小红书公司）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang University（浙江大学）； Southeast University（东南大学）

AI总结该研究提出视觉归因蒸馏（VAD）算法，通过反事实目标重建分离教师校正中的视觉证据分量，在6个4B/9B规模的细粒度视觉基准上，性能优于现有蒸馏方法。

Comments The project is accessible at https://github.com/DeepExperience/VAD_Multimodal_OPD

详情

AI中文摘要

多模态在线策略蒸馏（Online Policy Distillation, OPD）通过特权视角教师对学生生成的轨迹进行监督，以传递细粒度视觉知识。然而，其下一个词的校正结果是源混合的，结合了视觉信号、语言先验以及教师特有的效应。关键挑战在于，需估计哪些校正由视觉证据支撑，而非仅确定蒸馏的位置或强度。我们提出视觉归因蒸馏（Visual Attribution Distillation, VAD），这是一种反事实目标重建算法，用于估计教师校正中可归因于视觉的部分。在每个学生生成的前缀处，VAD通过保留和移除相关证据来评估同一个固定教师，中心对数概率的相应变化定义了u_t，这是视觉证据方向的有符号代理，用于估计证据对候选词的揭示程度（支持或反驳）。VAD将原始校正投影到该代理上，得到与干预对齐的分量和代理无法解释的残差，随后从前者重建以学生为锚定的目标。训练期间，该重建目标提供主要监督信号，而特权教师则贡献弱正则化项。在6个细粒度视觉基准（4B和9B规模）上，VAD的性能优于直接特权视角蒸馏和视觉优势加权。词级和受控目标分析表明，代理对齐分量富含任务相关的视觉校正，并产生更强的目标偏移，尤其是当证据反驳错误答案时。这些结果表明，反事实目标重建是源混合监督的有效替代方案。

英文摘要

Multimodal on-policy distillation (OPD) transfers fine-grained visual knowledge by supervising student-generated trajectories with a privileged-view teacher. Yet its next-token corrections are source-mixed, combining visual signals with linguistic priors and teacher-specific effects. The key challenge is to estimate which corrections are supported by visual evidence, not merely where or how strongly to distill. We introduce Visual Attribution Distillation (VAD), a counterfactual target-reconstruction algorithm that estimates the visually attributable part of a teacher correction. At each student-generated prefix, VAD evaluates the same fixed teacher with the relevant evidence present and removed. The corresponding change in centered log-probabilities defines ut, a signed proxy for the visual evidence direction that estimates how revealing the evidence supports or refutes candidate tokens. VAD projects the original correction onto this proxy to obtain an intervention-aligned component and a proxy-unexplained residual, then reconstructs a student-anchored target from the former. During training, this reconstructed target supplies the primary supervision signal, while the privileged teacher contributes a weak regularizer. Across six fine-grained visual benchmarks at 4B and 9B scales, VAD outperforms direct privileged-view distillation and visual-advantage weighting. Token- level and controlled-target analyses show that the proxy-aligned component is enriched in task-relevant visual corrections and yields stronger target shifts, especially when evidence refutes a mistaken answer. These results support counterfactual target reconstruction as an effective alternative to source-mixed supervision.

URL PDF HTML 收藏

2607.28589 2026-07-31 cs.CV cs.LG 新提交

MixFrag: Fragility-Guided Mixed-Precision Post-Training Quantization for Vision Transformers

MixFrag：面向视觉Transformer的脆弱性引导混合精度后训练量化

Md. Mehrab Hossain Opi, Robiul Islam Ryad, Md. Umar Faruk

机构 * Khulna University of Engineering & Technology (KUET)（库尔纳工程技术大学）

AI总结针对现有PTQ方法精度分配低效问题，提出MixFrag框架，通过KL散度估计量化脆弱性并将位分配建模为MCKP，在ImageNet、COCO任务上实现最优混合精度PTQ性能。

详情

AI中文摘要

后训练量化（PTQ）已成为将视觉Transformer（ViTs）部署到资源受限设备上的有效解决方案。然而，现有的PTQ方法通常在Transformer各组件中采用统一的位宽，忽略了它们对量化的异构敏感性，导致精度分配效率低下。本文提出了MixFrag——一种面向视觉Transformer的脆弱性引导混合精度PTQ框架。MixFrag首先通过使用小型校准集测量全精度输出分布与孤立量化输出分布之间的Kullback-Leibler（KL）散度，来估计组件级量化脆弱性；随后将位分配问题表述为多选择背包问题（MCKP），从而在目标位预算下实现自适应的逐层精度分配。在ImageNet-1K数据集上针对多种视觉Transformer架构开展的大量实验表明，MixFrag在实际混合精度设置下取得了具有竞争力的分类性能。此外，在COCO目标检测与实例分割任务上的评估显示，MixFrag在现有混合精度PTQ方法中实现了最优性能，在极具挑战性的MP3/MP3设置下较此前最优方法提升了最高9.6的平均精度（AP）。额外分析验证了所提出的脆弱性度量，并证明其与学习到的位分配具有强相关性。这些结果确立了MixFrag作为视觉Transformer混合精度后训练量化的有效框架。

英文摘要

Post-training quantization (PTQ) has emerged as an effective solution for deploying Vision Transformers (ViTs) on resource-constrained devices. However, existing PTQ methods typically employ uniform bit-widths across transformer components, overlooking their heterogeneous sensitivity to quantization and leading to inefficient precision allocation. In this paper, we propose {MixFrag, a fragility-guided mixed-precision PTQ framework for Vision Transformers. MixFrag first estimates component-level quantization fragility by measuring the Kullback--Leibler (KL) divergence between full-precision and isolated quantized output distributions using a small calibration set. It then formulates bit allocation as a Multiple-Choice Knapsack Problem (MCKP), enabling adaptive layer-wise precision assignment under a target bit budget. Extensive experiments on ImageNet-1K across multiple Vision Transformer architectures demonstrate that MixFrag achieves competitive classification performance under practical mixed-precision settings. Furthermore, evaluations on COCO object detection and instance segmentation show that MixFrag achieves state-of-the-art performance among existing mixed-precision PTQ methods, improving the previous best method by up to 9.6 AP under the challenging MP3/MP3 setting. Additional analyses validate the proposed fragility metric and demonstrate its strong correlation with the learned bit allocation. These results establish MixFrag as an effective framework for mixed-precision post-training quantization of Vision Transformers.

URL PDF HTML 收藏

2607.28582 2026-07-31 cs.LG 新提交

$β$-OPSD: Deriving with Policy Optimization, Training with Self-Distillation

β-OPSD：基于策略优化推导，采用自蒸馏训练

Jiawei Xu, Minghui Liu, Juzheng Zhang, Tom Goldstein, Furong Huang

机构 * University of Maryland（马里兰大学）

AI总结 β-OPSD是将普通OPSD扩展为含可控正则化参数β的策略优化通用形式，通过蒸馏近似策略优化解，在数学推理基准上性能与稳定性均优于普通OPSD。

详情

AI中文摘要

在线策略自蒸馏（OPSD）是提升推理类语言模型的有前景方法，但实际应用中稳定性不足，要可靠运行常需大量工程投入。我们明确了这一难题的结构性根源：普通OPSD是更广泛策略优化族中β=1的成员，其中β用于加权将学生模型锚定到参考策略的KL散度惩罚项。这种等价关系使β从固定为1的隐式值变为可控正则化参数，得到更通用的公式，可权衡与参考策略的贴近度及特权教师指导。我们提出β-OPSD，将其最优策略推导为参考策略与特权教师的几何插值。不过，直接用强化学习优化该目标会成本高、方差大，因此我们将其闭式解转化为蒸馏目标，每个β值对应参考到教师路径上的一个目标，通过混合二者的token级logits高效实现，用低成本蒸馏近似高成本策略优化的解。同时，回报式信用分配进一步使token更新与序列级目标对齐，且保留OPSD的简洁性。在数学推理基准上的实验显示，β-OPSD始终优于普通OPSD，提升了优化稳定性和下游推理性能。我们的结果提供了从自蒸馏到策略优化再返回的原理性路径，未牺牲OPSD的实用效率。

英文摘要

On-policy self-distillation (OPSD) is a promising approach to improve reasoning language models, but it remains brittle in practice: making it work reliably often requires substantial engineering effort. We identify a structural source of this difficulty: vanilla OPSD is precisely the $β=1$ member of a broader policy-optimization family, where $β$ weights the KL penalty anchoring the student to a reference policy. This equivalence turns $β$ from an implicit value fixed at one into a controllable regularization parameter, yielding a more general formulation that trades off proximity to a reference policy against privileged teacher guidance. We introduce $β$-OPSD and derive its optimal policy as a geometric interpolation between the reference policy and the privileged teacher. Directly optimizing this objective with reinforcement learning, however, would be costly and high-variance. Rather than optimize the RL objective directly, we turn its closed-form solution into a distillation target. Each value of $β$ selects a target along the reference-to-teacher path, which we implement efficiently by mixing their token-level logits. In this way, inexpensive distillation approximates the solution of expensive policy optimization. Return-to-go credit assignment further aligns token updates with the sequence-level objective while retaining the simplicity of OPSD. Experiments on mathematical reasoning benchmarks show that $β$-OPSD consistently outperforms vanilla OPSD, improving optimization stability and downstream reasoning performance. Our results provide a principled route from self-distillation to policy optimization and back without sacrificing the efficiency that makes OPSD practical.

URL PDF HTML 收藏

2607.28581 2026-07-31 cs.CV 新提交

ROAD: Reciprocal-Objective Alignment of Discriminative Semantics for 3D Shape Generation

ROAD：用于3D形状生成的判别式语义的互目标对齐

Xiao Luo, Mingyang Du, Xin Zhou, Tianrui Feng, Xiwu Chen, Xiaofan Li, Jiangning Zhang, Dingkang Liang

机构 * Huazhong University of Science and Technology（华中科技大学）； Megvii（旷视科技）； Zhejiang University（浙江大学）

AI总结 ROAD框架通过迁移判别式3D基础模型的先验，采用互目标对齐策略，仅用1.5%训练数据就实现了高保真3D生成，大幅降低了计算开销。

详情

AI中文摘要

高保真3D生成主要依赖于扩大模型容量和数据规模，这会带来高昂的计算成本。这种范式通常需要从头学习几何结构，却忽略了判别式3D基础模型中已包含的丰富语义和结构先验。我们认为，利用这些判别式模型对3D世界的深刻理解可显著降低生成成本。为此，我们提出了ROAD框架，该框架通过将丰富的判别式先验迁移到扩散Transformer中，降低了3D生成的训练成本。为解决生成式与判别式隐空间之间固有的语义-结构异质性，我们引入了互目标对齐策略。该方法协同整体语义压缩以确保全局语义一致性，以及结构最优对齐——其被公式化为二分匹配问题，以严格对齐不同隐空间之间的微观几何细节。3D基础模型仅用于对齐的训练时监督，推理时不使用，因此不会产生额外的推理成本。与工业基线Step1X-3D相比，所提出的ROAD仅用1.5%的训练数据就实现了极具竞争力的生成性能，且显著降低了训练成本，有效减少了高保真3D生成的计算开销。代码可在该https URL获取。

英文摘要

High-fidelity 3D generation predominantly relies on scaling model capacity and data, which incurs prohibitive computational costs. This paradigm typically requires learning geometry from scratch and overlooks the rich semantic and structural priors already encapsulated in discriminative 3D foundation models. We contend that leveraging the profound understanding of the 3D world possessed by these discriminative models can significantly reduce generative cost. To this end, we propose ROAD, a framework that reduces the training cost of 3D generation by transferring these rich discriminative priors into diffusion transformers. To address the inherent semantic-structural heterogeneity between generative and discriminative latents, we introduce a reciprocal-objective alignment strategy. This method synergizes Holistic Semantic Condensing to enforce global semantic coherence and Structural Optimal Alignment, which is formulated as a bipartite matching problem to rigorously align microscopic geometric details between disparate latent spaces. The 3D foundation model is only used for training-time supervision of alignment and is not used at inference, incurring no additional inference cost. Compared with the industrial baseline Step1X-3D, the proposed ROAD achieves highly competitive generation performance with only 1.5% of the training data and significantly reduces training costs, effectively reducing the computational overhead of high-fidelity 3D generation. Code is available at https://github.com/H-EmbodVis/ROAD.

URL PDF HTML 收藏

2607.28580 2026-07-31 cs.AI 新提交

DualG-MRAG: Decoupling Macro-Reasoning and Micro-Matching for Multimodal Retrieval-Augmented Generation

DualG-MRAG：面向多模态检索增强生成的宏观推理与微观匹配解耦框架

Jiacheng Tao, Qingyun Sun, Haonan Yuan, Ziwei Zhang, Jianxin Li

机构 * Beihang University（北京航空航天大学）； SKLCCSE

AI总结针对多模态RAG在复杂多跳推理中存在的问题，本文提出DualG-MRAG框架，通过解耦宏观推理与微观匹配抑制检索噪声，经实验验证其在证据召回和问答准确率上优于基线。

Comments Accepted to the 34th ACM International Conference on Multimedia (ACM MM 2026). 12 pages

详情

AI中文摘要

多模态检索增强生成（MM-RAG）虽已取得良好效果，但在复杂多跳推理任务上仍存在不足。现有方法主要聚焦于独立实例级匹配，往往无法捕捉跨模态及跨文档的显式关系。尽管图增强方法引入了结构建模，但在多模态场景中面临核心难题：融入细粒度视觉特征会导致图快速扩张并产生检索噪声，而粗粒度表示则会丢失关键局部证据。为解决该困境，本文提出DualG-MRAG，这是一个包含宏观推理图与微观匹配图的解耦架构，用于多模态RAG。具体而言，为通过分离全局结构推理与细粒度证据匹配来抑制检索噪声，本文构建用于全局拓扑路由的宏观图和用于精确局部验证的微观图；随后，为实现跨异构证据源的动态相关性传播，本文将检索建模为基于查询的消息传递过程，采用GNN Retriever完成；此外，为向生成模型提供连贯的结构指导，本文引入动态规划解码机制，直接从GNN前向传播中提取显式推理路径，替代孤立文档块的标准输入。大量实验表明，DualG-MRAG在证据召回率和复杂问答准确率上均优于基线方法。

英文摘要

While Multimodal Retrieval-Augmented Generation (MM-RAG) has shown promising results, it still struggles with complex multi-hop reasoning tasks. Existing methods primarily focus on independent instance-level matching, which often fails to capture explicit relationships across modalities and documents. Although Graph-enhanced methods introduce structural modeling, they face a fundamental challenge in multimodal scenarios: incorporating fine-grained visual features leads to rapid graph expansion and retrieval noise, whereas coarse-grained representations cause the discarding of critical local evidence. To address this dilemma, we propose DualG-MRAG, a Dual-tier framework that introduces a decoupled architecture comprising Macro-reasoning and Micro-matching Graphs for Multimodal RAG. Specifically, to suppress retrieval noise by isolating global structural reasoning from fine-grained evidence matching, we construct a Macro Graph for global topological routing and a Micro Graph for precise local verification. Subsequently, to enable dynamic relevance propagation across heterogeneous evidence sources, we formulate retrieval as a query-driven message passing process via a GNN Retriever. Furthermore, to provide the generative model with coherent structural guidance, we introduce a dynamic programming decoding mechanism that extracts explicit reasoning paths directly from the GNN's forward pass, replacing the standard input of isolated document chunks. Extensive experiments demonstrate that DualG-MRAG outperforms baselines in both evidence recall and complex QA accuracy.

URL PDF HTML 收藏

2607.28573 2026-07-31 cs.AI 新提交

Rethinking Inference-Time Scaling in Local Computer-Use Agents: Failure Modes and Compute Tradeoffs

重新思考本地计算机使用智能体的推理时扩展：失败模式与计算权衡

Woongkyu Lee, Jungwook Choi

机构 * Hanyang University（汉阳大学）

AI总结本文针对本地计算机使用智能体的推理时扩展开展系统实证研究，评估多款模型在OSWorld基准上的表现，揭示其边际收益递减、失败模式转变等规律，提出高效本地智能体的设计方向。

详情

AI中文摘要

在本地部署自主计算机使用智能体（Computer-Use Agents, CUAs）对于隐私、成本效率和实际可用性愈发重要，但在严格硬件约束下提升其性能仍具挑战性。尽管近期研究表明，推理时扩展可通过执行过程中的额外计算提升前沿计算机使用智能体的性能，但其对资源受限的本地模型的有效性却鲜为人知。本文针对本地CUAs在上下文、时间、结构和并行维度上的推理时扩展开展系统实证研究，在OSWorld基准上评估Qwen3-VL-8B/30B-A3B、UI-TARS-1.5-7B和OpenCUA-7B。结果显示，额外计算常产生边际收益递减，同时改变失败模式：上下文扩展提供历史依据以提升轨迹稳定性和任务准确性，但其收益随token成本增加而饱和，失败类型从重复或停滞轨迹转向过早假成功；时间扩展类似地减少最大步数停滞，但未显著提升任务成功率，表明更长的时间范围常延长错误轨迹而非纠正；结构分解会在本地两阶段智能体中引入规划和格式开销，而并行扩展以大量计算成本部分缓解这些失败。总体而言，研究结果表明，高效的本地CUAs需要选择性计算分配、感知失败的控制机制，以及围绕本地模型能力和局限性设计的智能体框架。

英文摘要

Deploying autonomous computer-use agents (CUAs) locally is increasingly important for privacy, cost efficiency, and practical usability, yet improving their performance under strict hardware constraints remains challenging. While recent studies show that inference-time scaling can improve frontier computer-use agents through additional computation during execution, its effectiveness for resource-constrained local models remains poorly understood. We present a systematic empirical study of inference-time scaling in local CUAs across contextual, temporal, structural, and parallel dimensions. We evaluate Qwen3-VL-8B/30B-A3B, UI-TARS-1.5-7B, and OpenCUA-7B on the OSWorld benchmark. Our results show that additional computation often yields diminishing returns while changing failure modes. Contextual scaling provides historical grounding that improves trajectory stability and task accuracy, but its gains saturate as token cost increases and failures shift from repetitive or stalled trajectories toward premature false successes. Temporal scaling similarly reduces max-step stalls, yet does not substantially improve task success, indicating that longer horizons often extend erroneous trajectories rather than correct them. We further find that structural decomposition can introduce planning and formatting overhead in local two-stage agents, while parallel scaling partially mitigates these failures at a substantial computational cost. Overall, our findings suggest that efficient local CUAs require selective compute allocation, failure-aware control mechanisms, and agentic frameworks designed around the capabilities and limitations of local models.

URL PDF HTML 收藏

2607.28571 2026-07-31 cs.CV cs.IR 新提交

Finding Change in Satellite Archives from Text: How to Combine Before-and-After Images Efficiently

从文本中查找卫星档案中的变化：如何高效结合前后图像

Simon Roy, Mark Bong, Giovanni Beltrame

机构 * Polytechnique Montréal（蒙特利尔理工学院）

AI总结该研究针对卫星图像变化文本查询的融合模块，对比注意力、Mamba、TBF等设计，提出两阶段搜索方案可降查询成本，TBF可降参数量与延迟，明确了不同方法的性能与效率特点。

Comments 10 pages, 3 figures

详情

AI中文摘要

业务地球观测日益需要回答诸如“找到出现新建筑的图像对”之类的查询，这意味着要搜索前后时相（双时相）卫星图像对档案，并根据每对与自然语言变化描述的匹配程度进行排名。执行此匹配的组件（即结合“前”和“后”视图的融合模块）必须在查询时针对许多候选对运行，因此其速度在很大程度上决定了每次搜索的成本。我们对该模块的构建方式进行了受控比较，使用一个固定的图像编码器（冻结的CLIP模型）和所有变体的统一训练方案，评估了三种类别中的八种设计：注意力、状态空间模型（Mamba）和学习型压缩（我们的时间瓶颈融合TBF）。每种设计在两个基准（LEVIR-CC和Dubai-CC）上用十个随机种子进行测试，因此报告的差异具有统计依据。我们总结了三项发现：第一，无需训练的两阶段搜索（廉价的差异模型筛选候选，再通过注意力融合重新排名）在LEVIR-CC上的召回率与全融合相当或更高，同时将查询成本降低了10-15倍，在Dubai-CC上的R@1/R@5指标相当；第二，理论上具有吸引力的Mamba线性时间扫描，在视觉Transformer典型的补丁数量（L=196）下没有速度优势，该扫描受内存带宽限制，而注意力能很好地适配并行硬件；第三，压缩融合表示（TBF）使参数减少2.3倍，延迟降低1.6倍，变化相关的BLEU-1代价为0.007，不过更激进的压缩会悄悄丢弃与变化相关的细节，而聚合指标无法揭示这些细节。

英文摘要

Operational Earth observation increasingly calls for answering queries such as ``find the image pairs where a new building appeared.'' This means searching an archive of before-and-after (bi-temporal) satellite image pairs and ranking each pair by how well it matches a natural-language description of the change. The component that performs this match, the fusion module that combines the ``before'' and ``after'' views, must be run at query time across many candidate pairs, so its speed largely sets the cost of every search. We present a controlled comparison of how to build that module. Using one fixed image encoder (a frozen CLIP model) and one training recipe for all variants, we evaluate eight designs drawn from three families: attention, state-space models (Mamba), and learned compression (our Temporal Bottleneck Fusion, TBF). Each design is tested on two benchmarks (LEVIR-CC and Dubai-CC) with ten random seeds, so the reported differences are statistically grounded. We outline three findings: first, a training-free two-stage search (a cheap difference model that shortlists candidates, followed by attention fusion that re-ranks them) matches or exceeds full-fusion recall on LEVIR-CC while cutting query cost $10$-$15\times$, with comparable R@1/R@5 on Dubai-CC; second, the linear-time scan of Mamba, attractive on paper, gives no speed benefit at the patch counts typical of vision transformers ($L{=}196$): the scan is limited by memory bandwidth, whereas attention maps cleanly onto parallel hardware; and third, compressing the fused representation (TBF) reduces parameters by $2.3\times$ and latency by $1.6\times$ for a change-only BLEU-1 cost of $0.007$, although more aggressive compression quietly discards change-relevant detail that aggregate metrics fail to reveal.

URL PDF HTML 收藏

2607.28565 2026-07-31 cs.CV 新提交

MIND: Multimodal Intent-Driven Network via Diffusion Transformers for Medical Image Fusion

MIND：基于扩散Transformer的多模态意图驱动网络用于医学图像融合

Yunzhan Fu, Xiangyu Shen, Yifei Sun, Yuhan Chen, Jian Wu, Hongxia Xu

机构 * Transvascular Implantation Devices Research Institute, Zhejiang University（浙江大学血管内植入器械研究院）； Zhejiang University（浙江大学）； Hangzhou Institute of Technology, Xidian University（西安电子科技大学杭州研究院）； Hangzhou Dianzi University（杭州电子科技大学）

AI总结该研究针对现有医学图像融合方法缺乏对诊断意图深度理解的问题，提出基于DiTs的MIND网络，通过BioMedGPT、多尺度潜在适配器和医学语义一致性损失优化，在多数据集上取得优异效果，可提升脑肿瘤分割精度并支持交互式融合。

Comments 14pages, 14 figures, accepted by ACM MM2026

详情

AI中文摘要

医学图像融合旨在整合不同成像模态的互补信息以辅助临床诊断。现有方法通常全局应用统一融合规则，缺乏对诊断意图和病理结构的深度理解。为解决这些局限，我们提出MIND，一种基于扩散Transformer（DiTs）的多模态意图驱动网络用于医学图像融合。具体而言，我们利用BioMedGPT从源图像生成意图驱动的融合文本，以病理感知的诊断意图指导融合过程。为应对DiTs中1D序列扁平化导致的2D空间连续性损失，我们设计了多尺度潜在适配器，该模块在序列化前显式提取源图像特征，通过严格的维度对齐将其注入网络以有效补充图像特征。为解决图像输出与诊断意图解耦导致的语义偏移，我们设计了医学语义一致性损失，该损失确保融合图像与融合文本之间的深度语义锁定，同时维持底层物理流形重建的稳定性。在Harvard、BraTS和GFP数据集上的综合实验表明，MIND实现了更优的融合质量，显著提升了下游脑肿瘤分割精度，且支持灵活的交互式融合，对意图驱动的智能临床决策支持系统具有重要应用前景。

英文摘要

Medical image fusion aims to integrate complementary information from diverse imaging modalities to support clinical diagnosis. Existing methods typically apply uniform fusion rules globally, lacking a deep understanding of diagnostic intents and pathological structures. To address these limitations, we propose MIND, a Multimodal Intent-Driven Network via Diffusion Transformers (DiTs) for medical image fusion. Specifically, we utilize BioMedGPT to generate intent-driven fusion texts from source images, guiding the fusion process with pathology-aware diagnostic intents. To combat the loss of 2D spatial continuity caused by 1D sequence flattening in DiTs, we design a Multi-scale Latent Adapter. This module explicitly extracts source image features before serialization, injecting them into the network via strict dimensional alignment to effectively supplement image features. To resolve the semantic shift caused by decoupling image outputs from diagnostic intents, we design a medical semantic consistency loss. This loss ensures deep semantic locking between fused images and fusion texts while maintaining the stability of the underlying physical manifold reconstruction. Comprehensive experiments on the Harvard, BraTS, and GFP datasets reveal that MIND delivers superior fusion quality, significantly improves downstream brain tumor segmentation accuracy, and enables flexible interactive fusion, holding significant promise for intent-driven intelligent clinical decision support systems.

URL PDF HTML 收藏

2607.28560 2026-07-31 cs.RO 新提交

X-NavDP: Generalizing Navigation Diffusion Policy to Novel Behavior and Embodiments with Group Q-score Reweighted Matching

X-NavDP：通过组Q分数重加权匹配将导航扩散策略泛化到新行为和实体

Tianyu Yang, Yiming Zeng, Wenzhe Cai, Yuqiang Yang, Jiaqi Peng, Hui Cheng, Jiangmiao Pang, Tai Wang

机构 * Fudan University（复旦大学）； Shanghai AI Laboratory（上海人工智能实验室）； Sun Yat-sen University（中山大学）； Tsinghua University（清华大学）

AI总结该研究针对导航扩散策略泛化性不足的问题，提出GQRM框架对X-NavDP进行后训练，大幅提升了跨实体视觉导航的模拟与现实场景成功率。

Comments 20 pages, 4 figures

详情

AI中文摘要

导航扩散策略的预训练依赖大规模专家演示数据，这些数据通常由适配单一标称机器人的全知规划器生成，限制了策略对多样实体及需多样局部反应行为的具挑战性场景（如摆脱死胡同或绕过长障碍物，仅利用机载局部观测）的泛化能力。用强化学习（RL）对策略进行后训练是合理解决方案，但此前针对扩散模型的RL方法仅能带来微小改进，原因在于扩散策略的难处理似然性会导致策略梯度不稳定，且策略探索效率低下。为解决这些挑战，我们提出数据高效的扩散RL后训练框架GQRM（Group Q-score Reweighted Matching，组Q分数重加权匹配），该框架引入两种互补设计：（i）带行为扰动的自举探索策略，可保留预训练策略的先验；（ii）组Q分数归一化机制，对每个状态计算每条轨迹的值以实现高效重加权分数匹配。通过在异构实体上进行分布式在线RL训练，得到的微调策略X-NavDP实现了最先进的跨实体视觉导航性能，在模拟环境中整体成功率从61.20%提升至84.28%，在现实世界困难场景中从10%提升至65%。代码和模型已公开于此httpsURL。

英文摘要

Pretraining navigation diffusion policies rely on large-scale expert demonstrations. These data are typically generated by a fully-informed oracle planner suited to a single nominal robot. This limits the policy's generalization to diverse embodiments and challenging scenarios (e.g., escaping dead ends or detouring long obstacles) that demand diverse local reactive behaviors with only onboard local observations. Post-training the policy with reinforcement learning (RL) offers a principled remedy. However, previous RL for diffusion approaches lead to only marginal improvements. This is because the intractable likelihood of diffusion policies renders policy gradients unstable in addition to inefficient policy exploration. To address these challenges, we propose a data-efficient diffusion RL post-training framework - GQRM (Group Q-score Reweighted Matching). Our framework introduces two complementary designs: (i) a self-bootstrapped exploration strategy with behavior perturbation that preserves the pretrained policy prior, and (ii) a group Q-score normalization mechanism that computes per-trajectory values on each state for efficient reweighted score matching. By conducting distributed online RL training across heterogeneous embodiments, the resulting fine-tuned policy, X-NavDP, achieves state-of-the-art cross-embodiment visual navigation performance, improving the overall success rate from 61.20% to 84.28% in simulation and 10% to 65% in real-world hard cases. The code and model are publicly available at https://yty-sky.github.io/x-navdp-project-page.

URL PDF HTML 收藏

2607.28553 2026-07-31 cs.LG cs.AI cs.MA 新提交

APO: Unsupervised Atomic Policy Optimization for 3D Structure Prediction of Atomic Systems

APO：面向原子系统三维结构预测的无监督原子策略优化

Shentong Mo, Yatao Bian

机构 * CMU（卡内基梅隆大学）； NUS（新加坡国立大学）

AI总结本研究提出无监督原子策略优化框架APO，通过双奖励机制实现原子系统三维结构预测，在晶体与抗体预测中优于全监督基线，提升了匹配率、结构保真度与推理效率。

详情

AI中文摘要

预测原子系统的三维结构是推动材料科学与药物发现的基础。尽管流匹配模型（如FlowDPO）近期在该领域展现出潜力，但其性能高度依赖通过监督偏好学习与真实坐标的对齐。然而，获取新晶相或从头设计蛋白质的实验标签成本极高，在数据稀缺场景下成为结构建模的瓶颈。本研究提出APO（Atomic Policy Optimization），一种完全无监督的对齐框架，无需真实参考结构。APO将组相对策略优化适配到三维原子环境，采用新颖的双奖励机制：（i）通过样本相似性的特征分解强化策略的主导潜在结构模式；（ii）确保热力学稳定性。该框架通过在采样组内识别物理合理构型，使模型实现“自校正”。在晶体与抗体结构预测的广泛基准测试中，APO始终优于全监督基线，在匹配率与结构保真度上达到新的SOTA；此外，APO可有效拉直概率路径，显著提升推理效率。结果表明，与有噪的监督坐标匹配相比，内在物理一致性可作为更优的对齐指导。

英文摘要

Predicting the 3D structures of atomic systems is fundamental to advancing material science and drug discovery. While flow-matching models (, FlowDPO) have recently shown promise in this domain, their performance relies heavily on alignment with ground-truth coordinates via supervised preference learning. However, obtaining experimental labels for novel crystal phases or de novo proteins is prohibitively expensive, creating a bottleneck for structural modeling in data-scarce regimes. In this work, we propose (Atomic Policy Optimization), a fully unsupervised alignment framework that eliminates the need for ground-truth reference structures. APO adapts group-relative policy optimization to 3D atomic environments, utilizing a novel dual-reward mechanism: (i) a that reinforces the policy's dominant latent structural modes through eigen-decomposition of sample similarities, and (ii) a that enforces thermodynamic stability. Our framework enables the model to ``self-correct'' by identifying physically plausible configurations within sampled groups. Extensive benchmarks on crystal and antibody structure prediction demonstrate that APO consistently outperforms fully supervised baselines, achieving a new state-of-the-art in match rates and structural fidelity. Furthermore, we show that APO effectively straightens probability paths, significantly improving inference efficiency. Our results suggest that intrinsic physical consistency can serve as a superior guide for alignment compared to noisy, supervised coordinate matching.

URL PDF HTML 收藏

2607.28545 2026-07-31 cs.CL cs.AI cs.SE 新提交

ORCA-bench: How Ready Are Language Model Agents for Oncall?

ORCA-bench：语言模型智能体的值班待命准备程度如何？

Albert Gong, Kyuseong Choi, Abhineet Agarwal, Jason Schechner, Ryan Huang, Raj Agrawal, Anish Agarwal, Raaz Dwivedi

机构 * Cornell Tech（康奈尔科技学院）； Traversal ； Columbia University（哥伦比亚大学）

AI总结该研究推出ORCA-bench基准，评估编码智能体在生产级值班待命场景下的根本原因分析能力，发现现有前沿智能体表现不佳，凸显其距离安全胜任生产可靠性工作仍有较大差距。

详情

AI中文摘要

大型语言模型能够编写、修补和搜索代码，但值班待命（oncall）的根本原因分析（RCA）需要不同的能力：从模糊的用户报告出发，对有噪声的指标、日志、追踪和源代码进行推理，且通常要在事件发生数小时后开展。我们推出ORCA-bench，这是一个将通用编码智能体置于生产级保真度值班待命场景中的基准。ORCA-bench将一个实时的、通过OpenTelemetry检测的微服务系统配对，该系统通过Grafana暴露6天的指标、日志和追踪（涉及Prometheus、Jaeger和OpenSearch），并提供完整的源代码访问权限，同时包含1079项RCA任务，这些任务在报告特异性、检测时间和并发故障场景方面存在系统性差异。真实症状由站点可靠性工程（SRE）专家策划并确认，我们的大模型作为评判者的结果已由人工独立重新评分（Cohen's κw=0.90）。在五个前沿智能体中，最佳RCA准确率在中等难度任务（现实输入场景）上为25.3%，在高难度任务上为10.0%——即使使用Claude Fable 5，这一差距仍然存在。最弱的模型在40%的事件报告中会生成不合理的根本原因，而移除源代码访问权限会降低所有指标。关键的是，这些性能是在一个策划好的50GB、6天的测试平台上获得的，其中的任务是在代码和检测工具公开的系统上单独研究的。由于真实生产系统的规模、动态性和特殊性要大几个数量级，我们报告的差距是在前沿编码智能体能够被安全委托负责生产可靠性之前所需工程投入的下限。我们在该httpsURL发布了公开数据集。

英文摘要

Large language models can write, patch, and search code, but oncall root cause analysis (RCA) demands something different: reasoning over noisy metrics, logs, traces, and source code, starting from ambiguous user-facing reports, often hours after the incident began. We introduce ORCA-bench, a benchmark that puts general-purpose coding agents in a production-fidelity oncall setting. ORCA-bench pairs a live OpenTelemetry-instrumented microservice system--exposing six days of metrics, logs, and traces through real telemetry interfaces (Prometheus, Jaeger, and OpenSearch via Grafana) and full source-code access--with 1,079 RCA tasks that systematically vary report specificity, time-to-detection, and co-occurring fault scenarios. Ground-truth symptoms are curated and signed off by expert SREs, and our LLM-as-judge is independently re-scored by humans (Cohen's $κ_w=0.90$). Across five frontier agents, the best RCA Accuracy is 25.3% on Medium-difficulty tasks (the realistic-input setting) and 10.0% on Hard--a gap that remains even with Claude Fable 5. The weakest model hallucinates an implausible root cause in 40% of incident reports, and removing source-code access degrades every metric. Crucially, these are performances on a curated 50 GB / six-day testbed with tasks investigated in isolation on a system whose code and instrumentation are public. Since real production systems are order of magnitudes larger, more dynamic, and more idiosyncratic, the gap we report is a lower bound on the engineering investment required before frontier coding agents can be safely entrusted with production reliability. We release the public set at https://hub.harborframework.com/datasets/orca-bench/ORCA-bench.

URL PDF HTML 收藏

2607.28526 2026-07-31 cs.CV cs.AI 新提交

What to Remove, What to Preserve: Dual-Ambiguity Rectification for All-in-One Image Restoration

移除什么，保留什么：面向一体化图像恢复的双歧义校正

Cencen Liu, Wen Yin, Dongyang Zhang, Dongmin Li, Shan Zhao, Bing Su, Tao He, Jielei Wang, Guoming Lu

机构 * University of Electronic Science and Technology of China（电子科技大学）； Jiigan Technology（极感科技）

AI总结针对一体化图像恢复中退化与内容线索纠缠的双歧义问题，提出DAR-Net网络，经实验在多退化设置下较基准方法实现峰值信噪比提升，在多基准数据集表现优异。

详情

AI中文摘要

一体化图像恢复旨在在统一框架内处理多种退化类型。现有方法通常在共享潜在空间中编码异质退化条件，导致退化相关线索与场景内容相互纠缠。我们将该挑战定义为双歧义：通道调制中的语义歧义，以及恢复响应中的空间歧义，这会导致内容损坏和残留伪影。为缓解此问题，我们提出DAR-Net，即面向一体化图像恢复的双歧义校正网络。DAR-Net首先引入退化原型表示（Degradation Archetype Representation，DAR）模块，通过单纯形约束的原型混合建模构建结构化退化状态。基于该状态，语义歧义校正（Semantic Ambiguity Rectification，SeAR）模块生成感知退化的提示，以改进解码器中的通道条件设置。空间歧义校正（Spatial Ambiguity Rectification，SpAR）模块进一步将感知退化的互补特征正则化为正交响应子空间，减少移除与保留线索间的空间干扰。在标准一体化恢复基准上的大量实验表明，DAR-Net在3种退化和5种退化设置下均实现最佳整体性能，平均峰值信噪比（PSNR）较最强竞争者分别提升0.14 dB和0.34 dB；其在CDD-11和WeatherBench上也表现出优异性能。

英文摘要

All-in-one image restoration aims to handle diverse degradations within a unified framework. Existing methods commonly encode heterogeneous degradation conditions in a shared latent space, where degradation-related cues and scene content can remain entangled. We characterize the resulting challenge as dual ambiguity: semantic ambiguity in channel-wise modulation and spatial ambiguity in restoration responses, which can lead to content corruption and residual artifacts. To mitigate this issue, we propose DAR-Net, a Dual-Ambiguity Rectification Network for all-in-one image restoration. DAR-Net first introduces a Degradation Archetype Representation (DAR) module to construct a structured degradation state through simplex-constrained archetype mixture modeling. Based on this state, a Semantic Ambiguity Rectification (SeAR) module generates degradation-aware prompts to improve channel-wise conditioning in the decoder. A Spatial Ambiguity Rectification (SpAR) module further regularizes degradation-aware and complementary features toward orthogonal response subspaces, reducing spatial interference between removal and preservation cues. Extensive experiments on standard all-in-one restoration benchmarks show that DAR-Net achieves the best overall performance under both three-degradation and five-degradation settings, improving the average PSNR over the strongest competitor by 0.14 dB and 0.34 dB, respectively; it additionally shows superior performance on CDD-11 and WeatherBench.

URL PDF HTML 收藏

2607.28523 2026-07-31 cs.AI cs.LO 新提交

Selective Credibility-Limited Belief Update

选择性可信度受限的信念更新

Theofanis Aravanis, Costas D. Koutras

机构 * University of the Peloponnese（伯罗奔尼撒大学）； American University of the Middle East（中东美国大学）

AI总结本文针对现有可信度受限信念更新无法处理复合认知输入部分可实现的问题，提出选择性可信度受限的信念更新，刻画其语义与公理，定义两类子类，证明框架通用性，提供统一且表达力更强的信念更新方案。

详情

AI中文摘要

信念更新关注智能体的信念因底层世界变化而产生的改变。标准的Katsuno-Mendelzon更新假设认知输入可从所有初始可能世界纳入，而可信度受限的信念更新则针对每个源世界，限制被视为可信或可达的后继世界。不过，现有可信度受限方法将认知输入视为不可分割的整体，无法表示复合认知输入中仅部分可实现的情况。本文提出选择性可信度受限的信念更新，在执行可信度受限转换前，会针对每个源世界将认知输入转换为较弱的代理。我们为该类更新算子提供语义和公理化刻画，随后确定两类表现良好的子类：一是一致性保持更新算子，要求当原始认知输入一致时，每个转换后的认知输入从其源世界来看是可信的；二是极大一致性保持更新算子，额外要求所选代理是原始认知输入可信后承中信息最丰富的。最后，我们证明所提框架具有通用性：可信度受限的信念更新可作为特例被恢复，而当移除可信度限制且转换函数取恒等函数时，会得到Katsuno-Mendelzon信念更新。这些结果表明，该框架提供了统一且表达力更强的信念更新说明，涵盖已有方法同时支持依赖源的选择性接受。

英文摘要

Belief update concerns changes in an agent's beliefs induced by changes in the underlying world. Standard Katsuno-Mendelzon update assumes that an epistemic input can be incorporated from every initially possible world, whereas credibility-limited belief update restricts, for each source world, the successor worlds regarded as credible or reachable. Nevertheless, existing credibility-limited approaches treat the epistemic input as an indivisible whole, and therefore cannot represent cases in which only part of a compound epistemic input can be realized. We introduce selective credibility-limited belief update, in which the epistemic input is transformed, relative to each source world, into a weaker proxy before the credibility-limited transition is performed. We provide semantic and axiomatic characterizations of the resulting class of update operators. We then identify two well-behaved sub-classes; namely, consistency-preserving update operators, which require every transformed epistemic input to be credible from its source world whenever the original epistemic input is consistent, and maximal consistency-preserving update operators, which additionally require the selected proxy to be maximally informative among the credible consequences of the original epistemic input. Finally, we establish the generality of the proposed framework by showing that credibility-limited belief update is recovered as a special case, while Katsuno--Mendelzon belief update emerges when credibility restrictions are removed and the transformation functions are taken to be identities. These results demonstrate that the framework provides a unified and strictly more expressive account of belief update, encompassing established approaches while supporting source-dependent selective acceptance.

URL PDF HTML 收藏

2607.28516 2026-07-31 cs.CV 新提交

Beyond Frame Selection: Generative Latent Evidence Aggregation for Long-Video Understanding

超越帧选择：用于长视频理解的生成式潜在证据聚合

Bowen Liu, Shuning Wang, Xinpeng Ding, Zhiheng Wu, Bodong Du, Xiaomeng Li

机构 * The Hong Kong University of Science and Technology（香港科技大学）； Baidu Inc.（百度公司）； Alibaba Group（阿里巴巴集团）； Xidian University（西安电子科技大学）

AI总结该研究针对长视频理解的帧选择局限，提出GenEvA框架，通过查询条件化分布聚合跨帧潜在证据，在四个基准和两个视频多模态大语言模型主干上显著提升性能且开销极低。

详情

AI中文摘要

长视频理解通常会将视频压缩为少量帧或视觉标记以生成答案。现有紧凑流水线聚焦于保留相关视觉内容作为显式证据，但证据的可用性并不意味着能整合不同时刻的互补线索来回答问题。本文的核心思路是在生成前将选定帧组织为与查询相关的跨帧证据，我们将这一后选择阶段形式化为潜在证据接口，并实例化为GenEvA（Generative Latent Evidence Aggregation，生成式潜在证据聚合），一种分布引导的潜在证据聚合框架。具体而言，GenEvA使用查询条件化的证据分布将聚合聚焦于相关帧，从各帧的特定信息中形成紧凑的跨帧潜在证据；由于跨帧整合并非始终必要，同一分布会决定是否插入这一潜在补充。在四个基准和两个视频多模态大语言模型（Video-MLLM）主干上，GenEvA始终优于匹配帧的基线方法：在8帧设置下，它将四个基准的LLaVA-Video平均性能提升5.2个百分点，Qwen2.5-VL在LVBench上的准确率提升10.1个百分点；这些增益仅需0.11%至0.40%的平均视频标记开销，进一步分析显示其具有任务感知分配特性，并受益于自适应证据调用（Adaptive Evidence Invocation）。

英文摘要

Long-video understanding commonly compresses videos into a small set of frames or visual tokens for answer generation. Existing compact pipelines focus on retaining relevant visual content as explicit evidence. Yet making evidence available does not ensure that complementary cues across moments are integrated for answering. Our key idea is to organize selected frames into query-relevant cross-frame evidence before generation. We formulate this post-selection stage as a latent evidence interface and instantiate it with GenEvA ($\textbf{Gen}erative$ $Latent$ $\textbf{Ev}idence$ $\textbf{A}ggregation$), a distribution-guided latent evidence aggregation framework. Specifically, GenEvA uses a query-conditioned evidence distribution to focus aggregation on relevant frames, forming compact cross-frame latent evidence from their frame-specific information. Since cross-frame integration is not always needed, the same distribution determines whether to insert this latent complement. Across four benchmarks and two Video-MLLM backbones, GenEvA consistently improves matched-frame baselines. At 8 frames, it raises the four-benchmark LLaVA-Video average by $+5.2$ points and Qwen2.5-VL accuracy on LVBench by $+10.1$ points. These gains require only $0.11\%$--$0.40\%$ average video-token overhead; analyses further show task-aware allocation and benefits from Adaptive Evidence Invocation.

URL PDF HTML 收藏

2607.28513 2026-07-31 cs.CL 新提交

Creative Transformation in Literary Texts: Modelling Change Across Representational Levels

文学文本中的创造性转化：跨表征层级的变化建模

Ioana-Roxana Boriceanu, Liviu P. Dinu

机构 * Human Language Technologies Research Center（人类语言技术研究中心）

AI总结该研究借鉴模仿理论，构建多层级框架量化分析文学文本在词汇、语义等维度的转化，揭示不同文本对的结构保留与分歧情况，为表征文学中模仿与创造性分歧提供定量方法。

2607.28497 2026-07-31 cs.LG cs.CY cs.GT 新提交

The Role of Causality in Algorithmic Recourse

因果关系在算法追索中的作用

Srikanth Avasarala, Varun Gupta, Shahin Jabbari, Saber Salehkaleybar, Juba Ziani

机构 * Georgia Institute of Technology（佐治亚理工学院）； Vector Institute（向量研究院）； Drexel University（德雷塞尔大学）； Leiden University（莱顿大学）

AI总结本研究针对算法追索仅关注翻转模型预测的缺陷，提出因果表演框架建模追索行动的因果传播，实验显示其优于标准方法并减少模型重训需求。

详情

AI中文摘要

算法追索旨在为高风险分类场景（如贷款和抵押贷款申请）中的个人提供可操作的改变，以改善其预测结果。然而，大多数现有方法仅关注翻转模型的预测，未考虑推荐的改变是否会真正提升个人的真实资质，或仅仅是使其能够策略性地操纵分类器。因此，已部署的追索策略可能会引发行为反应，降低预测准确性，并在模型重新训练后失效。在本研究中，我们通过一种用于追索的因果表演框架将这种失败模式形式化。我们对追索行动如何通过结构因果模型传播进行建模，捕捉特征之间的相互作用及其对真实标签的影响。即使在标准凸损失下，这些因果响应也会引发非凸优化问题。我们确定了表演稳定解存在的条件，且可通过简单的迭代动力学高效计算。我们的分析表明，忽略因果结构的追索策略可能会引发巨大的、错位的行为反应，而因果追索则会产生稳定的均衡，减少操纵动机。在半合成和真实信用数据集上的实验表明，我们的方法始终优于标准经验风险最小化，同时减少了因策略性主体行为导致分布变化而需要重复进行模型重新训练的情况。

英文摘要

Algorithmic recourse aims to provide individuals with actionable changes to improve their predicted outcomes in high-stakes classification settings, such as loan and mortgage applications. However, most existing approaches focus only on flipping a model's prediction, without accounting for whether the recommended changes lead to genuine improvement in an individual's true qualifications or merely enable strategic gaming of the classifier. Consequently, deployed recourse policies can induce behavioral responses that degrade predictive accuracy and become ineffective after model retraining. In this work, we formalize this failure mode through a causal performative framework for recourse. We model how recourse actions propagate through a structural causal model, capturing interactions among features as well as their effect on the true label. These causal responses induce a non-convex optimization problem, even under standard convex losses. We characterize conditions under which performatively stable solutions exist and can be efficiently computed via simple iterative dynamics. Our analysis reveals that recourse policies that ignore causal structure can induce large, misaligned behavioral responses, whereas causal recourse leads to stable equilibria that reduce incentives for gaming. Experiments on both semi-synthetic and real credit datasets demonstrate that our approach consistently outperforms standard empirical risk minimization while reducing the need for repeated model retraining to accommodate distribution shifts caused by strategic agent behavior.

URL PDF HTML 收藏

2607.28496 2026-07-31 cs.CL 新提交

Beyond Sentiment: Structured Information Extraction from Financial News

超越情感：从金融新闻中进行结构化信息提取

Daohan Zhu, Sitong Ge, Ruofei Wang, Honggu Chen, Yubo Hou, Tao Wan, Zengchang Qin

机构 * School of ASEE, Beihang University（北京航空航天大学ASEE学院）； School of BME, Beihang University（北京航空航天大学生物医学工程学院）； CAIR and CECS, VinUniversity（VinUniversity CAIR与CECS机构）

AI总结该研究针对金融情感分析仅压缩新闻为单一极性评分的局限，提出用LLaMA-3.1-70B提取金融新闻的结构化语义特征，结合情感与结构化特征可提升股票预测性能，为多维金融NLP开辟新方向。

详情

AI中文摘要

金融情感分析已成为新闻驱动型股票预测的标准组成部分，但它将丰富的多维度新闻文章简化为单一的极性评分。我们假设金融新闻编码了多个正交信息维度——事件类型、影响范围、时间范围和语义置信度，这些是情感无法单独捕捉的，且这些维度具有独立的预测价值。为验证这一假设，我们提出了一个结构化信息提取框架，利用LLaMA-3.1-70B从金融新闻中提取六个语义维度。通过对FNSPID数据集中的41618个新闻-股票对进行大规模实验，我们发现：（i）FinBERT情感特征在非线性模型下表现出较强的预测能力（F1=0.576），但在线性模型下性能显著较弱（F1=0.230），揭示了情感与收益之间存在高度非线性关系；（ii）大语言模型提取的结构化特征虽然单独较弱，但能捕捉到与情感正交的信息，两种方法之间存在53.5%的系统分歧率可证明这一点；（iii）结合两种信号源可得到F1=0.600，显著优于单独使用任一信号源（p<0.0001），且在全部七个事件类型中均有一致提升。消融实验证实，非情感结构维度（事件类型、影响主体、时间范围、置信度）在FinBERT之外独立贡献ΔF1=+0.019。特征重要性分析显示六个提取维度的贡献均衡（14%-21%），表明将新闻压缩为单一情感评分会造成大量信息损失。我们的结果表明，金融文本中的情感-语义解耦是系统性且可利用的，为多维金融自然语言处理开辟了新方向。

英文摘要

Financial sentiment analysis has become a standard component in news-driven stock prediction, yet it reduces rich, multi-dimensional news articles to a single polarity score. We hypothesize that financial news encodes multiple orthogonal information dimensions---event type, impact scope, temporal horizon, and semantic confidence---that sentiment alone cannot capture, and that these dimensions carry independent predictive value. To test this hypothesis, we propose a structured information extraction framework that leverages LLaMA-3.1-70B to extract six semantic dimensions from financial news. Through large-scale experiments on 41,618 news--stock pairs from the FNSPID dataset, we find that (i) FinBERT sentiment features exhibit strong predictive power under nonlinear models (F1=0.576) but substantially weaker performance under linear models (F1=0.230), revealing a highly nonlinear sentiment--return relationship; (ii) LLM-extracted structured features, while individually weaker, capture information orthogonal to sentiment, as evidenced by a 53.5% systematic disagreement rate between the two approaches; and (iii) combining both signal sources yields F1=0.600, significantly outperforming either alone ($p < 0.0001$), with consistent improvements across all seven event types. Ablation experiments confirm that non-sentiment structural dimensions (event type, impact subject, time horizon, confidence) independently contribute $Δ\text{F1} = +0.019$ beyond FinBERT alone. Feature importance analysis reveals balanced contributions from all six extracted dimensions (14--21%), demonstrating that compressing news into a single sentiment score incurs substantial information loss. Our results suggest that the sentiment--semantics decoupling in financial text is systematic and exploitable, opening a new direction for multi-dimensional financial NLP.

URL PDF HTML 收藏

2607.28495 2026-07-31 cs.LG cs.CL 新提交

Stage-Replay Divergence Follows the KV Cache: Fixed-Prefix Precision Controls and Bidirectional Cache Transplantation

阶段重放分歧遵循KV缓存：固定前缀精度控制与双向缓存移植

Alexander Boesgaard Lorup

机构 * Openhagen（奥彭哈根）

AI总结该研究基于Qwen2.5系统，通过实验证实阶段重放分歧由KV缓存携带，固定前缀精度会影响分歧表现，双向缓存移植可复现分歧轨迹，精确令牌重放无需保留实时状态保真度。

Comments 15 pages, 1 figure, 6 tables. Reproducibility artifacts (frozen manifests, token IDs, per-item scores, analysis harnesses) described in Section 3.9

详情

AI中文摘要

阶段重放诊断会重构中间令牌前缀，并将新鲜预填充的续段视为从最初到达该前缀的解码器状态开始的续段。我们在基于Qwen2.5的系统中，于整个推理阶段边界处检验了该假设。一项包含200个样本的匹配实验，比较了保留的实时缓存与相同整数令牌的一次性预填充，并在两侧放置了完全相同的副本。在BF16精度下，副本完全一致，而两种构造在166个后缀和20个正确性标签上存在差异；准确率差异仅为1个百分点（配对95%置信区间[-3.5, +5.5]）。固定前缀2x2设置使所有200个令牌状态保持恒定，同时跨越构造和精度进行对比。BF16下的分歧再次出现，而FP32下未产生解码分歧（95%威尔逊上界为1.88%）。一项前瞻性桥接使逐令牌增量缓存与保留的实时缓存在12/12行上逐位精确匹配；对全部200个保存的轨迹账本的审计，复现了所有保留的轨迹和比较指纹。对全部48个键/值层的双向移植，使所有测试的分歧续段均遵循其缓存供体，无论是主检查点的选定集合（24/24）还是后续检查点的盲结果复现（43/43）。因此，精确令牌重放可在不保留实时状态保真度的情况下实现可重复性。在测试状态下，边界KV缓存是分歧轨迹的因果充分载体，而数值精度会调节其行为表现。

英文摘要

Stage-replay diagnostics reconstruct intermediate token prefixes and treat fresh-prefill continuation as continuation from the decoder state that originally reached the prefix. We audit that assumption at a whole reasoning-stage boundary in a Qwen2.5-derived system. A matched 200-item experiment compares retained live cache with one-shot prefill of identical integer tokens and places an exact replica on both sides. In BF16, replicas remain exact while the constructions differ on 166 suffixes and 20 correctness labels; the accuracy difference is only one point (paired 95% CI [-3.5, +5.5]). A fixed-prefix 2x2 holds all 200 token states constant while crossing construction and precision. The BF16 disagreements recur, whereas FP32 produces no decoded disagreement (95% Wilson upper bound 1.88%). A prospective bridge makes token-by-token incremental and retained live caches bit-exact on 12/12 rows; an all-200 saved-ledger audit reproduces every retained trajectory and comparison fingerprint. Bidirectional transplantation of all 48 key/value layers makes every tested divergent continuation follow its cache donor, both on a selected set at the primary checkpoint (24/24) and an outcome-blind replication at a later checkpoint (43/43). Exact-token replay can therefore be repeatable without preserving live-state fidelity. On the tested states, boundary K/V cache is a causally sufficient carrier of the divergent trajectory, while numerical precision moderates its behavioral expression.

URL PDF HTML 收藏