arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.30268 2026-05-29 cs.CV cs.AI

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

PhyGenHOI：物理感知的动态人-物交互4D生成

Omer Benishu, Gal Fiebelman, Sagie Benaim

AI总结提出PhyGenHOI框架，结合运动扩散模型和物质点方法，通过窗口吸引损失、接触驱动重模拟和掩码视频SDS目标，生成物理一致且视觉逼真的4D人-物交互动态场景。

详情

AI中文摘要

我们解决了生成物理准确且视觉逼真的4D人-物交互（HOI）的任务。给定一个静态3D人体和以3D高斯泼溅（3DGS）表示的目标物体，我们的目标是合成动态场景，其中人体根据给定的输入文本主动与物体交互，例如拳击或踢腿。为此，我们引入了PhyGenHOI，一种新颖的框架，将生成式人体运动与显式物理物体模拟相结合。我们将人体建模为由运动扩散模型（MDM）驱动的语义智能体，将物体建模为通过物质点方法（MPM）模拟的物理智能体，并利用3D高斯作为统一的、可微分的表示。我们通过三种耦合机制监督它们的交互：（1）窗口吸引损失，时间上同步生成运动以拦截物体；（2）接触驱动重模拟步骤，在碰撞时触发物理一致动量传递；（3）掩码视频SDS目标，注入基于视频的先验以增强接触保真度。实验表明，PhyGenHOI在多种动作、人体和物体上生成物理一致的4D HOI，优于基线方法。项目页面和视频：https://omerbenishu.github.io/PhyGenHOI/

英文摘要

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

URL PDF HTML ☆

赞 0 踩 0

2605.30265 2026-05-29 cs.CV cs.CL

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo: 局部模态替换以实现更深的视觉-语言融合

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

AI总结针对视觉-语言模型在模态替换时性能下降的“载体敏感性”问题，提出局部模态替换（LoMo）数据策展范式，通过将文本片段动态渲染为图像来训练跨模态表示不变性，显著提升多模态推理与融合效果。

详情

AI中文摘要

视觉-语言模型（VLM）在广泛的理解和推理任务中取得了显著进展，这得益于旨在多模态融合的大规模图像-文本训练。理想情况下，将文本问题替换为其渲染图像对应物应基本不影响模型性能。然而，在实践中，这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏差。在图像描述、VQA、OCR和网络来源的交错数据等流行数据集中，文本和图像通常被组织成不同且不对称的角色，文本作为语言查询，图像作为视觉参考。这种数据偏差导致VLM在不同模态的信息获取上表现出不同的偏好。因此，VLM无法对齐语义等价内容在文本和视觉载体上的表示，使得模型推理在模态替换下变得脆弱。为了解决这个问题，我们提出了局部模态替换（LoMo），一种轻量级、架构无关的数据策展范式，旨在为语义等价的文本和图像载体之间的跨模态表示不变性提供监督。LoMo通过将单模态提示重新表述为无缝交错的跨模态序列来实现这一点。它动态选择目标文本跨度并将其重新表述为渲染图像，从而在“文本、视觉、文本”载体上保持相同的语义。在13个不同的多模态基准上的大量实验表明，LoMo显著改善了整体多模态推理，并实现了更深的跨模态融合。具体来说，它在基础模型上带来了一致的提升，在LLaVA-OneVision-1.5-8B上比标准SFT提高了2.67个百分点，在Qwen3.5-9B上提高了2.82个百分点。

英文摘要

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

URL PDF HTML ☆

赞 0 踩 0

2605.30263 2026-05-29 cs.CV

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM: 用于实时交互式视频世界模型的全栈开源框架

Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu

AI总结提出minWM全栈开源框架，通过因果强制/因果强制++流水线将双向视频扩散模型转化为可控制、低延迟的自回归世界模型，支持相机控制与多种骨干架构。

详情

AI中文摘要

最近的视频扩散基础模型在高品质视频生成方面取得了显著进展，但将其转化为实时交互式视频世界模型仍然具有挑战性。交互式世界模型需要可控、因果和低延迟的展开，这在实际中需要涵盖数据构建、可控微调、自回归训练、少步蒸馏和流式推理的完整流水线。在这项工作中，我们提出了minWM，一个用于构建实时交互式视频世界模型的全栈开源框架。minWM提供了一个端到端流水线，将现有的双向T2V/TI2V视频基础模型转化为相机可控的少步自回归世界模型。具体来说，minWM首先微调一个带有相机控制的双向视频扩散模型，然后应用因果强制/因果强制++流水线，包括AR扩散训练、因果ODE或因果一致性蒸馏以及非对称DMD，将其蒸馏为少步自回归生成器以实现低延迟展开。该框架是模块化和架构可扩展的：我们在代表性开源骨干上实例化它，包括Wan2.1-T2V-1.3B和HY1.5-TI2V-8B，覆盖了基于交叉注意力的条件注入和MMDiT风格架构。minWM还支持将现有的视频世界模型（如HY-WorldPlay）适应到新的数据分布、训练配方和延迟目标。除了发布可运行脚本、检查点、文档和推理代码外，我们还提供了关于相机轨迹质量、可控性训练步骤和最小批量大小要求的实际消融实验。我们希望minWM能够作为构建和适应实时交互式视频世界模型的可复现和可扩展的配方。

英文摘要

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

URL PDF HTML ☆

赞 0 踩 0

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆？大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

AI总结本文提出参数记忆定律，揭示LoRA在微调中参数与序列长度对损失降低的幂律关系，并基于此设计MemFT优化策略提升记忆保真度与效率。

详情

Comments: Ongoing work

AI中文摘要

大型语言模型（LLM）必须持续学习和更新知识，以在动态的真实世界环境中保持有效。虽然低秩适应（LoRA）被广泛用于此类记忆更新，但现有研究主要依赖于定性的下游评估，使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距，我们在潜在空间中使用LoRA作为受控记忆容量探针，以系统量化精确参数记忆。我们引入了参数记忆定律，这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别，细粒度分析揭示了确定性相变，表明在贪婪解码下，预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解，我们引入了MemFT，一种阈值引导的优化策略，该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明，MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

URL PDF HTML ☆

赞 0 踩 0

2605.30257 2026-05-29 cs.CV

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Stable-Layers: 使用VLM评分强化学习微调图像层分解模型

Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss

AI总结提出Stable-Layers框架，通过强化学习（Flow-GRPO）和视觉语言模型（VLM）评分，无需配对监督即可微调预训练层分解模型，解决评分信号方差不足问题，提升层分离质量和重建精度。

详情

Comments: 25 pages, 8 figures, 4 tables. Project page: https://stability-ai.github.io/stable-layers.github.io/

AI中文摘要

我们提出了Stable-Layers，一个强化学习框架，通过仅使用视觉语言模型（VLM）的反馈来微调预训练的层分解模型，从而消除了对配对监督的需求。从Qwen-Image-Layered开始，我们应用带有LoRA适应的Flow-GRPO，对每张图像采样多个候选分解，用VLM进行评分，并根据组相对优势优化策略。关键挑战在于设计可靠的奖励信号：单独对样本评分的VLM倾向于将其判断压缩到一个狭窄的范围内，使得GRPO几乎没有组内方差可供学习。我们通过一个两阶段评估流水线解决了这个问题，该流水线将基于五个编辑中心标准的结构化逐样本评分与基于网格的校准步骤配对，在该步骤中VLM并排重新评分所有候选。与基础模型相比，Stable-Layers在Crello数据集上产生了具有更强层分离、更少空白或伪影层以及更低逐层重建误差的分解结果。

英文摘要

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

URL PDF HTML ☆

赞 0 踩 0

2605.30256 2026-05-29 cs.CV cs.CL cs.HC

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

VideoFDB: 评估对话代理中的全双工视觉-语音能力

Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

AI总结提出首个全双工视听到视听（AV2AV）对话基准VideoFDB，通过237个真实视频片段、感知与生成行为分类以及基于评分规则的LM评判框架，系统评估代理在非语言对话动态中的表现，发现现有系统存在字幕崩溃和视觉流忽视等缺陷。

详情

Comments: Project page: https://research.nvidia.com/labs/amri/projects/video-fdb/

AI中文摘要

自然的人类对话是全双工且视听融合的：人们同时说话和倾听，同时持续解读并产生非语言线索，如点头、微笑和手势。为了支持成功的人机交互，代理必须建模全双工视听对话；然而，现有的全双工基准仅评估语音。在这项工作中，我们提出了VideoFDB，这是首个评估全双工视听到视听（AV2AV）对话代理的基准。VideoFDB贡献了：(i) 237个来自真实世界视频通话的二元片段，涵盖11种非语言对话动态；(ii) 将感知行为与生成行为分离的分类法；(iii) 基于评分规则的LM评判评估框架，具有可解释的轴，用于评估关于非语言对话动态的对话质量。在开源和闭源的视觉-语音代理中，我们发现了系统性的失败模式：字幕崩溃和视觉流忽视，并且我们表明当前系统利用视觉进行显式视觉问答，但不用于自然对话中所需的流式联合视听基础。我们进一步评估了级联的语音到虚拟形象系统，发现其架构从根本上排除了全双工非语言线索的产生。作为全双工AV2AV交互的首个基准，VideoFDB为系统评估奠定了基础，我们希望这将加速下一代多模态对话代理的进步和发展。

英文摘要

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

URL PDF HTML ☆

赞 0 踩 0

2605.30251 2026-05-29 cs.CL cs.AI

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

相同证据，不同答案：面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

AI总结提出规范上下文在线策略蒸馏（CCOPD）方法，通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为，减少自我锚定漂移，在多轮数学对话上训练后，在原始分片任务上平均提升32%性能。

详情

AI中文摘要

大型语言模型（LLMs）通常在单次提示中给出所有指令时能解决任务，但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时，模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移：在部分信息下产生的响应引入了未经支持的假设，而这些假设随后扭曲了最终答案。为了减少这种影响，我们提出了规范上下文在线策略蒸馏（CCOPD）。在训练过程中，同一基础模型扮演两个角色：一个冻结的教师模型，以干净的完整提示为条件；一个可训练的学生模型，通过多轮对话逐步接收相同的证据；CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后，CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%，同时基本保持全上下文性能。进一步分析表明，CCOPD增强了基于用户证据的推理，并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

URL PDF HTML ☆

赞 0 踩 0

2605.30250 2026-05-29 cs.CV cs.GR

Ambient-robust Inverse Rendering using Active RGB-NIR Imaging

使用主动RGB-NIR成像的环境鲁棒逆渲染

Hoon-Gyu Chung, Jinnyeong Kim, Hyunwoo Kang, Seung-Hwan Baek

AI总结提出一种利用主动RGB-NIR成像的三阶段逆渲染方法，通过结合环境光照下的多视角RGB图像和主动NIR闪光图像，实现对外部光照变化鲁棒的几何与反射率重建。

详情

Comments: 11 pages

AI中文摘要

逆渲染旨在从图像中重建物体的几何和反射率。尽管近期取得了进展，现有方法通常会产生不准确的重建，且对环境光照条件敏感。本文介绍了一种由主动RGB-NIR成像实现的环境鲁棒逆渲染方法。我们的关键洞察是利用近红外（NIR）闪光照明（对人眼不可见）来获得稳定的点光源阴影，该阴影在很大程度上不受环境光照影响。通过使用环境光照下的多视角RGB图像和主动NIR闪光照明获取的NIR图像，我们利用RGB和NIR图像的互补优势，通过三阶段逆渲染方法重建精确的几何和反射率。为了实现密集多视角采集，我们开发了一个主动成像系统，配备RGB-NIR相机和安装在移动底座上的NIR闪光灯。利用该系统，我们收集了首个在多种环境光照条件下捕获的多视角RGB-NIR逆渲染数据集。实验表明，我们的方法优于先前方法，在多种环境光照场景下实现了准确的几何和反射率估计。

英文摘要

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.30247 2026-05-29 cs.LG cs.MM

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

OOD-GraphLLM：面向分布外泛化的药物协同预测图大语言模型

Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu

AI总结针对药物协同预测中因新化合物导致的分布外偏移问题，提出OOD-GraphLLM框架，通过联合优化分子图表示与生物医学语义语言表示实现准确预测。

详情

Comments: 12 pages, 9 figures, ACM KDD 2026

AI中文摘要

药物协同预测（DSP）旨在识别不同细胞环境下针对不同靶点的有效药物组合。然而，新化合物的不断出现导致分子骨架和大小发生变化，使得药物协同数据在拓扑结构上呈现分布外（O.O.D.）偏移。现有工作依赖于分布内（I.D.）假设，无法处理O.O.D.偏移。为解决此问题，我们首次通过图大语言模型研究分布外泛化的药物协同预测。尽管如此，O.O.D.泛化的DSP极具挑战性，面临以下难题：i) 如何发现与细胞靶点相关的结构相关和无关的分子表示；ii) 如何找到精确计算分子表示的最优图神经架构；iii) 如何联合利用LLM中的分子结构和语义信息。为应对这些挑战，我们提出OOD-GraphLLM，一种新颖的图LLM框架，通过统一方式联合优化分子图表示和生物医学语义语言表示，能够在O.O.D.设置下准确预测药物协同。此外，我们微调了生物医学LLM DrugSyn-LLM，并采用检索增强的生物医学指令调优策略，将分子拓扑信息和分子语义信息与基于语言的推理对齐，用于O.O.D.泛化的DSP。源代码（https://github.com/EkkoXiao/Bio-GraphLLM）和发布模型（https://mn.cs.tsinghua.edu.cn/bio-graphllm/）均已公开，用户可下载模型资源并通过Web界面交互式使用系统。

英文摘要

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes, causing drug synergy data to exhibit out-of-distribution (O.O.D.) shifts with respect to topological structure. Existing works rely on in-distribution (I.D.) assumption, failing to handle the O.O.D. shifts. To solve this problem, we study out-of-distribution generalized drug synergy prediction through a graph large language model for the first time. Nevertheless, O.O.D. generalized DSP is highly non-trivial, posing several challenges: i) how to discover structurally relevant and irrelevant molecular representations with respect to cell targets; ii) how to find the optimal graph neural architectures that accurately calculate molecular representations; and iii) how to jointly leverage molecular structural and semantic information in LLMs. To address these challenges, we propose OOD-GraphLLM, a novel graphLLM framework which is able to accurately predict drug synergy under O.O.D. settings via jointly optimizing molecular graph representation and biomedical semantic language representations in a unified manner. Furthermore, we finetune DrugSyn-LLM, a biomedical LLM, and employ a retrieval-augmented biomedical instruction tuning strategy to align molecular topological information and molecular semantic information with language-based reasoning for O.O.D. generalized DSP. Both the source code (https://github.com/EkkoXiao/Bio-GraphLLM) and released model (https://mn.cs.tsinghua.edu.cn/bio-graphllm/) are publicly available, where users are allowed to download model resources and interactively use the system through a web interface.

URL PDF HTML ☆

赞 0 踩 0

2605.30245 2026-05-29 cs.CL

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

知道在如何解决之前该解决什么：预规划赋能的大语言模型数学推理

Shaojie Wang, Liang Zhang

AI总结提出PPC框架，通过引入显式的问题理解阶段（预规划）来弥补现有规划推理方法中“如何解决”与“该解决什么”之间的范式差距，在多个数学推理基准上取得最佳结果。

详情

AI中文摘要

当前的基于规划的推理方法通过在执行前插入规划阶段来改进大语言模型（LLMs），形成了问题→规划→思维链的范式。虽然有效，但仔细审视发现存在固有的范式级差距：规划和执行阶段都决定了如何解决问题，而之前的问题——该解决什么，即识别问题类型、适用工具和可预见的陷阱——仍然完全隐含。为弥补这一差距，我们提出PPC（预规划-规划-思维链），一个引入显式问题理解阶段（预规划）的框架，产生了新的问题→预规划→规划→思维链范式。实现这一范式需要在两端维护预规划的概念完整性。具体地，我们设计了一个三阶段合成流程，配备一个剧透分数检测器来过滤泄漏和剧透故障，以构建干净的预规划监督，并且一个复合GRPO奖励强制生成的规划真正遵循预规划。在四个骨干模型和五个数学推理基准上的实验表明，PPC在40个指标中的39个上取得了最佳结果，在不引入额外推理令牌开销的情况下，将maj@16和pass@16分别比最强基线提高了+2.23和+3.06。

英文摘要

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.30244 2026-05-29 cs.CV cs.AI

Reinforcement Learning with Robust Rubric Rewards

基于稳健评分规则的强化学习

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

AI总结针对部分可验证的视觉-语言任务，提出RLR^3方法，通过双路径执行评分规则、最小暴露策略和层次聚合，实现从任务级到准则级验证的扩展，在15个基准上平均提升4.7分。

详情

AI中文摘要

虽然基于可验证奖励的强化学习（RLVR）对于确定性可检查的任务有效，但许多视觉-语言任务部分可验证，需要多准则监督（例如，感知细节、推理步骤和约束）。评分规则为此细粒度监督提供了自然接口，但其有效性取决于在线RL期间的执行准确性。我们提出基于稳健评分规则的强化学习（$\text{RLR}^3$），将RLVR从任务级验证扩展到准则级验证。$\text{RLR}^3$通过两条执行路径路由实例特定的评分规则：LLM作为提取器与确定性验证器配对，或LLM作为裁判用于不可验证的准则。为确保忠实评分，$\text{RLR}^3$引入最小暴露策略，从提取器中屏蔽真实标签，从裁判中屏蔽图像。此外，$\text{RLR}^3$采用层次聚合，优先考虑基本准则而非附加准则，并缓解rollout组内的分数饱和。在Qwen3-VL-30B-A3B上跨15个基准评估，$\text{RLR}^3$始终优于RLVR，比基础模型提升4.7分，并超过官方instruct-to-thinking模型差距。受控审计证实，我们的确定性验证和最小暴露显著减少了可利用的假阳性。

英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

URL PDF HTML ☆

赞 0 踩 0

2605.30241 2026-05-29 cs.CL cs.CY cs.SI

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact：一个面向野外错误信息检测的动态、多语言、多领域基准

Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka

AI总结提出CommunityFact基准，通过多语言多领域声明、LLM评估和社区笔记分析，揭示封闭输入验证的挑战、网络搜索的增益以及证据选择策略的偏差。

详情

AI中文摘要

错误信息验证越来越多地发生在公开、快速变化和多语言的在线环境中，静态基准无法全面衡量模型可靠性。我们引入了CommunityFact，一个可刷新的野外错误信息检测基准，具有三个主要目标：覆盖度、粒度和可再分发性。本版本包含5种语言和2个领域的15,992条独立声明。我们在不同的推理时能力（包括思考和网络搜索）下评估了十个LLM。我们的结果表明，封闭输入验证仍然具有挑战性，网络访问带来了最大的收益，并且启用网络的LLM的源选择策略与人类社区笔记评分者所达成的源系统性地不一致——这种差距可以通过特定模型的检索扩展或剪枝机制来缩小。我们进一步发现，跨语言-领域切片以及启用网络的系统所使用的证据生态系统存在显著差异。除了评估之外，CommunityFact将社区笔记定位为声明条件源建议器的训练信号，这可以改进对新声明的真实性验证。

英文摘要

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.

URL PDF HTML ☆

赞 0 踩 0

2605.30239 2026-05-29 cs.CV

SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

SAM3D-Phys：迈向真实世界中的多物体交互仿真

Xin Dong, Weijian Deng, Lihan Zhang, Tianru Dai, Wenfeng Deng, Yansong Tang

AI总结提出SAM3D-Phys框架，结合场景重建与SAM3D生成式先验，从部分观测中恢复完整可仿真物体几何，并通过物理约束优化和掩码引导外观蒸馏实现场景一致性，支持多物体同时交互仿真。

详情

Comments: 23 pages, 11 figures

AI中文摘要

这项工作解决了从重建的真实世界场景中恢复完整、可仿真的物体几何的问题，使得与场景中嵌入的物体进行基于物理的交互成为可能。虽然现代多视图重建方法可以产生视觉上准确的环境，但由于遮挡和有限的观测，物体往往不完整，因此不适合物理仿真。为了解决这一局限性，我们提出了SAM3D-Phys，一个将场景重建与SAM3D的生成式3D先验相结合以恢复可物理仿真的物体的框架。我们的方法首先从多视图图像重建场景，获得场景几何和物体的部分观测。然后，我们利用SAM3D从这些部分观测中推断出完整的物体几何。为了确保恢复的物体与重建场景保持一致，我们通过两种互补策略恢复场景一致的物体状态：一种物理约束的空间优化算法，迭代地将恢复的物体对齐到其原始位置；以及一种掩码引导的外观蒸馏模块，基于观测图像细化纹理保真度。通过恢复完整的物体几何并在场景中恢复其姿态和外观，SAM3D-Phys产生了适用于基于物理仿真的干净物体表示，使得在重建场景中能够对多个物体进行同时且物理一致的交互仿真。项目页面：https://chnxindong.github.io/sam3d-phys/

英文摘要

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

URL PDF HTML ☆

赞 0 踩 0

2605.30235 2026-05-29 cs.CV

BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

BullingerDB：用于手写文本识别和作者检索的数据集

Marco Peer, Anna-Scius Bertrand, Patricia Scheurer, Andreas Fischer

AI总结提出一个基于Heinrich Bullinger书信的大规模历史文档数据集BullingerDB，用于手写文本识别和作者检索，并引入时间感知的nDCG指标评估检索性能。

详情

Comments: Accepted for presentation at ICDAR2026. Dataset available via zenodo

AI中文摘要

我们提出了BullingerDB，这是一个基于Heinrich Bullinger（1504-1575）书信的大规模历史文档分析基准数据集。该语料库包含由796位作者在六十年间书写的20,898页和499,222行文本，具有风格变化、多语言内容（主要是拉丁语和早期新高地德语）以及作者身份和时间等元信息。我们在文本识别和作者检索上评估了BullingerDB。表现最佳的模型TrOCR实现了9.1%的字符错误率（CER）。对于作者检索，我们引入了一个时间感知的nDCG指标来评估时间感知检索。虽然可以实现时间连贯的检索，但mAP（78.3%）分数表明由于长期风格变化而存在挑战。通过BullingerDB，我们旨在为多语言历史文本识别和时间感知的作者分析建立一个新的基准。

英文摘要

We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.30233 2026-05-29 cs.CL cs.AI

Do Language Models Track Entities Across State Changes?

语言模型是否在状态变化中跟踪实体？

Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller, Sebastian Schuster, Najoung Kim

AI总结研究语言模型在自然语言中处理多步状态变化操作时的实体跟踪机制，发现其采用非增量策略，在最后token并行聚合信息，并揭示了REMOVE操作的全局抑制标签及其导致的失败模式。

详情

Comments: ICML main conference 2026, 9 pages

AI中文摘要

实体跟踪（ET），即跟踪状态的能力，是支撑复杂推理的基本技能。越来越多的研究探讨transformer语言模型（LMs）如何在没有状态变化的情况下解决实体绑定问题。然而，对于非玩具级LMs如何处理以自然语言表达的具有现实难度的ET问题，理解仍然有限。为此，我们研究了在具有多个状态变化操作的更复杂场景下ET背后的机制。我们发现，LMs不会跨token增量地跟踪世界状态，也不会跨层跟踪查询相关状态，而是在查询变得明显时，在最后一个token处并行地聚合相关信息。我们进一步研究了单个操作（PUT、REMOVE、MOVE）的机制，以表征这种非增量ET机制。令人惊讶的是，LMs使用一种脆弱的全局抑制标签来实现REMOVE操作；这种全局移除机制预测了我们通过行为实验确认的各种失败模式。我们提供了一种消除该标签的机械解决方案，以部分解决此问题。总体而言，我们的发现揭示了LMs使用非顺序策略来解决一个本质上是顺序的任务。更广泛地说，我们的工作展示了行为分析和机制分析如何有效地相互作用。行为结果为机制假设提供信息，而机制分析的见解通过预测现有评估中缺失的失败模式，有助于构建更强的行为评估。

英文摘要

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.30232 2026-05-29 cs.LG cs.CL

How's it going? Reinforcement learning in language models recruits a functional welfare axis

进展如何？语言模型中的强化学习招募了一个功能性福利轴

Andy Q Han, David J. Chalmers, Pavel Izmailov

AI总结本文通过迷宫环境实验，发现强化学习会招募语言模型中预先存在的功能性福利表征（即对系统目标达成程度的估计），从而广泛影响模型行为，且该表征在训练前已存在。

详情

Comments: 81 pages, 43 figures, 32 tables

AI中文摘要

强化学习如何塑造语言模型的内部表征？我们提出证据表明，RL招募了一个预先存在的功能性福利表征：即对系统相对于其目标表现好坏程度的估计。我们在一个新颖的、语义中性的迷宫环境中训练了几个语言模型。然后，我们提取奖励和惩罚轨迹的概念向量，并在与迷宫环境无关的设置中评估这些向量。惩罚向量表现为负面福利的表征：它促进失败和不可能性标记，与负面情绪概念对齐，负面追踪目标达成，并且通过它进行引导会引发负面自我报告、病理性回溯、拒绝和不确定性。正向奖励向量则表现为镜像，两者几乎反平行。这些效应在控制图块到奖励映射、规模、指令微调、RL训练算法、模型家族以及LoRA与全微调时都很稳健，并且当我们用监督微调替换RL时，这些效应在很大程度上仍然存在。重要的是，这些向量在模型经历迷宫训练之前就已经有效。结合这些效应也出现在仅预训练模型中的观察，我们因此认为，这个功能性福利轴在训练后已经存在：它是由训练后招募的，而不是创造的。虽然我们不声称任何关于福利体验的主张，但该轴提供了一个证明，即最小的奖励信号可以通过招募预先存在的类似福利的表征来广泛影响模型行为，这对可解释性、训练后动态和对齐具有启示意义。

英文摘要

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.30231 2026-05-29 cs.CV cs.AI

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

超越3D VQA：将3D空间先验注入视觉-语言模型以增强几何推理

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

AI总结提出GASP框架，通过将几何先验注入LLM的Transformer层，利用对比损失和深度一致性监督训练，显著提升VLM的3D空间推理能力，在多个基准上取得大幅提升。

详情

Comments: CVPR 2026. Project page: https://danielchyeh.github.io/GASP/

AI中文摘要

视觉-语言模型（VLM）通常在鲁棒的3D空间推理方面存在困难。依赖于使用3D视觉问答（VQA）数据集进行微调的主流方法可能过度拟合数据集特定的偏差，而集成专门的3D视觉编码器往往不灵活且繁琐。在本文中，我们认为真正的空间理解应该源于学习基本的几何先验，而不仅仅是来自高级VQA监督。我们提出了GASP（几何感知空间先验），这是一个将这些先验直接注入LLM的Transformer层的框架。GASP采用一个小的对应头，作为跨所有层的深度监督信号，并使用一个双重目标进行训练，该目标利用大规模视频场景的真实几何：基于真实点对应的对比损失强制2D视图不变性，而深度一致性监督解决3D几何歧义。我们的分析首先提供了一个诊断，表明标准VLM的内部对应匹配精度非常低（通常低于5%）。然后我们证明，我们的训练显著改善了这种行为，将逐层峰值对应提升到70%以上，并保持超过85%的时间鲁棒性，而基线仍低于5%。这些内部改进转化为下游空间基准的显著提升，包括在All-Angles Bench上+18.2%，在VSI-Bench上+29.0%，所有这些都没有在任何3D VQA数据上进行训练。我们的发现表明，从基本几何先验中学习是实现具有更可靠3D空间推理的VLM的一条有前途且可推广的途径。

英文摘要

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30230 2026-05-29 cs.CV

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

IP-Adapter 就够了：迈向免微调扩散模型的人脸说话视频生成

Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, Jinwei Wang

AI总结提出一种免微调范式，利用预训练的 Stable Diffusion 和 IP-Adapter，结合三个无参数组件解决身份漂移、同步误差和时间不稳定问题，在唇同步精度和视觉保真度上超越现有方法。

详情

AI中文摘要

随着扩散模型的快速发展，人脸说话视频生成取得了显著进展。然而，现有的基于扩散的方法仍然需要特定任务的微调和大规模音视频数据集，导致计算成本高昂，阻碍了扩散方法在学术界的可扩展性和可访问性。为了解决这个问题，我们提出了一种免微调范式，直接使用 Stable Diffusion 和 IP-Adapter 的预训练权重进行人脸说话视频生成。该骨干网络利用 IP-Adapter 的视觉嵌入能力，从预训练的 Stable Diffusion 中挖掘与嘴唇相关的语义。为了解决身份漂移、同步误差和时间不稳定的挑战，我们还设计了三个无训练参数组件：（1）结构器（Structurist），显式解耦并重新组合嘴唇和外观特征，以减轻身份漂移和外观失真；（2）结构控制器（Structure Controller），基于准单调运动趋势自适应细化嵌入，实现精确的唇同步；（3）噪声传感器（Noise Sensor），引入高斯先验来检测和抑制闪烁和抖动伪影，增强时间一致性。实验结果表明，我们的方法在唇同步精度（PCLD 至少提升 0.16）和视觉保真度（FID 至少提升 0.7）方面均优于现有最先进方法，建立了一种新颖的免微调扩散框架用于人脸说话视频生成。

英文摘要

With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30229 2026-05-29 cs.LG

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

基于辅助变量的平均场Transformer中的反模式坍缩

Masaaki Imaizumi, Masanori Koyama, Noboru Isobe, Kohei Hayashi

AI总结本研究利用平均场Transformer模型从理论上证明位置编码等辅助变量能防止自注意力机制的模式坍缩，并揭示其表示普适性与亚稳态性质。

详情

Comments: 39 pages

AI中文摘要

我们使用基于平均场的Transformer模型从理论上研究辅助变量（如位置编码）如何防止自注意力机制的模式坍缩。近年来，由于平均场Transformer能够全面分析token交互，利用其分析自注意力机制性质的方法引起了广泛关注。然而，对该简单模型的分析表明，在长推理（即多层）过程中会出现模式坍缩，即token分布退化为单点，这与实际情况不符。本研究考察了该平均场Transformer模型，并证明引入辅助变量（如位置编码）可作为对抗理论模式坍缩的反作用力。具体而言，我们表明在理论框架中，能量最大化分布不会退化为单点，而是由辅助变量分布的推前（pushforward）刻画，从而避免集中于Dirac测度。我们的主要例子是位置编码和固定提示插入，它们被视为并行辅助变量机制。此外，我们证明位置编码和提示插入在极限情况下具有表示普适性，即推理的极限分布可以精确表示一大类分布。我们还分析了位置编码和亚稳态的几个关键性质，并通过数学实验验证了我们的理论结果。

英文摘要

We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attention in recent years due to their ability to comprehensively analyze token interactions. However, analysis of this simple model suggests that mode collapse, where token distributions degenerate to a single point, occurs during long inferences (i.e., many layers), indicating a discrepancy with reality. This study investigates this mean-field transformer model and demonstrates that the introduction of auxiliary variables, such as positional encoding, acts as a counterforce against theoretical mode collapse. Specifically, we show that in the theoretical scheme, the energy-maximizing distribution does not degenerate to a single point; instead, it is characterized by a pushforward of the auxiliary variable distribution, thereby avoiding concentration in the Dirac measure. Our main examples are the positional encoding and the fixed prompt insertion treated as a parallel auxiliary-variable mechanism. Furthermore, we demonstrate that positional encoding and prompt insertion possess universality of representation in the limit, meaning that the limit distribution of inference can exactly represent a wide class of distributions. We also analyze several key properties of positional encoding and metastability, and validate our theoretical results through mathematical experiments.

URL PDF HTML ☆

赞 0 踩 0

2605.30227 2026-05-29 cs.MA cs.AI

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

统一基于LLM的多智能体提示优化中的时间与结构信用分配

Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

AI总结提出通过时间信用（状态空间瓶颈识别关键轮次）和结构信用（固定角色策略隔离智能体贡献）分解误差信号，并利用离散言语化块坐标下降算法迭代优化角色提示和聚合协议，降低查询复杂度并提升性能。

详情

Comments: 15 pages, 4 figures, 6 tables

AI中文摘要

虽然多智能体系统（MAS）通过协作交互使大型语言模型能够处理复杂推理任务，但由于计算图的离散、不可微性质以及全局监督信号的稀疏性，优化其动态仍然是一个严峻的挑战。现有的黑盒优化器难以将轨迹级别的失败归因于特定的局部组件，导致低效、高方差的探索。我们认为，可处理的MAS优化需要结构归纳偏差来解开误差信号。我们提出了时间和结构信用分配，它沿着两个轴分解目标：（i）时间信用，使用状态空间瓶颈识别关键轮次；（ii）结构信用，使用固定角色策略隔离智能体贡献。利用这些分解后的信号，我们引入了一种离散的、言语化的块坐标下降算法用于迭代优化。它不是不加区分的全局更新，而是在优化角色提示和聚合协议之间交替，使用LLM生成的“代理梯度”仅针对识别出的薄弱环节。在多种推理基准测试中，我们的方法在提高性能的同时显著降低了查询复杂度，为自我改进的MAS提供了一条有原则且可解释的路径。

英文摘要

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.

URL PDF HTML ☆

赞 0 踩 0

2605.30220 2026-05-29 cs.LG

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

TriSearch：通过双星翻转学习优化三角剖分

Yiran Wang, Guido Montúfar

AI总结提出基于强化学习的框架TriSearch，利用电路支撑的子三角剖分动作表示，通过双星翻转优化多面体三角剖分目标，实现零样本泛化到更大实例。

详情

AI中文摘要

我们引入了TriSearch，这是一个强化学习框架，用于通过双星翻转优化多面体三角剖分上的目标。关键思想是一种电路支撑的子三角剖分动作表示：可行的翻转由其支撑电路和实现的局部子三角剖分编码，使得学习策略能够利用局部几何和组合特征对它们进行排序。这产生了一个维度无关的接口，并能够在不显式枚举整个三角剖分空间的情况下高效遍历翻转图。在3D和4D中实例化后，TriSearch从小的训练实例零样本泛化到具有指数级更大搜索空间的大型多面体。它在3D中的度量目标上达到了顶级性能，并且在4D中，在固定预算下，发现了比现有采样器更多的自反多面体的不同精细、正则、星形三角剖分，对应于Calabi-Yau三维流形。

英文摘要

We introduce TriSearch, a reinforcement learning framework for optimizing objectives over triangulations of a polytope via bistellar flips. The key idea is a circuit-supported subtriangulation action representation: feasible flips are encoded by their supporting circuit and realized local subtriangulation, enabling a learned policy to rank them using local geometric and combinatorial features. This yields a dimension-agnostic interface and enables efficient traversal of the flip graph without explicit enumeration of the full triangulation space. Instantiated in 3D and 4D, TriSearch generalizes zero-shot from small training instances to larger polytopes with exponentially larger search spaces. It achieves top performance on metric objectives in 3D and, in 4D, discovers more distinct Fine, Regular, Star triangulations of reflexive polytopes, corresponding to Calabi-Yau threefolds, than existing samplers under a fixed budget.

URL PDF HTML ☆

赞 0 踩 0

2605.30219 2026-05-29 cs.AI cs.CL cs.LG

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模型何时应改变想法？大语言模型中的上下文信念管理

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

AI总结提出上下文信念管理（CBM）框架，通过引入BeliefTrack基准和信念状态奖励的强化学习，将大语言模型在长程交互中的信念更新失败率平均降低70.9%。

详情

Comments: Work in progress

AI中文摘要

长程交互要求语言模型管理累积信息：何时更新状态、何时保持状态、以及忽略什么。我们将这一挑战研究为 extbf{上下文信念管理（CBM）}：在隔离任务无关噪声的同时，维护与形式证据对齐的预测信念状态。为了使CBM可测量，我们引入了BeliefTrack，一个涵盖规则发现和电路诊断的封闭世界基准，其中有限的信念空间和符号验证器支持精确的逐轮评估。BeliefTrack诊断三种失败：保持失败、更新失败和隔离失败。在多个大语言模型中，原始模型表现出严重的CBM失败，而显式的信念跟踪提示提供的改进有限。相比之下，使用信念状态奖励的强化学习平均将失败率降低了70.9%。进一步的探测揭示了这些失败背后的潜在信念状态动态，而表示级引导在两个任务上将失败率降低了46.1% ootnote{代码即将在https://github.com/zjunlp/CBM发布。}

英文摘要

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

URL PDF HTML ☆

赞 0 踩 0

2605.30218 2026-05-29 cs.LG cs.PF

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

MarginGate: 用于批量不变LLM推理的稀疏边际触发验证

Kexin Chu, Yang Zhou, Wei Zhang

AI总结提出MarginGate方法，利用logit边际稀疏触发验证，仅对低边际步骤进行验证并修复，以低成本实现批量不变LLM推理的确定性解码。

详情

Comments: 13 pages, 5 figures, 11 tables

AI中文摘要

零温度BF16 LLM推理通常被认为是可重现的，但同一请求在单独解码或位于较大批次内时可能产生不同的token。现有修复方法使用批量不变算子或LLM-42的逐token验证，即使在大多数步骤稳定时也会产生成本。我们询问验证是否可以仅应用于翻转的token。在五个模型上，批次诱导的token翻转在翻转率基准上是稀疏的：在MATH500上，Llama-3.1-8B在$0.48\%$的同步解码步骤中翻转，所有测试模型在MATH500、GSM8K和HumanEval上的翻转率保持在0.3-1.3%范围内。翻转前K/V扰动保持平坦，而低top-1/top-2 logit边际暴露了大部分翻转风险。MarginGate将这些观察转化为验证器策略：它在高边际步骤上保持BF16解码，仅验证低边际步骤，并通过替换当前K/V列修复确认的不匹配。我们在四个数据集上评估，在MATH500上校准并迁移到GSM8K、SharedGPT和HumanEval。MarginGate在Llama-3.1-8B和Qwen2.5-14B上以18.56%/15.05%的验证器触发率恢复100%序列级确定性解码，相对于始终验证，将LLM-42的延迟增量降低2.23倍/1.99倍。在DSR1-Distill-Qwen-7B上，相同策略在更困难的条件下以49.50%的触发率达到确定性。

英文摘要

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

URL PDF HTML ☆

赞 0 踩 0

2605.30214 2026-05-29 cs.CL

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

GRUFF：德语中LLM的代词忠实度、推理与偏见

Fabian Mewes, Anne Lauscher, Vagrant Gautam

AI总结通过构建大规模德语数据集GRUFF，研究大型语言模型在四种性别一致系统与四组代词上的代词忠实度，发现模型在无显式上下文时对阳性和阴性实体表现出强语法一致，但对新代词xier和en较弱，且职业刻板印象在不同语法格和模型间相关性低。

详情

AI中文摘要

第三人称单数代词长期以来被用于研究语言模型中的刻板偏见以及测试其推理指代的能力。最近，通过代词忠实度任务研究了推理与偏见之间的相互作用，该任务评估模型正确复用先前为某个话语实体指定的代词的能力，而不受中间提到的其他潜在干扰话语实体的影响。然而，此类研究主要关注英语，这是一种语法性别有限且几乎没有性别一致的语言。在本文中，我们贡献了一个新颖的大规模数据集GRUFF，用于测量德语中的代词忠实度，涵盖了名词中的四种不同性别一致系统以及四组代词。利用该数据集，我们展示了LLM在缺乏显式上下文时对阳性和阴性实体表现出强语法一致，但对新代词xier和en则不然。模型通常对干扰项不鲁棒，但仅编码器模型在德语中比在英语中更鲁棒，反映了语法性别的重要性。最后，我们表明，在此上下文中，职业刻板印象在不同语法格之间以及大多数模型之间相关性较低，除了具有紧密相关架构的模型。我们发布所有代码和数据，以鼓励在德语中进一步研究性别包容性语言和指代推理。

英文摘要

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

URL PDF HTML ☆

赞 0 踩 0

2605.30213 2026-05-29 cs.LG

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

不规则和异步数据的忠实嵌入用于在线Log-NCDEs

Benjamin Walker, Alexandre Bloch, Lingyi Yang, Sam Morley, Terry Lyons

AI总结针对不规则和异步数据，提出一种连续且单射的嵌入方法，基于Log-NCDEs实现无需插值的在线计算，并证明其通用性。

详情

Comments: 34 pages, 16 figures

AI中文摘要

连续时间模型是不规则和异步数据的自然选择。一个核心设计选择是如何将离散观测嵌入到连续时间中。基于插值和插补的嵌入重构了连续的观测路径，使得模型对重构的选择敏感。我们表明这种重构步骤是不必要的；在温和条件下，只要从数据到输入的嵌入是连续且单射的，模型输入空间上的紧集通用性就会转移到数据空间。受此结果指导，并基于神经控制微分方程（NCDEs）的直线控制路径，我们为Log-NCDEs（一类通用的连续时间模型）引入了一种连续且单射的嵌入。我们的方法将观测记录为增量，并在任意查询区间上组合它们，直接形成对数签名。这提供了区间级别的摘要，而无需先对观测变量进行插值，同时支持在线计算。在合成控制动力学和真实世界时间序列数据集上的实验表明，该表示准确、高效，并且对不规则、异步和稀疏观测具有鲁棒性。

英文摘要

Continuous-time models are a natural choice for irregular and asynchronous data. A central design choice is how to embed discrete observations into continuous time. Interpolation- and imputation-based embeddings reconstruct a continuous observation path, making the model sensitive to the choice of reconstruction. We show that this reconstruction step is unnecessary; under mild conditions, compact-set universality on the model input space transfers to the data space whenever the embedding from data to input is continuous and injective. Guided by this result, and building on the rectilinear control path for Neural Controlled Differential Equations (NCDEs), we introduce a continuous and injective embedding for Log-NCDEs, a universal class of continuous-time models. Our approach records observations as increments and composes them over arbitrary query intervals to directly form log-signatures. This provides interval-level summaries without first interpolating the observed variables, while supporting online computation. Experiments on synthetic controlled dynamics and real-world time-series datasets show that the representation is accurate, efficient, and robust to irregular, asynchronous, and sparse observations.

URL PDF HTML ☆

赞 0 踩 0

2605.30211 2026-05-29 cs.CV

Cycle Consistency in Video Object-Centric Learning

视频目标中心学习中的循环一致性

Rongzhen Zhao, Zhiyuan Li, Ruonan Wei, Juho Kannala, Joni Pajarinen

AI总结针对视频目标中心学习中潜在槽空间难以直接应用循环一致性的问题，提出隐式循环一致性（ICC），将约束从槽空间转移到连续重建流形，避免特征坍塌并提升性能。

详情

Comments: 14 pages

AI中文摘要

自监督视频目标中心学习（OCL）旨在发现不同目标并跨时间关联它们，而自监督多目标跟踪（MOT）则侧重于关联预定义的目标检测或分割。尽管循环一致性（CC）在MOT中已成熟应用，但它不能简单或显式地应用于OCL的潜在槽空间。与MOT中确定性和理想的目标表示不同，OCL槽由于非唯一的场景分解而固有地具有随机性和模糊性。在槽上强制执行显式循环一致性（ECC）会导致刚性均值寻求，这严重惩罚了模型探索替代但同样有效的分解，从而驱动特征坍塌。为解决这一困境，我们提出隐式循环一致性（ICC），它将循环一致性约束从限制性的槽空间转移到连续的重建流形，鼓励槽在集体解释视觉场景上达成软共识，而不是强制刚性点对点特征对齐。在复杂视频OCL基准上的大量实验表明，ICC避免了特征坍塌，并优于ECC基线。我们的源代码、模型检查点和训练日志可在 https://github.com/Genera1Z/ICC 获取。

英文摘要

Self-supervised video Object-Centric Learning (OCL) aims to discover distinct objects and associate them across time, whereas self-supervised Multi-Object Tracking (MOT) focuses on associating pre-defined object detections or segmentations. Although well-established in MOT, Cycle Consistency (CC) cannot naively or explicitly apply to the latent slot space of OCL. Unlike the deterministic and ideal object representations in MOT, OCL slots are inherently stochastic and ambiguous due to non-unique scene decompositions. Enforcing explicit cycle consistency (ECC) on slots imposes rigid mean seeking. This severely penalizes the model for exploring alternative but equally valid decompositions, thereby driving towards feature collapse. To resolve this dilemma, we propose \textit{Implicit Cycle Consistency (ICC)}, which shifts the cycle-consistency constraint from the restrictive slot space to the continuous reconstruction manifold, encouraging slots to reach a soft consensus on collectively interpreting the visual scene rather than forcing rigid point-to-point feature alignment. Extensive experiments on complex video OCL benchmarks demonstrate that ICC avoids feature collapse and outperforms ECC baselines. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/ICC.

URL PDF HTML ☆

赞 0 踩 0

2605.30208 2026-05-29 cs.SE cs.AI

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

自动化低风险代码审查在Meta：RADAR、风险校准与审查效率

Chris Adams, Arjun Singh Banga, Parveen Bansal, Souvik Bhattacharya, Rujin Cao, Pedro Canahuati, Nate Cook, Brian Ellis, Prabhakar Goyal, Gurinder Grewal, Tianyu He, Matt Labunka, Alex Manners, David Molnar, Ging Cee Ng, Vishal Parekh, Jiefu Pei, Frederic Sagnes, James Saindon, Will Shackleton, Sid Sidhu, Gursharan Singh, Karthik Chengayan Sridhar, Matt Steiner, Pratibha Udmalpet, Sean Xia, Stacey Yan, Audris Mockus, Peter Rigby, Nachiappan Nagappan

AI总结提出RADAR系统，通过多阶段漏斗对代码差异进行风险分层自动化审查，在Meta部署后显著提升审查效率并降低风险。

详情

AI中文摘要

AI辅助编码工具改变了软件生产。在Meta，每人工提交的代码行数同比增长105.9%，每位开发者的提交量增长51%，其中代理AI贡献了超过80%的增长。与此同时，获得及时审查的提交比例下降，暴露出代码供应与审查带宽之间的差距。我们提出三个问题，从可行性到校准再到影响：（1）风险分层的自动化能否在不同组织中大规模运行，（2）调整风险阈值如何影响自动化产出与安全性之间的权衡，（3）自动化审查在多大程度上减少AI生成变更的端到端延迟？我们部署了RADAR（风险感知差异自动审查），一个多阶段漏斗，根据作者和源类型对每个差异进行分类，应用资格门控、静态启发式、机器学习差异风险评分、基于LLM的自动化代码审查，以及在落地合格变更前的确定性验证。我们通过覆盖535K+个RADAR审查差异的遥测、政策变更的前后观察比较以及效率结果的差异分析来评估RADAR。RADAR已审查535K+个差异并落地331K+个。将差异风险评分阈值从第25百分位放宽到第50百分位，批准率提高到60.31%。RADAR审查差异的回滚率是非RADAR差异的1/3，生产事故率是非RADAR差异的1/50。RADAR将中位关闭时间减少超过330%，中位差异审查墙时间减少35%。风险感知的分层自动化可以显著减少由AI驱动的代码增长造成的审查瓶颈，同时不损害生产安全。

英文摘要

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

URL PDF HTML ☆

赞 0 踩 0

2605.30207 2026-05-29 cs.AI

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

检索增强商业对话中品牌推荐的人格条件化：一种突出性分层跨提供商审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结本研究通过审计10种人格×8个提示×3种模型配置的2000次运行，发现用户人格显著改变AI推荐品牌集，且效果在中等市场品牌和依赖先验的生成路径中更为突出。

详情

AI中文摘要

相同的提示——“最佳CRM软件”——来自不同背景的买家（独立创始人、企业副总裁、英国中小企业主）会到达AI助手。我们审计了这种上下文变化如何强烈地重塑模型推荐的品牌。审计采样了2000次运行，覆盖10种人格×8个提示×3种模型配置×N=10次重复的设计空间，其中两个OpenAI单元覆盖全部8个提示，Anthropic sonnet-4.6/低单元覆盖4个提示。在用户消息前添加人格，相对于同人格基线，推荐集相似度（Jaccard）下降Delta = -0.12至-0.20（聚类95%置信区间在所有三个测量单元上排除零；sonnet单元的置信区间仅基于4个提示聚类，相应更宽）。该效应具有明显的突出性分层：品类领导者具有人格抗性（跨人格约80%相同品牌一致性），但中等市场品牌随人格变化最多更换75%的推荐集。Anthropic模型的点估计效应大于OpenAI配置，尽管聚类置信区间在更接近的对比（sonnet vs. OpenAI/高）中重叠；这种不对称性与Anthropic更多依赖检索未归因的生成路径一致（43-52%的推荐没有观察到检索层证据，而OpenAI为8-29%，记录在Jack 2026中）。任何AI品牌感知的测量都必须以提供查询的买家人格为条件：相同的提示根据模型认为谁在提问而产生实质上不同的推荐集，而跨人格聚合的测量协议系统性地掩盖了这种变化。该效应集中在中等市场，并且在我们审计中最依赖先验的生成路径上最大，这与人格响应性随着模型更依赖训练数据先验和更丰富的上下文集成而增强是一致的。

英文摘要

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

URL PDF HTML ☆

赞 0 踩 0

2605.30202 2026-05-29 cs.CL

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

一种用于扩展LLMs计算和容量的双路径架构

Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali

AI总结提出一种双路径块架构，通过深度子层（参数共享重复K次）和宽度子层（单次大FFN）并行扩展计算和容量，在语言建模和下游任务上超越等FLOPs基线模型，且门控机制可解释地分配每令牌路径。

详情

AI中文摘要

循环变压器多次应用共享块，已成为语言模型中参数高效扩展计算的途径。然而，在固定FLOPs下，循环模型的容量严格低于基线变压器。我们提出一种新颖的双路径块，可以灵活扩展计算（应用于隐藏状态的顺序操作数量）和容量（单步可用参数）。为此，我们在单层内将两个轴暴露为并行通路：一个深度子层，使用共享参数重复应用K次；一个宽度子层，包含一次应用的大型前馈网络。独立的每令牌门控组合两个轴，并允许详细的每令牌路由分析。我们表明，在两个FLOP预算下，我们的双路径模型在语言建模和下游评估上超越了等FLOPs匹配模型，同时在匹配FLOPs下使用比基线更少的参数。学习到的门控直接可解释，并显示系统的每令牌分配：功能词和词汇内容倾向于宽度路径，而标点、符号和算术令牌倾向于深度路径。

英文摘要

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

URL PDF HTML ☆

赞 0 踩 0

2605.30201 2026-05-29 cs.LG cs.AI

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO: 稀疏奖励机制下稳定高效训练的滞后策略优化

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

AI总结针对GRPO在稀疏验证奖励下的失败模式，提出HPO通过降低负优势更新权重和均值长度归一化改进训练，并引入自适应版本A-HPO，在TeleLogs和Countdown实验中显著提升奖励。

详情

AI中文摘要

我们研究了GRPO风格的强化学习在稀疏可验证奖励背景下的一种狭窄但常见的失败模式：早期更新中包含更多具有负优势的响应，而非正优势的响应，而响应级长度归一化将更新幅度与输出长度挂钩。我们提出滞后策略优化（HPO），这是对GRPO的最小修改，它降低了负优势更新的权重，并用均值长度归一化替代了每个响应的长度归一化。我们进一步引入自适应HPO（A-HPO），它基于批次级优势符号统计设置滞后权重，从而消除了调整固定滞后权重的需要。在我们的TeleLogs和Countdown实验中，与GRPO相比，A-HPO提高了每次更新的奖励，在早期稀疏奖励机制中增益最大。在TeleLogs上，A-HPO实现了0.84的最终奖励，比SAPO高5%，比GSPO高11%，比GRPO高15%，同时保持了可比较的响应长度。在Countdown上，A-HPO在1.5B-7B模型的初始和最困难配置中实现了最大增益。关于滞后权重的消融研究表明，A-HPO的增益来自于比仅正更新或完全对称更新更好地平衡正负优势的贡献。

英文摘要

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

URL PDF HTML ☆

赞 0 踩 0