arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.06139 2026-05-21 cs.LG cs.AI

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表式策略优化:基于组的RLVR作为LLM响应单纯形上的目标投影

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

AI总结 本文提出列表式策略优化(LPO),通过显式执行目标投影来解构隐式目标,利用响应单纯形限制近端RL目标,并通过精确散度最小化进行策略投影,从而在多样推理任务和LLM基础上提升训练性能,同时保持优化稳定性和响应多样性。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为大语言模型(LLMs)训练后的一种标准方法,以激励推理能力。在现有方法中,基于组的策略梯度很流行,它为每个提示样本生成一组响应,并通过组内优势信号更新策略。本文揭示这些优化策略共享一个共同的几何结构:每种策略隐式地定义了一个目标分布,并通过一阶近似向响应单纯形投影。基于这一见解,我们提出了列表式策略优化(LPO)以显式执行目标投影,通过限制近端RL目标到响应单纯形来解构隐式目标,然后通过精确散度最小化进行策略投影。该框架提供了(i)在列表式目标上单调改进,具有有界、零和和自校正的投影梯度,以及(ii)通过解耦的投影步骤灵活选择散度,具有不同的结构性质。在多样推理任务和LLM基础架构上,LPO在匹配的目标下一致地优于典型的策略梯度基线,同时内在地保持了优化稳定性和响应多样性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

2605.05863 2026-05-21 cs.LG cs.AI

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE: 通过先验数据稳定在线强化学习中的策略评估

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

AI总结 本文提出SOPE算法,通过使用与演员对齐的离策略策略评估(OPE)信号作为自动早停机制,动态控制离线训练阶段的长度,从而在连续控制任务中提高基线性能并减少计算资源消耗。

详情
AI中文摘要

将先验数据纳入在线强化学习可以加速训练,但通常需要在高计算成本和长的多阶段训练流水线之间做出艰难的权衡。虽然固定长度的稳定阶段比静态更新计划更具计算效率,但它们需要任务相关的手动调整,可能会导致先验知识的浪费或严重的过拟合。为此,我们提出了SOPE算法,该算法利用与演员对齐的离策略策略评估(OPE)信号作为自动早停机制,动态控制离线训练阶段的长度。通过在当前策略的动作分布下对批评者进行保留验证集的评估,SOPE在离分布收益饱和时精确停止梯度更新,从而消除了手动调度调整的需要。在Minari基准套件的25个连续控制任务上评估,SOPE将基线性能提高了高达45.6%,同时将所需的TFLOPs减少了高达22倍,从而在样本效率和计算效率之间取得了平衡。这些发现表明,自适应的、基于评估的更新计划比依赖静态、详尽的更新计划更有效。

英文摘要

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

2605.05405 2026-05-21 cs.CV

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

通过联合嵌入实现零样本卫星图像检索:应用于危机响应

James Walsh, William Fawcett, Grace Colverd, Raúl Ramos-Pollán

AI总结 本文提出GeoQuery系统,通过两阶段语义和视觉搜索,在无需配对数据和计算资源的情况下实现全球范围内的自然语言查询,利用部分全球数据的自然语言嵌入,优化描述生成提示以使文本嵌入空间与冻结CLAY视觉嵌入空间的距离相关联,从而在灾难地点查询中实现高精度检索。

详情
AI中文摘要

地球观测档案的语义搜索仍具挑战性。视觉基础模型如CLAY能生成丰富的卫星图像嵌入,但缺乏用于直观查询所需的自然语言基础,而对遥感CLIP式模型的完整对比训练需要配对数据和计算资源,这些在全球范围内不可用。为允许全球范围内的自然语言查询,我们提出GeoQuery,一种零样本检索系统,通过两阶段语义和视觉搜索绕过数据和计算限制,利用部分全球数据的自然语言嵌入。我们不训练联合编码器,而是为100,000个代理子集的全球Sentinel-2瓦片生成语言描述,并优化描述生成提示,使生成的文本嵌入空间中的距离与冻结CLAY视觉嵌入空间中的距离相关联。查询分为两个阶段,首先在代理子集上进行文本相似度搜索,然后在全球CLAY嵌入中进行视觉最近邻搜索。在76个灾难地点查询中,包括英国洪水、美国野火和美国干旱,GeoQuery在50公里内达到31.6%的准确率,其中洪水表现最强(50%在50公里内),因为地形特征由RGB嵌入良好捕获。在名为\ECHO{}的危机响应系统中部署,GeoQuery在布里斯班2025年 Cyclone Alfred期间识别出易受灾区域,下游洪水模拟重现了历史模式。提示对齐的代理为EO基础模型与操作检索之间提供了一个实用的桥梁,当完整对比训练不可行时。

英文摘要

Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

2605.03690 2026-05-21 cs.LG cs.AI q-bio.QM

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

基于图神经网络的面向层次的知识图谱嵌入:应用于酵母表型预测

Filip Kronström, Alexander H. Gower, Daniel Brunnsåker, Ievgeniia A. Tiukova, Ross D. King

AI总结 本文提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法,用于酵母表型预测,并展示了其在基因敲除效应预测和知识图谱修订评估中的应用。

详情
AI中文摘要

我们提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法。该方法生成的嵌入更能反映领域知识。为了展示其效用,我们预测并解释了酵母Saccharomyces cerevisiae中基因敲除的影响,并在没有预测任务的情况下学习知识图谱的盒嵌入。我们进一步展示了盒嵌入如何作为评估知识图谱修订的基础。我们的酵母知识图谱是从社区数据库和本体术语构建的。低维盒嵌入结合图神经网络用于预测双基因敲除的细胞生长。在10折交叉验证中,这些预测的平均R²分数为0.360,显著高于基线比较,证明了高层定性知识对实验结果的影响力。在模型训练中纳入语义损失项提高了其预测性能(R²=0.377),通过将嵌入对齐本体结构。这表明本体中的类层次可以用于定量预测。我们还测试了训练好的模型在三基因敲除上的表现,展示了其对训练数据之外数据的泛化能力。此外,通过识别酵母知识图谱中对细胞生长预测重要的共现关系,我们构建了关于酵母相互作用特征的假说。一个生物实验验证了其中一个发现,揭示了肌醇利用与渗透压压力抗性之间的关联,突显了模型在生物发现中的潜力。

英文摘要

We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.

2605.01486 2026-05-21 cs.AI

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

MAP-Law: 多轮法律咨询中的覆盖驱动检索控制

Qinchuan Cheng, Jiaqi Liu, Ruixuan Xie, Xiaoya Yuan, Yuxin Liu

AI总结 本文提出了一种覆盖驱动的检索控制框架,用于多轮法律咨询,通过维护用户事实、法律要素、检索目标和检索证据的结构化地图,利用要素覆盖、证据有效性覆盖和边际检索收益来决定检索、澄清、改写或停止操作,实验表明该方法在固定法律要素模式下能有效实现要素覆盖。

详情
AI中文摘要

法律咨询本质上是迭代的:在提供建议之前,系统必须识别相关法律要素,收集缺失的事实和权威,以及确定当前证据是否足够。现有的检索增强型法律代理通常使用固定的检索预算或单次搜索,使其对咨询的演变覆盖状态不敏感。本文介绍了一种针对多轮法律咨询的覆盖驱动检索控制框架。该框架维护用户事实、法律要素、检索目标和检索证据的结构化地图,并利用要素覆盖、证据有效性覆盖和边际检索收益来决定是否检索、澄清、改写或停止。在50个案例的合成中文劳动法咨询试点中,使用DeepSeek V4-Pro动作选择变体,在试点指标下实现了完全测量的要素覆盖,平均需要3.4次检索轮次和7.1个证据片段。诊断分析表明,模型支持的动作选择能够通过小幅增加检索预算恢复规则-政策失败案例,而强制继续主要增加令牌和延迟成本。这些结果表明,法律要素覆盖是适应性法律检索的有用控制信号,在固定模式条件下保持检索控制行为,而非部署层面的法律正确性。

英文摘要

Legal consultation is inherently iterative: before giving advice, a system must identify relevant legal elements, gather missing facts and authorities, and determine whether the current evidence is sufficient. Existing retrieval-augmented legal agents often use fixed retrieval budgets or single-shot search, making them insensitive to the evolving coverage state of a consultation. This paper introduces a coverage-driven retrieval-control framework for multi-turn legal consultation. The framework maintains a structured map over user facts, legal elements, retrieval goals, and retrieved evidence, and uses element coverage, evidence validity coverage, and marginal retrieval gain to decide whether to retrieve, clarify, reformulate, or stop. On a 50-case synthetic Chinese labor-law consultation pilot with fixed legal-element schemas, a DeepSeek V4-Pro action-selection variant achieves full measured element coverage under the pilot metric while requiring 3.4 retrieval rounds and 7.1 evidence snippets on average. Diagnostic analyses show that model-backed action selection recovers rule-policy failure cases with a small retrieval-budget increase, while forced continuation mainly increases token and latency costs. These results suggest that legal-element coverage is a useful control signal for adaptive legal retrieval, while remaining bounded to retrieval-control behavior under synthetic fixed-schema conditions rather than deployment-level legal correctness.

2604.27505 2026-05-21 cs.CV

Leveraging Verifier-Based Reinforcement Learning in Image Editing

利用基于验证器的强化学习进行图像编辑

Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang

AI总结 本文提出Edit-R1框架,通过构建基于推理的验证器奖励模型(RRM)来解决图像编辑中缺乏稳健奖励模型的问题,该模型通过分解指令为不同原则并逐项评估图像,实现细粒度奖励,实验表明其在图像编辑任务中优于现有模型。

详情
AI中文摘要

尽管强化学习从人类反馈(RLHF)已成为文本到图像生成的关键范式,但其在图像编辑中的应用仍鲜有研究。关键瓶颈在于缺乏适用于所有编辑任务的稳健通用奖励模型。现有编辑奖励模型通常仅提供总体评分而无详细检查,忽视了不同指令要求,导致奖励偏差。为此,我们主张从简单的评分器转向推理验证器。我们引入Edit-R1框架,构建基于推理链(CoT)的验证器奖励模型(RRM)并用于下游图像编辑。Edit-RRM将指令分解为不同的原则,将编辑后的图像与每个原则进行评估,并将这些检查汇总成可解释、细粒度的奖励。为了构建此类RRM,我们首先应用监督微调(SFT)作为“冷启动”生成CoT奖励轨迹。然后,我们引入组对比偏好优化(GCPO),一种利用人类配对偏好数据强化点状RRM的强化学习算法。在构建RRM后,我们使用GRPO训练编辑模型,利用此非可微但强大的奖励模型。大量实验表明,我们的Edit-RRM在图像编辑特定任务中优于强大的VLMs如Seed-1.5-VL和Seed-1.6-VL,并观察到明显的扩展趋势,性能从3B到7B参数持续提升。此外,Edit-R1为编辑模型如FLUX.1-kontext带来增益,凸显了其在提升图像编辑任务中的有效性。

英文摘要

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

2604.27375 2026-05-21 cs.CV

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

VeraRetouch: 一个轻量级的全微分框架用于多任务推理照片修复

Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

AI总结 本文提出VeraRetouch,一个轻量级且全微分的多任务照片修复框架,通过使用0.5B视觉-语言模型和全微分的修复渲染器,实现了端到端的像素级训练,并引入了AetherRetouch-1M+数据集和DAPO-AE强化学习策略,以提升修复性能和泛化能力。

详情
AI中文摘要

推理照片修复已获得显著关注,要求模型分析图像缺陷、提供推理过程并执行精确的修复增强。然而,现有方法常依赖非微分的外部软件,导致优化障碍,并存在参数冗余和泛化能力有限的问题。为解决这些问题,我们提出了VeraRetouch,一个轻量级且全微分的多任务照片修复框架。我们采用一个0.5B视觉-语言模型(VLM)作为核心智能,根据指令和场景语义制定修复计划。此外,我们开发了一个全微分的修复渲染器,取代外部工具,通过解耦控制潜在变量实现直接端到端的像素级训练。为克服数据稀缺,我们引入了AetherRetouch-1M+,第一个百万级的专业修复数据集,通过新的逆降级工作流程构建。此外,我们提出DAPO-AE,一种强化学习后训练策略,以增强自主审美认知。大量实验表明,VeraRetouch在多个基准上实现了最先进的性能,同时保持显著更小的模型规模,支持移动部署。我们的代码和模型已公开在https://github.com/OpenVeraTeam/VeraRetouch。

英文摘要

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

2604.26052 2026-05-21 cs.CL

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

从提示风险到响应风险:大型语言模型安全行为的配对分析

Mengya Hu, Qiong Wei, Sandeep Atluri

AI总结 本研究通过配对分析人类标注的提示和响应记录,探讨了大型语言模型在四个危害类别(性、自残、仇恨和暴力)和有序严重性级别上的安全行为,发现61%的响应比提示减少了危害,3%的响应升级了危害,并揭示了安全行为与无害性之间的权衡。

详情
AI中文摘要

大型语言模型的安全评估通常报告二元结果,如攻击成功率(ASR)、拒绝率或有害与安全分类,这些结果隐藏了提示与响应之间风险的变化。我们通过对四个危害类别(性、自残、仇恨和暴力)和有序严重性级别(安全、低、中、高)的人类标注提示和响应记录进行配对分析,发现61%的响应相对于提示减少了危害,36%的响应保持了严重性,3%的响应升级了危害。升级分为两种机制:良性提示触发未请求的有害细节,以及在更高严重性级别上保持任务的响应。类别分解显示,性内容在此样本中表现出最高的危害持续性,这由相同严重性级别的合规性驱动,而非来自良性输入的漂移。联合相关性分析揭示了有用性与无害性之间的权衡:合规性升级仍保持高度相关,而安全响应包含低相关性的通用拒绝。最后,少样本LLM评估者表现出提示/响应检测的不对称性,数据校准无法弥补这种不对称性。评估者提示可在https://github.com/microsoft/PairedSafety获取。

英文摘要

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

2604.24697 2026-05-21 cs.AI

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

当前智能体能否缩小发现到应用的差距?Minecraft中的一个案例研究

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

AI总结 本文通过Minecraft中的SciCrafter基准测试,探讨了智能体在发现因果规律并将其应用于构建功能系统(发现-应用循环)方面的能力,发现前沿模型在该任务中的成功率约为26%,揭示了知识识别和问题提出能力成为当前AI的瓶颈。

Comments Preprint, under review. 41 pages. Project page: https://scicrafter-bench.github.io/. Code: https://github.com/scicrafter-bench/scicraft-bench

详情
AI中文摘要

发现因果规律并将其应用于构建功能性系统——发现-应用循环——是通用智能的标志,但评估这一能力受到科学发现与现实世界工程之间巨大复杂性差距的阻碍。我们引入了基于Minecraft的SciCrafter基准测试,通过参数化的红石电路任务来操作化这一循环。智能体必须按照指定的模式(例如同时或按时间序列)点燃灯泡;扩大目标参数会显著增加构建复杂性和所需知识,迫使真正的发现而非依赖记忆中的解决方案。在通用目的代码智能体框架下评估前沿模型,包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5,我们发现所有模型均在约26%的成功率处停滞。为了诊断这些失败,我们将循环分解为四个能力——知识差距识别、实验发现、知识整合和知识应用,并设计了针对性的干预措施,其边际贡献作为相应差距的代理。我们的分析表明,尽管通用知识应用能力仍然是所有模型中最大的差距,但对前沿模型而言,知识差距识别开始成为主要障碍——表明瓶颈正从解决正确的问题转变为提出正确的问题。我们发布了SciCrafter作为未来研究AI系统在完整发现-应用循环中导航的诊断探针。

英文摘要

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

2604.22080 2026-05-21 cs.AI

Sound Agentic Science Requires Adversarial Experiments

声音代理科学需要对抗性实验

Dionizije Fa, Marko Culjak

AI总结 该研究探讨了代理辅助科学中对抗性实验的重要性,指出传统方法在科学发现中的局限性,并提出应以证伪优先的标准来评估代理生成的科学主张。

Comments Published at ICLR 2026 Workshop on Agents in the Wild

详情
AI中文摘要

基于大型语言模型的代理正迅速被用于科学数据分析,自动化了以往受限于人类时间和专业知识的任务。这种能力通常被描述为发现的加速,但同时也加速了熟悉的失败模式,即快速生成合理且可反复修改的分析,这些分析易于生成,实际上将假设空间转化为由选择性分析支持的候选主张,优化为可发表的积极结果。与软件不同,科学知识不是通过迭代积累代码和事后统计支持来验证的。单个数据集上的流畅解释或显著结果并不等于验证。因为缺失的证据是一个负空间,那些本应证伪主张的实验和分析从未运行或发表。因此,我们提出,通过代理协助产生的非实验性主张应受证伪优先标准的评估:代理不应主要用于构建最吸引人的叙述,而是应主动寻找主张可能失败的方式。

英文摘要

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

2604.21060 2026-05-21 cs.CV

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

基于临床信息的儿童脑肿瘤全切片病理图像分类建模

Joakim Nguyen, Jian Yu, Jinrui Fang, Nicholas Konz, Tianlong Chen, Sanjay Krishnan, Chandra Krishnan, Ying Ding, Hairong Wang, Ankita Shukla

AI总结 本文提出一种结合临床信息的对比学习框架,用于在有限数据和类别不平衡条件下提高儿童脑肿瘤全切片图像的细粒度分类性能。

Comments Accepted at the IEEE International Conference on Healthcare Informatics (ICHI), 2026

详情
AI中文摘要

准确诊断儿童脑肿瘤,从组织病理学开始,对深度学习提出了独特的挑战,包括严重的数据稀缺性、类别不平衡以及不同诊断亚型之间细微的形态学重叠。尽管病理基础模型在片段级表示学习方面取得了进展,但其在有限数据下有效适应弱监督的儿童脑肿瘤分类仍待探索。在本文中,我们引入了一种专家指导的对比微调框架,用于从全切片图像(WSI)中进行儿童脑肿瘤诊断。我们的方法将对比学习整合到滑动级别的多实例学习(MIL)中,以在下游微调过程中显式正则化滑动级别的表示几何。我们提出了一个通用的监督对比设置以及一个结合临床信息的硬负样本变体,旨在针对诊断上易混淆的亚型。通过在现实中的低样本和类别不平衡条件下对儿童脑肿瘤WSI分类进行全面实验,我们证明对比微调在细粒度诊断区分上产生了可测量的改进。我们的实验分析揭示了不同对比策略之间的互补优势,专家指导的硬负样本促进了更紧凑的类内表示和改进的类间分离。本文强调了在数据稀缺的儿童病理学设置中显式塑造滑动级别表示对于鲁棒细粒度分类的重要性。

英文摘要

Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

2604.20985 2026-05-21 cs.LG cs.AI cs.CR stat.ML

Differentially Private Model Merging

差分隐私模型融合

Qichuan Yin, Manzil Zaheer, Tian Li

AI总结 本文提出两种后处理技术,随机选择和线性组合,用于在不额外训练的情况下生成满足任意目标差分隐私要求的最终私有模型,同时分析了这些方法在一般问题和私有均值估计中的隐私-效用权衡。

详情
AI中文摘要

在机器学习中,推理或部署时间的隐私要求往往由于政策、法规或用户偏好变化而演变。在本文中,我们旨在构建一组模型,以满足任何目标差分隐私(DP)要求,而无需额外训练,给定一组已在相同数据集上训练且具有不同隐私/效用权衡的现有模型。我们提出两种后处理技术,即随机选择和线性组合,以生成最终的私有模型,满足任何目标隐私参数。我们从R'enyi DP和一般问题中的隐私损失分布的角度提供了这些方法的隐私计费,以及在私有均值估计中的精确隐私/效用权衡分析,并比较了这两种机制。实验上,我们展示了我们方法的有效性,并在多个模型和合成及现实世界数据集上验证了我们的分析。

英文摘要

In machine learning, privacy requirements at inference or deployment time often evolve due to changing policies, regulations, or user preferences. In this work, we aim to construct a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post-processing techniques, namely random selection and linear combination, to generate final private models satisfying any target privacy parameter. We provide privacy accounting of these approaches from the lens of R'enyi DP and privacy loss distributions on general problems, as well as on private mean estimation, where we precisely characterize the privacy/utility tradeoffs and compare the two mechanisms. Empirically, we demonstrate the effectiveness of our approaches and validate our analyses on several models and both synthetic and real-world datasets.

2604.12239 2026-05-21 cs.CV eess.IV

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

基于标准化车牌字体的单目车辆距离估计

Manognya Lokesh Reddy, Zheng Liu

AI总结 本文提出了一种利用美国标准化车牌字体作为被动标记进行车辆距离估计的方法,通过显式的几何先验知识解决尺度模糊问题,无需训练数据或主动照明,实现了鲁棒的距离、相对速度和碰撞预警。

Comments 21 pages, 12 figures

详情
AI中文摘要

准确的车辆间距离估计是高级驾驶辅助系统(ADAS)和自动驾驶的核心。尽管LiDAR和雷达提供高精度,但其高成本限制了在大众市场车辆中的广泛应用。基于单目相机的估计提供了低成本的替代方案,但存在根本性的尺度模糊问题。最近的单目深度学习方法取得了显著成果,但需要昂贵的监督训练,存在领域偏移,并且生成的预测难以在安全关键部署中认证。本文提出了一种框架,利用美国标准化车牌字体作为被动标记进行度量测距,通过显式的几何先验知识解决尺度模糊问题,无需任何训练数据或主动照明。首先,一个四方法并行车牌检测器在全汽车照明范围内实现了稳健的车牌阅读。其次,一个三阶段状态识别引擎融合光学字符识别文本匹配、多设计颜色评分和轻量级神经网络分类器,在所有环境条件下提供稳健的识别。第三,混合深度融合与逆方差加权和在线尺度对齐,结合一维常速卡尔曼滤波器,提供平滑的距离、相对速度和时间到碰撞用于碰撞预警。在受控静态数据集上的基线验证重现了字符高度测量的2.3%系数变异和与先前工作中的车牌宽度方法相比距离估计方差减少了36%。

英文摘要

Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing optical character recognition text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation on a controlled static dataset reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work.

2604.11661 2026-05-21 cs.LG cs.AI

Towards Autonomous Mechanistic Reasoning in Virtual Cells

向虚拟细胞中的自主机理推理迈进

Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

AI总结 本文提出了一种结构化解释形式化方法,用于虚拟细胞中的生物推理,通过机理动作图实现系统验证和反驳,并引入VCR-Agent多智能体框架,结合生物基础知识检索和基于验证器的过滤方法,生成并验证机理推理。

详情
AI中文摘要

大型语言模型(LLMs)最近因其在加速科学发现方面的潜力而受到广泛关注。然而,它们在如生物学等开放性科学领域中的应用仍然有限,主要是由于缺乏事实性支撑和可操作的解释。为此,我们引入了一种结构化解释形式化方法,用于虚拟细胞,将生物推理表示为机理动作图,从而实现系统验证和反驳。在此基础上,我们提出了VCR-Agent多智能体框架,该框架整合了生物基础知识检索与基于验证器的过滤方法,以自动生成并验证机理推理。使用该框架,我们发布了VC-TRACES数据集,该数据集由来自Tahoe-100M图谱的验证机理解释组成。实证研究表明,使用这些解释训练可以提高事实准确性,并为下游基因表达预测提供更有效的监督信号。这些结果强调了通过多智能体和严格验证的协同作用,可靠机理推理在虚拟细胞中的重要性。

英文摘要

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

2604.11530 2026-05-21 cs.CV cs.AI

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

超越注意力分数:基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

AI总结 本文提出SVD-Prune方法,通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌,以在极端视觉令牌预算下保持高性能,优于现有修剪方法。

详情
AI中文摘要

视觉-语言模型(VLMs)通过联合处理视觉和文本信息革新了多模态学习。然而,由于处理长序列视觉令牌的高计算和内存需求,它们面临显著挑战。许多现有方法依赖于局部启发式方法,如注意力分数或令牌范数。然而,这些标准存在位置偏见和信息分散的问题,限制了它们在高修剪比率下保留本质内容的能力,导致在视觉细节丰富的图像上性能下降。为了解决这些问题,我们提出了SVD-Prune,一种训练免费、即插即用的令牌修剪方法,基于奇异值分解。它分解视觉令牌特征矩阵,并利用统计杠杆分数选择顶级令牌,确保仅保留对主导全局方差贡献最大的令牌。实验表明,SVD-Prune在极端视觉令牌预算下始终优于现有修剪方法,即使在32和16个视觉令牌的情况下也能保持强劲性能。

英文摘要

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

2604.11071 2026-05-21 cs.CV cs.AI cs.LG

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

轻量级低光照图像增强 via 分布归一化预处理和深度卷积U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

AI总结 本文提出了一种轻量级两阶段框架,通过分布归一化预处理和深度卷积U-Net实现低光照图像增强,相比现有方法参数更少且感知质量更优。

Comments Technical report for the NTIRE 2026 Efficient Low-Light Image Enhancement Challenge (CVPR 2026 Workshops), 3rd place solution

详情
AI中文摘要

我们提出了一种轻量级两阶段框架,用于低光照图像增强(LLIE),该框架在参数远少于现有方法的情况下实现了具有竞争力的感知质量。我们的方法结合了冻结算法的预处理与一个完全由深度卷积构成的紧凑型U-Net。预处理通过提供互补的亮度校正视图来归一化输入分布,使可训练网络能够专注于残差颜色校正。我们的方法在CVPR 2026 NTIRE高效低光照图像增强挑战中获得了第三名。我们进一步提供了扩展的基准测试和消融实验以证明我们方法的通用有效性。

英文摘要

We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 3rd place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

2604.10245 2026-05-21 cs.CV physics.med-ph

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

基于迭代3D/2D肝脏配准的预热启动强化学习

Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew. J Clarkson

AI总结 本文提出了一种基于离散动作强化学习的框架,用于将术前CT与术中腹腔镜视频进行配准,通过预热启动的监督姿态估计网络提供稳定的几何特征和更快的收敛速度,从而实现更高效的配准。

Comments Laparoscopic Liver Surgery, Augmented Reality, Image Registration, Reinforcement Learning

详情
AI中文摘要

术前CT与术中腹腔镜视频之间的配准在增强现实(AR)导航中微创手术中起着关键作用。基于学习的方法最近在配准误差上实现了与基于优化的方法相当的性能,同时提供了更快的推理速度。然而,许多监督方法会产生粗略的对齐,需要额外的基于优化的细化,从而增加推理时间。我们提出了一种离散动作强化学习(RL)框架,将CT到视频的配准视为一个序列决策过程。一个共享的特征编码器从CT渲染和腹腔镜帧中提取表示,通过从监督姿态估计网络预热启动以提供稳定的几何特征和更快的收敛。一个RL策略头学习在六个自由度上选择刚体变换,并决定何时停止迭代。在公开的腹腔镜数据集上的实验表明,我们的方法实现了平均目标配准误差(TRE)为15.70毫米,与监督方法结合优化相当,同时实现了更快的收敛。所提出的基于RL的公式使自动、高效的迭代配准成为可能,而无需手动调整步长或停止标准。这种离散框架为未来连续动作和可变形配准模型在手术AR应用中的实际基础提供了支持。

英文摘要

Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

2604.07213 2026-05-21 cs.LG math.PR

Diffusion Processes on Implicit Manifolds

隐式流形上的扩散过程

Victor Kawasaki-Borruat, Clara Grotehans, Pierre Vandergheynst, Adam Gosztolai

AI总结 本文研究如何仅使用点云样本在数据流形上构建扩散过程,提出隐式流形估值扩散(IMDs)方法,通过近似扩散过程的无穷小生成元和carré-du-champ来定义高维空间中的随机微分方程,实现流形内在过程的外推,并通过实验验证其在数据流形上的约束性和探索性。

Comments Comments are more than welcome!

详情
AI中文摘要

高维数据通常被认为位于低维流形上。我们研究如何仅使用点云样本,在不访问图表、投影或其他几何原始元素的情况下,构建该数据流形上的扩散过程。本文引入隐式流形估值扩散(IMDs),一种数据驱动的数学形式化方法,用于在原始高维空间中定义描述内在流形上漂移布朗粒子的随机微分方程。我们的构造基于使用数据上的邻近图近似扩散过程的相应无穷小生成元,并利用生成元的carré-du-champ,该编码流形的局部切空间,并将内在过程提升到环境坐标中。我们证明随着样本数量的增长,我们的离散扩散过程在概率路径空间上收敛到其光滑流形对应物。我们进一步提出一个欧拉-马尔代夫方案用于IMDs的数值积分。我们通过在合成流形和MNIST数据流形上的数值实验验证了我们的框架,显示IMDs能够在流形上保持约束并实现其引导探索。我们的工作为数据流形上的扩散过程提供了数学基础和实际实现,开辟了流形感知采样、探索和生成建模的新途径。

英文摘要

High-dimensional data are often assumed to lie on lower-dimensional manifolds. We study how to construct diffusion processes on this data manifold using only point cloud samples and without access to charts, projections, or other geometric primitives. Here, we introduce Implicit Manifold-valued Diffusions (IMDs), a data-driven mathematical formalism for defining stochastic differential equations in the original high-dimensional space that describe drifting Brownian particles evolving intrinsically on the underlying manifold. Our construction hinges on approximating the corresponding infinitesimal generator of the diffusion process using a proximity graph over the data and using the carré-du-champ of the generator, which encodes the local tangent spaces of the manifold and lifts the intrinsic process into ambient coordinates. We show that as the number of samples grows, our discrete diffusion process converges in law on the space of probability paths to its smooth manifold counterpart. We further present an Euler-Maruyama scheme for the numerical integration of IMDs. We validate our framework using numerical experiments on synthetic manifolds and the MNIST data manifold, showing that IMDs remain confined over the manifold and enable its guided exploration. Our work provides the mathematical foundation and practical implementations of diffusion processes on data manifolds, opening new avenues for manifold-aware sampling, exploration, and generative modeling.

2604.06750 2026-05-21 cs.CV

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

视觉-语言模型对连续驾驶场景的理解有多好?一项敏感性研究

Roberto Brusnicki, Mattia Piccinini, Johannes Betz

AI总结 本文通过系统分析视觉-语言模型(VLMs)在连续驾驶场景中的表现,揭示了输入配置对模型性能的影响,发现即使顶级模型在连续驾驶场景中的准确率仅为57%,远低于人类在相似约束下的65%,并暴露了VLMs在理解车辆动态和时间关系上的显著差距。

Comments 8 pages, 5 figures

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于自动驾驶任务,但它们在连续驾驶场景中的性能仍然缺乏充分的描述,尤其是在输入配置如何影响其能力方面。我们介绍了VENUSS(VLM评估在理解连续场景上),这是一个用于系统敏感性分析VLM在连续驾驶场景中性能的框架,为未来研究建立了基准。基于现有数据集,VENUSS从驾驶视频中提取时间序列,并在自定义类别中生成结构化评估。通过在2,600多个场景上比较25多个现有VLMs,我们揭示了即使顶级模型也只能达到57%的准确率,这并未达到在相似约束下人类的65%表现,并暴露了显著的能力差距。我们的分析表明,VLMs在静态物体检测方面表现优异,但在理解车辆动态和时间关系方面则存在困难。VENUSS提供了首次系统性的VLM敏感性分析,专注于输入图像配置(如分辨率、帧数、时间间隔、空间布局和展示模式)如何影响连续驾驶场景中的性能。补充材料可在https://TUM-AVS.github.io/VENUSS/上获得。

英文摘要

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

2604.01341 2026-05-21 cs.CV q-bio.NC

Perceptual misalignment of texture representations in convolutional neural networks

卷积神经网络中纹理表示的感知偏差

Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini

AI总结 本文研究了卷积神经网络中纹理表示与人类感知内容之间的对齐关系,发现传统CNN视觉模型质量评估与人类纹理感知对齐性无直接关联,表明纹理感知可能涉及不同于传统CNN对象识别模型的机制。

详情
AI中文摘要

视觉纹理的数学建模可以追溯到Julesz的直觉,即人类对纹理的感知基于图像特征之间的局部相关性。一种有影响力的纹理分析和生成方法将这一概念推广到卷积神经网络(CNNs)中非线性特征之间的线性相关性,这些特征被编译成Gram矩阵。鉴于CNNs常被用作视觉系统的模型,自然会问这些

英文摘要

Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we quantify the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

2603.29183 2026-05-21 cs.LG cs.AI

IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

IMPACT: 开集时间序列异常检测中的影响建模

Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang

AI总结 本文提出IMPACT框架,通过影响建模方法解决开集时间序列异常检测中的挑战,通过学习影响函数生成真实异常模式并净化训练数据。

Comments Accepted by ICML 2026

详情
AI中文摘要

开集异常检测(OSAD)是一种新兴范式,旨在利用训练中观察到的异常类有限标记数据,在测试时识别已见和未见的异常。当前方法依赖简单的增强方法生成伪异常以复制未见异常。尽管在图像数据中表现良好,但这些方法在时间序列数据中效果不佳,因为未能保持其序列特性,导致异常模式变得琐碎或不真实。当训练数据被未标记异常污染时,问题进一步加剧。本文引入IMPACT,一种新的框架,利用影响建模方法解决这些挑战。关键见解是学习一个影响函数,以准确估计单个训练样本对建模的影响,然后利用这些影响分数生成语义上不同但真实的未见异常,同时将高影响样本重新利用为监督异常以净化数据。大量实验表明,IMPACT显著优于现有最先进方法,在各种OSAD设置和污染率下表现出更高的准确性。代码可在https://github.com/mala-lab/IMPACT获取。

英文摘要

Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates. Code is available at https://github.com/mala-lab/IMPACT.

2603.27747 2026-05-21 cs.CV cs.AI

AI-Powered Facial Mask Removal Is Not Suitable For Identification

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

AI总结 本文研究了基于AI的面部遮挡去除技术的有效性和风险,探讨其在真实身份匹配中的可靠性。

详情
AI中文摘要

最近,众包在线刑事调查已使用生成式AI来增强低质量的视觉证据。在一项高关注度案件中,社交媒体用户传播了一张联邦执法人员涉致命枪击事件的

英文摘要

Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.

2603.26539 2026-05-21 cs.CL cs.AI

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断?

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

AI总结 本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性,指出封闭模型通常不适合科学用途,并提出系统识别和缓解推断威胁的方法。

详情
AI中文摘要

语言模型的开放程度如何影响基于其研究的科学推断?本文分析了模型构造和部署信息的限制如何威胁可靠的推断。我们论证当前封闭模型通常不适合科学用途(有例外情况),并讨论如何解决或缓解它们对可靠推断的威胁。我们建议在研究中使用模型时,应系统地识别潜在的推断威胁,并采取相应的缓解措施,同时提供具体模型选择的正当理由。

英文摘要

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

2603.23531 2026-05-21 cs.CL

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

大型语言模型通过目标-立场提取解构复杂的政治观点

Özgür Togay, Javier Garcia-Bernardo, Florian Kunneman, Anastasia Giachanou

AI总结 本文研究了大型语言模型是否能通过目标-立场提取任务解构复杂政治观点,通过构建包含138个不同政治目标的Reddit帖子数据集,评估了多种LLM在零样本、少样本和上下文增强提示策略下的表现,结果显示最佳模型表现接近高度训练的人类标注者,证明了LLM在最小监督下的复杂政治观点提取能力。

详情
AI中文摘要

政治极化源于对政策、人物和议题的复杂信念相互作用。然而,大多数计算分析将话语简化为粗粒度的党派标签,忽视了这些信念之间的互动。这在在线政治对话中尤为明显,这些对话通常具有细微差别且涵盖广泛主题,使自动识别讨论目标和对它们的观点变得困难。在本研究中,我们探讨了大型语言模型(LLMs)是否能通过目标-立场提取(TSE)任务解决这一挑战,TSE是一种结合目标识别和立场检测的自然语言处理任务,能够更细致地分析政治观点。为此,我们构建了一个包含1,084个Reddit帖子的数据集,来自r/NeutralPolitics,涵盖138个不同的政治目标,并使用零样本、少样本和上下文增强提示策略评估了一系列专有和开源的LLM。我们的结果表明,最佳模型在高度训练的人类标注者表现相当,并且在具有低标注者一致性挑战的帖子上仍保持稳健。这些发现表明,LLM能够以最小监督的方式提取复杂的政治观点,为计算社会科学和政治文本分析提供了可扩展的工具。

英文摘要

Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

2603.22727 2026-05-21 cs.LG eess.SP

Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication

基于脑机接口的沉浸式通信的脉冲个性化联邦学习

Chen Shang, Dinh Thai Hoang, Diep N. Nguyen, Jiadong Yu

AI总结 本文提出了一种利用脑机接口获取脑信号以推断用户中心状态(如意图和感知相关不适)的沉浸式通信框架,通过个性化联邦学习模型处理脑信号,以适应神经多样性数据并防止敏感脑信号信息泄露,同时通过嵌入脉冲神经网络降低能耗,实验表明在真实脑信号数据集上识别准确率最高且能耗降低6.46倍。

Comments 6 pages, 3 figures

详情
Journal ref
INFOCOM Workshop, 2026
AI中文摘要

本文提出了一种新颖的沉浸式通信框架,利用脑机接口(BCI)获取脑信号以推断用户中心状态(例如意图和感知相关不适),从而在强个体差异下实现更个性化和稳健的沉浸式适应。具体而言,我们开发了一个个性化联邦学习(PFL)模型来分析和处理收集到的脑信号,该模型不仅能够适应神经多样性脑信号数据,还能防止敏感脑信号信息泄露。为了解决持续设备学习和推理在能量受限的沉浸终端(如头戴式显示器)中的能量瓶颈,我们进一步将脉冲神经网络(SNNs)嵌入到PFL中。通过利用稀疏、事件驱动的脉冲计算,SNN启用的PFL在保持竞争性个性化性能的同时,降低了训练和推理的计算和能耗。在真实脑信号数据集上的实验表明,我们的方法在整体识别准确率方面表现最佳,同时与传统人工神经网络基线相比,推理能耗降低了6.46倍。

英文摘要

This work proposes a novel immersive communication framework that leverages brain-computer interface (BCI) to acquire brain signals for inferring user-centric states (e.g., intention and perception-related discomfort), thereby enabling more personalized and robust immersive adaptation under strong individual variability. Specifically, we develop a personalized federated learning (PFL) model to analyze and process the collected brain signals, which not only accommodates neurodiverse brain-signal data but also prevents the leakage of sensitive brain-signal information. To address the energy bottleneck of continual on-device learning and inference on energy-limited immersive terminals (e.g., head-mounted display), we further embed spiking neural networks (SNNs) into the PFL. By exploiting sparse, event-driven spike computation, the SNN-enabled PFL reduces the computation and energy cost of training and inference while maintaining competitive personalization performance. Experiments on real brain-signal dataset demonstrate that our method achieves the best overall identification accuracy while reducing inference energy by 6.46$\times$ compared with conventional artificial neural network-based personalized baselines.

2603.22430 2026-05-21 cs.LG

Inference Time Policy Optimization for Offline RL with Differentiable World Models

基于可微世界模型的离线强化学习推理时间策略优化

Rohan Deb, Stephen J. Wright, Arindam Banerjee

AI总结 本文提出了一种在推理时间利用可微世界模型优化策略参数的方法,通过端到端的梯度计算提升离线强化学习的性能,同时探讨了推理时间适应的计算开销与收益的权衡。

详情
AI中文摘要

Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference-time adaptation, however, is expensive: rollout generation and backpropagation dominate per-step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one-step MeanFlow sampler recovers much of the gains at a fraction of the cost.

英文摘要

Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference-time adaptation, however, is expensive: rollout generation and backpropagation dominate per-step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one-step MeanFlow sampler recovers much of the gains at a fraction of the cost.

2603.16513 2026-05-21 cs.LG cs.AI

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

FEAT: 一个线性复杂度的超大规模结构化数据基础模型

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li

AI总结 本文提出FEAT,一种线性复杂度的基础模型,用于处理超大规模结构化数据,通过多层双轴编码架构和自适应融合双向状态空间模型,实现线性时间内的跨元组上下文化,同时支持排列不变的表示学习。

详情
AI中文摘要

结构化数据在医疗、金融和科学数据管理等领域被广泛应用。最近关于结构化数据基础模型(SFMs)的研究旨在支持在这些数据上的数据分析和挖掘任务,但将其应用于现实世界的企业数据库时仍面临可扩展性和泛化能力的挑战。首先,许多SFMs依赖于完全自注意力机制,这引入了O(N²)的计算瓶颈,并限制了可以同时处理的元组数量。其次,直接用线性复杂度序列模型替代注意力可能与结构化数据的排列不变性质相冲突,引入人为的顺序偏差并降低表示质量。此外,仅在合成数据上训练的模型可能难以泛化到现实世界数据库中常见的重尾和异质分布。为了解决这些挑战,我们提出了FEAT,一种用于超大规模结构化数据的线性复杂度基础模型。FEAT用多层双轴编码架构替代二次注意力。它集成了自适应融合双向状态空间模型(AFBM)与卷积门控线性注意力(Conv-GLA),在O(N)时间内实现跨元组上下文化,同时支持排列不变的表示学习。为了提高在现实数据偏斜下的鲁棒性,FEAT进一步采用混合结构因果预训练流水线,具有鲁棒的重建目标。在12个现实世界数据库基准测试中,FEAT在零样本任务上始终优于代表性的SFMs,并且与结构化数据样本长度线性扩展,达到高达50倍的推理延迟提升。

英文摘要

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

2603.14392 2026-05-21 cs.LG cs.RO

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

WestWorld: 一种知识编码的可扩展轨迹世界模型用于多样化机器人系统

Yuchen Wang, Jiangtao Kong, Sizhe Wei, Xiaochang Li, Haohong Lin, Hongjue Zhao, Tianyi Zhou, Lu Gan, Huajie Shao

AI总结 本文提出WestWorld,一种知识编码的可扩展轨迹世界模型,用于多样化机器人系统,通过引入系统感知的混合专家(Sys-MoE)和结构嵌入来提升可扩展性和零样本泛化能力,实现了在多种机器人环境中的高效轨迹预测和控制。

Comments ICML 2026 spotlight

详情
AI中文摘要

轨迹世界模型在机器人动力学学习、规划和控制中起着关键作用。尽管最近的研究已经探索了适用于多样化机器人系统的轨迹世界模型,但它们难以扩展到大量不同的系统动态,并忽略了物理结构的领域知识。为了解决这些限制,我们引入了WestWorld,一种针对多样化机器人系统的知识编码可扩展轨迹世界模型。为了解决可扩展性挑战,我们提出了一种新颖的系统感知混合专家(Sys-MoE),通过可学习的系统嵌入动态结合和路由针对不同机器人系统的专用专家。为进一步增强零样本泛化能力,我们通过引入结构嵌入来整合机器人物理结构的领域知识,使轨迹表示与形态学信息对齐。在预训练于89个复杂环境(涵盖多样化形态的仿真和现实世界设置)后,WestWorld在零样本和少样本轨迹预测上显著优于竞争基线。此外,它在广泛范围的机器人环境中的可扩展性表现出色,并在不同机器人上的下游基于模型的控制中显著提高了性能。最后,我们在现实世界中的Unitree Go1上部署了该模型,展示了稳定的移动性能。代码可在https://github.com/511205787/WestWorld上获取。

英文摘要

Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance. The code is available at https://github.com/511205787/WestWorld.

2603.14184 2026-05-21 cs.CV cs.AI

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深入的思考,更弱的目标:理解并缓解多模态大语言模型推理过程中感知障碍

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

AI总结 本文研究了多模态大语言模型在推理过程中出现的视觉感知障碍问题,提出了一种无需训练的视觉区域引导注意力框架,通过选择和重新加权视觉头部来引导模型关注与问题相关区域,从而提高视觉定位和推理准确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在进行扩展推理模式时常常出现感知障碍,特别是在视觉问答(VQA)任务中。我们识别出注意力分散是根本原因:在多步推理过程中,模型的视觉注意力变得分散并远离与问题相关区域,实际上“失去焦点”于视觉输入。为了更好地理解这一现象,我们分析了MLLMs的注意力图,并观察到推理提示显著减少了回答问题关键区域的注意力。我们进一步发现模型对图像标记的总体注意力与图像内注意力的空间分散性之间存在强相关性。基于这一见解,我们提出了一个无需训练的视觉区域引导注意力(VRGA)框架,该框架根据熵-聚焦准则选择视觉头部并重新加权其注意力,从而有效引导模型在推理过程中关注与问题相关区域。在视觉-语言基准上的广泛实验表明,我们的方法有效缓解了感知退化,从而在视觉定位和推理准确性方面取得改进,同时提供了可解释的见解,说明MLLMs如何处理视觉信息。

英文摘要

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

2603.13419 2026-05-21 cs.LG

Diffusion Models Memorize in Training -- and Generalize in Inference

扩散模型在训练中记忆,而在推理中泛化

Tim Kaiser, Markus Kollmann

AI总结 本文研究了扩散模型在训练中过度拟合去噪目标,导致训练样本与验证样本性能差距,但通过模型误差使采样轨迹远离训练样本分布,从而在推理中实现泛化。

Comments 31 pages and 29 figures

详情
AI中文摘要

扩散模型在实践中泛化效果良好。然而,一个最优的扩散模型完全记忆训练数据,因此无法泛化,引发了一个问题:是什么因素使真实的扩散模型能够泛化?我们发现,尽管在样本层面泛化,扩散模型逐渐过度拟合去噪训练目标,从而在验证样本和训练样本的性能之间产生泛化差距。这种差距在中间噪声水平最明显。使用一个完全分析性的误差易犯玩具模型,我们追踪了影响泛化差距的因素。我们发现,最优去噪流场在训练点周围局部化,但模型误差抑制了对训练点的精确回忆,从而产生一个平滑、泛化的流场。最后,我们发现训练中观察到的泛化差距不会转移到推理中,这会导致生成样本与训练样本有很强的相似性。这是因为采样轨迹的中间状态足够远离模型训练所用的噪声训练样本分布。这些发现揭示了扩散模型泛化的全新图景:流场通过模型误差泛化,使采样轨迹远离噪声训练样本的领域,从而自然地防止过拟合。

英文摘要

Diffusion models generalize well in practice. However, an optimal diffusion model fully memorizes the training data and therefore fails to generalize, raising the question of what induces generalization in a real diffusion model. We show that, despite generalizing at the sample level, diffusion models progressively overfit the denoising training objective and thereby create a generalization gap between the performance on validation and training samples. This gap is most pronounced at intermediate noise levels. Using a fully analytic error-prone toy model, we trace the factors affecting the generalization gap. We find that the optimal denoising flow field localizes sharply around training points, but the model error suppresses the exact recall of training points, yielding a smooth, generalizing flow field. Finally, we find that the generalization gap observed in training does not translate to inference, which would result in a strong similarity between generated samples and training samples. This is because the intermediate states of sampling trajectories are sufficiently far from the distribution of noisy training samples the model is trained on. Together, these findings reveal a novel picture of how diffusion models generalize: the flow field generalizes through model error, which moves sampling trajectories outside the domain of noisy training samples and thereby naturally prevents overfitting.