arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2604.04940 2026-05-27 cs.AI

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化

Cuong Van Duc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Tung Vu Duy, Son Nguyen Van, Hanh Nguyen Thi, Binh Huynh Thi Thanh

AI总结 针对NP-hard组合优化问题的启发式设计,提出ReVEL框架,通过行为感知分组和多轮迭代细化,利用LLM和累积性能反馈联合优化启发式,实验表明优于现有LLM引导的进化基线。

详情
AI中文摘要

为NP-hard组合优化问题设计有效的启发式仍然具有挑战性,通常需要大量的领域专业知识。最近的LLM引导的进化方法在自动启发式生成方面显示出前景,但大多数现有方法独立地或通过有限的成对反馈来细化启发式。我们提出ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化,一个用于群体式多轮启发式细化的框架。ReVEL将启发式组织成行为感知的反思组,包括用于局部细化的相似性驱动组和用于探索性搜索的多样性驱动组。在每个组内,LLM使用累积的性能反馈执行迭代多轮细化,使得相关启发式能够在进化迭代中被联合分析和逐步改进。在标准组合优化基准上的实验表明,ReVEL在多种设置和LLM骨干下通常优于现有的LLM引导的进化基线。额外分析表明,行为感知分组有助于在迭代启发式进化过程中实现更一致的细化轨迹。

英文摘要

Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain expertise. Recent LLM-guided evolutionary methods have shown promise for automated heuristic generation, but most existing approaches refine heuristics independently or through limited pairwise feedback. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a framework for group-wise multi-turn heuristic refinement. ReVEL organizes heuristics into behavior-aware reflective groups, including similarity-driven groups for localized refinement and diversity-driven groups for exploratory search. Within each group, the LLM performs iterative multi-turn refinement using accumulated performance feedback, enabling related heuristics to be jointly analyzed and progressively improved across evolutionary iterations. Experiments on standard combinatorial optimization benchmarks show that ReVEL generally improves optimization performance over existing LLM-guided evolutionary baselines across multiple settings and LLM backbones. Additional analyses suggest that behavior-aware grouping contributes to more consistent refinement trajectories during iterative heuristic evolution.

2603.27146 2026-05-27 cs.CL

Learning to Predict Future-Aligned Research Proposals with Language Models

学习用语言模型预测未来对齐的研究提案

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

AI总结 本文提出将研究提案生成重构为时间切片科学预测问题,通过未来对齐分数(FAS)评估模型能否预测截止时间后发表的论文方向,并构建时间一致数据集和推理轨迹进行训练,实验表明未来对齐微调显著提升提案质量。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于辅助研究中的构思,但评估LLM生成的研究提案的质量仍然困难:新颖性和合理性难以自动衡量,而大规模人工评估成本高昂。我们通过将提案生成重构为时间切片科学预测问题,提出了一种可验证的替代方案。给定一个研究问题和截止时间前可用的启发论文,模型生成一个结构化提案,并通过其是否预测到截止时间后发表的论文中出现的研究方向来评估。我们通过检索和基于LLM的语义评分,针对保留的未来语料库计算未来对齐分数(FAS)来操作化这一目标。为了训练模型,我们构建了一个时间一致的数据集,包含来自目标及其截止前引用的3,642个实例中的21,835篇论文出现次数,并合成推理轨迹,教授差距识别和灵感借鉴。在Llama-3.1和Qwen2.5模型上,未来对齐微调相比未对齐基线提高了未来对齐(总体FAS最高提升+10.6%),领域专家的人工评估证实了提案质量的改进。最后,我们通过使用代码代理实现两个模型生成的提案来展示实际影响,从新的提示策略中获得MATH 4.17%的准确率提升,并对一种新颖的模型合并方法实现了一致的改进。我们的代码和数据公开在https://github.com/Arthur-Heng/future-aligned-proposals。

英文摘要

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 21,835 paper occurrences across 3,642 instances from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method. Our code and data are publicly available at https://github.com/Arthur-Heng/future-aligned-proposals.

2509.08289 2026-05-27 cs.CV

Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

双阈值热力图引导的提议聚类与负确定性监督及增强基础网络的弱监督目标检测

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

AI总结 提出DANCE方法,通过双阈值热力图引导的提议选择、增强基础网络和负确定性监督损失,解决弱监督目标检测中伪GT框不完整、语义鸿沟和收敛慢的问题。

Comments IEEE TIP Minor Revision

详情
AI中文摘要

弱监督目标检测(WSOD)近年来因其不需要框级标注而受到广泛关注。最先进的方法通常采用多模块网络,使用WSDDN作为多实例检测网络模块,并使用多实例细化模块来改进性能。然而,这些方法存在三个关键局限性。首先,现有方法倾向于生成仅关注判别性部分的伪GT框,未能捕捉整个物体,或者覆盖整个物体但无法区分相邻的类内实例。其次,基础WSDDN架构缺乏每个提议的关键背景类表示,并且其分支之间存在较大的语义鸿沟。第三,先前的方法在优化过程中丢弃被忽略的提议,导致收敛缓慢。为了解决这些挑战,我们提出了双阈值热力图引导的提议聚类和负确定性监督与增强基础网络(DANCE)方法用于WSOD。具体来说,我们首先设计了一种热力图引导的提议选择器(HGPS)算法,该算法利用热力图上的双阈值来预选提议,使伪GT框既能捕捉完整的物体范围,又能区分相邻的类内实例。然后,我们构建了一个弱监督基础检测网络(WSBDN),它为每个提议增加一个背景类表示,并使用热力图进行预监督以弥合矩阵之间的语义鸿沟。最后,我们在被忽略的提议上引入负确定性监督(NCS)损失以加速收敛。在具有挑战性的PASCAL VOC和MS COCO数据集上进行的大量实验证明了我们方法的有效性和优越性。我们的代码可在https://github.com/gyl2565309278/DANCE公开获取。

英文摘要

Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and uses multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we propose the Dual-thresholded heAtmap-guided proposal clustering and Negative Certainty supervision with Enhanced base network (DANCE) method for WSOD. Specifically, we first devise a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then construct a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision (NCS) loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC and MS COCO datasets demonstrate the effectiveness and superiority of our method. Our code is publicly available at https://github.com/gyl2565309278/DANCE.

1403.1076 2026-05-27 cs.AI

A Discussion to Qualify Intelligence

关于智能定义的探讨

Kieran Greer

AI总结 本文试图提出一个适用于自然世界和人工智能的统一智能定义,基于Kolmogorov复杂性理论提出度量标准,并区分智能与意识的不同。

Comments Newly edited version

详情
Journal ref
Scientific Insights, 2(1), pp. 1 - 15
AI中文摘要

我们对智能的理解主要针对人类水平。本文试图给出一个更统一的定义,可应用于整个自然世界,然后应用于人工智能。该定义更侧重于定性而非定量,并可能有助于对此问题做出判断。虽然正确行为是首选定义,但本文提出了一种基于Kolmogorov复杂性理论的度量标准,该标准引出了关于熵的测量。随后,本文提出了一种公认的人工智能测试版本作为“酸性测试”,这可能是自由思维程序试图实现的目标。作者最近的工作更多是从机械过程的角度出发,基于结构构建。本文认为智能是一种主动事件,但也注意到其背后存在一个机械性的次要方面。本文建议将智能和意识视为略有不同,其中意识是更机械的方面。事实上,一个令人惊讶的结论是,一个被动但智能的大脑可能由主动但不太智能的感官所激发。

英文摘要

Our understanding of intelligence is directed primarily at the human level. This paper attempts to give a more unifying definition that can be applied to the natural world in general and then Artificial Intelligence. The definition would be used more to qualify than quantify it and might help when making judgements on the matter. While correct behaviour is the preferred definition, a metric that is grounded in Kolmogorov's Complexity Theory is suggested, which leads to a measurement about entropy. A version of an accepted AI test is then put forward as the 'acid test' and might be what a free-thinking program would try to achieve. Recent work by the author has been more from a direction of mechanical processes, built from structure. This paper agrees that intelligence is a pro-active event, but also notes a second aspect to it that is in the background and mechanical. The paper suggests looking at intelligence and the conscious as being slightly different, where the conscious is this more mechanical aspect. In fact, a surprising conclusion can be a passive but intelligent brain being invoked by active and less intelligent senses.

2604.03785 2026-05-27 cs.AI cs.MA

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

跨时间步延迟下合作多智能体强化学习中的通信增益与延迟代价

Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

AI总结 针对部分可观测环境中跨时间步通信延迟导致的信息错位问题,提出通信增益与延迟代价(CGDC)度量,并基于此设计演员-评论家框架CDCMA,通过预测未来观测和注意力融合延迟消息来提升合作多智能体强化学习的性能、鲁棒性和泛化能力。

详情
AI中文摘要

在部分可观测的\emph{合作}多智能体强化学习中,通信对于协调至关重要,然而\emph{跨时间步}延迟会导致消息在生成后多个时间步才到达,造成时间错位,使得信息在消费时变得陈旧。我们将此设定形式化为延迟通信部分可观测马尔可夫博弈(DeComm-POMG),并将消息的影响分解为\emph{通信增益}和\emph{延迟代价},从而得到通信增益与延迟代价(CGDC)度量。我们进一步建立了一个价值损失界,表明由延迟消息引起的性能下降被一个折扣累积的信息差距所上界,该差距由及时消息与延迟消息所诱导的动作分布之间的差异衡量。在CGDC的指导下,我们提出了 extbf{CDCMA},一个演员-评论家框架,该框架仅在预测CGDC为正时请求消息,预测未来观测以减少消费时的错位,并通过CGDC引导的注意力融合延迟消息。在无队友视觉变体的合作导航和捕食者-猎物任务以及多个延迟级别的SMAC地图上的实验表明,该方法在性能、鲁棒性和泛化能力上均有一致提升,消融实验验证了每个组件的有效性。

英文摘要

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

2604.00648 2026-05-27 cs.CV

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

DirectFisheye-GS: 在三维高斯泼溅中通过跨视图联合优化实现原生鱼眼输入

Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu

AI总结 针对鱼眼相机输入导致的信息丢失和细节模糊问题,提出将鱼眼相机模型集成到3DGS框架中,并引入基于特征重叠的跨视图联合优化策略,实现无需预处理的原生鱼眼图像训练,提升重建质量。

Comments CVPR 2026 Highlight; Fix NSFC ID

详情
AI中文摘要

三维高斯泼溅(3DGS)实现了从日常图像中进行高效的三维场景重建,具有实时、高保真渲染的特点,极大地推动了VR/AR应用的发展。鱼眼相机凭借其更宽的视场角(FOV),有望从更少的输入中实现高质量重建,近来备受关注。然而,由于3DGS依赖于光栅化,大多数后续涉及鱼眼相机输入的工作在训练前先对图像进行去畸变,这引入了两个问题:1)图像边缘的黑边导致信息丢失,抵消了鱼眼大FOV的优势;2)去畸变的拉伸和插值重采样将每个像素的值扩散到更大区域,稀释了细节密度——导致3DGS过拟合这些低频区域,产生模糊和漂浮伪影。在这项工作中,我们将鱼眼相机模型集成到原始3DGS框架中,实现了无需预处理的原生鱼眼图像输入进行训练。尽管建模正确,我们观察到重建场景在图像边缘仍然存在漂浮物:畸变向边缘增加,而3DGS原始的逐迭代随机选择视图优化忽略了高斯函数的跨视图相关性,导致极端形状(例如过大或拉长)降低了重建质量。为解决此问题,我们引入了一种基于特征重叠的跨视图联合优化策略,该策略在视图之间建立一致的几何和光度约束——该技术同样适用于现有的基于针孔相机的流水线。我们的DirectFisheye-GS在公共数据集上达到或超越了最先进的性能。项目页面:https://yzxqh.github.io/DirectFisheye-GS/ 。

英文摘要

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets. Project Page: https://yzxqh.github.io/DirectFisheye-GS/ .

2603.25152 2026-05-27 cs.AI cs.IR

OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion

OMD-GraphRAG:利用本体引导提取、多维聚类和双通道融合增强GraphRAG

Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian

AI总结 提出OMD-GraphRAG框架,通过本体引导知识提取、多维社区聚类和双通道图检索融合,提升GraphRAG在复杂推理和多跳查询中的性能。

详情
AI中文摘要

检索增强生成(RAG)系统在复杂推理、多跳查询和领域特定问答中面临重大挑战。尽管现有的GraphRAG框架在结构化知识组织方面取得了进展,但在知识提取精度、社区报告完整性和检索性能方面仍存在局限性。本文提出OMD-GraphRAG,一个基于开源GraphRAG构建的增强框架。该框架引入了三项核心创新:(1)本体引导知识提取,使用预定义Schema指导LLM准确识别领域特定实体和关系;(2)多维社区聚类策略,通过对齐完成、基于属性的聚类和多跳关系聚类提高社区完整性;(3)双通道图检索融合,通过混合图和社区检索平衡问答准确性和性能。在MultiHop-RAG基准上的评估结果显示,OMD-GraphRAG在综合F1分数上优于主流开源解决方案(如LightRAG),特别是在推理和时间查询方面。

英文摘要

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in knowledge extraction precision, community report integrity, and retrieval performance. This paper proposes OMD-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHop-RAG benchmark show that OMD-GraphRAG outperforms mainstream open source solutions (e.g., LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries.

2602.02192 2026-05-27 cs.LG cs.DC

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

ECHO-2: 一种面向经济高效强化学习的大规模分布式推演框架

Jingwei Song, Meng Chen, Jie Xiao, Qingnan Ren, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Zhisheng Chen, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Lynn Ai, Eric Yang, Tianyu Shi

AI总结 提出ECHO-2分布式强化学习框架,通过重叠推演生成、传播与训练,结合对等辅助流水线广播和成本感知异构工作节点激活,在保持奖励性能的同时显著提升成本效率。

Comments 24 pages, 7 figures

详情
AI中文摘要

强化学习(RL)是大语言模型(LLM)后训练的关键阶段,涉及推演生成、奖励评估和集中学习之间的反复交互。分布式推演执行提供了利用更具成本效益的推理资源的机会,但引入了广域协调和策略传播方面的挑战。我们提出了ECHO-2,一个用于后训练的分布式RL框架,使用远程推理工作节点且传播延迟不可忽略。ECHO-2将集中学习与分布式推演相结合,将有界策略过时性视为用户可控参数,使得推演生成、传播和训练能够重叠。我们引入了一个基于重叠的容量模型,关联训练时间、传播延迟和推演吞吐量,得出了一个维持学习器利用率的实用配置规则。为了缓解传播瓶颈并降低成本,ECHO-2采用了对等辅助流水线广播和成本感知的异构工作节点激活。在真实广域网带宽条件下,对4B到32B参数规模的LLM进行GRPO后训练的实验表明,ECHO-2在保持与强基线相当的RL奖励的同时,显著提高了成本效率。

英文摘要

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of LLMs ranging from 4B to 32B parameters under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

2603.28730 2026-05-27 cs.RO cs.CL cs.CV

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1:视频语言推理作为机器人强化学习的唯一奖励

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

AI总结 提出SOLE-R1模型,通过视频语言时空推理生成密集任务进度估计作为唯一奖励信号,实现在无真实奖励、演示或任务特定调优下的零样本在线强化学习。

详情
AI中文摘要

视觉语言模型(VLM)在各种任务中展现出令人印象深刻的能力,这促使人们努力利用这些模型来监督机器人学习。然而,当在强化学习(RL)中用作评估器时,当今最强的模型在部分可观测性和分布偏移下常常失败,使得策略能够利用感知错误而非解决任务。我们提出SOLE-R1(自观察学习器),一种专门设计用于为在线RL提供唯一奖励信号的视频语言推理模型。仅给定原始视频观测和自然语言目标,SOLE-R1执行每时间步的时空思维链(CoT)推理,并生成可直接用作奖励的密集任务进度估计。为了训练SOLE-R1,我们开发了一个大规模视频轨迹和推理合成流水线,生成与连续进度监督对齐的时间基础CoT轨迹。这些数据与基础的空间和多帧时间推理相结合,并使用混合框架训练模型,该框架将监督微调与可验证奖励的RL相结合。在四个不同的仿真环境和真实机器人设置中,SOLE-R1实现了从随机初始化的零样本在线RL:机器人学习之前未见过的操作任务,无需真实奖励、成功指标、演示或任务特定调优。SOLE-R1在24个未见过的任务上成功,并显著优于强视觉语言奖励器,包括Robometer、RoboReward、ReWiND、GPT-5和Gemini-3-Pro,同时对奖励破解表现出明显更强的鲁棒性。我们在匿名页面发布所有模型、数据、代码和演示:https://philip-mit.github.io/sole-r1/

英文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including Robometer, RoboReward, ReWiND, GPT-5, and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip-mit.github.io/sole-r1/

2601.18987 2026-05-27 cs.CL cs.AI cs.PL

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题:程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

AI总结 本文评估了前沿LLMs在程序终止推理上的能力,发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平,但无法生成形式化证明,并引入分歧前置条件形式化描述非终止条件。

详情
AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性,表明没有算法能普遍确定所有程序和输入的终止性。因此,验证工具近似地处理终止问题,有时无法证明或反驳;这些工具依赖于特定问题的架构,并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题:它们在多大程度上能够推理程序终止?我们在2025年国际软件验证竞赛(SV Comp)的一组多样化C程序上评估了前沿LLMs。我们的结果表明,GPT-5和Claude Sonnet 4.5(通过测试时缩放)达到了与顶级验证工具相当的分数。然而,尽管模型通常能正确推断程序是否终止,但它们经常无法构造一个见证作为形式化证明,揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加,性能进一步下降。为了分析这一差距,我们引入了一个分歧前置条件形式化方法,将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法,以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

2603.25415 2026-05-27 cs.AI cs.RO

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

具身语义场景图生成的强化学习导航现代化

Roman Küble, Marco Hüller, Mrunmai Phatak, Rainer Lienhart, Jörg Hähner

AI总结 提出模块化导航组件,通过替换策略优化方法和重新设计离散动作表示,现代化具身语义场景图生成中的决策过程,并评估不同动作集和策略结构对场景图完整性、执行安全性和导航行为的影响。

详情
AI中文摘要

语义世界模型使具身智能体能够推理对象、关系和空间上下文,超越纯几何表示。在有机计算中,此类模型是在不确定性和资源约束下实现目标驱动自适应的关键。核心挑战是在有限动作预算内获取最大化模型质量和下游实用性的观测。语义场景图(SSG)为此提供了结构紧凑的表示。然而,在有限动作视界内构建SSG需要探索策略,在信息增益与导航成本之间权衡,并决定何时额外动作的收益递减。本文提出了用于具身语义场景图生成的模块化导航组件,并通过替换策略优化方法和重新审视离散动作公式来现代化其决策。我们研究了紧凑和更细粒度的较大离散动作集,并比较了原子动作上的单头策略与动作组件上的分解多头策略。我们评估了课程学习和基于深度的可选碰撞监督,并评估了SSG完整性、执行安全性和导航行为。结果表明,仅替换优化算法在相同奖励塑造下相对于基线将SSG完整性提高了21%。深度主要影响执行安全性(无碰撞运动),而完整性基本保持不变。将现代优化与更细粒度、分解的动作表示相结合,产生了最强的完整性-效率权衡。

英文摘要

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

2601.04426 2026-05-27 cs.AI

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

XGrammar-2: 面向智能体LLM的高效动态结构化生成引擎

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

AI总结 针对智能体LLM中动态结构化生成(如工具调用和响应协议)的挑战,提出XGrammar-2引擎,通过标签触发结构切换和跨语法子结构缓存实现高效编译与近零开销。

Comments 10 pages, ACM CAIS 26

详情
AI中文摘要

现代LLM智能体越来越依赖动态结构化生成,例如工具调用和响应协议。与具有静态结构的传统结构化生成不同,这些工作负载在请求之间和请求内部都有变化,给现有引擎带来了新的挑战。我们提出了XGrammar-2,一种用于动态智能体工作负载的结构化生成引擎。我们的设计基于两个关键思想:对标签触发的结构切换的一流支持,以及跨具有不同输出结构的请求的细粒度重用。具体来说,XGrammar-2引入了TagDispatch用于动态结构调度,以及Cross-Grammar Cache用于跨语法的子结构级缓存重用。它通过基于Earley的自适应令牌掩码缓存、即时编译和重复状态压缩进一步提高了效率。实验表明,XGrammar-2的编译速度比先前的结构化生成引擎快6倍以上,并且在现代LLM服务系统中几乎为零的端到端开销。

英文摘要

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional structured generation with static structures, these workloads vary both across requests and within a request, posing new challenges to existing engines. We present XGrammar-2, a structured generation engine for dynamic agentic workloads. Our design is based on two key ideas: first-class support for tag-triggered structure switching, and fine-grained reuse across requests with different output structures. Concretely, XGrammar-2 introduces TagDispatch for dynamic structural dispatching and Cross-Grammar Cache for substructure-level cache reuse across grammars. It further improves efficiency with an Earley-based adaptive token mask cache, just-in-time compilation, and repetition state compression. Experiments show that XGrammar-2 achieves over 6x faster compilation than prior structured generation engines, and incurs near-zero end-to-end overhead in modern LLM serving systems.

2512.01678 2026-05-27 cs.LG cs.DC cs.PL

Morphling: Fast, Fused, and Flexible GNN Training at Scale

Morphling: 快速、融合且灵活的图神经网络规模化训练

Anubhab, Rupesh Nasre

AI总结 提出Morphling领域特定代码合成器,通过架构感知的原语和运行时稀疏感知执行引擎,在CPU、GPU和分布式环境下显著提升GNN训练吞吐量并降低内存消耗。

详情
AI中文摘要

图神经网络(GNN)通过融合不规则、内存受限的图遍历与规则、计算密集型密集矩阵运算,带来了根本性的硬件挑战。虽然PyTorch Geometric(PyG)和Deep Graph Library(DGL)等框架优先考虑高级可用性,但它们未能解决这些不同的执行特性。因此,它们依赖通用内核,导致缓存局部性差、内存移动过多以及大量中间分配。为了解决这些限制,我们提出了Morphling,一个旨在弥合这一差距的领域特定代码合成器。Morphling将高级GNN规范编译为可移植的、后端特化的实现,针对OpenMP、CUDA和MPI。它通过实例化一个针对每个执行环境定制的优化、架构感知原语库来实现这一点。Morphling还包含一个运行时稀疏感知执行引擎,该引擎使用输入特征统计动态选择密集或稀疏执行路径,减少对零值条目的不必要计算。我们在涵盖不同图结构、特征维度和稀疏程度的11个真实世界数据集上评估了Morphling。与PyG和DGL相比,Morphling在CPU上平均提高每轮训练吞吐量20倍,在GPU上提高19倍,在分布式设置中提高6倍,峰值加速达到66倍。Morphling的内存高效布局进一步将峰值内存消耗降低多达15倍,使得在商用硬件上进行大规模GNN训练成为可能。这些发现表明,专门的、架构感知的代码合成为跨不同并行和分布式平台的高性能GNN执行提供了一条有效且可扩展的路径。

英文摘要

Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

2603.23994 2026-05-27 cs.LG cs.AI

Understanding the Challenges in Iterative Generative Optimization with LLMs

理解大语言模型迭代生成优化中的挑战

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

AI总结 本文通过案例研究,揭示了在基于大语言模型的迭代生成优化中,起始工件、信用分配和批处理等隐藏设计选择对优化成败的决定性影响,并指出缺乏跨领域的通用学习循环设置方法是生产化和采用的主要障碍。

Comments 39 pages, 17 figures

详情
AI中文摘要

生成优化利用大型语言模型(LLMs)通过执行反馈迭代改进工件(如代码、工作流或提示)。这是一种构建自我改进代理的有前途的方法,但在实践中仍然脆弱:尽管有活跃的研究,只有9%的调查代理使用了任何自动优化。我们认为这种脆弱性是因为,为了建立学习循环,工程师必须做出“隐藏”的设计选择:优化器可以编辑什么,以及在每次更新时提供什么“正确”的学习证据?我们调查了影响大多数应用的三个因素:起始工件、执行轨迹的信用跨度,以及将试错批处理为学习证据。通过在MLAgentBench、Atari和BigBench Extra Hard中的案例研究,我们发现这些设计决策可以决定生成优化是否成功,然而它们在先前的工作中很少被明确说明。不同的起始工件决定了在MLAgentBench中哪些解决方案是可达到的,截断的轨迹仍然可以改进Atari代理,而更大的小批量并不会单调地改善BBEH上的泛化。我们得出结论,缺乏一种简单、通用的跨领域设置学习循环的方法是生产化和采用的主要障碍。我们为做出这些选择提供了实用指导。

英文摘要

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

2603.20020 2026-05-27 cs.CV cs.AI

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

分离跳跃链接与$R$-探针:解耦特征聚合与梯度传播用于MLLM OCR

Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang

AI总结 针对多模态大语言模型在OCR任务中因梯度干扰导致细粒度视觉信息丢失的问题,提出分离跳跃链接(Detached Skip-Links)以解耦前向特征聚合与反向梯度传播,并引入$R$-探针($R$-Probe)诊断视觉令牌的可重构性,从而提升OCR及通用多模态任务性能。

Comments Accepted by ICML 2026. Ziye Yuan and Ruchang Yao contributed equally to this work (co-first authors, listed in random order)

详情
AI中文摘要

多模态大语言模型(MLLMs)擅长高级推理,但在OCR任务中失败,因为细粒度视觉细节被破坏或错位。我们发现了多层特征融合中一个被忽视的优化问题。跳跃路径引入了从高级语义目标到早期视觉层的直接反向传播路径。这种机制覆盖了低级信号并破坏了训练稳定性。为了缓解这种梯度干扰,我们提出了分离跳跃链接(Detached Skip-Links),这是一种最小的修改,在前向传播中重用浅层特征,同时在联合训练期间停止通过跳跃分支的梯度。这种非对称设计减少了梯度干扰,提高了稳定性和收敛性,且无需增加可学习参数。为了诊断细粒度信息是否被保留并可供LLM使用,我们引入了$R$-探针($R$-Probe),它使用从LLM前四分之一层初始化的浅层解码器测量投影视觉令牌的像素级可重构性。在多个ViT骨干网络和多模态基准测试中,以及高达7M训练样本的规模下,我们的方法持续改进了以OCR为中心的基准测试,并在通用多模态任务上取得了明显提升。

英文摘要

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

2603.17685 2026-05-27 cs.LG

Flow Matching Policy Optimization with Mirror Descent and Entropy Constraints

基于镜像下降和熵约束的流匹配策略优化

Ting Gao, Stavros Orfanoudakis, Nan Lin, Winnie Daamen, Serge Hoogendoorn, Elvin Isufi

AI总结 针对在线强化学习中策略表达性与探索-利用平衡的挑战,提出基于ODE流匹配的框架FMER,通过免模拟策略优化和可计算熵目标,结合动态温度调节,在稀疏奖励任务中取得优越性能。

详情
AI中文摘要

平衡策略表达性与探索-利用权衡是在线强化学习(RL)中的核心挑战。虽然基于随机微分方程(SDE)的扩散策略可以表示复杂的多模态动作分布,但它们存在两个关键限制:其随机逆过程使熵难以处理(需要启发式探索),并且通过长去噪链计算策略梯度既昂贵又不稳定。在这项工作中,我们表明基于ODE的流匹配通过实现免模拟策略优化和可处理的熵计算,从本质上解决了这些问题。基于此,我们引入了基于镜像下降和熵约束的流匹配策略优化(FMER)。我们的框架以三种方式利用这一见解。首先,我们从理论上证明,最小化优势加权条件流匹配损失可以作为策略镜像下降的免模拟替代。这引导速度场朝向高价值区域,同时完全避免通过ODE求解器进行反向传播。其次,我们推导了一个解析熵目标,该目标校正了由$ anh$变换(将无界潜在空间映射到有界动作)引起的密度失真,从而促进了有原则的最大熵优化。最后,我们基于有效样本量动态调整镜像下降温度,以在训练期间强制执行稳健的信任区域。实验评估表明,FMER在具有挑战性的稀疏奖励FrankaKitchen环境中实现了优越的性能,同时在标准密集奖励MuJoCo基准测试中保持了有竞争力的结果。

英文摘要

Balancing policy expressiveness with the exploration-exploitation trade-off is a core challenge in online Reinforcement Learning (RL). While Stochastic Differential Equation (SDE)-based diffusion policies can represent complex, multimodal action distributions, they suffer from two critical limitations: their stochastic reverse processes render entropy intractable (necessitating heuristic exploration), and computing policy gradients through long denoising chains is expensive and unstable. In this work, we show that ODE-based flow matching inherently resolves these issues by enabling both simulation-free policy optimization and tractable entropy computation. Building on this, we introduce Flow Matching Policy Optimization with Mirror Descent and Entropy Constraints (FMER). Our framework exploits this insight in three ways. First, we theoretically establish that minimizing an advantage-weighted conditional flow matching loss acts as a simulation-free surrogate for policy mirror descent. This steers the velocity field toward high-value regions while entirely avoiding backpropagation through the ODE solver. Second, we derive an analytic entropy objective that corrects for the density distortion caused by the $\tanh$ transformation (mapping an unbounded latent space to bounded actions), thereby facilitating principled maximum-entropy optimization. Finally, we dynamically tune the mirror descent temperature based on the effective sample size to enforce a robust trust region during training. Empirical evaluations demonstrate that FMER achieves superior performance on the challenging sparse-reward FrankaKitchen environment, while maintaining competitive results across standard dense-reward MuJoCo benchmarks.

2603.11790 2026-05-27 cs.LG

Disentangled Representation Learning through Unsupervised Symmetry Group Discovery

通过无监督对称群发现实现解缠表示学习

Barthélémy Dang-Nhu, Louis Annabi, Sylvain Argentieri

AI总结 提出一种具身智能体通过与环境的无监督交互自主发现动作空间的群结构的方法,证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法以学习线性对称基解缠表示。

详情
AI中文摘要

基于对称性的解缠表示学习利用环境变换的群结构来揭示潜在的变化因素。先前的基于对称性的解缠方法需要对称群结构的强先验知识,或对子群性质做出限制性假设。在这项工作中,我们通过提出一种方法消除了这些约束,该方法使具身智能体通过与环境的无监督交互自主发现其动作空间的群结构。我们证明了在最小假设下真实对称群分解的可识别性,并推导出两种算法:一种用于从交互数据中发现群分解,另一种用于在不假设特定子群性质的情况下学习线性对称基解缠(LSBD)表示。我们的方法在三个表现出不同群分解的环境中得到了验证,其性能优于现有的LSBD方法。

英文摘要

Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true symmetry group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.

2603.17218 2026-05-27 cs.CL cs.AI cs.GT

Alignment Makes Language Models Normative, Not Descriptive

对齐使语言模型变得规范,而非描述性

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

AI总结 通过对比120个基础-对齐模型对在超过10,000个真实人类决策中的表现,发现对齐诱导了规范性偏差:在单轮教科书式博弈中提升预测,但在多轮战略博弈中因忽略互惠、报复等描述性动态而损害预测。

详情
AI中文摘要

后训练对齐优化语言模型以匹配人类偏好信号,但这一目标并不等同于对观察到的人类行为进行建模。我们在多轮战略博弈——讨价还价、说服、谈判和重复矩阵博弈中,比较了120个基础-对齐模型对在超过10,000个真实人类决策上的表现。在这些设置中,基础模型在预测人类选择方面以近10:1的比例优于其对齐版本,这一结果在模型家族、提示表述和博弈配置中均稳健成立。然而,在人类行为更可能遵循规范预测的设置中,这一模式发生了逆转:对齐模型在所有12种测试的单轮教科书式博弈以及非战略彩票选择中占据主导地位——甚至在多轮博弈本身中,在交互历史发展之前的第一轮也是如此。这种边界条件模式表明,对齐诱导了规范性偏差:当人类行为相对较好地由规范性解决方案捕捉时,它改善了预测;但在多轮战略设置中,当行为由互惠、报复和依赖于历史的适应等描述性动态塑造时,它损害了预测。这些结果揭示了在优化模型以供人类使用和将其用作人类行为代理之间的根本权衡。

英文摘要

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

2601.10566 2026-05-27 cs.CL cs.LG

Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

基于激活签名的表示感知遗忘:从抑制到实体签名擦除

Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, K. M. Shadman Wadith, Nazia Tasnim, Farig Sadeque

AI总结 提出ERUF框架,通过挖掘实体特异性激活签名并抑制对应方向,实现表示层面的遗忘,同时保持表面抑制、内部衰减和效用保留。

Comments 16 pages, 4 figures

详情
AI中文摘要

实体级遗忘通常通过模型输出评估:是否停止命名目标、拒绝查询或改变真值比分布。然而,这些输出级测试无法显示主体的内部表示是否被衰减。我们引入实体表示遗忘框架(ERUF),这是一个表示感知框架,挖掘主体特定的激活签名,抑制相应的激活方向,并将行为蒸馏到LoRA参数中。在评估的基线中,ERUF是唯一同时实现表面级抑制、内部衰减和效用保留的方法。在TOFU forget10上,ERUF达到FQ=0.99和MU=0.62,匹配报告的神谕效用,同时接近神谕遗忘质量。在大多数标准基础模型设置中,ERUF保持低泄漏和低内部目标激活,SMR在0.00%至1.10%之间,EL10低于0.06,效用漂移低于3%。在Llama-3.1-8B上,对抗性实体恢复从63.89%降至20.15%,而名称无关恢复减少72.7%至77.4%。联合表面/内部诊断进一步揭示了推理优先模型中仅靠表面指标无法发现的尺度依赖行为。我们将这些结果解释为表示层面衰减的操作性证据,而非不可逆删除的正式保证。

英文摘要

Entity-level unlearning is usually evaluated by what a model says: whether it stops naming the target, refuses a query, or shifts a Truth Ratio distribution. These output-level tests, however, do not show whether a subject's internal representation has been attenuated. We introduce the Entity Representation Unlearning Framework (ERUF), a representation-aware framework that mines subject-specific activation signatures, suppresses the corresponding activation direction, and distills the behavior into LoRA parameters. Among evaluated baselines, ERUF is the only method that jointly achieves surface-level suppression, internal attenuation, and utility preservation. On TOFU forget10, ERUF achieves FQ = 0.99 and MU = 0.62, matching reported oracle utility while approaching oracle forget quality. Across most standard foundation-model settings, ERUF maintains low leakage and low internal target activation, with SMR between 0.00% and 1.10%, EL10 below 0.06, and utility drift below 3%. On Llama-3.1-8B, adversarial entity recovery falls from 63.89% to 20.15%, while name-agnostic recovery decreases by 72.7% to 77.4%. Joint surface/internal diagnostics further reveal scale-dependent behavior in reasoning-prior models that surface metrics alone would miss. We interpret these results as operational evidence of representation-level attenuation, not as a formal guarantee of irreversible deletion.

2601.03471 2026-05-27 cs.CL cs.AI

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

EpiQAL:大型语言模型在流行病学问答与推理中的基准测试

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

AI总结 提出EpiQAL基准,通过三个子集(事实回忆、多步推理、不完整信息下结论重建)评估LLM在流行病学推理中的表现,发现当前模型在多步推理上表现有限。

Comments 31 pages, 7 figures, 25 tables

详情
AI中文摘要

可靠的流行病学推理需要综合研究证据来推断疾病负担、传播动态和人群层面的干预效果。现有的医学问答基准主要强调临床知识或患者层面的推理,但很少有系统评估基于证据的流行病学推理。我们提出了EpiQAL,这是首个针对多种疾病的流行病学问答诊断基准,包含三个从开放获取文献构建的子集。这三个子集逐步测试事实回忆、多步推理以及在不完整信息下的结论重建,并通过结合分类学指导、多模型验证和难度筛选的质量控制流程构建。对涵盖开源和专有系统的15个模型的实验表明,当前LLM在流行病学推理上表现有限,其中多步推理构成最大挑战。模型排名在不同子集间发生变化,仅靠规模并不能预测成功。思维链提示有利于多步推理,但在其他情况下效果不一。EpiQAL为证据基础、推理推理和结论重建提供了细粒度的诊断信号。

英文摘要

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fifteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.

2601.03079 2026-05-27 cs.CL

Learning to Diagnose and Correct Errors: Towards Moral Sensitivity Acquisition in Large Language Models

学习诊断和纠正错误:大型语言模型中的道德敏感性获取

Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi, Xitong Zhang, Kristen Johnson, Guangliang Liu

AI总结 提出一种实用推理方法,通过让大型语言模型诊断和纠正道德错误来获取道德敏感性,并在多个任务上验证了其有效性。

详情
AI中文摘要

道德敏感性是人类道德能力最基础的能力。尽管许多方法旨在使大型语言模型(LLMs)与人类道德价值观对齐,但它们主要关注拟合道德适当文本的分布,而忽视了如何使LLMs获得道德敏感性。在本文中,我们朝着解决以下问题迈出了一步:LLMs如何获得道德敏感性?具体来说,我们提出了一种实用推理方法,通过使LLMs能够诊断和纠正道德错误来促进其道德敏感性的获取。我们的实用推理方法的一个核心优势在于其统一的视角:它不是对语义多样且复杂的表面形式进行道德话语建模,而是提供了一个基于推理负荷设计实用推理过程的原则性框架。实验证据表明,我们的实用方法能够使LLMs获得道德敏感性,并在多个任务中有效泛化。

英文摘要

Moral sensitivity is the most fundamental capability underlying human moral competence. Although many approaches aim to align large language models (LLMs) with human moral values, they primarily focus on fitting the distributions of morally appropriate texts while overlooking how to enable moral sensitivity acquisition in LLMs. In this paper, we take a step toward addressing the question: How can moral sensitivity be acquired in LLMs? Specifically, we propose a pragmatic inference approach that facilitates moral sensitivity acquisition in LLMs by enabling them to diagnose and correct moral errors. A central strength of our pragmatic inference approach lies in its unified perspective: rather than modeling moral discourses across semantically diverse and complex surface forms, it provides a principled framework for designing pragmatic inference procedures grounded in their inferential load. Empirical evidence demonstrates that our pragmatic approach can enable moral sensitivity acquisition in LLMs and generalizes effectively across tasks.

2603.16870 2026-05-27 cs.CV cs.AI

Demystifying Video Reasoning

揭秘视频推理

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

AI总结 本文通过实验揭示视频扩散模型中的推理主要发生在去噪步骤中,提出链式步骤(CoS)机制,并发现工作记忆、自我修正和感知先行等涌现行为,最后提出一种无需训练的集成策略来提升推理能力。

Comments Homepage: https://www.wruisi.com/demystifying_video_reasoning

详情
AI中文摘要

近期视频生成的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡的推理能力。先前的工作将此归因于链式帧(CoF)机制,假设推理在视频帧间顺序展开。在本工作中,我们挑战这一假设,并揭示了一个根本不同的机制。我们表明视频模型中的推理主要沿着扩散去噪步骤涌现。通过定性分析和针对性探测实验,我们发现模型在早期去噪步骤中探索多个候选解,并逐步收敛到最终答案,我们将此过程称为链式步骤(CoS)。除了这一核心机制,我们还识别出对模型性能至关重要的几种涌现推理行为:(1)工作记忆,支持持久参考;(2)自我修正与增强,允许从不正确的中间解中恢复;(3)先感知后行动,早期步骤建立语义基础,后期步骤执行结构化操作。在扩散步骤内部,我们进一步揭示了扩散变换器中的自演化功能特化:早期层编码密集的感知结构,中间层执行推理,后期层巩固潜在表示。受这些见解的启发,我们提出了一种简单的无需训练的策略作为概念验证,展示了如何通过集成来自相同模型不同随机种子的潜在轨迹来改进推理。总体而言,我们的工作系统性地理解了推理如何在视频生成模型中涌现,为未来研究更好地利用视频模型固有的推理动态作为智能的新基础提供了基础。

英文摘要

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

2603.16654 2026-05-27 cs.CL cs.AI cs.LG

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic:迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

AI总结 针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题,提出Omanic基准,通过分解为单跳子问题并分析步骤级错误,揭示后期跳数瓶颈、事实知识下限和错误传播,微调后提升多个推理基准性能。

详情
AI中文摘要

仅从最终答案评估大语言模型(LLM)的推理能力可能会掩盖中间步骤的失败,尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题,我们引入了Omanic,一个开放域4跳问答基准,它不仅用于衡量最终答案的准确性,还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例(OmanicSynth)和967个经专家审核的人工标注评估示例(OmanicBench),每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明,Omanic具有挑战性,而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准,平均提升7.41分,验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据,在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

2603.13853 2026-05-27 cs.CL cs.AI

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

AI总结 针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题,提出APEX-Searcher,通过分离规划与执行的信用分配(规划用RL优化、执行用SFT学习),在多个基准上取得一致提升。

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)与外部知识连接起来,但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力,大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能,但它们仍然面临任务推理和模型训练方面的挑战,尤其是模糊的检索执行路径和端到端强化学习(RL)中的稀疏奖励,这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠:单一的最终奖励同时更新规划和执行,因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher,它采用了一种细化信用分配的范式:规划通过带有规划级奖励的RL进行优化,而执行则通过SFT学习。大量实验表明,在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

2603.15500 2026-05-27 cs.AI cs.LG

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

不确定性下通过策略信息分配理解LLM中的推理

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang

AI总结 本文提出一个信息论框架,将推理分解为程序推进和认知外化(不确定性标记级外化),证明零散外化能在无显式错误触发时恢复收敛,并通过实验表明小规模SFT即可调控该能力,从而将推理重新定义为不确定性下的策略信息分配。

详情
AI中文摘要

LLM 经常表现出“啊哈”时刻,例如在“Wait”等标记后进行自我修正,但其潜在机制仍不清楚。标准 LLM 主要通过无声发散崩溃,即轨迹偏离正确答案但仍保持局部连贯,因此没有显式错误触发反应性自我修正。我们引入一个信息论框架,将推理分解为程序推进和认知外化(不确定性的标记级外化),并证明零散外化能在没有显式错误触发的情况下恢复向正确答案的收敛。实验上,一个最小的怀疑线索即可恢复失败的轨迹,小规模 SFT 足以灌输或抑制这种能力,这表明强推理更少依赖于非凡的内在机制,而更多依赖于外化不确定性的语言习惯。我们的框架将推理重新定义为不确定性下的策略信息分配,为理解和推进 LLM 推理提供了新视角。

英文摘要

LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct answer yet remain locally coherent, so no explicit error triggers reactive self-correction. We introduce an information-theoretic framework that separates reasoning into procedural advancement and epistemic verbalization, the token-level externalization of uncertainty, and prove that sporadic verbalization restores convergence toward the correct answer even without explicit error triggers. Empirically, a minimal doubt cue recovers failed trajectories, and small-scale SFT suffices to instill or suppress this capability, suggesting that strong reasoning hinges less on an extraordinary inner mechanism than on the linguistic habit of externalizing uncertainty. Our framework recasts reasoning as strategic information allocation under uncertainty, offering a new lens for understanding and advancing LLM reasoning.

2603.13282 2026-05-27 cs.LG cs.AI

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

FedTreeLoRA:协调联邦LoRA微调中的统计异质性与功能异质性

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

AI总结 针对联邦LoRA微调中统计异质性与功能异质性正交耦合的问题,提出树结构聚合框架FedTreeLoRA,通过逐层对齐实现泛化与个性化的有效平衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于低秩自适应(LoRA)的联邦学习(FL)已成为隐私保护的大语言模型微调的标准方法。然而,现有的个性化方法主要在一种限制性的平面模型假设下运行:它们处理客户端的 extit{统计异质性},但将模型视为一个整体块,忽略了跨LLM层的 extit{功能异质性}。我们认为这两个维度——统计(水平)异质性和功能(垂直)异质性——在来源上是正交的,但在交互中是耦合的,这意味着参数共享的最优深度在功能上依赖于客户端的相似性。为了解决这个问题,我们提出了 extbf{FedTreeLoRA},一个采用树结构聚合进行细粒度逐层对齐的框架。通过动态构建聚合层次结构,FedTreeLoRA允许客户端在浅层“树干”上共享广泛共识,同时在深层“树枝”上逐步特化。在NLU和NLG基准上的实验表明,FedTreeLoRA通过有效协调泛化与个性化,显著优于现有最先进方法。

英文摘要

Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing personalized methods predominantly operated under a restrictive Flat-Model Assumption: they addressed client-side \textit{statistical heterogeneity} but treated the model as a monolithic block, ignoring the \textit{functional heterogeneity} across LLM layers. We argue that these two statistical (horizontal) and functional (vertical) dimensions, are \textit{orthogonal in source yet coupled in interaction}, implying that the optimal depth of parameter sharing is functionally dependent on client similarity. To address this, we propose \textbf{FedTreeLoRA}, a framework employing tree-structured aggregation for fine-grained, layer-wise alignment. By dynamically constructing an aggregation hierarchy, FedTreeLoRA allows clients to share broad consensus on shallow `trunks' while progressively specializing on deep `branches'. Experiments on NLU and NLG benchmarks demonstrate that FedTreeLoRA significantly outperforms state-of-the-art methods by effectively reconciling generalization and personalization.

2603.12754 2026-05-27 cs.CL

A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

一种从语义标注语料库学习大规模计算构式语法的方法

Paul Van Eecke, Katrien Beuls

AI总结 提出一种从语义标注语料库自动学习大规模、广覆盖计算构式语法的方法,生成包含数万构式的网络,支持开放域文本的框架语义分析并揭示句法-语义使用模式。

Comments Accepted for oral presentation at CoNLL 2026

详情
AI中文摘要

我们提出了一种从语言使用语料库中学习大规模、广覆盖构式语法的方法。该方法从带有成分结构和语义框架标注的话语出发,促进学习可解释的计算构式语法,捕捉句法结构与所表达语义关系之间的复杂关联。生成的语法由在流体构式语法框架内形式化的数万个构式网络组成。这些语法不仅支持开放域文本的框架语义分析,还蕴含了关于学习数据中句法-语义使用模式的大量信息。该方法及学习到的语法有助于基于使用的构式主义语言方法的规模化,因为它们证实了若干基本构式语法猜想的可扩展性,同时为广覆盖语料库中英语论元结构的构式主义研究提供了实用工具。

英文摘要

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

2603.09551 2026-05-27 cs.CV

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

GeoSolver: 利用细粒度过程监督扩展遥感中的测试时推理

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang

AI总结 提出GeoSolver框架,通过构建大规模过程监督数据集Geo-PRM-2M和训练过程奖励模型GeoPRM,结合过程感知树GRPO强化学习算法,实现遥感中可验证的逐步推理,在多个基准上达到最优性能并支持测试时扩展。

Comments Code: https://github.com/yourname/GeoSolver

详情
AI中文摘要

尽管视觉语言模型(VLM)显著推进了遥感解译,但使其能够执行复杂、逐步推理仍然极具挑战性。最近在该领域引入思维链(CoT)推理的努力显示出前景,但确保这些中间步骤的视觉忠实性仍是一个关键瓶颈。为解决这一问题,我们提出了GeoSolver,一个新颖的框架,将遥感推理转向可验证的、过程监督的强化学习。我们首先构建了Geo-PRM-2M,一个大规模的、令牌级过程监督数据集,通过熵引导的蒙特卡洛树搜索(MCTS)和有针对性的视觉幻觉注入合成。基于该数据集,我们训练了GeoPRM,一个令牌级过程奖励模型(PRM),提供细粒度的忠实性反馈。为了有效利用这些验证信号,我们提出了过程感知树GRPO,一种强化学习算法,将树结构探索与忠实性加权奖励机制相结合,以精确分配中间步骤的信用。大量实验表明,我们的最终模型GeoSolver-9B在多样化的遥感基准上实现了最先进的性能。至关重要的是,GeoPRM解锁了鲁棒的测试时扩展(TTS)。作为通用的地理空间验证器,它无缝地扩展了GeoSolver-9B的性能,并直接增强了通用VLM,突显了其卓越的跨模型泛化能力。

英文摘要

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

2512.09700 2026-05-27 cs.CV eess.IV

LiM-YOLO: Less is More with Pyramid Level Shift for Ship Detection in Optical Remote Sensing

LiM-YOLO:基于金字塔层级偏移的光学遥感舰船检测中少即是多

Seon-Hoon Kim, Yerin Kim, Hyeji Sim, Youeyun Jung, Okchul Jung, Daewon Chung

AI总结 针对光学遥感舰船检测中目标尺度小、长宽比高导致深层特征金字塔空间特征稀释的问题,提出LiM-YOLO检测器,通过金字塔层级偏移策略将检测头从步长8、16、32移至4、8、16,并引入组归一化辅助投影模块,在减少64.1%参数量的情况下超越更大规模的SOTA检测器。

Comments 16 pages, 6 figures, 9 tables

详情
AI中文摘要

通用目标检测器在应用于卫星图像中的舰船检测时面临根本性的结构限制,其中舰船尺度分布集中在较小尺寸和高长宽比。在传统的YOLO架构中,最深的特征金字塔层级(步长32)将窄长船只压缩为亚像素表示,导致严重的空间特征稀释并影响准确的舰船边界回归。我们提出Less is More YOLO,一种基于YOLOv9超大变体的精简检测器,以解决这些领域特定的结构冲突。通过对四个主要基准(SODA-A、DOTA-v1.5、FAIR1M-v2.0和ShipRSImageNet)中舰船尺度分布的统计分析,我们引入了一种金字塔层级偏移策略,将检测头从步长8、16、32移至步长4、8、16。该偏移满足基于奈奎斯特-香农原理推导出的最窄目标的空间可表示性条件,同时消除了最深金字塔层级的计算冗余。为了进一步稳定高分辨率卫星输入上的训练,我们引入了一个组归一化辅助投影模块,将组归一化引入投影路径,缓解了内存受限的微批量训练中的梯度不稳定性。在这四个数据集上验证,我们的检测器仅用21.16百万参数就达到了0.600的mAP_{50-95},相比超大YOLOv9基线(58.99百万参数)减少了64.1%。尽管尺寸紧凑,我们的模型超越了多达三倍大的最先进检测器,验证了有针对性的金字塔层级偏移实现了准确性与效率之间的“少即是多”平衡。代码可在https://github.com/egshkim/LiM-YOLO获取。

英文摘要

General-purpose object detectors face fundamental structural limitations when applied to ship detection in satellite imagery, where the ship scale distribution is concentrated at small sizes and high aspect ratios. In conventional You Only Look Once architectures, the deepest feature pyramid level (stride 32) compresses narrow vessels into sub-pixel representations, causing severe spatial feature dilution and compromising accurate ship boundary regression. We propose Less is More YOLO, a streamlined detector built upon the extra-large variant of YOLOv9, to address these domain-specific structural conflicts. From a statistical analysis of ship scale distributions across four major benchmarks (SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet), we introduce a Pyramid Level Shift Strategy that shifts the detection head from strides 8, 16, and 32 to strides 4, 8, and 16. This shift satisfies a spatial representability condition derived from the Nyquist-Shannon principle for the narrowest targets, while eliminating the computational redundancy of the deepest pyramid level. To further stabilize training on high-resolution satellite inputs, we incorporate a group-normalized auxiliary projection module that introduces Group Normalization into the projection path, mitigating gradient instability in memory-constrained micro-batch regimes. Validated on these four datasets, our detector attains an mAP_{50-95} of 0.600 with only 21.16 million parameters, a 64.1% reduction from the extra-large YOLOv9 baseline (58.99 million). Despite this compact size, our model surpasses state-of-the-art detectors up to three times larger, validating that a well-targeted pyramid level shift achieves a "Less is More" balance between accuracy and efficiency. The code is available at https://github.com/egshkim/LiM-YOLO.

2603.08413 2026-05-27 cs.LG cs.AI

Geometrically Constrained Outlier Synthesis

几何约束异常合成

Daniil Karzanov, Marcin Detyniecki

AI总结 提出几何约束异常合成(GCOS)方法,通过在隐藏特征空间中生成受几何约束的虚拟异常样本,结合对比正则化,提升图像分类模型对分布外样本的鲁棒性。

Comments 19 pages, accepted to ICML 2026

详情
AI中文摘要

用于图像分类的深度神经网络通常对分布外(OOD)样本表现出过度自信。为了解决这个问题,我们引入了几何约束异常合成(GCOS),这是一种训练时正则化框架,旨在提高推理时的OOD鲁棒性。GCOS通过生成隐藏特征空间中尊重分布内(ID)数据学习到的流形结构的虚拟异常,解决了先前合成方法的局限性。合成分两个阶段进行:(i)从训练特征中提取的主方差子空间识别出几何信息引导的离流形方向;(ii)由校准集中非一致性得分的经验分位数定义的一个类共形壳,自适应地控制合成幅度以产生边界样本。该壳确保生成的异常既不是微不足道可检测的,也不是与分布内数据无法区分的,从而促进更平滑地学习鲁棒特征。这与对比正则化目标相结合,在选定的得分空间(如马氏距离或基于能量的)中促进ID和OOD样本的可分离性。实验表明,在近OOD基准测试(定义为异常与分布内数据共享相同语义域的任务)上,使用标准基于能量的推理时,GCOS优于最先进的方法。作为探索性扩展,该框架自然地过渡到共形OOD推理,将不确定性得分转化为统计上有效的p值,并启用具有形式误差保证的阈值,为更可预测和可靠的OOD检测提供了途径。

英文摘要

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.