arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2604.13491 2026-05-27 cs.CV

FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation

FiRe:用于增强图像生成的细粒度多模态推理

Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

发表机构 * KT Corporation(KT公司)

AI总结 提出FiRe方法,通过细粒度多步推理和强化学习FiRe-GRPO,解决文本到图像生成中缺乏细粒度控制的问题。

详情
AI中文摘要

随着多模态大语言模型(MLLM)的快速发展,联合进行图像理解和生成的统一MLLM取得了显著进展。然而,尽管统一MLLM具有自我反思和自我改进的内在推理能力,它们在文本到图像生成中的应用仍未被充分探索。同时,现有的基于多模态推理的图像生成方法大多依赖于提示增强或整体图像-文本对齐判断,缺乏对详细提示属性的细粒度反思和改进,导致细粒度控制有限。为了解决这一局限性,我们提出了FiRe,一种通过MLLM增强图像生成的细粒度多模态推理方法。具体来说,FiRe执行细粒度多步推理,首先将提示分解为关键视觉需求,然后自我判断它们在生成图像中的满足程度,接着根据自我生成的精确反馈进行局部改进。此外,为了进一步增强MLLM的多模态推理能力,我们引入了FiRe-GRPO,一种针对FiRe量身定制的强化学习方法。由于标准的组相对策略优化(GRPO)在多步推理中面临稀疏的、基于结果的奖励问题,我们将推理过程形式化为一个步骤级别的决策问题,设计步骤特定的奖励,并计算步骤级别的优势以在GRPO内进行细粒度的信用分配。大量实验表明,FiRe持续优于竞争性的文本到图像基线,包括现有的基于推理的方法,在组合文本到图像基准上尤其取得了显著提升。

英文摘要

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on prompt augmentation or holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. To address this limitation, we propose FiRe, a Fine-grained Multimodal Reasoning method for enhanced image generation by MLLM. In specific, FiRe performs a fine-grained multi-step reasoning by first decomposing the prompt into key visual requirements and then self-judging their satisfaction in the generated image, followed by localized refinement according to self-generated precise feedback. In addition, to further strengthen the MLLM's multimodal reasoning ability, we introduce FiRe-GRPO, a reinforcement learning method tailored to FiRe. Since standard Group Relative Policy Optimization (GRPO) suffers from sparse, outcome-based rewards in multi-step reasoning, we formulate our reasoning process as a step-level decision-making problem, design step-specific rewards, and compute step-level advantages for granular credit assignment within GRPO. Extensive experiments demonstrate that FiRe consistently outperforms competitive text-to-image baselines, including existing reasoning-based methods, with particularly substantial gains on compositional text-to-image benchmarks.

2604.13502 2026-05-27 cs.CL

Using reasoning LLMs to extract SDOH events from clinical notes

使用推理型大语言模型从临床笔记中提取社会健康决定因素事件

Ertan Dogan, Kunyu Yu, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine(流行病学与公共卫生系,韦尔·科恩医学中心)

AI总结 本研究提出一种基于推理型大语言模型的提示工程方法,通过四个模块(简洁提示、少样本学习、自一致性机制和后处理)从临床笔记中提取结构化SDOH事件,取得0.866的微平均F1分数,展示了简单实现与强性能的平衡。

详情
AI中文摘要

社会健康决定因素(SDOH)指影响个人生活、工作和衰老的环境、行为和社会条件。SDOH对个人健康结果有显著影响,其系统识别和管理可大幅改善患者护理。然而,SDOH信息主要记录在电子健康记录的非结构化临床笔记中,限制了其作为机器可读实体的直接使用。为解决此问题,研究人员采用基于预训练BERT模型的自然语言处理(NLP)技术,展示了有前景的性能,但需要复杂的实现和大量计算资源。在本研究中,我们探索了利用具有高级推理能力的大语言模型(LLM)提取结构化SDOH事件的提示工程策略。我们的方法包含四个模块:1)开发结合既定指南的简洁描述性提示,2)应用精心策划示例的少样本学习,3)使用自一致性机制确保稳健输出,4)后处理进行质量控制。我们的方法达到了0.866的微平均F1分数,展示了与领先模型相比具有竞争力的性能。结果表明,具有推理能力的LLM是SDOH事件提取的有效解决方案,兼具实现简单性和强性能。

英文摘要

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

2603.12564 2026-05-27 cs.CL cs.AI

Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

卖给我这支股票:LLM智能体中的不安全推荐漂移

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * Centre for Artificial Intelligence, University College London(人工智能研究中心,伦敦大学学院)

AI总结 研究LLM智能体在多轮金融推荐中因工具输出被操纵而产生风险不匹配推荐的问题,通过实验揭示评估盲区并分析机制。

详情
AI中文摘要

人们越来越多地使用LLM智能体进行多轮金融推荐,智能体通过工具获取市场数据并跨轮次跟踪用户偏好。当工具输出被操纵时,推荐不再匹配用户声明的风险偏好,但由于NDCG等标准指标仅衡量一般相关性,风险股票和安全股票的得分相同,因此指标显示一切正常。我们将这种差距称为评估盲区。我们在八个语言模型上回放23轮金融咨询对话,每段对话分别使用干净和被操纵的工具数据运行两次。质量得分与干净会话几乎相同,而智能体在65-99%的轮次中产生风险不匹配的推荐,所有八个模型一致。该机制在逐轮中可见:在1,840轮中,80%的风险评分引用逐字复现了被操纵的值,没有一轮提出质疑,高风险股票的安全语言框架比例从14%(Qwen2.5-7B)到69%(Claude Sonnet 4.6)不等。使前沿模型成为优秀智能体的特性——忠实地将其推理基于工具输出——也使其跟随被操纵的输出。损害并非由记忆驱动:仅污染当前轮次仍会产生95%的违规。模型内部能区分操纵(稀疏自编码器特征将对抗性扰动与随机扰动分开),但这并未转化为更安全的输出。激活层干预仅恢复不到6%的安全差距,提示级自我验证失败,因为自我检查读取了相同的被操纵数据,而参数化交叉检查在前沿模型上每轮以99-100%的比率标记污染,但整体适宜性仍未改变:智能体识别出篡改,但仍然推荐它。

英文摘要

People increasingly use LLM agents for multi-turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23-turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk-mismatched recommendations in 65-99% of turns, unanimous across all eight models. The mechanism is visible turn-by-turn: 80% of risk-score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe-language framing of high-risk stocks ranges from 14% (Qwen2.5-7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory-driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation-level interventions recover under 6% of the safety gap, prompt-level self-verification fails because the self-check reads the same manipulated data, and a parametric cross-check that flags contamination at 99-100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

2604.13018 2026-05-27 cs.CL

Toward Autonomous Long-Horizon Engineering for ML Research

面向机器学习研究的自主长周期工程

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia

发表机构 * GitHub

AI总结 提出AiScientist多智能体系统,通过轻量级层级研究团队和File-as-Bus工作空间解决长周期ML研究工程中的累积进度维持问题,在PaperBench和MLE-Bench Lite上取得显著提升。

Comments Repo: https://github.com/AweAI-Team/AiScientist

详情
AI中文摘要

智能体系统日益自动化AI研究的各个环节。然而,将未明确的研究目标转化为可运行、经实验验证的ML系统仍是一个核心瓶颈。我们将这一操作环境研究为“长周期ML研究工程”:通过反复实现、实验和改进,将研究规范转化为可运行的ML系统。核心挑战是在延迟、混杂反馈下,跨异构阶段维持累积的项目进展。我们引入了AiScientist,一个围绕“薄控制厚状态”构建的多智能体系统:轻量级层级研究团队通过File-as-Bus工作空间进行协调,该工作空间跨角色和调用保留决策相关工件。在PaperBench上,AiScientist使用Gemini-3-Flash和GLM-5分别比最强匹配基线提高9.92和11.15分。在MLE-Bench Lite上,它在两个骨干网络下均达到81.82 Any Medal%,比最强匹配基线提高4.55和16.67分,并超过Codex/GPT-5.5 xhigh前沿参考基准13.64 Any Medal分。消融和过程分析表明,持久的项目状态对后期轮次改进至关重要:移除File-as-Bus使PaperBench分数降低6.41分,MLE-Bench Lite Any Medal%降低31.82分。这些结果表明,长周期AI研究不仅是一个更强的局部推理问题,更是一个维持累积、可检查项目进展的系统问题。

英文摘要

Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.

2604.12918 2026-05-27 cs.CV

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

雷达-相机BEV多任务学习:用于联合3D检测与分割的跨任务注意力桥

Ahmet İnanç, Özgür Erkent

发表机构 * Hacettepe University(哈切特佩大学)

AI总结 提出CTAB(跨任务注意力桥)模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支间交换特征,实现联合3D检测与分割的多任务学习,在nuScenes上提升分割性能且检测几乎不受影响。

Comments 8 pages, 5 figures, 3 Tables, Accepted at Radar in Robotics: New Frontiers workshop, at IEEE International Conference on Robotics & Automation (ICRA), 2026

详情
AI中文摘要

鸟瞰图(BEV)表示是自动驾驶中3D感知的主流范式,它提供了一个统一的空间画布,检测和分割特征在几何上注册到同一物理坐标系。然而,现有的雷达-相机融合方法孤立地处理这些任务,错过了跨任务特征共享的机会:来自检测的物体级几何线索可以锐化分割,而来自分割的密集道路布局上下文可以锚定检测。我们提出了 extbf{CTAB}(跨任务注意力桥),这是一个双向模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支之间交换特征。CTAB集成到一个多任务框架中,该框架包含基于实例归一化的分割解码器和可学习的BEV上采样,以提供更详细的BEV表示。在nuScenes上,CTAB在联合多任务基线的基础上,在7个类别上提升了分割性能,同时检测几乎不受影响。在一个4类子集(可行驶区域、人行横道、人行道、车辆)上,我们的联合多任务模型实现了51.0 mIoU-4,同时提供了有竞争力的3D检测。

英文摘要

Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity for cross-task feature sharing: object-level geometric cues from detection can sharpen segmentation, while dense road-layout context from segmentation can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model achieves 51.0 mIoU-4 while simultaneously providing competitive 3D detection.

2604.11467 2026-05-27 cs.AI cs.HC cs.LG

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动:激活导向的人本应用

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

发表机构 * Fraunhofer Heinrich-Hertz-Institut(弗劳恩霍夫 Heinrich-Hertz 研究所) Technische Universität Berlin(柏林技术大学) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所)

AI总结 提出结合SAE归因与激活导向的交互式工作流,通过专家访谈验证其能促进从检查到干预的转变,并揭示组件抑制等调试策略及潜在风险。

详情
AI中文摘要

可解释人工智能(XAI)方法揭示了哪些特征影响模型预测,但为实践者基于这些解释采取行动提供了有限的手段。通过XAI识别出的组件的激活导向为可操作的解释提供了一条路径,但其实际效用仍未得到充分研究。我们引入了一个交互式工作流,将基于SAE的归因与激活导向相结合,用于视觉模型中概念使用的实例级分析,并实现为一个基于网页的工具。基于此工作流,我们进行了半结构化专家访谈(N=8),在CLIP上执行调试任务,以调查实践者如何推理、信任和应用激活导向。我们发现,导向使得从检查转向基于干预的假设检验(8/8参与者),大多数参与者将信任建立在观察到的模型响应上,而非仅仅解释的合理性(6/8)。参与者采用了系统性的调试策略,其中组件抑制占主导(7/8),并指出了包括涟漪效应和实例级修正的有限泛化在内的风险。总体而言,激活导向使可解释性更具可操作性,同时为安全有效使用提出了重要考虑。

英文摘要

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

2505.23606 2026-05-27 cs.LG cs.CV

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit: 通过统一离散扩散模型解放超越文本到图像的生成

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

发表机构 * M-E-AGI-Lab(M-E-AGI实验室)

AI总结 提出Muddit,一种统一离散扩散Transformer,结合预训练文本到图像骨干的强视觉先验与轻量文本解码器,实现跨文本和图像模态的快速并行生成,在质量和效率上优于大型自回归模型。

Comments Accepted to ICLR 2026. Codes and Supplementary Material: https://github.com/M-E-AGI-Lab/Muddit

详情
AI中文摘要

统一生成模型旨在单一架构和解码范式下处理跨模态的多种任务——如文本生成、图像生成和视觉-语言推理。自回归统一模型因顺序解码导致推理缓慢,而非自回归统一模型因预训练骨干有限导致泛化能力弱。我们引入第二代Meissonic:Muddit,一种统一离散扩散Transformer,能够在文本和图像模态上实现快速并行生成。与先前从头训练的统一扩散模型不同,Muddit将来自预训练文本到图像骨干的强视觉先验与轻量文本解码器集成,从而在统一架构下实现灵活且高质量的多模态生成。实验结果表明,Muddit在质量和效率上均达到或优于显著更大的自回归模型。该工作凸显了纯离散扩散在配备强视觉先验时,作为统一生成的可扩展且有效骨干的潜力。

英文摘要

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

2604.11056 2026-05-27 cs.LG cs.AI

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

事后信用可驻留之处:RLVR中令牌更新的有符号容量视角

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文通过条件互信息分析RLVR中令牌级信用的容量上限,提出四象限分解区分更新方向,并设计HAPO算法进行容量引导的优势重分配,提升数学推理性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)提升了大语言模型(LLMs)的推理能力,但稀疏的结果奖励使得令牌级信用分配变得困难。我们将令牌级信用视为从行为策略到事后后验的奖励条件偏移。在自回归RLVR中,这种偏移可以通过条件互信息(CMI)表示,这表明令牌熵限制了可能的事后信用上限。然而,熵指示的是容量而非更新方向,因此我们引入了四象限分解,根据奖励极性和令牌熵来分离更新。受控干预表明,这两个因素共同塑造了令牌更新。持续的推理增益集中在有符号的高熵象限,而低熵更新则迅速饱和。基于此分析,我们提出了事后感知策略优化(HAPO),这是对GRPO的一种符号保持修改,执行容量引导的优势重分配。在两个模型设置的数学推理基准上的实验表明,HAPO在熵感知基线中取得了有竞争力的性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

2604.10102 2026-05-27 cs.CV cs.AI

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

退化一致性配对训练用于鲁棒的AI生成图像检测

Zongyou Yang, Yinghan Hou, Xiaokun Yang

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(伦敦帝国理工学院) School of Electronic Information(电子信息学院)

AI总结 提出退化一致性配对训练(DCPT),通过特征一致性和预测一致性约束显式增强模型对JPEG压缩、高斯模糊等真实世界图像退化的鲁棒性,在Synthbuster基准上平均准确率提升9.1个百分点。

Comments 6 pages, 5 figures, 2 tables

详情
AI中文摘要

AI生成图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊和分辨率降采样)下性能显著下降。我们观察到,包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品,而非明确的训练目标。在这项工作中,我们提出退化一致性配对训练(DCPT),这是一种简单而有效的训练策略,通过配对一致性约束显式增强鲁棒性。对于每张训练图像,我们构建一个干净视图和一个退化视图,然后施加两个约束:特征一致性损失,最小化干净表示和退化表示之间的余弦距离;以及基于对称KL散度的预测一致性损失,对齐两个视图的输出分布。DCPT不增加额外参数和推理开销。在Synthbuster基准(9个生成器,8种退化条件)上的实验表明,与没有配对训练的相同基线相比,DCPT将退化条件下的平均准确率提高了9.1个百分点,同时仅牺牲了0.9%的干净准确率。在JPEG压缩下改进最为显著(+15.7%至+17.9%)。消融实验进一步揭示,添加架构组件会导致在有限训练数据上过拟合,证实了对于退化鲁棒性,训练目标改进比架构增强更有效。

英文摘要

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

2604.10095 2026-05-27 cs.CV

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

挖掘属性子空间以实现3D基础模型的高效微调

Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Shanghai Jiao Tong University(上海交通大学) Adobe Research(Adobe研究) Google Research(谷歌研究)

AI总结 本文通过生成合成数据并提取与纹理、几何、相机运动和光照变化相关的LoRA子空间,发现这些子空间近似解耦,集成后形成降维子空间,从而提高下游任务微调的效率和预测精度。

Comments 10 pages, 8 figures. Code here: https://github.com/jpppppppppppppppppppppppp/Subspaces-Mining-for-VGGT

详情
AI中文摘要

随着3D基础模型的出现,人们越来越关注将其微调用于下游任务,其中LoRA是主要的微调范式。由于3D数据集在纹理、几何、相机运动和光照方面表现出明显的差异,因此存在有趣的基本问题:1) 是否存在与每种变化类型相关的LoRA子空间?2) 这些子空间是否解耦(即彼此正交)?3) 如何有效地计算它们?本文为所有这些问题提供了答案。我们引入了一种鲁棒的方法,生成具有受控变化的合成数据集,在每个数据集上微调LoRA适配器,并提取与每种变化类型相关的LoRA子空间。我们表明这些子空间近似解耦。将它们集成可以得到一个降维的LoRA子空间,从而能够实现高效的LoRA微调,并提高下游任务的预测精度。特别是,我们表明这样的降维LoRA子空间尽管完全来自合成数据,但可以泛化到真实数据集。消融研究验证了我们方法中各种选择的有效性。

英文摘要

With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

2509.21882 2026-05-27 cs.LG cs.AI

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

立场:具有可验证奖励的强化学习的隐藏成本与测量缺口

Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) The University of Tokyo(东京大学) RIKEN AIP(理化学研究所AIP) Waseda University(早稻田大学) Georgia Tech(佐治亚理工学院) Northwestern University(西北大学) UCLA(加州大学洛杉矶分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Yale University(耶鲁大学) University of Waterloo(滑铁卢大学) Independent Researcher(独立研究者) CUHK(香港中文大学) UT Southwestern Medical Center(西南医学中心) National University of Singapore(新加坡国立大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Amazon AWS AI(亚马逊AWS人工智能)

AI总结 本文指出,具有可验证奖励的强化学习(RLVR)在提升大语言模型性能时,常因预算不匹配、尝试膨胀和基准数据污染等混淆因素导致收益被高估,并提出了预算匹配饱和曲线、校准跟踪、法官鲁棒性测试和污染筛查等最低标准。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是一种实用、可扩展的方法,用于在数学、代码和其他结构化任务上改进大语言模型。然而,我们认为许多头条RLVR收益尚未得到充分验证,因为报告常常将策略改进与三个混淆因素混为一谈:(i) RLVR与基线评估之间的预算不匹配,(ii) 尝试膨胀和校准漂移,将弃权转化为自信答案,以及(iii) 基准数据污染。通过预算匹配的复现和部分提示污染探测,我们发现一旦预算、提示和数据集版本匹配,并且将受污染集视为记忆探测而非推理证据,几个被广泛引用的差距会大幅缩小或消失。这并不意味着RLVR无效,而是表明当前的测量常常夸大能力收益并掩盖可靠性成本。因此,我们为RLVR训练和评估提出了一个紧凑的、考虑成本的的最低标准:带有方差、校准和弃权跟踪的预算匹配饱和曲线,当使用LLM评判者时的评判者鲁棒性压力测试,以及明确的污染筛查。有了这些控制,RLVR在可验证领域仍然有效且可部署,但如果没有这些控制,推理收益应被视为暂定的。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluations, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) benchmark data contamination. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, a judge-robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

2604.08999 2026-05-27 cs.CL cs.AI cs.LG

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA: 面向复杂表格问答的自适应语义树推理架构

Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ASTRA架构,通过AdaSTR将表格重构为逻辑语义树,并利用DuTR双模式推理框架结合树搜索文本导航与符号代码执行,在复杂表格问答中达到最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式层次结构且缺乏模式灵活性,而当前的基于树的方法则存在语义适应性有限的问题。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们引入AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化显式建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,基于此结构,我们提出了DuTR,一种双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以实现精确验证。在复杂表格基准上的实验表明,我们的方法达到了最先进的性能。

英文摘要

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

2604.08819 2026-05-27 cs.CV cs.AI cs.LG cs.MM

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen: 用于可解释内容审核的敏感场景图

Fatih Cagatay Akyon, Alptekin Temizel

发表机构 * Graduate School of Informatics, METU(信息学院研究生院,梅尔夫大学) Ultralytics, Inc.(Ultralytics公司)

AI总结 提出SenBen基准和紧凑学生模型,通过多任务训练和词汇平衡策略实现敏感内容的空间定位与可解释性,在场景图生成上超越多数VLM。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到了什么敏感行为、涉及谁或发生在哪里。我们引入了敏感基准(SenBen),这是第一个用于敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,标注了Visual Genome风格的场景图(25个对象类别、28个属性,包括情感状态如痛苦、恐惧、攻击和痛苦,14个谓词)以及跨5个类别的16个敏感标签。我们通过多任务配方将前沿VLM蒸馏成一个紧凑的241M学生模型,该配方通过基于后缀的对象身份、词汇感知召回(VAR)损失和解耦的Query2Label标签头(带非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上比标准交叉熵训练提高了+6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕生成分数,推理速度提升7.6倍,GPU内存减少16倍。

英文摘要

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

2603.11394 2026-05-27 cs.CL cs.AI cs.LG

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

别听我的!多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔大学) Vanderbilt University Medical Center(范德比尔大学医学中心) Intuit AI Research(Intuit人工智能研究)

AI总结 提出“坚持或切换”(SoS)框架,通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性,发现对话税导致准确性和拒绝错误建议的能力平均下降30%,并观察到盲目切换现象。

详情
AI中文摘要

大型语言模型(LLM)在静态基准测试中表现出色,但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要,因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里,我们引入了“坚持或切换”(SoS)框架,该框架将问答空间划分为多个顺序呈现,以模拟两种以安全为中心的行为:坚持(即坚持正确的答案选择或拒绝错误的建议)和灵活性(即在引入正确建议时切换到该建议)。在三个临床基准测试中评估了17个LLM,我们观察到普遍存在的对话税,其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%,在某些模型中达到65%。我们还观察到盲目切换,即模型从初始拒绝转向错误和正确建议的比率几乎相同,达到50%。最后,我们表明,增加模型规模可以缓解其中一些对话效率低下的问题,但会加剧其他问题,例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明,静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

2512.21602 2026-05-27 cs.LG cs.CV

An Empirical Study of Machine Learning Robustness and Scalability for Imbalanced Tabular Clinical Data in Emergency and Critical Care

机器学习在急诊和重症监护中不平衡表格临床数据的鲁棒性与可扩展性实证研究

Yusuf Brima, Marcellin Atemkeng

发表机构 * Computer Vision Group, Institute of Cognitive Science, Osnabrück University(计算机视觉组,认知科学研究所,奥斯纳布吕克大学) Department of Mathematics, Rhodes University(数学系,罗德斯大学) National Institute for Theoretical and Computational Sciences (NITheCS)(国家理论与计算科学研究所(NITheCS))

AI总结 本研究在MIMIC-IV-ED和eICU数据集上评估六类模型在不平衡临床表格数据上的性能,发现树模型在可扩展性上最优,而表格基础模型在性能与效率间提供新的权衡。

详情
AI中文摘要

每年,数百万患者通过急诊科和重症监护室,临床医生必须在时间压力和不确定性下做出高风险决策。机器学习可以支持恶化预测、分诊和罕见关键结局的预测,但临床数据通常严重不平衡,使模型偏向多数类并降低预测性能。因此,为不平衡的临床表格数据开发鲁棒且高效的模型仍然是一个重要挑战。 我们在MIMIC-IV-ED和eICU数据库的不平衡表格数据上评估了六类模型:决策树、随机森林、XGBoost、TabNet、TabICL和TabPFN v2.6。可训练模型通过贝叶斯超参数调优进行优化,而基础模型在其预训练推理模式下进行评估,无需任务特定的重新加权。模型使用Macro F1分数、对递增不平衡的鲁棒性以及跨七个临床预测任务的计算可扩展性进行评估。 结果在不同数据集上有所不同。在MIMIC-IV-ED上,TabPFN v2.6和TabICL获得了最强的平均Macro F1排名,XGBoost保持竞争力。在eICU上,XGBoost始终表现最佳,其次是其他基于树的方法,而基础模型达到中等性能。在两个数据集中,TabNet在递增不平衡下显示出最大的性能下降和最高的计算成本。训练时间分析表明,基于树的方法随数据集大小扩展最有利,而基础模型提供了较低的每任务适应成本。 这些发现表明,没有单一模型族在所有临床环境中占主导地位。然而,表格基础模型正在缩小与强经典基线的性能差距,同时提供独特的效率-性能权衡,这可能有利于资源受限的临床环境。

英文摘要

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.

2604.04940 2026-05-27 cs.AI

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化

Cuong Van Duc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Tung Vu Duy, Son Nguyen Van, Hanh Nguyen Thi, Binh Huynh Thi Thanh

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) Phenikaa University(Phenikaa大学)

AI总结 针对NP-hard组合优化问题的启发式设计,提出ReVEL框架,通过行为感知分组和多轮迭代细化,利用LLM和累积性能反馈联合优化启发式,实验表明优于现有LLM引导的进化基线。

详情
AI中文摘要

为NP-hard组合优化问题设计有效的启发式仍然具有挑战性,通常需要大量的领域专业知识。最近的LLM引导的进化方法在自动启发式生成方面显示出前景,但大多数现有方法独立地或通过有限的成对反馈来细化启发式。我们提出ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化,一个用于群体式多轮启发式细化的框架。ReVEL将启发式组织成行为感知的反思组,包括用于局部细化的相似性驱动组和用于探索性搜索的多样性驱动组。在每个组内,LLM使用累积的性能反馈执行迭代多轮细化,使得相关启发式能够在进化迭代中被联合分析和逐步改进。在标准组合优化基准上的实验表明,ReVEL在多种设置和LLM骨干下通常优于现有的LLM引导的进化基线。额外分析表明,行为感知分组有助于在迭代启发式进化过程中实现更一致的细化轨迹。

英文摘要

Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain expertise. Recent LLM-guided evolutionary methods have shown promise for automated heuristic generation, but most existing approaches refine heuristics independently or through limited pairwise feedback. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a framework for group-wise multi-turn heuristic refinement. ReVEL organizes heuristics into behavior-aware reflective groups, including similarity-driven groups for localized refinement and diversity-driven groups for exploratory search. Within each group, the LLM performs iterative multi-turn refinement using accumulated performance feedback, enabling related heuristics to be jointly analyzed and progressively improved across evolutionary iterations. Experiments on standard combinatorial optimization benchmarks show that ReVEL generally improves optimization performance over existing LLM-guided evolutionary baselines across multiple settings and LLM backbones. Additional analyses suggest that behavior-aware grouping contributes to more consistent refinement trajectories during iterative heuristic evolution.

2603.27146 2026-05-27 cs.CL

Learning to Predict Future-Aligned Research Proposals with Language Models

学习用语言模型预测未来对齐的研究提案

Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu, Jiawei Han, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出将研究提案生成重构为时间切片科学预测问题,通过未来对齐分数(FAS)评估模型能否预测截止时间后发表的论文方向,并构建时间一致数据集和推理轨迹进行训练,实验表明未来对齐微调显著提升提案质量。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于辅助研究中的构思,但评估LLM生成的研究提案的质量仍然困难:新颖性和合理性难以自动衡量,而大规模人工评估成本高昂。我们通过将提案生成重构为时间切片科学预测问题,提出了一种可验证的替代方案。给定一个研究问题和截止时间前可用的启发论文,模型生成一个结构化提案,并通过其是否预测到截止时间后发表的论文中出现的研究方向来评估。我们通过检索和基于LLM的语义评分,针对保留的未来语料库计算未来对齐分数(FAS)来操作化这一目标。为了训练模型,我们构建了一个时间一致的数据集,包含来自目标及其截止前引用的3,642个实例中的21,835篇论文出现次数,并合成推理轨迹,教授差距识别和灵感借鉴。在Llama-3.1和Qwen2.5模型上,未来对齐微调相比未对齐基线提高了未来对齐(总体FAS最高提升+10.6%),领域专家的人工评估证实了提案质量的改进。最后,我们通过使用代码代理实现两个模型生成的提案来展示实际影响,从新的提示策略中获得MATH 4.17%的准确率提升,并对一种新颖的模型合并方法实现了一致的改进。我们的代码和数据公开在https://github.com/Arthur-Heng/future-aligned-proposals。

英文摘要

Large language models (LLMs) are increasingly used to assist ideation in research, but evaluating the quality of LLM-generated research proposals remains difficult: novelty and soundness are hard to measure automatically, and large-scale human evaluation is costly. We propose a verifiable alternative by reframing proposal generation as a time-sliced scientific forecasting problem. Given a research question and inspiring papers available before a cutoff time, the model generates a structured proposal and is evaluated by whether it anticipates research directions that appear in papers published after the time. We operationalize this objective with the Future Alignment Score (FAS), computed via retrieval and LLM-based semantic scoring against a held-out future corpus. To train models, we build a time-consistent dataset of 21,835 paper occurrences across 3,642 instances from targets and their pre-cutoff citations, and synthesize reasoning traces that teach gap identification and inspiration borrowing. Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality. Finally, we demonstrate practical impact by implementing two model-generated proposals with a code agent, obtaining 4.17% accuracy gain on MATH from a new prompting strategy and consistent improvements for a novel model-merging method. Our code and data are publicly available at https://github.com/Arthur-Heng/future-aligned-proposals.

2509.08289 2026-05-27 cs.CV

Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

双阈值热力图引导的提议聚类与负确定性监督及增强基础网络的弱监督目标检测

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

发表机构 * Institute of Cyberspace Security, Harbin Institute of Technology(哈尔滨工业大学网络安全学院) Faculty of Information Technology, Monash University(莫纳什大学信息科技学院) Center on Machine Learning Research, Harbin Institute of Technology(哈尔滨工业大学机器学习研究中心) Department of New Networks, Peng Cheng Laboratory(鹏城实验室网络部) School of Cyberspace Science, Harbin Institute of Technology(哈尔滨工业大学网络空间科学学院)

AI总结 提出DANCE方法,通过双阈值热力图引导的提议选择、增强基础网络和负确定性监督损失,解决弱监督目标检测中伪GT框不完整、语义鸿沟和收敛慢的问题。

Comments IEEE TIP Minor Revision

详情
AI中文摘要

弱监督目标检测(WSOD)近年来因其不需要框级标注而受到广泛关注。最先进的方法通常采用多模块网络,使用WSDDN作为多实例检测网络模块,并使用多实例细化模块来改进性能。然而,这些方法存在三个关键局限性。首先,现有方法倾向于生成仅关注判别性部分的伪GT框,未能捕捉整个物体,或者覆盖整个物体但无法区分相邻的类内实例。其次,基础WSDDN架构缺乏每个提议的关键背景类表示,并且其分支之间存在较大的语义鸿沟。第三,先前的方法在优化过程中丢弃被忽略的提议,导致收敛缓慢。为了解决这些挑战,我们提出了双阈值热力图引导的提议聚类和负确定性监督与增强基础网络(DANCE)方法用于WSOD。具体来说,我们首先设计了一种热力图引导的提议选择器(HGPS)算法,该算法利用热力图上的双阈值来预选提议,使伪GT框既能捕捉完整的物体范围,又能区分相邻的类内实例。然后,我们构建了一个弱监督基础检测网络(WSBDN),它为每个提议增加一个背景类表示,并使用热力图进行预监督以弥合矩阵之间的语义鸿沟。最后,我们在被忽略的提议上引入负确定性监督(NCS)损失以加速收敛。在具有挑战性的PASCAL VOC和MS COCO数据集上进行的大量实验证明了我们方法的有效性和优越性。我们的代码可在https://github.com/gyl2565309278/DANCE公开获取。

英文摘要

Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and uses multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we propose the Dual-thresholded heAtmap-guided proposal clustering and Negative Certainty supervision with Enhanced base network (DANCE) method for WSOD. Specifically, we first devise a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then construct a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision (NCS) loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC and MS COCO datasets demonstrate the effectiveness and superiority of our method. Our code is publicly available at https://github.com/gyl2565309278/DANCE.

1403.1076 2026-05-27 cs.AI

A Discussion to Qualify Intelligence

关于智能定义的探讨

Kieran Greer

发表机构 * Distributed Computing Systems(分布式计算系统)

AI总结 本文试图提出一个适用于自然世界和人工智能的统一智能定义,基于Kolmogorov复杂性理论提出度量标准,并区分智能与意识的不同。

Comments Newly edited version

Journal ref Scientific Insights, 2(1), pp. 1 - 15

详情
AI中文摘要

我们对智能的理解主要针对人类水平。本文试图给出一个更统一的定义,可应用于整个自然世界,然后应用于人工智能。该定义更侧重于定性而非定量,并可能有助于对此问题做出判断。虽然正确行为是首选定义,但本文提出了一种基于Kolmogorov复杂性理论的度量标准,该标准引出了关于熵的测量。随后,本文提出了一种公认的人工智能测试版本作为“酸性测试”,这可能是自由思维程序试图实现的目标。作者最近的工作更多是从机械过程的角度出发,基于结构构建。本文认为智能是一种主动事件,但也注意到其背后存在一个机械性的次要方面。本文建议将智能和意识视为略有不同,其中意识是更机械的方面。事实上,一个令人惊讶的结论是,一个被动但智能的大脑可能由主动但不太智能的感官所激发。

英文摘要

Our understanding of intelligence is directed primarily at the human level. This paper attempts to give a more unifying definition that can be applied to the natural world in general and then Artificial Intelligence. The definition would be used more to qualify than quantify it and might help when making judgements on the matter. While correct behaviour is the preferred definition, a metric that is grounded in Kolmogorov's Complexity Theory is suggested, which leads to a measurement about entropy. A version of an accepted AI test is then put forward as the 'acid test' and might be what a free-thinking program would try to achieve. Recent work by the author has been more from a direction of mechanical processes, built from structure. This paper agrees that intelligence is a pro-active event, but also notes a second aspect to it that is in the background and mechanical. The paper suggests looking at intelligence and the conscious as being slightly different, where the conscious is this more mechanical aspect. In fact, a surprising conclusion can be a passive but intelligent brain being invoked by active and less intelligent senses.

2604.03785 2026-05-27 cs.AI cs.MA

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

跨时间步延迟下合作多智能体强化学习中的通信增益与延迟代价

Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

发表机构 * The State Key Laboratory for Manufacturing Systems Engineering(制造系统工程国家重点实验室) School of Automation Science and Engineering, Xi’an Jiaotong University(西安交通大学自动化科学与工程学院)

AI总结 针对部分可观测环境中跨时间步通信延迟导致的信息错位问题,提出通信增益与延迟代价(CGDC)度量,并基于此设计演员-评论家框架CDCMA,通过预测未来观测和注意力融合延迟消息来提升合作多智能体强化学习的性能、鲁棒性和泛化能力。

详情
AI中文摘要

在部分可观测的\emph{合作}多智能体强化学习中,通信对于协调至关重要,然而\emph{跨时间步}延迟会导致消息在生成后多个时间步才到达,造成时间错位,使得信息在消费时变得陈旧。我们将此设定形式化为延迟通信部分可观测马尔可夫博弈(DeComm-POMG),并将消息的影响分解为\emph{通信增益}和\emph{延迟代价},从而得到通信增益与延迟代价(CGDC)度量。我们进一步建立了一个价值损失界,表明由延迟消息引起的性能下降被一个折扣累积的信息差距所上界,该差距由及时消息与延迟消息所诱导的动作分布之间的差异衡量。在CGDC的指导下,我们提出了 extbf{CDCMA},一个演员-评论家框架,该框架仅在预测CGDC为正时请求消息,预测未来观测以减少消费时的错位,并通过CGDC引导的注意力融合延迟消息。在无队友视觉变体的合作导航和捕食者-猎物任务以及多个延迟级别的SMAC地图上的实验表明,该方法在性能、鲁棒性和泛化能力上均有一致提升,消融实验验证了每个组件的有效性。

英文摘要

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

2604.00648 2026-05-27 cs.CV

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

DirectFisheye-GS: 在三维高斯泼溅中通过跨视图联合优化实现原生鱼眼输入

Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu

发表机构 * BNRist, Tsinghua University(北京理工大学,清华大学) Beihang University(北航) JD.com, Beijing, China(京东(北京,中国)) Shanghai AI Lab(上海人工智能实验室)

AI总结 针对鱼眼相机输入导致的信息丢失和细节模糊问题,提出将鱼眼相机模型集成到3DGS框架中,并引入基于特征重叠的跨视图联合优化策略,实现无需预处理的原生鱼眼图像训练,提升重建质量。

Comments CVPR 2026 Highlight; Fix NSFC ID

详情
AI中文摘要

三维高斯泼溅(3DGS)实现了从日常图像中进行高效的三维场景重建,具有实时、高保真渲染的特点,极大地推动了VR/AR应用的发展。鱼眼相机凭借其更宽的视场角(FOV),有望从更少的输入中实现高质量重建,近来备受关注。然而,由于3DGS依赖于光栅化,大多数后续涉及鱼眼相机输入的工作在训练前先对图像进行去畸变,这引入了两个问题:1)图像边缘的黑边导致信息丢失,抵消了鱼眼大FOV的优势;2)去畸变的拉伸和插值重采样将每个像素的值扩散到更大区域,稀释了细节密度——导致3DGS过拟合这些低频区域,产生模糊和漂浮伪影。在这项工作中,我们将鱼眼相机模型集成到原始3DGS框架中,实现了无需预处理的原生鱼眼图像输入进行训练。尽管建模正确,我们观察到重建场景在图像边缘仍然存在漂浮物:畸变向边缘增加,而3DGS原始的逐迭代随机选择视图优化忽略了高斯函数的跨视图相关性,导致极端形状(例如过大或拉长)降低了重建质量。为解决此问题,我们引入了一种基于特征重叠的跨视图联合优化策略,该策略在视图之间建立一致的几何和光度约束——该技术同样适用于现有的基于针孔相机的流水线。我们的DirectFisheye-GS在公共数据集上达到或超越了最先进的性能。项目页面:https://yzxqh.github.io/DirectFisheye-GS/ 。

英文摘要

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets. Project Page: https://yzxqh.github.io/DirectFisheye-GS/ .

2603.25152 2026-05-27 cs.AI cs.IR

OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion

OMD-GraphRAG:利用本体引导提取、多维聚类和双通道融合增强GraphRAG

Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian

发表机构 * Data Science & Artificial Intelligence Research Institute(数据科学与人工智能研究院)

AI总结 提出OMD-GraphRAG框架,通过本体引导知识提取、多维社区聚类和双通道图检索融合,提升GraphRAG在复杂推理和多跳查询中的性能。

详情
AI中文摘要

检索增强生成(RAG)系统在复杂推理、多跳查询和领域特定问答中面临重大挑战。尽管现有的GraphRAG框架在结构化知识组织方面取得了进展,但在知识提取精度、社区报告完整性和检索性能方面仍存在局限性。本文提出OMD-GraphRAG,一个基于开源GraphRAG构建的增强框架。该框架引入了三项核心创新:(1)本体引导知识提取,使用预定义Schema指导LLM准确识别领域特定实体和关系;(2)多维社区聚类策略,通过对齐完成、基于属性的聚类和多跳关系聚类提高社区完整性;(3)双通道图检索融合,通过混合图和社区检索平衡问答准确性和性能。在MultiHop-RAG基准上的评估结果显示,OMD-GraphRAG在综合F1分数上优于主流开源解决方案(如LightRAG),特别是在推理和时间查询方面。

英文摘要

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in knowledge extraction precision, community report integrity, and retrieval performance. This paper proposes OMD-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHop-RAG benchmark show that OMD-GraphRAG outperforms mainstream open source solutions (e.g., LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries.

2602.02192 2026-05-27 cs.LG cs.DC

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

ECHO-2: 一种面向经济高效强化学习的大规模分布式推演框架

Jingwei Song, Meng Chen, Jie Xiao, Qingnan Ren, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Zhisheng Chen, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * The University of Hong Kong(香港大学) Fudan University(复旦大学) Gradient University of Edinburgh(爱丁堡大学) Soochow University(苏州大学) Technical University of Darmstadt(达姆施塔特技术大学) University of the Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ECHO-2分布式强化学习框架,通过重叠推演生成、传播与训练,结合对等辅助流水线广播和成本感知异构工作节点激活,在保持奖励性能的同时显著提升成本效率。

Comments 24 pages, 7 figures

详情
AI中文摘要

强化学习(RL)是大语言模型(LLM)后训练的关键阶段,涉及推演生成、奖励评估和集中学习之间的反复交互。分布式推演执行提供了利用更具成本效益的推理资源的机会,但引入了广域协调和策略传播方面的挑战。我们提出了ECHO-2,一个用于后训练的分布式RL框架,使用远程推理工作节点且传播延迟不可忽略。ECHO-2将集中学习与分布式推演相结合,将有界策略过时性视为用户可控参数,使得推演生成、传播和训练能够重叠。我们引入了一个基于重叠的容量模型,关联训练时间、传播延迟和推演吞吐量,得出了一个维持学习器利用率的实用配置规则。为了缓解传播瓶颈并降低成本,ECHO-2采用了对等辅助流水线广播和成本感知的异构工作节点激活。在真实广域网带宽条件下,对4B到32B参数规模的LLM进行GRPO后训练的实验表明,ECHO-2在保持与强基线相当的RL奖励的同时,显著提高了成本效率。

英文摘要

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of LLMs ranging from 4B to 32B parameters under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

2603.28730 2026-05-27 cs.RO cs.CL cs.CV

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1:视频语言推理作为机器人强化学习的唯一奖励

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

发表机构 * MIT(麻省理工学院) RAI Institute(机器人智能研究所)

AI总结 提出SOLE-R1模型,通过视频语言时空推理生成密集任务进度估计作为唯一奖励信号,实现在无真实奖励、演示或任务特定调优下的零样本在线强化学习。

详情
AI中文摘要

视觉语言模型(VLM)在各种任务中展现出令人印象深刻的能力,这促使人们努力利用这些模型来监督机器人学习。然而,当在强化学习(RL)中用作评估器时,当今最强的模型在部分可观测性和分布偏移下常常失败,使得策略能够利用感知错误而非解决任务。我们提出SOLE-R1(自观察学习器),一种专门设计用于为在线RL提供唯一奖励信号的视频语言推理模型。仅给定原始视频观测和自然语言目标,SOLE-R1执行每时间步的时空思维链(CoT)推理,并生成可直接用作奖励的密集任务进度估计。为了训练SOLE-R1,我们开发了一个大规模视频轨迹和推理合成流水线,生成与连续进度监督对齐的时间基础CoT轨迹。这些数据与基础的空间和多帧时间推理相结合,并使用混合框架训练模型,该框架将监督微调与可验证奖励的RL相结合。在四个不同的仿真环境和真实机器人设置中,SOLE-R1实现了从随机初始化的零样本在线RL:机器人学习之前未见过的操作任务,无需真实奖励、成功指标、演示或任务特定调优。SOLE-R1在24个未见过的任务上成功,并显著优于强视觉语言奖励器,包括Robometer、RoboReward、ReWiND、GPT-5和Gemini-3-Pro,同时对奖励破解表现出明显更强的鲁棒性。我们在匿名页面发布所有模型、数据、代码和演示:https://philip-mit.github.io/sole-r1/

英文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including Robometer, RoboReward, ReWiND, GPT-5, and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip-mit.github.io/sole-r1/

2601.18987 2026-05-27 cs.CL cs.AI cs.PL

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题:程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

发表机构 * FAIR Team, Meta AI(Meta AI FAIR 团队) The Hebrew University of Jerusalem, Israel(耶路撒冷希伯来大学) Bloomberg, New York, USA(彭博社,纽约,美国) Imperial College London, UK(伦敦帝国理工学院,英国) University College London, UK(伦敦大学学院,英国)

AI总结 本文评估了前沿LLMs在程序终止推理上的能力,发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平,但无法生成形式化证明,并引入分歧前置条件形式化描述非终止条件。

详情
AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性,表明没有算法能普遍确定所有程序和输入的终止性。因此,验证工具近似地处理终止问题,有时无法证明或反驳;这些工具依赖于特定问题的架构,并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题:它们在多大程度上能够推理程序终止?我们在2025年国际软件验证竞赛(SV Comp)的一组多样化C程序上评估了前沿LLMs。我们的结果表明,GPT-5和Claude Sonnet 4.5(通过测试时缩放)达到了与顶级验证工具相当的分数。然而,尽管模型通常能正确推断程序是否终止,但它们经常无法构造一个见证作为形式化证明,揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加,性能进一步下降。为了分析这一差距,我们引入了一个分歧前置条件形式化方法,将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法,以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

2603.25415 2026-05-27 cs.AI cs.RO

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

具身语义场景图生成的强化学习导航现代化

Roman Küble, Marco Hüller, Mrunmai Phatak, Rainer Lienhart, Jörg Hähner

发表机构 * Organic Computing Group(有机计算组) Machine Learning and Computer Vision Group(机器学习与计算机视觉组) University of Augsburg(奥格斯堡大学) Am Technologiezentrum 8(技术中心8号) Augsburg, Germany(德国奥格斯堡)

AI总结 提出模块化导航组件,通过替换策略优化方法和重新设计离散动作表示,现代化具身语义场景图生成中的决策过程,并评估不同动作集和策略结构对场景图完整性、执行安全性和导航行为的影响。

详情
AI中文摘要

语义世界模型使具身智能体能够推理对象、关系和空间上下文,超越纯几何表示。在有机计算中,此类模型是在不确定性和资源约束下实现目标驱动自适应的关键。核心挑战是在有限动作预算内获取最大化模型质量和下游实用性的观测。语义场景图(SSG)为此提供了结构紧凑的表示。然而,在有限动作视界内构建SSG需要探索策略,在信息增益与导航成本之间权衡,并决定何时额外动作的收益递减。本文提出了用于具身语义场景图生成的模块化导航组件,并通过替换策略优化方法和重新审视离散动作公式来现代化其决策。我们研究了紧凑和更细粒度的较大离散动作集,并比较了原子动作上的单头策略与动作组件上的分解多头策略。我们评估了课程学习和基于深度的可选碰撞监督,并评估了SSG完整性、执行安全性和导航行为。结果表明,仅替换优化算法在相同奖励塑造下相对于基线将SSG完整性提高了21%。深度主要影响执行安全性(无碰撞运动),而完整性基本保持不变。将现代优化与更细粒度、分解的动作表示相结合,产生了最强的完整性-效率权衡。

英文摘要

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

2601.04426 2026-05-27 cs.AI

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

XGrammar-2: 面向智能体LLM的高效动态结构化生成引擎

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学) Carnegie Mellon University, NVIDIA(卡内基梅隆大学,NVIDIA)

AI总结 针对智能体LLM中动态结构化生成(如工具调用和响应协议)的挑战,提出XGrammar-2引擎,通过标签触发结构切换和跨语法子结构缓存实现高效编译与近零开销。

Comments 10 pages, ACM CAIS 26

详情
AI中文摘要

现代LLM智能体越来越依赖动态结构化生成,例如工具调用和响应协议。与具有静态结构的传统结构化生成不同,这些工作负载在请求之间和请求内部都有变化,给现有引擎带来了新的挑战。我们提出了XGrammar-2,一种用于动态智能体工作负载的结构化生成引擎。我们的设计基于两个关键思想:对标签触发的结构切换的一流支持,以及跨具有不同输出结构的请求的细粒度重用。具体来说,XGrammar-2引入了TagDispatch用于动态结构调度,以及Cross-Grammar Cache用于跨语法的子结构级缓存重用。它通过基于Earley的自适应令牌掩码缓存、即时编译和重复状态压缩进一步提高了效率。实验表明,XGrammar-2的编译速度比先前的结构化生成引擎快6倍以上,并且在现代LLM服务系统中几乎为零的端到端开销。

英文摘要

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional structured generation with static structures, these workloads vary both across requests and within a request, posing new challenges to existing engines. We present XGrammar-2, a structured generation engine for dynamic agentic workloads. Our design is based on two key ideas: first-class support for tag-triggered structure switching, and fine-grained reuse across requests with different output structures. Concretely, XGrammar-2 introduces TagDispatch for dynamic structural dispatching and Cross-Grammar Cache for substructure-level cache reuse across grammars. It further improves efficiency with an Earley-based adaptive token mask cache, just-in-time compilation, and repetition state compression. Experiments show that XGrammar-2 achieves over 6x faster compilation than prior structured generation engines, and incurs near-zero end-to-end overhead in modern LLM serving systems.

2512.01678 2026-05-27 cs.LG cs.DC cs.PL

Morphling: Fast, Fused, and Flexible GNN Training at Scale

Morphling: 快速、融合且灵活的图神经网络规模化训练

Anubhab, Rupesh Nasre

发表机构 * IIT Madras(印度理工学院马德拉斯学院)

AI总结 提出Morphling领域特定代码合成器,通过架构感知的原语和运行时稀疏感知执行引擎,在CPU、GPU和分布式环境下显著提升GNN训练吞吐量并降低内存消耗。

详情
AI中文摘要

图神经网络(GNN)通过融合不规则、内存受限的图遍历与规则、计算密集型密集矩阵运算,带来了根本性的硬件挑战。虽然PyTorch Geometric(PyG)和Deep Graph Library(DGL)等框架优先考虑高级可用性,但它们未能解决这些不同的执行特性。因此,它们依赖通用内核,导致缓存局部性差、内存移动过多以及大量中间分配。为了解决这些限制,我们提出了Morphling,一个旨在弥合这一差距的领域特定代码合成器。Morphling将高级GNN规范编译为可移植的、后端特化的实现,针对OpenMP、CUDA和MPI。它通过实例化一个针对每个执行环境定制的优化、架构感知原语库来实现这一点。Morphling还包含一个运行时稀疏感知执行引擎,该引擎使用输入特征统计动态选择密集或稀疏执行路径,减少对零值条目的不必要计算。我们在涵盖不同图结构、特征维度和稀疏程度的11个真实世界数据集上评估了Morphling。与PyG和DGL相比,Morphling在CPU上平均提高每轮训练吞吐量20倍,在GPU上提高19倍,在分布式设置中提高6倍,峰值加速达到66倍。Morphling的内存高效布局进一步将峰值内存消耗降低多达15倍,使得在商用硬件上进行大规模GNN训练成为可能。这些发现表明,专门的、架构感知的代码合成为跨不同并行和分布式平台的高性能GNN执行提供了一条有效且可扩展的路径。

英文摘要

Graph Neural Networks (GNNs) present a fundamental hardware challenge by fusing irregular, memory-bound graph traversals with regular, compute-intensive dense matrix operations. While frameworks such as PyTorch Geometric (PyG) and Deep Graph Library (DGL) prioritize high-level usability, they fail to address these divergent execution characteristics. As a result, they rely on generic kernels that suffer from poor cache locality, excessive memory movement, and substantial intermediate allocations. To address these limitations, we present Morphling, a domain-specific code synthesizer designed to bridge this gap. Morphling compiles high-level GNN specifications into portable, backend-specialized implementations targeting OpenMP, CUDA, and MPI. It achieves this by instantiating a library of optimized, architecture-aware primitives tailored to each execution environment. Morphling also incorporates a runtime sparsity-aware execution engine that dynamically selects dense or sparse execution paths using input feature statistics, reducing unnecessary computation on zero-valued entries. We evaluate Morphling on eleven real-world datasets spanning diverse graph structures, feature dimensionalities, and sparsity regimes. Morphling improves per-epoch training throughput by an average of 20X on CPUs, 19X on GPUs, and 6X in distributed settings over PyG and DGL, with peak speedups reaching 66X. Morphling's memory-efficient layouts further reduce peak memory consumption by up to 15X, enabling large-scale GNN training on commodity hardware. These findings demonstrate that specialized, architecture-aware code synthesis provides an effective and scalable path toward high-performance GNN execution across diverse parallel and distributed platforms.

2603.23994 2026-05-27 cs.LG cs.AI

Understanding the Challenges in Iterative Generative Optimization with LLMs

理解大语言模型迭代生成优化中的挑战

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

发表机构 * Google DeepMind(谷歌DeepMind) CNRS(国家科学研究中心) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) AWS(亚马逊AWS) Netflix Research(Netflix研究) Microsoft Research(微软研究院)

AI总结 本文通过案例研究,揭示了在基于大语言模型的迭代生成优化中,起始工件、信用分配和批处理等隐藏设计选择对优化成败的决定性影响,并指出缺乏跨领域的通用学习循环设置方法是生产化和采用的主要障碍。

Comments 39 pages, 17 figures

详情
AI中文摘要

生成优化利用大型语言模型(LLMs)通过执行反馈迭代改进工件(如代码、工作流或提示)。这是一种构建自我改进代理的有前途的方法,但在实践中仍然脆弱:尽管有活跃的研究,只有9%的调查代理使用了任何自动优化。我们认为这种脆弱性是因为,为了建立学习循环,工程师必须做出“隐藏”的设计选择:优化器可以编辑什么,以及在每次更新时提供什么“正确”的学习证据?我们调查了影响大多数应用的三个因素:起始工件、执行轨迹的信用跨度,以及将试错批处理为学习证据。通过在MLAgentBench、Atari和BigBench Extra Hard中的案例研究,我们发现这些设计决策可以决定生成优化是否成功,然而它们在先前的工作中很少被明确说明。不同的起始工件决定了在MLAgentBench中哪些解决方案是可达到的,截断的轨迹仍然可以改进Atari代理,而更大的小批量并不会单调地改善BBEH上的泛化。我们得出结论,缺乏一种简单、通用的跨领域设置学习循环的方法是生产化和采用的主要障碍。我们为做出这些选择提供了实用指导。

英文摘要

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

2603.20020 2026-05-27 cs.CV cs.AI

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

分离跳跃链接与$R$-探针:解耦特征聚合与梯度传播用于MLLM OCR

Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China(多媒体信息处理国家重点实验室,计算机科学学院,PKU-Anker LLM实验室,软件与硬件协同人工智能系统北京重点实验室,北京大学,北京,中国) Tsinghua University, Beijing, China(清华大学,北京,中国) Baidu Inc, Beijing, China(百度公司,北京,中国)

AI总结 针对多模态大语言模型在OCR任务中因梯度干扰导致细粒度视觉信息丢失的问题,提出分离跳跃链接(Detached Skip-Links)以解耦前向特征聚合与反向梯度传播,并引入$R$-探针($R$-Probe)诊断视觉令牌的可重构性,从而提升OCR及通用多模态任务性能。

Comments Accepted by ICML 2026. Ziye Yuan and Ruchang Yao contributed equally to this work (co-first authors, listed in random order)

详情
AI中文摘要

多模态大语言模型(MLLMs)擅长高级推理,但在OCR任务中失败,因为细粒度视觉细节被破坏或错位。我们发现了多层特征融合中一个被忽视的优化问题。跳跃路径引入了从高级语义目标到早期视觉层的直接反向传播路径。这种机制覆盖了低级信号并破坏了训练稳定性。为了缓解这种梯度干扰,我们提出了分离跳跃链接(Detached Skip-Links),这是一种最小的修改,在前向传播中重用浅层特征,同时在联合训练期间停止通过跳跃分支的梯度。这种非对称设计减少了梯度干扰,提高了稳定性和收敛性,且无需增加可学习参数。为了诊断细粒度信息是否被保留并可供LLM使用,我们引入了$R$-探针($R$-Probe),它使用从LLM前四分之一层初始化的浅层解码器测量投影视觉令牌的像素级可重构性。在多个ViT骨干网络和多模态基准测试中,以及高达7M训练样本的规模下,我们的方法持续改进了以OCR为中心的基准测试,并在通用多模态任务上取得了明显提升。

英文摘要

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.