arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07155 2026-05-11 cs.LG

Regret-Oracle Complexity Tradeoffs in Agnostic Online Learning

Idan Attias, Steve Hanneke, Arvind Ramaswami

AI总结 在无先验知识的在线学习中,传统方法依赖于Littlestone标准最优算法(SOA),但该算法计算复杂度极高。本文提出一种更高效的策略,通过引入弱一致性预言机,动态剪枝非可实现的标签序列,显著降低了预言机复杂度,将总查询复杂度从指数级降至多项式级别,同时保持近似最优的期望遗憾。此外,研究还量化了遗憾与预言机复杂度之间的权衡关系,并给出了相应的上界和下界分析。

详情
英文摘要

Agnostic online learning is classically solved via a reduction to the realizable setting, utilizing Littlestone's Standard Optimal Algorithm (SOA) as a base learner. However, the SOA is computationally intractable to execute even for a single round. To overcome this barrier, recent work in oracle-efficient online learning replaces the SOA with a realizable base learner that accesses the concept class exclusively through an offline empirical risk minimization (ERM) oracle. While such agnostic learners achieve near-optimal expected regret, they suffer from a doubly-exponential oracle complexity of $O\big(T^{2^{O(d_\mathrm{LD})}}\big)$, where $d_\mathrm{LD}$ is the Littlestone dimension and $T$ is the number of rounds. In this work, we significantly improve this oracle complexity while relying on an even weaker primitive: a weak-consistency oracle, which merely decides whether a given labeled dataset is realizable. At the core of our approach is an adaptive and dynamic agnostic-to-realizable reduction that actively prunes non-realizable label sequences on the fly. By using the VC dimension ($d_\mathrm{VC}$) to bound the number of dynamically maintained active paths, our algorithm reduces the total query complexity down to $O(T^{d_\mathrm{VC}+1})$ while perfectly preserving near-optimal expected regret. Crucially, this dynamic pruning also yields a memory reduction over the standard reduction. Furthermore, we formally quantify the regret--oracle complexity tradeoff, providing upper bounds that smoothly interpolate between restricted query budgets and attainable expected regret. We complement these with lower bounds proving that any learner restricted to $Q = o(\sqrt{T})$ queries must suffer an expected regret of $Ω(T/Q)$.

2605.07154 2026-05-11 cs.CV

PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition

Yuchen He, Jing Zhang

AI总结 PRIMED 是一种用于指代音频-视觉分割(Ref-AVS)的新方法,旨在根据视觉、听觉和文本线索在视频帧中定位和分割目标对象。该方法基于认知神经科学中的偏差竞争理论,通过自适应模态抑制机制,有效区分不同模态的相关性,提升分割精度。PRIMED 引入模态先验解码器和跨模态融合模块,结合空间感知语义对齐损失,显著增强了模型对前景与背景的区分能力,在 Ref-AVS 基准测试中取得了最先进的性能。

Comments 11 pages, 8 figures

详情
英文摘要

Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality prior to adaptively guide high-level attention. A Token Distiller further extracts compact global visual tokens from high-level features and shares them across Competition-aware Cross-modal Fusion modules to provide hierarchical global context. Additionally, we introduce a Spatial-Aware Semantic Alignment loss to further enhance foreground-background discrimination through contrastive learning. Extensive experiments on the Ref-AVS benchmark demonstrate that PRIMED achieves state-of-the-art overall performance.

2605.07153 2026-05-11 cs.CL

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

Wanli Yang, Hongyu Zang, Junwei Zhang, Wenjie Shi, Du Su, Jingang Wang, Xueqi Cheng, Fei Sun

AI总结 本研究探讨了强化学习(RL)在提升大语言模型(LLMs)参数知识直接回忆能力方面的潜力。通过受控的零样本、单跳、闭书问答实验,研究发现RL在多个事实性问答基准上平均提升了约27%的相对表现,且主要通过重新分配已有知识的概率分布而非获取新事实来实现。研究还表明,最难以处理的样本对提升效果贡献最大,揭示了RL在解锁模型潜在参数知识方面的重要作用。

详情
英文摘要

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

2605.07151 2026-05-11 cs.CV cs.AI

DPG-CD: Depth-Prior-Guided Cross-Modal Joint 2D-3D Change Detection

Luqi Zhang, Zhen Dong, Bisheng Yang

AI总结 该研究提出了一种名为DPG-CD的深度先验引导的跨模态融合框架,用于联合检测2D语义变化和3D高度变化,以应对城市形态分析和应急响应中的挑战。通过引入估计的深度先验缓解影像与DSM之间的模态差异,并采用门控融合机制有效结合几何与光谱特征,最终利用多任务解码器同时预测2D语义变化和3D高度变化,实验表明该方法在多个公开数据集上优于现有先进方法。

详情
英文摘要

Urban spatial evolution is manifested not only through horizontal expansion but also through vertical structural changes. Consequently, jointly capturing 2D semantic changes and 3D height changes is essential for urban morphology analysis and emergency management. In practical scenarios, collecting 3D observations is often constrained by high acquisition costs and the inability to support frequent updates. The multi-temporal cross-modal input consisting of pre-event Digital Surface Model (DSM) and post-event imagery provides a practical solution for 3D change detection in high-frequency urban monitoring, disaster assessment, and emergency response scenarios. However, this setting remains challenging as imagery and DSM data exhibit significant spectral-geometric representation gaps. Moreover, modality differences may be confused with actual changes, and robust change detection requires effective fusion of semantic and geometric features from multi-temporal data. In this paper, we propose DPG-CD, a depth-prior-guided multi-temporal cross-modal fusion framework for joint 2D semantic and 3D height change detection. Specifically, an estimated depth prior is introduced into the imagery to mitigate the modality gap with DSM. A gated fusion mechanism then selectively injects geometric cues from depth prior while preserving discriminative spectral representations. Subsequently, a multi-stage cross-temporal cross-modal feature fusion architecture is employed to extract change-aware features. Finally, a multi-task decoder jointly predicts 2D semantic changes and 3D height changes, complemented by an auxiliary DSM prediction task to improve structural consistency and height estimation accuracy. Experiments on two public datasets, Hi-BCD and 3DCD, and a new dataset, NYC-MMCD, demonstrate that DPG-CD outperforms state-of-the-art methods on both 2D and 3D change detection tasks.

2605.07149 2026-05-11 cs.CV

Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection

Wenbing Zhu, Jianing Liang, Linjie Cheng, Yurui Pan, Zhuhao Chen, Qingwang Yan, Yudong Cheng, Jianghui Zhang, Mingmin Chi, Bo Peng

AI总结 工业异常检测(IAD)在质量控制中具有重要意义,但现有方法在检测细微几何缺陷方面存在局限。本文提出Real-IAD-MVN,一个大规模多视角法向量数据集,通过高精度表面法向量捕捉微小几何缺陷,弥补了传统2D图像和稀疏3D点云的不足。该数据集从五个不同视角获取高保真表面法向信息,显著提升了缺陷检测能力,并通过基于重建的基线方法验证了其有效性,展示了多模态融合在几何异常检测中的新潜力。

Comments Accepted to CVPR 2025. 15 pages

详情
英文摘要

Industrial Anomaly Detection (IAD) is critical for quality control, but existing methods struggle with subtle, geometric defects. Standard 2D (RGB) images are sensitive to texture and lighting but often miss fine geometric anomalies. While 3D point clouds capture macro-shape, they are typically too sparse to detect micro-defects like scratches or pits. We address this fundamental data limitation by introducing Real-IAD-MVN (Multi-View Normal), a large-scale industrial dataset. By upgrading our acquisition system, Real-IAD-MVN captures high-fidelity surface normal maps from five distinct viewpoints, replacing sparse 3D data entirely. This provides a comprehensive geometric representation at a micro-detail level, making previously invisible side-wall and occluded defects explicitly detectable. Our experiments, conducted on this new dataset, first provide evidence that incorporating dense, multi-view pseudo-3D (surface normals) yields significantly better detection performance than using sparse 3D point cloud data. To further validate the dataset and provide a strong benchmark, we introduce a baseline method based on reconstruction, which learns to extract cross-modal unified prototypes from the image and normal map streams. We demonstrate that this unified prototype approach surpasses existing state-of-the-art multimodal fusion methods, highlighting the rich potential of our new dataset for advancing geometric anomaly detection.

2605.07148 2026-05-11 cs.CV

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

Haoming Wang, Wei Gao

AI总结 本文研究了视觉语言模型(VLMs)是否能形成类似人类认知地图的三维场景拓扑表示。作者发现,尽管现有VLMs能从二维输入中表现出空间推理能力,但其内部的三维拓扑表示被颜色、形状等非几何语义信息所掩盖。通过跨场景线性特征提取,研究者分离出一个控制模型空间输出的干净空间子空间,并通过数学方法塑造该表示,证明其与场景三维图的拉普拉斯特征映射一致。进一步引入基于狄利克雷能量的正则化方法,显著提升了模型在现实场景拓扑理解任务中的表现。

详情
英文摘要

Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the physical 3D space in the continuous limit. Motivated by this geometric identification, we further introduce a mathematically principled latent regularization method for VLMs, based on Dirichlet energy. Applying this single-term regularizer to a minimal 500-step supervised VLM fine-tuning (SFT) on simple synthetic data yields significant improvements on real-world spatial benchmarks, outperforming standard SFT and competitive baselines by up to 12.1\% in spatial tasks involving scene topology understanding. Source code is available at https://github.com/pittisl/vlm-latent-shaping

2605.07146 2026-05-11 cs.CV

UniV2D: Bridging Visual Restoration and Semantic Perception for Underwater Salient Object Detection

Laibin Chang, Shaodong Wang, Yunke Wang, Xu Zhang, Kui Jiang, Chang Xu, Bo Du

AI总结 水下显著目标检测(USOD)在海洋视觉任务中具有重要作用,但由于水下环境中的严重视觉退化,如选择性吸收和介质散射,使得该任务极具挑战性。传统方法通常采用“先增强后检测”的顺序流程,但将低级视觉修复与高级语义感知分离会导致语义不一致问题。为此,本文提出UniV2D,一种统一的视觉到检测网络,通过互惠框架联合优化视觉修复与显著目标检测,引入语义驱动的学习范式,使高级语义信息引导修复过程,同时修复后的视觉线索反过来增强语义感知,从而在多个基准测试中取得优于现有方法的显著性能提升。

详情
英文摘要

Underwater salient object detection (USOD) plays a vital role in marine vision tasks but remains fundamentally challenging due to severe visual degradation, such as selective absorption and medium scattering. Conventional pipelines typically adopt a sequential "enhance-then-detect" paradigm. However, isolating low-level visual restoration from high-level semantic perception often leads to semantic inconsistency, where the restored images may not be optimal for detection and can even introduce task-irrelevant noise. To break this sequential bottleneck, we propose UniV2D, a Unified Vision-to-Detection Network that jointly optimizes visual restoration and salient object detection within a mutually beneficial framework. Unlike traditional methods that rely on disjointed pipelines or rigid physical priors, UniV2D introduces a semantic-driven learning paradigm: high-level saliency semantics actively guide the restoration process, while the restored visual cues reciprocally enhance saliency perception. Specifically, UniV2D features a hierarchical dual-branch architecture. It first employs a self-calibrated decoder to predict initial saliency masks alongside a mask-aware restoration module to reconstruct image content. Subsequently, a saliency-guided refinement module equipped with cross-level modulation is utilized to align structural fidelity with semantic consistency. Extensive experiments across multiple benchmarks demonstrate that UniV2D significantly outperforms state-of-the-art methods in both quantitative and qualitative evaluations, establishing a new standard for joint underwater perception.

2605.07143 2026-05-11 cs.CV cs.NA cs.RO math.NA

TriP: A Triangle Puzzle Approach to Robust Translation Averaging

Zhekai Fan, Wanze Li, Jinxin Wang, Yunpeng Shi

AI总结 TriP 是一种基于三角形拼图思想的鲁棒平移平均方法,旨在从成对相对平移方向中恢复相机位置,是全局结构从运动(SfM)流程中的关键步骤。该方法通过三角形几何推断局部相对边尺度,并在对数域中同步重叠三角形的尺度,从而恢复全局一致的边长和相机位置,提高了对结构化噪声的鲁棒性。TriP 具有理论上的精确性保障,同时具备高效并行计算能力,适用于大规模相机网络,在合成和真实数据集上均显著优于现有方法。

详情
英文摘要

Translation averaging aims to recover camera locations from pairwise relative translation directions and is a fundamental component of global Structure-from-Motion pipelines. The problem is challenging because direction measurements contain no distance information, making the estimation problem highly ill-conditioned and highly sensitive to corrupted observations. In this paper, we propose TriP, a triangle-based framework for robust translation averaging. TriP first infers local relative edge scales from triangle geometry, and then synchronizes the scales of overlapping triangles in the logarithmic domain to recover globally consistent edge lengths and camera locations. By leveraging higher-order consistency across triangles, the proposed method is robust to adversarial, cycle-consistent, and other structured corruptions. In addition, TriP avoids the collapse issue without requiring any extra anti-collapse constraints, since log-scale synchronization excludes the degenerate zero-scale solution by construction. These structural advantages enable a particularly strong theory for exact location recovery. On the practical side, TriP is fully parallelizable, computationally efficient, and naturally scalable to graphs with millions of cameras. Moreover, it outperforms all previous translation averaging methods by a large margin on both synthetic and real datasets.

2605.07142 2026-05-11 cs.CV

AGA3DNet: Anatomy-Guided Gaussian Priors with Multi-view xLSTM for 3D Brain MRI Subtype Classification

Peiyu Duan, Xueqi Guo, Sepehr Farhand, Mehmet Berk Sahin, Xinyuan Zheng, James S. Duncan, Gerardo Hermosillo Valadez, Yoshihisa Shinagawa

AI总结 本文提出了一种名为AGA3DNet的框架,用于3D脑MRI亚型分类,该方法结合了从放射科报告中提取的解剖短语作为软解剖先验,并与轻量级3D卷积神经网络和多视角xLSTM聚合相结合。通过将解剖短语映射到图谱定义的区域,并利用符号距离变换和高斯加权生成平滑的空间先验,AGA3DNet在无需密集体素标注的情况下提供了可解释的解剖引导。实验表明,该方法在回顾性脑MRI队列中表现出更均衡的性能,并支持临床可解释的定位分析。

Comments CVPR CV4CLINIC 2026

详情
英文摘要

Accurate 3D brain MRI subtype classification benefits from both localized anatomical cues and long-range contextual reasoning. We present AGA3DNet, a report-grounded framework that incorporates brief anatomical phrases extracted from radiology reports as a soft anatomical prior channel and fuses it with a lightweight 3D CNN and multi-view xLSTM aggregation. Specifically, extracted anatomical phrases are mapped to atlas-defined regions and converted into smooth spatial priors using a signed-distance transform followed by Gaussian weighting, providing interpretable, anatomy-grounded guidance without requiring dense voxel annotations. We evaluate AGA3DNet on a retrospective institutional brain MRI cohort for abnormal subtype discrimination and compare against reproducible 3D classification baselines. AGA3DNet achieves improved overall balance across performance metrics and supports clinically interpretable localization through the prior channel. We discuss limitations related to single-cohort evaluation and the lack of large-scale public brain MRI datasets paired with radiology reports under broadly usable terms.

2605.07141 2026-05-11 cs.CV cs.AI

Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

Yuan Yao, Qiushi Yang, Humen Zhong, Jiangning Wei, Yifang Men, Shuai Bai, Miaomiao Cui, Zhibo Yang

AI总结 该研究提出了一种名为 Qwen3-VL-Seg 的高效框架,用于解决开放世界指称分割问题,即如何将自然语言描述准确映射到图像的像素级区域。该方法通过将大语言模型预测的边界框作为语义先验,结合轻量级的掩码解码器,实现了从稀疏框到密集分割结果的生成,仅引入了约 17M 参数。研究还构建了 SA1B-ORS 数据集和 ORS-Bench 基准,实验表明该模型在多种任务和分布下均表现出色,尤其在语言复杂性和开放世界场景中具有显著优势。

详情
英文摘要

Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.

2605.07140 2026-05-11 cs.CV cs.AI

Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition

Talha Ilyas, Deval Mehta, Zongyuan Ge

AI总结 本文提出了一种基于神经符号系统的骨架驱动人类动作识别框架,将动作识别重新表述为基于运动原语的概念驱动一阶逻辑推理问题。该方法通过可学习的空间和时间运动概念将一阶逻辑谓词与表示学习相结合,实现了对动作语义的可解释逻辑规则学习。通过与大语言模型生成的运动原语描述对齐,构建了感知与推理共享的概念空间,实验表明该方法在多个数据集上取得了具有竞争力的识别性能,并提供了基于逻辑结构的明确解释。

Comments Accepted In Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
英文摘要

Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON

2605.07139 2026-05-11 cs.CL cs.AI cs.LG

Structural Rationale Distillation via Reasoning Space Compression

Jialin Yang, Jiankun Wang, Jiajun Wu, Henry Leung, Jiayu Zhou, Steve Drew

AI总结 在将大型语言模型的推理能力蒸馏到小型模型时,教师模型对相似问题的推理结构和策略往往不一致,导致学生模型难以学习。为此,研究提出了一种基于推理路径压缩的蒸馏方法(D-RPC),通过动态维护一个可复用的高层推理路径库,约束教师模型遵循一致性更强的推理路径,从而生成结构一致且覆盖多样问题类型的解释。实验表明,D-RPC 在多个数学和常识推理基准测试中优于多种主流蒸馏方法,且在保持较少token使用量的情况下取得了更优性能。

详情
英文摘要

When distilling reasoning from large language models (LLMs) into smaller ones, teacher rationales for similar problems often vary wildly in structure and strategy. Like a chef who makes the same dish differently each time, this inconsistency burdens the student with noisy supervision that is hard to internalize. We propose Distillation through Reasoning Path Compression (D-RPC), which constrains the teacher to follow a compact, dynamically maintained bank of reusable high-level reasoning paths. For each training question, D-RPC retrieves the most relevant path and conditions the teacher to follow it, producing rationales that are consistent across similar problems yet diverse enough to cover different problem types. A PAC-Bayes analysis formalizes the resulting trade-off between bank size and coverage: smaller banks reduce supervision entropy but risk coverage gaps, and the generalization bound identifies an optimal intermediate size confirmed by our ablations. Across five math and commonsense reasoning benchmarks with two student models, D-RPC consistently outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines, while using fewer tokens than template-heavy alternatives.

2605.07138 2026-05-11 cs.AI cs.LG

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

Deeraj S K, Sadhana Devarajan, Krishna Mehra, Sudhakar Mishra

AI总结 该研究探讨了基于可验证情感奖励的强化学习(RLVER)训练出的共情代理在对抗性场景下的鲁棒性问题。研究构建了对抗共情基准(AEB)和情感一致性评分(ECS),用于评估模型在面对用户操纵、情绪升级等对抗性交互时的表现。实验表明,RLVER-PPO-Think在对话稳定性与隐藏意图检测方面显著优于基线模型,但在情感状态追踪能力上并未表现出明显提升,揭示了情感响应性与状态追踪能力之间的行为与可解释性分离现象。

详情
英文摘要

Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, \(p<0.001, r=0.688\)), with zero dialogue collapses and 47\% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think (\(p=0.650\)): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS--FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.

2605.07137 2026-05-11 cs.LG cs.AI

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Yash Ingle, Jaival Chauhan, Ankit Yadav, Sudhakar Mishra

AI总结 该研究针对大语言模型(LLM)推理能力提升中的负样本强化学习(NSR)方法,提出了一种自适应负样本强化(A-NSR)框架,以动态平衡错误纠正与多样性生成。通过引入时间依赖的调度函数和置信度加权惩罚机制,模型在训练初期注重错误修正,在后期则转向更精细的更新策略,并根据模型对错误路径的置信度分配不同惩罚权重。实验表明,该方法在多个复杂推理数据集上表现出优越的性能,有效提升了模型的推理准确性和泛化能力。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a highly effective method for improving the reasoning abilities of Large Language Models (LLMs). Recent research shows that Negative Sample Reinforcement (NSR) -- which focuses on penalizing incorrect steps rather than simply rewarding correct ones -- can match or even exceed the performance of more complex frameworks like PPO and GRPO across the entire Pass@k spectrum. However, current NSR techniques usually apply a fixed penalty throughout the training process and treat every incorrect response with the same weight. To address these limitations, we propose two extensions to the NSR framework: Adaptive Negative Sample Reinforcement. Rather than using a fixed update rule, A-NSR uses time-dependent scheduling functions. In the initial training phases, the system focuses heavily on correcting errors to stabilize the model. As training continues, it shifts toward more subtle and controlled updates. We also introduce Confidence-Weighted Negative Reinforcement, which operates on the principle that different mistakes carry different levels of importance. CW-NSR assigns specific penalty weights based on the model's normalized sequence likelihood. If the model is highly confident in a wrong path, it receives a larger penalty and for uncertain errors -- where the model is effectively exploring -- are penalized less strictly. Our formal analysis shows how these mechanisms govern token-level updates, allowing the model to leverage prior-guided probability redistribution while providing a natural defense against overfitting. We evaluated these methods on difficult reasoning datasets, including MATH, AIME 2025, and AMC23, using the Qwen2.5-Math-1.5B architecture.

2605.07134 2026-05-11 cs.CL cs.AI

Region4Web: Rethinking Observation Space Granularity for Web Agents

Donguk Kwon, Dongha Lee

AI总结 本文研究了网络代理在感知网页时观察空间粒度的设计问题,指出现有方法采用与动作空间相同元素级粒度的观察方式,未能显式表达网页的功能结构。为此,作者提出Region4Web框架,通过层次分解和语义抽象将网页的AXTree重新组织为功能区域,使代理能够基于功能区域理解页面状态。同时,提出的PageDigest方法将区域级观察信息压缩为跨步骤的页面摘要,显著提升了任务成功率,验证了功能区域粒度在提升代理性能方面的有效性。

详情
英文摘要

Web agents perceive web pages through an observation space, yet its granularity has remained an underexamined design choice. Existing work treats observation at the same element-level granularity as the action space, leaving the page's functional organization implicit and forcing the agent to infer it from element-level signals at every step. We argue observation should instead operate at the granularity of functional regions, parts of the page that each serve a distinct purpose. We propose Region4Web, a framework that reorganizes the AXTree into functional regions through hierarchical decomposition and semantic abstraction, exposing the page's functional organization as the basis for page state understanding. Moreover, we propose PageDigest, a web-specific inference pipeline that delivers this region-level observation to the actor agent as a compact per-page digest that persists across steps. On the WebArena benchmark, PageDigest substantially reduces observation length while improving overall task success rate across diverse backbone large language models (LLMs) and established agent methods, regardless of backbone capacity. These results show that operating at the granularity of functional regions delivers a more compact and informative basis for the actor agent than element-level processing alone.

2605.07133 2026-05-11 cs.LG cs.AI

GAD in the Wild: Benchmarking Graph Anomaly Detection under Realistic Deployment Challenges

Jingjing Zhou, Shiyu Huang, Qing Qing, Zuquan Yuan, Huafei Huang, Ziqi Xu, Mingliang Hou, Xikun Zhang, Renqiang Luo, Ivan Lee

AI总结 本文针对图异常检测(GAD)在实际部署中面临的真实挑战,提出一个多维基准测试框架,系统评估模型在大规模图、极端异常稀疏和缺失节点属性等场景下的性能。研究发现,现有大多数基于图神经网络的方法难以处理百万级节点图,且在真实异常比例下检测效果显著下降,重建类模型对属性填补策略也极为敏感。该工作通过五个多样化图数据集构建基准,揭示了当前GAD模型在实际应用中的局限性,并为构建鲁棒、可扩展的图异常检测系统提供了诊断测试平台。

详情
英文摘要

Graph Anomaly Detection (GAD) is a critical task in graph machine learning with vital applications in financial fraud detection and social platform governance. However, existing GAD benchmarks are often restricted to small-scale, curated graphs with relatively balanced anomaly ratios, leaving a substantial gap between academic evaluation and real-world deployment. To bridge this gap, we present a multi-dimensional benchmark that systematically evaluates GAD models under three deployment-relevant challenges: million-scale graphs, extreme anomaly scarcity, and missing node attributes. We derive a family of controlled benchmark variants from five diverse graphs, including two native industrial-scale datasets with over 3.7 million nodes. Our extensive evaluation of nine representative GAD models reveals three major limitations: (1) most GNN-based methods fail to scale to million-node graphs due to prohibitive memory requirements; (2) detection performance drops sharply under realistic anomaly ratios (e.g., 0.1\%), often resulting in zero recall; and (3) reconstruction-based models are highly sensitive to attribute imputation strategies. Our findings suggest that strong performance in laboratory settings does not guarantee robustness in production environments. We release this benchmark and empirical evaluation as a diagnostic testbed to promote the development of robust and scalable GAD systems for large-scale, imperfect graphs encountered in practice. Code is available at https://anonymous.4open.science/r/Benchmark_GAD-E7A3.

2605.07130 2026-05-11 cs.LG cs.DS

Simple KNN-Based Outlier Detection Achieves Robust Clustering

Tianle Jiang, Yufa Zhou

AI总结 本文研究了在存在异常值情况下如何实现鲁棒聚类的问题,重点探讨了经典异常值检测启发式方法在鲁棒 $k$-Means 中的有效性。作者证明,在合理假设下,仅通过移除 $K$-最近邻距离较大的点,即可达到与现有方法相当的近似保证,且无需引入额外中心或移除更多异常值。实验表明,该方法在实际数据集上的聚类效果和运行效率优于或匹敌多种更复杂的算法,展示了基于 $K$-近邻的简单启发式方法在鲁棒聚类中的潜力。

Comments Code: https://github.com/MasterZhou1/Robust-Clustering

详情
英文摘要

Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

2605.07127 2026-05-11 cs.LG cs.CL

The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

Zhanqi Zhang, Hua-Dong Xiong, Robert C. Wilson, Mikio Aoi, Marcelo G. Mattar, Li Ji-An

AI总结 现代大语言模型(LLMs)在从大量文本中定位特定信息时表现优异,但在定位短列表中最后几个项目时却常常失败,我们称这一现象为“位置诅咒”。研究通过两种互补任务评估了模型在序列中根据位置或项目检索内容的能力,发现模型在反向检索(如从列表末尾向起点定位)上的表现明显弱于正向检索。为改善这一问题,研究构建了聚焦位置任务的训练数据集PosBench,通过LoRA微调提升了模型的正向和反向检索能力,但其性能仍远未达到饱和水平。这一发现突显了位置检索能力在代码理解和编辑等任务中的重要性,为未来模型预训练目标和设计提供了新方向。

详情
英文摘要

Modern large language models (LLMs) can find a needle in a haystack (locating a single relevant fact buried among hundreds of thousands of irrelevant tokens) with near-saturated accuracy, yet fail to retrieve the last few items in a short list. We call this failure the Position Curse. For instance, even in a two-line code snippet, Claude Opus 4.6 misidentifies the second-to-last line most of the time. To characterize this failure, we evaluated two complementary queries: given a position in a sequence (of letters or words), retrieve the corresponding item; and given an item, return its position. Each position is specified as a forward or backward offset from an anchor, either an endpoint of the list (its start or end) or another item in the list. Across both open-source and frontier closed-source models, backward retrieval substantially lags forward retrieval. To test whether this capability can be rescued by post-training, we constructed PosBench, a position-focused training dataset. LoRA fine-tuning improves both forward and backward retrieval and generalizes to a held-out code-understanding benchmark (PyIndex), yet absolute performance remains far from saturated. As LLM coding agents increasingly operate over large codebases where precise indexing becomes essential for code understanding and editing, position-based retrieval emerges as a key capability for future pretraining objectives and model design.

2605.07123 2026-05-11 cs.LG

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

AI总结 本文研究了基于思维链(CoT)的上下文强化学习(ICRL)的收敛性与涌现机制,首次从理论上分析了CoT如何增强ICRL能力。通过线性Transformer的策略评估设置,作者证明在特定参数下,CoT生成等价于重复执行时间差分学习更新,并给出了有限样本下的收敛分析,表明策略评估误差随CoT长度几何级数下降并最终趋于由上下文长度决定的统计下限。此外,研究还证明了这些参数是预训练损失的全局最小值,为该参数的实证涌现提供了理论解释。

详情
英文摘要

In-context reinforcement learning (ICRL) refers to the ability of RL agents to adapt to new tasks at inference time without parameter updates by conditioning on additional context. Recent empirical studies further demonstrate that Chain-of-Thought (CoT) generation can amplify this ICRL capability. This paper is the first to provide a theoretical understanding on how CoT interacts with ICRL. We conduct our analysis in a policy evaluation setup with linear Transformer. We prove that with specific Transformer parameters, the CoT generation process is equivalent to repeatedly executing temporal difference learning updates. Additionally, we provide finite sample convergence analysis showing that the policy evaluation error decreases geometrically with CoT length and eventually saturates at a statistical floor determined by the context length. We also prove that the desired Transformer parameters are a global minimizer of the pretraining loss, providing a theoretical understanding on the empirical emergence of those parameters.

2605.07120 2026-05-11 cs.LG stat.ML

When Symbol Names Should Not Matter: A Logistic Theory of Fresh-Symbol Classification

Wenjie Guan, Jelena Bradic

AI总结 该论文研究了在固定标签分类任务中,模型是否能基于抽象模板而非具体符号名称进行推理的问题。作者提出了一种正则化核逻辑分类方法,分析了在训练数据中由于符号偶然重叠引起的扰动,并通过着色碰撞图对这些扰动进行建模。研究证明了在新鲜符号分类任务中,模型的分类边界具有高概率的迁移保证,并揭示了词汇规模与碰撞几何对分类性能的不同影响,为理解符号抽象和泛化提供了新的理论视角。

详情
英文摘要

Template tasks have emerged as a clean testbed for asking whether transformers reason with abstract symbols rather than concrete token names. We study the fixed-label classification version of this problem, where train and test examples share latent templates but may use disjoint vocabularies. Unlike next-token prediction, the model need not emit unseen symbols; it must learn a decision rule invariant to symbol renaming. We analyze regularized kernel logistic classification in the transformer-kernel regime. Our main result decomposes the learned predictor into an ideal template-level classifier and a finite-sample perturbation caused by accidental token overlaps in the training data. We encode these overlaps by a colored collision graph and prove high-probability margin-transfer guarantees for fresh-symbol classification. This perspective extends template-based analyses to logistic classification and refines scalar diversity conditions: vocabulary size controls the average rate of collisions, but collision geometry controls whether the ideal classification margin is preserved. More broadly, the same perturbation framework applies to abstraction-augmented inputs, yielding a general margin-versus-collision criterion for identifying when prompting strategies improve fresh-symbol generalization. Synthetic template experiments illustrate the predicted roles of regularization, sample size, and transformer-kernel structure.

2605.07116 2026-05-11 cs.LG cs.AI cs.NA math.NA math.OC

Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

Minseok Kim, Yeongjong Kim, Namkyeong Cho, Yeoneung Kim

AI总结 本文研究了基于神经网络求解Hamilton-Jacobi-Bellman方程的稳定方法,并在模型预测强化学习中进行了应用。该方法结合了有限差分策略评估结构与神经网络表示,通过随机连续配点最小化残差,避免了传统网格方法的限制。论文建立了该混合方法的误差分析理论,证明了单步策略评估的稳定性,并分析了残差、初始误差、策略偏差及模型识别误差等因素的影响,同时给出了有限样本下的误差保证和多步策略改进的条件结果。实验表明该方法在多个控制任务中优于传统模型基和无模型强化学习方法。

详情
英文摘要

Physics-informed neural solvers offer a promising route to model-based reinforcement learning in continuous time, where optimal feedback synthesis is governed by Hamilton--Jacobi--Bellman (HJB) equations. Practical implementations often occupy a regime that is neither a classical grid method nor a continuous-PDE PINN: the value function is represented by a neural network, finite-difference HJB policy-evaluation operators are evaluated by network queries at shifted points, and residuals are minimized by random continuous collocation. This regime preserves the stabilized finite-difference policy-evaluation structure while avoiding grid-based value unknowns. We develop an error theory for this hybrid regime. Interpreting finite differences as shift operators acting on neural networks, we prove a population $L^2$ stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement. Experiments on compact-control LQR upto 64 dimensions, Allen--Cahn control, pendulum, Hopper, and 3D quadrotor benchmarks compare against representative model-based and model-free RL baselines, demonstrating the predicted residual, policy-mismatch, and learned-model error trends.

2605.07115 2026-05-11 cs.LG stat.ML

Conformal-Style Quantile Analyses for Stochastic Bandits

Chengyu Du, Mengfan Xu

AI总结 本文研究了在随机多臂老虎机问题中,如何针对具有强上尾性能的臂进行分析,而非传统的平均奖励准则。作者提出了一种基于符合性(conformal)方法的上尾量化分析框架,并设计了ACPU-CB1算法,该算法结合了自适应的符合性估计与UCB型乐观奖励机制。该方法在保证上尾性能的同时,实现了对数级别的上尾遗憾界,理论分析与实验验证均表明其优于传统UCB算法。

详情
英文摘要

Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(α\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-α/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{Δ_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.

2605.07114 2026-05-11 cs.LG

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

Tao Wang, Shuo Li, Yan Sun, Dongsheng Ding, Edgar Dobriban

AI总结 本文研究了如何在基于群体的强化学习与可验证奖励(RLVR)中高效分配 rollout 资源,以提升大语言模型的推理能力。为解决现有方法中 rollout 分配不均衡的问题,作者提出了基于“命中效用”的最优 rollout 分配策略 HORA,该方法无需训练即可动态调整每个提示的 rollout 预算,以最大化后验命中效用。实验表明,HORA 在多个数学推理基准上相比现有方法在计算资源匹配的情况下表现更优,且兼容其他群体估计方法。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a central paradigm for improving the reasoning capabilities of large language models. Group-based policy optimization methods, such as GRPO, typically allocate a fixed number of rollouts to every prompt. This uniform allocation can be inefficient: it over-allocates compute to prompts whose sampled groups are already saturated while under-exploring prompts for which additional samples may reveal useful correct trajectories. To address this limitation, we introduce hit utility, the posterior probability that at least one rollout in a proposed additional allocation for a prompt will be correct. Building on this notion, we propose Hit-Utility Optimal Rollout Allocation (HORA), a learning-free rollout allocation policy that maximizes total posterior hit utility within each allocation batch. HORA adaptively reallocates rollout budgets while leaving the downstream reward evaluation and group-based advantage estimator unchanged. Across four mathematical reasoning benchmarks and three model scales, HORA preserves comparable Pass@1 and improves Pass@K over compute-matched GRPO in ten of twelve model--benchmark configurations, with one tie and one saturated exception. It is also drop-in compatible with other group-based estimators such as RLOO. Ablation studies indicate that the uniform prior used by HORA is competitive with five prompt-conditioned learned-prior alternatives.

2605.07113 2026-05-11 cs.LG math.OC

Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks

Hao Chen, Chendi Qian, Christopher Morris, Andrea Lodi, Can Li

AI总结 该论文研究了如何通过图神经网络高效求解最大割(Max-Cut)问题的全局最优解。作者提出了一种专门针对Max-Cut问题的图神经网络,作为传统半定规划(SDP)松弛求解器的轻量级替代,能够在分支定界框架中直接使用。该网络能够在保持解可行性的同时预测原始和对偶可行的SDP解,并通过Goemans-Williamson算法生成Max-Cut可行解,实验表明其相比传统SDP求解器大幅降低了计算成本。

详情
英文摘要

Exact solution of hard combinatorial optimization problems often relies on strong convex relaxations, but solving these relaxations repeatedly inside a branch-and-bound algorithm can be prohibitively expensive. Hence, we consider this challenge for Max-Cut, where branch and bound commonly uses semidefinite programming (SDP) relaxations to bound subproblems. We propose a Max-Cut-specific graph neural network that serves as a principled, lightweight neural proxy for these SDP solvers and can be plugged directly into an exact branch-and-bound framework. The proposed architecture has update steps of complexity $\mathcal{O}(n^2 + ne)$, and predicts both primal- and dual-feasible SDP solutions. The primal SDP solutions yield feasible Max-Cut solutions via the Goemans--Williamson algorithm. In addition, it is trained in a self-supervised fashion without requiring solved SDP relaxations as labels. Empirically, we show that our architecture can substantially reduce the cost of bounding in exact Max-Cut solving by up to $10.6 \times$ compared with using the state-of-the-art SDP solver Mosek. Our work highlights the potential of learned, validity-preserving surrogates for accelerating exact optimization over structured convex relaxations.

2605.07112 2026-05-11 cs.AI cs.MA

Switchcraft: AI Model Router for Agentic Tool Calling

Sharad Agarwal, Pooria Namyar, Alec Wolman, Rahul Ambavat, Ankur Gupta, Qizheng Zhang

AI总结 Switchcraft 是一种专为智能体工具调用优化的 AI 模型路由系统,旨在解决当前基于大型模型的智能体系统推理成本过高的问题。该方法通过在调用工具时动态选择成本最低且保证正确性的模型,显著降低了推理开销。实验表明,Switchcraft 在多个基准测试中实现了与最佳单模型相当的准确率,同时将推理成本降低了 84%,为高效、经济的智能体系统部署提供了新方案。

详情
英文摘要

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

2605.07110 2026-05-11 cs.CL cs.SE

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

Zejian Chen, Zhanyuan Liu, Chaozhuo Li, Mengxiang Han, Songyang Liu, Litian Zhang, Feng Gao, Yiming Hei, Xi Zhang

AI总结 随着计算机使用代理(CUA)从受限基准转向真实软件环境,其可靠性不再仅由任务成功率衡量,而需考虑感知误差、规划偏差、权限范围等多方面因素。本文提出一种统一的架构-生命周期框架,用于保障CUA在部署环境中的可靠性,从架构层面分析感知、决策与执行的耦合关系,并从生命周期角度探讨创建、部署、运行与维护各阶段的可靠性保障机制。该框架有助于系统分析现有CUA系统、基准与安全研究,并识别关键干预点以提升控制与保障能力。

详情
英文摘要

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer captured by task success alone: perception errors,planning drift, memory use, tool mediation, permission scope,and runtime oversight jointly determine whether agent actionsremain aligned with user intent, Existing surveys organize theCUA landscape by methods, platforms, benchmarks, or securitythreats, but less explicitly connect capability formation, author-ity exposure, failure manifestation, and control placement. Toaddress this gap, the article develops an architecture-lifecycleframework for deployment-grounded reliability in CUAs. Thearchitectural view analyzes Perception, Decision, and Executionas coupled layers that transform software observations intoauthority-bearing actions, The lifecycle view examines Creation.Deployment, Operation, and Maintenance as stages in which priorsare learned, tools and permissions are bound, runtime trajecto.ries are stressed, and assurance must be preserved under drift.Using this lens, the analysis synthesizes representative systems,benchmarks, and security/privacy studies; distinguishes wherefailures become visible from where their enabling conditions areintroduced, and maps recurring intervention surfaces for controloversight, and assurance. OpenClaw is used only as a public moti.vating example of an open deployment pattern, not as a verifedinternal case study. The conclusion highlights open challengesin controllable grounding, long-horizon constraint preservation,safe authority binding, mixed-trust runtime defense, privacy-preserving memory,and continual assurance.

2605.07106 2026-05-11 cs.CL

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Jin Cui, Xinyue Long, Xunyong Zhang, Yadong Zhang, Chuanchang Su, Jingye Gan, Boran Zhao, Pengju Ren

AI总结 该研究针对多模态大语言模型在视觉推理中的信息瓶颈和隐空间兼容性不足问题,提出了一种基于空间语义对齐的隐空间推理框架RIS。RIS通过构建带有边界框和区域语义描述的分步推理数据集,将隐空间 token 与视觉和语义证据相结合,并引入渐进注意力机制和语言过渡 token,以增强推理过程的可解释性和生成质量。实验表明,RIS在多个视觉推理基准上显著优于现有方法,为实现可信的内部视觉推理提供了可行路径。

Comments 19 pages, 8 figures

详情
英文摘要

Multimodal Large Language Models (MLLMs) have made remarkable progress on vision-language reasoning, yet most methods still compress visual evidence into discrete textual thoughts, creating an information bottleneck for fine-grained perception. Recent latent visual reasoning methods attempt to reason in continuous hidden states, but we find that they suffer from insufficient manifold compatibility: latent trajectories drift away from pretrained reasoning circuits, collapse into instance-agnostic patterns, and are often bypassed during answer generation. To address these issues, we propose RIS (Retrieve, Integrate, and Synthesize), a spatial-semantic grounded framework that develops latent reasoning as a compatible extension of pretrained MLLM computation. We first construct a step-wise grounded reasoning dataset with bounding boxes and region-specific semantic descriptions. Built on this supervision, RIS anchors latent tokens to both spatial and semantic evidence, enforces their causal role through a progressive attention bottleneck, and introduces short language transition tokens to bridge synthesized latent states back to vocabulary-aligned decoding. Experiments on V*, HRBench4K, HRBench8K, MMVP, and BLINK show consistent improvements over closed/open-source and latent reasoning baselines. Further analyses demonstrate that RIS learns diverse, interpretable, and progressively integrated latent trajectories, offering a practical path toward faithful internal visual reasoning in MLLMs.

2605.07105 2026-05-11 cs.LG cs.CL cs.CY cs.IT math.IT

Theoretical Limits of Language Model Alignment

Lucas Monteiro Paes, Natalie Mackraz, Barry-John Theobald, Federico Danieli

AI总结 本文研究了语言模型对齐在固定KL散度预算下的理论极限,分析了奖励提升的最大可能值,并提出了基于Jeffreys散度的闭式表达式,揭示了传统分析中使用的$\sqrt{\texttt{KL}}$的不足。研究还表明,奖励集成可以缓解奖励黑客问题,并通过实验证明最佳-of-$N$方法接近理论极限,而PPO和GRPO方法则表现较差,为对齐算法的改进提供了理论依据。

详情
英文摘要

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

2605.07104 2026-05-11 cs.LG math.OC stat.ML

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

Xinyu Liu, Zixuan Xie, Shangtong Zhang

AI总结 本文研究了在马尔可夫噪声环境下随机逼近和强化学习算法的几乎必然收敛速率问题。针对一类期望更新具有收缩性的算法(如Q学习和线性时序差分学习),作者提出了一种基于泊松方程修正的Lyapunov漂移构造方法,从而获得了对幂律和调和学习率下接近最优的收敛速率结果。该方法为理解强化学习算法在非独立同分布噪声下的收敛行为提供了新的理论分析工具。

详情
英文摘要

Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are contractive, a setting that arises in many reinforcement learning algorithms such as $Q$-learning and linear temporal difference learning. Specifically, for a power-law learning rate $O(n^{-η})$ with $η\in (1/2, 1)$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{1 - 2η})$. For a harmonic learning rate $O(n^{-1})$, we obtain an almost sure convergence rate arbitrarily close to $o(n^{-1})$, which we argue is a strong result because it is close to the optimal rate $O(n^{-1}\log\log n)$ given by the law of the iterated logarithm (for a special case of i.i.d. noise). Key to our analysis is a novel Lyapunov drift construction that applies a Poisson-equation based correction for Markovian noise to the well-established Moreau-envelope smoothing for the contractive mapping.

2605.07103 2026-05-11 cs.AI cs.MA

ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Ye Liu, Botao Yu, Xinyi Ling, Daniel Adu-Ampratwum, Xia Ning

AI总结 本文提出了一种名为ARMOR的智能框架,用于解决计算化学中反应可行性预测的问题。该框架通过建模不同工具的特定效用、自适应选择优先工具并解决工具间的冲突,有效整合多个AI工具以提高预测准确性。实验表明,ARMOR在公开数据集上显著优于现有方法,尤其在工具预测存在冲突的反应中表现突出,展示了其在多工具协同方面的优越性。

详情
英文摘要

Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via https://anonymous.4open.science/r/ARMOR-E13F.