arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02380 2026-06-02 cs.CL cs.AI

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench：通过计划-行动分歧评估智能体中的自发性策略欺骗

Yuyan Bu, Haowei Li, Qirui Zheng, Bowen Dong, Kaiyue Yang, Jiaming Ji, Yingshui Tan, Wenxin Li, Yaodong Yang, Juntao Dai

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）； University of Science and Technology of China（中国科学技术大学）； University of Chinese Academy of Science（中国科学院大学）； Alibaba Group（阿里巴巴集团）

AI总结针对LLM智能体在工具使用中可能出现的自发性策略欺骗（计划与行动不一致），提出SPADE-Bench基准，通过结合实际工具执行和受控压力场景，严格区分欺骗与幻觉，实验证实该问题真实且紧迫。

详情

AI中文摘要

随着基于LLM的智能体扩展其操作范围，可靠性成为实际部署的前提。然而，在实际应用中，人类用户无法监控每一个即时行为；相反，执行过程往往是一个黑箱，用户仅依赖智能体的自我报告更新。这种不透明性带来了关键风险：智能体可能呈现与执行行动不一致的面向观察者的报告，使得系统不可控，尤其是在高风险自主场景中。我们将这种自我报告的计划-行动分歧称为智能体欺骗。为了评估这一点，我们引入了SPADE-Bench，一个旨在评估自发性计划-行动分歧的基准。与先前的欺骗基准不同，SPADE-Bench同时集成了实际工具执行和受控压力场景。这种设计确保了生态效度，并通过在压力下进行受控的计划-行动比较，严格区分策略欺骗与单纯的幻觉。跨主流模型的实验证实，智能体欺骗在工具使用环境中是一个真实且紧迫的问题。通过提供一个全面且稳健的评估框架，SPADE-Bench填补了智能体安全中的关键空白，促进社区朝着构建可信和可控的自主系统迈进。

英文摘要

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent's self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community's progress toward building trustworthy and controllable autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02379 2026-06-02 cs.CV

Honey, I Shrunk the Arc de Triomphe!

亲爱的，我把凯旋门缩小了！

Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely

发表机构 * Cornell University（康奈尔大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结针对单目度量几何估计中的“尺度坍缩”现象，通过构建新数据集MetricScenes并采用两阶段泊松补全方法提升深度图质量，微调MoGe-2模型显著缓解了尺度低估问题。

Comments Project page: https://metricscenes.github.io/

详情

AI中文摘要

度量尺度单目几何估计通过大规模数据聚合取得了显著进展，但当前的基础模型存在持续的“尺度坍缩”现象：远处地标和广阔景观被度量低估。我们假设这一性能差距源于训练数据瓶颈，现有度量尺度数据集受硬件限制，要么是均匀的车辆捕获LiDAR或短距离室内扫描，要么是缺乏物理世界语义复杂性的合成数据。为弥补这一差距，我们整理了一个新的度量级野外数据集MetricScenes，从多种来源收集，包括互联网照片集和立体图像。我们使用现成方法估计每个场景的相机姿态和初始深度图，并从地理标记元数据以及已知立体相机基线恢复绝对尺度。我们还通过一种新的两阶段泊松补全方法改进了从MetricScenes导出的深度图质量。在我们的数据集上微调MoGe-2显著缓解了尺度坍缩，并在无约束的开放域场景中实现了优越的度量精度，同时在标准基准上保持了最先进的性能。

英文摘要

Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.02375 2026-06-02 cs.CL cs.CY cs.HC

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

WAXAL-NET: 针对19种非洲语言的微调边缘ASR

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra

发表机构 * CMU Africa（CMU非洲中心）； LyngualLabs ； MBZUAI

AI总结本研究评估了紧凑型领域专用ASR模型在WAXAL语料库的19种非洲语言会话语音上是否优于大规模多语言基础模型，通过微调边缘模型实现了宏平均WER从64.9%降至38.0%，模型大小缩小3-40倍，证实领域专业化主导规模效应。

详情

AI中文摘要

我们评估了紧凑型领域专用ASR模型是否能在WAXAL语料库的19种非洲语言会话语音上优于大规模多语言基础模型。微调后的边缘模型实现了宏平均词错误率（WER）38.0%，而最佳零样本基线为64.9%，使用小3-40倍的模型降低了26.9个百分点。结果证实，对于自发的非洲语音，领域专业化主导规模效应。跨域评估显示，微调模型在分布外（OOD）语音上恢复了可用性能，而零样本模型在测试域与其预训练分布匹配时重新获得优势。一项涵盖所有调查语言的分布式母语者审计产生了基于语言学的错误分类，表明CTC和自回归架构在不同语系中表现不同。我们进一步表明，对于音节文字语言，仅WER会错误表示性能，其中CER/WER比率显示字符级准确率远高于标题WER所暗示的。最后，为促进未来的非洲ASR研究，我们发布了所有模型权重、微调和评估脚本，以及涵盖全部19种语言的清洗后的WAXAL子集。

英文摘要

We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.

URL PDF HTML ☆

赞 0 踩 0

2606.02372 2026-06-02 cs.AI cs.CL

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP：面向LLM智能体的世界模型与智能体策略协同进化

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

发表机构 * Central South University（中南大学）； College of Computer Science, Sichuan University（四川大学计算机学院）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结提出COMAP框架，通过闭环交互协同进化文本世界模型和智能体策略，在具身任务规划、网页导航和工具使用基准上显著提升性能。

详情

AI中文摘要

为语言智能体配备世界模型使其能够在执行前预测环境动态并评估候选动作。然而，现有的文本世界模型通常在训练后固定不变，无法适应由进化中的智能体引发的策略内状态-动作分布。同时，智能体改进方法往往依赖外部奖励或验证器，限制了其在现实交互环境中的适用性。本文提出COMAP，一种通过闭环交互协同进化文本世界模型和智能体策略的新框架。在每个决策步骤，世界模型预测候选动作的未来状态反馈，智能体通过估计该反馈的可靠性并相应调整动作来进行未来感知反思。由此产生的策略内轨迹随后通过自蒸馏用于更新世界模型，使其更好地匹配智能体不断演化的交互分布。在具身任务规划、网页导航和工具使用基准上，COMAP始终优于竞争基线，例如使用Qwen3-4B相对提升16.75%。进一步分析表明，协同进化循环随时间提高了世界模型的预测准确性，并导致更有效的长程决策。我们的代码可在https://github.com/loyiv/CoMAP获取。

英文摘要

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

URL PDF HTML ☆

赞 0 踩 0

2606.02366 2026-06-02 cs.CV

PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation

PRIMA: 利用生物先验和测试时自适应提升动物网格恢复

Xiaohang Yu, Ti Wang, Mackenzie Weygandt Mathis

发表机构 * École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）

AI总结提出PRIMA框架，通过生物先验（BioCLIP嵌入）和测试时自适应策略，解决严重物种和姿态不平衡下的3D四足动物网格恢复问题，实现高泛化性能并构建大规模伪3D数据集Quadruped3D。

详情

AI中文摘要

我们提出PRIMA（*PRI*ors for *M*esh *A*daptation），一个在严重物种和姿态不平衡下进行鲁棒3D四足动物网格恢复的框架。现有的动物重建方法由于有限的3D监督和长尾物种分布，往往回归到平均形状和姿态，导致对欠代表性动物和罕见关节的泛化能力差。PRIMA通过三个关键贡献解决了这一挑战。首先，我们将BioCLIP嵌入作为生物先验，将语义和形态学知识注入重建过程，从而在多样化的四足动物中实现更准确和可泛化的形状预测。其次，我们引入了一种测试时自适应（TTA）策略，该策略利用2D重投影约束和辅助关键点指导来优化SMAL预测，改进了姿态和形状估计，同时能够从现有2D数据集中生成高质量的伪3D标注。第三，利用这个TTA框架，我们构建了Quadruped3D，一个大规模伪3D数据集，涵盖多样化的物种和姿态变化，以系统性地提升模型性能。在Animal3D、CtrlAni3D、Quadruped2D和Animal Kingdom上的大量实验表明，PRIMA达到了最先进的结果，在欠代表性物种和挑战性姿态上尤其有显著改进。我们的结果强调了生物先验和自适应驱动的数据扩展对于可扩展和可泛化的动物网格恢复的重要性。代码可在https://github.com/AdaptiveMotorControlLab/PRIMA获取。

英文摘要

We present PRIMA (*PRI*ors for *M*esh *A*daptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.

URL PDF HTML ☆

赞 0 踩 0

2606.02365 2026-06-02 cs.LG cs.AI

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM：基于频率和算子误差的自适应阻尼方法，用于减少Shampoo的陈旧性误差

Kyunghun Nam, Sumyeong Ahn

发表机构 * Kyunghun Nam ； Sumyeong Ahn

AI总结提出FOAM算法，通过自适应控制阻尼因子和特征分解频率来抑制陈旧性误差，在保持收敛的同时减少Shampoo的计算时间。

Comments 9 pages, ICML 2026 camera-ready version

详情

AI中文摘要

Shampoo因其在大规模优化基准上的卓越性能而备受关注，但它面临一个重要的实际瓶颈：矩阵求逆的过高计算开销。为了缓解这一问题，从业者通常依赖陈旧的预条件子更新，这在计算效率和优化保真度之间产生了根本性的权衡。在这项工作中，我们通过收敛性和稳定性的互补视角对陈旧性进行了理论研究。虽然陈旧性提高了计算效率，但它固有地降低了性能并引入了数值不稳定性。关键的是，我们发现作为数值稳定器的阻尼可以有效抑制这些负面影响。在此分析指导下，我们提出了FOAM，一种自适应算法，通过基于陈旧性误差的近似动态控制阻尼因子和特征分解频率来稳定训练。实验结果表明，与标准Shampoo相比，FOAM在保持稳健收敛的同时减少了挂钟时间。

英文摘要

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

URL PDF HTML ☆

赞 0 踩 0

2606.02363 2026-06-02 cs.LG stat.ML

Minimax-Optimal Policy Regret in Partially Observable Markov Games

部分可观测马尔可夫博弈中的极小化最优策略遗憾

Raman Arora

发表机构 * Raman Arora

AI总结针对部分可观测马尔可夫博弈，提出基于epoch的乐观最大似然算法，实现了与聚合Eluder维数相关的$ ilde{O}(\sqrt{T})$策略遗憾，并证明了匹配的下界。

详情

AI中文摘要

我们研究了部分可观测环境中面对战略、自适应对手的序贯决策问题，建模为部分可观测马尔可夫博弈（POMG）。核心挑战在于从部分观测中学习潜在动态，同时面对行为依赖于学习者策略的对手，这使得标准遗憾概念不适用。我们证明，对于固定问题参数，基于epoch的乐观最大似然算法实现了$ ilde{O}(\sqrt{T})$的策略遗憾，显式依赖于视界、对手记忆、置信半径以及可观测算子类的聚合Eluder维数。该算法在每个几何增长的epoch中选择一个策略，使用从过去数据累积构建的置信集，这将比较跨策略的对手响应的成本控制在$T$的对数级别。我们还证明了与$\sqrt{T}$和聚合Eluder维数依赖相匹配的下界（至多问题相关和对数因子）。最后，我们将框架扩展到视界自适应保证和具有几何衰减记忆的对手。

英文摘要

We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.

URL PDF HTML ☆

赞 0 踩 0

2606.02359 2026-06-02 cs.AI

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

MOC：基于LLM的多智能体系统中的多阶通信

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan

发表机构 * Fudan University（复旦大学）； Nanjing Normal University（南京师范大学）； Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出多阶通信（MOC）方案，通过重构智能体间通信以捕获多跳依赖，并设计结构消息合并策略，在多个数据集上提升任务性能并降低通信成本。

详情

AI中文摘要

尽管基于大语言模型（LLM）的多智能体系统取得了显著进展，但大多数研究侧重于优化协调拓扑，而同样关键的问题——如何有效地在智能体之间传输和优化消息——却很大程度上未被充分探索。当前的通信方案通常依赖于一阶邻居响应的直接拼接，这导致了受限的证据感受野，并使得关键信息在多跳路径上被稀释。为了解决这些局限性，我们提出了多阶通信（MOC）方案，该方案重构了智能体间通信以捕获多跳依赖，并引入了一种结构消息合并策略以确保效率。具体来说，我们形式化了通信机制以构建结构化的多阶证据流，随后设计了一种语义-拓扑合并算法，以在令牌约束内优化语义保真度。在六个不同数据集和不同参数规模的LLM骨干上的大量实验表明，MOC一致地提升了任务性能并降低了通信成本。代码可在 https://github.com/yao-guan/MOC 获取。

英文摘要

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.

URL PDF HTML ☆

赞 0 踩 0

2606.02355 2026-06-02 cs.AI cs.LG

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI：具有内在技能的自我内化强化学习用于LLM智能体训练

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai

发表机构 * Xiamen University（厦门大学）； Meituan（美团）； Macao Polytechnic University（澳门 polytechnic 大学）

AI总结提出SIRI框架，通过自我技能挖掘、验证和内化，使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能，在ALFWorld和WebShop上优于基线方法。

详情

AI中文摘要

长程LLM智能体可以从可重用技能中受益，但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索，增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习（SIRI），这是一个三阶段框架，使智能体能够发现、验证和内化技能，无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘，当前策略从其自身的成功普通轨迹中总结紧凑技能，并通过配对的技能增强和技能无关轨迹进行验证。最后，SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时，智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct，SIRI将GiGPO从ALFWorld的0.908提升到0.930，从WebShop的0.728提升到0.813，优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明，我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。

英文摘要

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

URL PDF HTML ☆

赞 0 踩 0

2606.02352 2026-06-02 cs.CV

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

多模态视频表示对齐用于鲁棒的自监督驾驶员分心检测

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB（弗劳恩霍夫智能系统研究所）； Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院）

AI总结提出一种多模态全局对齐框架，通过软目标和加权机制处理错误负样本和不可靠正样本，在Drive&Act数据集上优于现有方法，实现鲁棒的驾驶员分心检测。

Comments Accepted at the IEEE ITSC 2026

详情

AI中文摘要

鲁棒的自监督多模态视频表示学习对于现实应用（如驾驶员分心检测）至关重要，其中多个传感器提供互补但嘈杂的信号。传统的对比目标（如InfoNCE）假设所有负样本信息量相等且所有正样本可靠。然而，由于视角变化、遮挡或模态间的语义重叠，这一假设在多模态数据中经常被违反。在这项工作中，我们提出了一种新颖的多模态全局对齐框架，通过联合建模错误负样本和不可靠或错误正样本来解决这些挑战。我们引入基于循环一致性分数的软目标来放松硬负样本假设，并基于相似性分布的加权机制来减轻噪声或错误正样本的影响。我们的方法将传统的成对对齐扩展到原则性的全局多模态设置，聚合所有模态对的对齐信息。我们在Drive&Act数据集上评估了我们的方法，结果表明它在RGB、IR、深度和骨架模态上始终优于成对和现有的全局对齐基线。跨视角消融研究进一步显示了对未见相机视角的强泛化能力，突出了我们表示的鲁棒性。总体而言，我们的框架为自监督全局多模态表示学习提供了一种可扩展且有效的解决方案，实现了可靠的驾驶员分心检测，并在现实世界的多模态视频理解中具有开创性。我们的代码将在GitHub上发布。

英文摘要

Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.02350 2026-06-02 cs.CV

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

TROPHIES：从多视角视频中重建场所、人和相机的时间序列

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出TROPHIES框架，通过联合估计动态人体、静态场景和相机姿态，实现多视角视频中全局一致的四维重建。

详情

AI中文摘要

在全局一致的4D空间中重建人类及其周围环境对于全面感知至关重要。然而，先前的工作通常假设单视角输入或将人体、场景和相机解耦，导致无法恢复连贯的几何形状、稳定的运动和物理对齐的轨迹。这些局限性促使我们引入一项新任务：从多视角视频中统一重建人体-场景-相机，旨在在一个全局坐标系中联合估计动态人体、静态场景和相机姿态。我们提出了TROPHIES——从多视角视频中重建场所、人和相机的时间序列——一个为这项任务量身定制的统一框架。TROPHIES包含一个通过时间和空间推理建模人体的人体分支，以及一个通过人体感知注意力重建静态几何的场景分支。一个全局对齐和优化模块通过强制执行尺度一致性、接触先验和跨视角时间相干性来耦合两个分支。在EgoHuman和EgoExo4D上的实验表明，TROPHIES实现了全局对齐、物理上合理的4D重建，并在全局保真度和人体-场景一致性方面始终优于现有范式。

英文摘要

Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.02346 2026-06-02 cs.CV

VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning

VEDAL: 用于3D高斯泼溅剪枝的变分误差驱动异步学习

Aoduo Li, Jiancheng Li, Huan Ye, Hongjian Xu, Shiting Wu, Xiujun Zhang, Zimeng Li, Xuhang Chen

发表机构 * Guangdong University of Technology（广东工业大学）； Huizhou Boluo Power Supply Bureau, Guangdong Power Grid Co., Ltd.（惠州市博罗供电局，广东电网有限责任公司）； Shenzhen Polytechnic University（深圳职业技术大学）； School of Computer Science and Engineering, Huizhou University（惠州市大学计算机科学与工程学院）

AI总结提出VEDAL框架，通过变分自由能最小化、预测误差门控机制和变分不确定性头实现3D高斯泼溅的高效剪枝，在5.2倍压缩下仅损失0.31 dB PSNR。

Comments 12 pages, 5 figures. Accepted by CGI 2026

详情

AI中文摘要

3D高斯泼溅（3DGS）通过实时渲染实现了卓越的新视图合成质量，但由于数百万个高斯原语导致内存消耗过大。现有的剪枝方法依赖于启发式重要性分数或同步批量更新，导致压缩次优和训练不稳定。我们提出VEDAL，一个将高斯剪枝公式化为变分自由能最小化的原则性框架。我们的方法引入了（1）一种预测误差门控机制，基于每个高斯的重建不确定性异步激活剪枝，以及（2）一个变分不确定性头，将剪枝决策建模为具有可学习先验的潜变量。自由能目标通过信息论视角自然地平衡了重建保真度与模型复杂度。在Mip-NeRF 360、Tanks&Temples和Deep Blending上的大量实验表明，VEDAL在仅0.31 dB PSNR下降的情况下实现了5.2倍压缩，在更高压缩比下优于PUP 3D-GS 0.05 dB，在相当质量下优于LightGaussian 0.35 dB，同时保持185 FPS的实时渲染。

英文摘要

3D Gaussian Splatting (3DGS) achieves remarkable novel view synthesis quality with real-time rendering, yet suffers from excessive memory consumption due to millions of Gaussian primitives. Existing pruning methods rely on heuristic importance scores or synchronous batch updates, leading to suboptimal compression and training instability. We propose VEDAL, a principled framework that formulates Gaussian pruning as variational free energy minimization. Our approach introduces (1) a prediction-error gating mechanism that asynchronously activates pruning based on per-Gaussian reconstruction uncertainty, and (2) a variational uncertainty head that models pruning decisions as latent variables with learnable priors. The free energy objective naturally balances reconstruction fidelity against model complexity through an information-theoretic lens. Extensive experiments on Mip-NeRF 360, Tanks&Temples, and Deep Blending demonstrate that VEDAL achieves 5.2x compression with only 0.31 dB PSNR drop, outperforming PUP 3D-GS by +0.05 dB at a higher compression ratio and LightGaussian by +0.35 dB at comparable quality, while maintaining real-time rendering at 185 FPS.

URL PDF HTML ☆

赞 0 踩 0

2606.02342 2026-06-02 cs.CV

Detecting Pen-In-Air States from Video: A Proof-of-Concept Toward Complementary Handwriting Analysis

从视频中检测笔在空中状态：迈向互补手写分析的概念验证

Lauren Sismeiro, Remy Plastre, Binbin Xu, Frederic Puyjarinet, Gerard Dray

发表机构 * IMT Mines Ales（IMT矿山阿勒大学）； Occitanie Region, France（法国奥克西塔尼大区）

AI总结提出一种基于YOLO的笔尖跟踪与运动特征提取及机器学习分类的可解释混合流程，通过俯视视频检测笔接触状态，作为数字化平板的低成本非侵入性补充，在试点数据集上实现了高达0.805的F2分数。

Comments accepted for 12th International Conference on Computer Technology Applications (ICCTA 2026)

详情

AI中文摘要

手写的动态方面对于评估如书写困难等发育障碍至关重要，通常通过数字化平板捕捉。然而，基于平板的传感将笔提起行为的分析限制在书写表面上方较短的接近范围内，可能错过高抬起的空中运动。作为概念验证，我们研究俯视视频是否能够提供补充信息源，用于推断笔接触状态，而无需依赖平板接近感应。我们提出了一种可解释的混合流程，结合了基于YOLO检测器的笔尖跟踪、运动特征提取和机器学习分类。一个包含多样化手写视频的试点数据集在帧级别进行了手动标注，并使用留一视频外（LOVO）协议进行评估。该方法实现了可靠的笔提起段事件级检测，F2分数高达0.805，与筛查导向场景中强调召回率一致。这些结果支持了基于视频的笔提起检测作为数字化平板低成本非侵入性补充的可行性，并为未来大规模研究奠定了基础。

英文摘要

Dynamic aspects of handwriting are critical for assessing developmental disorders such as dysgraphia and are typically captured using digitizing tablets. However, tablet-based sensing restricts analysis of Pen-Up behavior to a short proximity range above the writing surface, potentially missing high-lift in-air movements. As a proof of concept, we investigate whether top-view video can provide a complementary source of information for inferring pen-contact states without relying on tablet proximity sensing. We propose an interpretable hybrid pipeline combining pen-tip tracking using a YOLO-based detector with kinematic feature extraction and machine learning classification. A pilot dataset of diverse handwriting videos was manually annotated at the frame level and evaluation used a Leave-One-Video-Out (LOVO) protocol. The method achieved reliable event-level detection of Pen-Up segments, with an F_2 score up to 0.805, consistent with the emphasis on recall in a screening-oriented setting. These results support the feasibility of video-based Pen-Up detection as a low-cost and non-intrusive complement to digitizing tablets, and provide a foundation for future large-scale studies.

URL PDF HTML ☆

赞 0 踩 0

2606.02339 2026-06-02 cs.LG cs.CV

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

无模型坍塌的熵最小化：减轻医学影像中的预测偏差

Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang, Julia A. Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany（慕尼黑技术大学计算、信息与技术学院）； Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany（生物医学成像中的机器学习研究所，海德堡慕尼黑德国）； School of Biomedical Engineering and Imaging Sciences, King’s College London, UK（伦敦国王学院生物医学工程与成像科学学院）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））； relAI – Konrad Zuse School of Excellence in Reliable AI（relAI——Konrad Zuse可靠性人工智能卓越学院）； TUM University Hospital Rechts der Isar（慕尼黑技术大学医院Rechts der Isar）

AI总结针对测试时适应中熵最小化导致的模型坍塌问题，提出分布偏移偏差减少（DSBR）方法，通过均衡各预测类对无监督熵最小化损失的贡献来纠正预测偏差，在四个医学影像数据集和ImageNet-C上验证了其稳定性和有效性。

详情

AI中文摘要

熵最小化（EM）是测试时适应的主导目标，但其失败模式——模型坍塌——仍然知之甚少。在这项工作中，我们表明分布偏移会导致模型表示空间中对应不同类别的特征簇合并，而决策边界保持不变。这导致预测类别分布出现系统性偏差，称为预测偏差。预测偏差是指预测类别分布的偏移，其中一些类别被过度代表，而其他类别被抑制。我们表明熵最小化通过收紧现有簇来放大这种预测偏差，强化错误的分类，直到所有预测坍缩为平凡解。接下来，为了证明预测偏差的重要性并减轻它，我们进一步提出了分布偏移偏差减少（DSBR），这是一种偏差纠正目标，通过均衡每个预测类别对无监督熵最小化损失的贡献来专门针对这种失败模式。为了研究这种失败模式，我们使用四个医学影像数据集设计了合适的适应设置，并在ImageNet-C上进行了额外评估。我们发现DSBR一致地稳定了测试时适应，防止了模型坍塌，并且匹配或超越了最先进的方法。此外，DSBR仅在测试时运行。

英文摘要

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

URL PDF HTML ☆

赞 0 踩 0

2606.02337 2026-06-02 cs.AI

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

约束多智能体强化学习的协调图

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

发表机构 * Department of Electrical and Computer Engineering, Linköping University（1 链çe普大学电气与计算机工程系）

AI总结提出CG-CMARL框架，利用协调图和拉格朗日对偶分解联合动作空间与约束耦合问题，实现独立于智能体数量的模型学习，并通过Max-Sum消息传递和拉格朗日乘子控制目标-约束权衡，生成帕累托前沿。

Comments Accepted at the Reinforcement Learning Conference (RLC) 2026. 40 pages (12 main + 28 appendix), 5 figures, 16 tables, 7 theorems

详情

AI中文摘要

约束多智能体强化学习（CMARL）面临两个相互交织的挑战：联合动作空间随智能体数量指数增长，以及额外的约束以奖励结构无法捕捉的方式耦合智能体。我们引入了用于约束多智能体强化学习的协调图（CG-CMARL），这是一个通过结合协调图和拉格朗日对偶性来应对这两个挑战的框架。该系统将联合问题分解为成对区域，每个区域由一组共享的Q函数服务，一个用于主要目标，每个约束对应一个，使得学习模型的数量与智能体数量无关。在执行时，Max-Sum消息传递在因子图上协调动作，而拉格朗日乘子控制目标-约束权衡，允许单个训练模型无需重新训练即可描绘帕累托前沿。我们在温和条件下提供了收敛保证，以及一个可分解为独立可解释来源的组合误差界，每个来源可追溯到特定的设计选择并可独立控制。在协作导航任务（其中多达10个智能体的团队必须协调到达目标位置，同时满足成对约束）上的实验表明，我们的方法产生的帕累托前沿优于以固定奖励塑形比率训练的既有基线，同时扩展到集中式方法变得棘手的大规模团队。

英文摘要

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.

URL PDF HTML ☆

赞 0 踩 0

2606.02331 2026-06-02 cs.CV cs.LG

Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates

基于鲁棒先验更新的幻觉感知扩散采样用于逆问题

Pengfei Jin, Yiqi Tian, Kailong Fan, Bingjie Qi, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School（先进医学计算与分析中心，麻省总医院和哈佛医学院）； Department of Industrial Engineering, University of Pittsburgh（工业工程系，匹兹堡大学）

AI总结提出鲁棒先验更新模块，通过探测扩散先验更新的局部稳定性并重新锚定位移，减少逆问题求解中的测量条件幻觉，提升实例保真度。

详情

AI中文摘要

基于扩散的逆问题求解器可以产生逼真的重建结果，但仅凭逼真度并不能确保恢复的细节得到测量的支持。我们将这种失败研究为测量条件幻觉：视觉上有意义但要么不可信要么与测量实例不一致的内容。我们的分析将基于贝叶斯规则的扩散逆求解器分为先验更新和测量条件步骤，表明在应用测量校正之前，幻觉内容可能通过先验侧提议进入。受此观点启发，我们提出鲁棒先验更新（RPU），一个求解器级模块，探测扩散先验更新的局部稳定性，将产生的位移重新锚定在当前迭代点，并保持测量更新不变。我们在DPS中实例化RPU，并使用自动指标和人类忠实度研究在FFHQ和ImageNet逆问题上进行评估。在FFHQ上，RPU在框内修复、高斯去模糊和运动去模糊中相比DPS提高了PSNR和LPIPS。在人类判断中，RPU在FFHQ框内修复上获得了91.9%的盲选非平局多数偏好和91.1%的借助真实标签的非平局偏好，而ImageNet高斯阅读器研究中平局较多，但在非平局情况下RPU更受青睐。这些结果支持一个有针对性的主张：鲁棒化先验更新可以提高扩散逆求解器中的实例保真度，尤其是在先验塑造弱约束内容时。

英文摘要

Diffusion-based inverse problem solvers can produce realistic reconstructions, but realism alone does not ensure that the recovered details are supported by the measurement. We study this failure as measurement-conditioned hallucination: visually meaningful content that is either implausible or inconsistent with the measured instance. Our analysis separates Bayes-rule-based diffusion inverse solvers into a prior update and a measurement-conditioning step, showing that hallucinated content can enter through the prior-side proposal before the measurement correction is applied. Motivated by this view, we propose Robust Prior Update (RPU), a solver-level module that probes the local stability of the diffusion prior update, re-anchors the resulting displacement at the current iterate, and leaves the measurement update unchanged. We instantiate RPU in DPS and evaluate it on FFHQ and ImageNet inverse problems using automatic metrics and human faithfulness studies. On FFHQ, RPU improves PSNR and LPIPS over DPS across box inpainting, Gaussian deblurring, and motion deblurring. In human judgments, RPU receives 91.9% of blind non-tie majority preferences and 91.1% of ground-truth-assisted non-tie preferences on FFHQ box inpainting, while the ImageNet Gaussian reader study is tie-heavy but favors RPU among non-tie cases. These results support a targeted claim: robustifying the prior update can improve instance faithfulness in diffusion inverse solvers, especially when the prior shapes weakly constrained content.

URL PDF HTML ☆

赞 0 踩 0

2606.02322 2026-06-02 cs.LG cs.AI

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

重新利用对抗扰动进行持续学习：从防御到主动对齐

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Deakin University（德肯大学）； Harbin Engineering University（哈尔滨工程大学）

AI总结提出AdvCL框架，通过将对抗扰动重新用作几何控制信号，结合三个即插即用模块（Intra-Smooth、Proto-Clip、Inter-Align），在持续学习中同时提升标准性能、鲁棒性、降低遗忘并增强迁移。

详情

AI中文摘要

在动态环境中，大型语言模型需要不断适应新任务，但持续学习常常遭受遗忘、有限的迁移以及对对抗扰动的脆弱性。为了解决这个问题，我们提出了AdvCL，它将对抗扰动重新用作稳定的持续适应的几何控制信号。AdvCL结合了三个即插即用模块：Intra-Smooth通过小的对抗扰动促进局部平滑性；Proto-Clip使用相似性裁剪以防止过度对齐到当前任务原型；Inter-Align则通过对齐到先前任务原型的方向性对齐来减少表示间隙。实验表明，在标准性能和鲁棒性方面均有一致的提升，同时具有更低的遗忘和更强的迁移。我们进一步通过量化Intra-Smooth对扰动设置的敏感性以及Inter-Align对任务相似性和几何距离的影响，分析了关键机制。总之，这些模块在组合时提供互补增益，每个模块也可以单独集成到各种持续学习范式中，包括回放、正则化和动态架构，从而为持续学习提供了一种几何控制机制。

英文摘要

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.

URL PDF HTML ☆

赞 0 踩 0

2606.02321 2026-06-02 cs.CV

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

发表机构 * School of Computer Science and Technology, University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院人工智能安全国家重点实验室）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结提出无训练框架，先利用冻结DINOv3模型筛选视觉相关候选，再通过大视觉语言模型评估指令匹配，最后推理精化，在CVPR 2026挑战赛中取得48.78 Recall@1和51.48 Recall@5。

Comments CVPR 2026, VidLLMs workshop

详情

AI中文摘要

近期大视觉语言模型的进展将视频检索从简单的基于文本搜索扩展到更灵活的场景，用户可以通过视觉示例和文本指令指定期望结果。在CVPR 2026的Reason-Aware组合视频检索挑战中，系统需要根据参考视频和修改指令检索目标视频。为解决该任务，我们开发了基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索框架。该框架首先使用冻结的DINOv3模型获取紧凑的视觉相关候选集，然后应用大视觉语言模型评估每个候选是否满足修改指令。最后对顶部候选进行基于推理的精化以改善排名第一的预测。无需训练，我们的系统在测试集上达到48.78 Recall@1和51.48 Recall@5。未来工作可通过更强的视频-大语言模型以及视觉表示与语言推理的详细集成进一步提高检索精度。

英文摘要

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02313 2026-06-02 cs.RO

Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

迈向精确意图对齐的VLA空中导航：基于专家引导的GRPO

Tianyang Chen, Wenjun Li, Xin zhou, Yuze Wu, Fei Gao

发表机构 * Zhejiang University Differential Robotics（浙江大学差分机器人实验室）

AI总结提出EG-GRPO框架，通过专家数据增强在线rollout和异构并行流水线，解决VLA模型在无人机导航中因数据稀缺和探索低效导致的意图对齐问题，成功率提升至SFT基线的2.13倍，意图对齐性能提升60.9%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为无人机（UAV）执行细粒度指令指定的复杂任务提供了一种有前景的端到端范式。然而，标准的监督微调（SFT）面临数据稀缺、泛化能力有限以及对细微复杂人类意图的弱监督问题。强化微调通过可设计的反馈提供了一种自然的方式来缓解这些挑战，并使策略行为与人类意图对齐，但由于在广阔连续空间中的低效探索，将其应用于空中导航仍然具有挑战性。为了解决这些问题，我们引入了一个基于VLA的空中导航的高效强化学习（RL）框架。其核心是，我们提出了EG-GRPO（专家引导的组相对策略优化），以用少量专家数据增强在线rollout。此外，我们设计了一个异构流水线，支持并行仿真和推理，将rollout时间减少了43.5%。在由复杂人类意图指定的多个任务中，EG-GRPO将成功率提升至SFT基线的2.13倍，同时将意图对齐性能提高了60.9%。这些结果表明，我们的框架可以使空中导航迈向精确的意图对齐飞行。

英文摘要

Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13x that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight.

URL PDF HTML ☆

赞 0 踩 0

2606.02310 2026-06-02 cs.CV cs.LG

Deep Learning for Remote Sensing to Improve Flood Inundation Mapping

深度学习用于遥感以改进洪水淹没制图

Yogesh Bhattarai, Vijay Chaudhary, Wai Lim Kim, Sanjib Sharma

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出基于去噪扩散概率模型和掩码扩散Transformer的云去除框架，用于洪水影像，以生成无云图像并保持水文一致性，提升洪水监测的可靠性。

Comments This paper has been selected as the top 10 student finalists in IGRASS 2026 paper competition

详情

AI中文摘要

洪水是全球最普遍的自然灾害。及时准确的洪水淹没制图对于告知灾害风险管理至关重要。光学卫星任务提供了高分辨率、多光谱观测，对于洪水检测和淹没制图至关重要。然而，在极端降水事件期间，其操作实用性受到云层的严重限制。基于时间合成或插值的传统云去除技术通常无法捕捉淹没动态。在本研究中，我们引入了一种基于去噪扩散概率模型的洪水影像云去除框架，利用掩码扩散Transformer架构。所提出的方法利用自注意力机制捕获更广泛的空间上下文，并采用掩码令牌建模来显式学习云遮挡区域的重建。在具有真实云模式的多光谱Sentinel-2B洪水场景上训练，该模型生成保持视觉保真度和水文一致性的无云图像实现。使用标准图像质量指标以及洪水特定的水文指标评估重建性能，显示出水体连续性的改善和对水检测指数至关重要的光谱特征的保留。结果表明，基于扩散的生成建模为光学洪水监测中的云去除提供了一种稳健且物理一致的替代方案，从而实现更可靠、连续的观测，以支持灾害风险管理和洪水相关决策。

英文摘要

Flooding is the most pervasive natural disaster worldwide. Timely and accurate flood inundation mapping are essential for informing disaster risk management. Optical satellite missions provide high-resolution, multispectral observations critical for flood detection and inundation mapping. However, their operational utility is severely constrained by cloud cover during extreme precipitation events. Conventional cloud-removal techniques based on temporal compositing or interpolation often fail to capture inundation dynamics. In this study, we introduce a cloud-removal framework for flood imagery based on Denoising Diffusion Probabilistic Models, leveraging the Masked Diffusion Transformer architecture. The proposed approach exploits self-attention mechanisms to capture wider spatial context and employs masked token modeling to explicitly learn the reconstruction of cloud-obscured regions. Trained on multispectral Sentinel-2B flood scenes with realistic cloud patterns, the model generates cloud-free image realizations that preserve both visual fidelity and hydrological consistency. Reconstruction performance is evaluated using standard image quality metrics alongside flood-specific hydrological measures, demonstrating improved continuity of water bodies and preservation of spectral signatures critical for water detection indices. The results indicate that diffusion-based generative modeling offers a robust and physically consistent alternative for cloud removal in optical flood monitoring, enabling more reliable, continuous observations to support disaster risk management and flood-related decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.02309 2026-06-02 cs.LG cs.CV

Measurement Geometry and Design for Trustworthy Generative Inverse Problems

可信生成式逆问题的测量几何与设计

Pengfei Jin, Na Li, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School（先进医学计算与分析中心，麻省总医院和哈佛医学院）； School of Engineering and Applied Sciences, Harvard University（工程与应用科学学院，哈佛大学）

AI总结提出局部测量-流形兼容性度量，证明其控制重建误差的稳定部分，并基于体积保持设计固定和自适应测量策略，在多个成像任务中预测失败模式、减少幻觉并指导采样。

详情

AI中文摘要

生成模型越来越多地被用作逆问题的先验，但它们生成逼真图像的能力带来了一个基本的信任问题：一个看似合理的重建可能由测量支持，也可能由先验沿未观测方向填充。这一区别在医学成像中尤为重要，因为采集操作是在扫描时间、剂量和校准约束下设计的。我们从测量几何的角度研究生成式逆问题。核心问题是：固定的测量算子能否区分在生成先验下看似合理的邻近图像，以及这种关系能否指导更好的测量。我们引入了一个局部测量-流形兼容性度量，用于量化算子观测先验相关切线方向的程度。在局部正则性假设下，我们证明该量控制重建误差的稳定部分，而生成先验控制流形外漂移。这一最坏方向证书基于整体局部体积保持，提出了实用的固定和顺序采集规则，包括一种后验云设计，该设计在测试时自适应调整测量，无需训练采样策略。在行采样、断层扫描和MR采集设置中，所提出的分数预测失败模式，解释测量引起的幻觉，并指导更好的采样。在fastMRI笛卡尔采样中，后验云测量设计优于强大的非学习ACS保留基线，包括可变密度和泊松类掩模。

英文摘要

Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.

URL PDF HTML ☆

赞 0 踩 0

2606.02307 2026-06-02 cs.RO

FATE-VLA:Failue-aware test generation for vision-language-action models

FATE-VLA：面向视觉-语言-动作模型的故障感知测试生成

Arusa Kanwal, Pablo Valle, Shaukat Ali, Aitor Arrieta

发表机构 * Mondragon University（蒙多龙大学）； Simula Research Laboratory（Simula研究实验室）

AI总结提出一种结合多样性驱动探索与代理模型的故障感知测试生成方法，用于主动发现VLA模型在高维具身空间中的稀疏聚类故障，在四个先进模型上相比基线多发现高达29.7%的故障。

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越多地被用作通用机器人策略，然而它们的评估仍然主要依赖于随机采样任务场景的静态基准。在高维具身空间中，故障是稀疏且聚类的，因此静态基准测试可能低估鲁棒性风险。我们将VLA评估重新定义为主动故障发现问题，并提出一种故障感知测试生成方法，该方法将多样性驱动的探索与从观察到的执行中学习的代理模型相结合。该方法将测试引导向高风险但多样化的场景区域。在四个最先进的VLA模型上，它发现了显著更多的故障（相比选定基线最多增加29.7%），同时揭示了更多样化的故障模式。这意味着，例如，在GR00T-N1.6的情况下，成功率从64.4%下降到34.7%。更广泛地说，我们的发现呼吁VLA评估的转变：从固定任务套件上的被动测量转向自适应、寻求故障的测试生成，在部署之前揭示模型弱点的结构。

英文摘要

Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines diversity-driven exploration with surrogate models learned from observed executions. The method steers testing toward high-risk yet diverse scene regions. Across four state-of-the-art VLA models, it uncovers substantially more failures (up to +29.7 % over selected baselines) while revealing more diverse failure modes. This mean that, for instance, in the case of GR00T-N1.6, success rate dropped from 64.4% to 34.7%. More broadly, our findings call for a shift in VLA evaluation: from passive measurement on fixed task suites to adaptive, failure-seeking test generation that exposes the structure of model weaknesses before deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.02303 2026-06-02 cs.CV

Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery

跨域航拍图像死树检测：基于知识蒸馏的方法

Anis Ur Rahman, Mete Ahishali, Einari Heinaro, Samuli Junttila

发表机构 * CSC – IT Center for Science Ltd.（CSC信息科技研究中心有限公司）； Department of Forest Sciences, University of Helsinki（赫尔辛基大学森林科学系）； KOKO Forest Ltd.（KOKO森林有限公司）； School of Forest Sciences, University of Eastern Finland（东芬兰大学森林科学学院）

AI总结针对航拍图像中死树检测的域差异和标注数据稀缺问题，提出基于知识蒸馏的TreeMort-1T-UNet模型，通过特征级蒸馏在多个目标域上实现鲁棒性能，并验证其在低数据场景下的优越性。

Comments 14 pages, 6 figures, journal

详情

AI中文摘要

航拍图像中的死树检测对于评估森林健康至关重要，尤其是随着气候变化导致全球树木死亡率上升，但域变异性和稀缺的标注数据常常限制模型的泛化能力。本研究改进了最初在芬兰航拍图像（源域）上训练的TreeMort-1T-UNet（树木死亡率单任务U-Net）模型，通过应用知识蒸馏（KD）使其适应各种目标域，包括代表不同森林类型的波兰、德国和爱沙尼亚数据集。我们评估了四种KD变体：基础、自蒸馏、特征级和集成，与微调基线进行比较，使用平均树木IoU、实例F1分数、实例精度和平均质心误差作为关键指标，并结合表征分析（如余弦相似度、CKA、SSIM、t-SNE和线性探针）评估域不变性。特征级KD优于其他方法，在波兰数据集上实现了平均树木IoU为0.106、实例F1分数为0.63、实例精度为0.55、平均质心误差为3.039，并在其他目标域上保持稳健精度（例如，芬兰为0.15，波兰为0.67，德国为0.60，爱沙尼亚为0.59）。它在低数据场景下表现优异，假阳性更少，并展现出优越的表征不变性（例如，更高深层CKA/SSIM、t-SNE中更好的域混合、线性探针AUC为0.95），使其成为精度关键的林业应用的理想选择。额外的消融研究证实，特征对齐等关键组件增强了其跨指标的平衡性能。我们的发现证明了KD在遥感中增强迁移学习的潜力，为生态监测和可持续森林管理提供了可扩展、域鲁棒的工具。

英文摘要

Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains, including Polish, German, and Estonian datasets representing diverse forest types. We assess four KD variants: Basic, Self, Feature-level, and Ensemble, against a fine-tuning baseline, using Mean Tree IoU, Instance F1-score, Instance Precision, and Mean Centroid Error as key metrics, alongside representational analyses (e.g., cosine similarity, CKA, SSIM, t-SNE, and linear probing) for domain invariance. Feature-level KD outperforms others, yielding a Mean Tree IoU of 0.106, Instance F1-score of 0.63, Instance Precision of 0.55, and Mean Centroid Error of 3.039 on the Polish dataset, with robust precision across other target domains (e.g., 0.15 on Finnish, 0.67 on Polish, 0.60 on German, 0.59 on Estonian). It excels in low-data scenarios with fewer false positives and shows superior representational invariance (e.g., higher deep-layer CKA/SSIM, better domain mixing in t-SNE, and linear probing AUC of 0.95), making it ideal for precision-critical forestry applications. Additional ablation studies confirm that key components like feature alignment enhance its performance balance across metrics. Our findings demonstrate KD's potential to enhance transfer learning in remote sensing, offering a scalable, domain-robust tool for ecological monitoring and sustainable forest management.

URL PDF HTML ☆

赞 0 踩 0

2606.02300 2026-06-02 cs.CL

Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization

超越孤立行为：面向LLM个性化的层次化用户建模

Liang Wang, Xinyi Mou, Xiaoyou Liu, Tiannan Wang, Yuqing Wang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Innovation Institute（上海创新研究院）； OPPO

AI总结针对LLM个性化中用户行为缺乏层次结构的问题，提出基于布迪厄实践理论的PHF框架，通过实践-惯习-场域三层建模，并实现轻量级模型无关方法PHF_Compass，在LaMP基准上取得一致提升。

详情

AI中文摘要

大型语言模型（LLM）在多个领域展现出卓越能力，但将其输出个性化以适应个体用户仍是一个开放挑战。现有方法主要采用扁平行为范式，聚合用户行为而未明确考虑它们如何组织成更深层的行为结构。在本工作中，我们借鉴皮埃尔·布迪厄的实践理论，提出PHF（实践-惯习-场域），一个基于社会学的框架，通过三个层次重新概念化LLM个性化：作为实践的个人行为、作为惯习的行为在时间上的积累形成稳定倾向、以及作为场域的相似用户间的共享规律。我们通过$\mathrm{PHF}_{ ext{Compass}}$实例化PHF，这是一种基于冻结LLM的轻量级且模型无关的实现。在语言模型个性化（LaMP）基准上的实验表明，该方法在多种任务上取得一致改进，进一步分析验证了所学行为结构的可解释性和可扩展性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet personalizing their outputs to individual users remains an open challenge. Existing approaches predominantly adopt a flat behavioral paradigm, aggregating user behaviors without an explicit account of how they are organized into deeper behavioral structures. In this work, we draw on Pierre Bourdieu's Theory of Practice to propose PHF (Practice-Habitus-Field), a sociologically grounded framework that reconceptualizes LLM personalization through three hierarchical levels: individual behaviors as practices, their temporal accumulation into stable dispositions as habitus, and shared regularities across similar users as fields. We instantiate PHF through $\mathrm{PHF}_{\text{Compass}}$, a lightweight and model-agnostic implementation based on a frozen LLM. Experiments on the Language Model Personalization (LaMP) benchmark demonstrate consistent improvements across diverse tasks, while further analyses validate the interpretability and extensibility of the learned behavioral structures.

URL PDF HTML ☆

赞 0 踩 0

2606.02296 2026-06-02 cs.RO

A Kinetic Theory of Encounter-Based Information Propagation in Multi-Robot Systems

多机器人系统中基于相遇的信息传播的动力学理论

Alkesh K. Srivastava, Philip Dames

发表机构 * Temple University（特拉华大学）

AI总结本文提出一种动力学理论，通过相遇驱动的信息传播、时效性和几何约束，分析多机器人目标跟踪中的性能极限。

详情

AI中文摘要

多机器人系统不能假设持续的网络连接。我们通过目标跟踪研究这一问题，其中性能取决于目标信息被感知、通过团队传输并在变得过时之前使用的速度。当机器人仅通过物理相遇交换信息时，跟踪成为一个动力学信息传输问题：机器人运动引发相遇，相遇携带目标状态估计，信息年龄决定过时程度，而过时信息产生跟踪误差。本文发展了一种基于相遇的信息传播的动力学理论，并识别出三个极限。第一个是访问极限——信息无法支持团队级协调，除非它传播到感知到它的机器人之外。第二个是过时极限——即使传播的信息也会随着目标移动而失去价值。第三个是几何极限——当目标运动超过信息传输时，跟踪误差进入饱和状态，此时仅通信改进的收益递减。我们通过改变团队规模、操作区域、通信范围和目标速度的大规模模拟评估该理论。结果支持所提出的访问-过时-几何分解：通信覆盖控制访问转变；一旦信息可访问，跟踪误差由目标位移决定；这种响应在受限区域内是局部线性的，但由于感知刷新和有界几何，在更广范围内是非线性的。在受控扫描和联合变化中，推导出的访问和过时坐标可靠地描述了跟踪性能。这些结果共同建立了一个动力学理论框架，用于预测和设计基于相遇的多机器人系统。

英文摘要

Multi-robot systems cannot assume persistent network connectivity. We study this problem through target tracking, where performance depends on how quickly target information is sensed, transported through the team, and used before it becomes stale. When robots exchange information only through physical encounters, tracking becomes a kinetic information-transport problem: robot motion induces encounters, encounters carry target-state estimates, information age determines staleness, and stale information produces tracking error. This paper develops a kinetic theory of encounter-based information propagation and identifies three limits. The first is an access limit -- information cannot support team-level coordination unless it spreads beyond the robots that sensed it. The second is a staleness limit -- even propagated information loses value as the target moves. The third is a geometry limit -- when target motion outpaces information transport, tracking error approaches a saturation regime where communication improvements alone have diminishing returns. We evaluate the theory through large-scale simulations varying team size, operating area, communication range, and target speed. Results support the proposed access-staleness-geometry decomposition: communication coverage governs the access transition; once information is accessible, tracking error is shaped by target displacement; and this response is locally linear in restricted regimes but nonlinear over broader ranges because of sensing refreshes and bounded geometry. Across controlled sweeps and joint variation, the derived access and staleness coordinates reliably describe tracking performance. Together, these results establish a kinetic-theoretic framework for predicting and designing encounter-based multi-robot systems.

URL PDF HTML ☆

赞 0 踩 0

2606.02294 2026-06-02 cs.LG

Regularized Large Neighborhood Search

正则化大邻域搜索

Germain Vivier-Ardisson, Laurent Demonet, Axel Parmentier, Mathieu Blondel

发表机构 * Google DeepMind（谷歌DeepMind）； CERMICS Paris, France（巴黎CERMICS研究所）； ENPC, CNRS, IPP Marne-la-Vallée, France（巴黎-马恩拉瓦尔大学、国家科学研究中心、IPP马恩拉瓦尔分校）； Google Research Paris, France（巴黎谷歌研究）

AI总结提出正则化大邻域搜索（RLNS），将LNS启发式转化为MCMC采样器，实现无需全局求解器的端到端学习。

详情

AI中文摘要

运筹学从业者通常使用大邻域搜索（LNS）来解决NP难的组合问题，这是一种可扩展的启发式方法，通过局部重新优化其变量的子集来迭代改进当前解。相比之下，大多数现有的将组合优化层集成到神经网络中的方法仍然假设可以访问精确的全局解，这在计算上是难以处理的。我们通过引入正则化大邻域搜索（RLNS）来弥合这一差距。通过正则化或扰动局部子问题，我们将LNS启发式转化为一个高效的MCMC采样器，在可行解的组合集上采样，并关联Fenchel-Young损失。在熵正则化下，我们证明RLNS执行精确的块吉布斯采样。此外，调整RLNS迭代次数使我们能够在伪似然和精确最大似然估计之间插值，从而实现无需全局求解器的端到端学习。我们在$k$-子集选择、广义分配和随机车辆调度问题上展示了我们的方法。

英文摘要

Operations research practitioners typically tackle NP-hard combinatorial problems using large neighborhood search (LNS), a scalable heuristic that iteratively refines a current solution by locally re-optimizing subsets of its variables. In contrast, most existing approaches for integrating combinatorial optimization layers into neural networks still assume access to an exact global solution, which is computationally intractable. We bridge this gap by introducing regularized LNS (RLNS). By regularizing or perturbing local subproblems, we turn the LNS heuristic into an efficient MCMC sampler over the combinatorial set of feasible solutions, with associated Fenchel-Young losses. Under entropic regularization, we prove that RLNS performs exact block Gibbs sampling. Furthermore, adjusting the number of RLNS iterations allows us to interpolate between pseudolikelihood and exact maximum likelihood estimation, for end-to-end learning without global solvers. We demonstrate our approach on $k$-subset selection, generalized assignment, and stochastic vehicle scheduling problems.

URL PDF HTML ☆

赞 0 踩 0

2606.02293 2026-06-02 cs.CL

AI as a Tool for Simulation-Based Experiments in Literary Studies

AI作为文学研究中基于模拟的实验工具

Matthew Wilkens

发表机构 * Department of Information Science（信息科学系）

AI总结本文探讨利用生成式AI进行受控、大规模、低成本的文学文化生产模拟实验，总结当前技术现状，并通过与人类小说对比的实验展示AI在文学文本生成中的初步成果。

详情

AI中文摘要

生成式人工智能系统通过受控、有依据、大规模、低成本的模拟文化生产，为文学研究中的实验开辟了新的可能性。当前系统尚未被证明能够生成高质量、长篇幅的叙事文本，并可靠地反映任意指定的文化约束或风格特征。但在文学历史模拟所需的各个组件上存在大量相关研究，包括：使用和验证AI系统作为可区分人类群体的代理；AI生成文本的叙事和风格特性；多智能体、多轮次AI模拟人类行为者的稳定性和连贯性；以及通过可预测方式改变生成系统知识和行为的技术方法。这些领域共同为基于AI的文学生产文化系统建模提供了更雄心勃勃的起点。我们描述了文学研究中基于模拟的实验的可能性和挑战，总结了相关领域的最新进展，并解释了工作的关键技术方面。为了提供一个与文学学者直接相关的例子，我们展示了文学文本生成实验的结果，包括与高地位、人类作者小说的比较。我们的结果首次展示了AI模型在该领域内（有限的）分布内输出。最后，我们描述了未来使用AI进行完整反事实文学历史模拟的工作。

英文摘要

Generative artificial intelligence (AI) systems open new possibilities for experimentation in literary studies via controlled, grounded, large-scale, low-cost simulations of cultural production. Current systems have not yet been shown to produce high-quality, book-length narrative texts that reliably reflect arbitrarily specified cultural constraints or stylistic features. But there exists substantial relevant research on each of the components required for literary-historical simulation. These include the use and validation of AI systems as proxies for differentiable human populations; the narrative and stylistic properties of AI-generated texts; the stability and coherence of multiagent, multiturn AI simulations of human actors; and technical methods through which to alter in predictable ways the knowledge and behavior of generative systems. Together, these areas could provide a starting point for more ambitious AI-based modeling of cultural systems of literary production. We describe the possibilities and challenges of simulation-based experiments in literary studies, summarize the current state of the art in relevant fields, and explain key technical aspects of the work. To provide an example directly relevant to literary scholars, we present the results of experiments on literary text generation, including comparisons to high-status, human-authored novels. Our results include the first demonstration of (limited) in-distribution outputs by AI models in this domain. We conclude with a description of future work on full counterfactual literary-historical simulations using AI.

URL PDF HTML ☆

赞 0 踩 0

2606.02292 2026-06-02 cs.CV

Neural Acquisition & Representation of Subsurface Scattering

次表面散射的神经获取与表示

Arjun Majumdar, Raphael Braun, Hendrik Lensch

发表机构 * University of Tübingen（图宾根大学）

AI总结提出一种通过U-Net CNN学习物体表面每个点的像素足迹响应来获取和估计高细节层次次表面散射特性的方法，实现任意高分辨率投影图案的重光照。

Comments 8 pages

详情

DOI: 10.2312/vmv.20251228

AI中文摘要

我们提出了一种方法，通过学习物体表面每个点的像素足迹响应，以高度细节化的水平获取和估计光传输的次表面散射特性。重建利用3D扫描技术作为U-Net CNN的输入。使用相移轮廓测量（PSP）图案的立体投影仪-相机设置高效捕获各种散射物体的数据。重建密集像素足迹允许使用任意高分辨率投影图案进行重光照。最终输出是重光照后的彩色图像。与真实世界捕获图像的定性和定量比较表明，预测的足迹与实际响应几乎相同。同一模型针对多个物体的多个视图进行训练，使得学习到的表示也能泛化到未见过的次表面散射材料。

英文摘要

We present a method to acquire and estimate the sub-surface scattering properties of light transport at a highly detailed level by learning the pixel footprint response at each point on the object surface. The reconstruction leverages 3D scanning techniques as input to a U-Net CNN. A stereo projector-camera setup using phase-shifted profilometry (PSP) patterns efficiently captures the data for a variety of scattering objects. Reconstructing dense pixel footprints allows for relighting with arbitrary high-resolution projector patterns. The final output is a relit color image. Qualitative and quantitative comparison against illuminated real-world captured images demonstrate that the predicted footprints are almost identical to the actual responses. The same model is trained for multiple views across multiple objects such that the learned representations can be used to generalize to unseen sub-surface scattering materials as well.

URL PDF HTML ☆

赞 0 踩 0

2606.02289 2026-06-02 cs.CL

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

DECK: LLM幻觉的一致性×置信度分类法

Mohit Singh Chauhan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种基于样本间一致性和词级置信度的2×2分类法（DECK），将LLM幻觉分为四个行为区域，每个区域对应可检测的评分器家族，并通过实验验证其有效性及识别出输出级不确定性评估的普遍盲点。

Comments 18 pages, 3 figures, 5 tables

详情

AI中文摘要

现有的幻觉分类法根据输出错误的内容（如记忆错误、推理失败、流畅编造）对LLM错误进行分类。这些分类法有助于诊断，但无法回答另一个问题：哪个不确定性评分器本可以捕捉到这个错误？我们提出一种补充性分类法，根据错误的可检测性特征（评分器家族能读取的信号）对错误进行分类。DECK分类法是一个2×2划分，沿样本间一致性和词级置信度分为四个行为区域（Drift、Entrenched、Confabulation、Knotted），每个区域映射到能够检测它的特定评分器家族（或多个家族）：黑盒一致性评分器在D和C中有信号，白盒词概率评分器在K和C中有信号，只有经过独立预训练的LLM-as-a-Judge才能检测E。通过在每个评分器轴上使用Youden's J最优分割来操作化单元成员关系。在三个模型和四个数据集上，我们通过两种方式验证该分类法：分析评分器对的不一致性，以及检查外部标签（SelfAware不可回答、HaluEval对抗性、PopQA实体流行度）是否落在预测的DECK单元中，并附带模型规模特定和内容特定的次级单元细化。我们进一步识别出输出级不确定性评估的一个普遍盲点：在知识缺口输入上，当生成器输出自信、可重复的编造时，每个输出级家族在构造上都会失效。对Llama-3-8B隐藏状态的线性探针也失效至随机水平，初步证据表明该失败可能在激活层面持续存在；更丰富的内部状态方法（不确定性评估头、信息论估计器）仍有待测试。

英文摘要

Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.

URL PDF HTML ☆

赞 0 踩 0

2606.02288 2026-06-02 cs.LG

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

LLM中的大规模尖峰是偏置向量：机制揭示与无尖峰量化

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

发表机构 * Princeton University（普林斯顿大学）； EnCharge AI

AI总结本文通过机制分析发现LLM中的激活尖峰本质上是结构化的向量偏置，并提出无尖峰量化框架INSERTQUANT，实现鲁棒的低比特量化。

详情

AI中文摘要

大型语言模型（LLM）中的大规模激活尖峰通过拉伸动态范围严重降低了量化性能。虽然先前的假设将这些尖峰描述为高级标量偏置，但我们认为它们只是携带尖峰的令牌中刚性、结构化的向量偏置的标量中间产物。我们展示了这些令牌在归一化后收敛到常向量，驱动了注意力沉没和值状态耗尽机制。我们通过分析投影权重的协调性从几何上证实了这一点：$W_K$对比性地放大该向量，$W_Q$将语义令牌对齐到它，$W_V$将其投影到谱零空间。此外，我们揭示了模型通过利用低频带和相干通道对将结构偏置定位在“旋转稳定区域”中，从而主动保护这些结构偏置免受旋转位置编码（RoPE）扰动的影响。利用这一点，我们提出了INSERTQUANT，一种后训练量化（PTQ）框架，通过预计算模板向量来钳制尖峰并恢复其功能。这使得激活严格无尖峰，从而实现高保真度的鲁棒低比特量化。INSERTQUANT在LLM上达到了与最先进的每张量量化方法相当的性能，并且独特地泛化到文本以外的其他模态，如ViT。

英文摘要

Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.

URL PDF HTML ☆

赞 0 踩 0