arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02355 2026-06-02 cs.AI cs.LG

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI:具有内在技能的自我内化强化学习用于LLM智能体训练

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai

AI总结 提出SIRI框架,通过自我技能挖掘、验证和内化,使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能,在ALFWorld和WebShop上优于基线方法。

详情
AI中文摘要

长程LLM智能体可以从可重用技能中受益,但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索,增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习(SIRI),这是一个三阶段框架,使智能体能够发现、验证和内化技能,无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘,当前策略从其自身的成功普通轨迹中总结紧凑技能,并通过配对的技能增强和技能无关轨迹进行验证。最后,SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时,智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct,SIRI将GiGPO从ALFWorld的0.908提升到0.930,从WebShop的0.728提升到0.813,优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明,我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。

英文摘要

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

2606.02352 2026-06-02 cs.CV

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

多模态视频表示对齐用于鲁棒的自监督驾驶员分心检测

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

AI总结 提出一种多模态全局对齐框架,通过软目标和加权机制处理错误负样本和不可靠正样本,在Drive&Act数据集上优于现有方法,实现鲁棒的驾驶员分心检测。

详情
Comments
Accepted at the IEEE ITSC 2026
AI中文摘要

鲁棒的自监督多模态视频表示学习对于现实应用(如驾驶员分心检测)至关重要,其中多个传感器提供互补但嘈杂的信号。传统的对比目标(如InfoNCE)假设所有负样本信息量相等且所有正样本可靠。然而,由于视角变化、遮挡或模态间的语义重叠,这一假设在多模态数据中经常被违反。在这项工作中,我们提出了一种新颖的多模态全局对齐框架,通过联合建模错误负样本和不可靠或错误正样本来解决这些挑战。我们引入基于循环一致性分数的软目标来放松硬负样本假设,并基于相似性分布的加权机制来减轻噪声或错误正样本的影响。我们的方法将传统的成对对齐扩展到原则性的全局多模态设置,聚合所有模态对的对齐信息。我们在Drive&Act数据集上评估了我们的方法,结果表明它在RGB、IR、深度和骨架模态上始终优于成对和现有的全局对齐基线。跨视角消融研究进一步显示了对未见相机视角的强泛化能力,突出了我们表示的鲁棒性。总体而言,我们的框架为自监督全局多模态表示学习提供了一种可扩展且有效的解决方案,实现了可靠的驾驶员分心检测,并在现实世界的多模态视频理解中具有开创性。我们的代码将在GitHub上发布。

英文摘要

Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.

2606.02350 2026-06-02 cs.CV

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

TROPHIES:从多视角视频中重建场所、人和相机的时间序列

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu

AI总结 提出TROPHIES框架,通过联合估计动态人体、静态场景和相机姿态,实现多视角视频中全局一致的四维重建。

详情
AI中文摘要

在全局一致的4D空间中重建人类及其周围环境对于全面感知至关重要。然而,先前的工作通常假设单视角输入或将人体、场景和相机解耦,导致无法恢复连贯的几何形状、稳定的运动和物理对齐的轨迹。这些局限性促使我们引入一项新任务:从多视角视频中统一重建人体-场景-相机,旨在在一个全局坐标系中联合估计动态人体、静态场景和相机姿态。我们提出了TROPHIES——从多视角视频中重建场所、人和相机的时间序列——一个为这项任务量身定制的统一框架。TROPHIES包含一个通过时间和空间推理建模人体的人体分支,以及一个通过人体感知注意力重建静态几何的场景分支。一个全局对齐和优化模块通过强制执行尺度一致性、接触先验和跨视角时间相干性来耦合两个分支。在EgoHuman和EgoExo4D上的实验表明,TROPHIES实现了全局对齐、物理上合理的4D重建,并在全局保真度和人体-场景一致性方面始终优于现有范式。

英文摘要

Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES--Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.

2606.02346 2026-06-02 cs.CV

VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning

VEDAL: 用于3D高斯泼溅剪枝的变分误差驱动异步学习

Aoduo Li, Jiancheng Li, Huan Ye, Hongjian Xu, Shiting Wu, Xiujun Zhang, Zimeng Li, Xuhang Chen

AI总结 提出VEDAL框架,通过变分自由能最小化、预测误差门控机制和变分不确定性头实现3D高斯泼溅的高效剪枝,在5.2倍压缩下仅损失0.31 dB PSNR。

详情
Comments
12 pages, 5 figures. Accepted by CGI 2026
AI中文摘要

3D高斯泼溅(3DGS)通过实时渲染实现了卓越的新视图合成质量,但由于数百万个高斯原语导致内存消耗过大。现有的剪枝方法依赖于启发式重要性分数或同步批量更新,导致压缩次优和训练不稳定。我们提出VEDAL,一个将高斯剪枝公式化为变分自由能最小化的原则性框架。我们的方法引入了(1)一种预测误差门控机制,基于每个高斯的重建不确定性异步激活剪枝,以及(2)一个变分不确定性头,将剪枝决策建模为具有可学习先验的潜变量。自由能目标通过信息论视角自然地平衡了重建保真度与模型复杂度。在Mip-NeRF 360、Tanks&Temples和Deep Blending上的大量实验表明,VEDAL在仅0.31 dB PSNR下降的情况下实现了5.2倍压缩,在更高压缩比下优于PUP 3D-GS 0.05 dB,在相当质量下优于LightGaussian 0.35 dB,同时保持185 FPS的实时渲染。

英文摘要

3D Gaussian Splatting (3DGS) achieves remarkable novel view synthesis quality with real-time rendering, yet suffers from excessive memory consumption due to millions of Gaussian primitives. Existing pruning methods rely on heuristic importance scores or synchronous batch updates, leading to suboptimal compression and training instability. We propose VEDAL, a principled framework that formulates Gaussian pruning as variational free energy minimization. Our approach introduces (1) a prediction-error gating mechanism that asynchronously activates pruning based on per-Gaussian reconstruction uncertainty, and (2) a variational uncertainty head that models pruning decisions as latent variables with learnable priors. The free energy objective naturally balances reconstruction fidelity against model complexity through an information-theoretic lens. Extensive experiments on Mip-NeRF 360, Tanks&Temples, and Deep Blending demonstrate that VEDAL achieves 5.2x compression with only 0.31 dB PSNR drop, outperforming PUP 3D-GS by +0.05 dB at a higher compression ratio and LightGaussian by +0.35 dB at comparable quality, while maintaining real-time rendering at 185 FPS.

2606.02345 2026-06-02 stat.ML cs.LG

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

少即是多!关于经验成对损失估计/最小化的采样技术

Louise Davy, Stephan Clémençon, Charlotte Laclau

AI总结 本文利用调查采样技术,通过直接对成对样本进行采样而非单个观测,在保留少量信息的情况下实现与全量成对评估相当的估计或优化性能,为精度与计算成本之间提供了理论上有依据的权衡。

详情
AI中文摘要

许多机器学习问题,包括相似性学习、排序和聚类,都依赖于经验成对损失函数,其二次计算成本在大规模下迅速变得难以承受。我们展示了一种节俭的方法,通过利用调查采样技术,仅保留成对信息的一小部分,即可实现与使用所有成对数据相当的估计或优化性能。一个核心发现(理论和实验均支持)是,这种采样方案必须直接针对成对样本而非单个观测。特别地,对于高维向量(如视觉或图学习中的嵌入)之间的成对损失,使用合适的辅助信息为信息量大的成对样本分配更高的包含概率,可以获得接近全量成对评估的性能,从而在精度和计算成本之间提供了一种有原则且理论上有依据的权衡。

英文摘要

Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.

2606.02342 2026-06-02 cs.CV

Detecting Pen-In-Air States from Video: A Proof-of-Concept Toward Complementary Handwriting Analysis

从视频中检测笔在空中状态:迈向互补手写分析的概念验证

Lauren Sismeiro, Remy Plastre, Binbin Xu, Frederic Puyjarinet, Gerard Dray

AI总结 提出一种基于YOLO的笔尖跟踪与运动特征提取及机器学习分类的可解释混合流程,通过俯视视频检测笔接触状态,作为数字化平板的低成本非侵入性补充,在试点数据集上实现了高达0.805的F2分数。

详情
Comments
accepted for 12th International Conference on Computer Technology Applications (ICCTA 2026)
AI中文摘要

手写的动态方面对于评估如书写困难等发育障碍至关重要,通常通过数字化平板捕捉。然而,基于平板的传感将笔提起行为的分析限制在书写表面上方较短的接近范围内,可能错过高抬起的空中运动。作为概念验证,我们研究俯视视频是否能够提供补充信息源,用于推断笔接触状态,而无需依赖平板接近感应。我们提出了一种可解释的混合流程,结合了基于YOLO检测器的笔尖跟踪、运动特征提取和机器学习分类。一个包含多样化手写视频的试点数据集在帧级别进行了手动标注,并使用留一视频外(LOVO)协议进行评估。该方法实现了可靠的笔提起段事件级检测,F2分数高达0.805,与筛查导向场景中强调召回率一致。这些结果支持了基于视频的笔提起检测作为数字化平板低成本非侵入性补充的可行性,并为未来大规模研究奠定了基础。

英文摘要

Dynamic aspects of handwriting are critical for assessing developmental disorders such as dysgraphia and are typically captured using digitizing tablets. However, tablet-based sensing restricts analysis of Pen-Up behavior to a short proximity range above the writing surface, potentially missing high-lift in-air movements. As a proof of concept, we investigate whether top-view video can provide a complementary source of information for inferring pen-contact states without relying on tablet proximity sensing. We propose an interpretable hybrid pipeline combining pen-tip tracking using a YOLO-based detector with kinematic feature extraction and machine learning classification. A pilot dataset of diverse handwriting videos was manually annotated at the frame level and evaluation used a Leave-One-Video-Out (LOVO) protocol. The method achieved reliable event-level detection of Pen-Up segments, with an F_2 score up to 0.805, consistent with the emphasis on recall in a screening-oriented setting. These results support the feasibility of video-based Pen-Up detection as a low-cost and non-intrusive complement to digitizing tablets, and provide a foundation for future large-scale studies.

2606.02339 2026-06-02 cs.LG cs.CV

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

无模型坍塌的熵最小化:减轻医学影像中的预测偏差

Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang, Julia A. Schnabel

AI总结 针对测试时适应中熵最小化导致的模型坍塌问题,提出分布偏移偏差减少(DSBR)方法,通过均衡各预测类对无监督熵最小化损失的贡献来纠正预测偏差,在四个医学影像数据集和ImageNet-C上验证了其稳定性和有效性。

详情
AI中文摘要

熵最小化(EM)是测试时适应的主导目标,但其失败模式——模型坍塌——仍然知之甚少。在这项工作中,我们表明分布偏移会导致模型表示空间中对应不同类别的特征簇合并,而决策边界保持不变。这导致预测类别分布出现系统性偏差,称为预测偏差。预测偏差是指预测类别分布的偏移,其中一些类别被过度代表,而其他类别被抑制。我们表明熵最小化通过收紧现有簇来放大这种预测偏差,强化错误的分类,直到所有预测坍缩为平凡解。接下来,为了证明预测偏差的重要性并减轻它,我们进一步提出了分布偏移偏差减少(DSBR),这是一种偏差纠正目标,通过均衡每个预测类别对无监督熵最小化损失的贡献来专门针对这种失败模式。为了研究这种失败模式,我们使用四个医学影像数据集设计了合适的适应设置,并在ImageNet-C上进行了额外评估。我们发现DSBR一致地稳定了测试时适应,防止了模型坍塌,并且匹配或超越了最先进的方法。此外,DSBR仅在测试时运行。

英文摘要

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

2606.02337 2026-06-02 cs.AI

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

约束多智能体强化学习的协调图

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

AI总结 提出CG-CMARL框架,利用协调图和拉格朗日对偶分解联合动作空间与约束耦合问题,实现独立于智能体数量的模型学习,并通过Max-Sum消息传递和拉格朗日乘子控制目标-约束权衡,生成帕累托前沿。

详情
Comments
Accepted at the Reinforcement Learning Conference (RLC) 2026. 40 pages (12 main + 28 appendix), 5 figures, 16 tables, 7 theorems
AI中文摘要

约束多智能体强化学习(CMARL)面临两个相互交织的挑战:联合动作空间随智能体数量指数增长,以及额外的约束以奖励结构无法捕捉的方式耦合智能体。我们引入了用于约束多智能体强化学习的协调图(CG-CMARL),这是一个通过结合协调图和拉格朗日对偶性来应对这两个挑战的框架。该系统将联合问题分解为成对区域,每个区域由一组共享的Q函数服务,一个用于主要目标,每个约束对应一个,使得学习模型的数量与智能体数量无关。在执行时,Max-Sum消息传递在因子图上协调动作,而拉格朗日乘子控制目标-约束权衡,允许单个训练模型无需重新训练即可描绘帕累托前沿。我们在温和条件下提供了收敛保证,以及一个可分解为独立可解释来源的组合误差界,每个来源可追溯到特定的设计选择并可独立控制。在协作导航任务(其中多达10个智能体的团队必须协调到达目标位置,同时满足成对约束)上的实验表明,我们的方法产生的帕累托前沿优于以固定奖励塑形比率训练的既有基线,同时扩展到集中式方法变得棘手的大规模团队。

英文摘要

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.

2606.02331 2026-06-02 cs.CV cs.LG

Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates

基于鲁棒先验更新的幻觉感知扩散采样用于逆问题

Pengfei Jin, Yiqi Tian, Kailong Fan, Bingjie Qi, Quanzheng Li

AI总结 提出鲁棒先验更新模块,通过探测扩散先验更新的局部稳定性并重新锚定位移,减少逆问题求解中的测量条件幻觉,提升实例保真度。

详情
AI中文摘要

基于扩散的逆问题求解器可以产生逼真的重建结果,但仅凭逼真度并不能确保恢复的细节得到测量的支持。我们将这种失败研究为测量条件幻觉:视觉上有意义但要么不可信要么与测量实例不一致的内容。我们的分析将基于贝叶斯规则的扩散逆求解器分为先验更新和测量条件步骤,表明在应用测量校正之前,幻觉内容可能通过先验侧提议进入。受此观点启发,我们提出鲁棒先验更新(RPU),一个求解器级模块,探测扩散先验更新的局部稳定性,将产生的位移重新锚定在当前迭代点,并保持测量更新不变。我们在DPS中实例化RPU,并使用自动指标和人类忠实度研究在FFHQ和ImageNet逆问题上进行评估。在FFHQ上,RPU在框内修复、高斯去模糊和运动去模糊中相比DPS提高了PSNR和LPIPS。在人类判断中,RPU在FFHQ框内修复上获得了91.9%的盲选非平局多数偏好和91.1%的借助真实标签的非平局偏好,而ImageNet高斯阅读器研究中平局较多,但在非平局情况下RPU更受青睐。这些结果支持一个有针对性的主张:鲁棒化先验更新可以提高扩散逆求解器中的实例保真度,尤其是在先验塑造弱约束内容时。

英文摘要

Diffusion-based inverse problem solvers can produce realistic reconstructions, but realism alone does not ensure that the recovered details are supported by the measurement. We study this failure as measurement-conditioned hallucination: visually meaningful content that is either implausible or inconsistent with the measured instance. Our analysis separates Bayes-rule-based diffusion inverse solvers into a prior update and a measurement-conditioning step, showing that hallucinated content can enter through the prior-side proposal before the measurement correction is applied. Motivated by this view, we propose Robust Prior Update (RPU), a solver-level module that probes the local stability of the diffusion prior update, re-anchors the resulting displacement at the current iterate, and leaves the measurement update unchanged. We instantiate RPU in DPS and evaluate it on FFHQ and ImageNet inverse problems using automatic metrics and human faithfulness studies. On FFHQ, RPU improves PSNR and LPIPS over DPS across box inpainting, Gaussian deblurring, and motion deblurring. In human judgments, RPU receives 91.9% of blind non-tie majority preferences and 91.1% of ground-truth-assisted non-tie preferences on FFHQ box inpainting, while the ImageNet Gaussian reader study is tie-heavy but favors RPU among non-tie cases. These results support a targeted claim: robustifying the prior update can improve instance faithfulness in diffusion inverse solvers, especially when the prior shapes weakly constrained content.

2606.02328 2026-06-02 cs.LG

Riemannian Gradient Descent for Low-Rank Architectures

低秩架构的黎曼梯度下降

Nicholas Knight

AI总结 针对低秩矩阵参数,探索黎曼优化技术,并在小语言模型的多头注意力参数上应用,但未显著优于AdamW基线。

详情
AI中文摘要

我们探索了用于秩因子矩阵参数的黎曼优化技术,针对当代深度学习应用。我们考察了算法设计空间中的十个点:秩为$r$的矩阵的两种几何结构,秩为$r$的部分等距的三种几何结构,以及这五种几何结构的块矩阵变体,其中因子在块行和块列之间共享。我们将我们的方法应用于小语言模型中的多头注意力参数。在调整学习率后,我们的方法并未决定性地优于AdamW基线。我们的实现可在网上获取。

英文摘要

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three geometries for rank-$r$ partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.

2606.02326 2026-06-02 cs.AI

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

否决前修复:面向上下文决策的修复增强约束学习

Yifan Wang

AI总结 提出修复增强约束学习(RACL)框架,将已知修复操作融入分类器语义,在否决前考虑可负担修复,以降低错误否决率并揭示决策规则的可学习性。

详情
Comments
7 pages, 3 figures
AI中文摘要

硬约束通常被视为最终否决:一旦候选违反要求,学习规则拒绝它,任何修复都在决策语义之外处理。这忽略了一种常见的部署场景,即系统已经知道有限的修改菜单,例如添加票务选项、更改配置或请求可用的服务升级。现有的约束学习、软松弛和补救方法解决了邻近问题,但它们没有学习在否决前是否应修复某个选项。我们引入修复增强约束学习(RACL),一种上下文决策框架,将已知修复算子提升到分类器语义中。当可负担的修复使候选可行且足够偏好时,候选被接受;否则系统返回结构化的拒绝信用,并在适用时返回修复计划。这种否决前修复视图严格推广了无修复的HASSLE风格语义,揭示了终端否决规则不可约的错误否决差距,将二分类不可识别性与决策规则可学习性分离,并为观测可行性共享权重设置提供了容量和校准界限。在受控和DB1B衍生基准测试中,RACL恢复了预期的信用和修复结构。在最难的原始数据衍生层级上,验证选择的RACL将错误否决减少到10/4039(FVR 0.0025),而最强的修复搜索黑盒基线约为1064/4039,同时明确展示了FVR/EDR权衡。

英文摘要

Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repair is handled outside the decision semantics. This misses a common deployed regime in which the system already knows a finite menu of modifications, such as adding a ticket option, changing a configuration, or requesting an available service upgrade. Existing constraint-learning, soft-relaxation, and recourse methods address nearby problems, but they do not learn whether an option should be repaired before being vetoed. We introduce Repair-Augmented Constraint Learning (RACL), a contextual decision framework that lifts known repair operators into the classifier semantics. A candidate is accepted when an affordable repair makes it feasible and preferred enough; otherwise the system returns a structured rejection credit and, when applicable, a repair plan. This repair-before-veto view strictly generalizes no-repair HASSLE-style semantics, reveals an irreducible false-veto gap for terminal-veto rules, separates binary-label non-identifiability from decision-rule learnability, and gives capacity and calibration bounds for the observed-feasibility shared-weight setting. Across controlled and DB1B-derived benchmarks, RACL recovers the intended credit and repair structure. On the hardest raw-data-derived tier, validation-selected RACL reduces false vetoes to 10/4039 (FVR 0.0025), versus about 1064/4039 for the strongest repair-search black-box baseline, while making the FVR/EDR trade-off explicit.

2606.02322 2026-06-02 cs.LG cs.AI

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

重新利用对抗扰动进行持续学习:从防御到主动对齐

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu

AI总结 提出AdvCL框架,通过将对抗扰动重新用作几何控制信号,结合三个即插即用模块(Intra-Smooth、Proto-Clip、Inter-Align),在持续学习中同时提升标准性能、鲁棒性、降低遗忘并增强迁移。

详情
AI中文摘要

在动态环境中,大型语言模型需要不断适应新任务,但持续学习常常遭受遗忘、有限的迁移以及对对抗扰动的脆弱性。为了解决这个问题,我们提出了AdvCL,它将对抗扰动重新用作稳定的持续适应的几何控制信号。AdvCL结合了三个即插即用模块:Intra-Smooth通过小的对抗扰动促进局部平滑性;Proto-Clip使用相似性裁剪以防止过度对齐到当前任务原型;Inter-Align则通过对齐到先前任务原型的方向性对齐来减少表示间隙。实验表明,在标准性能和鲁棒性方面均有一致的提升,同时具有更低的遗忘和更强的迁移。我们进一步通过量化Intra-Smooth对扰动设置的敏感性以及Inter-Align对任务相似性和几何距离的影响,分析了关键机制。总之,这些模块在组合时提供互补增益,每个模块也可以单独集成到各种持续学习范式中,包括回放、正则化和动态架构,从而为持续学习提供了一种几何控制机制。

英文摘要

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.

2606.02321 2026-06-02 cs.CV

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

AI总结 提出无训练框架,先利用冻结DINOv3模型筛选视觉相关候选,再通过大视觉语言模型评估指令匹配,最后推理精化,在CVPR 2026挑战赛中取得48.78 Recall@1和51.48 Recall@5。

详情
Comments
CVPR 2026, VidLLMs workshop
AI中文摘要

近期大视觉语言模型的进展将视频检索从简单的基于文本搜索扩展到更灵活的场景,用户可以通过视觉示例和文本指令指定期望结果。在CVPR 2026的Reason-Aware组合视频检索挑战中,系统需要根据参考视频和修改指令检索目标视频。为解决该任务,我们开发了基于视觉表示引导的视频-大语言模型推理的无训练组合视频检索框架。该框架首先使用冻结的DINOv3模型获取紧凑的视觉相关候选集,然后应用大视觉语言模型评估每个候选是否满足修改指令。最后对顶部候选进行基于推理的精化以改善排名第一的预测。无需训练,我们的系统在测试集上达到48.78 Recall@1和51.48 Recall@5。未来工作可通过更强的视频-大语言模型以及视觉表示与语言推理的详细集成进一步提高检索精度。

英文摘要

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

2606.02320 2026-06-02 cs.CL

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

TVIR:构建面向文本-视觉交错报告生成的深度研究智能体

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

AI总结 提出TVIR基准和层次化多智能体框架,解决深度研究报告中视觉元素的事实可靠性与对齐问题。

详情
AI中文摘要

深度研究智能体在多步信息检索、推理和长文本报告生成方面表现出强大能力,但现有基准和系统仍以文本为中心,对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为填补这一空白,我们引入了TVIR(文本-视觉交错报告生成),包括TVIR-Bench(一个包含100个专家策划的多模态深度研究任务的基准,要求视觉元素服务于特定的分析子目标)和TVIR-Agent(一个层次化多智能体框架,作为构建大纲、检索图像、生成可溯源图表以及通过上下文感知的顺序写作撰写报告的强基线)。我们进一步开发了结合文本评估和视觉评估的双路径评估框架。在九个深度研究系统上的实验表明,TVIR-Agent实现了强大的整体性能,凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

英文摘要

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

2606.02313 2026-06-02 cs.RO

Towards Precise Intent-Aligned VLA Aerial Navigation via Expert-Guided GRPO

迈向精确意图对齐的VLA空中导航:基于专家引导的GRPO

Tianyang Chen, Wenjun Li, Xin zhou, Yuze Wu, Fei Gao

AI总结 提出EG-GRPO框架,通过专家数据增强在线rollout和异构并行流水线,解决VLA模型在无人机导航中因数据稀缺和探索低效导致的意图对齐问题,成功率提升至SFT基线的2.13倍,意图对齐性能提升60.9%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为无人机(UAV)执行细粒度指令指定的复杂任务提供了一种有前景的端到端范式。然而,标准的监督微调(SFT)面临数据稀缺、泛化能力有限以及对细微复杂人类意图的弱监督问题。强化微调通过可设计的反馈提供了一种自然的方式来缓解这些挑战,并使策略行为与人类意图对齐,但由于在广阔连续空间中的低效探索,将其应用于空中导航仍然具有挑战性。为了解决这些问题,我们引入了一个基于VLA的空中导航的高效强化学习(RL)框架。其核心是,我们提出了EG-GRPO(专家引导的组相对策略优化),以用少量专家数据增强在线rollout。此外,我们设计了一个异构流水线,支持并行仿真和推理,将rollout时间减少了43.5%。在由复杂人类意图指定的多个任务中,EG-GRPO将成功率提升至SFT基线的2.13倍,同时将意图对齐性能提高了60.9%。这些结果表明,我们的框架可以使空中导航迈向精确的意图对齐飞行。

英文摘要

Vision-Language-Action (VLA) models offer a promising end-to-end paradigm for unmanned aerial vehicles (UAVs) to accomplish complex tasks specified by fine-grained instructions. However, standard supervised fine-tuning (SFT) suffers from data scarcity, limited generalization, and weak supervision for nuanced and complicated human intents. Reinforcement fine-tuning offers a natural way to mitigate these challenges and align policy behaviors with human intents through designable feedback, but applying it to aerial navigation remains challenging due to inefficient exploration in expansive continuous spaces. To address these challenges, we introduce an efficient reinforcement learning (RL) framework for VLA-based aerial navigation. At its core, we propose EG-GRPO (Expert-Guided Group Relative Policy Optimization) to augment online rollouts with few-shot expert data. Additionally, we design a heterogeneous pipeline enabling parallel simulation and inference, which reduces rollout time by 43.5%. Across multiple tasks specified by complex human intents, EG-GRPO improves the success rate to 2.13x that of the SFT baseline, while improving intent alignment performance by 60.9%. These results demonstrate that our framework can move aerial navigation toward precise intent-aligned flight.

2606.02310 2026-06-02 cs.CV cs.LG

Deep Learning for Remote Sensing to Improve Flood Inundation Mapping

深度学习用于遥感以改进洪水淹没制图

Yogesh Bhattarai, Vijay Chaudhary, Wai Lim Kim, Sanjib Sharma

AI总结 提出基于去噪扩散概率模型和掩码扩散Transformer的云去除框架,用于洪水影像,以生成无云图像并保持水文一致性,提升洪水监测的可靠性。

详情
Comments
This paper has been selected as the top 10 student finalists in IGRASS 2026 paper competition
AI中文摘要

洪水是全球最普遍的自然灾害。及时准确的洪水淹没制图对于告知灾害风险管理至关重要。光学卫星任务提供了高分辨率、多光谱观测,对于洪水检测和淹没制图至关重要。然而,在极端降水事件期间,其操作实用性受到云层的严重限制。基于时间合成或插值的传统云去除技术通常无法捕捉淹没动态。在本研究中,我们引入了一种基于去噪扩散概率模型的洪水影像云去除框架,利用掩码扩散Transformer架构。所提出的方法利用自注意力机制捕获更广泛的空间上下文,并采用掩码令牌建模来显式学习云遮挡区域的重建。在具有真实云模式的多光谱Sentinel-2B洪水场景上训练,该模型生成保持视觉保真度和水文一致性的无云图像实现。使用标准图像质量指标以及洪水特定的水文指标评估重建性能,显示出水体连续性的改善和对水检测指数至关重要的光谱特征的保留。结果表明,基于扩散的生成建模为光学洪水监测中的云去除提供了一种稳健且物理一致的替代方案,从而实现更可靠、连续的观测,以支持灾害风险管理和洪水相关决策。

英文摘要

Flooding is the most pervasive natural disaster worldwide. Timely and accurate flood inundation mapping are essential for informing disaster risk management. Optical satellite missions provide high-resolution, multispectral observations critical for flood detection and inundation mapping. However, their operational utility is severely constrained by cloud cover during extreme precipitation events. Conventional cloud-removal techniques based on temporal compositing or interpolation often fail to capture inundation dynamics. In this study, we introduce a cloud-removal framework for flood imagery based on Denoising Diffusion Probabilistic Models, leveraging the Masked Diffusion Transformer architecture. The proposed approach exploits self-attention mechanisms to capture wider spatial context and employs masked token modeling to explicitly learn the reconstruction of cloud-obscured regions. Trained on multispectral Sentinel-2B flood scenes with realistic cloud patterns, the model generates cloud-free image realizations that preserve both visual fidelity and hydrological consistency. Reconstruction performance is evaluated using standard image quality metrics alongside flood-specific hydrological measures, demonstrating improved continuity of water bodies and preservation of spectral signatures critical for water detection indices. The results indicate that diffusion-based generative modeling offers a robust and physically consistent alternative for cloud removal in optical flood monitoring, enabling more reliable, continuous observations to support disaster risk management and flood-related decision making.

2606.02309 2026-06-02 cs.LG cs.CV

Measurement Geometry and Design for Trustworthy Generative Inverse Problems

可信生成式逆问题的测量几何与设计

Pengfei Jin, Na Li, Quanzheng Li

AI总结 提出局部测量-流形兼容性度量,证明其控制重建误差的稳定部分,并基于体积保持设计固定和自适应测量策略,在多个成像任务中预测失败模式、减少幻觉并指导采样。

详情
AI中文摘要

生成模型越来越多地被用作逆问题的先验,但它们生成逼真图像的能力带来了一个基本的信任问题:一个看似合理的重建可能由测量支持,也可能由先验沿未观测方向填充。这一区别在医学成像中尤为重要,因为采集操作是在扫描时间、剂量和校准约束下设计的。我们从测量几何的角度研究生成式逆问题。核心问题是:固定的测量算子能否区分在生成先验下看似合理的邻近图像,以及这种关系能否指导更好的测量。我们引入了一个局部测量-流形兼容性度量,用于量化算子观测先验相关切线方向的程度。在局部正则性假设下,我们证明该量控制重建误差的稳定部分,而生成先验控制流形外漂移。这一最坏方向证书基于整体局部体积保持,提出了实用的固定和顺序采集规则,包括一种后验云设计,该设计在测试时自适应调整测量,无需训练采样策略。在行采样、断层扫描和MR采集设置中,所提出的分数预测失败模式,解释测量引起的幻觉,并指导更好的采样。在fastMRI笛卡尔采样中,后验云测量设计优于强大的非学习ACS保留基线,包括可变密度和泊松类掩模。

英文摘要

Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.

2606.02307 2026-06-02 cs.RO

FATE-VLA:Failue-aware test generation for vision-language-action models

FATE-VLA:面向视觉-语言-动作模型的故障感知测试生成

Arusa Kanwal, Pablo Valle, Shaukat Ali, Aitor Arrieta

AI总结 提出一种结合多样性驱动探索与代理模型的故障感知测试生成方法,用于主动发现VLA模型在高维具身空间中的稀疏聚类故障,在四个先进模型上相比基线多发现高达29.7%的故障。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地被用作通用机器人策略,然而它们的评估仍然主要依赖于随机采样任务场景的静态基准。在高维具身空间中,故障是稀疏且聚类的,因此静态基准测试可能低估鲁棒性风险。我们将VLA评估重新定义为主动故障发现问题,并提出一种故障感知测试生成方法,该方法将多样性驱动的探索与从观察到的执行中学习的代理模型相结合。该方法将测试引导向高风险但多样化的场景区域。在四个最先进的VLA模型上,它发现了显著更多的故障(相比选定基线最多增加29.7%),同时揭示了更多样化的故障模式。这意味着,例如,在GR00T-N1.6的情况下,成功率从64.4%下降到34.7%。更广泛地说,我们的发现呼吁VLA评估的转变:从固定任务套件上的被动测量转向自适应、寻求故障的测试生成,在部署之前揭示模型弱点的结构。

英文摘要

Vision-Language-Action (VLA) models are increasingly used as generalist robot policies, yet their evaluation still relies largely on static benchmarks that randomly sample task scenes. In high-dimensional embodied spaces, failures are sparse and clustered, so static benchmarking can underestimate robustness risks. We reframe VLA evaluation as an active failure-discovery problem and propose a failure-aware test-generation approach that combines diversity-driven exploration with surrogate models learned from observed executions. The method steers testing toward high-risk yet diverse scene regions. Across four state-of-the-art VLA models, it uncovers substantially more failures (up to +29.7 % over selected baselines) while revealing more diverse failure modes. This mean that, for instance, in the case of GR00T-N1.6, success rate dropped from 64.4% to 34.7%. More broadly, our findings call for a shift in VLA evaluation: from passive measurement on fixed task suites to adaptive, failure-seeking test generation that exposes the structure of model weaknesses before deployment.

2606.02304 2026-06-02 cs.CL

Unified Context Evolution for LLM Agents

统一上下文演化:面向LLM智能体

Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang, Chunyang Jiang, Senkang Hu, Yuzhi Zhao

AI总结 提出统一上下文演化(UCE)框架,通过将智能体经验外部化为四种类型的可演化上下文单元(ECU),实现跨任务的知识积累与动态调度,在ALFWorld和WebShop上显著提升性能。

详情
AI中文摘要

基于LLM的智能体可以通过结合推理与环境反馈来解决多步交互任务,然而每个回合从相同的固定上下文开始,任务结束后沿途发现的任何有用策略都会丢失。现有方法要么将学习限制在当前任务,要么将所有经验汇集到单一的无类型存储中,而不区分知识类型、通过使用跟踪质量或平衡库中仍缺乏的内容。我们引入了统一上下文演化(UCE),一种无梯度框架,将智能体经验外部化到不断演化的类型化可演化上下文单元(ECU)库中。UCE将经验分解为四种互补类型(记忆、策略、工作流和技能),每种类型从轨迹中根据类型特定条件生成,在决策时检索,通过重复使用结果评分,并在不再有价值时修剪。调度模块将每个周期的生成预算分配给库中最弱的类型。在两个交互基准测试中,UCE将ALFWorld的成功率从75.4%提高到96.3%,将WebShop的任务得分从45.1%提高到61.3%,并且累积的库无需重新训练即可迁移到其他智能体主干。

英文摘要

LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle's generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.

2606.02303 2026-06-02 cs.CV

Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery

跨域航拍图像死树检测:基于知识蒸馏的方法

Anis Ur Rahman, Mete Ahishali, Einari Heinaro, Samuli Junttila

AI总结 针对航拍图像中死树检测的域差异和标注数据稀缺问题,提出基于知识蒸馏的TreeMort-1T-UNet模型,通过特征级蒸馏在多个目标域上实现鲁棒性能,并验证其在低数据场景下的优越性。

详情
Comments
14 pages, 6 figures, journal
AI中文摘要

航拍图像中的死树检测对于评估森林健康至关重要,尤其是随着气候变化导致全球树木死亡率上升,但域变异性和稀缺的标注数据常常限制模型的泛化能力。本研究改进了最初在芬兰航拍图像(源域)上训练的TreeMort-1T-UNet(树木死亡率单任务U-Net)模型,通过应用知识蒸馏(KD)使其适应各种目标域,包括代表不同森林类型的波兰、德国和爱沙尼亚数据集。我们评估了四种KD变体:基础、自蒸馏、特征级和集成,与微调基线进行比较,使用平均树木IoU、实例F1分数、实例精度和平均质心误差作为关键指标,并结合表征分析(如余弦相似度、CKA、SSIM、t-SNE和线性探针)评估域不变性。特征级KD优于其他方法,在波兰数据集上实现了平均树木IoU为0.106、实例F1分数为0.63、实例精度为0.55、平均质心误差为3.039,并在其他目标域上保持稳健精度(例如,芬兰为0.15,波兰为0.67,德国为0.60,爱沙尼亚为0.59)。它在低数据场景下表现优异,假阳性更少,并展现出优越的表征不变性(例如,更高深层CKA/SSIM、t-SNE中更好的域混合、线性探针AUC为0.95),使其成为精度关键的林业应用的理想选择。额外的消融研究证实,特征对齐等关键组件增强了其跨指标的平衡性能。我们的发现证明了KD在遥感中增强迁移学习的潜力,为生态监测和可持续森林管理提供了可扩展、域鲁棒的工具。

英文摘要

Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains, including Polish, German, and Estonian datasets representing diverse forest types. We assess four KD variants: Basic, Self, Feature-level, and Ensemble, against a fine-tuning baseline, using Mean Tree IoU, Instance F1-score, Instance Precision, and Mean Centroid Error as key metrics, alongside representational analyses (e.g., cosine similarity, CKA, SSIM, t-SNE, and linear probing) for domain invariance. Feature-level KD outperforms others, yielding a Mean Tree IoU of 0.106, Instance F1-score of 0.63, Instance Precision of 0.55, and Mean Centroid Error of 3.039 on the Polish dataset, with robust precision across other target domains (e.g., 0.15 on Finnish, 0.67 on Polish, 0.60 on German, 0.59 on Estonian). It excels in low-data scenarios with fewer false positives and shows superior representational invariance (e.g., higher deep-layer CKA/SSIM, better domain mixing in t-SNE, and linear probing AUC of 0.95), making it ideal for precision-critical forestry applications. Additional ablation studies confirm that key components like feature alignment enhance its performance balance across metrics. Our findings demonstrate KD's potential to enhance transfer learning in remote sensing, offering a scalable, domain-robust tool for ecological monitoring and sustainable forest management.

2606.02302 2026-06-02 cs.CR cs.AI

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw: 面向自主代理评估的规范驱动安全任务合成

Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi, Yichi Wang, Jing Yang, Taowen Wang, Jinhao Duan, Mengshu Sun, Peiyan Dong, Xuan Shen, Yang Cao, Renjing Xu, Kaidi Xu, Jindong Gu, Bo Zhang, Jize Zhang, Chenhao Lin, Philip Torr, Chao Shen

AI总结 提出SeClaw框架,通过规范驱动的安全任务合成与基于执行的安全评估,实现对自主LLM代理在状态化环境中的安全风险的可扩展、可复现评估。

详情
AI中文摘要

自主LLM代理越来越多地在有状态环境中运行,访问工具、文件、内存和外部服务。虽然这些能力支持复杂的现实工作流,但它们也引入了难以通过现有评估捕获的安全风险。当前的代理安全基准通常依赖手动策划的任务,对新兴威胁的覆盖有限,并且主要关注最终结果而非导致不安全行为的执行过程。我们引入了SeClaw,一个结合规范驱动的安全任务合成与基于执行的安全评估的框架,用于自主代理。规范驱动的安全任务合成能够从结构化风险规范中可扩展且可控地构建安全任务,而SeClaw docker提供了一个标准化测试平台,用于评估代理在各种安全风险场景下的行为。该基准涵盖了由资源、用户任务、环境和内在代理行为引起的风险,并支持对不安全行为的轨迹感知评估,超越最终响应。通过桥接系统化的任务合成和可复现的安全评估,SeClaw为测量、诊断和比较自主LLM代理中的安全故障提供了实用基础。代码可在 https://github.com/seclaw-eval/seclaw-eval 获取。

英文摘要

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.

2606.02301 2026-06-02 cs.HC cs.AI cs.CV

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试:从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

AI总结 提出基于计算机视觉的定量运动测试(QMT)方法,利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物,在实验室验证中与光学运动捕捉高度一致(r>0.85),并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情
AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量,但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度,但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试(QMT),这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程,平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计,在健康对照组(N=13)中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后,我们在两个前瞻性临床队列中部署QMT以评估现实世界效用:一项纤维肌痛患者的干预前后试验,以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中,QMT提取的临床运动指标与光学运动捕捉高度一致,显示出强相关性(r>0.85)和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度(r>0.86),并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差,但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物,但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

2606.02300 2026-06-02 cs.CL

Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization

超越孤立行为:面向LLM个性化的层次化用户建模

Liang Wang, Xinyi Mou, Xiaoyou Liu, Tiannan Wang, Yuqing Wang, Zhongyu Wei

AI总结 针对LLM个性化中用户行为缺乏层次结构的问题,提出基于布迪厄实践理论的PHF框架,通过实践-惯习-场域三层建模,并实现轻量级模型无关方法PHF_Compass,在LaMP基准上取得一致提升。

详情
AI中文摘要

大型语言模型(LLM)在多个领域展现出卓越能力,但将其输出个性化以适应个体用户仍是一个开放挑战。现有方法主要采用扁平行为范式,聚合用户行为而未明确考虑它们如何组织成更深层的行为结构。在本工作中,我们借鉴皮埃尔·布迪厄的实践理论,提出PHF(实践-惯习-场域),一个基于社会学的框架,通过三个层次重新概念化LLM个性化:作为实践的个人行为、作为惯习的行为在时间上的积累形成稳定倾向、以及作为场域的相似用户间的共享规律。我们通过$\mathrm{PHF}_{ ext{Compass}}$实例化PHF,这是一种基于冻结LLM的轻量级且模型无关的实现。在语言模型个性化(LaMP)基准上的实验表明,该方法在多种任务上取得一致改进,进一步分析验证了所学行为结构的可解释性和可扩展性。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet personalizing their outputs to individual users remains an open challenge. Existing approaches predominantly adopt a flat behavioral paradigm, aggregating user behaviors without an explicit account of how they are organized into deeper behavioral structures. In this work, we draw on Pierre Bourdieu's Theory of Practice to propose PHF (Practice-Habitus-Field), a sociologically grounded framework that reconceptualizes LLM personalization through three hierarchical levels: individual behaviors as practices, their temporal accumulation into stable dispositions as habitus, and shared regularities across similar users as fields. We instantiate PHF through $\mathrm{PHF}_{\text{Compass}}$, a lightweight and model-agnostic implementation based on a frozen LLM. Experiments on the Language Model Personalization (LaMP) benchmark demonstrate consistent improvements across diverse tasks, while further analyses validate the interpretability and extensibility of the learned behavioral structures.

2606.02296 2026-06-02 cs.RO

A Kinetic Theory of Encounter-Based Information Propagation in Multi-Robot Systems

多机器人系统中基于相遇的信息传播的动力学理论

Alkesh K. Srivastava, Philip Dames

AI总结 本文提出一种动力学理论,通过相遇驱动的信息传播、时效性和几何约束,分析多机器人目标跟踪中的性能极限。

详情
AI中文摘要

多机器人系统不能假设持续的网络连接。我们通过目标跟踪研究这一问题,其中性能取决于目标信息被感知、通过团队传输并在变得过时之前使用的速度。当机器人仅通过物理相遇交换信息时,跟踪成为一个动力学信息传输问题:机器人运动引发相遇,相遇携带目标状态估计,信息年龄决定过时程度,而过时信息产生跟踪误差。本文发展了一种基于相遇的信息传播的动力学理论,并识别出三个极限。第一个是访问极限——信息无法支持团队级协调,除非它传播到感知到它的机器人之外。第二个是过时极限——即使传播的信息也会随着目标移动而失去价值。第三个是几何极限——当目标运动超过信息传输时,跟踪误差进入饱和状态,此时仅通信改进的收益递减。我们通过改变团队规模、操作区域、通信范围和目标速度的大规模模拟评估该理论。结果支持所提出的访问-过时-几何分解:通信覆盖控制访问转变;一旦信息可访问,跟踪误差由目标位移决定;这种响应在受限区域内是局部线性的,但由于感知刷新和有界几何,在更广范围内是非线性的。在受控扫描和联合变化中,推导出的访问和过时坐标可靠地描述了跟踪性能。这些结果共同建立了一个动力学理论框架,用于预测和设计基于相遇的多机器人系统。

英文摘要

Multi-robot systems cannot assume persistent network connectivity. We study this problem through target tracking, where performance depends on how quickly target information is sensed, transported through the team, and used before it becomes stale. When robots exchange information only through physical encounters, tracking becomes a kinetic information-transport problem: robot motion induces encounters, encounters carry target-state estimates, information age determines staleness, and stale information produces tracking error. This paper develops a kinetic theory of encounter-based information propagation and identifies three limits. The first is an access limit -- information cannot support team-level coordination unless it spreads beyond the robots that sensed it. The second is a staleness limit -- even propagated information loses value as the target moves. The third is a geometry limit -- when target motion outpaces information transport, tracking error approaches a saturation regime where communication improvements alone have diminishing returns. We evaluate the theory through large-scale simulations varying team size, operating area, communication range, and target speed. Results support the proposed access-staleness-geometry decomposition: communication coverage governs the access transition; once information is accessible, tracking error is shaped by target displacement; and this response is locally linear in restricted regimes but nonlinear over broader ranges because of sensing refreshes and bounded geometry. Across controlled sweeps and joint variation, the derived access and staleness coordinates reliably describe tracking performance. Together, these results establish a kinetic-theoretic framework for predicting and designing encounter-based multi-robot systems.

2606.02294 2026-06-02 cs.LG

Regularized Large Neighborhood Search

正则化大邻域搜索

Germain Vivier-Ardisson, Laurent Demonet, Axel Parmentier, Mathieu Blondel

AI总结 提出正则化大邻域搜索(RLNS),将LNS启发式转化为MCMC采样器,实现无需全局求解器的端到端学习。

详情
AI中文摘要

运筹学从业者通常使用大邻域搜索(LNS)来解决NP难的组合问题,这是一种可扩展的启发式方法,通过局部重新优化其变量的子集来迭代改进当前解。相比之下,大多数现有的将组合优化层集成到神经网络中的方法仍然假设可以访问精确的全局解,这在计算上是难以处理的。我们通过引入正则化大邻域搜索(RLNS)来弥合这一差距。通过正则化或扰动局部子问题,我们将LNS启发式转化为一个高效的MCMC采样器,在可行解的组合集上采样,并关联Fenchel-Young损失。在熵正则化下,我们证明RLNS执行精确的块吉布斯采样。此外,调整RLNS迭代次数使我们能够在伪似然和精确最大似然估计之间插值,从而实现无需全局求解器的端到端学习。我们在$k$-子集选择、广义分配和随机车辆调度问题上展示了我们的方法。

英文摘要

Operations research practitioners typically tackle NP-hard combinatorial problems using large neighborhood search (LNS), a scalable heuristic that iteratively refines a current solution by locally re-optimizing subsets of its variables. In contrast, most existing approaches for integrating combinatorial optimization layers into neural networks still assume access to an exact global solution, which is computationally intractable. We bridge this gap by introducing regularized LNS (RLNS). By regularizing or perturbing local subproblems, we turn the LNS heuristic into an efficient MCMC sampler over the combinatorial set of feasible solutions, with associated Fenchel-Young losses. Under entropic regularization, we prove that RLNS performs exact block Gibbs sampling. Furthermore, adjusting the number of RLNS iterations allows us to interpolate between pseudolikelihood and exact maximum likelihood estimation, for end-to-end learning without global solvers. We demonstrate our approach on $k$-subset selection, generalized assignment, and stochastic vehicle scheduling problems.

2606.02293 2026-06-02 cs.CL

AI as a Tool for Simulation-Based Experiments in Literary Studies

AI作为文学研究中基于模拟的实验工具

Matthew Wilkens

AI总结 本文探讨利用生成式AI进行受控、大规模、低成本的文学文化生产模拟实验,总结当前技术现状,并通过与人类小说对比的实验展示AI在文学文本生成中的初步成果。

详情
AI中文摘要

生成式人工智能系统通过受控、有依据、大规模、低成本的模拟文化生产,为文学研究中的实验开辟了新的可能性。当前系统尚未被证明能够生成高质量、长篇幅的叙事文本,并可靠地反映任意指定的文化约束或风格特征。但在文学历史模拟所需的各个组件上存在大量相关研究,包括:使用和验证AI系统作为可区分人类群体的代理;AI生成文本的叙事和风格特性;多智能体、多轮次AI模拟人类行为者的稳定性和连贯性;以及通过可预测方式改变生成系统知识和行为的技术方法。这些领域共同为基于AI的文学生产文化系统建模提供了更雄心勃勃的起点。我们描述了文学研究中基于模拟的实验的可能性和挑战,总结了相关领域的最新进展,并解释了工作的关键技术方面。为了提供一个与文学学者直接相关的例子,我们展示了文学文本生成实验的结果,包括与高地位、人类作者小说的比较。我们的结果首次展示了AI模型在该领域内(有限的)分布内输出。最后,我们描述了未来使用AI进行完整反事实文学历史模拟的工作。

英文摘要

Generative artificial intelligence (AI) systems open new possibilities for experimentation in literary studies via controlled, grounded, large-scale, low-cost simulations of cultural production. Current systems have not yet been shown to produce high-quality, book-length narrative texts that reliably reflect arbitrarily specified cultural constraints or stylistic features. But there exists substantial relevant research on each of the components required for literary-historical simulation. These include the use and validation of AI systems as proxies for differentiable human populations; the narrative and stylistic properties of AI-generated texts; the stability and coherence of multiagent, multiturn AI simulations of human actors; and technical methods through which to alter in predictable ways the knowledge and behavior of generative systems. Together, these areas could provide a starting point for more ambitious AI-based modeling of cultural systems of literary production. We describe the possibilities and challenges of simulation-based experiments in literary studies, summarize the current state of the art in relevant fields, and explain key technical aspects of the work. To provide an example directly relevant to literary scholars, we present the results of experiments on literary text generation, including comparisons to high-status, human-authored novels. Our results include the first demonstration of (limited) in-distribution outputs by AI models in this domain. We conclude with a description of future work on full counterfactual literary-historical simulations using AI.

2606.02292 2026-06-02 cs.CV

Neural Acquisition & Representation of Subsurface Scattering

次表面散射的神经获取与表示

Arjun Majumdar, Raphael Braun, Hendrik Lensch

AI总结 提出一种通过U-Net CNN学习物体表面每个点的像素足迹响应来获取和估计高细节层次次表面散射特性的方法,实现任意高分辨率投影图案的重光照。

详情
Comments
8 pages
AI中文摘要

我们提出了一种方法,通过学习物体表面每个点的像素足迹响应,以高度细节化的水平获取和估计光传输的次表面散射特性。重建利用3D扫描技术作为U-Net CNN的输入。使用相移轮廓测量(PSP)图案的立体投影仪-相机设置高效捕获各种散射物体的数据。重建密集像素足迹允许使用任意高分辨率投影图案进行重光照。最终输出是重光照后的彩色图像。与真实世界捕获图像的定性和定量比较表明,预测的足迹与实际响应几乎相同。同一模型针对多个物体的多个视图进行训练,使得学习到的表示也能泛化到未见过的次表面散射材料。

英文摘要

We present a method to acquire and estimate the sub-surface scattering properties of light transport at a highly detailed level by learning the pixel footprint response at each point on the object surface. The reconstruction leverages 3D scanning techniques as input to a U-Net CNN. A stereo projector-camera setup using phase-shifted profilometry (PSP) patterns efficiently captures the data for a variety of scattering objects. Reconstructing dense pixel footprints allows for relighting with arbitrary high-resolution projector patterns. The final output is a relit color image. Qualitative and quantitative comparison against illuminated real-world captured images demonstrate that the predicted footprints are almost identical to the actual responses. The same model is trained for multiple views across multiple objects such that the learned representations can be used to generalize to unseen sub-surface scattering materials as well.

2606.02289 2026-06-02 cs.CL

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

DECK: LLM幻觉的一致性×置信度分类法

Mohit Singh Chauhan

AI总结 提出一种基于样本间一致性和词级置信度的2×2分类法(DECK),将LLM幻觉分为四个行为区域,每个区域对应可检测的评分器家族,并通过实验验证其有效性及识别出输出级不确定性评估的普遍盲点。

详情
Comments
18 pages, 3 figures, 5 tables
AI中文摘要

现有的幻觉分类法根据输出错误的内容(如记忆错误、推理失败、流畅编造)对LLM错误进行分类。这些分类法有助于诊断,但无法回答另一个问题:哪个不确定性评分器本可以捕捉到这个错误?我们提出一种补充性分类法,根据错误的可检测性特征(评分器家族能读取的信号)对错误进行分类。DECK分类法是一个2×2划分,沿样本间一致性和词级置信度分为四个行为区域(Drift、Entrenched、Confabulation、Knotted),每个区域映射到能够检测它的特定评分器家族(或多个家族):黑盒一致性评分器在D和C中有信号,白盒词概率评分器在K和C中有信号,只有经过独立预训练的LLM-as-a-Judge才能检测E。通过在每个评分器轴上使用Youden's J最优分割来操作化单元成员关系。在三个模型和四个数据集上,我们通过两种方式验证该分类法:分析评分器对的不一致性,以及检查外部标签(SelfAware不可回答、HaluEval对抗性、PopQA实体流行度)是否落在预测的DECK单元中,并附带模型规模特定和内容特定的次级单元细化。我们进一步识别出输出级不确定性评估的一个普遍盲点:在知识缺口输入上,当生成器输出自信、可重复的编造时,每个输出级家族在构造上都会失效。对Llama-3-8B隐藏状态的线性探针也失效至随机水平,初步证据表明该失败可能在激活层面持续存在;更丰富的内部状态方法(不确定性评估头、信息论估计器)仍有待测试。

英文摘要

Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.

2606.02288 2026-06-02 cs.LG

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

LLM中的大规模尖峰是偏置向量:机制揭示与无尖峰量化

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

AI总结 本文通过机制分析发现LLM中的激活尖峰本质上是结构化的向量偏置,并提出无尖峰量化框架INSERTQUANT,实现鲁棒的低比特量化。

详情
AI中文摘要

大型语言模型(LLM)中的大规模激活尖峰通过拉伸动态范围严重降低了量化性能。虽然先前的假设将这些尖峰描述为高级标量偏置,但我们认为它们只是携带尖峰的令牌中刚性、结构化的向量偏置的标量中间产物。我们展示了这些令牌在归一化后收敛到常向量,驱动了注意力沉没和值状态耗尽机制。我们通过分析投影权重的协调性从几何上证实了这一点:$W_K$对比性地放大该向量,$W_Q$将语义令牌对齐到它,$W_V$将其投影到谱零空间。此外,我们揭示了模型通过利用低频带和相干通道对将结构偏置定位在“旋转稳定区域”中,从而主动保护这些结构偏置免受旋转位置编码(RoPE)扰动的影响。利用这一点,我们提出了INSERTQUANT,一种后训练量化(PTQ)框架,通过预计算模板向量来钳制尖峰并恢复其功能。这使得激活严格无尖峰,从而实现高保真度的鲁棒低比特量化。INSERTQUANT在LLM上达到了与最先进的每张量量化方法相当的性能,并且独特地泛化到文本以外的其他模态,如ViT。

英文摘要

Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.

2606.02287 2026-06-02 cs.LG cs.AI

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

CityTrajBench: 城市尺度车辆轨迹生成的统一基准

Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen, Haoran Zhang, Tianhao Wu, Jinyue Yan

AI总结 为解决轨迹生成方法因数据集、预处理、表示和评估指标不一致导致的比较困难,提出CityTrajBench统一基准框架,标准化数据处理、模型适配与多级评估,并在三个真实数据集上对比统计、VAE、GAN、扩散和流匹配模型,揭示不同模型在全局真实性、轨迹几何保真度等指标上的权衡。

详情
AI中文摘要

城市轨迹生成是交通模拟、城市规划和移动性分析的基础任务。然而,由于现有研究通常依赖不同的数据集、预处理流程、轨迹表示和评估指标,轨迹生成方法之间的系统比较仍然困难。这种碎片化使得报告的性能差异是否源于生成机制本身或实验协议不一致变得不明确。为解决这一问题,我们提出了CityTrajBench,一个用于城市尺度车辆轨迹生成的统一基准框架和协议。CityTrajBench在共同设置下标准化了数据摄入、轨迹归一化、特征构建、模型适配、地图感知后处理、模型选择和多级评估。它支持异构生成器,包括统计基线、基于VAE、GAN、扩散和流匹配的模型,并在三个真实世界城市轨迹数据集上评估它们。该基准衡量全局空间真实性、行程级分布保真度、轨迹级几何相似性、条件移动一致性和效率。实验揭示了模型家族之间的明确权衡:DiffTraj在轨迹级几何保真度上最强,DiffRNTraj在结构敏感的全局真实性上具有竞争力,而TrajFlow在真实性、质量、条件一致性和效率之间提供了强平衡。同时,一个简单的马尔可夫基线在粗粒度行程和局部移动统计上仍具有竞争力。这些发现表明,城市轨迹生成质量本质上是多目标的,没有单一模型在所有标准上同等占优,并且CityTrajBench为未来城市移动性生成研究提供了可复现的基准协议和测试平台。

英文摘要

Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systematic comparison across trajectory generation methods remains difficult because existing studies often rely on different datasets, preprocessing pipelines, trajectory representations, and evaluation metrics. This fragmentation makes it unclear whether reported performance differences arise from the generation mechanism itself or from inconsistent experimental protocols. To address this issue, we present CityTrajBench, a unified benchmark framework and protocol for city-scale vehicle trajectory generation. CityTrajBench standardizes data ingestion, trajectory normalization, feature construction, model adaptation, map-aware post-processing, model selection, and multi-level evaluation under a common setting. It supports heterogeneous generators, including statistical baselines, VAE-based, GAN-based, diffusion-based, and flow-matching-based models, and evaluates them on three real-world urban trajectory datasets. The benchmark measures global spatial realism, trip-level distribution fidelity, trajectory-level geometric similarity, conditional mobility consistency, and efficiency. Experiments reveal clear trade-offs across model families: DiffTraj is strongest on trajectory-level geometric fidelity, DiffRNTraj is competitive on structure-sensitive global realism, and TrajFlow provides a strong balance across realism, quality, conditional consistency, and efficiency. Meanwhile, a simple Markov baseline remains competitive on coarse-grained trip and local-movement statistics. These findings show that urban trajectory generation quality is inherently multi-objective, that no single model dominates all criteria equally, and that CityTrajBench provides a reproducible benchmark protocol and testbed for future research on urban mobility generation.