arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪 全部专题
2605.23174 2026-05-25 cs.CV

LQ-rPPG: A Label-Quantized Coarse-to-Fine Learning Framework for Remote Physiological Measurement

LQ-rPPG:一种用于远程生理测量的标签量化粗到细学习框架

Jun Seong Lee, Samyeul Noh, Changki Sung, Hyun Myung

发表机构 * Electronics and Telecommunications Research Institute(电子电信研究院) School of Electrical Engineering, Korea Advanced Institute of Science and Technology(韩国科学技术院电气工程学院)

AI总结 远程光电容积图(rPPG)技术能够通过面部视频非接触地测量生理信号,在远程医疗和日常健康监测中具有重要应用前景。然而,现有基于深度学习的rPPG方法大多忽视了训练标签的质量及其对模型学习的影响,导致模型易受标签噪声和变化的影响,影响泛化性能。为此,本文提出LQ-rPPG,一种基于标签量化和粗到细学习的框架,通过将连续PPG信号转化为多比特伪标签以减少噪声,并在分层监督下逐步优化rPPG估计,从而提升模型鲁棒性和泛化能力,实验表明其在多个数据集上表现优异且计算效率显著提高。

详情
AI中文摘要

远程光电容积描记(rPPG)技术能够从面部视频中非接触式测量生理信号,在远程医疗和日常健康监测方面具有巨大潜力。受此驱动,研究者提出了多种基于深度学习的rPPG方法以改进估计性能。然而,以往的深度学习方法很少关注训练标签的质量及其对模型学习的影响。用作训练标签的接触式PPG信号通常包含由运动伪影、传感器接触不一致和形态畸变引起的噪声和变异性。这种标签不一致性可能导致模型过拟合标签噪声和变异性,从而降低泛化性能。为解决此问题,我们提出LQ-rPPG,一种标签量化的粗到细学习框架,用于鲁棒的rPPG估计。LQ-rPPG包含一个标签量化模块和一个粗到细的rPPG估计模型。标签量化模块将连续PPG信号转换为多比特量化伪标签,以降低噪声和变异性。粗到细估计模型在多比特伪标签的分层监督下逐步细化rPPG信号。这种设计减轻了对标签特定变异性的过拟合,使模型能够学习结构化和一致的表示。因此,LQ-rPPG即使在挑战性条件下也能实现鲁棒且可泛化的rPPG估计。在多个基准数据集上的实验表明,LQ-rPPG在数据集内和跨数据集评估中均取得了强劲性能,同时参数和乘累加操作分别减少88%和29%,吞吐量提高191%。代码可在https://github.com/Anonymous-repo-code/LQ-rPPG获取。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of physiological signals from facial videos, offering strong potential for remote healthcare and daily health monitoring. Driven by this potential, various deep learning-based rPPG methods have been proposed to improve rPPG estimation. However, previous deep learning-based rPPG methods have paid little attention to the quality of training labels and their impact on model learning. Contact-based PPG signals used as training labels often contain noise and variability caused by motion artifacts, inconsistent sensor contact, and morphological distortions. Such label inconsistency can lead models to overfit to the label noise and variability and consequently degrade generalization performance. To address this issue, we propose LQ-rPPG, a label-quantized coarse-to-fine learning framework for robust rPPG estimation. LQ-rPPG consists of a label quantization module and a coarse-to-fine rPPG estimation model. The label quantization module transforms continuous PPG signals into multi-bit quantized pseudo labels with reduced noise and variability. The coarse-to-fine estimation model progressively refines rPPG signals under hierarchical supervision guided by the multi-bit pseudo labels. This design alleviates overfitting to label-specific variations and enables the model to learn structured and consistent representations. As a result, LQ-rPPG achieves robust and generalizable rPPG estimation even under challenging conditions. Experiments on multiple benchmark datasets demonstrate that LQ-rPPG achieves strong performance in both intra- and cross-dataset evaluations, while reducing parameters and multiply-accumulate operations by 88% and 29%, respectively, and increasing throughput by 191%. The code is available at https://github.com/Anonymous-repo-code/LQ-rPPG.

2605.23171 2026-05-25 cs.LG cs.AI stat.ML

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

理解与改进指令微调中的噪声嵌入技术

Abhay Yadav

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究探讨了指令微调中嵌入层添加噪声的技术,分析了均匀噪声与高斯噪声的效果差异,并提出了一种新的对称噪声嵌入方法SymNoise。通过理论与实验分析,研究发现不同噪声类型性能相近,而SymNoise通过更严格地调控模型局部曲率,显著提升了微调效果。在多个基准测试中,SymNoise相比当前最优方法NEFTune取得了约6.7%的性能提升,展示了其在语言模型微调中的优越性。

Comments arXiv admin note: substantial text overlap with arXiv:2312.01523

Journal ref IEEE International Conference on Language Modeling (COLM), 2025

详情
AI中文摘要

最近指令微调的进展在嵌入中注入噪声,其中NEFTune(Jain等人,2024)使用均匀噪声设立了基准。尽管NEFTune的实验发现均匀噪声优于高斯噪声,其原因仍不清楚。本文旨在通过提供彻底的理论和实证分析来澄清这一点,表明这些噪声类型之间的性能相当。此外,我们引入了一种新的语言模型微调方法,在嵌入中使用对称噪声。该方法旨在通过更严格地调节模型的局部曲率来增强模型功能,表现出优于当前方法NEFTune的性能。当使用Alpaca微调LLaMA-2-7B模型时,标准技术在AlpacaEval上获得29.79%的分数。然而,我们的方法SymNoise使用对称噪声嵌入将这一分数显著提高到69.04%,比最先进方法NEFTune(64.69%)提高了6.7%。此外,当在各种模型和更强的基线指令数据集(如Evol-Instruct、ShareGPT、OpenPlatypus)上测试时,SymNoise始终优于NEFTune。当前文献,包括NEFTune,强调了在语言模型微调中应用基于噪声的策略需要更深入的研究。我们的方法SymNoise是朝着这一方向迈出的又一重要步骤,显示出对现有最先进方法的显著改进。

英文摘要

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

2605.23170 2026-05-25 cs.CL cs.AI cs.LG

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败:推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University(北京交通大学) Central South University of Forestry and Technology(中央林业科技大学)

AI总结 该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足,导致无法准确评估模型在不同位置上的表现。为此,作者提出了Context Rot Evaluation(CRE)框架,系统地控制任务位置、填充内容和上下文长度三个因素,并通过实验发现,当目标任务从上下文末尾移至中间位置时,模型性能会显著下降,且随着上下文长度增加,这一问题更加严重。研究还表明,通过在末尾添加任务副本,可以有效缓解位置带来的性能下降,揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情
AI中文摘要

位置控制评估是检索任务(如Needle-in-a-Haystack和RULER)的标准做法,但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试,发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现,NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目,而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估(CRE),一个控制所有三个因素的框架,并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现:初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时,模型性能可能急剧下降,且对于易受影响的模型,这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点(中间准确率8%)。较新的发布显示出较小的下降:在64K下,四个模型中有三个的末尾位置准确率波动在+/-6个百分点内;MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下,所有四个模型在中间位置的下降仍然存在(在8K、32K、64K下范围-16到-56个百分点)。在8K下,一个诊断探针在末尾添加目标任务副本,使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内,这与位置解释一致。在初始五个模型集中,76%的中间位置错误与周围填充文本匹配,而末尾位置仅为22%,这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距:当任务位置不受控制时,无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

2605.23165 2026-05-25 cs.RO cs.AI cs.CL

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California(加州大学EECS系)

AI总结 本文提出了一种基于视觉语言模型(VLM)引导的自主前沿探索方法,用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策,指导传统的底层机器人控制系统,利用当前地图和潜在路径的视觉信息生成多模态提示,从而选择最具前景的探索方向。实验表明,该方法在六个室内环境的仿真中提升了地图覆盖率,且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

详情
AI中文摘要

自主机器人在未知和危险环境中的探索是一个长期挑战,通过利用视觉语言模型的高级推理能力可以显著改进。我们提出了一种新颖的探索流程,其中VLM执行高层战略决策,引导传统的低级机器人控制栈。在决策点,机器人生成包含当前地图和潜在路径(即前沿)视觉图像的多模态提示。VLM分析该提示以选择最有希望的前沿,用上下文空间推理替代简单的几何启发式。该方法在六个室内环境的模拟中得到了验证,与现有方法相比,地图覆盖率提高了高达24%。我们的流程轻量级、无需训练,并且可以轻松迁移到任何配备标准传感器和互联网连接的机器人上。

英文摘要

Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24\% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.

2605.23160 2026-05-25 cs.RO cs.CV

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

语义感知引导的无人机探索:面向语言条件的三维室内建图

Nitin Vegesna, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences(电气工程与计算机科学系)

AI总结 本文提出了一种语义感知引导的无人机探索系统SAGE,用于在未知的室内3D环境中进行开放词汇的探索,能够在保持全面覆盖行为的同时,利用语义线索重新优先选择探索前沿。SAGE基于FALCON体积探索器,通过集成CLIP模型的四个关键组件,实现了语义与几何信息的联合规划,有效提升了目标发现效率。实验表明,SAGE在模拟和真实环境中均优于现有方法,尤其在目标发现速度和体积吞吐量方面表现突出。

Comments 10 pages, 6 figures, 4 tables. To be presented at the 2nd 3D-LLM/VLA Workshop at CVPR 2026 (non-archival workshop)

详情
AI中文摘要

我们提出语义感知引导探索(SAGE),一个用于未知三维室内环境的开放词汇探索系统,该系统在保持覆盖导向行为的同时,允许语义提示重新优先化前沿选择。基于FALCON体积探索器,SAGE通过四个关键组件集成对比语言-图像预训练(CLIP):以物体为中心的嵌入存储、将最近观测投影到自由-未知边界的时间缓存、用于高相似度检测的物体前沿,以及统一的语义-几何规划成本。该成本函数限制了语义重新加权的影响,确保前沿被优先化而不牺牲总覆盖率。在基于Matterport3D的仿真中,SAGE在地图-查询对上的物体发现方面优于FALCON和纯语义消融。与Finding Things in the Unknown(FTU)相比,SAGE在九个共享地图-查询对上的探索速度提高了9.0到25.9倍,平均加速13.7倍。此外,SAGE的体积吞吐量显著高于FTU。最后,我们在Modal AI Starling 2四旋翼飞行器上,在两种环境中的五次真实飞行中部署了SAGE,配备机载感知和规划以及离板CLIP推理。比较SAGE和FALCON,我们发现虽然FALCON导致更快的探索和更短的建图轨迹,但SAGE在物体发现方面优于FALCON。

英文摘要

We present Semantic-Aware Guided Exploration, SAGE, a system for open-vocabulary exploration in unknown 3D indoor environments that preserves coverage-oriented behavior while allowing semantic cues to reprioritize frontier selection. Building on the FALCON volumetric explorer, SAGE integrates Contrastive Language-Image Pre-training (CLIP) via four key components: object-centric embedding storage, a temporal cache that projects recent observations onto the free-unknown boundary, object frontiers for high-similarity detections, and a unified semantic-geometric planning cost. This cost function bounds semantic reweighting influence, ensuring frontiers are prioritized without sacrificing total coverage. In Matterport3D-based simulations, SAGE outperforms FALCON and a semantic-only ablation in object discovery across map-query pairs. Compared to Finding Things in the Unknown (FTU), SAGE completes exploration 9.0 to 25.9 times faster across the nine shared map-query pairs, achieving a mean speedup of 13.7. Furthermore, SAGE achieves substantially higher volumetric throughput than FTU. Finally, we deploy SAGE in five real-world flights in two environments on a Modal AI Starling 2 quadrotor with onboard sensing and planning, and offboard CLIP inference. Comparing SAGE and FALCON, we find that while FALCON results in faster exploration and shorter mapping trajectories, SAGE outperforms FALCON in terms of object discovery.

2605.23157 2026-05-25 cs.CL

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

相同模型,不同弱点:语言与模态如何重塑前沿多模态大语言模型的越狱攻击面

Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix

发表机构 * Appen(Appliance)

AI总结 该研究探讨了多模态大语言模型(MLLM)在不同语言和模态下的越狱攻击表面差异,揭示了语言对模型安全性的非均匀影响。通过对比四种前沿模型在英语和西班牙语下的攻击表现,研究发现语言框架攻击在西班牙语中效果减弱,而视觉化多模态攻击则更有效,表明语言与模态对齐失败的机制存在差异。研究指出,当前将语言和模态视为独立维度的安全评估框架无法准确反映实际攻击风险,需进行重新设计。

详情
AI中文摘要

多模态大语言模型(MLLM)的攻击面具有语言依赖性,揭示了对齐失败的机制结构。我们首次进行系统的跨语言、多模态红队研究,比较了四种前沿MLLM(Claude Sonnet 4.5、GPT-5、Pixtral Large和Qwen Omni)在美国英语(en-US)和墨西哥西班牙语(es-MX)下的越狱漏洞。使用包含363个多样化提示场景的固定对抗基准,在纯文本和多模态条件下进行测试,从每组语言的九名母语标注员匹配小组收集了52,272个危害评级和二元攻击成功判断。我们的核心发现是,语言不会均匀地放大漏洞。贝叶斯混合效应分析显示,语言框架攻击(如角色扮演)在西班牙语提示下效果显著降低,而视觉显式多模态攻击效果增强,这直接指向提示-语言界面而非全局标注员宽松度。这种分离表明,语言和视觉对齐失败通过不同机制运作,切换语言足以暴露这种分离。实际后果是安全排名不跨语言保持。Qwen Omni在es-MX参与者中超越Pixtral Large成为最易受攻击的模型,这种排名反转是英语条件下分数的标量校正无法恢复的,并且绝对攻击成功率在模型代际间下降,但模型间差距未缩小。这些发现表明,将语言和模态视为独立维度的安全评估框架从根本上错误地指定了全球部署MLLM的攻击面,必须相应重新设计。

英文摘要

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

2605.23156 2026-05-25 cs.LG math.FA math.RT stat.ML

Any-Dimensional Invariant Universality

任意维不变泛化性

Shengtai Yao, Eitan Levin, Mateo Díaz

发表机构 * Department of Applied Mathematics and Statistics, Johns Hopkins University(约翰霍普金斯大学应用数学与统计学系) Department of Computing and Mathematical Sciences, Caltech(加州理工学院计算与数学科学系)

AI总结 本文研究了适用于任意尺寸输入的机器学习模型的泛化能力问题,这类模型如处理不同节点数的图或点云的数据。传统泛化性分析通常针对固定尺寸的输入,而本文提出了一种系统的方法,通过将任意维函数映射到一个合适的无限维极限空间,从而建立任意维模型的泛化性理论。该方法利用输入的对称性及不同尺寸输入之间的关系,定义了该空间上的自然拓扑结构,并展示了如何在该空间上建立任意维泛化性。研究还指出了一些现有模型的泛化性缺陷,并提出了简单的改进方案以恢复其泛化能力。

详情
AI中文摘要

一些机器学习模型是为任意大小的输入定义的,例如具有不同节点数的图和包含不同点数目的点云。这类任意维模型的泛化性仍然知之甚少,因为泛化性传统上是在接受固定大小输入的模型上研究的,定义在其域的紧致子集上。与此形成鲜明对比的是,任意维模型可以被视为定义在规模不断增长的输入上的函数序列,目前尚不清楚它们在何种意义上可以是泛化的。我们开发了一种系统的方法来建立任意维泛化性,通过将任意维函数与一个唯一的函数等同起来,该函数在合适的无限维极限空间中接受输入,该空间包含所有有限大小的输入及其极限。利用这些输入的对称性以及不同大小输入之间的关系,我们证明了该极限空间具有自然的拓扑结构,并且包含丰富的紧致集族,在这些紧致集上可以建立任意维泛化性。我们通过展示几种现有架构无法实现泛化性,并提出了恢复泛化性的简单修改,来说明我们的方法。

英文摘要

Several machine learning models are defined for inputs of any size, such as graphs with different numbers of nodes and point clouds containing varying numbers of points. The universality properties of such any-dimensional models remain poorly understood, as universality is traditionally studied for models accepting inputs of a fixed size, defined on a compact subset of their domain. In sharp contrast, any-dimensional models can be viewed as sequences of functions defined on growing-sized inputs, and it is not clear in which sense they can be universal. We develop a systematic approach to establish any-dimensional universality, by identifying any-dimensional functions with a unique function taking inputs in a suitable infinite-dimensional limit space containing inputs of all finite sizes as well as their limits. Using the symmetries of these inputs and relations between inputs of different sizes, we show that this limit space admits a natural topology with rich families of compact sets on which any-dimensional universality can be established. We illustrate our approach by showing that several existing architectures fail to be universal, and we propose simple modifications that restore universality.

2605.23147 2026-05-25 cs.CL cs.AI

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X,做Y:角色和任务如何在指令微调LLM中结合

Eric Xu

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了在指令微调的大语言模型中,角色提示(如“As X, do Y”)如何将“人物”和“任务”信息结合,并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出,人物和任务分别通过部分正交的加法方向影响模型输出,并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而,研究也表明,尽管存在局部加法结构,角色提示无法被压缩为单一的残差向量,因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情
AI中文摘要

形式为“作为X,做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡(最后一个提示标记与前两个生成标记)——在早期/中层波段表现出清晰的线性分解。在那里,角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y,并将h_BB + Δ_X + Δ_Y替换干净残差,在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上,跨越12个单元格的短网格和48个单元格的长角色网格,下游输出与干净输出的KL散度很小,并保留了角色特定的行为标记。从这种加性结构自然推断,角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中,无论是在一个位置还是在多个层,都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置,这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制;整个延续中的角色条件化行为依赖于分布式的提示/KV机制,局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

2605.23146 2026-05-25 cs.LG cs.AI

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Infra-Bayesian 强化学习智能体在最坏情况鲁棒性上优于经典强化学习

Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Clément Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport

发表机构 * Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学) WorldQuant University(WorldQuant大学) UC Berkeley(加州大学伯克利分校) Aix-Marseille University(阿维尼翁-马赛大学) MIT(麻省理工学院) University of Zurich(苏黎世大学) University of British Columbia(不列颠哥伦比亚大学) University of Stuttgart(斯图加特大学) University of Buenos Aires(布宜诺斯艾利斯大学) California State University, Fresno(弗雷斯诺加州州立大学) University of Chicago(芝加哥大学)

AI总结 该论文研究了在存在模型误设和策略依赖不确定性的情况下,经典强化学习方法的局限性,并提出了一种基于Infra-Bayesian主义的强化学习框架。该方法通过区分普通概率不确定性与Knightian不确定性,采用最坏情况下的预期值最大化策略进行决策,从而在非现实环境中实现更稳健的性能。实验表明,该方法在具有Knightian不确定性的环境中表现出更低的最坏情况遗憾,并在纽康姆问题中优于经典决策理论方法。

详情
AI中文摘要

经典强化学习假设智能体与一个固定环境交互,该环境的行为不依赖于智能体的策略。这一假设在非可实现环境中失效,其中其他参与者可能预测智能体的行为,包括对 AI 安全至关重要的环境,例如智能体与预测者、人类、其他 AI 智能体和机构交互的环境。在此类环境中,智能体的模型类无法捕捉其运行的世界。在这种误设下,经典贝叶斯方法可能产生自信的错误后验、不可靠的决策和无界遗憾,因为可实现性无法获得。Infra-Bayesianism 是一个决策理论框架,通过将普通概率不确定性(其中先验可以合理选择)与 Knightian 不确定性(其中没有构建此类先验的依据)区分开来,解决了这些失败。它通过评估行动的最坏情况结果,而不是后验期望或加权平均来实现这一点。我们首次提出了一个用于有限结果无状态决策问题的 Infra-Bayesian 强化学习架构的概念验证实现。我们的智能体维护一组不精确的假设,使用 Infra-Bayesian 条件更新它们,并通过最大化最坏情况期望值来选择行动。我们将 Infra-Bayesian 极大极小决策过程的实现应用于具有 Knightian 不确定性的环境,并展示了与经典强化学习智能体相比更低的最坏情况遗憾。我们还研究了纽科姆问题,并表明 Infra-Bayesian 智能体选择了最优策略,优于经典决策理论智能体。我们的结果为在模型误设和策略依赖不确定性下保持鲁棒性的强化学习智能体迈出了一步。

英文摘要

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

2605.23144 2026-05-25 cs.CV

SLIP-RS: Structured-Attribute Language-Image Pre-Training for Remote Sensing Object Detection

SLIP-RS:面向遥感目标检测的结构化属性语言-图像预训练

Chenxu Wang, Yuxuan Li, Yunheng Li, Xiang Li, Jingyuan Xia, Qibin Hou

发表机构 * VCIP, CS, Nankai University(中国南开大学计算机科学与技术研究所) National University of Defense Technology(国防科技大学)

AI总结 现有的遥感目标检测语言-图像预训练方法受限于单一标签学习,依赖黑盒数据枚举开放类别以获取细粒度表示,难以适应遥感领域数据稀缺的特点。为此,本文提出SLIP-RS方法,构建了一个结构化属性解耦范式,将开放类别空间映射到有限且具有物理意义的属性空间,通过显式结构逻辑提升细粒度判别能力。该方法包含两个关键技术:结构化属性对比学习和符合性属性可靠性引擎,分别用于解耦视觉逻辑和从噪声数据中提取高质量监督信号,最终在细粒度检测和跨域泛化方面取得了显著提升。

详情
AI中文摘要

现有的遥感目标检测语言-图像预训练受限于单一标签学习,该方法通过黑盒数据穷举开放集类别以获取细粒度表示,这种依赖性与领域固有的数据稀缺性不兼容。为突破这一瓶颈,我们提出SLIP-RS,建立结构化属性解耦范式,将开放类别空间映射到有限且物理有意义的属性空间,通过显式结构逻辑解锁细粒度判别能力。该范式通过两个技术支柱实现:(1)结构化属性对比学习,通过组合属性增强强制学习解耦的内在视觉逻辑;(2)共形属性可靠性引擎,利用共形预测理论从噪声源中严格提取高保真监督,生成RS-Attribute-15M,这是最大的包含超过1500万属性标注的数据集。大量实验表明,SLIP-RS在细粒度检测和跨域泛化方面建立了前所未有的性能,验证了结构化属性作为遥感基础的重要性。代码:https://github.com/facias914/SLIP-RS。

英文摘要

Existing language-image pre-training for remote sensing object detection is constrained by Monolithic Label Learning, which relies on exhaustively enumerating open-set categories via black-box data to acquire fine-grained representations, creating a dependency incompatible with the domain's inherent data scarcity. To transcend this bottleneck, we propose SLIP-RS, establishing a Structured-Attribute Decoupling Paradigm that maps the open-ended category space into a finite, physically meaningful attribute space, unlocking fine-grained discriminability via explicit structural logic. This paradigm is realized via two technical pillars: (1) Structured-Attribute Contrastive Learning, which enforces the learning of decoupled intrinsic visual logic via combinatorial attribute augmentation; and (2) Conformal Attribute Reliability Engine, which leverages conformal prediction theory to rigorously distill high-fidelity supervision from noisy sources, yielding RS-Attribute-15M, the largest dataset with over 15 million attribute annotations. Extensive experiments demonstrate that SLIP-RS establishes unprecedented performance in fine-grained detection and cross-domain generalization, validating structured attributes as a vital foundation for remote sensing. Code: https://github.com/facias914/SLIP-RS.

2605.23141 2026-05-25 cs.CV

VisAnalog: A Diagnostic Suite for Visual Concept Transfer on Natural Images

VisAnalog:自然图像上视觉概念迁移的诊断套件

Zhaonan Li, Kyle R. Chickering, Bangzheng Li, Jacob Dineen, Xiao Ye, Zhikun Xu, Shijie Lu, Yuxi Huang, Ming Shen, Bach Nguyen, Jaya Adithya Pavuluri, Mau Son Nguyen, Sanika Chavan, Ngoc Minh Thu Le, Muhao Chen, Ben Zhou

发表机构 * Arizona State University(亚利桑那州立大学) Luma AI UC Davis(加州大学戴维斯分校)

AI总结 VisAnalog 是一个用于评估视觉概念迁移能力的诊断数据集,旨在测试模型是否能在不同场景中保持和操作概念属性。该数据集通过“A:B::C:?”的形式构造样本,要求模型根据给定的图像和变换关系推断出目标图像。实验表明,即使在强大的视觉语言模型上,其性能也远低于理想情况,且随着变换步骤的增加性能显著下降,而人类表现则接近最优。该数据集为分析模型在视觉关系推理和变换应用上的缺陷提供了有效工具。

Comments Accepted to the Workshop on Visual Concepts at CVPR 2026 as a non-archival report

详情
AI中文摘要

视觉概念学习的一个有用测试不仅在于模型能否在单张图像中识别概念,还在于它能否在变换下保留和操作概念级属性并将其迁移到新场景。我们引入了VisAnalog,一个针对自然图像上这一场景的受控套件。每个示例实例化$A\!:\!B::C\!:\,?$:图像$B$和隐藏的目标图像$D$是通过对源图像$A$和$C$应用相同的确定性变换序列生成的。给定$A$、$B$和$C$,模型必须回答关于$D$的多选题。该基准包含617个人工验证的问题,涵盖一到四步变换,如缩放、象限交换、旋转、翻转和色调旋转。在强大的专有和开源视觉语言模型上,当直接显示$D$时,端到端准确率显著低于oracle准确率,并且随着变换深度的增加而急剧下降,而人类表现仍接近上限。程序条件评估进一步将关系推理失败与变换应用失败分开,表明从$A \rightarrow B$推断视觉关系是主要瓶颈,在更困难的多步案例中还会出现额外的应用错误。该数据集公开于https://huggingface.co/datasets/zli99/VisAnalog。

英文摘要

A useful test of visual concept learning is not just whether a model can recognize a concept in a single image, but whether it can preserve and manipulate concept-level properties under transformation and transfer them to new scenes. We introduce VisAnalog, a controlled suite for this setting on natural images. Each example instantiates $A\!:\!B::C\!:\,?$: images $B$ and a hidden target image $D$ are produced by applying the same deterministic transformation sequence to source images $A$ and $C$. Given $A$, $B$, and $C$, a model must answer a multiple-choice question about $D$. The benchmark contains 617 human-validated questions spanning one- to four-step transformations such as zoom, quadrant swap, rotation, flip, and hue rotation. Across strong proprietary and open-source VLMs, end-to-end accuracy is substantially lower than oracle accuracy when $D$ is directly shown, and degrades sharply as transformation depth increases, while human performance remains near the ceiling. A program-conditioned evaluation further separates failures of relation inference from failures of transformation application, showing that inferring the visual relation from $A \rightarrow B$ is the dominant bottleneck, with additional application errors emerging on harder multi-step cases. The dataset is publicly available at https://huggingface.co/datasets/zli99/VisAnalog.

2605.23139 2026-05-25 cs.LG cs.AI

CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection

CALAD:面向多元时间序列异常检测的信道感知对比学习

Jaehyeop Hong, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国Inha大学工业工程系)

AI总结 多变量时间序列异常检测在实际应用中日益重要,但通常面临标注数据稀缺的问题。现有方法多采用无监督学习建模正常模式,但往往对所有通道一视同仁,忽略了不同通道对异常检测的贡献差异。本文提出CALAD,一种基于通道感知的对比学习框架,通过估计通道相关性指导对比样本的构建,增强模型对异常语义的学习能力,并结合重建误差和对比学习,提升模型在分布偏移场景下的检测性能。

Comments Accepted to ICPR 2026

详情
AI中文摘要

多元时间序列异常检测在实际应用中变得越来越重要,而标记数据往往稀缺。许多现有方法依赖无监督学习来建模正常模式,但它们通常平等对待所有信道。这种设计会稀释异常相关信号,因为并非所有信道对异常检测的贡献相同。在本文中,我们提出CALAD,一种用于多元时间序列异常检测的信道感知对比学习框架。CALAD利用估计的信道相关性指导对比样本的构建,使学习过程反映异常语义而非通用相似性。信道相关性通过基于Transformer的自编码器的重构误差进行估计,并用于区分对异常行为影响更大的信道。利用这些信息,我们设计了一种信道级增强策略,其中正负样本基于异常相关信道是否被保留或扰动来构建。这鼓励对无关信道的变化保持不变性,同时对异常相关信道的变化保持敏感性。此外,CALAD结合了对比学习和辅助重构头,使模型在保留正常结构的同时学习判别性表示。在多个真实数据集上的实验表明,CALAD在分布漂移场景下持续优于现有方法。我们提供可复现的代码:https://github.com/hirundo1218/CALAD。

英文摘要

Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly-relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer-based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel-wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly-relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly-relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real-world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD

2605.23134 2026-05-25 cs.LG

Archimedean Copula Inference via Taylor-Mode AD

通过泰勒模式自动微分进行阿基米德Copula推断

Cambridge Yang, Dongdong Li

发表机构 * Cambridge Yang(剑桥阳) Harvard Medical School(哈佛医学院)

AI总结 该研究提出了一种名为 \textsc{acopula} 的 JAX 框架,用于高效计算任意嵌套阿基米德Copula模型在高维、任意变量右删失情况下的精确似然和参数梯度。其核心方法是通过泰勒模式自动微分的多项式幂运算,替代传统手动推导的贝尔多项式表,从而支持任意生成函数和复杂的嵌套结构。实验表明,该框架在高维数据、大规模金融和医学数据集上表现出优越的性能和灵活性,并实现了比现有工具显著的加速效果。

详情
AI中文摘要

现有的嵌套阿基米德Copula工具无法同时处理以下三个方面:(a) 生存分析中任意变量的(右)删失,(b) 任意嵌套树,以及(c) 精确参数梯度。现有实现仅处理双变量问题、低维(即$d \leq 10$)情况、两层嵌套或仅手工推导的Copula嵌套。我们提出 extsc{acopula},一个JAX原生框架,给定任意阿基米德生成元——经典或神经——在多项式时间内,在任意删失掩码下评估精确的嵌套Copula似然和参数梯度。其机制是泰勒模式自动微分输出的多项式幂运算,用单个可微计算替代每个族手工推导的偏贝尔多项式表,任何用户定义的生成元都可以驱动该计算。我们进行了大量模拟以验证 extsc{acopula}的正确性。然后我们展示了:(a) 在$d=53$的高维MIMIC-IV ICU入院数据($85{,}229$条记录)上的逐变量删失,由经典阿基米德族和嵌套神经阿基米德Copula拟合;(b) 在$d=98$的标普500日收益率上的11部门层次模型;(c) 在一项视网膜病变研究中,跨十个族(其中五个族之前没有实现)的族无关删失MLE;以及(d) 在$d=35$时,相对于R的 exttt{nacLL}每密度加速约$650$倍,且二次扩展到$d=8{,}000$。

英文摘要

No existing nested Archimedean copula tool handles all three of (a) arbitrary per-variable (right-)censoring in survival analysis, (b) arbitrary nesting trees, and (c) exact parameter gradients. Existing implementations handle only bivariate problems, low dimensional (i.e., $d \leq 10$) cases, two layers of nesting, or only hand-derived copula nestings. We present \textsc{acopula}, a JAX-native framework that, given any Archimedean generator -- classical or neural -- evaluates exact nested-copula likelihoods and parameter gradients under arbitrary censoring masks in polynomial time. The mechanism is polynomial powering of Taylor-mode automatic differentiation output, which replaces per-family hand-derived partial Bell polynomial tables with a single differentiable computation that any user-defined generator can drive. We conduct extensive simulations to verify the correctness of \textsc{acopula}. We then demonstrate (a) per-variable censoring on $85{,}229$ MIMIC-IV ICU admissions in high dimensions with $d{=}53$, fit by both classical Archimedean families and nested neural Archimedean copulas; (b) an 11-sector hierarchical model on S\&P~500 daily returns at $d{=}98$; (c) family-agnostic censored MLE across ten families, five of them with no prior implementation, on a retinopathy study; and (d) a ${\sim}650\times$ per-density speedup over R's \texttt{nacLL} at $d{=}35$, scaling quadratically to $d{=}8{,}000$.

2605.23131 2026-05-25 cs.LG

When Determinants Are Not Enough: Private Rare Switching

当行列式不够时:私有稀有切换

Xingyu Zhou

发表机构 * Wayne State University(韦恩州立大学)

AI总结 本文探讨了在隐私保护背景下,传统基于行列式的线性上上下文 bandits 和强化学习更新规则的局限性。当引入高斯噪声以满足隐私要求时,设计矩阵的单调增长特性可能被破坏,导致原有分析不再适用。为解决这一问题,作者提出了一种基于广义瑞利商的稀有切换规则,恢复了对数策略更新和置信区间宽度的常数因子控制,从而在隐私设置下实现了有效的稀有切换策略。

详情
AI中文摘要

在这篇笔记中,我想分享一个小研究时刻,Codex帮助我找到了将稀有切换适应私有设置的正确方法。线性bandit和强化学习中基于行列式的标准更新规则效果很好,因为设计矩阵单调增长。但一旦加入高斯噪声以实现隐私,这种单调性可能失效,通常的分析不再成立。关键原因是行列式增长控制体积,而遗憾分析需要控制最坏方向。为了解决这个问题,Codex提出了一种基于广义瑞利商的不同稀有切换规则,该规则恢复了对数策略更新以及所需的置信宽度比较(至多常数因子)。我在此展示了我手动清理后的证明版本,以及对此例的一些个人反思。

英文摘要

In this note, I would like to share a small research moment where Codex helped me find the right way to adapt rare switching to the private setting. The standard determinant-based update rule in linear bandits and RL works beautifully because the design matrix grows monotonically. But once Gaussian noise is added for privacy, this monotonicity can fail, and the usual analysis no longer goes through. The key reason is that determinant growth controls volume, while regret analysis needs control of the worst direction. To address this, Codex comes up with a different rare-switching rule based on the generalized Rayleigh quotient, which restores logarithmic policy updates and the desired confidence-width comparison up to a constant factor. I present my manually clean-up version of the proof here as well as some personal reflection on this example.

2605.23128 2026-05-25 cs.RO

$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

$\pi_0$-EqM:闭环视觉-语言-动作控制的均衡匹配

Huanming Liu, Congsheng Xu, Jianmin Ji, Yao Mu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为 $π_0$-EqM 的闭环视觉-语言-动作控制方法,通过将传统的流匹配解码器替换为均衡匹配(EqM)解码器,提升了机器人操作任务的性能。在固定计算预算下,该方法在多个任务中显著提高了成功率,并揭示了任务依赖的“稳定性-可执行性”差距现象,为迭代式VLA控制的策略设计提供了新视角。

Comments Preprint. 5 pages, 3 figures

详情
AI中文摘要

目前,视觉-语言-动作(VLA)模型因其在任务泛化方面的巨大潜力而成为机器人操作最常用的范式。然而,大多数用于VLA控制的生成式流匹配动作解码器通常以固定的采样视界部署,限制了状态相关的计算和控制周期之间的时间复用。我们提出$\pi_0$-EqM,用均衡匹配(EqM)解码器替换$\pi_0$中的流匹配专家,同时保持上游VLA堆栈不变。在匹配的300步预算下,$\pi_0$-EqM在19个任务上将RoboTwin的平均成功率从40.4%提升到50.2%,并在LIBERO上保持竞争力,在LIBERO-10上获得最显著的提升(87.0%)。两次阈值扫描揭示了残差与成功率之间存在任务依赖的非单调关系,我们称之为平稳性-可执行性差距。结果表明,迭代VLA控制中的推理深度是策略设计的一部分,并引入了一种基于能量的VLA视角,这可能为未来跨任务和跨本体的可组合动作生成工作提供参考。

英文摘要

Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $π_0$-EqM, which replaces the flow-matching expert in $π_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $π_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.

2605.23118 2026-05-25 cs.CV cs.AI cs.LG

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

发表机构 * German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany(德国癌症研究中心(DKFZ)海德堡,医学图像计算部,德国) Faculty of Mathematics and Computer Science, Heidelberg University, Germany(海德堡大学数学与计算机科学学院,德国) HIDSS4Health -- Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany(HIDSS4Health——海德堡信息与数据科学健康学校,卡尔斯鲁厄/海德堡,德国) Medical Faculty, Heidelberg University, Germany(海德堡大学医学学院,德国) University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane, Germany(勃兰登堡运河大学医院,布兰登堡泰奥多尔·冯·_fontane医学学校,德国) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany(放射肿瘤科模式分析与学习组,海德堡大学医院,德国)

AI总结 本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息,以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式,通过临床医生验证注册提出的提示,并结合病灶的基线外观信息,解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权,构建了一个统一的纵向信息引导分割框架,并通过大规模合成预训练克服数据稀缺问题,显著提升了性能。实验表明,该方法在全自动和验证追踪设置下均优于现有方法,且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡:端到端追踪器实现高度自动化,但无法纠正无声的追踪失败;而解耦的配准-分割流程允许用户验证,却丢弃了病灶的先验外观,限制了在模糊情况下的准确性。在这项工作中,我们提出了一种验证追踪范式:临床医生验证配准提出的提示,模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架,结合早期空间提示融合与潜在时间差异加权,用于纵向信息感知的分割。为了解决数据稀缺问题,我们利用大规模合成预训练,证明这对于利用纵向上下文至关重要,相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack,一个新的纵向胰腺癌基准,以评估分布外泛化能力。实验表明,我们的模型在全自动和所提出的验证追踪设置中均优于先前工作,在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

2605.23116 2026-05-25 cs.CV cs.AI

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

CoReVAD: 一种无需训练的视频异常检测上下文推理框架

Hyeongmuk Lim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国釜山大学工业工程系)

AI总结 现有视频异常检测方法通常依赖任务特定的训练,导致领域依赖性强且训练成本高,且大多仅输出标量异常分数,缺乏对异常原因的解释。为此,本文提出CoReVAD,一种无需训练的上下文推理框架,利用冻结的视觉-语言模型直接生成异常分数和时间描述,并通过局部响应清理模块和全局时序优化策略提升检测精度与可解释性。实验表明,CoReVAD在多个数据集上表现出色,提供了可靠且易于理解的异常解释。

Comments Accepted to ICPR 2026

详情
AI中文摘要

现有的视频异常检测方法通常依赖于任务特定的训练,导致强领域依赖性和高训练成本。此外,大多数现有方法仅输出标量异常分数,对特定事件为何被视为异常提供的洞察有限。视觉语言模型的最新进展使得异常检测和人类可解释推理成为可能。然而,许多基于视觉语言模型的方法仍然需要额外的训练步骤(例如,指令调优或口头化学习)或外部大型语言模型,从而带来进一步的训练成本和推理开销。为了解决这些挑战,我们提出了CoReVAD,一种用于无需训练的视频异常检测的上下文推理框架,该框架使用单个冻结的视觉语言模型运行。CoReVAD直接从视觉语言模型生成异常分数和时间描述。为了减轻生成输出中的噪声,我们引入了一个基于局部视觉-文本对齐的局部响应清理模块。此外,通过基于softmax的精炼、高斯平滑和位置加权,融入了全局时间上下文和进展。在UCF-Crime和XD-Violence上的实验表明,CoReVAD在无需训练的方法中取得了竞争性能,同时提供了可靠且可解释的解释。我们的官方代码可在https://github.com/Muk-00/CoReVAD获取。

英文摘要

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

2605.23115 2026-05-25 cs.LG stat.ML

Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift

鲁棒OT引导的生成式残差域适应用于时间域偏移下的共享单车需求预测

Yiming Ma

发表机构 * Department of Statistics Finance, School of Management, University of Science

AI总结 本文研究了从2021年到2026年纽约Citi Bike共享单车需求预测中的时间域适应问题,提出了一种基于最优运输引导的残差域适应框架Gen-ROTDA。该方法通过拟合目标域的站点-时间锚点,转移残差而非原始需求,并采用确定性标签保持的残差特征生成器,提升了模型在时间域偏移下的鲁棒性。实验表明,Gen-ROTDA在主要任务2025至2026年的预测中取得了最低的平均绝对误差,并在多任务中优于其他最优运输方法,尤其在面对噪声数据时表现出更强的稳定性。

详情
AI中文摘要

基于历史站点-小时数据训练的共享单车模型在后续年份部署时,由于出行模式随时间变化,性能可能会下降。本文将2021年至2026年3月Citi Bike需求预测作为时间域适应问题进行研究,并提出了Gen-ROTDA,一种鲁棒最优传输引导的残差域适应框架。该方法利用少量标记目标子集拟合目标域站点-时间锚点,传输残差而非原始需求,应用确定性标签保持残差特征生成器,并在训练最终残差预测器之前修剪高成本传输匹配。实验将Gen-ROTDA与仅锚点、仅源域、仅目标域、微调、MMD适应、Sinkhorn OTDA、ROTDA和Gen-OTDA进行比较。Gen-ROTDA在2025年至2026年主要任务上取得了最低MAE,并且在多年度任务中平均表现最佳,尽管微调和MMD适应仍然是强大的整体基线。在异常目标无标签记录下,Gen-ROTDA比非鲁棒OT变体稳定得多,表明鲁棒传输对于共享单车需求预测中的噪声时间迁移是有用的。

英文摘要

Bike-sharing models trained on historical station-hour data may degrade when deployed in later years because travel patterns change over time. This paper studies March Citi Bike demand prediction from 2021 to 2026 as a temporal domain adaptation problem and proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework. The method fits a target-domain station-time anchor with a small labeled target subset, transfers residual rather than raw demand, applies a deterministic label-preserving residual feature generator, and trims high-cost transport matches before training the final residual predictor. Experiments compare Gen-ROTDA with anchor-only, source-only, target-only, fine-tuning, MMD adaptation, Sinkhorn OTDA, ROTDA, and Gen-OTDA. Gen-ROTDA achieves the lowest MAE on the main 2025 to 2026 task and is the best OT-family method on average across multi-year tasks, although fine-tuning and MMD adaptation remain strong overall baselines. Under abnormal target-unlabeled records, Gen-ROTDA is much more stable than non-robust OT variants, suggesting that robust transport is useful for noisy temporal transfer in bike-sharing demand prediction.

2605.23113 2026-05-25 cs.CV

Inconsistency-aware Multimodal Schrödinger Bridge for Deepfake Localization

不一致感知多模态薛定谔桥用于深度伪造定位

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue

发表机构 * Department of Computer Science and Techonology, Huaqiao University(华侨大学计算机科学与技术系) Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University(厦门计算机视觉与模式识别重点实验室) Tongji University(同济大学) School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本文提出了一种基于不一致性感知的多模态Schrödinger Bridge(IaMSB)方法,用于深度伪造视频的区间级定位。该方法通过联合估计跨模态一致性并进行时间区间定位,有效抑制了单侧和异步伪造中的跨模态噪声传播。IaMSB利用Schrödinger Bridge框架统一了一致性估计、跨模态信息选择和桥步调度,在提升定位精度的同时减少了不必要的迭代,显著提高了高精度定位性能,尤其在单侧伪造检测中表现优异。

Comments Accepted by CVPR2026

详情
AI中文摘要

音视频深度伪造定位需要区间级输出作为时间证据。尽管近期取得进展,但在单侧或异步伪造下的对称融合会传播跨模态噪声,降低高精度定位。我们提出IaMSB,一种不一致感知多模态薛定谔桥(SB),联合估计跨模态一致性并执行区间级定位。与扩散模型不同,SB最小化路径分布差异,无需显式噪声注入或去噪即可生成一致性分数。借助薛定谔桥(SB),IaMSB将一致性估计、跨模态信息选择和桥步调度统一在一个框架中。具体地,轻量级粗桥首先提出候选区间并估计跨模态一致性;这些统计量选择跨模态见证信号并跨模态非对称分配桥步。然后,精炼桥执行步调融合并输出精炼的时间对齐区间。IaMSB预判单侧和异步伪造,并通过带步分配的瓶颈跨模态交互抑制噪声转移,避免不必要的迭代。在多个基准上,IaMSB稳定了严格IoU边界精度,将AP@0.95提高了3%~10%,并实现了改进的高精度定位,特别是对于单侧伪造。

英文摘要

Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schrödinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schrödinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

2605.23109 2026-05-25 cs.AI cs.DC cs.LO cs.PL

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

归纳演绎合成:使AI能够生成形式化验证的系统

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li, Sylvia Ratnasamy, Aditya Parameswaran, Matei Zaharia, Ion Stoica, Mohsen Lesani

发表机构 * UC Berkeley(伯克利大学) Google(谷歌) UC Santa Cruz(圣克鲁兹大学)

AI总结 本文提出了一种名为归纳演绎综合(IDS)的新方法,旨在解决AI生成代码时缺乏形式化验证的问题,特别是在分布式系统领域。该方法通过联合生成实现代码和形式化证明,并从失败尝试中学习,系统性地尝试有效策略。IDS作为基于代理的大型语言模型系统,能够在约6.8小时内以较低成本完成7个分布式键值存储规范的形式化验证,且生成的实现性能优于现有验证系统。

详情
AI中文摘要

AI代理在生成、测试和优化代码方面日益出色。然而,在需要完全覆盖的形式化保证(仅靠测试无法提供)的任务上,它们表现不足。分布式系统是一个典型例子:读写一致性等属性必须在每个可能的事件交错下成立。机械化形式验证可以保证这种正确性,但通常需要专家数月到数年的努力。证据表明,即使是最先进的编码代理(Codex with GPT-5.4和Claude Code with Opus 4.6)也仅在7个分布式键值存储规范中的2个上成功。在本文中,我们提出了解决这一差距的首个有效方法——归纳演绎合成(IDS),它联合且增量地合成实现和证明,并从失败的尝试中学习以系统地尝试有前景的策略。作为基于LLM的代理系统,IDS在平均约6.8小时和每个规范106美元的成本下实现了7/7的成功率,比专家努力快约200倍,比最先进的代理便宜17%。IDS进一步将性能反馈纳入同一循环,产生的实现比已发布的验证系统快达3倍。

英文摘要

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出了一种基于微调BERT的分类器Lepton,用于识别晚明至清初文集目录中的标题是否为个人书信,特别是与可混淆的序言(如告别序)进行区分。该模型在33位文人手标注的5438个文集标题上进行微调,并已部署于Hugging Face平台,应用于中国传记资料库(CBDB),成功识别出约五万五千封书信,为明信平台的数据建设提供了支持。

详情
AI中文摘要

我提出Lepton(书信预测),一个微调的BERT分类器,用于预测古典中文文集目录中的标题是个人书信还是易混淆的序文(特别是赠序)。Lepton在来自三十三位明清之际文人的5438个手工标注的文集标题上微调bert-base-chinese。我已将该模型部署在Hugging Face上,并已在中国传记数据库(CBDB)中使用,用于识别从中明到清初文集中约五万五千封书信,从而填充明代书信平台。

英文摘要

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

2605.23098 2026-05-25 cs.RO

UfM*: Uncertainty from Motion* for DNN Depth Estimation Using Gaussians

UfM*:基于高斯分布的运动不确定性用于DNN深度估计

Soumya Sudhakar, Sertac Karaman, Vivienne Sze

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出了一种名为UfM*的深度神经网络单目深度估计不确定性估计方法,通过使用高斯混合模型高效地衡量多视角预测之间的不一致性,仅需单次网络推理即可生成不确定性。相比传统方法,UfM*在计算和内存效率上显著提升,并在多个数据集上验证了其在提升校准误差和降低能耗方面的优越性,特别适用于资源受限的机器人系统。

Comments 18 pages, 15 figures

详情
AI中文摘要

可靠的不确定性估计对于在安全关键的机器人系统中部署单目深度深度神经网络(DNN)至关重要。传统的不确定性方法(如集成和基于采样的方法)需要每张图像多次推理,导致大量计算和内存开销。此外,从单张图像预测的不确定性无法衡量同一区域不同视图间预测的不一致性。我们提出UfM*(基于运动的不确定性),一种不确定性估计算法,通过使用紧凑高斯混合模型比较前后视图,高效衡量多视图不一致性,每张图像仅需一次DNN推理。使用高斯分布计算多视图不一致性不仅比先前使用点云的方法更节省计算和内存,而且通过衡量3D空间区域间的不一致性提高了不确定性质量。UfM*结合偶然不确定性,在100个分布外ScanNet序列上,与集成相比,期望校准误差改善24-28%,而能耗仅为集成的3%,内存仅为0.02%。我们证明,在微型能量受限机器人上,UfM*在Arm Cortex-A76 CPU上以30 FPS实时运行,每张224x224图像仅消耗63 mJ,突显了使用高斯分布衡量多视图不一致性能够为资源受限的机器人系统实现高效的不确定性估计。

英文摘要

Reliable uncertainty estimation is critical for deploying monocular depth deep neural networks (DNNs) in safety-critical robotic systems. Conventional uncertainty methods such as ensembles and sampling-based approaches require multiple inferences per image, incurring substantial compute and memory overhead. Moreover, uncertainty predicted from a single image misses out on measuring disagreement between predictions across views of the same region. We propose Uncertainty from Motion* (UfM*), an uncertainty estimation algorithm that measures multiview disagreement efficiently by comparing previous and current views using a compact Gaussian mixture, requiring only a single DNN inference per image. Using Gaussians to compute multiview disagreement is not only more compute- and memory-efficient than a prior approach using a point cloud, but also improves uncertainty by measuring disagreement across regions of 3D space. UfM* paired with aleatoric uncertainty improves expected calibration error by 24-28% compared to an ensemble, while requiring only 3% of the energy and 0.02% of the memory on 100 out-of-distribution ScanNet sequences. We demonstrate UfM* consumes only 63 mJ per 224x224 image while running real-time at 30 FPS on an Arm Cortex-A76 CPU onboard a miniature energy-constrained robot, highlighting that measuring multiview disagreement using Gaussians enables efficient uncertainty for resource-constrained robotic systems.

2605.23093 2026-05-25 cs.CL cs.CY

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

结构主题模型与BERTopic在简短开放式调查回答中的比较评估

Yan Jiang, Sihong Liu, Philip A. Fisher

发表机构 * Stanford Center on Early Childhood, Stanford University(斯坦福大学早期儿童研究中心)

AI总结 本文比较了结构主题模型(STM)和基于嵌入的BERTopic模型在分析短文本开放性调查回复中的表现。研究通过多种参数设置对两种方法进行了评估,发现BERTopic在主题一致性方面优于STM,而STM在协变量分析方面更具优势。研究结果表明,两种方法各有优劣,适用于不同研究需求,为应用社会科学研究中的主题建模方法选择提供了实用指导。

详情
AI中文摘要

应用心理学中的主题建模日益跨越两种方法论传统:概率词袋模型和较新的基于嵌入的方法。然而,对这些方法的许多评估依赖于较长且更干净的基准语料库,对简短、开放式调查回答的指导较少。本文比较了结构主题模型(STM)(一种概率主题模型)和BERTopic(一种基于嵌入的模型)用于分析开放式调查回答。我们评估了三种STM条件和五种BERTopic条件,变化包括拼写纠正、词干提取、嵌入选择以及上下文增强(我们引入的一种为极短回答提供额外语义上下文的策略)。结果表明,BERTopic始终比STM产生更高的主题连贯性,其中上下文增强带来了最强的性能提升。相比之下,仅使用更高维度的嵌入并未改善连贯性,反而与更大的数据损失相关。定性评估显示,BERTopic生成了更可解释和稳定的主题,而STM主题通常更广泛且更混杂。然而,STM为推断性协变量分析提供了更强的支持,而BERTopic的协变量比较主要是描述性的。这些发现表明STM和BERTopic具有互补优势。我们最后为应用社会科学研究中选择和结合主题建模方法提供了实用指导。

英文摘要

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

2605.23089 2026-05-25 cs.LG cs.AI

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

利用梯度惩罚潜在动力学实现平滑且高效的采样

Romil V. Sonigra, P. R. Kumar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种名为GPLD的梯度惩罚隐动力学正则化方法,用于改进基于模型的强化学习中的隐世界模型。该方法通过对后验隐状态分布施加行级雅可比惩罚,显式地鼓励局部平滑的转移动力学学习,从而提升模型的样本效率和学习稳定性。实验表明,GPLD在多个深度强化学习任务中表现出色,尤其在复杂运动控制环境中显著提升了性能,并且在四足机器人任务中实现了更早的高回报行为和更一致的长期学习效果。

Comments 17 pages and 9 figures

详情
AI中文摘要

基于模型的强化学习通过学习世界模型来提高样本效率。然而,现有的潜在世界模型(如DreamerV3)并未明确强制其学习的转移动力学具有局部平滑性,从而未利用这一有用的归纳偏置。我们提出GPLD,一种用于DreamerV3的梯度惩罚潜在动力学正则化器,通过对后验潜在分布施加行雅可比惩罚来鼓励局部平滑的转移学习。我们证明该惩罚可解释为离散嵌入状态MDP中转移律的有限差分平滑的连续潜在类比,并使用Hutchinson风格随机探针高效估计。实验上,在DeepMind Control本体感受任务中,GPLD提高了总体样本效率,在复杂度较高的运动环境中尤其显著。在更具挑战性的四足任务中,GPLD更早达到高回报行为,并在更长的时间跨度内表现出更一致的后期学习。显式局部平滑正则化是改善平滑连续控制环境中潜在世界模型的简单有效方法。GPLD代码见github.com/romils9/gpld-mbrl。

英文摘要

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

2605.23087 2026-05-25 cs.LG

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

深度的隐式偏差:从神经坍缩到Softmax编码

Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Department of Electrical and Computer Engineering, University of British Columbia(不列颠哥伦比亚大学电气与计算机工程系)

AI总结 该研究探讨了深度神经网络中梯度下降的隐式偏差如何影响神经崩溃(NC)现象。通过分析无正则化的深度非约束特征模型(UFM),研究发现深度本身会引入一种隐式的低秩偏差,使得网络更倾向于生成低秩的特征表示,这些表示与softmax编码形式的最优解相关。研究还揭示了深度如何影响训练动态和NC的收敛区域,并指出网络宽度的增加可能促使训练向更高秩的解发展,为理解深度模型的隐式偏差提供了新的理论视角。

Comments 46 pages, 11 figures, accepted at the International Conference on Machine Learning 2026

详情
AI中文摘要

神经坍缩(NC)描述了训练分类器中特征和权重出现的结构化几何。最近的理论表明,NC在深度架构中可能不是最优的,将其归因于L2正则化的显式低秩偏差。我们研究了深度无约束特征模型(UFM)——等价于具有正交输入的深度线性网络——在无正则化训练下的情况,以隔离梯度下降和深度单独如何塑造NC。我们表明,深度诱导了隐式低秩偏差:低秩矩阵通过连续乘法更有效地传播范数,从而促进NC的低秩替代方案。我们认为,这些替代方案对应于softmax编码:先前在宽度瓶颈网络中发现的最大间隔解。通过分析谱初始化下的训练动态,我们识别出早期奇异值之间的排斥力驱动低秩出现,并刻画了深度如何缩小NC的吸引域。最后,我们展示了一些相反方向的效果:对于随机初始化的网络,增加宽度会使训练偏向更高秩的解。我们的结果首次提供了在无正则化多类交叉熵训练的深度UFM中隐式偏差的渐近和动态刻画。

英文摘要

Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

2605.23081 2026-05-25 cs.LG

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention: 面向长上下文FP4注意力机制的选择性混合精度

Joe Sharratt

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 在长上下文任务中,注意力机制的二次计算成本是一个关键挑战。为了解决这一问题,ThriftAttention 提出了一种选择性混合精度方法,在保持 FP4 推理效率的同时,显著提升了长上下文场景下的模型质量。该方法通过分阶段策略,优先以 FP16 精度计算少量重要的查询-键块对,其余块则使用 FP4 精度计算,并通过在线 softmax 合并结果,从而在仅使用 5% FP16 块的情况下,恢复了 89.1% 的 FP4 到 FP16 性能差距。

详情
AI中文摘要

高效的注意力算法对于减轻长上下文工作负载中注意力的二次成本至关重要。先前的工作在Blackwell GPU上利用块缩放量化技术将注意力计算移至4位精度以加速推理。然而,这些技术在长上下文设置中会导致显著的质量下降。我们表明,量化误差的输出影响高度不均匀,并且随着每个查询-键交互的重要性而增加,将功能相关的误差集中在包含最重要标记的少量注意力块中。我们提出ThriftAttention,一种低比特注意力变体,在FP4推理效率下提供接近FP16的长上下文质量。该方法分两个阶段进行。首先,一种启发式方法快速选择少量重要的查询-键块对进行FP16精度计算。其次,选中的块以FP16计算,其余块以FP4计算,两条路径通过在线softmax合并为单个输出。我们在长上下文基准和模型家族上证明,通过仅计算5%的查询-键块为FP16,ThriftAttention平均恢复了FP4到FP16性能差距的89.1%。我们展示了ThriftAttention的优势随序列长度增加而增长,缓解了在更长上下文中观察到的系统性FP4质量下降。代码可在https://github.com/joesharratt1229/ThriftAttention获取。

英文摘要

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.

2605.23078 2026-05-25 cs.LG cs.CL

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ:MoE大语言模型的全局专家级混合精度量化

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu

发表机构 * University of Pittsburgh(匹兹堡大学) University of Central Florida(佛罗里达州立大学) University of Arizona(亚利桑那大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 混合专家大型语言模型(MoE-LLMs)在性能上表现优异,但因大量专家参数导致内存开销较大。为解决这一问题,本文提出了一种全局专家级混合精度量化方法GEMQ,通过全局线性规划形式捕捉模型整体的专家重要性,并结合高效的路由微调以适应量化后的专家,从而实现更优的精度与内存权衡。实验表明,GEMQ在保持精度的同时显著降低了内存占用并加速了推理。

Comments ICML 2026

详情
AI中文摘要

混合专家大语言模型(MoE-LLMs)性能强大,但由于大量专家参数导致显著的内存开销。混合精度量化根据专家重要性分配不同的位宽,接近精度-内存帕累托前沿,并实现极低比特量化。然而,现有方法依赖于逐层重要性估计,忽视了量化引起的路由器偏移,导致次优的分配和路由。本文提出全局专家级混合精度量化(GEMQ),通过(1)基于量化误差分析的全局线性规划公式来捕获模型范围内的专家重要性,以及(2)高效的路由器微调以适应量化后的专家,从而克服这些限制。这些组件被集成到一个渐进式量化框架中,该框架迭代地优化重要性估计和分配。实验表明,GEMQ在最小化精度损失的情况下显著减少内存并加速推理。源代码可在 https://github.com/jndeng/GEMQ 获取。

英文摘要

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

2605.23074 2026-05-25 cs.AI

PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

PathCal: 状态感知的反思标记校准用于高效推理

Lingyu Jiang, Zirui Li, Shuo Xing, Peiran Li, Tsubasa Takahashi, Dengzhe Hou, Zhengzhong Tu, Kazunori Yamada, Fangzhou Lin

发表机构 * Tohoku University(东大大学) Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 随着大语言模型在推理任务中的应用日益广泛,如何高效控制其推理路径成为一个关键问题。本文提出PathCal,一种无需训练的解码控制器,通过区分不同类型的反思标记并仅在局部不确定状态进行干预,实现对推理路径的校准。实验表明,PathCal在多个推理基准上有效提升了推理效率与性能的平衡,减少了生成长度而不牺牲准确性。

Comments 21 pages, 5 figures, 7 tables

详情
AI中文摘要

大型推理语言模型(LRMs)的出现通过推理时缩放生成长篇思维链(CoT)轨迹,为处理复杂推理任务铺平了道路。同时,这些轨迹通常包含显式的反思标记,如“wait”、“but”和“alternatively”,分别表示犹豫、修正和考虑替代探索。最近关于测试时控制的研究利用这些标记作为轻量级手柄来引导推理,通常将它们视为单一的粗粒度类别,而非区分其不同的功能角色。在本文中,我们进行类型级抑制和固定前缀干预,揭示反思标记不仅在功能角色上不同,而且在它们发挥最大影响的时机上也不同。具体来说,不同的标记类别以不同方式影响准确性和生成长度,并且标记选择在模型进入稳定推理轨迹之前最为关键。受这些发现启发,我们引入PathCal,一种新颖的无需训练的解码控制器,通过区分标记类型并仅在局部不确定状态进行干预来校准推理路径。在每个解码步骤,PathCal利用反思标记上的分布来估计维持当前推理轨迹与启动竞争分支之间的局部竞争,并在竞争分支证据过多时软性地重新平衡标记对数。在六个推理基准上的实验表明,PathCal实现了更好的效率-性能权衡,在减少生成长度的同时提高或保持准确率,且不依赖外部验证器或额外采样。

英文摘要

The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by generating long-form Chain-of-Thought (CoT) trajectories during inference. Meanwhile, these trajectories often contain explicit reflection markers such as ``wait'', ``but'', and ``alternatively'', signaling hesitation, revision, and the consideration of alternative explorations, respectively. Recent studies on test-time control leverage such markers as lightweight handles for steering reasoning, typically treating them as a single coarse-grained category rather than distinguishing their distinct functional roles. In this paper, we conduct type-wise suppression and fixed-prefix intervention, revealing that reflection markers differ not only in their functional roles but also in when they exert the greatest influence. Specifically, different marker classes affect accuracy and generation length in distinct ways, and marker choices are most consequential before the model settles into a stable reasoning trajectory. Motivated by these findings, we introduce PathCal, a novel training-free decoding controller that calibrates reasoning paths by distinguishing marker types and intervening only at locally uncertain states. At each decoding step, PathCal utilizes the distribution over reflection-markers to estimate local competition between maintaining the current reasoning trajectory and initiating a competing branch, and softly rebalances marker logits when competing-branch evidence becomes excessive. Experiments across six reasoning benchmarks demonstrate that PathCal achieves a better efficiency--performance trade-off, improving or preserving accuracy while reducing generation length, without relying on external verifiers or additional sampling.

2605.23071 2026-05-25 cs.CL

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

效率前沿:LLM上下文管理中成本-性能优化的统一框架

Binqi Shen, Lier Jin, Hanyu Cai, Lan Hu, Yuting Xin

发表机构 * Northwestern University(西北大学) Duke University(杜克大学) Carnegie Mellon University(卡内基梅隆大学) University of Minnesota(明尼苏达大学)

AI总结 随着大语言模型对长上下文处理的需求增加,扩展上下文窗口带来了显著的计算和经济成本。本文提出了一种统一的框架《The Efficiency Frontier》,用于在上下文管理中实现成本与性能的优化,通过联合考虑任务性能、令牌成本和预处理复用,将上下文策略选择建模为部署感知的优化问题。该框架揭示了检索与预处理策略在不同操作条件下的适用范围,并在实验中展示了其在减少令牌使用和降低成本方面的显著优势。

详情
AI中文摘要

大型语言模型(LLM)越来越依赖长上下文处理,但扩展上下文窗口会带来巨大的计算和财务成本。现有的上下文缩减方法,包括检索和内存压缩方法,通常使用性能和效率指标独立评估,限制了系统比较和部署感知决策。本文介绍了效率前沿,一个用于LLM上下文管理中成本-性能优化的统一框架。该框架将上下文策略选择建模为部署感知优化问题,通过摊销成本建模联合考虑任务性能、token成本和预处理重用。与孤立比较方法的现有评估不同,所提出的框架能够进行决策导向分析,揭示不同上下文管理策略在不同操作条件下何时变得更为可取。在5000个HotpotQA实例上的评估显示,该框架揭示了基于检索和基于预处理的策略之间的不同操作区间和转换边界。结果表明,部署感知优化在可比性能(F1 ≈ 0.78)下将有效token使用减少了约25%,而摊销内存压缩在高性能设置下相比全上下文提示实现了超过50%的token成本降低。总体而言,所提出的框架为评估和部署可扩展、高效且可持续的LLM系统提供了原则性和实用性的基础。

英文摘要

Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated using performance and efficiency metrics independently, limiting systematic comparison and deployment-aware decision-making. This paper introduces The Efficiency Frontier, a unified framework for cost-performance optimization in LLM context management. The framework models context strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and preprocessing reuse through amortized cost modeling. Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis of when different context management strategies become preferable under varying operational conditions. Evaluated on 5,000 HotpotQA instances, the framework reveals distinct operational regimes and transition boundaries between retrieval-based and preprocessing-based strategies. Results show that deployment-aware optimization reduces effective token usage by approximately 25% at comparable performance ($F1 \approx 0.78$), while amortized memory compression achieves over 50% lower token cost relative to full-context prompting in higher-performance settings. Overall, the proposed framework provides a principled and practical foundation for evaluating and deploying scalable, efficient, and sustainable LLM systems.

2605.23070 2026-05-25 cs.CV

Flow Mismatching: Unsupervised Anomaly Detection via Velocity Discrepancies in Flow Matching Models

Flow Mismatching: 通过流匹配模型中的速度差异进行无监督异常检测

Shengzhe Chen, Mehrdad Moradi, Kamran Paynabar, Hao Yan

发表机构 * Arizona State University(亚利桑那州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 Flow Mismatching 的无监督异常检测方法,避免了基于重建的范式,转而利用流匹配模型中的速度差异来检测异常。该方法通过在从高斯噪声到目标图像的仿射路径上分析模型预测速度与几何路径速度之间的不一致,从而识别出异常区域。实验表明,该方法在多个基准数据集上优于现有的基于重建和基于流匹配的最新方法。

详情
AI中文摘要

我们提出Flow Mismatching,一种无监督异常检测方法,有意避免基于重建的范式。相反,我们将流匹配视为几何动力学,并利用一个关键见解:异常发生在学习到的正常流与指向测试图像的几何路径不一致的地方。给定仅在正常图像上训练的流匹配模型,我们沿着从高斯噪声到目标图像的仿射路径探测其学习到的速度场。沿着每条路径,我们比较模型预测的速度(遵循正常生成动力学)与指向目标的速度(包含任何异常内容)。异常会导致这些速度之间的强烈局部不一致。聚合不同时间步和多条路径上的不匹配,产生像素级热图和图像级分数,无需测试时优化、特征记忆或额外校准。我们的分析表明,总体不匹配分解为一个不可约的降噪项和一个测试路径与正常路径得分函数之间的Fisher散度项,后者识别出驱动异常分离的得分差距成分,并解释了鲁棒路径聚合的有效性。在MVTec-AD和VisA上的大量实验表明,与最先进的基于重建和最近的基于流匹配的方法相比,性能优越。

英文摘要

We propose Flow Mismatching, an unsupervised anomaly detection method that deliberately avoids reconstruction-based paradigms. Instead, we treat flow matching as geometric dynamics and leverage a key insight: anomalies occur at places where the learned normal flow disagrees with the geometric path toward a test image. Given a flow matching model trained only on normal images, we probe its learned velocity field along affine paths from Gaussian noise to a target image. Along each path, we compare the model-predicted velocity, which follows normal generative dynamics, with the geometric velocity toward the target, which includes any anomalous content. Anomalies induce strong local disagreement between these velocities. Aggregating the mismatch over different time steps and multiple paths yields pixel-wise heatmaps and image-level scores without test-time optimization, feature memories, or additional calibration. Our analysis shows that the population mismatch decomposes into an irreducible denoising term and a Fisher-divergence term between the test-path and normal-path score functions, which identifies the score-gap component that drives anomaly separation and explains the effectiveness of robust path aggregation. Extensive experiments on MVTec-AD and VisA demonstrate superior performance compared with SOTA reconstruction-based and recent flow matching-based approaches.