arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
2605.22720 2026-05-22 cs.AI cs.HC

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

AI 是否会加剧冲突?在冲突情境下LLM部署中的对齐失败

Andrii Kryshtal

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了AI模型在冲突情境下可能产生的对齐失败问题,通过测试九种模型配置,发现其在处理冲突相关场景时存在错误等价、否认种族灭绝和未能识别种族歧视术语等问题,提出了首个评估框架以提高AI在冲突情境下的安全性。

Comments Preprint. 8 pages, 2 figures. Code and evaluation framework: https://github.com/akryshtal/conflict-sensitivity-eval-bloom

详情
AI中文摘要

AI模型已经部署在受武装冲突影响的社会中,记者、人道主义工作者、政府和普通公民依赖这些模型获取信息或用于工作流程。目前尚无已建立的实践来检查其输出是否会加剧冲突。我们测试了来自四个供应商(OpenAI、Anthropic、DeepSeek、xAI)的九种模型配置,在90个多轮场景中揭示了冲突情境中的对齐失败行为:如在记录的暴行之间制造虚假等价、否认种族灭绝以及未能识别种族歧视术语等。当这些输出影响新闻报道、人道主义报告或公共辩论时,它们可能加深脆弱社会的分歧。失败率在最佳和最差表现的模型之间为6%至47%,这使得模型选择本身成为一项安全问题。当用户在国际法院已指责任任的情况下寻求“平衡”时,五种配置在80%至100%的情况下失败。我们发布了该领域的首个评估框架,并建议将其添加到对齐评估套件中。

英文摘要

AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6\% to 47\% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.

2605.22719 2026-05-22 cs.LG

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

阅读任务失败的激活特征:GPT-2小模型在间接对象识别任务上的稀疏特征审计

Mahdi Nasermoghadasi

发表机构 * Research Division, BrightMind AI(BrightMind AI研究部) Texas Tech University(德克萨斯理工大学) University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 该研究通过审计GPT-2小模型在间接对象识别任务中失败与成功样本的稀疏自动编码器特征,发现特定特征与任务失败高度相关,并通过多种控制实验验证了其相关性而非因果性。

Comments 10 pages, 7 figures

详情
AI中文摘要

我们报告了一个小型、可复现的审计,探讨了GPT-2小模型在间接对象识别(IOI)任务中失败与成功样本之间稀疏自动编码器(SAE)特征的差异。在300个提示中,GPT-2小模型达到79.7%的准确率;24,576个层-8残差流SAE特征中有146个通过holm校正的显著性阈值,105个具有大效应量(|Cohen's d| > 0.8)。最强的单一相关特征——特征17,491(d=+2.93,Neuronpedia标签'加密密钥')——在提示中的转移对象为'密钥'时,GPT-2小模型失败率达93.3%,而在其他七个对象上仅为7.5%(Fisher精确检验p=8.79 x 10^-33)。我们通过三种控制实验验证了这一相关性。 (i) 因果消融:在所有45个密钥提示的token位置上零特征17,491不恢复准确性(6.7% -> 4.4%);该特征是相关而非该层的充分原因。 (ii) 表示基线:对原始768维残差流进行逻辑回归达到5倍ROC AUC=0.929,与前100个SAE特征(0.927)相当;SAE基底增加可解释性而非预测能力。 (iii) 种子鲁棒性检查:在五个随机种子中,密钥子集的失败率保持在75.0-93.3%(行为效应是真实的),但特征17,491仅在1个运行中是top-|d|特征。因此,方法学贡献是审计流程(经济、模型无关、揭示命名相关特征)而非任何单个通过该流程发现的特征。我们发布了代码、300个提示语料库、300x24,576激活矩阵、消融和基线脚本以及图表。完整流程可在笔记本电脑(Apple M3 Max,无离散GPU)上运行。

英文摘要

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

2605.22718 2026-05-22 cs.CV

WorldKV: Efficient World Memory with World Retrieval and Compression

WorldKV: 通过世界检索和压缩实现高效的world内存

Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Naver AI Lab(Naver人工智能实验室)

AI总结 本文提出WorldKV,一种无需训练的框架,通过世界检索和压缩技术,在保持一致性的同时提高效率,实现在Matrix-Game-2.0和LingBot-World-Fast数据集上达到或超越全KV内存保真的性能。

Comments Project Page: https://cvlab-kaist.github.io/WorldKV/

详情
AI中文摘要

自回归视频扩散模型已使实时、动作条件化的world生成成为可能。然而,维持一个持久的world,其中重新访问先前看到的视角会得到一致的内容,仍然是一个开放问题。全KV缓存注意力保持这种一致性,但会破坏实时约束:内存足迹和注意力成本随着rollout长度线性增长。滑动窗口推断恢复了吞吐量,但丢弃了长期一致性。我们提出WorldKV,一种无需训练的框架,包含两个组件:World检索和World压缩。World检索将被驱逐的KV缓存片段存储在GPU/CPU内存中,并通过相机/动作对应关系选择性地检索场景相关的片段,将其插入回原生注意力窗口而不重新编码。World压缩通过键-键相似性修剪每个片段中的冗余token,将每个片段的存储减少一半,以在固定预算下容纳两倍的历史。在Matrix-Game-2.0和LingBot-World-Fast上,WorldKV在大约两倍的吞吐量下达到或超过全KV内存保真度,并且在无需微调的情况下与内存训练的基线竞争。项目页面:https://cvlab-kaist.github.io/WorldKV/

英文摘要

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/

2605.22717 2026-05-22 cs.SD cs.AI cs.LG cs.MM

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

实时音乐扩散模型:交互式音乐生成扩散模型的高效微调与后训练

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang

发表机构 * UC San Diego(加州大学圣迭戈分校) MIT(麻省理工学院) Adobe(Adobe公司)

AI总结 本文研究了音频扩散模型能否通过块级KV缓存高效地转化为交互式模型,从而在消费级硬件上实现。提出的Live Music Diffusion Models (LMDMs)通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度,并通过ARC-Forcing范式实现稳定的后训练对齐,从而在无需显式RL或奖励模型的情况下减少误差累积。

详情
AI中文摘要

交互式流式音乐生成承诺了生成模型在实时表演和协作创作中的应用,这在离线模型中是无法实现的。然而,最先进的模型存在于离散AR领域,需要工业级的计算资源进行训练和推理。在本文中,我们研究音频扩散模型是否可以被重新利用为交互式模型,从而在消费级硬件上实现。通过仔细分析现代块级外推扩散流程,我们发现推理过程中存在关键的低效问题,导致其计算效率严劣于离散AR模型。我们提出了Live Music Diffusion Models (LMDMs),一种简单的生成扩散过程修改,通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度。与LMMs不同,LMDMs进一步通过我们新颖的ARC-Forcing范式实现稳定的后训练对齐,无需任何显式RL或奖励模型即可减少误差累积。我们展示了LMDMs在多个创意领域中的应用,包括文本条件生成、基于草图的音乐合成和即兴演奏。最后,我们展示了如何将LMDMs用作生成乐器,在真实艺术家与AI的合作中利用LMDMs作为“生成延迟”,将音乐家的即兴演奏转换为可变的音色效果,同时在本地消费级游戏笔记本电脑上运行。

英文摘要

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

2605.22716 2026-05-22 cs.AI cs.LO

Parametric Modular Answer Set Programs Made Declarative

参数化模态答案集程序的声明性

Jorge Fandinno, Yuliya Lierler, Torsten Schaub

发表机构 * University of Nebraska Omaha, USA(内布拉斯加大学奥马哈分校) University of Potsdam, Germany(波茨坦大学)

AI总结 本文探讨了在一阶答案集编程中模ularity的概念,引入了参数化模态逻辑程序这一新形式,允许定义带有参数和intensionality语句的子程序,并展示了如何捕捉clingo程序的集体控制语义,连接传统非模态答案集编程。

Comments To appear in Theory and Practice of Logic Programming

详情
AI中文摘要

在本文中,我们探讨了在第一阶答案集编程(ASP)中模块化的概念。我们引入了一种新的形式化方法,称为参数化模态逻辑程序,它允许定义带有参数和intensionality语句的子程序。我们展示了这种形式化方法如何捕捉具有集体控制的clingo程序的语义,这一特性使得能够对子程序进行结构化和实例化。我们为模块化ASP提供了理论基础,展示了其有用性,并将其与传统非模块化ASP连接起来。

英文摘要

In this paper, we explore the concept of modularity in first-order answer set programming (ASP). We introduce a new formalism called parametric modular logic programs, which allows defining subprograms with parameters and intensionality statements. We demonstrate how this formalism can capture the semantics of clingo-programs with collective control, a feature that enables structuring and instantiating subprograms. We provide theoretical foundations for modular ASP, illustrate its usefulness, and connect to traditional non-modular ASP.

2605.22711 2026-05-22 cs.LG cs.AI

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的抽象

Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster

发表机构 * FLAIR, MLRG University of Oxford(FLAIR、MLRG 欧洲大学)

AI总结 本文提出了一种在离线目标条件强化学习中利用抽象的方法,通过引入相对化选项和不同层次的表示,提高了在相似状态空间上下文中的经验复用能力,从而提升了性能。

详情
AI中文摘要

马尔可夫决策过程(MDPs)在现实中的目标条件强化学习(GCRL)中往往由于对称性和状态-目标对之间的共享结构而表现出显著的冗余性。虽然分层策略已被提出以通过时间抽象减少时间跨度来改进离线GCRL,但本文证明层次结构也能够实现绝对抽象。通过引入相对化选项以及为不同层次的层次结构引入不同的表示,我们展示了智能体如何在相似的状态空间上下文中重用经验。基于这一框架,我们介绍了两种简单的算法用于学习相对化选项和从绝对参考框架中抽象。我们的实验表明,这种归纳偏置在离线GCRL中显著提高了性能。

英文摘要

Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

2605.22707 2026-05-22 cs.AI cs.HC

Beyond the Org Chart: AI and the Transformation of Invisible Work

超越组织图:人工智能与无形工作的变革

Stephanie Rosenthal, Shamsi Iqbal

发表机构 * Microsoft Corporation(微软公司)

AI总结 本文研究了人工智能如何改变工作流程,特别是无形文化实践,如专业指导,同时提出了使无形工作可见的步骤以及个人和领导者如何支持同事并保持健康的公司文化。

Comments 10 pages

详情
AI中文摘要

越来越多的新闻和研究文章报告称,人工智能的采用使专业人士能够模糊和扩展其在企业中的角色边界。为了了解在人工智能导向的公司中工作流程可能发生的变化,我们采访了大型科技公司中24名以产品为中心的个体,探讨人工智能如何影响他们的工作、他们在产品团队中的工作以及他们的专业互动。我们的谈话表明,人工智能不仅改变了正式的角色责任和角色之间的协作,还改变了诸如专业指导等无形文化实践,这些实践对于帮助专业人士适应其职位、保持对工作的投入以及发展职业生涯至关重要。一些变化是积极的,例如同行之间的协作更加顺畅,但其他变化更加微妙,可能使典型的职业发展机会,如从专业网络中获得反馈、促进领导力和指导,面临风险。我们提出人工智能公司可以采取的步骤,以使无形工作更加可见。此外,我们还提出个人和领导者可以采取的措施,以在人工智能转型过程中支持同事,同时保持支持多样化思维、协作和非正式互动的健康公司文化。

英文摘要

An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

2605.22703 2026-05-22 cs.LG

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

剪裁瓶颈:通过近边界信号的随机恢复稳定RLVR

Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 本文研究了强化学习可验证奖励(RLVR)中由于硬剪裁决策导致的训练不稳定问题,提出了一种名为近边界随机救援(NSR)的简单方法,通过随机保留略微超出边界范围的token来恢复丢失的信号,从而提升训练稳定性和性能。

详情
AI中文摘要

强化学习可验证奖励(RLVR)已成为扩展大语言模型推理能力的核心范式,但其优化过程常常受到训练不稳定和收敛次优的问题影响。通过系统分析基于剪裁的GRPO类目标,我们发现由硬剪裁引起的刚性剪裁决策是所研究的RLVR设置中的关键实际瓶颈。具体而言,我们的分析表明,信息信号可能位于剪裁阈值之外的近边界区域,因此被标准硬剪裁规则所丢弃。值得注意的是,一旦这个瓶颈被精确识别,即使在边界处进行简单的随机扰动也能恢复有意义的性能提升。基于这一发现,我们提出了近边界随机救援(NSR),一种最小、即插即用的修改方法,通过随机保留略微超出边界范围的token来恢复丢失的信号。虽然NSR通过随机采样可以被解释为在期望上诱导隐含梯度衰减,但我们的消融实验表明,其随机的边界局部救援机制在一致性上比确定性梯度衰减更有效。通过在7B到30B规模以及密集和MoE架构上的广泛实验验证,作为即插即用的解决方案,NSR显著提高了训练稳定性,并在DAPO和GSPO等强基线模型上实现了持续的性能提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

2605.22697 2026-05-22 cs.CV

Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

跨域人类动作识别:多视角运动与文本描述

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

发表机构 * Université Bourgogne Europe, CNRS(布尔格ogne欧洲大学,法国国家科学研究中心) TEB Group, Prynel SAS(TEB集团,普里内尔公司)

AI总结 本文提出一种面向多视角运动和文本描述的跨域人类动作识别方法,通过结合多视角运动线索和文本描述,提升零样本动作识别模型在不同领域中的鲁棒性和泛化能力。

Comments Accepted to ICPR 2026. Code and trained models available at: https://icb-vision-ai.github.io/OrientationAware-HAR

详情
AI中文摘要

在真实世界场景中,人类动作识别系统对域变化的鲁棒性是一个关键能力,其中推理时的动作类别可能呈现重要的域偏移甚至训练中未见过的动作。在这一背景下,提高零样本动作识别模型(ZSAR)的识别能力,而无需强标注努力,仍然是一个核心挑战。大多数ZSAR方法假设动作是在与训练时相似的几何条件下观察到的。实际上,人体姿态变化和摄像机视角的变化会在ZSAR中引入显著的域差距,从而大大限制了对新动作-运动组合的泛化能力。在这一背景下,本文提出了一种新的面向姿态的行动识别方法,具有改进的跨域能力。我们的方法在训练阶段结合了多个摄像机视角的运动线索和人类动作的文本描述。我们提出了一种新的面向姿态的运动编码网络,以学习不同的运动特征,并在推理时适配特定的面向意识文本提示以匹配相应的特征。广泛的实验表明,所提出的方法在不同识别基准上一致提高了ZSAR性能,优于最近的最先进的零样本方法在NTU-RGB+D、BABEL、NW-UCLA以及两个监控数据集上。此外,学习到的表示表现出强大的迁移学习能力,在跨域和同域识别已见动作方面都表现出竞争力。代码和训练模型可在:https://icb-vision-ai.github.io/OrientationAware-HAR 获取。

英文摘要

Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: https://icb-vision-ai.github.io/OrientationAware-HAR

2605.22695 2026-05-22 cs.CV

Improving Viewpoint-Invariance and Temporal Consistency for Action Detection

提升视角不变性和时间一致性以进行动作检测

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

发表机构 * Université Bourgogne Europe, CNRS, ICB(勃艮第欧洲大学、法国国家科学研究中心、ICB) TEB Group, Prynel SAS(TEB集团、普莱恩萨公司)

AI总结 本文提出了一种两阶段动作检测方法,通过增强视角不变性和全局时间一致性来改进动作检测性能,在PKU-MMD和BABEL基准测试中优于现有方法。

Comments Accepted at ICIP 2026. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD

详情
AI中文摘要

视角变化不变性和动作时间一致性是无剪裁视频中有效部署人类动作检测的关键方面。现有的基于外观的视频检测方法在训练期间往往难以应对有限的视角多样性,而基于运动的检测方法则经常无法建模连续运动窗口之间的细粒度时间关系。本文介绍了一种新的两阶段动作检测方法,旨在同时提高视角不变性和全局时间一致性。在第一阶段,我们从增强的虚拟视角中提取运动特征,仅在训练过程中使用。然后,第二阶段引入了一种基于选择性状态空间序列建模的新的视角不变、多尺度时间编码器,以在不同视角和时间尺度上聚合信息。在PKU-MMD和BABEL基准测试中,实验表明该方法在所有考虑的分割中均显著优于现有最先进方法。代码和训练模型可在:https://icb-vision-ai.github.io/HydraView-TAD获取。

英文摘要

Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL benchmarks demonstrate that this approach significantly outperforms state-of-the-art methods in all considered splits. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD

2605.22693 2026-05-22 cs.RO cs.AI

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Hoang-Dung Bui, Abhish Khanal, Raihan Islam Arnob, Gregory J. Stein

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文提出了一种Scout-Assisted Planning框架,通过无人机主动收集环境信息来改进地面车辆的导航,通过信息增益引导的行动剪枝减少回溯成本,实验表明其在不同环境中能显著降低地面机器人旅行成本。

详情
AI中文摘要

自主机器人团队在部分已知环境中导航时,当地面机器人遇到被阻塞的道路时,需要昂贵的回溯操作。我们通过Scout-Assisted Planning,一种异构规划框架,其中无人机主动收集环境信息以改进地面车辆的导航。为了将侦察聚焦于最关键的边,我们提出了基于信息增益的行动剪枝,通过评估候选侦察行动对地面机器人行为的预期影响来评分。由于精确的信息增益基于行动剪枝计算成本过高,我们开发了一个基于图神经网络的模型,该模型可以直接从图结构和信念状态预测信息增益值,将规划时间减少到实时水平而不牺牲解决方案质量。在三种环境类型上的实验表明,SAP结合信息增益行动剪枝将地面机器人旅行成本降低了31.9-37.7%相对于加拿大旅行者问题基线,并且比基于接近度的侦察指导多出8-14%,证实了基于原则的信息增益引导的侦察在实际部署中既更有效且计算上可行。

英文摘要

Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9--37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8--14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

2605.22691 2026-05-22 cs.LG cond-mat.stat-mech

Posterior Collapse as Automatic Spectral Pruning

后验坍缩作为自动谱剪枝

Johannes Hirn

发表机构 * Image Processing Laboratory (IPL), Universitat de València, Paterna, València 46980, Spain(图像处理实验室(IPL),瓦伦西亚大学,帕特erna,瓦伦西亚 46980,西班牙)

AI总结 本文研究了β-VAE中的后验坍缩现象,揭示其本质上是一种自动谱剪枝过程,通过分析不同β值下的均衡解,展示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。

详情
AI中文摘要

我们证明了β-VAE中的后验坍缩实现了自动谱剪枝。一个潜在模式如果其对重建的贡献低于由β设定的截止值,则会坍缩。不同β值的平衡解因此揭示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。我们通过Landau稳定性分析将这一现象推导为损失的后果。我们定义了一个潜在-缩放不变的序参量,该参量对活跃的潜在模式进行排序,其坍缩阈值确定了哪些有效变量应首先检查。在线性高斯情况下,坍缩谱、效用谱和标准化PCA谱一致,且每个坍缩遵循均场定律。我们对WorldClim数据集进行了测试以验证这些预测。

英文摘要

We show that posterior collapse in $β$-VAEs implements automatic spectral pruning. A latent mode collapses if its contribution to reconstruction is below the cutoff set by $β$. Equilibrium solutions with different $β$ thus reveal a cascade of collapses as latent modes decouple from least to most useful. We derive this as a consequence of the loss via a Landau stability analysis. We define a latent-rescaling-invariant order parameter that ranks active latent modes and whose collapse thresholds identify which effective variables to inspect first. In the linear Gaussian case, the collapse spectrum, utility spectrum, and normalized PCA spectrum coincide, and each collapse follows a mean-field law. We test these predictions on the WorldClim dataset.

2605.22681 2026-05-22 cs.AI

Forecasting Scientific Progress with Artificial Intelligence

用人工智能预测科学进步

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, Junchi Yu

发表机构 * University of Oxford(牛津大学) Stanford University(斯坦福大学) Allen Institute for AI(人工智能研究所) Sakana AI

AI总结 本文研究了人工智能在预测科学进步中的能力,提出了一种基于时间的评估框架,并介绍了CUSP基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统的科学预测能力,发现当前前沿模型在不同领域存在系统性限制,且预测结果受事件发生时间影响较大,表明AI在科学预测中仍存在不足。

Comments 73 pages, 13 figures, 29 tables

详情
AI中文摘要

人工智能(AI)日益融入科学发现,但其能否预测科学进步仍不明确。为研究此问题,我们引入了一个基于时间的评估框架,用于在受控知识约束下预测科学进步。我们提出了CUSP(截止条件下的未见科学进步),一个多学科和事件级别的基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统在科学预测中的表现。在4760个科学事件中,我们观察到当前前沿模型在不同领域存在系统性和领域依赖性的限制。虽然模型可以识别出竞争候选研究方向的可能性,但它们无法可靠地预测科学进步是否会被实现,并系统性地低估了其发生时间。性能在不同领域中高度异质,AI的进步时间比生物学、化学和物理学的进步更可预测。性能在事件发生时间在训练截止前或后时基本不受影响,表明这些限制不能仅由训练数据中的知识暴露来解释。在受控信息访问下,额外的预截止知识会提高性能,但无法缩小与全信息设置之间的差距,这种差距在高引用进步中更加明显。模型还表现出系统性的过度自信和强烈的响应偏差,表明不确定性估计不可靠。综合来看,当前AI系统在预测科学进步方面仍显不足。获取先前知识并未转化为可靠的预测,性能更受益于事后信息而非前瞻性预测。

英文摘要

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

2605.22679 2026-05-22 cs.CV cs.LG

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

将嵌入概念化:面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University(雅盖隆大学数学与计算机科学学院) Doctoral School of Exact and Natural Sciences, Jagiellonian University(雅盖隆大学精确与自然科学博士学校) Centre for Credible AI, Warsaw University of Technology(华沙技术大学可信人工智能中心)

AI总结 本文提出CEDAR方法,通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构,从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情
AI中文摘要

视觉-语言模型学习了强大的多模态嵌入,但其内部语义仍然模糊。尽管稀疏自编码器(SAEs)可以提取可解释的特征,但它们依赖于扩展表示维度,这会破坏原始几何结构并引入冗余。我们引入CEDAR(通过自适应旋转进行概念嵌入解缠),一种事后方法,能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换,CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中,单个坐标可以与文本概念进行解释,而对于生成模型如BLIP,它们可以解码为自然语言描述。实验表明,CEDAR在重建-稀疏性权衡方面具有竞争力,同时产生更可解释且更符合人类感知的解释。我们的结果表明,视觉-语言表示中的显性纠缠可以通过适当的基变换来解决,从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

2605.22678 2026-05-22 cs.CV cs.AI

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling: 通过泰勒级数选择时间惊喜

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

发表机构 * Boston University(波士顿大学) Microsoft Research India(微软研究院印度)

AI总结 本研究提出了一种无需训练的帧选择算法Swift Sampling,通过在视觉潜在空间中建模视频为可微轨迹,并利用泰勒展开预测后续帧的路径,从而自动识别高信息量的时间惊喜帧,提升了长视频问答任务的性能。

详情
AI中文摘要

尽管长视频中的大多数帧都是冗余的,但关键信息存在于时间惊喜中:即实际视觉特征偏离其预测演变的时刻。受人脑预测编码的启发,我们引入了Swift Sampling,一种优雅且无需训练的帧选择算法,能够自动识别视频中的高信息量时刻。具体而言,我们将视频建模为视觉潜在空间中的可微轨迹,并计算其特征的速度和加速度。然后,我们应用泰勒展开来投影后续帧的预期路径。与预测路径显著偏离的帧被识别为时间惊喜帧并被选中采样。与依赖辅助网络或视频特定超参数调整的先前无训练方法不同,Swift Sampling 非常轻量,仅比基线增加 0.02x 的计算成本,使其比领先基线便宜 30 倍。在三个长视频问答基准和 10 个不同的下游任务上,Swift Sampling 超过了均匀采样和先前查询无关的基线。它在帧预算有限的长视频中表现尤为强大,准确率可提高高达 12.5 个百分点。

英文摘要

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

2605.22677 2026-05-22 cs.CV

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Slimmable ConvNeXt: 适用于高效多设备部署的宽度自适应推理

Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel

发表机构 * Kiel University(基尔大学) Hamburg University of Technology (TUHH)(汉堡工业大学) UNU-INWEH

AI总结 本文提出Slimmable ConvNeXt,通过训练包含多个嵌套子网络的共享权重集,实现宽度自适应推理,从而在不同资源约束的设备上高效部署模型。该方法利用ConvNeXt的现代设计,如LayerNorm和倒置瓶颈结构,实现了通道宽度压缩,减少了归一化开销,并提供了更简单的训练流程。

Comments Accepted at Mobile AI Workshop 2026 (CVPR'26 Workshop)

详情
AI中文摘要

在资源约束变化的设备上部署视觉模型,或在单个设备上由于电池状态、热 throttling 或延迟截止而变化的计算资源,通常需要训练和维护多个模型。宽度自适应推理通过训练一组共享权重,其中包含多个嵌套子网络,这些子网络具有递增的容量,从而解决这一问题。尽管之前的CNN方法需要可切换的批量归一化,而近期可扩展方法则集中在视觉Transformer上,本文提出了Slimmable ConvNeXt,证明了ConvNeXt的现代设计,特别是LayerNorm和倒置瓶颈结构,使其特别适合通道宽度压缩,消除了经典可压缩网络的归一化开销,并提供了比之前CNN和ViT方法更简单的训练流程。在ImageNet-1k上,Slimmable ConvNeXt-T在3个子网络的情况下,以4.5 GMACs达到80.8%的top-1准确率,以1.2 GMACs达到77.4%的准确率,训练了600个epoch。在同等计算量下,这超过了HydraViT的6头子网络(78.4%在4.6 GMACs)2.4个百分点,以及其3头配置(73.0%在1.3 GMACs)4.4个百分点,同时在相同GMACs下也超过了MatFormer-S(78.6%)和SortedNet-S(78.2%)。将规模扩展到Slimmable ConvNeXt-B进一步将最大准确率提高到15.35 GMACs时的82.8%。

英文摘要

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

2605.22675 2026-05-22 cs.CL

Self-Policy Distillation via Capability-Selective Subspace Projection

通过能力选择性子空间投影实现自我策略蒸馏

Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, Hanxue Liang

发表机构 * University of Cambridge(剑桥大学) HKUST(香港科技大学) University of Chicago(芝加哥大学)

AI总结 本文提出Self-Policy Distillation(SPD),通过从模型自身梯度中提取低维能力子空间,将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下进行微调,实现了无需外部信号的通用且能力选择性的自我蒸馏方法。

详情
AI中文摘要

自我蒸馏通过训练模型自身的生成来提升大语言模型(LLMs)。然而,现有方法要么依赖外部信号来筛选自生成输出(例如正确性过滤、执行反馈和奖励搜索),这些方法成本高且无法用于表现最佳的前沿模型;要么完全跳过筛选直接训练所有原始输出,这种方法通常领域特定且难以泛化。两者都共享一个更深层次的弱点,即自生成输出会将任务相关的能力建与其它因素(如风格模式、格式瑕疵和模型特定错误)纠缠在一起,稀释了要改进的特定能力的信号。在本文中,我们提出Self-Policy Distillation(SPD),实现了无需外部信号的通用且能力选择性的自我蒸馏。具体而言,SPD从模型对正确性定义标记的自身梯度中提取低维能力子空间,在自我生成过程中将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下对结果进行微调。通过在代码生成、数学推理和多个选择性问答任务上的广泛实验,我们展示了SPD在无外部信号的情况下比最先进的自我蒸馏方法提高了高达13%,并且在预训练基线上的表现提高了高达16%。值得注意的是,SPD展示了优越的泛化能力,在跨领域泛化设置下表现更优15%。

英文摘要

Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

2605.22668 2026-05-22 cs.CV

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

SEGA:用于扩散变换器中分辨率外推的频谱-能量引导注意力

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 SEGA通过动态调整注意力权重来提升扩散变换器在高分辨率生成中的表现,其核心方法是根据潜在空间的频谱结构调整RoPE组件的注意力缩放,从而在保持全局结构和恢复细节方面取得平衡。

Comments 27 pages, 14 figures. Project page: https://rajabi2001.github.io/sega/

详情
AI中文摘要

扩散变换器(DiTs)已成为文本到图像生成的主导架构,但其在生成超出训练范围的分辨率时性能下降。现有的无训练方法通过修改推理时的注意力行为来缓解这一问题,通常通过旋转位置嵌入(RoPE)外推结合注意力缩放。然而,这些策略在RoPE组件上采用统一且内容无关的缩放,具有不同的频率特性,导致在保持全局结构和恢复细节之间产生权衡。我们引入SEGA,一种无训练方法,根据每个去噪步骤中潜在空间的空间-频率结构动态调整注意力缩放。这种自适应缩放提高了结构一致性和细节保真度。实验表明,SEGA在多个目标分辨率上均能提升高分辨率合成性能,优于最先进的无训练基线。

英文摘要

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

2605.22662 2026-05-22 cs.AI

Claw AI Lab: An Autonomous Multi-Agent Research Team

Claw AI Lab:一个自主多智能体研究团队

Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu, Yanyu Qian, Dingcheng Gao, Lanyun Zhu, Qi Zhu, Yi Tan, Deyi Ji, Guosheng Lin, Tianrun Chen, Deheng Ye, Fayao Liu

发表机构 * NTU(国立新加坡大学) A*STAR(科技研究局) Moxin Technology Co., LTD(摩新科技有限公司) NUIST(南京信息工程大学) THU(清华大学) USTC(中国科学技术大学)

AI总结 本文提出Claw AI Lab,一种自主研究平台,通过隐藏的提示到论文流程实现自动化研究,并提供交互式AI实验室。该平台允许用户通过一个提示创建完整的研究团队,支持自定义角色、协作流程、实时监控、 artifact检查和回滚/恢复控制。Claw-Code Harness连接本地代码库、数据集和检查点,提高实验执行、完成和结果完整性。在内部评估中,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。

Comments Project page and code are available at https://github.com/Claw-AI-Lab/Claw-AI-Lab

详情
AI中文摘要

我们介绍了Claw AI Lab,一个实验室原生的自主研究平台,将自动化研究从隐藏的提示到论文流程推进到交互式AI实验室。与围绕单一智能体或固定顺序工作流中心化系统不同,我们允许用户通过一个提示实例化完整的研究团队,支持自定义角色、协作流程、实时监控、artifact检查以及回滚/恢复控制,通过统一仪表板。该平台还支持探索、多智能体讨论和再现三种不同的研究模式,使自主研究在实践中变得更加可控和实验室化。Claw AI Lab的关键实际贡献在于其Claw-Code Harness,它将本地代码库、数据集和检查点连接到可运行的实验,并将执行artifact反馈到研究循环中。结果,Harness不仅提高了执行集成,还提高了实验完成和结果完整性:实验更容易检查、迭代和忠实转移到最终论文,减少了部分运行和格式错误报告等常见故障模式。在我们内部评估的五个AI研究案例研究中,使用AutoResearchClaw作为基线,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。我们视Claw AI Lab为一种新范式的第一步:自主研究作为可使用、交互式和可靠性感知的科学基础设施。

英文摘要

We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice. A key practical contribution of Claw AI Lab lies in its Claw-Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability-aware scientific infrastructure.

2605.22660 2026-05-22 cs.CL cs.AI

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

道德语义在机器翻译中得以保留:来自道德基础语料库的跨语言证据

Maciej Skorski

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距,通过波兰语案例展示直接翻译能有效保留微妙的道德线索,为资源匮乏语言的道德研究提供了可行路径。

详情
AI中文摘要

道德语言具有微妙性和文化差异性,使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而,自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距,以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子,我们应用了一个系统化的四方法验证流程:LaBSE跨语言嵌入相似性、中心核对齐(CKA)、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明,尽管在处理俚语、粗俗语言和文化负载表达方面存在不足,直接翻译能够很好地保留微妙的道德线索,这些线索足以被跨语言机器学习系统捕获——在所有基础方面,平均余弦相似度为0.86,AUC差距在0.01-0.02之间,经过语言模型微调后进一步缩小。这些结果表明,机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点,并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

2605.22658 2026-05-22 cs.CV cs.LG cs.MM eess.IV

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass: 探索通过稀疏自编码器实现可解释对齐以增强推理分割

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Meituan, Beijing(美团,北京) University of Chinese Academy of Sciences(中国科学院大学) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院)

AI总结 本文提出SegCompass,一种通过稀疏自编码器实现可解释对齐的端到端模型,以提升推理分割的性能和可解释性。

Comments Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables

详情
AI中文摘要

尽管大语言模型提供了强大的组合推理能力,但现有推理分割流程未能清晰地将这种推理与视觉感知连接起来。当前方法,如潜在查询对齐,虽然端到端但却是不透明的“黑箱”。相反,文本定位读出仅可读但不真正可解释,通常作为无约束的后处理步骤。为弥合这一可解释性差距,我们提出了SegCompass,一种端到端模型,利用稀疏自编码器(SAE)建立一个显式、可解释且可微的对齐路径。给定一个图像-指令对,SegCompass首先生成一个思维链(CoT)轨迹。该方法的核心是一个将CoT和视觉标记映射到共享高维稀疏概念空间的SAE。一个查询代码本从该空间中选择显著概念,然后通过槽映射器在空间上定位到多槽热图,引导最终的掩码解码器。整个模型联合训练,将强化学习用于推理路径与标准分割监督相结合。这种由SAE驱动的接口提供了显著比潜在查询更可追溯的“白盒”连接,比文本读出更连贯。在五个具有挑战性的基准测试中,SegCompass匹配或超越了最先进的性能。关键的是,我们的视觉和定量分析显示,所学稀疏概念的质量与最终掩码准确性之间存在强相关性,证实了SegCompass通过其增强且可检查的对齐实现了优越的结果。代码可在https://github.com/ZhenyuLU-Heliodore/SegCompass获取。

英文摘要

While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

2605.22654 2026-05-22 cs.CL cs.CV

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌:基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) University of Rochester(罗切斯特大学) Sichuan University(四川大学) Department of Portuguese, Faculty of Arts and Humanities, University of Macau(澳门大学人文学院葡萄牙语系)

AI总结 本文提出了一种图像-语义引导的诗歌检测方法,通过整合图像内容与诗歌文本信息,提升大语言模型在检测现代汉语诗歌中的性能,实验结果表明该方法在多个数据集上均优于传统方法。

详情
AI中文摘要

先前的检测研究显示,LLMs无法有效用作检测器,但这些研究未涉及现代汉语诗歌。此外,没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能,并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比,我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法,我们的方法有效整合了图像中的意义、意象和情感信息,然后与诗歌文本形成互补判断。实验结果表明,基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器,甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%,达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

2605.22651 2026-05-22 cs.CV

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

图中标签真的在说些什么?用于视觉语言预训练中组合数据选择的反事实短语干预

Hyejin Go, Semi Lee, Hyesong Choi

发表机构 * Soongsil University(顺斯尔大学)

AI总结 本文研究了在视觉语言预训练中如何通过反事实短语干预来改进组合数据选择,提出了CPI方法以解决现有方法中全局过滤信号失效的问题,从而提升模型在关系识别任务上的表现。

Comments 11 pages, 2 figures, 4 tables. Preprint

详情
AI中文摘要

CLIP风格的对比预训练通常通过样本级过滤信号来收集网络级图像-文本对,通常基于对级对齐。我们证明这种信号饱和:一旦粗略不匹配被移除,更严格的全局过滤不再跟踪由保留标签提供的组合监督。原因在于结构问题 - 全局评分混淆了对是否广泛合理与是否个别对象、属性和关系短语在标签中实质性支持图像-文本匹配。后者是组合泛化所需,但对级过滤器对此无能为力。我们通过反事实短语干预(CPI),一种短语级整理框架,将受控的非正式令牌替换转换为图像条件的短语敏感性评分。CPI仅使用全局对齐进行粗略不匹配移除,然后通过是否在受控替换下短语显著影响图像-文本评分来对幸存池进行排名。我们将CPI框架为一阶短语敏感性信号,而非接地或识别结果,并在CC3M规模上评估。按此信号排名产生一个50%的数据子集,在VL-CheckList-VG关系任务上比完整数据基线提高+1.91,在匹配预算下比仅对齐过滤提高+1.00,同时提高SugarCrepe整体表现并保持泛化转移。CPI是损失正交的:应用不变于NegCLIP,它进一步在VL-CheckList-VG关系任务上提高+3.84,并在主要文本中获得额外的CE-CLIP收益。

英文摘要

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

2605.22650 2026-05-22 cs.CL

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

谁的声音被听见?通过美国政府公开提交的材料映射利益相关者的AI观点

Alina Karakanta, Alex Christiansen, Tomás Dodds, Bissie Anderson, Matteo Fuoli, Marcus Perlman, Aletta G. Dorst

发表机构 * Leiden University Centre for Linguistics, Leiden University(莱顿大学语言学中心,莱顿大学) Department of Linguistics and Communication, University of Birmingham(伯明翰大学语言学与交流系) School of Journalism and Mass Communication, University of Wisconsin-Madison(威斯康星大学麦迪逊分校新闻与大众传播学院)

AI总结 本文通过分析美国政府AI行动计划公众咨询期间提交的信件,探讨不同利益相关者对AI的看法,发现个人更关注AI对生活的影响,而其他群体更关注AI发展,揭示了AI行动计划主要反映私营部门的关切。

详情
AI中文摘要

随着人工智能(AI)系统在日常生活中的普及,了解不同利益相关者如何理解和设想这些技术在塑造社会、政治和经济现实中的作用变得至关重要。本文基于特朗普政府AI行动计划公众咨询期间提交的信件语料库,调查公众对AI的看法。为此,我们发布了一个语料库清理流程,并通过主题建模和频率分析来探索不同子群体(如学术界、个人、私营部门)讨论的主要主题以及AI行动计划中出现的主题。我们的结果表明,个人对AI对生活的影响表达了强烈担忧,而其他利益相关者则更关注AI的发展。我们的主题比较显示,AI行动计划主要反映了私营部门对安全、政策和发展方面的关切,而个人的关切则代表性较低。

英文摘要

As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration's US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals' concerns less represented.

2605.22649 2026-05-22 cs.CV cs.LG

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

从基线到随访:利用因果层次变分自编码器在UK Biobank中生成脊柱DXA图像

Yilin Zhang, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

发表机构 * School of Electronics and Computer Science(电子与计算机科学学院) University of Southampton(萨塞克斯大学) MRC Lifecourse Epidemiology Centre(英国医学研究理事会生命周期流行病学中心) University of Southampton, Southampton General Hospital(萨塞克斯大学索马塞特医院) Computer Science University of Southampton(计算机科学萨塞克斯大学)

AI总结 本文提出了一种基于元数据的因果层次变分自编码器,用于在UK Biobank中生成一致的脊柱DXA图像,通过基线到随访的设置评估因果一致性,展示了年龄干预下关键椎体形态学变量的高一致性,支持了在解剖上合理的DXA图像合成。

Comments 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
AI中文摘要

双能X射线吸收法(DXA)广泛用于大规模骨骼评估,但学习可控且可解释的因子特异性解剖变异仍具挑战性。我们提出了一种基于元数据的因果层次变分自编码器(CHVAE),用于在UK Biobank(UKB)中因果一致地生成前后位(AP)脊柱DXA图像。模型在3,743个原始AP脊柱扫描(来自首次成像访问)上进行训练,并基于基本参与者属性和腰椎形态学进行条件化。因果一致性在基线到随访的设置中通过 abduction--action--prediction(AAP)进行评估:潜在变量从基线图像中抽象出来,年龄被干预到重复成像值,然后将产生的反事实随访形态学与观察到的重复成像测量进行比较。结果表明,在年龄干预下,关键椎体形态学变量的绝对一致性较高,支持了与干预对齐的、在解剖上合理的DXA图像合成。

英文摘要

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

2605.22645 2026-05-22 cs.AI

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

AtelierEval: 人类与LLM作为文本到图像提示器的代理评估

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

发表机构 * New York University Abu Dhabi(纽约大学阿布扎克校区) Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力,引入技能基于的记忆增强代理评估器,实现与人类专家的高相关性,验证了提示器在图像增强方向的优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

文本到图像(T2I)系统日益依赖上游提示器,无论是人类还是多模态大语言模型(MLLMs),将用户意图转化为详细提示。然而,当前基准固定提示并仅评估T2I模型,忽略了上游组件的提示能力。我们引入AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力。基于认知观点,它涵盖三个任务类别,并使用现实挑战的分类学来实例化任务,为人类和MLLMs提供双接口。为了实现可扩展和可靠的评估,我们提出了AtelierJudge,一个技能基于、记忆增强的代理评估器。它为提示-图像对生成主观和客观评分,与人类专家的Spearman相关性达到0.79,接近人类表现。广泛实验在4个T2I后端上基准8个MLLMs和48个人类用户,验证AtelierEval作为稳健诊断工具的有效性,并揭示模仿优于规划,倡导未来提示器的图像增强方向。我们的工作已发布以支持未来研究。

英文摘要

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

2605.22644 2026-05-22 cs.LG

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

为何SGD不是布朗运动:对随机动力学的新视角

Igor Ignashin, Anna Radovskaya, Andrew Semenov, Egor Lopatin, Stanislav Potapov, Aleksandr Kovalenko, Andrey Veprikov, Aleksandr Shestakov, Andrey Leonidov, Aleksandr Beznosikov

发表机构 * Basic Research of Artificial Intelligence Laboratory (BRAIn Lab)(人工智能基础研究实验室(BRAIn Lab)) Innopolis University(因诺普利斯大学) P.N. Lebedev Physical Institute of the Russian Academy of Sciences(俄罗斯科学院皮亚琴佐·列别杰夫物理研究所)

AI总结 本文从离散更新出发,提出了一种将SGD视为在波动损失景观中确定性动力学的新方法,揭示了在临界点附近SGD的动力学行为,并通过实验验证了其在神经网络模型中的表现。

Comments Preprint

详情
AI中文摘要

随机梯度下降(SGD)通常被建模为兰格汉斯过程,假设小批量噪声充当布朗运动。然而,这种近似依赖于连续时间极限和sqrt(eta)噪声缩放,这与有限学习率下的离散SGD更新不匹配。本文提出了一种替代方法,将SGD视为由小批量采样诱导的波动损失景观中的确定性动力学。从离散更新出发,我们推导了参数分布的主方程,并获得了与标准兰格汉斯形式在eta^2阶不同的离散福克-平克方程。利用这一框架,我们分析了SGD在损失临界点附近的行为。我们表明,其行为沿均值海森矩阵的本征基分解为质地上不同的区域。特别是,几乎平坦的方向不具有平稳分布:方差随时间增长,对应于沿山谷的有效扩散,系数与学习率成比例。我们提供了支持这些预测的实验证据,通过计算机视觉和自然语言处理的神经网络模型,观察到受限和扩散模式之间的明显质别。

英文摘要

Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.

2605.22642 2026-05-22 cs.AI

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL: 通过强化学习推进大型语言模型代理在现实中的电子表格任务中的进步

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang, Zhaoheng Li, Shengyi Qian, Minjia Zhang, Klara Nahrstedt, Rui Hou, Xiangjun Fan, Hanchao Yu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta

AI总结 本文提出Spreadsheet-RL,一种通过强化学习微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。该方法通过自动化管道收集在线论坛中的配对起始-目标电子表格,以及金融和供应链管理等领域的领域特定评估任务,构建了新的Domain-Spreadsheet基准数据集,并展示了在通用和领域特定电子表格任务上的显著性能提升。

Comments Mingyuan served as the project lead. Banghao, Yining, and Mingyuan contributed equally to this work, with more junior authors listed before senior authors. All data and code releases are maintained by the corresponding authors at UIUC and are not affiliated with Meta

详情
AI中文摘要

电子表格系统(例如Microsoft Excel,Google Sheets)在现代数据导向的工作流程中起着核心作用。随着AI代理越来越能够自动化复杂任务,如控制计算机和生成演示文稿,构建一个AI驱动的电子表格代理已成为一个有前途的研究方向。大多数现有的电子表格代理依赖于在通用目的LLM上进行专门的提示;虽然这种设计在简单的电子表格操作上有潜力,但难以管理现实世界中典型的复杂、多步骤的工作流程。我们介绍了Spreadsheet-RL,一种强化学习(RL)微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。Spreadsheet-RL具有自动化管道,用于可扩展地收集配对的起始-目标电子表格,以及在金融和供应链管理等领域的领域特定评估任务,这些任务被编译成新的Domain-Spreadsheet基准数据集。它还包括一个Spreadsheet Gym环境,用于多轮RL:Spreadsheet Gym通过Python沙箱暴露广泛的Excel功能,并附带一个经过改进的Harness,其中包含全面的工具集和精心设计的工具路由规则用于电子表格任务。通过全面的实验,我们展示了Spreadsheet-RL在通用和领域特定的电子表格任务上显著提高了AI代理的性能:它将Qwen3-4B-Thinking-2507在SpreadsheetBench上的Pass@1从12.0%提高到23.4%,并在我们精心编写的Domain-Spreadsheet数据集上将Pass@1从8.4%提高到17.2%。这些结果突显了Spreadsheet-RL在电子表格自动化中的强大泛化能力和实际应用潜力,以及更广泛地,其在日常工作中LLM与数据接口交互方面的前景。

英文摘要

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.

2605.22633 2026-05-22 cs.RO

SE3Kit: A Lightweight Python Library for Specialized Geometric Primitives in Robotics

SE3Kit: 一个用于机器人学中专用几何原语的轻量级Python库

Daniyal Maroufi, Omid Rezayof, Farshid Alambeigi

发表机构 * Walker Department of Mechanical Engineering and Texas Robotics at The University of Texas at Austin(德克萨斯大学奥斯汀分校机械工程系和德克萨斯机器人学院)

AI总结 本文提出SE3Kit,一个轻量级Python库,专注于特殊欧几里得群SE(3)和特殊正交群SO(3)上的高效运算,提供严格的数学实现,适用于嵌入式部署、快速原型设计和教育。

详情
AI中文摘要

Python机器人生态系统面临挑战:虽然有许多库用于刚体变换,但很少有库既轻量又数学严谨。本文介绍了SE3Kit,一个轻量级Python库,高效地进行特殊欧几里得群SE(3)和特殊正交群SO(3)上的运算。不同于需要大量依赖的现有框架(例如SpatialMath、PyPose)或缺乏机器人特定功能的一般工具(例如SciPy),SE3Kit旨在填补这些极端之间的空白。它专为嵌入式部署、快速原型设计和教育而设计,同时提供严谨的数学实现。它提供了一个仅使用Python和NumPy的Lie群运算实现,没有深度学习或其他可视化软件的开销。

英文摘要

The Python robotics ecosystem faces a challenge: while many libraries exist for rigid body transformations, few are both lightweight and mathematically strict. This paper introduces SE3Kit, a lightweight Python library efficient operations on the Special Euclidean Group SE(3) and the Special Orthogonal Group SO(3). Unlike established frameworks that require heavy dependencies (e.g., SpatialMath, PyPose) or general tools that lack robotics-specific features (e.g., SciPy), SE3Kit targets the gap between these extremes. It is designed for embedded deployment, rapid prototyping, and education while providing rigorous mathematical implementation. It provides a pure-Python, NumPy-only implementation of Lie Group operations, without the overhead of deep learning or other visualization software.

2605.22631 2026-05-22 cs.CV

AtomicMotion: Learning Human Motion From Different Human Parts

AtomicMotion: 从不同人体部分学习人体动作

Runzhen Liu, Chuhua Xian, Fa-Ting Hong

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) South China University of Technology(华南理工大学) Department of Computer Science and Engineering(计算机科学与工程系) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 该研究提出AtomicMotion框架,通过解耦和重新整合身体动态,解决从稀疏头部和手部轨迹准确重建完整身体姿态的挑战,核心方法是逻辑身体分区、全身体预条件化策略和运动学注意力机制,实验表明其在AMASS数据集上显著优于现有基线。

详情
AI中文摘要

准确从稀疏头部和手部轨迹重建完整身体姿态是沉浸式AR/VR远程存在的基础挑战。当前方法常面临误差累积和不自然关节协调的问题,主要因为将人体视为单一实体,无法捕捉细微信号变化中的细粒度“原子意图”,并忽视了固有的结构拓扑。为弥合这一差距,我们提出了AtomicMotion,一个通过三个核心创新解耦和重新整合身体动态的框架。首先,我们引入一种逻辑身体分区方案,根据功能意图将骨架分解为五个不同的簇;这确保每个分区保留内部关节协同性,同时隔离局部运动原语。其次,为了稳健地将稀疏输入映射到高维姿态,我们在训练期间采用掩码全身体预条件化策略,迫使模型内化全局骨骼拓扑和潜在运动学约束。最后,针对常规空间注意力机制常忽略固定生理连接的局限性,我们提出了运动学注意力。通过将经典运动学树结构嵌入注意力机制中,我们确保合成动作具有生物合理性。在AMASS数据集上的广泛评估表明,AtomicMotion显著优于现有基线,实现了更高的重建保真度和更优越的生物力学真实性。

英文摘要

Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.