arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22724 2026-05-22 cs.LG cs.NA math.NA stat.ML

Multiple Neural Operators Achieve Near-Optimal Rates for Multi-Task Learning

多重神经算子在多任务学习中实现接近最优的速率

Adrien Weihs, Hayden Schaeffer

AI总结 本文研究了共享多任务设置中学习一组算子的近似性和统计复杂性,重点探讨了多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,推导出近似和统计泛化性的近优上界。同时,建立了参数复杂性的诅咒并证明了相应的最小最大速率。这些结果表明,跨任务共享表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,本文还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

详情
AI中文摘要

我们研究了在共享多任务设置中学习一组算子的近似性和统计复杂性,重点在于多重神经算子(MNO)架构。对于广泛类别的Lipschitz多重算子映射,我们推导出近似和统计泛化的近优上界。在下界方面,我们建立了参数复杂性的诅咒,并证明了相应的最小最大速率。这些结果表明,跨任务共享的表示不会增加总体成本:多任务算子学习遵循与单算子学习相同的缩放定律。此外,我们还比较了MNO与基于拼接任务输入的深度ONet多任务扩展版本,并表明从最坏情况的近似复杂性角度看,两种架构满足本质上相同的渐进行速率。

英文摘要

We study the approximation and statistical complexity of learning collections of operators in a shared multi-task setting, with a focus on the Multiple Neural Operators (MNO) architecture. For broad classes of Lipschitz multiple operator maps, we derive near-optimal upper bounds for approximation and statistical generalization. On the lower-bound side, we establish a curse of parametric complexity and prove corresponding minimax rates. Together, these results show that shared representations across tasks do not increase the overall cost: multi-task operator learning follows the same scaling laws as single operator learning. We also compare MNO with a multi-task extension of DeepONet based on concatenated task inputs and show that, from a worst-case approximation-complexity perspective, both architectures satisfy essentially the same asymptotic rates.

2605.22723 2026-05-22 cs.LG cs.AI cs.IT math.IT

The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler

高斯DDPM中协方差匹配的价值及兰扎斯采样器

Md Sahil Akhtar, Aymane El Gadarri, Vivek F. Farias, Adam D. Jozefiak

AI总结 本文研究了高斯DDPM中协方差匹配在路径空间KL散度中的价值,提出兰扎斯采样器方法,通过矩阵自由技术实现最优反向协方差采样,从而提升采样质量。

详情
AI中文摘要

高斯DDPM中的核心误差度量是精确反向链与学习高斯反向过程之间的路径空间KL散度。这一量在如分类引导等过程中尤为重要,这些过程扰动整个反向轨迹而非仅终端样本。先前分析显示,标准各向同性反向协方差会导致随着去噪步数T增长而不可避免的Ω(1/T)路径KL误差。我们证明匹配完整后验协方差突破这一障碍,使路径KL误差降至O(1/T²)。为使完整协方差匹配实用化,我们引入兰扎斯高斯采样器(LGS),一种无需训练、矩阵自由的方法,仅通过后验均值的雅可比-向量积即可从最优反向协方差采样。LGS避免了密集协方差存储和辅助协方差模型。我们证明LGS近似误差随兰扎斯步骤数呈指数衰减,每个兰扎斯步骤仅需一次雅可比-向量积。实验表明,仅使用三个此类步骤即可在标准图像基准上提升样本质量,优于包括OCM-DDPM在内的强对角协方差基线。这表明完整协方差匹配在理论和实践中均具有价值。

英文摘要

A central error measure in Gaussian DDPMs is the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. This quantity is especially relevant for procedures such as classifier guidance, which perturb the entire reverse trajectory rather than only the terminal sample. Prior analyses show that standard isotropic reverse covariances suffer an unavoidable $Ω(1/T)$ path-KL error as the number of denoising steps $T$ grows. We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to $O(1/T^2)$. To make full covariance matching practical, we introduce the Lanczos Gaussian sampler (LGS), a training-free, matrix-free method for sampling from the optimal reverse covariance using only covariance-vector products, which are available through Jacobian-vector products of the posterior mean. LGS avoids dense covariance storage and auxiliary covariance models. We prove that LGS approximation error decays exponentially in the number of Lanczos steps, where each Lanczos step requires a single Jacobian-vector product. Empirically, using only just three such steps improves sample quality over strong diagonal-covariance baselines, including OCM-DDPM, across standard image benchmarks. This identifies full covariance matching as both theoretically valuable and practically accessible for fast DDPM sampling.

2605.22722 2026-05-22 cs.RO cs.SY eess.SY

N3P: Accelerated Automated Parking via a Learning-Based Naturalistic Three-Stage Scheme

N3P:通过基于学习的自然三阶段方案实现加速的自动泊车

Yifan Xue, Toktam Mohammadnejad, Faizan M Tariq, Sangjae Bae, David Isele, Yosuke Sakamoto, Nadia Figueroa, Jovin D'sa

AI总结 本文提出N3P,一种基于学习的三阶段框架,用于自动泊车,通过引入中间预备姿态和学习模块预测该姿态,将泊车操作分解为更简单的子问题,从而降低计算复杂度并加速路径生成,实验表明其在垂直和平行泊车场景中显著提升了规划速度,并在成功率和轨迹质量上优于强化学习基线。

详情
Comments
Accepted at IEEE Intelligent Transportation Systems Conference (ITSC 2026)
AI中文摘要

自动驾驶泊车需要高效的路径规划,以确保运动学可行性并在受限环境中实现碰撞避免。混合A*被广泛使用,但计算成本高,而强化学习(RL)方法缺乏可靠性,往往在长时间几何约束下表现不佳,导致轨迹次优。我们提出了N3P,一种快速基于学习的三阶段框架用于自动泊车。通过引入中间预备姿态并使用学习模块预测该姿态,N3P将操作分解为更简单的子问题,从而降低计算复杂度并加速路径生成。我们通过将其与混合A*算法结合来验证该框架。在垂直和平行泊车场景中的实验表明,N3P增强的混合A*将规划速度提高了超过80%。它在成功率和轨迹质量上优于RL基线,产生更短的轨迹和更少的换挡,同时在大多数情况下实现可比或更低的规划时间。

英文摘要

Autonomous parking requires efficient path planning that ensures kinematic feasibility and collision avoidance in constrained environments. Hybrid A* is widely used but computationally expensive, while reinforcement learning (RL) methods lack reliability and often struggle with long-horizon geometric constraints, leading to suboptimal trajectories. We present N3P, a fast learning-based three-stage framework for automated parking. By introducing an intermediate preparatory pose and using a learning module to predict it, N3P decomposes the maneuver into simpler subproblems, thereby reducing computational complexity and accelerating path generation. We validate the framework by integrating it with Hybrid A* algorithms. Experiments in perpendicular and parallel parking scenarios show that N3P-enhanced Hybrid A* speeds up planning by more than 80%. It also outperforms RL baselines in success rate and trajectory quality, producing shorter trajectories with fewer gear changes, while achieving comparable or lower planning time in most cases.

2605.22720 2026-05-22 cs.AI cs.HC

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

AI 是否会加剧冲突?在冲突情境下LLM部署中的对齐失败

Andrii Kryshtal

AI总结 本文研究了AI模型在冲突情境下可能产生的对齐失败问题,通过测试九种模型配置,发现其在处理冲突相关场景时存在错误等价、否认种族灭绝和未能识别种族歧视术语等问题,提出了首个评估框架以提高AI在冲突情境下的安全性。

详情
Comments
Preprint. 8 pages, 2 figures. Code and evaluation framework: https://github.com/akryshtal/conflict-sensitivity-eval-bloom
AI中文摘要

AI模型已经部署在受武装冲突影响的社会中,记者、人道主义工作者、政府和普通公民依赖这些模型获取信息或用于工作流程。目前尚无已建立的实践来检查其输出是否会加剧冲突。我们测试了来自四个供应商(OpenAI、Anthropic、DeepSeek、xAI)的九种模型配置,在90个多轮场景中揭示了冲突情境中的对齐失败行为:如在记录的暴行之间制造虚假等价、否认种族灭绝以及未能识别种族歧视术语等。当这些输出影响新闻报道、人道主义报告或公共辩论时,它们可能加深脆弱社会的分歧。失败率在最佳和最差表现的模型之间为6%至47%,这使得模型选择本身成为一项安全问题。当用户在国际法院已指责任任的情况下寻求“平衡”时,五种配置在80%至100%的情况下失败。我们发布了该领域的首个评估框架,并建议将其添加到对齐评估套件中。

英文摘要

AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6\% to 47\% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.

2605.22719 2026-05-22 cs.LG

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

阅读任务失败的激活特征:GPT-2小模型在间接对象识别任务上的稀疏特征审计

Mahdi Nasermoghadasi

AI总结 该研究通过审计GPT-2小模型在间接对象识别任务中失败与成功样本的稀疏自动编码器特征,发现特定特征与任务失败高度相关,并通过多种控制实验验证了其相关性而非因果性。

详情
Comments
10 pages, 7 figures
AI中文摘要

我们报告了一个小型、可复现的审计,探讨了GPT-2小模型在间接对象识别(IOI)任务中失败与成功样本之间稀疏自动编码器(SAE)特征的差异。在300个提示中,GPT-2小模型达到79.7%的准确率;24,576个层-8残差流SAE特征中有146个通过holm校正的显著性阈值,105个具有大效应量(|Cohen's d| > 0.8)。最强的单一相关特征——特征17,491(d=+2.93,Neuronpedia标签'加密密钥')——在提示中的转移对象为'密钥'时,GPT-2小模型失败率达93.3%,而在其他七个对象上仅为7.5%(Fisher精确检验p=8.79 x 10^-33)。我们通过三种控制实验验证了这一相关性。 (i) 因果消融:在所有45个密钥提示的token位置上零特征17,491不恢复准确性(6.7% -> 4.4%);该特征是相关而非该层的充分原因。 (ii) 表示基线:对原始768维残差流进行逻辑回归达到5倍ROC AUC=0.929,与前100个SAE特征(0.927)相当;SAE基底增加可解释性而非预测能力。 (iii) 种子鲁棒性检查:在五个随机种子中,密钥子集的失败率保持在75.0-93.3%(行为效应是真实的),但特征17,491仅在1个运行中是top-|d|特征。因此,方法学贡献是审计流程(经济、模型无关、揭示命名相关特征)而非任何单个通过该流程发现的特征。我们发布了代码、300个提示语料库、300x24,576激活矩阵、消融和基线脚本以及图表。完整流程可在笔记本电脑(Apple M3 Max,无离散GPU)上运行。

英文摘要

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream across all token positions of the 45 keys prompts does not restore accuracy (6.7% -> 4.4%); the feature is a correlate, not a sufficient cause at this layer. (ii) A representation baseline: a logistic regression on the raw 768-dimensional residual stream reaches 5-fold ROC AUC = 0.929, matching the top-100 SAE features (0.927); the SAE basis adds interpretability, not predictive power. (iii) A seed-robustness check: across five random seeds the keys-subset failure rate stays in 75.0--93.3% (the behavioural effect is real), but feature 17,491 is the top-|d| feature in only 1 of 5 runs. The methodological contribution is therefore the audit pipeline (cheap, model-agnostic, surfaces named correlates) rather than any single feature found through it. We release the code, the 300-prompt corpus, the 300x24,576 activation matrix, the ablation and baseline scripts, and the figures. The full pipeline runs on a laptop (Apple M3 Max, no discrete GPU).

2605.22718 2026-05-22 cs.CV

WorldKV: Efficient World Memory with World Retrieval and Compression

WorldKV: 通过世界检索和压缩实现高效的world内存

Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

AI总结 本文提出WorldKV,一种无需训练的框架,通过世界检索和压缩技术,在保持一致性的同时提高效率,实现在Matrix-Game-2.0和LingBot-World-Fast数据集上达到或超越全KV内存保真的性能。

详情
Comments
Project Page: https://cvlab-kaist.github.io/WorldKV/
AI中文摘要

自回归视频扩散模型已使实时、动作条件化的world生成成为可能。然而,维持一个持久的world,其中重新访问先前看到的视角会得到一致的内容,仍然是一个开放问题。全KV缓存注意力保持这种一致性,但会破坏实时约束:内存足迹和注意力成本随着rollout长度线性增长。滑动窗口推断恢复了吞吐量,但丢弃了长期一致性。我们提出WorldKV,一种无需训练的框架,包含两个组件:World检索和World压缩。World检索将被驱逐的KV缓存片段存储在GPU/CPU内存中,并通过相机/动作对应关系选择性地检索场景相关的片段,将其插入回原生注意力窗口而不重新编码。World压缩通过键-键相似性修剪每个片段中的冗余token,将每个片段的存储减少一半,以在固定预算下容纳两倍的历史。在Matrix-Game-2.0和LingBot-World-Fast上,WorldKV在大约两倍的吞吐量下达到或超过全KV内存保真度,并且在无需微调的情况下与内存训练的基线竞争。项目页面:https://cvlab-kaist.github.io/WorldKV/

英文摘要

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/

2605.22717 2026-05-22 cs.SD cs.AI cs.LG cs.MM

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

实时音乐扩散模型:交互式音乐生成扩散模型的高效微调与后训练

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang

AI总结 本文研究了音频扩散模型能否通过块级KV缓存高效地转化为交互式模型,从而在消费级硬件上实现。提出的Live Music Diffusion Models (LMDMs)通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度,并通过ARC-Forcing范式实现稳定的后训练对齐,从而在无需显式RL或奖励模型的情况下减少误差累积。

详情
AI中文摘要

交互式流式音乐生成承诺了生成模型在实时表演和协作创作中的应用,这在离线模型中是无法实现的。然而,最先进的模型存在于离散AR领域,需要工业级的计算资源进行训练和推理。在本文中,我们研究音频扩散模型是否可以被重新利用为交互式模型,从而在消费级硬件上实现。通过仔细分析现代块级外推扩散流程,我们发现推理过程中存在关键的低效问题,导致其计算效率严劣于离散AR模型。我们提出了Live Music Diffusion Models (LMDMs),一种简单的生成扩散过程修改,通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度。与LMMs不同,LMDMs进一步通过我们新颖的ARC-Forcing范式实现稳定的后训练对齐,无需任何显式RL或奖励模型即可减少误差累积。我们展示了LMDMs在多个创意领域中的应用,包括文本条件生成、基于草图的音乐合成和即兴演奏。最后,我们展示了如何将LMDMs用作生成乐器,在真实艺术家与AI的合作中利用LMDMs作为“生成延迟”,将音乐家的即兴演奏转换为可变的音色效果,同时在本地消费级游戏笔记本电脑上运行。

英文摘要

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

2605.22716 2026-05-22 cs.AI cs.LO

Parametric Modular Answer Set Programs Made Declarative

参数化模态答案集程序的声明性

Jorge Fandinno, Yuliya Lierler, Torsten Schaub

AI总结 本文探讨了在一阶答案集编程中模ularity的概念,引入了参数化模态逻辑程序这一新形式,允许定义带有参数和intensionality语句的子程序,并展示了如何捕捉clingo程序的集体控制语义,连接传统非模态答案集编程。

详情
Comments
To appear in Theory and Practice of Logic Programming
AI中文摘要

在本文中,我们探讨了在第一阶答案集编程(ASP)中模块化的概念。我们引入了一种新的形式化方法,称为参数化模态逻辑程序,它允许定义带有参数和intensionality语句的子程序。我们展示了这种形式化方法如何捕捉具有集体控制的clingo程序的语义,这一特性使得能够对子程序进行结构化和实例化。我们为模块化ASP提供了理论基础,展示了其有用性,并将其与传统非模块化ASP连接起来。

英文摘要

In this paper, we explore the concept of modularity in first-order answer set programming (ASP). We introduce a new formalism called parametric modular logic programs, which allows defining subprograms with parameters and intensionality statements. We demonstrate how this formalism can capture the semantics of clingo-programs with collective control, a feature that enables structuring and instantiating subprograms. We provide theoretical foundations for modular ASP, illustrate its usefulness, and connect to traditional non-modular ASP.

2605.22711 2026-05-22 cs.LG cs.AI

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的抽象

Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster

AI总结 本文提出了一种在离线目标条件强化学习中利用抽象的方法,通过引入相对化选项和不同层次的表示,提高了在相似状态空间上下文中的经验复用能力,从而提升了性能。

详情
AI中文摘要

马尔可夫决策过程(MDPs)在现实中的目标条件强化学习(GCRL)中往往由于对称性和状态-目标对之间的共享结构而表现出显著的冗余性。虽然分层策略已被提出以通过时间抽象减少时间跨度来改进离线GCRL,但本文证明层次结构也能够实现绝对抽象。通过引入相对化选项以及为不同层次的层次结构引入不同的表示,我们展示了智能体如何在相似的状态空间上下文中重用经验。基于这一框架,我们介绍了两种简单的算法用于学习相对化选项和从绝对参考框架中抽象。我们的实验表明,这种归纳偏置在离线GCRL中显著提高了性能。

英文摘要

Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

2605.22709 2026-05-22 cs.CR cs.ET cs.RO cs.SY eess.SY

TriSweep: A Four-Drone Swarm Framework for Electromagnetic Side-Channel Analysis

TriSweep: 一种四无人机群框架用于电磁侧信道分析

Eric Yocam, Varghese Vaidyan

AI总结 本文提出TriSweep框架,通过四无人机群实现自主远距离电磁侧信道分析,针对嵌入式微控制器在0.25-1.5米范围内进行攻击,通过空间专业化收集无人机和固定积累无人机的协同工作,实现信号增强和掩码消除,验证了无人机群在对抗环境中的有效性。

详情
Comments
Simulation framework + systems design for a four-drone swarm performing standoff electromagnetic side-channel analysis. No hardware fabricated yet
AI中文摘要

电磁(EM)侧信道分析传统上假设存在一个静止且近距离的探测器,这种威胁模型低估了空中对手的威胁。TriSweep是一种模拟框架,设计并评估了一种四无人机群架构,用于自主远距离电磁侧信道分析(EM-SCA)嵌入式微控制器,距离为0.25-1.5米。三个空间专业化收集无人机——锚点(全频谱)、掩码探测器(掩码寄存器加载泄漏)和密码探测器(掩码SubBytes输出泄漏)——将信号馈入一个固定积累无人机,该无人机通过两个空间分离泄漏流的中心乘积进行相干结合(+4.8 dB信噪比增益)和二次掩码消除。在三个真实的ANSSI ASCAD数据集(ATmega8515掩码AES-128和50/100样本非同步变体)上评估该框架,其在0.25米范围内主要掩码数据集上实现了模拟密钥排名为18±1.7(五种子)。通过轮廓跟踪轨迹交叉相关对齐,单无人机排名从89降低到21,在100样本抖动变体上展示了对无人机悬停振动的补偿。积累无人机中的两个通道CNN收敛到损失为0.454(与随机基线5.545相比)并在非同步数据集上提高了排名。尚未制造物理硬件;原型构建是下一步计划。

英文摘要

Electromagnetic (EM) side-channel analysis traditionally assumes a stationary, close-proximity probe - a threat model that underestimates aerial adversaries. TriSweep is a simulation framework that designs and evaluates a four-drone swarm architecture for autonomous standoff EM-SCA of embedded microcontrollers at 0.25-1.5 m. Three spatially specialized collector drones - Anchor (full-spectrum), Mask Probe (mask-register loading leakage), and Cipher Probe (masked SubBytes output leakage) - feed a stationary Accumulator drone that performs coherent combining (+4.8 dB SNR gain) and second-order mask cancellation via a centered product of the two spatially separated leakage streams. Evaluated against three real ANSSI ASCAD datasets (ATmega8515 masked AES-128 and 50/100-sample desynchronized variants), the framework achieves a simulated key rank of 18 +/- 1.7 (five-seed) at 0.25 m on the primary masked dataset. Profiling-trace cross-correlation alignment reduces single-drone rank from 89 to 21 on the 100-sample-jitter variant, demonstrating compensation for drone hover vibration. A two-channel CNN in the Accumulator converges to a loss of 0.454 (vs. random baseline 5.545) and improves rank on desynchronized datasets. No physical hardware has been fabricated; prototype construction is the planned next step.

2605.22707 2026-05-22 cs.AI cs.HC

Beyond the Org Chart: AI and the Transformation of Invisible Work

超越组织图:人工智能与无形工作的变革

Stephanie Rosenthal, Shamsi Iqbal

AI总结 本文研究了人工智能如何改变工作流程,特别是无形文化实践,如专业指导,同时提出了使无形工作可见的步骤以及个人和领导者如何支持同事并保持健康的公司文化。

详情
Comments
10 pages
AI中文摘要

越来越多的新闻和研究文章报告称,人工智能的采用使专业人士能够模糊和扩展其在企业中的角色边界。为了了解在人工智能导向的公司中工作流程可能发生的变化,我们采访了大型科技公司中24名以产品为中心的个体,探讨人工智能如何影响他们的工作、他们在产品团队中的工作以及他们的专业互动。我们的谈话表明,人工智能不仅改变了正式的角色责任和角色之间的协作,还改变了诸如专业指导等无形文化实践,这些实践对于帮助专业人士适应其职位、保持对工作的投入以及发展职业生涯至关重要。一些变化是积极的,例如同行之间的协作更加顺畅,但其他变化更加微妙,可能使典型的职业发展机会,如从专业网络中获得反馈、促进领导力和指导,面临风险。我们提出人工智能公司可以采取的步骤,以使无形工作更加可见。此外,我们还提出个人和领导者可以采取的措施,以在人工智能转型过程中支持同事,同时保持支持多样化思维、协作和非正式互动的健康公司文化。

英文摘要

An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

2605.22703 2026-05-22 cs.LG

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

剪裁瓶颈:通过近边界信号的随机恢复稳定RLVR

Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou

AI总结 本文研究了强化学习可验证奖励(RLVR)中由于硬剪裁决策导致的训练不稳定问题,提出了一种名为近边界随机救援(NSR)的简单方法,通过随机保留略微超出边界范围的token来恢复丢失的信号,从而提升训练稳定性和性能。

详情
AI中文摘要

强化学习可验证奖励(RLVR)已成为扩展大语言模型推理能力的核心范式,但其优化过程常常受到训练不稳定和收敛次优的问题影响。通过系统分析基于剪裁的GRPO类目标,我们发现由硬剪裁引起的刚性剪裁决策是所研究的RLVR设置中的关键实际瓶颈。具体而言,我们的分析表明,信息信号可能位于剪裁阈值之外的近边界区域,因此被标准硬剪裁规则所丢弃。值得注意的是,一旦这个瓶颈被精确识别,即使在边界处进行简单的随机扰动也能恢复有意义的性能提升。基于这一发现,我们提出了近边界随机救援(NSR),一种最小、即插即用的修改方法,通过随机保留略微超出边界范围的token来恢复丢失的信号。虽然NSR通过随机采样可以被解释为在期望上诱导隐含梯度衰减,但我们的消融实验表明,其随机的边界局部救援机制在一致性上比确定性梯度衰减更有效。通过在7B到30B规模以及密集和MoE架构上的广泛实验验证,作为即插即用的解决方案,NSR显著提高了训练稳定性,并在DAPO和GSPO等强基线模型上实现了持续的性能提升。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

2605.22697 2026-05-22 cs.CV

Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

跨域人类动作识别:多视角运动与文本描述

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

AI总结 本文提出一种面向多视角运动和文本描述的跨域人类动作识别方法,通过结合多视角运动线索和文本描述,提升零样本动作识别模型在不同领域中的鲁棒性和泛化能力。

详情
Comments
Accepted to ICPR 2026. Code and trained models available at: https://icb-vision-ai.github.io/OrientationAware-HAR
AI中文摘要

在真实世界场景中,人类动作识别系统对域变化的鲁棒性是一个关键能力,其中推理时的动作类别可能呈现重要的域偏移甚至训练中未见过的动作。在这一背景下,提高零样本动作识别模型(ZSAR)的识别能力,而无需强标注努力,仍然是一个核心挑战。大多数ZSAR方法假设动作是在与训练时相似的几何条件下观察到的。实际上,人体姿态变化和摄像机视角的变化会在ZSAR中引入显著的域差距,从而大大限制了对新动作-运动组合的泛化能力。在这一背景下,本文提出了一种新的面向姿态的行动识别方法,具有改进的跨域能力。我们的方法在训练阶段结合了多个摄像机视角的运动线索和人类动作的文本描述。我们提出了一种新的面向姿态的运动编码网络,以学习不同的运动特征,并在推理时适配特定的面向意识文本提示以匹配相应的特征。广泛的实验表明,所提出的方法在不同识别基准上一致提高了ZSAR性能,优于最近的最先进的零样本方法在NTU-RGB+D、BABEL、NW-UCLA以及两个监控数据集上。此外,学习到的表示表现出强大的迁移学习能力,在跨域和同域识别已见动作方面都表现出竞争力。代码和训练模型可在:https://icb-vision-ai.github.io/OrientationAware-HAR 获取。

英文摘要

Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: https://icb-vision-ai.github.io/OrientationAware-HAR

2605.22695 2026-05-22 cs.CV

Improving Viewpoint-Invariance and Temporal Consistency for Action Detection

提升视角不变性和时间一致性以进行动作检测

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

AI总结 本文提出了一种两阶段动作检测方法,通过增强视角不变性和全局时间一致性来改进动作检测性能,在PKU-MMD和BABEL基准测试中优于现有方法。

详情
Comments
Accepted at ICIP 2026. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD
AI中文摘要

视角变化不变性和动作时间一致性是无剪裁视频中有效部署人类动作检测的关键方面。现有的基于外观的视频检测方法在训练期间往往难以应对有限的视角多样性,而基于运动的检测方法则经常无法建模连续运动窗口之间的细粒度时间关系。本文介绍了一种新的两阶段动作检测方法,旨在同时提高视角不变性和全局时间一致性。在第一阶段,我们从增强的虚拟视角中提取运动特征,仅在训练过程中使用。然后,第二阶段引入了一种基于选择性状态空间序列建模的新的视角不变、多尺度时间编码器,以在不同视角和时间尺度上聚合信息。在PKU-MMD和BABEL基准测试中,实验表明该方法在所有考虑的分割中均显著优于现有最先进方法。代码和训练模型可在:https://icb-vision-ai.github.io/HydraView-TAD获取。

英文摘要

Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL benchmarks demonstrate that this approach significantly outperforms state-of-the-art methods in all considered splits. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD

2605.22693 2026-05-22 cs.RO cs.AI

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Hoang-Dung Bui, Abhish Khanal, Raihan Islam Arnob, Gregory J. Stein

AI总结 本文提出了一种Scout-Assisted Planning框架,通过无人机主动收集环境信息来改进地面车辆的导航,通过信息增益引导的行动剪枝减少回溯成本,实验表明其在不同环境中能显著降低地面机器人旅行成本。

详情
AI中文摘要

自主机器人团队在部分已知环境中导航时,当地面机器人遇到被阻塞的道路时,需要昂贵的回溯操作。我们通过Scout-Assisted Planning,一种异构规划框架,其中无人机主动收集环境信息以改进地面车辆的导航。为了将侦察聚焦于最关键的边,我们提出了基于信息增益的行动剪枝,通过评估候选侦察行动对地面机器人行为的预期影响来评分。由于精确的信息增益基于行动剪枝计算成本过高,我们开发了一个基于图神经网络的模型,该模型可以直接从图结构和信念状态预测信息增益值,将规划时间减少到实时水平而不牺牲解决方案质量。在三种环境类型上的实验表明,SAP结合信息增益行动剪枝将地面机器人旅行成本降低了31.9-37.7%相对于加拿大旅行者问题基线,并且比基于接近度的侦察指导多出8-14%,证实了基于原则的信息增益引导的侦察在实际部署中既更有效且计算上可行。

英文摘要

Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9--37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8--14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

2605.22691 2026-05-22 cs.LG cond-mat.stat-mech

Posterior Collapse as Automatic Spectral Pruning

后验坍缩作为自动谱剪枝

Johannes Hirn

AI总结 本文研究了β-VAE中的后验坍缩现象,揭示其本质上是一种自动谱剪枝过程,通过分析不同β值下的均衡解,展示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。

详情
AI中文摘要

我们证明了β-VAE中的后验坍缩实现了自动谱剪枝。一个潜在模式如果其对重建的贡献低于由β设定的截止值,则会坍缩。不同β值的平衡解因此揭示了潜在模式从最不有用的到最有用的逐步解耦的崩溃过程。我们通过Landau稳定性分析将这一现象推导为损失的后果。我们定义了一个潜在-缩放不变的序参量,该参量对活跃的潜在模式进行排序,其坍缩阈值确定了哪些有效变量应首先检查。在线性高斯情况下,坍缩谱、效用谱和标准化PCA谱一致,且每个坍缩遵循均场定律。我们对WorldClim数据集进行了测试以验证这些预测。

英文摘要

We show that posterior collapse in $β$-VAEs implements automatic spectral pruning. A latent mode collapses if its contribution to reconstruction is below the cutoff set by $β$. Equilibrium solutions with different $β$ thus reveal a cascade of collapses as latent modes decouple from least to most useful. We derive this as a consequence of the loss via a Landau stability analysis. We define a latent-rescaling-invariant order parameter that ranks active latent modes and whose collapse thresholds identify which effective variables to inspect first. In the linear Gaussian case, the collapse spectrum, utility spectrum, and normalized PCA spectrum coincide, and each collapse follows a mean-field law. We test these predictions on the WorldClim dataset.

2605.22681 2026-05-22 cs.AI

Forecasting Scientific Progress with Artificial Intelligence

用人工智能预测科学进步

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, Junchi Yu

AI总结 本文研究了人工智能在预测科学进步中的能力,提出了一种基于时间的评估框架,并介绍了CUSP基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统的科学预测能力,发现当前前沿模型在不同领域存在系统性限制,且预测结果受事件发生时间影响较大,表明AI在科学预测中仍存在不足。

详情
Comments
73 pages, 13 figures, 29 tables
AI中文摘要

人工智能(AI)日益融入科学发现,但其能否预测科学进步仍不明确。为研究此问题,我们引入了一个基于时间的评估框架,用于在受控知识约束下预测科学进步。我们提出了CUSP(截止条件下的未见科学进步),一个多学科和事件级别的基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统在科学预测中的表现。在4760个科学事件中,我们观察到当前前沿模型在不同领域存在系统性和领域依赖性的限制。虽然模型可以识别出竞争候选研究方向的可能性,但它们无法可靠地预测科学进步是否会被实现,并系统性地低估了其发生时间。性能在不同领域中高度异质,AI的进步时间比生物学、化学和物理学的进步更可预测。性能在事件发生时间在训练截止前或后时基本不受影响,表明这些限制不能仅由训练数据中的知识暴露来解释。在受控信息访问下,额外的预截止知识会提高性能,但无法缩小与全信息设置之间的差距,这种差距在高引用进步中更加明显。模型还表现出系统性的过度自信和强烈的响应偏差,表明不确定性估计不可靠。综合来看,当前AI系统在预测科学进步方面仍显不足。获取先前知识并未转化为可靠的预测,性能更受益于事后信息而非前瞻性预测。

英文摘要

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

2605.22679 2026-05-22 cs.CV cs.LG

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

将嵌入概念化:面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

AI总结 本文提出CEDAR方法,通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构,从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情
AI中文摘要

视觉-语言模型学习了强大的多模态嵌入,但其内部语义仍然模糊。尽管稀疏自编码器(SAEs)可以提取可解释的特征,但它们依赖于扩展表示维度,这会破坏原始几何结构并引入冗余。我们引入CEDAR(通过自适应旋转进行概念嵌入解缠),一种事后方法,能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换,CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中,单个坐标可以与文本概念进行解释,而对于生成模型如BLIP,它们可以解码为自然语言描述。实验表明,CEDAR在重建-稀疏性权衡方面具有竞争力,同时产生更可解释且更符合人类感知的解释。我们的结果表明,视觉-语言表示中的显性纠缠可以通过适当的基变换来解决,从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

2605.22678 2026-05-22 cs.CV cs.AI

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling: 通过泰勒级数选择时间惊喜

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

AI总结 本研究提出了一种无需训练的帧选择算法Swift Sampling,通过在视觉潜在空间中建模视频为可微轨迹,并利用泰勒展开预测后续帧的路径,从而自动识别高信息量的时间惊喜帧,提升了长视频问答任务的性能。

详情
AI中文摘要

尽管长视频中的大多数帧都是冗余的,但关键信息存在于时间惊喜中:即实际视觉特征偏离其预测演变的时刻。受人脑预测编码的启发,我们引入了Swift Sampling,一种优雅且无需训练的帧选择算法,能够自动识别视频中的高信息量时刻。具体而言,我们将视频建模为视觉潜在空间中的可微轨迹,并计算其特征的速度和加速度。然后,我们应用泰勒展开来投影后续帧的预期路径。与预测路径显著偏离的帧被识别为时间惊喜帧并被选中采样。与依赖辅助网络或视频特定超参数调整的先前无训练方法不同,Swift Sampling 非常轻量,仅比基线增加 0.02x 的计算成本,使其比领先基线便宜 30 倍。在三个长视频问答基准和 10 个不同的下游任务上,Swift Sampling 超过了均匀采样和先前查询无关的基线。它在帧预算有限的长视频中表现尤为强大,准确率可提高高达 12.5 个百分点。

英文摘要

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

2605.22677 2026-05-22 cs.CV

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Slimmable ConvNeXt: 适用于高效多设备部署的宽度自适应推理

Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel

AI总结 本文提出Slimmable ConvNeXt,通过训练包含多个嵌套子网络的共享权重集,实现宽度自适应推理,从而在不同资源约束的设备上高效部署模型。该方法利用ConvNeXt的现代设计,如LayerNorm和倒置瓶颈结构,实现了通道宽度压缩,减少了归一化开销,并提供了更简单的训练流程。

详情
Comments
Accepted at Mobile AI Workshop 2026 (CVPR'26 Workshop)
AI中文摘要

在资源约束变化的设备上部署视觉模型,或在单个设备上由于电池状态、热 throttling 或延迟截止而变化的计算资源,通常需要训练和维护多个模型。宽度自适应推理通过训练一组共享权重,其中包含多个嵌套子网络,这些子网络具有递增的容量,从而解决这一问题。尽管之前的CNN方法需要可切换的批量归一化,而近期可扩展方法则集中在视觉Transformer上,本文提出了Slimmable ConvNeXt,证明了ConvNeXt的现代设计,特别是LayerNorm和倒置瓶颈结构,使其特别适合通道宽度压缩,消除了经典可压缩网络的归一化开销,并提供了比之前CNN和ViT方法更简单的训练流程。在ImageNet-1k上,Slimmable ConvNeXt-T在3个子网络的情况下,以4.5 GMACs达到80.8%的top-1准确率,以1.2 GMACs达到77.4%的准确率,训练了600个epoch。在同等计算量下,这超过了HydraViT的6头子网络(78.4%在4.6 GMACs)2.4个百分点,以及其3头配置(73.0%在1.3 GMACs)4.4个百分点,同时在相同GMACs下也超过了MatFormer-S(78.6%)和SortedNet-S(78.2%)。将规模扩展到Slimmable ConvNeXt-B进一步将最大准确率提高到15.35 GMACs时的82.8%。

英文摘要

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

2605.22675 2026-05-22 cs.CL

Self-Policy Distillation via Capability-Selective Subspace Projection

通过能力选择性子空间投影实现自我策略蒸馏

Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, Hanxue Liang

AI总结 本文提出Self-Policy Distillation(SPD),通过从模型自身梯度中提取低维能力子空间,将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下进行微调,实现了无需外部信号的通用且能力选择性的自我蒸馏方法。

详情
AI中文摘要

自我蒸馏通过训练模型自身的生成来提升大语言模型(LLMs)。然而,现有方法要么依赖外部信号来筛选自生成输出(例如正确性过滤、执行反馈和奖励搜索),这些方法成本高且无法用于表现最佳的前沿模型;要么完全跳过筛选直接训练所有原始输出,这种方法通常领域特定且难以泛化。两者都共享一个更深层次的弱点,即自生成输出会将任务相关的能力建与其它因素(如风格模式、格式瑕疵和模型特定错误)纠缠在一起,稀释了要改进的特定能力的信号。在本文中,我们提出Self-Policy Distillation(SPD),实现了无需外部信号的通用且能力选择性的自我蒸馏。具体而言,SPD从模型对正确性定义标记的自身梯度中提取低维能力子空间,在自我生成过程中将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下对结果进行微调。通过在代码生成、数学推理和多个选择性问答任务上的广泛实验,我们展示了SPD在无外部信号的情况下比最先进的自我蒸馏方法提高了高达13%,并且在预训练基线上的表现提高了高达16%。值得注意的是,SPD展示了优越的泛化能力,在跨领域泛化设置下表现更优15%。

英文摘要

Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

2605.22668 2026-05-22 cs.CV

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

SEGA:用于扩散变换器中分辨率外推的频谱-能量引导注意力

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati

AI总结 SEGA通过动态调整注意力权重来提升扩散变换器在高分辨率生成中的表现,其核心方法是根据潜在空间的频谱结构调整RoPE组件的注意力缩放,从而在保持全局结构和恢复细节方面取得平衡。

详情
Comments
27 pages, 14 figures. Project page: https://rajabi2001.github.io/sega/
AI中文摘要

扩散变换器(DiTs)已成为文本到图像生成的主导架构,但其在生成超出训练范围的分辨率时性能下降。现有的无训练方法通过修改推理时的注意力行为来缓解这一问题,通常通过旋转位置嵌入(RoPE)外推结合注意力缩放。然而,这些策略在RoPE组件上采用统一且内容无关的缩放,具有不同的频率特性,导致在保持全局结构和恢复细节之间产生权衡。我们引入SEGA,一种无训练方法,根据每个去噪步骤中潜在空间的空间-频率结构动态调整注意力缩放。这种自适应缩放提高了结构一致性和细节保真度。实验表明,SEGA在多个目标分辨率上均能提升高分辨率合成性能,优于最先进的无训练基线。

英文摘要

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

2605.22666 2026-05-22 math.CO cs.LG math.PR

Holographic functions and neural networks

全息函数与神经网络

Balazs Szegedy

AI总结 本文研究了全息函数的复杂性,通过三种不同方法(采样性质、结构性质和计算性质)探讨了全息函数的复杂性界限,并证明了这三种性质在参数上是等价的。

详情
AI中文摘要

模糊布尔函数是映射 $f:\cube^n o [0,1]$,其中 $n\in\mathbb N$。我们介绍了并比较了三种表示此类函数具有有界复杂度的方式。第一种是采样性质:函数值 $f(x)$ 可以通过随机选择的少量坐标值在小误差和高概率下恢复。我们称其为全息性质。第二种是结构性质:$f$ 与在有限多个有界线性坐标形式上的一次多项式一致。第三种是计算性质:$f$ 与具有有限个非输入神经元、有界Lipschitz激活函数和有界输入权重的神经网络的输出一致。我们证明了这三种性质在参数上是等价的。从全息性到多项式结构的推论使用了超图正则性的弱变种。

英文摘要

A fuzzy Boolean function is a map $f:\cube^n\to [0,1]$, where $n\in\mathbb N$. We introduce and compare three ways of saying that such a function has bounded complexity. The first is a sampling property: the value $f(x)$ can be recovered, up to small error and with high probability, from the values of a bounded number of randomly chosen coordinates of $x$. We call this the holographic property. The second is a structural property: $f$ is uniformly close to a bounded-degree polynomial in boundedly many bounded linear coordinate forms. The third is computational: $f$ is uniformly close to the output of a neural network with a bounded number of non-input neurons, bounded Lipschitz activation functions and bounded incoming weights. We prove that these three properties are equivalent up to quantitative changes of the parameters. The implication from holography to polynomial structure uses a variant of a weak version of hypergraph regularity.

2605.22662 2026-05-22 cs.AI

Claw AI Lab: An Autonomous Multi-Agent Research Team

Claw AI Lab:一个自主多智能体研究团队

Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu, Yanyu Qian, Dingcheng Gao, Lanyun Zhu, Qi Zhu, Yi Tan, Deyi Ji, Guosheng Lin, Tianrun Chen, Deheng Ye, Fayao Liu

AI总结 本文提出Claw AI Lab,一种自主研究平台,通过隐藏的提示到论文流程实现自动化研究,并提供交互式AI实验室。该平台允许用户通过一个提示创建完整的研究团队,支持自定义角色、协作流程、实时监控、 artifact检查和回滚/恢复控制。Claw-Code Harness连接本地代码库、数据集和检查点,提高实验执行、完成和结果完整性。在内部评估中,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。

详情
Comments
Project page and code are available at https://github.com/Claw-AI-Lab/Claw-AI-Lab
AI中文摘要

我们介绍了Claw AI Lab,一个实验室原生的自主研究平台,将自动化研究从隐藏的提示到论文流程推进到交互式AI实验室。与围绕单一智能体或固定顺序工作流中心化系统不同,我们允许用户通过一个提示实例化完整的研究团队,支持自定义角色、协作流程、实时监控、artifact检查以及回滚/恢复控制,通过统一仪表板。该平台还支持探索、多智能体讨论和再现三种不同的研究模式,使自主研究在实践中变得更加可控和实验室化。Claw AI Lab的关键实际贡献在于其Claw-Code Harness,它将本地代码库、数据集和检查点连接到可运行的实验,并将执行artifact反馈到研究循环中。结果,Harness不仅提高了执行集成,还提高了实验完成和结果完整性:实验更容易检查、迭代和忠实转移到最终论文,减少了部分运行和格式错误报告等常见故障模式。在我们内部评估的五个AI研究案例研究中,使用AutoResearchClaw作为基线,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。我们视Claw AI Lab为一种新范式的第一步:自主研究作为可使用、交互式和可靠性感知的科学基础设施。

英文摘要

We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice. A key practical contribution of Claw AI Lab lies in its Claw-Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability-aware scientific infrastructure.

2605.22660 2026-05-22 cs.CL cs.AI

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

道德语义在机器翻译中得以保留:来自道德基础语料库的跨语言证据

Maciej Skorski

AI总结 本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距,通过波兰语案例展示直接翻译能有效保留微妙的道德线索,为资源匮乏语言的道德研究提供了可行路径。

详情
AI中文摘要

道德语言具有微妙性和文化差异性,使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而,自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距,以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子,我们应用了一个系统化的四方法验证流程:LaBSE跨语言嵌入相似性、中心核对齐(CKA)、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明,尽管在处理俚语、粗俗语言和文化负载表达方面存在不足,直接翻译能够很好地保留微妙的道德线索,这些线索足以被跨语言机器学习系统捕获——在所有基础方面,平均余弦相似度为0.86,AUC差距在0.01-0.02之间,经过语言模型微调后进一步缩小。这些结果表明,机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点,并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

2605.22658 2026-05-22 cs.CV cs.LG cs.MM eess.IV

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass: 探索通过稀疏自编码器实现可解释对齐以增强推理分割

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang

AI总结 本文提出SegCompass,一种通过稀疏自编码器实现可解释对齐的端到端模型,以提升推理分割的性能和可解释性。

详情
Comments
Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables
AI中文摘要

尽管大语言模型提供了强大的组合推理能力,但现有推理分割流程未能清晰地将这种推理与视觉感知连接起来。当前方法,如潜在查询对齐,虽然端到端但却是不透明的“黑箱”。相反,文本定位读出仅可读但不真正可解释,通常作为无约束的后处理步骤。为弥合这一可解释性差距,我们提出了SegCompass,一种端到端模型,利用稀疏自编码器(SAE)建立一个显式、可解释且可微的对齐路径。给定一个图像-指令对,SegCompass首先生成一个思维链(CoT)轨迹。该方法的核心是一个将CoT和视觉标记映射到共享高维稀疏概念空间的SAE。一个查询代码本从该空间中选择显著概念,然后通过槽映射器在空间上定位到多槽热图,引导最终的掩码解码器。整个模型联合训练,将强化学习用于推理路径与标准分割监督相结合。这种由SAE驱动的接口提供了显著比潜在查询更可追溯的“白盒”连接,比文本读出更连贯。在五个具有挑战性的基准测试中,SegCompass匹配或超越了最先进的性能。关键的是,我们的视觉和定量分析显示,所学稀疏概念的质量与最终掩码准确性之间存在强相关性,证实了SegCompass通过其增强且可检查的对齐实现了优越的结果。代码可在https://github.com/ZhenyuLU-Heliodore/SegCompass获取。

英文摘要

While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

2605.22654 2026-05-22 cs.CL cs.CV

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌:基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

AI总结 本文提出了一种图像-语义引导的诗歌检测方法,通过整合图像内容与诗歌文本信息,提升大语言模型在检测现代汉语诗歌中的性能,实验结果表明该方法在多个数据集上均优于传统方法。

详情
AI中文摘要

先前的检测研究显示,LLMs无法有效用作检测器,但这些研究未涉及现代汉语诗歌。此外,没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能,并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比,我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法,我们的方法有效整合了图像中的意义、意象和情感信息,然后与诗歌文本形成互补判断。实验结果表明,基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器,甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%,达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

2605.22653 2026-05-22 cs.DS cs.LG

The Secretary Problem with a Stochastic Precursor

带随机前导的秘书问题

Franziska Eberle, Alexander Lindermayr

AI总结 本文研究了带随机前导的秘书问题,展示了预测仅因其到达时间而有价值。在随机顺序模型中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。在对抗性顺序模型中,足够集中的前导可恢复常数成功保证。

详情
AI中文摘要

在学习增强的在线算法中,预测通常因其提供的价值估计、解决方案或算法推荐而被重视。本文表明,预测仅因其到达时间而有价值。我们研究了带随机前导的秘书问题:一种无内容的信号,保证在最佳项目之前到达,但其他时间是随机的。该信号不携带额外信息;然而,其到达时间本身改变了最优停止策略的结构。我们分别在随机顺序和对抗性顺序模型中刻画了最优策略。在随机顺序中,单个均匀时间的前导可使成功概率达到至少1/2,优于经典1/e的基准。随着前导时间越来越晚,成功概率接近1。在对抗性顺序中,对于传统模型无法提供强保证的情况,足够集中的前导可恢复常数成功保证。我们的结果表明,这种新型的异步时间信息是在线决策中的独特且强大的建议形式,可能对其他问题也有效。

英文摘要

In learning-augmented online algorithms, predictions are usually valued for what they say: a value estimate, a solution, or an algorithmic recommendation. This paper shows that predictions can also be valuable solely due to their arrival time. We study the fundamental secretary problem augmented with a stochastic precursor: a content-free signal that is guaranteed to arrive no later than the best item, but is otherwise stochastically timed. The signal does not carry any additional information; nevertheless, its timing alone changes the structure of optimal stopping. We characterize optimal policies in the random-order and adversarial-order models. In random order, a single uniformly timed precursor already gives success probability at least $\frac12$, improving on the classic $\frac1e$ benchmark. With increasingly late precursors, the success probability approaches $1$. In adversarial order, for which traditional models do not admit strong guarantees, sufficiently concentrated precursors recover constant success guarantees. Our results show that such novel forms of asynchronous temporal information are a distinct and powerful form of advice in online decision making and may also be effective for other problems.

2605.22651 2026-05-22 cs.CV

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

图中标签真的在说些什么?用于视觉语言预训练中组合数据选择的反事实短语干预

Hyejin Go, Semi Lee, Hyesong Choi

AI总结 本文研究了在视觉语言预训练中如何通过反事实短语干预来改进组合数据选择,提出了CPI方法以解决现有方法中全局过滤信号失效的问题,从而提升模型在关系识别任务上的表现。

详情
Comments
11 pages, 2 figures, 4 tables. Preprint
AI中文摘要

CLIP风格的对比预训练通常通过样本级过滤信号来收集网络级图像-文本对,通常基于对级对齐。我们证明这种信号饱和:一旦粗略不匹配被移除,更严格的全局过滤不再跟踪由保留标签提供的组合监督。原因在于结构问题 - 全局评分混淆了对是否广泛合理与是否个别对象、属性和关系短语在标签中实质性支持图像-文本匹配。后者是组合泛化所需,但对级过滤器对此无能为力。我们通过反事实短语干预(CPI),一种短语级整理框架,将受控的非正式令牌替换转换为图像条件的短语敏感性评分。CPI仅使用全局对齐进行粗略不匹配移除,然后通过是否在受控替换下短语显著影响图像-文本评分来对幸存池进行排名。我们将CPI框架为一阶短语敏感性信号,而非接地或识别结果,并在CC3M规模上评估。按此信号排名产生一个50%的数据子集,在VL-CheckList-VG关系任务上比完整数据基线提高+1.91,在匹配预算下比仅对齐过滤提高+1.00,同时提高SugarCrepe整体表现并保持泛化转移。CPI是损失正交的:应用不变于NegCLIP,它进一步在VL-CheckList-VG关系任务上提高+3.84,并在主要文本中获得额外的CE-CLIP收益。

英文摘要

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

2605.22650 2026-05-22 cs.CL

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

谁的声音被听见?通过美国政府公开提交的材料映射利益相关者的AI观点

Alina Karakanta, Alex Christiansen, Tomás Dodds, Bissie Anderson, Matteo Fuoli, Marcus Perlman, Aletta G. Dorst

AI总结 本文通过分析美国政府AI行动计划公众咨询期间提交的信件,探讨不同利益相关者对AI的看法,发现个人更关注AI对生活的影响,而其他群体更关注AI发展,揭示了AI行动计划主要反映私营部门的关切。

详情
AI中文摘要

随着人工智能(AI)系统在日常生活中的普及,了解不同利益相关者如何理解和设想这些技术在塑造社会、政治和经济现实中的作用变得至关重要。本文基于特朗普政府AI行动计划公众咨询期间提交的信件语料库,调查公众对AI的看法。为此,我们发布了一个语料库清理流程,并通过主题建模和频率分析来探索不同子群体(如学术界、个人、私营部门)讨论的主要主题以及AI行动计划中出现的主题。我们的结果表明,个人对AI对生活的影响表达了强烈担忧,而其他利益相关者则更关注AI的发展。我们的主题比较显示,AI行动计划主要反映了私营部门对安全、政策和发展方面的关切,而个人的关切则代表性较低。

英文摘要

As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration's US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals' concerns less represented.