arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
2605.22538 2026-05-22 cs.CV

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

基于运动、几何和语义适应的复杂非线性视觉目标跟踪

Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出SAMOSA框架,通过显式利用运动、几何和语义线索,改进SAM 2在复杂非线性视觉目标跟踪中的表现,实现了更鲁棒和通用的跟踪方法。

详情
AI中文摘要

传统视觉目标跟踪(VOT)方法通常依赖于任务特定的监督训练,限制了其对未见对象和具有干扰、遮挡和非线性运动的挑战场景的泛化能力。最近的视觉基础模型,如SAM 2,通过大规模预训练学习强大的视频理解先验,并为构建更鲁棒和通用的跟踪器提供了有前景的基础。然而,直接将SAM 2应用于VOT仍然不够优化,因为它没有显式建模目标运动动态或在帧之间强制几何和语义一致性,这两者对于可靠的跟踪至关重要。为了解决这个问题,我们提出了SAMOSA,一个新的跟踪框架,通过显式利用运动、几何和语义线索,将SAM 2适应于复杂的VOT场景。具体来说,我们引入了一个轻量级的非线性运动预测器来建模目标动态并指导掩码选择以及内存过滤。我们进一步利用语义线索来检测目标位移并从跟踪失败中恢复,同时将几何线索作为结构约束以提高跟踪稳定性。通过这种方式,SAMOSA弥合了SAM 2隐含视频理解先验与显式跟踪导向建模之间的差距。广泛的实验表明,SAMOSA在通用基准上始终优于最先进的基于SAM 2的方法,展示了比监督VOT方法更强的泛化能力,并在反UAV数据集上实现了显著的提升,这些数据集典型地代表了复杂的非线性运动场景。我们的代码可在https://github.com/DurYi/SAMOSA上获得。

英文摘要

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

2605.22537 2026-05-22 cs.LG

F-TIS: Harnessing Diverse Models in Collaborative GRPO

F-TIS: 利用多样化模型进行协作GRPO

Nikolay Blagoev, Oğuzhan Ersoy, Wendelin Boehmer, Lydia Yiyu Chen

发表机构 * Gensyn University of Neuchatel(日内瓦大学内沙特尔分校) Gensyn(盖森) TU Delft(代尔夫特理工大学) University of Neuchatel(日内瓦大学内沙特尔分校)

AI总结 本文提出F-TIS方法,通过利用异构模型在协同GRPO训练中提高本地模型的学习效果,实现了高效的通信和一致的最终模型收敛,同时在某些情况下提升了模型在分布外任务上的泛化能力。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情
AI中文摘要

像GRPO这样的强化学习方法在LLM后训练中变得非常流行。在GRPO中,模型产生一组提示的完成,这些完成会得到奖励,策略会朝着相对高奖励的完成更新。由于模型的自回归性质,这种训练风格的生成阶段可以极其耗时。为了解决这个问题,先前的工作试图将推理步骤分布到许多节点上,并行工作。这些工作主要假设训练中的同质模型,以保持样本尽可能接近on-policy。这一假设可能在去中心化系统中不切实际,因为具有不同计算能力和偏好的各方可能希望在同一个任务上合作。因此,去中心化训练需要一种能够处理异构模型的方法——不同的模型在同一个任务上协作。然而,这会导致训练过程中出现高度离策略的样本,而先前的工作已经指出离策略样本可能会影响GRPO的收敛。为了实现异质性,我们提出了过滤截断重要性采样(F-TIS)——一种GRPO风格的训练范式,可以利用离策略样本来改进本地模型的学习。我们的框架允许各种模型在同一个RL训练运行中协作,同时保持高效的通信。我们广泛评估了F-TIS在各种异构设置中的表现,并展示了它在最终模型收敛方面与纯on-sample训练相同。此外,我们观察到在某些设置中,F-TIS在分布外任务上的泛化能力优于on-policy训练,使模型性能提高了高达12%。

英文摘要

Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.

2605.22536 2026-05-22 cs.CV cs.CL

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China(电子科技大学) Chongqing University(重庆大学) The University of Tokyo(东京大学) Beihang University(北航) Northwestern Polytechnical University(西北工业大学)

AI总结 本文提出SpaceDG,首个针对退化感知空间理解的大型数据集,通过物理基础的退化合成引擎生成9种退化类型,评估多模态大语言模型在视觉退化下的空间推理能力,并展示在退化条件下微调可提升模型鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间智能方面取得了快速进展,但现有空间推理基准大多假设纯净的视觉输入,忽略了现实部署中常见的退化现象,如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题:当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何?为回答这个问题,我们引入SpaceDG,首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布(3DGS)渲染,能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题,来自近1000个室内场景。我们进一步引入SpaceDG-Bench,一个经人类验证的基准,包含11种推理类别和9种视觉退化类型的1102个问题,产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现,视觉退化一致且显著损害空间推理能力,暴露出关键的鲁棒性差距。最后,我们展示在SpaceDG上微调可显著提高退化鲁棒性,并且在退化条件下甚至可以超越人类性能,而不会在清晰图像上造成性能下降,突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

2605.22535 2026-05-22 cs.AI

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

TerminalWorld: 在真实世界终端任务上评估智能体的基准测试

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye

发表机构 * University College London(伦敦大学学院) Nanjing University(南京大学) Tencent(腾讯)

AI总结 本文提出TerminalWorld,一个可扩展的数据引擎,能够自动从真实世界终端记录中反向工程高保真的评估任务。通过处理80,870条终端记录,生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中精选出200个代表性任务作为Verified子集。在八个前沿模型和六个智能体上全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

详情
AI中文摘要

我们介绍了TerminalWorld,一个可扩展的数据引擎,能够自动从'现实世界'终端记录中反向工程高保真的评估任务。处理80,870条终端记录,该引擎生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中我们精选出200个代表性、人工审核的任务作为Verified子集。在八个前沿模型和六个智能体上对TerminalWorld-Verified进行全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为仅62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

英文摘要

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

2605.22531 2026-05-22 cs.LG

Disentanglement Beyond Generative Models with Riemannian ICA

超越生成模型的解缠:黎曼ICA

Edmond Cunningham

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文提出黎曼ICA,一种不依赖生成模型的解缠方法,通过引入解缠张量来研究局部解缠特性,为理解无生成假设下的特征解缠提供了理论基础。

详情
AI中文摘要

在解缠理论基础与现代表示学习实践之间存在差距。现有的理论框架,特别是独立成分分析(ICA)及其非线性变体,假设数据背后存在统计独立的潜在变量,使得解缠等同于识别生成数据的潜在变量。这种生成框架具有可解释性和理论依据,但其强假设使其难以应用于现代表示学习。现代预训练编码器通常学习出具有解缠特性的特征,而无需做出生成假设,但缺乏解释这些特征作为独立变化因素的一般理论。本文通过引入黎曼ICA,将ICA的全局生成模型替换为局部几何结构。RICA基于观察到,在ICA中,数据点的潜在变化因素可以通过从该点出发的径向曲线映射到潜在空间中的轴对齐直线来理解。我们利用黎曼几何正式化这一观点,并以与现有生成方法一致的方式提出我们的理论。我们的主要贡献是解缠张量,它编码了我们称为点解缠的二阶解缠概念。该张量依赖于数据对数似然的Hessian以及模型诱导的里奇曲率。在受控源恢复设置中,RICA在多个流形上恢复了源,而ICA基线的成功取决于用于表示观测的坐标。本文为研究无生成模型假设下的局部解缠提供了理论基础。

英文摘要

There is a gap between the theoretical foundations of disentanglement and the practice of modern representation learning. Existing theoretical frameworks, particularly Independent Component Analysis (ICA) and its nonlinear variants, assume a generative model with statistically independent latent variables underlying the data so that disentanglement amounts to identifying the latents that could have generated the data. This generative framework is interpretable and theoretically justified, but its strong assumptions make it difficult to apply to modern representation learning. Modern pretrained encoders often learn features that exhibit disentangled properties without making generative assumptions, yet there is no general theory for interpreting these features as independent factors of variation. We take a step toward such a theory by introducing Riemannian ICA (RICA), which replaces ICA's global generative model with local geometric structure. RICA is founded on the observation that in ICA, the factors of variation underlying a data point can be understood through radial curves emanating from the point that map to axis-aligned lines in the latent space. We formalize this perspective using Riemannian geometry and introduce our theory in a way that is consistent with the existing generative approach. Our main contribution is the disentanglement tensor, which encodes a second-order notion of disentanglement that we call pointwise disentanglement. This tensor depends on the Hessian of the data log likelihood as well as the Ricci curvature induced by the model. In a controlled source recovery setting with known ground-truth sources, RICA recovers sources across several manifolds, while the success of ICA baselines depends on the coordinates used to represent the observations. Our work provides a theoretical basis for studying local disentanglement without assuming a global generative model.

2605.22530 2026-05-22 cs.AI

A Subjective Logic-based method for runtime confidence updates in safety arguments

基于主观逻辑的方法用于安全论证中的运行时置信度更新

Benjamin Herd, Jessica Kelly, Clarissa Heinemann, João-Vitor Zacchi

发表机构 * Fraunhofer Institute for Cognitive Systems (IKS)(弗劳恩霍夫认知系统研究所)

AI总结 本文提出了一种基于主观逻辑的方法,用于在安全论证中实现动态定量保证,通过整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),在开发生命周期中量化和传播置信度。在运行时,SPI证据被持续评估,针对的声明通过规则更新,当没有违反时增加置信度,当发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。

Comments Accepted for publication at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026)

Journal ref Proceedings of the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), 2026

详情
AI中文摘要

我们提出了一种方法,用于动态定量保证,该方法通过在单一的主观逻辑(SL)基础上的保证案例中整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),从而增强静态安全案例,实现连续的运行时驱动的置信度更新。该方法通过量化和传播置信度,贯穿整个开发生命周期。在运行时,SPI证据被持续评估,并通过规则更新目标声明:在没有违反的情况下增加置信度,在发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。我们通过基于模拟的施工区辅助功能演示该方法,重点在于基于机器学习的施工锥检测组件,并展示置信度如何随着SPI证据在操作中的观察而演变。

英文摘要

We present a method for dynamic quantitative assurance that enhances static safety cases with continuous, runtime-driven confidence updates. The method quantifies and propagates confidence across the development lifecycle by integrating design-time evidence and windowed runtime Safety Performance Indicators (SPIs) within a single Subjective Logic (SL)-based assurance case. At runtime, SPI evidence is continuously evaluated, and targeted claims are updated using a rule that increases confidence in the absence of violations and imposes prompt penalties when violations occur. This design prioritizes safety-relevant responsiveness over exact classical Bayesian posterior updates. We demonstrate the method using a simulation-based construction zone assist function, focusing on an ML-based construction cone detection component, and show how confidence evolves as SPI evidence is observed in operation.

2605.22529 2026-05-22 cs.LG cs.AI

Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

在网络安全AI中稳定可解释性脆弱性:公共基准数据集中的多重共线性影响与缓解

Ioannis J. Vourganas, Anna Lito Michala

发表机构 * Netrity Ltd(Netrity有限公司) University of Glasgow(格拉斯哥大学)

AI总结 本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

Comments 35 pages, 3 figures, submitted to ACM TAISAP

详情
AI中文摘要

本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

英文摘要

This paper investigates a unexplored yet impactful vulnerability in AI explainability used in intrusion detection (IDS): multicollinearity-induced instability. Despite extensive reliance on post-hoc explainability tools such as SHAP or LIME, the impact of correlated features on explanation robustness is not evaluated. We introduce a formal theorem stating that multicollinearity inflates attribution variance. This demonstrates that explanations and feature importances are non-identifiable under multicollinearity. A suite of comprehensive experiments validates the theorem on a representative benchmark dataset, UNSW-NB15. Four widely used families of models are evaluated, including linear, tree-based, kernel, and neural, across full and pruned feature sets based on VIF and correlation thresholding. We propose the novel metric of Explanability Fragility Score and two novel methods to mitigate it with variable integration complexity. CAA-Filtering focuses on stabilising explanations by grouping attributions of trained models. SHARP is a novel training-time regularisation framework that penalises attribution instability, enabling controllable and monotonic improvement of explainability stability. The findings support stable predictive performance, using Kendall's τ to quantify instability across bootstrapped explanations. This work has direct implications for the trustworthiness and reproducibility of XAI in security-critical contexts, and motivates incorporating multicollinearity mitigations into the IDS pipelines, providing a set of guidelines for practitioners.

2605.22521 2026-05-22 cs.RO cs.HC

Quantifying Full-Body Immersion

量化全身沉浸

Alihan Bakir, Ekrem Yüksel, Fabio Zuliani, Neil Chennoufi, Francesco Bruno, Jamie Paik

发表机构 * Reconfigurable Robotics Lab(可重构机器人实验室)

AI总结 本文提出了一种基于全身动态交互的沉浸式虚拟体验新范式,通过音频视觉沉浸、物理沉浸和全身沉浸三个层次,结合模块化机器人表面单元实现可扩展的沉浸环境渲染,推动人与虚拟环境的共生。

Comments This manuscript is under consideration for possible publication in the Nature. Copyright may be transferred to Nature if the manuscript is accepted for publication, without further notice

详情
AI中文摘要

人类正处于又一场数字革命的前沿,现实与虚拟世界的界限正在消融,重塑我们对周围环境的认知和交互方式。在此背景下,我们引入了一种以全身动态交互为核心的沉浸式虚拟体验新范式。我们的方法通过三个不同的层次重新定义沉浸:音频视觉沉浸,捕捉感官真实;物理沉浸,提供触觉反馈;以及全身沉浸(FBI),其中动态的身体互动无缝整合到虚拟环境中。该创新的核心是一种基于模块化机器人表面单元的可扩展、可分布平台,这些单元受到自然界适应性设计的启发。这些单元能够渲染沉浸式环境,从亲密的个人体验到大规模多用户设置,动态适应实时互动。模块化系统在整个空间中分布力、形状和运动反馈,复制环境的物理特性,并通过FBI实现新的深度参与。通过结合可扩展性、适应性和动态物理参与,该框架弥合了现实与虚拟世界之间的鸿沟。它提供了一种前所未有的沉浸水平,使用户能够以共生的方式与虚拟空间进行全身互动。这项工作不仅推动了沉浸技术的发展,还重新定义了人类与虚拟环境共存的方式,为人类与环境合成的新时代奠定了基础。

英文摘要

Humanity is at the forefront of yet another digital revolution, where the lines between real and virtual worlds are dissolving, reshaping how we perceive and interact with our surroundings. In this context, we introduce a transformative paradigm for immersive virtual experiences centered around whole-body kinetic interactions. Our approach redefines immersion through three distinct levels: audio-visual immersion, capturing sensory realism; physical immersion, delivering haptic feedback; and full-body immersion (FBI), where dynamic bodily interaction integrates seamlessly with virtual environments. At the core of this innovation lies a scalable, distributable platform based on modular robotic surface units inspired by the adaptive designs of nature. These units enable the rendering of immersive environments at any scale, from intimate personal experiences to expansive multi-user settings, dynamically adapting to interactions in real-time. The modular system distributes force, shape, and motion feedback throughout entire spaces, replicating the physical characteristics of the environment and enabling new depth of engagement through FBI. By combining scalability, adaptability, and dynamic physical engagement, this framework bridges the gap between real and virtual worlds. It offers an unprecedented level of immersion where users can engage their entire bodies in symbiotic interactions with the virtual space. This work not only advances immersive technology but also redefines how humans and virtual environments coexist, setting a foundation for a new era of human-environment synthesis.

2605.22513 2026-05-22 cs.AI

Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems

为不确定非线性系统参考跟踪设计快速适应的元学习

Jiaqi Yan, Ankush Chakrabarty, Niklas Schmid, John Lygeros, Alisa Rupenyan

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院) Mitsubishi Electric Research Laboratories(三菱电机研究实验室) Automatic Control Laboratory, ETH Zurich(瑞士联邦理工学院自动化实验室) ZHAW Centre for Artificial Intelligence, Zurich University of Applied Sciences(瑞士应用科学大学人工智能中心)

AI总结 本文针对不确定非线性系统的参考跟踪问题,提出基于元学习的控制框架,通过利用源系统数据加速训练并提升控制性能,通过两阶段方法实现对目标系统的快速适应。

Comments 13 pages

详情
AI中文摘要

在本文中,我们解决了不确定非线性系统的参考跟踪问题。由于从目标系统收集数据往往具有挑战性,我们的目标是利用有限的目标系统数据设计最优控制器。元学习提供了一个有前景的范式,通过利用源系统(与目标系统结构相似的系统)的离线数据来加速训练并提高控制性能。受此启发,我们提出了一种基于元学习的控制框架,将隐式模型无关元学习(iMAML)算法适应到控制设置中。该框架分为两个阶段:一个(离线)元训练阶段,其中从源数据中学习聚合表示以捕捉相似系统之间的共享系统动态;一个(在线)元适应阶段,其中仅使用少量数据样本和有限的适应步骤对目标系统进行微调。我们将此框架表述为一个双层优化问题,并提供一个具有降低存储复杂性和较少近似值的高效解决方案。所提出的框架具有通用性,允许各种学习算法的整合。为了展示这种灵活性,我们提出两种特定的学习算法,分别基于神经状态空间模型和深度Q网络。这两种方法的主要区别在于是否需要显式系统识别。数值模拟和硬件实验表明,所提出的方法增强了控制性能,并且在大多数情况下均优于基线方法。

英文摘要

In this paper, we address the problem of reference tracking for uncertain nonlinear systems. Since collecting data from the target system (i.e., the system of interest) is often challenging, our objective is to design optimal controllers using limited target system data. Meta-learning provides a promising paradigm by leveraging offline data from source systems (systems sharing structural similarities with the target system) to accelerate training and enhance control performance. Motivated by this idea, we propose a meta-learning-based control framework that tailors the implicit model-agnostic meta-learning (iMAML) algorithm to the control setting. The framework operates in two phases: an (offline) meta-training phase, where an aggregated representation is learned from source data to capture the shared system dynamics among similar systems, and an (online) meta-adaptation phase, where this representation is fine-tuned on the target system using only a few data samples and limited adaptation steps. We formulate this framework as a bi-level optimization problem and provide an efficient solution with reduced storage complexity and few approximations. The proposed framework is general, allowing various learning algorithms to be integrated. To demonstrate this flexibility, we propose two specific learning algorithms that can be incorporated into our framework based on a neural state-space model and a deep Q-network, respectively. The primary distinction between these approaches is whether explicit system identification is required. Numerical simulations and hardware experiments demonstrate that the proposed methods enhance control performance and consistently outperform baseline approaches.

2605.22507 2026-05-22 cs.LG stat.ML

Generative Modeling by Value-Driven Transport

通过价值驱动传输进行生成建模

Pablo Moreno-Muñoz, Adrian Müller, Gergely Neu

发表机构 * Universitat Pompeu Fabra Barcelona(巴塞罗那庞培乌法布拉大学) ETH Zürich(苏黎世联邦理工学院) ICREA & Universitat Pompeu Fabra Barcelona(ICREA与巴塞罗那庞培乌法布拉大学)

AI总结 本文提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架,通过线性规划的对偶变量直接编码最优控制策略,并开发了高效的模拟-free 原始-对偶算法来计算近似最优价值函数和价值驱动传输(VDT)策略,这些策略在多个实验中表现出优越的性能和良好的可扩展性。

详情
AI中文摘要

我们提出了一种基于测度传输离散时间随机控制 formulations 的新生成建模框架。通过适应控制理论中的经典结果,我们将问题 formulations 为一个线性规划,其对偶变量对应于控制问题的最优价值函数,这直接编码了最优控制策略。利用这种线性规划 formulations,我们开发了高效的模拟-free 原始-对偶算法,用于计算近似最优价值函数及其相关的价值驱动传输(VDT)策略,这些策略近似于真正的最优策略。我们展示了经过良好训练的 VDT 策略与其他基于流、扩散或 Schrödinger 桥的最新方法相比具有许多有利的性质:它们导致直线传输路径,可以快速且鲁棒地模拟,并且可以以与扩散和流基模型相同的方式增强(例如,条件生成、分类器-free 引导、无配对数据到数据翻译都很容易整合)。我们在一系列实验中评估了我们的方法,结果表明性能强大且具有良好的可扩展性潜力。

英文摘要

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

2605.22504 2026-05-22 cs.AI cs.CV

LACO: Adaptive Latent Communication for Collaborative Driving

LACO:适应性潜在通信用于协同驾驶

Tianhao Chen, Yuheng Wu, Dongman Lee

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 本文提出LACO,一种无需训练的潜在通信范式,通过迭代潜在推理、跨时间显著性归因和结构化语义知识蒸馏,解决协同驾驶中潜在通信的延迟和信息丢失问题,实验证明其在降低通信和推理延迟的同时保持了强大的协同驾驶性能。

详情
AI中文摘要

协同驾驶旨在通过使连接车辆在部分可观测性下协调以提高安全性和效率。最近的方法已从共享视觉特征进行感知发展到通过基础模型交换基于语言的推理以实现行为协调。尽管用语言交流提供直观的信息,但引入了两个挑战:由自回归解码引起的高延迟以及由于将丰富的内部表示压缩成离散标记而引起的信信息丢失。为了解决这些挑战,我们分析了协同驾驶中潜在通信在多智能体设置下的固有限制。我们的分析揭示了代理身份混淆,即直接融合潜在状态会将车辆间的决策表示纠缠。受此启发,我们提出了LACO,一种无需训练的潜在通信范式,能够无缝地将预训练驾驶模型适应到协同设置中。LACO引入了迭代潜在推理(ILD)用于潜在推理,跨时间显著性归因(CHSA)用于通信高效的信信息选择,以及结构化语义知识蒸馏(SSKD)以稳定以自我为中心的决策。在CARLA中的闭环实验表明,LACO显著降低了通信和推理延迟,同时保持了强大的协同驾驶性能。

英文摘要

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

2605.22502 2026-05-22 cs.AI cs.LG

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

将代理工作流编译为LLM权重:在成本上减少两个数量级的情况下实现接近前沿质量

Simon Dennis, Rivaan Patil, Kevin Shabahang, Hao Guo

发表机构 * University of Melbourne(墨尔本大学)

AI总结 本文研究如何将代理工作流编译为LLM权重以提高效率,通过在旅行预订、Zoom支持和保险索赔等任务中验证,展示了编译方法在减少成本的同时保持高质量性能。

Comments 19 pages

详情
AI中文摘要

代理编排框架已经普及,共同超过了LangGraph、CrewAI、Google ADK、OpenAI Agents SDK、Semantic Kernel、Strands和LlamaIndex在内的290,000多个GitHub星标。所有框架都遵循相同模式:一个外部编排器位于LLM之上,每回合注入指令并路由决策。最近的工作表明,这种架构在处理过程性任务时,只需在前沿模型的系统提示中提供过程即可[Dennis et al., 2026a],但代价是消耗上下文窗口、需要为每次对话提供一个前沿模型,并将专有过程暴露给第三方提供者。将过程编译到小型微调模型的权重中——创建一个地下代理——应解决所有这些担忧,先前工作(SimpleTOD、FireAct、SynTOD、WorkflowLLM、Agent Lumos)已展示了该技术的可行性。然而,开发者采用却 overwhelmingly 倾向于编排。我们识别了三个感知障碍,并在旅行预订(14个节点)、Zoom支持(14个节点,产品特定知识)和保险索赔(55个节点,6个决策中心)中通过实证方法解决每个障碍。

英文摘要

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

2605.22501 2026-05-22 cs.CL cs.AI cs.IR

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

BeLink: 生物医学实体链接结合生成性重新排序

Darya Shlyk, Stefano Montanelli, Lawrence Hunter

发表机构 * University of Milan(米兰大学) University of Chicago(芝加哥大学)

AI总结 本文提出了一种基于生成模型的重新排序方法,通过指令微调提高生物医学实体链接的效率和准确性,在多个基准测试中实现了3%-24%的链接准确率提升,同时减少了推理时间。

Comments Accepted to ACM SIGIR 2026

详情
AI中文摘要

尽管近年来取得了进展,但使用大语言模型(LLMs)的生物医学实体链接(BEL)仍然计算效率低下,难以在实际应用中部署。在本工作中,我们证明了在BEL流水线的重新排序阶段对开源生成模型进行指令微调可以提供有效的解决方案。我们提出了一种集束式指令微调公式,使候选人的选择变得快速且准确。我们的方法在多个BEL基准测试中表现出色,比最先进的方法在链接准确性上提高了3%-24%,同时减少了推理时间。我们将我们的生成性重新排序器整合到BeLink中,这是一个模块化、端到端的系统,旨在实际的生物医学实体链接应用中使用。

英文摘要

Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.

2605.22498 2026-05-22 cs.LG cs.AI cs.SC

The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning

神经编译器:程序到网络的翻译用于混合科学机器学习

Lucas Sheneman

发表机构 * Institute for Interdisciplinary Data Sciences(跨学科数据科学研究所) University of Idaho(爱达荷大学)

AI总结 该研究提出了一种神经编译器,能够将程序转换为可微的PyTorch模块,用于混合科学机器学习,通过符号规范生成正确且可微的模块,实现系统化的可组合性。

Comments Use: 21 pages, 10 figures, 10 tables. Preprint; source code available at https://github.com/sheneman/neural_compiler

详情
AI中文摘要

科学机器学习经常需要结合已知的物理规律与从数据中学习的未知参数或校正项。现有方法要么忽略已知结构,将其编码为软惩罚项,要么需要为每个方程手动编写PyTorch代码。我们提出了神经编译器,一种将用第一顺序Scheme-like表达式语言编写的程序转换为冻结、可微的PyTorch模块的系统。这些模块在浮点精度范围内匹配源程序,并通过autograd提供梯度。在混合模型中,编译模块精确编码已知的物理规律,而学习组件则建模未知的剩余部分。我们评估了该编译器在六个实验领域:费曼物理方程、洛特卡-沃勒特动力学、阻尼摆、一维热方程、三维向量力学以及组合泛化。编译模块在单个方程上与手动编写PyTorch实现数值上一致,显示编译没有精度损失。编译模型在大多数情况下能够将物理常数恢复到不到1%的误差,而标准PINN基线模型具有超过8500个参数,误差为7到93%。编译模块还可以与零误差组合,而神经近似方法在深度组合链中会积累大误差。编译器的主要价值不是优于手动编写方程的精度,而是系统化的可组合性:它从符号规范生成正确且可微的模块,而无需手动重写每个方程。该系统支持51个基本操作,包括向量和矩阵代数,能够实现PDE离散化和混合科学模型。这种字符串输入、模块输出的接口也为大语言模型提供了自然的目标,这些模型可以将科学描述翻译成可执行的可微模块。

英文摘要

Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand-written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first-order Scheme-like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating-point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka-Volterra dynamics, a damped pendulum, a one-dimensional heat equation, three-dimensional vector mechanics, and compositional generalization. Compiled modules match hand-coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand-coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string-in, module-out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

2605.22496 2026-05-22 cs.LG

The Signal in the Noise: OOD Detection Through Goodness-of-Fit Testing in Factorised Latent Spaces

噪声中的信号:通过因子化潜在空间中的拟合性检验进行分布外检测

Philipp Bomatter, Jack Geary, Henry Gouk

发表机构 * School of Informatics University of Edinburgh(信息学院爱丁堡大学)

AI总结 本文提出了一种基于因子化潜在空间中拟合性检验的分布外检测方法SITN,该方法无需访问分布外数据,计算开销小,并能严格控制误报率。

详情
AI中文摘要

深度生成模型为分布外检测提供了自然的基础,但先前的工作表明,它们分配的似然在区分分布内与分布外数据方面 notoriously 不可靠。在本文中,我们通过利用连续归一化流的 diffeomorphic 和质量保持性质来解决这个问题。我们的分析表明,分布外样本被映射到在噪声先验下高度非典型的噪声样本,这种方式无法通过似然来捕捉。基于这一观察,我们提出了一种新的方法--Signal in the Noise (SITN)--用于单样本级别的分布外检测。SITN 不需要访问分布外数据,计算开销小,并提供严格的误报率控制。通过标准基准和合成扰动的全面评估,突显了该方法的有效性以及似然方法固有的复杂性偏差的不存在。

英文摘要

Deep generative models offer a natural foundation for out-of-distribution (OOD) detection, yet prior work has shown that their assigned likelihoods are notoriously unreliable indicators for in- vs out-of-distribution data. In this paper, we address this problem by leveraging the diffeomorphic and mass-preserving properties of continuous normalising flows. Our analysis shows that OOD samples are mapped to noise samples that are highly atypical under the noise prior in ways not captured by the likelihood. Based on this observation, we propose a new method -- Signal in the Noise (SITN) -- for OOD detection on the single-sample level. SITN requires no access to OOD data, incurs minimal computational overhead, and provides strict control of false positive rates. Comprehensive evaluations through standard benchmarks and synthetic perturbations highlight the method's effectiveness and the absence of the complexity bias inherent to likelihood-based methods.

2605.22493 2026-05-22 cs.LG cs.AI cs.RO

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

发表机构 * NCT-Dresden(NCT-德累斯顿)

AI总结 研究行为克隆在多模态情况下失败的机制,分析不同多模态参数化在动作分块策略中的不同失效方式,并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情
AI中文摘要

当相同的观察允许多个有效动作时,行为克隆变得困难。我们研究了动作分块策略中的这一问题,并展示了不同多模态参数化以不同的方式失败。对于隐变量策略,后验-先验正则化使部署时的采样更可靠,但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息,但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略,多模态性受到基础到动作传输的平滑性限制:具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

2605.22492 2026-05-22 cs.CV

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

无需训练的细粒度语义分割在低数据环境下:一个FungiTastic基线

Sebastian Cavada, Francesco Pelosin, Lapo Faggi

发表机构 * Covision Lab(Covision实验室)

AI总结 本文提出了一种无需训练的两阶段框架,用于在低数据环境下实现细粒度语义分割,通过宏分类提示生成蘑菇掩码,并利用嵌入空间中的原型匹配进行细粒度标签分配,提高了可扩展性和分割成本。

Comments Accepted at the 13th Workshop on Fine-Grained Visual Categorization, CVPR 2026

详情
AI中文摘要

细粒度语义分割需要精确的定位和在视觉上相似的类别间的区分。在FungiTastic中,这个问题进一步复杂化了长尾分布和图像获取条件的强烈变化。我们提出了一种无需训练的两阶段框架,将分割与分类解耦。SAM3首先使用宏分类提示生成类别无关的蘑菇掩码,DINOv3随后通过嵌入空间中的原型匹配分配细粒度标签。为了改进这一阶段,我们应用了简单的DINOv3特征空间转换,以提高基于原型的分类效果。与类别特定提示相比,我们的方法更具可扩展性且保持分割成本较低。我们报告了一次-shot到几百-shot范围内的结果,提供了目前在低数据设置下细粒度语义分割的首个基线。

英文摘要

Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage framework that decouples segmentation from classification. SAM3 first produces class-agnostic mushroom masks using macro-taxonomic prompts, and DINOv3 then assigns fine-grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype-based classification. Compared with class-specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one-shot to few-hundred-shot regimes, providing, to the best of our knowledge, the first baseline for fine-grained semantic segmentation in low-data settings.

2605.22488 2026-05-22 cs.LG

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

表示不等于计算:一个变换器的因果测试,检验候选算法中间变量

Ishita Darade, Sushrut Thorat

发表机构 * MKSSS's Cummins College of Engineering for Women(MKSSS女子工程学院) Institute of Cognitive Science(认知科学研究所) Osnabrück University(奥斯纳布吕克大学)

AI总结 本文研究了变换器在执行算术任务时如何整合组件,发现模型虽然能准确回答问题,但其内部表示与计算路径之间存在因果分离,表明探针结果可能与实际因果观察有显著差异。

Comments 16 pages, 4 figures

详情
AI中文摘要

结构化提示要求根据任务相关的关系整合组件。网络如何实现这种整合在语言或视觉任务中往往难以判断,因为这些关系很少精确到足以定义候选内部算法。算术提供了一个更清晰的环境。我们研究了一个训练于基数提取的变换器:给定N,B和D,它必须报告N的基数-B展开式中B^D的系数。闭式解,即floor(N/B^D) mod B,提供了显式的候选算法中间变量。在三个种子下,模型在测试的数字-基数交集上达到了99.83%的准确答案,建立了可靠的任务能力。线性探针解码了这些中间变量,使分阶段的算术计算成为可能。因果测试则将表示与使用分开:在局部路由中,从具有D作为输入的流到输出位置,行为取决于早期的D选择性通信,与N和B无关。相关地,稀疏电路搜索发现大部分N、B和D的路线是分开的,它们在晚期而非由探针建议的分阶段路线中结合。因此,模型表示了使闭式解合理的中间变量,但识别的局部因果路线并未将它们传递到输出流。这一案例表明,基于探针的结论可能与实际因果观察有显著差异,即使有显式的算法假设。

英文摘要

Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given $N$, $B$, and $D$, it must report the coefficient of $B^D$ in the base-$B$ expansion of $N$. The closed-form solution, $\lfloor N/B^D \rfloor \bmod B$, provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with $D$ as input to the output positions, behavior depends on early $D$-selective communication, independent of $N$ and $B$. Relatedly, a sparse circuit search finds mostly separate $N$, $B$, and $D$ routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

2605.22487 2026-05-22 cs.CL

Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

表面礼貌,实践错误:一个用于修复多语言孟加拉语生成中敬语失误的定制数据集

Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi

发表机构 * United International University(国际大学)

AI总结 本文提出了一个定制数据集BLADE,用于改进多语言孟加拉语生成中敬语处理的准确性,通过系统微调和评估领先的开源架构,如DeepSeek-8B和LLaMA-3.2-3B,以提高结构忠实度和敬语对齐度。

详情
AI中文摘要

近年来,多语言大语言模型(MLLMs)的进展显著增强了跨语言对话能力,但建模文化细腻和上下文依赖的交流仍是一个关键瓶颈。具体而言,现有最先进的模型在处理低资源环境如孟加拉语中的结构变化、地区习语和敬语一致性时存在严重的语用差距。为了解决这一限制,我们引入了一个新的、文化对齐的指令微调数据集和基准框架,即BangLa Application and DialoguE生成-BLADE,包含4,196对精心编纂的交互对。我们利用这一资源系统地微调和评估领先的开源架构,包括DeepSeek-8B和LLaMA-3.2-3B,通过LoRA适配器在4位NormalFloat(NF4)量化框架下进行参数高效微调。我们的实证评估表明,使用我们数据集微调的模型在结构忠实度和敬语对齐方面有显著改进,为弥合低资源多语言文本生成中的语用差距提供了严格基准。代码和数据集:https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

英文摘要

Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \textbf{BangLa Application and DialoguE generation - BLADE} and benchmarking framework comprising $4,196$ meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

2605.22484 2026-05-22 cs.CV

Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

监督分类头作为语义原型:通过权重重用解锁视觉-语言对齐

David Méndez, Roberto Confalonieri, Natalia Díaz Rodríguez

发表机构 * Department of Computer Science and Artificial Intelligence, DaSCI Institute, University of Granada, Granada, Spain(计算机科学与人工智能系,DaSCI研究所,格拉纳达大学,格拉纳达,西班牙) Department of Mathematics ``Tullio Levi-Civita'', University of Padova, Padova, Italy(托里利-西维塔数学系,帕多瓦大学,帕多瓦,意大利)

AI总结 本文提出利用预训练视觉模型的分类头作为语义原型,通过权重重用实现视觉-语言对齐,提升跨模态检索、零样本和少样本分类任务的性能。

详情
AI中文摘要

视觉-语言模型(VLMs)通过将图像和文本映射到共享空间,在零样本分类和跨模态检索等任务中表现出色,但需要昂贵的端到端训练和大量配对数据。当前的后处理对齐方法通过轻量级映射连接预训练编码器来降低计算成本,但仍需大量配对数据。在本文中,我们研究了重新利用预训练视觉模型的分类头作为语义原型的潜力。这些权重的重用,通常在预训练后被丢弃,解锁了两种不同的能力:它使零样本对齐成为可能,通过将权重用作语义锚点,并通过将这些原型与真实图像-文本对混合,成为一种稳健的数据增强策略。我们证明,将我们的方法与几种最先进的后处理对齐技术结合,能够一致地提高跨模态检索、零样本和少样本分类任务的准确性。

英文摘要

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

2605.22481 2026-05-22 cs.LG math.ST stat.TH

When Stronger Triggers Backfire: A High-Dimensional Theory of Backdoor Attacks

当更强的触发器反噬:高维背景下后门攻击的理论

Donald Flynn, Hadas Yaron Goldhirsh, Jonathan P. Keating, Inbar Seroussi

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) School of Mathematical Science, Tel Aviv University(特拉维夫大学数学科学学院) School of Mathematical Science and Computer Science, Tel Aviv University(特拉维夫大学数学科学与计算机科学学院)

AI总结 本文研究了在高维情况下后门毒化攻击的行为,发现更强的训练触发器有助于防御者,并通过高维理论分析了后门攻击的核心机制和影响因素。

详情
AI中文摘要

后门毒化攻击在高维情况下表现出反直觉的行为:更强的训练触发器有助于防御者。我们研究了在比例极限下(p/n→κ)的正则化广义线性模型在高斯混合数据上的表现,通过改变训练触发强度α(相对于固定的测试触发强度)来研究。三种现象出现:(i)干净测试准确率随着α增加而增加;(ii)攻击成功率在有限的α后达到峰值然后下降;(iii)最危险的触发方向是数据协方差的最小特征向量。我们为平方损失证明了所有三个结果,并通过高斯代理固定点系统将(i)和(ii)扩展到一般的凸GLM损失。我们识别出一个与κ成比例的有限样本噪声底噪是(i)背后机制,这在经典n>>p分析中是不可见的。在CIFAR-10和高斯代理上的实验与理论紧密吻合;ResNet-18实验显示在非凸设置下也出现了相同现象。

英文摘要

Backdoor poisoning attacks behave counter-intuitively in high dimensions: stronger training triggers can help the defender. We study regularised generalised linear models on Gaussian-mixture data in the proportional regime ($p/n \to κ$), varying the training trigger strength $α$ against a fixed test trigger. Three phenomena emerge: (i) clean test accuracy increases with $α$; (ii) attack success peaks at a finite $α$ and then declines; and (iii) the most damaging trigger direction is the minimum eigenvector of the data covariance. We prove all three results in closed form for the squared loss, and extend (i) and (ii) to general convex GLM losses via a Gaussian-proxy fixed-point system. We identify a finite-sample noise floor proportional to $κ$ as the mechanism behind (i), invisible to classical $n \gg p$ analysis. Experiments on CIFAR-10 and Gaussian surrogates match the theory closely; ResNet-18 experiments show the same phenomena beyond the convex setting.

2605.22480 2026-05-22 cs.LG cs.AI

Implicit Regularization of Mini-Batch Training in Graph Neural Networks

图神经网络中mini-batch训练的隐式正则化

Clement Wang, Antoine Vialle, Robin Vaysse, Thomas Bonald

发表机构 * Institut Polytechnique de Paris(巴黎理工学院) Mirakl

AI总结 本文研究了图神经网络中mini-batch训练的隐式正则化现象,发现简单的随机节点采样方法在多个数据集上表现优异,且效率更高。

详情
AI中文摘要

图神经网络(GNN)的mini-batch训练与i.i.d.数据训练有本质区别:采样子图会改变拓扑结构并引入边界效应,导致先前工作发展出结构感知采样器以保持局部连接性和减少嵌入方差。令人惊讶的是,我们证明了最简单的可能方案,即随机节点采样(RNS),在均匀采样的诱导子图上训练,在10个数据集中的8个上在墙钟时间和内存消耗上匹配或优于全图训练。为了解释这一点,我们对图mini-batch随机梯度下降(SGD)应用反向误差分析,并显示其隐式最小化采样损失加上一个与mini-batch梯度方差成比例的正则化量,该量直接由采样器塑造。尽管RNS丢弃了局部结构,但它产生了一组预期损失更接近全图损失,且每批梯度方差更低的mini-batch,从而得到更好的隐式目标。我们的分析将图采样器的选择重新定义为一种隐式正则化形式,并将RNS识别为一种强大的、有理论基础的可扩展GNN训练方法。

英文摘要

Mini-batch training of Graph Neural Networks (GNNs) is fundamentally different from training on i.i.d. data: sampling a subgraph alters the topology and introduces boundary effects, leading prior work to develop structure-aware samplers that preserve local connectivity and reduce embedding variance. Surprisingly, we demonstrate that the simplest possible scheme, Random Node Sampling (RNS), training on the induced subgraph of uniformly sampled nodes, matches or outperforms full-graph training on 8 of 10 datasets at a fraction of the wall-clock time and memory. To explain this, we apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler. Although RNS discards local structure, it produces mini-batches whose expected loss is closer to the full-graph loss, and whose per-batch gradients have lower variance, yielding a better implicit objective. Our analysis reframes the choice of graph sampler as a form of implicit regularization, and identifies RNS as a strong, theoretically grounded method for scalable GNN training.

2605.22476 2026-05-22 cs.LG cs.CL

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

结构稀疏注意力用于具有次二次序列复杂度的实体跟踪

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen

发表机构 * ESPCI PSL(ESPCI 法国巴黎大学) LAMSADE, Université Paris Dauphine - PSL(LAMSADE 巴黎dauphine大学-巴黎科学实验室)

AI总结 本文提出了一种结构稀疏注意力机制,用于在长序列中高效维护和更新实体和属性的潜在状态,通过减少计算复杂度提升实体跟踪的效率和准确性。

Comments 12 pages, 1 figure, 9 tables

详情
AI中文摘要

实体跟踪需要在长序列中维护和更新实体和属性的潜在状态。最近的特定任务注意力运算可以通过在单个层内进行多跳状态传播,将深度Transformer堆栈压缩成几层,但其密集评估仍很昂贵。我们显示在这种情况下,学习的注意力具有很强的结构特性:大部分质量集中在局部块对角邻域,具有轻量的跨块残差。利用这一点,我们推导出一种分块评估的解析式算子,保持块内交互的精确性,并通过缩减系统路由跨块交互。所得到的评估是序列长度的次二次复杂度$O(n^{4/3}d)$(当$d\approx n$时为$O(n^{7/3})$)。在受控跟踪基准上,我们的方法在保持密集运算准确性的同时,通过标准化测量协议减少了12-29%的实时时钟时间,并在可比的精确匹配准确性下,比紧凑的密集Transformer快高达2.4倍。我们进一步提供了关于块大小和模型容量的消融实验,并识别了一个限制:当同时演化的属性数量超过注意力头的数量时,性能会崩溃。

英文摘要

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

2605.22472 2026-05-22 cs.LG

Winner-Take-All bottlenecks enforce disentangled symbolic representations in multi-task learning

赢家通吃瓶颈强制多任务学习中的解耦符号表示

Julian Gutheil, Simon Hitzginger, Robert Legenstein

发表机构 * Institute of Machine Learning and Neural Computation(机器学习与神经计算研究所) Graz University of Technology(格拉茨技术大学) Graz, Austria(奥地利格拉茨)

AI总结 本文研究了赢家通吃瓶颈在多任务学习中强制提取数据类别潜在因素的作用,证明了其产生的表示具有高度符号性,并通过实验验证了其在一般化中的优势。

详情
AI中文摘要

赢家通吃(WTA)网络是大脑皮层网络中的核心电路模式,在现代深度学习模型中,如Transformer的注意力层中的softmax激活函数,也广泛存在WTA-like激活。尽管其在简单生成模型中提取潜在因素的角色已被研究,但在高度非线性纠缠的潜在因素背景下其作用仍不清楚。本文表明,在深度神经网络中存在WTA瓶颈时,在某些明确条件下,可以在多任务学习设置中强制提取数据的类别潜在因素。特别是,我们证明了WTA瓶颈中产生的表示具有高度符号性,其中单个神经元或神经元群体编码单个抽象特征,如特定对象、颜色或位置。我们进一步在两个数据集上实验证明,即使在不完全符合我们定理假设的架构和设置中,这一结论也成立,并展示了获得的符号表示在一般化中的优势。我们提出的模型为具有WTA-like组件的深度神经网络的一般化能力提供了见解,并可能成为符号AI和子符号AI系统之间的接口。

英文摘要

Winner-take-all (WTA) networks constitute a central circuit motif in cortical networks of the brain. In addition, WTA-like activations are abundant in modern deep learning models in the form of the softmax activation for example in attention layers of transformers. While their role in the extraction of latent factors has been studied for relatively simple generative models, their role in the context of highly non-linearly entangled latent factors has remained elusive. In this article, we show that a WTA bottleneck within a deep neural network can enforce under certain well-defined conditions the extraction of categorical latent factors of the data in a multi-task learning setup. In particular, we prove that the representation that emerges in the WTA bottleneck is highly symbolic, where a single neuron or a population of neurons encodes the presence of a single abstract feature such as a specific object, color, or position. We furthermore show empirically on two datasets, that this also holds for architectures and setups that do not fully comply with the assumptions of our theorem and demonstrate the advantages of the acquired symbolic representation for generalization. Our proposed model provides insights into the generalization capabilities of deep neural networks with WTA-like components and may serve as an interface between symbolic and subsymbolic AI systems.

2605.22471 2026-05-22 cs.LG

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

迷失在标记化中:图标记化在Transformer中的基本权衡

Maya Bechler-Speicher, Gilad Yehudai, Gil Harari, Clayton Sanford, Amir Globerson, Joan Bruna

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学数学科学学院) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院) Google Research(谷歌研究) Tel-Aviv University(特拉维夫大学)

AI总结 本文研究了图标记化在Transformer中的基本权衡,探讨了不同标记化方法对模型表达能力的影响,并通过实验验证了不同任务对不同结构视图的偏好。

详情
AI中文摘要

Transformers已经成为图学习的核心架构,但其应用于图学习需要首先选择一种标记化方法:一种图到标记的映射,决定了输入中暴露的结构信息。在本工作中,我们证明这种选择是Transformer表达能力的基本组成部分。我们考察了三种作为许多现有图标记化基础的标记化方法:谱标记化、随机游走标记化和邻接标记化。我们证明不同的标记化方法会诱导不同的深度范围:同一图计算可能在一种标记化下通过浅层Transformer实现,而在另一种标记化下则需要显著更大的深度。例如,我们证明随机游走标记化在任何游走长度下都是有损的,使其一般无法从该标记化中恢复图;而谱标记化虽然无损,但对局部任务来说是病态的。我们进一步证明,尽管随机游走和谱标记化都源自邻接信息,但有限深度的Transformer一般无法在标记化家族之间转换。特别是,我们建立了下界和不可能性结果,表明不利的标记化可能阻碍更合适的结构表示的高效恢复。最后,我们通过合成和现实任务的受控实验补充了我们的理论,验证了预测的分离,并展示了不同任务对不同结构视图的偏好,以及结合互补的标记化使Transformer能够利用每种表示的distinct信号。

英文摘要

Transformers have become a central architecture for graph learning, but their application to graphs requires first choosing a tokenization: a graph-to-token map that determines which structural information is exposed at the input. In this work, we show that this choice is a fundamental component of transformer expressivity. We examine three tokenizations that serve as building blocks for many existing graph tokenizations: spectral, random-walk, and adjacency tokenizations. We prove that different tokenizations induce distinct depth regimes: the same graph computation may be realizable by a shallow transformer under one tokenization, while requiring substantially larger depth under another. For example, we prove that random-walk tokenization is lossy for any walk length, making it impossible in general to recover the graph from it, and that while spectral tokenization is lossless, it is ill-conditioned for local tasks. We further show that although both random-walk and spectral tokenizations are derived from adjacency information, it is impossible for a limited-depth transformer to convert between tokenization families in general. In particular, we establish lower bounds and impossibility results showing that unfavorable tokenizations may preclude the efficient recovery of more suitable structural representations. Finally, we complement our theory with controlled experiments on synthetic and real-world tasks, validating the predicted separations and showing that different tasks favor different structural views, and combining complementary tokenizations allows the transformer to leverage distinct signals from each representation.

2605.22469 2026-05-22 cs.CV

MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

MaSC:一种用于评估概念驱动生成的遮蔽相似度度量

Patryk Bartkowiak, Lennart Petersen, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University(亚当·密茨凯维奇大学) Kiel University(基尔大学) ArtCollect(艺术收藏) KAUST(卡塔尔科技大学)

AI总结 本文提出MaSC,一种基于遮蔽的相似度度量方法,用于评估文本到图像扩散模型中单概念个性化生成的保真度和提示遵循性,通过使用外部提供的前景概念遮罩将评估分解为针对主体的概念保真度和基于背景的提示遵循性。

Comments 20 pages, 2 figures, 7 tables

详情
AI中文摘要

评估文本到图像扩散模型中单概念个性化生成需要测量概念保真度(捕捉参考的识别保真度)和提示遵循性(捕捉生成场景是否匹配提示)。现有度量通常使用全局图像或文本-图像嵌入,如CLIP-I、DINO和CLIP-T。我们证明这些度量与人类感知相关性差,因为它们将图像视为整体而非将概念主体与背景分离。我们引入MaSC,一种遮蔽相似度度量,使用外部提供的前景概念遮罩将评估分解为主体特定的概念保真度和基于背景的提示遵循性。MaSC通过冻结的SigLIP2 SO400M-NaFlex特征计算两个分数:概念保真度通过前景参考块与生成图像块之间的遮蔽最大余弦匹配测量,提示遵循性通过比较仅背景的池化图像嵌入与无主体提示嵌入进行比较。在DreamBench++人类评分中,MaSC在概念保真度上达到Krippendorff alpha = 0.471,优于所有测试的非LLM基线和GPT-4V,并接近GPT-4o。在ORIDa,一个跨物理环境的真实照片身份保真度基准中,MaSC达到AUC = 0.992,几乎完美地区分相同主体与跨主体对。其提示遵循性分数也优于DreamBench++中自带的CLIP-T基线。这些结果表明,空间分解聚合是评估概念驱动生成的强大设计原则。

英文摘要

Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

2605.22467 2026-05-22 cs.CV

SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

SADGE:合成与真实数据的结构和外观领域差距估计

Patryk Bartkowiak, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University(亚当·密茨凯维奇大学) ArtCollect(艺术收藏) KAUST(卡塔尔科技大学) Kiel University(基尔大学)

AI总结 本文提出SADGE,一种定量相似性度量指标,用于预测合成图像数据集在常见计算机视觉任务上的性能,而无需下游模型训练。研究发现,现有评估指标(如PSNR、FID、CLIP)主要衡量真实与合成图像之间的语义对齐(外观相似性分数),而结构相似性则用于评估领域差距(几何相似性分数)。本文通过多种合成数据集和下游任务证明,单一的外观或几何相似性无法可靠预测下游性能,而是它们的非线性交互决定了合成数据的效用。SADGE在五个公开的合成到真实基准家族和15个数据集变体(79k图像对)中,达到了线性和排名标准下最强的下游转移性能关联性。

详情
AI中文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline.

英文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline .

2605.22465 2026-05-22 cs.CL

In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

RAMPHO缓冲区的计算机模拟:通过语音熵在深度神经网络中分离信息性和能量性遮蔽

Stefan Bleeck

发表机构 * Institute of Sound and Vibration Research (ISVR), University of Southampton(声学与振动研究所(ISVR),南安普顿大学)

AI总结 本文提出了一种基于wav2vec 2.0的自监督声学模型的计算机模拟,通过语音熵分离信息性和能量性遮蔽,揭示了认知-听觉帕累托优化问题。

详情
AI中文摘要

在多说话者环境中听觉识别的核心挑战是一个认知瓶颈,定义为RAMPHO事件缓冲区内的失败。当前用于语音增强的深度神经网络仅优化物理声学,未能考虑信息性遮蔽的认知惩罚。本文通过使用自监督声学模型(wav2vec 2.0)的帧级语音熵,对RAMPHO缓冲区进行了计算机模拟。通过在信号噪声比(SNR)扫描中对比语义完整和相位去相关干扰源(集中护盾),成功将信息性干扰的认知惩罚与能量衰减的物理惩罚分离。该模拟揭示了一个认知-听觉帕累托优化问题:破坏干扰源的语义负载在高SNR下释放了信息性遮蔽,但本质上在低SNR下会退化时间瞥见线索。

英文摘要

The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor's semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.

2605.22462 2026-05-22 cs.CL cs.AI

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

从相关性到因果:一种五阶段方法用于Transformer语言模型中的特征分析

Caleb Munigety

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种五阶段方法用于Transformer语言模型中的因果特征分析,并在GPT-2小型模型上端到端地展示了其在间接宾语识别任务中的应用,通过激活补丁恢复经典IOI电路,稀疏自编码器恢复特定名称的特征,因果验证发现这些特征具有特定但部分因果性,鲁棒性测试揭示了检测鲁棒性与因果鲁棒性之间的差距,部署评估显示了最优监控配置带来的成本节省。

详情
AI中文摘要

我们提出了一种五阶段方法用于Transformer语言模型中的因果特征分析(探针设计、特征提取、因果验证、鲁棒性测试和部署集成),并在GPT-2小型模型上端到端地执行了间接宾语识别(IOI)任务。激活补丁恢复了经典的IOI电路(第9层头9单独恢复+1.02)。稀疏自编码器恢复了每名称选择性特征,其效果大小为30到50个激活单元。因果验证发现这些特征具有特定但部分因果性:删除十五个特征后,模型在98%的提示上仍保持准确。两种受NLA启发的评估强化了这一观点:十五个选择性特征仅解释了激活方差的31%,而SAE的解释为99.7%,选择性比率与因果力呈负相关(r = -0.56)。三种分布偏移下的鲁棒性测试发现,电路能够顺利转移,但特征消融效果显著下降,揭示了检测鲁棒性与因果鲁棒性之间的差距。基于成本的部署评估(假设$50/FN,$0.42/FP,2%错误率)发现最优监控配置可使每1000次查询的成本降至$8.96,相比$1000的基准,节省了99.1%。最优组合策略随成本比和基础率变化。各阶段的结合产生了单一阶段无法产生的发现。

英文摘要

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

2605.22457 2026-05-22 cs.AI cs.SY eess.SY

KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

KAPPS:一种基于知识的闭环工厂CPPS架构

Etienne Hoffmann, Jan-Felix Klein, Sören Weindel, Max Goebels, Sebastian Behrendt, Daniel Hernández, Ratan Bahadur Thapa, Jürgen Fleischer, Kai Furmans, Steffen Staab

发表机构 * Institute for Material Handling and Logistics (IFL), Karlsruhe Institute of Technology(材料搬运与物流研究所(IFL),卡尔斯鲁厄技术大学) Department of Production Engineering, KTH Royal Institute of Technology(生产工程系,皇家理工技术大学) Institute of Production Science (wbk), Karlsruhe Institute of Technology(生产科学研究所(wbk),卡尔斯鲁厄技术大学) Analytic Computing, Institute for Artificial Intelligence, University of Stuttgart(分析计算,人工智能研究所,斯图加特大学) Electronics and Computer Science, University of Southampton(电子与计算机科学,南安普顿大学)

AI总结 本文提出KAPPS,一种基于知识的闭环工厂CPPS架构,旨在解决闭环制造中产品状态变化、动态重构过程和人机知识整合的需求,通过知识图谱和语义接口层实现数据集成与推理,提升制造系统的灵活性和适应性。

Comments Submitted to Journal of Manufacturing Systems (JMS)

详情
AI中文摘要

尽管线性制造依赖于同质材料和预定义的过程序列,但闭环制造重新引入了具有异质和不确定条件的使用产品。这种转变要求制造系统能够处理可变的产品状态、动态可重构的过程以及人机知识的整合。传统制造IT架构,设计用于稳定结构和确定性执行,无法满足这些需求,因为它们无法充分表示和管理运行时单个组件的唯一性。遵循设计科学方法,为闭环制造设计CPPS,我们从五个互补的视角中推导出14个需求。基于这些需求,我们设计了KAPPS,一种基于知识的架构,利用以本体为基础的知识图谱作为统一的数据骨干,结合语义接口层,实现跨异构系统和服务的一致数据和信息集成、推理和通信,使知识图谱从集成层转变为工厂的权威写时状态。KAPPS集成了约束执行和事件驱动规划模块,使在不确定性和人机知识交换下执行计划能够逐步适应。通过两个实施用例验证了KAPPS的适用性:(i) 通过知识图谱中介服务进行异常检测和学习;(ii) 在模块化输送系统中运行时约束执行。随后,该架构被评估以满足14个需求(摘要已缩短)

英文摘要

While linear manufacturing relies on homogeneous materials and predefined process sequences, circular manufacturing reintroduces used products with heterogeneous and uncertain conditions. This shift demands manufacturing systems capable of handling variable product states, dynamically reconfigurable processes, and the integration of human and machine knowledge. Conventional manufacturing IT architectures, designed for stable structures and deterministic execution, are unable to meet these requirements, as they cannot adequately represent and manage the uniqueness of individual components at runtime. Following a design science methodology for developing a Cyber Physical Production System for circular manufacturing, we derive 14 requirements from five complementary perspectives. Based on these requirements, we design KAPPS, a knowledge-based architecture that uses an ontology-grounded knowledge graph as a unifying data backbone, combined with a semantic interface layer to enable consistent data and information integration, reasoning, and communication across heterogeneous systems and services, turning the knowledge graph from an integration layer into the factories authoritative write-time state. KAPPS incorporates modules for constraint enforcement and event-driven planning, enabling incremental adaptation of execution plans under uncertainty and human-machine knowledge exchange. The applicability of KAPPS is demonstrated through two implemented use cases: (i) Anomaly detection and learning through knowledge graph mediated services and (ii) runtime constraint enforcement in a modular conveyor system. Subsequently, the architecture is evaluated against the 14 requirements (ed. abstract shortened)