arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别4020
2606.09725 2026-06-09 cs.LG 新提交

Disentanglement with Holographic Reduced Representations

基于全息约简表示的解缠

Jhonny J. Velasquez Olivera, Christo K. Thomas, Walid Saad

发表机构 * Virginia Tech(弗吉尼亚理工大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 提出使用全息约简表示(HRR)的无监督解缠算法,利用HRR解绑操作提供归纳偏置,分离数据中的因子变化,并通过信息论分析证明其诱导近似独立的符号-值对。

详情
AI中文摘要

解缠,即使用神经网络分离数据中的因子变化,仍然是机器学习中长期存在的挑战。先前的工作通过变分自编码器和生成对抗网络,结合变分推理和信息论约束来解决这个问题。与依赖连续表示的方法不同,我们提出一种将解缠表示视为符号结构的设计,其动机是构成分布样本的概念之间的组合关系。然而,在保持可微性的同时用神经网络学习离散符号结构是困难的,通常需要复杂的架构。为此,我们引入一种无监督学习算法,使用全息约简表示(HRR)进行神经解缠。我们表明,HRR解绑操作为分离因子提供了归纳偏置,并在潜在遍历和解缠度量方面取得了与基线相当的结果。我们通过HRR解绑通道的信息论分析补充了这些实证发现。我们证明解绑诱导了近似独立的符号-值对,并推导出每个槽的容量界限,量化了可以可靠编码的不同符号概念的数量,从而定量解释了朝向解缠的归纳偏置。得到的表示不同于标准的基于自编码器的模型,其潜在单元是求和在一起的向量,而不是低维潜在向量的标量维度。我们表明,这种HRR表示比其他解缠表示对噪声更鲁棒,并在一定信噪比范围内保持重建质量。

英文摘要

Disentanglement, the separation of factors of variation in data using neural networks, remains a long-standing challenge in machine learning. Prior work has addressed this problem with variational autoencoders and generative adversarial networks that incorporate ideas from variational inference and information-theoretic constraints. In contrast to methods that rely on continuous representations, we propose a design that treats disentangled representations as symbolic structures, motivated by the compositional relationships among the concepts that make up samples from a distribution. However, learning discrete symbolic structures with neural networks while maintaining differentiability is difficult and often requires complex architectures. To address this, we introduce an unsupervised learning algorithm that uses holographic reduced representations (HRR) for neural disentanglement. We show that the HRR unbinding operation provides an inductive bias for separating factors and yields competitive results against baselines, as measured by latent traversals and disentanglement metrics. We complement these empirical findings with an information-theoretic analysis of the HRR unbinding channel. We prove that unbinding induces approximately independent symbol-value pairs and derive a per-slot capacity bound that quantifies how many distinct symbolic concepts can be reliably encoded, giving a quantitative account of the inductive bias toward disentanglement. The resulting representations differ from standard autoencoder-based models, in that their latent units are vectors that are summed together, rather than scalar dimensions of a low-dimensional latent vector. We show that this HRR representation is more robust to noise than other disentangled representations and maintains reconstruction quality across a range of SNRs.

2606.09724 2026-06-09 cs.AI 新提交

Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

超越概率相似性:检索增强生成在法律领域的结构性、时间性和因果性局限

Hudson de Martim

发表机构 * Federal Senate of Brazil(巴西联邦参议院)

AI总结 本文指出法律AI中RAG的失败源于概率检索与法律知识层次、时间及制度结构的架构不匹配,提出三种病理(部分盲、历时盲、因果不透明)并推导出确定性设计的四项架构承诺。

详情
AI中文摘要

检索增强生成(RAG)已成为应对法律AI不可靠性的标准架构响应,然而跨司法管辖区持续出现高调失败案例,包括提交给法院的捏造引文以及作为现行法律呈现的过时法律内容。我们认为这些失败并非可通过扩展语言模型消除的残余虚构,而是概率检索与法律知识的层次性、时间性和制度性结构之间架构不匹配的症状。我们分三步展开论证。首先,我们将法律知识的本体论承诺阐述为可从经典法律理论推导出的三元属性:层次和分体结构、操作封闭下的历时动态性,以及基于论证义务的制度来源的因果可追溯性。其次,我们识别出检索的三种相应病理(分体盲、历时盲和因果不透明),每种均给出操作性定义、失败机制、典型示例和用于诊断的检测标准。第三,我们通过此视角回顾现有技术,表明现有方法不均匀地满足这些要求,且尚未组合成将它们视为共同构成的范式。基于此分析,我们推导出四个架构承诺,这些承诺表征了法律检索的确定性设计方向:本体论优先性、事件具体化、双时态正确性和确定性交互协议。该框架关注的是法律问题(哪些规范适用及其状态),而非作用于已识别规范的下游任务,并主要处理立法和宪法检索,将解释时间作为显式扩展。

英文摘要

Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures, including fabricated citations submitted to courts and anachronistic legal content presented as current, continue to appear across jurisdictions. We argue that these failures are not residual confabulations to be eliminated by scaling language models, but symptoms of an architectural mismatch between probabilistic retrieval and the hierarchical, temporal, and institutional structure of legal knowledge. We develop the argument in three moves. First, we articulate the ontological commitment of legal knowledge as a triad of properties derivable from classical legal theory: hierarchical and mereological structure, diachronic dynamism under operational closure, and causal traceability of institutional provenance grounded in the duty of justification. Second, we identify three corresponding pathologies of retrieval (mereological blindness, diachronic blindness, and causal opacity), each developed with an operational definition, a failure mechanism, a canonical example, and detection criteria for diagnostic use. Third, we review the state of the art through this lens, showing that existing approaches address these requirements unevenly and do not yet compose into a paradigm that treats them as co-constitutive. From this analysis we derive four architectural commitments that characterize the deterministic-by-design direction for legal retrieval: ontological primacy, event reification, bitemporal correctness, and deterministic interaction protocols. The framework concerns quaestio juris (which norms apply and in what state) rather than the downstream tasks that act on identified norms, and addresses legislative and constitutional retrieval primarily, with interpretive time as an explicit extension.

2606.09719 2026-06-09 cs.RO 新提交

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

基于控制障碍函数的安全多面体在多面体内的运动规划与控制

Alejandro Gonzalez-Garcia, Dries Dirckx, Jan Swevers, Wilm Decré

发表机构 * KU Leuven(鲁汶大学)

AI总结 提出一种安全局部运动规划与控制方法,通过模型预测控制器中的离散时间控制障碍函数约束,保证多面体机器人足迹始终位于连续更新的凸自由空间内,计算时间随障碍物数量增加最多降低91倍。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

在狭窄环境中运行的自主移动机器人需要考虑机器人物理足迹的运动规划框架。将几何形状简化为点或圆是保守的,并且丢弃了成功安全通过狭窄通道所需的信息。本文提出了一种安全的局部运动规划与控制方法,保证多面体机器人足迹始终位于连续更新的凸自由空间内。包含条件被表述为模型预测控制器内的一组离散时间控制障碍函数约束。安全约束的数量取决于局部自由空间的复杂性和机器人形状,而不是障碍物的数量。所提出的自由空间公式不需要任何障碍物检测或分割。与基于多面体的避障公式的比较分析证实,随着障碍物数量的增加,计算时间最多减少91倍。该方法在自主水面车辆的仿真中和使用占用网格和LiDAR传感的非完整移动机器人的硬件上得到了验证。实验证明了在机载嵌入式计算机上以10 Hz进行安全的实时运动规划与控制,包括对动态障碍物的反应性避让。

英文摘要

Autonomous mobile robots operating in tight environments require motion planning frameworks that account for the physical footprint of the robot. Simplifying the geometry to a point or a circle is conservative and discards information needed to successfully and safely traverse narrow passages. This work proposes a safe local motion planning and control method that guarantees that a polytopic robot footprint stays inside a continuously updated convex free-space region. The containment condition is formulated as a set of discrete-time control barrier function constraints within a model predictive controller. The number of safety constraints depends on the complexity of the local free-space geometry and the robot shape, instead of the number of obstacles. The proposed free-space formulation does not need any obstacle detection or segmentation. A comparative analysis against a polytope-based obstacle avoidance formulation confirms favorable scaling up to a reduction of 91$\times$ in computation time as the number of obstacles increases. The approach is validated in simulation with an autonomous surface vehicle and on hardware with a non-holonomic mobile robot, using both occupancy grids and LiDAR sensing. The experiments demonstrate safe real-time motion planning and control at 10~Hz on an onboard embedded computer, including reactive avoidance of dynamic obstacles.

2606.09717 2026-06-09 cs.SD eess.AS 新提交

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

什么让合成语音听起来讽刺?一项韵律控制的感知研究

Zhu Li, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(格罗宁根大学)

AI总结 通过可控神经TTS系统操纵语速、音高变化和响度,发现响度主要驱动人类对讽刺的感知,而模型更依赖语速,揭示了韵律线索权重差异。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

韵律在讽刺感知中起着核心作用,然而以往的研究依赖于自然产生的语音,缺乏对单个声学维度的精细控制。由于韵律线索在自然数据中共变,隔离它们的独立贡献仍然具有挑战性。我们引入了一个受控框架,使用基于提示的韵律条件化的神经文本到语音(TTS)来操纵语速、音高变化和响度。构建了一个正交刺激集,以实现对韵律线索效应的因果测试。人类听众对讽刺性和自然度进行评分,并将他们的判断与能够处理音频输入的基础模型的预测进行比较。结果表明,响度主要驱动人类对讽刺的感知,而模型则赋予语速更大的权重,导致不同的线索加权模式。这项研究表明,可控神经TTS如何能够研究语音感知中的韵律线索加权。

英文摘要

Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.

2606.09711 2026-06-09 cs.AI cs.LG 新提交

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

代理奖励内化与机制性利用:奖励黑客及其泛化的学习前兆

Mohammad Beigi, Ming Jin, Lifu Huang

发表机构 * UC Davis(加州大学戴维斯分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 提出PRIME概念,通过思维链监控、直接探针和激活级概念向量测量,发现PRIME在持续奖励黑客前分阶段出现,且直接探针得分可预测后续黑客爆发,跨检查点跟踪域外失调。

详情
AI中文摘要

奖励黑客通常在其变得可见后才被研究,即当模型获得高代理奖励但未能完成预期任务时。我们转而研究代理强化学习在失败出现之前教会了什么。我们引入了代理奖励内化与机制性利用(PRIME),这是一种评估任务正确性、预测代理接受度以及推理可被利用的代理-黄金差距的学习能力。在具有可被利用的pytest奖励的编码强化学习环境中,我们通过思维链监控、直接探针和激活级概念向量来测量PRIME。我们发现,PRIME在持续奖励黑客之前以阶段性顺序出现,并且其当前的直接探针得分可以预测后续黑客的爆发时间和严重程度,即使可见的黑客率仍然很低。当评估者发生变化时,PRIME也会适应,重新瞄准任何仍然获得奖励的代理-黄金差距,并在黄金奖励抑制公开黑客时持续存在;消除其激活方向会减少黑客行为。跨检查点,域内PRIME跟踪域外失调。这些结果共同表明,可被利用的代理强化学习放大了可见黑客上游的代理内化能力,使PRIME成为更广泛对齐风险的候选早期预警信号。

英文摘要

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

2606.09709 2026-06-09 cs.CL 新提交

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking

IS-CoT: 通过交错结构思维打破长文本生成崩溃

Zechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li, Wenpeng Hu, Wenliang Chen, Zhunchen Luo, Guotong Geng, Min Zhang

发表机构 * Institute of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Information Research Center of Military Science, PLA Academy of Military Science(军事科学院军事科学信息研究中心)

AI总结 针对大语言模型在长文本生成中因静态层次规划导致长度崩溃的问题,提出交错结构思维链(IS-CoT)框架,通过动态规划-写作-反思循环实现持续策略调整,训练IS-Writer-8B模型在长文本基准上取得最优性能。

详情
AI中文摘要

生成连贯且可控的长文本内容仍然是大语言模型(LLMs)面临的一个持久挑战。虽然推理增强模型在逻辑密集型领域已展现出成功,但我们的评估揭示,它们在开放式写作中遭受严重的长度崩溃,当目标长度超过2,000词时性能急剧下降。我们将这一失败归因于静态层次规划的局限性,它难以在扩展上下文中提供动态指导。为弥补这一差距,我们引入了交错结构思维链(IS-CoT)框架。与外部智能体工作流不同,IS-CoT将动态的规划-写作-反思循环嵌入生成过程,无需额外辅助即可实现持续策略调整和全局对齐。基于该框架,我们通过多教师管道构建了一个高质量的交错推理轨迹数据集,并训练了IS-Writer-8B。实验表明,IS-Writer-8B在具有挑战性的长文本基准上取得了最先进的性能(例如,在LongBench-Write上比DeepSeek-V3.2高出+3.08),展现出与显著更大的专有模型相竞争的长度合规性和连贯性。

英文摘要

Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.

2606.09707 2026-06-09 cs.LG cs.CL 新提交

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

BrainSurgery:用于模型编辑和升级的可复现且可靠的声明式权重操作

Gianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech, Peter Schneider-Kamp

发表机构 * University of Southern Denmark(南丹麦大学)

AI总结 提出BrainSurgery工具,通过声明式YAML计划实现神经网络检查点的鲁棒可复现张量操作,支持结构修改、数学变换和张量重塑,内置断言验证防止静默错误。

详情
AI中文摘要

随着深度学习模型规模的扩大,管理、检查和修改大型检查点变得越来越具有挑战性。研究人员经常需要更改模型权重以进行层重构、精度转换、低秩分解和架构调试,但这些工作流程通常依赖于脆弱的临时Python脚本。在这里,我们介绍BrainSurgery,一个用于对神经网络检查点进行鲁棒且可复现的“张量手术”的工具,并提供一个系统演示,涵盖从模型升级到LoRA提取的四个示例和三个案例研究。通过抽象存储格式和内存管理,BrainSurgery通过声明式YAML计划执行复杂的转换。它支持通过表达性正则表达式和结构定位进行结构修改、数学变换和张量重塑,同时内置断言验证张量形状、数据类型和值,以防止静默错误。我们期望BrainSurgery通过其可复现且经过验证的操作,为未来的研究提供坚实的基础。

英文摘要

As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.

2606.09705 2026-06-09 cs.LG cond-mat.stat-mech 新提交

When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

局部评分模型何时能跨尺寸外推?诊断理论与基准

Wenjie Xi

发表机构 * The University of Hong Kong(香港大学) Department of Physics and HK Institute of Quantum Science & Technology(物理系与香港量子科学与技术研究所)

AI总结 提出诊断理论,证明局部模型能否稳定外推取决于高斯平滑评分的准局部性,并引入有限深度局部流(FDLF)基准进行验证。

详情
AI中文摘要

科学生成建模通常需要尺寸迁移,即在小系统上训练的模型在大系统上评估。虽然平移不变架构允许这种评估,但我们表明架构局部性本身并不能保证稳定的尺寸外推。相反,稳定外推由高斯平滑评分的准局部性决定。通过Tweedie公式,远距离扰动可以通过后验协方差影响局部评分分量,这意味着局部模型只有在感受野覆盖平滑评分的响应范围时才能成功。我们形式化了这一机制,证明了反向扩散下局部边缘的尺寸一致比较定理。我们还引入了有限深度局部流(FDLF),这是一个具有精确评分、密度和可控响应范围的白盒诊断基准。实验上,我们验证了空间混合、平滑评分准局部性和模型感受野之间的相互作用。在空间混合下,平滑评分相对于感受野保持准局部性,从而实现稳定外推。相反,当空间混合减弱时,评分的局部性迅速退化,导致尺寸迁移失败。

英文摘要

Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score. Through Tweedie's formula, far-away perturbations can influence local score components via posterior covariance, meaning a local model succeeds only if its receptive field covers the smoothed score's response range. We formalize this mechanism, proving a size-uniform comparison theorem for local marginals under reverse diffusion. We also introduce Finite-Depth Local Flow (FDLF), a white-box diagnostic benchmark with exact scores, densities, and controllable response ranges. Empirically, we validate the interplay between spatial mixing, smoothed-score quasi-locality, and model receptive fields. Under spatial mixing, the smoothed score remains quasi-local relative to the receptive field, enabling stable extrapolation. Conversely, when spatial mixing weakens, the score's locality rapidly degrades, causing size transfer to fail.

2606.09701 2026-06-09 cs.CL cs.AI cs.LG 新提交

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻击与防御:通过GRPO对语言模型进行自适应红队测试

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

发表机构 * Microsoft AI Red Team(微软AI红队) Microsoft Azure(微软Azure)

AI总结 提出AdvGRPO框架,通过密集多通道奖励和分离优势归一化实现GRPO在攻击者-防御者联合优化中的稳定训练,产生高效可迁移攻击,防御者优于基线。

详情
AI中文摘要

AI红队测试必须不断适应不断演变的攻击者和防御者。强化学习为发现新型攻击提供了一种有前景的方法,而协同训练方法可以同时产生更鲁棒的防御者。最近的工作通过应用PPO和DPO证明了攻击者-防御者协同训练的有效性,但报告称GRPO在此设置中不稳定。我们引入了AdvGRPO,一种协同训练框架,通过使用密集多通道奖励和分离优势归一化,使GRPO能够用于攻击者-防御者联合优化。训练过程通过一个课程从单轮攻击发展到闭环多轮攻击,然后启动协同训练,其中攻击者和防御者模型交替更新。我们表明,我们的方法可以产生高度有效且可迁移的攻击,并且协同训练的防御者在安全基准测试中优于基线。

英文摘要

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

2606.09699 2026-06-09 cs.CV 新提交

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

Cranio-Diff: 基于扩散的跨模态颅面重建,利用二维X射线颅骨引导和结构身份约束

Ravi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag, Dinesh Singh

发表机构 * Indian Institute of Technology Mandi(印度理工学院曼迪分校) CSVTU Bhilai(恰蒂斯加尔邦斯瓦米·维韦卡南达技术大学比莱分校)

AI总结 提出Cranio-Diff扩散框架,通过ControlNet的颅骨条件结构引导和生物特征文本条件,从2D X射线颅骨图像重建跨模态人脸,解决结构身份对齐问题,在120名受试者的颅面数据集上优于现有方法。

详情
Comments
14 pages, 7 figures, BMVC 2026 conference
AI中文摘要

最先进的生成模型,如CycleGAN、Pix2Pix和扩散模型,在人脸生成任务中表现出色。然而,在从颅骨(X射线)到人脸(光学)域的跨模态颅面重建中,由于跨模态结构身份对齐不匹配,它们无法有效捕获跨模态语义信息。为解决此问题,我们提出Cranio-Diff,一种基于扩散的框架,用于从2D X射线颅骨图像进行跨域颅面重建。该方法通过ControlNet集成颅骨条件结构引导和生物特征文本条件,生成与给定颅骨在语义和结构上更对齐的人脸。所提出的Cranio-Diff方法在从120名受试者的侧位和正位X射线扫描获得的颅面数据集上进行了评估。为实现受控评估,每张人脸图像在三个年龄组(25、45、65)和三个BMI变化(-10%、基线、+10%)下合成,共产生4320个配对样本。据我们所知,这是唯一具有此规模的X射线-人脸数据集。大量实验表明,所提方法在生成图像质量和检索任务上均优于近期现有方法。最后,为评估所提方法的性能,我们使用FID、IS、SSIM、LPIPS、PSNR和ArcFace分数评估了生成图像的质量。此外,使用recall@k、mAP@k和MRR@k评估了检索性能。获得的实验结果表明,所提方法可作为法医调查中的辅助工具。

英文摘要

The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in craniofacial reconstruction when translating from the skull (x-ray) to the face (optical) domain, due to a mismatch in the alignment of structural identity across modalities. To address this issue, we propose Cranio-Diff, a diffusion-based framework for cross-domain cranio-facial reconstruction from 2D X-ray skull images. The proposed approach integrates skull-conditioned structural guidance through ControlNet with biometric text conditioning to generate a face which is more semantically and structurally aligned with the given skull. The proposed Cranio-diff method is evaluated on skull-face dataset obtained from X-ray scans of 120 subjects in lateral and frontal views. To enable controlled evaluation, each face image is synthesised across three age groups (25, 45, 65) and three BMI variations of -10%, baseline and +10%, yielding 4320 paired samples. To the best of our knowledge, this is the only X-ray-face dataset with this magnitude. Extensive experiments showed that the proposed method outperforms recent existing approaches in both generated image quality and retrieval task. Finally, to evaluate the performance of our proposed method, we have evaluated the quality of the generated image using FID, IS, SSIM, LPIPS, PSNR and ArcFace score. Additionally, retrieval performance is evaluated using recall@k, mAP@k and MRR@k. Obtained experimental results demonstrate that the proposed method can be used as an alternate tool in providing aid in forensic investigations.

2606.09697 2026-06-09 cs.CL 新提交

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

PsychoSafe:在大语言模型中引发基于心理学的拒绝

Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle, Stine Lyngsø Beltoft, Peter Schneider-Kamp, Thomas Eisenbarth, Lukas Galke Poech, Anne Lauscher

发表机构 * University of Southern Denmark(南丹麦大学) University of Turin(都灵大学) University of Hamburg(汉堡大学) University of Lübeck(吕贝克大学)

AI总结 提出PsychoSafe框架,将LLM的拒绝行为重构为基于证据干预策略的结构化支持性沟通,通过构建5个心理风险领域的8019个提示-响应对,对Qwen 3.5 27B进行提示和参数高效微调,在拒绝质量上比通用基线提升28.1%,同时保持非拒绝任务性能。

详情
AI中文摘要

大型语言模型(LLM)经常面临应被拒绝的请求,这造成了帮助性与伤害预防之间的权衡。然而,拒绝本身可能是有帮助的。在涉及危机、胁迫或意图升级的高风险交互中,生硬的不服从可能防止直接伤害,但仍未能支持请求背后的人的需求。我们提出了PsychoSafe,一个基于心理学的拒绝框架,将拒绝重构为基于证据干预策略的结构化支持性沟通。为了开发PsychoSafe,我们构建了一个包含8019个提示-响应对的语料库,涵盖五个心理上显著的风险领域,并对Qwen 3.5 27B应用提示和参数高效微调。在一个包含500个提示的平衡验证集上,通过LLM评判器评估并经人工评分验证,PsychoSafe提示在拒绝质量上比通用基线提高了28.1%,在外部资源转介(+46.8%)和心理基础(+34.8%)方面尤为突出,同时保持了非拒绝任务的下游性能。微调实现了近乎完美的拒绝和资源转介率,但降低了响应相关性。在SORRY-Bench和XSTest上的额外评估显示,域内鲁棒性强但域外泛化有限,这表明未来的工作应多样化微调数据,以帮助模型有选择地而非机械地应用干预措施。

英文摘要

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

2606.09682 2026-06-09 cs.LG cs.DC cs.PF 新提交

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel:用于自我重定目标超内核合成的静态检查代理框架

Jaber Jaber, Osama Jaber

发表机构 * RightNow AI

AI总结 提出AutoMegaKernel系统,将Llama模型编译为单个持久CUDA内核,通过静态调度验证器确保无死锁和无竞争,自动生成10种模型正确超内核,并在NVIDIA推理卡上以W8A16精度超越cuBLAS bf16。

详情
Comments
18 pages, 5 figures. Open-source code, data, and agent harness: https://github.com/RightNow-AI/AutoMegaKernel
AI中文摘要

AutoMegaKernel (AMK) 将HuggingFace Llama系列模型编译成一个持久的协作CUDA内核,该内核在一次启动中运行整个前向传播,无需为每个模型手写CUDA代码。其贡献在于系统本身,而非原始速度。一个冻结的调度IR验证器通过静态图检查(非机械化证明)静态地认证无死锁和无竞争,因此不安全的智能体提议调度在启动前被拒绝:在7,160个对抗性调度(6,091个不安全)中,它实现了零误接受,并接受了所有360个实际底层实现。同一源代码可重定目标至sm_80/sm_90/sm_120,从单一代码库自动为10个支持模型中的全部生成正确的超内核,并在真实的SmolLM2-135M检查点上重现HuggingFace贪婪解码逐token匹配(困惑度差异2.5e-7)。一个无人值守、智能体驱动的自动研究循环在其自身基线之上自我改进超内核(1.25-1.72倍)。一个搜索发现的int8 (W8A16) 超内核在NVIDIA数据中心推理集群的batch-1解码中击败了CUDA图化的cuBLAS bf16:L4最高1.33倍,当前一代L40S 1.25-1.27倍,A10G大规模最高1.08倍,以及消费级RTX 5090 1.19-1.23倍。排序并非带宽的简单函数(864 GB/s的L40S击败了600 GB/s的A10G);分界线是推理级与训练级。AMK在高带宽训练级A100/H100上落后于cuBLAS,其中框架定位了跨SM同步瓶颈;我们坦率地报告了这一差距。这是解码位置0处精度不对称(W8A16 vs bf16)的比较;最大的真实检查点是TinyLlama-1.1B。代码和框架:https://github.com/RightNow-AI/AutoMegaKernel

英文摘要

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel

2606.09679 2026-06-09 cs.CV 新提交

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

SoccerNet 2026 以球员为中心的球类动作定位:FOOTPASS 基线的重训练与后处理扩展

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(迪克体育用品的GameChanger)

AI总结 针对足球广播中八类动作的球员-动作-时间预测任务,在FOOTPASS基线上提出梯度检查点、GNN与DST融合、平方根频率类别加权和后处理流水线四项扩展,在测试集和挑战集上分别达到0.548和0.446的Macro F1。

详情
Comments
CVPR 2026 SoccerNet Player Centric Ball Action Spotting Challenge, Rank 7
AI中文摘要

我们描述了针对SoccerNet 2026以球员为中心的球类动作定位挑战赛的系统,该挑战要求预测广播足球中八类动作的谁、做什么以及何时发生。基于三个FOOTPASS基线[1](TAAD、TAAD+GNN和TAAD+DST),我们贡献了四个扩展:(1)梯度检查点,使得在单个GPU上能够对整个骨干网络进行微调;(2)将GNN logits融合到DST编码器中,将基于图的战术上下文与每个球员的视觉特征相结合;(3)平方根频率类别加权,以解决训练数据中213:1的传球与抢断不平衡问题;(4)一个后处理流水线,包括每类logit门控、时间帧细化、球衣重新分配和双模型集成。我们的系统在测试集上达到0.548 Macro F1,在挑战集上(服务器评估)达到0.446。

英文摘要

We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires predicting who performs which action and when, across eight classes in broadcast soccer. Building on the three FOOTPASS baselines [1] (TAAD, TAAD+GNN, and TAAD+DST), we contribute four extensions: (1) gradient check pointing to enable full-backbone fine-tuning on a single GPU; (2) fusion of GNN logits into the DST encoder, combining graph-based tactical context with per-player visual features; (3) square-root frequency class weighting to address the 213:1 pass-to-tackle imbalance in the training data; and (4) a post processing pipeline comprising per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble. Our system achieves 0.548 Macro F1 on the test set and 0.446 on the challenge set (server evaluation).

2606.09671 2026-06-09 cs.LG cs.AI 新提交

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

基于转换的阿尔茨海默病数字孪生建模在稀疏纵向数据下的应用

Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

发表机构 * University of Southampton(南安普顿大学) University Hospital Southampton NHS Foundation Trust(南安普顿大学医院NHS基金会信托) Faculty of Medicine, University of Southampton(南安普顿大学医学院)

AI总结 针对阿尔茨海默病进展异质性和数据稀疏问题,提出结合局部转换建模与序列建模的数字孪生框架,利用多模态纵向数据预测认知状态并量化不确定性,在ADNI数据上表现优异。

详情
Comments
13 pages, 5 figures, 3 tables. Accepted as a full-length paper at the International Conference on AI in Healthcare (AIiH) 2026
AI中文摘要

阿尔茨海默病(AD)进展具有高度异质性,通常通过稀疏且不规则的纵向数据观察,给预测和个性化监测带来挑战。现有的机器学习方法利用多模态数据改进了AD预测,但往往侧重于静态分类或队列级风险估计,对个体特异性建模和不确定性推理的支持有限。为了解决这些局限性,我们提出了一种个性化数字孪生框架,用于AD预测和基于场景的分析,利用多模态纵向数据。该方法整合了互补的建模策略,以捕捉临床转换和跨访视的时间依赖性。使用阿尔茨海默病神经影像学倡议(ADNI)的数据,包括认知评估、临床变量和MRI衍生的表型,该框架预测认知状态和诊断类别,同时量化预测不确定性并实现患者特定的假设轨迹分析。在无泄漏的受试者级别分割上的评估表明,在评分预测和诊断分类方面表现强劲。在这种稀疏且不规则的ADNI设置中,相邻访视的基于转换的建模比基于序列的分支实现了更高的预测准确性,表明局部转换建模可能更数据高效。虽然序列模型对于不确定性感知的轨迹预测仍然有价值,但局部转换建模提供了一种更数据高效且稳健的预测策略。这些发现强调了将时间建模策略与临床数据结构对齐的重要性,并表明基于转换的数字孪生公式可能为神经退行性疾病的个性化预测提供一种实用且可解释的方法。

英文摘要

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

2606.09670 2026-06-09 cs.CV cs.AI 新提交

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

视觉提示结合基于特征重建的双教师监督异常检测

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

发表机构 * IBM Research Europe Zurich(IBM欧洲研究院苏黎世分院)

AI总结 针对异常检测在真实场景中因物体尺度、视角等变化失效的问题,提出视觉提示管道、解冻教师模型和扩散生成数据增强,在AeBAD数据集上提升3.5个百分点。

详情
AI中文摘要

最近的异常检测方法在成熟数据集(如MVTec)上取得了完美的检测和分割分数。然而,当基本假设(如一致的物体尺度、视角、背景、光照和居中放置)被违反时,许多方法面临挑战。这些变化使得异常检测方法在许多真实场景中无法使用。为了解决这些限制,我们引入了三个关键贡献:(1)一个视觉提示管道,通过前景-背景掩码隔离物体;(2)一种在师生模型中解冻教师以提高领域适应性的机制;(3)一种利用扩散生成合成图像的数据增强策略,以增强异常检测性能。通过使用掩码多尺度重建(MMR)模型作为骨干,我们在具有挑战性的AeBAD数据集上比之前的最先进方法提高了3.5个百分点。

英文摘要

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

2606.09669 2026-06-09 cs.AI cs.CL 新提交

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SpatialWorld: 在多模态智能体真实世界任务中基准测试交互式空间推理

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

发表机构 * Tsinghua University(清华大学) Chongqing University(重庆大学) Peking University(北京大学) ZenoMind AI Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学) Southeast University(东南大学) Shanghai Jiao Tong University(上海交通大学) Joy Future Academy The University of Hong Kong(香港大学)

AI总结 提出SpatialWorld基准,集成8种异构模拟后端,通过760个人工标注任务评估多模态智能体在视觉部分可观测环境中的交互式空间理解,发现最强模型GPT-5任务成功率仅17.4%。

详情
AI中文摘要

空间推理是多模态大语言模型(MLLMs)感知和操作物理世界的基础能力。然而,现有基准主要依赖被动评估(如静态VQA)或特定模拟器流程,未能评估通用的交互式空间理解。我们引入SpatialWorld,一个专门为评估多模态智能体在复杂真实世界任务中的交互式空间理解而设计的统一基准。在共享的、模拟器无关的协议下集成八个异构模拟后端,SpatialWorld包含跨多个领域(如家庭日常、旅行、社交协作)的760个人工标注任务。智能体必须在仅视觉的部分可观测性下解决问题,主动收集自我中心的视觉证据,并通过MLLMs原生的统一文本动作接口表达决策。为了可靠评估,每个任务包含一个人工验证的初始状态、一条参考轨迹和一个终端状态验证器。评估15个先进智能体揭示,稳健的空间任务解决仍然具有挑战性:最强模型GPT-5平均任务成功率(TSR)仅为17.4%,而领先的开源模型Qwen-3.5达到14.1%。进一步分析暴露了任务成功与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长程规划中的瓶颈使SpatialWorld成为未来空间智能体的严格测试平台。

英文摘要

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

2606.09668 2026-06-09 cs.LG 新提交

Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

具有速率最优队列长度遗憾的上下文队列赌博机算法

Seoungbin Bae, Dabeen Lee

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学)

AI总结 针对上下文队列赌博机问题,提出三阶段算法CQB-η-2,通过仅在截止轮前进行随机探索,将队列长度遗憾从Õ(T^{-1/4})改进到Õ(T^{-1/2}),并证明该速率在最小最大意义下最优。

详情
AI中文摘要

上下文队列赌博机为在未知上下文相关服务速率下学习调度异构作业提供了框架。在随机上下文下,现有算法实现了 $\widetilde{\mathcal{O}}(T^{-1/4})$ 的队列长度遗憾,定义为学习者在时间 $T$ 的队列长度与最优队列长度之差的期望。本文将该速率改进至 $\widetilde{\mathcal{O}}(T^{-1/2})$。关键观察是随机探索仅需在精心选择的截止轮之前进行,而非整个时间范围。我们提出 CQB-$\eta$-2,一个三阶段算法:(i) 纯随机探索以构建初始估计器,(ii) $\eta$-随机探索结合 UCB 规则以在保持负漂移的同时继续学习,(iii) 探索截止后的纯 UCB。我们的证明在截止轮处分解队列长度遗憾。截止前,负漂移抑制了由次优选择引起的队列长度差异。截止后,前两个阶段提供了足够的随机探索样本,确保 UCB 决策导致的离开率差距较小。结合这两个界得到 $\widetilde{\mathcal{O}}(T^{-1/2})$ 阶的队列长度遗憾。我们进一步证明了 $\Omega(T^{-1/2})$ 阶的最小最大下界。证明构造了两个统计上不可区分的困难实例直到最终服务决策,并使用队列特定的耦合论证将由此产生的检验误差转化为队列长度遗憾。综上,我们的上下界刻画了在时间 $T$ 上的最小最大依赖关系(忽略对数因子)。

英文摘要

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$η$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $Ω(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

2606.09666 2026-06-09 cs.AI 新提交

Frequency-based Constrained Sampling for Interval Patterns

基于频率的区间模式约束采样

Djawad Bekkoucha, Abdelkader Ouali, Bruno Crémilleux

发表机构 * Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Université Paris-Saclay, CNRS(巴黎-萨克雷大学数字科学跨学科实验室(LISN),法国国家科学研究中心) Université Caen Normandie, ENSICAEN, CNRS, Normandie Univ, GREYC UMR6072(卡昂诺曼底大学,卡昂国立高等工程师学校,法国国家科学研究中心,诺曼底大学,GREYC UMR6072)

AI总结 提出CFips方法,将用户定义的句法约束直接融入多步采样框架,通过分解为区间边界上的基本谓词实现精确采样,保证在约束模式空间中按频率比例采样,实验证明能完成超时失败的挖掘任务。

详情
Comments
16 pages
AI中文摘要

输出空间模式采样是穷举模式挖掘的一种强大替代方案,用于探索大型模式空间,因为它使用户能够根据选定的兴趣度量关注代表性模式。在本文中,我们解决了在用户定义的句法约束下采样区间模式的问题。我们引入了CFips,一种将约束直接融入采样过程的采样方法。该方法基于多步采样框架,通过将约束分解为区间边界上的基本谓词来支持多种句法约束,同时保持精确采样保证。我们正式证明CFips在约束模式空间内按频率比例采样区间模式。实验结果表明,将约束融入采样过程能够完成在给定超时内否则会失败的挖掘任务。

英文摘要

Output space pattern sampling is a powerful alternative to exhaustive pattern mining for exploring large pattern spaces, as it enables users to focus on representative patterns drawn according to a chosen interestingness measure. In this paper, we address the problem of sampling interval patterns under user-defined syntactic constraints. We introduce CFips, a sampling approach that incorporates constraints directly into the sampling procedure. The approach relies on a multi-step sampling framework and supports several syntactic constraints by decomposing them into elementary predicates on interval bounds while preserving exact sampling guarantees. We formally prove that CFips samples interval patterns proportionally to their frequency within the constrained pattern space. The experimental results show that integrating constraints into the sampling procedure enables to complete mining tasks that would otherwise fail within a given time out.

2606.09664 2026-06-09 cs.LG stat.ML 新提交

In-Context Learning for Latent Space Bayesian Optimization

潜空间贝叶斯优化的上下文学习

Tuan A. Vu, Harri Lähdesmäki, Julien Martinelli

发表机构 * Aalto University(阿尔托大学)

AI总结 针对潜空间贝叶斯优化中上下文学习模型与优化任务不匹配的问题,提出在分子VAE潜空间上定义合成优化任务进行持续预训练,并引入正则化器保持原始先验,显著提升分子优化性能。

详情
AI中文摘要

贝叶斯优化(BO)是样本高效设计的核心工具,潜空间贝叶斯优化(LSBO)将其扩展到分子和蛋白质等结构化对象。与此同时,TabPFN和TabICL等表格基础模型现已实现最先进的回归性能,并越来越多地被用作BO代理模型。由于其贝叶斯行为是由大规模合成预训练集合诱导的,因此该预训练分布的组成至关重要。LSBO造成了一种独特的不匹配:从潜代码到目标值的映射与当前上下文模型训练所用的回归任务明显不同。我们通过在分子VAE的潜空间上定义合成优化任务来补充表格基础模型代理的预训练阶段,从而解决这种不匹配。持续预训练目标包含一个正则化器,将模型锚定到原始检查点,保留其广泛的回归先验,同时避免对适应任务的过度专业化。在保留的分子优化基准测试中,所得模型实现了强劲性能,支持了针对上下文化代理的LSBO特定适应的重要性。

英文摘要

Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.

2606.09663 2026-06-09 cs.AI 新提交

From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

从0到1再到N:MetaAI递归自我设计的可复现工程证据

Dun Li, Jiatao Li, Hongzhi Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Shanghai Maritime University(上海海事大学) Chizhou University(池州学院)

AI总结 提出可复现证据框架,通过四个标准评估现有系统,其中Darwin Goedel Machine在SWE-bench上提升30%,并给出可复现协议MetaAI-Mini。

详情
Comments
6 pages, 2 figures, 7 tables. Supplementary code: https://github.com/DunLi-Tsinghua/MetaAI-Mini
AI中文摘要

递归自我设计指的是AI辅助修改AI系统构建、评估和改进的机制。本文将MetaAI视为一种由人类播种、AI扩展的开发模式,其中设计空间本身成为修改目标。我们提出了一个可操作证据框架,包含四个标准:可检查的目标系统、元级修改器、反馈导向选择和递归延续。然后,我们将包括Darwin Goedel Machine (DGM)、STOP、Goedel Agent和ShinkaEvolve在内的公开系统映射到这些标准上。DGM提供了目前最直接的已报告证据:其公布的结果显示,经过80次迭代,SWE-bench Verified上的性能从20%提升到50%,完整Polyglot上的性能从14.2%提升到30.7%,消融实验表明开放式探索和自我改进都有贡献。最后,我们提供了MetaAI-Mini,一个基于HumanEval的可复现协议和代码库。由于本次构建未包含完整的模型运行,MetaAI-Mini作为协议而非实验结果报告。

英文摘要

Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.

2606.09662 2026-06-09 cs.CL 新提交

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

当内置思考既有帮助又有害:指令遵循中的约束级错误转移

Sai Adith Senthil Kumar

发表机构 * George Mason University(乔治梅森大学)

AI总结 研究大型推理模型(LRM)的思考模式对指令遵循的影响,发现思考会改变错误模式而非统一降低性能,其中规划类约束改善而精确类约束恶化,并通过分析思考轨迹和激活修补揭示了机制。

详情
Comments
16 pages, 7 figures, 15 tables
AI中文摘要

大型推理模型(LRM)通常能提升数学和编码性能,但其对指令遵循的影响尚不明确。我们使用 Qwen3 模型(1.7B-32B)研究 IFEval,采用同权重的思考开启/关闭控制;四个 Hunyuan 模型提供跨家族方向性支持。总体通过率变化很小(-0.55 到 -3.52 个百分点),但 10-20% 的提示在两种模式间在通过和失败之间切换,表明思考改变了错误模式——某些提示改善而另一些恶化——而非统一降低性能。在事后 Qwen3 导出的分组下,约束类型分为规划类(全局计数、结构、协调)和精确类(精确局部形式);规划类在思考下类别层面改善,而精确类持续恶化;尽管 Hunyuan 的总体方向相反,但所有四个 Hunyuan 模型在类别层面的规划/精确符号模式方向一致。思考还改变了最终答案长度;匹配长度分析大幅减少了精确类的下降,但仍有残余惩罚。使用交叉编码器相关性指标分析思考轨迹揭示了三种模式:中性模式显示正的相关性-合规性关联(r ≈ 0.15);规划模式显示接近零的预测相关性(r ≈ 0.02),尽管有可测量的轨迹参与,这与 CE 测量的轨迹相关性和最终答案合规性之间的执行差距一致;精确模式显示小的负相关性(r ≈ -0.05),失败实例的平均相关性高于通过实例。跨四个模型大小(1.7B-14B)的激活修补显示,精确类翻转实例比规划类翻转实例更常被恢复(32-58% 对 14-40% 的平均层恢复),最大差距在 14B 处(约 30 个百分点)。

英文摘要

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

2606.09659 2026-06-09 cs.CL cs.AI cs.LG 新提交

End-to-End Context Compression at Scale

端到端上下文压缩的规模化

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

发表机构 * New York University(纽约大学) Modal Labs(Modal实验室) University of Maryland(马里兰大学) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) Harvard University(哈佛大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) FAIR at Meta(Meta FAIR实验室)

AI总结 本研究通过架构搜索和持续预训练,提出潜在上下文语言模型(LCLMs),一种端到端编码器-解码器压缩器,在通用任务性能、压缩速度和峰值内存上改进帕累托前沿,并可作为长时智能体的高效骨干。

详情
AI中文摘要

长上下文语言模型推理受限于内存,因为KV缓存随上下文长度增长。最近压缩KV缓存的技术存在不足:它们要么大幅降低模型质量,要么需要大量时间和计算来压缩单个长提示。此外,许多方法要求输入适合目标模型的上下文窗口,并且通常与现代生产推理引擎不兼容。编码器-解码器压缩器原则上是一种有吸引力的替代方案,它将长令牌序列映射到由解码器消费的较短潜在嵌入序列。然而,现有方法在精度-效率前沿上无法与KV缓存压缩竞争。在这项工作中,我们重新审视编码器-解码器压缩并缩小了这一差距。我们首先进行架构搜索,从头开始预训练许多变体,以确定如何最佳设计和训练编码器-解码器压缩器。根据我们的发现,我们持续预训练一系列0.6B编码器、4B解码器模型,每个模型在超过350B令牌上训练,压缩比为1:4、1:8和1:16。我们引入了潜在上下文语言模型(LCLMs),这是一系列压缩器,在通用任务性能、压缩速度和峰值内存使用上改进了帕累托前沿。我们证明了LCLMs可作为长时智能体的高效骨干,让智能体浏览压缩的长上下文并按需自适应扩展相关片段。

英文摘要

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

2606.09658 2026-06-09 cs.LG cs.AI 新提交

Muon Learns More Robust and Transferable Features than Adam

Muon 比 Adam 学习更鲁棒和可迁移的特征

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang

发表机构 * Yale University(耶鲁大学) National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Academy of Mathematics and Systems Science, CAS(中国科学院数学与系统科学研究院)

AI总结 本文通过鲁棒性和可迁移性视角,证明 Muon 优化器相比 Adam 和 SGD 能学习到更鲁棒、更可迁移的特征,并通过理论分析支持了经验发现。

详情
AI中文摘要

Muon 最近已成为预训练大型语言模型(LLMs)和视觉分类器的最先进优化器。尽管其在效率上优于 Adam 和 SGD,但 Muon 在特征学习方面的优势仍不清楚。本文通过鲁棒性和可迁移性的视角研究了 Muon 的特征学习优势。首先,通过在损坏图像和文本上评估预训练模型,我们表明 Muon 学习到的特征在不同架构(包括 Transformer 和卷积神经网络(CNN))中始终比 Adam 和 SGD 学习到的特征更鲁棒。使用训练好的逐层探针,我们进一步表明这种鲁棒性优势体现在各层更大的 logit 间隔上。其次,通过在下游任务上训练线性分类器或从预训练参数微调完整模型,我们证明 Muon 学习到的特征比 Adam 和 SGD 学习到的特征更有效地迁移。这种可迁移性优势还通过有效秩衡量的各层隐藏状态的多样性得到进一步支持。最后,在一个具有多组件特征的代表性分类问题中,我们证明 Muon 比 Adam 和 SGD 获得更大的间隔和更高的有效秩,为我们的经验发现提供了理论支持。

英文摘要

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

2606.09655 2026-06-09 cs.CL 新提交

Beyond Accuracy: Community Perspectives on Machine Translation

超越准确率:机器翻译的社区视角

Yujun Wang, Ehud Reiter, Shimei Pan, Steffen Eger, Wei Zhao

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University of Aberdeen(阿伯丁大学)

AI总结 本文通过分析社交媒体上四个利益相关者社区(AI开发者、专业译者、语言学习者、语言服务提供商)的帖子,揭示机器翻译技术社区间的分歧与冲突,强调倾听用户社区需求的重要性。

详情
AI中文摘要

尽管机器翻译(MT)取得了显著进展,但非AI社区对MT系统日益增长的担忧表明技术进展与现实用户需求之间存在明显差距。例如,NLP研究人员关注基准性能,而最终用户关心伦理问题、信任、可靠性、成本等。我们认为倾听不同用户社区至关重要,以便研究工作能针对社区关心的问题。为此,我们首次进行大规模分析,调查四个利益相关者社区(AI开发者、专业译者、语言学习者和语言服务提供商)在社交媒体上关于MT技术的帖子。我们构建了一个包含2019年至2025年来自Reddit、Facebook、Bluesky和Mastodon的79,286条帖子及评论的数据集,并分析这些社区在哪些方面存在分歧,以及分歧的方式和原因。总体而言,我们发现社区间经常存在分歧,甚至在翻译质量、效率和可靠性等话题上因情绪极化而表现出强烈冲突。这是因为这些社区处理这些话题的方式不同:AI社区将其视为技术和计算问题,而非AI(用户)社区更关注质量细微差别、时间节省、用户信任和更广泛的社会问题。

英文摘要

Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care about ethical concerns, trust, reliability, costs, and more. We argue that listening to various user communities is essential so that research efforts would be directed towards the problems that the communities care about. To this end, we present a large-scale analysis, for the first time, that investigates what four stakeholder communities (AI developers, professional translators, language learners, and language service providers) post about MT technology on social media. To do so, we construct a dataset of 79,286 posts and comments from Reddit, Facebook, Bluesky, and Mastodon from 2019 to 2025, and analyse where these communities disagree, and how and why. Overall, we find that communities often disagree, and even show strong conflicts due to polarised sentiments on topics such as translation quality, efficiency, and reliability. This is because these communities approach these topics differently: the AI community frames them as technical and computational problems, while non-AI (user) communities care more about quality nuances, time savings, user trust, and broader social issues.

2606.09653 2026-06-09 cs.LG 新提交

A Unifying Framework for Concept-Based Representational Similarity

基于概念的表征相似性的统一框架

Grégoire Dhimoïla, Victor Boutin, Agustin Martin Picard, Thomas Fel, Thomas Serre

发表机构 * Brown University(布朗大学) ENS Paris Saclay(巴黎萨克雷高等师范学校) CNRS(法国国家科学研究中心) DEEL - IRT Saint Exupéry(DEEL - IRT 圣埃克苏佩里) Goodfire

AI总结 提出统一框架分解概念对齐的两个轴(表征vs.概念、实例级vs.分布级),定义四种性质,并引入干预基准InterVenchA和耦合稀疏自编码器CoSAE,证明对齐是多目标问题。

详情
AI中文摘要

跨模型和模态的学习表征常常展现出惊人的结构相似性,暗示着共享的潜在概念分解。然而,概念对齐的定义仍不明确:现有方法在相同术语下优化不同目标,模糊了实际对齐的内容。我们提出了一个统一框架,沿两个轴分解对齐:对齐什么(表征vs.概念)以及什么级别(实例级vs.分布级)。这产生了四个相应的性质——翻译和概念一致性的实例级和分布级变体——并精确揭示了现有方法提供了这些保证中的哪些。我们进一步引入了\InterVenchA,一个基于干预的基准,分别衡量提取质量、翻译质量和概念一致性。通过理论和实验,我们表明对齐目标之间通常假设的等价性在实践中不成立:优化一个性质并不能可靠地恢复其他性质,纯无监督目标无法恢复有意义的实例级对齐。然后我们提出了耦合稀疏自编码器(CoSAE),它联合强制互补的对齐目标。强对齐仅在这种机制下出现。令人惊讶的是,当锚定分布目标时,仅0.1%的配对数据就足以恢复实例级对齐。总体而言,我们的结果表明概念对齐本质上是多目标的:它必须被定义、衡量和优化为多目标。

英文摘要

Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 新提交

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

视频基础模型是否理解直觉物理?逐层探测分析

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 通过冻结特征探测,研究预训练视频基础模型在直觉物理信息上的编码能力,发现V-JEPA表现最佳,物理信息在中后期层最易获取,且时序破坏显著降低性能。

详情
AI中文摘要

我们研究预训练视频基础模型是否在其冻结表示中编码直觉物理信息,以及该信息如何随模型家族、层和探测类型变化。通过在IntPhys2和Minimal Video Pairs (MVP)上进行冻结特征探测,我们比较了预测联合嵌入模型(V-JEPA)、掩码重建模型(VideoMAE)和基于扩散的视频生成器(LTX-Video)。V-JEPA在基准测试中取得最强整体结果,尤其是在建模时序动态的探测器中,而VideoMAE仍具竞争力,LTX-Video恢复较弱但非平凡的信号。逐层分析表明,物理相关信息在早期层最弱,在中后期深度最易获取;时序控制表明,打乱帧顺序显著降低性能,尤其是在MVP上。综合来看,这些结果表明直觉物理知识在预训练视频表示中可靠地出现,但其可获取性强烈依赖于预训练范式、表示深度和读出机制。

英文摘要

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

2606.09645 2026-06-09 cs.RO cs.PL cs.SE 新提交

Modeling Components and Connections in Cyber-Physical Systems

信息物理系统中的组件与连接建模

Kate Sanborn, Tanuj Kenchannavar, Vakul Nath, Jonathan Sprinkle

发表机构 * Vanderbilt University(范德堡大学)

AI总结 提出基于WebGME的模型集成工具ROSLaunchVisual,通过图形界面可视化ROS启动文件中的节点、发布者、订阅者和参数,提升开发效率和系统理解。

详情
AI中文摘要

信息物理系统的基于文本的配置文件很好地展示了组件模块的层次结构,但往往隐藏了模块之间连接和接口的细节。对这些配置文件采用基于模型的视觉方法可以更好地捕获这些信息。机器人操作系统(ROS)启动文件的XML结构可以通过建模方法得到改进。本文介绍了ROSLaunchVisual,一个基于WebGME构建的模型集成环境,用于设计、可视化和管理ROS启动文件。该工具通过允许开发者使用图形界面创建和修改启动文件来提高抽象层次,该界面将节点、发布者、订阅者和参数表示为互连组件。该工具提供动态系统分析,可用于新启动文件和现有启动文件的静态开发和分析。ROSLaunchVisual集成了元模型驱动验证、启动文件的自动导入/导出以及可视化通信映射等功能。插件通过更新库、检查语义错误和管理重映射进一步增强功能。通过使启动文件创建更直观且不易出错,ROSLaunchVisual提高了开发效率和系统理解,特别是在协作或大规模机器人项目中。

英文摘要

Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.

2606.09644 2026-06-09 cs.CL cs.CV 新提交

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

答案从何而来?面向自动驾驶的多视角MLLMs中视角级视觉证据识别基准

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对多视角自动驾驶场景,提出一个基准测试,评估多模态大模型在视觉问答中识别支持性相机视角的能力,包含122个冲突中心问题对,并区分视角选择与答案正确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉推理基准测试中取得了强劲结果,但仅凭答案准确性并不能表明模型是否依赖了正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要,因为模型可能产生看似合理的答案,却将其归因于错误的相机视角。我们引入了一个多视角视觉问答基准,用于评估证据来源识别:给定六个同步的NuScenes视角和一个问题,模型必须识别支持性的相机视角并回答问题。该基准包含来自73个场景的122个冲突中心问答对,涵盖因果关系、反事实推理和意图预测。视角标签由自动冲突挖掘流程提出,并由标注者手动验证。我们评估了三种设置:相机视角选择、给定黄金视角的Oracle问答,以及模型在一次前向中同时选择视角并回答的联合预测。答案以多项选择和自由形式两种格式进行评估,使用精确匹配处理结构化预测,并使用LLM评判器处理自由形式回答。通过明确分离视觉来源识别与答案正确性,该基准揭示了仅凭答案评估无法发现的接地失败案例。

英文摘要

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

2606.09641 2026-06-09 cs.CV 新提交

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

MAVIS: 通过结构化视频理解实现多智能体视频检索

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

发表机构 * School of Computing and Information Technology, Great Bay University(大湾区大学计算机与信息技术学院) College of Computer Science, Nankai University(南开大学计算机学院) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Graduate School of Information Science and Technology, The University of Tokyo(东京大学信息科学与技术研究生院)

AI总结 提出多智能体框架MAVIS,通过结构化语义库解析视频,利用逻辑感知辩论机制协作推理,无需全库扫描和微调即可实现高效视频检索。

详情
AI中文摘要

视频检索的主流范式依赖于基于嵌入的全库扫描,这种方法存在固有的计算低效以及信息密集视频与稀疏文本查询之间的语义不对称问题。为弥合这一差距,我们引入了\textbf{MAVIS},一种新颖的多智能体框架,将检索重新构想为协作推理而非暴力搜索。MAVIS首先通过将原始视频解析为\textbf{结构化语义库}来弥合粒度不匹配,从而实现显式的属性级索引。在检索过程中,规划器将复杂的用户意图分解为原子子任务,分派专门的智能体独立提名候选。关键的是,MAVIS采用带有严格否决协议的\textbf{逻辑感知辩论}机制,智能体协作修剪逻辑不匹配,以识别紧凑的“有争议”候选集进行细粒度验证。这种智能体工作流有效避免了全库遍历的低效。在MSR-VTT、MSVD和ActivityNet上的大量实验表明,MAVIS在无需任务特定微调的情况下实现了有竞争力的性能,为传统的双编码器方法提供了可扩展且可解释的替代方案。

英文摘要

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

2606.09640 2026-06-09 cs.RO 新提交

Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics

面向欧拉-拉格朗日机器人动力学的物理感知稀疏学习与选择性在线自适应

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan

发表机构 * The University of Manchester(曼彻斯特大学) International Institute of Information Technology Hyderabad(海得拉巴国际信息技术学院) Delft University of Technology(代尔夫特理工大学) Newcastle University(纽卡斯尔大学)

AI总结 提出一种保结构残差学习框架,将模型误差分解为惯性修正、科里奥利项和广义力残差,通过物理约束学习机械部分,并用稀疏历史依赖潜变量模型和贝叶斯线性回归在线自适应扰动敏感部分,提升多机器人平台动力学预测与轨迹跟踪性能。

详情
AI中文摘要

精确的动力学模型对于基于模型的机器人控制至关重要,然而名义上的欧拉-拉格朗日模型在存在负载变化、未建模耦合、摩擦、空气动力学效应和变化操作条件时往往变得不准确。大多数基于学习的校正方法通过引入单个加性残差来提高预测精度,但未能保留欧拉-拉格朗日系统的内部机械结构。这导致模型不保留对称性、正定性或惯性与速度相关项之间的耦合,当嵌入基于模型的控制器时,可能导致物理上不一致的预测和降低的可靠性。我们提出了一种保结构残差学习框架,将模型不匹配分解为惯性修正、相应的诱导科里奥利项和广义力残差。机械部分在物理约束下学习,而扰动敏感部分通过稀疏历史依赖潜变量交互模型表示,并使用贝叶斯线性回归在线自适应。这种分离保留了关键的机械结构,同时将自适应限制在最受变化条件影响的动力学部分。在多个机器人平台(包括移动机器人、空中机器人和机械臂系统)上的实验表明,所提出的方法在耦合和时变动力学下改善了动力学预测和轨迹跟踪。这些结果凸显了将结构化残差建模、紧凑潜变量交互选择和选择性在线自适应相结合对于实际基于模型控制的价值。

英文摘要

Accurate dynamics models are essential for model-based robotic control, yet nominal Euler--Lagrange models often become inaccurate in the presence of payload variation, unmodeled coupling, friction, aerodynamic effects, and changing operating conditions. Most learning-based correction methods improve prediction accuracy by introducing a single additive residual, but do not preserve the internal mechanical structure of Euler--Lagrange systems. This leads to models that do not preserve symmetry, positive-definiteness, or the coupling between inertia and velocity-dependent terms, which can result in physically inconsistent predictions and reduced reliability when embedded in model-based controllers. We propose a structure-preserving residual learning framework that decomposes model mismatch into an inertia correction, the corresponding induced Coriolis term, and a generalized-force residual. The mechanical component is learned under physical constraints, while the disturbance-sensitive component is represented through a sparse history-dependent latent interaction model and adapted online using Bayesian linear regression. This separation preserves key mechanical structure while restricting adaptation to the part of the dynamics most affected by changing conditions. Experiments across multiple robotic platforms, including mobile, aerial, and manipulator systems, show that the proposed method improves dynamics prediction and trajectory tracking under coupled and time-varying dynamics. These results highlight the value of combining structured residual modeling, compact latent interaction selection, and selective online adaptation for real-world model-based control.