arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2605.27575 2026-06-02 cs.AI

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn:一个面向AI代理的开源平台,具有可扩展的按需执行、代理即代码定义和零信任访问

Nikita Benkovich, Vitalii Valkov

发表机构 * Agyn, Inc.(Agyn公司) Mila AI e-Lab

AI总结 提出Agyn开源平台,通过信号驱动的有状态无服务器运行时、Terraform代理定义和零信任安全模型,解决AI代理在生产中的隔离、治理和扩展问题。

详情
AI中文摘要

随着组织向AI代理的生产部署迈进,这些代理执行非确定性工作流、维护有状态会话,并通常以特权访问内部服务,工程挑战从构建单个代理转向在适当的隔离、治理和安全性下大规模运行它们。在本文中,我们介绍Agyn,一个开源平台,围绕三个针对代理工作负载的关键原则设计:基于Kubernetes的信号驱动、有状态无服务器运行时;用于代理和工具定义的Terraform提供程序;以及基于零信任和最小权限原则的安全模型。Agyn是代理无关、模型无关和云无关的。

英文摘要

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

2605.27569 2026-06-02 cs.AI

RULER: Representation-Level Verification of Machine Unlearning

RULER: 机器遗忘的表示级验证

Georgina Cosma, Axel Finke

发表机构 * Department of Computer Science, Loughborough University, UK(英国洛林大学计算机科学系) School of Mathematics, Statistics and Physics, Newcastle University, UK(英国新castle大学数学、统计与物理学院)

AI总结 提出表示级验证指标RULER(包括基于oracle的M2和无oracle的M4),检测机器遗忘后模型中间表示中残留的被遗忘记录信息,发现输出级验证通过的方法在表示级仍存在显著残留。

详情
AI中文摘要

机器遗忘旨在从已部署的模型中移除特定训练记录的影响,而无需从头重新训练。当前协议通过成员推断、保留准确率和遗忘集准确率在输出级进行验证,但模型可能满足所有三个条件的同时在其中间表示中编码被遗忘的记录。我们引入RULER,一组表示级验证指标。基于oracle的比较指标M2衡量遗忘集记录是否占据与在没有它们的情况下重新训练的模型中相同的表示位置。无oracle指标M4仅从未学习模型的内部相似性结构检测残差,无需重新训练。四种近似遗忘方法均通过输出级评估,但在线性混合效应模型下,M2在12个条件中的10个中检测到显著残差(p<0.05),且效应大小随遗忘比例增加而增大。第五种方法Bad Teacher尽管具有不同的遗忘机制,也显示出相同的残差。M4作为遗忘前诊断指标,适用于表格、图像、临床文本和人脸身份设置:它检测到人脸识别模型中身份级别的记忆化,而所有测试方法均无法完全擦除该信号。

英文摘要

Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model's internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p<0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 针对Transformer中异质注意力结构(如共注意力)带来的多源信息融合挑战,提出一种通用解释方法,并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情
AI中文摘要

Transformer极大地推动了人工智能的发展,也推动了智能体(agent)的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类:同质注意力结构和异质注意力结构。异质注意力结构以共注意力(co-attention)为典型例子,处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求,对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面,我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面,基于我们的实验分析范式,我们解释代表性模型的操作机制,进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

2605.27180 2026-06-02 cs.RO

Towards Drone-based Mapping of Volcanic Gases using Gas Tomography

面向基于无人机的火山气体测绘:使用气体断层成像

Marius Schaab, Niklas Karbach, Antonia Rabe, Thomas Wiedemann, Patrick Hinsen, Dmitriy Shutin, Thorsten Hoffmann, Achim J. Lilienthal

发表机构 * German Research Foundation (DFG)(德国研究基金会) Istituto Nazionale di Geofisica e Vulcanologia (INGV)(意大利国家地震与火山观测研究所)

AI总结 针对无人机旋翼下洗流干扰问题,提出基于拉格朗日模型的模型驱动气体断层成像方法,实现火山气体排放的准确测绘。

详情
AI中文摘要

火山排放大量二氧化碳,直接影响人类生活。测绘火山气体排放有助于预测喷发并了解火山对气候和环境的影响。基于无人机的气体传感显著降低了火山监测的风险,但在测量气体时面临技术限制,因为旋翼下洗流会在检测前驱散气体羽流。使用远程气体传感的气体断层成像解决了这一挑战。在Salinelle dei Cappuccini泥火山,我们证明,尽管无人机搭载的原位传感器因空气动力学干扰未能检测到CO2排放,但开路路径传感成功实现了远程气体分布测绘。我们提出了一种新颖的基于模型的气体断层重建方法,该方法结合拉格朗日模型来补偿风引起的平流。所得气体分布图与手动收集的原位测量结果一致,证实了基于模型的气体断层成像有效克服了下洗流限制,并实现了火山排放的准确测绘。

英文摘要

Volcanoes emit large amounts of CO2, directly influencing human lives. Mapping volcanic gas emissions helps to forecast eruptions and understand the impact of volcanoes on climate and the environment. Drone-based gas sensing significantly reduces risks in volcanic monitoring but faces technical limitations when measuring gas, as rotor downwash disperses the gas plume before detection. Gas Tomography using remote gas sensing addresses this challenge. At the Salinelle dei Cappuccini mud volcanoes, we demonstrate that while drone-mounted in-situ sensors failed to detect CO2 emissions due to aerodynamic disturbance, open-path sensing successfully enabled remote gas distribution mapping. We present a novel model-based gas tomographic reconstruction approach that incorporates a Lagrangian model to compensate for wind-induced advection. The resulting gas distribution maps align with manually collected in-situ measurements, confirming that model-based gas tomography effectively overcomes downwash limitations and enables accurate mapping of volcanic emissions.

2605.27095 2026-06-02 cs.LG

Adversarial Dual On-Policy Distillation from Expressive Teacher

来自表达性教师的对抗性双在线策略蒸馏

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor W. Tsang, Yang You

发表机构 * National University of Singapore(新加坡国立大学) University of Technology Sydney(悉尼大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出FA-OPD方法,通过对抗性双在线策略蒸馏,结合流匹配教师和轻量MLP学生,利用奖励和动作双通道信号,在机器人导航、操作和运动任务中优于强基线,且对噪声和有限演示鲁棒。

详情
Comments
arXiv admin note: substantial text overlap with arXiv:2510.09222
AI中文摘要

在具身控制中从演示学习通常被建模为行为克隆,最近的扩散或流匹配策略通过建模多模态专家动作改进了这一范式。然而这些方法仍然是离线监督学习:策略仅在专家状态上训练,在其实际访问的状态上得不到纠正信号。在线策略蒸馏(OPD)提供了一种自然的补救措施,但标准OPD假设一个强大的固定教师,这在仅演示控制中不可用。我们提出 extbf{FA-OPD},一种\emph{对抗性双在线策略蒸馏}方法,其中从演示学习流匹配(FM)教师,并与轻量MLP学生共同训练。教师在学生 rollout 上提供两种互补信号。奖励通道学习状态-动作对上的专家似然目标,并通过长视界策略优化驱动在线探索。动作通道在学生访问的状态提供密集的局部目标,稳定利用。FA-OPD耦合两者,使得奖励蒸馏能够实现超越逐点演示的泛化,而动作蒸馏使探索锚定在接近专家行为附近。在六个机器人导航、操作和运动基准上,FA-OPD击败了强基线,并在噪声或有限演示下表现出更强的鲁棒性。源代码:https://github.com/vanzll/FA-OPD。

英文摘要

Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbf{FA-OPD}, an \emph{adversarial dual on-policy distillation} method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations. Source code: https://github.com/vanzll/FA-OPD.

2605.27044 2026-06-02 cs.AI

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

BatteryMFormer:电池退化轨迹预测的多级学习

Ruifeng Tan, Jintao Dong, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

发表机构 * Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou)(可持续能源与环境方向,香港科学与技术大学(广州)) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(数据科学与分析方向,香港科学与技术大学(广州)) Material Genome Institute, Shanghai University(材料基因组研究所,上海大学)

AI总结 提出BatteryMFormer,一种多级Transformer模型,通过老化条件感知解码器、元退化模式记忆和双视角编码器,从早期数据预测电池全生命周期健康状态轨迹,在四个电池域上超越现有方法。

详情
Comments
Accepted by KDD 2026
AI中文摘要

早期电池退化轨迹预测(BDTF)从早期运行数据预测全生命周期健康状态轨迹,对电池优化、制造和部署至关重要。电池退化数据呈现两个关键特征。首先,退化数据具有多级结构,包括老化条件下的共享规律性和跨电池的轨迹模式。其次,电压-电流曲线中与退化相关的变化通常局限于特定的荷电状态(SOC)区间。现有方法通常未能显式建模这些特征。为弥补这一差距,我们提出BatteryMFormer,一种用于早期BDTF的多级Transformer。BatteryMFormer集成了(1)老化条件感知解码器,通过老化条件知情查询和老化条件感知注意力注入老化条件先验,(2)元退化模式记忆,学习并检索轨迹原型以指导长期预测,以及(3)双视角编码器,从电压和电流时间序列中联合捕获时间动态和SOC局部变化。在四个电池域上的大量实验表明,BatteryMFormer始终优于最先进的基线,标志着向可靠BDTF迈出了重要一步。我们的代码可在https://github.com/Ruifeng-Tan/BatteryMFormer获取。

英文摘要

Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi-level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation-related variations in voltage-current profiles are often localized to specific state of charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi-level Transformer for early BDTF. BatteryMFormer integrates (1) an aging-condition-aware decoder that injects aging-condition priors via aging-condition-informed queries and aging-condition-aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long-horizon forecasting, and (3) a dual-view encoder that jointly captures temporal dynamics and SOC-localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state-of-the-art baselines, marking a significant step toward reliable BDTF. Our code is available at https://github.com/Ruifeng-Tan/BatteryMFormer.

2605.27000 2026-06-02 cs.CL cs.AI

Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning

撒更宽的网:面向代码推理的协调 Pass@K 策略优化

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) NVIDIA Research(英伟达研究)

AI总结 提出协调 Pass@K 策略优化 (CPPO),通过规划器生成多种策略并协调求解器尝试,以解决代码生成中重复采样导致冗余推理路径的问题,在多个基准上显著提升 pass@4。

详情
Comments
Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity
AI中文摘要

使用验证器进行重复采样是分配代码生成测试时计算的标准方法,pass@$K$ 是规范指标。然而,标准策略类从单一答案分布中抽取 $K$ 个独立样本,因此尝试往往坍缩到近乎重复的推理路径,并将预算浪费在冗余 rollout 上。在竞争性编程中,这种失败代价高昂,因为许多问题允许多种不同的算法策略,而 pass@$K$ 只需要一次正确尝试。我们提出协调 Pass@$K$ 策略优化 (CPPO),它将 pass@$K$ 生成转化为策略上的联合探索:规划器输出一个包含 $K{=}4$ 种替代高层方法的元组,共享求解器为每种方法尝试一个解决方案。CPPO 使用乘法规划器奖励 $R_{\\mathrm{plan}} = J_ψ\\\cdot R_{\\mathrm{out}}$ 训练此联合策略,仅将信用分配给导致验证器确认的 pass@$K$ 成功的有效策略元组。在 APPS、CodeContests 和 LiveCodeBench-v6 上,CPPO 在相同的 $K{=}4$ 求解器尝试预算下,相比于直接采样、规划基线、仅规划器 SFT 和面向 pass@$K$ 的 RL,提升了 pass@$4$,在九个模型-基准组合中的六个上具有统计显著增益。最大单次增益是在 Qwen3.5-9B LiveCodeBench-v6 上,相比于最强基线 PKPO 提升了 $+0.16$($0.588 \\rightarrow 0.748$;配对 bootstrap,$p < 0.05$)。

英文摘要

Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical metric. Yet the standard policy class draws $K$ independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@$K$ requires only one correct attempt. We propose Coordinated Pass@$K$ Policy Optimization (CPPO), which turns pass@$K$ generation into joint exploration over strategies: a planner emits a tuple of $K{=}4$ alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, $R_{\mathrm{plan}} = J_ψ\cdot R_{\mathrm{out}}$, assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@$K$ success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@$4$ over direct sampling, planning baselines, planner-only SFT, and pass@$K$-oriented RL under the same $K{=}4$ solver-attempt budget, with statistically significant gains on six of nine model--benchmark cells. The largest single gain is $+0.16$ on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ($0.588 \rightarrow 0.748$; paired bootstrap, $p < 0.05$).

2605.26919 2026-06-02 cs.LG stat.ML

Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates

敏捷在线模型选择:通过受保护的大学习率解决适应滞后

Kei Takemura, Ryuta Matsuno, Keita Sakuma

发表机构 * NEC Corporation(日本电报电话株式会社)

AI总结 提出一种乐观在线镜像下降算法,利用受保护的大学习率(高达Θ(T))并引入事后惩罚机制,在非平稳环境中实现快速适应,同时保持近最优的遗憾界。

详情
Comments
Accepted to KDD 2026
AI中文摘要

在非平稳环境中保持预测准确性需要在线模型选择来自主适应未知的分布变化。然而,现有的免调参算法在鲁棒性和敏捷性之间存在根本性权衡。具体来说,为了确保动态遗憾界,它们必须将学习率限制为小常数(例如,$O(1)$)。这种限制不可避免地会在突变期间导致显著的适应滞后。为了解决这个问题,我们提出了一种新颖的乐观在线镜像下降算法,该算法利用受保护的大学习率,最高可达$Θ(T)$,其中$T$是轮数。我们的关键技术贡献是一种事后惩罚机制,该机制动态监控不稳定的更新,并排除导致过度遗憾的学习率,从而消除了限制性先验约束的需要。我们证明了累积惩罚仍为$O(\log T)$,使得我们的算法在良性情况下实现优越的速率的同时,匹配接近最优的最坏情况保证。在三个合成数据集和十一个多样化真实世界数据集上的实证评估表明,我们的方法将适应滞后从数百轮减少到几轮,始终优于免调参基线。

英文摘要

Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., $O(1)$). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to $Θ(T)$, where $T$ is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains $O(\log T)$, allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on three synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.

2605.25246 2026-06-02 cs.AI

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

FrontierOR:基准测试大语言模型在大规模优化中高效算法设计的能力

Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li, Yi Fan, Xinshou Zheng, Xi Jing, Yikai Zhang, Zhiwei Liang, Seonghoo Kim, Runqing Yang, Zijian Zhou, Sirui Li, Han Zheng, Wangyang Ying, Ou Zheng, Chonghuan Wang, Jinglong Zhao, Hanzhang Qin, Cathy Wu, Paul Pu Liang, Jinhua Zhao, Hai Wang

发表机构 * Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工联盟研究技术) Massachusetts Institute of Technology(麻省理工学院) Northeastern University(东北大学) Uber Shanghai Jiaotong University(上海交通大学) Boston University(波士顿大学) Emory University(埃默里大学) Northwestern University(西北大学) National University of Singapore(国立新加坡大学) Microsoft(微软) University of Texas at Dallas(德克萨斯大学达拉斯分校) Singapore Management University(新加坡管理学院)

AI总结 提出FrontierOR基准,系统评估大语言模型在现实大规模优化问题中设计可扩展算法(而非仅生成求解器代码)的能力,发现最强模型仅在31%案例中优于Gurobi。

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于优化建模和求解器代码生成,然而实际的运筹学和优化问题往往需要更困难的能力:设计可扩展的算法,利用问题结构并超越直接的建模-求解基线。现有基准仅限于远低于现实规模和复杂度的小型或简化示例。我们引入FrontierOR,这是首批系统评估基于LLM的高效算法设计在现实大规模优化问题中的基准之一。FrontierOR包含180个任务,这些任务源自顶级运筹学场所发表的方法论多样的论文,每个任务都有标准化实例和隐藏的、专家验证的评估套件。我们评估了七个LLM,涵盖前沿、经济高效和开源模型,在一次性设置和测试时进化设置中。结果显示,前沿模型仍然难以从可执行的公式化转向高效的优化算法:最强的一次性模型在解决方案质量和计算效率方面仅在31%的案例中优于Gurobi,即使具有测试时进化的强大编码代理在选定的困难任务上也仅达到50%。FrontierOR为基于LLM的优化算法设计建立了一个实用的评估平台,使未来的LLM和智能体能够系统地测试它们是否能够超越正确的公式化,转向可行、高质量和高效的算法。代码和数据已在https://github.com/Minw913/FrontierOR公开。

英文摘要

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Code and data are publicly released at https://github.com/Minw913/FrontierOR.

2605.26684 2026-06-02 cs.LG cs.AI

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因:基于图的智能体强化学习信用分配

Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphGPO方法,通过构建状态转移图并利用全局信息估计各状态到任务目标的距离,实现步骤级信用分配,提升训练效率和性能。

详情
Comments
Accepted by ICML 2026
AI中文摘要

基于组的强化学习方法在提升大型语言模型性能方面取得了显著成功,并已迅速扩展到智能体任务。然而,其信用分配严重依赖于根据最终结果进行的粗粒度轨迹级归因,难以捕捉单个步骤的贡献,例如失败轨迹中被掩盖的有价值步骤。为了揭示潜在信息并实现更忠实的步骤级信用分配,我们提出基于图的组策略优化(GraphGPO),该方法首先将所有 rollout 轨迹聚合为一个统一的状态转移图,然后利用图中编码的全局信息估计每个状态到任务目标的距离。最后,GraphGPO 通过估计基于图的优势函数,根据转移减少到任务目标距离的程度,为每条边分配信用。通过这种方式,GraphGPO 显著提高了训练效率,并在多个具有挑战性的基准测试中取得了最先进的性能。

英文摘要

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

2605.26660 2026-06-02 cs.LG

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

WINDQuant: 权重感知的全局混合精度大语言模型量化神经决策

Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu, Thong Thanh Nguyen, Tho Quan

发表机构 * CAIR, VinUniversity, Vietnam(越南 VinUniversity 的 CAIR) Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam(越南胡志明市技术大学 (HCMUT)) CCDS, Nanyang Technological University, Singapore(新加坡南洋理工大学的 CCDS) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 提出基于强化学习的WINDQuant控制器,在全局存储预算下为列块分配位宽和量化策略,实现超低位LLM量化的细粒度混合精度,性能优于传统方法。

详情
AI中文摘要

量化是减少大语言模型(LLM)内存占用和推理成本的有效方法,但在超低位宽下保持性能仍具挑战。现有的后训练方法常遭受严重的精度下降,而量化感知训练则需要昂贵的重训练和额外资源。此外,大多数混合精度策略依赖于粗粒度或启发式敏感性分析,忽略了权重矩阵内的细粒度变化。我们提出WINDQuant,一种基于强化学习的分配控制器,用于超低位LLM量化。WINDQuant并非引入另一种低位量化算子,而是学习如何在全局存储预算下为细粒度列块分配位宽和量化处理。通过在列块级别操作,WINDQuant能够在全局目标位宽下实现层内灵活且细粒度的精度分配。实现结合了PPO与激活感知校准、轻量级每单元量化器拟合以及学习到的混合精度方案的显式有效位计算。在LLaMA模型上的实验表明,WINDQuant在超低位设置下实现了竞争性能,同时相对于基于重训练的方法降低了优化开销,凸显了强化学习作为自适应混合精度量化实用控制器的潜力。

英文摘要

Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often suffer from severe accuracy degradation, while quantization-aware training requires costly retraining and additional resources. Moreover, most mixed-precision strategies rely on coarse-grained or heuristic sensitivity analysis that overlooks fine-grained variations within weight matrices. We propose WINDQuant, a reinforcement-learning-based allocation controller for ultra-low-bit LLM quantization. Rather than introducing another low-level quantization operator, WINDQuant learns how to assign bit-widths and quantization treatments to fine-grained column chunks under a global storage budget. By operating at the column-chunk level, WINDQuant enables flexible and fine-grained precision assignment within layers under a global target bit-width. The implementation combines PPO with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit effective-bit accounting of the learned mixed-precision plan. Experiments on LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while reducing optimization overhead relative to retraining-based approaches, highlighting reinforcement learning as a practical controller for adaptive mixed-precision quantization.

2605.26632 2026-06-02 cs.LG

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

RT-Lynx:以正确方式将GEMM稀疏性应用于扩散模型

Xing Cong, Hanlin Tang, Kan Liu, Lan Tao, Lin Qu, Chenhao Xie

发表机构 * Alibaba Group(阿里巴巴集团) Independent Researcher(独立研究者)

AI总结 针对扩散模型推理成本高的问题,提出将N:M半结构化稀疏性从权重转移到激活,结合误差补偿技术,实现线性层平均1.55倍加速且保持生成质量。

详情
Comments
33 pages, 18 figures, Accepted by ICML 2026
AI中文摘要

扩散Transformer(DiT)在图像生成中表现出色,但推理成本高昂。先前工作通过量化和蒸馏降低了成本,但半结构化稀疏性(可将近减半FLOPs)仍未得到充分探索。一个关键原因是现有方法大多关注权重稀疏化,而剪枝50%的权重会移除关键模型容量并降低生成质量。然而,我们的研究表明,DiT激活本质上是稀疏的,并且比权重对N:M半结构化稀疏化更鲁棒。受此观察启发,我们倡导从权重稀疏化到激活稀疏化的范式转变。我们提出RT-Lynx,将N:M稀疏化应用于激活,并结合误差补偿技术以减轻精度损失。我们进一步针对此设置实现了高度优化的CUDA内核,在线性层中平均实现高达1.55倍的加速。在多个扩散模型上的大量实验表明,我们的方法在显著加速推理的同时,保持了原始模型的生成质量。

英文摘要

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.

2605.26625 2026-06-02 cs.RO cs.SY eess.SY

Provably Safe Motion Planning Under Unknown Disturbances

未知扰动下的可证明安全运动规划

Ibon Gracia, Qi Heng Ho, Luca Laurenti, Morteza Lahijanian

发表机构 * Department of Aerospace Engineering Sciences at the University of Colorado Boulder(科罗拉多大学博尔德分校航空航天工程科学系) Department of Aerospace and Ocean Engineering(航空航天与海洋工程系) Delft University of Technology(代尔夫特理工大学) Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 针对未知分布随机扰动下的机器人系统,提出一种基于Wasserstein模糊管的学习型采样运动规划算法,实现概率完备且保守性低的安全规划。

详情
AI中文摘要

我们提出了一种可证明安全的基于采样的运动规划算法,适用于受未知分布随机扰动影响的机器人系统。我们考虑具有线性或可线性化动力学的系统,在具有任意形状障碍物的工作空间中运行,并受状态和控制约束。安全要求被表述为机会约束。我们的方法利用系统轨迹的数据来学习Wasserstein模糊管,即一系列模糊集,该模糊管以高置信度包含系统状态分布的轨迹。然后,该模糊管被用于一个概率完备的算法中,以构建一个尊重问题约束的基于采样的运动规划树。我们表明,学习几个低维模糊管而不是单个高维模糊管可以有效降低保守性并提高可扩展性。此外,我们设计了一种高效的基于bandit的有效性检查器,在不牺牲概率完备性的情况下显著提高了我们算法的经验性能。案例研究表明,我们的算法在严格安全阈值下的杂乱环境中找到了有效规划,优于最先进的方法。

英文摘要

We present a provably safe sampling-based motion planning algorithm for robotic systems affected by random disturbances of unknown distribution. We consider systems with linear or linearizable dynamics evolving in workspace with arbitrary-shaped obstacles subject to state and control constraints. Safety requirements are formulated as chance-constraints. Our approach leverages data from trajectories of the system to learn a Wasserstein ambiguity tube, i.e., a sequence of ambiguity sets, which contains the trajectory of the system's state distribution with high confidence. This ambiguity tube is then used in a probabilistically complete algorithm to grow a sampling-based motion planning tree that respects the constraints of the problem. We show that learning several lower-dimensional ambiguity tubes instead of a single high-dimensional one effectively reduces the conservatism and boosts scalability. Additionally, we design an efficient bandit-based validity checker that remarkably increases the empirical performance of our approach without sacrificing probabilistic completeness. Case studies show our algorithm finds valid plans in cluttered environments under strict safety thresholds, outperforming state-of-the-art methods.

2605.26444 2026-06-02 cs.CL

NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies

NanoSpec: 利用极简上下文词汇表加速推测解码

Zhiyang Chen, Daliang Xu, Yinyuan Zhang, Chenghua Wang, Mengwei Xu, Yun Ma

发表机构 * arXiv.org cs.CL(计算机语言学)

AI总结 提出NanoSpec,一种无需训练的方法,通过动态构建极简上下文感知词汇表(平均<3k tokens),结合系统-算法协同设计,将推测解码的草稿生成时间平均减少51.6%,实现端到端加速1.17-1.29倍。

详情
AI中文摘要

大型语言模型的巨大词汇量(通常超过10万tokens)在推测解码过程中对最后的线性投影层造成了计算瓶颈。现有的词汇表剪枝方案依赖于静态或粗粒度的子词汇表,需要较大的活跃大小(约30k)才能维持草稿质量。我们提出NanoSpec,一种新颖的无需训练的方法,通过为每个生成步骤动态构建一个极简的、上下文感知的活跃词汇表,打破了这种权衡。利用语言生成固有的时间局部性,NanoSpec实现了高覆盖率,同时将平均词汇量削减了40倍以上(降至<3k tokens),且无需任何辅助训练参数。为了在现代硬件上实现这种高稀疏性的理论优势,我们引入了系统-算法协同设计,通过异步收集和GPU驻留状态管理克服了稀疏内存访问的低效问题。作为一个互补的即插即用模块,NanoSpec将草稿生成时间平均减少了51.6%,在7个任务上比最先进的推测解码方法EAGLE-2和EAGLE-3实现了1.17-1.29倍的端到端加速,并优于复杂的基于训练的剪枝基线。

英文摘要

The massive vocabulary sizes of large language models, often exceeding 100k tokens, impose a computational bottleneck on the final linear projection layer during speculative decoding. Existing vocabulary pruning solutions rely on static or coarsely-grained sub-vocabularies that necessitate large active sizes ($\sim$30k) to maintain draft quality. We propose NanoSpec, a novel training-free approach that breaks this trade-off by dynamically constructing a minimalist, context-aware active vocabulary for each generation step. Leveraging the inherent temporal locality of language generation, NanoSpec achieves high coverage while slashing the average vocabulary size by over $40\times$ (to $<$3k tokens) without requiring any auxiliary trained parameters. To realize the theoretical benefits of such high sparsity on modern hardware, we introduce a system-algorithm co-design that overcomes the inefficiencies of sparse memory access through asynchronous gathering and GPU-resident state management. As a complementary plug-and-play module, NanoSpec cuts draft time by an average of 51.6\%, delivering a $1.17$-$1.29\times$ end-to-end speedup over the state-of-the-art speculative decoding methods EAGLE-2 and EAGLE-3 across 7 tasks and outperforming complex training-based pruning baselines.

2605.26431 2026-06-02 cs.CL stat.AP

Probing Minimalist Phase Structure in LLMs: What Universal Dependencies Cannot Represent

探究LLM中的最简阶段结构:普遍依赖关系无法表示的内容

Yuanhao Chen, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 通过设计UD距离不变的条件,使用结构探针评估LLM在wh-移位刺激上的表现,发现阶段计数梯度和阶段内凝聚性效应,表明分布预训练能诱导出超越UD的形式句法抽象表征。

详情
AI中文摘要

结构探针在普遍依赖关系(UD)上进行训练,而UD不编码形式句法抽象,如阶段边界或阶段内凝聚性。大型语言模型(LLM)是否编码这些抽象仍是一个开放问题,基于UD的探针在构造上无法回答。我们在wh-移位刺激上评估结构探针,这些刺激中UD距离在设计上跨条件不变——因此任何非零效应都反映了超越UD的结构。三个条件——裸小句、不定式和限定式——按wh-元素跨越的最简方案(MP)阶段边界数量排序。在来自四个家族的13个LLM中,我们在跨从句对上发现了阶段计数梯度(12/13模型),并在一个从句内对上发现了13/13的符号不对称性,该从句内对的UD距离跨条件相同——后者特别由阶段内凝聚性预测,这是MP抽象,在构造上对UD不可见。激活修补证实这些表征在12/13模型中因果活跃。这些发现表明,分布预训练可以诱导与形式句法抽象一致的表征,这些抽象超出了基于注释的探针所能达到的范围;基于UD的探针提供了句法编码的下界,而非上界。

英文摘要

Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design -- any non-zero effect therefore reflects structure beyond UD. The three conditions -- bare small clause, infinitival, and finite -- are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses. Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions -- the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound.

2605.26397 2026-06-02 cs.CL cs.AI

Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication

LLM生成的自闭症交流中的算法脆弱性与人格偏见

Naba Rizvi, Mohammed Rizvi, Harper Strickland, Saleha Ahmedi, Nedjma Ousidhoum

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Georgia Institute of Technology(佐治亚理工学院) Cornell University(康奈尔大学) Cardiff University(卡迪夫大学)

AI总结 通过双人格改写范式,发现LLM在生成自闭症人格文本时存在词汇与情感偏离、输出坍塌等系统性失败,且对齐策略而非参数规模主导这些失败,表明当前对齐训练导致深层表征鸿沟。

详情
Comments
main paper: 9 pages; total: 19 pages; 2 figures; 5 tables
AI中文摘要

安全对齐减少了明确有害的输出,但无意中编码了一种对边缘化交流的净化、神经典型化表征。我们使用双人格改写范式研究这种编码,提示十个大型语言模型(LLM)从自闭症或神经典型人格改写自然发生的自闭症话语。我们发现,尽管语义相似性相当,自闭症人格改写比神经典型改写在词汇形式和情感语域上偏离显著更多。此外,大多数模型将跨人格生成折叠成几乎相同的输出。为了揭示这种生成崩溃背后的机制,我们引入了一个多智能体定性分析框架。我们的结果揭示了系统性输出擦除、刻板幻觉和任务回避元评论是此任务的普遍失败模式,这些模式按对齐策略而非参数规模聚类。最后,我们与自闭症人类标注者的针对性比较表明,社区内部知识相对于LLM分类产生了系统性标签反转。我们的发现表明,当前的对齐训练导致仅通过定性分析可见的人格特定生成崩溃,证实了提示工程无法解决的深层表征鸿沟。

英文摘要

Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized communication. We investigate this encoding using a dual-persona rewrite paradigm, prompting ten large language models (LLMs) to rewrite naturally occurring autistic discourse from either an autistic or neurotypical persona. We uncover autistic-persona rewrites diverge significantly more in lexical form and affective register than neurotypical rewrites, despite equivalent semantic similarity. Furthermore, most models collapse cross-persona generations into near-identical outputs. To uncover the mechanisms behind this generative breakdown, we introduce a multi-agent qualitative analysis framework. Our results reveal systemic output erasure, stereotyped hallucination, and task-evasive meta-commentary are pervasive failure modes for this task that cluster by alignment strategy rather than parameter scale. Finally, our targeted comparison with autistic human annotators demonstrates that community-insider knowledge produces systematic label reversals relative to LLM classifications. Our findings indicate that current alignment training causes persona-specific generative breakdown visible only through qualitative analysis, confirming a deep representational gap that prompt engineering cannot resolve.

2605.26292 2026-06-02 cs.CV cs.CL

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Evi-Steer:通过高效且可泛化的证据调优学习引导生物医学视觉-语言模型

Taha Koleilat, Hassan Rivaz, Yiming Xiao

发表机构 * Concordia University(康科迪亚大学)

AI总结 提出Evi-Steer框架,通过证据跨模态低维引导实现BiomedCLIP的不确定性感知参数高效微调,仅更新0.11%参数,在15个生物医学数据集上少样本学习和域泛化设置中优于现有方法。

详情
Comments
MICCAI 2026 Early Accept; Project Page: https://tahakoleilat.github.io/Evi-Steer. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October
AI中文摘要

视觉-语言基础模型的参数高效适配对于生物医学图像的精确多模态理解至关重要,但现有方法仍具有确定性,且在域偏移或模糊的图像-文本对齐下常常表现不佳。这一限制在临床中尤为关键,因为模型应在低数据 regime 和域偏移下保持鲁棒性。我们提出了Evi-Steer,一个用于BiomedCLIP的证据跨模态低维引导框架,能够在仅更新总模型参数0.11%的情况下实现不确定性感知的参数高效微调。我们的方法在视觉和文本编码器中执行轻量级低维令牌更新,同时估计认知不确定性。这些不确定性估计更新门控残差,使模型在证据较弱时能够保守地适应。此外,我们引入了基于Dempster-Shafer理论的跨模态置信度融合,使视觉适应能够以文本置信度为条件,并抑制冲突或不确定的跨模态更新。我们在涵盖8个器官和8种成像模态的15个生物医学成像数据集上,在少样本学习和域泛化设置下进行了全面评估。Evi-Steer在少样本学习和域偏移设置下始终优于最先进的方法,展示了在真实临床环境中部署视觉-语言模型的实用且鲁棒的途径。代码可在https://github.com/HealthX-Lab/Evi-Steer获取。

英文摘要

Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.

2605.26068 2026-06-02 cs.LG cs.AI

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

重新思考异常检测中的弱监督:一个综合基准

Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics(上海金融学院) Ant Group(蚂蚁集团) Key Laboratory of Interdisciplinary Research of Computation and Economics(计算与经济交叉学科重点实验室)

AI总结 提出WSADBench,首个统一评估不完全、不精确和不准确三种弱监督异常检测场景的基准,通过系统变化标签数量、粒度和质量,揭示36种算法在4种模态上的性能边界,并发现弱监督场景间存在强相关性、专用WSAD算法仅在极端标签稀缺时占优等关键洞察。

详情
Comments
Accepted at KDD 2026 Datasets and Benchmarks Track
AI中文摘要

弱监督异常检测(WSAD)已发展出三个主要方向:不完全监督、不精确监督和不准确监督。然而,这些方向仍然相互孤立,缺乏一个统一的框架来评估它们是否解决独特的挑战或共享基本机制。本文介绍了WSADBench,这是第一个统一评估不同弱监督场景的基准,对从专用WSAD方法到先进表格基础模型的多种方法进行基准测试。WSADBench通过系统变化标签数量、粒度和质量,建立了标准化协议来评估4种模态上的36种算法,揭示了各种方法的性能边界。基于超过70万次实验,WSADBench揭示了四个关键见解:(i)这些弱监督场景之间存在强内在相关性,挑战了当前研究方向的孤立性。(ii)专用WSAD算法仅在极端标签稀缺情况下表现出色,但随着监督增加或在OOD场景中,很快被表格基础模型和通用分类方法主导。(iii)未标记数据在不同设置中的效用不一致,与标签细化相比收益微乎其微。(iv)模型对不同类型的标签噪声表现出不对称敏感性。我们发布WSADBench作为开源基准,包含代码和数据集,以促进未来的WSAD研究:https://github.com/SUFE-AILAB/WSADBench。

英文摘要

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

2605.24634 2026-06-02 cs.CV

Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

通过校准交互解决组合图像检索中的歧义

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

AI总结 本文提出将组合图像检索重新定义为不确定性下的校准意图解析,通过共形预测层提供覆盖保证的候选集,并利用期望信息增益策略提出最有效的澄清问题,从而解决查询歧义和假阴性问题。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和描述如何修改它的文本搜索语料库。尽管从三元组训练的合成器到零样本和生成方法取得了快速进展,但所有系统本质上都共享一个假设:查询映射到单个目标,通过Recall@K针对一个标注进行评分。我们认为这与任务根本不一致。诸如“使其更正式”之类的查询并不命名一个图像,而是命名语料库的一个区域,用户意图中的哪个成员真正是不确定的。这种欠指定是众所周知的假阴性问题的根源,并使得当前模型无法区分精确查询和模糊查询。我们将CIR重新定义为不确定性下的校准意图解析:检索器被包裹在一个共形预测层中,该层返回一个具有覆盖保证的候选集,其大小是歧义的原则性度量;当集合很大时,期望信息增益策略从可解释的歧义轴中提出一个最有用的澄清问题,然后集合收缩。我们引入了AmbiCIR,一个基准和经过人工验证的用户模拟器,它复活了CIRR中休眠的辅助和对话标注,并扩展了CIRCO的多正例设置。在开放域和时尚基准上,我们的方法匹配了单轮最先进水平,确认了校准解析在精确查询上是无成本的,同时以朴素对话基线所需交互预算的一小部分达到预期目标,并且它是第一个为任务报告有效覆盖和校准的方法。

英文摘要

Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.

2605.26102 2026-06-02 cs.CV

InstructSAM: Segment Any Instance with Any Instructions

InstructSAM: 根据任意指令分割任意实例

Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li, Siliang Tang, Jun Xiao, Yueting Zhuang, Wenqiao Zhang

发表机构 * Zhejiang University(浙江大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 提出InstructSAM框架,通过将指令驱动实例分割建模为集合结构查询预测问题,并设计显式推理到实例查询接口,结合视觉语言模型和SAM3实现单次前向传播中的多实例分割。

详情
Comments
19 pages, 8 figures, code: https://github.com/DCDmllm/InstructSAM
AI中文摘要

在本文中,我们介绍了InstructSAM,一个统一且精简的框架,旨在任意指令下进行多实例分割。我们将指令驱动的实例分割形式化为一个集合结构查询预测问题,并提出了一个显式的推理到实例查询接口,优雅地桥接了视觉语言模型(VLM)和SAM3。具体来说,一组可学习的实例查询被注入到VLM中,并与指令和视觉信息进行上下文关联,使每个查询成为一个实例感知槽。混合注意力机制进一步促进了这些查询、视觉令牌和指令令牌之间的交互,改进了实例枚举并减少了重复预测。得到的LLM条件查询被投影到SAM3的检测器查询空间中,以在单次前向传播中驱动准确的多实例分割。这种设计赋予了SAM3高级指令理解、组合推理和实例级集合预测的能力,而无需修改其核心架构。为了支持训练和评估,我们进一步构建了Inst2Seg,一个高质量、大规模的基于指令的实例分割数据集和基准,将自由形式的指令与实例级掩码配对。大量实验表明,仅2B规模的InstructSAM在复杂的指令驱动和短语级指代分割基准上取得了强劲的结果,超越了之前的端到端方法和SAM3的代理流水线,同时实现了高效的单次多实例预测。

英文摘要

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

2605.26089 2026-06-02 cs.CV cs.AI

Channel-wise Vector Quantization

通道级向量量化

Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Westlake University(西湖大学) Zhejiang University(浙江大学) Fudan University(复旦大学) JD.COM(京东公司) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出通道级向量量化(CVQ)代替补丁级量化,并基于此设计通道级自回归(CAR)模型,通过逐通道预测实现渐进式细节生成,在图像重建和文本到图像生成中取得优异性能。

详情
AI中文摘要

我们提出了通道级向量量化(CVQ),一种新颖的图像标记化范式,用通道级标记取代补丁级标记。与传统的向量量化(为每个补丁特征向量分配一个离散标记)不同,CVQ 对特征图的每个通道进行量化。这种表示将图像表示为视觉细节的离散层级,而不是空间补丁的网格。基于 CVQ,我们引入了一种新的视觉自回归框架,采用“下一通道预测”。我们的通道级自回归(CAR)模型不是按光栅顺序逐补丁渲染图像,而是顺序预测图像通道,逐步生成更丰富的视觉细节。具体来说,它首先勾勒全局结构,然后细化细粒度属性,类似于人类艺术家的创作流程。实验表明:(1)CVQ 在 16K+ 的码本大小下实现了 100% 的码本利用率,无需任何额外技巧,并且显著提高了传统 VQ 的重建质量;(2)CAR 在 DPG 评分中达到 86.7,在 GenEval 评分中达到 0.79,展示了其在文本到图像生成中的强大有效性。

英文摘要

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

2605.30290 2026-06-02 cs.LG cs.AI cs.CL

Self-Trained Verification for Training- and Test-Time Self-Improvement

自训练验证用于训练和测试时的自我改进

Chen Henry Wu, Aditi Raghunathan

发表机构 * arXiv

AI总结 提出自训练验证(STV)方法,通过让验证器模仿参考解决方案下的自身版本,解决自我改进中验证器瓶颈问题,在测试时显著提升验证-细化循环,在训练时通过验证器在环训练(ViL)进一步提升生成器性能。

详情
AI中文摘要

大规模自我改进一直是推理模型的长期目标,有两个自然的实现阶段:测试时,通过验证-细化(V-R)循环;训练时,通过自训练方法。两者都受限于同一个瓶颈:验证器。当验证器得分膨胀而准确率停滞,且反馈过于泛化无法执行时,V-R循环会停滞;当糟糕的自生成数据被加入训练时,自训练同样会失败。更好的验证将解锁两者,但我们想要训练的能力,即捕捉自生成的错误,缺乏训练信号。为了解决这一挑战,我们提出了自训练验证(STV)。我们的关键观察是,虽然模型单独无法捕捉这些错误,但当它看到参考解决方案时却可以。我们将这种不对称性转化为监督目标,训练验证器模仿自身更具信息量的版本。在测试时,STV在困难问题上显著改进了V-R循环,而替代方法(如SFT、对验证器分数进行RL,甚至元验证器)则不然。STV在困难数学任务上大致使准确率翻倍,在科学推理任务上提升14倍(从1.5%到21%)。在训练时,我们额外使用STV验证器在V-R循环内的反馈对生成器进行RL训练——我们称之为验证器在环训练(ViL)。从一个RL收敛的生成器开始,ViL在pass@1上进一步获得33%的提升。更值得注意的是,生成器在测试时无验证器的独立pass@1相对标准RL收敛点提升了30%。因此,困难问题推理的下一个前沿可能在于我们如何训练用于验证和与验证结合的方法。网站:https://ar-forum.github.io/stv-webpage

英文摘要

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

2605.30190 2026-06-02 cs.LG

Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents

均场扩散器:将离线多智能体强化学习扩展到数千个智能体

Wenhao Li, Xiangfeng Wang, Bo Jin

发表机构 * Tongji University(同济大学) East China Normal University(华东师范大学)

AI总结 提出MF-Diffuser框架,通过将轨迹规划提升到轨迹分布的Wasserstein空间,并利用混沌传播和分层粗到细策略,将离线多智能体强化学习扩展到数千个智能体,理论证明均场近似误差为$O(H^2/\sqrt{N})$且离线分布偏移不随群体规模增长,实验在三个均场RL基准上取得最佳回报。

详情
Comments
62 pages, 15 figures, 16 tables
AI中文摘要

基于扩散的规划在单智能体离线强化学习中取得了显著成果,但由于联合轨迹空间的维度灾难,扩展到多智能体系统仍然难以处理。我们引入了MF-Diffuser,这是一个将轨迹规划提升到轨迹分布的Wasserstein空间的框架,其中混沌传播确保一个小的代表性智能体子集能够捕捉整个群体的动态。我们的方法包括一个值加权的混沌熵目标,该目标协调生成保真度与回报最大化,以及一个分层粗到细策略,在去噪过程中逐步增加智能体群体。我们建立了端到端的次优性界,包含四个可解释项,揭示了均场近似误差以$O(H^2/\sqrt{N})$缩放,而离线分布偏移被证明不会随群体规模$N$增长,并证明生成的策略是一个近似的均场纳什均衡,具有明确的收敛保证。在三个均场RL基准(包括阶段博弈、序列动态和对抗性团队竞争)上的实验表明,MF-Diffuser在大多数设置中实现了最佳回报,在次优离线数据和极端规模($N \geq 10^3$)下增益最大。

英文摘要

Diffusion-based planning has achieved strong results in single-agent offline reinforcement learning, yet scaling to many-agent systems remains intractable due to the curse of dimensionality in the joint trajectory space. We introduce MF-Diffuser, a framework that lifts trajectory planning to the Wasserstein space of trajectory distributions, where the propagation of chaos ensures a small representative subset of agents captures the full population dynamics. Our approach features a value-weighted chaotic entropy objective that reconciles generative fidelity with return maximization, and a hierarchical coarse-to-fine strategy that progressively grows the agent population during denoising. We establish end-to-end suboptimality bounds with four interpretable terms, revealing that mean-field approximation error scales as $O(H^2/\sqrt{N})$ while offline distribution shift provably does not grow with population size $N$, and prove the generated policy is an approximate mean-field Nash equilibrium with explicit convergence guarantees. Experiments on three mean-field RL benchmarks -- spanning stage games, sequential dynamics, and adversarial team competition -- show MF-Diffuser achieves the best return in the majority of settings, with the largest gains on suboptimal offline data and at extreme scales ($N \geq 10^3$).

2605.30188 2026-06-02 cs.LG cs.AI stat.ML

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

CalArena:大规模事后校准基准

Eugène Berta, David Holzmüller, Francis Bach, Michael I. Jordan

发表机构 * Inria - Ecole Normale Supérieure PSL Research University(法国国家科学研究中心-巴黎高等师范学院-巴黎-萨克雷大学)

AI总结 提出CalArena大规模标准化基准,涵盖近2000个实验,通过事后改进(PHI)原则比较多种校准方法,发现平滑校准函数优于分箱方法,专用多类方法在高维场景中至关重要。

详情
Comments
30 pages, 9 figures
AI中文摘要

可靠的概率估计在许多机器学习应用中至关重要,但现代分类器往往校准不佳。事后校准提供了一种简单且广泛使用的解决方案,但由于提出的方法众多,加上小规模和不一致的评估,很难确定哪些方法在实践中真正有效。我们引入了一个大规模、标准化的事后校准基准,涵盖表格和计算机视觉任务的近2000个实验,包括二分类、多分类和大规模分类设置。我们的基准汇集了来自多种经典模型、现代深度学习架构和基础模型的预测,并在通用评估框架内提供了数十种校准方法的统一、可重复实现。我们认为,在适当评分规则下的事后改进(PHI)为比较事后方法提供了传统校准误差估计器的原则性替代方案,同时捕捉校准质量和模型预测性能的潜在退化。利用这一框架,我们进行了迄今为止最全面的事后校准实证研究。我们的结果揭示了跨领域的一致模式:平滑校准函数优于基于分箱的方法,专用多类方法在高维设置中至关重要,而通用机器学习模型在没有校准特定设计的情况下不具备竞争力。为促进未来研究,我们发布了所有数据、代码和评估工具,为开发和比较校准方法提供了一个即插即用的基准。

英文摘要

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

2605.30122 2026-06-02 cs.LG cs.AI

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

超越MSE:利用多分位数回归改进降水临近预报

Gijs van Nieuwkoop, Siamak Mehrkanoon

发表机构 * Department of Information and Computing Sciences, Utrecht University(信息与计算科学系,乌特勒支大学)

AI总结 本文提出将确定性降水临近预报模型的训练目标从均方误差(MSE)改为多分位数回归损失,使用SmaAt-UNet模型在荷兰雷达降水数据上验证,使中心确定性预测的测试集MSE降低8.6%,并输出高分位数预测以改善强降水预测。

详情
Comments
7 pages, 5 figs
AI中文摘要

深度学习降水临近预报模型通常使用逐点损失(如均方误差或平均绝对误差)进行优化,这可能导致预测过于平滑且对强降雨的表示较差。本研究探讨了是否可以通过将训练重新表述为多分位数回归问题来提高已建立的确定性临近预报架构的预测性能。使用SmaAt-UNet作为核心模型,我们在荷兰雷达降水临近预报上比较了MSE、MAE和多分位数pinball损失训练。结果表明,多分位数训练改进了中心确定性预测,与使用MSE训练的模型相比,测试集MSE降低了8.6%,同时产生的高分位数输出对强降水的风险敏感预测很有用。这些发现表明,分位数回归提供了一种简单的替代标准逐点损失的方法,无需新的架构或生成采样过程。我们模型和训练设置的实现可在GitHub上获取。

英文摘要

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.

2605.30000 2026-06-02 cs.AI

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Cookie-Bench: 面向网页生成的连续屏幕按键交互评估

Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu

发表机构 * Baidu Inc.(百度公司) Beijing, China(中国北京)

AI总结 提出一种无参考、自主驱动且整体推理的网页生成评估框架Cookie-Bench,通过元认知监控分阶段收集证据并评分,与专家评分高度一致。

详情
AI中文摘要

前端网页代码已成为每个前沿LLM发布的核心产品面,但以开发速度评估这些交互式应用仍然成本高昂,因为像Arena这样的人类评判排行榜无法扩展。现有的自动化代理通常依赖参考实现、测试套件或严格检查清单,并且往往遗漏人类评审员在实时会话中进行的推理综合。我们阐述了一种新的评估体系,该体系同时具有无参考、自主驱动和整体推理的特点,并通过两个工件实例化。\textbf{\dataname}是一个涵盖11个领域、54个叶节点、1000个查询的WebDev基准测试,包括静态展示和交互式应用任务,在三个难度等级和三个目标语言组之间平衡,并且重写了任务简介以防止从流传的提示中回忆。\textbf{\framename}基于Flavell的元认知监控,将证据收集与判断分离为三个阶段:静态感知通过被动观察形成第一印象;代理驱动交互自主探索应用,同时捕获连续屏幕视频、音频和每步截图;动态评分仅在证据链完成后发出整体功能和美学评判,并附带结构化失败归因。在\dataname上,\framename与专家人类评分高度一致,同时揭示了13个前沿LLM在交互式网页生成上的显著提升空间。\noindent https://anonymous.4open.science/r/Cookie-3CE/

英文摘要

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

2605.29987 2026-06-02 cs.LG cs.CL

MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

MIC: 通过各向同性子空间对齐最大化自适应表示中的信息容量

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对多尺度表示中的维度冗余和谱坍缩问题,提出MIC框架,通过各向同性子空间对齐、软坍缩正则化和谱各向同性正则化,结合自蒸馏目标生成语义密集且高判别力的表示,在高压缩场景下显著优于基线。

详情
Comments
Accepted at the GlobalSouthML Workshop at ICML 2026. 8 pages, 2 figures
AI中文摘要

尽管多尺度表示学习能够实现弹性维度的嵌入,但嵌套子空间常常遭受维度冗余和谱坍缩的困扰。为了解决这一问题,我们引入了MIC,一个通过各向同性子空间对齐来优化多粒度嵌入几何景观的框架。MIC采用软坍缩正则化(SCR),通过交叉相关惩罚来减轻前缀子空间和残差子空间之间的冗余,同时使用谱各向同性正则化(SIR)确保低维前缀的超球面均匀性。通过自蒸馏目标统一这些策略,MIC生成语义密集的表示,并保持高判别力。我们的实验表明,MIC显著优于标准基线,特别是在维持信息容量最为关键的高压缩场景中。

英文摘要

Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.

2605.29977 2026-06-02 cs.CV cs.LG

EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

EVL-ECG:面向多视角异构知识蒸馏的高效心电图解读

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 提出EVL-ECG框架,通过多头交叉注意力对齐、最优传输视觉特征匹配和几何结构关系匹配三种创新方法,实现跨架构知识蒸馏,在资源受限环境下高效解读心电图。

详情
Comments
7Accepted at the SD4H Workshop at ICML 2026. 7 pages, 3 figures
AI中文摘要

高保真心电图解读越来越依赖于大规模基础模型,但其在临床边缘护理中的部署仍受到极端计算需求的阻碍。虽然知识蒸馏(KD)是一种有前景的解决方案,但传统方法在跨异构架构传递知识时,无法捕捉心电图信号的复杂时空依赖关系。本文提出EVL-ECG,一个专门用于心脏诊断逻辑跨架构蒸馏的框架。EVL-ECG引入了三种心电图感知创新:(1)多头交叉注意力对齐,协调架构差异以保留细粒度形态特征;(2)基于最优传输的视觉特征匹配,利用最优传输在标记表示不匹配的情况下保持跨心电图导联的全局结构关系;(3)几何结构内关系匹配,蒸馏教师模型的潜在诊断推理。在心电图基准测试上的评估表明,EVL-ECG相比现有基线,AUC提升高达2.4%,临床准确率提升1.1%。值得注意的是,EVL-ECG建立了一个高效的20亿参数心电图基础模型,适用于资源受限的临床环境。

英文摘要

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

2605.29973 2026-06-02 cs.RO

Replicable Simulation-Based Robot Validation through Provenance

通过数据溯源实现可复现的基于仿真的机器人验证

Argentina Ortega, Samuel Wiest, Frederik Pasch, Nico Hochgeschwender

发表机构 * Argentine Institute of Technology(阿根廷技术研究所)

AI总结 针对基于仿真的机器人验证可复现性不足的问题,提出将数据溯源与FAIR原则集成到测试流程中,通过追踪工件链接和附加机器可读元数据来增强可复现性,并在移动机器人导航数据集上验证了该方法。

详情
Comments
Accepted for publication at 2026 IEEE RAS International Conference on Engineering Reliable Autonomous Systems (ERAS)
AI中文摘要

机器人行为通常通过基于仿真的测试来验证,然而此类测试活动的可复现性关键取决于测试配置、执行和后处理过程的透明文档化。我们认为,数据溯源结合FAIR原则(可发现、可访问、可互操作、可重用)通过显式追踪工件之间的链接以及附加关于文件来源和关键设计决策的机器可读元数据,弥补了这一不足。此外,溯源和元数据不能被视为仅局限于最终数据集的后续补充;它们必须集成到生成这些数据集的测试过程中,以便能够端到端地重建证据。我们通过为现有的基于仿真的测试框架增加溯源追踪和元数据收集机制,并利用这些扩展为移动机器人导航数据集添加结构化溯源和符合FAIR原则的元数据来证明这一点。最后,我们讨论了在此集成过程中遇到的障碍——例如词汇对齐、属性选择和领域标准的采纳——并提供了在机器人验证工作流中实现以溯源为中心的FAIR元数据的可操作建议。

英文摘要

Robot behavior is often validated through simulation-based testing, yet the replicability of such campaigns depends critically on transparent documentation of how tests are configured, executed, and post-processed. We argue that data provenance, coupled with the FAIR principles (findability, accessibility, interoperability, and reusability), addresses this gap by explicitly tracking links between artifacts and by attaching machine-readable metadata about file origins and key design decisions. Moreover, provenance and metadata cannot be treated as an afterthought confined to final datasets; they must be integrated into the testing processes that generate those datasets so that evidence can be reconstructed end-to-end. We demonstrate this by augmenting an existing simulation-based testing framework with provenance tracking and metadata collection mechanisms, and by using these extensions to enrich a mobile robot navigation dataset with structured provenance and FAIR-aligned metadata. Finally, we discuss obstacles encountered in this integration -- such as vocabulary alignment, attribute selection, and adoption of domain standards -- and provide actionable recommendations for implementing provenance-centric, FAIR metadata in robotics validation workflows.

2605.29948 2026-06-02 cs.SD cs.AI eess.AS

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

HoliTok: 一种具有鲁棒的双重语音生成与理解能力的连续整体式分词

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China(X-LANCE实验室,计算机科学学院,上海交通大学,中国) hi lab, Xiaohongshu Inc, China(hi实验室,小红书公司,中国)

AI总结 提出HoliTok连续整体式语音分词模型,通过渐进训练策略联合保持信号保真度、融入语义信息并维持潜在可学习性,基于该分词构建统一AR+DiT模型实现语音合成与识别,实验证明其在统一生成-理解架构中无需额外优化即可鲁棒运行。

详情
Comments
14 pages, 2 figures, 8 tables
AI中文摘要

统一的语音基础模型需要一个整体式的分词空间,该空间既要能被语言模型学习,又要能解码为高质量波形。然而,现有的语音分词器往往无法同时满足这些要求,导致架构复杂度和训练设计增加。我们提出HoliTok,一种用于统一生成-理解建模的连续整体式语音分词模型。HoliTok将48 kHz语音编码为紧凑的25 Hz序列,包含128维潜在向量。它采用渐进策略进行训练,联合保留信号级保真度、融入语义信息并保持强大的潜在可学习性。基于此分词,我们构建了一个统一的AR+DiT模型用于语音合成和识别,其中相同的潜在序列既支持生成特定任务,也支持统一的生成-理解任务。实验表明,HoliTok实现了有竞争力的重建保真度,提高了高质量和可控合成中的生成可学习性,并且在评估的表示中,它是唯一一个在我们的统一生成-理解架构中无需额外优化技巧即可鲁棒运行的表示。这些结果表明,HoliTok作为一种有效的语音分词器,为统一口语建模提供了基础的表示接口。代码可在 https://github.com/bovod-sjtu/HoliTok 获取。

英文摘要

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.