arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28820 2026-05-28 cs.CV

From Pixels to Words -- Towards Native One-Vision Models at Scale

从像素到文字——迈向原生单视觉大规模模型

Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu, Yue Zhu, Zhongang Cai, Weichen Fan, Linjun Dai, Silei Wu, Xuanyu Zheng, Mingxuan Li, Yuanhan Zhang, Bo Li, Hanming Deng, Huchuan Lu, Quan Wang, Lei Yang, Lewei Lu, Dahua Lin, Ziwei Liu

AI总结本文提出NEO-ov原生基础模型，通过端到端学习跨帧和像素-文字对应，无需外部编码器或适配器，在细粒度视觉感知上缩小了与模块化模型的差距，验证了原生单视觉架构的可行性和竞争力。

详情

Comments: 13 pages, 6 figures

AI中文摘要

当前的视觉语言模型（VLM）通常通过多阶段对齐将独立的图像编码器和语言解码器拼接在一起，这种模块化框架不可避免地碎片化跨帧的像素级信号，并分散了早期的像素-文字交互。与此同时，原生VLM尽管在单图像上表现令人印象深刻，但在多图像、视频理解和空间智能方面仍鲜有探索。因此，我们引入了NEO-ov，一个原生基础模型，它端到端地学习跨帧和像素-文字对应，无需任何外部编码器、辅助适配器或事后融合。通过完全消除模块边界，NEO-ov使得细粒度且统一的时空建模能够在模型内部原生地涌现。值得注意的是，NEO-ov在缩小与模块化模型差距的同时，在细粒度视觉感知方面表现出色，验证了原生“单视觉”架构不仅可行，而且在大规模上具有竞争力。除了实证性能，我们还揭示了系统的架构分析和详细的训练配方，以促进后续的原生多模态建模。我们的代码和模型可在 https://github.com/EvolvingLMMs-Lab/NEO 公开获取。

英文摘要

Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native "one-vision" architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

URL PDF HTML ☆

赞 0 踩 0

2605.28819 2026-05-28 cs.LG cs.CL

PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

PEFT-Arena：从稳定性-可塑性角度理解参数高效微调

Yangyi Huang, Ruotian Peng, Zeju Qiu, Jiale Kang, Yandong Wen, Bernhard Schölkopf, Weiyang Liu

AI总结提出PEFT-Arena基准，通过稳定性-可塑性困境评估参数高效微调方法，发现正交微调在帕累托前沿上最优，并从权重空间和激活空间分析其几何特性。

详情

Comments: Technical report v1 (28 pages, 9 figures, project page: https://spherelab.ai/PEFT-Arena/)

AI中文摘要

参数高效微调（PEFT）已成为适应大型语言模型的标准方法，然而评估主要强调下游准确性，而忽略了预训练能力的保留。我们认为，PEFT应通过稳定性-可塑性困境来评估：目标任务适应与抵抗遗忘之间的权衡。我们引入了PEFT-Arena，一个联合测量下游性能和通用能力保留的基准。在不同方法中，我们发现了不同的稳定性-可塑性特征；在可比的参数预算下，正交微调实现了最有利的帕累托前沿。为了解释这些差异，我们从两个几何角度分析PEFT更新。在权重空间中，谱分析揭示了参数化如何与预训练的奇异值结构相互作用。在激活空间中，保留指标显示微调是保留还是扭曲了通用能力表示，遗忘与非等距表示扭曲相关。最后，分析表明最终的SFT检查点通常超过了一个更好的目标-保留操作点。受此启发，我们展示了通过路径回退进行事后改进的案例研究。

英文摘要

Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

URL PDF HTML ☆

赞 0 踩 0

2605.28818 2026-05-28 cs.CL q-bio.NC

VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

VLMs 在自然阅读中可能不会全局性地增强与人类的对齐性优于 LLMs

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

AI总结通过严格文本设置比较LLM和VLM，发现多模态预训练在自然阅读中未带来全局性人类对齐优势，但视觉语义内容强的句子中VLM有选择性优势。

详情

Comments: 17 pages, 10 figures

AI中文摘要

大型语言模型（LLMs）已成为人类语言处理的有用计算模型，但尚不清楚视觉-语言学习是否使文本表示在自然阅读中更接近人类。本文通过严格文本设置比较紧密匹配的LLM和视觉-语言模型（VLM）对，从而将多模态训练历史的影响与在线视觉输入或跨模态融合分离。我们使用包含全脑fMRI反应和同步眼动扫视的人类自然阅读数据集评估模型对齐。我们的发现表明，多模态预训练可能不会在自然阅读中赋予均匀的全局性人类对齐优势，表明语言内部表示仍然是建模人类文本处理的关键因素。然而，当句子包含更强的视觉语义内容时，VLM的优势可能更具选择性出现，fMRI和眼动对齐均提供汇聚证据。总之，我们的发现提供了一个受控的计算框架，用于测试视觉学习历史如何塑造语言处理的模型-人类对齐，表明多模态预训练在自然阅读中对类人语言表示的贡献是选择性的而非全局性的。

英文摘要

Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

URL PDF HTML ☆

赞 0 踩 0

2605.28816 2026-05-28 cs.CV

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Gamma-World: 超越双玩家的生成式多智能体世界建模

Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler, Yueqi Duan, Jun Gao, Igor Gilitschenski, Zian Wang, Xuanchi Ren

AI总结提出一种生成式多智能体世界模型，通过Simplex Rotary Agent Encoding实现智能体置换等价性，并采用Sparse Hub Attention降低跨智能体注意力成本，支持多玩家交互视频生成。

详情

Comments: Project Page: https://research.nvidia.com/labs/sil/projects/gamma-world

AI中文摘要

用于交互式视频生成的世界模型主要集中在单智能体设置中，其中未来观测由单个控制信号生成。然而，许多生成环境需要多智能体交互：多个玩家、机器人或具身智能体在共享空间中同时行动。将世界模型扩展到此类设置需要原则性的多智能体设计：智能体应保持独立可控、置换对称，并在保持时间和视角一致性的同时支持高效推理。在本文中，我们提出了用于交互式模拟的生成式多智能体世界模型。它引入了Simplex Rotary Agent Encoding，这是3D RoPE的一种无参数扩展，将智能体表示为旋转角度空间中的正则单纯形顶点。这为每个智能体赋予不同的相位，同时使所有智能体置换等价，从而无需学习每个槽位的身份或固定的智能体顺序即可实现可扩展的智能体身份。为了避免跨智能体的密集全连接注意力，我们进一步提出了Sparse Hub Attention，其中可学习的中心令牌调解跨智能体的令牌交互，将跨智能体注意力成本从智能体数量的二次方降低到线性。为了实现实时展开，我们将全上下文扩散教师模型蒸馏为因果学生模型，该模型通过KV缓存顺序生成时间块，实现24 FPS的动作响应生成。在多玩家虚拟环境中的实验表明，与基于槽位和密集注意力的基线相比，我们的模型提高了视频保真度、动作可控性和智能体间一致性，并且无需额外训练即可从两个玩家泛化到四个玩家。

英文摘要

World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.28814 2026-05-28 cs.CL

Self-Improving Language Models with Bidirectional Evolutionary Search

具有双向进化搜索的自我改进语言模型

Guowei Xu, Zhenting Qi, Huangyuan Su, Weirui Ye, Himabindu Lakkaraju, Sham M. Kakade, Yilun Du

AI总结提出双向进化搜索（BES）框架，通过前向候选进化与后向目标分解相结合，克服了传统搜索方法中稀疏验证信号和自回归扩展的局限，在训练后和推理时均显著提升语言模型性能。

详情

AI中文摘要

搜索已被提出作为自我改进语言模型和代理系统的有效方法，既用于训练后样本生成，也用于推理。然而，广泛使用的方法如最佳N采样和树搜索面临两个基本限制：它们由稀疏的验证信号引导，并且主要通过自回归扩展构建候选，将探索限制在具有大量模型概率质量的区域。为了解决这些问题，我们提出了双向进化搜索（BES），一个将前向候选进化与后向目标分解相结合的搜索框架。在前向搜索中，BES通过进化算子增强标准扩展，这些算子重组部分轨迹以生成难以从单次模型推出中获得的候选。在后向搜索中，BES递归地将原始任务分解为可检查的子目标，产生密集的中间反馈来指导前向搜索。我们提供了理论动机，表明仅通过扩展搜索生成的候选被限制在狭窄的熵壳内，而进化算子可以逃脱它，并且后向搜索可以指数级减少找到正确答案所需的样本数量。实验表明，在主流后训练算法无法改进的具有挑战性的后训练任务上，BES实现了持续的增益，并且在推理时三个开放问题解决基准上，BES在平均和最佳性能上均优于现有的开源框架。代码和训练模型可在 https://github.com/Embodied-Minds-Lab/BES 获取。

英文摘要

Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at https://github.com/Embodied-Minds-Lab/BES.

URL PDF HTML ☆

赞 0 踩 0

2605.28812 2026-05-28 cs.RO cs.AI cs.LG

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二元：基于物理接触表示的仿真到现实灵巧操作

Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

AI总结提出基于物理原理的中心压力（CoP）触觉表示，结合可微动力学传感器标定，实现多指手的零样本仿真到现实迁移，在插销入孔和球平衡任务中优于二元接触和原始触觉基线。

详情

Comments: Project site: https://mpan31415.github.io/tactile_rep/

AI中文摘要

接触丰富操作的主要瓶颈是收集真实世界数据的困难。仿真到现实强化学习提供了一种可扩展的替代方案，但仿真-现实差距阻碍了像触觉这样信息密集的模式被有效使用。现有的仿真到现实方法通常通过将触觉数据简化为粗略的低维特征来缩小这一差距——牺牲了复杂操作所需的丰富性。在这项工作中，我们引入了中心压力（CoP），一种基于物理原理的有效触觉表示，它保留了密集的接触信息，同时保持了仿真到现实迁移的鲁棒性。为了支持这种表示，我们提出了一种基于可微动力学的传感器标定方案，使得能够在不需真实力测量的情况下估计触觉单元的朝向。我们在两个盲态、具有挑战性的接触丰富操作任务上评估了CoP：插销入孔和球平衡。在这两个任务中，基于CoP的策略在多指手上实现了零样本仿真到现实迁移，并且优于粗略的二元接触和原始触觉基线。对学习策略状态的分析进一步表明，基于CoP的策略编码了任务相关的物理属性，如物体质量，作为控制的涌现副产品。

英文摘要

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

URL PDF HTML ☆

赞 0 踩 0

2605.28811 2026-05-28 cs.CV

HarmoVid: Relightful Video Portrait Harmonization

HarmoVid: 可重光照的视频人像协调

Jun Myeong Choi, Jae Shin Yoon, Luchao Qi, Roni Sengupta, Joon-Young Lee

AI总结提出一种基于视频扩散模型和光照去闪烁方法，实现前景视频与目标背景场景在阴影、色调和光照强度上的协调，解决视频时域抖动问题。

详情

Comments: CVPR 2026

AI中文摘要

我们提出了一种方法，用于协调前景视频的光照以匹配目标背景场景，调整阴影、色调和光照强度（可重光照协调）。与图像不同，获取视频的标注数据（即相同运动在不同光照条件下记录）实际上不可行且不可扩展。虽然创建这种配对数据的一种方法是将现有的基于图像的协调模型逐帧应用于视频，但得到的输出常常遭受显著的时域抖动。我们通过引入一种新颖的光照去闪烁模型来克服这个问题，该模型可以稳定全局和局部的光照闪烁伪影。我们的视频扩散模型从这些升级后的去闪烁数据中学习，结合大量真实和合成视频，生成高质量的视频协调结果。我们进一步提出了一种非对称alpha掩膜调节技术，以从真实视频中学习干净的边界。实验表明，与先前的基于图像和基于视频的协调方法相比，我们的模型实现了强时域一致性、自然性、更干净的边界和物理上有意义的光照行为，同时保持了强大的重光照表现力。

英文摘要

We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.

URL PDF HTML ☆

赞 0 踩 0

2605.28810 2026-05-28 cs.LG cs.IR cs.SD

Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

情感音乐推荐：基于展开世界模型的离线偏好优化

Audrey Chan, Aaron Labbé, Jacob Lavoie, Jordan Bannister, Arsène Fansi Tchango, Guillaume Lajoie, Laurent Charlin

AI总结针对在线情感实验受伦理限制的问题，提出基于展开世界模型的情感音乐推荐系统AMRS，利用因果Transformer预测用户情感状态，并通过离线偏好优化提升推荐效果。

详情

AI中文摘要

功能性音乐应用，从消费者专注和睡眠辅助到临床干预，共享一个独特的推荐问题：成功由听者的情感状态定义，但在情感上的在线实验受到伦理约束，特别是对于临床人群，他们无法可靠地跳过歌曲或报告痛苦。我们描述了AMRS，即部署在LUCID健康与 wellness 平台上的情感音乐推荐系统，该平台服务于临床用户（主要是患有神经认知状况的老年人）和消费者 wellness 用户，涵盖活力、专注、平静和睡眠模式。AMRS围绕一个基于展开的世界模型构建：一个在记录的收听数据上训练的因果Transformer，用于联合预测参与度、二元评分以及自我报告的效价和唤醒度。该世界模型既作为离线策略训练的模拟器，也作为部署前的压力测试工具。通过行为克隆初始化的推荐策略，使用直接偏好优化（DPO）针对可配置的多目标效用函数进行离线微调。在严格的冷启动协议下，世界模型以可用的保真度预测行为和情感信号；DPO在保持相似多样性分布并避免贪婪优化产生的分布崩溃的同时，提高了预测的效价和唤醒度，优于克隆基线。我们将这项工作定位为一种在在线实验伦理上不可行时进行情感推荐的方法的早期部署验证。

英文摘要

Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener's affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID's health-and-wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer-wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout-based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self-reported valence and arousal. The world model serves both as an in-silico simulator for offline policy training and as a stress-testing tool before deployment. A recommender policy initialized by behaviour cloning is fine-tuned offline with Direct Preference Optimization (DPO) against a configurable multi-objective utility function. Under a strict cold-start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

URL PDF HTML ☆

赞 0 踩 0

2605.28809 2026-05-28 cs.CV cs.LG

AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning

AREA: 基于CLIP的类增量学习中的属性提取与聚合

Zhen-Hao Xie, Yu-Cheng Shi, Da-Wei Zhou

AI总结提出AREA方法，通过主测地线分析稳定属性提取、轻量级任务专家和变分信息瓶颈正则化稳定属性聚合，并利用最优传输进行推理，以解决CLIP类增量学习中的灾难性遗忘问题。

详情

Comments: Accepted to ICML 2026. Code is available at https://github.com/LAMDA-CL/ICML2026-AREA

AI中文摘要

类增量学习（CIL）在构建真实世界学习系统中至关重要。在基于CLIP的CIL中，模型通过比较从模板提示（例如，“一张[类别]的照片”）获得的视觉和文本嵌入之间的相似性来执行分类。这种看似单一的匹配过程可以分解为两个概念上不同的阶段：属性提取和属性聚合。例如，模型可能通过毛皮纹理和胡须等属性识别猫。当学习新类别（如汽车）时，模型必须提取额外属性（如轮子），并调整它们在共享表示空间中的聚合方式。然而，由于只有当前任务的数据可用，增量更新可能使属性提取和聚合偏向新类别，导致灾难性遗忘。因此，我们提出了AREA，用于基于CLIP的CIL中的属性提取和聚合。为了稳定提取，我们通过主测地线分析将类别级视觉和文本属性锚定在超球面嵌入空间上。为了稳定聚合，我们学习轻量级任务特定专家，并带有评分和残差细化，通过变分信息瓶颈目标进行正则化。在推理时，我们通过最优传输在任务属性流形上进行路由，以实现更简洁的预测。实验表明，AREA持续优于最先进的方法。代码可在https://github.com/LAMDA-CL/ICML2026-AREA获取。

英文摘要

Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]''. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation in CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments show that AREA consistently outperforms SOTA methods. Code is available at https://github.com/LAMDA-CL/ICML2026-AREA.

URL PDF HTML ☆

赞 0 踩 0

2605.28807 2026-05-28 cs.AI

Calibrating Conservatism for Scalable Oversight

校准保守主义以实现可扩展监督

William Overman, Mohsen Bayati

AI总结提出校准集体监督（CCO）方法，通过在线校准保守主义，在无分布假设下确保不良结果低于用户指定阈值，并在SWE-bench和MACHIAVELLI实验中验证其有效性。

详情

AI中文摘要

能够自主规划和与环境进行长期交互的智能体AI系统提出了一个基本的控制问题：人类如何对可能超越自身能力的系统保持有意义的监督？现有的可扩展监督方法依赖于复杂的假设，大多仍停留在启发式层面，或者缺乏具有统计保证的序列设置实用方法。我们引入了校准集体监督（CCO），它将多样化的辅助评分函数聚合成一个惩罚项，用于衡量与保守基线的偏离。受可达到效用保留的启发，CCO实现了集体保守主义：行动面临与监督者关注程度成比例的惩罚，因此当监督者认为无异议时，高效用行动仍会被选择，只有在关注累积时才被覆盖。CCO使用共形决策理论在线校准这种保守主义，确保不良结果在有限时间内低于用户指定的目标阈值，且无需分布假设。在SWE-bench的修改版本上，较弱的监督者成功约束了对抗性错误对齐的较强智能体；在MACHIAVELLI上，CCO在保持奖励的同时大幅减少了伦理违规。在两种设置中，经验违规率与理论预测的指定目标紧密匹配。

英文摘要

Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.

URL PDF HTML ☆

赞 0 踩 0

2605.28806 2026-05-28 cs.CV cs.CL cs.IR

Personal Visual Memory from Explicit and Implicit Evidence

来自显式和隐式证据的个人视觉记忆

Viet Nguyen, Thao Nguyen, Vishal M. Patel, Yuheng Li

AI总结本文提出个人视觉记忆基准和VisualMem混合架构，通过显式与隐式视觉证据增强AI代理的长期记忆，显著提升个性化任务性能。

详情

Comments: Project Page: https://viettmab.github.io/visualmem-page/

AI中文摘要

长期记忆对于个性化AI代理越来越重要，然而现有的基准和方法仍然主要以文本为中心。即使包含图像，后续问题所需的用户特定信息通常仅从文本中即可恢复，并且大多数记忆系统将图像轮次简化为通用描述。然而，图像通常携带文本很少陈述的个人信息——包括显式证据（如重复出现的用户相关实体）和隐式证据（如从视觉或多模态线索推断出的潜在用户事实）。我们引入了一个针对这两种证据形式的个人视觉记忆基准，并提出了VisualMem，一种混合视觉-文本架构，通过结构化个人视觉记忆模块增强文本记忆后端。VisualMem不是将图像压缩为描述，而是利用对话上下文来解析身份、所有权和持久的用户事实。实验表明，VisualMem在我们的基准上显著优于先前的记忆系统，同时在标准文本记忆基准上保持竞争力，这表明个人视觉记忆是个性化AI代理长期记忆中一个独特且重要的组成部分。

英文摘要

Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states -- both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual--text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

URL PDF HTML ☆

赞 0 踩 0

2605.28805 2026-05-28 cs.CL cs.AI cs.CV cs.LG

OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

OmniVerifier-M1: 具有显式结构化重校准的多模态元验证器

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

AI总结提出OmniVerifier-M1，通过符号化元验证（如边界框）和解耦强化学习，实现多模态大模型的可靠细粒度验证与动态区域级自校正。

详情

Comments: ICML 2026. Project: https://github.com/Cominclip/OmniVerifier

AI中文摘要

视觉结果日益成为多模态大语言模型的核心，因此可靠且细粒度的验证对于扩展通用基础模型至关重要。在这项工作中，我们研究了多模态元验证，它利用验证器生成的推理过程而非仅决策信号，并探索如何有效地将元验证反馈纳入多模态验证器训练。我们发现了两个关键发现。首先，符号化验证器输出（例如边界框）作为元验证推理过程优于文本解释，能够实现高效的基于规则的强化学习奖励，同时避免依赖来自辅助评判模型的基于模型的奖励。其次，解耦二元判断和元验证的强化学习目标显著优于联合奖励优化，这是由于输出结构和学习动态的内在差异。基于这些见解，我们训练了OmniVerifier-M1，一个利用符号化元验证和解耦强化学习的通用视觉验证器。OmniVerifier-M1提供稳健的验证和细粒度的错误定位，并进一步实现了M1-TTS，一个由验证器驱动的智能体生成系统，实现动态区域级自校正。这种方法为更可靠、可解释和细粒度的多模态验证铺平了道路，支持更安全、更可控的基础模型部署。

英文摘要

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.28803 2026-05-28 cs.CV cs.LG

Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

Ω-QVLA: 通过复合旋转和逐步缩放实现视觉-语言-动作模型的鲁棒量化

Xinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu, Kaicheng Yang, Ziyu Zhao, Yufei Cui, Xiao-Wen Chang, Peng Lu

AI总结提出Ω-QVLA，首个无需训练的后训练量化框架，通过复合SVD-Hadamard旋转和逐步DiT激活缩放量化，将VLA模型的语言骨干和扩散动作头统一压缩至W4A4精度，在LIBERO上达到或超越FP16性能，内存减少71.3%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型将感知、推理和控制统一在单个策略中，但其数十亿参数的骨干网络和基于扩散的动作头使得设备端部署成本过高。先前的量化工作仅提供部分解决方案，压缩LLM骨干网络而保留DiT动作头为全精度，或采用混合精度方案，原因是认为统一量化动作头本质上不稳定。我们通过Ω-QVLA挑战了这一假设，这是首个无需训练的后训练量化框架，将VLA模型的语言骨干和整个扩散动作头统一压缩至W4A4精度，无需混合精度分配。Ω-QVLA结合了复合SVD-Hadamard旋转（均衡每通道权重能量并扩散残差激活异常值）与逐步DiT激活缩放量化（吸收去噪步骤中的动态范围漂移）。在LIBERO上，Ω-QVLA将Pi 0.5和GR00T N1.5压缩至W4A4，任务成功率分别为98.0%和87.8%，匹配或超越其FP16参考值（97.1%和87.0%），同时静态内存占用减少71.3%。真实世界操作实验进一步证实了平滑、精确的操作，而先前方法失败。代码可在https://github.com/UCMP13753/Omega-QVLA获取。

英文摘要

Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at https://github.com/UCMP13753/Omega-QVLA.

URL PDF HTML ☆

赞 0 踩 0

2605.28802 2026-05-28 cs.CL

Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

人类标注变异作为稳定信号：通过跨标注者偏好优化学习标注者特定的解释行为

Beiduo Chen, Pingjun Hong, Ziyun Zhang, Benjamin Roth, Anna Korhonen, Barbara Plank

AI总结研究大语言模型能否学习并复现标注者特定的标签-解释行为，提出跨标注者偏好优化（CAPO）方法，通过对比目标标注者与其他有效但非目标标注者的响应来提升模仿和归因能力。

详情

Comments: 43 pages, 20 figures

AI中文摘要

自由文本解释通过揭示标注者决策背后的推理和偏好，将人类标注变异（HLV）扩展到标签分歧之外。我们研究大语言模型（LLM）能否学习并复现这种标注者特定的标签-解释行为。使用两个句子对任务（自然语言推理和释义判断），每个任务有四位标注者，我们首先分析标注者是否表现出稳定的个体模式。我们发现，由于强烈的输入内容效应，这些模式在单标注层面较弱，但在减少输入内容影响并进行标注者级别聚合后变得可检测。然后，我们比较了提示学习和监督微调（SFT）基线，并提出了跨标注者偏好优化（CAPO），该方法将目标标注者的响应与同一输入的其他有效但非目标标注者的标注进行对比。实验表明，提示学习有限且不稳定，SFT能更好地捕捉标注者特定行为，而CAPO进一步改进了聚合感知模仿和基于判断的归因，同时在人类验证下保留了目标特定的推理模式。总体而言，我们的结果表明，HLV可以作为标注者特定的标签-解释行为被学习，这为基于标注者历史而非仅标签的可扩展解释型标注提供了路径。

英文摘要

Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators' decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each -- natural language inference and paraphrase judgment -- we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator's response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

URL PDF HTML ☆

赞 0 踩 0

2605.28792 2026-05-28 cs.AI cs.HC cs.LG

CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

CaMBRAIN：基于因果状态空间模型的实时连续脑电图推理

Abhilash Durgam, Nyle Siddiqui, Jeffrey A. Chan-Santiago, Qiushi Fu, Elakkat D. Gireesh, Mubarak Shah

AI总结提出首个基于因果Mamba的状态空间模型CaMBRAIN，通过多阶段自监督训练实现实时、长程连续的EEG信号推理，在三个数据集上达到SOTA且吞吐量提升10倍以上。

详情

Comments: 22 pages, 3 figures, 8 tables

AI中文摘要

脑电图（EEG）是一种监测脑电活动的关键非侵入性方法。EEG信号时长从几秒到数小时不等，给现有深度学习方法带来两大障碍：（1）现有EEG模型主要基于注意力机制，随着序列长度增加计算量呈二次增长；（2）由于固定长度输入要求，原始EEG信号必须以滑动窗口方式处理，阻碍了对整个信号的全局理解。为此，我们提出CaMBRAIN——首个基于因果Mamba的状态空间模型（SSM），能够实时推理EEG信号，并论证了考虑到EEG的因果单向性，双向方法是不必要的昂贵。然而，训练这样的模型并非易事，因为关键的EEG事件可能极其短暂（不到一秒），却被长达数分钟的间隔分隔。当前的EEG方法使用自监督目标优化信号重建，但这些方法不适用于流式SSM；它们未能明确训练隐藏状态以保留流式推理所需的关键长程上下文。因此，我们引入了一种专门设计的多阶段自监督训练流程，以鼓励长程记忆保持和在EEG信号上的强性能，同时保持状态空间模型的线性时间复杂度。CaMBRAIN在三个不同的EEG数据集上达到了最先进（SOTA）结果，吞吐量比现有模型高10倍以上，成为首个能够对可变长度EEG信号进行长程连续推理的模型。

英文摘要

Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple seconds to multiple hours, posing a major hurdle for existing deep learning methods due to two major factors: (1) existing EEG models are predominantly built upon the attention mechanism, incurring quadratic scaling as the sequence length increases, and (2) raw EEG signals must be processed in a sliding-window fashion due to fixed-length input requirements, preventing global understanding of the entire signal. To this extent, we propose CaMBRAIN - the first Causal, Mamba-based state space model (SSM) capable of real-time inference of EEG signals, arguing that bidirectional approaches are needlessly expensive given the causal, unidirectional nature of EEG. However, training such a model is non-trivial, as crucial EEG events can be extremely brief - within fractions of a second - yet separated by long intervals spanning minutes. Current EEG methods use self-supervised objectives that optimize for signal reconstruction, but these are not well suited for streaming SSMs; they fail to explicitly train the hidden state to retain the salient long-range context needed for streaming inference. We therefore introduce a multi-stage self-supervised training pipeline specifically tailored to encourage long-range memory retention and strong performance on EEG signals, while preserving the linear-time complexity of state space models. CaMBRAIN achieves state-of-the-art (SOTA) results across 3 different EEG datasets with >10x higher throughput than existing models, enabling the first model capable of long-range, continuous inference of variable-length EEG signals.

URL PDF HTML ☆

赞 0 踩 0

2605.28791 2026-05-28 cs.CL cs.AI

Skill-Conditioned Gated Self-Distillation for LLM Reasoning

技能条件门控自蒸馏用于大语言模型推理

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

AI总结提出技能条件门控自蒸馏（SGSD），通过从经验技能库中检索技能-错误对构建多教师池，并利用验证器验证教师极性，以鲁棒门控目标蒸馏信息性师生差异，在弱先验信息假设下提升数学推理性能。

详情

AI中文摘要

在线自蒸馏（SD）通过使用教师端特权信息（PI）将稀疏的验证器结果转化为密集的令牌级监督，从而改善大语言模型推理。现有方法通常假设可信的PI，例如参考答案或成功轨迹。我们提出PI是否可以来自经验驱动的技能库，其中检索到的技能紧凑且可重用，但也可能不相关或具有误导性。我们提出技能条件门控自蒸馏（SGSD），将基于技能的SD表述为教师假设验证而非无条件模仿。SGSD检索技能-错误对，构建多教师池，并让所有技能条件教师对相同的普通提示学生输出进行评分。验证器验证每个教师的极性：支持成功或抑制失败提供正向监督，而相反立场则被反转。然后，一个鲁棒的门控目标蒸馏信息性的师生差异，同时抑制不确定或极端信号。在多个数学推理基准上的实验表明，SGSD在弱PI假设下持续优于GRPO，并与答案条件OPSD保持竞争力。例如，在Qwen3-1.7B上，SGSD在AIME24、AIME25和HMMT25上平均比GRPO高出6.2%，比OPSD高出1.7%。我们的代码可在https://github.com/walawalagoose/SGSD获取。

英文摘要

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

URL PDF HTML ☆

赞 0 踩 0

2605.28787 2026-05-28 cs.IR cs.AI

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

智能体需要语义元数据吗？智能体数据检索的比较研究

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

AI总结通过对比基线智能体（搜索开放网络）与语义智能体（利用schema.org元数据）在数据检索中的表现，发现语义元数据在检索可操作数据时精度更高（整体精度高65.7%），而基线智能体覆盖更广但存在“最后一英里效用”失败。

详情

AI中文摘要

在自主智能体时代，机器可操作数据对于数据驱动的工作流至关重要。十多年来，像schema.org这样的语义元数据支撑了机器可操作数据的FAIR原则（可发现、可访问、可互操作、可重用），并支持了Google Dataset Search等发现工具。然而，能够导航非结构化网络的大型语言模型（LLM）的兴起提出了一个基本问题：语义元数据对于智能体数据发现是否仍然必要，或者智能体能否直接从网络可靠地检索可操作数据？我们提出了两种不同环境下的智能体数据检索比较分析：一个基线智能体搜索数十亿开放网络文档，以及一个语义智能体利用使用schema.org的9000万数据集语料库。我们部署了一个“LLM作为裁判”的评估流程，直接映射到FAIR原则，以评估检索数据的语义相关性、数据可访问性和计算实用性。我们的结果揭示了明显的差异。语义智能体在检索可操作数据方面表现出色，对于元数据丰富的注册表，其返回结果中的精度高出44.9%，对于具有机器可读下载的页面，精度高出46.6%。相反，基线智能体经常遭受“最后一英里效用”失败，检索到的是散文密集的页面（占结果的20.1%）和门户登录页面（占8.5%），而不是实际的数据页面。虽然基线智能体通过回答多40%的问题实现了更高的覆盖率，但语义智能体在检索符合FAIR原则的数据集方面实现了更高的准确性，整体精度高出65.7%。我们得出结论，虽然非结构化检索支持广泛的探索性任务，但结构化生态系统仍然是可靠、面向执行的自主工作流不可或缺的基础。

英文摘要

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.28782 2026-05-28 cs.CL

Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

大型语言模型能否处理话语标记？以口语马来语为例

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen

AI总结本文通过构建MalayPrag基准和提出五个属性，系统评估并改进了大型语言模型在口语马来语中处理话语标记的能力。

详情

AI中文摘要

话语标记，如 extit{well}和 extit{kind of}，是使LLMs更接近人类“说话”方式的关键组成部分。它们用于传达情感、意图和人际意义。然而，现有研究尚未全面了解LLMs处理话语标记的能力。此外，有限的研究主要集中在英语等高资源语言上，对东南亚语言的关注很少。在本文中，我们（1）提出了 extsc{MalayPrag}，一个旨在系统评估和分析LLMs处理口语马来语中话语标记能力的基准；（2）引入了五个属性，这些属性提供了一个基于语言学的统一框架，用于解释话语标记的语用功能。应用这两项贡献，我们提示十个现成的LLMs执行三个预测任务。实验结果表明，当前LLMs在准确将马来语话语标记与其语用功能联系起来方面面临重大挑战。本研究中设计的五个属性的提供显著改善了这些联系，突出了模型语用能力结构化支架的必要性。

英文摘要

Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs' capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models' pragmatic competence.

URL PDF HTML ☆

赞 0 踩 0

2605.28780 2026-05-28 cs.CV cs.LG

Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

偏差留下梯度痕迹：基于概念分解的梯度探针实现无标签偏差识别

Thomas Vitry, Kieran Edgeworth, Stefan Wermter, Jae Hee Lee

AI总结提出一种无需偏差标签的后处理方法，通过非负矩阵分解提取概念向量，并利用误分类样本的梯度信号识别视觉模型中的虚假关联，在不重新训练的情况下提升最差组准确率。

详情

Comments: Accepted to the 49th German Conference on Artificial Intelligence (KI2026)

AI中文摘要

视觉分类器可能利用虚假关联，在分布内取得高准确率但在分布偏移下失败。现有的偏差缓解和分析方法通常依赖于精心策划的数据集、虚假属性或组标签，或重新训练，这在模型部署后或相关偏差未知时可能不可行。我们提出一种无需偏差标签的后处理方法，用于识别冻结视觉模型中的虚假概念，仅依赖于来自保留审计数据集的标准类标签。对于每个目标类，我们从预测为该类的输入中收集补丁，并对中间激活应用非负矩阵分解，以获得可解释的概念向量库。然后，通过从误分类示例的反向传播梯度与这些概念的相互作用导出的偏差估计器对候选概念进行排序：偏差概念在纠正假阴性时倾向于被激活，而在纠正假阳性时被抑制。在Colored MNIST和Waterbirds上，该方法恢复了与已知虚假线索一致的概念；在CelebA上，它揭示了仅部分与注释性别属性重合的决策相关方向；在推理时抑制排名靠前的概念，无需任何重新训练或参数更新，即可将Waterbirds的最差组准确率提高最多17.9个百分点，CelebA提高10.4个百分点。我们的方法识别出不一定与注释属性重合的决策相关虚假方向，为冻结视觉模型提供了可解释的审计工具和可操作的去偏处理。代码可在https://github.com/vitryt/label-free-bias-identification获取。

英文摘要

Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.

URL PDF HTML ☆

赞 0 踩 0

2605.28779 2026-05-28 cs.CL cs.CV

The Abstraction Gap in Vision-Language Causal Reasoning

视觉-语言因果推理中的抽象差距

Chinh Hoang, Mohammad Rashedul Hasan

AI总结针对视觉-语言模型（VLM）生成因果解释时语言流畅性与忠实因果推理的混淆问题，提出双探针方法和抽象差距（AG）指标，通过CAGE基准评估发现多数模型存在显著AG，但通过预训练和架构选择可缩小差距。

详情

AI中文摘要

视觉-语言模型（VLM）能生成流畅的因果解释，但当前的评估无法区分语言合理性与忠实因果推理。我们提出一种双探针方法来分离这些属性。文本探针测量语言质量。链式文本探针要求模型首先生成显式因果链。抽象差距（AG）指标量化归一化的性能差异。在CAGE（因果抽象差距评估）基准上评估八个VLM，该基准包含跨越Pearl因果层次的5,500张图像上的49,500个问题，我们发现七个模型的AG超过0.50，文本得分为6-8，但链式得分低于2.5。在45,000个链式标注样本上进行微调未能缩小差距。然而，一个模型实现了接近零的AG。该能力存在于当前VLM架构中，并取决于预训练和架构选择。CAGE为评估VLM中的忠实因果推理提供了诊断工具。

英文摘要

Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl's causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6--8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.28778 2026-05-28 cs.CL

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

LLMs 能否使用语言不确定性标记可靠地反映内在置信度？

Gabrielle Kaili-May Liu, Arman Cohan

AI总结本研究首次系统探究大语言模型（LLMs）是否能够稳定且泛化地将其语言置信度标记与内在置信度关联，并评估上下文特征的影响，通过提出7个指标分析标记内在置信度的稳定性，发现LLMs即使在模型中心解释下也存在忠实校准偏差。

详情

Comments: Code: https://github.com/yale-nlp/marker_internal_confidence

AI中文摘要

LLMs 的语言表达置信度应忠实反映其内在不确定性。尽管近期研究表明 LLMs 在以人类对齐的方式使用认知标记（例如，“很可能...”）方面存在困难，但尚不清楚模型是否能够应用其自身的语言置信度框架，以稳定且可泛化的方式将标记与特定置信水平关联起来，以及上下文特征如何影响这种能力。我们首次对此问题进行了系统研究，将_标记内在置信度_（MIC）形式化为模型在给定任务领域中与特定认知标记相关联的估计内在置信度。我们提出了7个指标来评估 MIC 在分布内和跨分布的稳定性。将我们的分析框架应用于多种模型和任务，我们发现，即使在模型中心解释标记含义的情况下，LLMs 仍然存在忠实校准偏差，尽管在任务间保持了一定程度的一致排序，但难以根据内在置信度区分跨分布的标记。这为现有工作提供了关键且互补的证据，有助于全面理解 LLMs 中的忠实校准，强调了需要更对齐和稳定的标记使用以提高可信度和可靠性。

英文摘要

LLMs' linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., "it is likely...") in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing _marker internal confidence_ (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.28775 2026-05-28 cs.LG cs.AI cs.CL

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

从弱点中学习：小型计算机使用代理的自动化领域专业化

Suji Kim, Kangsan Kim, Sung Ju Hwang

AI总结提出LearnWeak框架，通过更强的参考代理识别学生代理在目标领域的弱点，自动合成针对性任务和监督信号，并引入误差感知专业化目标，显著提升小型计算机使用代理在多个领域的性能。

详情

AI中文摘要

计算机使用代理（CUA）最近取得了实质性进展，但为每个软件领域部署单独的大型专家仍然昂贵。小型开源计算机使用代理是更实用的专业化目标，但它们仍然明显较弱，并表现出不均匀的领域特定失败。一个直接的补救措施是为目标领域合成大规模训练数据，但我们发现这种简单方法仅带来边际改进。基于这一观察，我们引入了LearnWeak，一个针对小型计算机使用代理的无注释专业化框架，它使用更强的参考代理来识别学生在目标领域的弱点，合成有针对性的任务，并自动构建监督。LearnWeak进一步引入了一个误差感知的专业化目标，将规划和执行误差分离，从而实现比广泛统一监督更行为精确的更新。在OSWorld上，LearnWeak在八个领域上分别比EvoCUA-8B和OpenCUA-7B平均提高了11.6和11.1个百分点。我们还验证了我们的学生感知数据集生成和训练方法优于现有的自主轨迹生成和训练基线。我们的工作强调了学生意识在数据合成和代理训练中的重要性，为在多样化领域专业化小型计算机使用代理指明了更原则和高效的路径。

英文摘要

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

URL PDF HTML ☆

赞 0 踩 0

2605.28774 2026-05-28 cs.CL

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Agent探索性策略优化用于多模态Agent推理

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee

AI总结针对多模态Agent推理中思考与工具使用的不对称性（Thinking-Acting Gap），提出AXPO算法，通过固定思考前缀并重采样工具调用及其延续，结合基于不确定性的前缀选择，显著提升工具使用率和模型性能。

详情

Comments: Project page: https://byungkwanlee.github.io/AXPO-page/

AI中文摘要

具有扩展推理能力的视觉-语言模型在复杂问题上取得成功，但许多现实问题需要外部工具，而仅靠内部推理往往无法解决。因此，Agent推理交织了两种具有结构不对称性的行为：思考（自包含的默认行为）和工具使用（高方差辅助行为）。我们将这种不对称性称为思考-行动差距。在标准RL方法（如GRPO）下，该差距在训练中表现为两种诊断症状：工具使用仅在约30%的rollout中被尝试，而当尝试时，组内使用工具的rollout在约40%的问题上全部错误，从而抑制了需要工具调用的学习信号。我们提出AXPO（Agent探索性策略优化）：对于每个全部错误的工具使用子组，AXPO固定思考前缀并重采样工具调用及其延续，同时结合基于不确定性的前缀选择。在九个多模态基准测试和三种规模的Qwen3-VL-Thinking上，SFT+AXPO在平均性能上优于SFT+GRPO（8B模型平均Pass@1提升1.8个百分点，Pass@4提升1.8个百分点），并且8B的SFT+AXPO在Pass@4上以4倍更少的参数超越了32B Base模型。

英文摘要

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.28773 2026-05-28 cs.CL cs.AI cs.LG cs.MA cs.MM

Rethinking Memory as Continuously Evolving Connectivity

重新思考记忆作为持续演化的连接性

Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Ying Wei, Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen, Ningyu Zhang

AI总结提出 FluxMem 框架，将记忆建模为异构图并通过三个阶段（初始连接形成、反馈驱动优化、长期巩固）动态演化拓扑结构，以解决现有记忆增强型 LLM 代理在动态环境中的脆弱性问题。

详情

Comments: Ongoing work

AI中文摘要

现有的记忆增强型 LLM 代理通常将记忆视为具有预定义表示和固定检索管道的静态存储库，这在动态代理环境中是脆弱的，因为反馈、任务变化和异构信号不断重塑应该记住的内容以及如何连接它们。为了解决这个问题，我们提出了 FluxMem，一种连接性演化的记忆框架，它将记忆建模为异构图，并通过三个阶段逐步优化其拓扑结构：初始连接形成、反馈驱动优化和长期巩固。在执行过程中，FluxMem 修复缺失的链接、修剪干扰、对齐抽象粒度，并将重复的成功轨迹提炼为可重用的程序化电路，由记忆泛化性和演化成熟度的一个度量指导。在三个根本不同的基准测试（包括 LoCoMo、Mind2Web 和 GAIA）上，FluxMem 实现了持续的最先进性能，展示了在复杂代理环境中的强大适应性和泛化能力。代码将在 https://github.com/zjunlp/LightMem 开源。

英文摘要

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.

URL PDF HTML ☆

赞 0 踩 0

2605.28769 2026-05-28 cs.LG

Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

多混合器模型：基于共享表示的灵活序列建模

Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, Ziteng Sun

AI总结提出Oryx混合模型，通过序列轴上的灵活切换（注意力与线性递归）实现高效长上下文处理，在1.4B规模下平均语言建模任务提升0.7个百分点。

详情

AI中文摘要

Softmax注意力是现代大型语言模型的基石，但其内存随序列长度线性增长，计算量呈二次增长。线性递归模型（如线性注意力和状态空间模型）因其线性计算和恒定内存而成为注意力的替代方案。尽管这些次二次令牌混合方法（或混合器）在广泛基准上取得了有前景的效率提升和竞争性结果，但当前的线性递归模型在需要长上下文检索或上下文学习的任务上仍落后。越来越多的研究致力于通过静态交错或合并注意力与递归块来缓解这些权衡的混合架构。在这项工作中，我们探索了开发混合模型的新轴：跨令牌序列。我们提出Oryx，一种混合模型，可以在序列中灵活切换不同的混合器，例如用于丰富上下文利用的二次注意力和用于高效生成的线性递归。Oryx在混合器之间绑定至少90%的参数，使注意力和递归模式能够在共享内部表示上操作。我们使用Mamba-2和Gated DeltaNet变体验证了我们的设计，模型规模达1.4B。在固定令牌预算和混合训练策略下，Oryx实现了与其单一混合器基线相当或更好的性能。在1.4B规模下，所有Oryx实例在平均语言建模任务上至少比各自基线高出0.7个百分点。在检索任务上，即使仅以注意力模式处理极小部分（<10%）的令牌，Oryx也能达到与Transformer基线相当的性能。这些结果表明注意力和线性递归模型可以共享内部表示，并激励序列轴混合化作为一个有前景的方向。

英文摘要

Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.

URL PDF HTML ☆

赞 0 踩 0

2605.28767 2026-05-28 cs.LG stat.ML

Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning

多标签学习中优化广义度量的原则性算法

Mehryar Mohri, Yutao Zhong

AI总结本文基于H-一致性理论，设计了可分解的代理损失函数，提出MMO算法族，用于优化多标签学习中的广义线性分式度量，并在大规模数据集上验证了其可扩展性和优越性能。

详情

AI中文摘要

许多现实世界的分类任务需要为每个实例预测多个标签，从而需要优化诸如$F$度量和Jaccard指数等复杂评估度量。虽然经验效用最大化（EUM）框架对于这些总体度量是自然的，但现有的理论结果主要局限于渐近贝叶斯一致性。在本文中，我们基于更强的$H$一致性概念，在EUM框架内开发了用于优化广义度量类别的原则性学习算法。我们的关键贡献是为多标签学习设计了新颖的代理损失函数，这些函数具有可证明的$H$一致性界，从而能够针对假设类别和有限样本进行具有非渐近保证的优化。至关重要的是，我们证明了这些组合公式化的代理损失函数可以精确分解，以严格的$O(l)$时间运行，无需近似。在此基础之上，我们引入了MMO（多标签度量优化），这是一个用于优化广义线性分式度量的新算法族。我们通过大量实验验证了我们的方法，在高稀疏性、深度学习环境下的大规模数据集（MS-COCO、Reuters-21578）上展示了稳健的可扩展性和优于最先进连续基线的性能。我们的结果为一般多标签度量优化提供了理论严谨性和实际有效性。

英文摘要

Many real-world classification tasks require predicting multiple labels per instance, necessitating the optimization of complex evaluation metrics such as the $F$-measure and Jaccard index. While the Empirical Utility Maximization (EUM) framework is natural for these population-level metrics, existing theoretical results are largely limited to asymptotic Bayes-consistency. In this paper, we develop principled learning algorithms for optimizing a broad class of generalized metrics within the EUM framework, grounded in the stronger notion of $H$-consistency. Our key contribution is the design of novel surrogate loss functions for multi-label learning that admit provable $H$-consistency bounds, enabling optimization with non-asymptotic guarantees tailored to the hypothesis class and finite samples. Crucially, we prove these combinatorially formulated surrogates decompose exactly, operating in strictly $O(l)$ time without approximations. Building on this foundation, we introduce MMO (Multi-Label Metric Optimization), a new family of algorithms for optimizing generalized linear-fractional metrics. We validate our approach through extensive experiments, demonstrating robust scalability and superior performance over state-of-the-art continuous baselines on large-scale datasets (MS-COCO, Reuters-21578) in high-sparsity, deep learning regimes. Our results offer both theoretical rigor and practical effectiveness for general multi-label metric optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.28764 2026-05-28 cs.AI cs.DC cs.MA

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

SwarmHarness：通过去中心化激励对齐的AI智能体网络进行基于技能的任务路由

Edwin Jose

AI总结提出SwarmHarness去中心化协议，通过DHT注册、效用函数路由和Shapley值激励，实现无中心化计算集群的自我组织与任务分配。

详情

AI中文摘要

大量计算资源（个人工作站上的GPU周期、空闲推理服务器以及作业间的边缘设备）未被使用，因为没有激励对齐协议让所有者安全且有利可图地共享它们。现有方法要么需要可信的中心协调器（云市场），要么需要繁重的区块链基础设施（Golem, BrokerChain），要么完全缺乏激励层（BOINC, Petals）。我们提出SwarmHarness，一种去中心化协议，其中HarnessAPI技能节点在没有中央权威的情况下自我组织成计算集群。SwarmHarness有三个互锁组件：基于分布式哈希表（DHT）的SwarmRegistry，用于对等发现和能力广告；SwarmRouter，使用基于能力、负载、延迟和信任的效用函数将任务分派给节点；以及SwarmCredit，一种通过Shapley值近似将计算积分奖励分配给贡献节点的激励机制。节点通过服务任务赚取积分，并花费积分来提交任务；从不贡献的空闲节点会耗尽积分并失去路由优先级，从而创建自我调节的参与经济。随着节点向高奖励技能专业化，路由信号充当数字信息素，网络表现出类似于生物集群的涌现集体智能。除了计算共享，SwarmHarness还是自主分布式AI智能体网络的基础原语，其中智能体无需人工中介即可雇佣计算、路由子任务和结算积分。

英文摘要

Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy. As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.

URL PDF HTML ☆

赞 0 踩 0

2605.28763 2026-05-28 cs.AI

CubePart: An Open-Vocabulary Part-Controllable 3D Generator

CubePart: 一种开放词汇、部件可控的3D生成器

Yiheng Zhu, Kangle Deng, Jean-Philippe Fauconnier, Inaki Navarro, Daiqing Li, Ava Pun, Yinan Zhang, Peiye Zhuang, Xiaoxia Sun, Maneesh Agrawala, Kiran Bhat, Tinghui Zhou

AI总结提出CubePart框架，通过开放词汇的部件模式实现用户定义的部件级3D网格生成，无需后处理即可直接用于游戏引擎。

详情

Comments: SIGGRAPH 2026. Project Page: https://cubepart.github.io/

AI中文摘要

游戏和仿真中使用的交互式3D资产通常被分解为特定的语义部件以支持动画、物理和脚本行为，然而大多数生成式3D模型要么产生整体网格，要么产生无法与应用特定需求对齐的任意部件分解。我们提出CubePart，一个用于开放词汇、部件可控的3D网格生成的生成框架，将部件结构作为显式的推理时控制信号。给定一个全局文本提示和一个用户定义的部件模式（表示为部件名称的开放列表），我们的方法生成一组网格——每个模式元素一个——这些网格组装成一个连贯的对象，同时尊重指定的语义结构。为了实现这一能力，我们引入了一个可扩展的数据管道来构建一个大型的开放词汇、部件标注的3D数据集，以及一个将全局形状合成与部件级解码分离的两阶段生成架构。我们证明，生成的资产可以直接集成到游戏引擎中，并由动画和行为脚本驱动，无需手动后处理。项目页面：https://cubepart.github.io/

英文摘要

Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes - one per schema element - that assemble into a coherent object while respecting the specified semantic structure. To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing. Project Page: https://cubepart.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.28760 2026-05-28 cs.LG

LLM Zeroth-Order Fine-Tuning is an Inference Workload

LLM零阶微调是一种推理工作负载

Zelin Li, Caiwen Ding

AI总结本文发现LLM零阶微调是推理主导的工作负载，通过将其重复评分阶段在服务运行时中执行，实现了8.13倍加速，并保持了高精度。

详情

Comments: 12 pages, 4 figures, 3 tables, including appendix and references

AI中文摘要

零阶（ZO）微调对大型语言模型具有吸引力，因为它用前向目标评估替代了反向传播。然而，现有实现仍在传统训练循环内执行ZO算法，尽管其主要工作是重复评分附近参数状态。这造成了工作负载-运行时不匹配：算法要求结构化的推理式评分，而系统暴露出一系列碎片化的训练循环步骤。我们表明LLM ZO微调是一种推理主导的工作负载，并通过服务运行时执行其重复评分阶段。在OPT-13B SST-2上，得到的vLLM执行路径在匹配的LoRA-only设置下，完成20k步LoZO运行仅需估计0.51训练小时，而官方LoZO基线需要4.15小时，实现了8.13倍加速，同时达到0.922的最终评估准确率和0.931的最终全验证准确率。在OPT-1.3B到OPT-13B的核心步长扩展实验中，相同的运行时重组带来了2.34倍至7.72倍的加速。一个MeZO风格的高秩因子分解实验表明，相同的运行时范式可以跟踪类似MeZO的损失轨迹，同时运行速度提升高达2.55倍。更广泛地说，将ZO更新表示为动态适配器状态，为推理时训练提供了一条实用路径，其中轻量级适配可以像推理工作负载一样调度，而不是作为单独的训练任务。

英文摘要

Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x--7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.

URL PDF HTML ☆

赞 0 踩 0

2605.28751 2026-05-28 cs.LG cs.AI cs.CL

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

外推权重平均揭示代码强化学习中的正确性-效率前沿

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen, Benjamin Negrevergne, Gabriel Synnaeve

AI总结通过外推权重平均，无需额外RL训练即可扩展微调检查点间的帕累托前沿，在竞争性编程中实现正确性与效率的权衡，并提升推理时性能。

详情

Comments: 54 pages

AI中文摘要

线性插值微调检查点已被证明可以追踪竞争目标之间的帕累托前沿，但外推权重平均是否能在不进行额外RL训练的情况下，将此类前沿扩展到推理时有用的新检查点，仍不清楚。我们在竞争性编程的RL中研究这一问题，其中隐藏单元测试在时间和内存限制下同时强制执行功能正确性和计算效率。从共享初始化开始，我们在嵌套单元测试覆盖下训练检查点：低覆盖奖励要求通过较小输入的测试，而高覆盖奖励要求逐步通过更大输入的测试直至完整套件。这种扫描揭示了正确性-效率前沿的出现：在困难问题上，更高覆盖奖励减少了优化失败但增加了正确性失败，使得解决率几乎不变。低覆盖和高覆盖检查点之间的插值恢复了这一前沿，而外推则将其扩展到训练端点之外。该前沿及其外推延续出现在三种推理设置（纯推理、工具使用和智能体编码）以及两种模型规模（32B和7B）中。在问题层面，沿前沿移动会改变被解决的问题，使得外推检查点成为推理时扩展中的互补策略。具有外推权重平均的集成扩大了覆盖范围，并在相同样本预算下，将LCB/hard上的pass@250比最佳单一检查点提高了3.3%。这些结果表明，代码RL中的嵌套单元测试覆盖诱导了一个前沿，外推权重平均可以导航、扩展和利用该前沿。

英文摘要

Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

URL PDF HTML ☆

赞 0 踩 0