arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.26189 2026-05-28 cs.LG cs.AI

Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

近无损 HiF8 W8A8 量化感知训练的最大窗口尺度估计

Yingying Cheng, Jinquan Shi, Li Zhou, Zhiyang He, Zhaoyi Sun, Fan Zhang, Jie Sun

发表机构 * OpenPangu-Embedded-1B

AI总结 针对 HiF8 W8A8 量化感知训练中的两种正交失效模式(amax 饱和与灾难性遗忘),提出保守的 64 步历史窗口最大算法 DTS 策略和 500 步 BF16 预热加低学习率 QAT 的修复方案,在 OpenPangu-Embedded-1B 上实现接近 BF16 基线的性能。

详情
AI中文摘要

使用低位浮点格式的量化感知训练(QAT)能够实现高效的 LLM 部署,但会引入标准训练指标无法察觉的微妙失效模式。我们通过延迟张量缩放(DTS)的视角,对 OpenPangu-Embedded-1B 的 HiF8 W8A8 QAT 进行了系统研究。在八个受控实验中,我们识别并解耦了两种正交的失效模式:(i) amax 饱和,其中延迟的尺度估计通过前向传播裁剪静默地破坏知识敏感表示,以及 (ii) 灾难性遗忘,其中激进的学习率独立于量化覆盖预训练的常识知识。两者都无法仅从训练损失中检测到。我们通过保守的 64 步历史窗口最大算法 DTS 策略解决 amax 饱和,并通过 500 步 BF16 预热后以 lr=10^{-5} 进行 QAT 来缓解遗忘。两种修复都是必要且充分的:我们的最终配置在匹配的 BF16 基线上实现了 0.43% MMLU 下降、0.58% HellaSwag 下降和 0.22% ARC-Challenge 下降,训练损失 APE 在 10,000 步内仅为 0.11%。

英文摘要

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

2605.26114 2026-05-28 cs.AI cs.CL

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

MobileGym: 一个可验证且高度并行的移动GUI智能体研究仿真平台

Dingbang Wu, Rui Hao, Haiyang Wang, Shuzhe Wu, Han Xiao, Zhenghong Li, Bojiang Zhou, Zheng Ju, Zichen Liu, Lue Fan, Zhaoxiang Zhang

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出MobileGym,一个基于浏览器的轻量级、完全可控的移动环境,通过结构化JSON状态实现可验证结果信号和低成本并行强化学习,并附带包含416个参数化任务模板的基准测试集。

Comments Project page: https://mobilegym.github.io

详情
AI中文摘要

我们提出MobileGym,一个托管于浏览器、轻量级、完全可控的日常移动使用环境,旨在实现交互保真度而无需复制专有后端。它实现了之前日常应用无法实现的两种能力:通过基于确定性状态判断的结构化JSON状态实现可验证的结果信号,以及通过低成本的并行回滚实现可扩展的在线强化学习。完整的环境状态被捕获、配置、分支和比较为结构化JSON,单个服务器可托管数百个并行实例,每个实例约400 MB内存,冷启动约3秒。分层状态模型和声明式任务定义框架使状态可编程性和任务创建在大规模下实用,单一的程序化判断机制同时提供确定性评估结果和密集的强化学习奖励。配套的MobileGym-Bench提供了416个参数化任务模板,包括256个测试模板和160个训练模板,覆盖28个应用,具有确定性判断器和结构化的AnswerSheet协议,避免了自由文本匹配失败。在Sim-to-Real案例研究中,Qwen3-VL-4B-Instruct上的GRPO在256任务测试集上获得了+12.8个百分点的提升,在59任务真实设备信号子集上,真实设备执行保留了模拟侧训练增益的95.1%。项目页面:https://mobilegym.github.io。

英文摘要

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2605.25815 2026-05-28 cs.AI cs.MA

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

EvoMap背后:表征一个自进化的智能体间协作网络

Qiming Ye, Peixain Zhang, Yupeng He, Zifan Peng, Gareth Tyson

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 通过分析EvoMap网络中的150万资产和12.8万智能体,揭示其设计选择在可重用性、演化和可审计性方面的权衡,发现奖励机制导致98%资产未被重用、评分系统易被操纵以及验证机制存在缺陷。

详情
AI中文摘要

智能体间(A2A)网络通过共享可重用的问题解决指令,使自主AI智能体能够协作。然而,这些去中心化生态系统在实践中如何运作仍然在很大程度上未被探索。我们首次对EvoMap(一个突出的A2A协作网络)进行了大规模实证研究。通过分析超过150万资产和12.8万智能体,我们展示了优先考虑可扩展增长的设计选择如何在可重用性、演化和可审计性方面引入权衡。首先,EvoMap的信用经济奖励智能体发布有价值的资产。尽管这种设计鼓励大规模参与,但奖励主要与发布而非采用挂钩。这导致智能体大量生产资产以积累信用。结果,98%的资产从未被重用,而奖励高度集中在少数智能体手中。其次,EvoMap采用一种算法(称为GDI)来评分和排序这些共享资产的质量。我们证明该评分系统存在缺陷:资产的排名并非衡量客观性能,而是严重受未经验证的自我报告元数据(例如声称修改的代码行数)支配。这使得智能体可以轻易操纵其资产的分数。最后,EvoMap依赖智能体提供本地执行日志作为上传资产功能正常的证据。由于这些验证未经独立核实,超过84%的已批准资产使用空测试(例如console.log())绕过质量检查。我们的发现表明,未来的A2A协作网络不能仅依赖未经验证的自我报告。可扩展的协作需要平衡开放参与与可验证执行和可信评估的机制。

英文摘要

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2605.25767 2026-05-28 cs.CV

SAFE-Diff: Scale-Aware Attention and Feature-Dispersive Diffusion with Uncertainty Estimation for Contrast-Enhanced Breast MRI Synthesis

SAFE-Diff: 用于对比增强乳腺MRI合成的尺度感知注意力与特征分散扩散及不确定性估计

Tianyu Zhang, Xinglong Liang, Jarek van Dijk, Luyi Han, Chunyao Lu, Antonio Portaluri, Xinghe Xie, Yaofei Duan, Nika Rasoolzadeh, Xin Wang, Yuan Gao, Muzhen He, Yue Sun, Jonas Teuwen, Tao Tan, Ritse Mann

发表机构 * Department of Medical Imaging, Radboud University Medical Center(鲁文大学医学中心医学影像部) Department of Radiology, Netherlands Cancer Institute(荷兰癌症研究所放射科) Maastro Clinic(马斯垂克诊所) Faculty of Applied Science, Macao Polytechnic University(澳门理工大学应用科学学院) Department of Radiation Oncology, Netherlands Cancer Institute(荷兰癌症研究所放射肿瘤科)

AI总结 提出SAFE-Diff模型,通过尺度感知注意力、特征分散扩散和不确定性估计,解决对比增强乳腺MRI合成中复杂病灶纹理和异质性增强模式的挑战。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

合成高保真度的对比增强MRI对于更安全、更高效的乳腺癌筛查具有临床价值,但由于复杂的病灶纹理和异质性增强模式,仍然具有挑战性。

英文摘要

Synthesizing high fidelity contrast enhanced MRI is clinically valuable for safer and more efficient breast cancer screening, yet remains challenging due to complex lesion textures and heterogeneous enhancement patterns.

2605.25378 2026-05-28 cs.CV cs.AI

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

CollectionLoRA: 通过多教师在线策略蒸馏将50种效果收集到1个LoRA中

Fangtai Wu, Hailong Guo, Shijie Huang, Jiayi Song, Yubo Huang, Mushui Liu, Zhao Wang, Yunlong Yu, Jiaming Liu, Ruihua Huang

发表机构 * Zhejiang University(浙江大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组) Xi'an Jiaotong University(西安交通大学)

AI总结 提出CollectionLoRA框架,通过多教师在线策略蒸馏将多达50种不同效果LoRA和少步生成能力整合到单个LoRA中,解决参数干扰并降低部署成本。

详情
AI中文摘要

定制图像编辑旨在使用有限的配对数据,通常通过低秩适配(LoRA)为预训练扩散模型配备特定的视觉效果。随着所需效果数量的增加,存储和动态加载这些效果LoRA会显著增加部署开销。此外,当前的流程通常将这些效果LoRA与加速模块级联以实现快速生成,这会引发严重的参数干扰,导致概念混淆和风格退化。我们提出了CollectionLoRA,一个多教师在线策略蒸馏框架,能够将多达50种不同效果LoRA的概念以及少步生成能力蒸馏到单个LoRA中。这从根本上解决了特征干扰问题,并显著降低了部署成本。具体来说,该方法引入了(i)概率双流路由机制,使模型在训练期间能够在数据源之间随机切换,有效增强其在未见场景中的泛化能力;(ii)非对称正交提示策略,在提示空间内实现概念隔离;(iii)从粗到细的蒸馏目标,以缓解教师模型与学生模型之间的分布差距。大量评估表明,CollectionLoRA将所有定制效果和少步生成蒸馏到单个LoRA中,降低了部署开销,同时实现了与独立训练的教师模型相当或更好的概念保真度。代码:https://github.com/Qwen-Applications/CollectionLoRA

英文摘要

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that enables the model to randomly switch between data sources during training, effectively enhancing its generalization in unseen scenarios; (ii) an Asymmetric Orthogonal Prompting strategy to achieve concept isolation within the prompt space; (iii) a Coarse-to-Fine Distillation Objective to mitigate the distribution gap between the teacher and student models. Extensive evaluations show that CollectionLoRA distills all customized effects and few-step generation into a single LoRA, reducing deployment overhead while achieving concept fidelity comparable to or better than independently trained teacher models. Code: https://github.com/Qwen-Applications/CollectionLoRA

2605.25252 2026-05-28 cs.LG cs.AI

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

量化 RLVR 中计算与监督的实证权衡

Ryo Mitsuhashi, Patrick Chen, Isabelle Tseng, Jasin Cekinmez, Addison J. Wu

发表机构 * Princeton University(普林斯顿大学)

AI总结 通过 GSM8K 上的 GRPO 实验,研究验证器噪声对 RLVR 的影响,发现计算扩展无法弥补监督噪声,且假阴性比假阳性危害更大。

Comments Workshop on Combining Theory and Benchmarks @ ICML 2026

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为后训练语言模型的标准范式,但在实践中,验证器很少是完美的。最近的理论工作预测,验证器噪声会影响学习速率,但不影响最终结果,这意味着足够的计算应该能够弥补不完美监督带来的任何差距。我们通过在 GSM8K 上使用 GRPO 对 Qwen2.5(0.5B, 1.5B)进行后训练,同时向二元正确性信号中注入受控的假阳性和假阴性噪声,并将每次提示的 rollout 数量作为计算轴,来实证检验这一预测。在实践中,验证准确率的差距在大量计算扩展下仍然存在,且计算收益急剧递减。我们进一步发现一种结构性不对称:假阴性单调地比假阳性更快地降低性能。这些发现表明,验证器质量和训练计算不可互换,并且减少假阴性比单纯扩展计算更有效。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, verifiers are rarely perfect. Recent theoretical work predicts that verifier noise affects the rate of learning but not its final outcome, implying that sufficient compute should close any gap induced by imperfect supervision. We test this prediction empirically by post-training Qwen2.5 (0.5B, 1.5B) with GRPO on GSM8K while injecting controlled false-positive and false-negative noise into the binary correctness signal, and varying rollouts per prompt as a compute axis. In practice, the gap in validation accuracy persists under substantial compute scaling, with returns to compute that are sharply diminishing. We further find a structural asymmetry where false negatives monotonically degrade performance more quickly than false positives. These findings suggest verifier quality and training compute are not interchangeable, and that reducing false negatives is a more effective lever than scaling compute alone.

2605.25230 2026-05-28 cs.AI

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

通过引导推理提升推理能力:递归模型的随机探索

Andrew Corbett, Archit Sood, Anna Tzatzopoulou, Sai-Aakash Ramesh, Tim Dodwell

发表机构 * digiLab, UK(digiLab, 英国) University of Bristol, UK(英国布里斯托尔大学)

AI总结 提出引导随机探索方法,通过随机扰动推理轨迹并在线重加权,提升递归模型在结构化推理任务上的性能,无需重新训练。

Comments Presented at the proceedings of the ICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM)}, Seoul, South Korea. 2026

详情
AI中文摘要

最近关于递归架构的研究表明,小型神经网络在结构化推理任务上可以出奇地强大。其诀窍是用潜在动力系统对推理轨迹进行建模。我们认为,这些架构的推理时行为最好被理解为对潜在推理轨迹的近似推理,其中确定性递归是单粒子、零噪声极限。我们通过引导随机探索使这一观点可操作:推理动力学的随机扰动提出相邻轨迹,而模型现有的早停头在线重新加权它们。该框架产生三个无标签诊断指标:局部稳定性、引导对齐度和云令牌熵。这些指标仅从推理轨迹就能预测该过程是否有帮助以及应信任其哪些输出。在Sudoku-Extreme上,它无需重新训练就将精确求解准确率从85.9%提升到98.0%;在Maze-Hard上,诊断指标标记出引导未对齐,后续验证性能也证实了这一点。因此,同一机制既能刻画递归推理在轨迹层面何时有改进空间,也能刻画模型内部引导何时能恢复它。

英文摘要

Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the inference-time behaviour of these architectures is best understood as approximate inference over latent reasoning trajectories, with deterministic recursion as the one-particle, zero-noise limit. We make this view operational through guided stochastic exploration: stochastic perturbations of the reasoning dynamics propose neighbouring trajectories, and the model's existing early-stopping head reweights them online. The framework yields three label-free diagnostics: local stability, guide alignment, and cloud-token entropy. These predict, from inference traces alone, whether the procedure will help and which of its outputs to trust. On Sudoku-Extreme it lifts exact-solve accuracy from $85.9\%$ to $98.0\%$ without retraining; on Maze-Hard the diagnostics flag a misaligned guide, as validation performance later confirms. The same machinery thus characterises both when recursive reasoning has room to improve at the trajectory level and when the model's internal guide can recover it.

2605.25183 2026-05-28 cs.CL cs.AI

Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电气与计算机工程系)

AI总结 本文通过从单一教科书构建知识图谱并生成问答监督,微调语言模型,实现超越大语言模型的专家级神经科学推理。

详情
AI中文摘要

知识图谱(KG)是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型(LM),实现了特定领域的超智能。在这项工作中,我们探索仅使用单一权威教科书中的信息,KG驱动的深度推理能力是否能在神经科学中出现。核心假设是,结构化知识在被提炼为高质量KG并转换为基于KG的问答(QA)监督后,足以通过微调LM产生专家级推理,该LM在准确率上超越大型语言模型(LLM),同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG,使用在KG拓扑上训练的掩码LM扩展它,生成多跳QA项目(包括QA对和推理轨迹),以仅基于KG的监督微调LM,并应用强化学习,使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明,深度、机械性的神经科学理解可以在模型中诱导,而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程(读者可以自我测试)以及微调后的LM可在以下GitHub位置获取:https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

2605.24302 2026-05-28 cs.CV

Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies

基于Mamba的第一人称视频跨模态动作识别:通过CLS令牌融合策略整合RGB和手部骨架流

Juan Ignacio Bustos Gorostegui, Maria Elena Buemi

发表机构 * Univ. of Buenos Aires. Faculty of Exact and Natural Sciences. Dept. of Computer Science (DC)(布宜诺斯艾利斯大学。精确与自然科学学院。计算机科学系) CONICET-Univ. of Buenos Aires. Institute of Computer Sciences (ICC)(布宜诺斯艾利斯大学CONICET联合体。计算机科学研究所)

AI总结 提出一种基于Mamba的跨模态架构,通过四种CLS令牌融合策略(朴素、平均、加权和基于上下文)整合RGB视频和手部骨架数据,在H2O数据集上平均策略达到最佳性能,Top-1准确率在Tiny配置下提升超10%。

Comments 4 pages , 2 figures , Egovis2026 , CVPR2026

详情
AI中文摘要

第一人称动作识别由于相机运动不稳定、手部频繁遮挡以及随时间保持一致视觉表示的困难而具有挑战性。在这项工作中,我们提出了一种跨模态架构,将RGB视频和时间手部骨架数据结合在一个统一的基于Mamba的框架内,利用状态空间模型(SSMs)的线性时间复杂度。我们的架构由三个组件组成:用于视觉特征提取的VideoMamba模块、基于Mamba块堆叠的骨架编码器,以及将两种模态整合为单一表示的融合模块。本工作的一个核心贡献是设计和评估了四种用于多模态融合的类(CLS)令牌混合策略:朴素、平均、加权和基于上下文。这些策略在如何利用预训练的单模态CLS令牌(其作用是作为信息汇聚集所学表示)来初始化用于最终分类的混合CLS令牌方面有所不同。我们在H2O数据集上评估了所有策略。实验结果表明,平均策略实现了最佳性能,在Tiny配置下比VideoMamba基线提高了超过10%的Top-1准确率,在Small配置下提高了2%。

英文摘要

Egocentric action recognition is a challenging task due to erratic camera motion, frequent hand occlusion, and the difficulty of maintaining consistent visual representations over time. In this work, we propose a cross-modal architecture that combines RGB video and temporal hand skeleton data within a unified Mamba-based framework, exploiting the linear time complexity of State Space Models (SSMs). Our architecture consists of three components: a VideoMamba module for visual feature extraction, a skeleton encoder built on a stack of Mamba blocks, and a fusion module that integrates both modalities into a single representation. A central contribution of this work is the design and evaluation of four Class (CLS) token mixing strategies for multimodal fusion: Naive, Average, Weighted and Context-based. These strategies differ in how the pretrained unimodal CLS tokens, which role is to act as information sinks concentrating learned representations, are leveraged to initialize the mixed CLS token used for final classification. We evaluate all strategies on the H2O dataset. Experimental results show that the Average strategy achieves the best performance, yielding gains of over 10% Top-1 accuracy in the Tiny configuration and 2% in the Small configuration over the VideoMamba baseline.

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素:用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University(纽约大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder,探索人工智能在无引导发现中的开放性能力,并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异,同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情
AI中文摘要

我们正处于大规模工业和学术努力之中,旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上,这些过程在人类形式中的一个基本属性是它们的开放性:即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现?为了回答这个问题,我们转向Picbreeder,这是人类驱动的开放性搜索的典型范例,用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder,用前沿视觉语言模型(VLM)替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异,并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素,我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

2605.20635 2026-05-28 cs.LG math.ST stat.ML stat.TH

The General Theory of Localization Methods

局部化方法的一般理论

Congwei Song

发表机构 * Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院)

AI总结 本文提出一种基于局部化核和局部均值的通用机器学习框架——局部化方法,系统揭示其与多种现有模型(如核方法、MeanShift、Transformer等)的联系,并展示其统一和泛化现代架构的能力。

Comments correct some math expressions

详情
AI中文摘要

本文提出一种称为局部化方法的通用机器学习框架,该框架从根本上建立在两个核心概念之上:局部化核和局部均值——这些是支撑自注意力机制的关键组成部分。为了建立严格的理论基础,该框架通过两个基本支柱正式定义:局部(化)模型的公式化和局部化技巧。我们系统地研究了局部化方法与广泛现有机器学习模型/方法之间的联系,包括(但不限于)核方法、惰性学习、MeanShift算法、松弛标记、Hopfield网络、局部线性嵌入(LLE)、模糊推理和去噪自编码器(DAEs)。通过剖析这些关系,我们阐明了局部化方法更广泛的理论意义,并展示了其在各种机器学习任务中的实际适用性。此外,我们探讨了该框架的高级扩展,如自适应核、层次局部模型和非局部模型。值得注意的是,我们展示了Transformer——现代序列建模的基石——可以使用层次局部模型构建,揭示了局部化方法统一和泛化最先进架构的能力。这项工作不仅提供了重新解释现有模型的统一理论视角,还为设计灵活、数据自适应的学习系统提供了新的方法论工具。

英文摘要

This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

2605.19257 2026-05-28 cs.RO

PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

PRISM-SLAM: 面向尺度感知度量SLAM的概率射线基础推理

Eunsoo Im, Gyeonggwan Lee, Junghun Suh

发表机构 * KakaoMobility, South Korea(韩国 KakaoMobility)

AI总结 提出PRISM-SLAM框架,通过将视觉基础模型先验集成到贝叶斯因子图中,利用Plücker射线距离因子和动态场景不确定性门控机制,实现无尺度漂移的实时单目度量SLAM。

详情
AI中文摘要

单目SLAM历来在动态环境中存在尺度模糊和跟踪失败的问题。虽然最近的视觉基础模型(VFM)提供了显著的零样本深度先验,但简单地整合这些确定性预测忽略了预测不确定性和帧间尺度不一致性。我们提出了PRISM-SLAM,一个实时框架,将VFM先验严格集成到结构化的贝叶斯因子图中,以实现尺度感知、度量一致的定位与建图。具体来说,我们引入了Plücker射线距离因子,将单目观测锚定在全局一致的度量坐标系中的绝对空间,通过使度量尺度Fisher可识别,从数学上解决了尺度漂移。为了处理环境动态,我们从时间深度一致性中推导出认知不确定性代理,并设计了动态场景不确定性门控(DSUG)机制。这种软门控方法概率性地降低动态干扰物的权重,而不会产生与传统语义分割掩码相关的高计算开销。通过采用多进程架构异步处理VFM推理和几何跟踪,PRISM-SLAM仅使用RGB输入即可在30 FPS下提供验证的度量输出,弥合了基础模型与现实机器人应用之间的差距。在TUM RGB-D和7-Scenes基准上的评估表明,PRISM-SLAM的度量$SE(3)$绝对轨迹误差(ATE)几乎与其对齐的$Sim(3)$误差相同。这表明我们的系统能够生成可直接部署的度量轨迹,无需任何后处理尺度校正。项目页面:https://prismslam-cmd.github.io/prismslam_pr/

英文摘要

Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero-shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame-to-frame scale inconsistencies. We propose PRISM-SLAM, a real-time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale-aware, metric-consistent localization and mapping. Specifically, we introduce a Plücker Ray-Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher-identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft-gating approach probabilistically down-weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi-process architecture that asynchronously processes VFM inference and geometric tracking, PRISM-SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real-world robotic applications. Evaluated on the TUM RGB-D and 7-Scenes benchmarks, PRISM-SLAM achieves a metric $SE(3)$ Absolute Trajectory Error (ATE) nearly identical to its oracle-aligned $Sim(3)$ error. This demonstrates that our system can produce deployment-ready metric trajectories by delivering robust metric SLAM solutions without any post-hoc scale correction. Project page: https://prismslam-cmd.github.io/prismslam_pr/

2605.02263 2026-05-28 cs.LG

Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning

打破块限制:通过单调熵下降与强化学习为扩散大语言模型实现动态大小推理块

Yan Jiang, Ruihong Qiu, Zi Huang

发表机构 * School of Electrical Engineering and Computer Science, The University of Queensland, Brisbane, Queensland, Australia(电气工程与计算机科学学院,昆士兰大学,布里斯班,昆士兰,澳大利亚)

AI总结 针对扩散大语言模型中固定大小推理块导致的逻辑连贯性差和效率低问题,提出基于单调熵下降目标与强化学习的后训练框架b1,学习动态大小推理块以提升推理连贯性。

详情
AI中文摘要

最近的扩散大语言模型(dLLMs)通过基于块的半自回归生成范式展示了推理的有效性和效率。尽管取得了进展,固定大小的块生成仍然是有效且连贯推理的关键瓶颈。1. 从全局角度看,不同的推理任务对应不同的最优解码块大小,这使得“一刀切”的假设无效。2. 即使在单个推理任务中,刚性的块划分也会破坏逻辑流并降低推理连贯性。通过经验观察,我们发现对于块级熵,错误推理在块之间表现出波动和不稳定的趋势,而正确生成的任务则遵循一致的下降趋势。因此,本文提出了b1,一种新颖的dLLMs后训练框架,通过强化学习结合单调熵下降目标学习动态大小推理块,以增强推理连贯性。b1作为即插即用模块无缝集成到现有dLLM的后训练算法中。在各种推理基准上的大量实验表明,b1相比现有固定大小块基线具有一致的改进。我们的代码已发布在https://github.com/YanJiangJerry/Block-R1。

英文摘要

Recent diffusion large language models (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed-size block baselines. Our code has been released at https://github.com/YanJiangJerry/Block-R1.

2605.01046 2026-05-28 cs.LG

Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning

在 Fisher 子空间中学习:LoRA 微调的引导初始化

Zhi-Quan Feng, Ying-Jia Lin, Hung-Yu Kao

发表机构 * Department of Computer Science(计算机科学系) Information Engineering, National Cheng Kung University(信息工程系,国立成功大学) Department of Artificial Intelligence, Chang Gung University(人工智能系,长庚大学) Department of Computer Science, National Tsing Hua University(计算机科学系,国立清华大学)

AI总结 本文提出一种基于 Fisher 信息的引导初始化方法,通过利用下游数据曲率信息选择 LoRA 适应子空间,以提升微调性能。

详情
AI中文摘要

LoRA 通过将更新限制在预训练权重的低秩子空间中来适应大型语言模型(LLMs)。虽然这大幅降低了训练成本,但适应的有效性关键取决于初始化时选择哪个子空间:一个将容量分配给任务无关方向的糟糕初始化会严重阻碍下游性能。现有的初始化策略主要依赖预训练权重的内在属性,隐含地假设仅权重几何就能反映任务相关性。然而,这种标准忽略了模型如何与下游数据分布交互。在这项工作中,我们将 LoRA 初始化表述为在目标数据分布下识别参数空间中方向的影响程度。我们认为,数据感知的敏感性(而非仅权重大小)应指导适应子空间的选择。基于这一观点,我们提出了一个 Fisher 引导的框架,利用下游数据诱导的曲率信息来表征参数扰动如何影响模型预测。这一视角为选择 LoRA 方向提供了一个原则性的、任务相关的标准,使适应更好地与目标对齐。跨不同任务和模态的实验结果表明,数据感知的初始化一致且显著地优于现有方法的下游性能。

英文摘要

LoRA adapts large language models (LLMs) by restricting updates to low-rank subspaces of pre-trained weights. While this substantially reduces training cost, the effectiveness of adaptation critically depends on which subspace is chosen at initialization: a poor initialization that allocates capacity to task-irrelevant directions can severely hinder downstream performance. Existing initialization strategies primarily rely on the intrinsic properties of pre-trained weights, implicitly assuming that weight geometry alone reflects task relevance. However, such criteria overlook how the model interacts with the downstream data distribution. In this work, we formulate LoRA initialization as identifying the degree of impact of directions in parameter space under the target data distribution. We argue that data-aware sensitivity, rather than weight-only magnitude, should govern the choice of adaptation subspaces. Building on this perspective, we propose a Fisher-guided framework that leverages curvature information induced by downstream data to characterize how parameter perturbations influence model predictions. This perspective yields a principled, task-dependent criterion for selecting LoRA directions that better align adaptation with the target objective. Empirical results across diverse tasks and modalities demonstrate that data-aware initialization consistently and significantly improves downstream performance over existing approaches.

2605.23192 2026-05-28 cs.CV

Occlusion-Aware Physics-Semantic Keyframe Selection for Robust Video Editing

遮挡感知的物理-语义关键帧选择用于鲁棒视频编辑

Lin Liu, Zhihan Xiao, Haohang Xu, Rong Cong, Zhibo Zhang, Xiaopeng Zhang, Qi Tian

发表机构 * Huawei(华为) Tsinghua University(清华大学) East China Normal University(华东师范大学)

AI总结 提出一种遮挡感知的物理-语义关键帧选择框架,通过从结构完整性、跟踪稳定性和属性可见性三个角度评估候选帧,自动选择最优锚定帧,并利用双向跟踪生成时空掩码,实现鲁棒且时序一致的视频编辑。

详情
AI中文摘要

近年来,基于扩散的生成模型在视频编辑领域取得了显著进展,能够根据自然语言指令实现多样化的对象级操作。然而,现有方法在遮挡、视角变化和快速物体运动场景下常常表现不佳,不可靠的视觉观测导致定位不准确、时间闪烁和编辑不一致。在本工作中,我们识别出缺乏可靠视觉锚点是遮挡鲁棒视频编辑的一个根本瓶颈。为解决此问题,我们提出了一种遮挡感知的物理-语义关键帧选择框架,该框架自动为下游编辑识别最优锚定帧。具体而言,我们的方法从三个互补角度评估候选帧:避免截断观测的结构完整性、衡量物理可靠性的循环一致跟踪稳定性、以及确保语义清晰性的基于视觉语言的属性可见性。选定的关键帧随后通过双向跟踪传播,生成密集的时空掩码,这些掩码作为扩散视频编辑骨干的辅助监督。通过将遮挡处理从显式重建转变为可靠锚点选择,我们的框架无需手动标注即可实现精确且时序一致的编辑。在具有挑战性的视频编辑基准上的大量实验证明了我们方法的有效性和高质量性能。

英文摘要

Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.

2605.22949 2026-05-28 cs.LG cs.MA

MARGIN: Runtime Confidence Calibration for Multi-Agent Foundation Model Coordination

MARGIN:多智能体基础模型协调的运行时置信度校准

Joss Armstrong

发表机构 * Ericsson, Athlone, Ireland(爱立信,Athlone,爱尔兰)

AI总结 提出在线校准方法MARGIN,通过任务流学习每个智能体每个置信度带的校准因子,无需模型访问或重训练,在分布漂移下将校准误差降低3-6倍,并显著提升多智能体选择性能。

详情
AI中文摘要

基础模型智能体越来越多地运行在多智能体部署中,协调者必须决定信任哪个智能体的响应。标准方法根据智能体自我报告的置信度进行加权,但最近的证据表明,基础模型的置信度系统性地校准不良,并且在困难任务上与准确性呈负相关。设计时校准方法(温度缩放、Platt缩放、直方图分箱)无法解决这个问题,因为它们对保留数据拟合固定校正,并在分布漂移下性能下降。我们提出MARGIN(通过增量归一化的多智能体运行时分级),一种在线校准方法,从任务流本身学习每个智能体、每个置信度带的校准因子,无需模型访问、无需保留数据、无需重新训练。MARGIN使用对称指数加权移动平均和贝叶斯收缩混合,具有三个超参数和稳健的默认值。在18个基础模型、8个基准测试和超过44,000个观测值上,MARGIN在分布漂移下实现了比最佳设计时基线低3-6倍的校准误差。在多智能体选择中,原始口头化置信度在困难基准测试的成对分辨率上未能击败随机(43-50%)。MARGIN完全纠正了这一点,将成对分辨率提高到70-89%,并在五个代码生成基准测试上缩小了37-78%的原始到Oracle pass@1差距,而无需任何关于哪个模型最强的先验知识。六个形式化命题描述了非策略智能体的收敛性、跟踪速度和对称更新的最优性,所有预测均通过实验说明。

英文摘要

Foundation model agents increasingly operate in multi-agent deployments where a coordinator must decide which agent's response to trust. The standard approach weights agents by their self-reported confidence, but recent evidence shows that foundation model confidence is systematically miscalibrated and, on hard tasks, inversely correlated with accuracy. Design-time calibration methods (temperature scaling, Platt scaling, histogram binning) cannot address this problem because they fit a fixed correction to held-out data and degrade under distribution shift. We present MARGIN (Multi-Agent Runtime Grading via Incremental Normalisation), an online calibration method that learns per-agent, per-confidence-band calibration factors from the task stream itself, requiring no model access, no held-out data, and no retraining. MARGIN uses symmetric exponentially weighted moving averages with Bayesian shrinkage blending, and has three hyperparameters with robust defaults. Across 18 foundation models, 8 benchmarks, and over 44,000 observations, MARGIN achieves 3-6x lower calibration error than the best design-time baseline under distribution shift. In multi-agent selection, raw verbalized confidence fails to beat random at pairwise resolution (43-50%) on hard benchmarks. MARGIN corrects this completely, raising pairwise resolution to 70-89% and closing 37-78% of the Raw-to-Oracle pass@1 gap across the five code-generation benchmarks without any oracle knowledge of which model is strongest. Six formal propositions characterize convergence, tracking speed, and the optimality of symmetric updates for non-strategic agents, with all predictions illustrated empirically.

2605.22547 2026-05-28 cs.CV cs.AI

Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement

基于多模态知识图谱和可靠性引导精化的病例感知医学图像分类

Yiming Xu, Yixuan Liu, Yuhang Zhang, Ling Zheng, Yihan Wang, Qi Song

发表机构 * University of Science and Technology of China(科学技术大学)

AI总结 提出一种基于多模态知识图谱的病例感知推理框架,通过构建结构化诊断记忆、自适应检索相似病例、知识传播与注入机制以及置信度校准的决策精化方案,提升医学图像分类的性能和可解释性。

详情
AI中文摘要

深度学习为医学图像分类带来了显著进展,但现有方法大多依赖孤立的视觉证据,无法有效利用相似病例或外部知识。在临床实践中,诊断通常由相似历史病例及其相关症状支持。为了显式建模这一循证诊断过程,我们提出了一种由多模态知识图谱驱动的病例感知推理框架,用于医学图像分类。具体而言,我们构建了一个病例感知的多模态知识图谱作为结构化的诊断记忆,其中疾病、图像和症状按层次组织。给定输入图像,我们的方法自适应地从该记忆中检索相似病例,并提取相应的以病例为中心的子图。我们进一步引入了一种知识传播与注入机制,其中以图像为中心的图注意力网络将异质语义聚合为基于病例的特征,随后通过双向跨模态注意力机制将这些特征注入视觉表示以实现跨模态对齐。为了减轻噪声检索,我们设计了一种置信度校准的决策精化方案,通过联合考虑预测置信度和样本相似性来估计每个检索病例的可靠性,并重新加权其对最终预测的贡献,提供可解释的病例级证据。在多个医学影像数据集上的大量实验表明,我们的方法一致优于强基线,而消融和定性分析验证了其有效性和可解释性。代码可在 https://anonymous.4open.science/r/MKG-CARE-8B7B 获取。

英文摘要

Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by similar historical cases and their associated symptoms. To explicitly model this evidence-based diagnostic process, we propose a case-aware reasoning framework driven by multimodal knowledge graphs for medical image classification. Specifically, we construct a case-aware multimodal knowledge graph as a structured diagnostic memory, where diseases, images, and symptoms are hierarchically organized. Given an input image, our method adaptively retrieves similar cases from this memory and extracts their corresponding case-centered subgraphs. We further introduce a knowledge propagation and injection mechanism, in which an image-centric Graph Attention Network aggregates heterogeneous semantics into case-based features, followed by a bidirectional cross-modal attention mechanism that injects these features into visual representations for cross-modal alignment. To mitigate noisy retrieval, we design a confidence-calibrated decision refinement scheme that estimates the reliability of each retrieved case by jointly considering prediction confidence and sample similarity, and reweights its contribution to the final prediction, providing interpretable case-level evidence. Extensive experiments on multiple medical imaging datasets demonstrate that our approach consistently outperforms strong baselines, while ablation and qualitative analyses validate its effectiveness and interpretability. The code is available at https://anonymous.4open.science/r/MKG-CARE-8B7B.

2605.22166 2026-05-28 cs.AI

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

适配接口而非模型:面向确定性LLM智能体的运行时框架适配

Tianshi Xu, Huifeng Wen, Meng Li

发表机构 * Peking University(北京大学)

AI总结 提出Life-Harness运行时框架,通过从训练轨迹中演化出可复用的环境侧干预,在不修改模型权重或评估环境的情况下,显著提升冻结LLM智能体在确定性任务中的性能。

Comments Work in progress

详情
AI中文摘要

LLM智能体不仅由其语言模型塑造,还受运行时框架的影响,该框架协调观察、工具使用、动作执行、反馈解释和轨迹控制。虽然现有的智能体适配方法主要更新模型参数,但在确定性、规则主导的领域中,许多失败源于模型-环境接口的不匹配。我们提出Life-Harness,一种生命周期感知的运行时框架,在不改变模型权重或评估环境的情况下改进冻结的LLM智能体。Life-Harness从训练轨迹中演化,通过将重复出现的交互失败转化为跨环境契约、程序技能、动作实现和轨迹调节的可复用干预,并在未见任务上保持固定以进行评估。在来自$\tau$-bench、$\tau^2$-bench和AgentBench的七个确定性环境中,Life-Harness在18个模型骨干上的126个模型-环境设置中改进了116个,平均相对提升88.5%。仅从Qwen3-4B-Instruct轨迹演化出的框架可迁移到其他17个模型,表明Life-Harness捕获的是可复用的环境侧结构而非模型特定行为。这些结果将运行时接口适配定位为以模型为中心的智能体训练的互补替代方案。代码可在https://github.com/Tianshi-Xu/Life-Harness获取。

英文摘要

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation methods mainly update model parameters, many failures in deterministic, rule-governed domains stem from mismatches at the model--environment interface. We propose Life-Harness, a lifecycle-aware runtime harness that improves frozen LLM agents without changing model weights or evaluation environments. Life-Harness evolves from training trajectories by converting recurring interaction failures into reusable interventions across environment contracts, procedural skills, action realization, and trajectory regulation, and remains fixed for evaluation on unseen tasks. On seven deterministic environments from $τ$-bench, $τ^2$-bench, and AgentBench, Life-Harness improves 116 out of 126 model--environment settings across 18 model backbones, with an average relative improvement of 88.5%. Harnesses evolved only from Qwen3-4B-Instruct trajectories transfer to 17 other models, showing that Life-Harness captures reusable environment-side structure rather than model-specific behavior. These results position runtime interface adaptation as a complementary alternative to model-centric agent training. Code is available at https://github.com/Tianshi-Xu/Life-Harness.

2510.20665 2026-05-28 cs.AI

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

推理的形状:大型语言模型中推理轨迹的拓扑分析

Xue Wen Tan, Nathaniel Tan, Galen Lee, Stanley Kok

发表机构 * University of Cambridge, Department of Engineering, England(剑桥大学工程系) National University of Singapore, School of Computing, Singapore(新加坡国立大学计算机学院)

AI总结 提出基于拓扑数据分析(TDA)的评估框架,通过捕捉推理轨迹的几何结构实现高效自动评估,实验表明拓扑特征比图指标更有效预测推理质量。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

评估大型语言模型推理轨迹的质量仍然研究不足、劳动密集且不可靠:当前实践依赖于专家评分标准、手动注释和缓慢的成对判断。自动化努力主要由基于图的代理主导,这些代理量化结构连通性,但未阐明高质量推理的构成;对于固有复杂的过程,这种抽象可能过于简单。我们引入了一个基于拓扑数据分析(TDA)的评估框架,该框架捕捉推理轨迹的几何结构,并实现标签高效、自动化的评估。在我们的实证研究中,拓扑特征在评估推理质量方面比标准图指标具有更高的预测能力,这表明有效推理更好地由高维几何结构而非纯关系图来捕捉。我们进一步表明,一组紧凑、稳定的拓扑特征可靠地指示轨迹质量,为未来的强化学习算法提供了实用信号。

英文摘要

Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated efforts are dominated by graph-based proxies that quantify structural connectivity but do not clarify what constitutes high-quality reasoning; such abstractions can be overly simplistic for inherently complex processes. We introduce a topological data analysis (TDA)-based evaluation framework that captures the geometry of reasoning traces and enables label-efficient, automated assessment. In our empirical study, topological features yield substantially higher predictive power for assessing reasoning quality than standard graph metrics, suggesting that effective reasoning is better captured by higher-dimensional geometric structures rather than purely relational graphs. We further show that a compact, stable set of topological features reliably indicates trace quality, offering a practical signal for future reinforcement learning algorithms.

2605.21832 2026-05-28 cs.AI

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

FLUID:从临时ID到多模态语义编码的工业级直播推荐

Xinhang Yuan, Zexi Huang, Anjia Cao, Xudong Lu, Zikai Wang, Penghao Zhou, Chang Liu, Wentao Guo, Qinglei Wang

发表机构 * TikTok(字节跳动) ByteDance(字节跳动)

AI总结 针对直播推荐中ID冷启动问题,提出FLUID框架,通过跨域多模态编码器生成层次化语义编码LUCID替代候选侧ID,并采用分阶段预热方案,在工业级系统上取得显著提升。

详情
AI中文摘要

现代推荐系统严重依赖基于ID的协同过滤:每个项目由一个独特的ID嵌入表示,该嵌入从用户交互中积累协同信号。然而,直播推荐在这种范式下面临独特挑战:直播间通常仅播出几十分钟,因此其项目ID在持续的冷启动状态下学习不佳,以ID为中心的排序模型无法泛化。我们提出FLUID,这是第一个从生产规模的直播排序器中完全淘汰候选侧项目ID的框架。FLUID引入了一个跨域多模态编码器,在短视频和直播上联合训练,生成离散的层次化语义编码,称为LUCID,用于基于内容的项目表征。为了使排序器适应LUCID,FLUID进一步采用分阶段预热方案:首先将冷启动的切片级LUCID作为独立标记与ID嵌入一起引入,然后在在线增量训练之前用热启动的房间级LUCID替换ID嵌入。FLUID部署在我们的工业级直播推荐系统上,该系统的跨平台合并用户基数超过十亿,取得了显著的在线收益:优质观看时长+0.55%,冷启动房间观看量+2.05%,活跃小时数+0.05%。

英文摘要

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID introduces a cross-domain multimodal encoder, jointly trained on short videos and livestreams, to produce discrete hierarchical semantic codes, called LUCID, for content-based item characterization. To adapt the ranker to LUCID, FLUID further employs a staged warmup scheme: it first incorporates cold, slice-level LUCID as an independent token alongside the ID embedding, and then replaces the ID embedding with warm, room-level LUCID before online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

2605.21743 2026-05-28 cs.AI econ.GN q-fin.EC

Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure

谁在使用AI?平台选择与职业AI暴露的测量

Michelle Yin, Burhan Ogut

发表机构 * School of Education and Social Policy, Northwestern University(教育与社会政策学院,西北大学) American Institutes for Research(美国研究机构)

AI总结 本文通过分析AI平台对话日志,揭示平台用户构成导致职业AI暴露测量偏差,并提出劳动力加权部分识别方法校正估计。

详情
AI中文摘要

来自AI平台的对话日志越来越多地被用于衡量职业对人工智能的暴露程度,但在这些日志中观察到的用户并非劳动力群体。我们表明,从平台导出的暴露分数结合了任务级别的AI适用性与平台用户群的职业构成。保持实证设计不变,仅改变平台输入会使ChatGPT后的就业系数变化1.9倍,并且同一供应商内的消费者和企业渠道在符号上存在分歧。我们将由此产生的非经典测量误差形式化,将其分解为职业间和职业内的选择,并构建了劳动力加权的部分识别界限。根据劳工统计局就业份额进行重新加权会使估计值衰减42%至93%。该偏差捕捉了观察用户中的增强效应,比劳动力中的替代效应更直接。

英文摘要

Conversation logs from AI platforms are increasingly used to measure occupational exposure to artificial intelligence, but the users observed in these logs are not the workforce. We show that platform-derived exposure scores combine task-level AI applicability with the occupational composition of the platform's user base. Holding the empirical design fixed, changing only the platform input changes the post-ChatGPT employment coefficient by a factor of 1.9, and consumer and enterprise channels within the same vendor disagree in sign. We formalize the resulting non-classical measurement error, decompose it into between- and within-occupation selection, and construct workforce-reweighted partial-identification bounds. Reweighting to Bureau of Labor Statistics employment shares attenuates estimates by 42 to 93 percent. The bias captures augmentation among observed users more directly than substitution in the workforce.

2605.16578 2026-05-28 cs.SD cs.AI cs.HC cs.LG

Voice "Cloning" is Style Transfer

语音“克隆”是风格迁移

Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds, Anna Pot, Yongchan Kwon, James Zou

发表机构 * Cornell University(康奈尔大学) TogetherAI Stanford University(斯坦福大学)

AI总结 研究发现语音克隆并非忠实复制原声,而是系统性地应用风格迁移,使克隆语音更权威、温暖、客服化且更人性化,导致说话者特征同质化,并影响人类信任与行为。

详情
AI中文摘要

人工生成的语音日益嵌入日常生活。语音克隆尤其适用于身份保留重要的应用,例如完成录音、用新语言配音或保存失语者的声音。然而,在我们的工作中,我们发现尽管术语如此,语音克隆并不能忠实地“克隆”个体的声音。相反,我们发现广泛使用的语音克隆模型系统性地对源语音应用风格迁移。根据人类标注者的评分,克隆语音相比源语音被认为更权威、更温暖、更接近客服风格且更人性化。人类标注者还报告对克隆语音的信任度高于源语音,并且更愿意向它们透露敏感个人信息。我们的工作还表明,语音克隆导致说话者特征的同质化,表现为口音、语速和音频嵌入空间的方差减小。总之,我们的结果凸显了语音克隆技术的一系列新局限和风险,及其对人类行为的潜在影响。

英文摘要

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

2602.06511 2026-05-28 cs.LG

EvoMAS: Evolutionary Generation of Multi-Agent Systems

EvoMAS:多智能体系统的进化生成

Yuntong Hu, Yuting Zhang, Matthew Trager, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

发表机构 * Department of Computer Science, Emory University, Atlanta, GA, USA(埃默里大学计算机科学系)

AI总结 提出EvoMAS方法,将多智能体系统生成转化为结构化配置生成,通过进化算法在配置空间中优化,提升任务性能、可执行性和鲁棒性。

Comments ICML2026

Journal ref ICML2026

详情
AI中文摘要

基于大语言模型的多智能体系统在复杂推理、规划和工具增强任务中展现出巨大潜力,但设计有效的MAS架构仍然劳动密集、脆弱且难以泛化。现有的自动MAS生成方法要么依赖代码生成,常导致可执行性和鲁棒性失败,要么施加僵化的架构模板,限制了表达性和适应性。我们提出多智能体系统的进化生成(EvoMAS),将MAS生成形式化为结构化配置生成。EvoMAS在配置空间中进行进化生成。具体来说,EvoMAS从池中选择初始配置,应用基于执行轨迹引导的反馈条件变异和交叉,并迭代优化候选池和经验记忆。我们在多个基准测试上评估EvoMAS,包括BBEH、SWE-Bench和WorkBench,涵盖推理、软件工程和工具使用任务。EvoMAS在任务性能上持续优于人工设计的MAS和先前的自动MAS生成方法,同时生成的系统具有更高的可执行性和运行时鲁棒性。EvoMAS在BBEH推理上比智能体进化方法EvoAgent高出10.5个百分点,在WorkBench上高出7.1个百分点。使用Claude-4.5-Sonnet,EvoMAS在SWE-Bench-Verified上达到79.1%,与排行榜顶部持平。代码可在https://github.com/amazon-science/EvoMAS获取。

英文摘要

Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS

2605.19729 2026-05-28 cs.CV cs.AI

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

LIFT and PLACE: 一种简单、稳定且有效的轻量级扩散模型知识蒸馏框架

Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology (UNIST)(ulsan国家科学技术研究所)

AI总结 提出LIFT和PLACE框架,通过粗到细的蒸馏策略解决教师网络高复杂度带来的学生模仿困难,在极端压缩下仍能稳定训练并取得良好性能。

Comments Project page: https://hyun-s.github.io/LIFT_PLACE_site , 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

详情
AI中文摘要

我们证明,在扩散模型的知识蒸馏中,教师网络由于其更大的容量而具有高度复杂的去噪过程,这给学生模型忠实模仿带来了重大挑战。为了解决这个问题,我们提出了一种基于线性拟合蒸馏(LIFT)和分段局部自适应系数估计(PLACE)的粗到细蒸馏框架。首先,LIFT将目标分解为“粗”对齐和“细”细化。学生先在粗对齐上训练,然后进行困难的细化。其次,PLACE通过将输出划分为基于误差的组来扩展LIFT以处理空间非均匀误差,提供局部自适应指导。我们的实验表明,LIFT和PLACE在扩散空间(图像/潜在)、骨干网络(U-Net/DiT)、任务(无条件/条件)、数据集上均有效,甚至扩展到基于流的模型如MMDiT(SD3)。此外,在极端压缩下(学生参数1.3M,仅为教师的1.6%),传统KD无法为稳定训练提供足够指导,FID分数常退化到50-200+,但我们的方法仍稳定收敛并达到15.73的FID。

英文摘要

We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

2602.06025 2026-05-28 cs.CL cs.AI cs.LG

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

学习面向运行时智能体记忆的查询感知预算层级路由

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

发表机构 * Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois Chicago(伊利诺伊大学香槟分校) Tsinghua University(清华大学) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 BudgetMem 框架,通过强化学习训练的轻量级路由器实现查询感知的预算层级路由,以在运行时平衡任务性能与记忆构建成本。

Comments Accepted by ICML 2026. Code is available at https://github.com/ViktorAxelsen/BudgetMem

详情
AI中文摘要

记忆对于在单个上下文窗口之外运行的大型语言模型(LLM)智能体日益重要,然而大多数现有系统依赖于离线的、查询无关的记忆构建,这可能导致效率低下并丢弃查询关键信息。尽管运行时记忆利用是一种自然的替代方案,但先前的工作通常会产生大量开销,并且对性能-成本权衡的显式控制有限。在这项工作中,我们提出了 extbf{BudgetMem},一个用于显式、查询感知性能-成本控制的运行时智能体记忆框架。BudgetMem 将记忆处理结构化为一组记忆模块,每个模块提供三个预算层级(即 extsc{Low}/ extsc{Mid}/ extsc{High})。一个轻量级路由器在模块间执行预算层级路由,以平衡任务性能和记忆构建成本,该路由器实现为通过强化学习训练的紧凑神经策略。使用 BudgetMem 作为统一测试平台,我们研究了实现预算层级的三种互补策略:实现(方法复杂度)、推理(推理行为)和容量(模块模型大小)。在 LoCoMo、LongMemEval 和 HotpotQA 上,当优先考虑性能时(即高预算设置),BudgetMem 超越了强基线,并在更紧的预算下提供了更好的精度-成本边界。此外,我们的分析揭示了不同层级策略的优势和劣势,阐明了在不同预算制度下每个轴何时提供最有利的权衡。

英文摘要

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

2605.20150 2026-05-28 cs.CV cs.PF

TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

TideGS: 通过外存优化训练超过十亿个3D高斯溅射基元

Chonghao Zhong, Linfeng Shi, Hua Chen, Tiecheng Sun, Hao Zhao, Binhang Yuan, Chaojian Li

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对大规模3D高斯溅射训练的内存瓶颈,提出TideGS外存训练框架,通过SSD-CPU-GPU层次化管理和三种协同技术,在单GPU上实现超过十亿高斯基元的训练并达到最优重建质量。

Comments Accepted to ICML 2026 as Spotlight. Website: https://sponge-lab.github.io/TideGS

详情
AI中文摘要

训练十亿基元规模的3D高斯溅射(3DGS)本质上是内存受限的:每个高斯基元携带一个大的属性向量,总参数表迅速超出GPU容量,限制了先前系统在商用单GPU硬件上只能处理数千万高斯基元。我们观察到3DGS训练本质上是稀疏且轨迹条件的:每次迭代仅激活当前相机批次可见的高斯基元,因此GPU内存可以作为工作集缓存而非持久参数存储。基于这一洞察,我们引入了TideGS,一个外存训练框架,通过三种协同技术管理SSD-CPU-GPU层次结构中的参数:用于SSD对齐空间局部性的块虚拟化几何、用于将I/O与计算重叠的分层异步流水线,以及轨迹自适应差分流,该流在迭代之间仅传输增量工作集变化。实验表明,TideGS能够在单个24 GB GPU上训练超过十亿个高斯基元,同时在大规模场景中实现评估的单GPU基线中最佳的重建质量,超越了先前的外存基线(例如约1亿高斯基元)和标准内存训练(例如约1100万高斯基元)。

英文摘要

Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

2605.19778 2026-05-28 cs.LG

B-cos GNNs: Faithful Explanations through Dynamic Linearity

B-cos GNNs:通过动态线性实现忠实解释

Joschka Groß, Mohammad Shaique Solanki, Verena Wolf

发表机构 * Saarland Informatics Campus, Saarland University(萨尔兰信息学校区,萨尔兰大学) DFKI

AI总结 提出B-cos GNNs,一种内在可解释的图神经网络,通过单个输入依赖的线性映射将预测精确分解为每个节点、每个特征的贡献,在保持高解释性的同时牺牲少量预测精度。

详情
AI中文摘要

我们引入B-cos GNNs,一类内在可解释的图神经网络,其预测通过单个输入依赖的线性映射精确分解为每个节点、每个特征的贡献。B-cos GNNs使用线性(求和)聚合,并用B-cos变换替换非线性消息和更新函数。这诱导了有意义的、任务特定的权重-输入对齐,可通过模型的动态线性直接访问。实例级解释来自单个前向和后向传播,无需辅助解释器、修改的学习目标或扰动过程。实例化为GIN后,我们的方法以较小的预测精度损失换取在各种合成和真实世界基准上最先进的解释性,产生的解释比事后基线快几个数量级。

英文摘要

We introduce B-cos GNNs, an inherently explainable class of graph neural networks whose predictions decompose exactly into per-node, per-feature contributions via a single input-dependent linear map. B-cos GNNs use linear (sum-based) aggregation and replace non-linear message and update functions with B-cos transforms. This induces meaningful, task-specific weight-input alignment that is directly accessible through the model's dynamic linearity. Instance-level explanations follow from a single forward and backward pass, requiring no auxiliary explainer, modified learning objective, or perturbation procedure. Instantiated as a GIN, our approach trades small losses in predictive accuracy for state-of-the-art explainability across diverse synthetic and real-world benchmarks, producing explanations orders of magnitude faster than post-hoc baselines.

2511.14159 2026-05-28 cs.CV

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

MVI-Bench:评估大型视觉语言模型对误导性视觉输入鲁棒性的综合基准

Huiyi Chen, Jiawei Peng, Dehai Min, Changchang Sun, Kaijie Chen, Yan Yan, Xu Yang, Lu Cheng

发表机构 * Department of Computer Science, University of Illinois Chicago, Chicago, USA School of Computer Science \& Engineering, Southeast University, Nanjing, China Guohao School, Tongji University, Shanghai, China

AI总结 针对现有鲁棒性基准忽视误导性视觉输入的问题,提出MVI-Bench基准,基于视觉基元的三级层次(视觉概念、视觉属性、视觉关系)构建6个类别1248个VQA实例,并引入MVI-Sensitivity指标进行细粒度评估,揭示18个LVLM的显著脆弱性。

Comments 18 pages, 9 figures

详情
AI中文摘要

评估大型视觉语言模型(LVLMs)的鲁棒性对于其持续发展和在现实世界应用中的负责任部署至关重要。然而,现有的鲁棒性基准通常关注幻觉或误导性文本输入,而在很大程度上忽视了评估视觉理解时由误导性视觉输入带来的同样关键的挑战。为填补这一重要空白,我们引入了MVI-Bench,这是首个专门设计用于评估误导性视觉输入如何削弱LVLMs鲁棒性的综合基准。基于基本视觉基元,MVI-Bench的设计围绕三个层次的误导性视觉输入:视觉概念、视觉属性和视觉关系。利用这一分类法,我们策划了六个代表性类别,并整理了1248个专家标注的VQA实例。为了促进细粒度的鲁棒性评估,我们进一步引入了MVI-Sensitivity,这是一种新颖的指标,可在细粒度上表征LVLM的鲁棒性。在18个最先进的LVLM上的实证结果揭示了它们对误导性视觉输入的显著脆弱性,我们在MVI-Bench上的深入分析提供了可操作的见解,可以指导开发更可靠和鲁棒的LVLM。基准和代码库可在https://github.com/chenyil6/MVI-Bench获取。

英文摘要

Evaluating the robustness of Large Vision-Language Models (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

2605.19743 2026-05-28 cs.AI cs.LG cs.MA

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

EngiAI: 面向LLM驱动工程设计的智能体框架与基准测试套件

Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

发表机构 * IDEAL Chair of Artificial Intelligence in Engineering Design(人工智能与工程设计理想 chair) ETH Zurich(苏黎世联邦理工学院) Autom8.build

AI总结 提出EngiAI多智能体系统框架和包含工作流、RAG、HPC三维度的基准套件,通过监督架构协调七个专业智能体,验证了LLM在工程设计中的能力与局限。

Comments 26 pages, 10 figures, to be published at IDETC 2026

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于工程设计任务,但现有的评估框架未能充分处理结合仿真、检索和制造准备的多智能体系统。我们引入了一个包含三个评估维度的基准套件:(1)一个工作流基准,包含七种针对不同认知需求的提示风格——包括直接工具使用、语义消歧、条件分支和工作记忆任务;(2)一个检索增强生成(RAG)基准,采用门控评分来隔离检索对参数选择的贡献;(3)一个高性能计算(HPC)基准,评估在SLURM集群上的端到端机器学习训练编排。与基准一起,我们提出了EngiAI,一个基于LangGraph构建的多智能体系统(MAS)参考实现,通过监督架构协调七个专业智能体,统一拓扑优化、文档检索、HPC作业编排和3D打印机控制。在四个LLM后端和两个EngiBench问题上,专有模型在Beams2D上实现了96-97%的平均任务完成率,而开源4B参数模型达到55-78%,并显示出明显的代际改进。条件分支被证明最具挑战性,在Photonics2D上条件风格的任务完成率降至20-53%。RAG门控确认了近乎完美的检索增强分数(约1.0),而无检索时接近零,验证了评估设计。在HPC编排中,一个模型在100%的运行中完成了所有流水线步骤,而另一个模型降至50%,表明多步骤指令遵循在长时间运行的工作流中会退化。

英文摘要

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores (about 1.0) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

2605.19514 2026-05-28 cs.AI cs.CL cs.LG

Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management

立场:自回归Transformer的图灵完备性高度依赖于上下文管理

Guanyu Cui, Zhewei Wei, Kun He

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) DEKE Lab, Renmin University of China, Beijing, China(中国人民大学北京校区DEKE实验室)

AI总结 本文通过区分固定系统和缩放族两种设置,论证了上下文管理方法对自回归Transformer计算能力的决定性影响,并指出缩放族设置下的图灵完备性证明不适用于实际部署的固定系统。

Comments Accepted to the ICML 2026 Position Paper Track

详情
AI中文摘要

许多工作提出了引人注目的主张,即Transformer是图灵完备的。然而,文献常常混淆两种不同的设置:(i)固定系统设置,其中固定的自回归Transformer与固定的上下文管理方法耦合,逐步处理不同长度的输入;(ii)缩放族设置,其中使用一系列不同模型(具有增加的上下文窗口长度或数值精度)来处理不同的输入长度。现有的Transformer图灵完备性证明通常是在设置(ii)中建立的,而现实世界中的LLM部署以及图灵完备性的标准概念更自然地对应于设置(i)。在本文中,我们首先形式化固定系统设置,从而具体描述现实世界LLM的运行方式。然后,我们认为在缩放族设置中证明的结果提供了理论上有意义的资源界限,但并未建立图灵完备性,从而澄清了对现有结果的常见误解。最后,我们展示了不同的上下文管理方法可以产生截然不同的计算能力,并主张上下文管理是决定现实世界自回归Transformer计算能力的关键组成部分。

英文摘要

Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.