arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31534 2026-06-01 cs.CV cs.AI

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

面向自适应3D场景重建的特征优化视觉

Eric Liang

AI总结提出一种自适应特征优化视觉前端，通过评分纹理、可重复性、独特性、预期三角化角度和空间覆盖来分配每视图特征预算，以最大化有效轨迹并降低重建RMSE。

详情

AI中文摘要

三维场景重建依赖于局部图像证据，这些证据既要在视觉上具有判别性，又要在几何上有用。固定的特征阈值和均匀的特征预算易于部署，但可能会在重复纹理、低视差区域或不稳定点上浪费计算。本文提出了一种用于3D重建的自适应特征优化视觉前端。该方法通过纹理、可重复性、独特性、预期三角化角度和空间覆盖对候选特征进行评分，然后在固定重建流程下分配每视图特征预算以最大化有效轨迹。一个小型合成多视图原型在走廊、立面、物体桌面和杂乱场景中评估了四种选择策略。与随机、仅纹理和均匀网格基线相比，自适应策略在保持广泛图像覆盖的同时，获得了最佳的质量感知完整性和最低的聚合重建RMSE。结果并非替代现代学习匹配或神经重建系统；它是一个模块化的前端策略，可以使经典和学习的3D流程更审慎地决定将计算花费在哪些视觉证据上。

英文摘要

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.

URL PDF HTML ☆

赞 0 踩 0

2605.31532 2026-06-01 cond-mat.soft cs.LG

Discovering Thermodynamically Admissible Dissipation Potentials via Grammar-Based Symbolic Regression

通过基于语法的符号回归发现热力学允许的耗散势

Federico Califano, Jacopo Ciambella

AI总结提出一种基于语法的符号回归框架，在广义标准材料形式下自动发现满足热力学约束（凸性和非负性）的耗散势，并在合成数据和实验数据上验证其有效性。

详情

AI中文摘要

非弹性材料的本构定律必须满足严格的热力学允许性要求，然而当前的数据驱动方法即使通过物理编码架构提供了形式保证，也牺牲了可解释性。我们提出了一种符号回归框架，用于在广义标准材料（GSM）形式下数据驱动地发现控制内变量演化的耗散势。从Clausius-Duhem不等式出发，我们强制执行对偶耗散势必须满足的热力学要求——凸性和非负性，以保证非负的机械耗散。这些要求在一般的次微分设置中表述，在一个统一框架内涵盖了率相关（粘弹性）和粘塑性耗散机制，包括具有真正弹性区域的势。候选势由一种复合扩展的保凸语法生成，该语法通过构造保证热力学允许性。该框架在包含牛顿、幂律和Bingham粘塑性真实过程的合成数据集（含过程和测量噪声）上进行了验证，并在合成弹性体的实验振荡剪切测量（多个应变幅度和频率）上进行了验证，其中发现的势再现了动态模量的幅度依赖性软化，并优于校准的线性Zener基线。

英文摘要

Constitutive laws for inelastic materials must satisfy strict thermodynamic admissibility requirements, yet current data-driven approaches sacrifice interpretability, even when formal guarantees are provided by physics-encoded architectures. We propose a symbolic regression framework for the data-driven discovery of dissipation potentials governing the evolution of internal variables within the Generalized Standard Materials (GSM) formalism. Starting from the Clausius--Duhem inequality, we enforce the thermodynamic requirements, convexity and non-negativity, that the dual dissipation potential must satisfy to guarantee non-negative mechanical dissipation. These requirements are formulated in the general subdifferential setting, encompassing rate-dependent (viscoelastic) and viscoplastic dissipative mechanisms, including potentials with genuine elastic domains, within a unified framework. Candidate potentials are generated by a composition-extended convexity-preserving grammar that guarantees thermodynamic admissibility \emph{by construction}. The framework is validated on synthetic datasets spanning Newtonian, power-law, and Bingham viscoplastic ground truths under process and measurement noise, and on experimental oscillatory shear measurements of a synthetic elastomer across multiple strain amplitudes and frequencies, where the discovered potentials reproduce the amplitude-dependent softening of the dynamic moduli and outperform a calibrated linear Zener baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.31529 2026-06-01 cs.CV

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

SVI-Bench: 一个用于战略视频智能的动态微世界

Yulu Pan, Han Yi, Seongsu Ha, Md Mohaiminul Islam, Benjamin Zhang, Lorenzo Torresani, Gedas Bertasius

AI总结本文提出SVI-Bench，一个基于团队体育动态微世界的大规模基准，通过四个层级（动态场景理解、因果推理、战略模拟、智能体合成）的9个任务评估视频智能从感知到战略规划的能力，发现模型在感知任务上表现良好但在认知层级上性能急剧下降。

详情

AI中文摘要

真正的视频智能需要的不仅仅是识别可见内容：它需要推理事件为何发生，预测在不同条件下会有什么变化，并决定下一步该做什么。我们将这种从感知到因果推理、模拟再到战略规划的演进称为战略视频智能（SVI）。现有基准均未评估这一能力栈：野外视频缺乏因果和战略问题的可验证真实数据，而合成环境则牺牲了真实多智能体系统的复杂性。为弥补这一差距，我们引入了SVI-Bench，这是一个大规模基准，利用团队体育作为动态微世界，将真实世界多智能体交互（10-22个智能体在对抗压力下做出协调决策）的复杂性与显式规则和确定性结果的可验证性相结合。SVI-Bench包含约3.5万小时的广播视频、1500万个标注动作、1.5万小时的专家解说、2.3万份比赛报告以及涵盖篮球、足球和冰球的10.3万条结构化统计记录，所有这些均通过一个将原始比赛数据转换为密集交叉引用语料库的数据引擎构建。我们将评估组织为9个任务，涵盖一个渐进的四层层次结构：动态场景理解、因果推理、战略模拟和智能体合成。评估强多模态和智能体基线后，我们发现一个能力悬崖：模型在感知任务上表现胜任，在细粒度动作问答上达到约73%的准确率，但在每个后续认知层级上急剧下降。智能体任务最为困难：当需要自主收集并整合来自180万个片段语料库的证据时，最强模型仅达到5%的准确率。

英文摘要

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

URL PDF HTML ☆

赞 0 踩 0

2605.31524 2026-06-01 cs.LG cs.LO

Value Functions as Supermartingale Certificates

值函数作为超鞅证书

Alessandro Abate, Daniel Contro, Mirco Giacobbe, Agustín Martínez-Suñé, Diptarko Roy

AI总结本文通过建立值函数与Streett超鞅证书之间的理论联系，将随机系统的形式化验证方法与强化学习相结合，为ω-正则性质提供了一种基于RL的证书合成方法。

详情

Comments: To appear in SAIV'26

AI中文摘要

随机系统的认证方法提供了基于实值超鞅证书的充分证明规则，用于确定在一般状态空间（包括可数无限和连续状态空间）上几乎必然满足ω-正则性质（因此也适用于线性时序逻辑）。相反，针对ω-正则任务的强化学习（RL）方法已受到广泛关注，但它们通常缺乏对所学策略满足规范的形式化保证，除非可能限于有限状态和动作空间。我们通过建立一个新的理论联系来弥合这两条研究路线：在适当的奖励下，与几乎必然满足ω-正则性质的策略相关联的值函数编码了该规范的Streett超鞅证书。我们的结果在有限马尔可夫决策过程上通过实验验证，适用于有限、可数无限和连续状态空间，为通过RL进行证书合成提供了一条有原则的途径。

英文摘要

Certification methods for stochastic systems provide sufficient proof rules, based on real-valued supermartingale certificates, to determine the almost-sure satisfaction of $ω$-regular properties (and therefore of linear temporal logic) over general state spaces, encompassing both countably infinite and continuous state spaces. Conversely, reinforcement learning (RL) methods for $ω$-regular tasks have received considerable attention, but they typically lack formal guarantees that the learned policy satisfies the specification, except possibly for finite state and action spaces. We bridge these two lines of research by establishing a novel theoretical connection: under an appropriate reward, the value function associated to a policy that almost surely satisfies an $ω$-regular property encodes a Streett supermartingale certificate for that specification. Our results, validated experimentally on finite Markov decision processes, hold for finite, countably infinite, and continuous state spaces, suggesting a principled route to certificate synthesis via RL.

URL PDF HTML ☆

赞 0 踩 0

2605.31522 2026-06-01 cs.LG q-bio.GN q-bio.QM

Chem-PerturBridge: a harmonized compendium of small molecule perturbation transcriptomic effects

Chem-PerturBridge：小分子扰动转录组效应的协调汇编

Artur Szałata, Olga Novitskaia, Maiia Shulman, Matthew Mella, Altynbek Zhubanchaliyev, Fabian J. Theis

AI总结为解决小分子扰动转录组数据碎片化问题，构建了涵盖37k化合物、136种细胞背景和125万样本的协调资源Chem-PerturBridge，并验证了其在跨数据集签名一致性评估和化合物表示学习预训练中的有效性。

详情

Comments: 33 pages, 6 figures, 16 tables

AI中文摘要

大型扰动模型需要涵盖化学、细胞和检测多样性的训练数据。然而，当前用于小分子建模的转录组资源在技术、元数据惯例、对照、剂量和预处理流程方面是碎片化的。我们引入了Chem-PerturBridge，这是一个协调的多数据集资源，包含超过37k种化合物、136种细胞背景和125万个转录组样本，涵盖八种检测类型，具有标准化的标识符、元数据和考虑重复的条件级效应。我们利用该资源评估了跨数据集的匹配条件一致性和数据集内的重复一致性。匹配的相同化合物条件在大多数数据集对上的细粒度logFC排名和幅度上通常表现出弱一致性，通常低于相同背景不同化合物的基线。相比之下，logFC方向的一致性要稳定得多，并且通常超过这些基线。我们进一步评估了Chem-PerturBridge作为化合物表示学习预训练资源的效果。在化合物留出的OP3评估分割下，基于Chem-PerturBridge预训练的嵌入在各项指标上优于仅使用L1000的嵌入、Morgan指纹和无描述符的OP3基线。在11个数据集上的广泛分子留出评估进一步表明，基于Chem-PerturBridge训练的模型优于或匹配未使用该资源的模型。因此，Chem-PerturBridge支持跨数据集签名一致性的诊断评估以及异质扰动转录组数据的模型导向复用。

英文摘要

Large perturbation models require training data encompassing chemical, cellular, and assay diversity. Current transcriptomic resources for small-molecule modeling, however, are fragmented across technologies, metadata conventions, controls, doses, and preprocessing pipelines. We introduce Chem-PerturBridge, a harmonized multi-dataset resource comprising over 37k compounds, 136 cellular contexts, and 1.25M transcriptomic samples across eight assay types, with standardized identifiers, metadata, and replicate-aware condition-level effects. We use the resource to evaluate matched-condition agreement across datasets and replicate agreement within datasets. Matched same-compound conditions generally show weak agreement in fine-grained logFC rankings and magnitudes across most dataset pairs, often falling below same-context different-compound baselines. In contrast, logFC direction agreement is substantially more stable and usually exceeds these baselines. We further evaluate Chem-PerturBridge as a pretraining resource for compound representation learning. Under a compound-held-out OP3 evaluation split, embeddings pretrained on Chem-PerturBridge improve over L1000-only embeddings, Morgan fingerprints, and the descriptor-free OP3 baseline across metrics. An extensive molecule-holdout evaluation across 11 datasets further shows that models trained on Chem-PerturBridge outperform or match those that are not. Chem-PerturBridge therefore supports both diagnostic evaluation of cross-dataset signature agreement and model-oriented reuse of heterogeneous perturbation transcriptomic data.

URL PDF HTML ☆

赞 0 踩 0

2605.31521 2026-06-01 cs.CL cs.SD

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

UniAudio-Token: 赋予语义语音分词器通用音频感知能力

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

AI总结提出UniAudio-Token框架，通过语义-声学基元（SAP）和语义-声学均衡（SAE）机制，在不牺牲语音能力的前提下为语义分词器注入通用音频感知，实现统一音频接口。

详情

Comments: 19 pages, 10 figures

AI中文摘要

语义语音分词器因其紧凑的单码本设计和强语言对齐能力，已成为音频-大语言模型广泛使用的接口。然而，它们对语言抽象的关注导致了声学盲点，限制了其在以语音为中心的任务之外的适用性。我们提出UniAudio-Token，一个在不损害语音能力的前提下赋予语义分词器通用音频感知能力的框架。UniAudio-Token并非改变语义范式，而是通过两个关键创新来减轻其信息损失：(1) 语义-声学基元（SAP）通过将音频分解为语言内容、声音属性和听觉场景基元来提供结构化监督；(2) 语义-声学均衡（SAE）引入了一种内容感知门控机制，自适应地从浅层恢复细粒度声学细节。广泛评估表明，UniAudio-Token在学习全面的通用表示的同时，保持了高保真语音生成。当与下游大语言模型集成时，它在理解和生成任务上均优于所有单码本基线分词器，有效地作为统一音频接口。我们在https://github.com/Tencent/Universal_Audio_Tokenizer上公开发布了所有代码，包括训练和推理脚本以及模型检查点。

英文摘要

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

URL PDF HTML ☆

赞 0 踩 0

2605.31520 2026-06-01 cs.SE cs.AI cs.CR

Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection

区分秘密与占位符：一种用于三类凭证泄露检测的混合CNN-CodeBERT框架

Maksuda Bilkis Baby, Khushika Shah, Naiyue Liang, Lei Zhang

AI总结针对现有凭证泄露检测工具高误报率的问题，提出一种基于CodeBERT语义理解与字符级模式识别的三分类框架，将占位符/弱凭证作为独立类别建模，在新构建的9426样本数据集上达到0.86的MCC和0.90的宏F1分数，将高严重性警报减少33%而不牺牲安全覆盖。

详情

Comments: Accepted at ICSME 2026 (International Conference on Software Maintenance and Evolution)

AI中文摘要

公共源代码仓库中的凭证泄露构成严重安全威胁，仅2024年就有超过2380万个秘密被暴露。现有检测工具由于刚性模式匹配和二元分类方案无法区分真实凭证与占位符或弱凭证，导致高误报率。我们提出一个三分类框架，明确将占位符或弱凭证建模为一个独立类别，利用基于CodeBERT的语义理解结合字符级模式识别。我们在一个新构建的包含10种编程语言、9426个样本的数据集上评估了我们的方法。我们的模型实现了0.86的马修斯相关系数和0.90的宏F1分数，对真实凭证泄露达到93%的召回率和89%的精确率，同时将高严重性警报减少了33.0%（从373降至250），且未牺牲安全覆盖。与先前的字符级方法相比，我们的方法将占位符或弱凭证检测的F1分数从54%提升至81%，同时保持了强大的跨语言泛化能力，在留一语言评估中，10种语言中有9种语言的F1分数超过0.80。

英文摘要

Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.31518 2026-06-01 cs.LG

On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders

关于稀疏自编码器中激活异常值与特征死亡之间关系的研究

Elana Simon, Etowah Adams, James Zou

AI总结本文通过理论分析和实验验证，揭示了稀疏自编码器中维度级激活异常值导致特征死亡的机制，并提出均值中心化预处理方法有效消除该问题。

详情

Comments: Accepted to ICML 2026 main conference

AI中文摘要

稀疏自编码器（SAEs）将神经网络激活分解为可解释的特征，但许多学习到的特征从未激活，这一称为特征死亡的问题浪费了字典容量并可能重新引入叠加。不同模型之间的死亡率差异巨大：在GPT-2上接近零，而在相同配置的AlphaFold3上超过70%。我们发现维度级激活异常值（其平均幅度相对于每个token的变化较大的维度）通过根据每个特征与激活均值的对齐方式在初始化时改变预激活来导致此问题。与均值反对齐的特征获得永久负预激活且从不触发。我们将异常值严重程度形式化为$γ= \|μ\|/\|σ\|$；它在涵盖语言、视觉、蛋白质和基因组模型的454个模型-层组合上预测初始死亡率（对于TopK死亡的Spearman $ρ= 0.89$，对于ReLU死亡的$0.82$）。死亡特征可以在训练期间复活，但恢复需要SAE偏置学习激活均值，这一过程在高$γ$时过于缓慢。均值中心化（减去激活均值）绕过了这一点，并在所有测试模型中消除了异常值诱导的死亡，确认了该机制，并为何时以及为何需要这一预处理步骤提供了原则性基础。

英文摘要

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near-zero on GPT-2, over 70% on AlphaFold3 with identical configurations. We find that dimension-level activation outliers (dimensions whose mean magnitude is large relative to per-token variation) cause this by shifting pre-activations at initialization based on each feature's alignment with the activation mean. Features anti-aligned with the mean receive permanently negative pre-activations and never fire. We formalize outlier severity as $γ= \|μ\|/\|σ\|$; it predicts initial death rates (Spearman $ρ= 0.89$ for dead-by-TopK, $0.82$ for dead-by-ReLU) across 454 model-layer combinations spanning language, vision, protein, and genomic models. Dead features can revive during training, but recovery requires the SAE bias to learn the activation mean, a process that is prohibitively slow at high $γ$. Mean-centering (subtracting the activation mean) sidesteps this and eliminates outlier-induced death across all tested models, confirming the mechanism and providing a principled basis for when and why this preprocessing step is necessary.

URL PDF HTML ☆

赞 0 踩 0

2605.31513 2026-06-01 cs.CV

Personalize Your Large Vision-language Models With In-context Prompt Tuning

用上下文提示调优个性化你的大型视觉语言模型

Yanshu Li, Jiaqian Li, Kuai Yu, Xi Xiao, Dongfang Liu, Tianyang Wang, Ruixiang Tang

AI总结提出上下文提示调优（ICPT）方法，通过轻量投影模块从多参考图像中提取细粒度视觉语义并转化为连续提示，结合几何正则化解决环境偏差和跨概念干扰，实现高效个性化。

详情

Comments: 27 pages, 10 figures, 5 tables

AI中文摘要

大型视觉语言模型（LVLMs）展示了强大的通用多模态能力，并越来越多地部署在下游系统中。这一趋势推动了对LVLM个性化的日益增长的兴趣，其目标是使模型能够快速有效地学习分布外的多模态概念，以满足用户特定需求。然而，许多现有方法依赖于推理时训练，降低了效率。它们也难以在复杂的多图像、多概念设置中保持准确性。这些限制制约了基于LVLM的系统的更广泛部署。因此，本文提出了上下文提示调优（ICPT）。具体来说，ICPT采用了一个轻量级投影模块，能够在复杂场景中操作，从多个参考图像中提取细粒度视觉语义，并将这些特征与身份标签映射无缝地转化为连续提示。为了最大化计算效率，该模块根据每个概念的内在视觉复杂性自适应地确定提示长度。关键的是，为了克服实际应用中普遍存在的环境偏差和跨概念干扰，我们引入了两种新颖的几何正则化。这些约束通过将关键身份与瞬态环境状态解耦，并分离概念以避免语义混淆，来优化提示表示。大量实验表明，ICPT在多种任务和LVLM骨干网络上实现了最先进的个性化准确性。

英文摘要

Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable models to quickly and effectively learn out-of-distribution multimodal concepts to meet user-specific needs. However, many existing methods rely on inference-time training, which reduces efficiency. They also struggle to maintain accuracy in complex multi-image, multi-concept settings. These limitations restrict the broader deployment of LVLM-based systems. Therefore, this paper proposes in-context prompt tuning (ICPT). Specifically, ICPT employs a lightweight projection module capable of operating in complex scenarios to extract fine-grained visual semantics from multiple reference images, seamlessly transforming these features alongside identity-label mappings into continuous prompts. To maximize computational efficiency, this module adaptively determines the prompt length based on the intrinsic visual complexity of each concept. Crucially, to overcome the environmental biases and cross-concept interference prevalent in real-world applications, we introduce two novel geometric regularizations. These constraints refine prompt representations by decoupling key identities from transient environmental states and separating concepts to avoid semantic confusion. Extensive experiments show that ICPT achieves state-of-the-art personalization accuracy across diverse tasks and LVLM backbones.

URL PDF HTML ☆

赞 0 踩 0

2605.31512 2026-06-01 cs.CL

Reliable Multilingual Orthopedic Decision Support from Clinical Narratives: Language-Aware Adaptation and Verification-Guided Deferral

来自临床叙述的可靠多语言骨科决策支持：语言感知适应与验证引导的延迟

Danish Ali, Li Xiaojian, Sundas Iqbal, Farrukh Zaidi

AI总结针对低资源医疗环境中的多语言骨科决策支持，提出结合语言感知适配编码器IndicBERT-HPA和确定性选择性验证层的可靠性框架，在英语、印地语和旁遮普语临床文本分类中取得最优性能。

详情

AI中文摘要

多语言骨科决策支持在低资源医疗环境中仍然具有挑战性，其中临床叙述包含专业术语、混合文字、不完整证据、标签不平衡和语言依赖的文档模式。本文提出了一个面向可靠性的框架，用于对英语、印地语和旁遮普语的自由文本骨科笔记进行分类。我们比较了任务对齐的多语言Transformer编码器、任务微调的DistilBERT基线、零样本指令微调的大语言模型（LLMs）和领域自适应编码器IndicBERT-HPA。IndicBERT-HPA通过语言感知的骨科适配器头增强IndicBERT，以支持临床相关的多语言表示学习。评估从整体准确率扩展到每类性能、ROC-AUC、AUPRC、期望校准误差、跨语言稳定性以及在受控平衡和自然患病率分布下的鲁棒性。评估的零样本LLMs在封闭集分类中远不如任务自适应编码器有效，且存在语言依赖的不稳定性。在自然临床患病率下，IndicBERT-HPA实现了最强的整体性能，平均Macro-F1达到0.8792，Macro-AUROC为0.894，AUPRC为0.902。我们进一步实现了一个确定性的选择性验证层，结合了置信门控、证据一致性检查和语言风险筛查。在随机选择的5000条保留子集上，它在72.3%的覆盖率下实现了84.4%的选择性准确率和0.76的选择性Macro-F1，而全接受预测的准确率为71.5%，Macro-F1为0.65。这些结果支持了面向可靠性的多语言临床决策支持，并带有明确的延迟机制。

英文摘要

Multilingual orthopedic decision support remains challenging in low-resource healthcare settings, where clinical narratives contain specialized terminology, mixed scripts, incomplete evidence, label imbalance and language-dependent documentation patterns. This article presents a reliability-oriented framework for classifying free-text orthopedic notes in English, Hindi and Punjabi. We compare task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, zero-shot instruction-tuned large language models (LLMs) and a domain-adaptive encoder, IndicBERT-HPA. IndicBERT-HPA augments IndicBERT with language-aware orthopedic adapter heads to support clinically relevant multilingual representation learning. Evaluation extends beyond aggregate accuracy to per-class performance, ROC-AUC, AUPRC, expected calibration error, cross-language stability and robustness under controlled balanced and natural-prevalence distributions. The evaluated zero-shot LLMs remain substantially less effective than task-adapted encoders for closed-set classification, with language-dependent instability. Under natural clinical prevalence, IndicBERT-HPA achieves the strongest overall performance, reaching an averaged Macro-F1 of 0.8792, Macro-AUROC of 0.894 and AUPRC of 0.902. We further implement a deterministic selective-verification layer combining confidence gating, evidence-consistency checking and language-risk screening. On a randomly selected held-out 5,000-record subset, it achieves 84.4% selective accuracy and 0.76 selective Macro-F1 at 72.3% coverage, compared with 71.5% accuracy and 0.65 Macro-F1 for accept-all prediction. These results support reliability-oriented multilingual clinical decision support with explicit deferral.

URL PDF HTML ☆

赞 0 踩 0

2605.31509 2026-06-01 cs.LG cs.AI

Skill Reuse as Compression in Agentic RL

智能体强化学习中的技能重用作为压缩

Zhikun Xu, Yu Feng, Jacob Dineen, Taiwei Shi, Jieyu Zhao, Ben Zhou

AI总结提出ReuseRL方法，基于最小描述长度原则将成功轨迹压缩为可重用技能字典，并通过分割代价惩罚低效编码行为，在多个环境中提升分布内和分布外成功率。

2605.31508 2026-06-01 cs.CV

Internalizing Temporal Consistency in Video Object-Centric Learning without Explicit Regularization

在没有显式正则化的情况下内化视频目标中心学习中的时间一致性

Rongzhen Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

AI总结提出一种无需显式时间一致性损失（SSC）的视频目标中心学习方法，通过时序通道分解（CCD）和跨时间重建（CTR）机制隐式学习时间一致性，提升训练效率和性能。

详情

Comments: 14 pages

AI中文摘要

视频目标中心学习（OCL）旨在将目标表示为 extit{slot}向量并保持其在帧间的一致性。Slot-Slot对比（SSC）损失已成为最先进（SOTA）视频OCL方法的基石。虽然非常有效，但SSC依赖于帧间的一对一目标对应并引入额外损失。遵循奥卡姆剃刀原则，我们提出范式转变：时间一致性应作为隐式模型设计而非显式损失来加强。为了优雅地排除SSC（ extbf{xSSC}），我们引入了两种准零开销的协同机制：（ extit{i}）时序通道分解（CCD）在结构上将slot表示沿通道维度分解为 extit{静态}和 extit{动态}子空间，作为经验统一的信息瓶颈；（ extit{ii}）跨时间重建（CTR）通过融合当前slot的静态通道和目标slot的动态通道，随机重建当前或前一时间步的目标特征，使用单个标准OCL解码器并进行少量训练调整。因此，slot集合通过仅最小化标准重建误差而内在地学习时间一致性。大量实验表明，将xSSC集成到领先基线中不仅提高了训练效率，还在视频目标发现和识别任务上建立了新的SOTA。此外，我们的PCA和梯度分析证实了目标的时间不变语义和时间变化运动学被编码到所提出的子空间中。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/xSSC上获取。

英文摘要

Video Object-Centric Learning (OCL) aims to represent objects as \textit{slot} vectors and maintain their consistency across frames. Slot-Slot Contrastive (SSC) loss has become the cornerstone for state-of-the-art (SOTA) video OCL methods. While highly effective, SSC relies on one-to-one object correspondence across frames and introduces an extra loss. Following Occam's Razor, we propose a paradigm shift: temporal consistency is better enforced as an implicit model design rather than an explicit loss. To elegantly exclude SSC (\textbf{xSSC}), we introduce two quasi-zero-overhead synergistic mechanisms: (\textit{i}) Chrono-Channel Decomposition (CCD) structurally disentangles slot representations along the channel dimension into \textit{static} and \textit{dynamic} sub-spaces, serving as an empirically unified information bottleneck; (\textit{ii}) Cross-Temporal Reconstruction (CTR) stochastically reconstructs target features of either the current or previous time step by fusing current slots' static channels and target slots' dynamic channels, using a single standard OCL decoder with minor training adaptation. Thereby, the slot sets inherently learn temporal consistency by minimizing the standard reconstruction error alone. Extensive experiments show that integrating xSSC into leading baselines not only improves training efficiency but also establishes new SOTAs on video object discovery and recognition tasks. Furthermore, our PCA and gradient analyses confirm that objects' time-invariant semantics and time-variant kinematics are encoded into the proposed sub-spaces. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/xSSC.

URL PDF HTML ☆

赞 0 踩 0

2605.31504 2026-06-01 cs.LG stat.ML

When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

何时多模态预测具有生物学支持？一个诊断性评估框架

Dylan Steiner, Gustavo Arango-Argoty, Gerald Sun, Etai Jacob

AI总结提出DECAT框架，通过五个零参考指标和规则决策，将多模态表示分类为四种诊断场景，以检测模型是否学到共享生物学、单模态生物学或虚假相关性。

详情

AI中文摘要

肿瘤学中的多模态模型可以产生准确的预测，但准确预测并不能揭示模型是否学到了跨模态共享的生物学、局限于单一模态的生物学，还是反映了混杂因素而非真正生物学的虚假相关性。我们引入了DECAT，一个模型无关的事后评估框架，该框架针对给定任务和模态，使用五个零参考指标和基于规则的决策程序，将多模态表示分类为四种诊断场景。该框架作用于学习到的表示，不需要知道存在哪个特定混杂因素，并在证据不足时返回不确定。我们在四种多模态模型类别（超过2500个训练表示）的合成数据上以及来自8979名TCGA患者的真实数据上验证了DECAT，评估了多模态嵌入和五个预训练的病理基础模型。纠缠模型（如CLIP）实现了近乎完美的共享生物学检测，但在真实基础模型嵌入中，大多数情况下错误地声称存在共享生物学。这种错误声称率随着混杂强度增加而增加，因此更大的队列和更强的表示会产生更自信但仍然错误的诊断。应用于多模态TCGA嵌入和五个没有配对RNA的病理基础模型时，DECAT检测到了AUROC无法看到的混杂，而无需混杂标签，这一点通过事后分层得到了证实。

英文摘要

Multimodal models in oncology can produce accurate predictions, but accurate prediction does not reveal whether the model has learned biology that is shared across modalities, biology confined to one modality, or spurious correlations that reflect confounders rather than genuine biology. We introduce DECAT, a model-agnostic post-hoc evaluation framework that classifies multimodal representations into four diagnostic scenarios for a given task and modality, using five null-referenced metrics and a rule-based decision procedure. The framework operates on learned representations, requires no knowledge of which specific confounder is present, and returns indeterminate when the evidence is insufficient. We validate DECAT on synthetic data across four multimodal model classes (over 2,500 trained representations) and on real data from 8,979 TCGA patients, evaluating both multimodal embeddings and five pretrained pathology foundation models. Entangled models (e.g., CLIP) achieve near-perfect shared biology detection but falsely claim shared biology in the majority of cases where it is absent on real foundation model embeddings. This false claim rate increases with confound strength so that larger cohorts and stronger representations produce more confident but still incorrect diagnoses. Applied to both multimodal TCGA embeddings and five pathology foundation models without paired RNA, DECAT detects confounding invisible to AUROC without requiring the confounder labels, as confirmed by post-hoc stratification.

URL PDF HTML ☆

赞 0 踩 0

2605.31503 2026-06-01 cs.CV cs.LG

How can embedding models bind concepts?

嵌入模型如何绑定概念？

Arnas Uselis, Darina Koishigarina, Seong Joon Oh

AI总结本文研究视觉-语言嵌入模型（如CLIP）在概念绑定上的局限性，发现场景嵌入可加性分解为对象表示，但CLIP的高复杂度绑定函数阻碍了泛化，而通过充分数据训练的Transformer模型能学习低复杂度乘法交互绑定函数实现系统泛化。

详情

Comments: ICML 2026

AI中文摘要

人类在多物体场景中能轻松判断哪种颜色属于哪种形状，这种能力称为概念绑定。视觉-语言嵌入模型（如CLIP）在绑定时存在困难：它们能识别单个概念，但无法表示哪些概念构成哪些对象。尽管CLIP在跨模态检索中表现为词袋模型，但对象信息可以从其图像和文本嵌入中分别恢复。我们通过绑定函数（将概念映射到场景嵌入）研究这种张力。我们发现场景嵌入可加性分解为对象表示，这解释了为何单模态探针能恢复对象信息。然而，CLIP的绑定函数具有高复杂度，这可能阻止图像和文本编码器学习共享的绑定机制，从而无法泛化到未见过的概念组合。然后我们探究这种局限性是否是根本性的。我们证明并非如此。在从零开始训练的受控Transformer模型中，随着数据覆盖率的增加，绑定泛化出现。这些模型学习到低复杂度的绑定函数，其特点是概念之间的乘法交互，从而实现系统泛化。代码公开于https://github.com/oshapio/binding-concepts-complexity。

英文摘要

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.31500 2026-06-01 cs.LG cs.AI

On Efficient Scaling of GNNs via IO-Aware Layers Implementations

通过IO感知层实现实现GNN的高效扩展

Daria Fomina, Daniil Krasylnikov, Alexey Boykov, Andrey Dolgovyazov, Vyacheslav Zhdanovskiy, Fedor Velikonivtsev

AI总结针对GNN中稀疏不规则内存访问瓶颈，提出三种GPU内核族（SpMM卷积、归约聚合、注意力层）以减少数据移动并提升局部性，在真实图上实现高达8.5倍加速和76倍内存降低。

详情

Comments: International Conference on Machine Learning (ICML) 2026, Spotlight Paper

AI中文摘要

图神经网络（GNN）受限于稀疏、不规则的内存访问。流行的框架如DGL和PyTorch Geometric支持通用消息传递，但复杂层通常具体化边中间结果，增加内存流量并限制在大图上的可扩展性。我们以I/O和算术强度为中心的观点表明，广泛使用的层分为三种内核族：基于SpMM的卷积、基于归约的聚合和基于注意力的层（GATv2/Graph Transformer）。对于每个族，我们开发了减少数据移动、改善局部性并在真实图上保持鲁棒性的GPU内核。我们还研究了图重排序，发现其影响取决于内核映射：它对邻居并行（以gather为主）内核的益处比特征并行设计更一致。实验表明，我们的融合注意力内核在Graph Transformer上达到高达$ extbf{3.9} imes$的加速（中位数$ extbf{1.6} imes$），在局部密集图上使用Tensor Core（块稀疏）变体达到高达$ extbf{7.3} imes$；对于GATv2，我们达到高达$ extbf{8.5} imes$的加速（中位数$ extbf{2.0} imes$），同时峰值内存降低高达$ extbf{76} imes$（中位数$ extbf{6} imes$）。我们的度感知归约内核达到高达$ extbf{10} imes$的加速（中位数$ extbf{2.6} imes$）。对于基于SpMM的层，适当缓存的cuSPARSE比DGL达到高达$ extbf{8} imes$的加速，并在大多数评估中优于评估的自定义基线。我们发布我们的实现作为即插即用的替代品，以支持可重现的、硬件感知的GNN加速。

英文摘要

Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edge-wise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to $\textbf{3.9}\times$ speedup for Graph Transformer (median $\textbf{1.6}\times$), with Tensor Core (block-sparse) variants up to $\textbf{7.3}\times$ on locally dense graphs; for GATv2 we reach up to $\textbf{8.5}\times$ speedup (median $\textbf{2.0}\times$) while reducing peak memory by up to $\textbf{76}\times$ (median $\textbf{6}\times$). Our degree-aware reduction kernels achieve up to $\textbf{10}\times$ speedup (median $\textbf{2.6}\times$). For SpMM-based layers, properly cached cuSPARSE achieves up to $\textbf{8}\times$ speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.31497 2026-06-01 cs.LG stat.ML

Assign and Add: A Mechanistic Study of Compositional Arithmetic

Assign and Add: 组合算术的机制研究

Brady Exoo, Alberto Bietti, John Sous

AI总结通过变量赋值和模加法任务，研究Transformer中组合泛化的机制，发现模型利用同一模加法模块处理直接和间接输入，并揭示了三阶段学习动态。

详情

AI中文摘要

大型语言模型能够组合技能以执行复杂任务，其中许多任务可能在训练期间未曾见过。这种组合发生的具体细节仍然难以捉摸。在本文中，我们通过考虑一个涉及变量赋值和模加法的简单受控设置，研究Transformer中组合泛化的机制。通过将训练数据划分为不相交的集合，我们观察到小型Transformer能够泛化到先前未见过的变量和数字组合。我们的机制分析表明，无论输入是直接给出还是通过单独的变量赋值机制间接给出，都使用相同的“模加法”MLP模块。我们还从经验角度分析了训练动态，揭示了三个学习阶段：首先学习模加法，然后学习变量赋值所需的结构，最后是精炼阶段，模型泛化到训练中未见的一些困难序列。最后，我们提供了一个理论框架来解释组合性如何从训练动态中涌现。这些结果表明，组合泛化可以是Transformer内部机制组合性的自然结果。

英文摘要

Large language models are able to compose skills in order to perform complex tasks, many of which might not have been seen during training. The details of how exactly this composition occurs remain elusive. In this paper, we study a mechanism for compositional generalization in transformers by considering a simple controlled setting involving variable assignment and modular addition. By partitioning our training data into disjoint sets, we observe that small transformers are able to generalize to previously unseen combinations of variables and numbers. Our mechanistic analysis shows that the same ``modular addition'' MLP module is used whether the inputs are given directly or indirectly through a separate variable assignment mechanism. We also analyze the training dynamics from an empirical lens, which reveals three phases of learning: first, modular addition is learned, then the structure required for variable assignment, and finally a refinement phase where the model generalizes to some hard sequences not seen in training. Finally, we provide a theoretical framework to explain how compositionality emerges from training dynamics. These results suggest that compositional generalization can be a natural consequence of the compositionality of internal mechanisms in~transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.31494 2026-06-01 cs.CL cs.LG

Consolidating Rewarded Perturbations for LLM Post-Training

整合奖励扰动用于大语言模型后训练

Zheyu Zhang, Shuo Yang, Gjergji Kasneci

AI总结提出CoRP方法，通过奖励加权聚合、兼容性重加权和验证门控，将奖励扰动整合为单一模型，无需梯度，在单次推理下平均提升8.1分。

详情

AI中文摘要

语言模型的后训练通常被框架为通过梯度下降实现的样本-分数-更新循环。最近的一系列工作，以RandOpt为例，将此循环转移到权重空间，在预训练模型周围采样高斯扰动，并在推理时集成前K个奖励专家。虽然在与PPO和GRPO匹配训练计算量下具有竞争力，但这种预测级集成每个测试样本需要K次前向传播，并且不能干净地扩展到自由生成。我们询问是否可以将奖励种群折叠成一个单一的可部署模型，用一次整合更新替代推理时集成。对25个模型-任务对的拆分半分析揭示了每种情况下可复现的低秩结构。我们将这种几何结构转化为CoRP（整合奖励扰动），这是一种无梯度算子，结合了奖励加权聚合、兼容性感知重加权和保留验证门控，且没有梯度通过语言模型。在从0.5B到8B的五个语言模型和涵盖数学、代码和创意写作的五个任务上，CoRP平均将基础模型提升了8.1分。使用RandOpt扰动预算的十分之一，CoRP超过了单次推理的RandOpt 6.5分，并恢复了50次多数投票集成增益的一半以上，而每个测试样本只需一次前向传播。

英文摘要

Post-training of language models is commonly framed as a sample-score-update loop implemented by gradient descent. A recent line of work, exemplified by RandOpt, relocates this loop to weight space, sampling Gaussian perturbations around a pretrained model and ensembling the top-K rewarded specialists at inference. While competitive with PPO and GRPO under matched training compute, this prediction-level ensemble incurs K forward passes per test example and does not extend cleanly to free-form generation. We ask whether the rewarded population can instead be folded into a single deployable model, replacing the inference-time ensemble with one consolidated update. A split-half analysis over 25 model-task pairs reveals reproducible low-rank structure in every case. We turn this geometry into CoRP (Consolidating Rewarded Perturbations), a gradient-free operator that combines reward-weighted aggregation, compatibility-aware reweighting, and a held-out validation gate, with no gradient flowing through the language model. Across five language models from 0.5B to 8B and five tasks covering math, code, and creative writing, CoRP improves the base model by 8.1 points on average. Using one tenth of RandOpt's perturbation budget, CoRP exceeds single-inference RandOpt by 6.5 points and recovers more than half of the gain of the 50-pass majority-vote ensemble, at one forward pass per test example.

URL PDF HTML ☆

赞 0 踩 0

2605.31492 2026-06-01 cs.AI

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

LinTree: 通过显式结构化搜索历史提升LLM推理能力

Liwei Kang, Yee Whye Teh, Wee Sun Lee

AI总结针对LLM推理中隐式搜索树导致性能不佳的问题，提出LinTree方法，通过添加父指针显式表示线性化树结构，在Blocks World、网格导航和Sokoban任务中提升了任务性能和搜索效率。

详情

Comments: 16 pages, 3 figures

AI中文摘要

大型语言模型（LLM）通常通过生成中间轨迹来解决推理问题，这些轨迹探索并修正部分解决方案。从搜索的角度来看，这些轨迹可以视为线性化的搜索树，其中模型扩展部分解决方案，失败时放弃并回溯尝试替代方案。与传统启发式搜索相比，这种策略有一个潜在优势：它基于整个搜索轨迹而非仅当前局部状态进行条件化。我们首先测试LLM是否利用这一优势，通过比较轨迹条件推理策略与配备仅观察当前局部状态的LLM启发式的最佳优先搜索。在三个受控推理环境（Blocks World、网格导航和Sokoban）中，我们发现仅原始访问搜索历史不足以可靠地超越启发式搜索。然后我们研究了一个可能的原因：在LLM推理轨迹中，底层搜索树仅隐式表示，当模型回溯或切换分支时，轨迹并未明确标识正在重新访问哪个早期搜索状态。我们表明，添加简单的父指针以显式表示线性化树（LinTree）结构，相对于隐式推理模型和LLM启发式引导搜索，提高了任务性能和搜索效率。这些结果表明，当树结构被显式化时，搜索历史变得最为有用，从而激励LLM推理中更具结构意识的表示。

英文摘要

Large language models (LLMs) often solve reasoning problems by generating intermediate traces that explore and revise partial solutions. From a search perspective, these traces can be viewed as linearized search trees, where the model extends a partial solution, abandons it when it fails, and backtracks to try alternatives. Compared with traditional heuristic-guided search, such a policy has a potential advantage: it conditions on the whole search trace rather than only on the current local state. We first test whether LLMs utilize this advantage by comparing trace-conditioned reasoning policies against best-first search equipped with an LLM heuristic that only observes the current local state. Across three controlled reasoning environments, Blocks World, grid Navigation, and Sokoban, we find that raw access to search history alone is not enough to reliably outperform heuristic search. We then study one possible reason: in LLM reasoning traces, the underlying search tree is only implicitly represented, and when the model backtracks or switches branches, the trace does not explicitly identify which earlier search state is being revisited. We show that adding simple parent pointers to explicitly represent the linearized tree (LinTree) structure improves both task performance and search efficiency relative to implicit reasoning models and LLM-heuristic-guided search. These results suggest that search history becomes most useful when its tree structure is made explicit, motivating more structure-aware representations for LLM reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.31486 2026-06-01 cs.RO

Learning Controlled Separation of Small Objects Between Two Fingers with a Tactile Skin

利用触觉皮肤学习两个手指间小物体的受控分离

Ulf Kasolowsky, Berthold Bäuml

AI总结本文提出并解决了多用途机器人手两个手指间小物体的受控分离任务，通过强化学习训练纯触觉策略，并分析了空间分辨触觉反馈的优势。

详情

AI中文摘要

我们提出并解决了多用途机器人手两个手指间小物体的受控分离这一新任务：在抓取一盒小物体后，任务是丢弃尽可能多的物体，直到手指间保留所需数量。这些物体相对于手指宽度很小，而且绝对尺寸也很小。在我们的案例中，处理的是直径仅为6毫米的小颗粒。我们证明，该任务可以纯粹通过触觉（无视觉）完成，使用指尖上的空间分辨触觉皮肤。分离策略通过强化学习在模拟中训练，使用简单的稀疏奖励，基本上检查是否达到所需物体数量。在模拟实验中，我们详尽分析了使用空间分辨触觉反馈的好处：虽然理想（高分辨率）触觉传感器几乎可以完美完成任务，但空间分辨率较低的传感器（此处为4x4触觉单元）与仅使用手指关节传感器相比，仍能带来高达20%的改进。为了进行此分析，我们还在策略旁边训练了一个估计器，用于预测真实接触位置。最后，我们展示了配备触觉皮肤的DLR-Hand II的成功仿真到现实迁移。

英文摘要

We introduce and solve the novel task of controlled separation of small objects with two fingers of a multi-purpose robotic hand: after grasping into a box of small objects, the task is to drop as many of them until a desired number remains between the fingers. The objects are small compared to the width of the fingers but also in absolute terms. In our case little pellets with a diameter of only 6mm are handled. We show that the task can be performed purely tactile (no vision) using a spatially-resolved tactile skin on a fingertip. The separation policy is trained in simulation via reinforcement learning using a straightforward sparse reward, which basically checks if the desired number of objects is reached. In simulation experiments, we provide an exhaustive analysis of the benefits of using spatially-resolved tactile feedback: while an ideal (high-resolution) tactile sensor allows solving the task almost perfectly, a sensor with lower spatial resolution (here 4x4 taxels) still leads to an improvement of up to 20% compared to using only the fingers' joint sensors. For this analysis, we further train an estimator alongside the policy that predicts the ground truth contact positions. Finally, we demonstrate the successful sim-to-real transfer for the DLR-Hand II equipped with a tactile skin.

URL PDF HTML ☆

赞 0 踩 0

2605.31485 2026-06-01 cs.LG math.CT

Graphical einops: bridging tensor networks and computation graphs

Graphical einops: 桥接张量网络与计算图

Vincent Wang-Maścianica, Nikhil Khatri

AI总结本文提出一种形式化的图形演算，用于einops的张量编程结构片段，通过等级自然性重写实现张量等变性的图解证明，并应用于注意力掩码转换以优化稀疏注意力实现。

2605.31484 2026-06-01 cs.LG

Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence

平衡LoRA：消除参数不变性以加速收敛

Valérie Castin, Kimia Nadjahi, Pierre Ablin, Gabriel Peyré

AI总结针对LoRA过参数化导致不同低秩因子对条件数差异大而影响收敛速度的问题，提出BaLoRA，通过投影到平衡流形改善损失景观条件，实现更快收敛和更优性能。

详情

Comments: Accepted at ICML 2026

AI中文摘要

低秩适应（LoRA）是微调大型语言模型最广泛采用的方法。值得注意的是，LoRA本质上是过参数化的：多对低秩因子可以产生相同的适应权重矩阵。我们从理论和经验上表明，这些对表现出显著不同的条件数。因此，收敛到不同的损失最小化器直接影响LoRA的收敛速度。基于这一观察，我们引入了平衡低秩适应（BaLoRA），这是LoRA的一种变体，将迭代投影到平衡流形上。该流形在保持适应矩阵的同时改善了损失景观的条件。投影步骤计算量轻，并可以无缝集成到现有的微调流程中。经验上，BaLoRA比标准LoRA收敛更快，并在各种微调任务中实现了更优的性能。

英文摘要

Low-Rank Adaptation (LoRA) is the most widely adopted method for fine-tuning large language models. Notably, LoRA is inherently overparameterized: multiple pairs of low-rank factors can yield the same adapted weight matrix. We show--both theoretically and empirically--that these pairs exhibit significantly different condition numbers. As a result, converging to different loss minimizers directly impacts the convergence rate of LoRA. Building on this observation, we introduce Balanced Low-Rank Adaptation (BaLoRA), a variant of LoRA that projects iterates onto a balanced manifold. This manifold improves the conditioning of the loss landscape while preserving the adapted matrix. The projection step is computationally lightweight and integrates seamlessly into existing fine-tuning pipelines. Empirically, BaLoRA converges faster than standard LoRA and achieves superior performance across a range of fine-tuning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.31481 2026-06-01 cs.RO

Batched Differentiable Rigid Body Dynamics in PyTorch for GPU-Accelerated Robot Learning

Yue Wang, Yanran Xu, Wenbo Wu, Chuanhang Qiu, Zhaoxing Li

AI总结提出BARD，一种基于PyTorch的批处理可微刚体动力学库，通过三级缓存、无矩阵乘法的关节变换和层级并行传播，在GPU上实现高达64倍的前向运动学加速，并支持梯度计算。

详情

AI中文摘要

随着机器人控制转向大规模强化学习与循环动力学计算，社区对Pinocchio等CPU绑定库的依赖在基于GPU的训练流程中造成了吞吐瓶颈。我们提出了BARD（批处理铰接刚体动力学），这是一个自包含的PyTorch实现，基于Featherstone的刚体动力学算法，针对批处理GPU评估和自动微分进行了优化。三个设计选择使其高效：分层惰性求值缓存避免冗余树遍历，通过预计算的Rodrigues常数实现无矩阵乘法的关节变换，以及将顺序操作减少为树深度批处理步骤的层级并行传播。在五个机器人模型（7-23自由度）上，BARD在数值上匹配Pinocchio，同时在NVIDIA H200上以批大小4096实现前向运动学高达64倍、雅可比矩阵高达63倍的吞吐量提升。我们通过基于梯度的系统辨识验证了可微性，在7自由度机械臂上，在5%扭矩噪声下将连杆质量恢复至1.24%的平均误差，并将BARD集成到Isaac Lab AMP训练流程中，用于具有4096个并行环境的11自由度脊柱四足机器人，其在循环动力学中比Pinocchio快8.5倍，比ADAM快2.0倍。BARD已开源：https://github.com/YueWang996/bard-pytorch-dynamics。

英文摘要

As robot control shifts toward large-scale reinforcement learning with in-loop dynamics computation, the community's reliance on CPU-bound libraries such as Pinocchio creates a throughput bottleneck in GPU-based training pipelines. We present BARD (Batched Articulated Rigid-body Dynamics), a self-contained PyTorch implementation of Featherstone's rigid-body dynamics algorithms, optimized for batched GPU evaluation and automatic differentiation. Three design choices make this efficient: a tiered lazy-evaluation cache that avoids redundant tree traversals, matmul-free joint transforms via pre-computed Rodrigues constants, and level-parallel propagation that reduces sequential operations to tree-depth batched steps. On five robot models (7-23 DOFs), BARD matches Pinocchio numerically while reaching up to 64x higher throughput for Forward Kinematics and 63x for Jacobians at batch size 4096 on an NVIDIA H200. We validate differentiability through gradient-based system identification on a 7-DOF manipulator, recovering link masses to 1.24% mean error under 5% torque noise, and integrate BARD into an Isaac Lab AMP training pipeline for an 11-DOF spined quadruped with 4096 parallel environments, where it is 8.5x faster than Pinocchio and 2.0x faster than ADAM for in-loop dynamics. BARD is open-sourced at: https://github.com/YueWang996/bard-pytorch-dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.31480 2026-06-01 cs.CL

Language Models Can Resolve Reference Compositionally, But It's Not Their Native Strength: The Case of the Personal Relation Task

语言模型可以组合性地解析指代，但这并非其天然优势：以个人关系任务为例

Bart Evelo, Meaghan Fowlie, Denis Paperno

AI总结通过个人关系任务，比较人类与大型语言模型在外延任务（确定指称对象）和内涵任务（结构化表示意义）上的表现，发现人类更擅长外延任务而LLM更擅长内涵任务，表明缺乏指称基础是LLM模拟人类语言理解的关键缺失。

详情

Comments: A pre-MIT Press publication version. Paper accepted to Transactions of the Association for Computational Linguistics

AI中文摘要

神经模型（如大型语言模型）是否真正获得了组合性能力来理解自然语言？当我们谈论语义解释时，可以区分两个互补的方面：确定一个表达式在世界中的指称（我们称之为外延任务）以及以结构化方式表示其意义（我们称之为内涵任务）。我们在个人关系任务（Paperno 2022）的设置中评估了LLM和人类在这两项任务上的表现，该任务给定一个人际关系宇宙，要求解释诸如“Amber的父母的朋友”这样的名词短语。这里，对于内涵任务，答案是公式“friend(parent(amber))”；对于外延任务，答案是具体的人。我们发现人类和LLM表现出相反的强项：人类在外延任务上表现优于内涵任务，而LLM则相反。我们的方法为理解现代机器学习模型中的组合性能力带来了更细致的视角。我们的结果支持这样一种观点：LLM训练中缺乏指称基础是模仿人类语言理解的关键缺失成分。

英文摘要

Do neural models, such as Large Language Models, genuinely acquire compositional abilities for interpretation of natural language? When we talk about semantic interpretation, we can distinguish two complementary aspects: establishing what an expression refers to in the world (which we call the Extensional task) and representing its sense in a structured way (which we call the Intensional task). We evaluate LLMs and humans on both tasks in the setting of the Personal Relation Task (Paperno 2022) in which, given a universe of people and their relationships with each other, one is asked to interpret a noun phrase such as "Amber's parent's friend". Here, for the Intensional task, the answer is the formula "friend(parent(amber))", and for the Extensional task, the person. We find that humans and LLMs show opposite strengths: humans perform better on Extensional than Intensional tasks, and LLMs vice versa. Our methodology brings greater nuance to the understanding of compositional abilities in modern machine learning models. Our results support the notion that the lack of referential grounding in LLM training is a crucial missing component in mimicking human-like language understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31478 2026-06-01 cs.SE cs.CL cs.SY eess.SY

Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

知识边界探测与需求引导干预：面向基于LLM的电力系统代码生成

Hui Wu, Xiaoyang Wang, Zhong Fan

AI总结针对LLM在电力系统代码生成中因API知识边界错误导致失败的问题，提出PowerCodeBench基准、L0-L3文档驱动探测和边界感知干预方法，显著提升模型准确率。

详情

Comments: 43 pages, 12 figures, includes supplementary material

AI中文摘要

大型语言模型（LLMs）越来越多地被用于自动化电力系统分析，但许多公用事业和能源研究实验室出于保密、监管、可重复性和成本原因，要求本地部署。这使得开源模型的可靠性成为一个部署问题。我们表明，电力系统代码生成中的首次失败并非仅由推理主导，而是由结构化的API知识边界错误主导：在版本化的仿真库中出现虚构的函数名、误用的参数以及处理不当的结果表。我们引入了PowerCodeBench，一个经过执行验证的基准生成器，它将自然语言操作员查询与pandapower代码和数值真值配对；一个L0-L3文档驱动的探测程序，用于测量每个模型的API知识概况；以及一种边界感知干预，将查询侧API需求估计与目标主动文档注入和路由被动修正相结合。在一个包含2000个任务的冻结版本上，我们评估了十个开源LLM（1.5B-480B参数）和四个商业中端API。该干预措施使每个评估的至少7B参数的开源模型和每个商业API提升了32到56个准确率点。70B-120B范围内的开源模型匹配了商业中端准确率范围，而Llama-3.1-405B和Qwen3-Coder-480B领先。目标提示在保持全上下文准确率上限的同时，使用了41%的提示令牌成本。结果是在不进行微调或云端推理的情况下，为电网分析工作流提供可靠的本地LLM辅助的准确率侧、部署时路径。

英文摘要

Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

URL PDF HTML ☆

赞 0 踩 0

2605.31476 2026-06-01 cs.RO

IDOL: Inverse-Dynamics-Guided Future Prediction for End-to-End Autonomous Driving

IDOL: 逆动力学引导的未来预测用于端到端自动驾驶

Chenghao Zhang, Timin Li, Dongmei Li

AI总结提出IDOL框架，通过逆动力学模型将BEV世界模型预测的未来潜在场景状态转化为规划相关的轨迹增量，实现未来预测与轨迹优化的紧密耦合，在NAVSIM基准上达到最优性能。

详情

Comments: 20 pages, 5 figures

AI中文摘要

端到端自动驾驶已成为直接从传感器观测学习规划的有力范式，而近期基于世界模型的方法通过显式推理场景未来演化进一步丰富了这一范式。然而，仅靠未来预测并不能保证更好的规划，除非预测的演化能够转化为规划相关的轨迹更新。当前许多方法仍预测未来场景状态，而未明确解码状态转换中隐藏的运动含义。因此，未来推理通常仅具有描述性价值，而与可执行运动生成的耦合较弱。为解决此限制，我们提出IDOL，一种基于逆动力学的未来预测框架，用于潜在BEV空间中基于世界模型的端到端规划，其中逆动力学作为未来预测与轨迹优化之间的关键桥梁。IDOL首先使用BEV世界模型预测多个未来潜在场景状态，然后对相邻潜在未来应用逆动力学模型，以解码过渡感知的轨迹特征并恢复规划相关的运动增量，解释潜在世界随时间如何演化。这些逆动力学导出的信号用于优化规划轨迹，将未来预测从被动场景预测转变为可操作的规划指导。轻量级闭环细化模块通过重用优化轨迹进行另一轮未来感知推理，进一步改善长时一致性。通过将逆动力学引入潜在未来推理，IDOL加强了世界建模与规划之间的耦合。在NAVSIM v1和NAVSIM v2基准上的大量实验表明，IDOL在可比方法中达到了最先进的性能。

英文摘要

End-to-end autonomous driving has emerged as a compelling paradigm for learning planning directly from sensor observations, while recent world-model-based approaches further enrich this paradigm by enabling explicit reasoning about how the scene may evolve in the future. Yet future prediction alone does not guarantee better planning unless the predicted evolution can be converted into planning-relevant trajectory updates. Many current methods still forecast future scene states without explicitly decoding the motion implications hidden in state transitions. As a result, future reasoning often remains descriptively useful but only weakly coupled to executable motion generation. To address this limitation, we propose \mathbf{IDOL}, an inverse-dynamics-guided future prediction framework for world-model-based end-to-end planning in latent BEV space, where inverse dynamics serves as the key bridge between future prediction and trajectory optimization. IDOL first predicts multiple future latent scene states with a BEV world model, then applies an inverse dynamics model to adjacent latent futures to decode transition-aware trajectory features and recover planning-relevant motion deltas that explain how the latent world evolves over time. These inverse-dynamics-derived signals are used to optimize the planned trajectory, turning future forecasting from passive scene anticipation into actionable planning guidance. A lightweight closed-loop refinement module further improves long-horizon consistency by reusing the optimized trajectory for another round of future-aware reasoning. By introducing inverse dynamics into latent future reasoning, IDOL tightens the coupling between world modeling and planning. Extensive experiments on the NAVSIM v1 and NAVSIM v2 benchmarks show that IDOL achieves state-of-the-art performance among comparable methods.

URL PDF HTML ☆

赞 0 踩 0

2605.31469 2026-06-01 cs.CL cs.AI cs.SD eess.AS

Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus

扩展匈牙利语对话ASR：BEA-Dialogue+语料库

Máté Gedeon, Piroska Zsófia Barta, Péter Mihajlik, Katalin Mády

AI总结针对匈牙利语对话语音识别训练数据不足的问题，本文通过放宽分割标准扩展BEA-Dialogue语料库至200小时，并评估基于Whisper和FastConformer的模型，证明基于序列化输出训练的微调能持续改善识别性能。

详情

AI中文摘要

匈牙利语对话自动语音识别受到公开对话式训练数据有限的制约。BEA-Dialogue语料库解决了这一需求，但其严格的说话人分离的训练/开发/测试分割将可用材料减少到仅85小时。在本文中，我们介绍了BEA-Dialogue+，这是该语料库的扩展版本，它放宽了实验者和对话伙伴的分割标准，同时保持主要说话人的完全分离。这产生了200小时转录的自然对话，并允许对额外训练数据与分割间说话人重叠之间的权衡进行受控研究。我们在两个语料库版本上评估了多个基于Whisper和FastConformer的模型，包括基于序列化输出训练（SOT）的对话转录微调。我们的结果表明，对于未经微调的模型，较大的语料库更具挑战性，而基于SOT的适应在WER、CER、cpWER和cpCER上产生了一致的改进。总体而言，BEA-Dialogue+为匈牙利语对话ASR提供了一个更大但仍具挑战性的基准，以及用于训练和评估对话转录系统的实用资源。

英文摘要

Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.

URL PDF HTML ☆

赞 0 踩 0

2605.31468 2026-06-01 cs.AI

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

AutoSci: 面向完整科学生命周期的以记忆为中心的智能体系统

Weitong Qian, Beicheng Xu, Zhongao Xie, Bowen Fan, Guozheng Tang, Jiale Chen, Xinzhe Wu, Mingtian Yang, Chenyang Di, Jiajun Li, Lingching Tung, Peichao Lai, Yifei Xia, Ziyi Guo, Yanwei Xu, Yanzhao Qin, Shaoduo Gan, Xupeng Miao, Bin Cui

AI总结提出AutoSci，一个以记忆为中心、支持完整科学生命周期的智能体系统，通过结构化记忆、多阶段流程、有向无环图增强和演化机制实现自动化科研。

详情

AI中文摘要

科学研究传统上是人力密集型的，要求研究人员在漫长的项目周期中协调文献、想法、实验、手稿和审稿回复。基于LLM的科学智能体的兴起为自动化这一过程创造了机会。这样的系统必须支持完整的研究生命周期，跨项目维护结构化的持久记忆，并随时间改进自身的研究流程。然而，现有系统要么部分满足，要么未能满足这些要求，留下了统一自动化科学研究系统的空白。因此，我们提出了AutoSci，一个面向完整科学生命周期的以记忆为中心的智能体系统。AutoSci围绕四个模块组织。SciMem提供受模式约束的研究记忆，将可重复使用的科学知识分离为长期知识记忆，将项目级工件（如想法、实验、手稿和审稿）分离为活跃研究记忆。SciMem通过一个控制状态、上下文、验证、反馈和编排的框架执行从文献理解到反驳的五阶段生命周期。SciDAG通过有向无环图形式的多智能体操作符和可重用的阶段特定模板增强困难技能。SciEvolve将来自用户、实验、审稿和外部环境的反馈信号转化为对SciMem组织、SciFlow技能和SciDAG模板的版本化更新。这些模块共同使AutoSci成为一个持久的研究环境，能够在研究项目间执行、记忆和演化。代码仓库位于https://github.com/skyllwt/AutoSci。

英文摘要

Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory-centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema-governed research memory, separating Long-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project-level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG-shaped multi-agent operators and reusable stage-specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.

URL PDF HTML ☆

赞 0 踩 0

2605.31466 2026-06-01 cs.CV

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

VolFill: 基于体素流匹配的单视图非模态3D场景重建

Tuan Duc Ngo, Chuang Gan, Evangelos Kalogerakis

AI总结提出VolFill框架，利用混合3D VAE和潜在扩散Transformer从单张RGB图像生成完整3D场景结构，在SCRREAM和NRGB-D数据集上显著优于现有方法。

详情

AI中文摘要

从单张RGB图像重建场景的完整几何形状仍然具有挑战性——尤其是在推断视觉证据不完整的隐藏结构时。我们提出了VolFill，一个生成框架，它预测完整场景的3D结构，而不是依赖传统的像素对齐回归。我们的方法利用混合3D VAE将稀疏截断无符号距离函数网格压缩为紧凑的潜在空间，并结合潜在扩散Transformer对该表示进行去噪以恢复完整场景。我们以几何基础模型为条件生成，利用丰富的空间先验进行稳健推理。与受限于逐射线约束或非结构化点云查询的现有方法不同，VolFill提供了一种结构化表示，支持直接表面提取和大规模占用查询。在SCRREAM和NRGB-D数据集上的大量实验表明，我们的方法显著优于当前基线，为整体空间理解提供了稳健的基础。

英文摘要

Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.31464 2026-06-01 cs.LG cs.AI

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU预测器：语言模型作为内核运行时优化的选择性替代

Zaid Khan, Justin Chih-Yao Chen, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

AI总结研究利用语言模型作为GPU内核性能的选择性替代，通过强化学习提高预测准确性和校准度，在有限GPU评估预算下加速内核搜索。

详情

Comments: Code: https://github.com/codezakh/gpu-forecasters

AI中文摘要

GPU内核是现代深度学习的主力，优化它们（通过进化搜索或编码代理）通常需要在目标硬件上重复测量。虽然这些测量提供了内核搜索所需的地面真实信号，但成本高昂，因为每次评估内核都需要编译并在GPU上重复执行。随着LLM推理的改进降低了编写新内核的成本，并且LLM驱动的搜索扩展到大的搜索预算，设备上的评估成为瓶颈。为了解决这个问题，我们研究LLM如何通过预测所提议内核的性能，作为选择性GPU替代用于内核评估。一个有用的替代应该是准确的，并且应该是选择性的，知道何时可能出错，并推迟到GPU。为了评估替代，我们测量其预测是否准确、校准良好，并且在有限的GPU测量预算下对恢复快速内核实际有用。接下来，我们研究强化学习是否能提高预测准确性和置信度校准。我们的实验表明，LLM可以准确预测相对内核性能，并且通过强化学习可以提高其实用性。在内核搜索中使用替代，使得搜索在相同的GPU评估预算下可以考虑多倍的候选，从而比同等预算的基线找到更快的内核。这些结果表明，LLM可以在内核优化中发挥更广泛的作用，作为GPU的虚拟模型，而不仅仅是搜索的内核生成器。

英文摘要

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.

URL PDF HTML ☆

赞 0 踩 0

2605.31463 2026-06-01 cs.LG cs.AI cs.CL cs.DC

PithTrain: A Compact and Agent-Native MoE Training System

PithTrain: 一个紧凑且面向智能体的MoE训练系统

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy, Zichun Yu, Junru Shao, Todd C. Mowry, Chenyan Xiong, Tianqi Chen

AI总结提出PithTrain，一个基于智能体原生设计原则的紧凑型MoE训练框架，通过引入ATE-Bench评估智能体任务效率，在保持生产框架吞吐量的同时，将智能体任务轮次和活跃GPU时间分别降低62%和64%。

详情

AI中文摘要

混合专家模型（MoE）已成为前沿语言模型的主导架构。为满足这一需求，生产框架经过多年的工程努力构建了优化的MoE训练栈。然而，为新的架构和系统优化而演进这些栈仍然代价高昂。随着AI编码智能体的兴起，它们可以自动化训练框架开发的部分工作并加速这一演进。但将这些智能体应用于现有框架会带来隐藏成本，这些成本在当今仅关注吞吐量的评估中不可见。我们将这一缺失维度命名为智能体任务效率（ATE）：即使用编码智能体理解、操作和扩展框架的成本。基于四个智能体原生设计原则，我们构建了PithTrain，一个紧凑、智能体原生的MoE训练框架。我们进一步引入了ATE-Bench，涵盖现实世界的训练框架任务。我们的评估表明，PithTrain在吞吐量上与生产框架相当，并且在ATE-Bench上，PithTrain实现了更高的智能体任务效率，智能体轮次减少高达62%，活跃GPU时间减少64%。

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.

URL PDF HTML ☆

赞 0 踩 0