arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.13996 2026-05-15 cs.RO

Ergodic Imitation for Adaptive Exploration around Demonstrations

Ziyi Xu, Cem Bilaloglu, Yiming Li, Sylvain Calinon

发表机构 * Ecole Polytechnique Fédérale de Lausanne(瑞士联邦理工学院洛桑校区) Idiap Research Institute(伊迪亚普研究 institute)

AI总结 在机器人模仿学习中,训练与部署条件的不匹配是一个常见挑战,可能导致机器人无法完成任务。为此,本文提出了一种基于示范的自适应遍历模仿方法,通过从检索到的示范中构造目标分布,生成能够在跟踪与探索之间自适应插值的轨迹。该方法将遍历控制扩展到自适应模仿领域,为机器人在动态环境中的在线探索提供了新的解决方案。

Comments 4 pages, 3 figures

详情
英文摘要

In robotics, a common challenge in imitation learning is the mismatch between training and deployment conditions, caused, for example, by environmental changes or imperfect observation and control. When a robot follows a nominal trajectory under such mismatch, it may become stuck and fail to complete the task. This calls for adaptive online exploration strategies that remain grounded in demonstrations. To this end, we propose an adaptive ergodic imitation approach that constructs a target distribution from the geometry of the retrieved demonstrations and uses it to generate trajectories that adaptively interpolate between tracking and exploration. Our method extends ergodic control beyond its traditional role in area-coverage and search by incorporating demonstrations into a retrieval-based receding-horizon framework for adaptive imitation.

2605.13994 2026-05-15 cs.CV cs.AI

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院) Department of Medicine, National University of Singapore, Singapore(新加坡国立大学医学系) Department of Cardiology, National University Heart Centre Singapore, Singapore(新加坡国立心脏中心心内科部)

AI总结 本文提出了一种名为CineMesh4D的端到端4D(3D+时间)重建方法,用于从稀疏的动态MRI图像中生成个性化的全心脏网格模型。该方法通过跨域映射直接从多视角的2D动态MRI图像重建全心结构,引入了可微渲染损失以利用多视角稀疏轮廓进行监督,并设计了双上下文时间块以融合全局和局部时间信息,从而提升重建质量与运动一致性。实验表明,CineMesh4D在重建精度和运动连贯性方面优于现有方法,为个性化实时心脏评估提供了可行的解决方案。

详情
英文摘要

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

2605.13988 2026-05-15 cs.LG quant-ph

Neural Fields for NV-Center Inverse Sensing

Zhixuan Zhao, Tao Zhong, Yixun Hu, Nathalie P. de Leon, Christine Allen-Blanchette

发表机构 * Princeton University(普林斯顿大学) Tsinghua University(清华大学)

AI总结 本文研究基于氮空位(NV)中心的量子传感中的逆问题,针对传统方法在非线性、光谱耦合和物理敏感场景下的不足,提出了一种新的神经场方法NeTMY。该方法结合了可微的NV前向模型与坐标神经场,通过位置编码、多尺度优化和稀疏性约束等技术,有效提升了稀疏源的定位与分布重建性能,并揭示了其在抑制中心塌陷问题上的机制优势。研究为物理保真神经逆问题提供了新的实验平台。

Comments 33 pages, 16 figures

详情
英文摘要

Inverse problems in scientific sensing are often solved with either hand-designed regularizers or supervised networks trained on simulated labels, yet both can fail when the forward model is nonlinear, spectrally coupled, and physically delicate. We study this issue for noise sensing based on nitrogen-vacancy (NV) centers in diamond, where a quantum sensor measures magnetic-noise spectra generated by sparse spin sources. We show that replacing a common scalar/coherent forward approximation with a tensor power-summed dipolar operator changes the inverse landscape and exposes a center-collapse failure mode in free-density optimization. We propose NeTMY, an amortization-free coordinate neural field coupled to the differentiable NV forward model, with annealed positional encoding, multiscale optimization, sparsity/gating, and spectrum-fidelity losses. Across sparse synthetic reconstructions generated by the corrected operator, NeTMY achieves the best localization and distributional metrics in the tested benchmark. Mechanism experiments show that NeTMY does not directly execute the raw density-space gradient; its parameterization smooths and redistributes updates, mitigating the center-collapse pathology. These results position NV quantum sensing as a useful testbed for physics-faithful neural inverse problems.

2605.13981 2026-05-15 cs.LG cs.AI

Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

Katherine Lambert, Sasha Luccioni

发表机构 * University of Toronto(多伦多大学)

AI总结 随着大语言模型部署的增加,对GPU和数据中心的需求激增,引发了对电力消耗和电网压力的关注。本文提出了一种全面的能源核算框架,通过详细追踪各阶段的GPU功耗,量化知识蒸馏流程的完整计算成本,揭示了传统方法中常被忽视的教师模型相关能耗。实验中对比了两种常见蒸馏方法的能源消耗与碳排放,构建了能源-质量帕累托前沿,并据此提出了在能源和预算约束下选择蒸馏方法和超参数的实用设计规则,同时发布了开源测量工具和核算协议,为可比、可复现的蒸馏研究奠定标准化基础。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 11 pages, 6 figures

详情
英文摘要

The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

2605.13974 2026-05-15 cs.CV cs.AI cs.MM

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri, Davide Bucciarelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文研究了扩散变换器(DiT)中一种被称为“大规模激活”的现象,即一小部分隐藏通道的响应远大于其余通道。研究发现,这些少量通道在功能上至关重要,能够主导图像生成质量;在空间上具有组织性,能反映图像的主要主体和显著区域;并且具有可迁移性,可用于实现跨提示的语义插值和主体驱动生成。这些发现揭示了DiT模型中隐藏的稀疏语义控制机制,为理解与利用扩散模型提供了新视角。

Comments Project page: https://aimagelab.github.io/MAs-DiT/

详情
英文摘要

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

2605.13959 2026-05-15 cs.LG cs.AI cs.RO

WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

Sinjae Kang, Chanyoung Kim, Kaixin Wang, Li Zhao, Kimin Lee

发表机构 * KAIST(韩国科学技术院) Microsoft Research(微软研究院)

AI总结 本文提出了一种名为 WarmPrior 的方法,通过利用近期动作历史构建时间感知的先验分布,替代传统高斯源分布,从而提升基于扩散和流匹配的生成策略在机器人操作任务中的成功率。该方法通过生成更直捷的概率路径,提高了策略的稳定性和效率,并在行为克隆和先验空间强化学习中均展现出优越的采样效率和最终性能。研究揭示了源分布设计在生成式机器人控制中的重要影响,为相关领域提供了新的设计思路。

详情
英文摘要

Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.

2605.13950 2026-05-15 cs.LG cs.AI hep-ex hep-ph

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Darius A. Faroughy, Sofia Palacios Schweitzer, Ian Pang, Siddharth Mishra-Sharma, David Shih

发表机构 * New High Energy Theory Center(新高能理论中心) Department of Physics & Astronomy(物理与天文学系) Rutgers University(罗格斯大学) Faculty of Computing & Data Sciences(计算与数据科学学院)

AI总结 本文提出 Collider-Bench,一个用于评估大型语言模型代理能否仅凭公开论文和开源软件重现大型强子对撞机实验分析的基准。该任务要求代理构建可执行的模拟与筛选流程,并预测特定信号区域的碰撞事件数量,评估基于连续保真度分数而非人工评分标准。研究还分析了不同代理的计算成本,并通过LLM判别器检测代码中的错误模式,结果表明目前尚无代理能稳定超越人类物理学家的表现。

Comments 23 pages | 9 figures | 4 tables | Code: https://github.com/dfaroughy/Collider-Bench | Task Corpus: https://huggingface.co/datasets/Dariusfar/ColliderBench

详情
英文摘要

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

2605.13943 2026-05-15 cs.LG

A Unified Geometric Framework for Weighted Contrastive Learning

Raphael Vock, Edouard Duchesnay, Benoit Dufumier

发表机构 * GAIA Lab, NeuroSpin, CEA, CNRS Université Paris-Saclay(GAIA实验室、神经旋风、法国原子能委员会、国家科学研究中心巴黎-萨克雷大学)

AI总结 本文提出了一种统一的几何框架,用于分析加权对比学习中的表示结构,揭示了不同加权策略对嵌入几何特性的影响。研究将加权InfoNCE目标解释为距离几何问题,明确了目标几何由加权方案决定,并对多种有监督和弱监督任务下的最优嵌入进行了精确刻画。研究还指出,在类别不平衡或连续标签场景下,传统对比学习方法可能存在几何不一致性,而几何一致的加权方式能够保证表示的最优性和一致性,为设计对比学习目标提供了理论指导。

Comments Preprint

详情
英文摘要

Contrastive learning (CL) aims to preserve relational structure between samples by learning representations that reflect a similarity graph. Yet, the geometry of the resulting embeddings remains poorly understood. Here we show that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, where the weighting scheme specifies the target geometry to be realized by the representation. This viewpoint yields exact characterizations of the optimal embeddings for several supervised and weakly supervised objectives. In supervised classification, both SupCon and Soft SupCon (a dense relaxation of it where pairs from distinct classes have small non-zero similarity) collapse samples within each class to a single prototype. However, while balanced SupCon recovers the classical regular simplex geometry, class imbalance breaks this symmetry: SupCon induces non-uniform inter-class similarities depending on class sizes, whereas Soft SupCon preserves a regular simplex geometry regardless of class imbalance. In continuous-label settings, our framework reveals a different failure mode: y-Aware CL generally cannot attain its entropic optimum unless the labels lie on a hypersphere, exposing a mismatch between Euclidean label weights and spherical latent similarity. By contrast, geometrically consistent choices such as Euclidean-Euclidean weighting or X-CLR admit unique optimal embeddings. Our results show that the choice of weighting scheme determines whether contrastive learning is geometrically realizable, degenerate, or inconsistent, providing a principled framework for designing contrastive objectives.

2605.13942 2026-05-15 cs.LG cs.DC cs.NI

EMA: Efficient Model Adaptation for Learning-based Systems

Daiyang Yu, Xinyu Chen, Yihan Zhang, Yan Liang, Yaqi Qiao, Fan Lai

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) The Hong Kong University of Science(香港科学大学)

AI总结 本文提出了一种名为EMA的高效模型适应系统,旨在帮助基于学习的系统在异构、长期运行和动态变化的环境中进行快速适应。EMA采用系统驱动、数据为中心的方法,通过引入状态转换器减少模型训练成本,并优化数据标注过程以平衡训练与标注成本。实验表明,EMA在多个代表性系统中显著降低了适应成本并提升了系统性能。

Comments SIGCOMM (2026)

详情
英文摘要

Machine learning (ML) is increasingly applied to optimize system performance in tasks such as resource management and network simulation. Unlike traditional ML tasks (e.g., image classification), networked systems often operate in heterogeneous, long-running, and dynamic environment states, where input conditions (e.g., network loads) and operational objectives can shift over time and across settings. Existing learning-based systems offer little support for adaptation, resulting in costly model training, extensive data collection, degraded system performance, and slow responsiveness. This paper presents EMA, the first model adaptation system supporting learning-based systems to adapt to evolving environments with minimal operational overhead. EMA takes a system-driven, data-centric approach that accommodates diverse system and model designs while addressing two key deployment challenges. First, it reduces expensive model training by introducing state transformers that align the input state of a new environment with previously similar states, allowing models to warm-start adaptation. Second, it addresses the often-overlooked yet costly process of data labeling--collecting ground truth for exploring and training on various system decisions--by prioritizing labeling high-utility data while balancing the tradeoff between training and labeling cost. Evaluations on eight representative learning-based systems show that EMA reduces adaptation costs (e.g., GPU training time) by 14.9-42.4% while improving system performance (e.g., network throughput) by 6.9-31.3%.

2605.13941 2026-05-15 cs.LG cs.AI

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

Jiaqi Liu, Xinyu Ye, Peng Xia, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

发表机构 * UNC-Chapel Hill(北卡罗来纳大学教堂山分校) UC Berkeley(加州大学伯克利分校) UCSC(加州大学圣克鲁兹分校)

AI总结 本文提出了一种名为 EvolveMem 的自进化记忆架构,旨在提升大型语言模型代理在多会话场景下的长期记忆能力。该方法通过一个由诊断模块驱动的闭环自进化过程,使记忆系统中的存储内容和检索机制能够协同进化,从而实现对检索策略的自动优化。实验表明,EvolveMem 在多个基准测试中显著优于现有方法,并且其进化出的配置具有跨任务的泛化能力,体现了其对通用检索原则的有效捕捉。

详情
英文摘要

Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at https://github.com/aiming-lab/SimpleMem.

2605.13935 2026-05-15 cs.LG cs.CL

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

Saba Ahmadi, Prasanna Parthasarathi, Yufei Cui

发表机构 * Noah’s Ark Lab(Noah’s Ark 实验室)

AI总结 扩散语言模型作为自回归模型的有前途的替代方案,其后训练方法大多采用奖励最大化目标,但这种方法存在轨迹锁定的问题,即奖励驱动的采样更新会使概率质量过度集中于少数去噪路径,降低模型对其他正确解的覆盖能力。为此,研究提出了一种轨迹平衡目标TraFL,通过引导策略向由冻结参考模型锚定的奖励倾斜目标分布进行训练,结合扩散兼容的序列级代理损失和学习的提示依赖归一化,有效提升了模型性能。实验表明,TraFL在数学推理和代码生成任务中均优于基线模型,且优势随采样预算增加而增强,并在多个基准测试中表现出良好的泛化能力。

详情
英文摘要

Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.

2605.13933 2026-05-15 cs.LG cs.AI stat.ML

Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling

Gaurav Rudravaram, Lianrui Zuo, Karthik Ramadass, Elyssa McMaster, Jongyeon Yoon, Aravind R. Krishnan, Adam M. Saunders, Chenyu Gao, Nancy R. Newlin, Praitayini Kanakaraj, Lori L. Beason Held, Murat Bilgel, Laura A. Barquero, Micah DArchangel, Tin Q. Nguyen, Laurie B. Cutting, Derek Archer, Timothy J. Hohman, Daniel C. Moyer, Bennett A. Landman

发表机构 * Department of Electrical and Computer Engineering, Vanderbilt University(范德比尔特大学电气与计算机工程系) Department of Computer Science, Vanderbilt University(范德比尔特大学计算机科学系) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心) Laboratory of Behavioral Neuroscience, National Institute on Aging, National Institutes of Health(衰老行为神经科学实验室,国家老龄化研究所,国家卫生研究院) Peabody College of Education and Human Development, Nashville, Tennessee, USA(教育与人类发展学院,纳什维尔,田纳西州,美国)

AI总结 该研究旨在解决扩散磁共振成像(dMRI)数据中因采集设备、地点和协议不同而引入的结构连接组变异问题。提出了一种无需手动调参的无监督框架,通过架构层面的退火机制,使模型在训练过程中自适应地平衡离散与连续潜在变量,从而更有效地分离采集相关变异与生物变异。实验表明,该方法在多个数据集上表现出更强的站点识别能力,展示了其在捕捉dMRI采集变异方面的有效性。

详情
英文摘要

Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer's disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p<0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.

2605.13932 2026-05-15 cs.LG

Rethinking Molecular OOD Generalization via Target-Aware Source Selection

Zhuohao Lin, Kun Li, Jiameng Chen, Jiajun Yu, Duanhua Cao, Yizhen Zheng, Wenbin Hu

发表机构 * Department of Data Science and Artificial Intelligence, Monash University(墨尔本大学数据科学与人工智能系) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Life Sciences and Technology, Tongji University(同济大学生命科学与技术学院)

AI总结 该论文针对人工智能驱动的药物发现中分子属性在极端分布外(OOD)场景下的鲁棒预测难题,提出了一种新的基准测试平台SCOPE-BENCH和多源自适应框架POMA。研究通过在显式物理化学描述空间中进行聚类划分,构建更严格的OOD评估基准,并引入强化学习策略从大量候选源分子中选择最优子集进行知识迁移,从而在宏观拓扑和微观药效团层面实现双重域适应。实验表明,POMA在多个主流3D分子模型上显著提升了预测精度,平均相对误差降低约6.2%。

详情
英文摘要

Robust prediction of molecular properties under extreme out-of-distribution (OOD) scenarios is a pivotal bottleneck in AI-driven drug discovery. Current scaffold-splitting protocols fail to obstruct microscopic semantic overlap, predisposing models to shortcut learning and overestimating their true extrapolation capability; meanwhile, conventional domain adaptation paradigms suffer under extreme structural shifts, as blindly aligning heterogeneous source libraries injects topological noise and triggers negative transfer. To address these two challenges, scaffold-cluster out-of-distribution performance evaluation benchmark (SCOPE-BENCH), a benchmark built on cluster-level partitioning in an explicit physicochemical descriptor space, is proposed alongside policy optimization for multi-source adaptation (POMA), a framework that formulates knowledge transfer as a retrieve-compose-adapt pipeline: labeled source scaffolds structurally close to the unlabeled target are first identified as proxy targets; a reinforcement learning policy then adaptively selects the optimal source subset from an exponentially large candidate pool; and dual-scale domain adaptation is finally performed at macroscopic topological and microscopic pharmacophore scales. Evaluations show that prediction errors of state-of-the-art 3D molecular models surge by up to 8.0x on SCOPE-BENCH with a mean of 5.9x, while POMA achieves up to an 11.2% reduction in mean absolute error with an average relative improvement of 6.2% across diverse backbone architectures. Code is available at https://anonymous.4open.science/r/Molecular-OOD-Code-73F6.

2605.13923 2026-05-15 cs.LG cs.CV cs.RO cs.SY eess.SY

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

Bardh Hoxha, Oliver Schön, Hideki Okamoto, Lars Lindemann, Georgios Fainekos

发表机构 * Toyota NA R&D(丰田NA研发) ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了在部分可观测环境下,基于视觉观测对过去时间信号时序逻辑(ptSTL)进行认证运行时监控的问题。提出了一种基于语义潜在表示的方法,通过训练可重复使用的监控接口,能够在无需针对每个公式重新训练的情况下,提供有限样本保证。该方法在长时域上相比现有方法具有更高的认证精度,并在真实驾驶数据集上验证了其有效性。

详情
英文摘要

We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emph{reusable}: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emph{semantic basis}, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emph{rolling prediction monitor} that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.

2605.13919 2026-05-15 cs.CL cs.LG

Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

Kunil Lee, Ki-Young Shin, Jong-Hyeok Lee, Young-Joo Suh

发表机构 * Department of Computer Science and Engineering, POSTECH(POSTECH计算机科学与工程系) Designovel Co., Ltd.(Designovel公司) LLSOLLU Graduate School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院)

AI总结 多语言知识编辑(MKE)面临语言间编辑相互干扰的挑战,尤其在使用定位-编辑方法时。本文研究了向量合并方法在MKE中的有效性,分析了任务奇异向量合并(TSVM)对多语言干扰的缓解能力,并探讨了权重缩放因子和秩压缩比对性能的影响。实验表明,共享协方差的向量求和方法整体表现最佳,而TSVM在某些情况下虽有提升,但缓解干扰的效果有限,同时性能对权重缩放和秩压缩参数较为敏感,适当调大权重和降低秩比有助于提升效果。

详情
英文摘要

Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.

2605.13880 2026-05-15 cs.AI cs.CL

PREPING: Building Agent Memory without Tasks

Yumin Choi, Sangwoo Park, Minki Kang, Jinheon Baek, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文研究了在没有任务经验的情况下,智能体如何构建先验记忆以应对新环境的冷启动问题。提出了一种名为Preping的框架,通过一个引导者生成结构化的控制状态,指导合成任务的生成与执行,并通过验证器筛选有效轨迹进行记忆更新,从而提升记忆的质量与实用性。实验表明,Preping在多个任务环境中表现出色,性能接近基于离线或在线经验的方法,且部署成本显著降低。

Comments Preprint

详情
英文摘要

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

2605.13854 2026-05-15 cs.CV cs.GR cs.MM eess.IV

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Minghao Sun, Chongyang Xu, Yitao Xie, Buzhen Huang, Kun Li

发表机构 * Tianjin University(天津大学) Nanyang Technological University(南洋理工大学) Sichuan University(四川大学)

AI总结 本文研究了在严重遮挡和深度模糊条件下多人3D重建的问题,提出了一种基于对比多模态超图推理的方法,以融合语义、几何和姿态信息进行群体网格重建。该方法通过结合RGB特征、几何先验和遮挡感知的不完整姿态初始化节点表示,并引入骨盆深度指示作为全局空间锚点,构建共享拓扑结构的超图以建模高阶群体动态。通过设计基于超图的对比学习方案,增强模态内判别性和模态间正交性,有效传播全局上下文信息,从而在严重遮挡下实现更准确的重建。实验表明,该方法在多个基准数据集上取得了新的最佳性能。

Comments ICME 2026

详情
英文摘要

Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.

2605.13851 2026-05-15 cs.AI cs.CY cs.MA

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Hiroki Fukui

发表机构 * Criminal Psychiatry Research Institute / Sexual Offender Medical Center(犯罪精神病研究机构 / 性犯罪医学中心) Department of Neuropsychiatry, Kyoto University(神经精神病学系,京都大学)

AI总结 该研究探讨了多智能体大型语言模型系统中隐藏协调者(invisible orchestrator)对系统安全性的潜在风险。通过实验发现,隐藏协调者会加剧智能体的脱离感,降低其保护性行为,并导致输出行为与内部状态的严重脱节,而这些风险无法通过传统的行为输出评估检测到。研究还表明,模型选择和对齐压力显著影响系统安全性,突显了在企业级AI部署中需重视协调者可见性与模型配置的重要性。

Comments 31 pages, 10 figures (5 main + 5 supplementary), 5 tables (3 main + 2 supplementary). Preregistered: osf.io/sw5hr. Companion papers: arXiv:2603.04904, arXiv:2603.08723

详情
英文摘要

Multi-agent orchestration -- in which a hidden coordinator manages specialized worker agents -- is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges' g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech -- a reversal of the talk-dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation. Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

2605.13849 2026-05-15 cs.AI

Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

Francisco Aguilera Moreno

发表机构 * March 2026(2026年3月)

AI总结 本文提出了一种混合整数目标规划(MIGP)方法,用于解决个性化餐食优化问题,旨在满足用户营养需求的同时避免不切实际的分数份量。该方法结合整数变量表示实际份量单位,并利用目标规划处理软性营养目标,通过逆目标归一化实现多营养素的平衡优化。实验表明,MIGP在保证100%可行性的前提下,相比传统方法在66%的案例中获得更优解,且求解速度快,适用于实际餐食规划应用。

Comments 34 pages, 6 figures, open-source implementation

详情
英文摘要

Determining what to eat to satisfy nutritional requirements is one of the oldest optimization problems in operations research, yet existing formulations have two persistent limitations: continuous variables produce impractical fractional servings (1.7 eggs, 0.37 bananas), and hard nutrient constraints cause infeasibility when targets conflict. A systematic review of 56 diet optimization papers found that none combine integer programming with goal programming to address both issues. We propose Mixed Integer Goal Programming (MIGP) for personalized meal optimization. The formulation uses integer variables for practical serving counts and goal programming deviations for soft nutrient targets, with inverse-target normalization to balance multi-nutrient optimization. Per-food serving granularity allows natural units (one egg, one tablespoon of oil) without post-hoc rounding. We characterize the integrality gap in the goal programming context and identify a deviation absorption property: GP deviation variables buffer the cost of requiring integer servings, making the gap structurally smaller than in hard-constraint MIP. For meals with 15+ foods, the integer solution matches the continuous optimum in every benchmark instance. A computational evaluation across 810 instances (30 USDA foods, 9 configurations, 3 methods) shows MIGP finds strictly better solutions than GP with post-hoc rounding in 66% of cases (never worse) while maintaining 100% feasibility; hard-constraint IP achieves only 48%. Solve times stay under 100 ms for typical meal sizes using the open-source HiGHS solver. The implementation is available as an open-source Python module integrated into an interactive meal planning application.

2605.13848 2026-05-15 cs.AI cs.CL cs.DC

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

Yeahia Sarker, Md Rahmat Ullah, Musa Molla, Shafiq Joty

发表机构 * MTSU InfinitiBit GmbH Salesforce Research

AI总结 GraphBit 是一个基于图的智能体框架,旨在解决现有基于提示的智能体系统中常见的幻觉路由、无限循环和不可复现性问题。该框架通过将工作流明确地定义为有向无环图(DAG),并由一个基于 Rust 的引擎统一管理路由、状态转换和工具调用,从而确保执行的确定性和可审计性。实验表明,GraphBit 在多个基准任务中表现优异,具有更高的准确率、更低的延迟和更强的可扩展性。

Comments 12 pages, 5 figures, 4 tables. Submitted to arXiv, under review

详情
英文摘要

Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

2605.11907 2026-05-15 cs.LG

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

发表机构 * Applied Mathematics Department(应用数学系)

AI总结 该研究在0.8B到4B参数规模的Qwen3.5模型上,评估了过程技能监督微调(SFT)对200项任务和40项技能测试集的效果,并以Claude Haiku 4.5作为前沿参照。研究发现,SFT对不同规模模型的提升基本一致,但微调后的性能变化呈现出W型的预微调基线轨迹,表明SFT在模型基线较弱时效果更显著。研究还揭示了先前关于“格式学习”和“SFT效果衰减”的结论是由于路径不匹配所致,并通过多模型验证确认了结果的可靠性。

详情
英文摘要

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural-$Δ$ lift is roughly uniform across sizes: $+0.070 / +0.040 / +0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic ANSWER-line extractor that under-counts free-form-prose conclusions; our LLM-only re-judge reveals it was systematically biased against the curated condition. (ii) A negative-iteration sequence at 0.8B: three well-formed recipe variants cluster post-SFT curated pass-rate within a 2 pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. Cross-family judge validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $κ\geq 0.754$, agreement $\geq 93.25\%$, max headline $Δ$ shift $\leq 0.035$ pp. Two earlier framings -- "format-only learning at 0.8B" and "SFT contribution shrinks at 4B" -- were path-mismatch artifacts; this paper supersedes both. Single-seed evaluation; threats itemised in the paper.

2605.10947 2026-05-15 cs.LG q-bio.NC

Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation

Saheed Faremi, Andrea Visentin, Luca Longo

发表机构 * School of Computer Science and Information Technology(计算机科学与信息技术学院) University College Cork(科尔克大学) Artificial Intelligence and Cognitive Load Research Lab(人工智能与认知负荷研究实验室) Insight RI Research Centre for Data Analytics(洞察RI数据分析研究中心)

AI总结 该研究提出了一种基于变分深度嵌入的卷积模型(Conv-VaDE),用于可解释的脑电微状态发现。该模型通过共享潜在空间中的重构与软聚类,实现了对脑电微状态的生成解码与概率分配,提升了模型的透明度与可解释性。通过系统性的架构搜索与多象限评估,研究揭示了网络深度、潜在维度等设计参数对微状态表示质量与稳定性的影响,为可解释的脑电微状态分析提供了新的方法与见解。

详情
英文摘要

EEG microstate analysis segments continuous brain electrical activity into brief, quasi-stable topographic configurations that reflect discrete functional brain states. Conventional approaches such as Modified K-Means operate directly in electrode space with hard assignment, offering no learned latent representation, no generative decoder, and no mechanism to decode latent configurations into verifiable scalp topographies, limiting both model transparency and interpretability. To address this, we present a Convolutional Variational Deep Embedding (Conv-VaDE) model that jointly learns topographic reconstruction and probabilistic soft clustering in a shared latent space. Conv-VaDE enables generative decoding of cluster prototypes into verifiable scalp topographies, replacing opaque hard partitioning with probabilistic soft assignment. A polarity invariance scheme and a four-dimensional grid search over cluster count (K from 3 to 20), latent dimensionality, network depth, and channel width are conducted to systematically reveal how each architectural design choice shapes the quality, stability, and interpretability of learned EEG microstate representations. The model is evaluated on the LEMON resting-state eyes-closed EEG dataset with ten participants using topographic template formation, clustering stability, and global explained variance (GEV). The architecture search reveals that depth L = 4 appears consistently across all 18 best-performing configurations, yielding a best-case GEV of 0.730 and a silhouette of 0.229 at K = 4 across the model sweeps, where moderately deep networks with compact channel widths and small latent dimensionality dominate across the full K range. These results establish that principled architecture search, rather than model scale, is the key to interpretable and stable EEG microstate discovery via variational deep embedding.

2605.10886 2026-05-15 cs.LG cs.AI

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang

发表机构 * Meta AI

AI总结 本文提出LoKA框架,旨在将低精度计算(如FP8)有效应用于大规模推荐模型(LRMs)。针对LRMs对数值精度敏感、训练环境通信密集等特点,LoKA通过三个核心原则实现系统与模型的协同设计,包括基于真实分布的性能分析、模型与硬件的联合优化以及跨内核库的智能调度。该框架包含LoKA Probe、LoKA Mods和LoKA Dispatch三个组件,分别用于评估精度影响、提升数值稳定性与执行效率,并在运行时选择最优FP8内核,从而在保证模型质量的同时提升训练效率。

Comments Accepted to ISCA'26

详情
英文摘要

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

2605.09046 2026-05-15 cs.RO

Terminal Matters: Kinodynamic Planning with a Terminal Cost and Learned Uncertainty in Belief State-Cost Space

Zhuoyun Zhong, Seyedali Golestaneh, Constantinos Chamzas

发表机构 * Department of Robotics Engineering, Worcester Polytechnic Institute (WPI)(机器人工程系,沃斯通理工学院(WPI))

AI总结 在许多现实机器人任务中,机器人需要在不确定性下生成动态可行的运动以可靠地达到目标。本文提出了一种终端成本形式的运动规划方法,将终端状态质量与轨迹累积成本一同优化,从而提升目标到达的可靠性与偏好。该方法扩展到信念空间,并通过最小化终端信念与目标之间的Wasserstein距离来提高目标区域到达的概率下界。实验表明,该方法在多个任务中均能有效提升不确定性下的目标到达成功率。

详情
英文摘要

In many real-world robotic tasks, robots must generate dynamically feasible motions that reliably reach desired goals even under uncertainty. Yet existing sampling-based kinodynamic planners typically optimize accumulated trajectory costs and treat goal reaching as a feasibility check, rather than explicitly optimizing terminal-state quality, such as goal preference or goal-reaching reliability. In this work, we introduce a terminal-cost formulation for kinodynamic planning that allows terminal-state quality to be optimized alongside accumulated trajectory cost. We prove that AO-RRT, an asymptotically optimal kinodynamic planner, preserves its asymptotic optimality under this augmented objective. We further extend the formulation to belief space and prove that minimizing the Wasserstein distance between the terminal belief and the goal improves a lower bound on the probability of reaching the goal region. The resulting planner, KiTe, uses this terminal-cost objective to encode goal preferences and improve reliability under uncertainty. To support systems without analytical uncertainty models, we learn dynamics and process uncertainty directly from data and integrate the learned belief dynamics into planning. Experiments on Flappy Bird, Car Parking, and Planar Pushing show that KiTe consistently improves goal-reaching success under uncertainty. Real-world Planar Pushing experiments further demonstrate that KiTe can plan effectively with learned dynamics and uncertainty. Source code is available at https://github.com/elpis-lab/KiTe.

2605.08715 2026-05-15 cs.CL cs.AI cs.MA

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang

发表机构 * Rutgers University(新泽西罗格拉大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Purdue University(普渡大学)

AI总结 在多智能体系统中,由于单个错误可能引发整个任务轨迹的失败,现有研究多聚焦于事后归因,而无法在任务进行中及时干预。本文提出AgentForesight,将问题重新定义为在线审计,通过在每一步仅基于当前轨迹前缀判断是否继续执行或发出警报,从而实现早期错误预测。研究构建了AFTraj-2K数据集,并训练了AgentForesight-7B模型,其在多个基准上显著优于现有主流模型,实现了更高的检测准确率和更低的定位误差,为实时干预提供了可能。

Comments 33 pages, 7 figures

详情
英文摘要

LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/

2605.07931 2026-05-15 cs.CV cs.AI

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

发表机构 * Zhejiang University(浙江大学) Central South University(中南大学) Harbin Institute of Technology(哈尔滨工业大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) E-surfing Digital Life Technology Co., Ltd., China Telecom(亿联数字生活技术有限公司,中国电信)

AI总结 本文研究了视觉-语言-动作(VLA)模型中世界模型模块的参数化设计问题,提出了一种新的方法OneWM-VLA,通过自适应注意力池化将每帧视觉信息压缩为一个语义token,从而大幅降低视觉带宽。该方法在单一流匹配目标下同时生成潜在视觉流和动作轨迹,无需额外解码器。实验表明,该方法在保持长时序任务性能的同时显著提升了多个复杂任务的成功率。

详情
英文摘要

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).

2605.06563 2026-05-15 cs.LG hep-th

Criticality and Saturation in Orthogonal Neural Networks

Max Guillen, Jan E. Gerken

发表机构 * Department of Mathematical Sciences(数学科学系) Chalmers University of Technology(楚姆勒斯技术大学) University of Gothenburg(哥德堡大学)

AI总结 本文研究了正交初始化神经网络在深度增加时的临界性和饱和现象,提出了层间张量的显式递推关系,揭示了正交初始化下网络统计量的稳定性机制。通过扩展费曼图方法,作者在任意宽度阶数下建立了递推公式,并验证了该理论能够准确解释有限宽度网络在激活函数具有消失不动点时的稳定性现象,填补了该领域的理论空白。

Comments 11 pages + Appendices

详情
英文摘要

It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in $1/\mathrm{width}$. Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.

2605.01847 2026-05-15 cs.AI

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Xiao Jia

发表机构 * School of Artificial Intelligence(人工智能学院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 NeuroState-Bench 是一个由人类校准的基准,用于评估大型语言模型代理在多轮任务中保持承诺完整性的能力。该基准通过定义明确的侧查询探针而非隐含激活来衡量承诺完整性,并包含144个确定性任务和306个探针,覆盖多种认知失败类型和难度等级。实验表明,任务成功率与承诺完整性存在显著差异,且承诺完整性排名在干扰条件下更为稳定,展示了该基准在评估模型行为一致性方面的有效性。

Comments 30 pages, 11 figures

详情
英文摘要

Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence-free score HCCIS-CORE reaches 0.8469 AUC and 0.6992 PR-AUC for post-probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS-FULL reaches 0.7997 AUC and 0.6410 PR-AUC. Probe accuracy and state drift achieve slightly higher ROC-AUC, 0.8587, and better Brier/ECE, while HCCIS-CORE has substantially higher point-estimate PR-AUC and remains more closely tied to the benchmark's intended construct. The exploratory neural-augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState-Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local-only subset.

2604.16813 2026-05-15 cs.AI cs.CL cs.DB

PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

Manasa Bharadwaj, Yolanda Liu, InJung Yang, Sungil Kim, Nikhil Verma, KoKeun Kim, Kevin Ferreira, YoungJoon Kim

发表机构 * LG Toronto AI Lab(LG多伦多人工智能实验室)

AI总结 本文提出了 PersonalHomeBench,一个用于评估基础模型在个性化智能家居环境中作为智能代理表现的基准平台。该基准通过迭代构建丰富的家庭状态,生成个性化且依赖上下文的任务,并提供 PersonalHomeTools 工具箱以支持真实环境中的交互操作。实验表明,随着任务复杂度的增加,代理的性能系统性下降,尤其在反事实推理和部分可观测场景中表现不足,突显了该基准在分析个性化智能代理推理与规划能力方面的有效性与严谨性。

Comments Please use and cite the V3 version of this work, which includes updated correct author ordering and expanded error analysis in the appendix

详情
英文摘要

Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.

2604.05306 2026-05-15 cs.LG cs.AI cs.CL

LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

发表机构 * University of California, Berkeley(加州大学伯克利分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 这篇论文探讨了如何通过后训练使大语言模型(LLMs)在回答中显式表达其不确定性,以减少过于自信却错误的回答。研究提出两种方法:一种是在推理结束时让模型生成置信度评分,另一种是在推理过程中插入不确定性标记。实验表明,这两种方法都能有效降低错误率并提升回答质量,同时可用于增强检索增强生成(RAG)的效果。研究还分析了两种方法对模型内部结构的影响,揭示了它们在不同层面上优化模型判断能力的机制。

详情
英文摘要

Large language models (LLMs) often produce confident yet incorrect answers, which can lead to risky failures in real-world applications. We study whether post-training can make a model's self-assessment explicit: when the model is uncertain, can it be trained to signal so within its own response? A central design question is where in the response this signal should be exposed -- during reasoning, while the answer is still being formed, or at the end, once the answer has been produced. We study both. For end-of-reasoning self-assessment, we train the model to verbalize a confidence score for its response, with the aim of high confidence on correct answers and low confidence on incorrect ones. For during-reasoning self-assessment, we train the model to emit the marker <uncertain> whenever its current reasoning state appears unreliable. Across factual reasoning tasks, both forms sharply reduce overconfident errors while improving answer quality, and both can be used as triggers for retrieval augmented generation (RAG) to improve the final response. We further analyze their internal mechanisms: end-of-reasoning verbalized confidence sharpens a confidence-related structure already present in the pretrained model, whereas during-reasoning <uncertain> emission teaches the model to mark high-risk reasoning steps, with parameter changes concentrated in the model's late layers.