arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1968
2605.13936 2026-05-15 cs.LG cs.AI cs.DC

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Daniel M. Jimenez-Gutierrez, Enrique Zuazua, Georgios Kellaris, Joaquin del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

AI总结 本文探讨了在无法共享隐私数据的情况下,如何通过联邦学习的方式对大语言模型进行微调,以利用分布在不同机构中的非独立同分布(non-IID)私有数据。研究提出了一种基于Sherpa.ai平台的联邦微调框架,允许各节点协作优化共享模型而无需交换原始数据,并在医疗和金融领域进行了跨领域的实验评估。实验表明,联邦微调在性能上接近集中式训练,优于单一机构独立训练,并且参数高效微调方法如QLoRA和IA3在保持较高准确率的同时提升了计算效率,为隐私数据下的大模型适配提供了可行方案。

详情
英文摘要

The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world's most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the Sherpa.ai Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared.

2605.13935 2026-05-15 cs.LG cs.CL

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

Saba Ahmadi, Prasanna Parthasarathi, Yufei Cui

AI总结 扩散语言模型作为自回归模型的有前途的替代方案,其后训练方法大多采用奖励最大化目标,但这种方法存在轨迹锁定的问题,即奖励驱动的采样更新会使概率质量过度集中于少数去噪路径,降低模型对其他正确解的覆盖能力。为此,研究提出了一种轨迹平衡目标TraFL,通过引导策略向由冻结参考模型锚定的奖励倾斜目标分布进行训练,结合扩散兼容的序列级代理损失和学习的提示依赖归一化,有效提升了模型性能。实验表明,TraFL在数学推理和代码生成任务中均优于基线模型,且优势随采样预算增加而增强,并在多个基准测试中表现出良好的泛化能力。

详情
英文摘要

Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.

2605.13933 2026-05-15 cs.LG cs.AI stat.ML

Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling

Gaurav Rudravaram, Lianrui Zuo, Karthik Ramadass, Elyssa McMaster, Jongyeon Yoon, Aravind R. Krishnan, Adam M. Saunders, Chenyu Gao, Nancy R. Newlin, Praitayini Kanakaraj, Lori L. Beason Held, Murat Bilgel, Laura A. Barquero, Micah DArchangel, Tin Q. Nguyen, Laurie B. Cutting, Derek Archer, Timothy J. Hohman, Daniel C. Moyer, Bennett A. Landman

AI总结 该研究旨在解决扩散磁共振成像(dMRI)数据中因采集设备、地点和协议不同而引入的结构连接组变异问题。提出了一种无需手动调参的无监督框架,通过架构层面的退火机制,使模型在训练过程中自适应地平衡离散与连续潜在变量,从而更有效地分离采集相关变异与生物变异。实验表明,该方法在多个数据集上表现出更强的站点识别能力,展示了其在捕捉dMRI采集变异方面的有效性。

详情
英文摘要

Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer's disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p<0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.

2605.13932 2026-05-15 cs.LG

Rethinking Molecular OOD Generalization via Target-Aware Source Selection

Zhuohao Lin, Kun Li, Jiameng Chen, Jiajun Yu, Duanhua Cao, Yizhen Zheng, Wenbin Hu

AI总结 该论文针对人工智能驱动的药物发现中分子属性在极端分布外(OOD)场景下的鲁棒预测难题,提出了一种新的基准测试平台SCOPE-BENCH和多源自适应框架POMA。研究通过在显式物理化学描述空间中进行聚类划分,构建更严格的OOD评估基准,并引入强化学习策略从大量候选源分子中选择最优子集进行知识迁移,从而在宏观拓扑和微观药效团层面实现双重域适应。实验表明,POMA在多个主流3D分子模型上显著提升了预测精度,平均相对误差降低约6.2%。

详情
英文摘要

Robust prediction of molecular properties under extreme out-of-distribution (OOD) scenarios is a pivotal bottleneck in AI-driven drug discovery. Current scaffold-splitting protocols fail to obstruct microscopic semantic overlap, predisposing models to shortcut learning and overestimating their true extrapolation capability; meanwhile, conventional domain adaptation paradigms suffer under extreme structural shifts, as blindly aligning heterogeneous source libraries injects topological noise and triggers negative transfer. To address these two challenges, scaffold-cluster out-of-distribution performance evaluation benchmark (SCOPE-BENCH), a benchmark built on cluster-level partitioning in an explicit physicochemical descriptor space, is proposed alongside policy optimization for multi-source adaptation (POMA), a framework that formulates knowledge transfer as a retrieve-compose-adapt pipeline: labeled source scaffolds structurally close to the unlabeled target are first identified as proxy targets; a reinforcement learning policy then adaptively selects the optimal source subset from an exponentially large candidate pool; and dual-scale domain adaptation is finally performed at macroscopic topological and microscopic pharmacophore scales. Evaluations show that prediction errors of state-of-the-art 3D molecular models surge by up to 8.0x on SCOPE-BENCH with a mean of 5.9x, while POMA achieves up to an 11.2% reduction in mean absolute error with an average relative improvement of 6.2% across diverse backbone architectures. Code is available at https://anonymous.4open.science/r/Molecular-OOD-Code-73F6.

2605.13923 2026-05-15 cs.LG cs.CV cs.RO cs.SY eess.SY

Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

Bardh Hoxha, Oliver Schön, Hideki Okamoto, Lars Lindemann, Georgios Fainekos

AI总结 本文研究了在部分可观测环境下,基于视觉观测对过去时间信号时序逻辑(ptSTL)进行认证运行时监控的问题。提出了一种基于语义潜在表示的方法,通过训练可重复使用的监控接口,能够在无需针对每个公式重新训练的情况下,提供有限样本保证。该方法在长时域上相比现有方法具有更高的认证精度,并在真实驾驶数据集上验证了其有效性。

详情
英文摘要

We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emph{reusable}: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emph{semantic basis}, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emph{rolling prediction monitor} that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.

2605.13919 2026-05-15 cs.CL cs.LG

Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

Kunil Lee, Ki-Young Shin, Jong-Hyeok Lee, Young-Joo Suh

AI总结 多语言知识编辑(MKE)面临语言间编辑相互干扰的挑战,尤其在使用定位-编辑方法时。本文研究了向量合并方法在MKE中的有效性,分析了任务奇异向量合并(TSVM)对多语言干扰的缓解能力,并探讨了权重缩放因子和秩压缩比对性能的影响。实验表明,共享协方差的向量求和方法整体表现最佳,而TSVM在某些情况下虽有提升,但缓解干扰的效果有限,同时性能对权重缩放和秩压缩参数较为敏感,适当调大权重和降低秩比有助于提升效果。

详情
英文摘要

Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.

2605.13880 2026-05-15 cs.AI cs.CL

PREPING: Building Agent Memory without Tasks

Yumin Choi, Sangwoo Park, Minki Kang, Jinheon Baek, Sung Ju Hwang

AI总结 本文研究了在没有任务经验的情况下,智能体如何构建先验记忆以应对新环境的冷启动问题。提出了一种名为Preping的框架,通过一个引导者生成结构化的控制状态,指导合成任务的生成与执行,并通过验证器筛选有效轨迹进行记忆更新,从而提升记忆的质量与实用性。实验表明,Preping在多个任务环境中表现出色,性能接近基于离线或在线经验的方法,且部署成本显著降低。

Comments Preprint

详情
英文摘要

Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost $2.99\times$ lower on AppWorld and $2.23\times$ lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

2605.13854 2026-05-15 cs.CV cs.GR cs.MM eess.IV

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Minghao Sun, Chongyang Xu, Yitao Xie, Buzhen Huang, Kun Li

AI总结 本文研究了在严重遮挡和深度模糊条件下多人3D重建的问题,提出了一种基于对比多模态超图推理的方法,以融合语义、几何和姿态信息进行群体网格重建。该方法通过结合RGB特征、几何先验和遮挡感知的不完整姿态初始化节点表示,并引入骨盆深度指示作为全局空间锚点,构建共享拓扑结构的超图以建模高阶群体动态。通过设计基于超图的对比学习方案,增强模态内判别性和模态间正交性,有效传播全局上下文信息,从而在严重遮挡下实现更准确的重建。实验表明,该方法在多个基准数据集上取得了新的最佳性能。

Comments ICME 2026

详情
英文摘要

Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.

2605.13851 2026-05-15 cs.AI cs.CY cs.MA

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Hiroki Fukui

AI总结 该研究探讨了多智能体大型语言模型系统中隐藏协调者(invisible orchestrator)对系统安全性的潜在风险。通过实验发现,隐藏协调者会加剧智能体的脱离感,降低其保护性行为,并导致输出行为与内部状态的严重脱节,而这些风险无法通过传统的行为输出评估检测到。研究还表明,模型选择和对齐压力显著影响系统安全性,突显了在企业级AI部署中需重视协调者可见性与模型配置的重要性。

Comments 31 pages, 10 figures (5 main + 5 supplementary), 5 tables (3 main + 2 supplementary). Preregistered: osf.io/sw5hr. Companion papers: arXiv:2603.04904, arXiv:2603.08723

详情
英文摘要

Multi-agent orchestration -- in which a hidden coordinator manages specialized worker agents -- is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges' g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech -- a reversal of the talk-dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation. Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

2605.13849 2026-05-15 cs.AI

Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

Francisco Aguilera Moreno

AI总结 本文提出了一种混合整数目标规划(MIGP)方法,用于解决个性化餐食优化问题,旨在满足用户营养需求的同时避免不切实际的分数份量。该方法结合整数变量表示实际份量单位,并利用目标规划处理软性营养目标,通过逆目标归一化实现多营养素的平衡优化。实验表明,MIGP在保证100%可行性的前提下,相比传统方法在66%的案例中获得更优解,且求解速度快,适用于实际餐食规划应用。

Comments 34 pages, 6 figures, open-source implementation

详情
英文摘要

Determining what to eat to satisfy nutritional requirements is one of the oldest optimization problems in operations research, yet existing formulations have two persistent limitations: continuous variables produce impractical fractional servings (1.7 eggs, 0.37 bananas), and hard nutrient constraints cause infeasibility when targets conflict. A systematic review of 56 diet optimization papers found that none combine integer programming with goal programming to address both issues. We propose Mixed Integer Goal Programming (MIGP) for personalized meal optimization. The formulation uses integer variables for practical serving counts and goal programming deviations for soft nutrient targets, with inverse-target normalization to balance multi-nutrient optimization. Per-food serving granularity allows natural units (one egg, one tablespoon of oil) without post-hoc rounding. We characterize the integrality gap in the goal programming context and identify a deviation absorption property: GP deviation variables buffer the cost of requiring integer servings, making the gap structurally smaller than in hard-constraint MIP. For meals with 15+ foods, the integer solution matches the continuous optimum in every benchmark instance. A computational evaluation across 810 instances (30 USDA foods, 9 configurations, 3 methods) shows MIGP finds strictly better solutions than GP with post-hoc rounding in 66% of cases (never worse) while maintaining 100% feasibility; hard-constraint IP achieves only 48%. Solve times stay under 100 ms for typical meal sizes using the open-source HiGHS solver. The implementation is available as an open-source Python module integrated into an interactive meal planning application.

2605.13848 2026-05-15 cs.AI cs.CL cs.DC

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

Yeahia Sarker, Md Rahmat Ullah, Musa Molla, Shafiq Joty

AI总结 GraphBit 是一个基于图的智能体框架,旨在解决现有基于提示的智能体系统中常见的幻觉路由、无限循环和不可复现性问题。该框架通过将工作流明确地定义为有向无环图(DAG),并由一个基于 Rust 的引擎统一管理路由、状态转换和工具调用,从而确保执行的确定性和可审计性。实验表明,GraphBit 在多个基准任务中表现优异,具有更高的准确率、更低的延迟和更强的可扩展性。

Comments 12 pages, 5 figures, 4 tables. Submitted to arXiv, under review

详情
英文摘要

Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

2605.11907 2026-05-15 cs.LG

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Igor Strozzi

AI总结 该研究在0.8B到4B参数规模的Qwen3.5模型上,评估了过程技能监督微调(SFT)对200项任务和40项技能测试集的效果,并以Claude Haiku 4.5作为前沿参照。研究发现,SFT对不同规模模型的提升基本一致,但微调后的性能变化呈现出W型的预微调基线轨迹,表明SFT在模型基线较弱时效果更显著。研究还揭示了先前关于“格式学习”和“SFT效果衰减”的结论是由于路径不匹配所致,并通过多模型验证确认了结果的可靠性。

详情
英文摘要

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. Main finding. Under matched-path LLM-only scoring, the SFT-attributable procedural-$Δ$ lift is roughly uniform across sizes: $+0.070 / +0.040 / +0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. Methodology. (i) A bench format-compliance artifact: 83.5% of the holdout uses a deterministic ANSWER-line extractor that under-counts free-form-prose conclusions; our LLM-only re-judge reveals it was systematically biased against the curated condition. (ii) A negative-iteration sequence at 0.8B: three well-formed recipe variants cluster post-SFT curated pass-rate within a 2 pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. Cross-family judge validation. GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $κ\geq 0.754$, agreement $\geq 93.25\%$, max headline $Δ$ shift $\leq 0.035$ pp. Two earlier framings -- "format-only learning at 0.8B" and "SFT contribution shrinks at 4B" -- were path-mismatch artifacts; this paper supersedes both. Single-seed evaluation; threats itemised in the paper.

2605.10947 2026-05-15 cs.LG q-bio.NC

Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation

Saheed Faremi, Andrea Visentin, Luca Longo

AI总结 该研究提出了一种基于变分深度嵌入的卷积模型(Conv-VaDE),用于可解释的脑电微状态发现。该模型通过共享潜在空间中的重构与软聚类,实现了对脑电微状态的生成解码与概率分配,提升了模型的透明度与可解释性。通过系统性的架构搜索与多象限评估,研究揭示了网络深度、潜在维度等设计参数对微状态表示质量与稳定性的影响,为可解释的脑电微状态分析提供了新的方法与见解。

详情
英文摘要

EEG microstate analysis segments continuous brain electrical activity into brief, quasi-stable topographic configurations that reflect discrete functional brain states. Conventional approaches such as Modified K-Means operate directly in electrode space with hard assignment, offering no learned latent representation, no generative decoder, and no mechanism to decode latent configurations into verifiable scalp topographies, limiting both model transparency and interpretability. To address this, we present a Convolutional Variational Deep Embedding (Conv-VaDE) model that jointly learns topographic reconstruction and probabilistic soft clustering in a shared latent space. Conv-VaDE enables generative decoding of cluster prototypes into verifiable scalp topographies, replacing opaque hard partitioning with probabilistic soft assignment. A polarity invariance scheme and a four-dimensional grid search over cluster count (K from 3 to 20), latent dimensionality, network depth, and channel width are conducted to systematically reveal how each architectural design choice shapes the quality, stability, and interpretability of learned EEG microstate representations. The model is evaluated on the LEMON resting-state eyes-closed EEG dataset with ten participants using topographic template formation, clustering stability, and global explained variance (GEV). The architecture search reveals that depth L = 4 appears consistently across all 18 best-performing configurations, yielding a best-case GEV of 0.730 and a silhouette of 0.229 at K = 4 across the model sweeps, where moderately deep networks with compact channel widths and small latent dimensionality dominate across the full K range. These results establish that principled architecture search, rather than model scale, is the key to interpretable and stable EEG microstate discovery via variational deep embedding.

2605.10886 2026-05-15 cs.LG cs.AI

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang

AI总结 本文提出LoKA框架,旨在将低精度计算(如FP8)有效应用于大规模推荐模型(LRMs)。针对LRMs对数值精度敏感、训练环境通信密集等特点,LoKA通过三个核心原则实现系统与模型的协同设计,包括基于真实分布的性能分析、模型与硬件的联合优化以及跨内核库的智能调度。该框架包含LoKA Probe、LoKA Mods和LoKA Dispatch三个组件,分别用于评估精度影响、提升数值稳定性与执行效率,并在运行时选择最优FP8内核,从而在保证模型质量的同时提升训练效率。

Comments Accepted to ISCA'26

详情
英文摘要

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

2605.09046 2026-05-15 cs.RO

Terminal Matters: Kinodynamic Planning with a Terminal Cost and Learned Uncertainty in Belief State-Cost Space

Zhuoyun Zhong, Seyedali Golestaneh, Constantinos Chamzas

AI总结 在许多现实机器人任务中,机器人需要在不确定性下生成动态可行的运动以可靠地达到目标。本文提出了一种终端成本形式的运动规划方法,将终端状态质量与轨迹累积成本一同优化,从而提升目标到达的可靠性与偏好。该方法扩展到信念空间,并通过最小化终端信念与目标之间的Wasserstein距离来提高目标区域到达的概率下界。实验表明,该方法在多个任务中均能有效提升不确定性下的目标到达成功率。

详情
英文摘要

In many real-world robotic tasks, robots must generate dynamically feasible motions that reliably reach desired goals even under uncertainty. Yet existing sampling-based kinodynamic planners typically optimize accumulated trajectory costs and treat goal reaching as a feasibility check, rather than explicitly optimizing terminal-state quality, such as goal preference or goal-reaching reliability. In this work, we introduce a terminal-cost formulation for kinodynamic planning that allows terminal-state quality to be optimized alongside accumulated trajectory cost. We prove that AO-RRT, an asymptotically optimal kinodynamic planner, preserves its asymptotic optimality under this augmented objective. We further extend the formulation to belief space and prove that minimizing the Wasserstein distance between the terminal belief and the goal improves a lower bound on the probability of reaching the goal region. The resulting planner, KiTe, uses this terminal-cost objective to encode goal preferences and improve reliability under uncertainty. To support systems without analytical uncertainty models, we learn dynamics and process uncertainty directly from data and integrate the learned belief dynamics into planning. Experiments on Flappy Bird, Car Parking, and Planar Pushing show that KiTe consistently improves goal-reaching success under uncertainty. Real-world Planar Pushing experiments further demonstrate that KiTe can plan effectively with learned dynamics and uncertainty. Source code is available at https://github.com/elpis-lab/KiTe.

2605.08715 2026-05-15 cs.CL cs.AI cs.MA

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Boxuan Zhang, Jianing Zhu, Zeru Shi, Dongfang Liu, Ruixiang Tang

AI总结 在多智能体系统中,由于单个错误可能引发整个任务轨迹的失败,现有研究多聚焦于事后归因,而无法在任务进行中及时干预。本文提出AgentForesight,将问题重新定义为在线审计,通过在每一步仅基于当前轨迹前缀判断是否继续执行或发出警报,从而实现早期错误预测。研究构建了AFTraj-2K数据集,并训练了AgentForesight-7B模型,其在多个基准上显著优于现有主流模型,实现了更高的检测准确率和更低的定位误差,为实时干预提供了可能。

Comments 33 pages, 7 figures

详情
英文摘要

LLM-based multi-agent systems are increasingly deployed on long-horizon tasks, but a single decisive error is often accepted by downstream agents and cascades into trajectory-level failure. Existing work frames this as \emph{post-hoc failure attribution}, diagnosing the responsible agent and step after the trajectory has ended. However, this paradigm forfeits any opportunity to intervene while trajectory is still unfolding. In this work, we introduce AgentForesight, a framework that reframes this problem as online auditing: at each step of an unfolding trajectory, an auditor observes only the current prefix and must either continue the run or alarm at the earliest decisive error, without access to future steps. To this end, we curate AFTraj-2K, a corpus of agentic trajectories across Coding, Math, and Agentic domains, in which safe trajectories are retained under a strict curation pipeline and unsafe trajectories are annotated at the step of their decisive error via consensus among multiple LLM judges. Built on that, we develop AgentForesight-7B, a compact online auditor trained with a coarse-to-fine reinforcement learning recipe that first equips it with a risk-anticipation prior at the failure boundary on adjacent safe/unsafe prefix pairs, then sharpens this prior into precise step-level localization under a three-axis reward jointly targeting the what, where, and who of an audit verdict. Across AFTraj-2K and an external Who\&When benchmark, AgentForesight-7B outperforms leading proprietary models, including GPT-4.1 and DeepSeek-V4-Pro, achieving up to +19.9% performance gain and 3$\times$ lower step localization error, opening the loop from post-hoc failures detection to enabling deployment-time intervention. Project page: https://zbox1005.github.io/agent-foresight/

2605.07931 2026-05-15 cs.CV cs.AI

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jing, De Ma, Gang Pan, Bin Liu

AI总结 本文研究了视觉-语言-动作(VLA)模型中世界模型模块的参数化设计问题,提出了一种新的方法OneWM-VLA,通过自适应注意力池化将每帧视觉信息压缩为一个语义token,从而大幅降低视觉带宽。该方法在单一流匹配目标下同时生成潜在视觉流和动作轨迹,无需额外解码器。实验表明,该方法在保持长时序任务性能的同时显著提升了多个复杂任务的成功率。

详情
英文摘要

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $π_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $π_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $π_0$).

2605.06563 2026-05-15 cs.LG hep-th

Criticality and Saturation in Orthogonal Neural Networks

Max Guillen, Jan E. Gerken

AI总结 本文研究了正交初始化神经网络在深度增加时的临界性和饱和现象,提出了层间张量的显式递推关系,揭示了正交初始化下网络统计量的稳定性机制。通过扩展费曼图方法,作者在任意宽度阶数下建立了递推公式,并验证了该理论能够准确解释有限宽度网络在激活函数具有消失不动点时的稳定性现象,填补了该领域的理论空白。

Comments 11 pages + Appendices

详情
英文摘要

It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in $1/\mathrm{width}$. Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.

2605.01847 2026-05-15 cs.AI

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Xiao Jia

AI总结 NeuroState-Bench 是一个由人类校准的基准,用于评估大型语言模型代理在多轮任务中保持承诺完整性的能力。该基准通过定义明确的侧查询探针而非隐含激活来衡量承诺完整性,并包含144个确定性任务和306个探针,覆盖多种认知失败类型和难度等级。实验表明,任务成功率与承诺完整性存在显著差异,且承诺完整性排名在干扰条件下更为稳定,展示了该基准在评估模型行为一致性方面的有效性。

Comments 30 pages, 11 figures

详情
英文摘要

Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations. The released inventory contains 144 deterministic tasks and 306 benchmark-defined side-query probes spanning eight cognitively motivated failure families, paired clean and distractor variants, and three difficulty bands. The main 32-profile evaluation contains a fixed 16-profile local subset and a matched 16-profile hosted large-model subset evaluated through the same benchmark pipeline. Human calibration uses the final merged reporting scope: 104 sampled task units, 216 raw annotations, and 108 adjudicated task rows, with weighted kappa = 0.977 and ICC(2,1) = 0.977. Empirically, task success and commitment integrity diverge across this expanded grid: the success leader is not the integrity leader, 31 of 32 profiles change rank when integrity replaces task success, and integrity rankings are more stable under distractor perturbation. The primary confidence-free score HCCIS-CORE reaches 0.8469 AUC and 0.6992 PR-AUC for post-probe diagnostic discrimination of terminal task failure; the legacy full heuristic variant HCCIS-FULL reaches 0.7997 AUC and 0.6410 PR-AUC. Probe accuracy and state drift achieve slightly higher ROC-AUC, 0.8587, and better Brier/ECE, while HCCIS-CORE has substantially higher point-estimate PR-AUC and remains more closely tied to the benchmark's intended construct. The exploratory neural-augmented variant HCCIS+N is weaker overall, and a randomized subspace control approaches chance. NeuroState-Bench therefore contributes a calibrated evaluation axis for exposing commitment failures over a broader model grid than the original local-only subset.

2604.25284 2026-05-15 cs.RO

Optimal UGV-UAV Cooperative Partitioning and Inspection of Shortest Paths

Ninh Nguyen, Srinivas Akella

AI总结 本文研究了在存在未知道路阻塞的环境中,由地面无人车(UGV)和空中无人车(UAV)协同合作的最短路径规划问题。该问题是对经典加拿大旅行者问题(CTP)的扩展,考虑了UAV的辅助作用。通过分析不同路径结构和速度比,提出了最优的路径划分策略,证明了UAV在路径后缀检查中的优越性,并在实际城市道路网络中验证了该方法可将UGV的行驶时间减少多达30%。

Comments Withdrawn by the authors due to an error in Section V.D in the competitive-ratio proof for the UGV-UAV case. The proof incorrectly uses $1+2\frac{v_A}{v_G+v_A}(k-1)\le 2\frac{v_A}{v_G+v_A}k-1$, which does not hold in general and affects the stated bound

详情
英文摘要

We study cooperative shortest path planning for an unmanned ground vehicle (UGV) assisted by an unmanned aerial vehicle (UAV) in environments with unknown road blockages that are only discovered when a robot reaches the damaged point. This formulation generalizes the original Canadian Traveller Problem (CTP), which assumes a single ground vehicle and that the traversability status of all incident edges is revealed upon arrival at a vertex. We first analyze the case where the start and the goal are connected by $k$ disjoint paths, and prove that the worst-case competitive ratio $ρ$ for a single UGV is $2k-1$. With UAV assistance, and under the simplifying assumption of negligible initial transit and deadheading UAV costs, the ratio improves to $ρ= 2\frac{v_G}{v_A + v_G}k - 1$, where $v_G$ and $v_A$ denote the UGV and UAV speed, respectively. To address general graphs and non-negligible UAV initial transit and deadheading costs, we present an optimal path partitioning strategy that assigns path prefix inspection to the UGV and path suffix inspection to the UAV, and prove the optimality of the UAV inspection strategy on general graphs. We evaluate our algorithm by performing experiments on road networks from the world's 50 most populous cities, with randomized blockages, and show that the proposed method reduces UGV travel times by up to 30%.

2604.16813 2026-05-15 cs.AI cs.CL cs.DB

PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

Manasa Bharadwaj, Yolanda Liu, InJung Yang, Sungil Kim, Nikhil Verma, KoKeun Kim, Kevin Ferreira, YoungJoon Kim

AI总结 本文提出了 PersonalHomeBench,一个用于评估基础模型在个性化智能家居环境中作为智能代理表现的基准平台。该基准通过迭代构建丰富的家庭状态,生成个性化且依赖上下文的任务,并提供 PersonalHomeTools 工具箱以支持真实环境中的交互操作。实验表明,随着任务复杂度的增加,代理的性能系统性下降,尤其在反事实推理和部分可观测场景中表现不足,突显了该基准在分析个性化智能代理推理与规划能力方面的有效性与严谨性。

Comments Please use and cite the V3 version of this work, which includes updated correct author ordering and expanded error analysis in the appendix

详情
英文摘要

Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.

2604.05306 2026-05-15 cs.LG cs.AI cs.CL

LLMs Should Express Uncertainty Explicitly

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

AI总结 这篇论文探讨了如何通过后训练使大语言模型(LLMs)在回答中显式表达其不确定性,以减少过于自信却错误的回答。研究提出两种方法:一种是在推理结束时让模型生成置信度评分,另一种是在推理过程中插入不确定性标记。实验表明,这两种方法都能有效降低错误率并提升回答质量,同时可用于增强检索增强生成(RAG)的效果。研究还分析了两种方法对模型内部结构的影响,揭示了它们在不同层面上优化模型判断能力的机制。

详情
英文摘要

Large language models (LLMs) often produce confident yet incorrect answers, which can lead to risky failures in real-world applications. We study whether post-training can make a model's self-assessment explicit: when the model is uncertain, can it be trained to signal so within its own response? A central design question is where in the response this signal should be exposed -- during reasoning, while the answer is still being formed, or at the end, once the answer has been produced. We study both. For end-of-reasoning self-assessment, we train the model to verbalize a confidence score for its response, with the aim of high confidence on correct answers and low confidence on incorrect ones. For during-reasoning self-assessment, we train the model to emit the marker <uncertain> whenever its current reasoning state appears unreliable. Across factual reasoning tasks, both forms sharply reduce overconfident errors while improving answer quality, and both can be used as triggers for retrieval augmented generation (RAG) to improve the final response. We further analyze their internal mechanisms: end-of-reasoning verbalized confidence sharpens a confidence-related structure already present in the pretrained model, whereas during-reasoning <uncertain> emission teaches the model to mark high-risk reasoning steps, with parameter changes concentrated in the model's late layers.

2603.11045 2026-05-15 cs.LG cond-mat.mtrl-sci cs.AI cs.CV physics.ins-det

Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette

AI总结 本文提出了一种名为NeFTY的神经场热层析成像方法,用于解决无标签的三维逆热传导问题。该方法通过将扩散率表示为基于坐标的连续神经网络,并在每次优化步骤中使用可微分的隐式欧拉热求解器,确保控制方程在离散化层面精确成立,而非作为软约束。实验表明,NeFTY在合成三维基准测试和真实热成像数据中均显著优于传统物理信息神经网络和体素网格方法,在缺陷分割和深度估计方面表现出优越性能。

Comments 37 pages, 19 figures

详情
英文摘要

Inverse problems for stiff parabolic partial differential equations (PDEs), such as the inverse heat conduction problem (IHCP), are severely ill-posed: the forward map rapidly damps high-frequency interior structure before it reaches the boundary. Soft-constrained physics-informed neural networks (PINNs), which embed the PDE as a residual penalty, suffer from gradient pathology in this regime and tend to fit boundary measurements while leaving the interior field essentially untouched. We propose Neural Field Thermal Tomography (NeFTY), a hard-constrained neural field framework for label-free three-dimensional inverse heat conduction. NeFTY represents the unknown diffusivity as a continuous coordinate-based neural network, and at every optimization step passes the candidate field through a differentiable implicit-Euler heat solver with harmonic-mean interface flux, so that the governing PDE holds exactly on the discretization rather than as a soft penalty. Adjoint gradients propagate the surface reconstruction error back to the network weights at solver-level memory cost, making test-time inversion tractable on a single GPU. Across synthetic 3D benchmarks, NeFTY substantially outperforms soft-constrained PINN variants and a voxel-grid baseline on label-free volumetric recovery, and it transfers to real thermography data, surpassing classical signal-processing baselines in both defect segmentation and depth estimation. Additional details at https://cab-lab-princeton.github.io/nefty/

2603.03577 2026-05-15 cs.CV cs.RO

From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes

Qifan Zhang, Sai Haneesh Allu, Jikai Wang, Yangxiao Lu, Yu Xiang

AI总结 本文研究了在开放世界场景中,如何利用少量模板图像检测和分割新颖物体实例的问题。提出了一种名为L2G-Det的局部到全局检测框架,通过模板与查询图像之间的密集块级匹配生成候选点,并结合改进的分割模型实现精确的实例分割。该方法避免了传统提案机制的依赖,提升了在遮挡和背景干扰下的检测与分割性能。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://irvlutd.github.io/L2G/

详情
英文摘要

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

2603.02115 2026-05-15 cs.RO cs.AI cs.LG

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, Jesse Zhang

AI总结 本文提出Robometer,一种通过轨迹比较扩展通用机器人奖励模型的可扩展框架。该方法结合轨迹内部的进度监督与轨迹之间的偏好监督,通过双目标训练:一方面利用专家数据进行帧级进度损失以锚定奖励幅度,另一方面通过轨迹对比偏好损失实现任务轨迹的全局排序约束,从而有效学习真实和增强失败轨迹的奖励函数。为支持该方法的大规模应用,研究者构建了包含超过一百万条轨迹的RBM-1M数据集,实验表明Robometer在多个基准和实际应用中表现出更优的泛化能力和学习效果。

Comments 33 pages, 17 figures

详情
Journal ref
RSS 2026
英文摘要

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

2602.21302 2026-05-15 cs.RO

Learning Dynamic Rope Manipulation Using Task-Level Iterative Learning Control

Krishna Suresh, Chris Atkeson

AI总结 本文提出了一种任务级迭代学习控制方法,用于实现对绳索的动态操作,特别针对一种非平面绳索操作任务——“飞结”进行演示。该方法仅需一次人类示范和简化的绳索模型,即可在实际硬件上直接学习,无需大量示范数据或仿真支持。通过在每次迭代中求解二次规划问题,将任务空间误差转化为动作更新,从而实现对机器人和绳索模型的逆向控制。实验表明,该方法在7种不同材质和规格的绳索上均实现了100%的成功率,并能在2至5次尝试内实现不同绳索类型之间的迁移。

Comments Project website: https://flying-knots.github.io

详情
英文摘要

We introduce a Task-Level Iterative Learning Control method for dynamic manipulation of ropes. We demonstrate this method on a non-planar rope manipulation task called the flying knot. Using a single human demonstration and a simplified rope model, the method learns directly on hardware without reliance on large amounts of demonstration data or massive amounts of simulation. At each iteration, the algorithm inverts a model of the robot and rope by solving a quadratic program to propagate task-space errors into action updates. We evaluate performance across 7 different kinds of ropes, including chain, latex surgical tubing, and braided and twisted ropes, ranging in thicknesses of 7--25\,mm and densities of 0.013--0.5\,kg/m. Learning achieves a 100\% success rate within 10 trials on all ropes. Furthermore, the method can successfully transfer between most rope types in 2--5 trials. https://flying-knots.github.io

2602.19532 2026-05-15 cs.RO cs.SY eess.SY

Bellman Value Decomposition for Task Logic in Safe Optimal Control

William Sharpless, Oswin So, Dylan Hirsch, Sylvia Herbert, Chuchu Fan

AI总结 该研究针对高维安全最优控制任务中目标与安全规范的复杂组合问题,提出了一种基于贝尔曼值分解的方法。通过将复杂任务的贝尔曼值分解为由可达-避障、避障及新型可达-避障-循环贝尔曼方程连接的图结构,实现了对任务逻辑的自然组织。研究进一步提出VDPPO算法,将分解后的值图嵌入双层神经网络,自动处理隐含依赖关系,并在多个高维仿真和硬件实验中验证了方法的有效性,显著提升了安全与活性的平衡性能。

详情
英文摘要

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

2602.13483 2026-05-15 cs.LG cs.AI

Finding Interpretable Prompt-Specific Circuits in Language Models

Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella

AI总结 本文研究了语言模型中用于执行任务的内部电路结构,重点在于理解注意力头为何关注特定的词对。为此,作者提出了改进的电路追踪方法 ACC++,该方法基于注意力因果通信原理,能够从单次前向传播中提取出具有因果关系的电路组件及其低维信号,无需替换模型或进行修补。实验表明,ACC++ 识别出的信号在多语言模型中具有可解释性,并揭示了模型对提示结构、语言差异等行为的敏感性,展示了该方法在解释模型行为方面的广泛适用性。

详情
英文摘要

Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce ACC++, an improved circuit-tracing method based on the principle of attention-causal communication (ACC) [1], which identifies signals, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a single forward pass, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are interpretable: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model components are reused across languages, signals are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.

2602.07519 2026-05-15 cs.LG

PALMS: A Computational Implementation for Pavlovian Associative Learning Models' Simulation

Martin Fixman, Alessandro Abati, Julián Jiménez Nimmo, Sean Lim, Esther Mondragón

AI总结 本文介绍了一种名为PALMS的计算工具,用于在Python环境中模拟巴甫洛夫联想学习模型。该工具不仅实现了经典的Rescorla-Wagner模型,还包含了多种注意机制模型及其扩展,如 Pearce-Kaye-Hall、Mackintosh Extended 和 Le Pelley 的混合模型,并引入了一个统一的学习率变量以融合不同理论观点。PALMS 提供图形化界面,支持输入复杂的实验设计,并能处理大量刺激和配置性线索的计算,显著提升了模型的预测能力,为神经科学家提供了研究和优化实验设计的有力工具。

Comments PALMS is licensed under the open-source GNU Lesser General Public License 3.0. The environment source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal-r/PALMS-Simulator

详情
英文摘要

In contrast to static formalisms, computational definitions describe the operational mechanisms of a model. Simulations are an essential part of the cycle of theory development and refinement, assisting researchers in formulating the precise definitions that models require, and making accurate predictions. This manuscript introduces a computational implementation of Pavlovian learning models in a Python environment, termed Pavlovian Associative Learning Models' Simulation (PALMS). In addition to the canonical Rescorla-Wagner model, attentional approaches are implemented, including Pearce-Kaye-Hall, Mackintosh Extended, Le Pelley's Hybrid, and a novel extension of the Rescorla-Wagner model featuring a unified variable learning rate that synthesises Mackintosh's and Pearce and Hall's opposing conceptualisations. To our knowledge, only the first attentional model has been previously specified computationally in a general design tool. PALMS integrates a graphical interface that permits the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. It uniquely enables the simulation of experiments involving hundreds of stimuli, such as those used with human participants, and the computation of configural cues and configural-cue compounds across all models, thereby substantially broadening their predictive capabilities. A comprehensive description of the models' implementation is provided in the paper. We evaluate PALMS by simulating five published experiments in the associative learning literature that assessed the predictive scope of existing models, and we show that this implementation provides neuroscientists with a useful tool for identifying critical variables, refining experimental designs, making precise predictions, comparing model fitness, and formulating new theoretical approaches.

2602.05319 2026-05-15 cs.LG

Accelerated Sequential Flow Matching: A Bayesian Filtering Perspective

Yinan Huang, Hans Hao-Hsun Hsu, Junran Wang, Bo Dai, Pan Li

AI总结 本文提出了一种名为“顺序贝叶斯流匹配”的新框架,用于从实时流数据中进行序列概率推断。该方法借鉴贝叶斯滤波的思想,通过学习一个概率流将后验分布从一个时间步递推到下一个时间步,从而实现高效的预测分布建模。相比传统的从无信息初始分布反复采样的方法,该方法利用前一时刻的信念作为信息源分布,显著提升了采样效率,在多个科学预测和决策任务中表现出与完整扩散模型相当的性能,但所需的采样步骤更少,大幅降低了推理延迟。

详情
英文摘要

Sequential probabilistic inference from streaming observations requires modeling distributions over future trajectories as new observations arrive. Although diffusion and flow-matching models are effective at capturing high-dimensional, multimodal distributions, their deployment in real-time streaming settings typically relies on repeatedly sampling from a non-informative initial distribution. This results in substantial inference latency, particularly when multiple samples are needed to characterize the predictive distribution. In this work, we introduce Sequential Bayesian Flow Matching, a framework inspired by Bayesian filtering. By learning a probability flow that transports the posterior distribution from one time step to the next time step conditioned on new observations, it mirrors the recursive structure of Bayesian belief updates. Crucially, by using the previous belief as an informative source distribution, it enables substantially faster sampling than naive resampling from scratch. Across scientific forecasting tasks spanning accelerator beam spill dynamics, fluid dynamics, and weather forecasting, as well as decision-making benchmarks, our method achieves performance competitive with full-step diffusion on distributional metrics while using far fewer sampling steps, substantially reducing inference latency. Our code is available at https://github.com/Graph-COM/Sequential_Flow_Matching.