arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2504.07738 2026-05-15 cs.CL

Automated Construction of a Knowledge Graph of Nuclear Fusion Energy for Effective Elicitation and Retrieval of Information

Andrea Loreti, Kesi Chen, Ruby George, Robert Firth, Adriano Agnello, Shinnosuke Tanaka

发表机构 * UK Atomic Energy Authority, Culham Campus(英国原子能管理局,库尔汉校区) STFC Hartree Centre(STFC哈特里中心) Sci-Tech Daresbury(科技达尔斯伯里) IBM Research(IBM研究院)

AI总结 本文提出了一种多步骤方法,用于自动构建核聚变能源领域的知识图谱,以有效组织和表示大规模文档中的专业知识。研究重点在于利用预训练的大语言模型实现自动命名实体识别与实体解析,并通过Zipf定律评估其性能。此外,作者开发了一种基于知识图谱的检索增强生成系统,能够通过多轮提示机制,为自然语言查询提供上下文相关的答案,尤其适用于需要跨实体推理的复杂问题。

详情
英文摘要

In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that uses multiple prompts with large language models to provide contextually relevant answers to natural-language queries, including complex multi-hop questions requiring reasoning across interconnected entities.

2502.17347 2026-05-15 cs.RO

SoFFT: Spatial Fourier Transform for Modeling Continuum Soft Robots

Daniele Caradonna, Diego Bianchi, Franco Angelini, Egidio Falotico

发表机构 * The BioRobotics Institute, Scuola Superiore Sant'Anna, Pisa, Italy(生物机器人研究所,圣安娜高等学院,意大利比萨) Department of Excellence in Robotics(机器人卓越部门)

AI总结 本文提出了一种基于空间傅里叶变换(SoFFT)的建模方法,用于描述连续体软机器人的变形。该方法将机器人的主干结构视为时空信号,利用傅里叶变换对其进行紧凑表示,从而在保持变形精度的同时减少自由度。该方法不仅统一了现有的Cosserat杆理论建模策略,还提供了一种数据驱动的实验方法,通过数值仿真和实物实验验证了其有效性。

详情
英文摘要

Continuum soft robots, composed of flexible materials, exhibit theoretically infinite degrees of freedom, enabling notable adaptability in unstructured environments. Cosserat Rod Theory has emerged as a prominent framework for modeling these robots efficiently, representing continuum soft robots as time-varying curves, known as backbones. In this work, we propose viewing the robot's backbone as a signal in space and time, applying the Fourier transform to describe its deformation compactly. This approach unifies existing modeling strategies within the Cosserat Rod Theory framework, offering insights into commonly used heuristic methods. Moreover, the Fourier transform enables the development of a data-driven methodology to experimentally capture the robot's deformation. The proposed approach is validated through numerical simulations and experiments on a real-world prototype, demonstrating a reduction in the degrees of freedom while preserving the accuracy of the deformation representation.

2502.09198 2026-05-15 cs.LG

Understanding High-Dimensional Bayesian Optimization

Leonard Papenmeier, Matthias Poloczek, Luigi Nardi

发表机构 * Department of Computer Science, Lund University, Lund, Sweden(隆德大学计算机科学系) Amazon(亚马逊)

AI总结 本文探讨了为什么简单的贝叶斯优化方法在高维现实任务中表现良好,这与以往的研究结论似乎相矛盾。研究发现,高维贝叶斯优化面临一些关键挑战,其中高斯过程初始化导致的梯度消失是影响性能的主要因素。作者提出通过最大似然估计确定高斯过程的长度尺度,并基于此设计了一种简单有效的方法MSR,在多个实际应用中达到了领先水平。

Comments 22 pages, 21 figures. Accepted to ICML 2025

Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:47902-47923, 2025

详情
英文摘要

Recent work reported that simple Bayesian optimization (BO) methods perform well for high-dimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates why. We identify underlying challenges that arise in high-dimensional BO and explain why recent methods succeed. Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications. We present targeted experiments to illustrate and confirm our findings.

2502.08208 2026-05-15 cs.LG

Exploring Exploration in Bayesian Optimization

Leonard Papenmeier, Nuojin Cheng, Stephen Becker, Luigi Nardi

发表机构 * Department of Computer Science, Lund University(卢德大学计算机科学系) Department of Applied Mathematics, University of Colorado Boulder(科罗拉多大学博尔德分校应用数学系) DBtune

AI总结 在贝叶斯优化中,探索与利用的平衡对获取函数的性能至关重要。本文提出了两种新的度量方法——观测旅行商距离和观测熵,用于量化获取函数的探索特性。通过这些度量,研究分析了多种经典获取函数在不同黑箱问题中的探索行为,揭示了探索与实际性能之间的联系,并发现了现有获取函数之间的新关系,为获取函数的设计提供了更系统和原理化的指导。

Comments 28 pages, 34 figures

Journal ref Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:3388-3415, 2025

详情
英文摘要

A well-balanced exploration-exploitation trade-off is crucial for successful acquisition functions in Bayesian optimization. However, there is a lack of quantitative measures for exploration, making it difficult to analyze and compare different acquisition functions. This work introduces two novel approaches - observation traveling salesman distance and observation entropy - to quantify the exploration characteristics of acquisition functions based on their selected observations. Using these measures, we examine the explorative nature of several well-known acquisition functions across a diverse set of black-box problems, uncover links between exploration and empirical performance, and reveal new relationships among existing acquisition functions. Beyond enabling a deeper understanding of acquisition functions, these measures also provide a foundation for guiding their design in a more principled and systematic manner.

2409.10038 2026-05-15 cs.CL cs.AI cs.LG

On the Diagram of Thought

Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

发表机构 * IIIS Tsinghua University(清华大学人工智能研究院) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 大型语言模型(LLMs)在许多任务中表现出色,但在需要结构化、多步骤推理的复杂问题上表现不佳。本文提出了一种名为“思维图谱”(Diagram of Thought, DoT)的框架,使单个LLM能够构建和导航其推理过程的思维地图,通过动态构建思想图谱,模型可以提出不同的推理路径、自我批评并整合验证后的见解形成最终结论。该方法无需外部搜索算法或规划器,仅依赖于确定性的在线验证器,并基于范畴论的数学框架,为LLM的结构化推理过程提供了可审计的步骤追踪和语义保证。

Comments 30 pages

详情
英文摘要

Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning. We introduce the Diagram of Thought (DoT), a framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This process is controller-light: it does not require an external search algorithm or planner, but it does use a deterministic online validator for grammar-constrained typed traces, register constraints, and optional solver checks. To clarify the reliability target of this process, we ground DoT in a mathematical framework from category theory. We interpret accepted typed reasoning records as diagrams in a slice topos and model synthesis of the selected proposer subdiagram as a finite limit. In the predicate fragment, this same object is equivalently a variance-reversed colimit in the opposite information order. The resulting formalism gives an auditable, step-by-step trace of the LLM's typed reasoning and separates semantic guarantees for the typed subtrace from unconstrained natural-language text and uncertified operational edges.

2408.16307 2026-05-15 cs.RO cs.AI

Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes

Hongxuan Wang, Xiaocong Li, Lihao Zheng, Adrish Bhaumik, Prahlad Vadakkepat

发表机构 * National University of Singapore(新加坡国立大学) SIMTech, A*STAR CUHK, Shenzhen(香港中文大学(深圳))

AI总结 本文提出了一种名为 SafeCtrlBO 的安全贝叶斯优化方法,用于同时调整多级耦合控制器的参数,以解决复杂控制系统的安全优化问题。该方法通过使用加法高斯过程核来捕捉控制器增益之间的低阶结构,从而降低样本复杂度,并采用基于边界的扩展规则替代传统方法中的高计算成本步骤,以保证在硬件实验中的安全约束。实验表明,SafeCtrlBO 在减少硬件评估次数的同时,能够有效达到高性能控制器参数,并保持高概率安全性和硬信号安全约束的满足。

Comments The shorter version has been accepted by IEEE Robotics and Automation Letters. This is the full version

详情
英文摘要

Automatic controller tuning is attractive for robotics and mechatronic systems whose dynamics are difficult to model accurately, but direct black-box optimization can be unsafe because each query is executed on the physical plant. Existing safe Bayesian optimization (BO) methods provide high-probability safety guarantees, yet their practical use in multi-loop control is limited by two coupled difficulties: the controller parameter space is often moderately high-dimensional, and hardware evaluations are too expensive to allow hundreds or thousands of exploratory trials. This paper proposes \textsc{SafeCtrlBO}, a safe BO method for simultaneously tuning multiple coupled controllers. The method uses additive Gaussian-process kernels to encode low-order structure across controller gains and reduce the sample complexity associated with dense full-dimensional kernels. It also replaces the expensive potential-expander computation used in \textsc{SafeOpt}-style exploration with a boundary-based expansion rule that preserves the intended safe-set expansion behavior under explicit geometric conditions and is validated empirically. Experiments on synthetic benchmarks and on a permanent magnet synchronous motor (PMSM) speed-control platform show that \textsc{SafeCtrlBO} reaches high-performing controller parameters with fewer hardware evaluations than representative safe BO baselines, while maintaining the prescribed high-probability safety criterion and avoiding violations of the hard signal-safety constraint in the hardware study. The code implementation is publicly available at https://github.com/hxwangnus/SafeCtrlBO.

2405.07459 2026-05-15 cs.CV

DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search

Yuchuan Deng, Zhanpeng Hu, Zijie Xin, Chuang Deng, Qijun Zhao

发表机构 * Sichuan University(四川大学) Renmin University of China(中国人民大学)

AI总结 本文研究了基于文本的行人检索(TBPS)任务中如何有效整合正负描述信息的问题。现有方法主要关注正向属性,忽视了负向描述的重要性,可能导致误检。为此,作者提出了DAPL框架,通过结合正负描述,引入双属性对比学习和敏感属性匹配学习,提升模型对未见属性的识别能力,并设计动态词元相似度损失函数,优化视觉与文本嵌入的对齐精度,显著提升了TBPS任务的准确性和鲁棒性。

Journal ref 2025 IEEE International Conference on Multimedia and Expo (ICME)

详情
英文摘要

Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.

2304.11468 2026-05-15 cs.LG stat.ML

Increasing the Scope as You Learn: Adaptive Bayesian Optimization in Nested Subspaces

Leonard Papenmeier, Luigi Nardi, Matthias Poloczek

发表机构 * Lund University(吕勒欧大学) Stanford University(斯坦福大学) DBtune Amazon(亚马逊)

AI总结 本文提出了一种名为BAxUS的自适应贝叶斯优化方法,通过引入嵌套随机子空间,在优化过程中动态调整搜索空间,以应对高维黑箱函数优化中的性能下降问题。该方法在理论上保证了稳定性,并在多个应用任务中表现出优于现有先进方法的优化效果。

Comments 28 pages, 8 figures. Accepted to NeurIPS 2022. This is the revised version and includes the appendix

Journal ref Advances in Neural Information Processing Systems 35 (NeurIPS 2022), pp. 11586-11601

详情
英文摘要

Recent advances have extended the scope of Bayesian optimization (BO) to expensive-to-evaluate black-box functions with dozens of dimensions, aspiring to unlock impactful applications, for example, in the life sciences, neural architecture search, and robotics. However, a closer examination reveals that the state-of-the-art methods for high-dimensional Bayesian optimization (HDBO) suffer from degrading performance as the number of dimensions increases or even risk failure if certain unverifiable assumptions are not met. This paper proposes BAxUS that leverages a novel family of nested random subspaces to adapt the space it optimizes over to the problem. This ensures high performance while removing the risk of failure, which we assert via theoretical guarantees. A comprehensive evaluation demonstrates that BAxUS achieves better results than the state-of-the-art methods for a broad set of applications.

2605.14186 2026-05-15 cs.LG

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

Qi Cao, Yufan Wang, Peijia Qin, Shuhao Zhang, Pengtao Xie

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 大型语言模型(LLMs)在解决问题前后能够产生自我监控信号,如对自身能否成功解决问题的预判以及对答案正确性的后验判断,但这些信号通常未被用于控制推理过程。本文提出一种元认知控制框架,基于认知心理学中的纳尔逊-纳雷恩斯理论,将监控与推理分离,使模型在推理过程中根据预判和后验判断动态决定是否信任当前结果、是否重试或汇总多轮结果。实验表明,该框架无需参数更新或特定任务微调,即可显著提升基础模型在文本、代码和多模态任务中的表现。

详情
英文摘要

Large language models (LLMs) often expose useful signals of self-monitoring: before solving a problem, they can estimate whether they are likely to succeed, and after solving it, they can judge whether their answer is likely to be correct. However, these signals are typically measured or elicited in isolation, rather than used to control inference. In this work, we ask whether LLMs possess latent metacognitive ability that can be turned into effective test-time control. Inspired by the Nelson--Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring from reasoning. For each problem, the model first reports a pre-solve feeling-of-knowing (FOK) signal; after each solve attempt, it reports a post-solve judgment-of-learning (JOL) signal. Rather than treating these signals as passive confidence estimates, the harness turns them into an explicit control interface for reasoning: it decides when to trust the current solution, when to retry with compact metacognitive feedback, and when to pass multiple attempts to a final aggregator. Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V. These results suggest that strong LLMs may already possess useful metacognitive ability, but require an explicit control harness to act on it during reasoning.

2605.14175 2026-05-15 cs.AI

Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

Qisong He, Yi Dong, Xiaowei Huang

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK(利兹大学计算机科学与信息学学院)

AI总结 本文提出了一种名为 Grounded Continuation 的运行时验证器,用于检测大型语言模型在长对话中生成的回复是否基于当前对话上下文中的有效前提。该方法通过构建显式的依赖图,将每轮对话归类为不同形式的逻辑操作,并记录主张与证据之间的依赖关系,从而在常数时间内验证回复的合理性并追踪不支持的结论。实验表明,该验证器在多个基准测试中优于仅依赖语言模型或检索增强的基线方法,尤其在检测过时前提方面表现出色,验证了其在逻辑严谨性和实际应用中的有效性。

详情
英文摘要

In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.

2605.14174 2026-05-15 cs.RO

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

Qisong He, Xinmiao Huang, Jinwei Hu, Zhuoyun Li, Yi Dong, Changshun Wu, Xiaowei Huang

发表机构 * University of Liverpool(利物浦大学) Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学)

AI总结 该研究针对移动机器人在复杂环境中安全导航的问题,提出了一种结合条件风险价值(CVaR)约束优化与后训练可达性验证的强化学习框架。通过在离策略TD3算法中引入CVaR约束,使策略对高风险尾部事件更加敏感,从而提升安全性;训练后利用泰勒模型分析计算动作可达集,量化策略在不同状态下的安全余量。实验表明,该方法在多个导航场景中取得了最高的安全验证率,并揭示了传统平均成本指标可能遗漏的风险。

详情
英文摘要

Safe navigation for mobile robots demands policies that remain reliable under the high-consequence perception uncertainty of cluttered environments. Yet most existing safe reinforcement learning (RL) methods assess safety through average cumulative cost. Such metrics can mask dangerous tail-risk behaviors. To address this, we propose a framework that trains risk-sensitive policies through Conditional Value-at-Risk (CVaR) constrained optimization on an off-policy TD3 backbone and evaluates their safety margins post-training through neural network reachability verification. During training, the policy is optimized under CVaR constraints on cumulative costs, promoting sensitivity to high-cost tail outcomes rather than average behavior alone. After training, we compute action reachable sets under bounded observation uncertainty using Taylor Model analysis, yielding a safety rate metric that quantifies the proportion of evaluated states at which the policy's reachable action set remains within prescribed safety margins. A key finding is that policies trained with CVaR constraints maintain larger safety margins from obstacles across evaluated states. This makes them significantly more amenable to formal reachability verification. Experiments across ten navigation scenarios and six baselines show that our method achieves a 98.3\% success rate, the highest safety verification rate among all compared methods, while revealing that average cost rankings and reachability-based safety rankings can diverge. This indicates that reachability verification captures risks which are missed by empirical cost metrics alone. We further validate our approach on a physical Clearpath Jackal robot, demonstrating successful sim-to-real transfer.

2605.14171 2026-05-15 cs.LG cs.NI

CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

Xuanhao Luo, Zhizhen Li, Yuchen Liu

发表机构 * North Carolina State University, USA(北卡罗来纳州立大学)

AI总结 本文提出了一种名为CSI-JEPA的自监督学习框架,旨在通过最小的监督实现通用的Wi-Fi感知表示学习。该方法通过预测被遮蔽信道区域的潜在特征,从未标记的CSI数据中学习可复用的时频表示,并引入了基于信道变化特性的遮蔽策略以提升表示能力。实验表明,CSI-JEPA在多个实际场景的感知任务中优于现有监督方法,显著提升了性能并减少了对标注数据的依赖。

详情
英文摘要

Channel state information (CSI) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task-specific supervised training and require substantial labeled data for each task, device, user, or environment. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect. In this paper, we present CSI-JEPA, a self-supervised predictive representation learning framework for label-efficient, multi-task Wi-Fi sensing. CSI-JEPA learns reusable temporal-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context. To better match the physical structure of CSI, CSI-JEPA tokenizes channel-response amplitude windows along the time and subcarrier dimensions. It then introduces a channel variation-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier-domain variations. After pretraining, the encoder is frozen and used as a backbone, with lightweight task-specific adapters added for downstream sensing tasks. We evaluate CSI-JEPA on seven real-world Wi-Fi sensing tasks spanning diverse objectives and deployment settings. The results show that CSI-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10.64 percentage points mean accuracy gain over state-of-the-art supervised Transformer and matched-budget label savings of up to 98.0%.

2605.14169 2026-05-15 cs.CL

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

Letian Peng, Ziche Liu, Yiming Huang, Longfei Yun, Kun Zhou, Yupeng Hou, Jingbo Shang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种名为BOOKMARKS的高效主动故事线记忆框架,用于角色扮演代理(RPA),以解决现有方法在长期一致性维护中因信息压缩而丢失关键细节的问题。该方法通过主动初始化和更新与任务相关的“书签”来记录故事中的关键问题与答案,从而在保证任务细节的同时减少重复计算。实验表明,BOOKMARKS在多个角色和任务上显著优于传统记忆方法,验证了其在角色扮演场景中的有效性。

详情
英文摘要

Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.

2605.14168 2026-05-15 cs.LG cs.DS stat.ML

Finite Sample Bounds for Learning with Score Matching

Devin Smedira, Abhijith Jayakumar, Sidhant Misra, Marc Vuffray, Andrey Y. Lokhov

发表机构 * Operations Research Center, Massachusetts Institute of Technology(麻省理工学院运筹学研究中心) Theoretical division, Los Alamos National Lab(洛斯阿拉莫斯国家实验室理论部)

AI总结 本文研究了在有限样本条件下,使用得分匹配方法学习连续指数族分布的统计学习问题。作者提供了非渐近的样本复杂度分析,揭示了模型维数的多项式依赖关系,这是该领域首个此类结果。该工作填补了得分匹配理论分析的空白,为高维统计学习提供了重要的理论保证。

Comments 22 pages

详情
英文摘要

Learning of continuous exponential family distributions with unbounded support remains an important area of research for both theory and applications in high-dimensional statistics. In recent years, score matching has become a widely used method for learning exponential families with continuous variables due to its computational ease when compared against maximum likelihood estimation. However, theoretical understanding of the statistical properties of score matching is still lacking. In this work, we provide a non-asymptotic sample complexity analysis for learning the structure of exponential families of polynomials with score matching. The derived sample bounds show a polynomial dependence on the model dimension. These bounds are the first of its kind, as all prior work has shown only asymptotic bounds on the sample complexity.

2605.14167 2026-05-15 cs.AI cs.CY

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Theodore J Kalaitzidis

发表机构 * Brown University(布朗大学)

AI总结 该论文探讨了AI基准测试中隐含的理论假设如何影响对能力评估的定义与进展方向,指出当这些假设未经审视时,基准测试会固化主流范式并限制对能力的真正理解。文章提出了一种名为“Epistematics”的方法论,用于从技术能力声明中直接推导评估标准,并检验基准测试是否能区分真实能力与表面行为。其核心贡献在于提供了一套元评估框架,包括评估流程、失败模式分类及基准设计准则,以提升评估与目标能力之间的一致性。

Comments 13 pages

详情
英文摘要

Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm's theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.

2605.14164 2026-05-15 cs.AI

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack, Christo Buschek, Maty Bohacek

发表机构 * Independent Researcher(独立研究者) Stanford University(斯坦福大学)

AI总结 该研究探讨了基础模型和生成式AI模型构建者在评估模型能力时所依赖的基准测试文化,发现其主要依据已从学术论文转向公司发布的新闻稿和博客,这些内容成为定义当前技术水平的重要依据。研究通过构建并开源Benchmarking-Cultures-25数据集,分析了2025年11家主要AI公司发布的139个模型中所强调的231个基准,揭示了当前评估体系碎片化、跨模型可比性低的问题,并提出统一分类框架以解析不同模型构建者对基准能力的异质化描述。

详情
英文摘要

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

2605.14163 2026-05-15 cs.AI

Agentic Systems as Boosting Weak Reasoning Models

Varun Sunkaraneni, Pierfrancesco Beneventano, Riccardo Neumarker, Tomaso Poggio, Tomer Galanti

发表机构 * Texas A&M University(德克萨斯A&M大学) MIT(麻省理工学院)

AI总结 本文研究如何通过组合多个弱推理模型的输出,达到强模型的性能。核心方法是引入验证者支持的委员会搜索机制,在推理时通过提案、批评和比较模块协同工作,提升整体推理能力。研究证明,仅靠增加模型数量不足以提升性能,还需结合局部正确性信号,如执行、类型检查等,以确保选择的有效性。实验表明,通过合理设计的机制,弱模型组合可达到与强模型相当的性能,主要挑战在于如何从提案中有效筛选出正确解。

详情
英文摘要

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-\(k\) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves \(67.0\%\) of tasks. Using the same nano model, our critic--comparator orchestration reaches \(76.4\%\) with \(k=8\) proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the \(79.0\%\) oracle best-of-\(8\) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

2605.14156 2026-05-15 cs.LG

Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings

Scott Ye, Harlin Lee

发表机构 * Department of Radiology(放射科) University of California San Francisco(加州大学旧金山分校) UNC–Chapel Hill(北卡罗来纳大学教堂山分校) UNC School of Data Science and Society(北卡罗来纳大学教堂山分校数据科学与社会学院)

AI总结 该研究探讨了多模态掩码自编码器在儿科睡眠数据分析中的潜在诊断信息,通过结合拓扑特征、几何结构和电子健康记录(EHR)来增强嵌入表示。研究发现,融合这些额外信息后,线性模型和多层感知机在睡眠障碍预测任务中表现出更好的性能与可解释性,尤其在极端类别不平衡情况下,融合模型显著提升了预测的校准性和鲁棒性。

Comments Accepted to ML4H 2025, 20 pages, 6 figures

Journal ref Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1392-1411, 2025

详情
英文摘要

While generative models have shown promise in pediatric sleep analysis, the latent structure of their multimodal embeddings remains poorly understood. This work investigates session-wide diagnostic information contained in the sequences of 30-second pediatric PSG epochs embedded by a multimodal masked autoencoder. We test whether augmenting embeddings with PHATE-derived per-epoch coordinates and whole-night movement descriptors, persistent homology summaries of the embedding cloud, and EHR yields task-relevant signals. Simple linear and MLP models, chosen for interpretability rather than state-of-the-art performance, show that geometric, topological, and clinical features each provide complementary gains. For binary predictions, feature importance is task-dependent, and more expressive late-fusion models generally perform better, with AUPRC improving from 0.26 to 0.34 for desaturation, 0.31 to 0.48 for EEG arousal, 0.09 to 0.22 for hypopnea, and 0.05 to 0.14 for apnea. We also report Brier score and Expected Calibration Error, where the full fusion model yields the best calibration across all four binary tasks. Our study reveals that latent geometry/topology and EHR offer complementary, interpretable signals beyond embeddings, improving calibration and robustness under extreme imbalance.

2605.14152 2026-05-15 cs.CL cs.AI cs.CR cs.CY

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

Michael S. Lee, Yash Maurya, Drew Rein, Bert Herring, Jonathan Nguyen, Kyungho Song, Udari Madhushani Sehwag, Jiyeon Cho, Kaustubh Deshpande, Yeongkyun Jang, Jiyeon Joo, Minn Seok Choi, Evi Fuelle, Christina Q Knight, Joseph Brandifino, Max Fenkell

发表机构 * Scale AI

AI总结 本文提出ROK-FORTRESS,一个用于评估大型语言模型在国家安全与公共安全领域风险的双语基准,聚焦于英韩语言对及美韩地缘政治背景下的交互影响。通过构建“转译矩阵”,该方法分离语言和地缘政治因素,系统评估模型在不同语言和实体背景下的安全响应行为。研究发现,韩国语言和地缘政治背景的结合对模型安全行为有显著影响,且不同模型对此的反应存在差异,表明传统仅依赖翻译的评估方式可能低估了语言与地缘政治交互带来的风险。

Comments 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public

详情
英文摘要

Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

2605.14147 2026-05-15 cs.LG

A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification

Jiandong Chen, Lingjie Su, Le Peng, Yash Travadi, Rui Zhang, Ju Sun

发表机构 * Institute for Health Informatics, University of Minnesota(明尼苏达大学健康信息学研究所) Department of Computer Science and Engineering, University of Minnesota(明尼苏达大学计算机科学与工程系) School of Statistics, University of Minnesota(明尼苏达大学统计学系) Division of Computational Health Sciences, Department of Surgery, University of Minnesota(明尼苏达大学外科系计算健康科学分会)

AI总结 本研究系统评估了常用不平衡数据处理方法在生物医学二分类任务中的影响,探讨了模型复杂度与数据模态之间的相互作用。通过在三种典型生物医学数据集上测试多种处理方法,发现简单模型如逻辑回归对不平衡处理方法不敏感,而复杂模型如深度神经网络在使用重采样或权重调整方法时性能显著提升。研究结果表明,选择合适的不平衡处理方法对提高复杂模型在文本和图像数据上的分类效果具有重要意义。

Comments 18 pages, 1 figures, 4 tables

详情
英文摘要

Objective: The primary goal of this study was to systematically examine the impact of commonly used imbalance handling methods (IHMs) on predictive performance in biomedical binary classification, considering the interplay between model complexity and diverse data modalities. Material and Methods: We evaluated five representative IHMs: random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO), against a raw training (RAW) baseline. The evaluation encompassed three public biomedical datasets: MIMIC-III (tabular), ADE-Corpus-V2 (text), and MURA (image), spanning three common biomedical data modalities. To assess varying model complexity, we employed a range of architectures, from classical logistic regression and random forest to deep neural networks, including multilayer perceptron (MLP), BiLSTM, BERT, DenseNet, and DINOv2. Results: For simpler models such as logistic regression on tabular data, IHMs yielded no significant advantage over the RAW baseline, aligning with prior findings. However, clear benefits were observed for more complex models and unstructured data: (a) ROS and RW consistently enhanced the performance of powerful models; (b) direct F1-score optimization demonstrated utility primarily for unstructured text and image data; and (c) RUS and SMOTE consistently degraded performance and are therefore not recommended. Conclusion: The effectiveness of IHMs depends on both model complexity and data modality. Performance gains are most pronounced when leveraging appropriate IHMs, such as ROS, RW, and DMO, on high-complexity models.

2605.14146 2026-05-15 cs.LG

bde: A Python Package for Bayesian Deep Ensembles via MILE

Vyron Arvanitis, Angelos Aslanidis, Emanuel Sommer, David Rügamer

发表机构 * Faculty of Physics, LMU Munich(物理系,慕尼黑大学) Department of Statistics, LMU Munich(统计系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 bde 是一个用于构建贝叶斯深度集成模型的用户友好型 Python 工具包,特别适用于表格数据。该工具基于高效的 MILE(微正则朗之万集成)采样推理方法实现,支持快速训练、高效的马尔可夫链蒙特卡洛采样以及回归和分类任务中的不确定性量化,为贝叶斯深度学习提供了便捷的解决方案。

详情
英文摘要

bde is a user-friendly Python package for Bayesian Deep Ensembles with a particular focus on tabular data. Built on an efficient JAX implementation of the sampling-based inference method Microcanonical Langevin Ensembles (MILE), it provides scikit-learn compatible estimators for fast training, efficient Markov Chain Monte Carlo sampling, and uncertainty quantification in both regression and classification tasks.

2605.14145 2026-05-15 cs.CV

Rethinking the Good Enough Embedding for Easy Few-Shot Learning

Michael Karnes, Alper Yilmaz

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文探讨了在大规模数据训练下,不同深度视觉模型是否收敛于一个“理想”的潜在表示空间,并提出“好的嵌入即足够”的观点。研究通过冻结DINOv2-L特征并结合k近邻分类器,构建了一个无需反向传播的非参数化少样本学习框架,揭示了最优特征提取层并引入主成分分析和独立成分分析进行流形优化。实验表明,该方法在多个主流基准上优于复杂的元学习算法,达到了当前最优性能。

详情
英文摘要

The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

2605.14141 2026-05-15 cs.AI

Distribution-Aware Algorithm Design with LLM Agents

Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano, Tomer Galanti

发表机构 * Texas A&M University(德克萨斯大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了在学习对象为可执行求解器代码而非预测模型的场景下的学习问题,强调求解器不仅要正确,还需在运行时间上表现优异。研究提出了一种名为“求解器提示”的核心抽象,通过从样本中推断可复用的结构并编译为专用求解器代码,从而提升求解效率和质量。实验表明,基于大语言模型的代码代理生成的求解器在多个组合优化问题上显著优于现有启发式方法和求解器,运行速度提升达数百倍,且在保持较高解质量的同时大幅降低计算复杂度。

详情
英文摘要

We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emph{solver hint}: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on \(21\) structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality \(0.971\), improve by \(+0.224\) over the average heuristic pool and by \(+0.098\) over the highest-quality heuristic, and are \(336.9\times\), \(342.8\times\), and \(16.1\times\) faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all \(100\) graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation.

2605.14136 2026-05-15 cs.CV

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Nurislam Tursynbek, Zhiqiang Lao, Heather Yu, Gedas Bertasius, Marc Niethammer

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Futurewei Technologies Inc(未来科技有限公司) UCSD(加州大学圣迭戈分校)

AI总结 近期文本到视频扩散模型虽然能生成视觉上吸引人的帧,但在时间一致性方面仍存在不足,常出现闪烁、漂移或运动不稳定的问题。本文提出了一种无需训练、仅在推理阶段使用的 TeDiO 方法,通过正则化模型内部的注意力图中的时间对角线模式,增强视频的时间一致性。该方法能够估计对角线平滑度、识别不稳定区域并进行轻量级潜在变量更新,从而在不修改模型权重或依赖外部运动监督的情况下,显著提升多个视频扩散模型的运动流畅性,同时保持每帧的视觉质量。

Comments CVPR'26 Workshop on Agentic AI for Visual Media

详情
英文摘要

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

2605.14135 2026-05-15 cs.CV

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

Adil Qureshi, Dongki Jung, Jaehoon Choi, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文提出了一种名为PanoPlane的方法,用于从稀疏视角生成高保真室内新视角图像,其核心是通过全景场景补全重建封闭房间的几何结构。该方法引入了一种无需训练的布局锚定注意力引导机制,在推理时引导扩散模型关注场景中检测到的平面表面,从而实现基于几何一致性的内容补全,替代了传统的无约束幻象生成。实验表明,该方法在Replica、ScanNet++和Matterport3D数据集上均取得了优于现有方法的新视角合成效果,PSNR指标最高提升了17.8%。

详情
英文摘要

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

2605.14126 2026-05-15 cs.LG cs.AI

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

发表机构 * IDM gGmbH, University Medical Center Hamburg-Eppendorf, Hamburg, Germany(IDM公司,汉堡埃彭多夫大学医疗中心,德国汉堡)

AI总结 本文研究了在Fast Healthcare Interoperability Resources(FHIR)标准下,如何通过强化学习提升医疗信息代理的多步骤推理能力。作者将FHIR中的电子健康记录建模为可查询的结构化图,并设计了一个基于代码操作的多轮代理,通过强化学习进行后训练,以提高其在真实医院数据上的问答性能。实验表明,该方法在FHIR-AgentBench基准上显著提升了答案正确率,并有效保证了数据完整性约束。

详情
英文摘要

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

2605.14120 2026-05-15 cs.LG cs.CL

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

Mashrekur Rahman

发表机构 * Dartmouth Libraries, Dartmouth College(达特茅斯图书馆,达特茅斯大学)

AI总结 该研究提出了一种名为Mini-JEPA的轻量级基础模型舰队,用于提升水文智能系统的性能。通过为不同传感器专门训练的小型联合嵌入预测架构模型,并由路由代理根据问题选择合适的模型,该方法在保持高精度的同时降低了计算成本。实验表明,Mini-JEPA在多种水文变量预测任务中表现优异,且在与大型模型AlphaEarth的对比中展现出显著的性能提升。

详情
英文摘要

Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated $R^2$ reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ($ΔR^2$ up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen's $d = 1.10$, $p = 0.031$). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.

2605.14117 2026-05-15 cs.CL cs.AI

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Luis Lara, Aristides Milios, Zhi Hao Luo, Aditya Sharma, Ge Ya Luo, Christopher Beckham, Florian Golemo, Christopher Pal

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 该研究提出了一种基于大语言模型(LLM)并通过可验证奖励强化学习(RLVR)优化的文本生成式平面图设计方法,旨在生成符合用户定义的连接性和数值约束的高质量平面图。通过在真实平面图上微调LLM,并结合约束遵从度指标进行优化,该方法在现实感、兼容性和多样性方面均优于现有方法,尤其在兼容性指标上实现了至少94%的相对提升,展示了LLM在处理结构化设计约束方面的有效性。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

2605.14115 2026-05-15 cs.CL

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Yikun Han, Mengfei Lan, Halil Kilicoglu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究探讨了在生物医学问答任务中,当检索到的证据存在冲突时,大型语言模型的表现问题。通过设计不同的证据条件,研究发现模型在面对矛盾信息时准确性显著下降,并且预测结果会发生翻转。为此,作者提出了一种结合模型置信度和证据冲突检测的弃权评分方法,在困难条件下有效提升了选择性准确性,突显了处理证据冲突对模型不确定性和鲁棒性的重要性。

Comments Accepted by BioNLP 2026

详情
英文摘要

Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

2605.14111 2026-05-15 cs.AI cs.HC

Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition

Yaniv Eliyahu Amiri, Noah Chicoine, Jacqueline Griffin, Stacy Marsella

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA(东北大学科里学院计算机科学系,波士顿,马萨诸塞州,美国) Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA, USA(东北大学机械与工业工程系,波士顿,马萨诸塞州,美国) Department of Psychology, Northeastern University, Boston, MA, USA(东北大学心理学系,波士顿,马萨诸塞州,美国)

AI总结 本文研究了医院药师在药品短缺情况下如何在不确定、时间压力和患者风险下做出决策的问题,提出了一种基于注意力引导的动态分解框架,将药品分为高成本推理和低成本监控两类,以有限理性方式进行决策。研究构建了专家代理和学习代理两个模型,分别基于药师访谈和经验动态调整注意力分配,实验表明该方法能够在不完全掌握状态信息的情况下实现稳定的决策,揭示了决策的核心不在于具体行动,而在于认知资源的合理分配。

Comments Accepted at CogSci 2026. 6 pages plus references, 1 figure, 2 tables

详情
英文摘要

Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.