arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3406
2604.22948 2026-05-26 cs.LG stat.CO stat.ML

Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces

分数排斥蒙特卡洛:面向一般状态空间中具有恒定内存的高效非马尔可夫采样器

Jie Hu, Lingyun Chen, Geeho Kim, Jinyoung Choi, Bohyung Han, Do Young Eun

AI总结 提出分数排斥蒙特卡洛(SRMC)框架,通过分数评估的运行平均值总结轨迹历史,利用指数分数倾斜构建替代目标,实现恒定内存下的非马尔可夫采样,降低渐近方差并改善模式覆盖。

Comments Accepted at ICML 2026 (Spotlight); GitHub Repo: https://github.com/srmc-project/Score-Repellent-Monte-Carlo

详情
AI中文摘要

历史依赖采样可以通过阻止冗余重访来降低长期蒙特卡洛方差,但现有方案通常通过有限状态空间上的经验度量编码历史,这在高维离散配置空间中不可行或在连续域中不适定。我们提出分数排斥蒙特卡洛(SRMC)框架,该框架通过 $\mathbb{R}^d$ 中分数评估的运行平均值总结轨迹历史,其中 $d$ 是分数和状态表示的维度。该历史通过指数分数倾斜转换为替代目标,以 $α$ 为索引,表示排斥强度,控制基于历史的排斥幅度。替代族在标准MCMC意义上是无需归一化的,从而产生一个通用包装器:在每次迭代中,任何针对 $π$ 的基础核都可以在当前替代 $π_{θ_n}$ 上运行,同时在线更新历史。我们使用带有受控马尔可夫噪声的随机逼近分析历史递归和蒙特卡洛估计器的耦合演化,建立了几乎必然收敛和联合中心极限定理。我们进一步确定了渐近协方差随 $α$ 增加而减小的区域,缩放比例为 $O(1/α)$,将有限状态历史依赖采样器的近零方差效应扩展到具有恒定内存的一般状态空间。在连续目标和离散能量基模型上的实验表明,估计器方差和模式覆盖得到改善,同时保持 $O(d)$ 内存使用和适度的每次迭代开销。

英文摘要

History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in $\mathbb{R}^d$, where $d$ is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with $α$ that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting $π$ can instead be run on the current surrogate $π_{θ_n}$ while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as $α$ increases, with scaling $O(1/α)$, extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining $O(d)$ memory usage and modest per-iteration overhead.

2604.13088 2026-05-26 cs.LG cs.AI

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

序列级奖励的组内学习设计条件:令牌梯度消除

Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

AI总结 针对大语言模型多步推理中稀疏终端奖励导致的信用分配问题,提出反事实比较框架和隐式行为策略优化(IBPO),通过轨迹差异近似替代决策,将稀疏奖励转化为步骤敏感信号,提升训练稳定性和推理性能。

详情
AI中文摘要

基于大语言模型的多步推理强化学习通常依赖于稀疏的终端奖励,这导致了不良条件的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏的终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),显著提高了数学和代码推理基准上的训练稳定性和性能上限。我们的结果指向了一个有希望的方向,以解锁大语言模型的推理潜力。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2604.03675 2026-05-26 cs.AI cs.CL cs.IR

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

OASES:面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

AI总结 提出OASES框架,通过结果对齐的过程奖励和搜索-评估协同训练,解决智能搜索中奖励稀疏和过程监督不可靠的问题,在多跳问答基准上优于强强化学习基线。

详情
AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式,但仅结果奖励是稀疏的,并且对中间搜索动作的信用分配有限。因此,现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而,代理奖励可能偏离最终结果目标,而固定评估器随着搜索策略的演化可能变得过时,导致不可靠的过程监督。为应对这些挑战,我们提出OASES,一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度,推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器,使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明,OASES始终优于强强化学习基线,进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

2603.29236 2026-05-26 cs.CV

M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction

M2H-MX:用于实时单目3D场景图构建的多任务语义与几何感知

U. V. B. L. Udugama, George Vosselman, Francesco Nex

AI总结 提出M2H-MX多任务感知模型,通过注册门控全局上下文和受控跨任务交互的轻量解码器,在严格延迟约束下实现深度与语义预测相互增强,并集成到单目SLAM中,显著提升轨迹精度和地图质量。

Comments 6 pages, 5 figures, 5 tables. Preprint under review

详情
AI中文摘要

单目相机因其低成本且易于部署而对机器人感知具有吸引力,但从单一图像流实现可靠的实时空间理解仍然具有挑战性。虽然最近的多任务密集预测模型改进了逐像素深度和语义估计,但将这些进展转化为稳定的单目建图系统仍然不简单。本文提出了M2H-MX,一种用于单目空间理解的实时多任务感知模型。该模型保留多尺度特征表示,同时在轻量解码器中引入注册门控全局上下文和受控跨任务交互,使深度和语义预测在严格的延迟约束下相互增强。其输出通过紧凑的感知到建图接口直接集成到未修改的单目SLAM流水线中。我们评估了密集预测精度和系统内性能。在NYUDv2上,M2H-MX-L取得了最先进的结果,与代表性多任务基线相比,语义mIoU提高了6.6%,深度RMSE降低了9.4%。当在ScanNet上的实时单目建图系统中部署时,与强单目SLAM基线相比,M2H-MX将平均轨迹误差降低了60.7%,同时生成更清晰的度量-语义地图。这些结果表明,现代多任务密集预测可以可靠地部署于机器人系统中的实时单目空间感知。

英文摘要

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

2603.18444 2026-05-26 cs.LG cs.AI

Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

折扣Beta-Bernoulli奖励估计用于基于可验证奖励的样本高效强化学习

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

AI总结 针对基于可验证奖励的强化学习样本效率低的问题,提出折扣Beta-Bernoulli奖励估计方法,利用历史奖励统计量降低估计方差并避免方差崩溃,在多个推理基准上显著提升性能。

Comments 14 pages, 3 figures

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的有效后训练范式。然而,现有的基于组的RLVR方法常遭受严重的样本低效问题。这种低效源于对少量rollout的奖励进行点估计,导致高估计方差、方差崩溃以及生成响应的无效利用。在本工作中,我们从统计估计角度重新审视RLVR,将奖励建模为从策略诱导分布中抽取的样本,并将优势计算视为从有限数据中估计奖励分布的问题。基于此观点,我们提出折扣Beta-Bernoulli奖励估计,该方法利用历史奖励统计量处理非平稳分布。尽管有偏,所得估计量展现出降低且稳定的方差,理论上避免了估计方差崩溃,并在均方误差上优于标准点估计。在六个分布内和三个分布外推理基准上的大量实验表明,使用DBB的GRPO一致优于朴素GRPO,在1.7B和8B模型上分别实现了分布内平均Acc@8提升3.22/2.42点,分布外提升12.49/6.92点,且无需额外计算成本或内存开销。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta-Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

2603.17198 2026-05-26 cs.LG cs.CL

Structural Abstraction as an Inductive Bias for Non-Stationary Language Model Training

结构抽象作为非平稳语言模型训练的归纳偏置

Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani

AI总结 提出抽象增强训练(AAT)方法,通过联合优化具体实例及其结构抽象,减少灾难性干扰并提升关系泛化能力,在非平稳语言模型训练中验证了结构抽象作为稳定学习信号的有效性。

详情
AI中文摘要

认知科学的一个基本原则认为,智能体不是通过将经验存储为孤立实例来学习,而是通过形成捕捉跨情境共享关系结构的抽象图式来学习。尽管这一主张得到了行为和神经影像研究的充分支持,但其作为语言模型计算训练信号的作用仍未得到充分探索。我们针对非平稳语言模型训练中的这一空白,提出疑问:将学习偏向结构抽象是否能如人类结果所预测的那样减少灾难性干扰并提升关系泛化?为研究这一问题,我们引入了抽象增强训练(AAT),这是一种轻量级的损失级修改,联合优化具体实例及其结构抽象,以及两个基准:关系循环基准(RCB)和叙事抽象基准(NAB)。这些资源将核心认知构造操作化:实体掩码作为关系对齐的计算模拟,谚语作为必须跨表面不同情境推断的隐式抽象意义的载体。我们的实证结果表明,AAT持续减少遗忘并提升泛化,其模式与基于图式学习的认知预测一致。除了对持续学习的实际意义外,这些结果提供了初步的计算证据,表明结构抽象是非平稳环境中稳定学习的信号。

英文摘要

A foundational principle in cognitive science holds that intelligent agents do not learn by storing experiences as isolated instances, but by forming abstract schemas that capture relational structure shared across situations. Even though this claim is well supported by behavioral and neuroimaging studies, its role as a computational training signal in language models remains underexplored. We target this gap in the setting of non-stationary language model training, asking does biasing learning toward structural abstraction reduce catastrophic interference and improve relational generalization as predicted by human results? To study this question, we introduce Abstraction-Augmented Training (AAT), a lightweight loss-level modification that jointly optimizes over concrete instances and their structural abstractions, and two benchmarks, the Relational Cycle Benchmark (RCB) and the Narrative Abstraction Benchmark (NAB). These resources operationalize core cognitive constructs: entity masking as a computational analog of relational alignment, and proverbs as vehicles for implicit abstract meaning that must be inferred across surface-dissimilar situations. Our empirical results demonstrate that AAT consistently reduces forgetting and improves generalization in a pattern that aligns with cognitive predictions for schema-based learning. Beyond the practical implications for continual learning, these results offer preliminary computational evidence that structural abstraction is a signal for stable learning in non-stationary environments.

2603.17044 2026-05-26 cs.LG cs.AI cs.CV

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

理解与生成相冲突吗?统一多模态模型DPO的诊断研究

Abinav Rao, Sujan Rachuri

AI总结 通过系统实验发现,在统一多模态模型上应用DPO时,生成质量难以对齐,主要原因是理解和生成梯度近乎正交且存在11-14倍的幅度不平衡,源于VQ token数量不对称。

Comments Experiments are inconclusive: The claim that architectures such as Chameleon or Emu would exhibit stronger gradient conflict is not supported by experiments or analysis, and all experiments are conducted on Janus-Pro without evaluation on other unified multimodal architectures

详情
AI中文摘要

统一多模态模型共享一个语言模型骨干来同时进行理解和生成图像。DPO能否同时对齐这两种能力?我们首次系统研究了这一问题,在Janus-Pro的1B和7B参数上应用DPO,采用七种训练策略和两种事后方法。核心发现是负面的:在该架构下,所有测试条件下生成质量都抵制DPO对齐。在7B规模下,没有任何方法能改善生成CLIPScore(|Δ| < 0.2,每个种子n=200,3个种子,p > 0.5);在1B规模下,所有方法都降低了生成质量,并且该结果在偏好数据类型(真实vs生成和模型vs模型)以及测试的数据量(150-288对)上均成立。梯度分析揭示了原因:理解和生成梯度近乎正交(cos ~ 0),且由于VQ token数量不对称(576个生成token vs. ~30-100个文本token),幅度不平衡达到约11-14倍。这种不平衡是多任务DPO中的主要干扰机制;幅度平衡产生了方向正确的理解增量(VQA +0.01-0.04,虽然单独不显著),但生成差距仍然存在。我们识别出离散VQ tokenization是一个可能的结构瓶颈——生成DPO损失收敛到ln(2)支持了这一点——并为使用基于VQ的统一模型的从业者提供了实用指导。

英文摘要

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

2603.10267 2026-05-26 cs.CV

A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

基于YOLO和视觉语言OCR的孟加拉车牌识别鲁棒深度学习框架

Nayeb Hasin, Md. Arafath Rahman Nishat, Mainul Islam, Khandakar Shakib Al Hasan, Asif Newaz

AI总结 提出一种结合YOLOv8两阶段自适应训练和ViT+BanglaBERT视觉语言OCR的鲁棒孟加拉车牌识别系统,在车牌定位和字符识别上分别达到97.83%准确率和0.1323字符错误率。

Comments Accepted at the 2026 IEEE International Conference on AI and Data Analytics (ICAD 2026). Final version will appear in IEEE Xplore

详情
AI中文摘要

自动车牌识别(ALPR)系统是智能交通管理系统的关键组成部分。然而,由于复杂的字符方案和不均匀的布局,孟加拉车牌检测仍然具有挑战性。本文提出了一种鲁棒的孟加拉车牌识别系统,该系统将基于深度学习的车牌定位目标检测模型与用于文本提取的光学字符识别相结合。比较了多种目标检测架构,包括U-Net和几种YOLO(You Only Look Once)变体,用于车牌定位。本研究提出了一种基于YOLOv8架构的新型两阶段自适应训练策略,以提高定位性能。所提出的方法优于现有模型,达到了97.83%的准确率和91.3%的交并比(IoU)。文本识别问题被表述为基于视觉编码器-解码器架构的序列生成问题,并评估了编码器-解码器的组合。结果表明,ViT + BanglaBERT模型在字符级别上取得了更好的结果,字符错误率为0.1323,词错误率为0.1068。所提出的系统在为此研究目的整理的外部数据集上进行测试时也表现出一致的性能。与训练样本相比,该数据集提供了完全不同的环境和光照条件,表明了所提出框架的鲁棒性。总体而言,我们提出的系统为孟加拉车牌识别提供了鲁棒且可靠的解决方案,并在各种真实场景中有效运行,包括光照、噪声和车牌样式的变化。这些优势使其非常适合部署在智能交通应用中,如自动执法和访问控制。

英文摘要

An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

2603.09458 2026-05-26 cs.RO

Stein Variational Ergodic Surface Coverage with SE(3) Constraints

Stein变分遍历曲面覆盖与SE(3)约束

Jiayun Li, Yufeng Jin, Sangli Teng, Dejian Gong, Georgia Chalvatzaki

AI总结 提出一种基于预条件SE(3) Stein变分梯度下降的采样即优化方法,用于生成满足SE(3)约束的遍历轨迹,实现复杂3D点云曲面的高质量覆盖。

详情
AI中文摘要

曲面操作任务要求机器人生成能够全面覆盖复杂3D曲面同时保持精确末端执行器姿态的轨迹。现有的遍历轨迹优化(TO)方法在覆盖任务中表现出色,但由于非凸优化景观以及采样即优化(SAO)技术中对SE(3)约束处理不足,在处理点云目标时存在困难。在这项工作中,我们引入了一种预条件SE(3) Stein变分梯度下降(SVGD)方法用于SAO遍历轨迹生成。我们提出的方法包含多项创新。首先,我们将点云遍历覆盖重新表述为流形感知的采样问题。其次,我们推导了SE(3)特定的SVGD粒子更新,第三,我们开发了一个预条件子以加速TO收敛。与基于优化的强基线和SAO基线相比,我们的基于采样的框架在保持SE(3)几何结构的同时,一致地识别出更优的局部最优解。在3D点云曲面覆盖基准测试和机器人曲面绘制任务上的实验表明,相对于现有的TO和SAO方法,我们的方法在我们的设置中以可计算的计算量实现了更优的覆盖质量,并在真实机器人实验中得到了验证。

英文摘要

Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.

2603.08011 2026-05-26 cs.CV

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

是时候正确了:提升视觉语言模型中的模拟时钟读取和指针空间推理能力

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee

AI总结 针对视觉语言模型在真实环境中读取模拟时钟的挑战,提出TickTockVQA数据集和Swap-DPO微调框架,显著提升时钟读取准确性和鲁棒性。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

视觉语言模型(VLM)在复杂多模态推理任务上取得了显著成功,导致人们假设它们也应擅长读取模拟时钟。然而,与预期相反,我们的研究表明,在真实环境中读取模拟时钟对最先进的VLM来说仍然是一个重大挑战。现有的模拟时钟数据集大多是合成或平面的,风格多样性有限且背景上下文极少,无法捕捉真实世界场景的视觉变化。因此,在此类数据上训练的VLM表现出较弱的时空推理能力,经常混淆时针和分针,并在遮挡、光照变化和杂乱背景等常见视觉条件下挣扎。为解决此问题,我们引入了TickTockVQA,一个包含真实世界多样化场景中模拟时钟的人工标注数据集。TickTockVQA提供明确的时针和分针标注,并在可从视觉上下文推断时包含AM/PM标签。此外,我们提出了Swap-DPO,一种基于直接偏好优化的微调框架,以将模型推理对齐到准确的时间解释。实验结果表明,我们的方法在真实世界条件下显著提高了时钟读取的准确性和鲁棒性,为VLM中时空推理和视觉理解的未来研究奠定了基础。

英文摘要

Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatiotemporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization-based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatiotemporal reasoning and visual understanding in VLMs.

2602.20725 2026-05-26 cs.CV

Bridging Rendering and Generative Modeling with Monte Carlo Transport Scheduling

桥接渲染与生成建模:蒙特卡洛传输调度

Junwei Shu, Wenjie Liu, Hantang Liu, Changbo Wang, Yang Li

AI总结 提出蒙特卡洛传输调度框架,将渐进式路径追踪视为连续采样驱动的传输过程,通过真实渲染端点训练实现任意步数的神经细化,并作为物理先验迁移至生成模型。

Comments preprint

详情
AI中文摘要

蒙特卡洛渲染和现代生成模型都将不确定状态转化为结构化图像,但通常被视为独立过程。我们引入蒙特卡洛传输调度,一个将渐进式路径追踪视为连续采样驱动传输过程的框架。我们的关键观察是,渲染器在此过程中已经产生物理有效状态:嵌套蒙特卡洛估计追踪一条细化轨迹,其自然时间坐标由采样方差决定。这一观点引出一个连续训练框架,从真实渲染端点而非合成插值中学习,保留蒙特卡洛估计的统计结构,同时支持任意步数的神经细化。我们在一个旨在分离传输难度与场景上下文的受控渲染基准上评估该框架,结果表明它产生稳定的渲染细化,支持渲染状态之间的连续停止,并作为冻结生成采样器的物理先验进行迁移。这些结果表明渲染和生成存在共同的连续时间基础,其中蒙特卡洛采样既提供物理状态,也提供学习图像传输的监督。

英文摘要

Monte Carlo rendering and modern generative models both transform uncertain states into structured images, yet they are usually studied as separate processes. We introduce Monte Carlo Transport Scheduling, a framework that treats progressive path tracing as a continuous sampling-driven transport process. Our key observation is that the renderer already produces physically valid states along this process: nested Monte Carlo estimates trace a refinement trajectory whose natural time coordinate follows from sampling variance. This view leads to a continuous training framework that learns from real render endpoints rather than synthetic interpolants, preserving the statistical structure of Monte Carlo estimation while enabling arbitrary-step neural refinement. We evaluate the framework on a controlled rendering benchmark designed to separate transport difficulty from scene context, and show that it yields stable render refinement, supports continuous stopping between rendering states, and transfers as a physical prior for frozen generative samplers. These results suggest a common continuous-time substrate for rendering and generation, where Monte Carlo sampling provides both the physical states and the supervision for learning image transport.

2602.10090 2026-05-26 cs.AI cs.CL cs.LG

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

AI总结 提出Agent World Model (AWM)全合成环境生成管道,通过代码驱动和数据库支持的环境进行大规模强化学习,使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使得自主智能体能够与工具和环境进行多轮交互。然而,扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中,我们提出了Agent World Model(AWM),一个完全合成的环境生成管道。使用该管道,我们扩展到涵盖日常场景的1000个环境,智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是,这些环境是代码驱动的并由数据库支持,比由LLM模拟的环境提供更可靠和一致的状态转换。此外,与从现实环境中收集轨迹相比,它们实现了更高效的智能体交互。为了展示该资源的有效性,我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练(而非特定于基准的环境)能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

2602.09620 2026-05-26 cs.AI cs.LO

FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints

FLINGO -- 将 ASP 表达力注入线性整数约束

Jorge Fandinno, Pedro Cabalar, Philipp Wanko, Torsten Schaub

AI总结 本文提出 FLINGO 语言和工具,通过将 ASP 的默认值、未定义、非确定性选择和聚合等表达力融入数值约束,并给出到 clingcon 格式的翻译,从而扩展了约束回答集编程。

Comments To appear in Theory and Practice of Logic Programming

详情
AI中文摘要

约束回答集编程(CASP)是一种混合范式,它通过数值约束处理丰富了回答集编程(ASP),这是许多实际应用的关键需求。然而,大多数 CASP 求解器中约束的规范更接近于数值后端的表达力和语义,而非 ASP 范式。在 ASP 中,数值属性被表示为谓词,允许声明默认值、使属性未定义、使用选择规则进行非确定性赋值或使用聚合值。在 CASP 中,一旦我们切换到这些属性的基于约束的表示,这些特性中的大多数(如果不是全部)就会丢失。在本文中,我们提出了 flingo 语言(和工具),它将上述表达力融入数值约束中,并通过多个示例说明了其使用。基于先前建立其语义基础的工作,我们还提出了从新引入的 flingo 语法到遵循 clingcon 输入格式的常规 CASP 程序的翻译。

英文摘要

Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, a crucial requirement for many real-world applications. However, the specification of constraints in most CASP solvers aligns more closely with the expressiveness and semantics of the numerical back-end than the ASP paradigm. In the latter, numerical attributes are represented as predicates, which allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules, or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the flingo language (and tool) that incorporates the aforementioned expressiveness within numerical constraints, and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced flingo syntax to regular CASP programs following the clingcon input format.

2602.03955 2026-05-26 cs.AI cs.MA

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

AgentArk:将多智能体智能蒸馏到单个LLM智能体中

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang

AI总结 提出AgentArk框架,通过三种分层蒸馏策略将多智能体系统的交互动态蒸馏到单个模型权重中,使单个智能体具备多智能体的推理和自校正能力,同时保持计算效率。

详情
AI中文摘要

虽然大型语言模型(LLM)多智能体系统通过迭代辩论实现了卓越的推理性能,但实际部署受到高计算成本和错误传播的限制。本文提出AgentArk,一种新颖的框架,将多智能体动态蒸馏到单个模型的权重中,有效地将显式的测试时交互转化为隐式的模型能力。这使得单个智能体在保持计算效率的同时具备多智能体系统的智能。具体来说,我们研究了跨多种模型、任务、规模和场景的三种分层蒸馏策略:推理增强微调;基于轨迹的增强;以及过程感知蒸馏。通过将计算负担从推理转移到训练,蒸馏后的模型在保持单个智能体效率的同时,展现出多个智能体的强推理和自校正性能。它们还在各种推理任务中表现出增强的鲁棒性和泛化能力。我们希望这项工作能为未来高效且鲁棒的多智能体开发研究提供启示。我们的代码位于https://github.com/AIFrontierLab/AgentArk。

英文摘要

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.

2602.02839 2026-05-26 cs.RO

Language Movement Primitives: Grounding Language Models in Robot Motion

语言运动基元:将语言模型锚定在机器人运动中

Yinlong Dai, Benjamin A. Christie, Daniel J. Evans, Dylan P. Losey, Simon Stepputtis

AI总结 提出语言运动基元(LMP)框架,通过将视觉语言模型(VLM)推理与动态运动基元(DMP)参数化结合,实现零样本机器人操作任务。

详情
AI中文摘要

尽管在基于基础模型的通用问题解决方面取得了显著进展,但使机器人能够根据自然语言指令执行新颖的操作任务仍然是机器人学中的一个基本挑战。大型视觉和语言模型(VLM)能够处理高维输入数据以理解视觉场景和语言,并将任务分解为一系列逻辑步骤;然而,它们难以将这些步骤锚定在具体的机器人运动中。另一方面,机器人基础模型输出动作命令,但在成功执行新颖任务之前需要领域内的微调或经验。其核心仍然存在将抽象任务推理与低级运动控制连接起来的基本挑战。为了解决这一脱节,我们提出了语言运动基元(LMP),这是一个将VLM推理锚定在动态运动基元(DMP)参数化中的框架。我们的关键洞察是,DMP提供了少量可解释的参数,而VLM可以设置这些参数来指定多样、连续且稳定的轨迹。换句话说:VLM可以推理自由形式的自然语言任务描述,并将其期望的运动语义锚定到DMP中——弥合了高级任务推理与低级位置和速度控制之间的鸿沟。基于这种VLM和DMP的结合,我们制定了LMP流程,用于零样本机器人操作,通过生成一系列DMP运动有效完成桌面操作问题。在31个真实世界操作任务中,我们展示了LMP实现了65%的任务成功率,而最佳基线的成功率为35%。请访问我们的网站查看视频:https://collab.me.vt.edu/lmp

英文摘要

Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 31 real-world manipulation tasks, we show that LMP achieves 65% task success as compared to 35% for the best performing baseline. See videos at our website: https://collab.me.vt.edu/lmp

2602.02009 2026-05-26 cs.LG

Logic-Guided Vector Fields for Constrained Generative Modeling

逻辑引导的向量场用于约束生成建模

Ali Baheri

AI总结 提出逻辑引导向量场(LGVF)框架,通过可微逻辑约束松弛注入流匹配生成模型,结合训练时逻辑损失和推理时梯度调整,在三个约束生成案例中减少59-82%的约束违反。

详情
AI中文摘要

神经符号系统旨在结合符号逻辑的表达结构与神经学习的灵活性;然而,生成模型通常缺乏在生成时强制执行声明性约束的机制。我们提出了逻辑引导向量场(LGVF),这是一个神经符号框架,将符号知识(指定为逻辑约束的可微松弛)注入流匹配生成模型。LGVF耦合了两种互补机制:(1)训练时逻辑损失,惩罚连续流轨迹上的约束违反,权重强调目标分布附近的正确性;(2)推理时调整,使用约束梯度引导采样,作为对学习动力学的轻量级、逻辑信息校正。我们在三个约束生成案例研究上评估了LGVF,涵盖线性、非线性和多区域可行性约束。在所有设置中,与标准流匹配相比,LGVF将约束违反减少了59-82%,并在每种情况下实现了最低的违反率。在线性和环形设置中,LGVF还通过MMD衡量提高了分布保真度,而在多障碍物设置中,我们观察到满意度-保真度权衡,可行性提高但MMD增加。除了定量收益外,LGVF还产生了具有约束意识的向量场,表现出新兴的避障行为,无需显式路径规划即可将样本绕过禁止区域。

英文摘要

Neuro-symbolic systems aim to combine the expressive structure of symbolic logic with the flexibility of neural learning; yet, generative models typically lack mechanisms to enforce declarative constraints at generation time. We propose Logic-Guided Vector Fields (LGVF), a neuro-symbolic framework that injects symbolic knowledge, specified as differentiable relaxations of logical constraints, into flow matching generative models. LGVF couples two complementary mechanisms: (1) a training-time logic loss that penalizes constraint violations along continuous flow trajectories, with weights that emphasize correctness near the target distribution; and (2) an inference-time adjustment that steers sampling using constraint gradients, acting as a lightweight, logic-informed correction to the learned dynamics. We evaluate LGVF on three constrained generation case studies spanning linear, nonlinear, and multi-region feasibility constraints. Across all settings, LGVF reduces constraint violations by 59-82% compared to standard flow matching and achieves the lowest violation rates in each case. In the linear and ring settings, LGVF also improves distributional fidelity as measured by MMD, while in the multi-obstacle setting, we observe a satisfaction-fidelity trade-off, with improved feasibility but increased MMD. Beyond quantitative gains, LGVF yields constraint-aware vector fields exhibiting emergent obstacle-avoidance behavior, routing samples around forbidden regions without explicit path planning.

2602.01576 2026-05-26 cs.LG cs.AI cs.CV

Generative Visual Code Mobile World Models

生成式视觉代码移动世界模型

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

AI总结 提出通过单一视觉语言模型预测可执行网页代码来生成移动GUI下一状态,结合文本和视觉世界模型优势,实现高保真视觉生成与精确文本渲染。

Comments ICML 2026

详情
AI中文摘要

移动图形用户界面世界模型为在训练和推理时提升移动GUI代理性能提供了有前景的路径。然而,当前方法面临关键权衡:基于文本的世界模型牺牲了视觉保真度,而视觉世界模型在精确文本渲染上的不足导致其依赖缓慢、复杂的流水线和大量外部模型。我们提出一种新范式:通过可渲染代码生成进行视觉世界建模,其中单一视觉语言模型预测下一个GUI状态为可执行网页代码,该代码渲染为像素,而非直接生成像素。这结合了两种方法的优势:视觉语言模型保留其语言先验以实现精确文本渲染,同时其在结构化网页代码上的预训练实现了高保真视觉生成。我们推出了gWorld(8B、32B),这是基于该范式的首个开源权重视觉移动GUI世界模型,以及一个自动合成基于代码的训练数据的数据生成框架(gWorld)。在4个分布内和2个分布外基准测试的广泛评估中,gWorld在准确率与模型规模之间建立了新的帕累托前沿,性能优于8个前沿开源权重模型(其规模大50.25倍以上)。进一步分析表明:(1)通过gWorld扩展训练数据带来有意义的收益;(2)我们流水线的每个组件都提高了数据质量;(3)更强的世界建模提升了下游移动GUI策略性能。

英文摘要

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

2602.01086 2026-05-26 cs.AI cs.CR cs.DB cs.DC cs.SE

MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

MedBeads:面向可信医疗AI的智能体原生不可变数据基底

Takahito Nakajima

AI总结 针对医疗AI中电子病历与智能体间的上下文不匹配问题,提出基于Merkle有向无环图的不可变数据架构MedBeads,通过确定性图遍历替代概率检索,实现可审计、防篡改的临床上下文提供。

Comments 19 pages, 5 figures. Code available at https://github.com/medbeads/medbeads

详情
AI中文摘要

背景:截至2026年,大型语言模型(LLM)展现出专家级医学知识。然而,将其部署为自主“临床智能体”仍受限。当前的电子病历(EMR)及FHIR等标准专为人工审阅设计,导致“上下文不匹配”:AI智能体接收碎片化数据,必须依赖概率推理(如RAG)重建患者病史。该方法引发幻觉并阻碍可审计性。方法:我们提出MedBeads,一种智能体原生数据基础设施,其中临床事件是不可变的“珠子”——Merkle有向无环图(DAG)中的节点——通过密码学引用因果前驱。这种“一次写入、多次读取”架构使篡改在数学上可检测。我们实现了原型,包含Go核心引擎、用于LLM集成的Python中间件以及基于React的可视化界面。结果:我们使用合成数据成功实现了工作流。FHIR到DAG的转换将扁平资源转化为因果关联图。我们的广度优先搜索(BFS)上下文检索算法以O(V+E)复杂度遍历相关子图,支持实时决策支持。防篡改特性由设计保证:任何修改都会破坏密码学链。可视化通过显式因果链接帮助临床医生理解。结论:MedBeads通过从概率搜索转向确定性图遍历、从可变记录转向不可变链,解决了“上下文不匹配”,为“可信医疗AI”提供了基底。它保证了AI接收的上下文是确定且防篡改的,而LLM负责解释。结构化的珠子格式充当了令牌高效的“AI原生语言”。我们将MedBeads作为开源软件发布,以加速智能体原生数据标准。

英文摘要

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable "Beads"--nodes in a Merkle Directed Acyclic Graph (DAG)--cryptographically referencing causal predecessors. This "write-once, read-many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient "AI-native language." We release MedBeads as open-source software to accelerate agent-native data standards.

2601.22984 2026-05-26 cs.AI

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

为什么你的深度研究智能体会失败?关于完整研究轨迹中的幻觉评估

Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

AI总结 针对深度研究智能体(DRA)在完整研究轨迹中累积的幻觉问题,提出从结果评估转向过程感知评估的PING分类法和细粒度评估框架,并构建DeepHalluBench基准,实验揭示系统性的可靠性差距。

详情
AI中文摘要

诊断深度研究智能体(DRA)的失败模式仍然是一个关键挑战。现有基准主要依赖端到端评估,掩盖了在研究轨迹中累积的中间幻觉。为弥补这一差距,我们提出从基于结果的评估转向过程感知评估,通过审计完整计划-搜索-总结轨迹中的幻觉。我们引入PING分类法,将DRA幻觉分为四种互补类型:传播、意图、噪声诱导和接地。我们进一步将该分类法实例化为一个细粒度评估框架,将轨迹分解为原子动作、声明和子查询以进行严格验证。利用该框架隔离100个特别容易产生幻觉的任务(包括对抗性场景),我们策划了DeepHalluBench。对六个代表性DRA的实验表明,在我们的幻觉压力测试集上,所有评估系统仍表现出不可忽视的可靠性差距。此外,我们的诊断分析将这些失败追溯到系统性缺陷,特别是幻觉传播和认知偏差,为未来的架构优化提供了可操作的见解。代码和数据可在https://github.com/yuhao-zhan/DeepHalluBench获取。

英文摘要

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

2601.21726 2026-05-26 cs.AI

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

DropoutTS: 用于鲁棒时间序列预测的样本自适应Dropout

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang

AI总结 针对深度时间序列模型对噪声敏感的问题,提出一种模型无关的插件DropoutTS,通过频谱稀疏性量化实例级噪声并动态调整Dropout率,在抑制伪波动的同时保持细粒度保真度,显著提升模型鲁棒性且几乎不增加参数。

详情
AI中文摘要

深度时间序列模型容易受到现实应用中普遍存在的噪声数据的影响。现有的鲁棒性策略要么修剪数据,要么依赖昂贵的先验量化,无法在有效性和效率之间取得平衡。在本文中,我们引入了DropoutTS,一种模型无关的插件,它将范式从学习“什么”转变为学习“多少”。DropoutTS采用样本自适应Dropout机制:利用频谱稀疏性通过重建残差高效量化实例级噪声,它通过将噪声映射到自适应Dropout率来动态校准模型学习能力——选择性地抑制伪波动,同时保持细粒度保真度。跨不同噪声场景和开放基准的大量实验表明,DropoutTS持续提升优秀骨干模型的性能,在几乎不增加参数且无需修改架构的情况下提供先进的鲁棒性。我们的代码可在https://github.com/CityMind-Lab/DropoutTS获取。

英文摘要

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

2601.21670 2026-05-26 cs.CV cs.LG

Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion

通过有界一致性实现多样性:多模态融合的几何正则化

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

AI总结 提出一种轻量级即插即用的几何正则化框架,通过有界一致性原则在保持模态特异多样性的同时约束跨模态漂移,提升多模态融合性能。

详情
AI中文摘要

多模态融合通常被视为一个优化平衡问题,通过调整训练信号防止一种模态主导其他模态。然而,平衡优化并不能完全决定中间表示的几何结构。有监督的多模态模型仍可能学习到低多样性的模态特定嵌入,或允许配对的跨模态观测过度分离,从而削弱单模态鲁棒性和多模态融合。 我们引入了\regName,一个轻量级即插即用的多模态表示学习几何正则化框架。\regName不强制执行严格的跨模态对齐,而是遵循有界一致性原则:在仅软约束超过允许一致性带的配对跨模态漂移部分的同时,保留模态特定多样性。在操作上,\regName结合了一个分散项(减轻谱集中度)和一个一致性带锚定项(控制过度配对漂移),无需架构修改或推理时开销。 在音频-视觉、图像-文本和基于RF的基准测试上的实验表明,\regName一致地提高了多模态性能,并常常增强单模态表示。这些结果表明,显式调节表示几何是优化平衡的有效补充,并提供了几何感知正则化可以改善跨不同架构和领域的多模态学习的证据。

英文摘要

Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.

2601.19070 2026-05-26 cs.LG

Critical Organization of Deep Neural Networks, and p-Adic Statistical Field Theories

深度神经网络的临界组织与p进统计场论

W. A. Zúñiga-Galindo

AI总结 本文严格证明了深度神经网络在激活函数为sigmoid时的热力学极限,揭示了参数空间中的分岔临界组织,并利用p进整数编码层次结构,将临界组织与层次拓扑联系起来,同时研究了随机版本网络的输出分布。

Comments Many typos and minor errors were corrected. The main theorem was strengthened

详情
AI中文摘要

我们严格研究了深度神经网络(DNNs)和循环神经网络(RNNs)的热力学极限,假设激活函数为sigmoid。热力学极限是一个连续神经网络,其中神经元形成具有无限多个点的连续空间。我们证明,在参数空间的某个区域内,这样的网络存在唯一的状态,该状态连续依赖于参数。在该参数空间区域之外,该状态分裂成无限多个状态。那么,临界组织是参数空间中的一个分岔,网络从唯一状态过渡到无限多个状态。我们使用p进整数来编码层次结构。实际上,我们提出了一种算法,将DNNs和RNNs中使用的层次拓扑重新表述为p进树状结构。在这个框架中,层次组织和临界组织是联系在一起的。我们严格研究了一个玩具模型的临界组织,该模型是一个基于p进细胞神经网络的灰度图像层次边缘检测器。这种网络的临界组织可以描述为一个奇异吸引子。在第二部分,我们研究了DNNs和RNNs的随机版本。在这种情况下,网络参数是二次可积函数空间中的广义高斯随机变量。我们计算了在无限宽度情况下给定输入时输出的概率分布。我们证明它有一个幂次展开,其中常数项是高斯分布。

英文摘要

We rigorously study the thermodynamic limit of deep neural networks (DNNS) and recurrent neural networks (RNNs), assuming that the activation functions are sigmoids. A thermodynamic limit is a continuous neural network, where the neurons form a continuous space with infinitely many points. We show that such a network admits a unique state in a certain region of the parameter space, which depends continuously on the parameters. This state breaks into an infinite number of states outside the mentioned region of parameter space. Then, the critical organization is a bifurcation in the parameter space, where a network transitions from a unique state to infinitely many states. We use p-adic integers to codify hierarchical structures. Indeed, we present an algorithm that recasts the hierarchical topologies used in DNNs and RNNs as p-adic tree-like structures. In this framework, the hierarchical and the critical organizations are connected. We study rigorously the critical organization of a toy model, a hierarchical edge detector for grayscale images based on p-adic cellular neural networks. The critical organization of such a network can be described as a strange attractor. In the second part, we study random versions of DNNs and RNNs. In this case, the network parameters are generalized Gaussian random variables in a space of quadratic integrable functions. We compute the probability distribution of the output given the input, in the infinite-width case. We show that it admits a power-type expansion, where the constant term is a Gaussian distribution.

2601.11428 2026-05-26 cs.LG

Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families

诊断不同PDE族中神经算子的失败模式

Lennon Shikhman

AI总结 本文提出一个标准化压力测试框架,通过在不同PDE族上测试FNO、DeepONet和CNO三种架构,发现分布内准确率不能可靠预测鲁棒性,且失败模式依赖于架构和PDE族的组合。

Comments Published in Transactions on Machine Learning Research. 17 pages, 7 figures, 1 table

详情
AI中文摘要

神经PDE求解器越来越多地被用作偏微分方程族的学习替代模型,其中关键的机器学习挑战不仅是在固定基准分布上的插值,还包括在系数、边界条件、离散化和滚动时域的结构化偏移下的泛化。然而,评估仍然常常由分布内测试误差主导,使得鲁棒性难以评估。我们引入了一个针对部署相关偏移下神经PDE求解器的标准化压力测试框架。我们在三个代表性架构——傅里叶神经算子(FNO)、DeepONet风格模型和卷积神经算子(CNO)——上实例化该框架,涵盖五个定性不同的PDE族:色散、椭圆、多尺度流体、金融和混沌系统。在750个训练模型中,我们使用基线归一化退化因子以及谱和滚动诊断来测量鲁棒性。由此产生的比较表明,强的分布内准确率不能可靠预测鲁棒性,并且失败模式共同依赖于架构和PDE族。我们的结果为评估神经PDE求解器中的鲁棒性声明提供了更清晰的基础,并表明在结构化偏移下的函数空间泛化应被视为首要评估目标。

英文摘要

Neural PDE solvers are increasingly used as learned surrogates for families of partial differential equations, where the key machine learning challenge is not only interpolation on a fixed benchmark distribution but generalization under structured shifts in coefficients, boundary conditions, discretization, and rollout horizon. Yet evaluation is still often dominated by in-distribution test error, making robustness difficult to assess. We introduce a standardized stress-testing framework for neural PDE solvers under deployment-relevant shift. We instantiate it on three representative architectures -- Fourier Neural Operators (FNOs), a DeepONet-style model, and convolutional neural operators (CNOs) -- across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Across 750 trained models, we measure robustness using baseline-normalized degradation factors together with spectral and rollout diagnostics. The resulting comparisons reveal that strong in-distribution accuracy does not reliably predict robustness, and that failure patterns depend jointly on architecture and PDE family. Our results provide a clearer basis for evaluating robustness claims in neural PDE solvers and suggest that function-space generalization under structured shift should be treated as a first-class evaluation target.

2601.10457 2026-05-26 cs.AI

NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

NSR-Boost:一种面向工业遗留模型的神经符号残差提升框架

Ziming Dai, Dabiao Ma, Jinle Tong, Mengyuan Han, Jian Yang, Hongtao Liu, Haojun Fei, Qing Yang

AI总结 针对工业遗留模型升级成本高、风险大的问题,提出非侵入式神经符号残差提升框架NSR-Boost,通过残差定位、LLM生成符号专家和轻量聚合器动态集成,显著提升性能并降低坏账率。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管梯度提升决策树(GBDTs)主导了工业表格应用,但在高并发生产环境中升级遗留模型仍面临高昂的重新训练成本和系统性风险。为解决这一问题,我们提出了NSR-Boost,一种专门为工业场景设计的神经符号残差提升框架。其核心优势在于“非侵入性”。它将遗留模型视为冻结模型,并对预测失败的“困难区域”进行针对性修复。该框架包括三个关键阶段:首先,通过残差发现困难区域;然后,利用大型语言模型(LLM)生成符号代码结构,并通过贝叶斯优化微调参数,从而生成可解释的专家;最后,通过轻量聚合器将专家与遗留模型输出动态集成。实验结果表明,该框架在六个公共数据集和一个私有数据集上显著优于最先进的(SOTA)基线。更重要的是,我们报告了NSR-Boost在Qfin Holdings的核心金融风险控制系统中的成功部署,实际在线流量的实证结果显示出卓越的性能改进和坏账率的显著降低。总之,它有效捕获了传统模型遗漏的长尾风险,并为工业提供了一种安全、低成本的演进范式。

英文摘要

Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being ``non-intrusive''. It treats the legacy model as a frozen model and performs targeted repairs on "hard regions" where predictions fail. The framework comprises three key stages: First, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. Experimental results demonstrate that the framework significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset. More importantly, we report the successful deployment of NSR-Boost within the core financial risk control system of Qfin Holdings, where empirical results on real-world online traffic exhibit superior performance improvements and a significant reduction in the bad rate. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.

2601.10201 2026-05-26 cs.LG cs.AI cs.CL

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

未来KL正则化GRPO:基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

AI总结 本文提出未来KL正则化策略优化(FRPO),通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号,在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情
AI中文摘要

组相对策略优化(GRPO)广泛用于无评论家的大语言模型(LLM)后训练,但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习(RL)目标不同,GRPO的组归一化引入非线性提示级效用;对于二元验证器奖励,该效用为$2\arcsin\sqrt p$。因此,奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势,而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL,这产生简单的未来KL修正:在优势构建后添加每个token对数比的反向累积和。由此产生的方法,未来KL正则化策略优化(FRPO),不需要评论家或额外的模型传递。在数学推理任务上,FRPO在我们的主要大模型设置中提高了pass@16,同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

2601.10012 2026-05-26 cs.LG

PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning

PID引导的多模态去中心化联邦学习部分对齐

Yanhang Shi, Xiaoyu Wang, Houwei Cao, Jian Li, Yong Liu

AI总结 针对多模态去中心化联邦学习中异构代理间更新不兼容的问题,提出基于部分信息分解的PARSE框架,通过特征分裂和部分对齐实现高效通信与协作。

详情
AI中文摘要

多模态去中心化联邦学习(DFL)必须支持持有不同模态子集和通常不同模型组件的代理之间的协作,同时在无协调服务器或全局网络视图的点对点(P2P)覆盖网络上运行。一个关键障碍是,传统的多模态训练通常依赖于单一共享表示,这隐含假设异构对等体可以通过相同的通信链路交换和聚合相同的模型组件。在多模态DFL中,这一假设不成立:单模态和多模态代理可能通过共享覆盖网络推送不兼容的更新,削弱代理间迁移和跨模态交互。我们提出PARSE,一个无服务器框架,将部分信息分解(PID)引入多模态DFL。每个代理将其潜在特征分裂为冗余、独特和协同切片(“特征分裂”),并在模态条件化的P2P覆盖网络上进行切片感知通信。在训练过程中,代理仅交换与其邻居在语义上可对齐的切片,根据它们共享的模态和模型组件(“部分对齐”)。这种设计避免了集中式编排和梯度手术式的冲突处理,同时与标准DFL约束和多种P2P覆盖网络拓扑兼容。在多个基准测试和异构代理混合场景中,PARSE在保持每链路负载受限的同时,始终优于任务共享、模态共享和混合共享的多模态DFL基线。关于融合选择和分裂比例的消融实验,以及定性特征分析和覆盖网络拓扑研究,证明了所提出的切片感知设计的鲁棒性和通信效率。

英文摘要

Multimodal decentralized federated learning (DFL) must support collaboration among agents that hold different modality subsets and often different model components, while operating over peer-to-peer (P2P) overlays without a coordinating server or a global network view. A key obstacle is that conventional multimodal training often relies on a single shared representation, which implicitly assumes that heterogeneous peers can exchange and aggregate the same model components over the same communication links. In multimodal DFL, this assumption breaks down: uni- and multimodal agents may push incompatible updates through shared overlays, weakening both inter-agent transfer and cross-modal interaction. We present PARSE, a server-free framework that brings partial information decomposition (PID) into multimodal DFL. Each agent splits its latent features into redundant, unique, and synergistic slices ("feature fission"), and performs slice-aware communication over modality-conditioned P2P overlays. During training, agents exchange only the slices that are semantically alignable with their neighbors, according to the modalities and model components they share ("partial alignment"). This design avoids centralized orchestration and gradient-surgery style conflict handling, while remaining compatible with standard DFL constraints and a range of P2P overlay topologies. Across multiple benchmarks and heterogeneous peer mixes, PARSE consistently outperforms task-, modality-, and hybrid-sharing multimodal DFL baselines while keeping per-link payloads bounded. Ablations on fusion choices and split ratios, together with qualitative feature analyses and overlay-topology studies, demonstrate the robustness and communication efficiency of the proposed slice-aware design.

2601.05847 2026-05-26 cs.CL

Schema-Grounded LLM Extraction for FHIR Patient Digital Twins

基于Schema的LLM抽取用于FHIR患者数字孪生

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

AI总结 提出SG-LLM方法,通过检索增强、JSON Schema约束和验证器修复循环,从非结构化EHR中生成有效的FHIR Bundle,并在临床效用实验中优于基线。

详情
AI中文摘要

我们重新审视从非结构化电子健康记录(EHR)构建可互操作患者数字孪生的问题,并认为该任务更适合被视作有效FHIR Bundle的受控生成,而非抽取模块的级联。我们引入SG-LLM,一种基于schema的LLM抽取器,它(i)通过SapBERT索引检索的候选SNOMED-CT、RxNorm和LOINC代码增强提示,(ii)在直接源自FHIR R4 StructureDefinitions的JSON Schema下解码,(iii)关闭一个验证器在环修复阶段,其诊断结果作为结构化错误消息反馈。我们认为,孪生的有用性(而不仅仅是跨度级F1)才是正确的评估对象,并通过一项临床效用实验将其操作化,该实验测量了基于SG-LLM生成的FHIR Bundle与专家策划的Bundle训练的分类器在30天再入院AUROC上的差距。在MIMIC-IV和n2c2 2018 Track 2基准测试上,SG-LLM匹配或超过了强大的联合抽取和普通LLM基线,同时生成了更有效的Bundle。消融实验分离了检索、schema约束和修复循环的贡献。所有代码、提示和schema均已发布。

英文摘要

We revisit the problem of constructing interoperable patient digital twins from unstructured electronic health records (EHRs) and argue that the task is better cast not as a cascade of extraction modules but as constrained generation of a valid FHIR bundle. We introduce SG-LLM, a schema-grounded LLM extractor that (i) augments the prompt with candidate SNOMED-CT, RxNorm, and LOINC codes retrieved through a SapBERT index, (ii) decodes under a JSON Schema derived directly from FHIR R4 StructureDefinitions, and (iii) closes a validator-in-the-loop repair stage whose diagnostics are fed back as structured error messages. We argue that the twin's usefulness, not only span-level F1, is the right object of evaluation, and operationalize this with a clinical-utility experiment that measures the gap in 30-day readmission AUROC between classifiers trained on SG-LLM-generated FHIR bundles versus expert-curated ones. On MIMIC-IV and n2c2 2018 Track 2 benchmarks, SG-LLM matches or exceeds strong joint-extraction and vanilla-LLM baselines while producing substantially more valid bundles. Ablations isolate the contributions of retrieval, schema constraint, and the repair loop. All code, prompts, and schemas are released.

2601.05004 2026-05-26 cs.CL

Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

大语言模型能否解决自我毁灭亚文化中的语义差异?来自Jirai Kei的证据

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

AI总结 针对亚文化中自我毁灭行为检测面临的知识滞后和语义错位问题,提出多智能体框架SAS,通过自动检索和亚文化对齐显著提升LLM检测性能,并优于现有先进方法。

Comments Preprint

详情
AI中文摘要

自我毁灭行为与复杂的心理状态相关,且难以诊断。由于亚文化群体独特的表达方式,这些行为可能更难识别。随着大语言模型(LLM)在各领域的部署,一些研究者开始探索其在检测自我毁灭行为中的应用。受此启发,我们使用当前基于LLM的方法研究亚文化中的自我毁灭行为检测。然而,这些方法面临两个主要挑战:(1)知识滞后:亚文化俚语演变迅速,快于LLM的训练周期;(2)语义错位:难以把握亚文化特有的具体和细微表达。为解决这些问题,我们提出亚文化对齐求解器(SAS),一个多智能体框架,集成了自动检索和亚文化对齐,显著提升了LLM在检测自我毁灭行为中的性能。实验结果表明,SAS优于当前先进的多智能体框架OWL。值得注意的是,它与微调后的LLM表现相当。我们希望SAS能推动亚文化背景下自我毁灭行为检测领域的发展,并为未来研究者提供宝贵资源。

英文摘要

Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

2601.03191 2026-05-26 cs.CV cs.AI cs.LG

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX:一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

AI总结 提出AnatomiX,一种两阶段解剖学感知多模态大语言模型,通过先识别解剖结构再执行下游任务,在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情
AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展,但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能,但它们往往未能建立真正的解剖对应关系,导致医学领域中的解剖理解错误。为弥补这一差距,我们引入了AnatomiX,一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发,AnatomiX采用两阶段方法:首先识别解剖结构并提取其特征,然后利用大语言模型执行多种下游任务,如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明,与现有方法相比,AnatomiX实现了卓越的解剖推理,并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

2601.02589 2026-05-26 cs.CL cs.AI

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

FlowPlan-G2P:一种将科学论文转化为专利描述的结构化生成框架

Kris W Pan, Yongmin Yoo

AI总结 提出FlowPlan-G2P图介导生成框架,通过概念图归纳、章节级规划和图条件生成三阶段分解,将科学论文转化为符合专利规范的描述,在领域评估中优于大型专有模型。

详情
AI中文摘要

由于科学论文与专利在修辞和结构上的根本差异,从科学论文生成专利描述具有挑战性。现有方法将其视为表面改写,未能捕捉专利起草中固有的层次推理和法定约束。我们提出FlowPlan-G2P,一种图介导的生成框架,将该转换分解为三个阶段:(1)概念图归纳,将技术实体和功能依赖提取为有向图;(2)章节级规划,将图划分为与规范专利章节对齐的连贯子图;(3)图条件生成,基于章节特定子图合成符合法律要求的段落。在专家验证基准上的实验表明,标准NLG指标系统性偏好法律不合规输出而非有效专利描述,这促使我们进行领域特定评估。在该评估下,使用开放权重骨干的FlowPlan-G2P始终优于原始专有模型,表明结构化分解比模型规模更能决定质量。

英文摘要

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.