2602.02839 2026-05-26 cs.RO

Language Movement Primitives: Grounding Language Models in Robot Motion

语言运动基元：将语言模型锚定在机器人运动中

Yinlong Dai, Benjamin A. Christie, Daniel J. Evans, Dylan P. Losey, Simon Stepputtis

AI总结提出语言运动基元（LMP）框架，通过将视觉语言模型（VLM）推理与动态运动基元（DMP）参数化结合，实现零样本机器人操作任务。

详情

AI中文摘要

尽管在基于基础模型的通用问题解决方面取得了显著进展，但使机器人能够根据自然语言指令执行新颖的操作任务仍然是机器人学中的一个基本挑战。大型视觉和语言模型（VLM）能够处理高维输入数据以理解视觉场景和语言，并将任务分解为一系列逻辑步骤；然而，它们难以将这些步骤锚定在具体的机器人运动中。另一方面，机器人基础模型输出动作命令，但在成功执行新颖任务之前需要领域内的微调或经验。其核心仍然存在将抽象任务推理与低级运动控制连接起来的基本挑战。为了解决这一脱节，我们提出了语言运动基元（LMP），这是一个将VLM推理锚定在动态运动基元（DMP）参数化中的框架。我们的关键洞察是，DMP提供了少量可解释的参数，而VLM可以设置这些参数来指定多样、连续且稳定的轨迹。换句话说：VLM可以推理自由形式的自然语言任务描述，并将其期望的运动语义锚定到DMP中——弥合了高级任务推理与低级位置和速度控制之间的鸿沟。基于这种VLM和DMP的结合，我们制定了LMP流程，用于零样本机器人操作，通过生成一系列DMP运动有效完成桌面操作问题。在31个真实世界操作任务中，我们展示了LMP实现了65%的任务成功率，而最佳基线的成功率为35%。请访问我们的网站查看视频：https://collab.me.vt.edu/lmp

英文摘要

Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 31 real-world manipulation tasks, we show that LMP achieves 65% task success as compared to 35% for the best performing baseline. See videos at our website: https://collab.me.vt.edu/lmp

URL PDF HTML ☆

赞 0 踩 0

2602.02009 2026-05-26 cs.LG

为什么你的深度研究智能体会失败？关于完整研究轨迹中的幻觉评估

Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

AI总结针对深度研究智能体（DRA）在完整研究轨迹中累积的幻觉问题，提出从结果评估转向过程感知评估的PING分类法和细粒度评估框架，并构建DeepHalluBench基准，实验揭示系统性的可靠性差距。

详情

AI中文摘要

诊断深度研究智能体（DRA）的失败模式仍然是一个关键挑战。现有基准主要依赖端到端评估，掩盖了在研究轨迹中累积的中间幻觉。为弥补这一差距，我们提出从基于结果的评估转向过程感知评估，通过审计完整计划-搜索-总结轨迹中的幻觉。我们引入PING分类法，将DRA幻觉分为四种互补类型：传播、意图、噪声诱导和接地。我们进一步将该分类法实例化为一个细粒度评估框架，将轨迹分解为原子动作、声明和子查询以进行严格验证。利用该框架隔离100个特别容易产生幻觉的任务（包括对抗性场景），我们策划了DeepHalluBench。对六个代表性DRA的实验表明，在我们的幻觉压力测试集上，所有评估系统仍表现出不可忽视的可靠性差距。此外，我们的诊断分析将这些失败追溯到系统性缺陷，特别是幻觉传播和认知偏差，为未来的架构优化提供了可操作的见解。代码和数据可在https://github.com/yuhao-zhan/DeepHalluBench获取。

英文摘要

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

URL PDF HTML ☆

赞 0 踩 0

2601.21726 2026-05-26 cs.AI

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

DropoutTS: 用于鲁棒时间序列预测的样本自适应Dropout

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang

AI总结针对深度时间序列模型对噪声敏感的问题，提出一种模型无关的插件DropoutTS，通过频谱稀疏性量化实例级噪声并动态调整Dropout率，在抑制伪波动的同时保持细粒度保真度，显著提升模型鲁棒性且几乎不增加参数。

详情

AI中文摘要

深度时间序列模型容易受到现实应用中普遍存在的噪声数据的影响。现有的鲁棒性策略要么修剪数据，要么依赖昂贵的先验量化，无法在有效性和效率之间取得平衡。在本文中，我们引入了DropoutTS，一种模型无关的插件，它将范式从学习“什么”转变为学习“多少”。DropoutTS采用样本自适应Dropout机制：利用频谱稀疏性通过重建残差高效量化实例级噪声，它通过将噪声映射到自适应Dropout率来动态校准模型学习能力——选择性地抑制伪波动，同时保持细粒度保真度。跨不同噪声场景和开放基准的大量实验表明，DropoutTS持续提升优秀骨干模型的性能，在几乎不增加参数且无需修改架构的情况下提供先进的鲁棒性。我们的代码可在https://github.com/CityMind-Lab/DropoutTS获取。

英文摘要

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

URL PDF HTML ☆

赞 0 踩 0

2601.21670 2026-05-26 cs.CV cs.LG

未来KL正则化GRPO：基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

AI总结本文提出未来KL正则化策略优化（FRPO），通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号，在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情

AI中文摘要

组相对策略优化（GRPO）广泛用于无评论家的大语言模型（LLM）后训练，但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习（RL）目标不同，GRPO的组归一化引入非线性提示级效用；对于二元验证器奖励，该效用为$2\arcsin\sqrt p$。因此，奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势，而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL，这产生简单的未来KL修正：在优势构建后添加每个token对数比的反向累积和。由此产生的方法，未来KL正则化策略优化（FRPO），不需要评论家或额外的模型传递。在数学推理任务上，FRPO在我们的主要大模型设置中提高了pass@16，同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.10012 2026-05-26 cs.LG

PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning

PID引导的多模态去中心化联邦学习部分对齐

Yanhang Shi, Xiaoyu Wang, Houwei Cao, Jian Li, Yong Liu

AI总结针对多模态去中心化联邦学习中异构代理间更新不兼容的问题，提出基于部分信息分解的PARSE框架，通过特征分裂和部分对齐实现高效通信与协作。

详情

AI中文摘要

多模态去中心化联邦学习（DFL）必须支持持有不同模态子集和通常不同模型组件的代理之间的协作，同时在无协调服务器或全局网络视图的点对点（P2P）覆盖网络上运行。一个关键障碍是，传统的多模态训练通常依赖于单一共享表示，这隐含假设异构对等体可以通过相同的通信链路交换和聚合相同的模型组件。在多模态DFL中，这一假设不成立：单模态和多模态代理可能通过共享覆盖网络推送不兼容的更新，削弱代理间迁移和跨模态交互。我们提出PARSE，一个无服务器框架，将部分信息分解（PID）引入多模态DFL。每个代理将其潜在特征分裂为冗余、独特和协同切片（“特征分裂”），并在模态条件化的P2P覆盖网络上进行切片感知通信。在训练过程中，代理仅交换与其邻居在语义上可对齐的切片，根据它们共享的模态和模型组件（“部分对齐”）。这种设计避免了集中式编排和梯度手术式的冲突处理，同时与标准DFL约束和多种P2P覆盖网络拓扑兼容。在多个基准测试和异构代理混合场景中，PARSE在保持每链路负载受限的同时，始终优于任务共享、模态共享和混合共享的多模态DFL基线。关于融合选择和分裂比例的消融实验，以及定性特征分析和覆盖网络拓扑研究，证明了所提出的切片感知设计的鲁棒性和通信效率。

英文摘要

Multimodal decentralized federated learning (DFL) must support collaboration among agents that hold different modality subsets and often different model components, while operating over peer-to-peer (P2P) overlays without a coordinating server or a global network view. A key obstacle is that conventional multimodal training often relies on a single shared representation, which implicitly assumes that heterogeneous peers can exchange and aggregate the same model components over the same communication links. In multimodal DFL, this assumption breaks down: uni- and multimodal agents may push incompatible updates through shared overlays, weakening both inter-agent transfer and cross-modal interaction. We present PARSE, a server-free framework that brings partial information decomposition (PID) into multimodal DFL. Each agent splits its latent features into redundant, unique, and synergistic slices ("feature fission"), and performs slice-aware communication over modality-conditioned P2P overlays. During training, agents exchange only the slices that are semantically alignable with their neighbors, according to the modalities and model components they share ("partial alignment"). This design avoids centralized orchestration and gradient-surgery style conflict handling, while remaining compatible with standard DFL constraints and a range of P2P overlay topologies. Across multiple benchmarks and heterogeneous peer mixes, PARSE consistently outperforms task-, modality-, and hybrid-sharing multimodal DFL baselines while keeping per-link payloads bounded. Ablations on fusion choices and split ratios, together with qualitative feature analyses and overlay-topology studies, demonstrate the robustness and communication efficiency of the proposed slice-aware design.

URL PDF HTML ☆

赞 0 踩 0

2601.05847 2026-05-26 cs.CL

Schema-Grounded LLM Extraction for FHIR Patient Digital Twins

基于Schema的LLM抽取用于FHIR患者数字孪生

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

AI总结提出SG-LLM方法，通过检索增强、JSON Schema约束和验证器修复循环，从非结构化EHR中生成有效的FHIR Bundle，并在临床效用实验中优于基线。

详情

AI中文摘要

我们重新审视从非结构化电子健康记录（EHR）构建可互操作患者数字孪生的问题，并认为该任务更适合被视作有效FHIR Bundle的受控生成，而非抽取模块的级联。我们引入SG-LLM，一种基于schema的LLM抽取器，它(i)通过SapBERT索引检索的候选SNOMED-CT、RxNorm和LOINC代码增强提示，(ii)在直接源自FHIR R4 StructureDefinitions的JSON Schema下解码，(iii)关闭一个验证器在环修复阶段，其诊断结果作为结构化错误消息反馈。我们认为，孪生的有用性（而不仅仅是跨度级F1）才是正确的评估对象，并通过一项临床效用实验将其操作化，该实验测量了基于SG-LLM生成的FHIR Bundle与专家策划的Bundle训练的分类器在30天再入院AUROC上的差距。在MIMIC-IV和n2c2 2018 Track 2基准测试上，SG-LLM匹配或超过了强大的联合抽取和普通LLM基线，同时生成了更有效的Bundle。消融实验分离了检索、schema约束和修复循环的贡献。所有代码、提示和schema均已发布。

英文摘要

We revisit the problem of constructing interoperable patient digital twins from unstructured electronic health records (EHRs) and argue that the task is better cast not as a cascade of extraction modules but as constrained generation of a valid FHIR bundle. We introduce SG-LLM, a schema-grounded LLM extractor that (i) augments the prompt with candidate SNOMED-CT, RxNorm, and LOINC codes retrieved through a SapBERT index, (ii) decodes under a JSON Schema derived directly from FHIR R4 StructureDefinitions, and (iii) closes a validator-in-the-loop repair stage whose diagnostics are fed back as structured error messages. We argue that the twin's usefulness, not only span-level F1, is the right object of evaluation, and operationalize this with a clinical-utility experiment that measures the gap in 30-day readmission AUROC between classifiers trained on SG-LLM-generated FHIR bundles versus expert-curated ones. On MIMIC-IV and n2c2 2018 Track 2 benchmarks, SG-LLM matches or exceeds strong joint-extraction and vanilla-LLM baselines while producing substantially more valid bundles. Ablations isolate the contributions of retrieval, schema constraint, and the repair loop. All code, prompts, and schemas are released.

URL PDF HTML ☆

赞 0 踩 0

2601.05004 2026-05-26 cs.CL

Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

大语言模型能否解决自我毁灭亚文化中的语义差异？来自Jirai Kei的证据

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

AI总结针对亚文化中自我毁灭行为检测面临的知识滞后和语义错位问题，提出多智能体框架SAS，通过自动检索和亚文化对齐显著提升LLM检测性能，并优于现有先进方法。

Comments Preprint

详情

AI中文摘要

自我毁灭行为与复杂的心理状态相关，且难以诊断。由于亚文化群体独特的表达方式，这些行为可能更难识别。随着大语言模型（LLM）在各领域的部署，一些研究者开始探索其在检测自我毁灭行为中的应用。受此启发，我们使用当前基于LLM的方法研究亚文化中的自我毁灭行为检测。然而，这些方法面临两个主要挑战：（1）知识滞后：亚文化俚语演变迅速，快于LLM的训练周期；（2）语义错位：难以把握亚文化特有的具体和细微表达。为解决这些问题，我们提出亚文化对齐求解器（SAS），一个多智能体框架，集成了自动检索和亚文化对齐，显著提升了LLM在检测自我毁灭行为中的性能。实验结果表明，SAS优于当前先进的多智能体框架OWL。值得注意的是，它与微调后的LLM表现相当。我们希望SAS能推动亚文化背景下自我毁灭行为检测领域的发展，并为未来研究者提供宝贵资源。

英文摘要

Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

URL PDF HTML ☆

赞 0 踩 0

2601.03191 2026-05-26 cs.CV cs.AI cs.LG

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX：一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

AI总结提出AnatomiX，一种两阶段解剖学感知多模态大语言模型，通过先识别解剖结构再执行下游任务，在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情

AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展，但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能，但它们往往未能建立真正的解剖对应关系，导致医学领域中的解剖理解错误。为弥补这一差距，我们引入了AnatomiX，一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发，AnatomiX采用两阶段方法：首先识别解剖结构并提取其特征，然后利用大语言模型执行多种下游任务，如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明，与现有方法相比，AnatomiX实现了卓越的解剖推理，并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

URL PDF HTML ☆

赞 0 踩 0

2601.02589 2026-05-26 cs.CL cs.AI

球面Voronoi：作为球面可微分划分的定向外观

Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi

AI总结提出球面Voronoi（SV）作为3D高斯泼溅中外观表示的统一框架，通过可学习区域划分实现视图依赖效果，在反射建模上达到最先进水平。

详情

AI中文摘要

辐射场方法（例如3D高斯泼溅）已成为新视角合成的强大范式，但其外观建模通常依赖于球谐函数（SH），这带来了根本性限制。SH难以处理高频信号，存在吉布斯振铃伪影，并且无法捕捉镜面反射——这是真实感渲染的关键组成部分。尽管球面高斯等替代方案有所改进，但它们增加了显著的优化复杂度。我们提出球面Voronoi（SV）作为3D高斯泼溅中外观表示的统一框架。SV将方向域划分为具有平滑边界的可学习区域，为视图依赖效应提供了直观且稳定的参数化。对于漫反射外观，SV在保持优化比现有替代方案更简单的同时取得了有竞争力的结果。对于反射——SH失败的地方——我们利用SV作为可学习的反射探针，遵循经典图形学原理将反射方向作为输入。该公式在合成和真实世界数据集上取得了最先进的结果，表明SV为显式3D表示中的外观建模提供了一种有原则、高效且通用的解决方案。项目页面：https://sphericalvoronoi.github.io/

英文摘要

Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations. Project page: https://sphericalvoronoi.github.io/

URL PDF HTML ☆

赞 0 踩 0

2512.12576 2026-05-26 cs.CL cs.AI

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

AI总结提出CoVRL方法，通过混合采样策略耦合先验和后验分布，将变分推理与强化学习结合，以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题，在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展，但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而，这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦，导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中，我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}（CoVRL），它通过混合采样策略耦合先验和后验分布，将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布，CoVRL实现了高效探索，同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明，CoVRL在基础模型上提升了12.4%的性能，并在最先进的无验证器强化学习基线基础上额外提升了2.3%，为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

URL PDF HTML ☆

赞 0 踩 0

2512.11941 2026-05-26 cs.CV cs.AI

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

AI总结提出DynaPURLS框架，通过多尺度视觉-语义对应和动态细化模块，解决骨架零样本动作识别中的领域偏移问题，在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情

DOI: 10.1109/TPAMI.2026.3680873

AI中文摘要

基于骨架的零样本动作识别（ZS-SAR）从根本上受到主流方法的限制，这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移，从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制，我们引入了 extbf{DynaPURLS}，一个统一的框架，它建立稳健的多尺度视觉-语义对应，并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述，涵盖全局运动和局部身体部位动态。同时，一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移，DynaPURLS包含一个动态细化模块。在推理时，该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定，该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集（包括NTU RGB+D 60/120和PKU-MMD）上的大量实验表明，DynaPURLS显著优于先前的方法，创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

URL PDF HTML ☆

赞 0 踩 0

2512.04883 2026-05-26 cs.CV

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

SDG-Track: 一种用于嵌入式平台高分辨率无人机跟踪的异构观察者-跟随者框架

Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

AI总结提出SDG-Track框架，采用观察者-跟随者架构，通过稀疏检测引导跟踪和双空间恢复机制，在嵌入式平台上实现高分辨率无人机实时跟踪，达到35.1 FPS和97.2%检测精度。

Comments Withdrawn by the authors due to unresolved authorship and public-disclosure authorization issues

详情

AI中文摘要

在边缘设备上对小型无人机（UAV）进行实时跟踪面临根本性的分辨率-速度冲突。将高分辨率图像下采样到标准检测器输入尺寸会导致小目标特征低于可检测阈值。然而，在资源受限平台上处理原生1080p帧无法为平滑云台控制提供足够的吞吐量。我们提出SDG-Track，一种稀疏检测引导跟踪器，采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器，从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过ROI约束的稀疏光流执行高频轨迹插值。为了处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败，我们引入了双空间恢复，一种无需训练的重捕获机制，结合颜色直方图匹配与几何一致性约束。在地对空跟踪站上的实验表明，SDG-Track实现了35.1 FPS的系统吞吐量，同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了实际操作条件下的敏捷FPV无人机。我们的论文代码公开在https://github.com/Jeffry-wen/SDG-Track。

英文摘要

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

URL PDF HTML ☆

赞 0 踩 0

2511.21734 2026-05-26 cs.CL cs.AI

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

AI总结提出Verification-First (VF)策略，通过先验证候选答案再生成解决方案，以低计算开销提升推理能力，并扩展为Iter-VF迭代方法，在多个基准上优于标准CoT和现有TTS策略。

详情

AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型（LLMs）的推理能力，我们引入了Verification-First (VF)策略，该策略在生成解决方案之前提示模型验证提供的候选答案（即使是琐碎或随机的答案）。这种方法触发了一种“反向推理”过程，与标准的前向思维链（CoT）互补，通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF，这是一种顺序测试时缩放（TTS）方法，利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实，使用随机答案的VF提示在最小计算开销下始终优于标准CoT，并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如，通过使用简单的VF提示，我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%，其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

URL PDF HTML ☆

赞 0 踩 0

2511.04556 2026-05-26 cs.AI cs.CE

Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach

城市排水管网流量重建的传感器优化布置：基于数字孪生的稀疏传感方法

Zihang Ding, Amit Kumar, Imran Md. Azizul Islam, Mila Avellar Montezuma, Ruihang Zhang, Kun Zhang

AI总结针对资源受限下城市排水管网监测与流量预测难题，提出一种基于数字孪生的数据驱动稀疏传感方法，通过奇异值分解和QR分解优化传感器位置，实现系统级流量重建，在明尼苏达州德卢斯林地流域验证中，3个传感器达到平均NSE 0.949。

Comments 32 pages (including supplementary information), 11 figures. Submitted to Water Research. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and AGU Fall Meeting 2025

详情

AI中文摘要

强降雨引发的城市洪水日益频繁和广泛。虽然高时空分辨率的洪水预测和监测是理想的，但时间、预算和技术上的实际限制阻碍了其全面实施。如何在资源受限的情况下监测城市排水管网并预测水流状况是一个主要挑战。为了解决这一问题，我们引入了一种数据驱动的稀疏传感（DSS）方法，通过明尼苏达州德卢斯林地流域的数字孪生进行演示。具体来说，我们将EPA-SWMM与基于奇异值分解和QR分解的传感器选择相结合，以优化系统级流量重建的监测位置。由不同情景驱动的SWMM模拟集成提供了必要的水力数据，以提取降阶基并识别信息丰富的传感器位置。跨事件验证表明，在77个候选节点中，三个策略性放置的传感器在观测到的风暴事件中实现了平均系统级纳什-萨特克利夫效率（NSE）为0.949。将QR选择的传感器集与通过穷举搜索和蒙特卡洛随机放置获得的参考传感器配置进行了基准测试。这一比较进一步表明，基于QR选择的传感器的流量重建紧密跟踪穷举最优值，同时显著优于随机放置。我们通过引入乘性高斯噪声和模拟单个传感器故障进一步评估了框架的鲁棒性。虽然模型对噪声相对具有弹性，但传感器缺失的影响在很大程度上取决于分配的传感器数量及其具体位置。

英文摘要

Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in high spatio-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resources is a major challenge. To address this, we introduced a data-driven sparse sensing (DSS) approach, demonstrated via a digital-twin of the Woodland catchment in Duluth, Minnesota. Specifically, we coupled EPA-SWMM with singular value decomposition and QR factorization-based sensor selection to optimize monitoring locations for system-level flow reconstruction. An ensemble of SWMM simulations, driven by diverse scenarios, provided the necessary hydraulic data to extract the reduced basis and identify informative sensor locations. Cross-event validation showed that three strategically placed sensors among 77 candidate nodes achieved a mean system-level Nash-Sutcliffe efficiency (NSE) of 0.949 across observed storm events. The QR-selected sensor sets were benchmarked against reference sensor configurations obtained from exhaustive searches and Monte Carlo random-placements. This comparison further showed that flow reconstruction based on QR-selected sensors closely tracked the exhaustive optimum while substantially outperforming random placements. We further evaluated the framework's robustness by introducing multiplicative Gaussian noise and simulating individual sensor failures. While the model is relatively resilient to noise, the impact of sensor dropouts depends heavily on the number of sensors allocated and their specific locations.

URL PDF HTML ☆

赞 0 踩 0