arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2603.08011 2026-05-26 cs.CV

It's Time to Get It Right: Improving Analog Clock Reading and Clock-Hand Spatial Reasoning in Vision-Language Models

是时候正确了:提升视觉语言模型中的模拟时钟读取和指针空间推理能力

Jaeha Choi, Jin Won Lee, Siwoo You, Jangho Lee

AI总结 针对视觉语言模型在真实环境中读取模拟时钟的挑战,提出TickTockVQA数据集和Swap-DPO微调框架,显著提升时钟读取准确性和鲁棒性。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

视觉语言模型(VLM)在复杂多模态推理任务上取得了显著成功,导致人们假设它们也应擅长读取模拟时钟。然而,与预期相反,我们的研究表明,在真实环境中读取模拟时钟对最先进的VLM来说仍然是一个重大挑战。现有的模拟时钟数据集大多是合成或平面的,风格多样性有限且背景上下文极少,无法捕捉真实世界场景的视觉变化。因此,在此类数据上训练的VLM表现出较弱的时空推理能力,经常混淆时针和分针,并在遮挡、光照变化和杂乱背景等常见视觉条件下挣扎。为解决此问题,我们引入了TickTockVQA,一个包含真实世界多样化场景中模拟时钟的人工标注数据集。TickTockVQA提供明确的时针和分针标注,并在可从视觉上下文推断时包含AM/PM标签。此外,我们提出了Swap-DPO,一种基于直接偏好优化的微调框架,以将模型推理对齐到准确的时间解释。实验结果表明,我们的方法在真实世界条件下显著提高了时钟读取的准确性和鲁棒性,为VLM中时空推理和视觉理解的未来研究奠定了基础。

英文摘要

Advances in vision-language models (VLMs) have achieved remarkable success on complex multimodal reasoning tasks, leading to the assumption that they should also excel at reading analog clocks. However, contrary to this expectation, our study reveals that reading analog clocks in real-world environments remains a significant challenge for state-of-the-art VLMs. Existing analog clock datasets are largely synthetic or planar with limited stylistic diversity and minimal background context, failing to capture the visual variability of real-world scenes. As a result, VLMs trained on such data exhibit weak spatiotemporal reasoning, frequently confusing the hour and minute hands and struggling under common visual conditions such as occlusion, lighting variation, and cluttered backgrounds. To address this issue, we introduce TickTockVQA, a human-annotated dataset containing analog clocks in diverse real-world scenarios. TickTockVQA provides explicit hour and minute annotations, and includes an AM/PM tag when it is inferable from the visual context. Furthermore, we propose Swap-DPO, a direct preference optimization-based fine-tuning framework to align model reasoning toward accurate time interpretation. Experimental results demonstrate that our approach substantially enhances clock reading accuracy and robustness under real-world conditions, establishing a foundation for future research on spatiotemporal reasoning and visual understanding in VLMs.

2602.20725 2026-05-26 cs.CV

Bridging Rendering and Generative Modeling with Monte Carlo Transport Scheduling

桥接渲染与生成建模:蒙特卡洛传输调度

Junwei Shu, Wenjie Liu, Hantang Liu, Changbo Wang, Yang Li

AI总结 提出蒙特卡洛传输调度框架,将渐进式路径追踪视为连续采样驱动的传输过程,通过真实渲染端点训练实现任意步数的神经细化,并作为物理先验迁移至生成模型。

Comments preprint

详情
AI中文摘要

蒙特卡洛渲染和现代生成模型都将不确定状态转化为结构化图像,但通常被视为独立过程。我们引入蒙特卡洛传输调度,一个将渐进式路径追踪视为连续采样驱动传输过程的框架。我们的关键观察是,渲染器在此过程中已经产生物理有效状态:嵌套蒙特卡洛估计追踪一条细化轨迹,其自然时间坐标由采样方差决定。这一观点引出一个连续训练框架,从真实渲染端点而非合成插值中学习,保留蒙特卡洛估计的统计结构,同时支持任意步数的神经细化。我们在一个旨在分离传输难度与场景上下文的受控渲染基准上评估该框架,结果表明它产生稳定的渲染细化,支持渲染状态之间的连续停止,并作为冻结生成采样器的物理先验进行迁移。这些结果表明渲染和生成存在共同的连续时间基础,其中蒙特卡洛采样既提供物理状态,也提供学习图像传输的监督。

英文摘要

Monte Carlo rendering and modern generative models both transform uncertain states into structured images, yet they are usually studied as separate processes. We introduce Monte Carlo Transport Scheduling, a framework that treats progressive path tracing as a continuous sampling-driven transport process. Our key observation is that the renderer already produces physically valid states along this process: nested Monte Carlo estimates trace a refinement trajectory whose natural time coordinate follows from sampling variance. This view leads to a continuous training framework that learns from real render endpoints rather than synthetic interpolants, preserving the statistical structure of Monte Carlo estimation while enabling arbitrary-step neural refinement. We evaluate the framework on a controlled rendering benchmark designed to separate transport difficulty from scene context, and show that it yields stable render refinement, supports continuous stopping between rendering states, and transfers as a physical prior for frozen generative samplers. These results suggest a common continuous-time substrate for rendering and generation, where Monte Carlo sampling provides both the physical states and the supervision for learning image transport.

2602.10090 2026-05-26 cs.AI cs.CL cs.LG

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Agent World Model: 用于智能体强化学习的无限合成环境

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

AI总结 提出Agent World Model (AWM)全合成环境生成管道,通过代码驱动和数据库支持的环境进行大规模强化学习,使智能体在多样日常场景中泛化。

Comments Accepted to ICML 2026

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使得自主智能体能够与工具和环境进行多轮交互。然而,扩展此类智能体训练受到缺乏多样且可靠环境的限制。在本文中,我们提出了Agent World Model(AWM),一个完全合成的环境生成管道。使用该管道,我们扩展到涵盖日常场景的1000个环境,智能体可以在其中与丰富的工具集交互并获得高质量的观测。值得注意的是,这些环境是代码驱动的并由数据库支持,比由LLM模拟的环境提供更可靠和一致的状态转换。此外,与从现实环境中收集轨迹相比,它们实现了更高效的智能体交互。为了展示该资源的有效性,我们对多轮工具使用智能体进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练(而非特定于基准的环境)能产生强大的分布外泛化能力。代码可在 https://github.com/Snowflake-Labs/agent-world-model 获取。

英文摘要

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

2602.09620 2026-05-26 cs.AI cs.LO

FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints

FLINGO -- 将 ASP 表达力注入线性整数约束

Jorge Fandinno, Pedro Cabalar, Philipp Wanko, Torsten Schaub

AI总结 本文提出 FLINGO 语言和工具,通过将 ASP 的默认值、未定义、非确定性选择和聚合等表达力融入数值约束,并给出到 clingcon 格式的翻译,从而扩展了约束回答集编程。

Comments To appear in Theory and Practice of Logic Programming

详情
AI中文摘要

约束回答集编程(CASP)是一种混合范式,它通过数值约束处理丰富了回答集编程(ASP),这是许多实际应用的关键需求。然而,大多数 CASP 求解器中约束的规范更接近于数值后端的表达力和语义,而非 ASP 范式。在 ASP 中,数值属性被表示为谓词,允许声明默认值、使属性未定义、使用选择规则进行非确定性赋值或使用聚合值。在 CASP 中,一旦我们切换到这些属性的基于约束的表示,这些特性中的大多数(如果不是全部)就会丢失。在本文中,我们提出了 flingo 语言(和工具),它将上述表达力融入数值约束中,并通过多个示例说明了其使用。基于先前建立其语义基础的工作,我们还提出了从新引入的 flingo 语法到遵循 clingcon 输入格式的常规 CASP 程序的翻译。

英文摘要

Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, a crucial requirement for many real-world applications. However, the specification of constraints in most CASP solvers aligns more closely with the expressiveness and semantics of the numerical back-end than the ASP paradigm. In the latter, numerical attributes are represented as predicates, which allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules, or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the flingo language (and tool) that incorporates the aforementioned expressiveness within numerical constraints, and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced flingo syntax to regular CASP programs following the clingcon input format.

2602.03955 2026-05-26 cs.AI cs.MA

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

AgentArk:将多智能体智能蒸馏到单个LLM智能体中

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang

AI总结 提出AgentArk框架,通过三种分层蒸馏策略将多智能体系统的交互动态蒸馏到单个模型权重中,使单个智能体具备多智能体的推理和自校正能力,同时保持计算效率。

详情
AI中文摘要

虽然大型语言模型(LLM)多智能体系统通过迭代辩论实现了卓越的推理性能,但实际部署受到高计算成本和错误传播的限制。本文提出AgentArk,一种新颖的框架,将多智能体动态蒸馏到单个模型的权重中,有效地将显式的测试时交互转化为隐式的模型能力。这使得单个智能体在保持计算效率的同时具备多智能体系统的智能。具体来说,我们研究了跨多种模型、任务、规模和场景的三种分层蒸馏策略:推理增强微调;基于轨迹的增强;以及过程感知蒸馏。通过将计算负担从推理转移到训练,蒸馏后的模型在保持单个智能体效率的同时,展现出多个智能体的强推理和自校正性能。它们还在各种推理任务中表现出增强的鲁棒性和泛化能力。我们希望这项工作能为未来高效且鲁棒的多智能体开发研究提供启示。我们的代码位于https://github.com/AIFrontierLab/AgentArk。

英文摘要

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.

2602.02839 2026-05-26 cs.RO

Language Movement Primitives: Grounding Language Models in Robot Motion

语言运动基元:将语言模型锚定在机器人运动中

Yinlong Dai, Benjamin A. Christie, Daniel J. Evans, Dylan P. Losey, Simon Stepputtis

AI总结 提出语言运动基元(LMP)框架,通过将视觉语言模型(VLM)推理与动态运动基元(DMP)参数化结合,实现零样本机器人操作任务。

详情
AI中文摘要

尽管在基于基础模型的通用问题解决方面取得了显著进展,但使机器人能够根据自然语言指令执行新颖的操作任务仍然是机器人学中的一个基本挑战。大型视觉和语言模型(VLM)能够处理高维输入数据以理解视觉场景和语言,并将任务分解为一系列逻辑步骤;然而,它们难以将这些步骤锚定在具体的机器人运动中。另一方面,机器人基础模型输出动作命令,但在成功执行新颖任务之前需要领域内的微调或经验。其核心仍然存在将抽象任务推理与低级运动控制连接起来的基本挑战。为了解决这一脱节,我们提出了语言运动基元(LMP),这是一个将VLM推理锚定在动态运动基元(DMP)参数化中的框架。我们的关键洞察是,DMP提供了少量可解释的参数,而VLM可以设置这些参数来指定多样、连续且稳定的轨迹。换句话说:VLM可以推理自由形式的自然语言任务描述,并将其期望的运动语义锚定到DMP中——弥合了高级任务推理与低级位置和速度控制之间的鸿沟。基于这种VLM和DMP的结合,我们制定了LMP流程,用于零样本机器人操作,通过生成一系列DMP运动有效完成桌面操作问题。在31个真实世界操作任务中,我们展示了LMP实现了65%的任务成功率,而最佳基线的成功率为35%。请访问我们的网站查看视频:https://collab.me.vt.edu/lmp

英文摘要

Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 31 real-world manipulation tasks, we show that LMP achieves 65% task success as compared to 35% for the best performing baseline. See videos at our website: https://collab.me.vt.edu/lmp

2602.02009 2026-05-26 cs.LG

Logic-Guided Vector Fields for Constrained Generative Modeling

逻辑引导的向量场用于约束生成建模

Ali Baheri

AI总结 提出逻辑引导向量场(LGVF)框架,通过可微逻辑约束松弛注入流匹配生成模型,结合训练时逻辑损失和推理时梯度调整,在三个约束生成案例中减少59-82%的约束违反。

详情
AI中文摘要

神经符号系统旨在结合符号逻辑的表达结构与神经学习的灵活性;然而,生成模型通常缺乏在生成时强制执行声明性约束的机制。我们提出了逻辑引导向量场(LGVF),这是一个神经符号框架,将符号知识(指定为逻辑约束的可微松弛)注入流匹配生成模型。LGVF耦合了两种互补机制:(1)训练时逻辑损失,惩罚连续流轨迹上的约束违反,权重强调目标分布附近的正确性;(2)推理时调整,使用约束梯度引导采样,作为对学习动力学的轻量级、逻辑信息校正。我们在三个约束生成案例研究上评估了LGVF,涵盖线性、非线性和多区域可行性约束。在所有设置中,与标准流匹配相比,LGVF将约束违反减少了59-82%,并在每种情况下实现了最低的违反率。在线性和环形设置中,LGVF还通过MMD衡量提高了分布保真度,而在多障碍物设置中,我们观察到满意度-保真度权衡,可行性提高但MMD增加。除了定量收益外,LGVF还产生了具有约束意识的向量场,表现出新兴的避障行为,无需显式路径规划即可将样本绕过禁止区域。

英文摘要

Neuro-symbolic systems aim to combine the expressive structure of symbolic logic with the flexibility of neural learning; yet, generative models typically lack mechanisms to enforce declarative constraints at generation time. We propose Logic-Guided Vector Fields (LGVF), a neuro-symbolic framework that injects symbolic knowledge, specified as differentiable relaxations of logical constraints, into flow matching generative models. LGVF couples two complementary mechanisms: (1) a training-time logic loss that penalizes constraint violations along continuous flow trajectories, with weights that emphasize correctness near the target distribution; and (2) an inference-time adjustment that steers sampling using constraint gradients, acting as a lightweight, logic-informed correction to the learned dynamics. We evaluate LGVF on three constrained generation case studies spanning linear, nonlinear, and multi-region feasibility constraints. Across all settings, LGVF reduces constraint violations by 59-82% compared to standard flow matching and achieves the lowest violation rates in each case. In the linear and ring settings, LGVF also improves distributional fidelity as measured by MMD, while in the multi-obstacle setting, we observe a satisfaction-fidelity trade-off, with improved feasibility but increased MMD. Beyond quantitative gains, LGVF yields constraint-aware vector fields exhibiting emergent obstacle-avoidance behavior, routing samples around forbidden regions without explicit path planning.

2602.01576 2026-05-26 cs.LG cs.AI cs.CV

Generative Visual Code Mobile World Models

生成式视觉代码移动世界模型

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

AI总结 提出通过单一视觉语言模型预测可执行网页代码来生成移动GUI下一状态,结合文本和视觉世界模型优势,实现高保真视觉生成与精确文本渲染。

Comments ICML 2026

详情
AI中文摘要

移动图形用户界面世界模型为在训练和推理时提升移动GUI代理性能提供了有前景的路径。然而,当前方法面临关键权衡:基于文本的世界模型牺牲了视觉保真度,而视觉世界模型在精确文本渲染上的不足导致其依赖缓慢、复杂的流水线和大量外部模型。我们提出一种新范式:通过可渲染代码生成进行视觉世界建模,其中单一视觉语言模型预测下一个GUI状态为可执行网页代码,该代码渲染为像素,而非直接生成像素。这结合了两种方法的优势:视觉语言模型保留其语言先验以实现精确文本渲染,同时其在结构化网页代码上的预训练实现了高保真视觉生成。我们推出了gWorld(8B、32B),这是基于该范式的首个开源权重视觉移动GUI世界模型,以及一个自动合成基于代码的训练数据的数据生成框架(gWorld)。在4个分布内和2个分布外基准测试的广泛评估中,gWorld在准确率与模型规模之间建立了新的帕累托前沿,性能优于8个前沿开源权重模型(其规模大50.25倍以上)。进一步分析表明:(1)通过gWorld扩展训练数据带来有意义的收益;(2)我们流水线的每个组件都提高了数据质量;(3)更强的世界建模提升了下游移动GUI策略性能。

英文摘要

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

2602.01086 2026-05-26 cs.AI cs.CR cs.DB cs.DC cs.SE

MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

MedBeads:面向可信医疗AI的智能体原生不可变数据基底

Takahito Nakajima

AI总结 针对医疗AI中电子病历与智能体间的上下文不匹配问题,提出基于Merkle有向无环图的不可变数据架构MedBeads,通过确定性图遍历替代概率检索,实现可审计、防篡改的临床上下文提供。

Comments 19 pages, 5 figures. Code available at https://github.com/medbeads/medbeads

详情
AI中文摘要

背景:截至2026年,大型语言模型(LLM)展现出专家级医学知识。然而,将其部署为自主“临床智能体”仍受限。当前的电子病历(EMR)及FHIR等标准专为人工审阅设计,导致“上下文不匹配”:AI智能体接收碎片化数据,必须依赖概率推理(如RAG)重建患者病史。该方法引发幻觉并阻碍可审计性。方法:我们提出MedBeads,一种智能体原生数据基础设施,其中临床事件是不可变的“珠子”——Merkle有向无环图(DAG)中的节点——通过密码学引用因果前驱。这种“一次写入、多次读取”架构使篡改在数学上可检测。我们实现了原型,包含Go核心引擎、用于LLM集成的Python中间件以及基于React的可视化界面。结果:我们使用合成数据成功实现了工作流。FHIR到DAG的转换将扁平资源转化为因果关联图。我们的广度优先搜索(BFS)上下文检索算法以O(V+E)复杂度遍历相关子图,支持实时决策支持。防篡改特性由设计保证:任何修改都会破坏密码学链。可视化通过显式因果链接帮助临床医生理解。结论:MedBeads通过从概率搜索转向确定性图遍历、从可变记录转向不可变链,解决了“上下文不匹配”,为“可信医疗AI”提供了基底。它保证了AI接收的上下文是确定且防篡改的,而LLM负责解释。结构化的珠子格式充当了令牌高效的“AI原生语言”。我们将MedBeads作为开源软件发布,以加速智能体原生数据标准。

英文摘要

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Clinical Agents" remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a "Context Mismatch": AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable "Beads"--nodes in a Merkle Directed Acyclic Graph (DAG)--cryptographically referencing causal predecessors. This "write-once, read-many" architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the "Context Mismatch" by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for "Trustworthy Medical AI." It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient "AI-native language." We release MedBeads as open-source software to accelerate agent-native data standards.

2601.22984 2026-05-26 cs.AI

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

为什么你的深度研究智能体会失败?关于完整研究轨迹中的幻觉评估

Yuhao Zhan, Tianyu Fan, Linxuan Huang, Zirui Guo, Chao Huang

AI总结 针对深度研究智能体(DRA)在完整研究轨迹中累积的幻觉问题,提出从结果评估转向过程感知评估的PING分类法和细粒度评估框架,并构建DeepHalluBench基准,实验揭示系统性的可靠性差距。

详情
AI中文摘要

诊断深度研究智能体(DRA)的失败模式仍然是一个关键挑战。现有基准主要依赖端到端评估,掩盖了在研究轨迹中累积的中间幻觉。为弥补这一差距,我们提出从基于结果的评估转向过程感知评估,通过审计完整计划-搜索-总结轨迹中的幻觉。我们引入PING分类法,将DRA幻觉分为四种互补类型:传播、意图、噪声诱导和接地。我们进一步将该分类法实例化为一个细粒度评估框架,将轨迹分解为原子动作、声明和子查询以进行严格验证。利用该框架隔离100个特别容易产生幻觉的任务(包括对抗性场景),我们策划了DeepHalluBench。对六个代表性DRA的实验表明,在我们的幻觉压力测试集上,所有评估系统仍表现出不可忽视的可靠性差距。此外,我们的诊断分析将这些失败追溯到系统性缺陷,特别是幻觉传播和认知偏差,为未来的架构优化提供了可操作的见解。代码和数据可在https://github.com/yuhao-zhan/DeepHalluBench获取。

英文摘要

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

2601.21726 2026-05-26 cs.AI

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

DropoutTS: 用于鲁棒时间序列预测的样本自适应Dropout

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang

AI总结 针对深度时间序列模型对噪声敏感的问题,提出一种模型无关的插件DropoutTS,通过频谱稀疏性量化实例级噪声并动态调整Dropout率,在抑制伪波动的同时保持细粒度保真度,显著提升模型鲁棒性且几乎不增加参数。

详情
AI中文摘要

深度时间序列模型容易受到现实应用中普遍存在的噪声数据的影响。现有的鲁棒性策略要么修剪数据,要么依赖昂贵的先验量化,无法在有效性和效率之间取得平衡。在本文中,我们引入了DropoutTS,一种模型无关的插件,它将范式从学习“什么”转变为学习“多少”。DropoutTS采用样本自适应Dropout机制:利用频谱稀疏性通过重建残差高效量化实例级噪声,它通过将噪声映射到自适应Dropout率来动态校准模型学习能力——选择性地抑制伪波动,同时保持细粒度保真度。跨不同噪声场景和开放基准的大量实验表明,DropoutTS持续提升优秀骨干模型的性能,在几乎不增加参数且无需修改架构的情况下提供先进的鲁棒性。我们的代码可在https://github.com/CityMind-Lab/DropoutTS获取。

英文摘要

Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from "what" to learn to "how much" to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones' performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

2601.21670 2026-05-26 cs.CV cs.LG

Diverse via bounded Agreement: Geometric Regularization for Multimodal Fusion

通过有界一致性实现多样性:多模态融合的几何正则化

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

AI总结 提出一种轻量级即插即用的几何正则化框架,通过有界一致性原则在保持模态特异多样性的同时约束跨模态漂移,提升多模态融合性能。

详情
AI中文摘要

多模态融合通常被视为一个优化平衡问题,通过调整训练信号防止一种模态主导其他模态。然而,平衡优化并不能完全决定中间表示的几何结构。有监督的多模态模型仍可能学习到低多样性的模态特定嵌入,或允许配对的跨模态观测过度分离,从而削弱单模态鲁棒性和多模态融合。 我们引入了\regName,一个轻量级即插即用的多模态表示学习几何正则化框架。\regName不强制执行严格的跨模态对齐,而是遵循有界一致性原则:在仅软约束超过允许一致性带的配对跨模态漂移部分的同时,保留模态特定多样性。在操作上,\regName结合了一个分散项(减轻谱集中度)和一个一致性带锚定项(控制过度配对漂移),无需架构修改或推理时开销。 在音频-视觉、图像-文本和基于RF的基准测试上的实验表明,\regName一致地提高了多模态性能,并常常增强单模态表示。这些结果表明,显式调节表示几何是优化平衡的有效补充,并提供了几何感知正则化可以改善跨不同架构和领域的多模态学习的证据。

英文摘要

Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.

2601.19070 2026-05-26 cs.LG

Critical Organization of Deep Neural Networks, and p-Adic Statistical Field Theories

深度神经网络的临界组织与p进统计场论

W. A. Zúñiga-Galindo

AI总结 本文严格证明了深度神经网络在激活函数为sigmoid时的热力学极限,揭示了参数空间中的分岔临界组织,并利用p进整数编码层次结构,将临界组织与层次拓扑联系起来,同时研究了随机版本网络的输出分布。

Comments Many typos and minor errors were corrected. The main theorem was strengthened

详情
AI中文摘要

我们严格研究了深度神经网络(DNNs)和循环神经网络(RNNs)的热力学极限,假设激活函数为sigmoid。热力学极限是一个连续神经网络,其中神经元形成具有无限多个点的连续空间。我们证明,在参数空间的某个区域内,这样的网络存在唯一的状态,该状态连续依赖于参数。在该参数空间区域之外,该状态分裂成无限多个状态。那么,临界组织是参数空间中的一个分岔,网络从唯一状态过渡到无限多个状态。我们使用p进整数来编码层次结构。实际上,我们提出了一种算法,将DNNs和RNNs中使用的层次拓扑重新表述为p进树状结构。在这个框架中,层次组织和临界组织是联系在一起的。我们严格研究了一个玩具模型的临界组织,该模型是一个基于p进细胞神经网络的灰度图像层次边缘检测器。这种网络的临界组织可以描述为一个奇异吸引子。在第二部分,我们研究了DNNs和RNNs的随机版本。在这种情况下,网络参数是二次可积函数空间中的广义高斯随机变量。我们计算了在无限宽度情况下给定输入时输出的概率分布。我们证明它有一个幂次展开,其中常数项是高斯分布。

英文摘要

We rigorously study the thermodynamic limit of deep neural networks (DNNS) and recurrent neural networks (RNNs), assuming that the activation functions are sigmoids. A thermodynamic limit is a continuous neural network, where the neurons form a continuous space with infinitely many points. We show that such a network admits a unique state in a certain region of the parameter space, which depends continuously on the parameters. This state breaks into an infinite number of states outside the mentioned region of parameter space. Then, the critical organization is a bifurcation in the parameter space, where a network transitions from a unique state to infinitely many states. We use p-adic integers to codify hierarchical structures. Indeed, we present an algorithm that recasts the hierarchical topologies used in DNNs and RNNs as p-adic tree-like structures. In this framework, the hierarchical and the critical organizations are connected. We study rigorously the critical organization of a toy model, a hierarchical edge detector for grayscale images based on p-adic cellular neural networks. The critical organization of such a network can be described as a strange attractor. In the second part, we study random versions of DNNs and RNNs. In this case, the network parameters are generalized Gaussian random variables in a space of quadratic integrable functions. We compute the probability distribution of the output given the input, in the infinite-width case. We show that it admits a power-type expansion, where the constant term is a Gaussian distribution.

2601.11428 2026-05-26 cs.LG

Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families

诊断不同PDE族中神经算子的失败模式

Lennon Shikhman

AI总结 本文提出一个标准化压力测试框架,通过在不同PDE族上测试FNO、DeepONet和CNO三种架构,发现分布内准确率不能可靠预测鲁棒性,且失败模式依赖于架构和PDE族的组合。

Comments Published in Transactions on Machine Learning Research. 17 pages, 7 figures, 1 table

详情
AI中文摘要

神经PDE求解器越来越多地被用作偏微分方程族的学习替代模型,其中关键的机器学习挑战不仅是在固定基准分布上的插值,还包括在系数、边界条件、离散化和滚动时域的结构化偏移下的泛化。然而,评估仍然常常由分布内测试误差主导,使得鲁棒性难以评估。我们引入了一个针对部署相关偏移下神经PDE求解器的标准化压力测试框架。我们在三个代表性架构——傅里叶神经算子(FNO)、DeepONet风格模型和卷积神经算子(CNO)——上实例化该框架,涵盖五个定性不同的PDE族:色散、椭圆、多尺度流体、金融和混沌系统。在750个训练模型中,我们使用基线归一化退化因子以及谱和滚动诊断来测量鲁棒性。由此产生的比较表明,强的分布内准确率不能可靠预测鲁棒性,并且失败模式共同依赖于架构和PDE族。我们的结果为评估神经PDE求解器中的鲁棒性声明提供了更清晰的基础,并表明在结构化偏移下的函数空间泛化应被视为首要评估目标。

英文摘要

Neural PDE solvers are increasingly used as learned surrogates for families of partial differential equations, where the key machine learning challenge is not only interpolation on a fixed benchmark distribution but generalization under structured shifts in coefficients, boundary conditions, discretization, and rollout horizon. Yet evaluation is still often dominated by in-distribution test error, making robustness difficult to assess. We introduce a standardized stress-testing framework for neural PDE solvers under deployment-relevant shift. We instantiate it on three representative architectures -- Fourier Neural Operators (FNOs), a DeepONet-style model, and convolutional neural operators (CNOs) -- across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Across 750 trained models, we measure robustness using baseline-normalized degradation factors together with spectral and rollout diagnostics. The resulting comparisons reveal that strong in-distribution accuracy does not reliably predict robustness, and that failure patterns depend jointly on architecture and PDE family. Our results provide a clearer basis for evaluating robustness claims in neural PDE solvers and suggest that function-space generalization under structured shift should be treated as a first-class evaluation target.

2601.10457 2026-05-26 cs.AI

NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

NSR-Boost:一种面向工业遗留模型的神经符号残差提升框架

Ziming Dai, Dabiao Ma, Jinle Tong, Mengyuan Han, Jian Yang, Hongtao Liu, Haojun Fei, Qing Yang

AI总结 针对工业遗留模型升级成本高、风险大的问题,提出非侵入式神经符号残差提升框架NSR-Boost,通过残差定位、LLM生成符号专家和轻量聚合器动态集成,显著提升性能并降低坏账率。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管梯度提升决策树(GBDTs)主导了工业表格应用,但在高并发生产环境中升级遗留模型仍面临高昂的重新训练成本和系统性风险。为解决这一问题,我们提出了NSR-Boost,一种专门为工业场景设计的神经符号残差提升框架。其核心优势在于“非侵入性”。它将遗留模型视为冻结模型,并对预测失败的“困难区域”进行针对性修复。该框架包括三个关键阶段:首先,通过残差发现困难区域;然后,利用大型语言模型(LLM)生成符号代码结构,并通过贝叶斯优化微调参数,从而生成可解释的专家;最后,通过轻量聚合器将专家与遗留模型输出动态集成。实验结果表明,该框架在六个公共数据集和一个私有数据集上显著优于最先进的(SOTA)基线。更重要的是,我们报告了NSR-Boost在Qfin Holdings的核心金融风险控制系统中的成功部署,实际在线流量的实证结果显示出卓越的性能改进和坏账率的显著降低。总之,它有效捕获了传统模型遗漏的长尾风险,并为工业提供了一种安全、低成本的演进范式。

英文摘要

Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being ``non-intrusive''. It treats the legacy model as a frozen model and performs targeted repairs on "hard regions" where predictions fail. The framework comprises three key stages: First, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. Experimental results demonstrate that the framework significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset. More importantly, we report the successful deployment of NSR-Boost within the core financial risk control system of Qfin Holdings, where empirical results on real-world online traffic exhibit superior performance improvements and a significant reduction in the bad rate. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.

2601.10201 2026-05-26 cs.LG cs.AI cs.CL

Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization

未来KL正则化GRPO:基于f-散度正则化的过程级信用分配

Jiarui Yao, Ruida Wang, Hao Bai, Tong Zhang

AI总结 本文提出未来KL正则化策略优化(FRPO),通过因果未来正则化回报修正GRPO中局部KL损失缺失的梯度信号,在数学推理任务中提升pass@16并保持更高熵和更低策略漂移。

详情
AI中文摘要

组相对策略优化(GRPO)广泛用于无评论家的大语言模型(LLM)后训练,但其KL正则化通常作为局部损失侧的token惩罚实现。我们表明这遗漏了自回归KL正则化诱导的策略梯度信号。与标准KL正则化强化学习(RL)目标不同,GRPO的组归一化引入非线性提示级效用;对于二元验证器奖励,该效用为$2\arcsin\sqrt p$。因此,奖励和KL在归一化前无法融合而不改变隐式目标。我们推导了具有token级$f$-散度正则化的GRPO风格目标的on-policy梯度。奖励项恢复标准化的GRPO优势,而正则化项包括局部KL损失遗漏的因果未来正则化回报。对于反向KL,这产生简单的未来KL修正:在优势构建后添加每个token对数比的反向累积和。由此产生的方法,未来KL正则化策略优化(FRPO),不需要评论家或额外的模型传递。在数学推理任务上,FRPO在我们的主要大模型设置中提高了pass@16,同时保持比传统损失侧KL基线更高的熵和更低的策略漂移。

英文摘要

Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularization is usually implemented as a local loss-side token penalty. We show that this misses the policy-gradient signal induced by autoregressive KL regularization. Unlike standard KL-regularized Reinforcement Learning (RL) objectives, GRPO's group normalization induces a non-linear prompt-level utility; for binary verifier rewards, this utility is $2\arcsin\sqrt p$. As a result, reward and KL cannot be fused before normalization without changing the implicit objective. We derive the on-policy gradient of GRPO-style objectives with token-wise $f$-divergence regularization. The reward term recovers the standardized GRPO advantage, while the regularizer term includes a causal future-regularization return-to-go omitted by local KL losses. For reverse KL, this yields a simple future KL correction: add a reverse cumulative sum of per-token log ratios after advantage construction. The resulting method, Future-KL Regularized Policy Optimization (FRPO), requires no critic or extra model passes. On mathematical reasoning tasks, FRPO improves pass@16 in our main large-model setting while maintaining higher entropy and lower policy drift than conventional loss-side KL baselines.

2601.10012 2026-05-26 cs.LG

PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning

PID引导的多模态去中心化联邦学习部分对齐

Yanhang Shi, Xiaoyu Wang, Houwei Cao, Jian Li, Yong Liu

AI总结 针对多模态去中心化联邦学习中异构代理间更新不兼容的问题,提出基于部分信息分解的PARSE框架,通过特征分裂和部分对齐实现高效通信与协作。

详情
AI中文摘要

多模态去中心化联邦学习(DFL)必须支持持有不同模态子集和通常不同模型组件的代理之间的协作,同时在无协调服务器或全局网络视图的点对点(P2P)覆盖网络上运行。一个关键障碍是,传统的多模态训练通常依赖于单一共享表示,这隐含假设异构对等体可以通过相同的通信链路交换和聚合相同的模型组件。在多模态DFL中,这一假设不成立:单模态和多模态代理可能通过共享覆盖网络推送不兼容的更新,削弱代理间迁移和跨模态交互。我们提出PARSE,一个无服务器框架,将部分信息分解(PID)引入多模态DFL。每个代理将其潜在特征分裂为冗余、独特和协同切片(“特征分裂”),并在模态条件化的P2P覆盖网络上进行切片感知通信。在训练过程中,代理仅交换与其邻居在语义上可对齐的切片,根据它们共享的模态和模型组件(“部分对齐”)。这种设计避免了集中式编排和梯度手术式的冲突处理,同时与标准DFL约束和多种P2P覆盖网络拓扑兼容。在多个基准测试和异构代理混合场景中,PARSE在保持每链路负载受限的同时,始终优于任务共享、模态共享和混合共享的多模态DFL基线。关于融合选择和分裂比例的消融实验,以及定性特征分析和覆盖网络拓扑研究,证明了所提出的切片感知设计的鲁棒性和通信效率。

英文摘要

Multimodal decentralized federated learning (DFL) must support collaboration among agents that hold different modality subsets and often different model components, while operating over peer-to-peer (P2P) overlays without a coordinating server or a global network view. A key obstacle is that conventional multimodal training often relies on a single shared representation, which implicitly assumes that heterogeneous peers can exchange and aggregate the same model components over the same communication links. In multimodal DFL, this assumption breaks down: uni- and multimodal agents may push incompatible updates through shared overlays, weakening both inter-agent transfer and cross-modal interaction. We present PARSE, a server-free framework that brings partial information decomposition (PID) into multimodal DFL. Each agent splits its latent features into redundant, unique, and synergistic slices ("feature fission"), and performs slice-aware communication over modality-conditioned P2P overlays. During training, agents exchange only the slices that are semantically alignable with their neighbors, according to the modalities and model components they share ("partial alignment"). This design avoids centralized orchestration and gradient-surgery style conflict handling, while remaining compatible with standard DFL constraints and a range of P2P overlay topologies. Across multiple benchmarks and heterogeneous peer mixes, PARSE consistently outperforms task-, modality-, and hybrid-sharing multimodal DFL baselines while keeping per-link payloads bounded. Ablations on fusion choices and split ratios, together with qualitative feature analyses and overlay-topology studies, demonstrate the robustness and communication efficiency of the proposed slice-aware design.

2601.05847 2026-05-26 cs.CL

Schema-Grounded LLM Extraction for FHIR Patient Digital Twins

基于Schema的LLM抽取用于FHIR患者数字孪生

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

AI总结 提出SG-LLM方法,通过检索增强、JSON Schema约束和验证器修复循环,从非结构化EHR中生成有效的FHIR Bundle,并在临床效用实验中优于基线。

详情
AI中文摘要

我们重新审视从非结构化电子健康记录(EHR)构建可互操作患者数字孪生的问题,并认为该任务更适合被视作有效FHIR Bundle的受控生成,而非抽取模块的级联。我们引入SG-LLM,一种基于schema的LLM抽取器,它(i)通过SapBERT索引检索的候选SNOMED-CT、RxNorm和LOINC代码增强提示,(ii)在直接源自FHIR R4 StructureDefinitions的JSON Schema下解码,(iii)关闭一个验证器在环修复阶段,其诊断结果作为结构化错误消息反馈。我们认为,孪生的有用性(而不仅仅是跨度级F1)才是正确的评估对象,并通过一项临床效用实验将其操作化,该实验测量了基于SG-LLM生成的FHIR Bundle与专家策划的Bundle训练的分类器在30天再入院AUROC上的差距。在MIMIC-IV和n2c2 2018 Track 2基准测试上,SG-LLM匹配或超过了强大的联合抽取和普通LLM基线,同时生成了更有效的Bundle。消融实验分离了检索、schema约束和修复循环的贡献。所有代码、提示和schema均已发布。

英文摘要

We revisit the problem of constructing interoperable patient digital twins from unstructured electronic health records (EHRs) and argue that the task is better cast not as a cascade of extraction modules but as constrained generation of a valid FHIR bundle. We introduce SG-LLM, a schema-grounded LLM extractor that (i) augments the prompt with candidate SNOMED-CT, RxNorm, and LOINC codes retrieved through a SapBERT index, (ii) decodes under a JSON Schema derived directly from FHIR R4 StructureDefinitions, and (iii) closes a validator-in-the-loop repair stage whose diagnostics are fed back as structured error messages. We argue that the twin's usefulness, not only span-level F1, is the right object of evaluation, and operationalize this with a clinical-utility experiment that measures the gap in 30-day readmission AUROC between classifiers trained on SG-LLM-generated FHIR bundles versus expert-curated ones. On MIMIC-IV and n2c2 2018 Track 2 benchmarks, SG-LLM matches or exceeds strong joint-extraction and vanilla-LLM baselines while producing substantially more valid bundles. Ablations isolate the contributions of retrieval, schema constraint, and the repair loop. All code, prompts, and schemas are released.

2601.05004 2026-05-26 cs.CL

Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

大语言模型能否解决自我毁灭亚文化中的语义差异?来自Jirai Kei的证据

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

AI总结 针对亚文化中自我毁灭行为检测面临的知识滞后和语义错位问题,提出多智能体框架SAS,通过自动检索和亚文化对齐显著提升LLM检测性能,并优于现有先进方法。

Comments Preprint

详情
AI中文摘要

自我毁灭行为与复杂的心理状态相关,且难以诊断。由于亚文化群体独特的表达方式,这些行为可能更难识别。随着大语言模型(LLM)在各领域的部署,一些研究者开始探索其在检测自我毁灭行为中的应用。受此启发,我们使用当前基于LLM的方法研究亚文化中的自我毁灭行为检测。然而,这些方法面临两个主要挑战:(1)知识滞后:亚文化俚语演变迅速,快于LLM的训练周期;(2)语义错位:难以把握亚文化特有的具体和细微表达。为解决这些问题,我们提出亚文化对齐求解器(SAS),一个多智能体框架,集成了自动检索和亚文化对齐,显著提升了LLM在检测自我毁灭行为中的性能。实验结果表明,SAS优于当前先进的多智能体框架OWL。值得注意的是,它与微调后的LLM表现相当。我们希望SAS能推动亚文化背景下自我毁灭行为检测领域的发展,并为未来研究者提供宝贵资源。

英文摘要

Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) being deployed across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we propose Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly boosting the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

2601.03191 2026-05-26 cs.CV cs.AI cs.LG

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX:一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

AI总结 提出AnatomiX,一种两阶段解剖学感知多模态大语言模型,通过先识别解剖结构再执行下游任务,在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情
AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展,但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能,但它们往往未能建立真正的解剖对应关系,导致医学领域中的解剖理解错误。为弥补这一差距,我们引入了AnatomiX,一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发,AnatomiX采用两阶段方法:首先识别解剖结构并提取其特征,然后利用大语言模型执行多种下游任务,如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明,与现有方法相比,AnatomiX实现了卓越的解剖推理,并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

2601.02589 2026-05-26 cs.CL cs.AI

FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions

FlowPlan-G2P:一种将科学论文转化为专利描述的结构化生成框架

Kris W Pan, Yongmin Yoo

AI总结 提出FlowPlan-G2P图介导生成框架,通过概念图归纳、章节级规划和图条件生成三阶段分解,将科学论文转化为符合专利规范的描述,在领域评估中优于大型专有模型。

详情
AI中文摘要

由于科学论文与专利在修辞和结构上的根本差异,从科学论文生成专利描述具有挑战性。现有方法将其视为表面改写,未能捕捉专利起草中固有的层次推理和法定约束。我们提出FlowPlan-G2P,一种图介导的生成框架,将该转换分解为三个阶段:(1)概念图归纳,将技术实体和功能依赖提取为有向图;(2)章节级规划,将图划分为与规范专利章节对齐的连贯子图;(3)图条件生成,基于章节特定子图合成符合法律要求的段落。在专家验证基准上的实验表明,标准NLG指标系统性偏好法律不合规输出而非有效专利描述,这促使我们进行领域特定评估。在该评估下,使用开放权重骨干的FlowPlan-G2P始终优于原始专有模型,表明结构化分解比模型规模更能决定质量。

英文摘要

Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the two genres. Existing approaches treat this as surface-level rewriting, failing to capture the hierarchical reasoning and statutory constraints inherent in patent drafting. We propose FlowPlan-G2P, a graph-mediated generation framework that decomposes this transformation into three stages: (1) Concept Graph Induction, extracting technical entities and functional dependencies into a directed graph; (2) Section-level Planning, partitioning the graph into coherent subgraphs aligned with canonical patent sections; and (3) Graph-Conditioned Generation, synthesizing legally compliant paragraphs conditioned on section-specific subgraphs. Experiments on expert-validated benchmarks reveal that standard NLG metrics systematically favor legally non-compliant outputs over valid patent descriptions, motivating our domain-specific evaluation. Under this evaluation, FlowPlan-G2P with an open-weight backbone consistently outperforms vanilla proprietary models, demonstrating that structured decomposition is a stronger determinant of quality than model scale.

2512.21815 2026-05-26 cs.CV cs.LG

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

高熵标记作为视觉-语言模型中的多模态失败点

Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang

AI总结 本研究揭示视觉-语言模型中约20%的高熵标记集中了不成比例的对抗性影响,并提出基于熵引导的稀疏攻击方法(EGA),实现高攻击成功率与有害率。

Comments 19 Pages,11 figures,8 tables

详情
AI中文摘要

视觉-语言模型(VLM)取得了显著性能,但仍易受对抗攻击。熵作为模型不确定性的度量,与VLM可靠性高度相关。虽然先前的基于熵的攻击在解码步骤中最大化不确定性,隐含假设每个标记对模型不稳定性的贡献相等,但我们揭示了在评估的具有不同架构的代表性开源VLM中,一小部分(约20%)高熵标记在自回归生成过程中集中了不成比例的对抗性影响。我们证明,将这些对抗扰动集中到这些高熵位置,可以在优化更少解码位置的情况下实现与全局方法相当的语义退化。此外,在多个代表性VLM中,此类攻击不仅导致语义漂移,还在当前流程下产生大量不安全子集(20-31%)。值得注意的是,由于这种脆弱的高熵标记在不同架构的VLM中重复出现,针对它们的攻击表现出非平凡的迁移性。受这些发现启发,我们设计了一种简单的熵引导攻击(EGA),该攻击实现了稀疏高熵定位,并通过可重用的标记库扩展,在三个代表性开源VLM上取得了具有竞争力的攻击成功率(93-95%)和相当高的有害率(30.2-38.6%)。

英文摘要

Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.

2512.21208 2026-05-26 cs.LG math.DS math.OC

A Learning Stability Profile for Finite-Dimensional Learning Dynamics

有限维学习动力学的学习稳定性剖面

Ronald Katende

AI总结 提出一个有限维灵敏度框架,通过Lyapunov准则控制学习稳定性剖面,适用于前馈网络、残差架构、随机梯度方法和非光滑系统。

Comments 19 pages, 0 figures

详情
AI中文摘要

我们开发了一个有限维灵敏度框架,用于研究状态包含表示、参数和更新变量的学习系统的稳定性。核心对象是学习稳定性剖面,这是一个方向灵敏度算子集合,记录了输入、参数初始化和更新机制中的扰动如何沿指定学习轨迹传播。主要结果是控制该剖面的Lyapunov准则。在显式的正则性、强制性和耗散性假设下,增量Lyapunov能量对相关的线性化转移算子产生一致或指数衰减的界。该结果被表述为充分稳定性准则,而非无条件逆定理。该框架还区分了终端衰减、剖面有界性和次指数增长,避免了将非正增长指数与一致有界性等同。然后,该剖面被专门应用于几种标准学习机制。谱界为前馈网络提供前向灵敏度估计。耗散性和步长限制为残差架构提供稳定性界。均方收缩假设为随机梯度方法提供参数和更新灵敏度界。局部Lipschitz系统,包括分段线性网络、近端映射、投影更新以及递归或状态空间递归,通过Clarke广义Jacobian和变分Lyapunov不等式处理。所得框架为架构、优化、随机性和非光滑性提供了统一的稳定性语言。其作用是结构性的:它将已知的稳定性机制组织在一个扰动演算中,同时使每种保证所需的假设保持明确。

英文摘要

We develop a finite-dimensional sensitivity framework for studying stability in learning systems whose states include representations, parameters, and update variables. The central object is the \emph{Learning Stability Profile}, a collection of directional sensitivity operators that records how perturbations in inputs, parameter initialization, and update mechanisms propagate along a specified learning trajectory. The main result is a Lyapunov criterion for controlling this profile. Under explicit regularity, coercivity, and dissipation assumptions, an incremental Lyapunov energy yields uniform or exponentially decaying bounds on the associated linearized transition operators. The result is stated as a sufficient stability criterion, not as an unconditional converse theorem. The framework also distinguishes terminal decay, profile-wise boundedness, and subexponential growth, avoiding the identification of nonpositive growth exponents with uniform boundedness. The profile is then specialized to several standard learning mechanisms. Spectral bounds give forward sensitivity estimates for feedforward networks. Dissipativity and step-size restrictions give stability bounds for residual architectures. Mean-square contraction assumptions yield parameter and update sensitivity bounds for stochastic gradient methods. Locally Lipschitz systems, including piecewise-linear networks, proximal maps, projected updates, and recurrent or state-space recursions, are handled through Clarke generalized Jacobians and variational Lyapunov inequalities. The resulting framework provides a common stability language for architecture, optimization, stochasticity, and nonsmoothness. Its role is structural: it organizes known stability mechanisms within one perturbation calculus while keeping the hypotheses needed for each guarantee explicit.

2512.19097 2026-05-26 cs.LG cs.AI

DIVER-1: Scaling Intracranial EEG Foundation Models for Transferable Representations

DIVER-1: 扩展颅内脑电图基础模型以实现可迁移表示

Danny Dongyeop Han, Yonghyeon Gwon, Ahhyun Lucy Lee, Taeyang Lee, Seong Jin Lee, Jubin Choi, Sebin Lee, Jihyun Bang, Seungju Lee, David Keetae Park, Shinjae Yoo, Chun Kee Chung, Jiook Cha

AI总结 提出DIVER-1自监督iEEG基础模型,通过可变电极-时间注意力、时空重采样等设计处理可变输入,在5310小时ECoG和SEEG上预训练,在认知解码和癫痫检测任务上超越现有模型,并首次进行受控计算感知的扩展研究。

Comments 31 pages, 12 figures, 14tables

详情
AI中文摘要

颅内脑电图(iEEG)提供直接、毫秒级的人类神经活动记录,但由于电极布局、解剖覆盖、参考方案和记录条件在不同患者和中心之间存在差异,可重用的表示学习变得困难。我们引入了DIVER-1,一个用于可变输入记录的自监督iEEG基础模型,它结合了任意变量电极-时间注意力、时空重采样、输入条件位置嵌入和多域掩码重建,而不假设固定的电极布局。我们在5310小时的ECoG和SEEG上预训练了两个变体DIVER-1-0.1s和DIVER-1-1s,涵盖352k通道小时,大约是BrainTreeBank预训练量的54倍。我们在两个保留基准上评估DIVER-1:用于自然认知解码的Neuroprobe和用于癫痫检测的MAYO。在考虑泄漏的Neuroprobe上,尽管预训练时未使用构成Neuroprobe语料库的BrainTreeBank记录,DIVER-1-0.1s仍优于先前评估的iEEG基础模型;它在平均AUROC上也超过了线性频谱图解码器,并与更强的非线性基线保持竞争力,这是先前评估的iEEG基础模型未能达到的水平。DIVER-1-1s在MAYO癫痫检测上也取得了最高的AUROC。最后,我们进行了据我们所知首次受控计算感知的自监督iEEG预训练扩展研究,扫描了数据规模、受试者数量、训练时长和模型大小(高达1.8B参数)。我们的结果表明存在数据受限区域:扩展独特记录和充分训练是比单纯增加参数数量更可靠的扩展轴。代码可在链接处获取。

英文摘要

Intracranial EEG (iEEG) provides direct, millisecond-scale recordings of human neural activity, but reusable representation learning is difficult because electrode layouts, anatomical coverage, referencing schemes, and recording conditions vary across patients and centers. We introduce DIVER-1, a self-supervised iEEG foundation model for variable-input recordings that combines any-variate electrode-time attention, spatio-temporal resampling, input-conditioned positional embeddings, and multi-domain masked reconstruction without assuming a fixed electrode montage. We pretrain two variants, DIVER-1-0.1s and DIVER-1-1s, on 5,310 hours of ECoG and SEEG spanning 352k channel-hours, roughly 54x the BrainTreeBank-based pretraining volume. We evaluate DIVER-1 on two held-out benchmarks: Neuroprobe for naturalistic cognitive decoding and MAYO for seizure detection. On leakage-aware Neuroprobe, DIVER-1-0.1s outperforms prior evaluated iEEG foundation models despite using no BrainTreeBank recordings, the corpus underlying Neuroprobe, during pretraining; it also exceeds the linear spectrogram decoder in mean AUROC and remains competitive with stronger nonlinear baselines, a level prior evaluated iEEG foundation models did not reach. DIVER-1-1s also achieves the top AUROC on MAYO seizure detection. Finally, we conduct, to our knowledge, the first controlled compute-aware scaling study for self-supervised iEEG pretraining, sweeping data scale, subject count, training duration, and model size up to 1.8B parameters. Our results indicate a data-constrained regime: expanding unique recordings and training sufficiently long are more reliable scaling axes than increasing parameter count alone. Code is available at link.

2512.14180 2026-05-26 cs.CV

Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

球面Voronoi:作为球面可微分划分的定向外观

Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi

AI总结 提出球面Voronoi(SV)作为3D高斯泼溅中外观表示的统一框架,通过可学习区域划分实现视图依赖效果,在反射建模上达到最先进水平。

详情
AI中文摘要

辐射场方法(例如3D高斯泼溅)已成为新视角合成的强大范式,但其外观建模通常依赖于球谐函数(SH),这带来了根本性限制。SH难以处理高频信号,存在吉布斯振铃伪影,并且无法捕捉镜面反射——这是真实感渲染的关键组成部分。尽管球面高斯等替代方案有所改进,但它们增加了显著的优化复杂度。我们提出球面Voronoi(SV)作为3D高斯泼溅中外观表示的统一框架。SV将方向域划分为具有平滑边界的可学习区域,为视图依赖效应提供了直观且稳定的参数化。对于漫反射外观,SV在保持优化比现有替代方案更简单的同时取得了有竞争力的结果。对于反射——SH失败的地方——我们利用SV作为可学习的反射探针,遵循经典图形学原理将反射方向作为输入。该公式在合成和真实世界数据集上取得了最先进的结果,表明SV为显式3D表示中的外观建模提供了一种有原则、高效且通用的解决方案。项目页面:https://sphericalvoronoi.github.io/

英文摘要

Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations. Project page: https://sphericalvoronoi.github.io/

2512.12576 2026-05-26 cs.CL cs.AI

Coupled Variational Reinforcement Learning for Language Model General Reasoning

耦合变分强化学习用于语言模型通用推理

Xueru Wen, Jie Lou, Yanjiang Liu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Yaojie Lu, Debing Zhang

AI总结 提出CoVRL方法,通过混合采样策略耦合先验和后验分布,将变分推理与强化学习结合,以解决无验证器强化学习中探索效率低和推理轨迹与答案不一致的问题,在数学和通用推理基准上提升性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

虽然强化学习在语言模型推理方面取得了显著进展,但它受到可验证奖励要求的限制。最近的无验证器强化学习方法通过利用LLM生成参考答案的概率作为奖励信号来解决这一限制。然而,这些方法通常仅基于问题采样推理轨迹。这种设计将推理轨迹采样与答案信息解耦,导致探索效率低下以及轨迹与最终答案之间的不一致。在本文中,我们提出了 extit{{Co}upled {V}ariational {R}einforcement {L}earning}(CoVRL),它通过混合采样策略耦合先验和后验分布,将变分推理与强化学习联系起来。通过构建和优化整合这两种分布的复合分布,CoVRL实现了高效探索,同时保持了思想与答案之间的强一致性。在数学和通用推理基准上的大量实验表明,CoVRL在基础模型上提升了12.4%的性能,并在最先进的无验证器强化学习基线基础上额外提升了2.3%,为增强语言模型的通用推理能力提供了一个原则性框架。

英文摘要

While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifiable rewards. Recent verifier-free RL methods address this limitation by utilizing the probabilities that LLMs generate reference answers as reward signals. However, these approaches typically sample reasoning traces conditioned only on the question. This design decouples reasoning-trace sampling from answer information, leading to inefficient exploration and incoherence between traces and final answers. In this paper, we propose \textit{\b{Co}upled \b{V}ariational \b{R}einforcement \b{L}earning} (CoVRL), which bridges variational inference and reinforcement learning by coupling prior and posterior distributions through a hybrid sampling strategy. By constructing and optimizing a composite distribution that integrates these two distributions, CoVRL enables efficient exploration while preserving strong thought-answer coherence. Extensive experiments on mathematical and general reasoning benchmarks show that CoVRL improves performance by 12.4\% over the base model and achieves an additional 2.3\% improvement over state-of-the-art verifier-free RL baselines, providing a principled framework for enhancing the general reasoning capabilities of language models.

2512.11941 2026-05-26 cs.CV cs.AI

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

AI总结 提出DynaPURLS框架,通过多尺度视觉-语义对应和动态细化模块,解决骨架零样本动作识别中的领域偏移问题,在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

基于骨架的零样本动作识别(ZS-SAR)从根本上受到主流方法的限制,这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移,从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制,我们引入了 extbf{DynaPURLS},一个统一的框架,它建立稳健的多尺度视觉-语义对应,并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述,涵盖全局运动和局部身体部位动态。同时,一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移,DynaPURLS包含一个动态细化模块。在推理时,该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定,该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集(包括NTU RGB+D 60/120和PKU-MMD)上的大量实验表明,DynaPURLS显著优于先前的方法,创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

2512.04883 2026-05-26 cs.CV

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

SDG-Track: 一种用于嵌入式平台高分辨率无人机跟踪的异构观察者-跟随者框架

Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

AI总结 提出SDG-Track框架,采用观察者-跟随者架构,通过稀疏检测引导跟踪和双空间恢复机制,在嵌入式平台上实现高分辨率无人机实时跟踪,达到35.1 FPS和97.2%检测精度。

Comments Withdrawn by the authors due to unresolved authorship and public-disclosure authorization issues

详情
AI中文摘要

在边缘设备上对小型无人机(UAV)进行实时跟踪面临根本性的分辨率-速度冲突。将高分辨率图像下采样到标准检测器输入尺寸会导致小目标特征低于可检测阈值。然而,在资源受限平台上处理原生1080p帧无法为平滑云台控制提供足够的吞吐量。我们提出SDG-Track,一种稀疏检测引导跟踪器,采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过ROI约束的稀疏光流执行高频轨迹插值。为了处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败,我们引入了双空间恢复,一种无需训练的重捕获机制,结合颜色直方图匹配与几何一致性约束。在地对空跟踪站上的实验表明,SDG-Track实现了35.1 FPS的系统吞吐量,同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了实际操作条件下的敏捷FPV无人机。我们的论文代码公开在https://github.com/Jeffry-wen/SDG-Track。

英文摘要

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

2511.21734 2026-05-26 cs.CL cs.AI

Asking LLMs to Verify First is Almost Free Lunch

先让LLMs验证几乎是免费的午餐

Shiguang Wu, Quanming Yao

AI总结 提出Verification-First (VF)策略,通过先验证候选答案再生成解决方案,以低计算开销提升推理能力,并扩展为Iter-VF迭代方法,在多个基准上优于标准CoT和现有TTS策略。

详情
AI中文摘要

为了在不增加训练成本或大量测试时采样的情况下增强大型语言模型(LLMs)的推理能力,我们引入了Verification-First (VF)策略,该策略在生成解决方案之前提示模型验证提供的候选答案(即使是琐碎或随机的答案)。这种方法触发了一种“反向推理”过程,与标准的前向思维链(CoT)互补,通过修剪LLM的输出分布来限制答案的逻辑搜索空间。我们进一步将VF提示推广到Iter-VF,这是一种顺序测试时缩放(TTS)方法,利用模型之前的答案迭代地循环验证-生成过程。跨多个基准和各种LLMs的大量实验证实,使用随机答案的VF提示在最小计算开销下始终优于标准CoT,并且Iter-VF优于现有的TTS策略。VF在SOTA思考模型上也有效。例如,通过使用简单的VF提示,我们在GPQA-Diamond上使用Gemini-3-Pro-Preview获得了新的SOTA准确率94.9%,其中VF相对减少了约30%的错误。

英文摘要

To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process complementary to standard forward Chain-of-Thought (CoT), which restricts the logical search space of the answer by pruning the LLM's output distribution. We further generalize VF prompting to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks and various LLMs confirm that VF prompting with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies. VF is also effective on SOTA thinking models. For example, by using the simple VF prompting, we obtain a new SOTA 94.9% accuracy on GPQA-Diamond with Gemini-3-Pro-Preview where VF reduces its errors by ~30% relatively.

2511.04556 2026-05-26 cs.AI cs.CE

Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach

城市排水管网流量重建的传感器优化布置:基于数字孪生的稀疏传感方法

Zihang Ding, Amit Kumar, Imran Md. Azizul Islam, Mila Avellar Montezuma, Ruihang Zhang, Kun Zhang

AI总结 针对资源受限下城市排水管网监测与流量预测难题,提出一种基于数字孪生的数据驱动稀疏传感方法,通过奇异值分解和QR分解优化传感器位置,实现系统级流量重建,在明尼苏达州德卢斯林地流域验证中,3个传感器达到平均NSE 0.949。

Comments 32 pages (including supplementary information), 11 figures. Submitted to Water Research. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and AGU Fall Meeting 2025

详情
AI中文摘要

强降雨引发的城市洪水日益频繁和广泛。虽然高时空分辨率的洪水预测和监测是理想的,但时间、预算和技术上的实际限制阻碍了其全面实施。如何在资源受限的情况下监测城市排水管网并预测水流状况是一个主要挑战。为了解决这一问题,我们引入了一种数据驱动的稀疏传感(DSS)方法,通过明尼苏达州德卢斯林地流域的数字孪生进行演示。具体来说,我们将EPA-SWMM与基于奇异值分解和QR分解的传感器选择相结合,以优化系统级流量重建的监测位置。由不同情景驱动的SWMM模拟集成提供了必要的水力数据,以提取降阶基并识别信息丰富的传感器位置。跨事件验证表明,在77个候选节点中,三个策略性放置的传感器在观测到的风暴事件中实现了平均系统级纳什-萨特克利夫效率(NSE)为0.949。将QR选择的传感器集与通过穷举搜索和蒙特卡洛随机放置获得的参考传感器配置进行了基准测试。这一比较进一步表明,基于QR选择的传感器的流量重建紧密跟踪穷举最优值,同时显著优于随机放置。我们通过引入乘性高斯噪声和模拟单个传感器故障进一步评估了框架的鲁棒性。虽然模型对噪声相对具有弹性,但传感器缺失的影响在很大程度上取决于分配的传感器数量及其具体位置。

英文摘要

Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in high spatio-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resources is a major challenge. To address this, we introduced a data-driven sparse sensing (DSS) approach, demonstrated via a digital-twin of the Woodland catchment in Duluth, Minnesota. Specifically, we coupled EPA-SWMM with singular value decomposition and QR factorization-based sensor selection to optimize monitoring locations for system-level flow reconstruction. An ensemble of SWMM simulations, driven by diverse scenarios, provided the necessary hydraulic data to extract the reduced basis and identify informative sensor locations. Cross-event validation showed that three strategically placed sensors among 77 candidate nodes achieved a mean system-level Nash-Sutcliffe efficiency (NSE) of 0.949 across observed storm events. The QR-selected sensor sets were benchmarked against reference sensor configurations obtained from exhaustive searches and Monte Carlo random-placements. This comparison further showed that flow reconstruction based on QR-selected sensors closely tracked the exhaustive optimum while substantially outperforming random placements. We further evaluated the framework's robustness by introducing multiplicative Gaussian noise and simulating individual sensor failures. While the model is relatively resilient to noise, the impact of sensor dropouts depends heavily on the number of sensors allocated and their specific locations.