详情

AI中文摘要

我们引入“Courant”，一种基于感知器的编码器-处理器-解码器代理模型，其潜在特征在物理空间中表现出自适应专门化和局部支持，实现了类似于自适应hp细化方案的功能，这是传统数值求解器和科学机器学习中非常期望的属性。所提出的架构结合了共享随机傅里叶特征坐标嵌入、状态自适应潜在查询和轻量解码器。Courant使用稳态或瞬态模拟数据进行端到端训练，仅使用物理空间中的标准L_2预测损失，在基准测试上达到竞争性精度。我们证明Courant的归纳偏差产生了设计上可解释的潜在变量：它们在模拟域中发展出多尺度几何专门化，并在时间相关情况下跟踪相干结构，类似于随时间演化的空间基函数，从而允许对模拟场进行紧凑的、几何锚定的、单位划分式的分解。

英文摘要

We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specialization and local support in the physical space, enabling functionality akin to an adaptive hp-refinement scheme, an attribute that is highly desirable in traditional numerical solvers and scientific machine learning broadly. The proposed architecture combines a shared random Fourier feature coordinate embedding, state-adapted latent queries, and a light-weight decoder. Courant is trained end-to-end with steady or transient simulation data and only a standard L_2 prediction loss in the physical space, achieving competitive accuracy on benchmarks. We demonstrate that Courant's inductive biases yield latents that are interpretable by design: they develop multiscale geometric specialization in the simulation domain and track coherent structures in the time-dependent case, acting analogously to time-evolving spatial basis functions and allowing for decoding a compact, geometry-anchored, partition-of-unity-like decomposition of the simulated field.

URL PDF HTML ☆

赞 0 踩 0

2605.25111 2026-05-26 cs.LG

Revisiting Pre-Propagation GNNs: Robust Diffusion Operators and Hidden-State Re-Propagation

重新审视预传播图神经网络：鲁棒扩散算子与隐状态再传播

Zichao Yue, Zhiru Zhang

发表机构 * School of Electrical and Computer Engineering, Cornell University, Ithaca, New York, USA（电气与计算机工程系，康奈尔大学，纽约州伊萨卡市）

AI总结提出鲁棒图扩散算子和少量隐状态再传播方案，使预传播图神经网络在保持训练效率的同时匹配消息传递图神经网络的精度。

2605.25110 2026-05-26 cs.CV cs.AI cs.LG

Uncertainty-DTW for Sequences and Visual Tokens

Uncertainty-DTW 用于序列和视觉标记

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University（工程与建筑环境学院，电气与电子工程学院，格里菲斯大学）； School of Computer Science and Engineering, University of New South Wales（计算机科学与工程学院，新南威尔士大学）

AI总结提出不确定性感知的动态时间规整（uDTW）框架，通过异方差不确定性建模和最大似然估计实现鲁棒对齐，并推广到视觉标记集，在多个领域取得优于现有方法的结果。

Comments Research report

详情

AI中文摘要

WorldCraft: 从相机导航到交互式视频世界模型中的物体操控

Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； AI Technology Center, Tencent Video, Tencent（腾讯视频AI技术中心，腾讯）； Wuhan University（武汉大学）； Peking University（北京大学）

AI总结提出WorldCraft框架，通过轨迹控制管道（NWT、SP-LoRA、TASP）将交互式视频世界模型从相机导航扩展到物体级轨迹操控，实现用户指定路径下的物体运动与相机导航共存。

Comments Project page: https://nevsdev.github.io/WorldCraft/

详情

AI中文摘要

最近的基于视频的世界模型使像素空间环境在相机层面具有交互性：用户可以导航视角，同时模型生成连贯的视觉延续。然而，它们的动作空间仍然不完整：用户可以移动相机，但不能对单个物体进行操作。由于现实世界的交互本质上是物体中心的，这样的模型更接近被动的场景观察者，而非真正可操控的环境。我们提出WorldCraft，一个将交互式视频世界模型从相机导航扩展到物体级轨迹动作的框架。给定用户点击和手绘路径，WorldCraft生成未来帧，其中所选物体遵循指定轨迹运动，同时相机继续导航场景。WorldCraft通过一个轨迹中心控制管道实现这一点：首先，归一化世界轨迹（NWT）在相机不变的世界坐标系中表示用户绘制的运动，并在当前相机姿态下动态重投影，将物体运动与相机引起的屏幕空间位移分离；然后，空间路径LoRA（SP-LoRA）通过模型的空间控制路径注入这个世界空间信号，在保留预训练相机控制器的同时增加物体操控能力；最后，轨迹锚定状态持久化（TASP）将世界轨迹视为持久空间状态，并在轨迹条件生成后刷新自回归记忆，使移动物体在离开相机视野后能够在其更新位置重新出现。实验表明，WorldCraft实现了精确的物体控制，在仅相机评估下保持了基于视频的世界模型的相机保真度，并在包含离屏移动的长自回归展开中维持了物体状态。

英文摘要

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

URL PDF HTML ☆

赞 0 踩 0

2605.25063 2026-05-26 cs.LG cond-mat.mtrl-sci

Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy--FEA Diagnostic Framework for Reward and World-Model Diagnosis

激光增材制造扫描顺序优化的强化学习：用于奖励和世界模型诊断的双层代理-有限元分析诊断框架

Xian Wu, Haoran Li, Dongbin Zhao, Ruiyao Zhang, Yuanqi Chu, Bin Wang

发表机构 * College of Engineering, Design and Physical Sciences, Brunel University London（布鲁内尔大学伦敦工程、设计与物理科学学院）； Pattern Recognition Laboratory, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别实验室）； ISIS Neutron and Muon Source, Science and Technology Facilities Council, Rutherford Appleton Laboratory（Rutherford Appleton实验室，科学与技术设施委员会ISIS中子与μ子源）

AI总结本文提出一个双层代理-有限元分析诊断框架，通过轻量代理和稀疏有限元模拟，诊断强化学习在激光增材制造扫描顺序优化中的奖励和世界模型保真度问题。

Comments 31 pages, 7 figures, 3 tables

详情

AI中文摘要

强化学习为激光增材制造中的扫描顺序优化提供了一种有前景的方法，其中顺序扫描决策关键影响热积累、残余应力、变形和最终零件质量。将RL应用于该领域的一个核心挑战在于奖励和世界模型的保真度：完整的有限元分析在密集的环路评估中计算成本过高，而廉价的热启发代理度量虽然高效，但可能仅捕获真实热机械目标的局部方面。本文研究了一个用于强化学习引导的扫描顺序优化中奖励和世界模型诊断的双层代理-有限元分析诊断框架。下层采用轻量扫描路径和热启发代理进行快速候选生成和初步策略侧筛选，而上层利用稀疏的Abaqus有限元分析模拟提供基于模拟的参考标签。该框架在一个简化的全轨迹加热LDED32条纹基准上进行检验，该基准包含十种代表性扫描策略。最终冷却残余Mises应力、U3垂直变形和PEEQ塑性度量揭示了一个观察到的应力-变形权衡，而非单一单调的质量目标。在评估的集合中，center_out策略成为稳健的折衷候选，而raster_left_to_right和edge_in构成权衡的对立端点。代理-有限元分析对齐分析表明，当前廉价的基于路径的度量主要捕获变形相关（U3）行为，且与稀疏有限元分析参考标签仅呈现弱相关性。这些发现表明，仅代理的奖励设计在未来的RL训练中可能存在错位风险，并强调了在大规模策略优化之前，稀疏有限元分析参考信号对于诊断引导的奖励和世界模型精炼的价值。

英文摘要

Reinforcement learning offers a promising approach for scan-order optimisation in laser additive manufacturing, where sequential scan decisions critically influence thermal accumulation, residual stress, distortion, and final part quality. A central challenge in applying RL to this domain lies in reward and world-model fidelity: full finite-element analysis is computationally prohibitive for dense in-the-loop evaluation, while cheap thermo-inspired proxy metrics, though efficient, may capture only partial aspects of the true thermo-mechanical objectives. This paper investigates a bilevel Proxy--FEA diagnostic framework for reward and world-model diagnosis in reinforcement-learning-guided scan-order optimisation. The lower level employs lightweight scan-path and thermo-inspired proxies for rapid candidate generation and preliminary policy-side screening, while the upper level utilises sparse Abaqus FEA simulations to provide simulation-based reference labels. The framework is examined on a simplified whole-track heating LDED32 stripe benchmark comprising ten representative scan strategies. Final-cooling residual Mises stress, U3 vertical distortion, and PEEQ plasticity metrics reveal an observed stress--distortion trade-off rather than a single monotonic quality objective. Within the evaluated set, the center_out strategy emerges as a robust compromise candidate, while raster_left_to_right and edge_in form opposing endpoints of the trade-off. Proxy--FEA alignment analysis shows that current cheap path-based metrics predominantly capture distortion-related (U3) behaviour and exhibit only weak correlation with the sparse FEA reference labels. These findings highlight that proxy-only reward designs risk misalignment in future RL training and underscore the value of sparse FEA reference signals for diagnostic-guided reward and world-model refinement prior to large-scale policy optimisation.

URL PDF HTML ☆

赞 0 踩 0

2605.25061 2026-05-26 cs.LG cs.AI

GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

GL-LFGNN：基于Liang-Kleeman信息流的全局-局部双分支因果图神经网络用于脑电情感识别

Ziyi Wang, Dongyang Kuang

发表机构 * School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai, China（中山大学数学学院（珠海））

AI总结提出GL-LFGNN模型，利用Liang-Kleeman信息流理论构建有向因果图，通过全局-局部双分支架构整合全脑与区域连接，在MEEG数据集上以少量参数实现高精度情感识别。

Comments 10 pages, 3 figures

详情

AI中文摘要

基于脑电的情感识别在客观诊断情绪障碍方面具有重要前景。图神经网络已成为建模脑电通道间依赖关系的主流范式，但现有方法依赖于基于空间邻近性或功能相关性导出的对称邻接矩阵，这些矩阵本质上捕捉的是统计关联而非有向因果影响，这与神经信息流固有的非对称、因果驱动特性相冲突。为弥合这一差距，我们提出GL-LFGNN，一种基于Liang-Kleeman信息流理论的全局-局部双分支因果图神经网络。与仅评估时间优先性的格兰杰因果不同，我们的方法从动力系统角度严格量化因果强度，生成神经生理学可解释的有向图。双分支架构进一步将全脑连接性与符合既定功能神经解剖学的区域特定处理相结合。在MEEG数据集上，GL-LFGNN仅用37K参数（约为当前最优模型的10%）便达到86.17%（唤醒度）和86.71%（效价）的准确率，表明原则性的因果建模可同时增强可解释性、泛化能力和计算效率。代码将开源。

英文摘要

EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerged as the dominant paradigm for modeling inter-channel dependencies in EEG, yet existing approaches rely on symmetric adjacency matrices derived from spatial proximity or functional correlations that fundamentally capture statistical associations rather than directed causal influences, which conflicts with the inherently asymmetric, causally-driven nature of neural information flow. To bridge this gap, we propose GL-LFGNN, a Global-Local Dual-branch Causal Graph Neural Network grounded in Liang-Kleeman information flow theory. Unlike Granger causality that merely assesses temporal precedence, our approach rigorously quantifies causal strength from a dynamical systems perspective, yielding neurophysiologically interpretable directed graphs. A dual-branch architecture further integrates whole-brain connectivity with region-specific processing aligned to established functional neuroanatomy. On the MEEG dataset, GL-LFGNN achieves 86.17% (Arousal) and 86.71% (Valence) accuracy with only 37K parameters -- approximately 10% of the current state-of-the-art -- demonstrating that principled causal modeling can simultaneously enhance interpretability, generalization, and computational efficiency. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.25052 2026-05-26 cs.CL

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

忠实性指标并不衡量忠实性：基于真实标签的元评估

Yoav Gur-Arieh, Ana Marasović, Mor Geva

发表机构 * Tel Aviv University（特拉维夫大学）； University of Utah（犹他大学）

AI总结针对思维链忠实性度量缺乏真实标签验证的问题，构建了包含真实忠实性标签的数据集BonaFide，系统评估现有指标，发现多数指标表现接近随机、存在偏差且计算成本高。

详情

AI中文摘要

思维链（CoT）已成为解释和审计大型语言模型行为的核心工具。然而，越来越多的证据表明，这些轨迹往往未能忠实反映模型预测背后的计算过程。已有多种忠实性指标被提出，但它们是否真正衡量了忠实性仍不得而知。回答这一问题需要真实标签，但由于内部计算不可直接观察，真实标签难以获取。因此，大多数提出指标的工作仅报告绝对分数或与先前指标的对比，而少数现有基准依赖于似然性或重要性等代理指标，这些属性与忠实性正交，可能误导对CoT可信度的判断。我们通过构建任务来应对这一挑战，这些任务的输出揭示了哪些中间计算必然产生了它们，并开发了一个自动化标注流程，在步骤级和CoT级生成真实忠实性标签。基于这一方法，我们提出了BonaFide基准，包含来自13个任务和10个模型的3066个标注CoT，并利用它首次系统评估了主流忠实性指标。我们的实验表明，大多数指标表现接近随机，存在强烈的预测偏差，并且在更长的CoT上性能下降。最佳指标在CoT级仅达到0.70 AUROC，另一指标在步骤级达到0.59，且两者均无法跨设置迁移，同时计算成本过高。我们的结果暴露了当前忠实性评估中的根本性缺陷，并呼吁开发更可靠、更高效的指标。

英文摘要

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.25045 2026-05-26 cs.AI

AION: Next-Generation Tasks and Practical Harness for Time Series

AION：下一代时间序列任务与实用框架

Tianxiang Zhan, Xiaobao Song, Tong Guan, Shirui Pan, Ming Jin

发表机构 * Griffith University（格里菲斯大学）； Shenzhen University（深圳大学）； Zhejiang University（浙江大学）

AI总结针对时间序列研究向结合预测、上下文推理、工具使用和结构化决策支持的现实任务转变，提出AION框架，通过时间锚定、知识推理和可靠性机制（如实验后分析和分层审查）实现更详细的过程追踪和审查步骤。

Comments Project page and code are available at https://github.com/ztxtech/aion

详情

AI中文摘要

时间序列研究正从固定的预测基准转向结合预测、上下文推理、工具使用和结构化决策支持的现实任务。大多数基准基于干净数据和短评估循环构建；仅靠智能体可能会在最终输出前忽略时间约束、证据检查或审查。我们首先将下一代时间序列任务形式化为由任务文件、工作空间和验证接口组成的三元组。然后，我们提出AION，一个由六个组件组（智能体、技能、规则、记忆、评估和协议）构建的时间序列框架。在该框架中，我们使用三个设计原则：时间锚定、时间知识推理以及可靠性机制（如实验后分析和分层审查）。Kaggle商店销售案例研究表明，与在OpenCode直接构建模式下运行的相同基础智能体相比，该框架产生了更详细的过程追踪、更多工件和更多审查步骤。综合来看，这些结果支持从固定任务向现实世界约束下的现实任务的范式转变。

英文摘要

Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next-generation time series tasks as three-component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge-grounded reasoning, and reliability mechanisms such as post-experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real-world constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.25044 2026-05-26 cs.RO

X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

X-DiffVLA：面向视觉-语言-动作模型的跨具身扩散动作头

Boyu Li, Chaoyi Xu, Haoqi Yuan, Xinrun Xu, Börje F. Karlsson, Dongbin Zhao, Haoran Li, Zongqing Lu

发表机构 * SKL-MAIS, Institute of Automation, Chinese Academy of Sciences（SKL-MAIS，自动化研究所，中国科学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（人工智能学院，中国科学院大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； BeingBeyond ； School of Computer Science, Peking University（北京大学计算机学院）

AI总结针对跨具身数据学习通用策略的挑战，提出X-DiffVLA模型，通过扩散模型和具身强制技术实现异构末端执行器间的知识迁移，在RoboCasa和Isaac Gym上分别提升15.3%和12.5%。

详情

AI中文摘要

从跨具身数据中学习通用策略仍然是机器人学中的基本挑战。尽管视觉-语言-动作（VLA）模型在大型多样化数据集上进行了预训练，但它们通常依赖于具身特定的微调才能在下游任务中实现强性能。这一要求严重限制了它们的泛化能力，并阻碍了执行相似任务的具身之间的知识迁移。为了克服这些限制，我们聚焦于共享机器人基座和异构末端执行器的跨具身设置，并提出X-DiffVLA，一种具有统一跨具身动作头的基于扩散的VLA模型。X-DiffVLA能够利用扩散模型的生成优势来捕捉跨具身数据集中的多样性和潜在相关性。具体地，我们引入了具身强制（Embodiment Forcing），一种无分类器引导技术，以隐式地将动作生成导向具身特定的功能组件，无需显式监督即可捕捉细粒度的结构细微差别。此外，设计了形态树扩散（Morphological Tree Diffusion）方法来增强不同末端执行器之间的行为相关性，最大化异构演示的可迁移性。在RoboCasa和Isaac Gym上的实验结果覆盖了从夹爪到灵巧手的多种具身，表明X-DiffVLA达到了最先进的性能，分别提升了15.3%和12.5%。真实世界评估进一步验证了所提出框架的鲁棒性及其在可扩展跨具身策略学习中的有效性。

英文摘要

Learning universal policies from cross-embodied data remains a fundamental challenge in robotics. Although Vision-Language-Action (VLA) models are pre-trained on large and diverse datasets, they typically rely on embodiment-specific fine-tuning to achieve strong performance in downstream tasks. This requirement severely limits their generalization capability and restricts knowledge transfer across embodiments performing similar tasks. To overcome these limitations, we focus on cross-embodied settings with shared robotic bases and heterogeneous end-effectors, and propose X-DiffVLA, a diffusion-based VLA model featuring a unified cross-embodied action head. X-DiffVLA can leverage the generative strengths of diffusion models to capture both the diversity and latent correlations in cross-embodied datasets. Specifically, we introduce Embodiment Forcing, a classifier-free guidance technique to implicitly steer action generation toward embodiment-specific functional components, capturing fine-grained structural nuances without explicit supervision. In addition, a Morphological Tree Diffusion approach is designed to strengthen behavioral correlations across diverse end-effectors, maximizing the transferability of heterogeneous demonstrations. Experimental results across RoboCasa and Isaac Gym, covering different embodiments from grippers to dexterous hands, show that X-DiffVLA achieves state-of-the-art performance, with improvements of 15.3% and 12.5%, respectively. Real-world evaluations further validate the robustness of the proposed framework and its effectiveness in scalable cross-embodied policy learning.

URL PDF HTML ☆

赞 0 踩 0

2605.25042 2026-05-26 cs.CV

MimirRAG：一种集成元数据的金融数据检索多智能体RAG框架

Magnus Samuelsen, Wilmer Nyström, Somnath Mazumdar, Mansoor Hussain, Mikkel Strange

发表机构 * Copenhagen Business School（哥本哈根商学院）

AI总结提出MimirRAG多智能体RAG框架，通过元数据集成、表格感知分块和智能体工作流，在金融数据检索中实现89.3%准确率，优于基线。

详情

AI中文摘要

检索增强生成（RAG）系统提供了一种有前景的方法来减少大语言模型（LLM）中的幻觉并提高答案准确性，这是可靠金融分析的必要条件，其中答案必须基于文件中的可验证证据，而非从模型先验生成。然而，设计能够从混合金融文档中提取有意义见解并集成到分析师工作流程中的RAG系统仍然具有挑战性。本文介绍了MimirRAG（元数据集成多智能体信息检索），这是一个迭代开发的多智能体RAG系统，旨在应对这些挑战。MimirRAG具有模块化流水线，包括PDF文件的保结构解析、表格感知分块、元数据提取、带有查询规划和混合搜索的基于智能体的检索、验证以及支持数值推理的上下文感知生成。我们的消融研究确定了有效金融RAG的三个关键技术推动因素：元数据集成、表格感知分块和智能体工作流。MimirRAG使用FinanceBench进行定量评估，并通过四位金融分析师的专家验证进行定性评估。该系统在FinanceBench上达到89.3%的准确率，优于原始基准基线。专家反馈强调，成功部署还需要校准信任、全面的数据集成和用户个性化。我们得出结论，将多智能体RAG架构与以人为中心的设计原则相结合，可以改善金融分析中有意义见解的提取。

英文摘要

Retrieval-augmented generation (RAG) systems offer a promising approach to reduce hallucinations and improve answer accuracy in large language models (LLMs), a requirement for reliable, financial analysis where answers must be grounded in verifiable evidence from filings rather than generated from model priors. However, designing RAG systems that extract meaningful insights from mixed financial documents and integrate into analyst workflows remains challenging. This paper introduces MimirRAG (Metadata-Integrated Multi-Agent Information Retrieval), a multi-agent RAG system developed iteratively to address these challenges. MimirRAG features a modular pipeline encompassing structure-preserving parsing of PDF filings, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation with numerical reasoning support. Our ablation study identifies three key technical enablers for effective financial RAG: metadata integration, table-aware chunking, and an agentic workflow. MimirRAG was evaluated quantitatively using FinanceBench and qualitatively through expert validation with four financial analysts. The system achieved 89.3% accuracy on FinanceBench, outperforming the original benchmark baselines. Expert feedback highlighted that successful deployment also requires calibrated trust, comprehensive data integration, and user personalization. We conclude that combining multi-agent RAG architecture with human-centric design principles can improve the extraction of meaningful insights in financial analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.25024 2026-05-26 cs.CV

DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

DA-UCT：用于快速肌肉骨骼声速重建的自监督域自适应超声计算机断层扫描

Tianyu Liu, Heyu Ma, Aiduo Wang, Peiwen Li, Boyi Li, Ying Li, Dan Li, Chengcheng Liu, Dean Ta

发表机构 * College of Biomedical Engineering, Fudan University（复旦大学生物医学工程学院）

AI总结提出SDA-UCT框架，通过自监督域自适应和注意力增强网络，实现快速高分辨率肌肉骨骼超声计算机断层扫描重建，显著提升速度并保持高质量。

详情

AI中文摘要

通过全波形反演的超声计算机断层扫描（UCT）能够实现高分辨率定量成像，用于组织表征和疾病诊断。然而，由于高度非线性的优化，UCT存在计算负担大和收敛问题严重等缺点。深度学习可以加速UCT重建，但监督训练需要大规模标记数据集，这在体内难以获得。为了解决这些限制，我们提出了SDA-UCT，一个两阶段自监督域自适应框架，用于快速准确的肌肉骨骼组织UCT成像。SDA-UCT采用在模拟数据集上预训练的注意力增强网络（AttUCT），并通过物理信息自监督学习迁移到体内数据，有效弥合了模拟到真实的域差距。集成了低秩自适应（LoRA）机制，以实现跨不同临床场景的高效自适应。结果表明，AttUCT在模拟人前臂上实现了高质量声速重建，PSNR为29.23 dB，SSIM为0.928，优于传统FWI和现有深度学习方法。在体内数据上验证，SDA-UCT成功重建了揭示人前臂复杂解剖结构（皮肤、脂肪、肌肉、肌腱、骨骼和骨髓）的声速图像，与MRI参考高度一致。仅调整3%参数的LoRA机制实现了与全微调相当的性能。快速重建（每帧5毫秒）实现了实时3D可视化，比传统FWI提高了五个数量级。这项工作代表了首个用于快速、高分辨率体内UCT成像的自监督域自适应深度学习，显示了在肌肉骨骼疾病诊断中的潜力。

英文摘要

Ultrasound computed tomography (UCT) via full waveform inversion (FWI) enables high-resolution quantitative imaging for tissue characterization and disease diagnosis. However, UCT suffers from large computational burden and severe convergence issues due to highly nonlinear optimization. Deep learning can accelerate UCT reconstruction, but supervised training requires large-scale labeled datasets difficult to obtain in vivo. To address these limitations, we propose SDA-UCT, a two-stage self-supervised domain-adaptive framework for rapid and accurate UCT imaging of musculoskeletal tissues. SDA-UCT employs an attention-enhanced network (AttUCT) pre-trained on simulation datasets and transfers to in-vivo data via physics-informed self-supervised learning, effectively bridging the simulation-to-real domain gap. A Low-Rank Adaptation (LoRA) mechanism is integrated to enable efficient adaptation across diverse clinical scenarios. Results showed that AttUCT achieved high-quality SOS reconstruction for simulated human forearm with a PSNR of 29.23 dB and SSIM of 0.928, outperforming conventional FWI and existing deep learning methods. Validated on in-vivo data, SDA-UCT successfully reconstructed SOS images revealing complex anatomical structures (skin, fat, muscle, tendon, bone and bone marrow) for human forearm, in high concordance with MRI references. The LoRA mechanism adjusting only 3% of parameters achieved comparable performance to full fine-tuning. The rapid reconstruction (5 ms per frame) enables real-time 3D visualization, achieving five-orders-of-magnitude improvement over traditional FWI. This work represents the first self-supervised domain-adaptive deep learning for rapid, high-resolution in-vivo UCT imaging, showing potential for musculoskeletal disease diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.25022 2026-05-26 cs.CV cs.AI

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

D3S2: 扩散引导的语义分割数据集蒸馏

Wenjie Zheng, Haoji Hu, Jiali Lu, Xingze Zou, Jing Wang

发表机构 * Zhejiang University（浙江大学）

AI总结针对语义分割数据集蒸馏中的长尾类别不平衡、像素级对齐和高计算成本问题，提出两阶段框架D3S2，通过类别平衡掩码选择和扩散引导图像合成生成紧凑训练集，在极低压缩率下显著提升分割性能。

详情

AI中文摘要

数据集蒸馏旨在将大规模数据集压缩为紧凑的合成集，同时保持训练效果。然而，现有研究主要关注图像分类，而语义分割等密集预测任务尚未充分探索。本文识别了分割数据集蒸馏的三个关键挑战：(i) 长尾类别不平衡，(ii) 图像与密集标签之间严格的像素级对齐需求，以及(iii) 使用复杂模型优化高分辨率数据的高计算成本。为应对这些挑战，我们提出D3S2，一种扩散引导的语义分割数据集蒸馏框架。我们的方法采用两阶段设计。在类别平衡掩码选择中，我们通过优先考虑低表示类别的贪婪策略构建代表性掩码集。在扩散引导图像合成中，我们使用预训练的布局到图像扩散模型生成以所选掩码为条件的图像，自然确保空间对齐。为进一步增强合成数据的训练效用，我们引入具有两个互补目标的引导扩散采样：用于像素级对齐的分割一致性损失，以及用于对齐跨层每类特征统计的类级特征匹配损失。大量实验证明了D3S2的优越性。值得注意的是，在1%的极低压缩率下，我们的方法在ADE20K和COCO-Stuff上使用Mask2Former (Swin-S)分别达到24.99%和35.49%的mIoU，比随机选择分别高出9.34%和5.70%。

英文摘要

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.25020 2026-05-26 cs.AI cs.CL

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

慢性皮肤病纵向数据检索中的隐私保护本地语言模型：在天疱疮患者中的实施

Abdurrahim Yilmaz, Ayşe Esra Koku Aksu, Duygu Yamen, Vefa Asli Erdemir, Mehmet Salih Gurel, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran

发表机构 * Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London（系统医学系，代谢、消化与生殖部，帝国理工学院伦敦分校）； Department of Dermatology and Venereology, Istanbul Research and Training Hospital（皮肤科与性病科，伊斯坦布尔研究与培训医院）； Department of Dermatology and Venereology, Istanbul Medeniyet University（皮肤科与性病科，伊斯坦布尔梅德尼yet大学）； Department of Dermatology and Venereology, Istanbul Medicana Atakoy Hospital（皮肤科与性病科，伊斯坦布尔Medicana阿塔科伊医院）

AI总结本研究评估了本地部署的隐私保护小型语言模型（SLM）在天疱疮患者长期随访记录中检索结构化临床特征并生成纵向摘要的能力，结果显示SLM在特征检索任务中平均准确率达82.25%，且医生对AI生成摘要的质量、临床准确性和实用性评分较高。

详情

AI中文摘要

慢性皮肤病如天疱疮需要长期随访，产生大量纵向临床文档，在常规就诊期间难以全面审查，增加了临床医生的工作量以及遗漏关键历史信息的风险。我们评估了本地部署的隐私保护小型语言模型（SLM）是否能够从长期皮肤科随访记录中检索结构化临床特征并生成纵向摘要。在这项回顾性病例系列研究中，30名天疱疮患者贡献了541份就诊记录，汇总为完整的纵向记录（89,336词）；由两位皮肤科专家标注了56个临床相关特征。本地部署的SLM（Qwen3 4B Thinking 2507）对每份完整记录进行查询，以检索56个特征并生成一份最终报告摘要。在1,680个特征检索任务中，平均准确率为82.25%。皮肤科医生对AI生成摘要的整体质量（8.23-8.47）、临床准确性（7.93-8.20）和实用性（8.47-8.50）评分较高，评估者间无显著差异，且在53.3%的评估中总体偏好AI摘要。这些发现表明，隐私保护的本地部署SLM可以优于医学专家，并可靠地生成有临床意义的纵向摘要。在适当监督下，SLM可以支持临床决策。

英文摘要

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

URL PDF HTML ☆

赞 0 踩 0

2605.25014 2026-05-26 cs.CV

Stop Denoising Your Blurs

停止去噪你的模糊

Sasidhar Parvathireddy, Vamsidhar Saraswathula, Rama Krishna Gorthi

发表机构 * Indian Institute of Technology Tirupati, India.（印度泰尔普蒂印度理工学院）

AI总结提出ConvDiff框架，用卷积替代加性噪声构建模糊退化轨迹，实现基于扩散模型的图像去模糊，弥合模糊数学原理与扩散算法设计的差距。

Comments Accepted at IEEE International Conference on Image Processing (ICIP) 2026. 7 pages, 3 figures

详情

AI中文摘要

近年来，扩散模型在图像恢复任务中取得了显著性能。其核心机制依赖于在加性噪声操作之前对退化先验的受限假设。然而，模糊模型作为最广泛研究的退化形式之一，违反了这一假设，因为它本质上基于卷积而非加法。在本文中，我们引入了ConvDiff，一种新颖的基于扩散的框架，该框架用卷积替代加法操作，用于图像去模糊任务。在前向过程中，我们利用卷积的频域特性，从清晰图像到其模糊对应物构建有意义的轨迹，而不是用加性噪声逐步破坏图像。虽然当前工作针对高斯模糊实例化了该框架（其中频域分解产生闭式且物理有效的中间状态），但从模糊算子构建退化轨迹的基本原则自然扩展到其他模糊族。该公式弥合了模糊的数学原理与基于扩散的恢复算法的迭代设计之间的差距，从而实现了更物理基础且有效的图像恢复模型。

英文摘要

In recent times, diffusion models have achieved remarkable performance in image restoration tasks. Their core mechanism relies on the restricted presumption of degradation prior to the additive noise operation. However, the blur model, one of the most widely studied degradation formulations, violates this assumption, as it is inherently based on convolution rather than addition. In this paper, we introduce ConvDiff, a novel diffusion based framework that substitutes the additive operation with convolution for the task of image deblurring. In the forward process, we construct a meaningful trajectory from the clean image to its blurred counterpart by exploiting the frequency domain characteristics of convolution, rather than progressively corrupting the image with additive noise. While the current work instantiates this framework for Gaussian blur, where frequency-domain decomposition yields closed-form and physically valid intermediate states, the underlying principle of constructing degradation trajectories from the blur operator extends naturally to other blur families. This formulation bridges the gap between the mathematical principles of blurring and the iterative design of diffusion-based restoration algorithms, enabling more physically grounded and effective image restoration models.

URL PDF HTML ☆

赞 0 踩 0

2605.25012 2026-05-26 cs.CV

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

从语义字典中学习：面向统一视觉表示与生成的判别式码本对比学习

Imanol G. Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva

发表机构 * Universitat de Barcelona（巴塞罗那大学）； Barcelona Supercomputing Center (BSC)（巴塞罗那超级计算中心）

AI总结提出LEASE框架，通过配对生成-判别码本设计，在离散标记空间中联合优化掩码重建损失和码本对比损失，实现统一视觉表示与生成，在ImageNet-1K上达到最先进性能。

Comments Accepted at CVPR'26

详情

AI中文摘要

判别式和生成式视觉模型在各自领域表现出色，但在语义上存在错位，阻碍了统一视觉学习的进展。我们提出LEASE（从语义字典中学习），一种自监督框架，通过配对生成-判别码本设计弥合这一差距。LEASE完全在通过一次性预计算步骤产生的离散标记空间中运行，无需数据增强、教师模型或在线分词器即可高效训练。LEASE整合了两个互补目标：捕获细粒度生成细节的掩码标记重建损失，以及通过自适应质心加权将编码器特征与判别语义对齐的码本对比损失。这种双重监督产生了一个统一潜在空间，同时支持高质量生成和强大的表示学习。在ImageNet-1K上，LEASE实现了最先进的统一性能，在线性探测（相比MAGE和Sorcen提升高达+1.7%）、无条件生成（相比MAGE FID降低1.26，IS提升10.19）、少样本学习（相比Sorcen平均提升+0.56%）、迁移学习（相比MAGE和Sorcen平均提升+0.75%）以及鲁棒性基准（相比MAGE和Sorcen平均提升+5.86%和+4.25%）上均优于先前的VQGAN方法如MAGE和Sorcen。它还能与领域专用的对比和生成模型竞争，同时超越先前的MIM方法。无监督的LEASE模型还可以通过在其学习表示基础上构建扩展到条件生成，与专用基线相比具有竞争力。总体而言，LEASE为联合理解和生成视觉内容的通用视觉模型提供了高效且有效的一步。

英文摘要

Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization

PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration

Optimizing Multidimensional Scaling in Gini Metric Spaces

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition

Revisiting Pre-Propagation GNNs: Robust Diffusion Operators and Hidden-State Re-Propagation

Uncertainty-DTW for Sequences and Visual Tokens

Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy--FEA Diagnostic Framework for Reward and World-Model Diagnosis

GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

AION: Next-Generation Tasks and Practical Harness for Time Series

X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

Unbiased Diffusion Variational Inversion via Principled Posterior Matching

RAMBA: 4D Radar Mapping by Bundle Adjustment

AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

Stop Denoising Your Blurs

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation