arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.25133 2026-05-26 cs.AI cs.CL

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

信任但验证:面向选择性LLM预测的证明者-验证者审议

João Sedoc, Baotong Zhang, Dean Foster

发表机构 * New York University(纽约大学)

AI总结 提出基于交互式证明理论的证明者-验证者审议协议,通过结构化置信度判定实现选择性预测,在GPQA Diamond上取得约30个百分点的高置信度精确率差距。

详情
AI中文摘要

可靠地知道语言模型何时正确几乎与正确本身同样重要。我们引入证明者-验证者审议(PVD),这是一种基于交互式证明理论的推理时协议,作为选择性预测的机制:该协议同时产生答案和结构化置信度判定,允许系统报告高置信度答案,同时在不明确的情况下弃权。在每个对话中,证明者通过可检查的子主张捍卫候选答案,而验证者发出有针对性的挑战并返回\textsc{Accept}、\textsc{Challenge}或\textsc{Reject}。由于冻结的语言模型是在噪声信道上运行的不完美的证明者和验证者,形式上的可靠性和完备性保证并不适用;相反,我们通过其覆盖-精确率行为来经验性地描述该协议。我们的主要实验使用Claude Sonnet 4.6作为证明者,Claude Haiku 4.5作为验证者,在GPQA Diamond上进行。没有答案修订即被接受的问题,我们称为Accept + No Change (ANC),作为高置信度子集报告;我们通过其精确率和覆盖来评估该子集。ANC将可靠答案与不可靠答案分开,与非ANC补集相比产生约30个百分点的HC-Prec差距。使用GPT和Gemini配对的鲁棒性实验表明,高HC-Prec可以跨模型系列转移,而验证者的严格性和领域能力在很大程度上决定了选择差距的大小。在Humanity's Last Exam上,较弱的证明者-验证者配对可能使ANC信号崩溃或反转,这说明了当验证者在其有效区域外操作时的实际失败模式。与自一致性、通用自一致性、多智能体辩论和Reflexion的比较表明,证明者-验证者审议为选择性预测提供了独特的论点可辩护性信号。

英文摘要

Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

2605.25129 2026-05-26 cs.LG

Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization

分块吉布斯采样遇上扩散Transformer:约束优化的无监督学习

Yudong W. Xu, Wenhao Li, Xiaoyu Wang, Scott Sanner, Elias B. Khalil

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出分块吉布斯扩散Transformer(BloGDiT),通过分块高斯去噪替代标准联合高斯去噪,解决扩散模型在约束优化中变量子集大规模编辑的需求,在数独、图着色、最大独立集和MaxCut任务上匹配或超越现有方法。

详情
AI中文摘要

扩散模型在学习解决约束优化问题方面显示出潜力。然而,它们大多局限于二元变量问题,并依赖图神经网络,阻碍了其应用于更广泛的问题,例如具有一般离散变量或需要全局而非局部推理的约束结构的问题。我们研究了使用扩散Transformer来解决上述局限性。朴素实现表现不佳,因为标准扩散过程与约束求解之间存在根本性不匹配:前者对所有变量进行微小、渐进的去噪,而后者需要大幅改变特定的变量子集以实现可行性或最优性。我们的方法,分块吉布斯扩散Transformer(BloGDiT),是第一个通过用分块高斯去噪替代标准联合高斯去噪来解决这一局限性的方法。BloGDiT使用迭代块重采样,并随时间退火块大小,以促进变量块内的大规模、有针对性的编辑。在数独、图着色、最大独立集和MaxCut上,BloGDiT匹配或超越了现有方法,表明分块吉布斯式扩散为基于Transformer的约束满足和优化提供了高度有效的归纳偏置。

英文摘要

Diffusion models have shown promise in learning to solve constraint optimization problems. However, they are mostly restricted to problems with binary variables and rely on graph neural networks, hindering their application to a broader range of problems such as those with general discrete variables or constraint structures that necessitate global rather than local reasoning. We investigate the use of Diffusion Transformers to address the aforementioned limitations. A naive implementation performs poorly due to a fundamental mismatch between the standard diffusion process and constraint solving: while the former applies small, incremental denoising across all variables, the latter requires substantially altering specific subsets of variables to attain feasibility or optimality. Our method, Blocked Gibbs Diffusion Transformer (BloGDiT), is the first to address this limitation by replacing standard joint Gaussian denoising with blocked Gaussian denoising. BloGDiT uses iterative block resampling and anneals the block size over time to facilitate large, targeted edits within a block of variables. Across Sudoku, Graph Coloring, Maximum Independent Set, and MaxCut, BloGDiT matches or outperforms existing methods, demonstrating that blocked Gibbs-style diffusion provides a highly effective inductive bias for Transformer-based constraint satisfaction and optimization.

2605.25127 2026-05-26 cs.CV cs.LG

PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration

PQDT: 伪查询双Transformer用于鲁棒点云修复

Haoqing Wu, Alexa Nawotki, Jochen Garcke

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) University of Bonn(波恩大学) Fraunhofer SCAI(弗劳恩霍夫SCAI研究所)

AI总结 提出一种基于伪查询模块和Transformer主干网络的统一3D修复网络,通过两阶段几何变换增强结构清晰度和局部细节,在多种退化场景下超越现有方法。

Comments To be published in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

详情
AI中文摘要

点云是计算机视觉中一种基本的3D表示,支持广泛的感知任务。然而,由于传感器限制或遮挡,真实世界的点云常常遭受不完整、噪声、离群点和密度不规则等退化。从这种退化数据中恢复干净且详细的形状对于下游应用至关重要。尽管现有的基于学习方法在完成或去噪等单个任务上取得了进展,但它们通常依赖于全局瓶颈特征,这会丢失细粒度几何信息,并且对变化的输入质量敏感。我们提出一个统一的3D修复网络,直接以点云作为输入,并在多种退化场景下自适应地重建高质量几何。我们方法的核心是一个伪查询模块,在Transformer主干网络中实现,它将几何变换重新表述为两个协作阶段,以增强结构清晰度、鲁棒性和局部细节保留。在精心设计的基准测试上的大量实验表明,我们的方法在通用3D修复中超越了最先进的性能。它有效处理了完成、变形和去噪退化的复杂组合。通过这项工作,我们提供了一个新颖的、统一的、仅基于点的主干网络,用于鲁棒的3D修复,从而实现更通用的3D感知。

英文摘要

Point clouds are a fundamental 3D representation in computer vision, enabling a wide range of perception tasks. However, real-world point clouds often suffer from degradations such as incompleteness, noise, outliers, and irregular density, caused by sensor limitations or occlusions. Recovering clean and detailed shapes from such degraded data is crucial for downstream applications. While existing learning-based methods achieve progress on individual tasks like completion or denoising, they typically rely on global bottleneck features, which lose fine-grained geometry and remain sensitive to varying input quality. We propose a unified 3D restoration network that directly takes point clouds as input and adaptively reconstructs high-quality geometry under diverse degradation scenarios. At the core of our approach is a Pseudo-Query module, implemented within a Transformer backbone, which reformulates geometric translation into two cooperative stages to enhance structural clarity, robustness, and local detail preservation. Extensive experiments on curated benchmarks demonstrate that our approach surpasses state-of-the-art performance in general 3D restoration. It effectively handles complex combinations of completion, deformation, and denoising degradations. With this work, we provide a novel unified, point-only backbone for robust 3D restoration, enabling more versatile 3D perception.

2605.25124 2026-05-26 cs.LG

Optimizing Multidimensional Scaling in Gini Metric Spaces

在基尼度量空间中优化多维缩放

Cassandra Mussard, Stéphane Mussard

发表机构 * GitHub

AI总结 提出基尼多维缩放(Gini MDS)框架,通过基于值和秩的基尼伪距离,在噪声和异常值数据上优于欧几里得MDS,并利用PyTorch实现GPU加速。

详情
AI中文摘要

基尼多维缩放(Gini MDS)框架扩展了欧几里得多维缩放。我们引入了一种基于值和秩的基尼伪距离,该距离依赖于一个可微调的超参数。这种伪距离允许灵活探索潜在配置,从而实现与观测相异度最佳匹配的嵌入。Gini MDS被证明对噪声和异常值具有鲁棒性,使其非常适合实际应用。我们在16个带有异常值的UCI数据集和带有噪声的MNIST图像上进行了实验,表明Gini MDS在噪声数据上优于欧几里得MDS。最后,与 exttt{sklearn}库的标准MDS相比,基于张量的 exttt{PyTorch}实现提供了GPU加速和高效计算。

英文摘要

The Gini Multidimensional Scaling (Gini MDS) framework extends the Euclidean multidimensional scaling. We introduce a Gini pseudo-distance based on values and their ranks that depends on a fine-tunable hyperparameter. This pseudo-distance allows flexible exploration of latent configurations, enabling embeddings that best match observed dissimilarities. The Gini MDS is shown to be robust to noise and outliers, making it well-suited for real-world applications. We provide experiments on 16 UCI datasets with outliers and on MNIST images with noise to show that the Gini MDS outperforms the Euclidean MDS on noisy data. Finally, a tensor-based implementation in \texttt{PyTorch} provides GPU acceleration and efficient computation compared to the standard MDS of the \texttt{sklearn} library.

2605.25123 2026-05-26 cs.LG cs.AI cs.CL cs.CV stat.ML

Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

扩散模型的推理时对齐:基于信任区域迭代扭曲序贯蒙特卡洛方法

Weixin Wang, Yu Yang, Wei Deng, Pan Xu

发表机构 * Duke University(杜克大学) Morgan Stanley(摩根大通)

AI总结 提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC)框架,通过迭代学习扭曲函数来改进扩散模型推理时的对齐,在文本生成和文本到图像生成任务上优于现有方法。

Comments 34 pages, 6 figures, and 7 tables

详情
AI中文摘要

我们研究基于扩散的生成模型的推理时对齐,旨在引导基础模型产生高奖励输出而不更新其权重。最近的基于序贯蒙特卡洛(SMC)的引导方法以原则性的方式近似奖励倾斜的目标分布,但其提议仍主要依赖于基础采样器。由于奖励信息主要通过粒子重加权和重采样在传播后使用,这些方法可能需要大量粒子预算,并遭受权重退化和高方差估计的问题。降低方差和提高粒子效率的一种方法是迭代学习提供前瞻指导的扭曲函数,如扭曲SMC。然而,现有的可学习扭曲方法主要针对经典序贯推理开发,当应用于具有高维状态空间和终端、噪声或黑盒奖励的扩散对齐时可能不稳定。我们提出信任区域迭代扭曲序贯蒙特卡洛(TRI-TSMC),一种用于在基于SMC的推理时对齐中学习扭曲函数的信任区域框架。每次迭代在路径空间中计算精确的KL约束更新,通过温度重要性重加权得到闭式解,并通过加权最大似然将该目标投影回参数化扭曲族。理论上,我们形式化了最优扭曲函数的值函数解释,并表明它产生零方差采样器。我们证明信任区域更新沿着护航路径朝向目标分布,加权最大似然更新是前向KL投影,并且该路径降低了残差重要性权重方差。实验上,在匹配的推理时预算下,TRI-TSMC在离散扩散文本生成和文本到图像生成上改进了主要对齐目标。

英文摘要

We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without updating its weights. Recent Sequential Monte Carlo (SMC)-based steering methods approximate reward-tilted target distributions in a principled way, but their proposals remain largely tied to the base sampler. Since reward information is mainly used after propagation through particle reweighting and resampling, these methods can require large particle budgets and suffer from weight degeneracy and high-variance estimates. One way to reduce variance and improve particle efficiency is to iteratively learn twisting functions that provide look-ahead guidance, as in twisted SMC. However, existing learnable twisting methods are developed mainly for classical sequential inference and can be unstable when applied to diffusion-based alignment with high-dimensional state spaces and terminal, noisy, or black-box rewards. We propose Trust-Region Iterative Twisted Sequential Monte Carlo (TRI-TSMC), a trust-region framework for learning twisting functions in SMC-based inference-time alignment. Each iteration computes an exact KL-constrained update in path space, which admits a closed-form solution by tempered importance reweighting, and projects this target back to the parameterized twisted family by weighted maximum likelihood. Theoretically, we formalize the value-function interpretation of the optimal twisting function and show that it yields a zero-variance sampler. We prove that the trust-region update follows an escort path toward the target distribution, that the weighted maximum-likelihood update is a forward-KL projection, and that the path reduces residual importance-weight variance. Empirically, TRI-TSMC improves primary alignment objectives on discrete diffusion text generation and text-to-image generation under matched inference-time budgets.

2605.25120 2026-05-26 cs.CL cs.AI cs.HC

Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence

证据关联放射学报告:面向结构化成像智能的人机协同参考架构

Houman Kazemzadeh, Kamyar Naderi

发表机构 * Xylemed

AI总结 提出一种人机协同、证据关联的参考架构,通过结合特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草以及基于DICOM、HL7 FHIR等标准的互操作性,将放射学报告从自由文本转化为结构化智能层,支持审阅报告、纵向比较、临床数据重用及系统集成。

Comments Technical report, 27 pages, 2 figures, 12 tables, 1 listing; reference architecture paper; does not report clinical outcomes or validated diagnostic performance

详情
AI中文摘要

放射学报告仍然是向临床团队传达成像结果的主要机制。然而,这些报告背后的大量结构化信息,包括测量值、图像证据、既往比较、病灶标识、不确定性和术语,通常仍被禁锢在自由文本中,或分散在图像存档与通信系统、放射信息系统、报告工作站、工作表、高级可视化工具和电子健康记录中。本文提出一种人机协同、证据关联的结构化放射学报告参考架构。该框架结合了特定检查模板、语音到结构处理、测量与分割捕获、受控AI辅助起草,以及基于DICOM、DICOM结构化报告、DICOM分割、HL7 FHIR、RadLex、SNOMED CT、LOINC和UCUM的标准化互操作性。该系统并非作为自主报告生成器,而是作为企业成像的结构化智能层,支持审阅报告、纵向比较、临床数据重用、治理,以及与PACS、RIS、EHR、分析和注册工作流的集成。本文还讨论了针对AI辅助放射学报告系统的模态特定部署考虑、临床安全风险、验证要求、网络安全、隐私、质量管理和监管边界。

英文摘要

Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structured information behind these reports, including measurements, image evidence, prior comparisons, lesion identity, uncertainty, and terminology, often remains trapped in free text or fragmented across picture archiving and communication systems, radiology information systems, reporting workstations, worksheets, advanced visualization tools, and electronic health records. This paper proposes a human-supervised, evidence-linked reference architecture for structured radiology reporting. The framework combines exam-specific templates, speech-to-structure processing, measurement and segmentation capture, controlled AI-assisted drafting, and standards-based interoperability using DICOM, DICOM Structured Reporting, DICOM Segmentation, HL7 FHIR, RadLex, SNOMED CT, LOINC, and UCUM. The system is positioned not as an autonomous report generator, but as a structured intelligence layer for enterprise imaging that supports reviewed reporting, longitudinal comparison, clinical data reuse, governance, and integration with PACS, RIS, EHR, analytics, and registry workflows. The paper also discusses modality-specific deployment considerations, clinical safety risks, validation requirements, cybersecurity, privacy, quality management, and regulatory boundaries for AI-assisted radiology reporting systems.

2605.25119 2026-05-26 cs.CV cs.AI cs.LG

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation

信任感知的联合特征-预测差异用于鲁棒域适应

Xi Ding, Lei Wang, Syuan-Hao Li, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Griffith University, Australia(工程与环境学院,格里菲斯大学,澳大利亚)

AI总结 提出信任感知域适应框架,通过联合特征-预测差异(JFPD)结合不确定性信任和语义对齐信任,实现可靠性感知的域差异估计,提升域适应性能。

Comments Research report

详情
AI中文摘要

域适应旨在减轻标记源域与未标记或稀疏标记目标域之间分布偏移导致的性能下降。大多数现有方法在特征空间或预测空间中估计域差异。然而,这些单一视角策略忽略了域偏移下的一个关键问题:用于对齐的信号可靠性。实际上,学习到的表示和语义预测都可能变得不可靠,平等对待所有目标样本可能导致误导性对齐和次优迁移。我们引入了信任感知域适应,这是一个原则性框架,通过特征和预测信号的可靠性来建模域差异。我们方法的核心是联合特征-预测差异(JFPD),这是一个统一公式,联合捕捉表示散度和预测散度,并通过样本特定信任加权它们的贡献。信任通过两种互补机制量化:不确定性信任,从预测熵导出以抑制不可靠预测;语义对齐信任,从特征空间中的原型相似性计算以强调良好对齐的表示。通过优先考虑自信且语义一致的样本,同时降低噪声或模糊样本的权重,JFPD提供了域差异的可靠性感知估计。我们进一步将JFPD集成到训练目标中,引导适应朝向目标域的可靠区域。在标准基准上的实验表明,所提出的框架始终实现优越的适应性能,并产生与目标域误差相关的差异估计。这项工作首次解决了在域适应中建模特征与预测之间交互信任的重要性。

英文摘要

Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled or sparsely labeled target domain. Most existing approaches estimate domain discrepancy either in feature space or in prediction space. However, these single-perspective strategies overlook a critical problem under domain shift: the reliability of the signals used for alignment. In practice, both learned representations and semantic predictions may become unreliable, and treating all target samples equally can lead to misleading alignment and suboptimal transfer. We introduce trust-aware domain adaptation, a principled framework that models domain discrepancy through the reliability of feature and prediction signals. Central to our approach is the Joint Feature-Prediction Discrepancy (JFPD), a unified formulation that jointly captures representation divergence and prediction divergence while weighting their contributions by sample-specific trust. Trust is quantified via two complementary mechanisms: uncertainty-aware trust, derived from prediction entropy to suppress unreliable predictions, and semantic-alignment trust, computed from prototype similarity in feature space to emphasize well-aligned representations. By prioritizing confident and semantically consistent samples while down-weighting noisy or ambiguous ones, JFPD provides a reliability-aware estimate of domain discrepancy. We further integrate JFPD into a training objective that guides adaptation toward trustworthy regions of the target domain. Experiments on standard benchmarks demonstrate that the proposed framework consistently achieves superior adaptation performance and yields discrepancy estimates that correlate with target-domain error. This work addresses, for the first time, the importance of modeling trust in the interaction between features and predictions for domain adaptation.

2605.25115 2026-05-26 cs.LG cs.AI cs.CE physics.app-ph

Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition

Courant:一种具有局部支持和可解释场分解的状态自适应感知器神经代理模型

Anuj Kumar, Josiah Bjorgaard, Nikolaos Bouklas, Matteo Salvador, Alexander Lavin

发表机构 * Pasteur Labs(Pasteur实验室) Cornell University(康奈尔大学) Institute for Simulation Intelligence(模拟智能研究所)

AI总结 提出基于感知器的编码-处理-解码代理模型Courant,通过状态自适应潜在查询和轻量解码器实现类似自适应hp细化的局部支持与可解释场分解,在稳态/瞬态模拟基准上取得竞争性精度。

详情
AI中文摘要

我们引入“Courant”,一种基于感知器的编码器-处理器-解码器代理模型,其潜在特征在物理空间中表现出自适应专门化和局部支持,实现了类似于自适应hp细化方案的功能,这是传统数值求解器和科学机器学习中非常期望的属性。所提出的架构结合了共享随机傅里叶特征坐标嵌入、状态自适应潜在查询和轻量解码器。Courant使用稳态或瞬态模拟数据进行端到端训练,仅使用物理空间中的标准L_2预测损失,在基准测试上达到竞争性精度。我们证明Courant的归纳偏差产生了设计上可解释的潜在变量:它们在模拟域中发展出多尺度几何专门化,并在时间相关情况下跟踪相干结构,类似于随时间演化的空间基函数,从而允许对模拟场进行紧凑的、几何锚定的、单位划分式的分解。

英文摘要

We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specialization and local support in the physical space, enabling functionality akin to an adaptive hp-refinement scheme, an attribute that is highly desirable in traditional numerical solvers and scientific machine learning broadly. The proposed architecture combines a shared random Fourier feature coordinate embedding, state-adapted latent queries, and a light-weight decoder. Courant is trained end-to-end with steady or transient simulation data and only a standard L_2 prediction loss in the physical space, achieving competitive accuracy on benchmarks. We demonstrate that Courant's inductive biases yield latents that are interpretable by design: they develop multiscale geometric specialization in the simulation domain and track coherent structures in the time-dependent case, acting analogously to time-evolving spatial basis functions and allowing for decoding a compact, geometry-anchored, partition-of-unity-like decomposition of the simulated field.

2605.25111 2026-05-26 cs.LG

Revisiting Pre-Propagation GNNs: Robust Diffusion Operators and Hidden-State Re-Propagation

重新审视预传播图神经网络:鲁棒扩散算子与隐状态再传播

Zichao Yue, Zhiru Zhang

发表机构 * School of Electrical and Computer Engineering, Cornell University, Ithaca, New York, USA(电气与计算机工程系,康奈尔大学,纽约州伊萨卡市)

AI总结 提出鲁棒图扩散算子和少量隐状态再传播方案,使预传播图神经网络在保持训练效率的同时匹配消息传递图神经网络的精度。

详情
AI中文摘要

预传播图神经网络(PPGNNs)将节点特征传播与变换解耦:图扩散作为预处理一次性执行,训练简化为每个节点的密集变换。这种设计使得小批量训练无需节点间依赖,避免了重复的稀疏矩阵-矩阵乘法,并更好地适配针对密集计算优化的现代加速器。然而,其表达能力仍不明确,实验结果表明PPGNNs与对应的消息传递图神经网络在常用图基准(尤其是异配图)上存在差距。本文提出一套用于预处理的鲁棒图扩散算子和训练过程中的少量隐状态再传播方案。我们的方法提高了PPGNNs的验证和测试准确率,使其在保持训练效率的同时匹配消息传递图神经网络的精度。

英文摘要

Pre-propagation graph neural networks (PPGNNs) decouple node feature propagation from transformation: graph diffusion is performed once as preprocessing, and training reduces to dense per-node transformations. This design enables mini-batch training without inter-node dependencies, avoids repeated sparse matrix--matrix multiplications, and better matches modern accelerators optimized for dense compute. However, their expressivity remains unclear, and empirical results show a gap between PPGNNs and their message-passing counterparts on commonly used graph benchmarks, especially heterophilic ones. In this paper, we propose a suite of robust graph diffusion operators for preprocessing and a few-shot hidden-state re-propagation scheme during training. Our methods improve the validation and test accuracy of PPGNNs, enabling them to match the accuracy of message-passing GNNs while maintaining training efficiency.

2605.25110 2026-05-26 cs.CV cs.AI cs.LG

Uncertainty-DTW for Sequences and Visual Tokens

Uncertainty-DTW 用于序列和视觉标记

Lei Wang, Syuan-Hao Li, Yongsheng Gao, Piotr Koniusz

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 提出不确定性感知的动态时间规整(uDTW)框架,通过异方差不确定性建模和最大似然估计实现鲁棒对齐,并推广到视觉标记集,在多个领域取得优于现有方法的结果。

Comments Research report

详情
AI中文摘要

对齐结构化数据是计算机视觉和机器学习中的一个基本问题,支撑着时间序列分析、人类动作识别和视觉表示学习等任务。现有的对齐方法,包括动态时间规整(DTW)及其可微变体,依赖于确定性相似度度量,因此对异质和噪声特征敏感。在这项工作中,我们引入了不确定性感知对齐,这是一个概率框架,用异方差不确定性建模成对对应关系,并沿对齐路径执行结构化匹配。我们的公式,不确定性-DTW(uDTW),为每个对应分配一个正态分布,并通过最大似然估计目标参数化每条对齐路径,该目标包括(i)一个精度加权匹配项,抑制不可靠特征,以及(ii)一个对数方差正则化,防止退化解。这产生了一个概率对齐机制,对噪声具有鲁棒性且可解释,因为不确定性直接反映了匹配的可靠性。我们进一步将该框架从时间序列推广到标记化的视觉表示,从而能够对视觉标记集进行结构化匹配。学习到的不确定性可以解释为反向注意力:语义相关区域表现出低不确定性并主导对齐,而模糊/噪声区域具有高不确定性。这提供了对齐、注意力和不确定性建模之间的联系。我们在不同领域评估了所提出的框架。结果表明,与最先进的方法相比,该方法持续改进,并且学习到的不确定性与语义重要性相关。这些发现将不确定性感知对齐确立为一个通用、鲁棒且可解释的框架,用于从结构化数据中学习。

英文摘要

Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, human action recognition, and visual representation learning. Existing alignment methods, including Dynamic Time Warping (DTW) and its differentiable variants, rely on deterministic similarity measures and are therefore sensitive to heterogeneous and noisy features. In this work, we introduce uncertainty-aware alignment, a probabilistic framework that models pairwise correspondences with heteroscedastic uncertainty and performs structured matching along alignment paths. Our formulation, uncertainty-DTW (uDTW), assigns each correspondence a Normal distribution and parametrizes each alignment path by a Maximum Likelihood Estimate objective consisting of (i) a precision-weighted matching term that suppresses unreliable features, and (ii) a log-variance regularization that prevents degenerate solutions. This yields a probabilistic alignment mechanism that is robust to noise and interpretable, as uncertainty directly reflects the reliability of matches. We further generalize this framework from temporal sequences to tokenized visual representations, enabling structured matching over sets of visual tokens. The learned uncertainty can be interpreted as a reverse-attention: semantically relevant regions exhibit low uncertainty and dominate the alignment, while ambiguous/noisy regions have high uncertainty. This provides a connection between alignment, attention, and uncertainty modeling. We evaluate the proposed framework across diverse domains. The results demonstrate consistent improvements over state-of-the-art methods and show that learned uncertainty correlates with semantic importance. These findings establish uncertainty-aware alignment as a general, robust, and interpretable framework for learning from structured data.

2605.25107 2026-05-26 cs.LG cs.AI cs.NA math.NA

Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems

利用规范自由度学习随机系统的非梯度种群动力学

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA(数学科学学院,纽约大学,纽约,纽约州,10012,美国)

AI总结 针对现有种群动力学推断局限于梯度流的问题,提出非梯度推断流(NGIF)算法,通过连续性方程的弱形式参数化一般向量场并选择非最小动能准则,在低维和高维物理问题中提高了分布精度并更好地捕捉非势输运。

详情
AI中文摘要

现有的种群动力学推断工作通常关注由标量势的梯度向量场产生的流。在所有与种群动力学兼容的容许流中,梯度流在特定意义下是最优的:它们最小化动能。基于不同准则选择场对应于确定种群动力学时的规范自由度,我们在本文中利用了这一点。我们提出了非梯度推断流(NGIF),一种使用连续性方程弱形式推断非梯度种群动力学的算法。这使我们能够参数化一般向量场,并选择超出最小动能的其他选择准则。我们在各种低维和高维物理问题上证明,这种更一般的方法提高了相对于梯度受限基线的分布精度,并更好地捕捉了非势输运。

英文摘要

Existing work on population dynamics inference often focuses on flows arising from vector fields that are the gradients of scalar potentials. Among all admissible flows that are compatible with the population dynamics, gradient flows are optimal in a specific sense: they minimize kinetic energy. The selection of fields based on different criteria corresponds to a gauge freedom when determining population dynamics, which we leverage in this work. We propose Non-Gradient Inference Flows (NGIF), an algorithm to infer non-gradient population dynamics using a weak formulation of the continuity equation. This allows us to parameterize general vector fields and choose other selection criteria beyond minimal kinetic energy. We demonstrate on a variety of low- and high-dimensional physics problems that this more general approach improves distributional accuracy over gradient-restricted baselines and better captures non-potential transport.

2605.25095 2026-05-26 cs.AI cs.LG math.OC

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

RECTOR: 基于优先级规则的合规感知自动驾驶轨迹选择重排序

Hadi Hajieghrary, Benedikt Walter, Chaitanya Shinde, Paul Schmitt, Miguel Hurtado

发表机构 * TORC Robotics LLC(TORC机器人公司) Daimler Truck AG(戴姆勒卡车集团) Reynolds & Moore(雷诺兹与摩尔公司) MassRobotics(马斯机器人)

AI总结 提出RECTOR,一种后生成重排序层,通过差异化代理和场景条件适用性机制,基于分层规则手册(安全>法律>道路>舒适)对候选轨迹进行评分,并采用确定性ε-词典序规则选择,在无需重新训练预测器的情况下,将安全与法律违规率从28.58%降至20.42%。

详情
AI中文摘要

自动驾驶堆栈必须从多模态候选集中选择一条轨迹;仅凭模型置信度选择会忽略安全、交通法规和舒适性约束。我们提出RECTOR(规则强制约束轨迹编排器),一种后生成重排序层,通过差异化代理和场景条件适用性机制,根据分层规则手册(安全>法律>道路>舒适)对候选轨迹进行评分,然后采用确定性ε-词典序规则进行选择,该规则通过构造保持跨层优先级——无需重新训练预测器。在Waymo开放运动数据集validation_interactive划分(43,219个增强实例,K=6)上,根据协议B(28条规则代理目录,oracle适用性),与同一候选集上仅基于置信度的选择相比,规则感知选择将安全+法律违规从28.58%降至20.42%,总违规从40.32%降至32.41%。在该基准上,均匀加权求和基线匹配了二元合规性——经验提升来自规则感知排序,而词典序保证是任何权重校准无法复制的结构性差异因素。在对抗性置信度破坏下,仅置信度选择在100%的场景中失败,而两种规则感知选择器在约96%的场景中拒绝了注入的模式。所有数据均为代理评估器结果(非安全认证),开环,5秒时域,美国规则,验证集划分。

英文摘要

Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic $\varepsilon$-lexicographic rule that preserves cross-tier priority by construction -- without retraining the predictor. On the Waymo Open Motion Dataset \texttt{validation\_interactive} split (43{,}219 augmented instances, $K{=}6$), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58\% to 20.42\% and Total from 40.32\% to 32.41\% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark -- the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100\% of scenarios while both rule-aware selectors reject the injected mode in $\sim$96\%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5\,s horizon, U.S.\ rules, validation split.

2605.25091 2026-05-26 cs.AI

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

进化增强的多智能体强化学习用于协同空战

Chengwei Li, Junlin Liu, Yang Gao

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 针对多机协同空战中现有MARL方法探索效率低、样本利用率低和策略泛化差的问题,提出ACE-MAPPO混合学习框架,融合进化算法与MAPPO,通过遗传软更新、进化优先轨迹回放和对抗进化课程学习机制提升性能。

详情
AI中文摘要

随着现代空战向超视距多机协同交战演变,无人作战飞行器的自主决策面临高维状态空间、离散动作指令和强对抗动态环境的重大挑战。为克服现有基于多智能体强化学习的方法在此类场景中的局限性,即探索效率不足、样本利用率低和策略泛化能力差,我们提出了对抗课程与进化增强的多智能体近端策略优化(ACE-MAPPO),一种将进化算法与MAPPO相结合的混合学习框架。具体而言,引入了遗传软更新机制以增强种群多样性并缓解收敛到局部最优。进一步采用了进化增强的优先轨迹回放策略以提高稀疏高价值样本的利用率。此外,设计了对抗进化课程学习机制,实现难度逐渐增加的自适应训练。大量实验结果表明,所提方法在训练稳定性、收敛速度和胜率方面优于MAPPO及其他基线算法,验证了其在多机协同空战场景中的有效性。

英文摘要

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi-agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary-enhanced Multi-agent Proximal Policy Optimization (ACE-MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary-augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high-value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi-aircraft cooperative air combat scenarios.

2605.25077 2026-05-26 cs.CV

WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models

WorldCraft: 从相机导航到交互式视频世界模型中的物体操控

Bohai Gu, Taiyi Wu, Yueyang Yuan, Jian Liu, Xiaocheng Lu, Dazhao Du, Jie Zhang, Jinxiang Lai, Shuai Yang, Xiaotong Zhao, Alan Zhao, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) AI Technology Center, Tencent Video, Tencent(腾讯视频AI技术中心,腾讯) Wuhan University(武汉大学) Peking University(北京大学)

AI总结 提出WorldCraft框架,通过轨迹控制管道(NWT、SP-LoRA、TASP)将交互式视频世界模型从相机导航扩展到物体级轨迹操控,实现用户指定路径下的物体运动与相机导航共存。

Comments Project page: https://nevsdev.github.io/WorldCraft/

详情
AI中文摘要

最近的基于视频的世界模型使像素空间环境在相机层面具有交互性:用户可以导航视角,同时模型生成连贯的视觉延续。然而,它们的动作空间仍然不完整:用户可以移动相机,但不能对单个物体进行操作。由于现实世界的交互本质上是物体中心的,这样的模型更接近被动的场景观察者,而非真正可操控的环境。我们提出WorldCraft,一个将交互式视频世界模型从相机导航扩展到物体级轨迹动作的框架。给定用户点击和手绘路径,WorldCraft生成未来帧,其中所选物体遵循指定轨迹运动,同时相机继续导航场景。WorldCraft通过一个轨迹中心控制管道实现这一点:首先,归一化世界轨迹(NWT)在相机不变的世界坐标系中表示用户绘制的运动,并在当前相机姿态下动态重投影,将物体运动与相机引起的屏幕空间位移分离;然后,空间路径LoRA(SP-LoRA)通过模型的空间控制路径注入这个世界空间信号,在保留预训练相机控制器的同时增加物体操控能力;最后,轨迹锚定状态持久化(TASP)将世界轨迹视为持久空间状态,并在轨迹条件生成后刷新自回归记忆,使移动物体在离开相机视野后能够在其更新位置重新出现。实验表明,WorldCraft实现了精确的物体控制,在仅相机评估下保持了基于视频的世界模型的相机保真度,并在包含离屏移动的长自回归展开中维持了物体状态。

英文摘要

Recent video-based world models have made pixel-space environments interactive at the camera level: users can navigate viewpoints while the model generates coherent visual continuations. Yet their action spaces remain incomplete: users can move the camera, but cannot act on individual objects. Since real-world interaction is inherently object-centric, such models remain closer to passive scene observers than truly manipulable environments. We present WorldCraft, a framework that expands interactive video world models from camera navigation to object-level trajectory actions. Given a user click and a sketched path, WorldCraft generates future frames in which the selected object follows the prescribed trajectory while the camera continues to navigate the scene. WorldCraft achieves this through a trajectory-centric control pipeline: First, Normalized World Trajectory (NWT) represents user-drawn motion in a camera-invariant world coordinate system and dynamically re-projects it under the current camera pose, separating object motion from camera-induced screen-space displacement; Spatial-Pathway LoRA (SP-LoRA) then injects this world-space signal through the model's spatial-control pathway, adding object manipulation capability while preserving the pretrained camera controller; finally, Trajectory-Anchored State Persistence (TASP) treats the world trajectory as a persistent spatial state and refreshes autoregressive memory after trajectory-conditioned generation, allowing moved objects to reappear at their updated positions after leaving the camera view. Experiments show that WorldCraft enables accurate object control, preserves the video-based world model's camera fidelity under camera-only evaluation, and maintains object state across long autoregressive rollouts with off-camera excursions.

2605.25063 2026-05-26 cs.LG cond-mat.mtrl-sci

Reinforcement Learning for Laser Additive Manufacturing Scan-Order Optimisation: A Bilevel Proxy--FEA Diagnostic Framework for Reward and World-Model Diagnosis

激光增材制造扫描顺序优化的强化学习:用于奖励和世界模型诊断的双层代理-有限元分析诊断框架

Xian Wu, Haoran Li, Dongbin Zhao, Ruiyao Zhang, Yuanqi Chu, Bin Wang

发表机构 * College of Engineering, Design and Physical Sciences, Brunel University London(布鲁内尔大学伦敦工程、设计与物理科学学院) Pattern Recognition Laboratory, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别实验室) ISIS Neutron and Muon Source, Science and Technology Facilities Council, Rutherford Appleton Laboratory(Rutherford Appleton实验室,科学与技术设施委员会ISIS中子与μ子源)

AI总结 本文提出一个双层代理-有限元分析诊断框架,通过轻量代理和稀疏有限元模拟,诊断强化学习在激光增材制造扫描顺序优化中的奖励和世界模型保真度问题。

Comments 31 pages, 7 figures, 3 tables

详情
AI中文摘要

强化学习为激光增材制造中的扫描顺序优化提供了一种有前景的方法,其中顺序扫描决策关键影响热积累、残余应力、变形和最终零件质量。将RL应用于该领域的一个核心挑战在于奖励和世界模型的保真度:完整的有限元分析在密集的环路评估中计算成本过高,而廉价的热启发代理度量虽然高效,但可能仅捕获真实热机械目标的局部方面。本文研究了一个用于强化学习引导的扫描顺序优化中奖励和世界模型诊断的双层代理-有限元分析诊断框架。下层采用轻量扫描路径和热启发代理进行快速候选生成和初步策略侧筛选,而上层利用稀疏的Abaqus有限元分析模拟提供基于模拟的参考标签。该框架在一个简化的全轨迹加热LDED32条纹基准上进行检验,该基准包含十种代表性扫描策略。最终冷却残余Mises应力、U3垂直变形和PEEQ塑性度量揭示了一个观察到的应力-变形权衡,而非单一单调的质量目标。在评估的集合中,center_out策略成为稳健的折衷候选,而raster_left_to_right和edge_in构成权衡的对立端点。代理-有限元分析对齐分析表明,当前廉价的基于路径的度量主要捕获变形相关(U3)行为,且与稀疏有限元分析参考标签仅呈现弱相关性。这些发现表明,仅代理的奖励设计在未来的RL训练中可能存在错位风险,并强调了在大规模策略优化之前,稀疏有限元分析参考信号对于诊断引导的奖励和世界模型精炼的价值。

英文摘要

Reinforcement learning offers a promising approach for scan-order optimisation in laser additive manufacturing, where sequential scan decisions critically influence thermal accumulation, residual stress, distortion, and final part quality. A central challenge in applying RL to this domain lies in reward and world-model fidelity: full finite-element analysis is computationally prohibitive for dense in-the-loop evaluation, while cheap thermo-inspired proxy metrics, though efficient, may capture only partial aspects of the true thermo-mechanical objectives. This paper investigates a bilevel Proxy--FEA diagnostic framework for reward and world-model diagnosis in reinforcement-learning-guided scan-order optimisation. The lower level employs lightweight scan-path and thermo-inspired proxies for rapid candidate generation and preliminary policy-side screening, while the upper level utilises sparse Abaqus FEA simulations to provide simulation-based reference labels. The framework is examined on a simplified whole-track heating LDED32 stripe benchmark comprising ten representative scan strategies. Final-cooling residual Mises stress, U3 vertical distortion, and PEEQ plasticity metrics reveal an observed stress--distortion trade-off rather than a single monotonic quality objective. Within the evaluated set, the center_out strategy emerges as a robust compromise candidate, while raster_left_to_right and edge_in form opposing endpoints of the trade-off. Proxy--FEA alignment analysis shows that current cheap path-based metrics predominantly capture distortion-related (U3) behaviour and exhibit only weak correlation with the sparse FEA reference labels. These findings highlight that proxy-only reward designs risk misalignment in future RL training and underscore the value of sparse FEA reference signals for diagnostic-guided reward and world-model refinement prior to large-scale policy optimisation.

2605.25061 2026-05-26 cs.LG cs.AI

GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition

GL-LFGNN:基于Liang-Kleeman信息流的全局-局部双分支因果图神经网络用于脑电情感识别

Ziyi Wang, Dongyang Kuang

发表机构 * School of Mathematics (Zhuhai), Sun Yat-sen University, Zhuhai, China(中山大学数学学院(珠海))

AI总结 提出GL-LFGNN模型,利用Liang-Kleeman信息流理论构建有向因果图,通过全局-局部双分支架构整合全脑与区域连接,在MEEG数据集上以少量参数实现高精度情感识别。

Comments 10 pages, 3 figures

详情
AI中文摘要

基于脑电的情感识别在客观诊断情绪障碍方面具有重要前景。图神经网络已成为建模脑电通道间依赖关系的主流范式,但现有方法依赖于基于空间邻近性或功能相关性导出的对称邻接矩阵,这些矩阵本质上捕捉的是统计关联而非有向因果影响,这与神经信息流固有的非对称、因果驱动特性相冲突。为弥合这一差距,我们提出GL-LFGNN,一种基于Liang-Kleeman信息流理论的全局-局部双分支因果图神经网络。与仅评估时间优先性的格兰杰因果不同,我们的方法从动力系统角度严格量化因果强度,生成神经生理学可解释的有向图。双分支架构进一步将全脑连接性与符合既定功能神经解剖学的区域特定处理相结合。在MEEG数据集上,GL-LFGNN仅用37K参数(约为当前最优模型的10%)便达到86.17%(唤醒度)和86.71%(效价)的准确率,表明原则性的因果建模可同时增强可解释性、泛化能力和计算效率。代码将开源。

英文摘要

EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerged as the dominant paradigm for modeling inter-channel dependencies in EEG, yet existing approaches rely on symmetric adjacency matrices derived from spatial proximity or functional correlations that fundamentally capture statistical associations rather than directed causal influences, which conflicts with the inherently asymmetric, causally-driven nature of neural information flow. To bridge this gap, we propose GL-LFGNN, a Global-Local Dual-branch Causal Graph Neural Network grounded in Liang-Kleeman information flow theory. Unlike Granger causality that merely assesses temporal precedence, our approach rigorously quantifies causal strength from a dynamical systems perspective, yielding neurophysiologically interpretable directed graphs. A dual-branch architecture further integrates whole-brain connectivity with region-specific processing aligned to established functional neuroanatomy. On the MEEG dataset, GL-LFGNN achieves 86.17% (Arousal) and 86.71% (Valence) accuracy with only 37K parameters -- approximately 10% of the current state-of-the-art -- demonstrating that principled causal modeling can simultaneously enhance interpretability, generalization, and computational efficiency. Code will be released.

2605.25052 2026-05-26 cs.CL

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

忠实性指标并不衡量忠实性:基于真实标签的元评估

Yoav Gur-Arieh, Ana Marasović, Mor Geva

发表机构 * Tel Aviv University(特拉维夫大学) University of Utah(犹他大学)

AI总结 针对思维链忠实性度量缺乏真实标签验证的问题,构建了包含真实忠实性标签的数据集BonaFide,系统评估现有指标,发现多数指标表现接近随机、存在偏差且计算成本高。

详情
AI中文摘要

思维链(CoT)已成为解释和审计大型语言模型行为的核心工具。然而,越来越多的证据表明,这些轨迹往往未能忠实反映模型预测背后的计算过程。已有多种忠实性指标被提出,但它们是否真正衡量了忠实性仍不得而知。回答这一问题需要真实标签,但由于内部计算不可直接观察,真实标签难以获取。因此,大多数提出指标的工作仅报告绝对分数或与先前指标的对比,而少数现有基准依赖于似然性或重要性等代理指标,这些属性与忠实性正交,可能误导对CoT可信度的判断。我们通过构建任务来应对这一挑战,这些任务的输出揭示了哪些中间计算必然产生了它们,并开发了一个自动化标注流程,在步骤级和CoT级生成真实忠实性标签。基于这一方法,我们提出了BonaFide基准,包含来自13个任务和10个模型的3066个标注CoT,并利用它首次系统评估了主流忠实性指标。我们的实验表明,大多数指标表现接近随机,存在强烈的预测偏差,并且在更长的CoT上性能下降。最佳指标在CoT级仅达到0.70 AUROC,另一指标在步骤级达到0.59,且两者均无法跨设置迁移,同时计算成本过高。我们的结果暴露了当前忠实性评估中的根本性缺陷,并呼吁开发更可靠、更高效的指标。

英文摘要

Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.

2605.25045 2026-05-26 cs.AI

AION: Next-Generation Tasks and Practical Harness for Time Series

AION:下一代时间序列任务与实用框架

Tianxiang Zhan, Xiaobao Song, Tong Guan, Shirui Pan, Ming Jin

发表机构 * Griffith University(格里菲斯大学) Shenzhen University(深圳大学) Zhejiang University(浙江大学)

AI总结 针对时间序列研究向结合预测、上下文推理、工具使用和结构化决策支持的现实任务转变,提出AION框架,通过时间锚定、知识推理和可靠性机制(如实验后分析和分层审查)实现更详细的过程追踪和审查步骤。

Comments Project page and code are available at https://github.com/ztxtech/aion

详情
AI中文摘要

时间序列研究正从固定的预测基准转向结合预测、上下文推理、工具使用和结构化决策支持的现实任务。大多数基准基于干净数据和短评估循环构建;仅靠智能体可能会在最终输出前忽略时间约束、证据检查或审查。我们首先将下一代时间序列任务形式化为由任务文件、工作空间和验证接口组成的三元组。然后,我们提出AION,一个由六个组件组(智能体、技能、规则、记忆、评估和协议)构建的时间序列框架。在该框架中,我们使用三个设计原则:时间锚定、时间知识推理以及可靠性机制(如实验后分析和分层审查)。Kaggle商店销售案例研究表明,与在OpenCode直接构建模式下运行的相同基础智能体相比,该框架产生了更详细的过程追踪、更多工件和更多审查步骤。综合来看,这些结果支持从固定任务向现实世界约束下的现实任务的范式转变。

英文摘要

Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, tool use, and structured decision support. Most benchmarks are built around clean data and short evaluation loops; agents alone may miss temporal constraints, evidence checks, or review before finalizing outputs. We first formalize next-generation time series tasks as three-component tuples consisting of a task file, a workspace, and a validation interface. We then present AION, a time series harness built from six component groups: agents, skills, rules, memory, evaluation, and protocols. In this harness, we use three design principles: temporal grounding, temporal knowledge-grounded reasoning, and reliability mechanisms such as post-experiment analysis and layered review. A Kaggle Store Sales case study shows that the harness produces more detailed process traces, more artifacts, and more review steps than the same base agent operating in OpenCode direct build mode. Taken together, these results argue for a paradigm shift from fixed tasks to realistic ones under real-world constraints.

2605.25044 2026-05-26 cs.RO

X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models

X-DiffVLA:面向视觉-语言-动作模型的跨具身扩散动作头

Boyu Li, Chaoyi Xu, Haoqi Yuan, Xinrun Xu, Börje F. Karlsson, Dongbin Zhao, Haoran Li, Zongqing Lu

发表机构 * SKL-MAIS, Institute of Automation, Chinese Academy of Sciences(SKL-MAIS,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) BeingBeyond School of Computer Science, Peking University(北京大学计算机学院)

AI总结 针对跨具身数据学习通用策略的挑战,提出X-DiffVLA模型,通过扩散模型和具身强制技术实现异构末端执行器间的知识迁移,在RoboCasa和Isaac Gym上分别提升15.3%和12.5%。

详情
AI中文摘要

从跨具身数据中学习通用策略仍然是机器人学中的基本挑战。尽管视觉-语言-动作(VLA)模型在大型多样化数据集上进行了预训练,但它们通常依赖于具身特定的微调才能在下游任务中实现强性能。这一要求严重限制了它们的泛化能力,并阻碍了执行相似任务的具身之间的知识迁移。为了克服这些限制,我们聚焦于共享机器人基座和异构末端执行器的跨具身设置,并提出X-DiffVLA,一种具有统一跨具身动作头的基于扩散的VLA模型。X-DiffVLA能够利用扩散模型的生成优势来捕捉跨具身数据集中的多样性和潜在相关性。具体地,我们引入了具身强制(Embodiment Forcing),一种无分类器引导技术,以隐式地将动作生成导向具身特定的功能组件,无需显式监督即可捕捉细粒度的结构细微差别。此外,设计了形态树扩散(Morphological Tree Diffusion)方法来增强不同末端执行器之间的行为相关性,最大化异构演示的可迁移性。在RoboCasa和Isaac Gym上的实验结果覆盖了从夹爪到灵巧手的多种具身,表明X-DiffVLA达到了最先进的性能,分别提升了15.3%和12.5%。真实世界评估进一步验证了所提出框架的鲁棒性及其在可扩展跨具身策略学习中的有效性。

英文摘要

Learning universal policies from cross-embodied data remains a fundamental challenge in robotics. Although Vision-Language-Action (VLA) models are pre-trained on large and diverse datasets, they typically rely on embodiment-specific fine-tuning to achieve strong performance in downstream tasks. This requirement severely limits their generalization capability and restricts knowledge transfer across embodiments performing similar tasks. To overcome these limitations, we focus on cross-embodied settings with shared robotic bases and heterogeneous end-effectors, and propose X-DiffVLA, a diffusion-based VLA model featuring a unified cross-embodied action head. X-DiffVLA can leverage the generative strengths of diffusion models to capture both the diversity and latent correlations in cross-embodied datasets. Specifically, we introduce Embodiment Forcing, a classifier-free guidance technique to implicitly steer action generation toward embodiment-specific functional components, capturing fine-grained structural nuances without explicit supervision. In addition, a Morphological Tree Diffusion approach is designed to strengthen behavioral correlations across diverse end-effectors, maximizing the transferability of heterogeneous demonstrations. Experimental results across RoboCasa and Isaac Gym, covering different embodiments from grippers to dexterous hands, show that X-DiffVLA achieves state-of-the-art performance, with improvements of 15.3% and 12.5%, respectively. Real-world evaluations further validate the robustness of the proposed framework and its effectiveness in scalable cross-embodied policy learning.

2605.25042 2026-05-26 cs.CV

Unbiased Diffusion Variational Inversion via Principled Posterior Matching

无偏扩散变分反演:基于原则性后验匹配

Weimin Bai, Yuxuan Gu, Yifei Wang, Weijian Luo, He Sun

发表机构 * Peking University(北京大学)

AI总结 提出原则性后验匹配(PPM)框架,通过精确优化KL散度(利用Fisher散度积分)解决逆问题中模式坍塌和不确定性量化不可靠的问题,统一变分推理和摊销推理,在图像修复、超分辨荧光显微和射电干涉成像中实现高保真重建和校准的不确定性估计。

详情
AI中文摘要

现有的基于分数的逆问题方法通常采用KL散度在反演分布与贝叶斯后验之间的近似最小化。这种近似导致严重的模式坍塌和不可靠的不确定性量化。在本文中,我们提出原则性后验匹配(PPM),一个回归变分推理基础而非使用技巧性近似的框架。我们不依赖启发式近似,而是通过整合Fisher散度严格公式化KL散度的精确优化。我们推导出该积分的可处理等价梯度形式,使得无需先前近似引入的偏差即可进行精确优化。我们的分析清楚地揭示了先前方法中的模式坍塌直接源于这种近似差距。在我们的理论解决方案支持下,PPM统一了两个互补范式:(1)在变分推理中,PPM采用覆盖质量的散度,显著提高了反演多样性和不确定性量化;(2)在摊销推理中,它使得能够训练高效的重建网络以进行快速的单步重建。此外,我们的公式通过推广Fisher散度的积分,自然地扩展到更广泛的散度度量族。我们在具有挑战性的计算成像任务中验证了PPM,包括图像修复、超分辨荧光显微镜和射电干涉黑洞成像。在所有实验中,PPM实现了卓越的重建保真度、忠实的多模态后验恢复以及良好校准的不确定性估计,为科学成像建立了一个稳健的框架。

英文摘要

Existing score-based methods for inverse problems often resort to approximate minimization of the KL divergence between the inversion distribution and the Bayesian posterior. Such an approximation leads to severe mode collapse and unreliable uncertainty quantification. In this paper, we propose Principled Posterior Matching (PPM), a framework that returns to the fundamentals of variational inference, rather than using tricky approximations. Instead of relying on heuristic approximations, we rigorously formulate the exact optimization of the KL divergence via the integration of Fisher divergence. We derive a tractable, equivalent gradient form of this integral, enabling precise optimization without the biases introduced by prior approximations. Our analysis clearly reveals that the mode collapse in previous methods stems directly from this approximation gap. Supported by our theoretical solution, PPM unifies two complementary paradigms: (1) In variational inference, PPM adopts mass-covering divergences that significantly improve the inversion diversity and uncertainty quantification; (2) In amortized inference, it enables the training of an efficient reconstruction network for rapid, single-step reconstruction. Furthermore, our formulation naturally extends to a broader family of divergence measures by generalizing the integral of the Fisher divergence. We validate PPM across challenging computational imaging tasks, including inpainting, super-resolution fluorescent microscopy, and radio interferometric black-hole imaging. In all experiments, PPM achieves superior reconstruction fidelity, faithful multimodal posterior recovery, and well-calibrated uncertainty estimates, establishing a robust framework for scientific imaging.

2605.25041 2026-05-26 cs.RO

RAMBA: 4D Radar Mapping by Bundle Adjustment

RAMBA: 通过束调整的4D雷达建图

Jianzhu Huai, Yiwen Chen, Binliang Wang

发表机构 * State Key Lab of Info Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘遥感国家重点实验室)

AI总结 提出RAMBA框架,利用束调整联合优化雷达帧状态,结合协方差加权几何残差、IMU预积分因子和雷达自速度约束,实现全局一致的4D雷达建图。

Comments 5 pages, 2 figures, to present in ISPRS2026 Thematic Session 10 on Radar Perception

详情
AI中文摘要

4D雷达在机器人建图中越来越有吸引力,因为它提供距离、方位角、仰角和多普勒测量,同时在恶劣可见度条件下保持鲁棒性。尽管最近的雷达和雷达-惯性里程计方法已经实现了有前景的在线状态估计性能,但4D雷达的离线全局地图优化仍未得到充分探索。本文提出了RAMBA,一种用于全局一致4D雷达建图的雷达束调整框架。给定来自雷达-惯性里程计前端的初始位姿和雷达帧,RAMBA使用协方差加权几何残差、IMU预积分因子和雷达自速度约束联合优化雷达帧状态。几何残差通过跨选定帧形成基于体素的对应关系,并用点协方差加权每个残差,将成对GICP扩展到多帧优化。为了提高对漂移和重访的鲁棒性,RAMBA在对应关系形成过程中强制时间一致性,同时明确支持闭环约束。在ColoRadar和SNAIL Radar数据集上的实验表明,与雷达-惯性里程计和位姿图优化基线相比,RAMBA提高了地图一致性并通常提升了轨迹精度。

英文摘要

4D radar is increasingly attractive for robotic mapping because it provides range, azimuth, elevation, and Doppler measurements while remaining robust in adverse visibility conditions. Although recent radar and radar--inertial odometry methods have achieved promising online state estimation performance, offline global map refinement for 4D radar remains underexplored. This paper presents RAMBA, a radar bundle-adjustment framework for globally consistent 4D radar mapping. Given initial poses and radar frames from a radar--inertial odometry front-end, RAMBA jointly refines radar frame states using covariance-weighted geometric residuals, IMU preintegration factors, and radar ego-velocity constraints. The geometric residuals extend pairwise GICP to a multi-frame optimization by forming voxel-based correspondences across selected frames and weighting each residual with point covariances. To improve robustness against drift and revisits, RAMBA enforces temporal consistency during correspondence formation while explicitly supporting loop-closure constraints. Experiments on the ColoRadar and SNAIL Radar datasets show that RAMBA improves map consistency and usually enhances trajectory accuracy over radar--inertial odometry and pose-graph optimization baselines.

2605.25039 2026-05-26 cs.CV

AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

AstroRAG -- 一种基于PageRank的检索增强生成管道用于天文学问答

Zhifeng Wang, Jason Jingshi Li, Kaihao Zhang, Ramesh Sankaranarayana

发表机构 * Australian National University(澳大利亚国立大学) Learning Machines Pty Ltd

AI总结 提出AstroRAG,一种基于PageRank的检索增强生成管道,通过两阶段检索(MMR和PR重排序)在严格token预算下选择紧凑互支持的上下文,无需训练且保护隐私,在天文学QA基准上使Mistral-7B准确率和F1分数达到79.49%,性能近乎翻倍。

Comments Accepted to IEEE CAI 2026

详情
AI中文摘要

大型语言模型(LLMs)在自然语言处理中表现出强大的性能,但仅依赖参数化知识时常常产生事实性错误。检索增强生成(RAG)通过将响应基于外部证据来减轻这些错误,然而传统的检索-转储方法经常引入无关上下文,从而降低答案质量。在这项工作中,我们提出了AstroRAG——一种基于PageRank的检索增强生成(RAG)管道,适用于天文学中的问答。该系统在Elasticsearch中执行token感知的分块和每个实例的临时索引,然后执行两阶段检索:(i)最大边际相关性(MMR)以获得一个小的、多样化的候选集,以及(ii)在相似性图上进行读者驱动的PageRank(PR)重排序,以在严格的token预算下识别紧凑、互支持的上下文。我们的设计无需训练、保护隐私且可重复,因为每个实例通过临时索引处理以防止跨任务泄漏。我们在用于天文学QA的AstroQA基准上评估了该管道,并在所有难度级别上展示了有竞争力的性能。特别是,RAG增强的Mistral-7B实现了 extbf{79.49\%的准确率}和 extbf{79.49\%的F1分数},几乎是非RAG对应版本性能的两倍。这些结果突显了严格检索和精炼在提升领域特定推理方面的有效性,为将RAG扩展到其他科学领域奠定了坚实基础。

英文摘要

Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG -- a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf{79.49\% accuracy} and \textbf{79.49\% F1-score}, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.

2605.25038 2026-05-26 cs.CL cs.LG cs.SE

TRACE: A taxonomy-grounded synthetic dataset for teaching-program generation and session interpretation in Applied Behavior Analysis

TRACE:一个基于分类学的合成数据集,用于应用行为分析中的教学程序生成和会话解释

Festus Kahunla

发表机构 * Drexel University(德雷塞尔大学) Pombo Labs(波莫实验室)

AI总结 提出TRACE数据集,通过分类学驱动的确定性生成器创建2999个合成示例,覆盖教学程序生成和多会话行为解释任务,以解决ABA领域真实数据受隐私保护无法公开的问题。

Comments 11 pages, 3 tables. Dataset: https://huggingface.co/datasets/PomboLabs/TRACE ; code: https://github.com/Pombo-Labs/TRACE

详情
AI中文摘要

应用行为分析(ABA)是一门临床学科,其文档、教学程序和多次会话行为日志具有公式化和高容量的特点,但真实会话数据受HIPAA保护并受专业保密规则约束,阻碍了训练语料库的发布。我们提出了TRACE(分类学参考的ABA临床示例),一个包含2999个示例的合成指令调优数据集,涵盖两项ABA任务:跨离散试验训练、自然环境教学和任务分析的教学程序生成;以及跨十二种轨迹模式和十三种目标行为的多会话行为解释。每个示例均由一个基于经典ABA文献的确定性分类学驱动生成器产生,并且每个示例都带有完整的采样来源,即产生它的确切分类学单元。该数据集以CC BY-NC 4.0(数据)和MIT(代码)许可发布,包含分层训练集(2549)、验证集(149)、测试集(281)和完整性检查集(20)。TRACE是一个研究工件,尚未经过临床验证。

英文摘要

Applied Behavior Analysis (ABA) is a clinical discipline whose documentation, teaching programs and multi-session behavioral logs, is formulaic and high-volume, yet real session data is HIPAA-protected and bound by professional confidentiality rules, blocking the release of a training corpus. We present TRACE (Taxonomy-Referenced ABA Clinical Examples), a 2,999-example synthetic instruction-tuning dataset covering two ABA tasks: teaching-program generation across Discrete Trial Training, Natural Environment Teaching, and Task Analysis; and multi-session behavioral interpretation across twelve trajectory patterns and thirteen target behaviors. Every example is produced by a deterministic taxonomy-driven generator grounded in the canonical ABA literature, and every example carries complete sampling provenance, the exact taxonomy cells that produced it. The dataset is released under CC BY-NC 4.0 for data and MIT for code, with stratified train (2,549), validation (149), test (281), and sanity (20) splits. TRACE is a research artifact and has not been clinically validated.

2605.25036 2026-05-26 cs.CL cs.AI

Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation

LVLMs中的语言偏差:从深入分析到简单有效的缓解方法

Yangneng Chen, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 本文系统研究了大视觉语言模型中的语言偏差问题,发现其根源在于训练中的模态未对齐,并提出了两种简单有效的缓解方法:语言偏差正则化(LBR)和语言偏差惩罚(LBP)。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉语言模型(LVLMs)通过视觉理解扩展了大型语言模型,但仍然容易产生幻觉,即输出流畅但与图像不一致。最近的研究将这一问题与语言偏差联系起来——LVLMs过度依赖文本而忽视视觉输入的倾向。然而,大多数分析仍然是经验性的,没有揭示其根本原因。在本文中,我们对语言偏差进行了系统研究,并确定其根源在于训练过程中的模态未对齐。我们的分析表明,视觉指令微调(VIT)和直接偏好优化(DPO)通常优先考虑文本改进,这可能导致LVLMs过度倾向于语言建模,而不是平衡的多模态理解。为了解决这个问题,我们提出了两种简单而有效的方法:语言偏差正则化(LBR),通过在指令微调期间进行正则化来缓解语言偏差;以及语言偏差惩罚(LBP),在DPO训练过程中惩罚语言偏差。跨多种模型和基准的大量实验证明了我们方法的有效性。LBR在十多个通用基准上持续提高性能,而LBP显著减少了幻觉并提高了可信度。这些方法共同不仅缓解了语言偏差,还促进了LVLMs的整体对齐,且无需引入任何额外数据或辅助模型。我们的代码公开在https://github.com/lab-klc/LVLM-Language-Bias。

英文摘要

Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where outputs are fluent yet inconsistent with images. Recent studies link this issue to language bias-the tendency of LVLMs to over-rely on text while neglecting visual inputs. Yet most analyses remain empirical without uncovering its underlying cause. In this paper, we provide a systematic study of language bias and identify its root in modality misalignment during training. Our analysis shows that both Visual Instruction Tuning (VIT) and Direct Preference Optimization (DPO) often prioritize textual improvements, which may cause LVLMs to overly lean toward language modeling rather than balanced multimodal understanding. To address this, we propose two simple yet effective methods: Language Bias Regularization (LBR) which mitigates language bias through regularization during instruction tuning, and Language Bias Penalty (LBP), which penalizes language bias in the DPO training process. Extensive experiments across diverse models and benchmarks demonstrate the effectiveness of our approach. LBR consistently improves performance on over ten general benchmarks, while LBP significantly reduces hallucination and improves trustworthiness. Together, these methods not only mitigate language bias but also advance the overall alignment of LVLMs, all without introducing any additional data or auxiliary models. Our code is publicly available at https://github.com/lab-klc/LVLM-Language-Bias.

2605.25030 2026-05-26 cs.LG

MimirRAG: A Multi-Agent RAG Framework for Financial Data Retrieval with Metadata Integration

MimirRAG:一种集成元数据的金融数据检索多智能体RAG框架

Magnus Samuelsen, Wilmer Nyström, Somnath Mazumdar, Mansoor Hussain, Mikkel Strange

发表机构 * Copenhagen Business School(哥本哈根商学院)

AI总结 提出MimirRAG多智能体RAG框架,通过元数据集成、表格感知分块和智能体工作流,在金融数据检索中实现89.3%准确率,优于基线。

详情
AI中文摘要

检索增强生成(RAG)系统提供了一种有前景的方法来减少大语言模型(LLM)中的幻觉并提高答案准确性,这是可靠金融分析的必要条件,其中答案必须基于文件中的可验证证据,而非从模型先验生成。然而,设计能够从混合金融文档中提取有意义见解并集成到分析师工作流程中的RAG系统仍然具有挑战性。本文介绍了MimirRAG(元数据集成多智能体信息检索),这是一个迭代开发的多智能体RAG系统,旨在应对这些挑战。MimirRAG具有模块化流水线,包括PDF文件的保结构解析、表格感知分块、元数据提取、带有查询规划和混合搜索的基于智能体的检索、验证以及支持数值推理的上下文感知生成。我们的消融研究确定了有效金融RAG的三个关键技术推动因素:元数据集成、表格感知分块和智能体工作流。MimirRAG使用FinanceBench进行定量评估,并通过四位金融分析师的专家验证进行定性评估。该系统在FinanceBench上达到89.3%的准确率,优于原始基准基线。专家反馈强调,成功部署还需要校准信任、全面的数据集成和用户个性化。我们得出结论,将多智能体RAG架构与以人为中心的设计原则相结合,可以改善金融分析中有意义见解的提取。

英文摘要

Retrieval-augmented generation (RAG) systems offer a promising approach to reduce hallucinations and improve answer accuracy in large language models (LLMs), a requirement for reliable, financial analysis where answers must be grounded in verifiable evidence from filings rather than generated from model priors. However, designing RAG systems that extract meaningful insights from mixed financial documents and integrate into analyst workflows remains challenging. This paper introduces MimirRAG (Metadata-Integrated Multi-Agent Information Retrieval), a multi-agent RAG system developed iteratively to address these challenges. MimirRAG features a modular pipeline encompassing structure-preserving parsing of PDF filings, table-aware chunking, metadata extraction, agent-based retrieval with query planning and hybrid search, validation, and context-aware generation with numerical reasoning support. Our ablation study identifies three key technical enablers for effective financial RAG: metadata integration, table-aware chunking, and an agentic workflow. MimirRAG was evaluated quantitatively using FinanceBench and qualitatively through expert validation with four financial analysts. The system achieved 89.3% accuracy on FinanceBench, outperforming the original benchmark baselines. Expert feedback highlighted that successful deployment also requires calibrated trust, comprehensive data integration, and user personalization. We conclude that combining multi-agent RAG architecture with human-centric design principles can improve the extraction of meaningful insights in financial analysis.

2605.25024 2026-05-26 cs.CV

DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

DA-UCT:用于快速肌肉骨骼声速重建的自监督域自适应超声计算机断层扫描

Tianyu Liu, Heyu Ma, Aiduo Wang, Peiwen Li, Boyi Li, Ying Li, Dan Li, Chengcheng Liu, Dean Ta

发表机构 * College of Biomedical Engineering, Fudan University(复旦大学生物医学工程学院)

AI总结 提出SDA-UCT框架,通过自监督域自适应和注意力增强网络,实现快速高分辨率肌肉骨骼超声计算机断层扫描重建,显著提升速度并保持高质量。

详情
AI中文摘要

通过全波形反演的超声计算机断层扫描(UCT)能够实现高分辨率定量成像,用于组织表征和疾病诊断。然而,由于高度非线性的优化,UCT存在计算负担大和收敛问题严重等缺点。深度学习可以加速UCT重建,但监督训练需要大规模标记数据集,这在体内难以获得。为了解决这些限制,我们提出了SDA-UCT,一个两阶段自监督域自适应框架,用于快速准确的肌肉骨骼组织UCT成像。SDA-UCT采用在模拟数据集上预训练的注意力增强网络(AttUCT),并通过物理信息自监督学习迁移到体内数据,有效弥合了模拟到真实的域差距。集成了低秩自适应(LoRA)机制,以实现跨不同临床场景的高效自适应。结果表明,AttUCT在模拟人前臂上实现了高质量声速重建,PSNR为29.23 dB,SSIM为0.928,优于传统FWI和现有深度学习方法。在体内数据上验证,SDA-UCT成功重建了揭示人前臂复杂解剖结构(皮肤、脂肪、肌肉、肌腱、骨骼和骨髓)的声速图像,与MRI参考高度一致。仅调整3%参数的LoRA机制实现了与全微调相当的性能。快速重建(每帧5毫秒)实现了实时3D可视化,比传统FWI提高了五个数量级。这项工作代表了首个用于快速、高分辨率体内UCT成像的自监督域自适应深度学习,显示了在肌肉骨骼疾病诊断中的潜力。

英文摘要

Ultrasound computed tomography (UCT) via full waveform inversion (FWI) enables high-resolution quantitative imaging for tissue characterization and disease diagnosis. However, UCT suffers from large computational burden and severe convergence issues due to highly nonlinear optimization. Deep learning can accelerate UCT reconstruction, but supervised training requires large-scale labeled datasets difficult to obtain in vivo. To address these limitations, we propose SDA-UCT, a two-stage self-supervised domain-adaptive framework for rapid and accurate UCT imaging of musculoskeletal tissues. SDA-UCT employs an attention-enhanced network (AttUCT) pre-trained on simulation datasets and transfers to in-vivo data via physics-informed self-supervised learning, effectively bridging the simulation-to-real domain gap. A Low-Rank Adaptation (LoRA) mechanism is integrated to enable efficient adaptation across diverse clinical scenarios. Results showed that AttUCT achieved high-quality SOS reconstruction for simulated human forearm with a PSNR of 29.23 dB and SSIM of 0.928, outperforming conventional FWI and existing deep learning methods. Validated on in-vivo data, SDA-UCT successfully reconstructed SOS images revealing complex anatomical structures (skin, fat, muscle, tendon, bone and bone marrow) for human forearm, in high concordance with MRI references. The LoRA mechanism adjusting only 3% of parameters achieved comparable performance to full fine-tuning. The rapid reconstruction (5 ms per frame) enables real-time 3D visualization, achieving five-orders-of-magnitude improvement over traditional FWI. This work represents the first self-supervised domain-adaptive deep learning for rapid, high-resolution in-vivo UCT imaging, showing potential for musculoskeletal disease diagnosis.

2605.25022 2026-05-26 cs.CV cs.AI

D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation

D3S2: 扩散引导的语义分割数据集蒸馏

Wenjie Zheng, Haoji Hu, Jiali Lu, Xingze Zou, Jing Wang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对语义分割数据集蒸馏中的长尾类别不平衡、像素级对齐和高计算成本问题,提出两阶段框架D3S2,通过类别平衡掩码选择和扩散引导图像合成生成紧凑训练集,在极低压缩率下显著提升分割性能。

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩为紧凑的合成集,同时保持训练效果。然而,现有研究主要关注图像分类,而语义分割等密集预测任务尚未充分探索。本文识别了分割数据集蒸馏的三个关键挑战:(i) 长尾类别不平衡,(ii) 图像与密集标签之间严格的像素级对齐需求,以及(iii) 使用复杂模型优化高分辨率数据的高计算成本。为应对这些挑战,我们提出D3S2,一种扩散引导的语义分割数据集蒸馏框架。我们的方法采用两阶段设计。在类别平衡掩码选择中,我们通过优先考虑低表示类别的贪婪策略构建代表性掩码集。在扩散引导图像合成中,我们使用预训练的布局到图像扩散模型生成以所选掩码为条件的图像,自然确保空间对齐。为进一步增强合成数据的训练效用,我们引入具有两个互补目标的引导扩散采样:用于像素级对齐的分割一致性损失,以及用于对齐跨层每类特征统计的类级特征匹配损失。大量实验证明了D3S2的优越性。值得注意的是,在1%的极低压缩率下,我们的方法在ADE20K和COCO-Stuff上使用Mask2Former (Swin-S)分别达到24.99%和35.49%的mIoU,比随机选择分别高出9.34%和5.70%。

英文摘要

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long-tailed class imbalance, (ii) the need for strict pixel-wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high-resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion-guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two-stage design. In Class-Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion-Guided Image Synthesis, we employ a pretrained layout-to-image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation-consistency loss for pixel-level alignment, and a class-wise feature matching loss for aligning per-class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO-Stuff with Mask2Former (Swin-S), outperforming random selection by 9.34% and 5.70%, respectively.

2605.25020 2026-05-26 cs.AI cs.CL

Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

慢性皮肤病纵向数据检索中的隐私保护本地语言模型:在天疱疮患者中的实施

Abdurrahim Yilmaz, Ayşe Esra Koku Aksu, Duygu Yamen, Vefa Asli Erdemir, Mehmet Salih Gurel, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran

发表机构 * Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London(系统医学系,代谢、消化与生殖部,帝国理工学院伦敦分校) Department of Dermatology and Venereology, Istanbul Research and Training Hospital(皮肤科与性病科,伊斯坦布尔研究与培训医院) Department of Dermatology and Venereology, Istanbul Medeniyet University(皮肤科与性病科,伊斯坦布尔梅德尼yet大学) Department of Dermatology and Venereology, Istanbul Medicana Atakoy Hospital(皮肤科与性病科,伊斯坦布尔Medicana阿塔科伊医院)

AI总结 本研究评估了本地部署的隐私保护小型语言模型(SLM)在天疱疮患者长期随访记录中检索结构化临床特征并生成纵向摘要的能力,结果显示SLM在特征检索任务中平均准确率达82.25%,且医生对AI生成摘要的质量、临床准确性和实用性评分较高。

详情
AI中文摘要

慢性皮肤病如天疱疮需要长期随访,产生大量纵向临床文档,在常规就诊期间难以全面审查,增加了临床医生的工作量以及遗漏关键历史信息的风险。我们评估了本地部署的隐私保护小型语言模型(SLM)是否能够从长期皮肤科随访记录中检索结构化临床特征并生成纵向摘要。在这项回顾性病例系列研究中,30名天疱疮患者贡献了541份就诊记录,汇总为完整的纵向记录(89,336词);由两位皮肤科专家标注了56个临床相关特征。本地部署的SLM(Qwen3 4B Thinking 2507)对每份完整记录进行查询,以检索56个特征并生成一份最终报告摘要。在1,680个特征检索任务中,平均准确率为82.25%。皮肤科医生对AI生成摘要的整体质量(8.23-8.47)、临床准确性(7.93-8.20)和实用性(8.47-8.50)评分较高,评估者间无显著差异,且在53.3%的评估中总体偏好AI摘要。这些发现表明,隐私保护的本地部署SLM可以优于医学专家,并可靠地生成有临床意义的纵向摘要。在适当监督下,SLM可以支持临床决策。

英文摘要

Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that is difficult to review comprehensively during routine visits and increasing clinician workload as well as the risk of missing critical historical information. We evaluated whether a locally deployed, privacy-preserving small language model (SLM) could retrieve structured clinical features and generate longitudinal summaries from long-term dermatology follow-up records. In this retrospective case series, thirty pemphigus patients contributed 541 visit notes that were aggregated into full longitudinal records (89,336 words); 56 clinically relevant features were annotated by two expert dermatologists. The locally deployed SLM (Qwen3 4B Thinking 2507) was queried with each complete record to retrieve 56 features and generate one final report summaries. Across 1,680 feature retrieval tasks, mean accuracy was 82.25%. Dermatologists' ratings of AI-generated summaries were high for overall quality (8.23-8.47), clinical accuracy (7.93-8.20), and usefulness (8.47-8.50), with no significant inter-evaluator differences and an overall preference for AI summaries in 53.3% of evaluations. These findings suggest that privacy-preserving, locally deployed SLMs can outperform medical experts and reliably generate clinically meaningful longitudinal summaries. SLMs may support clinical decision-making when integrated with appropriate oversight.

2605.25014 2026-05-26 cs.CV

Stop Denoising Your Blurs

停止去噪你的模糊

Sasidhar Parvathireddy, Vamsidhar Saraswathula, Rama Krishna Gorthi

发表机构 * Indian Institute of Technology Tirupati, India.(印度泰尔普蒂印度理工学院)

AI总结 提出ConvDiff框架,用卷积替代加性噪声构建模糊退化轨迹,实现基于扩散模型的图像去模糊,弥合模糊数学原理与扩散算法设计的差距。

Comments Accepted at IEEE International Conference on Image Processing (ICIP) 2026. 7 pages, 3 figures

详情
AI中文摘要

近年来,扩散模型在图像恢复任务中取得了显著性能。其核心机制依赖于在加性噪声操作之前对退化先验的受限假设。然而,模糊模型作为最广泛研究的退化形式之一,违反了这一假设,因为它本质上基于卷积而非加法。在本文中,我们引入了ConvDiff,一种新颖的基于扩散的框架,该框架用卷积替代加法操作,用于图像去模糊任务。在前向过程中,我们利用卷积的频域特性,从清晰图像到其模糊对应物构建有意义的轨迹,而不是用加性噪声逐步破坏图像。虽然当前工作针对高斯模糊实例化了该框架(其中频域分解产生闭式且物理有效的中间状态),但从模糊算子构建退化轨迹的基本原则自然扩展到其他模糊族。该公式弥合了模糊的数学原理与基于扩散的恢复算法的迭代设计之间的差距,从而实现了更物理基础且有效的图像恢复模型。

英文摘要

In recent times, diffusion models have achieved remarkable performance in image restoration tasks. Their core mechanism relies on the restricted presumption of degradation prior to the additive noise operation. However, the blur model, one of the most widely studied degradation formulations, violates this assumption, as it is inherently based on convolution rather than addition. In this paper, we introduce ConvDiff, a novel diffusion based framework that substitutes the additive operation with convolution for the task of image deblurring. In the forward process, we construct a meaningful trajectory from the clean image to its blurred counterpart by exploiting the frequency domain characteristics of convolution, rather than progressively corrupting the image with additive noise. While the current work instantiates this framework for Gaussian blur, where frequency-domain decomposition yields closed-form and physically valid intermediate states, the underlying principle of constructing degradation trajectories from the blur operator extends naturally to other blur families. This formulation bridges the gap between the mathematical principles of blurring and the iterative design of diffusion-based restoration algorithms, enabling more physically grounded and effective image restoration models.

2605.25012 2026-05-26 cs.CV

Learning from Semantic Dictionaries: Discriminative Codebook Contrastive Learning for Unified Visual Representation and Generation

从语义字典中学习:面向统一视觉表示与生成的判别式码本对比学习

Imanol G. Estepa, Jesús M Rodríguez-de-Vera, Bhalaji Nagarajan, Petia Radeva

发表机构 * Universitat de Barcelona(巴塞罗那大学) Barcelona Supercomputing Center (BSC)(巴塞罗那超级计算中心)

AI总结 提出LEASE框架,通过配对生成-判别码本设计,在离散标记空间中联合优化掩码重建损失和码本对比损失,实现统一视觉表示与生成,在ImageNet-1K上达到最先进性能。

Comments Accepted at CVPR'26

详情
AI中文摘要

判别式和生成式视觉模型在各自领域表现出色,但在语义上存在错位,阻碍了统一视觉学习的进展。我们提出LEASE(从语义字典中学习),一种自监督框架,通过配对生成-判别码本设计弥合这一差距。LEASE完全在通过一次性预计算步骤产生的离散标记空间中运行,无需数据增强、教师模型或在线分词器即可高效训练。LEASE整合了两个互补目标:捕获细粒度生成细节的掩码标记重建损失,以及通过自适应质心加权将编码器特征与判别语义对齐的码本对比损失。这种双重监督产生了一个统一潜在空间,同时支持高质量生成和强大的表示学习。在ImageNet-1K上,LEASE实现了最先进的统一性能,在线性探测(相比MAGE和Sorcen提升高达+1.7%)、无条件生成(相比MAGE FID降低1.26,IS提升10.19)、少样本学习(相比Sorcen平均提升+0.56%)、迁移学习(相比MAGE和Sorcen平均提升+0.75%)以及鲁棒性基准(相比MAGE和Sorcen平均提升+5.86%和+4.25%)上均优于先前的VQGAN方法如MAGE和Sorcen。它还能与领域专用的对比和生成模型竞争,同时超越先前的MIM方法。无监督的LEASE模型还可以通过在其学习表示基础上构建扩展到条件生成,与专用基线相比具有竞争力。总体而言,LEASE为联合理解和生成视觉内容的通用视觉模型提供了高效且有效的一步。

英文摘要

Discriminative and generative vision models excel in their respective domains but remain semantically misaligned, hindering progress toward unified visual learning. We introduce LEASE (LEArning from SEmantic Dictionaries), a self-supervised framework that bridges this gap using a paired generative-discriminative codebook design. LEASE operates entirely in a discrete token space produced through a one-time precomputation step, enabling efficient training without data augmentations, teacher models, or online tokenizers. LEASE integrates two complementary objectives: a masked token reconstruction loss that captures fine-grained generative detail, and a codebook contrast loss that aligns encoder features with discriminative semantics via adaptive centroid weighting. This dual supervision yields a unified latent space that supports both high-quality generation and strong representation learning. On ImageNet-1K, LEASE achieves state-of-the-art unified performance, outperforming prior VQGAN-based methods such as MAGE and Sorcen across linear probing (up to +1.7%), unconditional generation (-1.26 FID and +10.19 IS w.r.t MAGE), few-shot learning (+0.56% on average against Sorcen), transfer (+0.75% average improvement against MAGE and Sorcen), and robustness benchmarks (+5.86% and +4.25% average improvement against MAGE and Sorcen, respectively). It also competes favorably with domain-specialized contrastive and generative models while surpassing previous MIM methods. The unsupervised LEASE model can also be extended to conditional generation by building upon its learned representations, proving competitive with specialized baselines. Overall, LEASE provides an efficient and effective step toward general-purpose vision models that jointly understand and generate visual content.