arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.24932 2026-05-26 cs.CV

X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers

X-Edit: 面向医学视觉Transformer的精确、显式且可解释的零空间编辑

Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang

发表机构 * Fudan University(复旦大学) Johns Hopkins University(约翰霍普金斯大学) National University of Singapore(新加坡国立大学) University of Sydney(悉尼大学)

AI总结 提出X-Edit框架,通过因果定位和零空间投影实现医学图像分类中ViT模型的精确错误修正,避免灾难性遗忘。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

预训练的视觉Transformer(ViT)越来越多地用于医学图像分类。然而,在动态临床场景中纠正其不可避免的失败案例是一个关键挑战。传统的微调方法固有地遭受灾难性遗忘,严重降低先前获得的诊断能力。这种不稳定性从根本上危及临床安全。解决这一脆弱性需要一种主动、可控且可靠的干预机制,该机制既有理论依据又具有内在可解释性。为此,我们提出X-Edit(精确、显式且可解释的编辑),一种高效的零空间模型编辑框架。X-Edit将编辑过程从基于梯度的迭代优化转变为有理论依据的闭式解。具体来说,我们首先通过因果追踪显式定位导致错误预测的影响层。然后,从精心挑选的锚点集中构建正交零空间投影矩阵。通过将精确的参数更新几何约束在该零空间内,我们提供了数学保证,即干预能够纠正目标错误而不干扰已建立的诊断表示。在六个医学影像基准上的广泛评估表明,X-Edit全面抑制了灾难性遗忘,同时实现了卓越的编辑成功率。我们的代码可在https://github.com/HenryLau7/X-Edit获取。

英文摘要

Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning approaches inherently suffer from catastrophic forgetting, severely degrading previously acquired diagnostic capabilities. Such instability fundamentally compromises clinical safety. Addressing this vulnerability requires an active, controllable, and reliable intervention mechanism that is both theoretically grounded and inherently interpretable. To this end, we propose X-Edit (eXact, eXplicit, and eXplainable Editing), an efficient null-space model editing framework. X-Edit transitions the editing process from iterative gradient-based optimization to a theoretically grounded, closed-form solution. Specifically, we first explicitly localize the influential layers via causal tracing governing the erroneous prediction. Subsequently, we construct an orthogonal null-space projection matrix from a curated anchor set. By geometrically constraining the exact parameter update strictly within this null space, we provide mathematical guarantees that the intervention rectifies targeted errors without perturbing established diagnostic representations. Extensive evaluations on six medical imaging benchmarks demonstrate that X-Edit comprehensively suppresses catastrophic forgetting while achieving superior edit success rates. Our code is available at https://github.com/HenryLau7/X-Edit.

2605.24931 2026-05-26 cs.RO

Learning High-Frequency Continuous Action Chunks in Latent Space

在潜在空间中学习高频连续动作块

Kunyun Wang, Yuhang Zheng, Yupeng Zheng, Jieru Zhao, Wenchao Ding

发表机构 * School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) National University of Singapore, Singapore(新加坡国立大学) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) Fudan University, Shanghai, China(复旦大学)

AI总结 本文提出通过变分自编码器将高频动作学习从动作空间转移到潜在空间,并引入Reuse-then-Refine块级精炼策略,以提升高频控制的时间与空间一致性,实现复杂接触任务的平滑执行。

Comments 17 pages, 10 figures

详情
AI中文摘要

现代机器人策略越来越依赖动作块来在物理世界中执行复杂任务。虽然动作块在中等动作频率下提高了时间一致性,但当动作频率进一步增加(例如到60 Hz)时,它变得不足。在这样的高频下,策略常常无法生成既时间平滑又空间一致的动作。我们通过使用变分自编码器(VAE)将高频动作学习从动作空间转移到潜在空间来解决这一挑战。这种表述显著提高了高频控制的时间与空间一致性。为了实现平滑的实时执行,我们进一步引入了Reuse-then-Refine,一种块级精炼策略,在异步推理下改善相邻动作块之间的连续性。因此,由我们的策略控制的机器人可以连续执行复杂的接触丰富任务,减少停顿和抖动。在三个真实世界的接触丰富机器人任务上的实验表明,我们的方法能够以平滑的动作一致地完成任务。我们的代码和数据可在 https://github.com/tars-robotics/RTR 获取。

英文摘要

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.

2605.24930 2026-05-26 cs.CL

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

H$^{2}$MT: 语义层次感知的层次记忆Transformer

Maryam Haghifam, Zifan He, Jason Cong, Yizhou Sun

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出H$^{2}$MT模型,通过离线构建语义层次结构并利用自底向上的后序聚合计算记忆嵌入,在推理时实现从粗到细的查询路由,从而在长上下文推理中实现质量与效率的权衡。

详情
AI中文摘要

基于Transformer的LLM在许多语言任务上取得了强劲的结果;然而,长输入仍然具有挑战性,因为上下文窗口是有限的,并且预填充延迟和内存随提示长度快速增长。因此,平坦的令牌流处理和基于块的检索可能会在与查询无关的文本上花费大量计算和上下文预算。离线索引的RAG额外引入了外部存储和索引管理开销,并且通常将检索到的证据作为原始文本附加,增加了预填充成本和延迟。H^{2}MT使长上下文推理具有结构感知性:它离线构建语义层次结构,通过自底向上的后序聚合为每个节点计算记忆嵌入,并在推理时从粗到细地路由查询,以早期修剪不相关的分支。在LongBench QA(NarrativeQA、HotpotQA、QASPER)和两个结构化技术文档设置上,H^{2}MT实现了有利的质量效率权衡,与提示压缩、记忆令牌方法和检索增强生成基线相比,在更低的峰值GPU内存和首令牌时间(TTFT)下取得了具有竞争力的ROUGE-L和F1(在适用情况下)。

英文摘要

Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^{2}MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.

2605.24928 2026-05-26 cs.CV

MambaDSF: Multi-Scale SSM with Dilated Feature Fusion for Sonar Small Target Detection

MambaDSF:基于膨胀特征融合的多尺度SSM用于声纳小目标检测

Hui Lin, Jiayi Li, Jing Wang, Shenghui Rong

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院)

AI总结 针对声纳小目标检测中像素覆盖不足、噪声干扰和尺度模糊问题,提出MambaDSF混合框架,通过Mamba增强特征金字塔、膨胀融合编码器和尺度自适应损失函数,在UATD数据集上达到91.5% mAP50,参数28.7M。

Comments 8 pages, 4 figures, under review at IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

声纳成像是水下目标检测的主要方式,但由于像素覆盖不足、声学对比度低以及不同成像距离下的尺度模糊,小目标仍然难以检测。基于CNN的检测器能高效提取局部特征,但缺乏全局声学上下文,无法抑制噪声引起的虚警。基于Transformer的方法以二次计算代价捕捉长距离依赖。现有的基于Mamba的视觉模型提供高效的线性代价扫描,但缺乏跨金字塔层级的多尺度语义对齐、多感受野融合以及可靠声纳检测所需的小目标感知训练监督。本文提出Mamba膨胀尺度融合(MambaDSF),一个混合框架,通过三个贡献解决这些局限:Mamba增强特征金字塔(MambaEFP)骨干网络,以线性复杂度联合捕捉局部回波线索和全局声学上下文;膨胀融合Mamba(DFMamba)编码器,强制跨金字塔层级的多尺度特征对齐;以及尺度自适应加权IoU(SA-WIoU)和跨尺度一致性(CSC)损失,稳定小目标训练。MambaDSF在UATD前视声纳基准上达到91.5% mAP50,参数为2870万,超越所有对比检测器。在小目标子集上,增益达到+2.2个百分点,在FLS和MD-FLS上的跨域评估证实了所提出架构的泛化能力。代码公开于https://github.com/IDontKnowAAA/MambaDSF。

英文摘要

Sonar imaging is the primary modality for underwater target detection, yet small targets remain difficult to detect due to insufficient pixel coverage, low acoustic contrast, and scale ambiguity across imaging ranges. CNN-based detectors extract local features efficiently but cannot suppress noise-induced false alarms without global acoustic context. Transformer-based methods capture long-range dependencies at quadratic computational cost. Existing Mamba-based vision models offer efficient linear-cost scanning but lack multi-scale semantic alignment across pyramid levels, multi-receptive-field fusion, and small-target-aware training supervision needed for reliable sonar detection. This letter proposes Mamba Dilated-Scale Fusion (MambaDSF), a hybrid framework addressing these limitations through three contributions: a Mamba Enhanced Feature Pyramid (MambaEFP) backbone that jointly captures local echo cues and global acoustic context at linear complexity; a Dilate Fusion Mamba (DFMamba) encoder that enforces multi-scale feature alignment across pyramid levels; and Scale-Adaptive Weighted IoU (SA-WIoU) and Cross-Scale Coherence (CSC) losses that stabilize small-target training. MambaDSF achieves 91.5% mAP50 on the UATD forward-looking sonar benchmark with 28.7 million parameters, surpassing all compared detectors. On a small-target subset the gain reached +2.2 percentage points, and cross-domain evaluation on FLS and MD-FLS confirms the generalization of the proposed architecture. The codes are publicly available at https://github.com/IDontKnowAAA/MambaDSF.

2605.24926 2026-05-26 cs.AI

Energy Shields for Fairness

公平性能量护盾

Filip Cano, Thomas A. Henzinger, Konstantin Kueffner

发表机构 * Institute of Science and Technology Austria(科学与技术研究院)

AI总结 提出一种受物理学启发的轻量级自适应控制器——能量护盾,通过概率性干预平滑地保证运行时公平性,并首次同时提供短期安全性和长期活性保证。

详情
AI中文摘要

运行时公平性不是一个一次性约束,而是一个在决策序列上评估的动态属性。为了确保运行时公平性,必须考虑过去的决策,这是传统静态分类器所忽略的信息。传统的公平性护盾通过确定性干预来强制执行运行时公平性,每当决策序列违反运行公平性度量的目标时,就会突然干预。这激发了我们主要的概念贡献:能量护盾。能量护盾是一种新颖的、轻量级的自适应控制器,它监控决策序列并概率性地干预,通过利用受物理学启发的能量函数将序列推向公平性,从而平滑地确保运行时公平性:决策越不公平,推动力就越强。这使得能量护盾成为第一个同时提供短期安全性和长期活性保证的公平性护盾。安全性确保运行公平性度量以高概率保持在运行目标区间内,而活性确保公平性度量的极限位于极限目标区间内。直观地说,短期指定了容忍的公平性值,长期指定了期望的公平性值。我们还提供了一种合成程序,用于为给定的目标规范构建最小侵入性的能量护盾,并通过实验证明其效率。我们通过短期和长期公平性的视角,将我们的能量护盾与现有的公平性护盾进行了评估。

英文摘要

Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emph{deterministically} whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emph{main conceptual contribution: \textbf{energy shields}.} An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emph{probabilistically} to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph{\textbf{first}} fairness shields to provide both \emph{short-term safety and long-term liveness guarantees}. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness.

2605.24924 2026-05-26 cs.RO

Dynamic Neural Koopman Distillation for Real-Time Robot Control Using Diffusion Models

动态神经Koopman蒸馏:基于扩散模型的实时机器人控制

Lei Zheng, Peiqi Yu, Zengqi Peng, Changliu Liu, Armin Lederer

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(国立新加坡大学电子与计算机工程系) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电子与计算机工程系) Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology(香港科学与技术大学机器人与自主系统方向)

AI总结 提出动态神经Koopman蒸馏框架,将多步扩散推理蒸馏为单步前向传递,通过因子化动态Koopman层保留多模态表达能力,在D4RL MuJoCo和物理机器人上实现毫秒级延迟的闭环控制。

Comments 8 pages, 5 figures

详情
AI中文摘要

扩散模型在生成多样化和多模态轨迹用于机器人规划方面表现出色,但其迭代去噪过程引入了与高频闭环控制不兼容的延迟。为了解决这个问题,我们提出了动态神经Koopman蒸馏,这是一个将多步扩散推理蒸馏为单步前向传递的框架,同时保留了教师模型的多模态表达能力。具体来说,我们引入了一个因子化动态Koopman层,通过具有状态依赖模态增益的因子化潜在转移来建模去噪过程。我们在标准D4RL MuJoCo运动基准测试和一个物理Kinova机械臂上评估了所提出的方法,并与单步基线进行了比较。结果表明,我们的方法在报告的运动任务上显著优于现有的单步蒸馏方法,并将推理延迟降低到毫秒级别,与教师策略相比。硬件实验进一步证明,我们的方法能够在保持任务成功和相当准确性的同时,实现平滑且快速的闭环执行。项目页面可在 https://fdkoopman.github.io/ 获取。

英文摘要

Diffusion models excel at generating diverse and multimodal trajectories for robotic planning, yet their iterative denoising process introduces latency that is incompatible with high-frequency closed-loop control. To address this problem, we propose Dynamic Neural Koopman Distillation, a framework that distills multistep diffusion inference into a single forward pass while retaining the multimodal expressivity of the teacher model. Specifically, we introduce a Factorized Dynamic Koopman layer that models the denoising process through a factorized latent transition with state-dependent modal gains. We evaluate the proposed method on standard D4RL MuJoCo locomotion benchmarks and a physical Kinova manipulator, comparing against one-step baselines. The results show that our method significantly outperforms existing one-step distillation approaches on the reported locomotion tasks, and reduces the inference latency to the millisecond regime compared with the teacher policy. Hardware experiments further demonstrate that our method enables smooth and fast closed-loop execution while maintaining task success and comparable accuracy. A project page is available at https://fdkoopman.github.io/.

2605.24922 2026-05-26 cs.RO

MuJoCoUni:Persistent Batched Runtime Primitives for MuJoCo

MuJoCoUni:MuJoCo的持久化批处理运行时原语

Yufei Jia, Junzhe Wu

发表机构 * Tsinghua University(清华大学)

AI总结 提出MuJoCoUni,一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版,通过BatchEnvPool提供有状态环境执行的运行时原语,支持高吞吐并行执行并保持上游语义。

Comments Technical report

详情
AI中文摘要

我们提出MuJoCoUni,一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版。除了上游mujoco.rollout已经提供的开环批处理轨迹生成外,MuJoCoUni还提供了用于有状态环境执行的运行时原语。目标工作负载需要高吞吐并行执行,同时保留上游CPU MuJoCo在模型、传感器、接触和约束方面的语义。其核心对象BatchEnvPool是一个C++/pybind11执行器,拥有每个环境的mjModel副本、每个线程的mjData工作线程以及一个内部线程池。它提供仅最终状态的短步进、稀疏重置、重置生命周期域随机化、不推进动力学的批处理传感器前向评估,以及批处理雅可比矩阵和高度场查询。该实现仅限于Python绑定层;MuJoCo的求解器、接触模型、积分器和核心源代码树保留上游语义。本报告描述了BatchEnvPool API、实现边界、与rollout的关系,以及随开源mujoco-uni包一起提供的验证和基准测试脚本,该包可通过 exttt{pip install mujoco-uni}安装。

英文摘要

We present MuJoCoUni, a downstream MuJoCo distribution for online robot learning and batched physics evaluation. Alongside the open-loop batched trajectory generation already provided by upstream mujoco.rollout, MuJoCoUni supplies runtime primitives for stateful environment execution. The target workloads need high-throughput parallel execution while retaining upstream CPU MuJoCo semantics for models, sensors, contact, and constraints. Its core object, BatchEnvPool, is a C++/pybind11 executor that owns per-environment mjModel copies, per-thread mjData workers, and an internal thread pool. It provides final-state-only short stepping, sparse reset, reset-lifecycle domain randomization, batched sensor forward evaluation without advancing dynamics, and batched Jacobian and height-field queries. The implementation is confined to the Python binding layer; MuJoCo's solver, contact model, integrator, and core source tree retain upstream semantics. This report describes the BatchEnvPool API, implementation boundary, relationship to rollout, and the validation and benchmark scripts shipped with the open-source mujoco-uni package, which is installed with \texttt{pip install mujoco-uni}.

2605.24921 2026-05-26 cs.LG

BandVQ: Band-Wise Vector-Quantized EEG Foundation Model

BandVQ: 分带向量量化的脑电图基础模型

Jamiyan Sukhbaatar, Satoshi Imamura, Toshihisa Tanaka

发表机构 * Tokyo University of Agriculture and Technology(东京农工大学) National University of Mongolia(蒙古国国立大学)

AI总结 针对脑电图基础模型中频率特异性活动表征不足的问题,提出BandVQ模型,通过分带VQ-VAE分词器和共享Transformer编码器,在71个公共数据集上预训练,并在六个分类任务上取得领先性能。

Comments 15 pages, 1 figure

详情
AI中文摘要

脑电图(EEG)基础建模的一个核心挑战是学习跨不同任务、导联、参考和频谱特征的记录的可迁移表示。现有的掩码建模方法通常依赖于宽带连续块或单一离散表示,这可能无法充分表征频率特异性活动。本文提出BandVQ,一种分带向量量化的EEG基础模型,它将EEG分解为delta、theta、alpha、beta和gamma频带,为每个频带训练独立的VQ-VAE分词器,并在生成的离散VQ码索引上预训练一个共享的Transformer编码器。编码器使用掩码码元、量化绝对对数功率元、通道和时间嵌入,以及表示参考、频带、任务族和阶段的元数据前缀元。还引入了基于区域的掩码,以减少空间相邻电极的平凡重建。该模型在71个公共EEG语料库上进行预训练,涵盖超过9200名受试者和357,000单通道小时,并在六个独立于受试者的分类数据集上进行评估。在当前评估设置下,所提模型实现了强大的迁移性能,在三个认知任务上取得了最高报告结果,在三个运动想象任务上取得了有竞争力的性能。

英文摘要

A central challenge in electroencephalography (EEG) foundation modeling is learning transferable representations across recordings with diverse tasks, montages, references, and spectral characteristics. Existing masked modeling approaches often rely on broadband continuous patches or a single discrete representation, which may underrepresent frequency-specific activity. This paper proposes BandVQ, a band-wise vector-quantized EEG foundation model that decomposes EEG into delta, theta, alpha, beta, and gamma bands, trains an independent VQ-VAE tokenizer for each band, and pretrains a shared Transformer encoder on the resulting discrete VQ code indices. The encoder uses masked code tokens, quantized absolute log-power tokens, channel and temporal embeddings, and metadata prefix tokens representing reference, band, task family, and phase. Region-based masking is also introduced to reduce the trivial reconstruction of spatially adjacent electrodes. The model is pretrained on 71 public EEG corpora comprising over 9,200 subjects and 357,000 single-channel hours and evaluated on six subject-independent classification datasets. Under the current evaluation setting, the proposed model achieves strong transfer performance, with the highest reported results on three cognitive tasks and competitive performance on three motor imagery tasks.

2605.24920 2026-05-26 cs.LG cs.AI stat.ML

Quaternion Self-Attention with Shared Scores

共享分数的四元数自注意力

Shogo Yamauchi, Tohru Nitta, Hideaki Tamori

发表机构 * Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出一种共享分数四元数自注意力机制,通过四元数内积计算单一实值分数并共享注意力分布,在保持性能的同时大幅降低计算成本。

Comments 26 pages, 6 figures and 15 tables. Accepted at ICML2026

详情
AI中文摘要

四元数神经网络通过将四个相关特征表示为一个单一实体,实现了参数高效并建模多维依赖关系。然而,现有的四元数自注意力计算每个分量的分数并对每个分量应用独立的softmax操作,这增加了计算成本并允许注意力分布在分量间发散。我们提出了一种共享分数的四元数自注意力机制,该机制使用四元数内积计算单一实值分数,并在所有分量上应用共享的注意力分布。这将分数计算乘法减少了75%,并将softmax操作次数从四次减少到一次。我们证明,当查询和键由诱导分量预混合的四元数线性投影产生时,分量级分数和共享分数位于相同的交互子空间中,表明独立的分量级注意力主要重新参数化相同的交互,而不是扩展特征交互空间。在语音增强中,我们的方法在GPU上将推理时间减少了高达44.3%,在CPU上减少了58.1%,同时保持了质量,并且在视觉和自然语言处理中呈现一致的趋势。

英文摘要

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

2605.24919 2026-05-26 cs.CL

MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

MultiHaluDet: 通过LLM隐藏状态探测实现多语言幻觉检测

Riasad Alvi, Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi

发表机构 * United International University(国际联合大学) BRAC University(布拉克大学)

AI总结 提出MultiHaluDet框架,通过探测冻结LLM的全隐藏状态轨迹,结合多尺度注意力和自注意力池化的混合架构,以及校准的经典分类器集成,实现跨语言的高精度幻觉检测,在英语基准上达到98.55% AUROC,并展现出对高、中、低资源语言的强泛化能力。

Comments MeLLM @ ACL 2026

详情
AI中文摘要

大型语言模型(LLM)中的幻觉是其可靠部署的关键障碍,这一漏洞在非英语和资源受限的环境中尤为严重。现有的依赖输出置信度启发式或单层内部表示的检测方法,往往无法捕捉跨语言的深层、复杂事实不一致性。为此,我们引入了MultiHaluDet,一种新颖的三阶段堆叠框架,通过探测冻结LLM的全隐藏状态轨迹来检测多语言幻觉,无需特定语言的微调。我们的方法提取跨多个层的序列特征,并通过使用多尺度注意力和自注意力池化的混合架构进行处理。通过生成折叠外嵌入并输入到校准的经典分类器集成中,MultiHaluDet捕捉了事实不一致性的细粒度和粗粒度模式。大量实验表明,我们的框架在Mistral-7B和LLaMA2-7B架构上,在英语HaluEval和TriviaQA基准测试中达到了高达98.55% AUROC的最先进检测性能。关键的是,我们严格评估了框架在高资源(法语)、中资源(孟加拉语)和低资源(阿姆哈拉语)语言上的跨语言泛化能力。MultiHaluDet展现出卓越的表示鲁棒性,始终优于基线,并成功地将幻觉检测能力迁移到类型多样的语言层级中。

英文摘要

Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework's cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

2605.24912 2026-05-26 cs.LG cs.AI q-bio.OT

Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes

可解释的视网膜成像用于预测2型糖尿病多器官功能障碍

Mini Han Wang, Liting Huang, Wei Hong, Boonthawan Wingwon

发表机构 * Faculty of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Frontier Science Computing Center(前沿科学计算中心) Chinese Academy of Sciences(中国科学院) Chinese University of Hong Kong(香港中文大学) Zhuhai People's Hospital(珠海人民医院) Beijing Institute of Technology(北京理工大学) Jinan University(暨南大学) Lampang Inter-Tech College

AI总结 本研究利用常规实验室生物标志物构建系统级异常指数,通过梯度提升模型预测2型糖尿病多系统失调,并采用SHAP实现可解释性,揭示了高血糖、肾功能障碍、血脂异常和炎症是主要驱动因素。

Comments 15 pages, 8 figures

详情
AI中文摘要

背景:2型糖尿病(T2DM)日益被认为是一种以代谢、肾脏、脂质和炎症通路协调功能障碍为特征的系统性疾病。现有的临床评估往往无法捕捉这种多维度负担。方法:我们对1,195名患者进行了回顾性研究,使用了常规收集的实验室生物标志物。构建了系统级异常指数以量化器官特异性功能障碍,并将多系统受累定义为两个或以上系统异常。训练了包括逻辑回归、随机森林和梯度提升在内的监督机器学习模型来预测多系统失调。使用SHapley Additive exPlanations(SHAP)实现模型可解释性。结果:梯度提升模型表现出近乎完美的区分能力(AUC = 1.000),显著优于逻辑回归(AUC = 0.925)。特征归因分析显示,高血糖、肾功能障碍、血脂异常和炎症是多系统风险的主要驱动因素。部分依赖分析中观察到的剂量-反应关系进一步支持了模型预测的生物学合理性。结论:本研究提出了一个可解释的、数据驱动的框架,用于量化T2DM的系统性疾病负担。通过将常规生物标志物与多器官功能障碍联系起来,我们的方法提供了预测准确性和机制洞察,为糖尿病护理中的风险分层和精准医学提供了潜力。本研究中使用的数据和代码可在GitHub上公开获取:https://github.com/MiniHanWang/Type-2-Diabetes-1.git

英文摘要

Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction across metabolic, renal, lipid, and inflammatory pathways. Existing clinical assessments often fail to capture this multi-dimensional burden. Methods: We conducted a retrospective study of 1,195 patients using routinely collected laboratory biomarkers. System-level abnormality indices were constructed to quantify organ-specific dysfunction, and multi-system involvement was defined as abnormalities in two or more systems. Supervised machine learning models, including logistic regression, random forest, and gradient boosting, were trained to predict multi-system dysregulation. Model interpretability was achieved using SHapley Additive exPlanations (SHAP). Results: The gradient boosting model demonstrated near-perfect discrimination (AUC = 1.000), significantly outperforming logistic regression (AUC = 0.925). Feature attribution analysis revealed that hyperglycaemia, renal impairment, dyslipidaemia, and inflammation were the dominant drivers of multi-system risk. Dose-response relationships observed in partial dependence analyses further supported the biological plausibility of model predictions. Conclusion: This study presents an interpretable, data-driven framework for quantifying systemic disease burden in T2DM. By linking routine biomarkers to multi-organ dysfunction, our approach provides both predictive accuracy and mechanistic insight, offering potential for improved risk stratification and precision medicine in diabetes care. The data and code used in this study are openly available on GitHub at: https://github.com/MiniHanWang/Type-2-Diabetes-1.git

2605.24911 2026-05-26 cs.LG cs.AI

Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting

因式分解以泛化:面向时间序列预测的检索引导不变-动态分解

Jinjin Chi, Lei Feng, Lulu Zhang, Yongcheng Jing, Yiming Wang, Ximing Li, Jialie Shen, Leszek Rutkowski, Dacheng Tao

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) City St George’s, University of London(伦敦大学城圣乔治学院) Systems Research Institute, Polish Academy of Sciences(波兰科学院系统研究所)

AI总结 提出检索引导的不变-动态分解框架,通过分离稳定共享结构与实例特定变化,提升时间序列零样本预测在分布偏移下的鲁棒性。

详情
AI中文摘要

时间序列基础模型(TSFMs)最近通过大规模预训练和检索增强预测实现了强大的零样本预测性能。然而,我们的实证分析揭示了基于检索的预测的一个非平凡限制:检索倾向于导致更振荡的预测,在高度波动的序列上提升性能,但在更平滑、趋势主导的序列上降低准确性。这表明检索信息可能在未明确区分稳定时间结构与实例特定变化的情况下被融合到预测中,这可能在分布偏移下降低鲁棒性。我们提出了一种用于时间序列预测的检索引导不变-动态分解框架。我们不将检索用作辅助预测上下文,而是利用检索到的序列作为来自相关环境的隐式样本,以指导表示分解。具体来说,我们首先通过基于注意力的聚合构建检索感知表示,然后引入检索引导路由机制将其分解为捕获稳定共享结构的不变组件和建模上下文相关变化的动态组件。这两个组件分别预测并融合以进行最终预测,使模型能够保留可迁移模式,同时保持对动态演变的适应性。我们进一步设计了鼓励不变学习和解耦的训练目标,并提供了理论见解,表明检索聚合减少了方差,并在没有显式环境监督的情况下近似不变表示学习。大量实验表明,我们的方法在分布偏移下持续提高鲁棒性,并在零样本预测设置中优于现有的TSFMs和基于检索的基线。

英文摘要

Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and retrieval-augmented prediction. However, our empirical analysis reveals a non-trivial limitation of retrieval-based forecasting: retrieval tends to induce more oscillatory predictions, improving performance on highly fluctuating series while degrading accuracy on smoother, trend-dominated ones. This suggests that retrieved information may be fused into prediction without explicitly distinguishing stable temporal structure from instance-specific variations, which can reduce robustness under distribution shifts. We propose a Retrieval-guided Invariant-Dynamic DEcomposition framework for time series forecasting. Rather than using retrieval as auxiliary predictive context, we leverage retrieved sequences as implicit samples from related environments to guide representation decomposition. Specifically, we first construct a retrieval-aware representation via attention-based aggregation, and then introduce a retrieval-guided routing mechanism to decompose it into an invariant component capturing stable shared structure and a dynamic component modeling context-dependent variations. These two components are forecast separately and fused for final prediction, enabling the model to preserve transferable patterns while remaining adaptive to evolving dynamics. We further design training objectives that encourage invariant learning and disentanglement, and provide theoretical insight showing that retrieval aggregation reduces variance and approximates invariant representation learning without explicit environment supervision. Extensive experiments demonstrate that our method consistently improves robustness under distribution shifts and outperforms existing TSFMs and retrieval-based baselines in zero-shot forecasting settings.

2605.24910 2026-05-26 cs.AI cs.CE

Noise-Robust Financial Numerical Entity Attribute Tagging

鲁棒噪声的金融数值实体属性标注

Hsin-Min Lu, Chen-Yang Lai, Yi-Jhen Li, Ju-Chun Yen

发表机构 * National Taiwan University(国立台湾大学) National Central University(国立中央大学)

AI总结 针对金融数值实体标注中标签噪声和属性不全问题,提出NORA方法,通过任务感知实例加权和邻域先验KNN过滤,在6.6百万实例基准上实现鲁棒的多属性预测。

详情
AI中文摘要

金融数值实体(FNE)理解旨在恢复财务报告中数值提及的含义。现有研究主要关注概念名称预测,并面临两个重要限制。首先,来自内联XBRL的标签可能包含错误,因为申报通常是手动准备的。其次,其他重要的FNE属性,如报告时间关系、测量尺度和会计符号,较少被强调。我们提出鲁棒噪声的丰富金融数值实体属性标注(NORA)来解决这些差距。NORA使用任务感知的实例特定加权来减弱训练过程中噪声标签的影响,并进一步提出邻域先验调整KNN(NPK)过滤方法,以便在真实世界噪声测试集上进行更可靠的评估。此外,我们构建了一个包含660万个实例的大规模基准,具有多属性标签和申报元数据。实验表明,NORA与最先进的噪声标签基线(包括Co-teaching、Mixup、SSR和SelfMix)相比表现强劲。此外,NORA在未过滤和噪声过滤测试设置下均具有鲁棒性。它在概念名称和时间关系预测上取得了最佳准确率、宏F1和加权F1,同时在尺度和符号预测上保持竞争力。这些结果证明了在考虑真实世界财务申报中标签噪声的同时,联合建模丰富FNE属性的价值。

英文摘要

Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies primarily focus on concept name prediction and face two important limitations. First, labels derived from inline XBRL may contain errors because filings are usually prepared manually. Second, other important FNE attributes, such as reporting-time relation, measurement scale, and accounting sign, are less emphasized. We propose \textbf{NO}ise-\textbf{R}obust Tagging for Rich Financial Numerical Entity \textbf{A}ttributes (\textsc{NORA}) to address these gaps. NORA uses task-aware instance-specific weighting to attenuate the influence of noisy labels during training, and we further propose the Neighborhood Prior-adjusted KNN (NPK) filtering method for more reliable evaluation on real-world noisy test sets. In addition, we construct a large-scale benchmark containing 6.6 million instances with multi-attribute labels and filing metadata. Experiments show that \textsc{NORA} performs strongly compared with state-of-the-art noisy-label baselines, including Co-teaching, Mixup, SSR, and SelfMix. Moreover, NORA is robust under both unfiltered and noise-filtered test settings. It achieves the best Accuracy, Macro F1, and Weighted F1 for concept name and time-relation prediction, while remaining competitive on scale and sign prediction. These results demonstrate the value of jointly modeling rich FNE attributes while accounting for label noise in real-world financial filings.

2605.24908 2026-05-26 cs.LG cs.AI

On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight

论类别不平衡对深度神经网络学习动态的影响:直观洞察

Ismail B. Mustapha, Shafaatunnur Hasan, Sunday O. Olatunji, Hatem S. Y. Nabus

发表机构 * Faculty of Computing(计算机学院) Universiti Teknologi Malaysia(技术大学) Adejkunle Ajasin University(阿德吉库内勒·阿贾辛大学) Johor, Malaysia(马来西亚 Johor) Akungba-Akoko, Nigeria(尼日利亚 Akungba-Akoko)

AI总结 通过监测不同不平衡比率下深度神经网络对多数类和少数类的学习模式,系统研究了类别不平衡如何导致模型早期欠拟合少数类并仅学习多数类,最终造成少数类表示过拟合而非泛化。

Comments Conference

详情
AI中文摘要

近年来,深度神经网络(DNN)中的类别不平衡问题引起了研究者的广泛关注。然而,相关文献中对DNN在不平衡数据上表现不佳的原因存在不同解释,表明人们对这一长期存在的现象如何影响DNN性能知之甚少。更好地理解这一问题对于开发有效的基于DNN的不平衡方法至关重要。因此,本研究通过监测DNN模型在不同不平衡比率数据集上对多数类和少数类的学习模式,系统研究了类别不平衡对DNN学习动态的影响。实验结果表明,与从平衡数据集学习时DNN类似地学习各个类别不同,类别不平衡严重损害了DNN的性能,导致模型在早期训练轮次中欠拟合少数类样本,同时仅学习多数类。尽管DNN最终学会了少数类样本,但这种学习方式仅导致学习到的少数类表示在测试阶段无法泛化,因为它们仅仅是过拟合以尽可能降低整体训练损失。

英文摘要

Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying accounts of the reasons behind the poor performance of DNN on imbalance data in pertinent literature shows that little is known about how this agelong phenomenon impacts the performance of DNNs. A better understanding of this problem is crucial to developing effective DNN-based imbalance methods. Thus, this study systematically investigates the impact of class imbalance on the learning dynamics of DNN by monitoring the learning pattern of DNN models on both the majority and minority classes of datasets of varying imbalance ratios. Experimental findings shows that as against learning from balanced datasets where DNN learns the classes similarly, class imbalance has severe deteriorating impact on the performance of DNN, driving the model to underfit the minority class samples in the early training epochs while simultaneously learning only the majority class. Although DNN ultimately learns the minority samples, learning in this manner only results in learnt minority representations that are non-generalizable at test phase because they are merely overfitted to keep the overall training loss as low as possible.

2605.24907 2026-05-26 cs.CL

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

PsyDefDetect 共享任务概述:在支持性对话中检测心理防御机制水平

Hongbin Na, Zimu Wang, Zhaoming Chen, Yining Hua, Rena Gao, Kailai Yang, Ling Chen, Wei Wang, Shaoxiong Ji, John Torous, Sophia Ananiadou

发表机构 * University of Technology Sydney(技术大学悉尼) Xi’an Jiaotong-Liverpool University(西安交通大学-利物浦大学) University of Utah(犹他大学) Harvard University(哈佛大学) The University of Melbourne(墨尔本大学) The University of Manchester(曼彻斯特大学) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学)

AI总结 本文介绍了与 BioNLP@ACL 2026 合办的 PsyDefDetect 共享任务,该任务基于临床验证的 DMRS 框架,要求系统将求助者话语分类为九个类别,最佳系统达到 0.420 的宏 F1 分数,但仍存在改进空间。

详情
AI中文摘要

我们介绍了 PsyDefDetect,这是一个与 BioNLP@ACL 2026 合办的关于在情感支持对话中检测心理防御机制水平的共享任务。该任务基于临床验证的防御机制评定量表(DMRS)框架,要求系统根据给定的前面对话上下文,将目标求助者话语分类为九个类别之一:七个层次的 DMRS 水平加上两个辅助标签。参与者使用了 PsyDefConv,这是一个新发布的语料库,包含 200 个对话和 2336 条求助者话语,在 DMRS 下进行了标注,并具有较高的一致性。该任务在 CodaBench 上吸引了 172 名参与者,提交了 563 份结果,其中 21 个团队正式注册了最终排名。最佳系统实现了 0.420 的宏 F1 分数,显著超过了数据集论文中报告的最强微调基线,但仍留有明显的改进空间。我们的分析强调了(i)过度预测多数类高适应水平的持续趋势,(ii)准确率和宏 F1 之间的差距扩大,揭示了类别不平衡敏感性,以及(iii)理论感知和基于 LLM 的方法在细粒度防御功能分类中的价值。我们发布了所有任务材料,并邀请社区继续在这个临床心理学与自然语言处理的新交叉领域开展工作。

英文摘要

We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

2605.24904 2026-05-26 cs.CL

Quantifying the Impact of Translation Errors on Multilingual LLM Evaluation

量化翻译错误对多语言大语言模型评估的影响

Klaudia-Doris Thellmann, Bernhard Stadler, Michael Färber, Jens Lehmann

发表机构 * TUD Dresden University of Technology and ScaDS.AI(德累斯顿技术大学和ScaDS.AI) InfAI e.V.(InfAI协会) Amazon(亚马逊)

AI总结 研究机器翻译基准中的翻译错误如何影响多语言LLM评估的可靠性,通过自动错误跨度检测和准确性下降分析揭示翻译错误与评估指标之间的关联。

详情
AI中文摘要

机器翻译基准被广泛用于评估大语言模型(LLM)的多语言能力,然而这些基准中的翻译错误仍未得到充分探索,引发了对多语言评估可靠性和可比性的担忧。我们解决了两个实际差距:(i)来自LLM评判者的自动MQM风格错误跨度以及跨度感知的QE基线(xCOMET-XXL)与基准翻译上的专家人工跨度注释的匹配程度,以及(ii)翻译错误(相对于英文原版中的源端问题)在多大程度上解释了翻译基准上的准确性下降。我们发现,在自然发生的基准翻译上,跨度一致性并非易事,并且目标端翻译错误始终与可测量的、百分点级别的翻译准确性下降相关,即使在控制了英文正确性和源端异常之后也是如此。

英文摘要

Machine-translated benchmarks are widely used to assess the multilingual capabilities of large language models (LLMs), yet translation errors in these benchmarks remain underexplored, raising concerns about the reliability and comparability of multilingual evaluation. We address two practical gaps: (i) how well automatic MQM-style error spans from LLM judges and a span-aware QE baseline (xCOMET-XXL) match expert human span annotations on benchmark translations, and (ii) how strongly translation errors (as opposed to source-side issues in the English original) explain accuracy drops on translated benchmarks. We find that span agreement is non-trivial on naturally occurring benchmark translations, and that target-side translation errors are consistently associated with measurable, percentage-point drops in translated accuracy even after controlling for English correctness and source-side anomalies.

2605.24902 2026-05-26 cs.CL cs.AI cs.LG

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

当推理有害:面向临床SOAP笔记生成的前沿LLM源感知评估

Faizan Faisal

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 通过源感知基准测试,评估推理增强型LLM在临床SOAP笔记生成中的表现,发现推理能力反而降低GPT-5.4的质量,而相同源RAG带来模型依赖的小幅提升。

详情
AI中文摘要

推理增强型LLM在医学推理基准测试中表现强劲,但这些增益是否能迁移到结构化临床文档尚不清楚;我们通过一个跨OMI Health、ACI-Bench和PriMock57的源感知基准,利用临床对话生成SOAP笔记来研究这一问题。我们在一个2x2受控设计中评估GPT-5.4、DeepSeek-V4-Flash和Gemma-4-E4B,独立切换提供者原生推理和相同源检索增强生成(RAG)。输出使用七种自动指标以及两个参考感知的LLM评判者进行评估。两种评估方法一致认为,非推理的GPT-5.4配置达到最高整体质量,而DeepSeek-V4-Flash在推理增强配置中表现最佳。启用推理显著降低了GPT-5.4在所有三个数据集上的性能,而相同源RAG带来较小的、模型依赖的改进。总体而言,研究结果表明,不应假设更强的推理能力能改善对保真度敏感的SOAP笔记生成,而无需专门的、任务特定的评估。

英文摘要

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured clinical documentation; we investigate this question using SOAP note generation from clinical dialogue in a source-aware benchmark spanning OMI Health, ACI-Bench, and PriMock57. We evaluate GPT-5.4, DeepSeek-V4-Flash, and Gemma-4-E4B in a controlled 2x2 design that independently toggles provider-native reasoning and same-source retrieval-augmented generation (RAG). Outputs are assessed using seven automatic metrics alongside two reference-aware LLM judges. Both evaluation approaches agree that a non-reasoning GPT-5.4 configuration achieves the highest overall quality, while DeepSeek-V4-Flash performs best among reasoning-enabled configurations. Enabling reasoning significantly degrades GPT-5.4 performance across all three datasets, whereas same-source RAG yields smaller, model-dependent improvements. Overall, the findings indicate that stronger reasoning capability should not be assumed to improve fidelity-sensitive SOAP note generation without dedicated, task-specific evaluation.

2605.24900 2026-05-26 cs.AI

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

ProActor: 时序感知强化学习用于主动任务调度智能体

Lei Ding, Bin He, Chenguang Wang, Yang Liu

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Zillow Group(Zillow集团)

AI总结 提出ProActor框架,通过时序感知强化学习(结合RULER奖励和阶段感知复合奖励)和高效训练系统ART-F,在保持动作一致性的同时显著提升主动任务调度的时序质量。

Comments 47 pages, 31 figures. Accepted to ACL 2026

详情
AI中文摘要

主动任务导向的智能体必须自主预测用户需求、识别可操作的机会,并在适当时刻触发软件动作——从根本上转变依赖显式指令的被动系统。然而,现有方法缺乏可泛化的端到端解决方案来度量和优化这种预期行为。本文介绍了ProActor,一个用于对话任务调度的统一框架,集成了:(1) 一种领域无关的自动标注方法,通过生成完整的机遇时间窗口而非刚性点标签,实现可扩展的主动性强化学习(RL);(2) 系统性的主动性指标,同时捕获时序质量和参考动作对齐;(3) 使用GRPO及多种奖励设计的RL优化。我们的洞察是,基于RULER的奖励结合主动性评分准则对提升时序质量至关重要,而由阶段感知复合奖励实现的主动性优化是平衡时序质量和参考动作对齐的关键。时序感知RL需要大量探索,这要求高效的基础设施。我们开发了ART-F,一种自适应框架,将请求自适应推理集群与单节点多GPU系统上的DDP训练相结合,实现了4位Qwen2.5-14B-ProActor-Q4的LoRA训练,加速4-8倍。在两个新自动标注数据集上的实验表明,在保持与最先进(SOTA)基线相当的动作一致性的同时,主动时序显著提升。消融实验验证了不同复合奖励变体的有效性。

英文摘要

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

2605.24899 2026-05-26 cs.AI

TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps

TaBIIC2:使用加权自组织映射交互式构建本体分类

Mathieu d'Aquin

发表机构 * LORIA, CNRS, Université de Lorraine(LORIA研究所、法国国家科学研究中心、洛林大学)

AI总结 本文提出一种工具,通过加权自组织映射聚类方法,支持用户逐步交互式地从表格数据中构建概念分类,并定义概念的内涵,平衡了纯手动分析与自动方法。

详情
AI中文摘要

本体表示一个领域的概念知识。本体的核心是概念和子概念的分类,这些概念代表特定实体,构建起来可能很复杂。在许多情况下,信息以记录形式提供,描述相关实体的特征,即表格数据。识别此类数据中的模式和相似性可以作为识别概念并组织它们的基础。然而,手动执行此操作可能具有挑战性,而纯自动方法(如凝聚聚类或依赖大型语言模型分析数据)可能会让用户面对大量结果且控制力不足。在本文中,我们描述了一种工具,通过识别聚类及其内涵定义,支持逐步交互式构建概念分类。为此,我们依赖加权自组织映射作为聚类方法,因为它们能够创建任意数量的聚类,这些聚类在聚类实体特定特征的值分布方面具有区分性。我们表明,通过集成这种机制和其他机制来快速创建将表格数据中的实例分组的概念,该工具代表了在纯手动分析和自动方法之间构建本体分类的中间地带。

英文摘要

Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that represent specific entities, which can be complex to build. In many cases, information is available in the form of records describing the characteristics of relevant entities, i.e., tabular data. Identifying patterns and similarities in such data can serve as a basis for identifying concepts and organizing them. However, doing so manually can be challenging, and purely automatic approaches, such as agglomerative clustering or relying on a large language model to analyze the data, can leave the user with overwhelming results and little control. In this paper, we describe a tool that enables the progressive and interactive construction of a taxonomy of concepts by identifying clusters as well as their intentional definitions. To do so, we rely on weighted self-organizing maps as a clustering method because they enable the creation of an arbitrary number of clusters that are distinct with respect to the distributions of values of specific characteristics of the clustered entities. We show that, by integrating this mechanism and others for rapidly creating concepts that group together instances from tabular data, this tool represents a middle ground between purely manual analysis and automatic methods for building ontological taxonomies.

2605.24894 2026-05-26 cs.CV

BFS: Back-to-Front Layered Image Synthesis via Knowledge Transfer

BFS: 通过知识转移的前后分层图像合成

Kyoungkook Kang, Gyujin Sim, Sunghyun Cho

发表机构 * SAMSUNG(三星) POSTECH

AI总结 提出BFS框架,利用双分支扩散模型和两阶段训练,通过从非分层图像合成中转移知识,实现高质量的前景层合成与背景和谐融合。

Comments SIGGRAPH 2026

详情
AI中文摘要

随着生成模型扩展了视觉内容创作的可能性,分层图像合成已成为可控和创意编辑的一个有前景的方向。然而,现有方法难以充分发挥这一潜力。基于分解的方法通常难以实现干净分离,而基于生成的方法则面临训练数据获取困难的问题,降低了质量和场景多样性。在本文中,我们提出了BFS,一种新颖的基于生成的分层图像合成框架。具体来说,给定背景图像和用户指导,BFS合成一个前景层,该层不仅包含前景对象,还包括其相关的视觉效果(如阴影和反射),同时与背景无缝协调以产生连贯的合成图像。为了实现多样且高质量的前景层合成,同时克服数据稀缺问题,我们利用相对易于学习的非分层图像合成知识来指导前景合成。为此,我们采用双分支扩散框架,其中两个相互连接的分支分别生成合成图像和前景层,实现双向知识转移。基于该框架,我们提出了一种两阶段训练方案,利用高质量的非分层合成图像数据集有效提升前景质量。大量实验(包括用户研究)表明,BFS生成了高质量的分层图像,始终优于先前方法。

英文摘要

As generative models expand the possibilities of visual content creation, layered image synthesis has emerged as a promising direction for controllable and creative editing. However, existing methods struggle to fully realize this potential. Decomposition-based methods often struggle with clean separation, while generation-based methods suffer from difficulty in training data acquisition, reducing quality and scene diversity. In this paper, we propose BFS, a novel generation-based framework for layered image synthesis. Specifically, given a background image and user guidance, BFS synthesizes a foreground layer that incorporates not only a foreground object but also its associated visual effects, such as shadows and reflections, while seamlessly harmonizing with the background to produce a coherent composite. To enable diverse and high-quality foreground layer synthesis while overcoming data scarcity, we leverage the comparatively easy-to-learn knowledge of unlayered image synthesis for the foreground synthesis. To this end, we adopt a dual-branch diffusion framework in which two interconnected branches generate a composite image and a foreground layer, respectively, enabling bidirectional knowledge transfer. Based on this framework, we propose a two-stage training scheme that utilizes a high-quality unlayered composite image dataset to effectively enhance foreground quality. Extensive experiments, including a user study, show that BFS produces high-quality layered images, consistently outperforming prior methods.

2605.24893 2026-05-26 cs.CV

BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors

BED-SAM2: 通过单目几何先验增强边界的深度SAM2

Tyler Rust, Dara McNally, Kyle O'Donnell, Colin Kelly, Chandra Kambhamettu

发表机构 * University of Delaware(德克萨斯大学) University of South Florida(佛罗里达州立大学) DEVCOM Army Research Laboratory(国防部陆军研究实验室)

AI总结 本研究通过修改SAM2编码器以直接编码单目深度信息,提出BED-SAM2模型,在少量训练周期内实现显著和伪装物体检测的竞争性能。

Comments 9 pages, 5 figures, 5 tables. Presented as a poster at the CVPR 2026 Workshop on Computer Vision in the Wild (CVinW). Code available at https://github.com/TylerRust-1/BED-SAM2

详情
AI中文摘要

基于SAM2视觉基础模型进行下游分割,本研究引入了边界增强深度(BED)-SAM2。修改了SAM2 Hiera编码器架构,以直接从RGB图像编码单目深度信息,从而提供几何线索,增强物体边界描绘并促进伪装物体形状的提取。BED-SAM2在多个显著和伪装物体检测任务中,仅需五个训练周期即可展现出具有竞争力的最先进性能。

英文摘要

Building upon the SAM2 vision foundation model for downstream segmentation, this study introduces Boundary Enhanced Depth (BED)-SAM2. The SAM2 Hiera encoder architecture is modified to directly encode monocular depth information from RGB images, thereby providing geometric cues that enhance object boundary delineation and facilitate the extraction of camouflaged object shapes. BED-SAM2 demonstrates competitive state-of-the-art performance across multiple salient and camouflaged object detection tasks with as few as five training epochs.

2605.24885 2026-05-26 cs.CL

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

DTO:一种用于有效反事实故事重写的可微分训练目标

Amelia Girard, Massimo Piccardi

发表机构 * University of Technology Sydney(技术大学悉尼)

AI总结 提出一种可微分训练目标(DTO),通过端到端反向传播联合优化对参考重写的忠实度和与源叙事的语义一致性,以解决反事实故事重写中模型忽略细微修改的问题。

Comments 11 pages, 2 figures

详情
AI中文摘要

反事实故事重写是一项自然语言处理任务,要求更新现有故事以反映所选替代事件,同时保留所有未受影响的故事情节元素和整体连贯性。尽管大型语言模型最近在此任务上取得了显著进展,但由于所需修改通常规模很小且高度局部化,该任务仍然具有挑战性。因此,以传统方式使用最大似然训练目标训练的模型往往忽略这些细微之处。同时,基于强化学习的更复杂训练方法以缓慢和难以设置而闻名。基于这些原因,本文提出了一种新颖的可微分训练目标(DTO),直接优化所需的反事实改进。在我们的方法中,通过端到端反向传播微调一个Transformer模型,针对一个完全可微的损失函数,该函数同时奖励(i)对参考重写的忠实度和(ii)与源叙事的语义一致性。在TimeTravel和ART数据集上的实证评估表明,所提出的DTO方法能够超越最大似然基线和基于偏好的方法,并在所有评估指标上与两个当代大型语言模型竞争。这些发现证实了任务特定的可微分目标对于细微、受控的文本生成任务的有效性。

英文摘要

Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

2605.24883 2026-05-26 cs.AI cs.CR cs.SE

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications

反转盾牌:从策略规范中系统生成安全测试

Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong

发表机构 * Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者)

AI总结 提出POLARIS框架,通过将非结构化自然语言策略编译为一阶逻辑表示并构建语义策略图,实现覆盖驱动的可重复安全测试,相比基线方法提高了策略覆盖率和攻击成功次数。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

大型语言模型(LLMs)的广泛集成需要严格且系统的安全评估。现有范式要么依赖构建的基准从预定义角度评估安全性,要么采用动态红队探测潜在漏洞。虽然有效,但这些方法面临挑战,因为它们严重依赖专家领域知识,提供的系统保证有限,且容易快速过时。为解决这些限制,我们引入了一个新颖框架POLARIS,将基于规范的软件测试的严谨性引入AI安全。POLARIS首先将非结构化自然语言策略编译为一阶逻辑(FOL)表示,建立高层规则与具体测试用例之间的可追溯链接。这种形式化使得能够构建语义策略图,其中复杂的策略违规场景被编码为可遍历路径。通过系统地探索该图,POLARIS发现组合违规模式,然后将其实例化为可执行的自然语言测试查询,实现覆盖驱动且可重复的安全测试。实验表明,与已建立的基线相比,POLARIS实现了更高的策略覆盖率和攻击成功次数。关键是,通过连接形式化方法和AI安全,POLARIS提供了一种有原则的自动化方法,确保LLMs遵守安全关键策略,并具有可验证的可追溯性。我们在https://github.com/huac-lxy/POLARIS发布代码。

英文摘要

The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths. By systematically exploring this graph, POLARIS uncovers compositional violation patterns, which are then instantiated into executable natural-language test queries, enabling coverage-driven and reproducible safety testing. Experiments demonstrate that POLARIS achieves higher policy coverage and attack success counts compared to established baselines. Crucially, by bridging formal methods and AI safety, POLARIS provides a principled, automated approach to ensuring LLMs adhere to safety-critical policies with verifiable traceability. We release our code at https://github.com/huac-lxy/POLARIS.

2605.24879 2026-05-26 cs.LG math.OC

Efficient DP-SGD for LLMs with Randomized Clipping

基于随机裁剪的高效DP-SGD用于大语言模型

Enayat Ullah, Sai Aparna Aketi, Devansh Gupta, Huanyu Zhang, Meisam Razaviyayn

发表机构 * Meta Platforms Inc(Meta平台公司) University of Southern California(南加州大学)

AI总结 提出DP-SGD-RC算法,利用随机迹估计(Hutchinson和Hutch++)降低每样本梯度范数估计的内存开销,在保持隐私保证的同时减少内存和计算复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

大语言模型(LLMs)在可能包含敏感信息的大规模数据集上进行训练。差分隐私(DP)作为正式隐私保护的事实标准,为训练具有可证明隐私保护的LLMs提供了原则性框架。然而,最先进的DP训练实现依赖于快速梯度裁剪技术,其内存开销为$O(B \min\{T^2, d^2\})$,其中$B$是批量大小,$T$是序列长度,$d$是模型宽度。随着模型规模和上下文长度的增长,这一开销变得难以承受。我们提出DP-SGD-RC,一种带有随机裁剪的新型DP-SGD变体,可降低内存和计算复杂度。DP-SGD-RC利用随机迹估计方法,特别是Hutchinson估计器[Hutchinson, 1989]及其改进变体Hutch++[Meyer et al., 2021],以减少每样本梯度范数估计的内存占用。我们提供了严格的隐私分析,表明DP-SGD-RC实现了与确定性裁剪相竞争噪声乘数。在长上下文基准(包括分类、问答和摘要任务)上微调Llama~3.2-1B的实验表明,DP-SGD-RC在匹配基线效用的同时显著降低了内存和计算需求。

英文摘要

Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead $O(B \min\{T^2, d^2\})$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.

2605.24873 2026-05-26 cs.CL cs.AI cs.LG

Towards a Universal Causal Reasoner

迈向通用因果推理器

Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, Chenhao Tan

发表机构 * The University of Chicago(芝加哥大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出UniCo数据生成框架,覆盖Pearl因果阶梯的18种查询类型,将符号示例转化为代码和自然语言,通过监督微调显著提升LLM的因果推理能力和推理忠实度。

详情
AI中文摘要

尽管因果推理的重要性不言而喻,但训练LLM进行因果推理仍未被充分探索。现有的数据工作大多集中在针对因果关系的特定方面对LLM进行基准测试,这使得它们不太适合训练可泛化的因果推理器。为了解决这个问题,我们提出了UniCo,一个数据生成框架,它既(1)涵盖了Pearl因果阶梯中的18种因果查询类型,又(2)将原生符号示例转化为代码和自然语言形式,以模拟因果术语未明确指定的真实世界用例。为确保数据质量,UniCo用精确的因果推理来支撑答案,并过滤掉存在推理捷径的案例。通过使用66.6K个UniCo生成的实例进行监督微调,Qwen3-4B、Qwen3-8B和Olmo-3-7B-Instruct在所有18种分布内查询类型上平均提升了22.9%,在训练分布之外的7个已建立的因果基准上,相比最先进的因果数据生成框架提升了8.1%。更重要的是,在真实世界的医学理解、法律决策和表格推理中,UniCo训练的模型始终展现出更忠实的推理轨迹,在忠实度指标上平均超过基础模型20.2%。这些结果表明,以因果为中心的训练不仅增强了因果推理能力,还赋予了LLM在一般推理任务中的因果思维。

英文摘要

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl's Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

2605.24872 2026-05-26 cs.LG

Cluster Frequency Conformal Prediction for Local Coverage

聚类频率共形预测用于局部覆盖

Tomer Lavi, Bracha Shapira, Nadav Rappoport

发表机构 * Institute of Applied AI Research (AAIR)(应用人工智能研究学院) Faculty of Computer and Information Science(计算机与信息科学学院) Ben-Gurion University of the Negev(贝叶尔-加隆大学) Institute for Interdisciplinary Computational Science (ICS)(跨学科计算科学研究所)

AI总结 提出聚类频率共形预测(CFCP)框架,通过聚类嵌入并估计局部标签频率分布,结合全局先验和可靠性感知收缩,在标准共形预测中实现类级覆盖改进,并在图像和文本基准上验证有效性。

详情
AI中文摘要

共形预测提供了无分布覆盖保证,但在多类分类中仍可能对特定类别或子群体覆盖不足,阻碍了在高风险应用中的安全部署。我们提出聚类频率共形预测(CFCP),一个即插即用框架,将共形预测适应于学习表示空间中的局部结构。CFCP对学习到的嵌入进行聚类,从校准数据中估计聚类级别的标签频率分布,并为每个测试点通过软混合附近聚类分布(经全局先验和可靠性感知收缩正则化)构建样本特定的概率向量。然后使用标准集构造器对该向量进行共形化。在不相交分割机制下,CFCP继承了标准的有限样本边际有效性。在额外假设下,CFCP进一步允许局部有效性解释。由于表示聚类聚合了局部相似样本,其经验类别频率提供了局部标签歧义的稳定估计。在图像和文本基准上,CFCP在15/16个数据集/评分族比较中实现了最佳类别覆盖,并具有竞争力的预测集大小效率,其中多个设置显著更高效。总体而言,我们的结果表明,聚类频率信息为改善多类共形预测中的类级可靠性提供了有效的局部化信号。

英文摘要

Conformal prediction provides distribution-free coverage guarantees, but in many-class classification it may still under-cover specific classes or subpopulations, preventing safe deployment in high-stakes applications. We propose Cluster Frequency Conformal Prediction (CFCP), a plug-in framework that adapts conformal prediction to local structure in a learned representation space. CFCP clusters learned embeddings, estimates cluster-level label-frequency distributions from calibration data, and for each test point constructs a sample-specific probability vector by softly mixing nearby cluster distributions regularized with global-prior and reliability-aware shrinkage. This vector is then conformalized using standard set constructors. In the disjoint-split regime, CFCP inherits standard finite-sample marginal validity. Under additional assumptions, CFCP further admits a local-validity interpretation. Since representation clusters aggregate locally similar samples, their empirical class frequencies provide a stable estimate of local label ambiguity. Across image and text benchmarks, CFCP achieves the best class coverage in 15/16 dataset/score-family comparisons and a competitive prediction set size efficiency, with several settings substantially more efficient. Overall, our results show that cluster-frequency information provides an effective localized signal for improving classwise reliability in many-class conformal prediction.

2605.24870 2026-05-26 cs.CV

Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models

轨迹一致校准用于缓存加速扩散模型

Mingyu Liang, Dingkun Xu, Jingwei Xu

发表机构 * Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术实验室)

AI总结 针对缓存加速扩散模型中表示偏差导致生成质量下降的问题,提出无训练的轨迹一致校准方法,通过离线迭代校准缓存表示,在PixArt-alpha和DiT-XL/2上持续改善FID。

Comments 23 pages, 8 figures, 8 tables. Code is available at https://github.com/NJUDeepEngine/TCC

详情
AI中文摘要

扩散Transformer在迭代采样过程中需要重复进行去噪器评估,导致推理计算成本高昂。基于缓存的加速方法通过跨去噪步骤重用中间表示来降低这一成本,但可能引入表示偏差并降低生成质量。本文分析了这些偏差,并表明有效的校准应考虑重用导致的直接不匹配以及先前校正引起的后续轨迹偏移。为解决这一挑战,我们提出了轨迹一致校准(TCC),一种无训练的方法,将缓存表示校准为其全计算对应物。具体而言,TCC并非从单个未校正的缓存轨迹中估计所有校准先验,而是使用离线迭代过程,使得每个先验都考虑先前校准引起的轨迹偏移。在PixArt-alpha和DiT-XL/2上的实验表明,TCC在保持底层重用策略的同时,持续改善了代表性缓存加速方法的FID。值得注意的是,在基于FORA的典型PixArt-alpha缓存加速设置中,TCC将FID从29.83降至27.35,略微超过了全计算基线。

英文摘要

Diffusion Transformers require repeated denoiser evaluations during iterative sampling, making inference computationally expensive. Cache-based acceleration reduces this cost by reusing intermediate representations across denoising steps, but can introduce representation deviations and degrade generation quality. In this paper, we analyze these deviations and show that effective calibration should consider both the direct mismatch caused by reuse and the subsequent trajectory shift induced by earlier corrections. To address this challenge, we propose Trajectory-Consistent Calibration (TCC), a training-free method that calibrates cached representations toward their full-computation counterparts. Specifically, rather than estimating all calibration priors from a single uncorrected cache trajectory, TCC uses an offline iterative procedure so that each prior accounts for the trajectory shift induced by preceding calibrations. Experiments on PixArt-alpha and DiT-XL/2 show that TCC consistently improves FID across representative cache-based acceleration methods while preserving their underlying reuse policies. Notably, in a representative PixArt-alpha cache-acceleration setting based on FORA, TCC reduces FID from 29.83 to 27.35, slightly surpassing the full-computation baseline.

2605.24869 2026-05-26 cs.CL

Lngram: N-gram Conditional Memory in Latent Space

Lngram: 潜在空间中的N-gram条件记忆

Yunao Zheng, Guoyang Xia, Xiaojie Wang, Lei Ren

发表机构 * Beijing University of Posts and Telecommunications (BUPT)(北京邮电大学) Li Auto Inc.(Li Auto公司)

AI总结 提出Lngram,一种在潜在空间中学习离散符号并执行N-gram查找的条件记忆模块,以解耦检索与骨干网络,提升长上下文语言建模和跨模态任务性能。

详情
AI中文摘要

序列建模需要组合推理和局部静态知识检索,而标准Transformer通过密集计算处理两者。Engram部分地将检索与骨干网络解耦,但其基于token的键仍依赖于文本分词和哈希压缩。我们提出Lngram,一种潜在空间中的条件记忆模块,直接从隐藏状态学习离散符号,并对这些符号执行N-gram查找。该设计消除了对分词器ID的依赖,并自然地扩展到非文本模态。在我们的评估设置中,Lngram优于Transformer和Engram基线,在长上下文语言建模中持续降低困惑度,并在事后添加到预训练模型时有效注入领域知识。与骨干网络的联合训练进一步超越了完全微调,而在视觉-语言和视觉-语言-动作任务上的实验显示了整体提升。使用LogitLens和CKA的分析表明,Lngram使预测相关信息更早出现,在有限的推理和内存开销下增加了有效深度。代码可在https://github.com/zyaaa-ux/Lngram获取。

英文摘要

Sequence modeling requires both compositional reasoning and local static knowledge retrieval, yet standard Transformers handle both through dense computation. Engram partially decouples retrieval from the backbone, but its token-based keys remain tied to text tokenization and hash compression. We propose Lngram, a latent-space conditional memory module that learns discrete symbols directly from hidden states and performs N-gram lookup over these symbols. This design removes the dependence on tokenizer IDs and naturally extends to non-text modalities. In our evaluated settings, Lngram outperforms Transformer and Engram baselines, consistently reduces perplexity in long-context language modeling, and effectively injects domain knowledge when added post hoc to pretrained models. Joint training with the backbone further surpasses full fine-tuning, while experiments on vision-language and vision-language-action tasks show overall gains. Analyses with LogitLens and CKA suggest that Lngram enables prediction-relevant information to emerge earlier, increasing effective depth with limited inference and memory overhead. Code is available at https://github.com/zyaaa-ux/Lngram.

2605.24868 2026-05-26 cs.LG nlin.CD physics.comp-ph

A comparative study of accuracy and rollout stability of temporal surrogate models

时间代理模型的准确性与展开稳定性比较研究

Rajarshi Biswas

发表机构 * Cargill Inc.(卡吉尔公司)

AI总结 本文比较了多种深度神经网络架构在混沌动力系统时间代理建模中的长期预测稳定性,发现具有积分器式更新的模型表现出更低的偏差和扰动放大,从而实现稳定的长期展开和更准确的预测。

Comments 24 pages, 18 figures, submitted to journal

详情
AI中文摘要

时间代理模型对于预测计算成本可能过高的混沌动力系统是有效的。几种深度神经网络架构可用于此目的。在这项工作中,使用共同的训练协议比较了几种常用的架构。目标是公平评估模型架构对长期预测稳定性的影响。针对三个问题进行了实验:双摆、Kuramoto-Sivashinsky方程和Kolmogorov流。实验在匹配模型容量的情况下进行。还对每个模型单独优化的场景进行了分析。观察到在两种场景中,模型在长期展开中表现出类别差异。为了具体量化,使用局部雅可比、相对单步偏差和有限时间李雅普诺夫增长等指标分析了逐步误差注入和扰动放大。此外,还进行了吸引子分析,以评估学习模型复制底层系统几何形状的程度。还进行了消融研究,以隔离连续更新架构中每个组件的影响。结论是,具有积分器式更新的模型表现出更低的偏差和扰动放大,从而产生稳定的长期展开和更准确的预测。

英文摘要

Temporal surrogate models are effective for predicting chaotic dynamical systems where computational cost can be prohibitive. Several deep neural network architectures can be used for such purposes. In this work, a few commonly used architectures are compared using a common training protocol. The objective is to fairly assess the impact of model architectures for long-horizon prediction stability. Experiments are carried out for three problems, the double pendulum, the Kuramoto-Sivashinsky equations, and the Kolmogorov flow. The experiments are carried out with matching model capacity. Analysis is also carried out for a scenario where each model is individually optimized. It is observed that in both scenarios, the models exhibit categorical differences in long-horizon rollouts. For a concrete quantification, stepwise error injections and perturbation amplifications are analyzed using metrics such as local jacobian, relative one-step bias, and finite-time Lyapunov growth. Additionally, an attractor analysis is also conducted to assess how well the learned models replicate the underlying system geometry. An ablation study to isolate the impact of each component of a continuous-update architecture is also carried out. It is concluded that models that having integrator-like updates show lower bias and perturbation amplification yielding stable long-horizon rollout and more accurate predictions.

2605.24867 2026-05-26 cs.AI cs.CL cs.NI

Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning

聚类即推理:思维链图学习的 $k$-均值解释

Xuanting Xie, Zhaochen Guo, Bingheng Li, Xingtong Yu, Zhifei Liao, Zhao Kang, Yuan Fang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Singapore Management University(新加坡国立大学) Michigan State University(密歇根州立大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出KCoT框架,通过将Transformer块与$k$-均值算法建立数学对应,将思维链推理与图表示学习统一,实现迭代语义-拓扑交互,在标准基准上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

思维链(CoT)提示在增强大型语言模型(LLMs)对文本属性图(TAGs)的推理能力方面显示出潜力。本文通过聚类即推理的原则重新审视基于CoT的图学习,提供了关于迭代推理如何在图结构数据上运行的$k$-均值解释。我们观察到现有的图CoT方法依赖于分离的架构和固定的图表示,限制了逐步的语义-拓扑交互和可解释性。为克服这一限制,我们提出了一个名为KCoT的统一框架,将CoT推理与图表示学习相结合。我们的关键理论结果揭示了Transformer块与$k$-均值算法之间的形式数学对应,使得推理可以被解释为迭代的分配和更新步骤。基于这一见解,我们引入了一个语义判别提示,明确将这些步骤形式化为结构化的CoT推理,并采用结构对齐策略将拓扑先验与演化的思维条件表示融合。在标准基准上的实验表明,与最先进的方法相比,该方法持续改进,验证了聚类作为基于CoT的图学习的原则性机制。

英文摘要

Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attributed graphs (TAGs). This work reframes CoT-based graph learning through the principle of clustering as reasoning, offering a $k$-means interpretation of how iterative reasoning operates over graph-structured data. We observe that existing graph CoT methods rely on disjoint architectures and fixed graph representations, limiting step-by-step semantic-topological interaction and interpretability. To overcome this limitation, we propose a unified framework named KCoT that integrates CoT reasoning with graph representation learning. Our key theoretical result reveals a formal mathematical correspondence between a Transformer block and the $k$-means algorithm, allowing reasoning to be interpreted as iterative assignment and update steps. Based on this insight, we introduce a Semantic Discriminating Prompt that explicitly formulates these steps as structured CoT reasoning, together with a structure-grounded alignment strategy to fuse topological priors with evolving thought-conditioned representations. Experiments on standard benchmarks demonstrate consistent improvements over state-of-the-art methods, validating clustering as a principled mechanism for CoT-based graph learning.