arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1695
专题追踪
2510.08945 2026-05-25 cs.AI

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

FATHOMS-RAG:评估使用检索增强生成的多模态系统思考与观察的框架

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

发表机构 * Louisiana State University(路易斯安那州立大学) Oak Ridge National Lab(橡树岭国家实验室) University of Florida(佛罗里达大学)

AI总结 本文提出了一种名为FATHOMS-RAG的框架,用于评估使用检索增强生成(RAG)的多模态系统在推理和观察方面的能力。该框架引入了一个由人类创建的小型数据集、多项评估指标以及对开源与闭源模型的对比实验,全面检验RAG系统在处理文本、表格和图像等多模态信息时的表现。实验结果表明,闭源模型在准确性和幻觉控制方面显著优于开源模型,尤其是在涉及多模态和跨文档信息的问题上表现更为突出。

Comments Accepted at SAFE-ML 2026 Workshop at the International Conference on Software Testing (ICST) 2026 Code: https://github.com/Sam-Hildebrand/FATHOMS-RAG

详情
AI中文摘要

检索增强生成(RAG)已成为提高大型语言模型(LLMs)事实准确性的有前景范式。我们引入了一个旨在整体评估RAG管道的基准,评估管道摄取、检索和推理多种模态信息的能力,区别于现有专注于检索等特定方面的基准。我们提出:(1)一个由93个人工创建的问题组成的小型数据集,用于评估管道摄取文本数据、表格、图像以及跨这些模态分布在多个文档中的数据的能力;(2)一个用于正确性的短语级召回率指标;(3)一个最近邻嵌入分类器,用于识别潜在的管道幻觉;(4)对使用开源检索机制构建的2个管道和4个闭源基础模型进行的比较评估;(5)第三方人工评估我们正确性和幻觉指标的对齐情况。我们发现,闭源管道在正确性和幻觉指标上均显著优于开源管道,在依赖多模态和跨文档信息的问题上性能差距更大。对我们指标的人工评估显示,在1-5 Likert量表(5表示“强烈同意”)上,正确性平均一致性为4.62,幻觉检测平均一致性为4.53。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

2510.03508 2026-05-25 cs.LG

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

D2 Actor Critic: 扩散演员遇上分布式评论家

Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C Stadie

发表机构 * Department of Computer Science, University of Toronto(计算机科学系,多伦多大学) Department of Statistics, Northwestern University(统计学系,西北大学)

AI总结 本文提出了一种新的无模型强化学习算法 D2AC,旨在高效在线训练表达能力强的扩散策略。其核心在于一种避免传统策略梯度高方差和反向传播复杂性的策略改进目标,并结合了分布强化学习与剪切双Q学习的鲁棒分布评价器。该算法在多个具有挑战性的基准任务中表现出色,并在生物启发的捕食者-猎物任务中展示了良好的行为鲁棒性和泛化能力。

Comments Accepted to TMLR 2025

详情
AI中文摘要

我们引入了D2AC,一种新的无模型强化学习算法,旨在有效在线训练表达性扩散策略。其核心是一个策略改进目标,避免了典型策略梯度的高方差和通过时间反向传播的复杂性。这种稳定的学习过程关键得益于我们的第二个贡献:一个鲁棒的分布式评论家,我们通过融合分布式强化学习和裁剪双Q学习来设计它。最终算法非常有效,在包含Humanoid、Dog和Shadow Hand领域的18个困难强化学习任务基准上达到了最先进性能,涵盖密集奖励和目标条件强化学习场景。除了标准基准,我们还评估了一个生物启发的捕食者-猎物任务,以检验我们方法的行为鲁棒性和泛化能力。代码:https://github.com/d2ac-actor-critic/d2ac-public

英文摘要

We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach. Code: https://github.com/d2ac-actor-critic/d2ac-public

2510.00948 2026-05-25 cs.CV

InfVSR: Toward Consistency-Driven Streaming Generative Video Super-Resolution

InfVSR:迈向一致性驱动的流式生成视频超分辨率

Ziqing Zhang, Kai Liu, Zheng Chen, Xi Li, Yucong Chen, Bingnan Duan, Linghe Kong, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为InfVSR的生成式视频超分辨率方法,旨在解决处理长序列视频时效率低和时序一致性差的问题。该方法将视频超分辨率重构为自回归单步扩散框架,通过因果结构的预训练DiT模型和滚动键值缓存机制,实现了流式推理并保持局部与全局的一致性。此外,研究还引入了块级像素监督和跨块分布匹配技术,显著提升了处理效率,并构建了一个针对长视频的评估基准,推动了长序列视频超分辨率领域的发展。

Comments Code and model are available at https://github.com/Kai-Liu001/InfVSR

详情
AI中文摘要

真实世界视频通常包含数千帧。然而,现有的生成式视频超分辨率(VSR)方法在处理长序列时面临两个持续挑战:(1)由于对全长序列进行多步去噪的高成本导致的低效率;(2)时间分解导致伪影和不连续性,阻碍了良好的一致性。为突破这些限制,我们提出InfVSR,将VSR重新构建为自回归单步扩散范式,并利用视频扩散先验实现流式推理。首先,我们将预训练的DiT适配为因果结构,通过滚动KV缓存和联合视觉引导保持局部和全局一致性。其次,我们通过逐块像素监督和跨块分布匹配,高效地将扩散过程蒸馏为单步。为填补长视频评估的空白,我们构建了一个针对扩展序列的新基准,并引入语义级指标以全面评估时间一致性。我们的方法推动了长视频VSR的前沿,实现了具有增强语义一致性的最先进质量,并相比现有方法(如MGLD-VSR)提供了高达58倍的加速。我们的代码和模型可在https://github.com/Kai-Liu001/InfVSR获取。

英文摘要

Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.

2510.00915 2026-05-25 cs.LG cs.AI

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

在不完美验证器下基于可验证但含噪声奖励的强化学习

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所AIP) The University of Tokyo(东京大学) The University of Melbourne(墨尔本大学) The University of Sydney(悉尼大学)

AI总结 该论文研究了在不可靠验证器存在下如何改进可验证奖励的强化学习(RLVR)。通过将验证器的不可靠性建模为具有不对称噪声率的随机奖励通道,作者提出了两种轻量级修正方法:一种是反向修正,用于生成无偏的替代奖励;另一种是正向修正,通过调整得分函数项使策略更新更贴近干净梯度方向。实验表明,这两种方法在合成和真实验证噪声环境下均能提升数学推理任务的性能,其中正向修正在高噪声情况下更为稳定。此外,作者还引入了一个基于轻量级语言模型的申诉机制,用于在线估计假阴性率并进一步提升性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)用自动验证器替代昂贵的人工标注。为减少验证器攻击,许多RLVR系统将奖励二值化为$\\\{0,1\\\}$,但不完美的验证器不可避免地引入\\emph{假阴性}(拒绝正确答案)和\\emph{假阳性}(接受错误答案)。我们将验证器不可靠性形式化为具有非对称噪声率$ρ_0$和$ρ_1$(分别为FP率和FN率)的随机奖励通道。由此抽象我们推导出两种轻量级校正:(i)\\emph{后向}校正,产生无偏替代奖励,从而在期望上得到无偏的策略梯度估计量;(ii)\\emph{前向}校正,重新加权得分函数项,使得期望更新与干净梯度方向对齐,且仅需FN率。我们在分组相对策略优化流程中将两者实现为轻量级钩子,两种校正均在合成和真实验证器噪声下改善了数学推理的RLVR,其中前向变体在较大噪声下更稳定。最后,一个带有轻量级LLM验证器的上诉机制在线估计FN率并进一步提升性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

2510.00526 2026-05-25 cs.CL cs.LG

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

超越对数似然:面向模型能力连续体的监督微调概率目标

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了监督微调(SFT)中超越负对数似然(NLL)的目标函数,针对大语言模型在不同能力水平下的表现差异,提出了一种基于概率的优化目标体系。通过大量实验和消融研究,发现模型能力水平是决定不同目标函数优劣的关键因素:在模型能力强时,优先考虑先验知识的目标(如$-p$、$-p^{10}$)表现更优;在模型能力弱时,NLL仍占优势;而中间阶段则无单一目标占优。该研究为根据模型能力选择合适的目标函数提供了理论依据和实践指导。

Comments ICML 2026

详情
AI中文摘要

监督微调(SFT)是后训练大型语言模型(LLM)的标准方法,但通常表现出有限的泛化能力。我们将此限制归因于其默认训练目标:负对数似然(NLL)。虽然NLL在从头训练时经典最优,但后训练处于不同范式,可能违反其最优性假设,因为模型已编码任务相关先验,且监督可能冗长且有噪声。在这项工作中,我们系统研究了各种基于概率的目标,并刻画了不同目标在不同条件下成功或失败的时间和原因。通过在8个模型骨干、27个基准和7个领域上的全面实验和广泛消融研究,我们揭示了控制目标行为的关键维度:模型能力连续体。在模型强端附近,降低低概率令牌权重的先验倾向目标(例如,-p, -p^{10}, 阈值变体)一致优于NLL;在模型弱端,NLL占主导;在中间,没有单一目标普遍最优。我们的理论分析进一步阐明了目标如何在连续体上交换位置,为根据模型能力调整目标提供了原则性基础。代码可在 https://github.com/GaotangLi/Beyond-Log-Likelihood 获取。

英文摘要

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. In this work, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

2509.26383 2026-05-25 cs.CL cs.AI

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

基于强化学习的高效可迁移智能知识图谱检索增强生成

Junhong Lin, Shicheng Liu, Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

发表机构 * MIT CSAIL(麻省理工学院CSAIL) University of Virginia(弗吉尼亚大学) IBM Research(IBM研究院)

AI总结 该研究提出了一种基于强化学习的高效且可迁移的智能体知识图谱检索增强生成框架KG-R1,旨在解决现有KG-RAG系统中固定流程导致的高推理成本和依赖特定图结构的问题。KG-R1通过单智能体与知识图谱环境交互,逐步学习信息检索与推理生成的统一过程,从而在减少生成token数量的同时提升回答准确性。实验表明,KG-R1在多个知识图谱问答基准上表现出优异的效率和跨图谱迁移能力,且无需重新训练即可保持对新知识图谱的准确推理,具有良好的实际应用前景。

详情
AI中文摘要

知识图谱检索增强生成(KG-RAG)将大型语言模型(LLMs)与结构化、可验证的知识图谱(KGs)相结合,以减少幻觉并提供推理轨迹。然而,当前的KG-RAG系统通常依赖于多个LLM模块(如规划、推理和响应)的固定流水线,这增加了推理成本,并将性能与特定图模式绑定。为了解决这个问题,我们引入了KG-R1,一个通过强化学习(RL)优化KG-RAG的智能体框架。与模块化工作流不同,KG-R1使用单个智能体,将KGs作为其环境进行交互,学习在每一步检索信息,并将其融入统一的推理和生成过程中。在知识图谱问答(KGQA)基准测试中,KG-R1展示了高效性和可迁移性——使用Qwen 2.5-3B,KG-R1以比先前使用更大基础或微调模型的多模块工作流方法更少的生成token提高了答案准确性。此外,KG-R1表现出强大的即插即用能力:训练后,无需重新训练即可在未见过的KGs上保持准确性。这些特性使KG-R1成为实际部署中很有前景的KG-RAG框架。我们的代码公开在github.com/junhongmit/KG-R1/。

英文摘要

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

2509.15808 2026-05-25 cs.SD eess.AS

From Independence to Interaction: Speaker-Aware Simulation of Multi-Speaker Conversational Timing

从独立性到交互性:说话人感知的多说话人对话时序模拟

Máté Gedeon, Péter Mihajlik

发表机构 * Dept. of Telecommunications and Artificial Intelligence(电信与人工智能系) Budapest University of Technology and Economics(布达佩斯技术与经济大学) Speechtex Ltd.(Speechtex有限公司)

AI总结 本文提出了一种关注说话者的多说话人对话时序模拟方法,能够捕捉时间一致性与真实的轮流发言动态。该方法通过引入说话者特定的偏差分布来保证个体时间一致性,并利用马尔可夫链控制发言轮换,结合固定房间脉冲响应以保持空间真实感。实验表明,该方法在多项内在指标上优于传统方法,能更准确地反映真实对话中的时间依赖关系和说话人交替模式。

Comments Submitted to ICASSP 2026

详情
AI中文摘要

我们提出了一种说话人感知的方法来模拟多说话人对话,该方法捕捉了时间一致性和真实的话轮转换动态。先前的工作通常假设说话人和话轮之间的独立性来建模聚合的对话统计。相比之下,我们的方法使用说话人特定的偏差分布来强制执行说话人内部的时间一致性,同时马尔可夫链控制话轮转换,固定的房间脉冲响应保持空间真实性。我们还将停顿和重叠统一为一个单一的间隔分布,并使用核密度估计进行平滑连续性建模。在Switchboard数据集上使用内在指标——全局间隔统计、连续间隔之间的相关性、基于copula的高阶依赖性、话轮转换熵和间隔生存函数——进行评估表明,说话人感知的模拟比基线方法更好地与真实对话模式对齐,捕捉了细粒度的时间依赖和真实的话轮交替,同时揭示了在建模长程对话结构方面的开放挑战。

英文摘要

We present a speaker-aware approach for simulating multi-speaker conversations that captures temporal consistency and realistic turn-taking dynamics. Prior work typically models aggregate conversational statistics under an independence assumption across speakers and turns. In contrast, our method uses speaker-specific deviation distributions enforcing intra-speaker temporal consistency, while a Markov chain governs turn-taking and a fixed room impulse response preserves spatial realism. We also unify pauses and overlaps into a single gap distribution, modeled with kernel density estimation for smooth continuity. Evaluation on Switchboard using intrinsic metrics - global gap statistics, correlations between consecutive gaps, copula-based higher-order dependencies, turn-taking entropy, and gap survival functions - shows that speaker-aware simulation better aligns with real conversational patterns than the baseline method, capturing fine-grained temporal dependencies and realistic speaker alternation, while revealing open challenges in modeling long-range conversational structure.

2509.15105 2026-05-25 cs.LG

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

Super-Linear: 一种轻量级预训练线性专家混合模型用于时间序列预测

Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot

发表机构 * Faculty of Computer and Information Science, Ben-Gurion University(计算机与信息科学学院,本·古里安大学)

AI总结 本文提出了一种轻量级的预训练混合专家模型 Super-Linear,用于时间序列预测。该模型通过使用频率特化的线性专家替代复杂的深度结构,并结合轻量的频谱门控机制动态选择相关专家,实现了高效且准确的预测。Super-Linear 在多个基准数据集上表现出色,显著提升了计算效率、对采样率的鲁棒性以及模型可解释性。

Journal ref Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

时间序列预测(TSF)在能源、金融、医疗和物流等领域至关重要,需要能够跨不同数据集泛化的模型。像Chronos和Time-MoE这样的大型预训练模型表现出强大的零样本(ZS)性能,但计算成本高。在这项工作中,我们引入了Super-Linear,一种轻量级且可扩展的混合专家(MoE)模型,用于通用预测。它用简单的频率特化线性专家替代深度架构,这些专家在多个频率范围内的重采样数据上进行训练。一种轻量级光谱门控机制动态选择相关专家,实现高效准确的预测。尽管简单,Super-Linear在基准测试中表现出强劲性能,同时显著提高了效率、对采样率的鲁棒性和可解释性。Super-Linear的实现可在以下网址获取:\href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}。

英文摘要

Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.

2509.12958 2026-05-25 cs.AI

Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning

忘记敏感信息,记住重要内容:持续学习中基于令牌级差分隐私的记忆塑造

Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He

发表机构 * East China Normal University(东华师范大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Qiji Zhifeng Co., Ltd.(上海启智锋科技有限公司)

AI总结 该研究针对持续学习(CL)模型在隐私保护方面的不足,提出了一种隐私增强的持续学习框架(PeCL)。该方法引入了基于语义敏感性的标记级动态差分隐私策略,动态分配隐私预算以保护敏感信息,同时减少对非敏感知识的干扰。此外,研究还设计了一个隐私引导的记忆塑形模块,用于智能遗忘敏感信息并保留任务不变的历史知识,从而在保障隐私的同时提升模型性能。实验表明,PeCL在隐私保护与模型效用之间取得了更优的平衡。

详情
AI中文摘要

持续学习(CL)模型虽然擅长顺序知识获取,但由于积累多样化信息而面临显著且常被忽视的隐私挑战。传统的隐私方法(如统一的差分隐私预算)不加区分地保护所有数据,导致模型效用大幅下降,阻碍了CL在隐私敏感领域的部署。为了克服这一问题,我们提出了一种隐私增强的持续学习(PeCL)框架,该框架忘记敏感信息并记住重要内容。我们的方法首先引入了一种令牌级动态差分隐私策略,该策略根据单个令牌的语义敏感性自适应分配隐私预算。这确保了对私有实体的强健保护,同时最小化对非敏感通用知识的噪声注入。其次,我们集成了一个隐私引导的记忆塑造模块。该模块利用来自动态DP机制的敏感性分析,智能地从模型记忆和参数中忘记敏感信息,同时明确保留对于缓解灾难性遗忘至关重要的任务不变历史知识。大量实验表明,PeCL在隐私保护和模型效用之间实现了优越的平衡,通过保持先前任务的高准确性同时确保强健的隐私,优于基线模型。

英文摘要

Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what's sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model's memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.

2508.18958 2026-05-25 cs.CV cs.AI

A drone-based framework for coral habitat mapping via weakly supervised segmentation

基于弱监督分割的无人机珊瑚栖息地制图框架

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

发表机构 * IFREMER Délégation Océan Indien (DOI)(IFREMER大洋印度洋办事处) INRIA, LIRMM, Université de Montpellier, CNRS(INRIA、LIRMM、蒙彼利埃大学、国家科学研究中心) UMR Marbec, IRD, Université de Montpellier, CNRS, Ifremer(Marbec联合研究单位、IRD、蒙彼利埃大学、国家科学研究中心、IFREMER) CNRS, LIRMM, Université de Montpellier(国家科学研究中心、LIRMM、蒙彼利埃大学)

AI总结 本文提出了一种基于无人机的弱监督分割框架,用于珊瑚生境的映射。该方法通过结合水下图像的细粒度多标签预测和广覆盖的航拍数据,无需像素级标注即可训练高分辨率分割模型。研究在珊瑚礁图像上验证了该方法,实现了大面积珊瑚形态的分割,取得了86.07%的像素准确率和52.23%的平均交并比,展示了其在生态监测中的高效性和适用性。

Comments Extended journal version of "The Point is the Mask: Scaling coral reef segmentation with weak supervision"

详情
AI中文摘要

在大空间范围内获取像素级标注仍然是机器学习在生态应用中部署的主要瓶颈。本文提出了一种多尺度弱监督语义分割(WSSS)框架,能够利用密集的、基于分类的输出训练高分辨率分割模型。我们的方法将来自水下图像的细粒度多标签预测与广覆盖的航空数据相结合。将这些点级分类转换为粗监督掩码,用于训练无人机(UAV)正射影像上的语义分割模型。然后使用模型自身的细化预测进行第二步训练,以进一步提高空间精度,无需额外标注。我们在珊瑚礁图像上展示了该方法,实现了珊瑚形态类型的大面积分割,并展示了其整合新类别的灵活性。最终模型在人工标注的礁区上达到86.07%的像素精度和52.23%的平均交并比(mIoU),表明无需像素级标注即可获得准确的大规模珊瑚分割。通过跨尺度和跨模态连接图像分类与分割,该方法为标注不可用场景下部署分割模型提供了高效解决方案,并为生态学及其他领域的可扩展、高效监测开辟了机会。

英文摘要

Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi-scale weakly supervised semantic segmentation (WSSS) framework that enables training high-resolution segmentation models from dense, classification-based outputs. Our method combines fine-scale, multi-label predictions from underwater imagery with broad-coverage aerial data. We convert these point-level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large-area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large-scale coral segmentation can be obtained without pixel-level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

2508.14311 2026-05-25 cs.LG cs.AI

Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

通过图结构反馈进行多重公平正则化器的在线学习

Quan Zhou, Jakub Marecek, Robert Shorten

发表机构 * Department of Mathematics, National University of Singapore(新加坡国立大学数学系) Department of Computer Science, Czech Technical University(捷克技术大学计算机科学系) Dyson School of Design Engineering, Imperial College London(伦敦帝国理工学院设计工程戴森学院) Imperial College London(伦敦帝国理工学院)

AI总结 本文研究了在自动决策系统中如何同时满足多个可能相互冲突的公平性要求的问题。作者提出了一种基于图结构反馈的强化学习方法,能够在序贯交互过程中自适应地学习不同公平性目标的权重。该方法为动态环境中实现多目标公平性优化提供了新的解决方案。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=y8iWuDZtEw

Journal ref Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

在自动化决策系统中,越来越需要强制执行多个通常相互竞争的公平性度量。这些公平性目标的适当权重通常是先验未知的,可能随时间变化,并且在我们的设置中,必须通过顺序交互自适应地学习。在这项工作中,我们在赌博机设置中解决了这一挑战,其中决策具有图结构反馈。

英文摘要

There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate weighting of these fairness objectives is typically unknown a priori, may change over time and, in our setting, must be learned adaptively through sequential interactions. In this work, we address this challenge in a bandit setting, where decisions are made with graph-structured feedback.

2508.14083 2026-05-25 cs.LG cs.AI

GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

GeoMAE:面向缺失值的时空图预测的掩码表示学习

Songyu Ke, Chenyu Wu, Yuxuan Liang, Huiling Qin, Junbo Zhang, Yu Zheng

发表机构 * College of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) JD Intelligent Cities Research(京东智能城市研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing Normal University(北京师范大学)

AI总结 GeoMAE 是一种用于时空图预测的自监督表示学习模型,旨在解决城市智能系统中因环境和设备问题导致的数据缺失问题。该方法通过引入基于注意力机制的时空预测网络和辅助学习任务,有效捕捉了传感器网络中的动态空间关联,并提升了模型对缺失数据的鲁棒性。实验表明,GeoMAE 在多个真实数据集上显著优于现有方法,相对提升了最高达13.20%的预测性能。

Comments 34 pages for pre-print version. This work has been published in *Neural Networks*. Please check the latest version via the following DOI

详情
AI中文摘要

城市智能系统中缺失数据的普遍存在,归因于不利的环境条件和设备故障,对下游应用(尤其是交通预测和能耗预测)的有效性构成了重大挑战。因此,开发一种能够从不完整数据集中提取有意义信息的稳健时空学习方法至关重要。尽管存在针对缺失值时空图预测的方法,但未解决的问题依然存在。首先,现有研究大多基于时间序列分析,从而忽略了传感器网络中固有的动态空间相关性。其次,缺失数据模式的复杂性加剧了问题的复杂性。此外,维护条件的差异导致缺失值比率和模式显著波动,从而挑战了预测模型的泛化能力。针对这些挑战,本研究引入了GeoMAE,一种自监督的时空表示学习模型。该模型由三个主要组件组成:输入预处理模块、基于注意力的时空预测网络(STAFN)和一个辅助学习任务,该任务受掩码自编码器启发,以增强时空表示学习的鲁棒性。在真实数据集上的实证评估表明,GeoMAE显著优于现有基准,相对于最佳基线模型实现了高达13.20%的相对改进。

英文摘要

The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses a significant challenge to the efficacy of downstream applications, notably in the realms of traffic forecasting and energy consumption prediction. Therefore, it is imperative to develop a robust spatio-temporal learning methodology capable of extracting meaningful insights from incomplete datasets. Despite the existence of methodologies for spatio-temporal graph forecasting in the presence of missing values, unresolved issues persist. Primarily, the majority of extant research is predicated on time-series analysis, thereby neglecting the dynamic spatial correlations inherent in sensor networks. Additionally, the complexity of missing data patterns compounds the intricacy of the problem. Furthermore, the variability in maintenance conditions results in a significant fluctuation in the ratio and pattern of missing values, thereby challenging the generalizability of predictive models. In response to these challenges, this study introduces GeoMAE, a self-supervised spatio-temporal representation learning model. The model is comprised of three principal components: an input preprocessing module, an attention-based spatio-temporal forecasting network (STAFN), and an auxiliary learning task, which draws inspiration from Masking AutoEncoders to enhance the robustness of spatio-temporal representation learning. Empirical evaluations on real-world datasets demonstrate that GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20\% relative improvement over the best baseline models.

2508.10651 2026-05-25 cs.LG

Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization

基于逻辑的Weisfeiler-Leman变体与表格化的图学习

Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Magdalena Ortiz, Matias Selin, Mantas Šimkus

发表机构 * Tampere University(塔尔皮奥大学) TU Wien(维也纳技术大学)

AI总结 本文提出了一种基于逻辑增强的Weisfeiler-Leman算法和表格化的新型图分类方法,通过将图数据转化为表格形式并应用传统表格数据分析方法进行分类。该方法通过修改底层逻辑框架提升了表达能力,并通过广义量化器的双模拟游戏理论进行了精确刻画。实验表明,该方法在多个数据集上性能接近图神经网络和图变换器,且无需GPU支持和复杂的超参数调优,计算效率显著更高。

Comments New version: Revised the experimental section

详情
AI中文摘要

我们提出了一种新颖的图分类方法,该方法通过Weisfeiler-Leman算法的新变体将图数据表格化,然后应用表格数据方法。这些变体通过修改底层逻辑框架获得,并利用广义量词的双模拟游戏的新推广,对其表达能力进行了精确的理论刻画。然后我们在涵盖多个应用领域的14个数据集上测试了我们的方法。实验表明,在多达40,000个样本的数据集上,我们的方法通常能匹配图神经网络和图变换器的预测性能,而无需GPU或广泛的超参数调优。即使将我们方法的调优时间计入而基线方法的不计入,我们的方法也快5-20倍。当所有方法的调优时间都计入时,差距更显著地有利于我们的方法。

英文摘要

We present a novel approach for graph classification based on tabularizing graph data via new variants of the Weisfeiler-Leman algorithm and then applying methods for tabular data. The variants are obtained by modifying the underlying logical framework, and we establish a precise theoretical characterization of their expressive power using a novel generalization of the bisimulation game for generalized quantifiers. We then test our method on 14 datasets that span a range of application domains. The experiments demonstrate that on datasets with up to 40 000 samples, our approach generally matches the predictive performance of graph neural networks and graph transformers, without requiring a GPU or extensive hyperparameter tuning. Even when our method's tuning time is included and the baselines' is not, our method is 5-20 times faster. When tuning time is included for all methods, the gap is significantly greater in favour of our method.

2508.10016 2026-05-25 cs.CL

Training-Free Multimodal Large Language Model Orchestration

免训练的多模态大语言模型编排

Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji

发表机构 * Media Analytics and Computing Lab, Xiamen University, Xiamen, China(厦门大学媒体分析与计算实验室) Institute of Artificial Intelligence, Xiamen University, Xiamen, China(厦门大学人工智能研究院) School of Informatics, Xiamen University, Xiamen, China(厦门大学信息学院) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Department of Computer Science, National University of Singapore, Singapore(新加坡国立大学计算机科学系)

AI总结 本文提出了一种无需训练的多模态大语言模型编排框架(LLM Orchestration),旨在解决构建交互式多模态助手时因端到端对齐带来的高昂数据和计算成本问题。该框架通过集成现成的模态专家模块,无需额外训练即可实现多模态输入输出的统一处理,包含意图识别控制器、跨模态记忆模块和统一交互层三个核心组件,有效提升了系统扩展性与运行效率。实验表明,该方法在多模态基准测试中表现出色,同时保持了较低的编排开销和模块化升级能力。

Journal ref ICML 2026

详情
AI中文摘要

构建交互式全模态助手通常依赖于端到端的多模态对齐来融合异构模态,这会产生大量的数据和计算成本,并限制了可扩展性。我们提出了免训练的大语言模型编排(LLM Orchestration),这是一个免训练的编排框架,它将现成的模态专家集成到一个统一的多模态输入-输出系统中,而无需额外的基于梯度的训练来进行集成。LLM Orchestration包含三个组件:(1)一个LLM控制器,它推断用户意图并发出显式的控制令牌以进行专家选择和排序,实现协议约束和可审计的路由;(2)一个以文本为中心的跨模态记忆,它将多模态证据压缩为结构化记录,用于轻量级检索和重用,减少跨轮次中冗余的专家调用;(3)一个统一的交互层,它执行路由和记忆决策,以支持一致的模态转换、全双工流和可中断的对话。在多种多模态基准测试中,LLM Orchestration在标准评估约束下实现了强劲的性能,同时保持了低编排开销和模块化可升级性,为全模态系统提供了一种实用的替代方案,避免了昂贵的联合训练。

英文摘要

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

2508.07849 2026-05-25 cs.CL

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

评估定制化与通用型基于Transformer的模型在法律合同分类中的表现

Amrita Singh, H. Suhan Karaca, Aditya Joshi, Hye-young Paik, Jiaojiao Jiang

发表机构 * School of Computer Science and Engineering University of New South Wales (UNSW), Sydney(计算机科学与工程学院 新南威尔士大学(UNSW),悉尼)

AI总结 尽管在法律自然语言处理领域已有进展,但目前尚缺乏对专门用于法律任务的Transformer模型在合同分类任务上的全面评估。本文对13种法律专用Transformer模型在三个英文合同分类任务上的表现进行了评估,并与9种通用模型进行了对比,结果表明法律专用模型在需要细致法律理解的任务中表现更优,尤其在处理类别不平衡数据时能有效减少罕见类的误分类。研究还发现,Legal-BERT和Contracts-BERT在参数量仅为最佳通用模型69%的情况下,在两个任务中取得了新的最先进成果,突显了领域定制模型在法律应用中的重要性。

Comments Accepted to Customizable NLP at ACL 2026

详情
AI中文摘要

尽管法律NLP取得了进展,但目前尚无针对合同分类任务对法律任务定制的Transformer模型(本文称为“法律专用”模型)进行全面评估。为填补这一空白,我们对13个法律专用Transformer模型在三个英文合同分类任务上进行了评估,并与9个通用模型进行了比较。结果表明,法律专用模型持续优于通用模型,尤其是在需要精细法律理解的任务上。它们还有助于减少不平衡数据集中罕见类别的误分类。Legal-BERT和Contracts-BERT在三个任务中的两个上建立了新的最优结果,尽管其参数比最佳通用模型少69%。我们还确定了CaseLaw-BERT和LexLM作为合同分类的强有力额外基线。我们的结果凸显了通用模型的不足,强调了领域特定定制的必要性,尤其是在法律应用的背景下。

英文摘要

Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as `legal-specific' models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.

2508.02332 2026-05-25 cs.LG stat.ML

BOOST: A Data-Driven Framework for the Automated Joint Selection of Kernel and Acquisition Functions in Bayesian Optimization

BOOST: 一种用于贝叶斯优化中核函数与采集函数自动联合选择的数据驱动框架

Joon-Hyun Park, Mujin Cheon, Jeongsu Wi, Dong-Yeun Koh

发表机构 * Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology(化学与生物分子工程系,韩国科学技术院) Department of AX, Korea Advanced Institute of Science and Technology(AX系,韩国科学技术院) Saudi Aramco-KAIST CO2 Management Center(沙特阿美-KAIST二氧化碳管理中心)

AI总结 贝叶斯优化(BO)是一种在昂贵黑箱问题中高度样本高效的优化方法,其性能高度依赖于核函数和获取函数等超参数的选择。本文提出了一种名为BOOST的框架,用于自动联合选择最优的核函数和获取函数对,解决了传统方法中依赖启发式或手动调参的问题。BOOST通过离线评估阶段预测不同核-获取函数对的性能,并在实际优化前选择最有可能表现良好的组合,从而提升优化效率和效果。实验表明,BOOST在合成基准和机器学习超参数优化任务中均优于固定超参数的BO方法,并能与先进自适应方法竞争。

Comments 25 pages

详情
AI中文摘要

贝叶斯优化(BO)是一种对昂贵黑箱问题具有高样本效率的方法,其性能关键取决于超参数的选择,包括核函数和采集函数。这带来了一个重要的实际挑战:不恰当的组合可能导致性能差和评估浪费。虽然对核函数和采集函数的单独改进已被积极探索,但自动联合选择最佳超参数对在很大程度上被忽视,迫使从业者依赖启发式方法或昂贵的手动训练。在这项工作中,我们提出了一个框架BOOST(贝叶斯优化与最优核函数和采集函数选择技术),该框架自动化了这一选择过程。BOOST利用一个简单的离线评估阶段来预测各种核函数-采集函数对的性能,并在进行昂贵的评估过程之前识别出最有希望的对。BOOST是一种数据驱动的策略选择程序,它根据候选策略在手头数据上的经验性能来评估核函数-采集函数对。在每次迭代中,先前观察到的点被划分为参考集和查询集。这些子集扮演类似于机器学习中训练集和验证集的角色:参考集用于模型构建,而查询集代表未见的区域,用于回顾性评估每个候选策略在向目标值推进方面的有效性。在合成基准和机器学习超参数优化任务上的实验表明,BOOST始终优于固定超参数的BO,并与最先进的自适应方法保持竞争力,突显了其在各种场景下的鲁棒性。

英文摘要

The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a significant practical challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions and acquisition functions have been actively explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been largely overlooked. This forced practitioners to rely on heuristics or costly manual training. In this work, we propose a framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a simple offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most promising pair before committing to the expensive evaluation process. BOOST is a data-driven strategy selection procedure that evaluates kernel-acquisition pairs based on their empirical performance on the data-in-hand. At each iteration, previously observed points are partitioned into a reference set and a query set. These subsets play roles analogous to training and validation sets in machine learning: the reference set is used for model construction, while the query set represents unseen regions to retrospectively evaluate how effectively each candidate strategy progresses toward the target value. Experiments on synthetic benchmarks and machine learning hyperparameter optimization tasks demonstrate that BOOST consistently improves over fixed-hyperparameter BO and remains competitive with state-of-the-art adaptive methods, highlighting its robustness across diverse landscapes.

2507.23372 2026-05-25 cs.CV

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

UniEmo: 利用可学习专家查询统一情感理解与生成

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Great Bay University(大湾区大学) School of Computing and Information Technology(计算机与信息学院) Dongguan Key Laboratory for Intelligence and Information Technology(东莞智能与信息科技重点实验室) Shenzhen Loop Area Institute(深圳环 area 院) Macao Polytechnic University(澳门 polytechnic 大学)

AI总结 本文提出 UniEmo,一个统一的情感理解和生成框架,通过可学习的专家查询机制,将情感理解与生成任务有机结合。该方法通过分层的情感理解链逐步提取多尺度情感特征,并利用这些特征引导扩散模型生成具有情感表达的图像,同时引入情感相关系数和条件损失以提升生成图像的多样性和真实性。实验表明,UniEmo 在情感理解和生成任务上均优于现有先进方法。

Comments Accepted to TIP 2026

Journal ref IEEE Transactions on Image Processing, vol. 35, pp. 5165-5180, 2026

详情
AI中文摘要

情感理解和生成通常被视为独立的任务,然而它们本质上是互补的,可以相互增强。在本文中,我们提出UniEmo,一个无缝集成这两个任务的统一框架。关键挑战在于情感的抽象性质,需要提取对两个任务都有益的视觉表示。为此,我们提出一个带有可学习专家查询的分层情感理解链,逐步提取多尺度情感特征,从而作为统一的基础步骤。同时,我们融合这些专家查询和情感表示,以指导扩散模型生成引发情感反应的图像。为了增强生成情感图像的多样性和保真度,我们进一步在融合过程中引入情感相关系数和情感条件损失。这一步骤促进了由理解引导的情感生成的融合与对齐。反过来,我们证明联合训练允许生成部分向理解部分提供隐式反馈。此外,我们提出一种新颖的数据过滤算法,以选择由训练良好的模型生成的高质量和多样化的情感图像,这些图像显式地反馈到理解部分。这些生成驱动的双重反馈过程共同增强了模型的理解能力。大量实验表明,UniEmo在情感理解和生成任务上均显著优于现有方法。所提出方法的代码可在 https://github.com/JiuTian-VL/UniEmo 获取。

英文摘要

Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

2507.22345 2026-05-25 cs.RO

A Reconfigured Wheel-Legged Robot for Enhanced Steering and Adaptability

一种增强转向能力和适应性的重构轮腿机器人

Zhicheng Song, Jinglan Xu, Chunxin Zheng, Yulin Li, Zhihai Bi, Jun Ma

发表机构 * Robotics and Autonomous Systems Thrust, The Hong Kong University of Science and Technology (Guangzhou)(机器人与自主系统方向,香港科技大学(广州))

AI总结 本文提出了一种名为FLORES的新型轮腿机器人设计,通过重新配置前腿结构,使其在平坦地面和复杂地形中均能实现高效移动。该设计采用髋部偏航自由度替代传统髋部滚转自由度,结合定制的强化学习控制器,实现了轮式与腿式运动模式之间的无缝切换和适应性控制。实验表明,FLORES在转向能力、导航效率和多地形适应性方面均有显著提升。

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7444-7451, June 2026

详情
AI中文摘要

轮腿机器人结合了腿在崎岖地形上的灵活性和轮子在平坦地面上的效率。然而,现有大多数设计未能充分利用腿式和轮式结构的优势,限制了系统的整体灵活性和效率。我们提出FLORES,一种新型轮腿机器人设计,其独特的前腿配置超越了标准设计方法。具体来说,FLORES将前腿传统的髋关节横滚自由度替换为髋关节偏航自由度,这使得在平坦表面上高效移动的同时,确保在复杂地形中的适应性。这种创新设计促进了不同运动模式(即腿式运动和轮式运动)之间的无缝切换,并优化了在不同环境中的性能。为了充分利用FLORES的机械能力,我们开发了一个定制的强化学习控制器,该控制器采用混合内模,并针对我们独特的机械配置优化了奖励结构。该框架能够生成自适应、多模态的运动策略,促进轮式和腿式运动之间的平滑过渡。此外,我们独特的关节设计使机器人能够表现出新颖且高效的运动步态,充分利用两种运动模式的协同优势。通过全面实验,我们展示了FLORES增强的转向能力、改进的导航效率以及在各种地形上的多功能运动。开源项目可在https://github.com/ZhichengSong6/FLORES获取。

英文摘要

Wheel-legged robots integrate leg agility on rough terrain with wheel efficiency on flat ground. However, most existing designs do not fully capitalize on the benefits of both legged and wheeled structures, which limits overall system flexibility and efficiency. We present FLORES, a novel wheel-legged robot design featuring a distinctive front-leg configuration that sets it beyond standard design approaches. Specifically, FLORES replaces the conventional hip-roll degree of freedom (DoF) of the front leg with hip-yaw DoFs, and this allows for efficient movement on flat surfaces while ensuring adaptability when navigating complex terrains. This innovative design facilitates seamless transitions between different locomotion modes (i.e., legged locomotion and wheeled locomotion) and optimizes the performance across varied environments. To fully exploit \flores's mechanical capabilities, we develop a tailored reinforcement learning (RL) controller that adapts the Hybrid Internal Model (HIM) with a customized reward structure optimized for our unique mechanical configuration. This framework enables the generation of adaptive, multi-modal locomotion strategies that facilitate smooth transitions between wheeled and legged movements. Furthermore, our distinctive joint design enables the robot to exhibit novel and highly efficient locomotion gaits that capitalize on the synergistic advantages of both locomotion modes. Through comprehensive experiments, we demonstrate FLORES's enhanced steering capabilities, improved navigation efficiency, and versatile locomotion across various terrains. The open-source project can be found at https://github.com/ZhichengSong6/FLORES.

2507.12455 2026-05-25 cs.CV

Mitigating Object Hallucinations via Sentence-Level Early Intervention

通过句子级早期干预缓解对象幻觉

Shangpin Peng, Senqiao Yang, Li Jiang, Zhuotao Tian

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) The Chinese University of Hong Kong(香港中文大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 多模态大语言模型在跨模态理解方面取得了显著进展,但在生成过程中常出现与视觉输入矛盾的“幻觉”问题。本文提出了一种基于句子级早期干预的解决方案SENTINEL,通过迭代生成并验证模型输出,构建高质量的领域内偏好数据,进而利用上下文感知的偏好损失进行训练,有效抑制幻觉生成。实验表明,该方法在多个基准测试中大幅减少了幻觉现象,同时保持了模型的通用能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)彻底改变了跨模态理解,但仍难以应对幻觉——即与视觉输入相矛盾的虚构内容。现有的幻觉缓解方法要么计算成本过高,要么在训练数据和模型输出之间引入分布不匹配。我们识别出一个关键见解:幻觉主要出现在文本生成的早期阶段,并通过后续输出传播。为解决此问题,我们提出了SENTINEL(通过领域内偏好学习的句子级早期干预),一个消除对人工标注依赖的框架。具体来说,我们首先通过迭代采样模型输出、通过与两个开放词汇检测器的交叉验证来验证对象存在性,并将句子分类为幻觉/非幻觉类别,从而引导出高质量的领域内偏好对。随后,我们使用上下文连贯的正样本和幻觉负样本迭代构建上下文感知的偏好数据。最后,我们使用上下文感知偏好损失(C-DPO)训练模型,该损失在幻觉最初显现的句子级别强调判别性学习。实验结果表明,与原始模型相比,SENTINEL可将幻觉减少超过90%,并在幻觉基准和通用能力基准上均优于先前的最先进方法,展示了其优越性和泛化能力。模型、数据集和代码可在 https://github.com/pspdada/SENTINEL 获取。

英文摘要

Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.

2506.00560 2026-05-25 cs.RO cs.CV

Using Ensemble Diffusion to Estimate Uncertainty for End-to-End Autonomous Driving

使用集成扩散估计端到端自动驾驶的不确定性

Florian Wintel, Sigmund H. Høeg, Gabriel Kiss, Frank Lindseth

发表机构 * Norwegian University of Science and Technology(挪威科学技术大学)

AI总结 本文提出了一种基于集成扩散模型的端到端自动驾驶系统EnDfuser,用于估计轨迹规划中的不确定性。该方法通过将注意力池化与轨迹规划结合到一个扩散变换器模块中,有效融合了摄像头和激光雷达等多源感知信息,并从单帧感知输入生成多个候选轨迹(共128个),从而提供对不确定未来轨迹空间的可解释性。实验表明,该方法通过设计简单安全规则,在LAV基准测试中提升了1.7%的驾驶性能,展示了集成扩散模型在端到端自动驾驶策略中建模轨迹后验不确定性分布的有效性。

Comments Accepted at NLDL 2026

详情
AI中文摘要

端到端自动驾驶规划系统正在快速改进,特别是在CARLA等闭环模拟环境中。许多此类驾驶系统要么不考虑规划本身的不确定性,要么通过使用不泛化的专用表示来获取不确定性。在本文中,我们提出了EnDfuser,一个使用扩散模型作为轨迹规划器的端到端驾驶系统。EnDfuser通过将注意力池化和轨迹规划结合到一个单一的扩散变换器模块中,有效利用复杂的感知信息,如融合的相机和激光雷达特征。EnDfuser不承诺单一规划,而是通过集成扩散从单一感知帧生成候选轨迹分布(在我们的情况下为128个)。通过观察完整的候选轨迹集,EnDfuser为不确定的多模态未来轨迹空间提供了可解释性。利用这些信息,我们设计了一个简单的安全规则,在LAV基准上将系统的驾驶评分提高了1.7%。我们的发现表明,集成扩散作为传统点估计轨迹规划模块的直接替代品,可以通过建模后验轨迹分布的不确定性,为端到端驾驶策略中的不确定性感知决策过程做出贡献。

英文摘要

End-to-end planning systems for autonomous driving are rapidly improving, especially in closed-loop simulation environments like CARLA. Many such driving systems either do not consider uncertainty as part of the plan itself or obtain it by using specialized representations that do not generalize. In this paper, we propose EnDfuser, an end-to-end driving system that uses a diffusion model as the trajectory planner. EnDfuser effectively leverages complex perception information like fused camera and LiDAR features, through combining attention pooling and trajectory planning into a single diffusion transformer module. Instead of committing to a single plan, EnDfuser produces a distribution of candidate trajectories (128 for our case) from a single perception frame through ensemble diffusion. By observing the full set of candidate trajectories, EnDfuser provides interpretability for uncertain, multimodal future trajectory spaces. Using this information we design a simplistic safety-rule that improves the system's driving score by 1.7% on the LAV benchmark. Our findings suggest that ensemble diffusion, used as a drop-in replacement for traditional point-estimate trajectory planning modules, can contribute to an uncertainty-aware decision making process in End-to-End driving policies by modeling the uncertainty of the posterior trajectory distribution.

2505.21573 2026-05-25 cs.LG cs.AI

Spectral-inspired Operator Learning with Limited Data and Unknown Physics

光谱启发的少数据与未知物理下的算子学习

Han Wan, Rui Zhang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学光明学院人工智能学院)

AI总结 本文研究了在数据有限且物理机制未知的情况下学习偏微分方程(PDE)动力学的挑战。为此,提出了一种名为SINO的频谱启发神经算子,它仅需2到5条轨迹即可建模复杂系统,无需显式依赖PDE方程。SINO通过频率索引自动捕捉局部和全局空间导数,结合乘法操作块和低通滤波器处理非线性效应和混叠问题,在多个二维和三维PDE基准测试中表现出优异性能,尤其在少量数据和分布外场景下显著优于现有方法。

Comments To appear in KDD 2026

详情
AI中文摘要

从有限数据和未知物理中学习PDE动力学具有挑战性。现有的神经PDE求解器要么需要大型数据集,要么依赖已知物理(如PDE残差或手工模板),导致适用性有限。为解决这些问题,我们提出光谱启发神经算子(SINO),它仅需2-5条轨迹即可建模复杂系统,无需显式PDE项。具体而言,SINO从频率索引自动捕获局部和全局空间导数,从而在物理无关机制下实现底层微分算子的紧凑表示。为建模非线性效应,它采用Pi块对光谱特征进行乘法运算,并辅以低通滤波器抑制混叠。在2D和3D PDE基准上的大量实验表明,SINO实现了最先进的性能,精度提升1-2个数量级。特别地,仅用5条训练轨迹,SINO就优于在1000条轨迹上训练的数据驱动方法,并在其他方法失败的高难度分布外案例中保持预测能力。

英文摘要

Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.

2505.17354 2026-05-25 cs.LG stat.ML

CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots

CT-OT Flow:从离散时间快照估计连续时间动态

Keisuke Kawano, Takuro Kutsuna, Naoki Hayashi, Yasushi Esaki, Hidenori Tanaka

发表机构 * Toyota Central R&D Labs., Inc.(丰田中央研发实验室)

AI总结 本文研究如何从离散时间快照中估计连续时间动态,针对如单细胞RNA测序、移动感知等场景中数据仅以时间聚合快照形式存在、时间标签可能噪声或不确定的问题。提出了一种两阶段框架——连续时间最优传输流(CT-OT Flow),通过部分最优传输对齐相邻时间区间以推断高分辨率时间标签,并利用时间核平滑重建连续时间数据分布,从而训练标准的常微分方程或随机微分方程模型。该方法有效处理快照聚合和时间标签不确定性,并通过实用加速策略提升计算效率,在多个合成和真实数据集上表现出更优的分布和轨迹估计性能。

Comments https://github.com/ToyotaCRDL/CT-OT_Flow

详情
AI中文摘要

在许多现实场景中(例如单细胞RNA测序、移动感知和环境监测),数据仅作为在有限时间窗口内收集的时间聚合快照被观测到,通常带有噪声或不确定的时间戳,并且无法访问连续轨迹。我们研究从这类快照估计连续时间动态的问题。我们提出连续时间最优传输流(CT-OT Flow),这是一个两阶段框架:(i)通过部分最优传输(POT)对齐相邻区间来推断高分辨率时间标签,(ii)通过时间核平滑重建连续时间数据分布,从中采样邻近时间对以训练标准ODE/SDE模型。我们的公式明确考虑了快照聚合和时间标签不确定性,并使用实际加速(筛选和小批量POT),使其适用于大型数据集。在合成基准和两个真实数据集(scRNA-seq和台风轨迹)上,与OT-CFM、[SF]²M、TrajectoryNet、MFM和ENOT相比,CT-OT Flow减少了分布和轨迹误差。

英文摘要

In many real-world settings--e.g., single-cell RNA sequencing, mobility sensing, and environmental monitoring--data are observed only as temporally aggregated snapshots collected over finite time windows, often with noisy or uncertain timestamps, and without access to continuous trajectories. We study the problem of estimating continuous-time dynamics from such snapshots. We present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage framework that (i) infers high-resolution time labels by aligning neighboring intervals via partial optimal transport (POT) and (ii) reconstructs a continuous-time data distribution through temporal kernel smoothing, from which we sample pairs of nearby times to train standard ODE/SDE models. Our formulation explicitly accounts for snapshot aggregation and time-label uncertainty and uses practical accelerations (screening and mini-batch POT), making it applicable to large datasets. Across synthetic benchmarks and two real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional and trajectory errors compared with OT-CFM, [SF]\(^{2}\)M, TrajectoryNet, MFM, and ENOT.

2505.17015 2026-05-25 cs.CV cs.CL

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-SpatialMLLM: 多模态大语言模型的多帧空间理解

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang

发表机构 * FAIR, Meta(FAIR,Meta) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为Multi-SpatialMLLM的多模态大语言模型框架,旨在增强模型对多帧场景的时空理解能力。通过引入深度感知、视觉对应和动态感知等基本空间技能,并构建包含2700多万个样本的MultiSPA数据集,该方法显著提升了模型在多帧空间任务中的表现。实验表明,Multi-SpatialMLLM在多种空间任务上优于现有基线模型和商业系统,展示了其在复杂场景下的泛化能力和多任务学习优势,并可应用于机器人领域的多帧奖励标注。

Comments CVPR 2026 Camera Ready. 27 pages. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉任务中取得了快速进展,但其空间理解仍局限于单张图像,使其不适合需要多帧推理的物理世界应用。在本文中,我们提出一个框架,通过整合基本空间技能(包括深度感知、视觉对应和动态感知)来赋予MLLMs多帧空间理解能力。我们设计了一个新颖的数据管道,并收集了包含超过2700万个样本的MultiSPA数据集,涵盖多样的3D和4D场景,以支持训练。除了MultiSPA,我们还引入了一个全面的基准测试,在统一的度量标准下测试广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线和专有系统上取得了显著提升,展示了可扩展和可泛化的多帧感知能力。我们进一步观察到多任务收益和在挑战性场景中的新兴空间能力,并展示了我们的模型如何作为机器人学的多帧奖励标注器。

英文摘要

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

2505.13893 2026-05-25 cs.CL

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

InfiGFusion:通过高效Gromov-Wasserstein进行图对逻辑蒸馏的模型融合

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学) Daya Bay Technology and Innovation Research Institute(大亚湾技术与创新研究院)

AI总结 该论文提出了一种名为 InfiGFusion 的模型融合框架,旨在将不同异构开源模型的优势整合到统一系统中。其核心方法是通过引入基于图的对数概率蒸馏(GLD)损失,显式建模词汇维度间的语义依赖关系,并利用改进的格罗莫夫-瓦萨距离近似算法提升计算效率。实验表明,InfiGFusion 在多个基准测试中显著优于现有方法,尤其在复杂推理任务中表现出色。

详情
AI中文摘要

近期大语言模型(LLMs)的进展推动了将异构开源模型融合为统一系统的努力,以继承其互补优势。现有的基于逻辑的融合方法保持了推理效率,但独立处理词汇维度,忽略了跨维度交互编码的语义依赖。这些依赖反映了模型内部推理下词元类型的交互方式,对于对齐具有不同生成行为的模型至关重要。为显式建模这些依赖,我们提出 extbf{InfiGFusion},首个结构感知融合框架,带有新颖的 extit{图对逻辑蒸馏}(GLD)损失。具体地,我们保留每个输出的top-$k$逻辑值,并在序列位置上聚合它们的外积以形成全局共激活图,其中节点表示词汇通道,边量化其联合激活。为确保可扩展性和效率,我们设计了一种基于排序的闭式近似,将Gromov-Wasserstein距离的原始$O(n^4)$成本降至$O(n \log n)$,并具有可证明的近似保证。在多种融合设置下的实验表明,GLD持续提高了融合质量和稳定性。InfiGFusion在涵盖推理、编程和数学的11个基准上优于SOTA模型和融合基线。它在复杂推理任务中表现出特别优势,在Multistep Arithmetic上比SFT提高+35.6,在Causal Judgement上提高+37.06,展示了卓越的多步和关系推理能力。

英文摘要

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

2505.03784 2026-05-25 cs.LG

Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers

从可穿戴设备和常规血液生物标志物预测胰岛素抵抗

Ahmed A. Metwally, A. Ali Heydari, Daniel McDuff, Alexandru Solot, Zeinab Esmaeilpour, Anthony Z Faranesh, Menglian Zhou, David B. Savage, Conor Heneghan, Shwetak Patel, Cathy Speed, Javier L. Prieto

发表机构 * Google Research(谷歌研究) Institute of Metabolic Science, University of Cambridge(剑桥大学代谢科学研究所)

AI总结 该研究旨在利用可穿戴设备数据和常规血液生物标志物预测胰岛素抵抗,以实现糖尿病的早期干预。研究构建了深度神经网络模型,结合多源数据进行预测,取得了较高的准确率和泛化能力。模型在肥胖和久坐人群中表现尤为突出,并展示了与大型语言模型结合用于解释预测结果的潜力,为个性化健康管理提供了新方法。

详情
AI中文摘要

胰岛素抵抗是2型糖尿病的前兆,其特征是组织中胰岛素作用受损。当前测量胰岛素抵抗的方法虽然有效,但昂贵、难以获取、不广泛可用,并阻碍了早期干预的机会。在这项研究中,我们在美国远程招募了迄今为止最大的数据集来研究胰岛素抵抗(N=1,165名参与者,中位BMI=28 kg/m²,年龄=45岁,HbA1c=5.4%),整合了可穿戴设备时间序列数据和血液生物标志物,包括胰岛素抵抗的金标准测量——稳态模型评估胰岛素抵抗(HOMA-IR)。我们开发了深度神经网络模型,基于易于获取的数字和血液生物标志物预测胰岛素抵抗。结果表明,我们的模型通过结合可穿戴数据和易于获取的血液生物标志物,能够比单独使用任一数据源更好地预测胰岛素抵抗(R²=0.5,auROC=0.80,灵敏度=76%,特异性=84%)。在肥胖和久坐参与者(最易患2型糖尿病且能从早期干预中最大受益的亚群)中,模型显示出93%的灵敏度和95%的调整后特异性。对模型性能的严格评估,包括可解释性和鲁棒性,促进了在更大队列中的泛化能力,这一点通过在独立验证队列(N=72名参与者)上复现预测性能得到证明。此外,我们展示了如何将预测的胰岛素抵抗集成到大语言模型代理中,以帮助理解和情境化HOMA-IR值,促进解释和安全的个性化推荐。这项工作为早期检测2型糖尿病风险人群提供了可能,从而促进预防策略的早期实施。

英文摘要

Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground-truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA-IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA-IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.

2504.09846 2026-05-25 cs.LG cs.AI cs.HC

GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals

GlyTwin: 通过以患者为中心的反事实实现1型糖尿病血糖控制的最佳行为修改的数字孪生

Asiful Arefeen, Saman Khamesian, Maria Adela Grando, Bithika Thompson, Hassan Ghasemzadeh

发表机构 * College of Health Solutions, Arizona State University(亚利桑那州立大学健康解决方案学院) School of Computing and Augmented Intelligence, Arizona State University(亚利桑那州立大学计算与增强智能学院) Department of Endocrinology, Mayo Clinic Arizona(梅奥诊所亚利桑那分部内分泌科)

AI总结 该研究提出了一种名为GlyTwin的数字孪生框架,用于通过行为优化改善1型糖尿病患者的血糖控制。其核心方法是结合反事实解释,模拟最优行为干预方案,如调整碳水化合物摄入和胰岛素剂量,以减少高血糖事件的发生。研究还引入了利益相关者的偏好,使干预方案更具个性化和实用性。实验结果表明,GlyTwin在生成有效反事实解释和预防高血糖方面优于现有方法,具有较高的实用价值。

详情
AI中文摘要

频繁和长期暴露于高血糖会增加慢性并发症的风险,包括神经病变、肾病和心血管疾病。现有的连续皮下胰岛素输注(CSII)和连续血糖监测(CGM)技术仅模拟血糖调节的特定方面,例如预测低血糖和给予小剂量胰岛素推注。同样,当前糖尿病管理中的数字孪生方法主要侧重于预测血糖对人类行为和胰岛素治疗的反应。因此,这些技术缺乏提供替代治疗方案的能力,而这些方案可以指导主动行为干预以实现最佳糖尿病管理。为填补这一空白,我们提出GlyTwin,一种新颖的计算框架,通过整合反事实解释来增强数字孪生技术,以模拟血糖控制的最佳行为治疗。GlyTwin通过推荐行为选择(如碳水化合物摄入和胰岛素剂量)的调整来生成反事实治疗,以显著减少高血糖事件的发生和持续时间。此外,GlyTwin将利益相关者的偏好纳入其干预生成过程,确保工具个性化和以用户为中心。我们在AZT1D上评估GlyTwin,该数据集是通过收集50名使用自动胰岛素输送(AID)系统的1型糖尿病(T1D)患者的纵向数据构建的,每人监测26天。结果表明,与历史数据相比,GlyTwin在生成反事实解释方面优于现有方法,有效解释率为85.8%,预防高血糖的有效性为87.3%。

英文摘要

Frequent and long-term exposure to hyperglycemia increases the risk of chronic complications, including neuropathy, nephropathy, and cardiovascular disease. Existing continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) technologies model only specific aspects of glycemic regulation, such as predicting hypoglycemia and administering small insulin boluses. Similarly, current digital twin approaches in diabetes management primarily focus on predicting glucose responses to human behavior and insulin therapy. As a result, these technologies lack the ability to provide alternative treatment scenarios that could guide proactive behavioral interventions for optimal diabetes management. To address this gap, we propose GlyTwin, a novel computational framework that enhances digital twin technologies by integrating counterfactual explanations to simulate optimal behavioral treatments for glucose control. GlyTwin generates counterfactual treatments by recommending adjustments to behavioral choices, such as carbohydrate intake and insulin dosing, to significantly reduce the occurrence and duration of hyperglycemic events. In addition, GlyTwin incorporates stakeholder preferences into its intervention-generation process, ensuring that the tool is personalized and user-centric. We evaluate GlyTwin on AZT1D, a new dataset constructed by collecting longitudinal data from 50 individuals living with type 1 diabetes (T1D) on automated insulin delivery (AID) systems, each monitored for 26 days. Results show that GlyTwin outperforms state-of-the-art methods for generating counterfactual explanations, with 85.8\% valid explanations and 87.3\% effectiveness in preventing hyperglycemia compared with historical data.

2504.01542 2026-05-25 cs.CL

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

语域始终重要:通过语言变异视角分析LLM预训练数据

Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo

发表机构 * University of Turku(图尔库大学)

AI总结 本文研究了语言变体(如语域或文体)对大语言模型预训练数据质量及模型性能的影响。作者首次将语域分类这一语料语言学中的标准应用于预训练数据的筛选,并通过实验发现不同语域的文本对模型表现有显著影响。研究显示,使用新闻类文本预训练会导致性能下降,而意见类文本(如评论和博客)则对模型性能提升有积极作用,合理组合多种高表现语域可进一步优化模型效果。这些发现表明,语域是解释模型性能差异的重要因素,有助于未来更精准地选择预训练数据。

详情
AI中文摘要

预训练数据整理是大语言模型(LLM)开发的基石,导致对大型网络语料库质量过滤的研究日益增多。从统计质量标志到基于LLM的标注系统,数据集被划分为不同类别,通常简化为二元分类:通过过滤器的被视为有价值的样本,其余则被丢弃为无用或有害。然而,对不同类型文本对模型性能贡献的更详细理解仍然缺乏。在本文中,我们首次利用语域或体裁——语料库语言学中广泛使用的语言变异建模标准——来整理预训练数据集,并研究语域对LLM性能的影响。我们使用语域分类的数据训练小型生成模型,并通过标准基准进行评估,表明预训练数据的语域显著影响模型性能。我们揭示了预训练材料与最终模型之间令人惊讶的关系:使用“新闻”语域会导致性能不佳,相反,包含“观点”类(涵盖评论和观点博客等文本)则非常有益。虽然在整个未过滤数据集上训练的模型优于在单一语域数据集上训练的模型,但将表现良好的语域(如“操作指南”、“信息描述”和“观点”)组合起来会带来重大改进。此外,对单个基准结果的分析揭示了特定语域类作为预训练数据的优势和缺点的关键差异。这些发现表明,语域是模型变异的重要解释因素,并可以促进未来更谨慎的数据选择实践。

英文摘要

Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labelling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters are deemed as valuable examples, others are discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilising registers or genres - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We train small generative models with register classified data and evaluate them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

2503.12868 2026-05-25 cs.CV

UniReg: A Universal Model for Controllable CT Image Registration

UniReg: 一种用于可控CT图像配准的通用模型

Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Cheng Chen, Dakai Jin

发表机构 * The University of Hong Kong(香港大学) DAMO Academy, Alibaba Group(阿里巴巴集团达摩院) The First Affiliated Hospital of Zhejiang University(浙江大学第一附属医院)

AI总结 本文提出了一种名为UniReg的通用可控CT图像配准模型,旨在解决现有方法在不同临床场景下泛化能力差、需为每个任务单独训练网络的问题。UniReg通过结合任务特定学习方法的精度优势与传统优化方法的泛化能力,构建了一个统一的配准框架,能够根据解剖结构先验、配准类型约束和实例特征自适应估计形变场,实现跨场景的最优配准。实验表明,UniReg在多个CT/MR配准数据集上取得了优于现有先进方法的平均配准精度,并显著降低了模型冗余和训练成本。

详情
AI中文摘要

基于学习的医学图像配准在匹配传统方法精度的同时,提供了优越的计算效率。然而,现有方法在不同临床场景中泛化能力差,需要为特定配准任务(如个体间/个体内配准或解剖区域特定对齐)开发多个孤立的网络,导致开发流程繁琐。为克服这一局限,我们提出了UniReg,首个用于多场景CT图像配准的条件统一模型,它结合了任务特定学习方法的精度优势和传统优化方法的泛化能力。我们的关键创新是一个统一的配准框架,该框架根据以下条件自适应估计变形场:(1)解剖结构先验,(2)配准类型约束(个体间/个体内),以及(3)实例特定特征,从而在单个模型中实现跨异构场景的最优对齐。通过在多个CT/MR配准数据集上的全面实验,UniReg相比当前最先进的基于学习方法取得了更优的平均配准精度,同时展现出强大的跨场景泛化能力。此外,通过用一个紧凑的统一模型替代多个孤立的任务特定模型,UniReg显著降低了总体训练负担,包括总训练成本和模型冗余。

英文摘要

Learning-based medical image registration has matched the accuracy of conventional methods while offering superior computational efficiency. However, existing approaches suffer from poor generalization across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or anatomical region-specific alignment, leading to cumbersome development pipelines. To overcome this limitation, we propose UniReg, the first conditional unified model for multi-scenario CT image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified registration framework that adaptively estimates deformation fields conditioned on: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling optimal alignment across heterogeneous scenarios within a single model. Through comprehensive experiments on multiple CT/MR registration datasets, UniReg achieves superior average registration accuracy compared with current state-of-the-art learning-based methods while exhibiting strong cross-scenario generalization. Moreover, by replacing multiple isolated task-specific models with a compact unified model, UniReg substantially reduces the overall training burden in terms of total training cost and model redundancy.

2503.06684 2026-05-25 cs.CV

PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

PixelPonder: 动态补丁自适应增强多条件文本到图像生成

Yanjie Pan, Qingdong He, Zhengkai Jiang, Pengcheng Xu, Chaoyi Wang, Jinlong Peng, Haoxuan Wang, Yun Cao, Zhenye Gan, Mingmin Chi, Bo Peng, Yabiao Wang

发表机构 * Fudan University(复旦大学) Tencent Youtu Lab(腾讯优图实验室) Western University(西澳大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种名为PixelPonder的新型统一控制框架,用于解决多条件文本到图像生成中多个异构控制信号之间的语义保真与视觉质量协调问题。该方法通过设计一种基于图像块的自适应条件选择机制,在子区域层面动态优先选择空间相关的控制信号,从而实现精确的局部引导而不受全局干扰,并结合时间感知的控制注入策略,根据去噪时间步调整条件影响,逐步从结构保持过渡到纹理优化。实验表明,PixelPonder在多个基准数据集上优于现有方法,在空间对齐精度和文本语义一致性方面均表现出色。

详情
AI中文摘要

最近基于扩散的文本到图像生成通过视觉条件控制取得了有希望的结果。然而,现有的ControlNet类方法在处理组合视觉条件时存在困难——在多个异质控制信号之间同时保持语义保真度,同时维持高视觉质量,它们采用独立的控制分支,在去噪过程中常常引入冲突的引导,导致生成图像中出现结构失真和伪影。为了解决这个问题,我们提出了PixelPonder,一种新颖的统一控制框架,允许在单一控制结构下有效控制多个视觉条件。具体来说,我们设计了一种补丁级自适应条件选择机制,在子区域级别动态优先考虑空间相关的控制信号,实现精确的局部引导而无需全局干扰。此外,部署了一种时间感知控制注入方案,根据去噪时间步调节条件影响,逐步从结构保留过渡到纹理细化,充分利用不同类别的控制信息以促进更和谐的图像生成。大量实验表明,PixelPonder在多个基准数据集上超越了先前的方法,在保持高文本语义一致性的同时,在空间对齐精度上显示出优越的提升。

英文摘要

Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

2503.05534 2026-05-25 cs.CV

S4M: 4-points to Segment Anything

S4M: 4点分割一切

Adrien Meyer, Lorenzo Arboit, Giuseppe Massimiani, Shih-Min Yin, Didier Mutter, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France(斯特拉斯堡大学、法国国家科学研究中心、法国国家医学研究院、ICube、UMR7357、法国斯特拉斯堡) Department of General Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan(高雄长庚纪念医院外科部、长庚大学医学院)

AI总结 本文提出了一种名为S4M的改进方法,旨在解决医学图像分割中Segment Anything Model(SAM)因点提示模糊而导致的分割精度不足问题。该方法引入了一种结构化的四点提示策略,利用极值点或主/次轴端点作为实例级别的形状描述,增强提示的表达能力。通过扩展提示空间并引入辅助的“Canvas”预训练任务,S4M提升了模型对几何结构的理解能力,实验表明其在多个超声和手术内窥镜数据集上显著提升了分割性能,并减少了临床标注的工作量。

详情
AI中文摘要

目的:Segment Anything Model (SAM) 有望缓解医学分割中的标注瓶颈,但重叠解剖结构和模糊边界使其点提示存在歧义,导致需要反复手动细化才能获得精确掩膜。需要更好的提示策略。方法:我们提出一种结构化提示策略,使用4个点作为紧凑的实例级形状描述。受超声测量实践启发,我们研究了两种4点变体:极值点和提出的长短轴端点。SAM无法充分利用此类结构化提示,因为它将所有点等同对待,缺乏几何感知推理。为解决此问题,我们引入S4M(4点分割一切),它增强SAM以将4点解释为关系线索而非孤立点击。S4M通过角色特定嵌入扩展提示空间,并添加辅助“画布”前置任务,直接从提示草绘粗略掩膜,促进几何感知推理。结果:在超声和手术内镜的八个数据集上,在相同提示预算下,S4M比强SAM基线提升+3.42 mIoU。与三位临床医生的标注研究进一步表明,长短轴提示可实现更快的标注。结论:S4M提高了性能,减少了标注工作量,并使提示与临床实践对齐,从而在医学影像中实现更可扩展的数据集开发。我们在https://github.com/CAMMA-public/S4M发布代码和预训练模型。

英文摘要

Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging. We release our code and pretrained models at https://github.com/CAMMA-public/S4M.