arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4127
2509.13805 2026-06-02 cs.LG cs.AI stat.ML

Towards a Physics Foundation Model

迈向物理基础模型

Florian Wiesner, Zoë J. Gray, Matthias Wessling, Stephen Baek

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出通用物理变换器(GPhyT),通过在大规模多样化模拟数据上训练,实现单一模型在多个物理领域(如流固耦合、冲击波、热对流和多相流)的零样本泛化与长期稳定预测,性能超越专用架构7倍以上。

Comments ICML-AI4Physics 2026

详情
AI中文摘要

基础模型通过“一次训练,随处部署”的范式彻底改变了自然语言处理,即单个预训练模型无需重新训练即可适应无数下游任务。拥有物理基础模型(PFM)将是变革性的——它能够民主化高保真模拟的访问、加速科学发现,并消除对专用求解器开发的需求。然而,当前物理感知的机器学习方法仍然从根本上局限于单一狭窄领域,并且需要为每个新系统重新训练。我们提出了通用物理变换器(GPhyT),该模型在1.8 TB的多样化模拟数据上训练,证明了基础模型能力在物理领域是可以实现的。我们的关键见解是,变换器可以学习从上下文中推断支配动力学,从而使单一模型能够模拟流固耦合、冲击波、热对流和多相动力学,而无需被告知底层方程。GPhyT实现了三个关键突破:(1)在多个物理领域上表现出卓越性能,比专用架构高出7倍以上;(2)通过上下文学习,对完全未见过的物理系统进行合理的零样本泛化;(3)通过长程 rollout 实现更稳定的长期预测。通过证明单一模型可以仅从数据中学习可泛化的物理原理,这项工作为通向通用PFM开辟了道路,该模型可能改变计算科学与工程。

英文摘要

Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.

2601.17952 2026-06-02 cs.CL cs.AI

A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models

面向临床神经科学中基于Transformer的语言模型的稳定可解释性的单语义归因框架

Michail Mamalakis, Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio

发表机构 * Department of Computer Science and Technology(计算机科学与技术系) Cancer Research UK(癌症研究英国公司) Cambridge Institute(剑桥研究所) University of Cambridge(剑桥大学) United Kingdom(英国) DIMES(迪梅斯) University of Calabria(卡拉布里亚大学) Italy(意大利) Department of Computer Automatic and Management Engineering (DIAG)(计算机自动与管理工程系) Sapienza Università di Roma(罗马大学萨皮恩扎) Department of Psychiatry(精神病学系) School of Computing(计算学院) University of Kent(肯特大学) Department of Psychology(心理学系)

AI总结 提出一种通过单语义特征提取整合归因与机制视角的统一可解释性框架,生成稳定的输入级重要性分数,促进语言模型在认知健康和神经退行性疾病中的安全应用。

详情
AI中文摘要

可解释性仍然是语言模型在临床环境中部署的关键挑战,例如阿尔茨海默病的进展诊断,其中早期和可信的预测至关重要。现有的归因方法由于基于Transformer的语言模型和LLM表示的多语义性质而表现出高方法间变异性和不稳定的解释,而机制可解释性方法缺乏与模型输入和输出的直接对齐,并且不提供显式的重要性分数。我们引入了一个统一的可解释性框架,通过单语义特征提取整合了归因和机制视角。通过在基于Transformer的LM层级别构建单语义嵌入空间,并优化框架以显式减少方法间变异性,我们的方法生成稳定的输入级重要性分数,并通过感兴趣层的解压缩表示突出显著特征,推进了语言模型在认知健康和神经退行性疾病中的安全可信应用。

英文摘要

Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer disease, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of Transformer-Based LM and LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an transformer-based LM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LMs in cognitive health and neurodegenerative disease.

2601.17898 2026-06-02 cs.CL

Assessment of Generative Named Entity Recognition in the Era of Large Language Models

大语言模型时代生成式命名实体识别的评估

Qi Zhan, Yile Wang, Hui Huang

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 本文系统评估了开源大语言模型在平面和嵌套命名实体识别任务上的性能,发现参数高效微调结合结构化格式可达到与传统模型竞争的性能,且NER能力源于指令遵循和生成能力而非记忆,微调对通用能力影响很小。

详情
AI中文摘要

命名实体识别(NER)正随着大语言模型(LLMs)的兴起从序列标注任务演变为生成式范式。我们对开源LLMs在平面和嵌套NER任务上进行了系统评估。我们研究了几个研究问题,包括生成式NER与传统NER模型之间的性能差距、输出格式的影响、LLMs是否依赖记忆,以及微调后通用能力的保持情况。通过在八个不同规模的LLMs和四个标准NER数据集上的实验,我们发现:(1)通过参数高效微调和结构化格式(如内联括号或XML),开源LLMs实现了与传统基于编码器的模型竞争的性能,并超越了使用上下文学习技术的基于解码器的LLMs;(2)LLMs的NER能力源于指令遵循和生成能力,而不仅仅是实体-标签对的记忆;(3)应用NER指令微调对LLMs的通用能力影响很小,甚至由于增强了实体理解,在DROP等数据集上将性能提高了25.50到45.32个F1点。这些发现表明,使用LLMs的生成式NER是传统方法的一种有前景且用户友好的替代方案。我们在https://github.com/szu-tera/LLMs4NER上发布了数据和代码。

英文摘要

Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass decoder-based LLMs with in-context learning techniques; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP by 25.50 to 45.32 F1 points due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at https://github.com/szu-tera/LLMs4NER.

2404.01356 2026-06-02 cs.LG cs.AI cs.CY

Perturbation Effects on Accuracy and Fairness among Similar Individuals

扰动对相似个体间准确性和公平性的影响

Xuran Li, Hao Xue, Peng Wu, Xingjun Ma, Zhen Zhang, Huaming Chen, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所系统软件重点实验室) Fudan University(复旦大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州先进研究所) The University of Sydney(悉尼大学)

AI总结 提出鲁棒个体公平性(RIF)概念,并开发黑盒对抗框架RIFair,通过解耦扰动策略暴露模型在语义保持扰动下同时存在的鲁棒性和公平性缺陷。

详情
AI中文摘要

深度神经网络易受对抗性扰动影响,这些扰动能在不同应用场景中同时降低预测鲁棒性和个体公平性。然而,现有评估协议通常孤立地评估这些维度,从而掩盖了关键故障模式。为弥补这一差距,我们形式化了鲁棒个体公平性(RIF):在语义保持(真值条件保持)扰动下,预测应既相对于真实标签保持正确,又在语义等价的个体间保持不变。为在实践中揭示RIF违规,我们引入RIFair,一种黑盒对抗框架,利用解耦扰动策略构建语义保持但不鲁棒和/或不公平的实例对。跨多个模型架构和真实世界文本数据集的实验表明,仅关注鲁棒性或公平性的指标常常遗漏鲁棒偏差和不鲁棒公平行为。RIFair可靠地暴露这些隐藏的漏洞,支持RIF作为可信模型评估的必要标准。实验代码公开于https://github.com/Xuran-LI/RIFair。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairness across diverse application settings. However, existing evaluation protocols typically assess these dimensions in isolation, thereby obscuring critical failure modes. To bridge this gap, we formalize Robust Individual Fairness (RIF): under semantic-preserving (truth-condition-preserving) perturbations, predictions should remain both correct with respect to the ground truth and invariant across semantically equivalent individuals. To surface RIF violations in practice, we introduce RIFair, a black-box adversarial framework that leverages a decoupled perturbation strategy to construct semantically preserved yet unrobust and/or unfair instance pairs. Experiments across multiple model architectures and real-world textual datasets show that robustness-only or fairness-only metrics often miss Robust Biased and Unrobust Fair behaviors. RIFair}reliably exposes these hidden vulnerabilities, supporting RIF as a necessary criterion for trustworthy model assessment. The experimental code is publicly available at https://github.com/Xuran-LI/RIFair.

2601.16462 2026-06-02 cs.CL

Finding What Matters: Anchoring Context Knowledge with Evolving Indices for Iterative Retrieval

寻找关键:通过演化索引锚定上下文知识进行迭代检索

Mingyan Wu, Zhenghao Liu, Xinze Li, Yuqing Lan, Yukun Yan, Shuo Wang, Cheng Yang, Minghe Yu, Zheni Zeng, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Institute for AI, Tsinghua University, China(清华大学人工智能研究院计算机科学与技术系) Beijing National Research Center for Information Science and Technology, China(北京信息科学与技术国家研究中心) School of Computer Science, Beijing University of Posts and Telecommunications, China(北京邮电大学计算机学院)

AI总结 提出KAIR框架,通过迭代检索中动态更新的知识索引锚定关键证据,引导大语言模型在多跳问答中有效推理并缓解噪声干扰。

详情
AI中文摘要

检索增强生成(RAG)已成为通过引入外部知识来减轻大语言模型(LLMs)幻觉的主流范式。然而,现有的RAG系统通常难以有效整合和推理分散在嘈杂检索文档中的关键证据,尤其是在多跳场景中。本文提出KAIR,一种用于迭代检索的知识锚定框架,该框架在检索到的知识中锚定知识,以引导LLMs定位关键信息。在迭代检索过程中,KAIR逐步更新知识索引,以锚定检索文档中的显著证据。演化索引作为导航锚定索引,使LLM能够评估知识充分性并制定后续检索查询。最后,KAIR通过联合利用检索文档和最终确定的锚定索引生成答案。在四个多跳问答基准上的实验表明,KAIR始终优于强RAG基线。进一步分析显示,KAIR有效锚定关键知识并减轻了迭代检索过程中的上下文噪声,提高了LLM关联和推理分散在检索文档中的证据的能力。所有代码和数据可在https://github.com/NEUIR/KAIR获取。

英文摘要

Retrieval-Augmented Generation (RAG) has become a dominant paradigm for mitigating hallucinations in Large Language Models (LLMs) by incorporating external knowledge. However, existing RAG systems often struggle to effectively integrate and reason over key evidence scattered across noisy retrieved documents, particularly in multi-hop scenarios. In this paper, we propose KAIR, a Knowledge Anchoring framework for Iterative Retrieval that anchors knowledge within retrieved knowledge to guide LLMs to locate the key information. During iterative retrieval, KAIR progressively updates the knowledge index to anchor salient evidence from retrieved documents. The evolving index serves as a navigational anchoring index that enables the LLM to assess knowledge sufficiency and formulate subsequent retrieval queries. Finally, KAIR generates answers by jointly leveraging the retrieved documents and the finalized anchoring index. Experiments on four multi-hop question answering benchmarks demonstrate that KAIR consistently outperforms strong RAG baselines. Further analysis shows that KAIR effectively anchors key knowledge and alleviates the context noise during iterative retrieval, improving the LLM's ability to associate and reason over dispersed evidence across retrieved documents. All code and data are available at https://github.com/NEUIR/KAIR.

2601.14230 2026-06-02 cs.CL cs.AI cs.HC

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

MASCOT: 迈向多智能体社会协作伴侣系统

Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对多智能体系统中的人格崩溃和社会谄媚问题,提出MASCOT框架,通过双层优化策略(人格感知行为对齐与协作对话优化)提升角色一致性和对话贡献。

Comments 15 pages, 9 figures. https://hello-diana.github.io/MASCOT/

详情
AI中文摘要

多智能体系统(MAS)正成为情感和认知支持方面有前景的社会协作伴侣。然而,现有系统经常遭受人格崩溃(即智能体退化为通用、同质化的助手行为)和社会谄媚(即智能体产生冗余、非建设性的对话)。我们提出MASCOT,一个用于多视角社会协作伴侣的多智能体框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体行为:1)人格感知行为对齐,一个RLAIF驱动的流程,用于微调个体智能体以实现特定于智能体的身份;2)协作对话优化,一个群体级适应过程,促进互补、多样和富有成效的对话。我们使用源自领域内和领域外(OOD)设置的人类真实情境评估MASCOT,并与最先进的基线进行比较。MASCOT将人格一致性提高了最多+14.1,社会贡献提高了最多+10.6。广泛的评估套件,包括人类评估、多个LLM评判、三方比较和自动指标,进一步表明MASCOT产生了更符合角色且更少冗余的多智能体对话。

英文摘要

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing systems frequently suffer from persona collapse, where agents revert to generic, homogenized assistant behaviors, and social sycophancy, where agents produce redundant, non-constructive dialogue. We propose MASCOT, a multi-agent framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that fine-tunes individual agents for agent-specific identities; and 2) Collaborative Dialogue Optimization, a group-level adaptation process that promotes complementary, diverse, and productive discourse. We evaluate MASCOT using human-grounded contexts drawn across both in-domain and out-of-domain (OOD) settings against state-of-the-art baselines. MASCOT improves persona consistency by up to +14.1 and social contribution by up to +10.6. A broad evaluation suite, including human evaluation, multiple LLM judges, three-way comparisons, and automatic metrics, further shows that MASCOT produces more role-consistent and less redundant multi-agent dialogue.

2601.13574 2026-06-02 cs.RO

Highly Deformable Proprioceptive Membrane for Real-Time 3D Shape Reconstruction

高度可变形本体感觉膜用于实时三维形状重建

Guanyu Xu, Jiaqi Wang, Dezhong Tong, Xiaonan Huang

发表机构 * arXiv.org cs.RO(计算机机器人学)

AI总结 提出一种基于光波导传感的柔性可拉伸本体感觉硅胶膜,通过数据驱动模型解码变形相关光强信号,实现实时三维形状重建,在140mm方形膜上达到90Hz更新率和1.307mm平均重建误差。

Comments 13 pages, 9 figures

详情
AI中文摘要

重建物体表面的三维几何形状对于机器人感知至关重要,但基于视觉的方法在低光照或遮挡条件下效果不佳。这一局限性促使我们设计一种本体感觉膜,该膜贴合感兴趣表面并通过重建自身变形来推断三维几何形状。传统的变形感知膜通常依赖于电阻、电容或磁敏机制,但可能存在结构复杂、在大规模变形下顺应性有限以及易受电磁干扰等问题。本文提出一种基于光波导传感的柔软、灵活且可拉伸的本体感觉硅胶膜。该膜将边缘安装的LED和中心分布的光电二极管集成在多层弹性体复合材料中。丰富的变形相关光强信号通过数据驱动模型解码,以恢复膜的几何形状。在定制的140mm方形膜上,以90Hz的端到端更新率实现了实时重建,对于高达25mm的面外变形,平均重建误差为1.307mm。所提出的传感器还在大面内变形下展示了精确重建,在高达75%应变下实现了可靠的形状恢复,平均Chamfer距离为1.214mm。所提出的框架为可变形机器人系统中的全局形状感知提供了一种可扩展、稳健且低剖面的解决方案。

英文摘要

Reconstructing the three-dimensional (3D) geometry of object surfaces is essential for robot perception, yet vision-based approaches degrade under low illumination or occlusion. This limitation motivates the design of a proprioceptive membrane that conforms to the surface of interest and infers 3D geometry by reconstructing its own deformation. Conventional deformation-aware membranes typically rely on resistive, capacitive, or magneto-sensitive mechanisms, but can suffer from structural complexity, limited compliance during large-scale deformation, and susceptibility to electromagnetic interference. This work presents a soft, flexible, and stretchable proprioceptive silicone membrane based on optical waveguide sensing. The membrane integrates edge-mounted LEDs and centrally-distributed photodiodes (PDs) within a multilayer elastomeric composite. Rich deformation-dependent light-intensity signals are decoded by a data-driven model to recover the membrane geometry. Real-time reconstruction is demonstrated on a customized 140 mm square membrane at an end-to-end update rate of 90 Hz, achieving an average reconstruction error of 1.307 mm for out-of-plane deformation of up to 25 mm. The proposed sensor also demonstrates accurate reconstruction under large in-plane deformation, achieving reliable shape recovery up to 75% strain with an average Chamfer distance of 1.214 mm. The proposed framework provides a scalable, robust, and low-profile solution for global shape perception in deformable robotic systems.

2601.11460 2026-06-02 cs.RO cs.LG

Semantic-Geometric Task Representations for Bimanual Manipulation from Human Demonstrations to Robot Action Planning

面向双臂操作的语义-几何任务表示:从人类示范到机器人动作规划

Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, Computer Science Dept., TU Darmstadt, Germany(图腾大学达姆施塔特分校计算机科学系交互机器人感知与学习实验室) Hessian.AI, Darmstadt, Germany(黑森人工智能公司) Robotics Institute Germany (RIG)(德国机器人研究所) Interactive AI Algorithms & Cognitive Models for Human-AI Interaction (IKIDA), Computer Science Dept., TU Darmstadt, Germany(人机交互的交互人工智能算法与认知模型(IKIDA),图腾大学达姆施塔特分校计算机科学系) Center for Cognitive Science, TU Darmstadt, Germany(图腾大学达姆施塔特分校认知科学中心)

AI总结 提出一种语义-几何图任务表示,通过消息传递神经网络编码器和Transformer解码器联合编码对象身份、语义关系和运动历史,实现从人类示范中学习结构化任务表示,并支持跨实体迁移和双臂操作规划。

Comments 9 pages, 7 figures, preprint

详情
AI中文摘要

从人类示范中学习结构化任务表示对于双臂操作至关重要,因为动作顺序、对象参与和交互几何在不同执行中变化显著。一个关键挑战在于以支持任务进展推理的形式,联合捕获离散的语义任务结构和对象中心几何关系的时间演化。我们引入一种基于语义-几何图的任务表示,通过消息传递神经网络(MPNN)编码器和基于Transformer的解码器,联合编码对象身份、对象间语义关系和每个对象的运动历史。编码器仅对时间场景图进行操作,产生与动作标签解耦的结构化表示。解码器则根据动作上下文预测未来动作、关联对象和对象运动。这种解耦学习了任务无关的表示,使得编码器可以通过仅在小型机器人数据集上微调解码器而跨实体复用。在两个数据集的十一个双臂任务中,我们发现结构化语义-几何表示相对于更简单的基于序列模型的优势随着动作顺序和对象参与的任务变异性增加而增长。在部署时,规划器将动作和运动预测与学习的概率运动基元相结合,在两个真实机器人双臂任务上实现了完全任务成功,并优于图消融、Transformer、仅解码器和微调的视觉-语言模型基线。

英文摘要

Learning structured task representations from human demonstrations is essential for bimanual manipulation, where action ordering, object involvement, and interaction geometry vary significantly across executions. A key challenge lies in jointly capturing the discrete semantic task structure and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. We introduce a semantic--geometric graph-based task representation that jointly encodes object identities, inter-object semantic relations, and per-object motion histories, via a Message Passing Neural Network (MPNN) encoder and a Transformer-based decoder. The encoder operates solely on the temporal scene graph, producing structured representations decoupled from action labels. The decoder then conditions on action-context to forecast future actions, associated objects, and object motions. This decoupling learns task-agnostic representations, enabling encoder reuse across embodiments through decoder-only finetuning on a small robot dataset. Across eleven bimanual tasks from two datasets, we find that the benefit of structured semantic--geometric representations over simpler sequence-based models grows with task variability in action ordering and object involvement. At deployment, a planner couples the action and motion predictions with learned Probabilistic Movement Primitives, achieving full task success on two real-robot bimanual tasks and outperforming graph ablations, Transformer, decoder-only, and finetuned vision-language model baselines.

2508.06407 2026-06-02 cs.CV cs.AI eess.IV

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

SAR图像中舰船目标的分类感知超分辨率框架

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * University of Malaya(马来亚大学)

AI总结 提出一种将分类目标融入超分辨率过程的算法,通过优化兼顾图像质量和分类性能的损失函数,提升SAR图像分辨率并改善分类精度。

详情
Journal ref
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 6614-6622, 2026
AI中文摘要

高分辨率图像在提升分类、检测和分割等视觉识别任务性能中起着关键作用。在包括遥感和监视在内的许多领域,低分辨率图像可能限制自动分析的准确性。为此,超分辨率(SR)技术被广泛采用,试图从低分辨率输入重建高分辨率图像。相关的传统方法仅基于像素级指标专注于提升图像质量,而超分辨率图像保真度与下游分类性能之间的关系在很大程度上未被探索。这引发了一个关键问题:将分类目标直接集成到超分辨率过程中是否能进一步提高分类精度?在本文中,我们通过部署一种专门的算法策略来研究超分辨率与分类之间的关系,试图回答这一问题。我们提出了一种新颖的方法,通过优化同时考虑图像质量和分类性能的损失函数,提高合成孔径雷达图像的分辨率。我们的方法在提升图像质量(通过科学验证的图像质量指标衡量)的同时,也提高了分类精度。

英文摘要

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

2601.09696 2026-06-02 cs.CL

Empathy Applicability Modeling for General Health Queries

通用健康查询的共情适用性建模

Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem

发表机构 * University of Michigan(密歇根大学) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 提出共情适用性框架(EAF),基于临床、语境和语言线索对患者查询的情感反应和解释适用性进行分类,以支持异步医疗中的共情沟通。

Comments Accepted at Findings of ACL 2026

详情
AI中文摘要

LLM 越来越多地被整合到临床工作流程中,但它们往往缺乏临床共情,而临床共情是有效医患沟通的重要方面。现有的 NLP 框架侧重于反应性地标记医生回应中的共情,但对共情需求的预期建模支持有限,尤其是在通用健康查询中。我们引入了共情适用性框架(EAF),这是一种理论驱动的方法,基于临床、语境和语言线索,根据情感反应和解释的适用性对患者查询进行分类。我们发布了一个真实患者查询的基准数据集,由人类标注者和 GPT-4o 双重标注。在具有人类共识的子集中,我们也观察到人类与 GPT 之间的一致性较高。为了验证 EAF,我们在人类标注和仅 GPT 标注上训练分类器以预测共情适用性,取得了强性能,并优于启发式方法和零样本 LLM 基线。错误分析突出了持续的挑战:隐性痛苦、临床严重性模糊性和语境困难,强调了多标注者建模、临床医生在环校准和文化多样性标注的必要性。EAF 提供了一个在生成回应之前识别共情需求的框架,为预期共情建模建立了基准,并支持异步医疗中的共情沟通。

英文摘要

LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by human annotators and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

2512.16401 2026-06-02 cs.CL

Navigating the Reality Gap: On-Device Continual Adaptation of ASR for Clinical Telephony

导航现实差距:面向临床电话的ASR设备端持续适应

Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar

发表机构 * BITS Pilani, Pilani Campus, India(印度比斯帕利尼学院帕利尼校区) Qure.ai, India(印度Qure.ai)

AI总结 针对临床电话场景中ASR性能严重下降的问题,使用Gram Vaani语料库作为代理,研究设备端持续适应方法,发现经验回放与弹性权重巩固之间的关键交互作用。

Comments 17 pages. Under review

详情
AI中文摘要

自动语音识别(ASR)可以显著减少临床工作流程中的文档负担,但在现实电话环境中,标准模型会严重退化,因为嘈杂的音频、方言变异和严格的数据驻留约束阻止了基于云的适应。我们使用Gram Vaani(一个覆盖农村医疗和农业热线的电话印地语语料库)作为严格设备端约束下临床语音的最接近可用代理,研究了这一“现实差距”。我们表明,一个强大的多语言模型(IndicWav2Vec)在标准干净印地语上的词错误率(WER)从11.59%下降到该代理电话数据上的41.71%。我们在多个基线、数据集和随机种子上评估了一系列在现实约束下的设备端适应方案,从全微调到参数高效的LoRA和基于流的持续学习。聚焦于持续学习,我们的核心发现突出了经验回放(ER)和弹性权重巩固(EWC,由正则化强度λ参数化)之间的关键交互。我们表明,标准的正EWC(λ>0)可能对抗回放驱动的更新,限制适应。反转EWC的强度(λ<0)表明,在ER引导的适应下,它可以作为方向控制信号:负λ增强回放驱动的可塑性,而调度的λ允许稳定性和可塑性的相位相关控制。在多个数据集上的评估中,我们发现多领域回放为适应提供了坚实基础,而EWC调节稳定性-可塑性动态而不改变最终性能。这些结果表明,有效的设备端适应取决于理解数据驱动和参数级学习信号如何交互,而不是孤立地选择方法。

英文摘要

Automatic Speech Recognition (ASR) can significantly reduce documentation burden in clinical workflows, but standard models degrade sharply in real-world telephony settings where noisy audio, dialectal variation, and strict data residency constraints prevent cloud-based adaptation. We study this "reality gap" using Gram Vaani: a telephonic Hindi corpus spanning rural healthcare and agricultural helplines, as the closest available proxy for clinical speech under strict on-device constraints. We show that a robust multilingual model (IndicWav2Vec) degrades from 11.59\% WER on standard clean Hindi to \textbf{41.71\% WER} on this proxy telephony data. We evaluate a progression of on-device adaptation regimes under realistic constraints, from full fine-tuning to parameter-efficient LoRA and stream-based continual learning, across multiple baselines, datasets, and seeds. Focusing on continual learning, our central finding highlights a critical interaction between Experience Replay (ER) and Elastic Weight Consolidation (EWC, parameterized by regularization strength $λ$). We show that standard positive EWC ($λ> 0$) can oppose replay-driven updates, limiting adaptation. Reversing EWC's strength ($λ< 0$) suggests that it can act as a directional control signal under ER-guided adaptation: negative $λ$ reinforces replay-driven plasticity, while a scheduled $λ$ enables phase-dependent control of stability and plasticity. Across evaluations on multiple datasets, we find that multi-domain replay provides a strong foundation for adaptation, while EWC modulates stability-plasticity dynamics without altering final performance. These results show that effective on-device adaptation depends on understanding how data-driven and parameter-level learning signals interact, rather than choosing methods in isolation.

2512.07436 2026-06-02 cs.AI

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

LocalSearchBench:现实本地生活服务中的智能搜索基准测试

Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su

发表机构 * Meituan, Beijing, China(美团,北京,中国) East China Normal University Shanghai Innovation Institute(东华大学上海创新研究院) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) North China University of Technology, Beijing, China(华北理工大学,北京,中国) East China Normal University Shanghai China(东华大学上海)

AI总结 针对本地生活服务领域,提出包含130万商家条目和900个多跳问答任务的基准测试LocalSearchBench,并开发统一环境LocalPlayground,实验表明现有大推理模型性能不足。

Comments 12 pages; accepted to KDD 2026

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD 2026), August 9--13, 2026, Jeju Island, Republic of Korea. ACM, New York, NY, USA, 12 pages
AI中文摘要

近期大推理模型(LRMs)的进展使得智能搜索系统能够在多个来源上执行复杂的多步推理。然而,大多数研究集中在通用信息检索上,很少探索具有独特挑战的垂直领域。在这项工作中,我们聚焦于本地生活服务,并引入LocalSearchBench,它涵盖了多样且复杂的业务场景。该领域的真实查询通常模糊不清,需要跨商家和产品进行多跳推理,仍然具有挑战性且未得到充分解决。作为本地生活服务中智能搜索的首个综合基准,LocalSearchBench包含一个数据库,涵盖6个服务类别和9个主要城市的超过130万条商家条目,以及来自真实用户查询的900个多跳问答任务,这些任务需要多步推理。我们还开发了LocalPlayground,一个集成多种工具供LRMs交互的统一环境。实验表明,即使是最先进的LRMs在LocalSearchBench上也表现不佳:最佳模型(DeepSeek-V3.2)仅达到35.60%的正确率,大多数模型在完整性(平均60.32%)和忠实性(平均30.72%)方面存在问题。这凸显了在本地生活服务中需要专门的基准测试和领域特定的智能体训练。代码、基准和排行榜可在https://localsearchbench.github.io/获取。

英文摘要

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 35.60% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at https://localsearchbench.github.io/.

2601.04946 2026-06-02 cs.CV cs.AI

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

原型性偏差揭示多模态评估指标中的盲点

Subhadeep Roy, Gagan Bhatia, Steffen Eger

发表机构 * University of Technology Nuremberg(图恩大学)

AI总结 本文通过构建受控诊断基准PROTOBIAS,发现并验证了多模态评估指标中存在原型性偏差,即倾向于选择视觉或社会原型性高但语义错误的图像,并提出了轻量级对比训练评估器PROTOSCORE作为缓解基线。

详情
AI中文摘要

自动指标广泛用于评估文生图模型,常常在基准测试、模型选择和大规模数据过滤中取代人类判断。然而,它们可能奖励看起来合理或原型性的图像,而非忠实满足提示的图像。我们识别出原型性偏差是多模态评估中的一个系统性盲点:指标可能偏好语义不正确但在视觉或社会层面具有原型性的图像,而非正确但原型性较弱的图像。我们引入PROTOBIAS,一个跨动物、物体和人口统计的受控诊断基准,其中语义正确的图像与包含单个受控语义违反的合理原型性对抗样本进行对比。基于原型理论和社会类别原型性,PROTOBIAS通过多个提示生成器、图像生成器和独立的VLM过滤器构建,并通过提示质量、人工标注和图像质量控制进行验证。使用PROTOBIAS,我们展示了广泛使用的嵌入、奖励、基于VQA和VLM作为评判的指标经常在这些对比中失败,而人类判断仍然更忠实于语义正确性。我们进一步引入PROTOSCORE,一个轻量级对比训练评估器,作为初始缓解基线。PROTOBIAS为测量原型性驱动的指标失败和开发更语义忠实的T2I评估器提供了一个聚焦基准。

英文摘要

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.

2511.01938 2026-06-02 cs.LG cs.AI

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Grokking 的几何:零损失流形上的范数最小化

Tiberiu Musat

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过约束优化视角,证明在极小学习率和权重衰减系数下,梯度下降在零损失流形上最小化权重范数,并引入近似解耦参数子集的学习动力学,推导出两层网络第一层后记忆动力学的闭式表达式,实验验证了该框架能复现 grokking 的延迟泛化和表征学习特征。

详情
AI中文摘要

Grokking 是神经网络中一种令人费解的现象,即在完全记忆训练数据之后,经过相当长的延迟才出现完全的泛化。先前的研究将这种延迟泛化与由权重衰减驱动的表征学习联系起来,但精确的潜在动力学仍然难以捉摸。在本文中,我们认为后记忆学习可以通过约束优化的视角来理解:梯度下降在零损失流形上有效地最小化权重范数。我们在无穷小学习率和权重衰减系数的极限下正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似,将一部分参数的学习动力学与网络其余部分解耦。应用这一框架,我们推导出两层网络中第一层后记忆动力学的闭式表达式。实验证实,使用我们预测的梯度模拟训练过程能够再现 grokking 的特征性延迟泛化和表征学习。

英文摘要

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

2408.11266 2026-06-02 cs.LG cs.NA math.NA

Practical Aspects on Solving Differential Equations Using Deep Learning: A Primer

使用深度学习求解微分方程的实践方面:入门指南

Georgios Is. Detorakis

发表机构 * International Centre for Neuromorphic Systems(神经形态系统国际中心) Department of Computer Science(计算机科学系) University of Manchester(曼彻斯特大学)

AI总结 本文介绍深度学习核心概念,重点阐述如何利用神经网络求解偏微分方程,包括实现方法、超参数选择及精度提升技巧,并强调无需GPU即可复现。

Comments 34 pages, 13 figures, primer (tutorial)

详情
AI中文摘要

深度学习如今在许多科学领域都很常见,包括偏微分方程的研究。本文简要、易懂地介绍了深度学习的核心概念,包括神经网络、反向传播和通用逼近定理。它主要涵盖了如何使用深度学习求解微分方程。本文旨在帮助数学、物理及相关领域的本科生和研究生学习如何使用深度学习求解偏微分方程。数学或物理教师也可以使用本文向学生介绍深度伽辽金方法和科学深度学习。我们关注关键问题:什么是深度学习,它如何帮助解决数学或物理问题?如何实现神经网络并选择正确的数值方法来求解微分方程?如何选择最佳超参数?如何提高精度并加速收敛?需要说明的是,本文中的所有问题都可以在没有GPU的机器上解决,因此任何学生都可以遵循所介绍的方法。

英文摘要

Deep learning is now common across many scientific fields, including the study of partial differential equations. This article provides a brief, accessible introduction to core deep learning concepts, including neural networks, backpropagation, and the universal approximation theorem. It mainly covers how to use deep learning in solving differential equations. The article aims to help undergraduate and graduate students in mathematics, physics, and related areas learn how to use Deep Learning to solve partial differential equations. Instructors in mathematics or physics can also use this article to introduce students to Deep Galerkin method and scientific deep learning. We focus on key questions: What is deep learning, and how can it help solve mathematical or physical problems? How can you implement a neural network and choose the right numerical method to solve differential equations? How do you select the best hyperparameters? How can you improve accuracy and speed up convergence? We should mention that all the problems in this article can be solved on a machine without a GPU, so any student can follow the presented methodology.

2601.03615 2026-06-02 cs.CL cs.SD eess.AS

SARA: Stress Test Reasoning in Audio Deepfake Detection

SARA: 音频深度伪造检测中的压力测试推理

Binh Nguyen, Charles Fleming, Thai Le

发表机构 * Indiana University(印第安纳大学) Cisco Research(思科研究)

AI总结 提出SARA框架,通过声学感知、推理-判决一致性与不和谐三个维度评估音频语言模型在对抗攻击下的推理可靠性,发现声学攻击降低一致性而语言攻击保持一致性但成功率更高,且推理轨迹的文本一致性可作为检测对抗样本的潜在指标。

Comments Preprint for ACL 2026 submission

详情
AI中文摘要

音频语言模型(ALMs)通过提供推理轨迹来透明化其预测,为可解释的音频深度伪造检测(ADD)提供了有前景的转变,超越了黑盒分类器。然而,这种推理可能不支持模型预测,反映出一致性差,或者更糟的是,可能用看似合理但具有误导性的解释来合理化错误预测。此外,ALM推理在对抗攻击下的行为仍未得到充分探索,引发了关于这种解释能力实际可靠性的疑问。为填补这一空白,本研究引入了SARA(音频推理的移位分析),这是一个诊断框架,从三个维度评估ALM推理:声学感知、推理-判决一致性与不和谐。我们针对声学和语言对抗攻击测试了五个开源ALM。结果表明,声学攻击显著降低了推理-判决一致性(平均下降14.20%),经常引发内部逻辑冲突。相反,语言攻击在保持推理一致性的同时实现了更高的攻击成功率。我们进一步证明,生成的推理轨迹的文本一致性也可作为对抗输入的潜在指标,从而无需访问原始声学信号即可有效检测受扰音频(F1为0.78)。这些发现表明,即使最终分类输出受损,推理轨迹仍具有诊断效用。

英文摘要

Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADD), moving beyond \textit{black-box} classifiers by providing transparency to their predictions via reasoning traces. However, such reasoning may not support the model predictions, reflecting poor coherence, or, worse, may rationalize incorrect predictions with plausible but misleading explanation. Moreover, the behavior of ALM reasoning under adversarial attacks remains under-explored, raising questions about the practical reliability of such explanation capabilities. To address this gap, this study introduces \textbf{SARA} (\textbf{S}hift \textbf{A}nalysis of \textbf{R}easoning in \textbf{A}udio), a diagnostic framework that evaluates ALM reasoning across three dimensions: acoustic perception, reasoning-verdict coherence and dissonance. We test five open-source ALMs against both acoustic and linguistic adversarial attacks. We show that acoustic attacks significantly degrade reasoning-verdict coherence (average decrease of 14.20\%), frequently inducing internal logical conflicts. Conversely, linguistic attacks achieve higher attack success rates while maintaining reasoning coherence. We further demonstrate that the textual coherence of generated reasoning traces also serves as a latent indicator of adversarial inputs, enabling effective detection of perturbed audio (0.78 in F1) \textit{without accessing the raw acoustic signal}. These findings suggest that reasoning traces provide diagnostic utility that persists even when final classification outputs are compromised.

2601.03309 2026-06-02 cs.CV cs.AI

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

VLM4VLA:重新审视视觉-语言-动作模型中的视觉-语言模型

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Qwen Team, Alibaba Inc.(阿里巴巴公司Qwen团队)

AI总结 本文通过VLM4VLA最小适配管道,系统研究视觉-语言模型(VLM)的选择和能力如何影响下游视觉-语言-动作(VLA)策略性能,发现VLM通用能力无法预测下游任务表现,且视觉模块是性能瓶颈。

详情
AI中文摘要

视觉-语言-动作(VLA)模型将预训练的大型视觉-语言模型(VLM)集成到其策略主干中,因其有前景的泛化能力而受到广泛关注。本文重新审视了一个基本但很少被系统研究的问题:VLM的选择和能力如何转化为下游VLA策略的性能?我们引入了VLM4VLA,一个最小适配管道,仅使用少量新的可学习参数将通用VLM转换为VLA策略,以实现公平高效的比较。尽管简单,VLM4VLA被证明与更复杂的网络设计相比具有惊人的竞争力。通过在三个基准上的各种下游任务进行广泛的实证研究,我们发现虽然VLM初始化比从头训练提供了一致的优势,但VLM的通用能力并不能很好地预测其下游任务性能。这挑战了常见的假设,表明标准VLM能力对于有效的具身控制是必要但不充分的。我们进一步通过微调VLM在七个辅助具身任务(例如,具身问答、视觉指向、深度估计)上研究特定具身能力的影响。与直觉相反,提高VLM在特定具身技能上的性能并不能保证更好的下游控制性能。最后,模态级别的消融实验确定VLM中的视觉模块(而非语言组件)是主要的性能瓶颈。我们证明,即使在下游微调期间编码器保持冻结,向VLM的视觉编码器注入控制相关的监督也能带来一致的收益。这隔离了当前VLM预训练目标与具身动作规划需求之间持续的领域差距。

英文摘要

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

2601.00664 2026-06-02 cs.LG cs.AI cs.CV cs.HC cs.MM

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing:用于自然对话的实时交互式头部化身生成

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) NTU Singapore(新加坡国立大学) DeepAuto.ai

AI总结 提出Avatar Forcing框架,通过扩散强制实现实时交互式头部化身生成,利用直接偏好优化进行无标签学习,在低延迟(约500ms)下生成富有表现力的反应动作。

Comments CVPR 2026. Project page: https://taekyungki.github.io/AvatarForcing/

详情
AI中文摘要

说话头部生成从静态肖像创建逼真的化身,用于虚拟通信和内容创作。然而,当前的模型尚未传达真正交互式通信的感觉,通常生成缺乏情感投入的单向响应。我们确定了实现真正交互式化身的两个关键挑战:在因果约束下实时生成运动,以及在没有额外标注数据的情况下学习富有表现力、生动的反应。为了解决这些挑战,我们提出了Avatar Forcing,一种新的交互式头部化身生成框架,通过扩散强制建模实时用户-化身交互。该设计允许化身处理实时多模态输入,包括用户的音频和运动,以低延迟即时响应语言和非语言线索,如言语、点头和笑声。此外,我们引入了一种直接偏好优化方法,利用通过丢弃用户条件构建的合成失败样本,实现无标签的富有表现力交互学习。实验结果表明,我们的框架能够实现低延迟(约500ms)的实时交互,相比基线加速6.8倍,并生成反应性和富有表现力的化身运动,在80%以上的情况下优于基线。

英文摘要

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

2601.00212 2026-06-02 cs.CV

IntraStyler: Intra-Domain Style Synthesis for Cross-Modality MRI Domain Adaptation

IntraStyler: 跨模态MRI域适应的域内风格合成

Han Liu, Yubo Fan, Hao Li, Dewei Hu, Daniel Moyer, Zhoubing Xu, Benoit M. Dawant, Ipek Oguz

发表机构 * Siemens Healthineers(西门子医疗) Princeton, NJ, USA(新泽西州普林斯顿) Vanderbilt University(范德比尔特大学) Mayo Clinic(梅奥诊所) Johnson & Johnson Innovative Medicine(强生创新医学)

AI总结 针对T2 MRI中前庭神经鞘瘤和耳蜗分割的域适应问题,提出IntraStyler方法,通过对比学习提取与解剖解耦的风格嵌入,自动发现并合成目标域内多样化的风格图像,提升下游分割模型的泛化性。

Comments Extension of our 1st place solution for the CrossMoDA 2023 challenge

详情
AI中文摘要

从T2 MRI中分割前庭神经鞘瘤和耳蜗在临床上很重要,但需要大量标注。域适应(DA)已被广泛用于弥合标记的对比增强T1和未标记的T2数据集之间的差距。现有方法专注于跨域对齐,但目标域内的域内变异性在很大程度上被忽视。同一域的图像可能因不同的扫描仪、场强和采集协议而存在显著差异。忽略这种变异性会产生同质的合成图像,限制了下游分割模型的泛化能力。为了解决这个问题,我们提出了IntraStyler,一种3D非配对图像翻译方法,无需任何预定义的子域即可自动发现细粒度的域内风格,并使用每幅图像的风格参考合成多样化的目标域图像。为此,我们设计了一个3D风格编码器,通过新颖的对比学习目标进行训练,以提取与解剖解耦的纯风格嵌入。IntraStyler基于CrossMoDA挑战赛的第一名解决方案构建并进一步改进,生成更多样化的合成数据,实现更可靠的下游分割。代码可在https://github.com/MedICL-VU/IntraStyler获取。

英文摘要

Segmentation of vestibular schwannoma and cochlea from T2 MRI is clinically important yet annotation-intensive. Domain adaptation (DA) has been widely adopted to bridge the gap between labeled contrast-enhanced T1 and unlabeled T2 datasets. While existing methods focus on cross-domain alignment, intra-domain variability within the target domain remains largely overlooked. Images from the same domain may vary substantially due to different scanners, field strengths, and acquisition protocols. Ignoring this variability produces homogeneous synthetic images that limit the generalizability of downstream segmentation models. To address this, we propose IntraStyler, a 3D unpaired image translation method that automatically discovers fine-grained intra-domain styles without any predefined sub-domains, and synthesizes diverse target domain images using per-image style references. To this end, we design a 3D style encoder trained with a novel contrastive learning objective to extract style-only embeddings disentangled from anatomy. IntraStyler is built upon the 1st place CrossMoDA challenge solution and further advances it, generating more diverse synthetic data and achieving more reliable downstream segmentation. Code is available at https://github.com/MedICL-VU/IntraStyler.

2512.22702 2026-06-02 cs.LG

Position: Current Benchmarking Hinders Real Progress in Deep Learning for Time Series Forecasting

立场:当前基准测试阻碍了时间序列预测深度学习的真正进展

Valentina Moretti, Ivan Marisca, Cesare Alippi, Andrea Cini

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文指出当前基准测试实践未能识别影响性能的关键设计因素,尤其是全局性与局部性等被忽视的方面对预测方法类别和实验结果有重大影响,并提出辅助预测模型卡以改进架构比较。

Comments ICML 2026

详情
AI中文摘要

深度学习模型在时间序列应用中越来越受欢迎。然而,大量新提出的架构和经常矛盾的实证结果使得评估哪种设计选择和模型组件驱动性能变得困难。在这篇立场论文中,我们认为当前的基准测试实践未能识别导致性能差异的因素,从而减缓了该领域的进展。特别是,在比较架构时,关键设计维度的差异被忽视,最终导致不一致的结果。为了支持我们的立场,我们展示了这些差异——通常被视为单纯的实现细节——可能比采用特定的序列建模层具有更大的影响。我们讨论了被忽视的方面(如全局性和局部性)如何(1)从根本上改变预测方法的类别,以及(2)极大地影响实证结果。我们的发现表明,需要重新思考我们的基准测试实践,并在设计和比较架构时关注预测问题的基本方面。作为具体步骤,我们提出了一个辅助预测模型卡,即一个包含一组字段的模板,用于根据关键设计选择来表征现有和新的预测架构。

英文摘要

Deep learning models have grown popular in time series applications. However, the large quantity of newly proposed architectures and the often contradictory empirical results make it difficult to assess which design choice and model component drives performance. In this position paper, we argue that current benchmarking practices fail to identify the factors responsible for performance differences, thus slowing down progress in the field. In particular, differences in crucial design dimensions are overlooked when comparing architectures, ultimately leading to inconsistent outcomes. To support our position, we show that such differences-often treated as mere implementation details-can have a greater impact than adopting specific sequence modeling layers. We discuss how overlooked aspects (such as globality and locality) can (1) fundamentally change the class of the forecasting method and (2) drastically affect empirical results. Our findings suggest rethinking our benchmarking practices and focusing on the foundational aspects of the forecasting problem when designing and comparing architectures. As a concrete step, we propose an auxiliary forecasting model card, i.e., a template with a set of fields to characterize existing and new forecasting architectures based on key design choices.

2512.21472 2026-06-02 cs.CV

IMA++: ISIC Archive Multi-Annotator Dermoscopic Skin Lesion Segmentation Dataset

IMA++: ISIC档案多标注者皮肤镜皮损分割数据集

Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh

发表机构 * Medical Image Analysis Lab, School of Computing Science, Simon Fraser University(医学影像分析实验室,计算科学学院,西蒙弗雷泽大学) AIP Labs(AIP实验室)

AI总结 提出ISIC MultiAnnot++数据集,包含14,967张皮肤镜图像和17,684个分割掩码,其中2,394张图像有2-5个标注,并附带标注者技能水平和工具元数据,支持多标注者医学图像分割研究。

Comments Published in IEEE Data Descriptions, 12 pages, 7 figures

详情
Journal ref
IEEE Data Descr. 3 (2026) 367-378
AI中文摘要

多标注者医学图像分割是一个重要的研究问题,但需要昂贵收集的标注数据集。皮肤镜皮损成像允许人类专家和AI系统观察在常规临床照片中无法辨别的形态结构。然而,目前没有大规模公开可用的、带有标注者标签的多标注者皮损分割(SLS)数据集用于皮肤镜皮损成像。我们引入了ISIC MultiAnnot++,一个大型公开的多标注者皮损分割数据集,图像来自ISIC档案。最终数据集包含14,967张皮肤镜图像的17,684个分割掩码,其中2,394张皮肤镜图像每张有2-5个分割,使其成为最大的公开SLS数据集。此外,还包括关于分割的元数据,包括标注者的技能水平和分割工具,支持诸如分割的标注者特定偏好建模和标注者元数据分析等研究主题。我们对该数据集的特征、策划的数据分区和共识分割掩码进行了分析。

英文摘要

Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators' skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

2512.20251 2026-06-02 cs.CV eess.IV

Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

退化感知度量提示用于高光谱图像恢复

Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang

发表机构 * arXiv.org cs.CV(计算机视觉)

AI总结 提出退化感知度量提示(DAMP)框架,通过可解释的空间-光谱度量作为退化提示,结合退化自适应混合专家(DAMoE)模块,实现多维度退化统一恢复,在自然和遥感高光谱数据集上达到最先进性能并展现零样本泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

统一高光谱图像(HSI)恢复旨在单个模型中恢复多种退化。然而,当前方法通常依赖于不切实际的显式先验或过拟合训练分布的不透明黑盒表示,阻碍了对未见场景的泛化。为弥补这一差距,我们提出退化感知度量提示(DAMP),一种新颖框架,通过可解释的空间-光谱度量表征多维退化。这些度量作为退化提示(DP),使模型能够捕捉任务间的共享特征并适应未知损坏。我们框架的核心是退化自适应混合专家(DAMoE),其中空间-光谱自适应模块(SSAM)作为专家,利用可学习的融合系数专门处理不同的退化程度。通过使用DP作为门控路由器,DAMoE动态激活针对特定退化特征定制的专家。在自然和遥感HSI数据集上的大量实验表明,DAMP实现了最先进的性能,并在未见恢复任务上展现出卓越的零样本泛化能力。代码公开于 \href{DAMP}{https://github.com/MiliLab/DAMP}。

英文摘要

Unified hyperspectral image (HSI) restoration aims to recover diverse degradations within a single model. However, current methods often rely on impractical explicit priors or opaque black-box representations that overfit to training distributions, hampering generalization to unseen scenarios. To bridge this gap, we propose Degradation-Aware Metric Prompting (DAMP), a novel framework that characterizes multi-dimensional degradations through interpretable spatial-spectral metrics. These metrics serve as Degradation Prompts (DP), enabling the model to capture shared characteristics across tasks and adapt to unknown corruptions. Central to our framework is the Degradation-Adaptive Mixture-of-Experts (DAMoE), where Spatial-Spectral Adaptive Modules (SSAMs) serve as experts that utilize learnable fusion coefficients to specialize in distinct degradation degrees. By using DP as a gating router, DAMoE dynamically activates specialized experts tailored to the specific degradation profile. Extensive experiments on natural and remote sensing HSI datasets demonstrate that DAMP achieves state-of-the-art performance and exhibits exceptional zero-shot generalization on unseen restoration tasks. Code is publicly available at \href{DAMP}{https://github.com/MiliLab/DAMP}.

2512.20638 2026-06-02 cs.CL cs.AI cs.LG

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于稀疏自编码器概念激活的新方法,自动发现模型在细粒度概念上的弱点(模型差距)和基准测试覆盖不平衡(基准差距),并通过内部表示评估和跨基准比较进行验证。

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标,但可能掩盖(i)模型薄弱的特定子领域(“模型差距”)和(ii)基准测试本身的不平衡覆盖(“基准差距”)。为了自动揭示这两类差距,我们提出了一种简单的新方法,利用稀疏自编码器的概念激活,在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示,以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试,作为示例说明。作为对该方法的验证,我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距(例如与谄媚相关的差距),并识别出新的模型差距。我们还能够自动揭示基准差距:应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解,并帮助基准测试开发者迭代基准测试设计,来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

2508.20072 2026-06-02 cs.CV cs.LG cs.RO

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA:将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出离散扩散VLA,通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化,实现自适应解码顺序和错误纠正,在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而,当前的VLA要么以固定的从左到右顺序自回归生成动作,性能较差;要么在骨干网络外附加独立的扩散头,这会割裂信息通路并阻碍统一、可扩展的架构。相反,我们提出了离散扩散VLA,它将动作块离散化,并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序,在解决较难的动作元素之前先解决高置信度的动作元素,并采用二次重掩码来重新审视不确定的预测,从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验,支持并行解码,并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率,在SimplerEnv-Fractal上达到71.2%的视觉匹配,在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中,我们的方法仅表现出0.8%的语言退化(相比之下并行解码为8.0%),以及20.4%的视觉退化(相比之下连续扩散为29.0%),表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估,以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

2512.15647 2026-06-02 cs.CV

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

硬标签登场!重新思考硬标签在缓解局部语义漂移中的作用

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对软标签在稀疏监督下导致的局部语义漂移问题,提出混合硬标签与软标签的HALD训练范式,在数据集蒸馏和大规模分类任务中提升泛化性能。

Comments ICML 2026. Code at: https://github.com/Jiacheng8/HALD

详情
AI中文摘要

来自教师模型的软标签是知识迁移和大规模数据集蒸馏(例如SRe2L、LPLD)的实际做法。然而,当我们限制每张图像的裁剪数量以减少存储预计算软标签的巨大成本时,这些方法会严重遭受局部语义漂移:视觉上模糊的裁剪可能导致软监督偏离图像级别的真实语义,导致持续错误和训练-测试分布不匹配。我们重新审视了硬标签被忽视的作用,并表明当适当整合时,它们可以作为内容不变的语义锚点来校准这种漂移。我们从理论上分析了稀疏软标签监督下漂移的出现,并证明混合硬标签和软标签可以恢复视觉内容与语义监督之间的对齐。基于这一见解,我们提出了一种新的训练范式——用于缓解局部语义漂移的硬标签(HALD),它使用硬标签作为中间校正信号,同时保留软标签的细粒度优势。在数据集蒸馏和大规模分类基准上的大量实验显示了一致的泛化改进。在ImageNet-1K上,我们的方法仅使用285M软标签存储(减少100倍)就达到了42.7%的准确率,优于先前最先进的LPLD 9.0%。

英文摘要

Soft labels from teacher models are a de facto practice for knowledge transfer and large-scale dataset distillation (e.g., SRe2L, LPLD). However, when we limit the number of crops per image to reduce the substantial cost of storing precomputed soft labels, these methods suffer severely from local semantic drift: visually ambiguous crops can cause soft supervision to deviate from the image-level ground-truth semantics, leading to persistent errors and a train-test distribution mismatch. We revisit the overlooked role of hard labels and show that, when properly integrated, they can act as a content-invariant semantic anchor that calibrates such drift. We theoretically analyze the emergence of drift under sparse soft-label supervision and demonstrate that hybridizing hard and soft labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which uses hard labels as intermediate corrective signals while preserving the fine-grained benefits of soft labels. Extensive experiments on dataset distillation and large-scale classification benchmarks show consistent generalization improvements. On ImageNet-1K, our method achieves 42.7% accuracy with only 285M soft-label storage (reduces by 100X), outperforming prior state-of-the-art LPLD 9.0%.

2506.13702 2026-06-02 cs.LG cs.AI

Value-Free Policy Optimization via Reward Partitioning

通过奖励划分实现无价值函数策略优化

Bilal Faye, Hanane Azzag, Mustapha Lebbah

发表机构 * LIPN, Université Paris 13(巴黎第十三大学LIPN实验室) Université Paris 13(巴黎第十三大学) Université de Versailles Saint-Quentin Paris(巴黎- versaillies圣quentin大学)

AI总结 提出Reward Partition Optimization (RPO)方法,通过基于划分的奖励归一化消除价值函数学习,实现稳定、高效的策略优化。

详情
AI中文摘要

单轨迹偏好优化方法从((提示, 响应, 奖励))元组的数据集中学习,通过直接利用标量反馈为成对偏好学习提供了一种实用的替代方案。现有方法如直接奖励优化(DRO)已显示出有希望的结果,但依赖于价值函数估计,引入了额外的方差、优化复杂性和对离策略数据的敏感性。我们引入了奖励划分优化(RPO),一种简单且可扩展的奖励驱动目标,消除了对价值函数学习的需要。RPO通过直接从提示级奖励分布估计的基于划分的公式对奖励进行归一化,产生稳定的监督优化目标,无需辅助模型或强化学习循环。我们使用自动评估指标、LLM作为评判员的评估和优化稳定性分析,在多个编码器-解码器和仅解码器语言模型上评估RPO。实验结果表明,RPO在生成更对齐、更多样化和更少有毒内容的同时,始终优于强基线,包括SFT、KTO和DRO。

英文摘要

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

2505.08438 2026-06-02 cs.CV cs.AI

A Survey of 3D Reconstruction with Event Cameras

事件相机三维重建综述

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文首次全面综述了基于事件相机的三维重建方法,按输入模态(立体、单目、多模态)和重建技术(几何、深度学习、神经渲染如NeRF和3DGS)分类,并讨论了数据集、评估、表示和动态场景重建等挑战。

Comments This survey has been accepted for publication in the Computational Visual Media Journal

详情
AI中文摘要

事件相机正迅速成为用于三维重建的强大视觉传感器,能够异步捕捉每个像素的亮度变化。与传统基于帧的相机相比,事件相机产生稀疏但时间密集的数据流,即使在高速运动、低光照和极端动态范围等挑战性条件下,也能实现鲁棒且准确的三维重建。这些能力为自动驾驶、机器人、空中导航和沉浸式虚拟现实等各个领域的变革性应用提供了巨大前景。在本文中,我们首次专门针对基于事件的三维重建进行了全面综述。现有方法根据输入模态系统地分为立体、单目和多模态系统,并根据重建方法进一步分类,包括基于几何的技术、深度学习方法以及神经渲染技术,如神经辐射场(NeRF)和3D高斯泼溅(3DGS)。在每个类别中,方法按时间顺序组织,以突出关键概念和进展的演变。此外,我们详细总结了专门适用于基于事件重建任务的公开数据集。最后,我们讨论了数据集可用性、标准化评估、有效表示和动态场景重建方面的重大开放挑战,并概述了未来研究的有见地的方向。本综述旨在作为重要参考,并为推进事件驱动三维重建的最新技术提供清晰且激励人心的路线图。

英文摘要

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

2512.18336 2026-06-02 cs.RO cs.AI cs.LG

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节:随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * arXiv.org

AI总结 研究在四旋翼控制中,通过动态熵调节训练随机策略的强化学习算法,并与确定性策略算法对比,发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)
AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响,并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励,而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略,然后执行确定性动作来控制四旋翼的效果,并与训练确定性策略并执行确定性动作进行了对比。为此,随机算法选择了软演员-评论家(SAC)算法,确定性算法选择了双延迟深度确定性策略梯度(TD3)算法。训练和仿真结果表明,动态熵调节通过防止灾难性遗忘和提高探索效率,对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

2512.18333 2026-06-02 cs.RO cs.AI cs.LG

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * arXiv.org

AI总结 提出一种基于强化学习的四旋翼推力矢量控制架构,使用软演员-评论家算法训练,相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)
AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速,而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后,智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法(一种无模型离策略随机RL算法)来训练RL智能体。训练结果表明,与传统的RPM控制器相比,所提出的推力矢量控制器训练时间更短。仿真结果表明,所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

2512.17605 2026-06-02 cs.CV cs.AI

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

MGRegBench:一个带有解剖标志的乳腺X线图像配准新型基准数据集

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

发表机构 * MSU(莫斯科国立大学)

AI总结 为解决乳腺X线图像配准中缺乏公开数据集和标准化基准的问题,提出了MGRegBench,包含5000多对图像和100对带手动标注解剖标志的数据集,并评估了多种配准方法。

详情
AI中文摘要

稳健的乳腺X线图像配准对于临床相关应用(如追踪乳腺组织疾病进展)至关重要。然而,由于缺乏透明的公共数据集和可重复的标准化基准,进展受到限制。现有研究通常使用私有数据和不一致的评估框架,因此难以直接比较。为解决这一问题,我们提出了MGRegBench,一个患者独立、无泄漏控制的乳腺X线图像配准评估协议,包含超过5000对图像,每对图像带有乳腺分割掩膜,以及100对带有手动标注解剖标志的图像,此外还有标准化的训练/评估分割和即用基线。利用这一资源,我们对多种配准方法进行了基准测试——包括经典方法(ANTs)、基于学习的方法(VoxelMorph, TransMorph)、隐式神经表示(IDIR)、一种乳腺X线专用方法,以及最近的深度学习方法MammoRegNet,并针对该模态调整了实现,同时在独立数据集SDM-MCs上验证了泛化能力。我们的贡献包括:(1)首个此规模且带有手动标注标志和掩膜的乳腺X线图像配准公共数据集;(2)一个透明、无泄漏控制的基准,首次实现了多种经典和基于机器学习的方法的同类比较;(3)在SDM-MCs上的外部验证,以测试主要趋势是否超越MGRegBench;(4)对基于深度学习的配准进行了广泛分析。我们公开发布代码和数据,为公平、可重复且临床相关的比较建立基础资源,并推动AI驱动医学影像的未来研究。

英文摘要

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. However, progress has been limited by the absence of transparent public datasets and reproducible standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a patient-disjoint, leakage-controlled evaluation protocol for mammography registration, comprising over 5,000 image pairs, each with a breast segmentation mask, and 100 pairs with manually annotated anatomical landmarks, plus standardized train/evaluation splits and ready-to-run baselines. Using this resource, we benchmark diverse registration methods -- including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a mammography-specific approach, and a recent deep learning method MammoRegNet, with implementations adapted to this modality, and validate generalization on the independent SDM-MCs dataset. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) a transparent, leakage-controlled benchmark enabling the first like-for-like comparison of diverse classical and machine learning-based methods; (3) external validation on SDM-MCs to test whether the main trend transfers beyond MGRegBench; and (4) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair, reproducible, and clinically relevant comparisons and catalyze future research in AI-driven medical imaging.