arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态:通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

发表机构 * University of Cambridge(剑桥大学) Queen's University Belfast(贝尔法斯特女王大学) University of Surrey(萨里大学) Cisco(思科) Southwest Jiaotong University(西南交通大学) Teesside University(泰赛德大学)

AI总结 提出一种查询自适应框架,通过跨模态分数一致性检测主动模态,在BBC Rewind语料库上达到94.2%的P@1,优于单模态和固定融合方法。

详情
Comments
INTERSPEECH 2026
AI中文摘要

当通过语音和面部从视频档案中检索一个人时,系统应该是多模态的吗?在实际的广播档案中,与精心策划的基准不同,目标可能只被听到但未被看到、只被看到但未被听到,或者两者兼有。融合来自缺失模态的分数会引入噪声,使精度低于最佳单模态系统。我们提出了一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态都活跃时,由一种模态检索的文件在另一种模态上也得分高;当一种模态缺失时,这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅语音(82.9%)、仅面部(93.4%)和固定融合(90.0%),恢复了与具有真实模态标签的Oracle(96.6%)之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2606.05927 2026-06-05 cs.LG

Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based Oversampling

通过标签特定距离的过采样解决多标签数据中的不平衡问题

Bin Liu, Jun Wu, Haoyu Peng, Ao Zhou, Jin Wang, QiaoSong Chen, Grigorios Tsoumakas

发表机构 * Key Laboratory of Data Engineering and Visual Computing, Chongqing University of Posts and Telecommunications, China(数据工程与视觉计算重点实验室,重庆邮电大学,中国) School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, China(计算机科学与技术学院,重庆邮电大学,中国) State Key Laboratory of Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Informatics, Aristotle University of Thessaloniki, Greece(信息学院,希腊阿尔蒂米斯大学)

AI总结 针对多标签分类中的标签不平衡问题,提出基于标签特定距离的过采样方法LSDMLO,通过加权相关特征空间识别标签一致邻居,生成更有效的合成实例,实验表明优于现有方法。

详情
AI中文摘要

复杂的非平衡标签分布对多标签分类构成了严峻挑战,因为大多数分类器偏向于多数类和高频标签。过采样是一种高效且灵活的解决方案,通过增加实例来为多标签分类器提供更平衡的训练数据集。现有的大多数过采样方法以启发式方式创建合成实例,本质上依赖于在整个特征空间中使用欧氏距离检索的邻域信息。然而,它们未能考虑特征对不同标签的不同语义相关性,导致邻近邻居之间的标签不一致,进而引入标签混淆和过拟合到合成实例。为了克服上述问题,我们提出了一种新颖的采样方法,称为基于标签特定距离的多标签过采样(LSDMLO),该方法创建更有用且标签正确的合成实例,以解决多标签数据集中的不平衡问题。LSDMLO基于加权相关特征空间推导标签特定距离,以识别标签一致的邻居,这有助于选择在边界区域表达更多标签相关性的种子实例,并生成与原始数据标签分布一致的合成实例。综合实验表明,所提出的LSDMLO在各种基分类器下均优于最先进的多标签采样方法。

英文摘要

The complex imbalanced label distribution poses a crucial challenge to multi-label classification, as most classifiers are biased towards the majority class and high-frequent labels. Oversampling is an efficient and flexible solution that augments instances to provide a more balanced training dataset for multi-label classifiers. Most existing oversampling methods create synthetic instances in a heuristic way that essentially relies on neighborhood information retrieved using Euclidean distance within the entire feature space. However, they fail to consider the varying semantic relevance of features to different labels, leading to label inconsistency among proximate neighbors and further introducing label confusion and overfitting to synthetic instances. To overcome the above issue, we propose a novel sampling approach called Label-Specific Distance-based Multi-Label Oversampling (LSDMLO) that creates more useful and well-labeled synthetic instances to address the imbalance in multi-label datasets. LSDMLO derives the label-specific distance to identify label-consistent neighbors based on the weighted pertinent feature space, which facilitates selecting seed instances that express more label correlations in boundary areas and generating synthetic instances aligned with the label distribution of original data. The comprehensive experiments verify that the proposed LSDMLO outperforms the state-of-the-art multi-label sampling approaches under various base classifiers.

2606.05925 2026-06-05 cs.AI

Towards World Models in Biomedical Research

迈向生物医学研究的世界模型

Guangyu Wang, Jingkun Yue, Siqi Zhang, Yu Liu, Xiaoyu Wang, Mingyuan Meng, Changwei Ji, Zongbo Han, Yulin Wang, Yang Yue, Frank Fu, Ting Chen, Song Wu, Ziwei Liu, Jiangning Song, Ming Li, Gao Huang, Xiaohong Liu, Athanasios Vasilakos, Xingcai Zhang, Ping Zhang, Yong Li

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China(网络与交换技术国家重点实验室,北京邮电大学,北京,中国) Department of Engineering Science, University of Oxford, Oxford, United Kingdom(英国牛津大学工程科学系,牛津,英国) Institute of Medical Artificial Intelligence, South China Hospital, Medical School, Shenzhen University, Shenzhen, Guangdong, China(医学人工智能研究所,南方医院,医学学院,深圳大学,深圳,广东,中国) Zhongguancun Academy & Zhongguancun Institute of Artificial Intelligence, Beijing, China(中关村学院及中关村人工智能研究院,北京,中国) Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, 100084, Beijing, China(北京信息科学与技术国家研究中心(BNRist),清华大学,100084,北京,中国) Department of Chemical and Nano Engineering, University of California, San Diego, La Jolla, CA, USA(美国加州大学圣地亚哥分校化学与纳米工程系,La Jolla,CA,美国) Nanyang Technological University, Singapore(新加坡南洋理工大学) Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, Australia(莫纳什大学生物医学发现研究所和生物化学与分子生物学系,墨尔本,维多利亚,澳大利亚) David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada(加拿大滑铁卢大学戴维·R·切里顿计算机科学学校,滑铁卢,安大略,加拿大) Department of ICT and Center for AI Research, University of Agder (UiA), Jon Lilletuns vei 9, Grimstad, Norway(挪威阿格德大学(UiA)信息与通信技术系及人工智能研究中心,Jon Lilletuns vei 9,Grimstad,挪威) Department of Electronic Engineering, Tsinghua University, Beijing, China(清华大学电子工程系,北京,中国)

AI总结 提出生物医学世界模型作为AI驱动发现的新范式,通过学习分子、细胞、组织和临床状态的潜在表征及干预条件动态,实现未来轨迹模拟,并探讨其在虚拟细胞、类器官、虚拟患者和手术模拟等应用中的潜力。

详情
AI中文摘要

生物医学的一个核心目标是理解、预测并最终控制生物系统对扰动、疾病进展和治疗干预的动态机制。尽管基础模型和大语言模型加速了生物医学数据解读,但当前大多数系统仍专注于静态模式识别,而非对生物未来的前瞻性模拟。在此,我们提出生物医学世界模型作为AI驱动发现的一种范式。这些模型学习分子、细胞、组织和临床状态的潜在表征,以及干预条件动态,使得在采取行动之前能够模拟未来轨迹。我们讨论了生物医学世界模型如何作为数据引擎、环境模拟器和科学规划基础,应用于虚拟细胞、类器官、虚拟患者和手术模拟等场景。我们概述了所需的数据基础设施、评估基准、安全约束和治理框架。生物医学世界模型可能为模拟引导、闭环且实验可操作的生物医学发现提供基础。

英文摘要

A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.

2606.05924 2026-06-05 cs.CL cs.AI

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译:多维度数据生成与大语言模型训练方法

Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He

发表机构 * Amazon Web Services (AWS)(亚马逊网络服务(AWS)) Peking University(北京大学)

AI总结 提出多维度迭代优化框架,通过专门的大语言模型生成高质量翻译参考和偏好数据,结合监督微调和强化学习(GRPO)提升文学翻译质量,在MetaphorTrans英中文学翻译基准上达到与Claude Sonnet 4.5竞争的性能。

详情
Comments
Accepted by ACL 2026 Industry
AI中文摘要

文学翻译因高质量标注数据的稀缺以及需要在表达流畅性与文学效果之间取得平衡而面临独特挑战。我们提出了一个多维度迭代优化框架,通过专门的大语言模型翻译器生成高质量的翻译参考和偏好数据,每个翻译器针对一个不同的质量维度。我们利用生成的数据进行监督微调和强化学习。实验表明,我们的生成参考在监督微调中比原始真实数据高出8.65个CEA100点。对于强化学习,我们发现直接偏好优化(DPO)在此设置下导致性能下降,而利用显式奖励模型进行组相对策略优化(GRPO)则额外提升了1.51个点。我们将此归因于两阶段训练的稳定性和GRPO的在线探索能力。我们的最终模型LitMT-8B和LitMT-14B在MetaphorTrans英中文学翻译基准上分别达到67.25和69.07个CEA100点,与Claude Sonnet 4.5的68.43点具有竞争力,并展现出对域外文学作品(如欧·亨利)的强泛化能力。

英文摘要

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

2606.05917 2026-06-05 cs.CV cs.CL

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

MemoryCard: 面向长视频问答的主题感知多模态线索压缩

Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Digital China Group(数字中国集团)

AI总结 提出MemoryCard框架,通过将长视频分割为主题事件单元并生成事件级摘要和代表性视觉时刻,以记忆卡形式增强VLMs的长视频问答能力,在相同视觉令牌预算下准确率提升高达21.8%。

详情
Comments
21 pages, 8 figures
AI中文摘要

长视频问答对视觉语言模型(VLMs)仍然具有挑战性,因为与答案相关的证据通常稀疏、短暂且时间上分散在冗长的视频上下文中。现有的以帧为中心的方法通过均匀采样、查询感知帧选择、视觉令牌压缩和自适应分辨率策略来提高效率。然而,它们仍然依赖孤立和零散的帧作为基本证据单元,限制了VLMs有效捕获连贯事件级语义的能力。为解决这一限制,我们提出了MemoryCard,一种基于视频记忆的增强框架,将长视频组织成自包含的记忆卡。具体来说,MemoryCard首先对视频和对齐的文本执行自读过程,将视频分割为语义连贯的单元,每个单元对应一个不同的主题或事件。对于每个单元,它生成事件级视频要点并选择代表性视觉时刻,然后将其渲染为统一的记忆卡,用于检索和问答。实验结果表明,在可比的视觉令牌预算下,MemoryCard持续提高了长视频问答性能,准确率相对提升高达21.8%。所有代码可在https://github.com/NEUIR/MemoryCard获取。

英文摘要

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

2606.05916 2026-06-05 cs.CV

Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs

揭示未知:基于场景图的开放词汇目标检测

Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu, Chong Wang, Jiangbo Qian

发表机构 * Faculty of Electrical Engineering and Computer Science, Ningbo University(宁波大学电气工程与计算机科学学院) Faculty of Computing, Georg-August-Universität Göttingen(哥廷根大学计算机学院) Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University(宁波大学商帮经济与文化智能计算实验室) School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 提出场景引导的关系建模检测框架,利用场景图捕获候选区域与上下文对象之间的结构化语义和空间关系,并通过关系注意力模块和场景文本对齐分支增强开放词汇目标检测性能。

详情
AI中文摘要

开放词汇目标检测旨在识别训练数据中未出现的新目标类别。许多基于知识蒸馏的方法通过将预训练视觉-语言模型的知识迁移到目标检测中,展现了有前景的性能。然而,这些方法往往忽略了对象之间结构化的、图像特定的关系,例如交互和空间布局。这种忽视可能严重限制检测新类别的有效性。为解决这一问题,我们提出了一种场景引导的关系建模检测框架。该框架利用场景图捕获候选区域与其上下文对象之间的结构化语义和空间关系。它显式建模相邻区域之间的交互,并引入关系注意力模块隐式增强从场景图中提取的关键关系线索。此外,我们提出了一种基于场景的文本对齐分支,从字幕中蒸馏类别知识以指导关系对齐。该方法促进了视觉关系与语义信息的无缝集成,从而提升检测性能。大量实验表明,我们的模型在COCO和LVIS数据集上对新类别的AP优于其他OVOD方法。

英文摘要

Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.

2606.05915 2026-06-05 cs.CV

CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications

CamFlow+: 用于二维相机运动估计的混合运动基及其稳定应用

Haipeng Li, Zhen Liu, Zhanglei Yang, Hai Jiang, Tianhao Zhou, Zhengzhe Liu, Ping Tan, Bing Zeng, Shuaicheng Liu

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院) University of Electronic Science and Technology of China(电子科技大学) School of Aeronautics and Astronautics, Sichuan University(四川大学航空宇航学院) YingCai Honors College, University of Electronic Science and Technology of China(电子科技大学 YingCai 优秀生学院) Lingnan University(岭南大学) Hong Kong University of Science and Technology and Shenzhen Loop Area Institute(香港科学与技术大学及深圳环宇研究院)

AI总结 提出CamFlow+混合基框架,通过结合单应性物理基、随机基和深度平移基在稠密光流空间中直接估计二维相机运动,并引入深度感知平滑项,有效处理平移、深度变化和局部视差,在相机运动估计和视频稳定任务中取得最优效果。

详情
AI中文摘要

估计二维相机运动是计算机视觉和计算摄影的基础。现有的基于单应性的方法在平面场景或纯旋转情况下效果良好,但在相机平移、深度变化和局部视差方面表现不佳;局部单应性和网格模型提高了灵活性,但仍依赖于分片平面假设。我们提出CamFlow+,一个混合基框架,直接在稠密光流空间中表示二维相机运动。CamFlow+结合了单应性导出的物理基、从单应性流中采样的随机基以及从深度和相机内参导出的深度平移基,在保持相机运动规律的同时放松了单平面约束。一个深度感知平滑项进一步在连续深度区域正则化平移引起的视差,同时保留深度边界附近的运动变化。我们在GHOF-Cam上评估CamFlow+,这是一个相机运动基准,通过掩蔽光流基准中的动态对象和不适定遮挡区域来隔离相机引起的运动。实验表明,CamFlow+改进了稀疏和稠密相机运动估计。在数字视频稳定中,CamFlow+还提高了全局和局部稳定性,在盲用户研究中实现了最佳top-1偏好率。代码和数据集将在项目页面上提供:https://lhaippp.github.io/CamFlow+。

英文摘要

Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: https://lhaippp.github.io/CamFlow+.

2606.05912 2026-06-05 cs.CV

Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars

自学习表情形变用于数据高效的高斯化身

Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan

发表机构 * Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 提出自适应高斯表情框架,通过自监督学习表情驱动的形变,结合2D高斯面元和符号距离场,实现从极少量输入数据(单帧、单目或单张图像)重建高保真可动画化身。

详情
AI中文摘要

使用3D高斯表示建模动态面部表情由于其非结构化特性仍然具有挑战性。传统的高斯化身流程需要大量的多视角和序列表情数据,限制了可扩展性和可访问性。在这项工作中,我们引入了自适应性高斯表情(SAGE),一个自学习表情诱导的高斯形变框架,能够从最小输入数据中实现高保真、可动画的化身。我们的方法联合优化2D高斯面元和符号距离场(SDF)以强制实现紧凑的、表面对齐的高斯分布,同时一个自监督的表情学习阶段用几何和外观一致性约束取代了长时间的训练序列。这种设计允许在多种重建场景下灵活部署:在多视角设置中,仅需单帧(时间步)而非数千帧;在单目设置中,仅需头部旋转而无需表情序列;在单次设置中,无需预训练或先验。实验表明,我们的方法在重建和动画质量上与最先进方法相当,同时将数据需求降低了几个数量级。我们的结果突显了自监督高斯形变学习作为迈向可访问、数据高效化身创建的一步的潜力。

英文摘要

Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.

2606.05911 2026-06-05 cs.SD cs.LG eess.AS

DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech Enhancement

DBHN-Net: 低复杂度单声道语音增强的双分支混合神经网络

Cunhang Fan, Enrui Liu, Jing Zhou, Jian Kang, Jie Li, Andong Li, Jian Zhou, Zhao Lv, Xuelong Li

发表机构 * State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, (School of Computer Science and Technology), Anhui University(光电信息获取与防护技术国家重点实验室(计算机科学与技术学院),安徽大学) China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司) Institute of Acoustics, University of Chinese Academy of Sciences(中国科学院声学研究所) Institute of Artificial Intelligence (TeleAI), China Telecom, China(人工智能研究所(TeleAI),中国电信,中国)

AI总结 提出一种结合ANN和SNN的双分支混合神经网络,通过BandSplit、TF-Mamba等模块降低计算复杂度,同时利用交互和融合模块保持性能,在三个公共数据集上实现平均7.5倍复杂度降低。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI2026)
Comments
This article has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI)
AI中文摘要

尽管基于人工神经网络(ANN)的语音增强(SE)方法表现出色,但高计算复杂度和高能耗阻碍了它们在实际前端处理任务中的部署。目前,脉冲神经网络(SNN)在降低功耗方面显示出潜力。然而,SNN的离散二进制激活和复杂的时空动态常常导致信息丢失。因此,当前的挑战集中在如何保持性能并降低计算复杂度。为了解决这个问题,本文提出了一种双分支混合神经网络(DBHN)。1)在网络架构方面:设计了一个集成ANN和SNN的双分支网络,其中SNN分支降低功耗,而ANN分支解决信息丢失;开发了BandSplit和时频(TF)-Mamba模块,以同时压缩能耗和增强模型性能;实现了带有残差连接的脉冲特征提取组(SFEG)和信息转换块(ITB)组件,以减轻信息丢失,同时进一步细化特征表示。2)为了促进分支间的信息融合:设计了一个交互模块,以促进双分支网络各个阶段的信息交换;设计了一个TF交叉注意力融合模块,在数据自适应地引导SNN分支保留更多关键信息的同时,对双分支信息进行时频域融合。结果表明,所提出的模型在三个公共数据集上保持了优越的性能,同时与基线模型相比,计算复杂度平均降低了7.5倍。

英文摘要

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.

2606.05909 2026-06-05 cs.SD eess.AS

Beyond WER: A Paired Acoustic Stress Test for Ambient Clinical Scribes

超越WER:面向环境临床记录员的配对声学压力测试

Xiao-Hang Jiang, Han-Jie Guo, Ying-Si Liang, Yang Ai, Zhen-Hua Ling, Lei Jiang, Zhi-Yang He

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK Co., Ltd.(iFLYTEK公司)

AI总结 提出配对声学压力测试方法,通过注入噪声并冻结下游模型,揭示噪声对临床推理的安全影响,发现轻微声学扰动可逆转临床意义而不显著增加词错误率,并展示轻量级缓解策略。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

环境临床记录员越来越多地将自动语音识别与大型语言模型结合以自动化文档记录。然而,词错误率等传统指标掩盖了系统性的安全性退化。我们提出了一种配对声学压力测试,以隔离噪声对临床推理的因果影响。对于相同的对话,我们在保持下游模型配置不变的情况下注入多种噪声类型。关键的是,我们发现信号保真度与临床安全性之间存在危险的脱节。平稳环境噪声使词错误率仅增加了微不足道的0.71个百分点,但几乎使不安全输出的比例翻倍。我们的分析表明,轻微的声学扰动可以在不显著增加错误率的情况下逆转临床含义。此外,我们展示了一种轻量级缓解策略,该策略在噪声条件下减轻安全性退化,而无需进行模型微调。

英文摘要

Ambient clinical scribes increasingly combine Automatic Speech Recognition with Large Language Models to automate documentation. However, traditional metrics like Word Error Rate mask systemic safety degradation. We present a paired acoustic stress test to isolate the causal impact of noise on clinical reasoning. For the same dialogues, we inject diverse noise types while keeping the downstream model configuration frozen. Crucially, we uncover a dangerous disconnect between signal fidelity and clinical safety. Stationary ambient noise increased the Word Error Rate by a negligible 0.71 percentage points yet nearly doubled the rate of unsafe outputs. Our analysis reveals that minor acoustic perturbations can invert clinical meaning without substantially inflating error rates. Furthermore, we demonstrate a lightweight mitigation strategy that mitigates safety degradation under noisy conditions without requiring model fine tuning.

2606.05906 2026-06-05 cs.CL

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

ACE-SQL: 基于经验信用分配的自适应协同优化方法用于文本到SQL

Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang

发表机构 * Harbin Engineering University(哈尔滨工程大学) Harbin Institute of Technology(哈尔滨工业大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出ACE-SQL强化学习框架,通过在线列集池和经验信用分配联合优化模式检索与SQL生成,在BIRD Dev上达到65.3%的贪心执行准确率。

详情
AI中文摘要

文本到SQL将自然语言问题映射为可执行的SQL查询。现代数据库通常包含大型且复杂的模式,使得模式链接成为准确生成SQL的关键步骤。现有方法要么依赖全模式生成,这在大搜索空间中隐式进行模式链接,要么使用基于静态金列监督训练的独立检索器,其目标可能对当前生成器策略是次优的。为解决此问题,我们提出基于经验信用分配的自适应协同优化方法用于文本到SQL(ACE-SQL),这是一个在执行反馈下联合优化模式检索和SQL生成的强化学习框架。ACE-SQL从生成器rollout中构建在线列集池,并从与执行正确rollout最频繁关联的列集中推导出自适应在线策略检索目标。这引发了双向适应:检索器适应生成器能正确执行的列集,而生成器在执行反馈下适应检索器不断演变的模式选择。使用约3k个合成文本到SQL问题-数据库对进行强化学习训练,ACE-SQL在BIRD Dev上实现了65.3%的贪心执行准确率,每个查询使用0.93k输出令牌。代码仓库见https://github.com/xbchen1/ACE-SQL。

英文摘要

Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at https://github.com/xbchen1/ACE-SQL.

2606.05903 2026-06-05 cs.RO

A Novel Method with Encoder-Decoder for Cross-Sensor Adaptation in Surface Shape Sensing with Sparse Strain Sensors

一种基于编码器-解码器的跨传感器自适应方法,用于稀疏应变传感器的表面形状感知

Shuo Wang, Heng Luo, Dian Jin, Xiaoming Tao

发表机构 * arXiv.org

AI总结 提出一种结合元学习和少样本适应的编码器-解码器架构,实现不同传感器阵列间的跨传感器自适应,显著降低新传感器部署所需的标注数据量和适应时间,将感知误差从23.0 mm降至约4.0 mm。

详情
AI中文摘要

由内在差异或安装条件引起的传感器阵列性能变化可能导致形状感知结果不一致。为了获得准确结果,通常需要大量数据,并且必须为每个传感器阵列重新训练单独的模型,从而增加了数据采集、传输和计算的时间和成本。为解决这一问题,本文提出了一种基于稀疏应变传感器的表面形状感知编码器-解码器架构,并进一步结合元学习和少样本适应策略,实现不同传感器阵列组之间的自适应。实验结果表明,经过跨传感器自适应后,新部署的传感器阵列仅需少于5.0%的新标注数据,适应时间低于1秒,即可达到约4.0 mm的感知误差,相比未适应时的23.0 mm误差和训练新模型所需的20分钟数据采集时间,有显著提升。此外,误差低于5.0 mm的点数增加了超过65.0%。这些结果表明,所提方法能大幅降低表面形状感知的成本和训练负担,在软体机器人和可穿戴设备中具有广泛的应用潜力。

英文摘要

Performance variations in sensor arrays, caused by intrinsic differences or installation conditions, can lead to inconsistent results during shape sensing. To obtain accurate results, a large amount of data is usually required, and a separate model must be retrained for each sensor array, thereby increasing the cost and time of data acquisition, transmission, and computation. To address this issue, this work proposes an encoder-decoder architecture for surface shape sensing based on sparse strain sensors and further incorporates meta-learning and few-shot adaptation strategies to enable adaptation across different groups of sensor arrays. Experimental results demonstrate that, after the cross-sensor adaptation, a newly deployed sensor array achieves a sensing error of approximately 4.0 mm relying on less than 5.0% newly labeled data and requiring an adaptation time of under 1 second, which represents a substantial improvement from 23.0 mm error without adaptation and 20-minute data collection time required to train a new model. Moreover, the number of points with errors below 5.0 mm increased by more than 65.0%. These results indicate that the proposed method can substantially reduce the cost and training burden of surface shape sensing, and it has broad potential applications in soft robotics and wearable devices.

2606.05901 2026-06-05 cs.CL cs.AI

Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

减少复杂问答中的幻觉:使用基于简单图的检索增强生成(长版)

Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała

发表机构 * National Innovation Centre for Data(数据创新研究中心)

AI总结 本研究提出一种轻量级图结构支持的检索增强生成系统,通过结合向量搜索和图查询工具,在复杂问答任务中将幻觉答案数量减半,并显著提升事实正确性的精确率和召回率。

详情
AI中文摘要

大型语言模型(LLMs)从根本上改变了自然语言处理的格局。尽管取得了这些进展,LLMs和基于LLM的系统仍然容易出现各种故障模式。检索增强生成(RAG)系统已成为一种常见的部署场景,旨在避免LLM“幻觉”信息的已知风险,并使模型能够对训练期间无法访问的专有信息进行推理和问答,而无需进行昂贵的模型微调。在这项工作中,我们探索了使用轻量级图结构(具有相对简单的图模式)通过专用工具集支持RAG子系统的想法。我们设计了一个基于英语维基百科文章精选子集的结构化数据集上的智能体系统,该系统配备了多种向量搜索和图查询工具,并评估了其在MoNaCo(一个具有挑战性的维基百科QA基准测试,涉及复杂查询回答任务)上的问题表现。我们的结果表明,引入基于图的工具可以显著提高事实正确性的精确率和召回率,将幻觉答案的数量减半,并在三个评估场景中实现了最高的细粒度真实性得分。所有这些都仅以适度的令牌使用增加为代价。

英文摘要

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

2606.05899 2026-06-05 cs.LG cond-mat.dis-nn

High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model

可解注意力模型中LoRA微调的高维理论

O. Duranthon, F. Boncoraglio, L. Zdeborová

发表机构 * Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(计算物理学实验室,瑞士联邦理工学院(EPFL))

AI总结 本文通过高维统计理论分析低秩适应(LoRA)在注意力模型中的微调过程,揭示了预训练与微调之间的相互作用,并给出了测试误差和表示对齐的精确渐近刻画。

详情
AI中文摘要

我们发展了低秩适应(LoRA)在注意力模型中的高维统计理论,捕捉了预训练与微调之间的相互作用。我们引入了一个可解框架,其中单头注意力层首先在数据丰富的任务上进行预训练,随后通过秩一LoRA更新在有限数据上进行适应。在高维极限下,两个阶段都允许通过一组有限序参数进行尖锐的渐近刻画,从而为测试误差和表示对齐提供显式预测。我们的分析表明,预训练对LoRA的影响可以总结为一个有效噪声项,由此我们推导出最优预训练过程的处方。我们还展示了一个测试误差与表示质量不匹配的机制,并提出了我们的理论在主动微调中的应用。

英文摘要

We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention layer is first pre-trained on a data-abundant task and subsequently adapted via a rank-one LoRA update on limited data. In the high-dimensional limit, both stages admit a sharp asymptotic characterization in terms of a finite set of order parameters, yielding explicit predictions for test errors and representation alignment. Our analysis shows that the impact of pre-training on LoRA is summarized by an effective noise term, from which we derive prescriptions for the optimal pre-training procedure. We also demonstrate a regime with a mismatch between the value of the test error and representation quality, and propose an application of our theory to active fine-tuning.

2606.05896 2026-06-05 cs.CV

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

共鸣心智:具备心智理论的闭环社交虚拟人

Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) Eastern Institute of Technology, Ningbo(宁波工程技术学院)

AI总结 提出一个闭环双智能体框架,通过整合感知、社会推理(基于心智理论)和多模态生成,实现具备社交智能的虚拟人,并在信息不对称数据集上取得优于全信息脚本模式的对话质量。

详情
AI中文摘要

创建具有真正社交智能的逼真数字人需要将认知推理和多模态生成统一在一个连贯的框架内。当前的方法将这些视为独立的任务:大型语言模型擅长对话但缺乏具身表达,而基于扩散的说话头模型实现了视觉保真度但忽略了社会认知。为了弥合这一差距,我们提出了一个闭环双智能体框架,将感知、社会推理和表达整合到一个连续的交互循环中。感知模块从视频中分析伙伴的多模态行为,而社会推理模块通过心智理论推断隐藏的心理状态,并通过集成机制选择响应。然后,表达模块生成情感可控的双智能体视频,合成说话者的言语和表情以及听者的反应行为,捕捉先前工作中缺失的双向动态。我们构建了一个分层的角色-场景数据集,包含基于心理学的角色和私人社交目标,以支持信息不对称下的评估。在该数据集上的实验表明,在对话质量和视频生成指标上均具有竞争性或优越的性能。值得注意的是,我们的方法在关键对话质量维度上甚至超过了全信息脚本模式,这表明在不确定性下显式的心理状态推断可以比无限制的信息访问引发更周到的对话。

英文摘要

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

2606.05894 2026-06-05 cs.CL

EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents

EMBER: 通过预算化证据保留实现高效记忆的长时程智能体

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) NVIDIA Research(NVIDIA研究)

AI总结 针对长时程智能体在固定预算下保留证据的问题,提出EMBER学习型保留策略,通过存储证据胶囊(含原文摘录、检索键和更新元数据)并利用查询后反馈训练,在LongMemEval-RR上显著提升F1、保留召回和读取召回。

详情
AI中文摘要

长时程智能体可以存档大量历史记录,但未来的答案仍然会产生检索、重读和上下文成本。当保留的记忆缺少与答案相关的证据时,系统必须返回原始历史的大部分内容。我们研究预算化证据存留:在查询未知之前,应保留哪些源证据,以便在固定的保留源证据令牌预算下保持可恢复和可用?我们将此设置实例化为预算化预查询保留,其中记忆在摄取期间写入,随后在无法访问完整原始流的情况下读取。我们引入了EMBER,一种学习型保留策略,它构建了一个紧凑的、基于源的证据状态。EMBER存储证据胶囊:逐字源摘录,附带检索键和更新元数据,同时保留基础性和读取时间访问。查询后结果反馈训练写入器在摄取-检索-答案链中保留证据。在LongMemEval-RR(我们基于LongMemEval衍生的保留证据协议)上,EMBER-14B在8192令牌保留证据比较点达到0.3017 F1,而最强非EMBER预算化基线为0.1765。在不同的保留源证据预算下,EMBER提高了F1、保留召回和读取召回,表明长时程记忆依赖于在预算内保留证据,而不是重读更大的历史记录。

英文摘要

Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before the query is known, which source evidence should be retained so that it remains recoverable and usable under a fixed retained source-evidence token budget? We instantiate this setting as Budgeted Pre-Query Retention, where memory is written during ingestion and later read without access to the full raw stream. We introduce EMBER, a learned retention policy that constructs a compact, source-backed evidence state. EMBER stores evidence capsules: verbatim source excerpts paired with retrieval keys and update metadata, preserving both grounding and read-time access. Post-query outcome feedback trains the writer to preserve evidence across the ingestion-retrieval-answer chain. On LongMemEval-RR, our LongMemEval-derived retained-evidence protocol, EMBER-14B reaches 0.3017 F1 at the 8192-token retained-evidence comparison point, compared with 0.1765 for the strongest non-EMBER budgeted baseline. Across retained source-evidence budgets, EMBER improves F1, Retain-Recall, and Read-Recall, indicating that long-horizon memory depends on retaining evidence within the budget rather than rereading larger histories.

2606.05890 2026-06-05 cs.CL cs.AI

Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations

与不确定性共处:LLM对LLM模拟对话中人工道德顾问的不确定性支撑策略

Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix

发表机构 * Centre for Data Futures, The Dickson Poon School of Law, King’s College London(数据未来中心、迪克森·普恩法学院、伦敦国王学院) Department of Informatics, King’s College London(信息学院、伦敦国王学院) LangAI, Center for Language AI Research, Tohoku University(LangAI、语言人工智能研究中心、东北大学) Neukom Institute for Computational Science, Dartmouth College(计算科学尼科姆研究所、达特茅斯学院)

AI总结 研究LLM作为人工道德顾问时,通过三种不确定性策略(视角倍增、张力保持、过程反思)与三种控制条件对比,在模拟对话中探讨如何帮助对话者“与不确定性共处”,发现不同策略在立场改变量上无差异但影响参与质量。

详情
AI中文摘要

LLM越来越多地被部署为各种背景下的人工道德顾问(AMA):它们应该展现什么样的对话模式?在本文中,我们研究AMA如何帮助其对话者“与不确定性共处”。我们提出了三种不确定性模式(视角倍增、张力保持、过程反思),并将它们与三种控制条件(基线、说服、谄媚)进行比较。用户代理LLM与遵循特定不确定性策略的AMA就伦理困境进行对话,并完成对话前和对话后的问卷调查。我们进一步考察了两种角色提示格式(陈述式和叙述式)的效果。我们发现:(1)没有一个单一模型作为模拟用户代理占主导地位,开放模型通过角色间分歧与人类模糊性对齐,而封闭模型通过角色内对冲对齐;(2)陈述式角色更好地捕捉初始立场多样性,而叙述式角色显示出更现实的信念修正;(3)所有六种AMA策略产生可区分的对话模式;(4)不确定性策略的不同不在于它们产生多少立场改变,而在于它们维持的参与质量。

英文摘要

LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

2606.05889 2026-06-05 cs.SD cs.CL eess.AS

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

GLASS: 基于GRPO训练的LoRA用于零样本文本转语音中的声学风格引导

Jaehoon Kang, Yejin Lee, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(人工智能系,全州大学)

AI总结 提出GLASS框架,通过GRPO训练轻量LoRA适配器实现零样本自回归TTS中可组合的声学风格控制,无需风格标签即可从奖励中学习控制。

详情
AI中文摘要

我们提出GLASS,一个用于零样本自回归文本转语音(TTS)中可组合声学风格控制的框架,该框架从生成后奖励而非风格标签中学习控制。在零样本TTS中,说话人提示通常将说话人身份与语速、音高等韵律属性纠缠在一起,使得在不改变提示本身的情况下难以改变风格。GLASS将每个声学属性视为一个由奖励定义的控制方向。对于每个控制轴,GLASS冻结TTS主干,并使用组相对策略优化(GRPO)训练一个轻量级LoRA适配器,以语音令牌长度和平均F0作为风格奖励,以WER作为可懂度锚点。由于每个控制表示为LoRA权重更新,独立训练的适配器可以通过线性LoRA算术进行交换、插值和组合,而无需重新训练主干。在语速和音高控制上的实验显示了目标风格偏移,同时保持了自然度、说话人相似性和可懂度,并展示了跨独立训练适配器的平滑插值和多轴组合。

英文摘要

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

2606.05888 2026-06-05 cs.AI

Retry Policy Gradients in Continuous Action Spaces

连续动作空间中的重试策略梯度

Soichiro Nishimori, Paavo Parmas

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 本文提出重试目标(如pass@K和max@K)的路径导数估计器,将ReMax扩展到连续动作空间,通过重塑策略梯度景观促进随机探索,并引入ReMAC算法实现与SAC相当的性能。

详情
AI中文摘要

基于重试的目标(如pass@K和max@K)优化从多个采样轨迹中获得的最佳回报,最近的研究表明,它们可以在没有显式探索奖励的情况下促进探索。在离散动作空间中,ReMax被证明可以通过适应回报不确定性来实现这一点。在这项工作中,我们引入了重试目标的路径导数估计器,并用它们将ReMax扩展到连续动作空间。我们研究了由此产生的学习动态,并表明,即使使用确定性奖励,ReMax也可以通过重塑策略梯度景观来鼓励随机探索。特别地,它既改变了梯度的方向,使更新偏向于更高的策略熵,也改变了梯度的大小,抑制梯度并减缓收敛。我们进一步表明,Adam的自适应归一化可以缓解这种抑制,具体取决于其数值稳定化参数。在实验上,我们将该目标实例化为ReMax Actor-Critic(ReMAC),这是一种使用路径导数估计器优化ReMax目标的离策略actor-critic算法。我们的实验表明,ReMAC可以在没有熵正则化的情况下促进更高的策略熵,并实现与SAC相当的性能。

英文摘要

Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.

2606.05885 2026-06-05 cs.LG

When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

当更密集的信用不足时:面向长周期LLM智能体训练的基于证据校准的策略优化

Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(X-LANCE实验室,计算机科学学院,上海交通大学,上海,中国) Faculty of Electronic and Information Engineering, Xi’an Jiaotong University(电子与信息工程学院,西安交通大学)

AI总结 针对长周期LLM智能体在稀疏延迟奖励下的信用分配问题,提出一种无评论家的策略优化算法ECPO,通过证据校准的动作优势和方差门控信用加权来修正密集信用的统计不可靠性,在ALFWorld和WebShop上显著提升性能。

详情
AI中文摘要

长周期LLM智能体需要能够在稀疏和延迟奖励下为中间决策分配信用的强化学习方法。最近的基于分组的方法如GiGPO通过构建重复锚点状态下的步骤级优势来改进GRPO。然而,我们表明这种密集信用在统计上可能不可靠:在有限的轨迹采样下,罕见但幸运的动作可能获得过大的优势,产生发散锚点偏差和后期训练振荡。我们提出证据校准策略优化(ECPO),一种在策略更新前校准步骤级信用的无评论家策略优化算法。ECPO结合了证据校准动作优势(将轨迹按规范动作分组并收缩低计数估计)和方差门控信用加权(抑制由动作内噪声主导的锚点状态)。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B的实验表明,ECPO持续优于强基线,在Qwen2.5-1.5B上,ALFWorld/WebShop的成功点分别比GiGPO提高+5.2/+7.3,同时仅增加0.1%的额外优势计算开销。

英文摘要

Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.

2606.05883 2026-06-05 cs.CV

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * arXiv.org GitHub

AI总结 针对扩散模型训练,提出基于几何感知分布对齐的真实子集选择方法,利用单侧部分最优传输保持几何结构,并辅以轻量级特征统计与语义一致性正则化,通过两阶段离散优化实现高效压缩。

详情
Comments
ICML 2026
AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而,现有方法不适用于扩散模型训练:合成数据生成通常产生不适合真实建模的低保真样本,而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题,我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输,我们的方法选择性地将紧凑子集与完整数据分布对齐,同时允许低密度区域中的未匹配质量,确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度,我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明,我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

2606.05880 2026-06-05 cs.RO

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

TAGA:面向可泛化敏捷人形运动的地形感知主动注视学习

Peizhuo Li, Hongyi Li, Mingfeng Fan, Fangzhou Xu, Shuhao Liao, Yuxuan Ma, Zicheng Zeng, Ze Wang, Yongbin Jin, Yuhong Cao, Hongtao Wang, Guillaume Sartoretti

发表机构 * MarmotLab, National University of Singapore(马尔莫实验室,新加坡国立大学) Center of X-Mechanics, Zhejiang University(浙大X力学中心) South China University of Technology(华南理工大学)

AI总结 提出TAGA框架,通过融合视觉、本体感觉和运动命令,让模型学习主动注视地形关键区域,在有限计算资源下提高感知密度,实现鲁棒且可泛化的敏捷人形运动。

详情
AI中文摘要

在多样挑战性地形上的敏捷人形运动需要广泛的感知覆盖和精确的局部几何理解。受人类在运动中选择性注视相关地形的启发,我们提出了TAGA,一种用于基于注意力的人形控制的地形感知主动注视学习框架。通过融合视觉、本体感觉和运动命令,我们的框架引导模型学习预期线索并主动关注高度扫描的特定区域,选择性地将这些信息区域用于下游网络。这自适应地提高了在严格机载计算约束下观测的信息密度,从而在更大尺度地形上实现细粒度感知运动。我们发现,这种注视行为可以仅通过强化学习自然涌现,无需额外监督或显式指导,显著提高了训练效率。因此,训练后的策略在仿真和硬件上展示了鲁棒且可泛化的运动,包括可靠的地形感知落脚点选择、高台穿越、竞争性稀疏落脚点穿越,以及在感知人形运动系统中报告的最大实际间隙穿越距离1.2米,同时在严重感知干扰和环境干扰下保持稳定性。

英文摘要

Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.

2606.05878 2026-06-05 cs.LG

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

TS-ICL: 一种基于上下文学习的灵活时间索引时间序列基础模型

Etienne Le Naour, Tahar Nabil, Adrien Petralia

发表机构 * EDF R&D(EDF研究与发展)

AI总结 提出TS-ICL,一种基于上下文学习的概率编码器-回归器Transformer,统一了时间序列预测与插值,并在插值任务上达到新最优,同时在部分观测回溯窗口预测中表现突出。

详情
AI中文摘要

基础模型标志着时间序列建模的深刻范式转变,任务特定模型正被通用零样本模型取代。然而,当前方法主要关注预测,而现实世界的时间序列通常是不规则和部分观测的,需要模型能够联合预测、插补缺失值并处理降采样条件。为应对这些挑战,我们引入了TS-ICL,一种新颖的基于概率上下文学习的编码器-回归器Transformer,统一了预测和插值。TS-ICL将时间序列任务表述为时间戳对齐的回归,并通过训练从新颖的因果数据先验生成的合成依赖结构自然地纳入协变量。实验上,TS-ICL在插值任务上达到了新的最优,同时在单变量和协变量感知基准上与领先的预测基础模型保持竞争力。它在部分观测回溯窗口的预测中表现出特别强的性能。

英文摘要

Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder--regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

2606.05875 2026-06-05 cs.AI cs.DB

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse: 通过压缩视图的查询感知缓存融合实现高效RAG服务

Jianxin Yan, Wangze Ni, Zhenxin Li, Jiabao Jin, Zhitao Shen, Haoyang Li, Jia Zhu, Peng Cheng, Xuemin Lin, Lei Chen, Kui Ren

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Ant Group(蚂蚁集团) The Hong Kong Polytechnic University(香港理工大学) Zhejiang Normal University(浙江师范大学) Tongji University(同济大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出QCFuse,一种基于压缩视图的查询感知选择器,通过块锚查询探测和关键层分析实现高效RAG缓存融合,在保持全预填充质量的同时平均加速1.7倍。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程基于外部证据来提高大语言模型(LLM)的答案质量,但处理检索到的上下文使得预填充阶段成为主要的服务成本。RAG缓存融合通过重用检索块的预计算键值(KV)缓存,并选择性地在当前提示下重新计算令牌来降低这一成本。然而,现有的选择器在质量和效率之间面临两难:快速的查询无关或最终层查询到上下文选择器可能遗漏与请求相关的证据,而全视图查询感知选择器在重新计算之前需要广泛的上下文和层可见性,因此会阻塞逐层缓存融合流水线。我们提出QCFuse,一种用于RAG缓存融合的压缩视图查询感知选择器。QCFuse使用块锚查询探测将用户查询状态条件化到紧凑的每块锚点上,并通过关键层分析识别重新计算令牌而无需检查所有层。我们在SGLang中实现QCFuse,并在六个数据集上对四个开放权重LLM进行评估。QCFuse达到了全预填充级别的质量。在匹配质量下,QCFuse相比全预填充实现了平均1.7倍的预填充加速,相比最强的保质量基线ProphetKV实现了1.5倍加速。

英文摘要

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

2606.05874 2026-06-05 cs.CL

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

评估多模态大语言模型中的随机坍缩与隐式偏差

Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo

发表机构 * Fudan University(复旦大学) Beihang University(北航) JD.com(京东)

AI总结 提出RandomBench基准测试,通过熵和分布偏差指标揭示多模态大语言模型在逻辑中性场景下存在随机坍缩现象,即无法维持均匀随机性。

详情
AI中文摘要

当前对多模态大语言模型(MLLMs)的评估 overwhelmingly 关注效用驱动目标,导致模型在逻辑中性场景下的行为 largely 未被探索。在多个行动同样有效的情况下(如推荐旅行路线或日常安排,多个选项具有相似效用),随机性是必要的。在此类设置中,确定性策略可能导致重复行为和有效替代方案的覆盖减少。为弥补这一空白,我们提出RandomBench,一个旨在评估MLLMs在选择等价选项时是否能维持分布中性行为的基准测试。我们进一步引入三个指标,包括RI、BCI、BII,以量化熵和分布偏差。实验揭示了一种普遍现象,称为随机坍缩,即MLLMs在明确的随机指令下无法维持均匀随机性,Claude Sonnet 4.6中top-1概率达到97%(理想为四分之一),RI降至0.068。广泛的消融研究进一步表明,这些偏差在不同语言和表示格式中持续存在,突显了逻辑中性决策设置中分布坍缩的鲁棒性。

英文摘要

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR(亚马逊FAR) USC(美国南加州大学) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) CMU(卡内基梅隆大学)

AI总结 提出LadderMan系统,通过两阶段学习管道和视觉基础模型,使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情
AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力,但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性,爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan},一个统一的系统,使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道,其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家,并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署,我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略,我们进一步使用双智能体公式训练一个独立的操控策略,允许通过遥操作在梯子上进行稳定操控。实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬,以零样本方式成功迁移到真实世界硬件,并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

2606.05868 2026-06-05 cs.CL

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

YouZhi:通过自适应GQA到MLA转换实现高并发金融大语言模型

PSBC LLM Team, Huawei LLM Team, Ruihan Long, Junjie Wu, Tianan Zhang, Duo Zhang, Yaozong Wu, Jinbin Fu, Chang Liu, Zhentao Tang, Wenshuang Yang, Xin Wang, Zhihao Song, Ning Huang, Wenjing Xu, Shuai Zong, Shupei Sun, Sen Wang, Jing Hu, Bin Wang, Xinyu Wang, Junkui Ju, Zequn Ding, Jie Ran, Man Luo, Shixiong Kai, Linkai Hou, Kaichao Liang, Hu Zhao, Yang Zhao, Shucheng Lin, Wei Yu, Chenghan Jiang, Jingjing Ding, Jiahui Zhang, Tian Jin, Yuhang Zhang, Dong Guo, Wei Sun, Jun Xie, Jianwei Li, Lei Cao, Pei Li, Jiabin Li, Jia Yuan, Rui Yuan, Jing Zhu, Mingxuan Yuan, Zhangcheng Lv, Xin Jiang, Xiuhong Fei, Xiaozhe Ren, Yulong Li, Zhipeng Zhang, Hang Wang, Zhaohui Xu, Rui Zhao, Yibo He, Xinzhuang Niu

发表机构 * Postal Savings Bank of China & Huawei LLM Team(中国邮政储蓄银行及华为LLM团队) Postal Savings Bank of China(中国邮政储蓄银行) Huawei Technologies(华为技术)

AI总结 提出YouZhi-LLM,通过层自适应GQA-to-MLA转换框架和基于昇腾的训练流水线,显著压缩KV缓存并提升金融领域高并发推理效率。

详情
AI中文摘要

大语言模型推动了重大金融创新,但其高并发部署受到KV缓存内存开销的严重瓶颈,这增加了基础设施成本并限制了可扩展性。为解决这一问题,我们提出YouZhi-LLM,一种高效金融大语言模型,通过基于华为昇腾生态系统的全面结构转换和训练流水线实现。在其算法核心,YouZhi-LLM采用层自适应GQA-to-MLA转换框架,动态分配每层的FreqFold大小,在最大化KV缓存压缩的同时最小化困惑度下降。为恢复表示能力并注入领域知识,基于昇腾的训练流水线无缝集成广义知识蒸馏与金融特定监督微调。评估表明该系统性方法的优越性,自适应转换相比均匀基线将困惑度下降减少高达35%。关键的是,在昇腾NPU上通过vLLM-Ascend评估时,大规模KV缓存减少直接转化为部署效率。与各自基础模型相比,YouZhi-7B在平均金融基准分数上提升12.3%,同时最大并发数提升2.69倍;类似地,YouZhi-14B实现7.0%的准确率提升和2.43倍的并发提升,为成本高效、高吞吐的金融推理建立了新范式。

英文摘要

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

2606.05864 2026-06-05 cs.CL

Analysis of the Neglect-Zero Effect in Large Language Models

大型语言模型中忽视零效应的分析

Jin Tanaka, Daiki Matsuoka, Ryoma Kumon, Hitomi Yanaka

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所) Tohoku University(东北大学)

AI总结 本研究通过结构启动范式,探究大型语言模型是否像人类一样存在忽视零效应,即忽略使命题因空集而空洞为真的零模型。

详情
Comments
14 pages (10 pages main text), 8 figures. To appear in the Proceedings of the ACL2026 Student Research Workshop (SRW)
AI中文摘要

我们研究了LLM的语言处理在多大程度上类似于人类的认知过程,重点关注一种称为$ extit{忽视零效应}$的人类认知偏差。这种效应指的是人类倾向于忽略$ extit{零模型}$,即那些因空集而使命题空洞为真的配置。我们关注由忽视零效应驱动的两种推理类型,并通过比较LLM在处理这些推理时的行为与不涉及忽视零效应的推理中的行为,来检验LLM如何处理这些推理。为此,我们采用基于$ extit{结构启动}$的范式,其中先前接触一个前导句子($ extit{启动句}$)会因结构相似性而促进后续句子($ extit{目标句}$)的处理。我们准备启动句以迫使LLM考虑零模型,并分析它们是否也在目标句中考虑零模型。结果表明,在本研究分析的LLM中可能未出现忽视零效应。我们的代码可在https://github.com/ynklab/neglect_zero获取。

英文摘要

We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the $\textit{neglect-zero effect}$. This effect refers to the human tendency to ignore $\textit{zero-models}$, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on $\textit{structural priming}$, where recent exposure to a preceding sentence (the $\textit{prime}$) facilitates the processing of a subsequent sentence (the $\textit{target}$) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero

2606.05863 2026-06-05 cs.LG cs.AI

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

通过深度线性网络理论与条件ReLU约简解读Grokking中的两个训练时钟

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,北京100190,中国) School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学数学科学学院,北京100049,中国) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS), Shanghai, China(上海数学与交叉科学研究所(SIMIS),上海,中国) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China(浙江省系统健康科学重点实验室,生命科学学院,杭州先进研究院,中国科学院大学,中国科学院,杭州310024,中国)

AI总结 本文通过分离分类损失的快速衰减与表示学习的缓慢简化,定义了“两个训练时钟”形式化Grokking现象,并利用深度线性网络理论和条件ReLU约简机制解释了这一两阶段过程。

详情
AI中文摘要

Grokking表明,拟合训练数据和学习简单底层规则可能发生在不同的时间尺度上。我们通过将分类损失的快速衰减与学习表示的较慢简化分离来形式化这一现象,并将由此产生的停止时间对称为两个训练时钟。对于深度线性网络,我们证明后边际间隙增长或一步尾部收缩条件在对数时间尺度上将交叉熵损失降低到ε水平。相反,当存在逐层权重衰减时,端到端映射上的诱导正则化可以表示为Schatten型惩罚;在尖锐的晚期Kurdyka-Lojasiewicz尾部下,这种结构能量在多项式时间尺度上闭合。因此,两个时钟将拟合与表示简化分开。然后我们解释相同机制如何在ReLU MLP中出现。在训练集上的激活模式保持固定的区域中,网络简化为活动坐标上的线性模型。在两层ReLU嵌入模型中,链式法则估计进一步表明,在受控的下游范数下,分类器头可以比嵌入块接收更大的有效梯度。这支持了一个两阶段机制:分类器先拟合,而表示随后继续简化。我们以模加法作为主要实验设置。深度线性理论提供了分析的核心严格基础。但ReLU结果被表述为条件约简,以解释经验行为,而不声称对非线性训练动态的全局证明。

英文摘要

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

2606.05859 2026-06-05 cs.CL

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

TARPO:通过动作路由策略优化的逐令牌隐式-显式推理

Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li

发表机构 * TMCC, College of Computer Science, Nankai University, Tianjin, China(TMCC,计算机科学学院,南开大学,天津,中国)

AI总结 提出TARPO框架,通过动作路由策略优化在每一步自适应切换离散令牌生成和连续隐式推理,以解决隐式推理中连续表示限制策略探索的问题,实验表明其优于现有显式和隐式推理基线。

详情
Comments
18 pages, 12 figures. Code available at https://github.com/NKU-LITI/TARPO-master
AI中文摘要

隐式推理已成为大型语言模型(LLMs)中离散思维链(CoT)的一种有前景的替代方案,通过在连续表示上操作实现更具表达力的推理。然而,连续表示固有的确定性限制了强化学习(RL)中的策略探索。为解决这一问题,我们提出了TARPO(通过动作路由策略优化的逐令牌隐式-显式推理),一个纯RL框架,在每一步自适应地在离散令牌生成和连续隐式推理之间切换。TARPO引入了一个轻量级的动作头路由器,它观察当前隐藏状态并从二元模式选择空间中采样一个路由决策,保留了从词汇表中离散令牌采样的随机性。LLM主干和路由器通过共享的组相对优势信号进行端到端联合优化。在Qwen2.5(从1.5B到7B)和Llama-3.1-8B主干上的大量实验表明,TARPO在各种基准测试中始终优于现有的显式和隐式推理RL基线。进一步分析表明,TARPO学习了自适应的逐令牌切换行为,同时保持了稳定的训练动态。我们的代码可在https://github.com/NKU-LITI/TARPO-master获取。

英文摘要

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.