arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
2606.10449 2026-06-16 cs.RO 新提交

GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains

GuideWalk: 面向人形机器人的统一自主导航与运动学习,适用于多种地形

Haoxuan Han, Chen Chen, Linao Gong, Xin Yang, Hao Hu, Junhong Guo, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Leju Robotics(乐聚机器人)

AI总结 提出GuideWalk框架,通过可通行性感知导航引导与地形自适应运动教师蒸馏,实现人形机器人在复杂地形上的稳定导航与运动协调。

详情
AI中文摘要

人形机器人已具备强大的运动能力,但在多种地形上的可靠导航仍然具有挑战性,因为避障必须与动态可行的运动协调。在这项工作中,我们提出了GuideWalk,一个统一的端到端框架,将可通行性感知的导航引导与地形自适应运动教师相结合,用于人形机器人导航。具体来说,我们引入了一个导航模块,提供明确的速度引导,将避障与地形条件解耦,从而能够在不同环境中进行鲁棒的规划。我们提出了一种复合教师蒸馏方案,其中目标导向的命令和动态一致的动作被聚合并蒸馏到单个策略中。为了进一步提高鲁棒性,蒸馏后的策略通过强化学习和辅助行为克隆目标进行微调,这促进了探索同时保留了期望的教师行为。实验表明,GuideWalk在保持稳定的人形运动的同时,实现了稳定有效的导航。

英文摘要

Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.

2606.10237 2026-06-16 cs.AI cs.LG 新提交

Minimalist Genetic Programming

极简遗传编程

Leonardo Trujillo

发表机构 * Tecnológico Nacional de México/IT de Tijuana(墨西哥国家理工学院/蒂胡ana信息技术学院) LASIGE, Department of Informatics, Faculty of Sciences, University of Lisbon(里斯本大学科学学院信息系LASIGE)

AI总结 提出极简遗传编程(MGP),借鉴语言学中的极简主义程序,用MERGE操作替代进化搜索,在符号回归任务中有效避免膨胀,稳定找到精确解。

详情
AI中文摘要

遗传编程(GP)基于两个重要见解。首先,任何学习任务从根本上都可以视为程序归纳问题,目标是构建表示为语法树的符号层次模型。其次,将此任务视为搜索问题,并使用进化来定位所需模型。自提出以来,GP在广泛的任务和问题领域中取得了显著成果。本文通过修改GP的第二个核心见解,将问题视为句法推导任务,提出了一种替代观点。具体来说,本文提出了极简遗传编程(MGP),该算法与GP一样受生物启发,但并非源自进化,而是从人类语言的极简主义程序中汲取灵感,其中句法被理解为连接其他两个心智系统的最优解决方案。在极简主义中,核心计算过程是一个称为MERGE的二元集合形成算子,它可以通过简单的马尔可夫过程逐步构建复杂的句法结构。MGP能够发现符号表达式的核心构建块,并使用MERGE逐步组合它们。所提出的系统在已知因膨胀倾向而难以用标准GP系统解决的符号回归任务上进行了基准测试。结果表明,当选择适当的原子句法对象词典时,MGP能够在一组标准GP难以做到同样任务的符号回归中一致地产生精确的真实模型。极简主义提供的见解被证明与程序归纳问题相关,并且基于MGP在这项工作中展示的潜力,应进一步探索。

英文摘要

Genetic programming (GP) is based on two important insights. First, that any learning task can fundamentally be posed as a program induction problem, where the goal is to construct a symbolic hierarchical model that is expressed as a syntax tree. Second, to pose this task as a search problem, and use evolution to locate the desired model. Since it was proposed, GP has produced notable results in a wide range of tasks and problem domains. This work presents an alternative view by modifying the second core insight of GP, posing the problem as a syntactic derivation task instead. In particular, this paper presents Minimalist Genetic Programming (MGP), an algorithm that like GP is biologically inspired, but instead of evolution it takes inspiration from the Minimalist Program to human language, in which syntax is understood as an optimal solution to the problem of linking two other mental systems. In minimalism, the core computational process is a binary set formation operator called $MERGE$, than can be used to incrementally construct complex syntactic structures using a simple Markovian process. MGP is able to discover the core building blocks of the symbolic expressions, and to incrementally combined them using $MERGE$. The proposed system is benchmarked on symbolic regression tasks that are known to be difficult to solve with standard GP systems because of the propensity for bloat. Results show that when a proper lexicon of atomic syntactic objects are chosen, MGP is able to consistently produce the exact ground truth model on a set of symbolic regression tasks where standard GP struggles to do the same. The insights provided by minimalism are shown to be relevant to the problem of program induction, and should be explored further based on the potential exhibited by MGP in this work.

2606.09777 2026-06-16 cs.RO 新提交

AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning

AetheRock: 一种用于力引导视觉触觉学习的臂戴式机器人教学系统

Hong Li, Yue Xu, Yihan Tang, Yankang Dong, Chenyuan Liu, Chenyang Yu, Xuyang Li, Siyuan Huang, Yujun Shen, Nan Xue, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出臂戴式设备AetheRock采集夹爪力、视觉和触觉数据,并设计ForceVT框架利用力和视觉引导触觉学习,解决力感知机器人学习中传感器装配不兼容问题。

详情
AI中文摘要

力和触觉感知在接触密集操作中不可或缺。然而,由于手持或可穿戴设备中触觉和力传感器的不兼容装配,力感知机器人学习面临关键挑战。为解决这些限制,我们首先引入AetheRock用于夹爪力、视觉和触觉数据收集,这是一种臂戴式设备,指尖配备模块化且易于制造的视觉触觉传感器GelSlim-MiniFab,人体手指接触区域配备电阻式压力传感器,定制PCB模块,以及用于舒适和稳健收集的可穿戴套件。在此基础上,我们提出ForceVT,一种表示学习框架,利用力和视觉引导保真度无关的触觉学习,实现在任何触觉情况下的鲁棒推理。实际实验表明,AetheRock实现了合格的数据效率,且ForceVT有效缓解了视觉触觉传感器在制造和使用不一致时的低效问题。总体而言,我们的工作通过创新的硬件设计和算法减轻了夹爪力-视觉-触觉机器人学习的局限性。

英文摘要

Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.

2606.09717 2026-06-16 cs.SD eess.AS 新提交

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

什么让合成语音听起来讽刺?一项韵律控制的感知研究

Zhu Li, Shekhar Nayak, Matt Coler

发表机构 * University of Groningen(格罗宁根大学)

AI总结 通过可控神经TTS系统操纵语速、音高变化和响度,发现响度主要驱动人类对讽刺的感知,而模型更依赖语速,揭示了韵律线索权重差异。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

韵律在讽刺感知中起着核心作用,然而以往的研究依赖于自然产生的语音,缺乏对单个声学维度的精细控制。由于韵律线索在自然数据中共变,隔离它们的独立贡献仍然具有挑战性。我们引入了一个受控框架,使用基于提示的韵律条件化的神经文本到语音(TTS)来操纵语速、音高变化和响度。构建了一个正交刺激集,以实现对韵律线索效应的因果测试。人类听众对讽刺性和自然度进行评分,并将他们的判断与能够处理音频输入的基础模型的预测进行比较。结果表明,响度主要驱动人类对讽刺的感知,而模型则赋予语速更大的权重,导致不同的线索加权模式。这项研究表明,可控神经TTS如何能够研究语音感知中的韵律线索加权。

英文摘要

Prosody plays an important role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.

2606.09669 2026-06-16 cs.AI cs.CL 新提交

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SpatialWorld: 在多模态智能体真实世界任务中基准测试交互式空间推理

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

发表机构 * Tsinghua University(清华大学) Chongqing University(重庆大学) Peking University(北京大学) ZenoMind AI Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学) Southeast University(东南大学) Shanghai Jiao Tong University(上海交通大学) Joy Future Academy The University of Hong Kong(香港大学)

AI总结 提出SpatialWorld基准,集成8种异构模拟后端,通过760个人工标注任务评估多模态智能体在视觉部分可观测环境中的交互式空间理解,发现最强模型GPT-5任务成功率仅17.4%。

详情
AI中文摘要

空间推理是多模态大语言模型(MLLMs)感知和操作物理世界的基础能力。然而,现有基准主要依赖被动评估(如静态VQA)或特定模拟器流程,未能评估通用的交互式空间理解。我们引入SpatialWorld,一个专门为评估多模态智能体在复杂真实世界任务中的交互式空间理解而设计的统一基准。在共享的、模拟器无关的协议下集成八个异构模拟后端,SpatialWorld包含跨多个领域(如家庭日常、旅行、社交协作)的760个人工标注任务。智能体必须在仅视觉的部分可观测性下解决问题,主动收集自我中心的视觉证据,并通过MLLMs原生的统一文本动作接口表达决策。为了可靠评估,每个任务包含一个人工验证的初始状态、一条参考轨迹和一个终端状态验证器。评估15个先进智能体揭示,稳健的空间任务解决仍然具有挑战性:最强模型GPT-5平均任务成功率(TSR)仅为17.4%,而领先的开源模型Qwen-3.5达到14.1%。进一步分析暴露了任务成功与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长程规划中的瓶颈使SpatialWorld成为未来空间智能体的严格测试平台。

英文摘要

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

2606.09150 2026-06-16 cs.CV 新提交

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

Ultra Flash: 将实时流式视频生成扩展到高分辨率

Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Jun-hao Zhuang, Yuming Li, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

发表机构 * JD Explore Academy(京东探索研究院) USTC(中国科学技术大学) PKU(北京大学) THU(清华大学) BUAA(北京航空航天大学) FDU(复旦大学) HKUST(香港科技大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Ultra Flash级联框架,通过架构保持的超分辨率训练、因果流式潜在上采样器和高分辨率解码器、以及级联优化方案,在单GPU上实现1K分辨率约30 FPS和2K分辨率约18 FPS的实时高分辨率流式视频生成。

详情
AI中文摘要

尽管最近的自回归视频扩散模型在流式质量上取得了显著成果,但它们仍局限于低分辨率(如480P),使得高效、可扩展的实时高分辨率视频生成成为一个根本性的开放挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实时生成高分辨率视频的级联流式框架。Ultra Flash在单GPU上实现约30 FPS(1K分辨率)和约18 FPS(2K分辨率),通过三个关键贡献:(1)一种保持架构的T2V到TV2V超分辨率训练范式,结合面向AIGC的数据降级流水线,有效保留基础模型的生成能力,从而在级联到主流低分辨率生成模型后增强高分辨率细节;(2)一个因果流式潜在上采样器与高分辨率解码器配对,增强时空连贯性,同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏,然后引入带有动态缓存管理的级联流式自强迫偏好优化,共同增强整体连贯性、提高质量,并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和卓越效率。

英文摘要

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency. Project Page: https://xin1u.github.io/UltraFlash/

2606.09076 2026-06-16 cs.CV 新提交

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

超越标量奖励:将推理内化到分数分布中

Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi

发表机构 * Alibaba Group(阿里巴巴集团) Nankai University(南开大学)

AI总结 提出Z-Reward框架,通过教师-学生模型将推理型奖励内化为紧凑VLM的分数分布,实现高效且准确的文本到图像优化。

Comments Z-Image Team Technical Report

详情
AI中文摘要

奖励模型对于文本到图像的后训练至关重要,但视觉偏好是主观的,更适合表示为评分分布而非确定性标量。现有的标量、评分令牌和成对奖励模型过度压缩了不确定性和细粒度评分差异,而基于推理的生成式奖励提供了更强的判断,但部署成本高且难以用作直接优化信号。我们提出Z-Reward,一种教师-学生奖励建模框架,将推理密集型判断与高效奖励部署解耦。教师是一个大型VLM,使用推理推断符合评分标准的分数分布,并通过组定向分数优化(GDSO)进行训练,该优化结合了来自分布期望的策略梯度奖励以及关于分数分布和分数差距的直接点式和成对监督。学生通过推理内化分数蒸馏(RISD)进行训练,将教师的推理条件分数分布转移到紧凑VLM中,而无需在推理时使用显式推理链。在我们内部标注的评估集上,27B GDSO教师达到了89.6%的人类偏好准确率,优于SFT、RewardDance和GRPO,而9B RISD学生达到了88.6%,优于OPD基线并接近更大的教师。我们进一步表明,Z-Reward可以作为文本到图像优化的可微奖励信号,相对于SFT基线产生了41.3%的净人类偏好改进。

英文摘要

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

2606.09039 2026-06-16 cs.AI 新提交

Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Agent经济学:一种熵控制的多元对齐框架以防止自主智能体中的蜂群思维

Cheonsu Jeong

发表机构 * AX Center, SAMSUNG SDS(三星SDS AX中心)

AI总结 提出行为协议框架(BPF),通过心智化社会智能、多元对齐和可验证执行内核三个模块,在闭环架构中控制熵以保持策略多样性,提升自主智能体经济的稳定性、效率和可信度。

Comments 15 pages, 2 figures, 1 table

详情
AI中文摘要

本研究提出了行为协议框架(BPF),这是一个熵控制的多元对齐框架,旨在解决自主智能体经济中的两个关键挑战:由智能体间过度战略趋同引起的蜂群思维效应,以及自主决策过程中缺乏透明度。所提出的BPF由三个核心模块组成:基于心智理论的心智化社会智能(MbSI)、多元对齐(PA)和可验证执行内核(VEK)。这些模块有机地集成在一个闭环架构中,该架构控制着智能体行为从决策、执行到验证和反馈的整个生命周期。为了评估所提出的框架,将开发一个用Python实现的模拟环境和基于Streamlit的用户界面。通过实证实验,本研究旨在检验PA模块的熵控制机制能否有效保持智能体间的战略多样性并减轻集体趋同,同时VEK模块提供决策过程的全面且透明的审计追踪。预期结果将表明,所提出的框架能够同时增强自主智能体经济的稳定性、效率和可信度。因此,本研究为开发稳健、透明且可问责的智能体原生经济系统提供了一种实用方法。

英文摘要

This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two critical challenges in autonomous agent economies: the hivemind effect arising from excessive strategic convergence among agents and the lack of transparency in autonomous decision-making processes. The proposed BPF consists of three core modules: Mentalizing-based Social Intelligence (MbSI) grounded in Theory of Mind (ToM), Pluralistic Alignment (PA), and a Verifiable Execution Kernel (VEK). These modules are organically integrated within a closed-loop architecture that governs the entire lifecycle of agent behavior, from decision-making and execution to verification and feedback. To evaluate the proposed framework, a simulation environment implemented in Python and a Streamlit-based user interface will be developed. Through empirical experimentation, the study aims to examine whether the entropy-control mechanism of the PA module can effectively preserve strategic diversity among agents and mitigate collective convergence, while the VEK module provides a comprehensive and transparent audit trail of the decision-making process. The anticipated results are expected to demonstrate that the proposed framework can simultaneously enhance the stability, efficiency, and trustworthiness of autonomous agent economies. Consequently, this research offers a practical approach for developing robust, transparent, and accountable agent-native economic systems.

2606.08867 2026-06-16 cs.CL 新提交

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

构建面向1亿用户规模的客户支持AI代理:一种评估驱动的框架

Aman Gupta, Kevin Rossell, Edesio Alcobaça, Jose Chrystian Lima Pacheco, Carolina Baptista de Lima, Shao Tang, Luiz Paulo Rabachini, Luis Moneda, Herbert Fei, Daniel Silva, Rohan Ramanath

发表机构 * Nubank

AI总结 提出一个统一框架,通过评估驱动开发、上下文工程、人工循环提示迭代和LLM评判一致性优化,在Nubank的100M+用户规模下实现客户支持AI代理的离线开发与在线效果桥接,并在五个生产部署中验证了离线指标与在线结果的高度相关性。

Comments 12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

详情
AI中文摘要

LLM能力的快速提升使得AI代理在广泛任务中越来越可行。其中最有前景的应用之一是构建生产就绪的面向客户代理,这一挑战需要在评估方法论、上下文工程、训练和在线测量方面协调卓越。然而,这些关键支柱通常是孤立开发的,导致只有在部署后才会暴露的盲点。\n在本文中,我们提出了一个统一框架,将离线开发与在线影响桥接起来,应用于Nubank(一家拥有1亿+用户的公司)的客户支持AI代理。我们的方法整合了几个关键组件:(1) 针对客户支持代理定制的结构化上下文工程,(2) 系统化的人工在环提示迭代,(3) 具有测量评估者间一致性和GEPA优化一致性的严格LLM评判评估,以及(4) 从构思到生产的验证。\n一个核心见解是评估管道质量直接决定迭代速度。我们展示了跨越不同领域的五个生产部署的结果:卡片递送、债务管理、信用额度支持、卡片管理和产品解释。这些部署在显著加速迭代的同时,带来了持续的客户满意度提升。在我们的卡片递送部署中,大规模A/B测试显示,与之前的代理变体相比,AI交易净推荐值提高了37个百分点,自助服务率提高了29个百分点,同时离线模拟指标与在线结果之间存在强相关性,表明评估驱动开发可靠地预测了生产影响。在大多数用例中,AI满意度达到了与专家人类代理相差几个百分点的水平。

英文摘要

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

2606.08781 2026-06-16 cs.CV 新提交

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

DeepMine-Mamba:缓解基于Mamba的状态空间模型在文档图像二值化中的信息稀释问题

Sheng-Wei Chan, Yung-Che Wang, Hsin-Jui Pan, Chia-Min Lin, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出DeepMine-Mamba框架,通过抗稀释门控机制选择性恢复笔画敏感局部响应,抑制无关背景增强,解决Mamba状态空间模型在文档二值化中弱前景线索被稀释的问题。

Comments code will be released on https://github.com/henrychan0719/Deep-Mine-Mamba

详情
AI中文摘要

文档图像二值化旨在从退化的背景中分离前景文本,同时保留细、断裂和低对比度的笔画。尽管深度学习方法提高了二值化性能,但大多数现有方法依赖于卷积、基于Transformer或生成架构,而基于Mamba的状态空间模型在此任务中尚未被充分探索。在这项工作中,我们研究了基于Mamba的特征传播,并观察到直接的状态空间传播可能会在长程建模过程中稀释弱前景线索,特别是淡墨迹、碎片化字符和边界敏感的笔画细节。为了解决这个问题,我们提出了DeepMime-Mamba,一个基于Mamba的二值化框架,配备了一种新颖的抗稀释门控机制,该机制估计传播引起的特征变化,并选择性地恢复笔画敏感的局部响应,同时抑制不必要的背景增强。在严格的留一年验证协议下,对DIBCO/H-DIBCO基准的实验表明,DeepMine-Mamba取得了具有竞争力的整体性能,在基准年份中具有强大的平均FM和Fps。消融结果进一步表明,抗稀释门控机制改善了笔画保留,并减少了感知上显著的二值化误差。

英文摘要

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further show that the Anti-Dilution Gate is the key component for mitigating propagation-induced foreground dilution and improving stroke preservation.

2606.08594 2026-06-16 cs.LG eess.SP 新提交

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

脑电图去噪需要多少容量?超紧凑网络揭示基准饱和与度量-效用差距

Jasmeet Singh Bindra, Siddharth Panwar

发表机构 * Indian Knowledge Systems and Mental Health Applications (IKSMHA) Center, Indian Institute of Technology Mandi(印度理工学院曼迪分校印度知识体系与心理健康应用中心) School of Computing and Electrical Engineering, Indian Institute of Technology Mandi(印度理工学院曼迪分校计算与电气工程学院)

AI总结 通过固定架构仅改变通道宽度(1.05K-40.26K参数),发现EEG去噪重建性能在3-6.5K参数时饱和,且重建度量不预测下游BCI效用,超紧凑模型(33-46KB)适用于边缘部署。

Comments 17 pages, will be submitted to peer-reviewed journal

详情
AI中文摘要

深度学习脑电图去噪架构已从数万参数扩展到数千万参数,然而尚无先前研究将模型容量作为实验变量隔离,或测试重建度量是否预测下游神经信号效用。我们通过固定架构、损失、数据划分和训练配方,仅在最小深度可分离卷积U-Net中从1.05K到40.26K参数扫描通道宽度,解决了这两个空白。模型在EEGDenoiseNet基准、跨数据集BCI迁移测试、受控基线重训练以及所有九个BCI竞赛IV-2a受试者的五个解码器家族的下游运动想象分类上进行了评估。重建性能在3-6.5K参数时饱和,肘部后每log10参数单位增益最多0.015相关系数。在相同流程下重训练的8.46M参数基线在EOG上与40.26K紧凑变体匹配——200倍参数差距未带来优势——而Patch-Transformer控制重现了相同的递减回报形状。下游评估揭示了分类器依赖的度量-效用差距:重建优化的去噪显著降低了所有九个受试者和三种伪影类型的CSP+LDA分类(最佳去噪准确率0.547 vs. 噪声基线0.612;Bonferroni p=0.0488),在自然记录试验中持续存在(Delta=-0.047;BH-FDR q=0.0049)。端到端神经解码器显示可变或中性效果。标准EEG去噪基准在远低于当前模型容量时已饱和,重建度量不预测BCI效用。33-46 KB和1.27-2.61M FLOPs/段的超紧凑模型适用于边缘部署。这些发现主张容量控制评估、更困难的任务感知基准以及强制性的下游验证。

英文摘要

Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters, yet no prior study has isolated model capacity as the experimental variable or tested whether reconstruction metrics predict downstream neural-signal utility. We address both gaps by fixing architecture, loss, data split, and training recipe while sweeping only channel width from 1.05K to 40.26K parameters in a minimal depthwise-separable convolutional U-Net. Models were evaluated on the EEGDenoiseNet benchmark, cross-dataset BCI transfer tests, controlled baseline retraining, and downstream motor-imagery classification with five decoder families across all nine BCI Competition IV-2a subjects. Reconstruction performance saturated by 3-6.5K parameters, with post-elbow gains of at most 0.015 correlation coefficient per log10-parameter unit. An 8.46M-parameter baseline retrained under the same pipeline matched the 40.26K compact variant on EOG--a 200x parameter gap yielding no advantage--while a Patch-Transformer control reproduced the same diminishing-return shape. Downstream evaluation exposed a classifier-dependent metric-utility gap: reconstruction-optimized denoising significantly degraded CSP+LDA classification across all nine subjects and three artifact types (best denoised accuracy 0.547 vs. 0.612 noisy baseline; Bonferroni p=0.0488), persisting on naturally recorded trials (Delta=-0.047; BH-FDR q=0.0049). End-to-end neural decoders showed variable or neutral effects. Standard EEG denoising benchmarks are saturated far below current model capacity, and reconstruction metrics do not predict BCI utility. Ultra-compact models at 33-46 KB and 1.27-2.61M FLOPs/segment are practical for edge deployment. These findings argue for capacity-controlled evaluation, harder task-aware benchmarks, and mandatory downstream validation.

2606.08592 2026-06-16 cs.LG quant-ph 新提交

Quantum Global Variational Learning for Quantum Error Correction

量子全局变分学习用于量子纠错

Shun Ryuzaki, Hideo Mukai

发表机构 * Meiji University(明治大学)

AI总结 提出一种全局结构的量子神经网络,减少量子电路中酉矩阵数量,训练时间降低97%,训练完成率提升25%,实现100%训练成功率,纠错性能超越以往研究。

Comments 24 pages, 22 figures

详情
AI中文摘要

高效的量子纠错对于量子计算的发展至关重要。我们提出了一种具有全局结构的量子神经网络,该网络减少了量子电路中所需的酉矩阵数量。这种方法使训练时间减少了97%,训练完成率提高了25%,最终实现了100%的训练成功率,同时超越了以往研究中报告的纠错性能。此外,我们展示了量子纠错对内部网络噪声的增强鲁棒性。而且,由于计算负载的减少,内部网络噪声下的量子纠错保真度提高了15%。

英文摘要

Efficient quantum error correction is essential for the advancement of quantum computing. We propose a quantum neural network with a global structure that reduces the number of unitary matrices required in quantum circuits. This approach resulted in a 97% reduction in training time and up to a 25% improvement in the training completion rate, ultimately achieving a 100% success rate in training while surpassing the error correction performance reported in previous studies. In addition, we demonstrated the enhanced robustness of quantum error correction against internal network noise. Moreover, the fidelity of quantum error correction under internal network noise increased by up to 15% due to the reduced computational load.

2606.08583 2026-06-16 cs.LG eess.SP 新提交

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

频谱审计框架揭示EEG和ECG深度学习中任务依赖的非周期性依赖

Jasmeet Singh Bindra, Siddharth Panwar

发表机构 * Indian Knowledge Systems and Mental Health Applications (IKSMHA) Center, Indian Institute of Technology Mandi(印度理工学院曼迪分校印度知识体系与心理健康应用中心) School of Computing and Electrical Engineering, Indian Institute of Technology Mandi(印度理工学院曼迪分校计算与电气工程学院)

AI总结 提出频谱审计框架,结合非周期/周期分解、相位保持傅里叶干预等,发现深度学习模型对非周期成分的依赖是任务依赖且架构通用的,在睡眠-觉醒分类中影响显著,临床异常检测中中等,运动想象中最小,并扩展到ECG。

Comments 25 pages, being prepared for submission to peer-reviewed journal

详情
AI中文摘要

生理时间序列的深度学习通过领域特定特征解释——EEG中的振荡节律、ECG中的形态复合波——但这些信号位于一个宽带非周期1/f样包络之上,该包络与觉醒、年龄和病理共变。我们引入了一个频谱审计框架,结合非周期/周期分解、相位保持傅里叶干预、假对照和模拟验证。非周期依赖是任务依赖且架构通用的:在六种神经架构中,对于睡眠-觉醒分类,平坦化下降超过0.42平衡准确率点;对于临床异常检测达到0.07-0.13;对于运动想象保持最小。七个EEG基础模型中有六个在临床EEG上显示出FDR显著的非周期依赖;年龄/性别和记录时代控制减少了但未消除该效应。将审计应用于PTB-XL ECG,发现神经下降0.32-0.36,在人口统计匹配后持续存在,确认此类混淆因素扩展到EEG之外。非周期控制应成为可解释生理时间序列深度学习的标准。

英文摘要

Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhythms in EEG, morphological complexes in ECG -- yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32--0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.

2606.08525 2026-06-16 cs.CV 新提交

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward:面向自动驾驶的综合数据集与生成式视觉语言奖励模型

Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun, Fangzhen Li, Bing Wang, Guang Chen, Yang Ji, Jiong Deng, Hongwei Xie, Hangjun Ye, Long Chen, Yi Zhang

发表机构 * Tsinghua University(清华大学) Xiaomi EV(小米汽车)

AI总结 提出DriveReward数据集和专用视觉语言奖励模型,通过反事实标注和时序视觉引导,解决自动驾驶中奖励获取的泛化问题,在强化学习和轨迹选择中取得与基于规则方法相当的性能。

详情
AI中文摘要

奖励模型在强化学习和自动驾驶的多模态轨迹选择中起着关键作用。然而,获取此类奖励通常依赖于手工设计的基于规则的目标或感知真值,这阻碍了数据扩展的泛化能力。虽然视觉语言模型在其他领域已被证明可作为奖励模型,但其在驾驶任务中的有效性尚未得到充分探索。在这项工作中,我们通过以下方式弥合这一差距:(1)引入DriveReward,一个通过时间接地视觉引导严格标注的推理轨迹评估数据集,并增加了反事实驾驶行为;(2)以及一个专门的视觉语言奖励模型。为了解决传统数据集中失败案例稀缺的问题,我们提出了一种反事实数据标注方案,构建包含多种驾驶风格和错误行为的案例。在我们提出的基准上的评估显示,即使是领先的开源和专有视觉语言模型也无法在所有任务中表现出色,突显出现有模型仍有很大的改进空间。基于这些发现,我们随后定制了一个专门的1B奖励模型,在特定任务的奖励对齐上优于更大的视觉语言模型。最后,我们通过将奖励模型集成到强化学习微调和多模态轨迹评分中,在多个基线上验证了其有效性,在开环和闭环评估中均达到了与基于规则的奖励计算相当的性能。

英文摘要

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.

2606.08281 2026-06-16 cs.RO cs.HC cs.SY eess.SY physics.med-ph 新提交

Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety

阻抗MPC用于物理人机交互:具有关节极限安全性的预测性扰动抑制

Yongyan Cao, Jinshan Tang

发表机构 * Voryx Robotic LLC George Mason University(乔治梅森大学)

AI总结 针对物理人机交互中轨迹精度与安全性的矛盾,提出双层阻抗MPC,通过解析抵消动力学和卡尔曼滤波估计持续扰动,实现零稳态误差,并利用零空间势垒和工作空间投影保证关节极限安全。

Comments 7 pages and 3 figures

详情
AI中文摘要

物理人机交互(pHRI)要求在非计划接触下同时实现轨迹精度和顺应性安全。经典阻抗控制在持续人力作用下会产生非零稳态位置误差(施加力除以任务刚度),积分作用仅在狭窄的稳定增益预算内减少该误差。我们提出一种双层阻抗MPC来解决这一矛盾。第一层解析抵消重力、科里奥利力和任务空间惯性,将剩余被控对象简化为具有恒定状态转移矩阵的构型无关双积分器。第二层以100 Hz求解30变量凸QP,利用该恒定结构使得自由响应矩阵仅需预计算一次;增广卡尔曼滤波器估计持续扰动状态,提供形式化的零稳态误差保证。零空间逆势垒和任务空间工作空间投影在测试工作空间内保证关节极限安全。在7自由度Franka FR3上,与经典阻抗在持续15 N力下的44.8 mm稳态误差相比,带卡尔曼增广的阻抗MPC达到亚0.05 mm稳态误差(降低超过800倍),在四个3-D圆上实现亚毫米跟踪,并对测量噪声和高达30%的惯性失配具有优雅鲁棒性。

英文摘要

Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.

2606.08151 2026-06-16 cs.AI 新提交

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

决策感知记忆卡:用于工具使用LLM智能体的反事实启发式上下文选择与压缩

Xinyu Guan, Qianyang Zhao, Yuming Deng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出CICL决策感知上下文层,通过构建上下文图、评分单元效用并打包为记忆卡,提升工具使用LLM智能体在行动时的证据选择与压缩能力,在SWE-bench验证集上实现检索命中率提升。

Comments 15 pages, 2 figures, 8 tables. Code is available at https://github.com/stephen-guan-researcher/CICL; Qwen-QLoRA adapter is available at https://huggingface.co/XinyuGuan/CICL

详情
AI中文摘要

使用工具的LLM智能体失败的原因往往不是缺少相关文本,而是在行动时未能选择、压缩或呈现决定性证据。我们提出CICL,一个决策感知上下文层,它将实例证据转化为上下文图,通过共享的八字段模式路由确定性、Opus辅助、Qwen、Codex/GPT-5.5和Qwen-QLoRA判断,根据行动偏移、结果提升、必要性和负迁移风险对单元评分,并将高效用证据打包为类型化记忆卡供预算有限的智能体使用。该设计将测量到的决策信号与判断模型分离,使得前沿标注、局部代理和轻量级排序器可以在一个可审计协议下进行比较。实验上,CICL在公开基准测试中取得了具体提升,同时暴露了其局限性。在50个SWE-bench Verified文件检索实例上,直接使用Qwen3.6-plus对BM25前50候选进行重排序,将hit@1从0.58提升至0.78,MRR@10从0.634提升至0.790,且所有2500个判断均可解析。受控诊断显示了行动关键性:在预算120时,CICL在v1上达到F1 0.620,在v3上达到F1 0.425,而移除最高效用的语义v3单元导致F1降至0.000。补充检查包括Qwen-QLoRA在710个候选上的一致性、一个小的200标签真实代码Opus辅助信号,以及一个三实例补丁烟雾测试验证检索到补丁的流程,但不声称官方SWE-bench成功。RepoBench-R摘要仍优于记忆卡,紧凑型排序器尚未取代启发式方法。CICL贡献了一个可复现的测量和选择层,用于决策关键上下文,而非端到端编码智能体修复声明。

英文摘要

Modern large language model (LLM) agents do not simply need longer contexts; they need decision-relevant evidence at the moment of action. We study decision-aware context selection: ranking retrieved files, tests, traces, rules, and memories by their expected effect on an agent's next action rather than by semantic similarity alone. We present the Counterfactual-Inspired Context Layer (CICL), which builds an instance context graph, estimates decision-oriented utility for candidate units, and compresses selected evidence into typed memory cards. The same schema can be instantiated with hosted LLM judges, local surrogates, or lightweight rankers, making the selection protocol auditable across model choices. On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-Plus reranking of BM25 top-50 candidates improves hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show that CICL identifies action-critical evidence: removing the top-utility semantic unit reduces F1 from 0.245 to 0.000. In selected-then-compressed mode, memory cards save 44.93 tokens per query while preserving selected evidence. CICL provides a practical layer for measuring, ranking, and compressing decision-critical context for tool-using agents. Code is available at https://github.com/stephen-guan-researcher/CICL.

2606.08059 2026-06-16 cs.RO 新提交

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

感知行为基础模型:将人体运动先验适应到以机器人为中心的地形

Zifan Wang, Yizhao Li, Teli Ma, Qiang Zhang, Yudong Fan, Hao Xu, Shuo Yang, Junwei Liang

发表机构 * Mondo Robotics The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) Artificial General Intelligence Institute, University of Science and Technology of China(中国科学技术大学通用人工智能研究院)

AI总结 提出感知行为基础模型(Perceptive BFM),通过地形一致参考合成(TCRS)将人体运动先验适应到机器人局部地形,实现地形感知的人形机器人控制。

详情
AI中文摘要

人形机器人行为基础模型旨在从广泛的人体运动先验中获取可复用的全身控制策略,使单一控制器能够产生多样且富有表现力的行为。然而,现有的以运动为中心的基础策略大多假设参考运动已经与机器人周围环境物理兼容。当演示者、操作者和机器人处于不同环境时,这一假设不再成立:人体运动可能指定了预期行为,但并未指定机器人局部地形所需的落脚点、间隙、身体高度或接触时机。我们引入了\emph{感知行为基础模型}(Perceptive BFM),这是一种地形感知的人形机器人控制框架,将人体运动先验植根于以机器人为中心的感知。该模型保留原始运动学运动参考作为行为接口,同时利用局部地形观测来调整接触、姿态和时机。为了提供可扩展的地形监督,我们开发了\emph{地形一致参考合成}(TCRS),通过接触感知的落脚点构建、足部几何感知的摆动优化、支撑感知的根部重建、碰撞修复和多点逆运动学,将面向运动的运动片段转换为地形一致的参考。然后,我们训练一个盲适应参考教师,并通过目标帧动作对齐将其地形一致行为迁移到部署的原始参考学生。学生是一个身份门控Transformer跟踪器,其地形特征通过残差路径进入,这些路径初始化为保留运动跟踪先验,并仅在需要时训练产生局部修正。

英文摘要

Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

2606.07678 2026-06-16 cs.LG cs.AI 新提交

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

DOG-DPO:几何中的动态优化用于安全对齐

Yi Nian, Tiankai Yang, Yudi Zhang, Qi Pan, Zelong Xu, Shenzhe Zhu, Qingqing Luan, Yue Huang, Xiangliang Zhang, Yue Zhao

发表机构 * University of Southern California(南加州大学) Iowa State University(爱荷华州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) UT Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员) University of Notre Dame(圣母大学)

AI总结 提出DOG-DPO框架,将偏好对表示为模型表示空间中的方向,通过几何分解和多样性覆盖选择子集,仅用11%数据即可恢复大部分安全增益。

详情
AI中文摘要

大型语言模型的安全对齐依赖于偏好数据,但当前的流水线通常训练于大规模冗余数据集。现有的数据选择方法通常独立地对每个偏好对评分,将方向性偏好信息压缩为标量质量或多样性分数。这种以样本为中心的视角在多数据集设置中尤其受限,其中共享的安全方向与数据集特定的残余风险共存。我们提出DOG-DPO,一种无需训练的数据选择框架,将偏好对视为结构化几何信号。DOG-DPO首先将每个偏好对表示为模型表示空间中的一个方向。然后,它将多数据集偏好几何分解为全局锚点子空间和数据集特定的残余子空间。最后,它通过最大化基于多样性的覆盖来选择子集,鼓励在DPO训练前广泛、非冗余地覆盖对齐方向。在六个安全基准和两个模型骨干上,DOG-DPO仅使用11%的偏好对就实现了强大的效用-鲁棒性权衡。它恢复了全数据训练的大部分安全增益,同时完全无需教师、无需训练,并且比代表性选择基线快得多。

英文摘要

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

2606.07334 2026-06-16 cs.SD cs.LG 新提交

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

和弦符号时间序列适应能承载多远流派身份?多流派和弦符号建模的能力与边界

Jinju Lee

发表机构 * PearlLeeStudio

AI总结 本研究评估了五种轻量级适应方法(LoRA、IA3、BitFit、前缀微调和全微调)将预训练流行爵士和弦模型扩展到11个目标流派的效果,发现所有方法均能提升和弦预测性能,但和弦符号本身不足以完整传递流派身份。

Comments v3: ft-pop80-v2, a selection-corrected, hash-distinct jazz base, exists, reproducing over 3 seeds (top-1 75.76 +/- 0.03), so the Sec. 8 base robustness ablation is now gated by effort, not checkpoint availability. Added a v3 changelog; corrected Sec. 5.2/6.3/6.9 stats for CSV fidelity (no qualitative changes). https://github.com/PearlLeeStudio/TheArtist | https://huggingface.co/PearlLeeStudio

详情
AI中文摘要

和声是一个紧凑的符号层,其中数学音高关系、声学协和与音乐惯例交汇。本报告将和弦符号序列视为音乐的不完全表示,而是作为可解释、可控的时间序列用于流派局部和声建模。从一个冻结的流行爵士音乐变换器检查点开始,我评估了小型适应接口能将模型扩展到11个目标流派的程度:布鲁斯、波萨诺瓦、巴赫众赞歌、乡村、电子、民谣、放克、福音、嘻哈、R&B/灵魂乐和摇滚。主要比较了LoRA、IA3、BitFit、前缀微调和全微调在11个流派和3个种子上的表现,构成完整的165个单元格网格。所有五种方法在保留和弦预测上都优于冻结基线,宏观增益从+2.89到+3.61分;LoRA和IA3得分最高,但经Holm和Benjamini-Hochberg校正的Wilcoxon检验不支持决定性优胜者。一个匹配数据量的对照实验进一步明确了这一点:当流派被子采样到共同语料库大小时,IA3保持领先,但LoRA的全数据优势消失并跌至最后,表明小差距部分由数据驱动。一个控制标记基线也很强,错误流派适配器通常优于冻结基线,表明大部分效果来自对可重用和声基底的轻量级条件化,而非特定适配器家族。额外的诊断(秩扫描、错误流派轮换、基础检查点消融、仅和弦流派分类、生成输出统计、真实歌曲评估和重复分析)支持一个有限的结论:和弦符号适应可靠地改进了流派局部和声预测,但仅靠和弦符号不能承载完整的流派身份。因此,本报告避免关于感知流派真实性或完整音乐质量的声明,这需要受控的听众或音乐家评估。

英文摘要

This revision updates an 11-genre chord-symbol adaptation report. The main 165-cell result is unchanged: all methods improve over the frozen pure-pop base, with no decisive method winner. v3 adds the ft-pop80-v2 multi-seed base-restoration note and corrects a few summary statistics for exact CSV faithfulness without changing conclusions.

2606.07086 2026-06-16 cs.CV cs.LG 新提交

An Adaptive Data cleaning Framework for Noisy Label Detection

自适应数据清洗框架用于噪声标签检测

Chen-Hsuan Fang, Wei-Hsinag Chen, Pin-Hsuan Yu, Jung-Hua Wang, Tsung-Wei Pan

发表机构 * Department of Electrical Eng(电子工程系) AI Research Center(人工智能研究中心)

AI总结 提出一种无需手动阈值的自适应数据清洗框架,融合局部、全局和学习动态等多重度量,通过特征空间的多度量聚类实现噪声标签检测,在CIFAR-10、MNIST和ImageNet-100上显著提升召回率和模型精度。

详情
AI中文摘要

深度神经网络(DNN)在给定大型标注数据集的计算机视觉任务中表现出色。然而,在实际应用中,标签常常因歧义、人为错误或动态环境而受到污染。过参数化的DNN在训练过程中容易记忆这些噪声标签,从而降低模型的准确性和泛化能力。现有的数据清洗和样本选择策略通常依赖于手动指定的阈值、噪声比率的先验知识或单一度量(学习动态或几何结构),这使得它们在复杂数据场景下不稳定。本文提出了一种自适应数据清洗框架,该框架整合了局部、全局和学习动态线索,用于鲁棒的噪声标签检测。通过模块化特征拼接范式,样本被映射到统一的低维特征空间。我们提供了两种实例化:一种二维度量,结合了基于类自适应KNN的局部不一致性和基于k-means的全局质心距离;另一种三维多度量,额外引入了z归一化分数。与传统的将一维高斯混合模型应用于单一标量度量的方法不同,我们的框架在特征空间上执行多度量聚类,以自适应地将样本划分为干净主导和噪声主导成分,无需手动阈值或噪声先验。在CIFAR-10、MNIST和ImageNet-100上,针对5%至40%的对称标签噪声进行的实验表明,该框架在所有设置下均实现了高召回率,包括在ImageNet-100上40%噪声时接近完美的召回率(≥98%)。后续训练在所有评估设置下均获得了精度提升,尤其是在ImageNet-100的严重污染情况下。这些发现表明,多度量整合为噪声标签检测提供了一种无阈值、实用且低调整的策略。

英文摘要

Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

2606.07015 2026-06-16 cs.SD cs.AI 新提交

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

面向统一歌曲生成与带伴奏共生成的歌声转换

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

发表机构 * Northwestern Polytechnical University(西北工业大学) Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出UniSinger框架,基于多模态扩散Transformer统一零样本歌曲生成与伴奏共生成歌声转换,通过共享说话人嵌入和课程学习策略实现跨任务音色控制与多任务优化。

详情
AI中文摘要

尽管歌曲生成和歌声转换(SVC)已显著发展,但长期以来它们被孤立开发:前者缺乏零样本说话人克隆,而后者忽略了人声-伴奏协同。为弥合这一差距,我们提出UniSinger,这是首个统一说话人克隆歌曲生成与伴奏共生成SVC的端到端框架。基于多模态扩散Transformer,我们构建了一个统一的说话人嵌入空间,将说话人表示从SVC迁移到歌曲生成,从而实现细粒度的跨任务音色控制。为缓解多任务优化冲突,我们设计了一种课程学习策略,使用任务特定的模态掩码来引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,在两个任务上均达到最先进性能,并实现了互补优势,为智能音乐制作提供了新可能性。

英文摘要

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

2606.06834 2026-06-16 cs.CL q-bio.GN 新提交

The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

暗调控组:从基因组基础模型中分离可预测性与调控性

Chahat Baranwal, Aaditya Baranwal, Lakshya Nitin Tandon

发表机构 * IIT Jodhpur(印度理工学院贾尔普尔分校) University of Central Florida(中央佛罗里达大学) Northeastern University(东北大学)

AI总结 本研究提出残差化-置换诊断方法,从基因组基础模型的计算机诱变评分中分离序列可预测性与调控信号,揭示10kb近端调控边界,并验证跨架构分解可区分可预测性层与调控输出层,为暗基因组调控研究提供通用工具。

详情
AI中文摘要

高级别胶质瘤通过与神经元的突触整合到神经回路中,这引发了一个问题:哪些非编码元件塑造了肿瘤细胞中的突触形成基因表达。写在暗基因组上的调控程序,我们称之为$\textit{暗调控组}$,是探索的自然底物,而序列基础模型通过计算机诱变(ISM)提供了一条零样本路径;然而,基于似然的评分与局部序列可预测性存在同义反复的耦合,使得调控解释不充分。在三个架构不同的基础模型(Caduceus-Ph、HyenaDNA、Enformer)和92个胶质瘤相关位点的30,448个暗基因组元件上,我们引入了一种残差化-置换诊断方法,以分离由可预测性驱动和由调控驱动的RIS方差。一个尖锐的10kb近端调控边界在我们应用的所有控制中仍然存在,但LM衍生的元件类别层次结构则不然:一个六特征线性基线在AUC=0.985时匹配Caduceus的十分位数成员。跨架构分解清晰地分离了序列可预测性层(两个语言模型共同对长且可预测的转座元件进行排序)和调控输出层(只有Enformer保留了区分cCRE的信号),两个前100列表之间完全没有重叠。然后,保守性、脑cis-eQTL和STRING-PPI交叉检查锚定了哪些生物学信息得以保留:所有三个模型的前100个元件在匹配脑eQTL方面每个模型富集了3.3倍($p_\mathrm{emp} < 5\times 10^{-3}$),而一个诱人的转座元件调控层和一个显著的NRXN1+NLGN1蛋白对收敛在构建适当的置换检验后均未通过。我们将该诊断方法作为任何基于ISM的调控研究的通用方法工具提供。

英文摘要

High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

2606.06646 2026-06-16 cs.CL cs.AI 新提交

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

CAF-Gen: 一种用于丰富论证结构的多智能体系统

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(电子与信息技术学院,华沙技术大学)

AI总结 提出CAF-Gen多智能体框架,通过迭代创建-评审流程将浅层论证结构自动转换为符合Carneades论证框架的丰富模型,克服单次生成的结构不稳定性。

Comments Accepted for publication in the proceedings of ICCCI 2026

详情
AI中文摘要

从自然文本中形式化复杂推理是计算语言学的核心挑战之一。它要求系统不仅理解关键词,还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘技术能够识别基本的主张和前提,但往往难以捕捉高级模式(如Carneades论证框架)所需的更丰富的结构信息,该框架包含前提类型、证明标准和论证模式等特征。我们通过引入CAF-Gen来解决这一局限性,这是一个自动化的多智能体框架,旨在将浅层论证结构丰富为符合CAF的论证模型。通过采用迭代的创建者-评审者流水线,创建者智能体的输出由批评智能体验证以确保结构完整性。这种多智能体协作对于缓解单次生成模型典型的结构不稳定性至关重要。我们的实验表明,迭代反馈循环提高了所得数据的质量,并与原始标注实现了强对齐,同时生成了结构更丰富的模型。我们的发现表明,多智能体系统可以克服单次生成的局限性,为自动建模形式论证提供了一种稳健的方法。

英文摘要

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

2603.04592 2026-06-16 cs.CL cs.CV 交叉投稿

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

从静态推理到动态交互:流式大型语言模型综述

Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究院)

AI总结 本文统一了流式LLM的定义,提出系统分类法,综述其方法、应用与未来方向。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

标准大型语言模型(LLM)主要设计用于预定义输入的静态推理,这限制了它们在动态实时场景中的适用性。为解决这一差距,流式LLM范式应运而生。然而,现有流式LLM的定义仍然零散,混淆了流式生成、流式输入和交互式流式架构,且缺乏系统分类法。本文对流式LLM进行了全面概述和分析。首先,我们基于数据流和动态交互建立了流式LLM的统一定义,以澄清现有歧义。基于这一定义,我们提出了当前流式LLM的系统分类法,并对其底层方法进行了深入讨论。此外,我们探讨了流式LLM在现实场景中的应用,并概述了有前景的研究方向,以支持流式智能的持续进展。我们在以下网址维护一个持续更新的相关论文仓库:此 https URL。

英文摘要

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

2605.04998 2026-06-16 cs.SD cs.IR cs.LG 版本更新

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

流行与爵士混合比例对体裁自适应和弦生成的实证研究

Jinju Lee

发表机构 * PearlLeeStudio(pearllee studio)

AI总结 本研究通过调整流行与爵士音乐的比例进行和弦生成排练,发现适度的流行排练能在保持流行准确率的同时提升爵士预测性能,并修正了先前版本中的检查点选择错误。

Comments Erratum: the released F1 checkpoint equals the Phase-0 pop baseline (full SHA-256 verified); min mixed validation loss selection kept the unadapted warmup epoch. Tables 4 and 5 are best epoch metrics; mix ratio conclusions hold. A corrected retrain (jazz only validation), ft-pop80-v2, reproduces across 3 seeds. v1 F2 row fixed. 3 figs, 5 tables. https://huggingface.co/PearlLeeStudio

详情
AI中文摘要

本修订更新了一项流行到爵士和弦生成的排练研究。最佳时期的指标仍然表明,适度的流行排练能在保持流行准确率的同时提高爵士预测性能,但v2版本修正了已发布检查点的选择:已发布的F1等于阶段0,F2存在转录错误,而ft-pop80-v2恢复了跨3个种子的哈希区分爵士适应F1。

英文摘要

This revision updates a pop-to-jazz chord-generation rehearsal study. Best-epoch metrics still show that modest pop rehearsal preserves pop accuracy while improving jazz prediction, but v2 corrects released-checkpoint selection: the released F1 equals Phase 0, F2 had a transcription error, and ft-pop80-v2 restores a hash-distinct jazz-adapted F1 across 3 seeds.

2605.04813 2026-06-16 cs.LG 版本更新

A Biased Nonnegative Block Term Tensor Decomposition Model for Dynamic QoS Prediction

一种用于动态QoS预测的有偏非负块项张量分解模型

Wenjing Liu, Yujia Lei, Qu Wang

发表机构 * GitHub

AI总结 提出BNBT框架,采用有偏非负块项张量分解增强表示能力,引入线性偏置项并设计SLF-NMUT算法,在动态QoS预测中显著提升精度。

详情
AI中文摘要

随着云计算和Web服务的快速发展,服务质量(QoS)已成为服务选择与推荐的关键标准。张量潜在特征分析为建模多维QoS数据提供了有效途径,现有大多数QoS预测方法主要基于规范多元分解(CP分解)或Tucker分解。然而,受限于其固有结构特性,这些方法无法准确捕捉用户-服务交互中复杂且动态的依赖关系,从而限制了预测性能。为解决此问题,本文提出一种基于有偏非负块项张量分解模型的动态QoS预测框架,称为BNBT。具体而言,该框架从三个方面进行构建:(1)采用块项张量分解增强潜在特征学习的表示能力;(2)引入线性偏置项以进一步提高预测精度;(3)设计一种面向张量的单元素依赖非负乘性更新算法SLF-NMUT,用于高效参数估计。在真实QoS数据集上的大量实验表明,所提出的BNBT框架在预测精度上持续优于多种先进的QoS预测方法。

英文摘要

With the rapid development of cloud computing and Web services, Quality of Service (QoS) has become a key criterion for service selection and recommendation. Tensor latent feature analysis provides an effective way to model multidimensional QoS data, and most existing QoS prediction methods are mainly based on Canonical Polyadic (CP) decomposition or Tucker decomposition. However, constrained by their inherent structural properties, these methods cannot accurately capture the complex and dynamic dependencies in user-service interactions, which limits their prediction performance. To address this issue, this paper proposes a dynamic QoS prediction framework based on the Biased Nonnegative Block Term Tensor Decomposition Model, termed BNBT. Specifically, the proposed framework is developed from three aspects: (1) block term tensor decomposition is employed to enhance the representation capability of latent feature learning; (2) linear bias terms are incorporated to further improve prediction accuracy; and (3) a tensor-oriented single-element-dependent nonnegative multiplicative update algorithm, called SLF-NMUT, is designed for efficient parameter estimation. Extensive experiments on real-world QoS datasets demonstrate that the proposed BNBT framework consistently outperforms several state-of-the-art QoS prediction methods in terms of prediction accuracy.

2605.03297 2026-06-16 cs.SD cs.LG 版本更新

Contrastive Regularization for Accent-Robust ASR

对比正则化用于口音鲁棒的ASR

Van-Phat Thai, Aradhya Dhruv, Duc-Thinh Pham, Sameer Alam

发表机构 * Air Traffic Management Research Institute, Nanyang Technological University, Singapore(新加坡南洋理工大学航空交通管理研究所) Center of AI Research, VinUniversity, Vietnam(越南Vin大学人工智能研究中心)

AI总结 提出使用监督对比学习作为轻量级口音不变辅助目标,在CTC微调中正则化编码器表示,无需架构修改或显式口音监督,在L2-ARCTIC基准上实现高达25-29%的未见口音词错误率降低。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

基于自监督声学预训练和CTC微调的ASR系统在母语语音上表现强劲,但对口音变化仍然敏感。我们研究监督对比学习(SupCon)作为CTC微调的轻量级、口音不变辅助目标。一个话语级对比损失正则化编码器表示,无需架构修改或显式口音监督。在L2-ARCTIC基准上的实验表明,多个预训练编码器均实现一致的WER降低,在未见口音评估下相对降低高达25-29%。使用转录内余弦离散度分析表明,SupCon在口音变化下促进更紧凑和稳定的表示几何结构。总体而言,SupCon提供了一种有效且模型无关的正则化策略,用于提高口音鲁棒性。

英文摘要

ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.

2605.01961 2026-06-16 cs.LG 版本更新

Multi-User Dueling Bandits: A Fair Approach using Nash Social Welfare

多用户决斗式赌博机:一种基于纳什社会福利的公平方法

Maheed H. Ahmed, Mahsa Ghasemi

发表机构 * Electrical and Computer Engineering, Purdue University(电子与计算机工程系,普渡大学)

AI总结 针对多用户偏好异质的决斗式赌博机问题,采用纳什社会福利目标最大化用户效用乘积,提出Fair-Explore-Then-Commit和Fair-ε-Greedy算法,并证明其遗憾上界匹配下界。

详情
AI中文摘要

从人类偏好数据中学习正成为一种有用的工具,从微调大型语言模型到训练强化学习智能体。然而,在大多数场景中,模型是在所有人类评估者的平均偏好上训练的,这在偏好差异较大时可能对少数群体不公平。在这项工作中,我们考虑了决斗式赌博机中的公平性,这是一个从偏好数据中进行在线学习的标准框架。我们假设每个用户都有一个(可能不同的)康多塞赢家,即一个优于其他所有臂的臂。使用这些用户特定的康多塞赢家作为参考点,我们根据臂相对于相应赢家的表现来评估和评分。为了促进异质用户之间的公平性,我们采用了成熟的纳什社会福利目标,该目标最大化用户效用的乘积,从而固有地惩罚不平等并防止任何单个用户被边缘化。在此框架内,我们构建了一个困难实例,以建立时间范围$T$、$K$个臂和$D$个用户的遗憾下界$Ω(T^{2/3}\min(K,D)^\frac{1}{3})$,据我们所知,这是第一个量化异质偏好决斗式赌博机中公平性成本的结果。然后,我们提出了带有康多塞赢家识别阶段的Fair-Explore-Then-Commit和Fair-$ε$-Greedy算法。我们进一步推导了它们的遗憾上界,该上界在$T$的依赖关系上与下界匹配,仅相差对数因子。

英文摘要

Learning from human preference data is becoming a useful tool, from fine-tuning large language models to training reinforcement learning agents. However, in most scenarios, the model is trained on the average preference of all human evaluators, which, under large variations of preferences, can be unfair to minority groups. In this work, we consider fairness in dueling bandits, a standard framework for online learning from preference data. We assume that each user has a (potentially distinct) Condorcet winner, which is an arm preferred to every other arm. Using these user-specific Condorcet winners as reference points, we evaluate and score arms according to their performance relative to the corresponding winner. To promote fairness across heterogeneous users, we adopt the well-established Nash Social Welfare objective, which maximizes the product of user utilities, thereby inherently penalizing inequality and preventing the marginalization of any single user. Within this framework, we construct a hard instance to establish a regret lower bound of $Ω(T^{2/3}\min(K,D)^\frac{1}{3})$ for a time horizon $T$, $K$ arms, and $D$ users, which, to the best of our knowledge, is the first result quantifying the cost of fairness in dueling bandits with heterogeneous preferences. We then present the Fair-Explore-Then-Commit and Fair-$ε$-Greedy algorithms with a Condorcet winner identification phase. We further derive their regret upper bounds that match the lower-bound dependence on $T$ up to logarithmic factors.

2605.01702 2026-06-16 cs.LG 版本更新

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

具有自动微分的浮点网络可以表示几乎所有浮点函数及其梯度

Sejun Park, Yeachan Park, Geonho Hwang

发表机构 * Department of Artificial Intelligence, Korea University(人工智能系,韩国大学) Department of Mathematics and Statistics, Sejong University(数学与统计学系,世宗大学) Department of Mathematical Sciences, Gwangju Institute of Science and Technology(数学科学系,光州科学技术院)

AI总结 本文证明,在浮点算术下,使用自动微分的浮点神经网络可以表示任意浮点函数及其梯度,适用于ReLU、ELU等常见激活函数。

详情
AI中文摘要

理论研究显示,对于紧致域上的任意可微函数,存在一个神经网络可以同时逼近函数值和梯度。然而,由于该结果假设实数参数和精确内部运算,无法在实际中使用。相反,实际实现仅使用实数的有限子集和带有舍入误差的机器运算。本文研究在浮点算术下,当输入梯度由自动微分算法$D^\mathtt{AD}$计算时,神经网络是否具有类似结果。我们首先证明,给定一个浮点函数$\phi$(例如损失函数),任意函数值和梯度可以分别由浮点网络$f$和$D^\mathtt{AD}(\phi\circ f)$表示。我们进一步推广该结果:在温和条件下,给定$\phi_1,\dots,\phi_n$,$D^\mathtt{AD}(\phi_i\circ f)$可以同时表示任意梯度,而$f$表示目标值。我们的结果适用于实际激活函数,例如$\mathrm{ReLU}$、$\mathrm{ELU}$、$\mathrm{GeLU}$、$\mathrm{Swish}$、$\mathrm{Sigmoid}$和$\mathrm{tanh}$。

英文摘要

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $ϕ$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(ϕ\circ f)$, respectively. We further extend this result: given $ϕ_1,\dots,ϕ_n$, $D^\mathtt{AD}(ϕ_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

2606.09500 2026-06-16 cs.AI cs.DL 版本更新

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

用于LLM辅助临床手稿准备的确定性完整性门控:一种可审计的生物医学信息学架构

Yoojin Nam, Jinhoon Jeong, Namkug Kim

发表机构 * University of Ulsan College of Medicine(蔚山大学医学院) Asan Medical Center(峨山医疗中心) Aperivue AMIST, Asan Medical Center(AMIST,峨山医疗中心)

AI总结 提出一种确定性完整性门控架构,通过将工作流分解为可独立验证的技能并在每个阶段设置确定性检查,解决了LLM生成临床手稿中的虚假引用、数据漂移和报告指南缺失问题。

Comments 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills . Archived on Zenodo: concept DOI https://doi.org/10.5281/zenodo.20155321 and version DOI (v3.8.0) https://doi.org/10.5281/zenodo.20582972

详情
AI中文摘要

目的。大型语言模型(LLM)越来越多地起草临床研究手稿,但其流畅性可能隐藏虚构的引用、偏离源表格的数字以及未满足的报告指南项目。现有工具生成文本而不进行验证,自我批评继承了产生自信虚构的盲点。我们描述了一种将生成与验证配对的架构。方法。该设计基于三个原则:将工作流分解为自包含的技能,在每个阶段转换处设置失败即停止的门控,以及用最便宜的足够机制解决每个完整性问题——一个确定性的、可重新执行的检查(如果适用),以及仅在需要解释时才使用散文级探针。这种尽可能确定性的分离,组织为完整性门控分类法,是核心贡献。它被实现为MedSci Skills,一个由43个技能组成的开源工具包,由一个编排器协调,其确定性层级包括21个标准库检测器。我们在三个可重复的公共数据集管道(STARD、PRISMA、STROBE)和一个种子缺陷消融上评估它。结果。在三个管道中,每个内容哈希清单都验证为干净,门控揭示了真实缺陷。在27个相同的注入缺陷上,确定性门控检测到所有27个,在匹配的干净固定装置上没有误报,而通用单提示LLM审查员检测到11个,其遗漏集中在生成的代码、参考文献内部和散文未暴露的风格缺陷上。结论。尽可能确定性的验证产生了一个可审计、可重新执行的轨迹,暴露了人类检查LLM辅助手稿所需的证据——可行性和可重复性证据,而不是声称具有人类竞争力的质量,这由另一项盲法研究解决。MedSci Skills采用MIT许可并归档(v3.8.0)。

英文摘要

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).