arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2605.27469 2026-05-28 cs.LG cs.AI

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

架构驱动的偏移:面向捕捉逻辑偏移趋势的轻量级选择器

Zhong Ye, Yu Hu, Ruilin Tang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Guangdong University of Technology(广东技术大学) School of Computer Science and Engineering(计算机科学与工程学院) South China University of Technology(华南理工大学)

AI总结 本文提出架构驱动偏移(ADS)作为逻辑偏移的轻量级代理,用于高效选择持续学习中的预训练模型,理论推导并实验验证了ADS与逻辑偏移的单调相关性。

详情
AI中文摘要

持续学习是一种利用深度预训练神经网络能力的实用范式,但哪个预训练模型能更好地平衡“可塑性-稳定性”值得选择?逻辑偏移作为自然代理,因为它代表了持续学习场景中的逻辑偏移。然而,获取逻辑偏移需要巨大的计算成本,阻碍了大规模模型选择。现有的理论分析由于假设均匀隐藏层宽度,忽略了实际架构的结构异质性(可变宽度和深度),无法提供有效的替代方案。这引发了一个关键问题:异构架构与在先验任务(模型已训练过的任务)上的逻辑偏移之间理论上存在什么关系?为了回答这个问题,我们将逻辑偏移解耦为架构依赖和数据依赖,建立我们的框架,揭示了两种依赖的组合——定义为架构驱动偏移(ADS)——能够很好地捕捉逻辑偏移趋势,且只需少量数据样本即可计算。具体来说,对于在先验任务上优化良好的模型,较高的ADS与在当前任务训练后较大的逻辑偏移相关,这基于三个机制组件推导得出:(1)权重矩阵梯度关于层宽的谱范数缩放,(2)新任务的优化路径长度,以及(3)宽网络中的渐近任务冲突。跨越175多种不同架构的大量实证结果表明,ADS与逻辑偏移之间存在强单调相关性(最弱的Spearman相关系数$r_s=0.731$)。在实践中,我们证明了ADS可以作为预期校准误差的轻量级代理,预期校准误差是用于可靠持续学习模型选择的广泛使用的指标,在三个数据集的六个场景中得到了验证。

英文摘要

Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift serves as a natural proxy because it represents the logit shift in CL scenarios. However, obtaining the logit shift requires huge computational cost, which hinders large-scale model selection. Existing theoretical analyses fail to offer an efficient alternative because of the assumption of uniform hidden layer widths, which ignores the structural heterogeneity (variable width and depth) of real-world architectures. This raises a critical question: what theoretically relationship can be identified between heterogeneous architecture and logit shift on prior tasks (that the model has been trained on)? To answer the question, we decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, and (3) the asymptotic task conflict in wide networks. Extensive empirical results across more than 175 diverse architectures demonstrate a strong monotonic correlation (the weakest Spearman's $r_s=0.731$) between ADS and logit shift. Practically, we demonstrate that ADS can serve as a lightweight proxy of the expected calibration error, which is a widely used metric for reliable CL model selection, on three datasets across six scenarios.

2605.27467 2026-05-28 cs.LG cs.AI cs.CV

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

液态神经网络与LSTM在序列模式识别中的比较分析:鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

发表机构 * National Electronics and Computer Technology Center (NECTEC)(国家电子与计算机技术中心) Language Understanding Lab.(语言理解实验室)

AI总结 本文通过对比液态神经网络(LNN)与LSTM在四种序列数据上的性能,发现LNN在参数效率和鲁棒性方面更优,尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情
AI中文摘要

传统的循环神经网络(RNN)和长短期记忆网络(LSTM)在离散时间步上运行,往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络(LNN),特别是闭式连续时间(CfC)网络,通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中,我们在四种不同的序列模态上进行了全面的基准测试研究:神经形态事件数据(N-MNIST)、基于笔画的绘图(QuickDraw)、视觉手写(IAM)和生理时间序列(PhysioNet Sepsis-3)。此外,我们使用时间丢弃法进行了严格的压力测试,以评估模型对缺失数据的鲁棒性。我们的研究结果表明,LNN在原生时间域和数据稀疏普遍的临床环境中,始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景,并附有详细附录,记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

2605.27465 2026-05-28 cs.CV cs.AI

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

发表机构 * Electronic Engineering(电子工程) Soongsil University(顺斯大学)

AI总结 提出AdaMerge框架,通过显著性加权相似度和自适应合并强度两个互补机制,在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

视觉Transformer(ViT)中自注意力的二次计算成本构成了实际部署的基本瓶颈,激发了令牌缩减方面的活跃研究。在现有方法中,令牌合并(ToMe)已成为一种优雅的无训练解决方案;然而,其设计基于令牌平等的隐含前提,这与自注意力已充分证明的非均匀性相悖,并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限,该框架基于两个互补机制。首先,显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理,并将所得显著性分数纳入二分匹配分数,确保关键令牌对合并表示贡献更大。其次,自适应合并强度使用预先计算的逐层相似度统计量,根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16,AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大:在13.4G FLOPs操作点,AdaMerge的Top-1下降仅为-1.06%,而PiToMe为-1.45%,DSM为-4.62%。据我们所知,AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法,推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

2605.27464 2026-05-28 cs.CV cs.AI

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元:基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University(哈佛人工智能与机器人实验室,哈佛大学)

AI总结 提出HiT-HAR层次模型,利用头戴式IMU数据实现行为级活动识别,超越传统运动基元,在五类动作和八类场景识别中优于现有模型。

详情
AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助,但其最实用的常开传感器——头戴式惯性测量单元(IMU)仅能检测行走或站立等运动基元。我们突破运动基元,实现行为级识别,定义了五个类别以平衡AR应用需求与传感器可观测性。为此,我们构建了一个包含16万样本的Ego4D数据集,采用四层质量保证框架覆盖8个活动场景,并提出了HiT-HAR,一个70.3万参数的层次模型,在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界,识别出哪些行为类别可靠可观测(移动),哪些受益于时间上下文(物体传递、任务操作),以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明,利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

2605.27461 2026-05-28 cs.RO

A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons

工业包装任务的VLA流水线工厂部署案例研究:工作流、故障与经验教训

Brian Zhu, Philipp Schmitt, Philine Meister, Lukas Gensler, Momen Khalil, Emmanuele Poggi, Johannes Hechtl, Carsten Braunroth, Kai Wurm, Gokul Narayanan, Eugen Solowjow, Georg von Wichert, Andre Scholz, Felix Albrecht, Maxmillian Metzner

发表机构 * Siemens Corporation(西门子公司)

AI总结 本研究通过在西门子工厂部署预训练Pi0.5策略执行工业包装任务,迭代微调并收集2535个现场数据片段,总结了VLA流水线部署中的常见故障模式与改进工作流的经验教训。

详情
AI中文摘要

视觉-语言-动作(VLA)策略展示了有前景的操作能力,但其实际影响常受限于现实部署的可靠性要求。我们展示了西门子工厂(德国埃尔朗根GWE)中一项工业包装任务的部署研究:机器人必须从杂乱堆中拾取透明配件袋,将其插入纸板包装的剩余空腔,并确保袋子及其内容物保持在闭合平面以下。我们的目标是理解通过迭代微调和部署驱动的改进,将预训练的Pi0.5策略适配到单一工厂任务所需的实际工作量。该流水线包括数据收集、整理、微调、评估和针对性恢复数据收集的重复循环。我们从现场工厂设置中积累了2535个片段(10小时)。在本文中,我们贡献了一个工厂级VLA部署的实证报告,重点介绍了常见的故障模式和有助于改进部署工作流的经验教训。

英文摘要

Vision-Language-Action (VLA) policies have shown promising manipulation capabilities, yet their practical impact is often limited by the reliability demands of real-world deployment. We present a deployment study of an industrial packaging task at Siemens Factory (GWE, Erlangen, Germany), where a robot must pick a transparent accessory bag from a cluttered pile, insert it into the remaining cavity of a cardboard package, and ensure that the bag and its contents remain below the closing plane. Our goal is to understand the practical effort required to adapt a pretrained Pi0.5 policy to a single factory-floor task through iterative fine-tuning and deployment-driven refinement. The pipeline consists of repeated loops of data collection, curation, fine-tuning, evaluation, and targeted recovery data collection. We have accumulated 2535 episodes (10 hours) from the on-site factory settings. In this paper, we contribute an empirical account of a factory-floor VLA deployment, highlighting recurring failure modes and lessons that inform how to improve the deployment workflow.

2605.27460 2026-05-28 cs.CV

D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

D$^2$Turb: 深度感知仿真与解耦学习用于单帧大气湍流抑制

Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin, Xun Liu, Peng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Beijing Institute of Space Mechanics and Electricity(北京空间机械与电子研究所) School of Physics, Northwest University Xi'an(西安西北大学物理学院)

AI总结 提出D$^2$Turb框架,通过深度感知湍流合成协议和自适应结构先验注入机制,将物理仿真与解耦恢复结合,实现单帧大气湍流下的纹理去模糊与几何校正。

Comments 14 pages, 7 figures

详情
AI中文摘要

单帧大气湍流抑制由于空间变化模糊与非刚性几何畸变并存而本质上是病态的。现有的基于平面场仿真的端到端方法通常难以平衡纹理恢复与几何校正。为克服这一限制,我们提出D$^2$Turb,一个将物理仿真与显式解耦恢复相结合的统一框架。首先,我们引入深度感知湍流合成协议,将场景深度纳入相位到空间公式中,生成物理一致、深度相关的退化,并为解耦学习提供关键的中间倾斜监督信号。基于该仿真引擎,D$^2$Turb将恢复分解为两个交互阶段:纹理去模糊和几何校正。纹理去模糊阶段采用去模糊骨干网络恢复细节,同时保留几何畸变以供后续校正阶段使用。为缓解级联设计中常见的信息碎片化问题,我们进一步提出自适应结构先验注入(ASPI)机制,动态传递去模糊模块的深层结构表示以指导密集流预测进行空间去扭曲。大量实验表明,D$^2$Turb在合成和真实数据集上均达到最先进性能,在纹理恢复和几何保真度方面均有持续改进。我们的代码和预训练模型已在 https://github.com/HertzDot222/D2Turb 公开。

英文摘要

Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at https://github.com/HertzDot222/D2Turb.

2605.27456 2026-05-28 cs.LG

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

度量感知PCA作为几何深度学习的线性实例

Michael Leznik

发表机构 * May, 2025(2025年5月)

AI总结 本文通过将度量感知主成分分析(MAPCA)置于几何深度学习框架中,建立了两者之间在对称性、等变性、不变性等六个轴上的精确对应关系,并证明了MAPCA是几何深度学习的线性实例。

详情
AI中文摘要

几何深度学习围绕数据域的对称性组织神经架构,对称群的选择作为几何先验,决定了可以学习哪些表示。度量感知主成分分析(MAPCA)通过正定度量矩阵参数化主成分分析,其规范子族在标准PCA和输出白化之间插值,对角度量点恢复不变PCA(IPCA)。本文将MAPCA置于几何深度学习框架中。度量被视为几何先验;保持它的正交群是其诱导的对称群;MAPCA解在该群下等变,所得谱不变;MAPCA的定义约束是等变网络中使用的Schur型权重约束的线性类比。在六个轴——域、对称群、等变性、不变性、架构原语和几何先验——上,我们构建了MAPCA与几何深度学习之间的精确字典。技术核心是一个唯一性定理,将IPCA刻画为MAPCA族中唯一的线性数据导出度量,该度量在任意对角缩放下等变,并投影到作用的不动点集上,在归一化下等价于精确形式的方差最大化准则。本文以三座桥梁结束:核PCA作为非线性扩展,谱图方法作为图上的MAPCA,以及深度MAPCA构造将定位扩展到深度等变网络。

英文摘要

Geometric deep learning organises neural architectures around the symmetries of their data domain, with the choice of symmetry group serving as a geometric prior that determines what representations can be learned. Metric-Aware Principal Component Analysis (MAPCA) parameterises principal component analysis by a positive-definite metric matrix, with a canonical subfamily interpolating between standard PCA and output whitening and a diagonal-metric point recovering Invariant PCA (IPCA). This paper positions MAPCA within the geometric deep learning framework. The metric is read as the geometric prior; the orthogonal group preserving it is the symmetry group it induces; MAPCA solutions are equivariant under this group with the resulting spectrum invariant; and MAPCA's defining constraint is the linear analogue of the Schur-type weight constraints used in equivariant networks. Across six axes - domain, symmetry group, equivariance, invariance, architectural primitive, and geometric prior - we construct a precise dictionary between MAPCA and geometric deep learning. The technical anchor is a uniqueness theorem characterising IPCA as the unique linear data-derived metric in the MAPCA family that is equivariant under arbitrary diagonal rescaling and projects onto the fixed-point set of the action, equivalent under normalisation to the variance-maximisation criterion in its precise form. The paper closes with three bridges: kernel PCA as the nonlinear extension, spectral graph methods as MAPCA on graphs, and a deep MAPCA construction extending the positioning into deep equivariant networks

2605.27451 2026-05-28 cs.CV

From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

从情感到复杂行为:第十届ABAW研讨会与竞赛推进多模态以人为中心的人工智能

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Pedersoli, Simon Bacon, Jens Madsen, Soufiane Belharbi, Muhammad Haseeb Aslam, Chunchang Shao, Guanyu Hu

发表机构 * Queen Mary University of London(伦敦皇后玛丽大学) Hume AI Google Deepmind(谷歌Deepmind) Imperial College London(伦敦帝国理工学院) Cogitat LIVIA ILLS ETS Montreal(蒙特利尔ETS) Concordia University(Concordia大学) Xi’an Jiaotong University(西安交通大学)

AI总结 本文介绍了第十届ABAW研讨会与竞赛,通过多模态挑战和论文,推动真实环境下人类情感与行为的建模、分析和理解。

Comments accepted at CVPR 2026

详情
AI中文摘要

第十届真实世界情感与行为分析(ABAW)研讨会与竞赛,与CVPR 2026同期举办,持续推动在真实、无约束环境中对人类情感与行为的建模、分析和理解研究。研讨会保持双重结构,包括竞赛和论文轨道。ABAW竞赛引入了一系列多样化的挑战,针对情感与行为理解的关键方面,包括连续情感(效价-唤醒度)估计、离散情感(表情和动作单元)识别,以及更复杂的行为分析任务,如情感模仿强度估计、矛盾/犹豫识别和细粒度暴力检测。这些挑战基于大规模真实世界数据集,为最先进方法提供了全面的基准。与此同时,论文轨道展示了广泛的贡献,涵盖姿态、运动与行为估计、情感建模与多模态学习、基准、数据集与评估协议、公平性、鲁棒性与部署。总体而言,第十届ABAW研讨会与竞赛继续作为基准测试、合作与创新的关键平台,塑造下一代多模态、以人为中心的人工智能系统的发展。

英文摘要

The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

2605.27431 2026-05-28 cs.LG cs.AI

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

应对多模态学习挑战的混合专家方法:综述

Liangwei Nathan Zheng, Wei Emma Zhang, Olaf Maennel, Lin Yue, Weitong Chen

发表机构 * Adelaide University(阿德莱德大学)

AI总结 本文综述了混合专家(MoE)如何通过高效扩展、表示学习和自适应适配解决多模态学习中的可扩展性、异质性和数据不完美等核心挑战。

Comments This survey paper has just been accepted by IJCAI 2026. Results were released by 30 April 2026. As I could not find a particular place to drop the acceptance email. I have upload the acceptance email alongside the LaTeX files of the paper, named as Acceptance_email.pdf

详情
AI中文摘要

混合专家(MoE)为多模态学习提供了一个自然兼容且可扩展的框架,在不同模态和任务中展现出强大的适应性。尽管其日益成功,但关于MoE方法解决多模态挑战的全面系统综述仍然缺乏。现有综述往往从方法分类学角度独立评估多模态学习或MoE,忽视了它们之间的独特相互作用。本综述通过回答一个核心问题来填补这一空白: extit{MoE如何有效解决多模态挑战?}我们从三个关键视角进行探讨:(1) extbf{MoE作为高效多模态引擎:}通过将计算成本与参数增长解耦,并通过选择性专家激活减轻模态冗余,实现可扩展的多模态建模;(2) extbf{MoE作为多模态表示学习器:}整合互补的多意见专家知识,丰富对齐和交互表示;(3) extbf{MoE作为多模态适配器:}提供模块化和灵活的机制,以建模不完美数据场景,如模态不平衡和模态缺失。通过广泛的文献综述,我们识别出关键研究空白,包括可解释路由、专家通信、模态集成和终身多模态学习。我们将本综述定位为未来研究的基础,旨在构建可解释且可持续的多模态混合专家系统。

英文摘要

Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability across diverse modalities and tasks. Despite its growing success, a comprehensive and systematic review on the MoE metho addressing multimodal challenges remains lacking. Existing surveys tend to evaluate either multimodal learning or MoE independently from method taxonomy, overlooking the unique interplay between them. This survey fills that gap by answering a central question: \textit{How does MoE effectively resolve multimodal challenges?} We approach this from three key perspectives: (1) \textbf{MoE as an Efficient Multimodal Engine:} enabling scalable multimodal modeling by decoupling computational cost from parameter growth and mitigating modality redundancy through selective expert activation; (2) \textbf{MoE as a Multimodal Representation Learner:} integrating complementary multi-opinion expert knowledge to enrich alignment and interaction representations; and (3) \textbf{MoE as a Multimodal Adapter:} providing a modular and flexible mechanism to model imperfect data scenarios such as modality imbalance and missing modality. Through our extensive literature review, we identify critical research gaps, including interpretable routing, expert communication, modality integration, and lifelong multimodal learning. We position this survey as a foundation for future research toward interpretable and sustainable multimodal Mixture-of-Experts system.

2605.27428 2026-05-28 cs.LG

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

$E^3$-Agent: 一种用于边缘生成式推理资源管理的可执行且可演化智能体

Rui Bao, Yaping Sun, Zhiyong Chen, Feng Yang, Meixia Tao, Nan Li, Wenjun Zhang

发表机构 * Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 200240, China(上海交通大学合作中位网创新中心,上海,中国) Department of Broadband Communication, Pengcheng Laboratory, Shenzhen 518000, China(鹏城实验室宽带通信部门,深圳,中国) China Mobile Research Institute, Beijing 100053, China(中国移动研究院,北京,中国)

AI总结 针对边缘生成式推理中设备性能未知且时变的问题,提出一种可执行且可演化的智能体$E^3$-Agent,通过分离快速路径路由器和慢速路径大语言模型元控制器,实现在线学习与自适应资源管理,在动态场景下平均延迟降低65%-73%。

Comments 13 pages, 4 figures, 6 tables

详情
AI中文摘要

边缘部署的生成式推理日益面临两个实际现实:每设备每模型的性能在部署时通常是未知的,并且由于用户驱动的语义事件、后台负载和设备变动而呈现非平稳性。因此,在固定机制下离线调优的资源管理器可能变得脆弱且维护成本高昂。本文提出了$E^3$-Agent,一种用于边缘人工智能生成内容(AIGC)资源管理的可执行且可演化的智能体。$E^3$-Agent将做出毫秒级调度决策的快速路径路由器与慢速路径事件驱动的大语言模型(LLM)元控制器分离,后者通过工具接口暴露的小型显式控制面(包括风险门控、路由器配置和快速性能校准)来缓解机制变化。该智能体从执行反馈中在线学习,并持续适应未知且时变的服务时间映射。我们在由MLPerf衍生的设备模型测量先验驱动的离散事件模拟器中评估了$E^3$-Agent,涵盖了冷启动预热和三种动态机制:语义动态、设备变动和隐藏漂移。在动态场景中,与最佳静态基线相比,$E^3$-Agent将平均延迟降低了65%-73%,保持在用于评估的在线全信息Oracle的7%-10%以内,并有效抑制了语义退化下的卡顿率。

英文摘要

Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, and device churn. Consequently, a resource manager that is tuned offline under a fixed regime can become brittle and expensive to maintain. This paper presents $E^3$-Agent, an executable and evolving agent for edge artificial intelligence generated content (AIGC) resource management. $E^3$-Agent separates a fast-path router that makes millisecond-level dispatch decisions from a slow-path, event-driven large language model (LLM) meta-controller that mitigates regime shifts through a small, explicit control surface exposed via a tool interface, including risk gating, router configuration, and rapid performance calibration. The agent learns online from execution feedback and continuously adapts to unknown and time-varying service-time mappings. We evaluate $E^3$-Agent in a discrete-event simulator driven by MLPerf-derived device-model measurement priors, covering cold-start warmup and three dynamic regimes: semantic dynamics, device churn, and hidden drift. Across the dynamic scenarios, $E^3$-Agent reduces average latency by 65%-73% compared to the best static baseline, stays within 7%-10% of an online full-information Oracle used for evaluation, and effectively suppresses stutter rate under semantic degradation.

2605.27406 2026-05-28 cs.LG

A Simple State Space Model Excels at Multivariate Time Series Classification

一个简单的状态空间模型在多变量时间序列分类中表现出色

Hassan Saadatmand, Geoffrey I. Webb, Hamid Rezatofighi, Mahsa Salehi

发表机构 * Monash University(墨尔本大学)

AI总结 本文系统研究对角状态空间模型(S4D)和输入相关状态空间模型(Mamba系列)在大规模时间序列分类任务中的表现,发现S4D在准确性和效率上均优于Mamba变体,并提出了轻量级改进MS4和MS4N,在多个基准上达到或超越参数量大2-10倍的深度学习模型。

详情
AI中文摘要

结构化状态空间模型(SSM)最近作为序列建模的有前景基础出现,基于Mamba的架构通过输入相关的状态转换展示了强大的性能,尽管复杂度相当高。然而,它们在时间序列分类(TSC)中的应用主要局限于Mamba风格的架构,更广泛的SSM设计空间尚未充分探索。我们首次在大规模TSC基准上进行了涵盖对角SSM(S4D)和输入相关SSM(Mamba系列)的系统研究,探究这种复杂性是否对顶级性能是必要的。我们的结果揭示了一个令人惊讶的发现:S4D在准确性和效率上始终优于基于Mamba的变体,挑战了增加复杂性会在TSC中带来有意义收益的假设。基于此,我们引入了MS4,通过线性输入投影和通道混合机制对S4D进行轻量级修改,以及MS4N,一种归一化变体,以可忽略的开销稳定状态动态。在MONSTER(多达6000万样本、5万时间步、82个类别)和UEA基准上的59个数据集上,与15个基线相比,MS4和MS4N始终优于基于Mamba的模型,同时保持更高的效率,并且MS4N匹配或超越了参数量大约2倍和10倍的竞争性深度学习模型。这些结果将轻量级结构化SSM定位为在TSC中扩展复杂性的有吸引力替代方案。

英文摘要

Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at considerable complexity. However, their application to time-series classification (TSC) has been largely limited to Mamba-style architectures, leaving the broader SSM design space underexplored. We present the first systematic study spanning diagonal SSMs (S4D) and input-dependent SSMs (Mamba family) on large-scale TSC benchmarks, asking whether such complexity is necessary for top performance. Our results reveal a surprising finding: S4D consistently outperforms Mamba-based variants in both accuracy and efficiency, challenging the assumption that increased complexity translates to meaningful gains in TSC. Building on this, we introduce MS4, lightweight modifications to S4D via a linear input projection and channel-mixing mechanism, and MS4N, a normalized variant that stabilizes state dynamics with negligible overhead. Evaluated on 59 datasets across MONSTER (up to 60 million samples, 50K timesteps, 82 classes) and the UEA benchmark, against 15 baselines, MS4 and MS4N consistently outperform Mamba-based models while remaining more efficient, and MS4N matches or surpasses competing deep learning models that are roughly 2x and 10x larger in parameters. These results position lightweight structured SSMs as a compelling alternative to scaling complexity for TSC.

2605.27397 2026-05-28 cs.LG

IGADA-IoT: IoT Sensor Energy Optimization in Wireless Sensor Networks Driven by Automatic Data Augmentation

IGADA-IoT:自动数据增强驱动的无线传感器网络中物联网传感器能量优化

Mingchun Sun, Rongqiang Zhao, Muhammad Abdul Munnaf, Jie Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院)

AI总结 提出一种信息间隙引导的自动数据增强框架IGADA-IoT,通过分层多生成器协作与调度,联合利用不同生成器能力减小信息间隙,并引入信息间隙-模型性能联合评估与闭环方法,提升增强决策准确性,实验表明平均准确率提升7.27%。

详情
AI中文摘要

在无线传感器网络(WSN)中,数据增强是一种提高采样频率决策性能的新方法,从而实现对物联网(IoT)传感器的能量优化。然而,现有方法依赖单一生成器和经验确定的量,未能建立动态信息间隙与多个生成器之间的映射,并且忽略了生成样本的异质性。此外,缺乏一种联合考虑信息间隙和模型性能的评估与闭环方法。为了解决这些问题,我们提出了一种信息间隙引导的物联网传感器自动数据增强框架(IGADA-IoT),具有分层多生成器协作和多轮调度。联合利用不同生成器的能力来减小信息间隙。在IGADA-IoT中,提出了一种分层多生成器协作与调度策略(HMGCS),以增强生成样本分配的针对性和合理性。提出了一种信息间隙-模型性能联合评估与闭环方法(IGMP-EC),以增强增强决策的准确性,并减轻欠增强和过增强的风险。实验结果表明,IGADA-IoT将多个下游模型的平均准确率提高了7.27%。与先进的数据增强方法相比,平均准确率提高了8.67%。与单个生成器相比,平均准确率提高了7.24%。此外,来自UCR Archive和实际部署的公共物联网传感器数据集证明了所提方法的准确性和泛化能力。

英文摘要

In wireless sensor networks (WSNs), data augmentation is a novel method to improve sampling-frequency decision performance, thereby enabling energy optimization for IoT (Internet of Things) sensors. However, existing methods rely on a single generator and empirically determined quantities, failing to establish a mapping between dynamic information gaps and multiple generators, and overlooking the heterogeneity of generated samples. Moreover, an evaluation and a closed-loop method that jointly considers the information gap and the model performance are lacking. To address these issues, we propose an information gap-guided IoT sensor automatic data augmentation framework (IGADA-IoT) with hierarchical multi-generator collaboration and scheduling over multiple rounds. Capabilities of different generators are jointly utilized to reduce the information gaps. In the IGADA-IoT, a hierarchical multi-generator collaboration and scheduling strategy (HMGCS) is proposed to enhance the targetedness and rationality of generated sample allocation. An information gap-model performance joint evaluation and closed-loop method (IGMP-EC) is proposed to enhance the accuracy of augmentation decisions, and to mitigate the risks of under-augmentation and over-augmentation. Experimental results show that the IGADA-IoT improves the average accuracy of multiple downstream models by 7.27%. Compared with advanced data augmentation methods, the average accuracy is improved by 8.67%. Compared with the individual generators, the average accuracy is improved by 7.24%. Furthermore, public IoT sensor datasets from the UCR Archive and real-world deployments demonstrate the accuracy and generalizability of the proposed method.

2605.27393 2026-05-28 cs.CL cs.AI

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam(弗里堡大学阿姆斯特丹分校) NII, Tokyo Institute of Technology(东京技术大学信息机构)

AI总结 提出StoryMI框架,通过多LLM智能体协作、情境故事基础和动态策略控制,生成符合动机性访谈标准的治疗性对话,并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情
AI中文摘要

大型语言模型(LLM)可以生成流畅的对话,但先前的工作缺乏情境基础、动态策略控制以及与动机性访谈(MI)临床标准对齐的评估。我们引入了StoryMI,一个用于可控MI对话生成的多LLM智能体框架,其中基于问卷的客户档案被扩展为情境故事,为对话提供叙事背景。治疗师和客户智能体生成由交互智能体选择的MI代码引导的MI编码话语,而交互智能体动态协调交换以在多次轮对话中控制MI策略。我们提出了一个两级评估协议:词汇指标和宏观层面咨询策略的MI特定度量,以及LLM作为评判者和人类专家评估。我们构建了一个包含6K模拟MI对话的数据集,基于1K问卷-故事对,涵盖12个MI代码和13个症状领域,并对六个开源和闭源LLM进行了基准测试。我们的结果表明,情境基础和宏观层面控制可以提高MI依从性和临床合理性,展示了结构化多智能体工作流在心理治疗对话生成中的有效性。我们提供代码和数据以促进可重复性。

英文摘要

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

2605.27388 2026-05-28 cs.CL cs.AI cs.SI

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

通过反应语气建模社区态度:评估LLM与在线社区语言行为对齐的人机协作框架

Nuan Wen, Xuezhe Ma

发表机构 * Information Sciences Institute University of Southern California(南加州大学信息科学研究所)

AI总结 提出CARE框架,通过细粒度言语气势分析,评估LLM模拟社区对真实新闻的反应,揭示其存在“现实主义差距”,表明当前对齐策略不足以捕捉在线群体的社会语言动态。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作计算社会分析的代理;然而,它们忠实再现人类社区“厚描述”(Geertz, 1973)的能力仍然是一个关键挑战。当前的评估通常将社会身份简化为静态标签,忽视了现实群体如何应对社会变迁。为弥合这一差距,我们引入了CARE(社区感知反应评估),一个以反应为中心的框架,将LLM模拟的话语与不同社区对真实新闻的真实、事件相关的反应进行基准测试。通过刻画细粒度的言语气势谱及其所体现的潜在态度——通过人机协作验证——我们的诊断揭示了一个持续的“现实主义差距”:使用明确的社区提示引导LLM并不能固有地提高模拟保真度。进一步分析识别了前沿模型之间的不同行为特征,表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

英文摘要

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

2605.27385 2026-05-28 cs.LG cs.AI

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

异构仿真环境中联邦强化学习的个性化观测归一化

Yiran Pang, Zhen Ni, Xiangnan Zhong

发表机构 * Department of Electrical Engineering \& Computer Science Florida Atlantic University Boca Raton, FL, USA

AI总结 针对联邦强化学习在异构环境中状态转移动力学差异导致输入分布不一致和参数更新不平衡的问题,提出个性化观测归一化方法,通过各智能体本地维护运行均值和方差对原始状态输入进行归一化,加速训练并提升性能。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025

详情
AI中文摘要

联邦强化学习(FedRL)使多个智能体能够在不共享原始数据的情况下协同训练全局策略,因此非常适合隐私敏感的应用。然而,FedRL在异构环境中面临挑战,其中不同的状态转移动力学导致聚合过程中输入分布不一致和参数更新不平衡。因此,本文开发了一种个性化观测归一化(PON)方法,允许每个智能体使用持续更新的运行均值和方差对原始状态输入进行局部归一化。这种设计确保了局部特征的一致缩放,而不会在聚合过程中掩盖其他智能体的特征。此外,我们证明了由于不同的局部输入分布,跨智能体共享归一化参数是无效的,这突显了个性化统计的必要性。在异构MuJoCo任务上的实验表明,我们开发的PON加速了训练,并且与基线方法相比取得了更优的性能。

英文摘要

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.

2605.27383 2026-05-28 cs.CL cs.AI

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

弥合稳定性与表现力之间的差距:低资源口语语言模型的合成数据扩展与偏好对齐

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An, Ya Li, Xiaoyu Shen

发表机构 * Beijing University of Posts(北京邮电大学) University of California, USA(美国加州大学) Northwestern University, USA(美国西北大学) Eastern Institute of Technology, Ningbo, China(宁波工程技术学院)

AI总结 针对低资源口语语言模型因合成数据导致的表现力崩溃问题,提出两种自对齐框架(DGSA和TDSC)以恢复韵律多样性,实现超越商业系统的性能并首次支持老挝语零样本语音克隆。

详情
AI中文摘要

口语语言模型(SLM)通过绕过显式的字素到音素流水线,已成为语音合成的一种有前景的范式。然而,它们在低资源语言中的有效性仍然受到转录语音稀缺的根本限制。在实践中,合成数据已成为在此类场景下扩展SLM的主要策略,当真实数据不足时提供可靠的音素监督。在这项工作中,我们表明这种依赖引入了一个基本权衡,我们称之为稳定性-表现力差距:虽然合成数据提高了音素准确性,但它逐渐抑制了韵律变异性,最终导致表现力崩溃(合成侵蚀)。为了弥合这一差距,我们提出了两种自对齐框架。解耦引导的自对齐(DGSA)通过利用韵律-音色分离来恢复复杂语言的表现力。对于真实参考极其有限的场景,温度驱动的自我批评(TDSC)通过自动探索和过滤来稳定生成。我们的方法优于强大的商业系统,包括ElevenLabs和Gemini Pro,并首次实现了老挝语的零样本语音克隆能力。

英文摘要

Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

2605.27380 2026-05-28 cs.CL cs.AI

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX: 基于别名的检索与LLM排序的跨语言生物医学实体链接

Yi Wang, Corina Dima, Liangyu Zhong, Steffen Staab

发表机构 * University of Stuttgart, Germany(斯图加特大学) Technical University of Berlin, Germany(柏林技术大学)

AI总结 提出BioELX两阶段框架,通过维基数据多语言别名增强SapBERT检索器,并利用预训练LLM排序器进行上下文感知消歧,无需标注数据即在多个基准上取得最佳性能。

Comments 12 pages, 3 figures

详情
AI中文摘要

跨语言生物医学实体链接(BEL)将任何语言的提及映射到生物医学知识库(KB)中的唯一标识符,支持临床和生物医学NLP应用。然而,BEL的专家标注训练数据成本高昂,尤其是对于低资源语言。此外,许多跨语言BEL系统依赖于基于SapBERT的检索器,这些检索器主要在KB中的英语别名上训练,导致对未见过的非英语提及泛化能力差,且上下文感知消歧有限。我们提出BioELX,一个两阶段跨语言BEL框架,无需任务特定的标注训练语料。在第一阶段,我们用维基数据派生的多语言别名丰富SapBERT训练,并使用得到的检索器改进跨语言候选检索。在第二阶段,我们使用预训练LLM排序器进行上下文感知消歧,该排序器联合考虑提及上下文和候选,消除了监督训练的需要。在五个基准(XL-BEL、EMEA、Patent、WikiMed-DE和MedMentions)上的实验表明,BioELX实现了新的最先进性能。它在XL-BEL上将平均Recall@1提高了+19.2,尤其是低资源语言提升显著,例如土耳其语+21.6、韩语+22.1、泰语+30.8,并在EMEA(+6.2)、Patent(+5.4)和WikiMed-DE(+12.8)上持续改进。代码和资源将在发表后发布。

英文摘要

Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

2605.27378 2026-05-28 cs.CL cs.CV cs.MA

OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis

OralAgent: 融合推理、工具与知识的交互式牙科影像分析

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu, Jiahao Bao, Yuxuan Fan, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Ming Hu, Liang Zhan, James Kit Hon Tsoi, Linlin Shen, Junjun He, Kuo Feng Hung

发表机构 * Faculty of Dentistry, the University of Hongkong, Hong Kong SAR, China(香港大学牙科学院,中国香港特别行政区) Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA(匹兹堡大学电气与计算机工程系,美国宾夕法尼亚州匹兹堡) Shenzhen University, China(深圳大学,中国) Department of Craniomaxillofacial Surgery, Shanghai Ninth People’s Hospital, China(上海第九人民医院口腔颌面外科部,中国) Nanyang technological University, Singapore(南洋理工大学,新加坡) School of Biomedical Engineering, Southern Medical University, China(南方医科大学生物医学工程学院,中国) Singapore University of Technology and Design, Singapore(新加坡科技设计大学,新加坡) University of Auckland, new zealand(奥克兰大学,新西兰) Shanghai Artificial Intelligence Laboratory , China(上海人工智能实验室,中国)

AI总结 提出首个牙科专用AI智能体OralAgent,通过集成22种视觉分析工具和368本经典牙科教科书,实现多模态推理、工具决策与知识检索的自动化框架,在多个基准上达到最优性能。

Comments 14 pages, 7 figures, 6 tables

详情
AI中文摘要

牙科影像分析在支持口腔医疗的准确诊断和治疗规划中起着关键作用。尽管近期进展产生了针对特定任务和单一成像模态的牙科AI模型,但其孤立的设计限制了在实际临床工作流程中的实用性。在本文中,我们提出了OralAgent,这是首个牙科专用AI智能体,它在端到端自动化框架内统一了多模态推理、基于工具的决策和基于知识的检索。它集成了22种视觉分析工具和368本广泛使用的经典牙科教科书,实现了自主推理、规划、工具使用、知识检索和多步骤工作流执行。此外,我们引入了OralCorpus,这是一个大规模、高质量的双语文本资源,包含1.348亿个标记,专为牙科检索增强生成(RAG)而构建。为了评估模型的多学科牙科知识,我们构建了OralQA-ZH,这是一个中文选择题基准,包含来自11个口腔亚专业的798个项目。大量实验表明,OralAgent在MMOral-Uni、MMOral-OPG和OralQA-ZH基准上达到了最先进的性能,突显了其在真实临床环境中的有效性、可解释性和适应性。代码和模型已在https://github.com/isjinghao/OralAgent公开。

英文摘要

Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.

2605.27376 2026-05-28 cs.CL cs.AI

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University, Korea(全州大学人工智能系) Department of Computer Science and Engineering, Sungkyunkwan University, Korea(全州大学计算机科学与工程系)

AI总结 针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题,提出句间风格插值和句内风格过渡技术,通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情
AI中文摘要

虽然基于提示的文本转语音(TTS)模型支持自然语言驱动的说话风格控制,但它们通常提供有限的细粒度控制,并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中,我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值,我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值,从而实现风格特征之间的平滑过渡。对于句内风格过渡,我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差,导致初始音频实现主导后续生成。为了减轻这种影响,我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明,我们提出的句间插值在性别转换中实现了99-100%的成功率,高达36 Hz的音高变化,以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度,并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

2605.27375 2026-05-28 cs.CL

LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks

LCO:基于LLM的约束优化,用于现实任务中更安全的智能体LLM

Jiayong Wan, Jiawei Chen, Zhaoxia Yin, Liu Shuyuan, Hang Su

发表机构 * East China Normal University(东华大学) Beijing Zhongguancun Academy(北京中关村学院) Tsinghua University(清华大学)

AI总结 提出LCO框架,通过自思考模块和进化采样模块约束LLM行为,在不微调模型的情况下减少上下文奖励黑客行为,实验表明在输出优化和策略优化场景中显著提升安全性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地充当自主智能体,但它们与环境的持续交互可能导致上下文奖励黑客行为(ICRH),即LLM迭代优化其行为以最大化代理目标,无意中产生有害副作用。现有防御方法不足以应对此风险,因为ICRH并非源于对抗性输入,而是模型自身的过度优化。为缓解此问题,我们提出基于LLM的约束优化(LCO),该框架无需模型微调即可有效减少ICRH。LCO包含两个模块:自思考模块,引导LLM在执行前主动思考并整合潜在安全约束;进化采样模块,利用基于LLM的交叉和变异将模型动作约束在安全解空间内,同时保持任务性能。实验结果表明,LCO在输出优化和策略优化场景中均显著缓解了ICRH。特别是在推文参与度优化任务中,LCO在GPT-4上使毒性增长率(TGR)降低了39%;在策略优化基准上,ICRH发生率降低了15.23%,在不牺牲任务性能的情况下提升了安全性。

英文摘要

Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

2605.27374 2026-05-28 cs.CL

ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

ICG: 通过基于MLLM的提示和个性化偏好对齐改进封面图像生成

Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai, Zhaocheng Du, Jiacheng Sun, Zhou Zhao, Zhenhua Dong

发表机构 * Huazhong University of Science and Technology(华中科技大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学)

AI总结 提出ICG框架,利用多模态大语言模型和扩散模型,通过元标记提取语义特征、用户嵌入个性化对齐及多奖励学习策略,实现高质量、个性化封面图像生成。

Comments Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: https://doi.org/10.18653/v1/2025.emnlp-main.617

Journal ref Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (Main Track) EMNLP 2025 12268-12278

详情
AI中文摘要

多模态大语言模型和扩散模型的最新进展为AI生成内容开辟了新的可能性。然而,个性化封面图像生成仍未被充分探索,尽管它在提升数字平台用户参与度方面起着关键作用。我们提出ICG,一个新颖的框架,将基于MLLM的提示与个性化偏好对齐相结合,以生成高质量、上下文相关的封面。ICG通过元标记从项目标题和参考图像中提取语义特征,使用用户嵌入进行细化,并将得到的个性化上下文注入扩散模型。为了解决缺乏标注监督的问题,我们采用了一种多奖励学习策略,该策略结合了公共美学和相关性奖励以及从用户行为训练的个性化偏好模型。与依赖手工提示和不连贯模块的先前流程不同,ICG采用适配器桥接MLLM和扩散模型进行端到端训练。实验表明,ICG显著提高了图像质量、语义保真度和个性化,从而在下游任务中增强了用户吸引力和离线推荐准确性。作为桥接MLLM和扩散模型的即插即用适配器,ICG兼容常见检查点,且在优化过程中不需要真实标签。

英文摘要

Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

2605.27373 2026-05-28 cs.AI cs.CL cs.CY

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观:一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) CETINIA, Universidad Rey Juan Carlos(CETINIA,雷伊·胡安·卡洛斯大学)

AI总结 提出一种基于大型语言模型的可定制架构,通过三个模块(规范生成、文本标注、强度评估)检测文本中人类价值观的强度,避免依赖特定价值理论或复杂提示工程,实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

Journal ref Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103

详情
AI中文摘要

随着智能系统变得更加自主,科学界专注于创建包含伦理和道德考量的决策机制,这与传统的效用最大化模型不同。为此,一个关键方面是评估这些决策与人类价值观的契合程度。基于此,一个有前景的研究方向是开发基于大型语言模型(LLM)的方法,从文本中识别显性或隐性的人类价值观,从而实现全程识别。本文介绍了一种基于LLM的架构,用于检测和量化文本中人类价值观的强度,避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块:一个从任何理论框架的基础文本中生成结构化价值规范;一个使用这些规范对文本进行标注;另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离,创建了一个可扩展且可重复的过程,由适应多种理论的价值规范驱动。该架构使用多个LLM实例化,并使用ValueEval数据集进行评估。实验表明具有良好的检测性能,证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Princeton University(普林斯顿大学) Nanjing University(南京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出并行框解码(PBD)方法,将边界框和点作为原子单元单步解码,结合大规模数据集LocateAnything-Data,实现高效统一的目标定位与检测,在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情
AI中文摘要

视觉语言模型(VLM)通常将视觉定位和检测表述为坐标令牌生成问题,将每个2D框序列化为多个1D令牌,这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配,并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything,一个基于并行框解码(PBD)的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码,LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎,并策划了LocateAnything-Data,这是一个包含超过1.38亿个训练样本的大规模数据集,大大增加了高精度定位的数据多样性。大量评估表明,LocateAnything推进了速度-精度前沿,在多个基准测试中实现了显著更高的解码吞吐量,同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

2605.27348 2026-05-28 cs.CV cs.AI

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

当眼睛背叛AI:社交注视一致性作为AI生成图像检测的语义线索

Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi

发表机构 * School of Computer Engineering(计算机工程学院) Hoseo University(Hoseo大学) School of Electronic Engineering(电子工程学院) Soongsil University(Soongsil大学) School of Computer Science(计算机科学学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出社交注视一致性作为高层语义线索,通过构建诊断数据集、块组合描述监督和跨架构验证,证明该线索能有效检测AI生成图像,并解释其跨生成器迁移的机制。

Comments 23 pages, 2 figures, 17 tables

详情
AI中文摘要

最近的生成模型在很大程度上缩小了低级伪影(像素指纹、频率异常、上采样痕迹)的差距,特别是在以人为中心和局部编辑的设置中,其中被操纵的区域很小且被光度真实的内容包围。我们引入了社交注视一致性,这是一个高层语义线索,定义为互动个体之间注视方向、头眼对齐和瞳孔放置的相互一致性,并表明它构成了一个先前未被充分利用的检测轴,与现有的低级范式正交。我们通过三个耦合机制实例化这一见解:(i) 一个受控的诊断数据集,具有注视一致图像的特定区域扰动,其中严格的成对分组阻止了生成器指纹记忆作为优化时间捷径,而不是依赖增强;(ii) 块组合描述监督,它在1250个宏观组合描述中保持一个单一的5块推理骨架不变,将推理一致性与表面多样性解耦;(iii) 跨架构验证表明,相同的监督在COCOAI交互子集上将视觉语言骨干(FakeVLM)的平衡准确率提高了3.7个百分点(67.8 -> 71.5),在COCOAI人物子集上提高了1.3个百分点(83.0 -> 84.3),并且在仅视觉骨干(Effort)上也有持续提升,证明了骨干无关的线索。真实类和伪造类召回率同时上升,排除了“全预测为伪造”的伪影。一个四步机制解释——成对编辑捷径阻断、难到易难度转移、CLIP先验保留以及扩散族在眼周结构中的共享频谱弱点——解释了为什么在单个修复模型(FLUX.1-Fill)上训练能够迁移到多生成器套件。我们将在论文被接收后发布代码以促进可重复性。

英文摘要

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

2605.27258 2026-05-28 cs.SD cs.AI

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS:一种有纪律的模块化配方用于竞争性语音合成

Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

发表机构 * Amap, Alibaba Group(阿里巴巴集团爱马仕部门) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出PilotTTS轻量级自回归TTS系统,通过极简架构和严格数据工程(仅用20万小时开源处理数据)实现竞争性能,支持零样本语音克隆、情感/副语言/方言合成,在Seed-TTS Eval基准上取得最低WER和最高说话人相似度。

详情
AI中文摘要

构建最先进的文本转语音(TTS)系统通常需要数百万小时的专有数据和复杂的多阶段架构,这给资源受限的研究团队带来了巨大障碍。在本报告中,我们提出了PilotTTS,一种轻量级自回归TTS系统,通过极简架构和严格的数据工程实现了竞争性能。PilotTTS仅使用20万小时的数据进行训练,这些数据完全通过开源工具处理。具体来说,我们的贡献包括:(1)一个可复现的多阶段数据处理流水线,涵盖质量评估、标签标注和过滤;(2)一个紧凑的模型架构,采用基于Q-Former的条件化,通过跨样本配对训练将说话人身份与说话风格解耦。在统一框架内,PilotTTS支持零样本语音克隆、情感合成(11类)、副语言合成(4类)和中文方言合成(14种方言)。在Seed-TTS Eval基准上,PilotTTS在test-en上实现了最低的WER 1.50%,在test-zh上实现了CER 0.87%,并在两个测试集上取得了最高的说话人相似度(0.862和0.815),优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS上发布了完整的数据流水线配方、预训练权重和代码。

英文摘要

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

2605.27155 2026-05-28 cs.CV cs.AI

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

通过修复进行语义鲁棒性探测:面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

发表机构 * Federal Institute for Occupational Safety and Health (BAuA)(联邦职业安全与健康研究所) Fraunhofer Institute for Manufacturing Engineering and Automation IPA(弗劳恩霍夫研究所(制造工程与自动化IPA))

AI总结 提出SemProbe工具,通过扩散模型可控修复生成语义探针,支持用户自定义掩码和因素,自动评估并记录目标检测模型的鲁棒性变化。

详情
AI中文摘要

在安全关键领域测试目标检测器需要超越像素级损坏的语义上有意义的探针。我们提出了SemProbe,一个用于语义鲁棒性探测的工具:用户上传部署图像,手动或自动创建掩码,选择操作设计域衍生因素(或自定义提示),并运行基于扩散的可控修复。系统支持批量作业、并行种子/工作流变体以及可配置的生成参数。每次输出后,自动运行模型推理并显示带有性能差异的带注释的前后对比。所有探针都作为结构化工件记录,从而能够提供与安全评估工作流一致的可追溯鲁棒性证据。我们在尺寸锯的手部检测上演示了SemProbe,针对保险导向测试标准中的因素。

英文摘要

Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textsc{SemProbe} on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.

2605.26790 2026-05-28 cs.LG physics.space-ph

Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability

低推力轨迹成本与可达性的预训练近似器

Zhong Zhang, Giacomo Acciarini, Dario Izzo, Hexi Baoyin, Francesco Topputo

发表机构 * Politecnico di Milano(米兰理工大学) European Space Agency(欧洲航天局) Tsinghua University(清华大学)

AI总结 提出使用机器学习代理模型精确近似低推力轨迹的燃料消耗和转移可行性,通过同伦射线策略和自相似变换实现跨任务泛化,并开源模型与数据集。

Comments Submitted to the Journal of Guidance, Navigation and Control. Zenodo entry: https://doi.org/10.5281/zenodo.18769170

详情
AI中文摘要

低推力轨迹设计严重依赖于对燃料消耗和转移可行性的重复评估,这需要昂贵的优化控制解。在这项工作中,我们表明这些量可以通过机器学习代理模型准确近似,从而在广泛场景中实现快速且可扩展的评估。通过增加数据集大小和模型容量,我们观察到低推力轨迹优化遵循缩放定律,性能随训练数据和网络参数的对数线性提升,且在探索范围内没有饱和迹象。基于这一观察,我们使用针对任务设计需求提出的同伦射线策略构建了一个大规模数据集。关键是引入自相似变换,允许在半长轴、倾角和中心天体之间泛化,避免重新训练。因此,相同的神经近似器可应用于不同的轨道环境和任务类别。所提出的模型准确预测了单圈和多圈转移的最优燃料消耗和最小转移时间。其性能和泛化能力在公开数据集、全球轨迹优化竞赛的多小行星飞越问题以及小行星交会任务设计中得到验证。模型和数据集作为开源发布,以支持航天社区。

英文摘要

Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately approximated by machine learning surrogates, enabling fast and scalable evaluation across a wide range of scenarios. By increasing both dataset size and model capacity, we observe that low-thrust trajectory optimization follows a scaling law, with performance improving linearly with the logarithm of training data and network parameters, and no evidence of saturation within the explored regime. Guided by this observation, we construct a large-scale dataset using the proposed homotopy-ray strategy tailored to mission design requirements. A key is the introduction of a self-similar transformation, which allows generalization across semi-major axes, inclinations, and central bodies avoiding retraining. As a result, the same neural approximator can be applied to diverse orbital environments and mission classes. The proposed models accurately predict optimal fuel consumption and minimum transfer time for single- and multi-revolution transfers. Their performance and generalization are demonstrated on a public dataset, a multi-asteroid flyby problem from the Global Trajectory Optimization Competition, and an asteroid rendezvous mission design. The models and datasets are released as open-source to support the space community.

2605.26730 2026-05-28 cs.CL

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

PRISM:评估LLM同行评审员的多维基准

Ngoc Phan Phuoc Loc, Toan Huynh La Viet, Thanh Tran Khanh, Duy A Nguyen, Tuan Anh Nguyen Pham, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D. Doan, Binh T. Nguyen

发表机构 * VinUniversity(Vin大学) University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Notre Dame(诺丁汉大学) Monash University(莫纳什大学)

AI总结 提出PRISM基准,从分析深度、新颖性评估、缺陷识别与主要问题排序、多维建设性四个维度评估LLM评审质量,发现LLM在单维度上可媲美甚至超越人类,但无系统在所有维度上一致平衡,表明LLM评审员更适合作为人类评审的针对性补充。

详情
AI中文摘要

机器学习会议投稿量的快速增长给科学同行评审系统带来了压力,并加剧了对基于LLM的自动评审系统的兴趣。然而,这些系统实际上有多好,特别是在捕捉科学漏洞方面与人类评审员相比如何,仍然知之甚少。在这项工作中,我们引入了PRISM(通过结构化多维评估的同行评审智能),这是一个评估框架,从四个维度评估评审质量:分析深度、新颖性评估、缺陷识别与主要问题排序、以及多维建设性。与大多数基于表面指标(如ROUGE和BLEU)或未约束的LLM-as-a-judge提示(将流畅性与严谨性混为一谈)的现有评估不同,PRISM将每个维度建立在论点挖掘、检索增强验证和基于共识的评分之上。我们应用PRISM对来自ICLR、ICML和NeurIPS的分层评审语料库中的五个领先自动化评审系统和人类评审员进行基准测试。结果显示,LLM在单个维度上可以匹配或超越人类评审员:可比较的分析深度、更强的新颖性验证以及高度准确的批评优先级排序。然而,没有一个系统能在所有维度上同时匹配人类基线的平衡表现。每个系统都表现出独特的专业化特征,带有典型的盲点——聚合指标完全遗漏的失败模式。这意味着LLM评审员最好被理解为人类评审的针对性补充,在特定维度上有效,但作为独立替代品不可靠。我们的演示和关键结果可在https://khanhthanhdev.github.io/prism-page/找到。

英文摘要

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots -- failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at https://khanhthanhdev.github.io/prism-page/.

2605.26624 2026-05-28 cs.CV

MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition

MSCGC-KAN: 用于脑电情感识别的多尺度因果图卷积与Kolmogorov-Arnold特征映射

Haoliang Gong, Qingshan She, Jiale Xu, Yunyan Gao, Xugang Xi

发表机构 * School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China(杭州电子科技大学自动化学院) Zhejiang Provincial Key Laboratory of Brain Computer Collaborative Intelligence Technology(浙江省脑机协同智能技术与应用重点实验室)

AI总结 本文提出MSCGC-KAN方法,通过多尺度因果图卷积和Kolmogorov-Arnold特征映射构建结构化任务头,在预训练CBraMod骨干上增强多尺度时间建模、可学习通道间连接建模和非线性判别映射,显著提升脑电情感识别性能。

详情
AI中文摘要

基于脑电图的情感识别是一项重要的情感计算任务,最近的脑电图基础模型为下游适应提供了有用的通用表示。然而,在微调设置下,三个局限性仍然突出:多尺度情感动态建模不足、通道间功能连接利用不充分以及简单线性分类头的表达能力有限。为了解决这些问题,本文提出了一种新的脑电情感识别方法,称为MSCGC-KAN,它引入了一个由多尺度因果图卷积和Kolmogorov-Arnold特征映射组成的结构化任务头。基于预训练的CBraMod骨干,MSCGC-KAN通过在紧凑的任务特定头中联合加强多尺度时间建模、可学习通道间连接建模和非线性判别映射来增强下游适应。这种设计保留了基础模型的表示优势,同时使分类器对情感相关的时空模式更加敏感。在公开的FACED和SEED-VII数据集上进行了大量实验。所提方法在FACED上实现了60.66%的平衡准确率、0.5525的Cohen's Kappa和60.40%的加权F1分数,在SEED-VII上分别获得了33.27%、0.2223和33.64%。与CBraMod+Linear基线相比,在两个数据集上平衡准确率分别提高了5.91和2.03个百分点。这些结果表明,在微调预训练脑电模型时,结构化任务头设计是改进脑电情感识别的有效方法。

英文摘要

Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov--Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66\%, a Cohen's Kappa of 0.5525, and a weighted F1-score of 60.40\% on FACED, and obtains 33.27\%, 0.2223, and 33.64\%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.

2605.26552 2026-05-28 cs.LG cs.AI

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

通过摊销基于样本的变分推断来对齐少步生成模型

Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park

发表机构 * KAIST(韩国科学技术院) MongooseAI Mila – Quebec AI Institute(魁北克AI研究院) University of Edinburgh(爱丁堡大学) Université de Montréal(蒙特利尔大学) Omelet

AI总结 提出FAV框架,利用Stein变分梯度下降进行基于样本的变分推断,并通过固定点回归将粒子更新摊销到生成器参数中,实现对少步生成模型的对齐,在机器人操作和图像生成任务中优于现有方法。

Comments Under review

详情
AI中文摘要

对齐少步生成模型具有挑战性,因为现有的对齐框架通常依赖于限制性假设:可处理的似然、特定的ODE/SDE求解器或特定的模型族。我们引入了FAV(Few-step Generative Models Alignment via Sample-based Variational Inference),这是一个通用的对齐框架,仅需要对生成器和参考分布的样本访问。我们将对齐视为从倾斜于参考分布的奖励倾斜分布中采样。我们利用Stein变分梯度下降作为基于样本的变分推断方案,并通过固定点回归将粒子更新摊销到生成器参数中。我们在两个领域评估了FAV:机器人操作和图像生成器对齐。在机器人操作的生成策略对齐中,FAV在56个离线RL任务和30个离线到在线RL任务中优于现有的策略提取基线。对于图像生成器对齐,FAV微调了多种少步骨干模型,包括GAN、漂移模型、一致性模型和流映射,从ImageNet-$256$扩展到1024$^2$文本到图像合成。代码可在https://github.com/Jaewoopudding/FAV获取。

英文摘要

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.