arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.11636 2026-06-11 cs.RO 新提交

SAFER-Nav: Enhancing Safety for Visual Robot Navigation via Segmentation-Aware Fine-Tuning

SAFER-Nav: 通过分割感知微调增强视觉机器人导航的安全性

Geonyeong Ko, Giung Lee, Changjoo Nam

发表机构 * Dept. of Electronic Engineering, Sogang University(西江大学电子工程系) Dept. of Computer Science, Rice University(莱斯大学计算机科学系) Vertical Labs, Co., Ltd.(Vertical Labs 有限公司)

AI总结 提出SAFER-Nav方法,通过分割感知微调将障碍物边界和可通行空间结构直接融入导航策略,降低碰撞频率并保持目标到达性能。

详情
AI中文摘要

基于视觉的导航模型,特别是基础模型,仅从RGB观测生成可行轨迹。然而,即使是最先进的基于Transformer和扩散的策略也难以在包含未见障碍物或条件变化的不熟悉部署环境中泛化。生成的轨迹通常仍以目标为导向但不安全。现有工作通过外部轨迹校正或内部几何先验提高安全性,但所得策略并未被训练显式表示障碍物边界或可通行自由空间结构。为解决此问题,我们提出一种导航模型,通过微调将这些结构直接纳入策略,并设计为与多种基于RGB的主干兼容。在多个机器人平台、室内环境以及静态和动态障碍物场景中,我们的方法相对于ViNT、NoMaD及其CARE增强变体降低了碰撞频率,同时保持目标到达性能。

英文摘要

Vision-based navigation models, particularly foundation models, generate viable trajectories from RGB observations alone. However, even state-of-the-art transformer- and diffusion-based policies struggle to generalize in unfamiliar deployment environments containing unseen obstacles or shifted conditions. The resulting trajectories often remain goal-directed but unsafe. Existing efforts improve safety through external trajectory correction or internal geometric priors, yet the resulting policies are not trained to explicitly represent obstacle boundaries or traversable free-space structure. To address this, we propose a navigation model that incorporates these structures directly into the policy via fine-tuning and is designed to be compatible with diverse RGB-based backbones. Across multiple robot platforms, indoor environments, and static and dynamic obstacle scenarios, our method reduces collision frequency relative to ViNT, NoMaD, and their CARE-augmented variants while maintaining goal-reaching performance.

2606.11628 2026-06-11 cs.RO cs.AI 新提交

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID:从非结构化人类视频学习与具身无关的意图模型以实现可扩展的灵巧机器人技能获取

Harsh Gupta, Guanya Shi, Wenzhen Yuan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出LUCID两阶段框架,从互联网规模的非结构化人类视频学习任务意图,并在大规模并行仿真中学习机器人控制,实现零样本迁移到不同具身和场景。

详情
AI中文摘要

目前最广泛采用的机器人学习流程通常从机器人演示或结构化人类数据中学习技能,这些数据收集成本高昂且与特定具身绑定。相比之下,非结构化人类视频提供了一种可扩展的替代方案。它们包含跨物体、场景和策略的多样化操作演示,但与机器人动作没有直接联系。我们提出LUCID,一个两阶段框架,从互联网规模数据集的非结构化人类视频中学习任务意图,并在大规模并行仿真中学习机器人控制。意图模型根据当前观测以闭环方式预测短时意图(场景中下一步应该发生什么)。一个具身特定的感觉运动策略将此意图转换为机器人动作。意图接口在控制器之间共享,因此相同的意图模型可应用于不同具身,从我们的主要灵巧手到平行夹爪。我们在五个真实世界操作任务上评估LUCID:搅拌、擦拭和分拣,仅由互联网视频监督,零样本迁移到新场景和物体实例;以及推T和电缆布线,各由1小时自收集智能手机视频监督。项目页面:此 https URL。

英文摘要

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: this https URL.

2606.11627 2026-06-11 cs.LG cs.AI 新提交

When Context Returns: Toward Robust Internalization in On-Policy Distillation

当上下文回归:面向在线策略蒸馏中的鲁棒内化

Xun Wang, Ruishuo Chen, Zhuoran Li, Yu Chen, Longbo Huang

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 针对在线策略蒸馏中上下文内化后重新引入上下文导致性能下降的问题,提出一种轻量级一致性正则化方法,通过锚定无上下文输出并惩罚偏离,有效缓解退化并提升鲁棒性。

详情
AI中文摘要

近期研究表明,在线策略蒸馏可以将特权上下文(如系统提示或任务提示)内化到学生模型中,使得推理时不再需要上下文。尽管该方法成功提升了学生的无上下文性能,我们却发现一个有趣且此前未被研究的现象:在许多设置中,向蒸馏后的学生模型重新引入原始特权上下文实际上会降低其性能,甚至对于它已经在无上下文情况下正确解决的实例也是如此。我们将此称为上下文诱导退化,并认为鲁棒内化不仅要求匹配教师的条件上下文行为,还要求在上下文重新引入时保持稳定,这一性质我们称为上下文可移除性。受此观察启发,我们提出一种轻量级一致性正则化方法,首先通过停止梯度锚定学生的无上下文输出,然后通过前向KL散度惩罚条件上下文输出偏离该锚点。这一简单添加每训练步仅需一次额外前向传播,却能有效缓解上下文诱导退化,并在许多情况下甚至提升无上下文性能。在涵盖不同领域和模型家族的12种配置中,我们的方法在大多数设置下提升了条件上下文准确率,在11/12的设置中减少了上下文诱导损害,并有效消除了响应长度膨胀。一项机制性案例研究进一步证实,上下文可移除性在表示层面得以实现,无论上下文是否存在,隐藏状态几乎保持相同。

英文摘要

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

2606.11626 2026-06-11 cs.CV 新提交

Adapting Vision-Language Models from Iconic to Inclusive for Multi-Label Recognition Without Labels

将视觉-语言模型从标志性适应到包容性:用于无标签的多标签识别

Cheng Chen, Jingyu Zhou, Yifan Zhao, Jia Li

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, SCSE & QRI, Beihang University(虚拟现实技术与系统国家重点实验室,北京航空航天大学计算机学院与青岛研究院)

AI总结 提出无监督框架,通过“切割”和“缝合”两阶段适应VLMs,实现无标签的多标签图像识别,在四个数据集上超越现有无监督方法。

详情
AI中文摘要

理解多标签图像仍然是计算机视觉中的一项挑战性任务。随着视觉-语言多模态学习的快速发展,视觉-语言模型(VLM)能够在没有标注数据的情况下实现零样本识别。然而,由于其内在设计,这些模型通常优先考虑最标志性的物体,而忽略其他上下文正例。这种内在偏差与多标签学习的性质相冲突,从而限制了它们的适用性。在这项工作中,我们提出了一个无监督框架,将VLM从标志性识别适应到包容性理解,实现无标签的多标签图像识别。我们的方法包括两个关键阶段:“切割”和“缝合”:在切割阶段,我们提出了多采样响应估计器,以防止模型仅关注单个物体。在第二个缝合阶段,引入了多目标混合适应,以调整标签使其更符合多标签分布,同时仅在一个epoch内保留原始模型的内在特性。大量实验表明,我们的框架在四个公共数据集上显著优于现有的无监督方法,甚至超过了几种有代表性的弱监督基线。这些结果证明了将预训练VLM适应于更全面的视觉理解而无需人工标注的潜力。我们的代码在此https URL公开。

英文摘要

Understanding multi-label images remains a challenging task in computer vision. With the rapid progress of vision-language multimodal learning, vision-language models (VLMs) enable zero-shot recognition without labeled data. However, due to their intrinsic design, these models often prioritize the most iconic object and omit other contextual positives. This intrinsic bias conflicts with the nature of multi-label learning, thereby limiting their applicability. In this work, we propose an unsupervised framework that adapts VLMs from iconic recognition toward inclusive understanding, enabling label-free multi-label image recognition. Our approach consists of two key stages, ``cutting'' and ``sewing'': In the cutting stage, we present the multi-sampling response estimator to prevent the model from concentrating only on one single object. In the second sewing stage, the multi-object blend adaptation is introduced to adjust the labels to better conform to the multi-label distribution while preserving the intrinsic characteristics of the original model within only one epoch. Extensive experiments show that our framework significantly outperforms existing unsupervised approaches on four public datasets, even surpassing several representative weakly supervised baselines. These results demonstrate the potential of adapting pre-trained VLMs for more comprehensive visual understanding without manual annotations. Our code is publicly available at this https URL.

2606.11625 2026-06-11 cs.LG 新提交

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

TimeRouter: 时间序列基础模型的高效自适应路由

Kanghui Ning, Yushan Jiang, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song

发表机构 * University of Connecticut(康涅狄格大学) Salesforce AI Research JP Morgan AI Research(摩根大通人工智能研究院)

AI总结 提出TimeRouter框架,通过轻量判别路由、选择性门控和集成回退实现时间序列基础模型的自适应选择,无需LLM推理,在GIFT-EVAL榜单取得最优性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)作为新兴智能时间序列系统中的预测专家越来越受到探索。然而,TSFMs表现出异质性归纳偏差,且没有单一模型能在所有预测场景中持续占优,使得专家选择成为关键挑战。现有系统通常将此决策委托给基于LLM的控制器,导致大量推理开销。我们提出TimeRouter,一种高效路由框架,通过轻量判别路由、选择性门控和集成回退,利用预训练TSFM池的经验互补性。具体而言,TimeRouter结合了学习路由头、选择性门控和集成回退,在推理时无需调用LLM即可实现自适应专家选择。TimeRouter在GIFT-EVAL榜单上取得了最先进性能,LB MASE为0.6765。除了基准性能,我们的消融研究为TSFM路由设计提供了经验见解,强调了池组成和选择性门控的重要性。综合来看,这些结果使TimeRouter成为未来基于基础模型池的智能时间序列系统的模块化轻量路由层。我们的代码见此链接。

英文摘要

Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates across forecasting regimes, making expert selection a critical challenge. Existing systems often delegate this decision to LLM-based controllers, incurring substantial inference overhead. We present TimeRouter, an efficient routing framework that leverages empirical complementarity across a pool of pretrained TSFMs through lightweight discriminative routing, selective gating, and ensemble fallback. Concretely, TimeRouter combines a learned routing head, a selective gate, and an ensemble fallback, enabling adaptive expert selection without invoking an LLM at inference time. TimeRouter achieves state-of-the-art performance on the GIFT-EVAL leaderboard, with an LB MASE of 0.6765. Beyond benchmark performance, our ablation studies provide empirical insights into TSFM routing design, highlighting the importance of pool composition and selective gating. Taken together, these results position TimeRouter as a modular and lightweight routing layer for future agentic time-series systems built upon foundation-model pools. Our code is available at this https URL.

2606.11619 2026-06-11 cs.CV 新提交

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

精度感知光照解耦视觉Transformer用于航天器6D姿态估计

Zongwu Xie, Yifan Yang, Yonglong Zhang, Guanghu Xie, Yang Liu, Shuo Zhang

发表机构 * School of Mechatronics Engineering, Harbin Institute of Technology(哈尔滨工业大学机电工程学院)

AI总结 提出PAID-ViT,通过光照解耦、可靠性感知令牌聚合和掩码监督,在光照变化和反射干扰下实现鲁棒的航天器6D姿态估计。

详情
Comments
11 pages, 7 figures
AI中文摘要

视觉传感器为航天器近距离操作提供了轻量级解决方案,但在光照变化、镜面反射、阴影、弱纹理和背景干扰下,单目航天器6D姿态估计仍然困难。这些因素使局部视觉证据在空间上不可靠,并可能破坏姿态回归的稳定性。本文提出了一种精度感知光照解耦视觉Transformer(PAID-ViT),用于鲁棒的航天器姿态估计。该模型将姿态相关的结构令牌与光照敏感的外观令牌分离,在姿态聚合前估计补丁可靠性,并使用前景掩码监督以保留轮廓线索。一个无参数的几何恢复模块将归一化裁剪坐标、对数深度和连续6D旋转表示转换为相机坐标系下的旋转和平移。在SPEED+ V2(本研究使用的SPEED+验证/光箱/太阳灯评估配置)上的实验表明,PAID-ViT减少了平移误差,并在具有挑战性的太阳灯域中提高了鲁棒性,而消融研究支持了光照解耦、可靠性感知令牌聚合、掩码监督和训练侧正则化的互补作用。

英文摘要

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose this http URL proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

2606.11616 2026-06-11 cs.LG cs.IR 新提交

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

DeMix: 通过影响向量调试包含混合错误类型的训练数据

Jiale Deng, Yanyan Shen, Xiaogang Shi, Chai Junjun

发表机构 * Shanghai Jiao Tong University(上海交通大学) ByteDance Inc.(字节跳动) Tiktok

AI总结 提出DeMix框架,利用影响向量捕捉不同错误类型对模型行为的独特模式,将数据调试转化为多标签分类问题,并引入基于干预的学习策略,在11个任务上显著提升调试F1分数和修复后模型性能。

详情
AI中文摘要

高质量的训练数据对于机器学习模型的成功至关重要。然而,真实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其具体的错误类型以便进行针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新框架。我们的关键见解是,不同的错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕获这些特定于错误的模式,这些影响向量描述了每个训练样本如何影响所有验证样本上的模型预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入了一种基于干预的学习策略,引导分类器捕获每种错误类型特有的不变理由,确保学到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11个任务上的实证评估表明,DeMix显著优于最先进的方法,在数据调试F1分数上提高了22.61%,在数据修复后任务模型性能上提高了9.32%。代码可在以下网址获取:this https URL。

英文摘要

High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.

2606.11615 2026-06-11 cs.CV cs.CR cs.LG 新提交

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD:面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出Adv-TGD框架,利用Stable Diffusion和LoRA微调生成逼真对抗人脸,在保持视觉质量的同时实现高成功率身份冒充攻击,平均ASR达85.90%。

详情
AI中文摘要

人脸识别(FR)技术的广泛普及引发了严重的隐私担忧,因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战,我们提出了Adv-TGD,一个生成式对抗攻击框架,能够合成逼真的人脸,冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion,Adv-TGD对每个样本进行LoRA微调,以简洁的文本提示为条件,生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同,我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束,以确保空间精确的身份操控,同时保留非敏感区域。我们引入了一个复合目标,结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制,以平衡对抗攻击和视觉真实性。可选地,LLaVA生成的属性提示增强了细粒度语义细节,而不会重新引入身份线索。在黑盒评估协议下,Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率(ASR)达到85.90%,超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲,Adv-TGD仍保持了高视觉保真度(PSNR = 27.15 dB,SSIM = 0.981)。此外,我们通过成功将其扩展到野外数据集(LADN)、通用对象分类(ImageNet)和基于Transformer的扩散模型(FLUX.1),展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

2606.11611 2026-06-11 cs.SD 新提交

SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations

SARA: 一种通过整合语义和声学表示实现高保真语音生成的双流VAE

Peijie Chen, Wenhao Guan, Weijie Wu, Kaidi Wang, Daiyu Huang, Zhuanling Zha, Junbo Li, Jun Fang, Qingyang Hong, Lin Li

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) School of Electronic Science and Engineering, Xiamen University(厦门大学电子科学与技术学院) DiDi Global Inc.(滴滴全球股份有限公司)

AI总结 提出SARA双流VAE,融合冻结的SSL语义锚点和残差声学编码器,解决语音分词器中声学与语义的权衡,实现高保真重建和零样本TTS的自然合成。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

零样本文本转语音(TTS)依赖于鲁棒的语音表示。然而,当前的语音分词器面临一个基本权衡:声学编解码器保留高保真音频但缺乏语言约束,导致生成过程中出现内容错误;而来自自监督学习(SSL)模型的语义标记确保精确的文本对齐,但丢弃了一些声学信息。为了弥合这一差距,我们提出了SARA,一种双流VAE,它直接将冻结的SSL语义锚点与专用的残差声学编码器融合。这有效缓解了困境,创建了一个高效且紧凑的潜在空间,而无需依赖复杂的正则化器。SARA在重建质量上优于强基线。此外,在下游零样本TTS任务中,它产生了高度自然且富有表现力的合成质量,即使在加速推理下也保持稳健的生成性能,在合成速度和计算成本之间提供了有利的权衡。

英文摘要

Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.

2606.11609 2026-06-11 cs.CL 新提交

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

基于自适应工人分配的多智能体推理用于立场检测

Meysam Sabbaghan, Arman Zareian Jahromi, Doina Caragea

发表机构 * Kansas State University(堪萨斯州立大学)

AI总结 提出一种Manager-Worker多智能体框架,通过自适应分配工人智能体进行推理级合成,而非标签级投票,在隐式和上下文依赖的立场检测上显著提升性能。

详情
AI中文摘要

立场检测需要识别作者对目标的态度,通常来自简短文本,其中立场是隐含的、间接的或修辞性的。尽管大型语言模型(LLM)在此任务上表现强劲,但当多种解释可能成立时,单次提示可能脆弱。现有的聚合策略,如多数投票或自一致性,通过组合标签来提高鲁棒性,但丢弃了解决冲突解释所需的中间推理。我们提出了一种用于立场检测的自适应工人分配多智能体推理框架,将聚合从标签级投票转变为推理级合成。该框架采用Manager-Worker架构,其中Manager根据输入复杂度自适应地分配可变数量的Worker智能体。每个Worker从不同角度分析输入,并生成仅推理的解释而不输出立场标签;然后Manager综合这些解释以产生最终预测。我们在SemEval-2016、P-Stance和COVID-19 Stance上使用Llama、Mistral和Gemini评估了所提出的框架。结果表明,该框架在隐式和上下文依赖的立场案例上取得了最大增益,在COVID-19上达到86.07 Macro-F1,在SemEval-2016上达到82.90,同时在更显式的立场数据集(如P-Stance)上保持竞争力。这些发现表明,当仅凭表面线索无法可靠推断立场时,自适应推理级聚合最为有益。

英文摘要

Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.

2606.11606 2026-06-11 cs.CV 新提交

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

冻结的基础模型嵌入在胸部X光检查中丢弃小病灶信号:对部署前评估的启示

Raajitha Muthyala, Zhenan Yin, Alekhya Jilla, Frank Li, Theo Dapamede, Bardia Khosravi, Mohammadreza Chavoshi, Judy Gichoya, Saptarshi Purkayastha

发表机构 * Department of Biomedical Engineering and Informatics, Indiana University(印第安纳大学生物医学工程与信息学系) Department of Radiology and Imaging Sciences, Emory University(埃默里大学放射学与影像科学系)

AI总结 本研究系统量化了五种冻结的视觉Transformer基础模型在胸部X光检查中保留或丢失小尺度、低对比度信号的情况,发现全局聚合步骤会无声地抑制小尺度信号,但可从补丁令牌中恢复。

详情
AI中文摘要

冻结的视觉Transformer(ViT)基础模型嵌入越来越多地用作下游胸部X光检查(CXR)流程的基础,然而在冻结的前向传播中,小尺度、低对比度信号在何处保留或丢失,尚未在架构、预训练领域和目标之间进行系统量化。我们探测了五种冻结的ViT(RAD-DINO、DINOv2-B/14、DINOv3 ViT-7B、BiomedCLIP、MedSigLIP)和一个冻结的DINO预训练ResNet-50架构对照,跨越三个大型CXR队列(NIH-CXR14、MIMIC-CXR、Emory-CXR;总池n=492,724)和ChestX-Det10(n=3,543;1,462个小病灶边界框,涵盖钙化、结节、肿块)。每个模型通过小尺度扰动面板和区域感知边界框分层探针对真实病灶进行评估,比较来自同一前向传播的三种池化模式:分类令牌(CLS)、补丁均值(所有最终层补丁令牌的平均值)和边界框限制的局部补丁。在扰动面板上,CLS嵌入处于随机水平(ROC曲线下面积[AUC] 0.500-0.524);补丁均值在等模糊和网状细细胞上与CLS无区别,但在较大方向模糊足迹上随CLS上升,而全局决策任务的疾病AUC范围为0.642-0.913。局部补丁探针从同一前向传播中恢复AUC约1.0(每个模型平均改进+0.412至+0.488);ResNet-50对照重现了随机水平。在ChestX-Det10上,图像级CLS分类显示类内小与大层间差距高达+0.243 AUC;同一前向传播上的边界框级局部补丁池化在每个(模型×类别)单元上恢复AUC >= 0.899。冻结的ViT嵌入在全局聚合步骤中无声地抑制小尺度信号;该信号可从补丁令牌中恢复,但需依赖于感兴趣区域。

英文摘要

Frozen vision-transformer (ViT) foundation-model embeddings increasingly serve as the substrate for downstream chest-radiography (CXR) pipelines, yet where small-scale, low-contrast signal is retained or lost in the frozen forward pass has not been systematically quantified across architectures, pretraining domains, and objectives. We probed five frozen ViTs (RAD-DINO, DINOv2-B/14, DINOv3 ViT-7B, BiomedCLIP, MedSigLIP) and a frozen DINO-pretrained ResNet-50 architectural control across three large CXR cohorts (NIH-CXR14, MIMIC-CXR, Emory-CXR; aggregate pool n=492,724) and ChestX-Det10 (n=3,543; 1,462 small-lesion bounding boxes across Calcification, Nodule, Mass). Each model was evaluated with a small-scale-perturbation panel and a region-aware bounding-box-stratified probe on real lesions, comparing three pooling modes from the same forward pass: classification token (CLS), patch-mean (mean over all final-layer patch tokens), and bounding-box-restricted patch-local. On the perturbation panel, CLS embeddings sat at the chance floor (area under the ROC curve [AUC] 0.500-0.524); patch-mean was indistinguishable from CLS on iso-blur and reticular-fine cells but rose with CLS on larger directional-blur footprints, while disease AUC on globally decided tasks ranged 0.642-0.913. Patch-local probes recovered AUC ~1.0 from the same forward pass (per-model mean improvement +0.412 to +0.488); the ResNet-50 control reproduced the chance floor. On ChestX-Det10, image-level CLS classification showed within-class small-versus-large stratum gaps up to +0.243 AUC; bounding-box-level patch-local pooling on the same forward pass recovered AUC >= 0.899 on every (model x class) cell. Frozen ViT embeddings silently suppress small-scale signal at the global-aggregation step; the signal is recoverable from patch tokens conditional on a region of interest.

2606.11602 2026-06-11 cs.CV 新提交

On Aligning Hierarchical Standardized Embedding for Audio-visual Generalized Zero-shot Learning

面向音视频广义零样本学习的层次化标准化嵌入对齐

Zihan Zhang, Jie Hong, Siyuan Fan, Yanghao Zhou, Pengfei Fang

发表机构 * Southeast University(东南大学) The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(新一代人工智能技术及其跨学科应用重点实验室(东南大学),教育部) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出AHSE方法,通过Z-score标准化和层次化对齐策略(语义、类别、批次三级)解决音视频与文本模态间的分布与结构差异,在三个基准数据集上取得竞争性能。

详情
AI中文摘要

音视频广义零样本学习(AV-GZSL)是一项具有挑战性的任务,旨在通过整合音频和视觉模态的数据来分类已见和未见对象或场景。近期研究主要集中于融合或对齐音频和视觉特征以生成更具信息量的音视频嵌入。此外,大多数现有方法对齐音视频与文本特征仅依赖于优化目标。然而,这些方法忽视了音视频与文本模态之间固有的分布和结构差异。为解决这一局限性,我们提出一种名为层次化标准化嵌入对齐(AHSE)的方法,该方法能够在共享嵌入空间内实现标准化音视频与文本嵌入的层次化对齐。具体而言,我们首先对融合后的音视频和文本嵌入应用Z-score标准化以减少分布不匹配。然后,我们引入一种层次化对齐策略,在语义、类别和批次三个层面最小化差异,从而构建一个更鲁棒且结构良好的嵌入空间。该策略不仅保留了语义和类间关系,还保持了每个批次内的空间一致性。在三个基准数据集:VGGSound-GZSL、UCF-GZSL和ActivityNet-GZSL上的大量实验表明,AHSE在零样本学习中取得了竞争性能。

英文摘要

Audio-visual Generalized Zero-shot Learning (AV-GZSL) is a challenging task that aims to classify both seen and unseen objects or scenes by integrating data from audio and visual modalities. Recent studies primarily focus on fusing or aligning audio and visual features to generate more informative audio-visual embeddings. Also, aligning the audio-visual and textual features of most existing methods relies solely on the optimization objectives. However, those methods neglect the inherent distributional and structural differences between audio-visual and textual modalities. To address this limitation, we propose a method termed Aligning Hierarchical Standardized Embedding (AHSE), which enables hierarchical alignment of standardized audio-visual and textual embeddings within a shared embedding space. Specifically, we first apply Z-score standardization to the fused audio-visual and textual embeddings to reduce distributional mismatches. We then introduce a hierarchical alignment strategy that minimizes discrepancies at the semantic, class, and batch levels, thereby constructing a more robust and well-structured embedding space. This strategy not only preserves semantic and inter-class relationships but also maintains spatial consistency within each batch. Extensive experiments on three benchmark datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL, demonstrate that AHSE achieves competitive performance in zero-shot learning.

2606.11601 2026-06-11 cs.CV 新提交

Spatially Coupled Phase-to-Depth Calibration for Fringe Projection Profilometry

条纹投影轮廓术中的空间耦合相位-深度标定

Sehoon Tak, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University(延世大学机械工程系)

AI总结 提出一种空间耦合的相位-深度变换,通过全局相位标量与仿射空间项共享所有像素的映射,替代逐像素独立标定,提升空间一致性并减少表面伪影。

详情
AI中文摘要

在条纹投影轮廓术(FPP)中,深度通常通过在每个相机像素处独立拟合相位-深度关系来恢复。尽管这种逐像素标定实现了较高的局部精度,但相邻像素即使观测同一光滑表面,也可能获得显著不同的标定函数,导致空间不一致的几何结构和结构化表面伪影。我们提出一种空间耦合的相位-深度变换,其中所有像素共享一个单一的低维映射——全局相位标量与在未畸变参考相机网格上的仿射空间项相结合——而非独立的逐像素拟合,可选地通过一个有界、空间平滑的校正场进行增强。我们进一步引入一种原生网格配对方案,直接在参考相机网格上构建相位-深度标定对:当深度监督来自校正后的主动立体管线时,在立体3D空间中拟合平面,并沿原生射线采样回相机网格,因此相位图从未被校正。在具有高分辨率扫描仪真实数据集的牙齿目标上,所提出的模型达到了与主动立体参考相当的点到表面RMSE(约12微米聚合),同时在空间一致性上显著优于逐像素多项式和有理标定,并将运行时映射减少为每个像素的少量逐元素操作,参数存储可忽略不计。

英文摘要

In fringe projection profilometry (FPP), depth is commonly recovered by fitting a phase-to-depth relation independently at each camera pixel. Although such pixel-wise calibration achieves high local accuracy, neighboring pixels can acquire markedly different calibration functions even when they observe the same smooth surface, producing spatially inconsistent geometry and structured surface artifacts. We propose a spatially coupled phase-depth transformation in which all pixels share a single low-dimensional mapping-global phase scalars combined with affine spatial terms on the undistorted reference-camera grid-rather than independent per-pixel fits, optionally augmented by a bounded, spatially smooth correction field. We further introduce a native-grid pairing scheme that constructs phase-depth calibration pairs directly on the reference-camera grid: when depth supervision comes from a rectified active-stereo pipeline, planes are fitted in stereo 3D and sampled back onto the camera grid along native rays, so the phase maps are never rectified. On a dental target with high-resolution scanner ground truth, the proposed model attains point-to-surface RMSE comparable to an active-stereo reference (about 12{\mu}m aggregate) while substantially improving spatial coherence over pixel-wise polynomial and rational calibration, and reduces the runtime mapping to a few element-wise operations per pixel with negligible parameter storage.

2606.11599 2026-06-11 cs.CL cs.LG 新提交

When is Your LLM Steerable?

你的大模型何时可操控?

Chenrui Fan, Yize Cheng, Ming Li, Soheil Feizi, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) MBZUAI, UAE(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出通过模型生成初期的内部状态预测激活操控是否成功,并利用该预测器优化操控强度搜索,降低解码成本。

详情
AI中文摘要

激活操控提供了一种轻量级的方法来控制语言模型在推理时的行为,但其成功与否严重依赖于提示、概念、模型和操控配置。寻找成功操控的范围和边界通常需要昂贵的网格搜索和对完整自回归生成的后验评估。在这项工作中,我们研究了是否可以从模型在生成过程初期(例如,生成前几个token后)的内部状态预测可操控性,以及如何利用这样的预测器来提高操控成功率。为此,我们首先引入了ASTEER,一个包含140万次操控生成的测试平台,涵盖150个概念,每个操控成功/失败均已标注。利用该测试平台,我们通过提取特征来比较操控前后跨层和初始解码步骤的隐藏状态,分析模型的早期解码动态。这些特征帮助我们理解操控效果如何沿层和token位置传播,为可操控性预测提供关键信息。然后,我们在这些特征上训练梯度提升决策树(GBDT)分类器,以预测干预是否会欠操控、成功或过操控,而无需完整生成。我们的预测器在未见过的概念上达到了约0.7的宏F1分数,表明早期隐藏状态编码了关于最终操控效果的大量结构化信息。我们进一步利用该可操控性预测器作为操控强度搜索的指导,以极小的解码成本实现了接近最优的性能。

英文摘要

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

2606.11585 2026-06-11 cs.LG cs.CL nlin.AO 新提交

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Kuramoto注意力:在环面上同步自注意力

Joshua Nunley

发表机构 * Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Cognitive Science Program, Indiana University Bloomington(印第安纳大学伯明顿分校信息学系,卢迪信息学、计算与工程学院,认知科学项目)

AI总结 提出Kuramoto注意力层,将隐藏坐标视为角度,通过门控余弦相似度和环形均值更新实现自注意力,等价于Kuramoto耦合项,在字符级语言建模中达到与强基线相近的性能。

详情
Comments
13 pages, 2 figures, 3 tables
AI中文摘要

我们引入了Kuramoto注意力,一种自注意力层,其中每个隐藏坐标是一个角度。该层通过门控余弦相似度对令牌进行评分,关注先前的相位状态,并通过注意力加权的环形均值的切线分量更新每个令牌。由于值是原始相位状态,该更新恰好是Kuramoto耦合项$\sum_u A_{t,u}\sin(\theta_u-\theta_t)$,其中注意力矩阵充当自适应、内容相关的耦合核。等价地,门控分数是环面上的学习度量,用于选择哪些令牌耦合,更新将每个令牌拉向其选择的令牌的环形均值,从而收紧它们的相位一致性。相同的两个成分,即不变相似度分数和流形上的均值,定义了任何紧致群上的此类层;环面是阿贝尔情形,两者都有闭式解。softmax权重解决了一个熵正则化的相位检索问题,旋转位置编码作为分数中与位置相关的相位漂移进入。在enwiki8字符级语言建模中,该层作为功能语言模型训练,其每字符比特数接近强匹配的RoPE+SwiGLU Transformer:在100万参数时相差0.02 BPC(1.637±0.010对比1.616±0.004),在500万参数时中位数持平(五个种子下1.448对比1.452),Transformer在均值上领先(1.468对比1.456)。这些实验表明,受约束的几何结构在此规模下是可行的语言模型;结构本身及其同步解释是贡献。消融实验隔离了承重组件,结果给出了自注意力和相位同步之间的紧凑桥梁。

英文摘要

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

2606.11583 2026-06-11 cs.LG 新提交

Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

超越黄金教师:通过LLM-GNN协同教学增强图学习

Zhuoyi Peng, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) WeBank(微众银行)

AI总结 针对文本属性图上的少样本学习,提出LLM-GNN协同教学框架,避免固定教师模型,通过双向伪标签交换和基于轮次的偏好优化,显著提升图学习性能。

详情
Comments
Code: this https URL
AI中文摘要

文本属性图(TAGs)支撑着现实世界的应用,如引文网络、社交媒体和电子商务。TAGs上的少样本图学习是困难的:每类只有少量标签,其余图数据未标注,GNN和LLM都无法单独良好学习。GNN读取拓扑结构,在冷节点上失败;LLM读取文本,在文本模糊节点上失败。现有的LLM-GNN方法都遵循相同的模式:指定一个模型为黄金教师,并使用其输出(如特征或伪标签)来监督另一个模型。我们认为这种黄金教师假设在稀疏监督下会失效:没有一个模型是黄金的,将任何一个视为黄金教师会将其盲点转移到学生模型中。因此,我们提出:能否避免指定任一模型为黄金教师,仍然进行有效的图学习?我们的答案是LLM-GNN协同教学,一种双向协同教学框架,其中没有模型被固定为教师。GNN和LLM在特定架构的小损失准则下交换它们最自信的伪标签,并且每轮都更新。然后从轨迹中挖掘监督信息:每当一个节点从第t轮的跨模型矛盾变为第t+1轮的跨模型一致时,LLM在同一输入上的两个答案形成一个偏好对(旧的矛盾自我 < 新的同伴认可自我),用于DPO训练。我们称之为基于轮的伪标签偏好优化(RPL-PO)。在六个基准测试上,LLM-GNN协同教学始终优于GNN-as-Judge和所有先前方法,在Cora和ogbn-arxiv上的绝对3-shot增益分别为7.86%和7.73%;改进延续到5-shot和零样本跨数据集迁移。错误结构分析进一步表明,放弃黄金教师假设显著提高了LLM在困难样本上的图学习能力。

英文摘要

Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM's two answers on the same input form a preference pair (old contradicting self < new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM's graph learning capability on challenging samples.

2606.11577 2026-06-11 cs.RO 新提交

Distortion-Resilient Robotic Imitation Learning for Autonomous Cable Routing

抗畸变机器人模仿学习用于自主电缆布线

Hao Wang, Fu-Zhao Ou, Shiqi Wang, Zhaolin Wan, Xiaopeng Fan

发表机构 * School of Artificial Intelligence, Harbin Institute of Technology(哈尔滨工业大学人工智能学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Pengcheng Laboratory(鹏城实验室) Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院)

AI总结 提出一种包含图像质量评估、置信度学习和决策模块的机器人模仿学习框架,在图像畸变下仍保持高性能,实验验证了其有效性。

详情
AI中文摘要

智能控制方法的快速发展赋予了机器人强大的自主智能。电缆布线作为工业中的基础任务,为机器人灵巧性和序列决策提供了严格的基准。在这些实际场景中,图像观测畸变频繁发生。低质量图像观测的样本常常阻碍准确的模型训练,对智能控制系统的可靠性和准确性构成挑战。然而,目前尚未有针对图像信号畸变场景的专用智能控制解决方案。同时,图像质量信息未被充分利用以进一步提升智能控制方法的性能。为此,我们提出了一种新颖的机器人模仿学习框架,该框架包含图像质量评估模块、基于置信度的学习机制和决策模块,旨在即使在畸变图像观测下也能保持高性能。在所提出的框架中,图像质量评估模块与基于置信度的学习机制协同作用,以增强决策模块的有效性。具体来说,引入图像质量评估模块从图像观测中提取图像质量信息,而基于置信度的学习机制自适应地优先处理具有挑战性的样本以提高学习效果。决策模块确定适当的离散技能或连续动作。实验结果表明,我们提出的框架提升了决策模块的整体性能。

英文摘要

The rapid development of intelligent control methodologies has endowed robots with powerful autonomous intelligence. Cable routing, a ubiquitous foundational task in industry, provides a rigorous benchmark for robotic dexterity and sequential decision-making. In these practical scenarios, image observation distortion frequently occurs. Samples characterized by low-quality image observations often hinder accurate model training, posing challenges to the reliability and accuracy of intelligent control systems. Nevertheless, no dedicated intelligent control solution has been proposed for scenarios of image signal distortion. Meanwhile, image quality information has not been sufficiently exploited to further enhance the performance of intelligent control methodologies. To this end, we propose a novel robotic imitation learning framework that comprises an image quality assessment module, a confidence-based learning mechanism, and a decision-making module, which is designed to maintain high performance even under distorted image observations. In the proposed framework, the image quality assessment module synergizes with the confidence-based learning mechanism to enhance the efficacy of the decision-making module. Specifically, the image quality assessment module is incorporated to extract image quality information from image observations, while the confidence-based learning mechanism adaptively prioritizes challenging samples to improve learning effectiveness. The decision-making module determines appropriate discrete skills or continuous actions. Experimental results demonstrate that our formulated framework enhances the overall performance of the decision-making module.

2606.11576 2026-06-11 cs.CV cs.AI 新提交

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

AVIS: 视觉语言模型的自适应测试时缩放

Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk

发表机构 * AI Center-Toronto, Samsung Electronics(三星电子多伦多AI中心) University of Toronto(多伦多大学) Vector Institute(向量研究所) York University(约克大学)

AI总结 提出AVIS,通过轻量策略联合优化视觉上下文缩放和推理缩放,利用无训练的关键多样性剪枝和自适应自一致性,在多种基准上提升精度-计算权衡。

详情
Comments
Project page: this https URL
AI中文摘要

现代视觉语言模型(VLM)受益于思维链提示和测试时缩放,但这些增益通常因大视觉上下文和长解码链而带来高昂推理成本。我们将此成本通过两个耦合的轴来审视:视觉上下文缩放(VCS),控制传递给语言模型的视觉证据量;以及视觉推理缩放(VRS),控制推理时推理搜索的执行量。现有方法通常一次优化一个轴,而跨这些轴的联合计算分配尚未充分探索。我们引入自适应视觉推理缩放(AVIS),一种轻量策略,根据每个查询自适应调整VCS和VRS。AVIS通过关键多样性视觉(KDV)剪枝实现VCS,这是一种无训练的$O(N)$基于关键字的规则,用于在预填充前移除冗余视觉令牌;并通过自适应自一致性实现VRS,使用学习的难度预测器选择推理滚动的数量。AVIS易于部署,兼容共享预填充推理,其中所有滚动重用单个预填充过程和KV缓存。在多样化的图像和视频推理基准上,AVIS相对于仅VCS和仅VRS的基线改善了精度-计算权衡,并且在RL后训练的VLM上仍然有效,同时保持低计算和低延迟。

英文摘要

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

2606.11574 2026-06-11 cs.LG cond-mat.mtrl-sci physics.chem-ph stat.ML 新提交

Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows

范围感知贝叶斯优化用于在目标属性窗口内发现多样化设计

Shengli Jiang, Jason Wu, Charles M. Schroeder, Michael A. Webb

发表机构 * Department of Chemical and Biological Engineering, Princeton University(普林斯顿大学化学与生物工程系)

AI总结 提出范围感知贝叶斯优化框架,通过采集函数直接评分候选解满足目标范围的后验概率,在基准任务和实际案例中比标准方法发现更多样化的有效设计。

详情
Comments
64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables
AI中文摘要

在许多材料和产品设计问题中,理想的候选物表现出可接受范围内的属性,而非达到单一最优值。恢复满足此类规格的多个不同解也具有实际价值,因为某些候选物可能因成本、可加工性或鲁棒性等原因而更受青睐,而这些因素难以直接编码到目标函数中。在此,我们开发了一个范围感知贝叶斯优化(BO)框架,其中采集函数直接评分候选解满足目标范围的后验概率。该框架自然扩展到在共享候选空间上并行追求多个不同规格。在基准任务中,范围感知采集一致地比标准BO基线和最近的目标寻求方法恢复更大且更多样化的有效设计集。其效用进一步在两个实际动机的设计案例研究中得到证明,涉及优化聚合物合成的反应条件和发现指定光学吸收带的序列定义低聚物,并得到量子化学计算的支持。这些结果表明,范围感知BO可以为规格驱动设计提供实用且样本高效的基础,特别是当设计灵活性和解多样性是重要考虑因素时。

英文摘要

In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations.

2606.11573 2026-06-11 cs.CV 新提交

Understanding Cross-Sensor Feature Variations for Generalizable 3D Perception

理解跨传感器特征变化以实现可泛化的3D感知

Xin Qiu, Wenjie Liu, Fuyuan Ai, YuChen Tan, Zhiwei Xu, Chunyi Song

发表机构 * Zhejiang University(浙江大学)

AI总结 针对雷达-相机BEV感知跨数据集性能下降问题,提出频域场景变化建模框架,通过合成多样源域视图并正则化融合表示,提升3D检测器鲁棒性,无需目标域样本。

详情
AI中文摘要

雷达-相机BEV感知在跨数据集评估时常常性能下降,因为驾驶场景、传感器配置和环境条件的变化会改变输入观测和内部融合表示。本文从源域变化建模的角度研究这一问题,旨在提高基于BEV的3D检测器的鲁棒性,而无需依赖目标域样本。我们引入一个框架,在频域中表征视觉场景变化,并利用这些变化合成多样的源域视图。通过比较生成的融合BEV表示,该框架进一步捕捉图像级变化如何影响多模态BEV特征。然后利用这些变化模式对检测器进行正则化,鼓励学习到的融合空间在潜在场景变化下保持稳定。所提出的方法仅在训练期间应用,推理流程保持不变。在View-of-Delft和TJ4DRadSet之间的跨数据集雷达-相机3D检测实验表明,该方法在多个BEV融合骨干网络上均有一致的改进,并且当少量目标域数据可用时,增益仍然有效。

英文摘要

Radar-camera BEV perception often suffers from degraded performance when evaluated across datasets, as changes in driving scenes, sensor configurations, and environmental conditions can alter both the input observations and the internal fused representations. This work studies this issue from the perspective of source-domain variation modeling, aiming to improve the robustness of BEV-based 3D detectors without relying on target-domain samples. We introduce a framework that characterizes visual scene variations in the frequency domain and uses them to synthesize diverse source-domain views. By comparing the resulting fused BEV representations, the framework further captures how image-level variations influence multi-modal BEV features. These variation patterns are then used to regularize the detector, encouraging the learned fusion space to remain stable under latent scene changes. The proposed method is applied only during training and leaves the inference pipeline unchanged. Experiments on cross-dataset radar-camera 3D detection between View-of-Delft and TJ4DRadSet demonstrate consistent improvements over multiple BEV fusion backbones, and the gains remain effective when a small amount of target-domain data is available.

2606.11572 2026-06-11 cs.CV 新提交

FreqKD: Frequency-Decoupled Cross-Modal Knowledge Distillation for Infrared Object Detection

FreqKD: 面向红外目标检测的频率解耦跨模态知识蒸馏

Keval Thaker, Venkatraman Narayanan, Abdalmalek Aburaddaha, Samir A. Rawashdeh

发表机构 * University of Michigan-Dearborn(密歇根大学迪尔伯恩分校)

AI总结 针对RGB与红外图像模态差异,提出频率解耦蒸馏框架FreqKD,对低频和高频成分分别施加严格MSE和松弛log-MSE损失,在KAIST数据集上提升DINOv2基线2.4 mAP50。

详情
AI中文摘要

通过知识蒸馏从大规模RGB基础模型迁移学习到红外图像,由于图像形成物理的根本差异仍然具有挑战性。我们研究了RGB-IR模态间隙的频谱结构,观察到特征差异在空间频率上并不均匀:低频分量(形状、布局)比高频分量(纹理、精细边缘)表现出更大的跨模态对齐,后者反映了模态特定特征。基于这一分析,我们提出了FreqKD,一种频率解耦蒸馏框架,对每个频带应用适应其跨模态一致性的非对称监督。该方法对低频带采用严格的均方误差(MSE)以保留共享的结构信息,对高频带采用松弛的log-MSE损失(权重为0.1)以提供边缘指导同时容忍纹理差异。对500个配对样本的频谱差异分析表明,在所有分析的Transformer层中,高频差异平均超过低频差异2.4倍。在KAIST多光谱行人检测上,FreqKD达到64.1 mAP50,比DINOv2基线提高2.4点。学到的表示可跨数据集(FLIR ADAS,+2.1 mAP50)、任务(MFNet分割,+1.85平均交并比)和架构(ResNet-50,+1.0 mAP50)迁移。代码见:this https URL

英文摘要

Transfer learning from large-scale RGB foundation models to infrared (IR) imagery through knowledge distillation (KD) remains challenging due to fundamental differences in image formation physics. We investigate the spectral structure of the RGB--IR modality gap and observe that feature divergence is not uniform across spatial frequencies: low-frequency components (shape, layout) show greater cross-modal alignment than high-frequency components (texture, fine edges), which reflect modality-specific characteristics. Based on this analysis, we propose FreqKD, a frequency-decoupled distillation framework that applies asymmetric supervision adapted to each band's cross-modal consistency. The method employs strict mean squared error (MSE) on the low-frequency band to preserve shared structural information and a relaxed log-MSE loss (weighted at 0.1) on the high-frequency band to provide edge guidance while tolerating texture differences. Spectral divergence analysis on 500 paired samples shows that high-frequency divergence exceeds low-frequency divergence by a factor of 2.4x on average across all analysed transformer layers. On KAIST multispectral pedestrian detection, FreqKD achieves 64.1 mAP50, improving 2.4 points over the DINOv2 baseline. The learned representation transfers across datasets (FLIR ADAS, +2.1 mAP50), tasks (MFNet segmentation, +1.85 mean intersection-over-union), and architectures (ResNet-50, +1.0 mAP50). Code is available at: this https URL

2606.11569 2026-06-11 cs.RO cs.AI 新提交

ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models

ConsistencyPlanner: 基于快速采样一致性模型的实时规划

Qichao Zhang, Xing Fang, Jiaqi Fang, Zhenwen Cai, Jie Ling, Qiankun Yu, Dongbin Zhao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Guangzhou Zaofu Intelligent Technology Co., Ltd.(广州造父智能科技有限公司)

AI总结 提出Consistency Planner框架,利用快速采样一致性模型实现高效多模态采样,并结合注意力增强解码器融合异构特征,在Waymax模拟器中显著提升安全性和实时性。

详情
AI中文摘要

在复杂真实驾驶场景中的闭环规划对自动驾驶系统构成了关键挑战。虽然传统的基于规则的方法是可解释的,但其预定义的启发式方法缺乏对动态交通环境的适应性。基于学习的方法已显示出巨大潜力。然而,基于学习的方法尽管有前景,但在建模多样化和多模态驾驶行为与实时规划之间难以平衡,常常导致犹豫不决或不安全的行动。为了解决这一限制,我们提出了Consistency Planner,一个具有快速采样一致性模型的实时规划框架。我们的方法基于两个关键技术贡献。高效多模态采样:我们采用快速采样一致性模型生成一组多样化的合理未来轨迹。这使得多模态行动的高效实时探索成为可能,克服了先前迭代生成方法的计算瓶颈。异构特征融合:我们引入了一个注意力增强解码器,将异构输入特征(包括场景特征和动作令牌)动态整合成一个连贯的表示,以实现稳健的规划。在Waymax模拟器中的广泛评估表明,与现有方法相比,在安全指标上具有优越性能,在具有挑战性的动态场景中尤其出色。

英文摘要

Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

2606.11568 2026-06-11 cs.CV 新提交

4DP-QA: Scalable QA for 4D Perception in Vision Language Models

4DP-QA:面向视觉语言模型中4D感知的可扩展问答

Seokju Cho, Abhishek Badki, Hang Su, Jindong Jiang, Ziyao Zeng, Seungryong Kim, Sifei Liu, Orazio Gallo

发表机构 * NVIDIA(英伟达) Yale University(耶鲁大学) KAIST AI(韩国科学技术院人工智能学院)

AI总结 针对视觉语言模型难以理解动态场景的问题,提出一种关注运动场景理解的问答生成流水线,通过真运动追踪解耦物体与相机运动,生成大规模数据集4DP-QA和基准4DP-QA-Bench,训练现有模型在外部基准上取得性能提升。

详情
Comments
Project page: this https URL
AI中文摘要

尽管近期取得了进展,视觉语言模型(VLM)仍然难以理解世界的动态。我们注意到,对4D场景进行推理的能力本身具有挑战性,且因两个因素而进一步复杂化。首先,VLM通过其投影到2D图像上间接观察运动。其次,现有数据集未能解耦物体和相机运动。为应对这些挑战,我们提出一个关注运动相关场景理解的问答生成流水线。我们特别关注相机与运动之间的纠缠,通过以传统方式以及一种新颖的固定参考系(称为真运动追踪)进行追踪,从而提供对运动的直观描述。通过该流水线,我们生成了一个包含40万样本的大规模训练数据集4DP-QA(4D感知问答)和一个包含2200样本的基准数据集4DP-QA-Bench。在我们的数据集上训练现有模型在外部基准上取得了性能提升,验证了我们方法的有效性。

英文摘要

Despite recent advances, Vision Language Models (VLMs) still struggle to grasp the dynamics of the world. We note that the ability to reason about a 4D scene, challenging in itself, is further complicated by two factors. First, VLMs observe motion indirectly via its projection onto 2D images. Second, existing datasets fail to disentangle object and camera motion. To address these challenges, we present a QA generation pipeline that focuses on motion-related scene understanding. We take particular care of the entanglement of camera and object motion by casting tracking in both the traditional way and in a novel, fixed reference system, dubbed True-Motion Tracking, which provides an intuitive description of motion. From this pipeline, we generate a large-scale training dataset of 400K samples, 4DP-QA (4D Perception QA), and a 2.2K-sample benchmark, 4DP-QA-Bench. Training existing models on our dataset yields performance improvements on an external benchmark, validating the effectiveness of our method.

2606.11562 2026-06-11 cs.LG cs.CL 新提交

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

GraphInfer-Bench:评估LLM在图上的推理能力基准

Zhuoyi Peng, Jingzhou Jiang, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Webank(微众银行)

AI总结 提出GraphInfer-Bench基准,通过五个任务(描述与比较)测试LLM能否从节点及其邻域推断出无法从单节点或路径检索的答案,发现所有方法均存在差距。

详情
Comments
Code: this https URL; Dataset: this https URL
AI中文摘要

图分析支撑着许多应用,这些应用的答案无法从单个记录中查找或沿路径检索:洗钱团伙、药物重定位、用户偏好和科学主题都是从节点及其邻域推断出来的。我们引入GraphInfer-Bench,一个评估LLM是否能够执行这种图推理的基准:产生一个开放式的答案,该答案没有单个节点支持,也没有路径可检索。现有的图问答协议无法测试这种能力:算法模拟、节点分类、单节点描述、KG-QA和GraphRAG都允许从单个节点或沿路径检索答案。GraphInfer-Bench定义了五个任务,涵盖描述(区域是什么)和比较(区域如何不同),每个任务的设计使得真实答案不存在于任何单个节点中。发布版本包含42,000个样本,跨越六个真实世界图,自动生成并通过四层质量控制协议筛选。我们评估了四种方法族在相同任务上的表现:图-令牌对齐模型、零样本前沿闭源LLM、Graph2Text监督微调以及作为结构参考的普通GNN。没有方法族能够弥合差距。图-令牌对齐部分处理描述任务(关系、主题),但在比较任务上失败。前沿LLM在基于LLM的方法中在离群点检测和社区划分上领先,但在掩码节点预测上落后。Graph2Text SFT在描述方面是最强的基于LLM的方法,但在比较方面落后于前沿LLM。在每个任务上,普通GNN匹配或击败了最强的基于LLM的方法,在社区检测上差距最大。GraphInfer-Bench揭示了图推理是一个开放的能力差距,而不是任何单一架构的属性。

英文摘要

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

2606.11559 2026-06-11 cs.AI 新提交

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

HERO: 基于环境观察的后见增强反思的智能体自蒸馏

Haoran Liu, Yuwei Zhang, Xiyao Li, Bohan Lyu, Jingbo Shang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Independent Researcher(独立研究员) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出HERO框架,利用环境观察作为局部对齐反馈进行自蒸馏,解决多轮设置中特权反馈与当前决策上下文不对齐导致的性能下降问题,在TauBench和WebShop上提升任务成功率并减少冗余轮次。

详情
AI中文摘要

强化学习通常通过轨迹的终端结果来提升多轮智能体能力,这使得难以确定每个中间轮的信用分配。最近的在线自蒸馏方法通过自教师将特权反馈转化为密集的令牌级监督,提供了一种有前景的替代方案。我们的研究动机是观察到当朴素地将此范式扩展到多轮设置时出现意外的性能下降,我们将其归因于特权反馈(如成功轨迹或终端结果)与学生当前决策上下文之间缺乏对齐。我们引入了HERO,一种后见增强的自蒸馏框架,它使用下一个环境观察作为局部对齐反馈。每次轨迹展开后,HERO反思完成的交互,将每个观察转化为紧凑的轮级诊断,捕获关于原始动作的可操作反馈,如其必要性、有效性或失败原因。在TauBench和WebShop上,HERO比仅环境反馈的自蒸馏和GRPO提高了任务成功率并减少了不必要的轮次。在训练轮次预算有限(成功轨迹稀少且GRPO提供弱奖励对比信号)的情况下,它尤其有效。

英文摘要

Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.

2606.11553 2026-06-11 cs.LG 新提交

APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

APEX:面向无线边缘运维的预测与异常检测的网络原生时间序列基础模型

Swadhin Pradhan, Niloo Bahadori, Peiman Amini

发表机构 * Cisco Systems, USA(思科系统公司)

AI总结 提出网络原生解码器Transformer APEX,针对企业AP遥测数据预训练,在DHCP退化基准上MAE比最强基线降低18%,异常检测F1=0.93,边缘版本实现亚秒级隐私保护推理。

详情
Comments
5 pages, 1 figure, 4 tables. Discusses a network-native time-series foundation model for wireless edge operations
AI中文摘要

通用时间序列基础模型对无线网络遥测数据的迁移效果较差,因为这些信号具有突发性、零膨胀性且跨协议层耦合。我们提出APEX,一个网络原生的、仅解码器的Transformer,用于预测企业AP遥测数据,并以DHCP退化作为代表性网络任务进行评估。APEX在来自约4500个生产无线网络的10通道多变量遥测数据(约10万AP时间序列,每个AP 34个指标)上预训练,并提供APEX-Large(269M参数,云端)和APEX-Edge(10.5M参数,边缘)两个版本。在192步(4天)的DHCP退化基准上,APEX-Large比最强的基础模型基线(Toto)MAE降低18%,比SARIMA降低38%,异常检测F1=0.93,而APEX-Edge能够在AP级边缘硬件上实现亚秒级、保护隐私的推理。这些结果表明,网络原生预训练是主动无线运维的实用基础。

英文摘要

Generic time-series foundation models transfer poorly to wireless network telemetry whose signals are bursty, zero-inflated, and coupled across protocol layers. We present APEX, a network-native, decoder-only transformer for forecasting enterprise AP telemetry, and evaluate it on DHCP degradation as a representative network task. APEX is pre-trained on 10-channel multivariate telemetry from ~4,500 production wireless networks (~100K AP time series, 34 metrics per AP), and is available as APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge). On a 192-step (4-day) DHCP degradation benchmark, APEX-Large reduces MAE by 18% over the strongest foundation-model baseline (Toto) and 38% over SARIMA, with anomaly-detection F1 = 0.93, while APEX-Edge enables sub-second, privacy-preserving inference on AP-class edge hardware. These results suggest network-native pre-training is a practical foundation for proactive wireless operations.

2606.11546 2026-06-11 cs.CV 新提交

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

VL-DINO: 利用CLIP视觉-语言知识进行开放词汇目标检测

Hao Zhang, Qinran Lin, Linqi Song, Yong Li

发表机构 * Chongqing University(重庆大学) City University of Hong Kong(香港城市大学)

AI总结 提出VL-DINO,通过QPSC模块构建高质量正样本增强视觉-语言对齐,VSE模块蒸馏CLIP视觉知识,ORSA模块对齐区域特征与文本嵌入,在LVIS零样本检测上达到36.3/38.1 AP。

详情
AI中文摘要

像CLIP这样的视觉-语言模型可以为开放词汇目标检测提供丰富的语义先验。然而,将文本和视觉知识联合集成到检测架构中仍然具有挑战性。在本文中,我们提出了VL-DINO,一种通过更有效地利用CLIP的视觉-语言知识来增强DINO的开放词汇检测器。具体来说,首先开发了一个查询引导的正样本构建(QPSC)模块,以构建额外的高质量正样本,使原始DINO框架能够更好地适应跨异构数据源的混合训练,同时提供更多的视觉-语言对齐信号,从而在训练过程中融入更丰富的文本知识。然后引入了一个视觉语义编码器(VSE)模块,将CLIP视觉知识蒸馏到骨干网络提取的特征中,生成用于后续编码器精炼的融合特征。基于融合特征,一个目标-区域语义对齐(ORSA)模块提取以目标为中心的区域特征,并将其与相应的文本嵌入对齐,进一步融入文本线索。在零样本设置下,VL-DINO-T和VL-DINO-L在LVIS基准上分别达到了36.3和38.1 AP,持续优于先前的高级方法。大量实验证明了所提出设计的有效性和竞争性能。

英文摘要

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

2606.11543 2026-06-11 cs.AI cs.SE 新提交

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

SkillJuror:衡量智能体技能组织如何改变运行时行为

Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillJuror框架,通过渐进式披露与扁平基线对比,发现技能组织方式改变智能体搜索和应用程序知识的行为,并在82个任务中提升4.1%的验证通过率。

详情
AI中文摘要

Agent技能在推理时为大语言模型(LLM)智能体提供程序性知识,但当前的基准测试很少区分技能的内容与其组织方式。我们通过渐进式披露(Progressive Disclosure)研究这种区别,其中简洁的根文件按需引导智能体访问支持资源,并将其与归一化的扁平基线进行比较。我们提出SkillJuror,一个通过语义控制变体、匹配的多试验评估和轨迹证据来评估技能编写范式的框架,同时保持任务知识固定。在82个任务的SkillsBench研究中,渐进式披露在总体结果之前改变了运行时行为:每个轨迹触及的不同技能资源从1.18增加到3.85,有效采纳事件从1.33增加到3.92。在410个匹配试验中,它还产生了17个额外的验证通过试验(比归一化扁平基线提高4.1%)。收益取决于任务。当支持资源指导实现、检查或修复时,渐进式披露有帮助,但当成功取决于精确的输出约定、数值阈值或长工件生成流水线时,效果较弱。这些结果表明,技能组织不仅仅是呈现方式:它可以改变智能体搜索和应用程序知识的方式,而结果收益取决于暴露的资源是否对任务可操作。代码见:https://this URL。

英文摘要

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.

2606.11542 2026-06-11 cs.CL cs.AI 新提交

Pretrained self-supervised speech models can recognize unseen consonants

预训练自监督语音模型能够识别未见过的辅音

Chihiro Taguchi, Éric Le Ferrand, Hirosi Nakagawa, Hitomi Ono, Kanji Kato, Emily Prud'hommeaux, David Chiang

发表机构 * University of Notre Dame(圣母大学) University at Buffalo(纽约州立大学布法罗分校) Tokyo University of Foreign Studies(东京外国语大学) Reitaku University(丽泽大学) Boston College(波士顿学院)

AI总结 研究预训练自监督语音模型(Wav2Vec2、HuBERT)对Khoisan语言中罕见吸气辅音的识别能力,发现模型对吸气辅音的识别准确率高于非吸气辅音,表明自监督学习能泛化到稀有音素。

详情
Comments
6 pages, 3 figures, 3 tables, accepted at Interspeech 2026
AI中文摘要

现代预训练自监督自动语音识别模型在大规模音频数据上训练,将语音编码为上下文表示。然而,它们的训练数据严重偏向高资源语言,低资源语言数据很少,这引发了对类型学上不常见的语音声音(如主要出现在Khoisan语言中的吸气辅音)可能代表性不足的担忧。这引出了我们的核心研究问题:这些模型能否像识别其他语音声音一样准确地识别吸气辅音?为了解决这个问题,我们在两种富含吸气辅音的Khoisan语言(G|ui和West !Xoon)的数据上微调并比较了预训练自监督语音模型(Wav2Vec2和HuBERT)。我们的结果显示,微调后的模型一致地更准确地识别吸气辅音而非非吸气辅音,表明自监督学习能够泛化到包括稀有音素在内的人类语音声音。

英文摘要

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

2606.11537 2026-06-11 cs.AI cs.CE 新提交

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: this https URL.