arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08152 2026-06-09 cs.RO 新提交

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

视觉引导的双臂人形机器人拆解报废18650锂离子电池组

Yile Chen, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出一种视觉引导的双臂拆解流水线,利用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,在无夹具条件下从任意初始姿态拆解21节18650电池组,实现80%端到端成功率。

详情
AI中文摘要

来自电动汽车和便携式电子产品的退役锂离子电池组数量不断增长,需要安全、灵活且可选择性到单个电池的自动化拆解。然而,现有的机器人系统大多假设已知电池组姿态、外部夹具或专用工具,使得在姿态不确定性下无夹具的电池级拆解仍未解决。本文提出一种视觉引导的双臂流水线,使用通用平行爪夹持器、RGB-D感知和预训练抓取检测器,从任意初始姿态拆解一个21节18650电池组。姿态不确定性通过一个学习-过滤感知栈和离散的看-移动腕部相机校正来吸收,而双臂之间的任务中支持转移则无需任何外部夹具即可扩展有效工作空间。该流水线实现了8/10的端到端成功率,电池定位均方根误差为2.4毫米,每个电池组的平均循环时间为6.0分钟,为工业电池回收提供了一个实用的、无夹具的基础模块。

英文摘要

The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.

2606.08151 2026-06-09 cs.AI 新提交

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

决策感知记忆卡:用于工具使用LLM智能体的反事实启发式上下文选择与压缩

Xinyu Guan, Qianyang Zhao, Yuming Deng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出CICL决策感知上下文层,通过构建上下文图、评分单元效用并打包为记忆卡,提升工具使用LLM智能体在行动时的证据选择与压缩能力,在SWE-bench验证集上实现检索命中率提升。

详情
Comments
15 pages, 2 figures, 9 tables. Code and artifacts are available at https://github.com/stephen-guan-researcher/CICL; Qwen-QLoRA adapter is available at https://huggingface.co/XinyuGuan/CICL
AI中文摘要

使用工具的LLM智能体失败的原因往往不是缺少相关文本,而是在行动时未能选择、压缩或呈现决定性证据。我们提出CICL,一个决策感知上下文层,它将实例证据转化为上下文图,通过共享的八字段模式路由确定性、Opus辅助、Qwen、Codex/GPT-5.5和Qwen-QLoRA判断,根据行动偏移、结果提升、必要性和负迁移风险对单元评分,并将高效用证据打包为类型化记忆卡供预算有限的智能体使用。该设计将测量到的决策信号与判断模型分离,使得前沿标注、局部代理和轻量级排序器可以在一个可审计协议下进行比较。实验上,CICL在公开基准测试中取得了具体提升,同时暴露了其局限性。在50个SWE-bench Verified文件检索实例上,直接使用Qwen3.6-plus对BM25前50候选进行重排序,将hit@1从0.58提升至0.78,MRR@10从0.634提升至0.790,且所有2500个判断均可解析。受控诊断显示了行动关键性:在预算120时,CICL在v1上达到F1 0.620,在v3上达到F1 0.425,而移除最高效用的语义v3单元导致F1降至0.000。补充检查包括Qwen-QLoRA在710个候选上的一致性、一个小的200标签真实代码Opus辅助信号,以及一个三实例补丁烟雾测试验证检索到补丁的流程,但不声称官方SWE-bench成功。RepoBench-R摘要仍优于记忆卡,紧凑型排序器尚未取代启发式方法。CICL贡献了一个可复现的测量和选择层,用于决策关键上下文,而非端到端编码智能体修复声明。

英文摘要

Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.

2606.08150 2026-06-09 cs.CV 新提交

Property-Informed Diffusion-Based Text-to-Microstructure Generation

基于属性信息的扩散模型文本到微结构生成

Bingxuan Dai, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学)) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学))

AI总结 提出一种属性信息驱动的扩散网络,从文本描述直接生成3D微结构,通过对比文本-结构对齐和测试时奖励引导对齐确保生成结构的语义和物理可行性。

详情
Comments
Published in CVPR2026, Code is at: https://github.com/hongsong-wang/PropDiff-TMG
AI中文摘要

设计满足预期功能的3D超材料微结构仍然是一个重大挑战,因为它通常需要领域专业知识、迭代模拟和大量手动调整。现有的基于期望目标属性自动生成微结构的逆向设计工作往往受限于设计多样性不足,并在确保生成结构的物理可行性方面面临挑战。为解决这一问题,提出了一种属性信息驱动的扩散网络,能够直接从文本描述生成3D微结构。与传统的属性条件方法不同,我们的方法利用文本输入中丰富的语义和物理属性指导,支持多样化的结构合成。为了强制生成结构与目标文本提示之间的一致性,采用了双重对齐策略,包括对比文本-结构对齐和测试时奖励引导对齐。实验结果表明,该模型能够在广泛材料类别中生成语义有意义且物理上合理的结构。我们的方法在交互式微结构设计方面具有良好潜力,并为结合语言接口与逆向材料发现开辟了新方向。代码可在 https://github.com/hongsong-wang/PropDiff-TMG 获取。

英文摘要

Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

2606.08146 2026-06-09 cs.AI 新提交

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

SAGE: 一种LLM驱动的自我反思智能体框架用于欺诈检测

Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) China Mobile Communications Group(中国移动通信集团有限公司)

AI总结 提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架,通过数据诊断树和自然语言梯度优化,在五个数据集上平均F1提升40.86%。

详情
AI中文摘要

支付、电子商务和电信系统中的欺诈检测需要在个体层面准确、在严重类别不平衡下鲁棒,并且易于风险管理者理解。现有方法至少缺乏这些要求之一:自动化机器学习系统在固定数值空间中搜索,缺乏对数据集的语义感知;基于图神经网络的方法需要预定义的关系图,在个体决策层面仍然不透明;通用大语言模型(LLM)智能体的设计未考虑现实欺诈检测中的召回率和精确率约束。在本文中,我们提出SAGE,首个端到端LLM驱动的多智能体欺诈检测框架。SAGE协调三个专用智能体,基于六层数据诊断树(DDT)和由自然语言梯度引导的马尔可夫决策过程做出决策,在欺诈特定奖励下自动优化模型。在五个欺诈数据集和五个LLM骨干网络上,SAGE在96.00%的方法-数据集比较中获胜,平均F1比基线提升40.86%。代码可在https://github.com/yichenC1c/SAGE获取。

英文摘要

Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.

2606.08144 2026-06-09 cs.CV 新提交

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

IMAGINE:自适应模式-意象增强组合用于组合视频检索

Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang, Yupeng Hu

发表机构 * Shandong University(山东大学) Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学(山东省科学院))

AI总结 针对组合视频检索中修改文本隐含语义与显式视觉内容不匹配的问题,提出自适应模式-意象增强组合网络(IMAGINE),通过动态多模态原型捕获隐含概念并调制视觉特征,在三个基准上达到最优性能。

详情
Comments
Accepted by ICMR 2026
AI中文摘要

组合视频检索(CVR)旨在检索与参考视频经修改文本修改后匹配的目标视频。现有方法探索跨模态对应关系时,常假设修改对象直接出现在视频中。然而,修改文本常描述未明确呈现但通过语义相关视觉线索隐含表达的概念(例如,“蛋糕”暗示“生日派对”)。当前方法通常依赖在具体空间中对齐显式特征表示,忽略了关键的潜在关联。为解决此问题,我们提出自适应模式-意象增强组合网络(IMAGINE)。与标准显式匹配不同,IMAGINE通过动态多模态原型具体化隐含语义(称为模式意象)。这些原型捕获共享的潜在概念,自适应地调制视觉特征,有效将隐含引导注入检索过程。通过弥合显式视觉内容与隐含检索意图之间的差距,IMAGINE在三个广泛使用的基准上,在CVR和组合图像检索(CIR)中均达到最先进性能。

英文摘要

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

2606.08140 2026-06-09 cs.LG 新提交

TRUST-SCF: Transformer-based Risk Understanding and Scoring for Transactional Supply Chain Finance

TRUST-SCF:基于Transformer的交易供应链金融风险理解与评分

Mohammadamin Davoodabadi, Amirabbas Shakeri

发表机构 * Department of Growth Barook Co.(Growth Barook公司)

AI总结 提出TRUST-SCF框架,利用Transformer对交易序列建模,通过金融对齐的注意力偏置、连续延迟预测和标签高效评分管道,实现动态信用评分,实验表明优于基线。

详情
Comments
15 pages, 13 Figures, 3 Tables
AI中文摘要

供应链金融(SCF)和LendTech平台需要能够响应不断变化的交易行为、还款延迟和活跃风险的信用评分系统。我们提出TRUST-SCF,一个基于Transformer的交易级风险预测和动态信用评分框架。每个用户历史被表示为包含利用率、还款延迟和交易位置的交易令牌序列。主要贡献包括:(1) 一种结合利用率相似性和近因性的金融对齐注意力偏置,使模型能够在可比风险暴露条件下比较还款行为;(2) 在对数变换目标空间中进行连续还款延迟预测,减少极端延迟的影响,同时提高对短延迟行为的敏感性;(3) 一个标签高效的信用评分管道,其中最终信用评分不依赖任何显式的外部信用评分标签进行训练,而是从预测延迟、模拟利用率下的潜在风险、实际未付风险暴露和非线性校准中推导得出。在超过30万笔交易的真实交易数据上的实验表明,TRUST-SCF在延迟预测上优于序列基线,并产生与未来还款行为强相关的评分。这些结果表明,TRUST-SCF是SCF和LendTech环境中自适应信用评分和交易级风险缓解的实用框架。

英文摘要

Supply Chain Finance (SCF) and LendTech platforms need credit scoring systems that respond to evolving transaction behavior, repayment delays, and active exposure. We propose TRUST-SCF, a transformer-based framework for transaction-level risk prediction and dynamic credit scoring. Each user history is represented as a sequence of transaction tokens containing utilization, repayment delay and transaction position. The main contributions are: (1) a financially aligned attention bias that combines utilization similarity and recency, enabling the model to compare repayment behavior under comparable exposure conditions; (2) continuous repayment-delay prediction in a log-transformed target space, reducing the influence of extreme delays while improving sensitivity to short-delay behavior and (3) a label-efficient credit-scoring pipeline in which the final credit score is not trained using any explicit external credit-score label, but is instead derived from predicted delay, potential risk over simulated utilization, actual unpaid exposure, and nonlinear calibration. Experiments on real transaction data from more than 300,000 transactions show that TRUST-SCF improves delay prediction over sequential baselines and produces scores that are strongly associated with future repayment behavior. These results suggest that TRUST-SCF is a practical framework for adaptive credit scoring and transaction-level risk mitigation in SCF and LendTech environments.

2606.08136 2026-06-09 cs.RO 新提交

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

基于深度Koopman算子的学习预测控制在自动驾驶车辆运动规划中的应用

Xinglong Zhang, Yongqian Xiao, Haotian Cao, Xing Zhou, Xin Yin, Xin Xu

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会) Science and Technology Innovation Program of Hunan Province(湖南省科技创新计划)

AI总结 提出一种结合深度Koopman算子的学习预测控制框架,通过提升非线性动力学到线性可观测空间,并利用滚动时域演员-评论家学习生成闭环状态反馈策略,在非凸约束下实现高效、安全的实时运动规划。

详情
AI中文摘要

模型预测控制(MPC)广泛应用于自动驾驶车辆(AV)的运动规划,但其实时应用通常受限于对精确模型的需求以及在动态道路环境中在线求解非线性、非凸优化问题。演员-评论家强化学习为在线策略生成提供了一种有前景的替代方案,但其策略学习过程往往缺乏显式的控制理论结构。本文提出了一种基于深度Koopman算子的学习预测控制(LPC)框架,用于在非凸约束下实现高效的实时运动规划。为了处理非线性和不确定的车辆动力学,使用基于深度Koopman的预测器以数据驱动的方式将系统提升到可解释的线性可观测空间。与计算开环控制序列的传统MPC不同,所提出的LPC框架通过滚动时域演员-评论家学习在每个预测区间内生成闭环状态反馈策略。为了确保在非凸环境约束下的安全性,LPC构建了障碍物的凸局部替代表示并定义了相应的势场函数。这些函数及其梯度直接嵌入到演员-评论家结构中,从而实现高效且具有安全意识的策略学习。在红旗EHS3平台上进行的大量仿真和实际实验表明,与CBF-MPC和LMPCC等基准方法相比,该方法在多种避障场景中在安全性、计算效率和驾驶舒适性方面均表现出优越性能。

英文摘要

Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.

2606.08133 2026-06-09 cs.CV 新提交

Gravity-guided Contact Dynamics Estimation from 3D Human Motions

重力引导的3D人体运动接触动力学估计

Cuong Le, Urs Waldmann, Bastian Wandt, Mårten Wadenbäck

发表机构 * Linköping University(林雪平大学)

AI总结 提出GraCE模型,利用人体质心与重力分布,从3D运动数据中准确估计地面接触力与压力分布,优于现有方法。

详情
Comments
14 pages, under submission
AI中文摘要

作用于人体的地面接触力对于生物力学研究或运动表现分析至关重要。先前的方法依赖测力台或压力垫来收集地面接触动力学,限制了其在严格控制环境下的适用性。一个更具扩展性的解决方案是直接从运动捕捉数据估计动力学。近期方法仅根据身体与地面之间的垂直距离粗略估计地面接触动力学,无法捕捉所有接触点的复杂压力分布。为此,我们提出GraCE——重力引导的接触动力学估计,一种新颖的全身接触动力学模型,利用身体质量分布和重力的真实影响来估计人体运动。我们使用人体的重心,基于其与身体的相对距离来估计地面接触。每个接触点上的作用力通过预测的接触概率与根据质心轨迹计算的总外力的乘积来估计。我们在GroundLink数据集上的地面反作用力估计和MOYO数据集上的详细接触压力预测中优于相关工作。代码将在接收后公开。

英文摘要

Ground contact forces acting on the human body, are crucial for biomechanics studies or sport performance analysis. Prior methods rely on force plates or pressure mats to collect ground contact dynamics, limiting their applicability to carefully controlled settings. A more scalable solution is to estimate the dynamics directly from motion capture data. Recent approaches only roughly estimate the ground contact dynamics from the vertical distance between the body and the ground plane, which cannot capture the complex pressure distribution of all contact points. To this end, we propose GraCE -- Gravity-guided Contact Dynamics Estimation, a novel full-body contact dynamics model for human motions using a realistic influence of body mass distribution and gravity. We use the human's center of gravity to estimate the ground contacts based on its relative distance to the human body. The applied force on each contact is estimated via the product of predicted contact probabilities and the total exterior force computed from the center of mass trajectory. We outperform related work on the GroundLink dataset for ground reaction force estimation, and on the MOYO dataset for detailed contact pressure prediction. The code is published upon acceptance.

2606.08132 2026-06-09 cs.CV cs.LG 新提交

Phase Marginalization for Patch-Grid Instability in Vision Transformers

视觉Transformer中补丁网格不稳定性的相位边缘化

Oğuzhan Ercan

发表机构 * Scientific and Technological Research Council of Türkiye(土耳其科学技术研究委员会)

AI总结 提出相位边缘化方法,通过评估结构化补丁网格相位、逆对齐密集输出并在原始图像坐标系聚合,消除视觉Transformer中补丁网格相位引起的预测不稳定性,无需训练即可提升分割、深度和匹配性能。

详情
Comments
13 pages, 1 figure, 9 tables
AI中文摘要

视觉Transformer在固定的补丁网格上操作,这可能导致密集预测中相位依赖的不稳定性:改变补丁划分会改变像素可用的令牌证据,尤其是在边界附近。我们将补丁网格相位形式化为一个干扰变量,并提出相位边缘化,一种事后边缘化方法,该方法评估结构化的补丁网格相位,逆对齐密集输出,并在原始图像坐标系中聚合它们。中心变体,K=4的均匀相位边缘化,无需训练,并在测量的分割、深度和局部匹配设置上优于规范的K=1基线。在受控的Cityscapes实验中,均匀相位边缘化相比基于通用移位的四次前向测试时增强(TTA)提供了适度的计算匹配优势(在最强测试的通用行上平均交并比提高0.31)。一项扩展研究进一步表明,K=4是一个实用的成本-精度权衡:K=8基本不变,K=16在更高延迟下增加很少精度。这些结果将补丁网格相位定位为可测量的干扰变量,并将相位边缘化定位为密集ViT预测的简单诊断和事后边缘化基线。

英文摘要

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

2606.08129 2026-06-09 cs.AI 新提交

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

推理中的跨LLM一致性:来自共享交互的证据

Siyu Lou, Yao Yan, Yuntian Chen, Quanshi Zhang

发表机构 * School of Computer Science Shanghai Jiao Tong University(上海交通大学计算机科学学院) Ningbo Key Laboratory of Advanced Manufacturing Simulation Eastern Institute of Technology, Ningbo(宁波市先进制造仿真重点实验室,宁波东方理工大学) College of Computer and Information Science Chongqing Normal University(重庆师范大学计算机与信息科学学院) SymtrustAI.com Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 研究发现,不同大型语言模型在相同提示下预测相同目标词时,常共享交互模式,且高级模型一致性更强,共享交互通常阶数更低、正负抵消更弱。

详情
Comments
20 pages, 8 figures
AI中文摘要

大型语言模型(LLM)在架构、训练数据和优化过程上各不相同,但它们仍可能发展出相似的内部推理模式。在本文中,我们使用基于交互的解释来检验这一假设。我们发现,当从相同提示预测相同目标词时,LLM 经常共享交互模式。这种一致性在高级 LLM 中更为明显。共享交互通常比非共享交互阶数更低,且正负抵消更弱。这些结果表明,高级 LLM 可能被隐式优化为共同的推理模式,尽管产生这种跨模型一致性的机制仍有待探索。

英文摘要

Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar internal inference patterns. In this paper, we examine this hypothesis using interaction-based explanations. We find that LLMs often share interaction patterns when predicting the same target token from the same prompt. This consistency is more pronounced among advanced LLMs. Shared interactions also tend to be lower-order and show weaker positive-negative cancellation than non-shared interactions. These results suggest that advanced LLMs may be implicitly optimized toward common inference patterns, even though the mechanisms that give rise to such cross-model consistency remain open.

2606.08126 2026-06-09 cs.CV 新提交

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

一石三鸟:面向多VLM选择、自适应与集成的自适应最优传输

Qiyu Xu, Zhanxuan Hu, Yu Duan, Yonghang Tai, Huafeng Li, Quanxue Gao, Xiangyong Cao

发表机构 * Xi’an Jiaotong University(西安交通大学) Yunnan Normal University(云南师范大学) Xidian University(西安电子科技大学) Kunming University of Science and Technology(昆明理工大学)

AI总结 提出无训练框架OSTB,通过自适应最优传输估计共识样本-类别结构,同时解决多VLM的模型选择、目标域自适应和预测集成问题。

详情
AI中文摘要

视觉语言模型(VLM)能够从语义类别描述中进行视觉识别,这使得它们在目标标注稀缺或不可用时具有吸引力。然而,大多数部署流程首先选择一个单一的VLM,然后将该模型适应到未标记的目标集。这种单骨干范式隐藏了一个关键假设:所选VLM已经与目标域兼容。在实际的跨域部署中,可能有多个通用和领域专用的VLM是可行的,但没有实例级目标标签可用于识别可靠的模型。因此,部署需要一个耦合的解决方案来进行模型选择、目标适应和预测集成。我们从系统级多VLM的角度重新审视这个问题。我们的核心观察是,上述三个决策依赖于同一个潜在对象:目标集中可信的样本-类别结构。不同的VLM可能编码不同的迁移偏差并产生冲突的预测,但它们的输出仍然可以为估计该结构提供互补证据。我们提出了一石三鸟(OSTB),一个基于自适应最优传输的无训练框架。给定一组冻结的候选VLM,OSTB在不更新VLM参数的情况下估计一个共识的样本到类别传输计划。然后,学习到的传输结构被重用于所有部署目标:通过排序共识计划引起的组合语义和视觉可靠性来进行模型选择;通过拟合传输条件视觉分类器获得目标适应;通过可靠性感知的概率集成实现集成。在自然图像、遥感和医学病理基准上的大量实验表明,OSTB在异构候选池下提高了模型排名、适应稳定性和集成鲁棒性。

英文摘要

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.

2606.08123 2026-06-09 cs.CV cs.AI 新提交

Human-Centered Benchmarking of Driver Monitoring Models

以人为中心的驾驶员监控模型基准测试

Ruben Dario Florez-Zela

发表机构 * Universidad Nacional de San Agustin de Arequipa (UNSA)(圣奥古斯丁国立大学(UNSA))

AI总结 针对驾驶员监控模型仅用分类精度评估的不足,提出以人为中心的基准测试框架(HCBF),从精度、可解释性、效率和鲁棒性四维评估,发现模型在帕累托前沿上各占优势,但聚合排名会掩盖关键缺陷。

详情
Comments
9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring
AI中文摘要

基于视觉的驾驶员监控系统越来越多地部署在安全关键的智能交通环境中,但它们几乎总是仅根据分类精度进行比较。本文认为精度不足以表征模型在实际部署中的适用性,并提出了以人为中心的基准测试框架(HCBF),该框架从四个维度评估模型:精度、可解释性、效率和鲁棒性。该框架应用于四种代表性的轻量级架构:MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny,在MRL眼睛数据集上进行眼睛状态分类。虽然这些模型在干净数据集上的精度几乎无法区分,但每个模型恰好在一个维度上领先,并且所有四个模型都位于帕累托前沿。在三种面向部署的权重场景下计算的人为中心得分将ShuffleNetV2排在首位。然而,这个聚合胜出者在传感器噪声下保留了不到一半的性能,并且将闭眼分类为睁眼而失败,而Transformer则保持鲁棒。这些发现表明,聚合排名可能掩盖在操作上具有决定性的维度特定漏洞,强调了多维、以人为中心评估的价值。

英文摘要

Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

2606.08122 2026-06-09 cs.AI 新提交

Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction

三思而后行:基于意图引导推理的LLM位置预测

Qingxiang Liu, Anqi Liang, Zhuoyang Jiang, Yutian Jiang, Sisuo Lyu, Yu Ji, Haomin Wen, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology(香港科技大学) Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntentPOI框架,通过两阶段意图引导推理(先推断用户出行意图,再基于意图选择POI),将位置预测从直接轨迹匹配转化为意图推理,在三个真实数据集上超越11个基线。

详情
AI中文摘要

根据用户的历史签到记录预测其下一个兴趣点(POI)是基于位置服务中的一项基本任务。尽管最近结合大语言模型的方法展现了强大的推理能力和有前景的结果,但它们通常将预测任务建模为一步式的轨迹到位置映射问题,使得预测容易受到浅层轨迹相关性和历史频率偏差的影响。我们认为用户很少直接选择位置,相反,他们通常首先形成出行意图,然后据此选择特定的POI。受此洞察启发,我们提出了IntentPOI,一个两阶段的意图引导推理框架。在思考阶段,我们通过结合历史移动模式、相似同伴行为和时间上下文来推断用户的中间意图。在执行阶段,我们首先构建一个紧凑的候选池,然后执行意图引导推理,以识别与推断意图最一致的位置。通过明确地将意图推断与位置预测解耦,IntentPOI将下一个POI预测从直接的轨迹匹配转变为意图引导推理。在三个真实世界数据集上的大量实验表明,IntentPOI始终优于十一个最先进的基线方法。

英文摘要

Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users' intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.

2606.08121 2026-06-09 cs.CV 新提交

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

可信视觉谓词用于退化条件下的鲁棒操作理解

Fatemeh Ziaeetabar

发表机构 * University of Tehran(德黑兰大学)

AI总结 提出谓词级可靠性框架,通过结构化谓词词汇表、置信度感知估计和可靠性度量,分析模糊、遮挡等退化对操作理解中视觉谓词的影响,实验表明接触敏感和动态谓词更脆弱。

详情
AI中文摘要

操作理解需要可靠的关联证据,如接触、支撑、包含、运动耦合、抓取、释放和主动手参与。尽管这些视觉谓词广泛用于事件链、图基和神经符号模型,但它们在视觉退化下的可靠性很少被直接分析。本文引入了一个谓词级可靠性框架,用于在模糊、遮挡、光照变化、低分辨率、帧丢失和检测噪声下实现鲁棒的操作理解。该框架定义了结构化谓词词汇表、置信度感知的谓词估计以及用于谓词保持、退化敏感性、时间一致性、置信度加权稳定性和下游影响的可靠性度量。在受控操作视频和公共自我中心或双手数据集(包括VISOR/EPIC-KITCHENS、H2O和ARCTIC)上的实验表明,谓词失败是结构化的而非均匀的。静态空间谓词相对稳健,而接触敏感、动态和派生谓词(如抓取和释放)更脆弱。在严重退化下,检测噪声、遮挡和帧丢失导致最强的可靠性损失。下游分析表明,退化谓词将操作理解准确率从0.89降至0.58,而在中等退化下去除置信度加权将准确率从0.74降至0.64。这些结果表明,谓词可靠性在视觉感知和结构化操作推理之间提供了一个诊断层。

英文摘要

Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.

2606.08113 2026-06-09 cs.LG math.FA math.OC 新提交

Conditional Random Ordered Transport Spaces

条件随机有序传输空间

Lei Luo, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education(PCA实验室,教育部高维信息智能感知与系统重点实验室) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 提出条件随机有序传输空间(CROTS),通过引入有序传输几何和条件风险泛函,解决分布学习中传输方向是否被允许的问题,并建立稳定性定理。

详情
Comments
24 pages, 1 figure, 2 tables
AI中文摘要

小的Wasserstein距离不能证明变换是可容许的。在证据约束、语义、因果、物理、单调或风险敏感学习中,不仅要问两个概率定律相距多远,还要问质量是否沿着可用信息允许的方向移动。我们引入了条件随机有序传输空间(CROTS),这是一类\(L^0\)值随机概率测度空间,配备了Wasserstein环境度量、闭随机序、硬和软有序传输差异,以及用于在证据sigma域下评估序违反的条件风险泛函。核心对象是随机测度值动力学的一个序可容许传输几何,区别于锥值度量、有序Kantorovich构造、单独的随机Wasserstein空间以及生成路径的模型特定残差。我们发展了CROTS作为可靠分布学习空间理论的基础。结果包括硬和软有序传输的适定性和对偶性、软到硬变分收敛、随机提升空间的可测性和完备性、约化到经典Wasserstein和有序几何、有序测地线、约束重心和投影、条件风险-传输对偶性以及序违反分布的分离。主要稳定性定理表明,随机学习动力学可以在环境Wasserstein度量中收敛,而其局部可容许性泄漏遵循一个独立的条件序-风险递归。由此产生的渐近序-风险下界为证据过度、有序分布偏移、鲁棒性失败和可容许分布动力学提供了数学语言。

英文摘要

A small Wasserstein distance does not certify that a transformation is admissible. In evidence-constrained, semantic, causal, physical, monotone, or risk-sensitive learning, one must ask not only how far two probability laws are, but whether mass has moved in a direction allowed by available information. We introduce conditional random ordered transport spaces (CROTS), a class of \(L^0\)-valued spaces of random probability measures equipped with a Wasserstein ambient metric, a closed stochastic order, hard and soft ordered transport discrepancies, and a conditional risk functional for evaluating order violation under an evidence sigma-field. The central object is an order-admissible transport geometry for random measure-valued dynamics, distinct from cone-valued metrics, ordered Kantorovich constructions, random Wasserstein spaces alone, and model-specific residuals for generative paths. We develop the foundations of CROTS as a space theory for reliable distributional learning. The results include well-posedness and duality for hard and soft ordered transport, soft-to-hard variational convergence, measurability and completeness of the random lifted space, reductions to classical Wasserstein and ordered geometries, ordered geodesics, constrained barycenters and projections, conditional risk-transport duality, and separation of order-violating distributions. The main stability theorem shows that random learning dynamics may converge in the ambient Wasserstein metric while its local admissibility leakage follows a separate conditional order-risk recursion. The resulting asymptotic order-risk floor provides a mathematical language for evidence overreach, ordered distribution shift, robustness failure, and admissible distributional dynamics.

2606.08107 2026-06-09 cs.RO cs.AI 新提交

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University(斯坦福大学) Meta

AI总结 为解决机器人数据稀缺问题,利用自我中心人类数据,基于π₀.₅模型微调,使机器人学习新任务语义并组合现有技能,无需对应机器人数据。

详情
AI中文摘要

机器人技术面临数据稀缺的根本挑战。与语言或视觉研究不同,机器人操作没有互联网规模的数据集。一个有前景的途径是利用自我中心人类数据,这类数据更容易收集、范围更广且规模更大。为此,我们研究了跨人类和配备灵巧五指手的类人机器人实体学习的关键设计选择,以$π_{0.5}$模型为基础。我们的结果表明,人类数据使机器人能够学习新的任务语义,并将现有技能组合成新颖的行为,而无需相应的机器人数据。论文网站:https://egopipaper.github.io/

英文摘要

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $π_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/

2606.08106 2026-06-09 cs.AI cs.MA 新提交

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

PACE: 自演化智能体的任意有效接受测试

Zayx Shawn

发表机构 * Independent Researcher(独立研究员)

AI总结 提出PACE方法,将自演化智能体的变更接受问题转化为序贯假设检验,通过配对任意有效提交评估控制错误提交概率,在多个基准上显著减少虚假提交并降低评估成本。

详情
AI中文摘要

自演化智能体通过反复提出对其自身提示、技能或工作流程的更改,并保留那些在小型保留集上得分更高的更改来改进。几乎所有努力都集中在生成候选方案的提议者上;我们认为薄弱环节是接受者,即决定是否提交更改的规则。针对相同的噪声开发估计应用数百次,无处不在的“如果分数上升则保留”规则是未受控制的自适应多重测试:智能体有效地自我p-hack,累积虚假提交,导致其搅动和漂移而非改进。我们将提交重新定义为序贯假设检验,并提出PACE(配对任意有效提交评估),一种无需训练、任意有效的提交门控。每个候选方案与现有方案在相同实例上进行比较,仅当通过测试-下注的e过程积累决定性证据时才提交,提前停止以节省评估,并在可选停止下将每个候选方案的虚假提交概率控制在用户设定的水平(每决策保证)。在Qwen2.5智能体(0.5B-3B)于GSM8K、SVAMP和ARC-Challenge上在提示级别自演化时,贪婪接受在真实改进隐藏在噪声提议中时提交30-42%的虚假编辑和10-33%的有害编辑,而PACE提交真实改进且几乎无其他,匹配贪婪的保留集准确性,但方差显著降低且评估成本降低约18%。在没有真正增益可用时,贪婪每次运行提交13-21次虚假自我修改(72-100%虚假),并使最脆弱的智能体性能下降4.9个百分点,而PACE保持基线水平。自演化的可靠性取决于接受者,而不仅仅是提议者。

英文摘要

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

2606.08105 2026-06-09 cs.LG 新提交

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

注意力汇聚的统一视角:两种算法,两种解决方案

Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade

发表机构 * Kempner Institute(肯普纳研究所) Harvard University(哈佛大学)

AI总结 本文揭示注意力汇聚(attention sink)可对应两种不同机制:自适应空操作(adaptive nop)和广播(broadcast),并据此提出诊断方法,证明门控(gating)和寄存器(register)等干预分别针对不同机制,组合使用效果更佳。

详情
AI中文摘要

当注意力集中在一个单一标记(即汇聚)上时,模型实际上在计算什么?注意力汇聚在softmax transformer中普遍存在,然而这种共享的视觉特征可能隐藏着根本不同的算法。我们表明,视觉上相似的汇聚模式可以反映两种不同的机制:{i}自适应空操作,其中注意力头通过路由到空标记来抑制其更新;以及{ii}广播,其中汇聚聚合并重新分配全局信息。在这种情况下,汇聚扮演着类似的作用:当没有有用信息可计算时,作为一个安全的目的地。提出的干预措施如门控或寄存器之所以有效,是因为它们隐式地针对其中一种机制,揭示了方法与假设机制之间的对偶性:门控隐式假设空操作;寄存器隐式假设广播。每种机制都会留下不同的痕迹(空操作汇聚的值范数可忽略;广播汇聚导致低秩输出),我们在合成任务上形式化这些痕迹,并用于推导实用的诊断方法。应用于预训练视觉transformer时,这些诊断表明两种机制在大规模模型中均存在:汇聚从早期层的CLS标记过渡到深层层的块标记,并集中在专门的注意力头中。引人注目的是,为广播设计的寄存器标记被重新用于服务空操作,证实了单独任何一种干预都不足够。将门控与寄存器结合使用在稳定性和性能上带来互补的提升。总体而言,我们发现相同的注意力模式可以反映两种截然不同的计算,有效的干预需要首先询问模型实际在计算什么。

英文摘要

When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We show that visually similar sink patterns can reflect two distinct mechanisms: {i} adaptive nop, where a head suppresses its update by routing to a null token, and {ii} broadcast, where a sink aggregates and redistributes global information. In that case, sinks serve an analogous role: a safe destination when there is nothing useful to compute. Proposed interventions like gating or registers work because they implicitly target one or the other, revealing a duality between method and assumed mechanism: gating implicitly assumes nop; registers implicitly assume broadcast. Each mechanism leaves distinct traces (nop sinks exhibit negligible value norms; broadcast sinks induce low-rank outputs) which we formalize on synthetic tasks and use to derive practical diagnostics. Applied to pretrained vision transformers, these diagnostics reveal that both mechanisms exist at scale: sinks transition from CLS in early layers to patches in deeper layers, and concentrate in specialized heads. Strikingly, register tokens, designed for broadcast, are repurposed to also serve nop, confirming that neither intervention alone suffices. Combining gating with registers yields complementary gains in stability and performance. Overall, we find that the same attention pattern can reflect two very different computations and effective intervention requires first asking what the model is actually computing.

2606.08104 2026-06-09 cs.RO 新提交

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁软体机器人配置的通用控制

Xinglong Zhang, Cong Li, Hangjie Mo, Yue Jiang, Xin Xu, Wei Jiang, Zhenshan Bing, Yihe Yang, Xiaojian Li, Yueneng Yang, Huimin Lu, Ling-li Zeng, Alois Knoll, Dewen Hu, Li Wen, Wei Pan

发表机构 * National University of Defense Technology(国防科技大学) Hefei University of Technology(合肥工业大学) Nanjing University (Suzhou Campus)(南京大学(苏州校区)) Technical University of Munich(慕尼黑工业大学) Beihang University(北京航空航天大学) Newcastle University(纽卡斯尔大学)

AI总结 提出基于共享线性Koopman嵌入空间的强化学习框架,将控制策略与机器人形态解耦,实现跨33种软体机器人配置的快速迁移,样本量减少75倍,并支持高速运动、重载和多执行器故障下的鲁棒控制。

详情
Comments
An updated version of this paper has been accepted by Nature Communications
AI中文摘要

软体生物如章鱼和大象鼻子展现出显著的形态适应性,能够动态重构身体形状和刚度,并灵活调整控制策略以实现多功能行为。受这些生物系统启发,近几十年来出现了各种软体机器人,它们采用针对特定任务定制的不同材料、刚度和形态。尽管软体机器人的材料和结构设计取得了重大进展,但开发一个能够跨不同配置快速适应的通用控制框架仍然是一个长期挑战。现有控制器局限于固定配置,需要针对新配置进行费力的特定配置重新建模和策略重新设计。本文介绍了一种通用控制系统,通过共享线性Koopman嵌入空间中的强化学习,实现跨多种软体机器人配置的快速适应。通过将机器人动力学编码到该嵌入空间,我们的方法将控制策略与特定形态解耦,允许跨不同配置进行实时、无模型的策略适应,而无需从头重新训练。我们在33种不同的机器人配置上验证了该系统。该系统在跨配置的迁移样本量上减少了75倍,同时在高速运动、重负载和多执行器故障下保持鲁棒性能,并实现了软体机器人领域此前无法获得的现实技能。这项工作为多种软体机器人配置建立了一个统一且可适应的控制范式,弥合了机械可重构性与控制灵活性之间的差距,并可能为复杂物理系统中的通用控制提供更广泛的见解。

英文摘要

Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.

2606.08103 2026-06-09 cs.RO cs.CV 新提交

Revisiting Articulated Parts Perception in Robot Manipulation

重新审视机器人操作中的关节部件感知

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出几何主结构(GPS)作为关节部件的新表示,结合VR设备实现高效标注,训练通用模型,在零样本下达到73%操作成功率。

详情
Comments
CVPR2026
AI中文摘要

我们被各种带有可移动关节部件的物体所包围,例如盒子、把手、门。对关节部件的准确且可泛化的感知对于增强机器人操作能力至关重要。基于这一需求,近期在关节部件感知方面的工作遵循两个主要方向:一类工作使用基于姿态的表示,这需要高人力成本;与此同时,基于可供性的方法通过点跟踪提取未来物体运动,无需额外人工,但受限于低质量数据。在本文中,我们提出了一种新的关节部件表示——几何主结构(GPS),它是部件几何结构的抽象,以平衡可扩展性和质量。为了实现高效且可扩展的数据收集,GPS与便携式虚拟现实(VR)设备集成,只需一分钟即可标注一个物体序列。这种直接的人工标注比估计的可供性提供了更高质量。利用高效的VR-GPS系统,我们收集了6个部件类别下234个物体的41K帧数据,并训练了一个以单张RGB-D物体图像为输入的通用GPS模型。对于物体操作,我们基于GPS预测部署了一个启发式策略。无需任何领域内微调,我们的方法在9个物体的270个初始状态下达到了73%的成功率。我们的代码、数据和可复用工具可在 https://enlighten0707.github.io/gps 获取。

英文摘要

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

2606.08100 2026-06-09 cs.LG 新提交

Constraint-Aware Optimization for Robust Protein Stability Prediction

约束感知优化用于鲁棒蛋白质稳定性预测

A Shivram, Aneesh S. Chivukula, Manik Gupta, Sourav Chowdhury

发表机构 * Birla Institute of Technology and Science Pilani, Hyderabad Campus(比拉理工学院海得拉巴校区)

AI总结 提出约束感知优化框架,结合平衡均方误差、孪生反对称正则化器和OOD边缘一致性损失,在不改变SPURS架构下提升蛋白质稳定性预测的鲁棒性,在多个基准上取得显著改进。

详情
AI中文摘要

多模态$\Delta\Delta G$预测器结合蛋白质语言模型与逆折叠表示,在Megascale数据集上实现了强分布内准确性,但在分布外蛋白质上鲁棒性有限,在配对突变基准上存在持续的正反向偏差,且对稀有稳定突变的代表性不足。现有方法主要通过额外的架构组件来解决这些局限性,而优化层面的干预相对未被充分探索。我们引入了一个约束感知优化框架,结合平衡均方误差、孪生反对称正则化器以及在每个位置特征表示上的新颖OOD边缘一致性损失,无需对SPURS主干进行架构更改。在十一个基准和三个随机种子上,该框架将S669上的Spearman相关性从0.486提高到0.540(种子间$\sigma=0.002$),在不修改架构的情况下匹配已发表的SPURS基线(0.50),并将S461上的相关性从0.653提高到0.711,在另外五个OOD数据集上取得一致的小幅提升。在Ssym上的受控诊断表明,反对称训练并未消除系统性的正反向偏差,表明增益是通过隐式正则化而非精确热力学约束强制执行来实现的。

英文摘要

Multimodal $ΔΔG$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution (OOD) proteins, persistent forward-reverse bias on paired-mutation benchmarks, and under-representation of rare stabilizing mutations. Existing approaches address these limitations primarily through additional architectural components, leaving optimization-level intervention comparatively underexplored. We introduce a constraint-aware optimization framework combining Balanced Mean Squared Error, a Siamese anti-symmetric regularizer, and a novel OOD-margin consistency loss on the per-position feature representation, requiring no architectural changes to the SPURS backbone. Across eleven benchmarks and three random seeds, the framework improves Spearman correlation on S669 from 0.486 to 0.540 ($σ=0.002$ across seeds), matching the published SPURS baseline (0.50) without architectural modification, and on S461 from 0.653 to 0.711, with consistent smaller gains on five additional OOD datasets. A controlled diagnostic on Ssym reveals that anti-symmetric training does not eliminate systematic forward-reverse bias, indicating that gains arise through implicit regularization rather than exact thermodynamic constraint enforcement.

2606.08099 2026-06-09 cs.RO 新提交

Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation

赛博格安卓化身“Yui”:系统集成、现场部署与评估

Kaoruko Shinkawa, Mizuki Nakajima, Taisei Mogi, Yoshihiro Nakata

发表机构 * The University of Electro-Communications(电气通信大学) Tokyo Denki University(东京电机大学)

AI总结 提出全身赛博格安卓化身Yui,集成操作者沉浸式遥操作与对话者类人社交信号,通过世博会长期展览、远程教育交流等实际部署验证可行性,获得共在感和情绪传达的积极评价。

详情
Comments
47 pages, 20 figures, 10 tables. Submitted to International Journal of Social Robotics
AI中文摘要

远程通信技术已广泛使用,但在许多社交互动场景中,支持共享物理空间感和传达丰富的非语言线索仍然具有挑战性。本研究介绍了“Yui”,一种全身赛博格安卓化身,旨在将操作者沉浸式遥操作与对话者类人社交信号相结合。Yui 结合了55自由度的全身机构与先前开发的安卓头部、面部表情和注视控制、上半身和手臂运动、手部驱动以及移动平台。它可以通过基于头戴显示器的沉浸式模式或基于网络摄像头的桌面模式进行操作。我们通过三个实际部署评估了系统:日本关西大阪2025年世博会的长期公共展览、小学生之间的远程教育交流以及与普通参与者的公共互动研究。在世博会部署期间,两个单元累计运行约1131小时,展示了操作可行性和维护挑战。在公共研究中,操作者和对话者均报告了对共在感的积极印象和使用意愿。对话者还在类人性和情绪及意图传达方面对化身给予了积极评价。结果表明对普通操作者具有可用性,同时在精确可控性方面存在改进空间。这些发现为可社交部署的全身安卓化身提供了现场证据和设计启示。

英文摘要

Remote communication technologies have become widely used; however, supporting a sense of shared physical space and conveying rich non-verbal cues remain challenging in many social interaction scenarios. This study presents "Yui," a full-body cybernetic android avatar designed to integrate operator-side immersive teleoperation with interlocutor-side human-like social signaling. Yui combines a 55-degrees of freedom full-body mechanism with a previously developed android head, facial expression and gaze control, upper-body and arm motion, hand actuation, and a mobile platform. It can be operated through either the immersive mode using a head mounted display-based interface or desktop mode using a webcam-based interface. We evaluated the system through three real-world deployments: a long-term public exhibition at Expo 2025 in Osaka, Kansai, Japan; a remote educational exchange between elementary school students; and a public interaction study with general participants. During the Expo deployment, two units accumulated approximately 1131 h of operation, demonstrating both operational feasibility and maintenance challenges. In the public study, both operators and interlocutors reported positive impressions of co-presence and willingness to use the system. Interlocutors also rated the avatar positively in terms of human likeness and the transmission of emotions and intentions. The results indicate usability for general operators while suggesting room for improvement in precise controllability. These findings provide field-derived evidence and design implications for socially deployable full-body android avatars.

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 新提交

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp:视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics Center for AI Research, VinUniversity(VinUniversity 人工智能研究中心) Intelligent Autonomous Systems, TU Darmstadt(达姆施塔特工业大学智能自主系统) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学院) University of Stuttgart(斯图加特大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 提出vla.cpp,基于llama.cpp的便携C++推理运行时,支持多种VLA架构,在LIBERO-Object上接近SOTA性能,内存仅1.3 GiB,并实现跨硬件部署。

详情
Comments
17 pages, 3 figures, 12 tables
AI中文摘要

视觉-语言-动作(VLA)策略通常以Python/PyTorch堆栈形式提供,假设使用工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出了vla.cpp,一个基于llama.cpp的便携式C++推理运行时。据我们所知,它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎,其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构,每个模型打包为自包含的捆绑包。在LIBERO-Object上,该引擎在200个回合中与最先进的检查点相差不到一个回合,并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行,从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明,批量大小为1的VLA推理受计算限制,因此利用率而非带宽是部署杠杆;由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后,我们在ALOHA机械臂上设计了一个机载压力测试,隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

2606.08091 2026-06-09 cs.CV 新提交

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

VideoWeaver: 评估与进化智能体长视频生成技能

Jianhui Wei, Jie Tan, Hengchuan Zhu, Xiaotian Zhang, Yan Zhang, Ziyi Chen, Daoan Zhang, Wei Xu, Zuozhu Liu

发表机构 * Zhejiang University(浙江大学) ByteDance(字节跳动)

AI总结 提出VideoWeaver框架,让智能体自主组合基础技能生成视频,并设计智能体裁判评估过程与结果,通过技能进化算法提升生成质量。

详情
AI中文摘要

最近的智能体框架如Claude Code、Codex和OpenClaw在工具使用和编排方面表现强劲,但它们能否处理长视频生成这一长时多模态任务仍待探索。与早期手工设计管线的视频智能体不同,这些框架可以构建和优化自己的工作流程。我们提出VideoWeaver,一个评估和进化长视频生成技能的智能体框架和基准测试,其中智能体通过将基础技能组合成自己的工作流程(而非遵循预定义管线)将单个指令转化为长视频。该基准测试包含16个任务类别和285个案例,参考信息涵盖文本、图像、音频、视频及其组合。由于错误可能出现在任何阶段而不仅仅是最终视频,我们提出一种智能体裁判,它检查执行轨迹和最终视频,并将其评分基于元数据和中间文件等证据。利用这一反馈,我们进一步设计了一种技能进化算法,用于优化和合并智能体的技能。在多个框架和模型上,我们发现显式的组合技能比单独使用基础技能更能改善生成过程,技能进化进一步提高了输出质量,并且不同框架和模型选择之间的性能差异显著。所提出的智能体裁判也与人类判断高度一致,尤其是在过程指标上。代码和数据集可在https://github.com/JianhuiWei7/VideoWeaver获取。

英文摘要

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent's skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver

2606.08088 2026-06-09 cs.LG cs.CL 新提交

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL:通过置信度感知强化学习引导大型语言模型的推理能力

Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen, Yuewen Liu, Shaoyi Du, Badong Chen

发表机构 * Xi'an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出ConSteer-RL框架,将模型log概率的token级置信度信号融入GRPO,通过置信度感知奖励塑造机制惩罚过度自信错误并强化正确自信推理,在多个模型规模上平均提升2.3%-4.0%。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为提升大型语言模型(LLMs)推理能力的关键范式,但其仍受限于稀疏的二元奖励以及对模型内部不确定性的忽视。本文提出ConSteer-RL,一个简单而有效的框架,将源自模型log概率的token级置信度信号整合到RLVR训练中。具体而言,基于组相对策略优化(GRPO)框架,我们通过将每个token的概率聚合成标量置信度分数,并融入基于意识的奖励塑造机制,构建置信度感知奖励,该机制惩罚过度自信的错误,同时强化正确且自信的推理。实验结果表明,ConSteer-RL在不同模型规模上持续优于强GRPO基线,平均提升2.3%-4.0%。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

2606.08087 2026-06-09 cs.SD cs.CL 新提交

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

评估神经说话人验证模型在训练和推理中的能耗与碳排放

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Aday Avignon University(阿维尼翁大学)

AI总结 本研究通过测量不同ResNet架构在VoxCeleb2上的能耗与碳排放,发现模型加深或加宽带来边际精度提升但能耗剧增,而中等规模网络(如ResNet-50)能实现性能与环境影响的良好平衡。

详情
Comments
Accepted to Speaker Odyssey 2026 Lisbon
AI中文摘要

深度学习说话人验证(SV)越来越依赖于深度神经网络骨干,但其环境影响仍缺乏记录。本文对在VoxCeleb2上训练的ResNet架构进行了评估,变化深度、通道宽度和阶段分布,并使用节点级传感器测量能耗和碳足迹。结果显示明显的收益递减点:更深或更宽的模型仅带来边际精度提升,而能耗急剧增长。相比之下,中等规模网络如ResNet-50和阶段集中变体在性能与环境影响之间实现了有利的权衡。这些发现为设计节能的SV系统提供了可操作的指导方针。

英文摘要

Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.

2606.08081 2026-06-09 cs.CL cs.AI 新提交

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定:区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University(国立台湾大学) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Institut Jean Nicod(让·尼科研究所)

AI总结 通过约束伪对基线方法,区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇,发现智能体通过冗长描述而非压缩表达实现协调。

详情
AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明,多模态LLM在轮次中未能变得更高效,尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇?我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线,它匹配原始指称任务结构,但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面(任务能力、描述策略、对齐动态)上,我们发现了明显差异。人类通过适应减少努力,压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平,从第一轮开始产生冗长的描述,标签重叠接近上限,在真实对和伪对之间统计上无法区分。因此,多模态LLM在没有惯例的情况下实现了协调,通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

2606.08078 2026-06-09 cs.SD cs.CL 新提交

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

说话人验证中的低位量化误差:诊断与缓解

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Avignon University(阿维尼翁大学) Aday

AI总结 本文通过逐层和得分级分析,诊断了低比特量化对说话人验证的影响,发现2比特是关键拐点,并提出校准多精度级联方法,在保持低位推理效率的同时接近全精度性能。

详情
Comments
Accepted at Speaker Odyssey 2026 Lisbon
AI中文摘要

尽管低比特量化为在资源受限设备上部署说话人验证提供了实用手段,但其对说话人验证性能的影响仍知之甚少。本文通过联合逐层和得分级分析,研究了ResNet-36和ResNet-200的均匀K-means量化感知训练。我们的逐层分析突出了脆弱组件,并表明得分退化不能仅由权重失真完全解释。我们在2比特处识别出一个明显的拐点,较大的得分漂移和有害决策翻转集中在FP32阈值附近。我们的得分级分析揭示了在极端量化下得分误差产生的位置和方式。基于这些发现,我们提出了一种校准的多精度级联方法,该方法在2比特下解决大多数试验,仅升级模糊情况,实现了接近FP32的性能,同时以显著降低的计算和内存成本保留了低位推理的效率优势。

英文摘要

Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.

2606.08077 2026-06-09 cs.CL 新提交

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

支持向量评分准则:弥合自生成与人工评分准则之间的差距

Mengyuan Sun, Yu Li, Zhuohao Yu, Shikun Zhang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(北京大学软件工程国家工程研究中心) University of Science and Technology of China(中国科学技术大学)

AI总结 针对自生成评分准则在困难实例上落后于人工标注的问题,提出SVR框架,将准则构建转化为偏好数据上的最大间隔边界学习,通过对比特征挖掘、提示条件选择器和迭代优化,显著缩小与人工准则的差距,并展现出广泛的奖励建模能力。

详情
AI中文摘要

基于评分准则的评估是评判大语言模型(LLM)输出的一种有前景的范式,然而在困难实例上,自生成准则落后于人工标注的准则。我们认为这一判别差距反映了目标不匹配:自生成准则描述好的回答,而有效的准则必须区分相近的候选。为弥合这一差距,我们引入SVR(支持向量评分准则),一个将准则构建重新表述为偏好数据上的最大间隔边界学习的框架。SVR从偏好对中挖掘对比特征存入准则库,学习一个提示条件化的选择器以及全局准则权重,并通过支持对选择和对抗性探测困难负例来迭代优化准则库。在推理时,仅给定提示,SVR从库中检索顶级准则并对回答进行评分。在RubricBench上,SVR将差距从24.1分缩小到0.3分,并优于强自生成准则和评判基线,且学习到的准则库无需重新训练即可跨评判迁移。在RewardBench 1&2和RM-Bench上,它仍与专用奖励模型保持竞争力,展示了更广泛的奖励建模能力。总体而言,边界定义的准则为弥合LLM评估中的判别差距提供了一条原则性路径。

英文摘要

Rubric-based evaluation is a promising paradigm for judging large language model (LLM) outputs, yet self-generated rubrics lag human-annotated criteria on hard instances. We argue this discriminative gap reflects an objective mismatch: self-generated rubrics describe good responses, whereas effective criteria must discriminate between close candidates. To close this gap, we introduce SVR (Support Vector Rubrics), a framework that recasts rubric construction as max-margin boundary learning over preference data. SVR mines contrastive features from preference pairs into a rubric bank, learns a prompt-conditioned selector together with global rubric weights, and iteratively refines the bank through support-pair selection and adversarial probing of hard negatives. At inference, given only the prompt, SVR retrieves the top-rubrics from the bank and scores responses. On RubricBench, SVR narrows the gap to human reference rubrics from 24.1 to 0.3 points and outperforms strong self-rubric and judge baselines, and the learned bank transfers across judges without retraining. On RewardBench 1&2, and RM-Bench, it remains competitive with dedicated reward models, demonstrating broader reward modeling capability. Overall, boundary-defining rubrics offer a principled route to closing the discriminative gap in LLM evaluation.

2606.08071 2026-06-09 cs.CL 新提交

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

SurgiQ: 用于评估大语言模型手术理解的大规模多领域基准

Ayah Al-Naji, Edoardo Fazzari, Saif Alkindi, Hamdan Alhadhrami, Preslav Nakov, Cesare Stefanini

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出SurgiQ基准,包含13,055道多选题,覆盖六个外科领域和四种题型,用于评估LLM的手术推理能力。实验显示最佳模型准确率仅68.1%,通用模型优于多数生物医学模型,表明当前医学专业化未能充分覆盖手术知识。

详情
AI中文摘要

大语言模型在外科领域的可靠评估仍不成熟。广泛的医学基准测试临床知识,而手术需要程序性推理、管理权衡、否定处理以及在合理手术决策中的选择。我们提出SurgiQ,一个纯文本、基于来源的基准,包含13,055道四选一多选题,涵盖六个外科领域和四种题型:基于案例、推理、最佳选项和否定题。SurgiQ通过多阶段生成、验证和专家审核流程,从外科教科书、开放获取论文和考试材料构建。我们在统一的log-likelihood协议下评估了35个开源权重LLM。结果显示仍有很大提升空间:较小模型通常接近25%的随机基线,而最佳模型达到68.1%的准确率。通用模型,尤其是Qwen2.5,优于大多数生物医学模型,表明当前的医学专业化尚未提供足够广泛的外科覆盖。校准和错误分析进一步表明,即使是强模型也会在临床合理的干扰项上犯自信的错误,这促使进行更可靠和更广泛的外科LLM评估。

英文摘要

Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25\% random baseline, while the best model reaches 68.1\% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.