arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

cs.LG 机器学习应用 61 cs.AI 机器学习与表示学习 48 cs.LG 深度学习架构与训练方法 40 cs.AI 评测、基准与数据集 36 cs.AI AI应用与系统 34 cs.CL 评测、数据集与基准 34 cs.AI 可信、安全与AI治理 29 cs.LG 优化、泛化与理论分析 29 cs.AI 自然语言与多模态智能 26 cs.AI 智能体、规划与决策 25 cs.LG 高效学习、压缩与部署 24 cs.LG 数据集、基准与评测 24 cs.CL 大语言模型与基础模型 22 cs.LG 强化学习与序列决策 21 cs.CV 生成式视觉与世界模型 18 cs.CV 数据集、基准、评测与训练方法 18 cs.CL 其他/综合NLP 17 cs.LG 鲁棒性、不确定性与可信学习 17 cs.LG 生成模型与概率建模 15 cs.CL 安全、隐私、公平与可解释NLP 13 cs.LG 其他/综合机器学习 13 cs.AI 机器人与具身智能 12 cs.AI 其他/综合AI 12 cs.LG 联邦学习、隐私与安全 12 cs.RO 机器人学习与模仿强化学习 12 cs.CV 具身智能、机器人与自动驾驶 11 cs.CV 医学影像与生物视觉 11 cs.CL 信息抽取、检索与问答 11 cs.CL 对话系统与智能体 11 cs.RO 操作、抓取与灵巧手 11 cs.CV 多模态与视觉语言模型 10 cs.CV 其他/综合视觉 10 cs.CV 3D视觉、点云与空间智能 9 cs.RO 导航、定位与SLAM 9 cs.CV 低层视觉、计算成像与图像增强 8 cs.CV 鲁棒性、安全、隐私与可信视觉 8 cs.CL 语音语言联合与音频文本 8 cs.LG 表示学习、自监督与对比学习 8 cs.CV 目标检测、分割与定位 7 cs.CL 多模态语言处理 7 cs.LG 图学习与结构化数据 7 cs.RO 运动规划、控制与动力学 7 cs.RO 无人车、无人机与移动机器人 7 cs.RO 仿真、数据集与评测 7 cs.AI 知识表示、推理与符号AI 6 cs.AI 多智能体与博弈 6 cs.CV 视频理解与时序视觉 6 cs.RO 具身智能与视觉语言动作模型 6 cs.AI 搜索、优化与约束求解 5 cs.CV 图像识别、检索与分类 5 cs.RO 人机交互与协作机器人 5 cs.SD 安全、隐私与深度伪造音频 5 cs.CL 低资源、领域适配与高效训练 4 cs.LG 迁移、元学习与持续学习 4 cs.SD 说话人识别、验证与分离 4 cs.CL 语义、语法与语言学分析 3 cs.SD 语音合成与声音生成 3 cs.SD 数据集、基准与评测 3 cs.SD 其他/综合语音音频 3 cs.RO 多机器人与群体系统 2 cs.RO 软体机器人与硬件设计 2 cs.RO 安全、鲁棒性与可信机器人 2 cs.SD 语音识别与关键词检测 2 cs.SD 音乐信息检索与音乐生成 2 cs.CV 文档图像、OCR与图表理解 1 cs.RO 其他/综合机器人 1 cs.SD 音频事件检测与场景理解 1

2606.14617 2026-06-15 cs.RO cs.SY eess.SY 新提交

Whole-Body Impedance Model Predictive Control for Safe Physical Human--Robot Interaction on Floating-Base Platforms

全身阻抗模型预测控制：浮基平台上的安全人机物理交互

Yongyan Cao

发表机构 * Voryx Robotics

AI总结提出三层架构的全身阻抗MPC，通过质心MPC规划接触力、优先级WBC层平衡关节力矩、再ceding-horizon QP预测并抑制人机交互扰动，实现浮基机器人零稳态误差安全交互。

详情

AI中文摘要

浮基机器人必须在刚性接触约束下保持平衡，同时与人类安全交互。现有的全身控制（WBC）框架将全部关节空间分配给运动，或依赖固定增益阻抗反馈，在持续的人机物理交互（pHRI）力作用下积累稳态误差。本文将作者先前针对固定基座的两层阻抗MPC扩展到浮基平台，采用三层架构：质心MPC在500毫秒时域内规划接触力；优先级驱动的WBC层通过接触一致性零空间投影将平衡分解为关节力矩；剩余零空间由再ceding-horizon二次规划（QP）控制，该QP使用卡尔曼增强状态预测并抑制pHRI扰动。接触一致性反馈线性化将手臂末端执行器系统简化为在每个接触模式下具有恒定状态矩阵的双积分器，从而允许离线预计算QP代价并实现≥1 kHz运行。一种协方差膨胀协议在接触模式切换时保持扰动估计，保证在有界恒定pHRI负载下零稳态误差；阻抗等价定理表明无限时域极限恢复经典任务空间阻抗定律，其有效质量、阻尼和刚度随姿态和接触配置自适应。在17自由度双足机器人和Unitree G1人形机器人上的仿真验证了该设计。

英文摘要

Floating-base robots must balance under rigid contact constraints while interacting safely with humans. Existing whole-body control~(WBC) frameworks allocate the full joint space to locomotion or rely on fixed-gain impedance feedback that accumulates steady-state error under sustained physical human--robot interaction~(pHRI) forces. This paper extends the authors' fixed-base two-layer Impedance MPC to floating-base platforms through a three-level architecture: a centroidal MPC plans contact forces over a 500\,ms horizon; a priority-driven WBC layer resolves balance into joint torques through contact-consistent null-space projection; and the residual null space is governed by a receding-horizon quadratic program~(QP) that predicts and rejects pHRI disturbances using a Kalman-augmented state. A contact-consistent feedback linearization reduces the arm end-effector plant to a double integrator with a \emph{constant} state matrix within each contact mode, enabling offline precomputation of the QP cost and ${\geq}1$\,kHz operation. A covariance-inflation protocol preserves the disturbance estimate across contact-mode switches, guaranteeing zero steady-state error under bounded constant pHRI loads, and an Impedance Equivalence Theorem shows the infinite-horizon limit recovers a classical task-space impedance law whose effective mass, damping, and stiffness adapt to posture and contact configuration. Simulations on a 17-DOF biped and the Unitree G1 humanoid validate the design.

URL PDF HTML ☆

赞 0 踩 0

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 新提交

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光：贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 ； API / Fable 5 ； Independent researcher（独立研究者）

AI总结通过计算分析贝多芬《月光奏鸣曲》的乐谱，发现其三个乐章分别对应三种不同的机器学习架构，并揭示了四个反直觉发现，包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情

AI中文摘要

我们展示了贝多芬《月光奏鸣曲》（Op. 27 No. 2）的三个乐章实例化了三种不同的机器学习架构——并非通过类比，而是通过结构对应。通过对乐谱的计算分析（熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入），我们建立了四个反直觉的发现：（1）感知的音乐“温度”由吞吐量决定，而非分布宽度；（2）最轻的乐章具有最高的不协和度；（3）这些乐章实现了流式、循环和周期位置编码记忆架构；（4）同一音高类在不同乐章中获得不同的上下文身份，类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化（将分析特征解码回MIDI）并量化了编码-解码循环的手性：分布保留什么而顺序排序破坏什么。受听众观察（解码后的音乐听起来像“无法叠加的镜像异构体”）的启发，手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息，尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐，反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.14609 2026-06-15 cs.RO 新提交

Safe Reinforcement Learning of Autonomous Highway Driving: A Unified Framework for Safety and Efficiency

自动驾驶高速公路的安全强化学习：安全与效率的统一框架

Chufei Yan, Zhihao Cui, Yiyan Lv, Taojie Chen, Ning Bian, Yulei Wang

发表机构 * School of Physics, Northeast Normal University（东北师范大学物理学院）； Clean Energy Automotive Engineering Center, School of Automotive Studies, Tongji University（同济大学汽车学院清洁能源汽车工程中心）； Mengshi Automobile Technology Company, Dongfeng Motor Corporation（东风汽车公司猛士汽车技术公司）

AI总结提出MoE-RM-SRL框架，通过安全距离、奖励机器和混合专家机制，在训练和部署中同时保证安全与效率，在CARLA和VR平台实验中优于现有方法。

Comments 20 pages, 5 figures, 7 tables. Preprint version

详情

AI中文摘要

深度强化学习（DRL）为高级自动驾驶车辆（AV）的决策提供了一条引人注目的途径，但其试错特性使得在训练过程中难以保证安全性，并在部署时难以同时实现安全与效率。我们提出了一个统一的安全强化学习（SRL）框架，该框架集成了安全距离（SD）、奖励机器（RM）和混合专家（MoE），称为MoE-RM-SRL。在部署中，SD和RM共同塑造了一个规则感知的奖励，编码了高速公路交通规则和阶段目标，从而在不牺牲效率的情况下实现安全可靠的行为。在训练中，我们引入了一个稀疏门控的MoE层，包含多达11个深度Q网络（DQN）；基于SD的门控规则激活一组最小的专家用于车道保持和车道变换，减轻了在不同控制器（如MPC/基于规则的模块和学习策略）之间切换时常见的不稳定性、不连续性和脉冲瞬态。我们在CARLA中实现了所提出的架构，并将其与一个6自由度驾驶员在环虚拟现实（DiL-VR）平台集成。在随机双车道交通中的实验表明，MoE-RM-SRL在安全性和效率上显著优于最先进的基线，并且该框架自然地扩展到多车道驾驶以及匝道合流和驶出场景。

英文摘要

Deep reinforcement learning (DRL) offers a compelling route to decision-making for advanced autonomous vehicles (AVs), yet its trial-and-error nature makes it difficult to guarantee safety during training and to achieve both safety and efficiency at deployment. We propose a unified safe reinforcement learning (SRL) framework that integrates safe distance (SD), reward machines (RM), and mixture-of-experts (MoE), termed MoE-RM-SRL. For deployment, SD and RM jointly shape a rule-aware reward that encodes highway traffic regulations and stage-wise objectives, enabling safe and reliable behavior without sacrificing efficiency. For training, we introduce a sparsely gated MoE layer comprising up to 11 deep Q-networks (DQNs); an SD-based gating rule activates a minimal set of experts for lane-keeping and lane-changing, mitigating the instability, discontinuities, and impulsive transients commonly induced by switching between heterogeneous controllers (e.g., MPC/rule-based modules and learned policies). We implement the proposed architecture in CARLA and integrate it with a 6-DoF driver-in-the-loop virtual-reality (DiL-VR) platform. Experiments in stochastic two-lane traffic show that MoE-RM-SRL substantially improves safety and efficiency over state-of-the-art baselines, and the framework naturally extends to multi-lane driving as well as on-ramp merging and exiting scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.14608 2026-06-15 cs.LG cs.AI 新提交

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

专家驱动的生存机器：改善多个临床队列中的分层与可解释性

Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出一种基于混合专家模型的自适应深度聚类生存框架（AdaCSM），通过路由专家机制实现条件专业化，动态分配患者到专门的风险预测器，提升生存预测性能和可解释性。

详情

DOI: 10.1145/3807503.3819574

AI中文摘要

生存预测在医疗提供者和临床研究中扮演核心角色。准确的风险分层能够实现早期干预并改善患者管理。大多数现有的深度生存模型为所有患者学习一个共同的特征表示，这可能掩盖患者亚组之间的重要差异。相比之下，混合专家（MoE）框架允许模型的不同部分关注不同的患者模式，从而产生更个性化的表示。因此，在这项工作中，我们提出了一种混合专家增强的自适应深度聚类生存框架（AdaCSM），用于建模这种异质性生存模式。我们引入了一种基于路由的专家机制，该机制在参数化生存建模框架内实现条件专业化。所提出的架构动态地将患者分配给专门的风险预测器，同时保留患者生存和亚型聚类目标。我们在跨越不同疾病领域的多个真实世界纵向临床队列上，将我们的方法与最先进的生存和深度聚类模型进行了比较。所提出的方法在生存分析中展示了改进的预测性能并产生了可解释的结果。

英文摘要

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.14604 2026-06-15 cs.LG cs.AI 新提交

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

移动健康多时间范围行为预测的深度学习架构比较研究

Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios

发表机构 * KIOS Research and Innovation Center of Excellence, University of Cyprus（塞浦路斯大学KIOS研究与创新卓越中心）； Department of Electrical and Computer Engineering, University of Cyprus（塞浦路斯大学电气与计算机工程系）

AI总结本研究在三个公开数据集上系统比较了六种深度学习架构、两种零样本基础模型和统计基线在1-8天时间范围内的行为预测性能，发现PatchTST表现最佳，基础模型TimesFM在低数据场景下可与训练模型匹敌，且参与者级微调可将RMSE降低16-60%。

详情

AI中文摘要

可穿戴设备和智能手机生成丰富的行为时间序列，可支持主动健康干预，但缺乏对这些数据现代预测架构的系统比较。特别是，模型如何在人群中泛化、不同架构如何响应参与者级微调以及预测精度如何在多天范围内下降仍不清楚。我们在三个涵盖800多名参与者的公开数据集上基准测试了六种深度学习架构、两种零样本基础模型（FM）和统计基线，报告了步数、屏幕时间和睡眠时长在1-8天范围内的逐特征指标。我们进一步对所有六种架构进行了逐特征个性化研究，并评估了FM在不同数据集大小和时间粒度上的迁移性。我们的主要发现是：（i）没有单一架构占主导地位，PatchTST在训练模型中领先，而前三名（TCN、MLP、Transformer）之间没有显著性能差异；（ii）FM TimesFM在零样本情况下匹配或超过训练模型，尤其是在低数据场景下；（iii）参与者级微调将逐特征RMSE降低了16-60%，其中睡眠受益最大，步数受益最小。这些结果为移动健康预测中的架构选择、FM适用性和个性化策略提供了实用指导。据我们所知，这是首个联合评估现代深度学习、FM和个性化用于可穿戴设备多时间范围行为预测的研究。

英文摘要

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

URL PDF HTML ☆

赞 0 踩 0

2606.14602 2026-06-15 cs.RO 新提交

What Robots Do Matters More Than What They Look Like: Task Context Shapes Trust in Educational HRI

机器人做什么比它们长什么样更重要：任务背景塑造教育人机交互中的信任

Anna-Maria Velentza, Konstantina Nikou, Anne-Gwenn Bosser, Nikolaos Fachantidis

发表机构 * LIRES Robotics Lab, University of Macedonia（马其顿大学LIRES机器人实验室）

AI总结通过视频实验（N=81）发现，任务类型（教学、指导、索要个人信息）对信任有显著主效应，而机器人外观无显著影响，表明任务背景比物理外观更关键。

Comments Accepted in the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026), Kitakyushu, Fukuoka, Japan

详情

AI中文摘要

社交辅助机器人（SARs）越来越多地部署在教育和信息共享环境中，这得益于大型语言模型的进步，使得流畅的实时交互成为可能。尽管机器人外观的多样性不断增加，但尚不清楚单一机器人外观是否适用于不同的交互任务，或者信任是否主要取决于情境因素。在本研究中，我们考察了机器人外观和任务类型如何共同影响对机器人的信任。通过一项受试者内视频实验（N=81），参与者评估了三种外观不同的机器人在执行三种教育相关任务（教学、程序性指导和个人信息讨论）时的表现。重复测量分析结果显示，任务对信任有强烈的主效应：参与者在指导任务中报告了最高的信任度，在教学活动中信任度中等，而当机器人索要个人信息时信任度显著降低。相比之下，机器人外观没有显著的主效应，外观与任务之间的交互作用也不明显。这些发现表明，人机交互中的信任更多地由任务背景而非物理外观所塑造。通过关注未来的教育工作者作为最终用户，本研究为教育环境中任务感知的机器人部署提供了实证证据，并强调了将机器人角色和行为与交互目标对齐的重要性，而非仅仅依赖拟人化设计。

英文摘要

Socially assistive robots (SARs) are increasingly deployed in educational and information-sharing contexts, supported by advances in large language models that enable fluent real-time interaction. Despite the growing diversity of robot embodiments, it remains unclear whether a single robot appearance is appropriate across different interaction tasks or whether trust depends primarily on contextual factors. In this study, we examine how robot appearance and task type jointly influence trust in robots. Using a within-subjects video-based experiment (N = 81), participants evaluated three robots with distinct appearances while performing three educationally relevant tasks: teaching, procedural instruction, and personal-information discussion. Results from repeated-measures analyses show a strong main effect of task on trust, with participants reporting the highest trust during instructional guidance, moderate trust during teaching activities, and significantly lower trust when robots requested personal information. In contrast, robot appearance showed no significant main effect, and the interaction between appearance and task was marginal. These findings suggest that trust in human-robot interaction is shaped more strongly by task context than by physical embodiment alone. By focusing on future educators as end users, this work contributes empirical evidence toward task-aware robot deployment in educational environments and highlights the importance of aligning robot roles and behaviors with interaction goals rather than relying solely on anthropomorphic design.

URL PDF HTML ☆

赞 0 踩 0

2606.14601 2026-06-15 cs.LG cs.SY eess.SY math.OC stat.CO 新提交

A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems

氢多能系统中运行阈值检测与可部署调度控制器开发的统计与机器学习框架

Shadi Heenatigala, Hasanika Samarasinghe

发表机构 * Antioch College（安提阿学院）； The Open University of Sri Lanka（斯里兰卡开放大学）

AI总结提出统计与机器学习框架，利用一年高分辨率运行数据表征氢多能系统，通过统计分析和随机森林揭示非线性动态，并利用强化学习优化调度。

Comments 17 pages, 12 figures

详情

AI中文摘要

本研究提出了一个统计与机器学习框架，利用一年高分辨率运行数据表征氢基多能系统（H-MES）。统计分析揭示了由可再生能源盈余驱动的二元运行模式，其中太阳辐照度解释了氢气生产中45.7%的基于秩的方差，按常规标准属于大效应。只有高辐照度时期才触发有意义的电解槽参与，而电力需求则产生较弱的反向抑制效应（$\epsilon^2 = 0.126$）。多元回归证实电解槽功率是主要的线性预测因子，并存在太阳-风协同交互作用。值得注意的是，随机森林分析将风能输出在预测重要性中排名第一，尽管其双变量相关性较弱（r = 0.167），揭示了参数方法无法发现的非线性动态。一个序列模型利用强24小时自相关性（r = 0.845）进行运行预测，而一个强化学习智能体优化了氢气收益调度。核心贡献在于证明了统计和机器学习方法在H-MES建模与控制中是互补的。

英文摘要

This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ($ε^2 = 0.126$). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.

URL PDF HTML ☆

赞 0 踩 0

2606.14600 2026-06-15 cs.CL 新提交

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

LoSoNA：群组对话中局部社交规范适应的基准

Mateusz Winiarek, Maksymilian Bilski, Mateusz Jacniacki

发表机构 * Humalike Research

AI总结提出LoSoNA基准，通过群聊场景测试LLM从对话历史推断并适应未明示的局部社交规范的能力，评估多种模型在不同提示条件下的表现。

详情

AI中文摘要

在线群聊是具有局部对话规范的社会空间，这些规范很少被明确陈述。基于LLM的智能体识别并适应这些规范的能力和意愿尚未得到充分探索。我们引入了LoSoNA，一个用于多方聊天中局部社交规范适应的基准。每个场景向主体模型提供一个精心策划的群聊记录，其中非主体参与者展示一个隐藏的局部规范，随后是一个最终的引发轮次，迫使模型做出响应，揭示其是否推断出该规范。我们在四种提示条件下评估了八个前沿和开放权重模型，这些条件在模型被明确告知将先前的对话作为其应如何回答的证据方面有所不同。对于大多数模型，朴素提示仍然有限；显式的规范感知提示帮助不均，Gemini 3.1 Pro达到84.2%，Claude Fable 5达到81.6%，而其他几个模型显示出微小的增益或回归。LoSoNA通过测试模型是否能够从先例推断局部对话规范并在单轮群聊响应中使用它们，为近期评估LLM社交能力的呼吁做出了贡献。

英文摘要

Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce LoSoNA, a benchmark for local social norm adaptation in multi-party chat. Each scenario gives a subject model a curated group-chat transcript in which non-subject participants demonstrate a hidden local norm, followed by a final elicitor turn that forces a response revealing whether the subject has inferred that norm. We evaluate eight frontier and open-weight models under four prompting conditions that vary how explicitly the model is told to treat the prior conversation as evidence for how it should answer. Naive prompting remains limited for most models; explicit norm-aware prompting helps unevenly, with Gemini 3.1 Pro reaching $84.2\%$ and Claude Fable 5 reaching $81.6\%$, while several other models show small gains or regressions. LoSoNA contributes to recent calls for evaluating LLM social capabilities by testing whether models can infer local conversational norms from precedent and use them in a one-turn group-chat response.

URL PDF HTML ☆

赞 0 踩 0

2606.14598 2026-06-15 cs.LG 新提交

Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

在消费级GPU上实现扩散Transformer的原生INT8计算：用于Ideogram 4.0的融合INT8 GEMM内核

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结针对消费级Ampere GPU上INT8量化比FP8/NF4更慢的问题，提出融合Triton INT8 GEMM内核，直接利用INT8张量核心，在Ideogram 4.0中实现2.8-4.2倍加速，端到端速度提升约10%，使1024px单卡可行。

详情

AI中文摘要

扩散Transformer的训练后INT8（W8A8）量化被广泛用作速度优化，但在消费级Ampere GPU上，它通常比它本应击败的FP8和NF4替代方案更慢。我们将此归因于一个软件伪影：生产中的“INT8”前向量化权重和激活，但立即将它们反量化回bf16并执行bf16矩阵乘法，从未使用GPU的INT8张量核心，因此硬件的计算优势完全未被利用。我们通过一个单一的融合Triton INT8 GEMM（在Ampere张量核心上执行int8xint8->int32，并在epilogue中融合每token乘每通道的反量化和偏置，针对每个GEMM形状自动调优）来弥补这一差距，将其插入Ideogram 4.0扩散Transformer的线性层中，替代反量化到bf16的路径。在该内核中，int8xint8->int32累加与torch._int_mm逐位精确，反量化输出与参考的余弦相似度为1.0且无NaN，每个GEMM的运行速度比bf16快2.8-4.2倍。端到端在768px分辨率下实现约1.1倍（约9-10%）的加速，在1024px分辨率下，单张RTX 3090上生成图像耗时156.5秒，快于单卡NF4（164.5秒）和FP8（172.9秒）基线，且在这些点估计（PickScore/CLIPScore）上无质量损失。因此，INT8从最慢的变体变为最快，1024px在单GPU上变得可行。主要速度标准（击败FP8，约9.5%）轻松满足；NF4的差距（约4.9%，单次运行n=4）在未量化的运行间方差内，最好理解为与达到扩展目标一致。最后我们给出一个诚实的部署图：该优势特定于消费级Ampere，在A100和B200上，相同内核会输给这些卡快速的本地bf16/FP8路径。

英文摘要

Post-training INT8 (W8A8) quantization of diffusion transformers is widely deployed as a speed optimization, yet on consumer Ampere GPUs it is frequently slower than the FP8 and NF4 alternatives it is meant to beat. We trace this to a software artifact: the production "INT8" forward quantizes weights and activations only to immediately dequantize them back to bf16 and run a bf16 matrix multiply, never engaging the GPU's INT8 tensor cores, so the hardware's compute advantage is left entirely unrealized. We close this gap with a single fused Triton INT8 GEMM (int8xint8->int32 on Ampere tensor cores, with per-token x per-channel dequantization and bias folded into the epilogue, autotuned per GEMM shape) dropped into the Ideogram 4.0 diffusion transformer's linear layers in place of the dequantize-to-bf16 path. In the kernel, the int8xint8->int32 accumulation is bit-exact against torch._int_mm and the dequantized output matches the reference at cosine similarity 1.0 with no NaNs, running 2.8-4.2x faster than bf16 per GEMM. End to end it delivers a ~1.1x (~9-10%) speedup at 768px, and at 1024px it generates an image in 156.5 s on a single RTX 3090, faster than the single-card NF4 (164.5 s) and FP8 (172.9 s) baselines, at no measurable quality cost on these point estimates (PickScore/CLIPScore). INT8 thus goes from the slowest variant to the fastest, and 1024px becomes single-GPU feasible. The primary speed criterion (beat FP8, by ~9.5%) is comfortably met; the NF4 margin (~4.9%, single-run n=4) is within run-to-run variance we did not quantify and is best read as consistent with meeting the stretch target. We close with an honest deployment map: the win is specific to consumer Ampere, and on A100 and B200 the same kernel loses to those cards' fast native bf16/FP8 paths.

URL PDF HTML ☆

赞 0 踩 0

2606.14597 2026-06-15 cs.LG 新提交

Zero-shot generalization of transformer neural operators to larger domains

Transformer神经算子对更大领域的零样本泛化

Armand de Villeroché, Sibo Cheng, Vincent Le Guen, Marc Bocquet, Rem-Sophia Mouradi, Patrick Armand, Alban Farchi, Patrick Massin

发表机构 * CEREA, ENPC, EDF R&D, Institut Polytechnique de Paris（CEREA, ENPC, EDF研发部, 巴黎综合理工学院）； SINCLAIR AI Laboratory（SINCLAIR人工智能实验室）； EDF R&D（EDF研发部）； CEA, DAM, DIF（法国原子能委员会, 军事应用局, 法兰西岛）

AI总结提出一种在注意力对数计算中引入可分解局部性偏置的方法，结合旋转位置嵌入，使Transformer神经算子能零样本泛化到更大空间域，在PDE和3D工业流中验证有效性。

详情

AI中文摘要

基于Transformer的神经算子在逼近复杂几何上偏微分方程的解算子方面表现出色。然而，现有方法隐式假设固定域大小，限制了其推理时的泛化能力。在这项工作中，我们研究了域扩展，即在空间域显著大于训练时遇到的域上进行零样本推理。我们认为这种设置从根本上需要空间局部性和平移等变性。我们提出通过在注意力对数计算中引入可分解偏置来实现这种局部性，从而在保持完全可分解为查询-键内积的同时实现精细可控的局部性，并直接与优化的注意力内核兼容。结合旋转位置嵌入，它能够在不改变Transformer架构的情况下，实现具有可控空间支持的表达性嵌入。我们通过实验表明，我们的方法在两个PDE基准测试和一个3D工业大气流动应用中显著改善了向更大域的零样本泛化。我们的代码和数据集可在以下网址获取：此 https URL。

英文摘要

Transformer-based neural operators have shown remarkable performance for approximating solution operators of partial differential equations on complex geometries. However, existing approaches implicitly assume a fixed domain size, which limits their ability to generalize at inference. In this work, we investigate domain extension, namely zero-shot inference on spatial domains that are significantly larger than those encountered during training. We argue that this setting fundamentally requires spatial locality and translation equivariance. We propose to implement this locality via a decomposable bias in the attention logits computation, enabling finely controllable locality while remaining fully decomposable into query-key inner products and directly compatible with optimized attention kernels. Combined with rotary positional embeddings, it enables expressive embeddings with controllable spatial support without altering the transformer architecture. We empirically show that our approach substantially improves zero-shot generalization to larger domains across two PDE benchmarks and a 3D industrial atmospheric flow application. Our code and datasets are available at https://github.com/cerea-daml/domain-extension.

URL PDF HTML ☆

赞 0 踩 0

2606.14591 2026-06-15 cs.SD cs.AI 新提交

AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

AudioDER: 一种用于后训练大型音频语言模型的去重增强推理数据集

Hui Geng, Yi Su, Han Yin, Tianjiao Wan, Qisheng Xu, Jiaxin Chen, Zijian Gao, Hengzhu Liu, Xie Chen, Kele Xu

发表机构 * College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Shanghai Jiaotong University（上海交通大学）

AI总结针对现有音频-语言数据集冗余导致后训练效果下降的问题，提出基于声学相似性去重的数据构建流程，生成包含191k样本的推理导向数据集AudioDER，显著提升LALM在多个音频推理基准上的性能。

详情

AI中文摘要

大型音频语言模型（LALMs）在广泛的音频理解任务上表现出色，但在复杂音频推理方面仍存在困难。提升此类能力的一种实用方法是后训练，其有效性关键取决于训练数据的质量和多样性。然而，现有的音频-语言数据集通常包含大量冗余，其中许多样本在声学内容上高度相似，从而提供重叠的监督信号。这种冗余不仅增加了标注成本，还限制了语料库的多样性，降低了后训练的效果。为解决此问题，我们提出了一种冗余感知的数据构建流程，用于为LALMs构建面向推理的监督。具体来说，我们首先基于声学相似性对原始音频数据集进行去重，以提高语料库的多样性。然后，我们将现有的音频描述和问答对整合为统一的多项选择格式。基于这些统一标注，我们利用Qwen3-30B生成思维链（CoT）推理过程，以提供面向推理的监督。基于此流程，我们构建了AudioDER，一个面向推理的后训练数据集，包含约191k个样本，涵盖声音、语音和音乐。每个样本包括一个音频片段、一个多项选择问题、四个候选答案、一个音频描述和一个CoT推理过程。大量实验表明，在AudioDER上进行后训练持续提升了Qwen2-Audio-7B-Instruct在多个音频推理基准上的性能，包括MMAU-mini、MMSU和MMAR。我们希望AudioDER能够成为推动音频推理研究和开发更强大LALMs的宝贵资源。

英文摘要

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

URL PDF HTML ☆

赞 0 踩 0

2606.14586 2026-06-15 cs.CV 新提交

S$^2$COPE: Self-Supervised Concept Discovery via Preference Learning

S$^2$COPE: 通过偏好学习进行自监督概念发现

Shilong Xiang, Zirui Zhang, Chengzhi Mao

发表机构 * Rutgers University（罗格斯大学）

AI总结提出S$^2$COPE框架，利用视觉大语言模型在自监督偏好优化循环中自主发现结构化概念，无需任何标签，在多个领域提升下游分类准确率。

详情

AI中文摘要

当前的表示学习范式存在一个根本性的折衷：自监督方法可扩展到大规模数据集但产生不透明的特征，而可解释模型则因需要密集的人工标注而受限。我们提出了通过偏好学习进行自监督概念发现（S$^2$COPE），这是一个无需标签的框架，解决了这一困境。S$^2$COPE不将视觉大语言模型（VLLMs）视为静态特征提取器，而是将其作为自监督偏好优化循环中的主动参与者。通过直接从原始图像中自主假设、验证和强化候选视觉属性，我们的框架无需任何标签即可发现新颖的结构化概念。在自然、医学和物理领域的大量实验表明，S$^2$COPE成功提取了标准VLLMs通常无法生成的领域特定概念。通过将概念发现直接摊销到VLLM骨干网络中（通过我们的自监督偏好目标，而非依赖静态生成和分离过滤），我们在未见数据上的下游top-1分类准确率实现了高达24个百分点的绝对提升。我们的工作表明，可解释性可以通过模型与偶然视觉结构的自主交互而出现，无需任何人类监督。

英文摘要

Current representation learning paradigms force a fundamental compromise: self-supervised methods scale to massive datasets but yield opaque features, whereas interpretable models remain bottlenecked by the need for dense human annotation. We introduce Self-Supervised Concept discOvery via Preference lEarning (\model), a label-free framework that resolves this dilemma. Instead of treating Vision-Large-Language Models (VLLMs) as static feature extractors, \model leverages them as active participants in a self-supervised preference optimization loop. By autonomously hypothesizing, validating, and reinforcing candidate visual attributes directly from raw imagery, our framework discovers novel, structured concepts without a single label. Extensive experiments across natural, medical, and physics domains demonstrate that \model successfully extracts domain-specific concepts where standard VLLMs often fail to generate. By amortizing concept discovery directly into the VLLM backbone through our self-supervised preference objective -- rather than relying on static generation and disjoint filtering -- we achieve up to a 24-point absolute improvement in downstream top-1 classification accuracy on unseen data. Our work suggest that interpretability can emerge through a model's autonomous interaction with incidental visual structures, without any human supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.14585 2026-06-15 cs.RO cs.AI 新提交

Sensitivity Shaping for Latent Modeling

潜变量建模中的灵敏度塑造

Hongzhan Yu, Chenghao Li, Ruipeng Zhang, Henrik Christensen, Sicun Gao

发表机构 * University of California San Diego（加利福尼亚大学圣迭戈分校）

AI总结针对生成动力学模型在策略诱导的分布外（OOD）转换检测中灵敏度不足的问题，提出支持条件控制灵敏度正则化，提升对控制输入变化的局部响应，实验验证了改进的OOD检测和更安全的闭环规划。

详情

AI中文摘要

生成动力学模型能够在具有挑战性的机器人系统中进行规划，但安全部署需要可靠地检测策略诱导的分布外（OOD）转换。现有方法通常将学习到的动力学视为固定的，并附加事后支持代理。我们表明，当动力学对关键动作选择局部不敏感时，这些代理可能失效：不受支持的控制动作可能产生类似于演示转换的潜变量预测，尽管存在较大的真实预测误差，但仍会抑制OOD信号。为了解决这个问题，我们引入了支持条件控制灵敏度正则化，该正则化在学习动力学的高支持训练区域中促进对控制输入变化的局部敏感响应。这保留了控制引起的变异，同时限制了因弱经验支持导致的不稳定外推。在基于视觉的避障、操作和真实机器人导航中的实验表明，OOD检测和更安全的闭环规划得到了改进。

英文摘要

Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution (OOD) transitions. Existing methods typically treat the learned dynamics as fixed and attach post hoc support surrogates. We show that these surrogates can fail when the dynamics are locally insensitive to critical action choices: unsupported control actions may produce latent predictions that resemble demonstrated transitions, suppressing OOD signals despite large true predictive errors. To address this, we introduce support-conditioned control-sensitivity regularization, which promotes sensitive local response to control input changes in learned dynamics in high-support training regions. This preserves control-induced variation while limiting unstable extrapolation due to weak empirical support. Experiments in vision-based obstacle avoidance, manipulation, and real-robot navigation show improved OOD detection and safer closed-loop planning.

URL PDF HTML ☆

赞 0 踩 0

2606.14582 2026-06-15 cs.AI 新提交

A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems

异构铁路系统中干扰感知的动态路径优化的时间规划框架

Pollob Chandra Ray, Sabah Binte Noor, Fazlul Hasan Siddiqui

发表机构 * Dhaka University of Engineering & Technology（达卡工程技术大学）

AI总结提出基于时间规划的框架，利用PDDL 2.1建模轨距兼容约束和多种干扰场景，生成无冲突时间戳操作计划，减少人工决策依赖。

详情

AI中文摘要

高效的路径优化对于确保铁路运营的安全性和准点性至关重要。在异构多轨距铁路网络中，由于列车速度、停车模式、基础设施兼容性约束的不同，协调复杂性增加，这一点尤为关键。在单轨系统中，由于所有列车共享同一轨道且需要频繁的轨道切换，这些挑战进一步加剧。干扰事件，包括轨道阻塞、列车阻塞、发动机故障和速度降低，给运营带来了额外的不可预测性，并偏离了时刻表。然而，现有研究主要关注高层次的时间表编制，忽略了诸如轨道切换协调等运营细节。因此，决策留给人类操作员，增加了铁路运营的安全风险。本研究提出了一个基于时间规划的框架，用于异构铁路系统中的动态路径优化和干扰管理。该框架使用PDDL 2.1将铁路运营形式化为时间规划问题，显式建模轨距兼容约束和多种干扰场景。它生成无冲突的时间戳操作计划，指定优化调度和可执行动作序列。为了评估所提出的框架，我们开发了一个包含200个实例的基准问题集，使用多达1000个轨道点和120列列车。采用两个最先进的时间规划器和一个计划验证器来评估该框架。实验结果表明，该框架能够有效地为异构铁路系统生成时间操作计划，处理多轨距约束和干扰，并减少对人工决策的依赖。

英文摘要

Efficient route optimization play a vital role in ensuring both safety and punctuality in railway operations. It is very crucial particularly in heterogeneous multi-gauge railway networks with varying train speed, stopping pattern, infrastructure compatibility constraints increase coordination complexity. In single-track systems these challenges are further intensify due to all trains to share the same track and requires frequent track switching.Stochastic disruptions events including blocked tracks, blocked trains, engine failure and speed slowdowns introduces additional unpredictability in operations and deviate the timetable. However, existing studies predominantly focuses on high-level timetabling, omitting operational details such as track switching coordination. As a result leaving decision to human operators, increasing safety risks into railway operations. This study proposes a framework based on temporal planning for dynamic route optimization and disruption management in heterogeneous railway systems. The framework formulates railway operations as a temporal planning problem using PDDL 2.1 with explicitly modeling gauge compatibility constraints and diverse disruption scenarios. It generates conflict-free timestamped operational plans specifying both optimized schedules and executable action sequences. To evaluate the proposed framework, we developed a benchmark problem set with 200 instances using up to 1,000 track points and 120 trains. Two state-of-the-art temporal planners and a plan validator were employed to assessed the framework. The experimental results demonstrate that the framework effectively generates temporal operational plans for heterogeneous railway systems and handles multi-gauge constraints, disruptions, and reduces dependence on manual decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.14581 2026-06-15 cs.LG cs.AI 新提交

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

CARE：通过科学实验中的可审计证据审查控制LLM生成的策略

Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi

发表机构 * University of Macau（澳门大学）； University of Toronto（多伦多大学）； UCLA（加州大学洛杉矶分校）； Harvard University（哈佛大学）； XtalPi（晶泰科技）； McGill University（麦吉尔大学）

AI总结提出CARE框架，通过可审计的干预门控机制，在保留非LLM优化器作为默认路径的同时，利用LLM修正挑战者排序策略，显著提升高通量实验优化性能。

Comments 23 pages, 4 figures

详情

AI中文摘要

赋予LLM对昂贵、不可逆的科学实验的直接控制会导致不安全的探索和不稳定的性能，但完全抛弃LLM的创造力会牺牲显著的优化潜力。我们引入了CARE（通过科学实验中的可审计证据审查控制LLM生成的策略），这是一种用于高通量实验（HTE）优化的可审计控制器，它保留非LLM的现有优化器作为默认动作路径，同时使用LLM来修正挑战者排序策略。在每个结果揭示之前，一个公共证据干预门将挑战者与现有方案进行比较。只有当选择前可用的证据支持变更时，它才授权选择挑战者，并将决策记录在审计日志中。在Minerva/Olympus和ChemLex基准测试中，CARE优于所有其他评估方法，相对于公开的现有方案，最终最佳结果从80.0提高到88.5（Minerva/Olympus），从83.9提高到92.1（ChemLex）。我们的实验表明，当LLM在可审计控制器下扩展提议空间时，其自我进化比直接选择实验更可靠。

英文摘要

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.14580 2026-06-15 cs.CL 新提交

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

说服指数：一个理论指导的说服分析框架

Liancheng Gong, Zhiyang Wang, Yiwei Xu, Julia Mendelsohn

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； New York University（纽约大学）

AI总结提出基于心理学和传播学理论的15维说服指数（PI）及55个子特征实现，在四个数据集上验证其能解释说服相关修辞模式，并提供轻量级预测信号。

详情

AI中文摘要

识别有说服力的修辞线索在多个领域至关重要，从检测信息操纵、提高AI安全性到推进公共卫生沟通。我们提出说服指数（PI），这是一个基于心理学和传播学说服理论的15维分类法，以及一个使用55个子特征的透明实现，这些子特征基于词汇和规则检测器构建。该分类法是模块化的：单个检测器可以被替换，同时保留理论结构。通过在四个领域、风格和结果度量不同的公共数据集上评估PI，我们表明PI提供了一个共享的特征空间，用于解释与说服结果相关的修辞模式。线性模型表明，PI特征在保持计算轻量级的同时携带了有意义的预测信号。维度级分析揭示了PI维度与说服结果之间跨数据集的重复关联，同时也突出了主题和立场特定的变化。我们将PI作为开源包和Web界面发布，用于对人和AI中介的沟通进行原则性和可审计的分析。

英文摘要

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

URL PDF HTML ☆

赞 0 踩 0

2606.14579 2026-06-15 cs.AI 新提交

VISTA: View-Consistent Self-Verified Training for GUI Grounding

VISTA: 视图一致的自验证训练用于GUI定位

Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * Zhejiang University（浙江大学）； Venus Team, Ant Group（蚂蚁集团金星团队）

AI总结提出VISTA框架，通过多视图分组和自验证锚点改进GRPO训练，在GUI定位任务中显著提升准确率。

详情

AI中文摘要

当将组相对策略优化（GRPO）应用于GUI定位时，rollout从单个截图视图中采样；组在困难实例上往往全部失败，在简单实例上全部成功，无法产生有用的相对优势。我们提出VISTA（视图一致的自验证训练），一种基于GRPO的训练框架，通过从同一GUI页面的多个目标保持视图中构建每个比较组。每个视图通过裁剪生成，保持目标元素可见并精确重新映射其边界框，因此模型rollout在语义等价但几何不同的输入之间进行比较。为了稳定短坐标生成而不将强化学习转变为无条件模仿，VISTA进一步添加了一个自验证的跨视图锚点：一个使用优势加权损失优化的oracle答案，从组基线中排除，仅在模型产生最大奖励rollout时激活。在五个GUI定位基准和多个Qwen骨干网络上，VISTA一致提高了定位准确率。在ScreenSpot-Pro上，它将Qwen3-VL 4B/8B/30B-A3B从55.5/52.7/53.7提升到63.4/65.8/67.0。鲁棒性分析进一步显示了更高的最差视图准确率和更低的预测翻转率。

英文摘要

When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.

URL PDF HTML ☆

赞 0 踩 0

2606.14578 2026-06-15 cs.CV 新提交

A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

基于GenAI的工业计算机视觉应用中数据生成与增强方法的定性综述

Paul Koch, Paul Hofmann, Ferdinand Waßelewsky, Adem Karakurt, Andre Sérs, Jörg Krüger

发表机构 * Fraunhofer IPK（弗劳恩霍夫研究所）； Hamburg University of Applied Sciences (HAW Hamburg)（汉堡应用技术大学）； Technical University Berlin (Tu-Berlin)（柏林技术大学）

AI总结本文综述了基于GenAI的数据生成与增强方法，旨在解决工业计算机视觉应用中数据获取的“先有鸡还是先有蛋”困境，并评估其在分类用例中的适应性。

Comments Accepted to Computing Conference 2026

详情

AI中文摘要

AI驱动的计算机视觉应用需要强大的数据库来确保可预测的行为和性能。这种可预测的行为对于工业应用获得用户信任尤为重要。然而，在工业应用中，这样的数据库并不容易获得，其获取也并非易事。主动学习方法可以在项目部署中逐步增加数据，从而迭代地扩充数据库，进而提高应用的可预测性。不幸的是，我们观察到这往往会导致用户对应用失去信任，而一旦失去信任就很难恢复。这就导致了“先有鸡还是先有蛋”的困境，即数据库和应用都无法得到发展。在这项工作中，我们回顾了最先进的方法和途径，以进一步推动初始主动数据扩充阶段后的数据库建设。这里，我们重点关注基于GenAI的数据生成和增强方法的最新进展，并评估它们在工业计算机视觉分类用例中的适应性。尽管我们观察到自动数据扩充的潜力，但我们也看到在源（训练环境）和目标（工业用例）之间存在领域不匹配——涉及自然语言和对象特征中定义的上下文。

英文摘要

AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a "chicken-and-egg" dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

URL PDF HTML ☆

赞 0 踩 0

2606.14574 2026-06-15 cs.CL cs.AI 新提交

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

SIMMER: 基于世界模型评估LLM可执行规划中的潜在故障

Xiaoxin Lu, Ranran Haoran Zhang, Rui Zhang

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结提出SIMMER基准，通过人工策划的厨房领域符号世界模型，评估LLM规划中的潜在故障；实验发现前沿模型最多17%无错误计划，56%含潜在故障，多数不可逆；反事实预演可减少72%潜在故障和75%不可逆案例。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为家庭环境中自主代理的规划器。虽然现有基准评估LLM生成的计划是否成功执行，但它们忽略了一种关键类型的故障：潜在故障。与立即故障（在执行时触发即时反馈并允许及时纠正）不同，潜在故障不会立即停止计划执行，而是悄无声息地损害目标实现。在严重情况下，它们会导致不可逆的损害。为弥补这一空白，我们引入了SIMMER，这是一个通过人工策划的、基于厨房领域的符号世界模型来评估LLM规划中潜在故障的基准。SIMMER定义了一个世界模型，包含77个动作、262个独特对象和约46,800种语义真实的可能交互，这些交互源自真实世界的烹饪脚本。然后，它利用一个状态机执行器，根据世界模型验证计划，并检测即时前提违规、潜在危险和不可逆故障。在六个LLM上的实验表明，即使是最前沿的模型，其无错误计划最多也只有17%。此外，高达56%的计划包含潜在故障，其中大多数导致不可逆后果。我们进一步证明，通过反事实预演进行显式状态推理可以将潜在故障减少高达72%，不可逆案例减少高达75%，这为更鲁棒的LLM规划器指明了一个有前景的方向。

英文摘要

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

URL PDF HTML ☆

赞 0 踩 0

2606.14571 2026-06-15 cs.AI 新提交

StreamMemBench: Streaming Evaluation of Agent Memory for Future-Oriented Assistance

StreamMemBench: 面向未来辅助的智能体记忆流式评估

Guanming Liu, Yuqi Ren, Hansu Gu, Peng Zhang, Weihang Wang, Jiahao Liu, Ning Gu, Tun Lu

发表机构 * Fudan University（复旦大学）； Amazon Stream（亚马逊流）

AI总结提出StreamMemBench基准，通过两步任务序列测试智能体记忆从流式观察到后续任务中证据回忆、反馈整合与重用能力，实验发现现有系统在证据使用和反馈转化上存在不足。

详情

AI中文摘要

个人智能体记忆的核心作用是将存储的信息和先前的交互转化为面向未来的辅助。在日常使用中，有用的线索来自智能体观察到什么以及用户如何与智能体交互，智能体必须将这些线索从当前请求延续到类似的未来任务中。现有的记忆基准通常孤立地测试对话回忆或任务改进，使得从流式观察到后续辅助的轨迹基本未经测试。我们引入了StreamMemBench，一个流式基准，它围绕EgoLife自我中心流中的每个证据锚点构建一个两步任务序列。初始任务测试证据使用，而后续任务测试反馈和交互经验是否被重用。四个指标诊断证据回忆、初始证据使用、反馈整合和后续重用。在两个骨干网络上的八个记忆系统的实验表明，当前系统通常无法使用观察到的证据或将反馈转化为可靠的后续行为，即使证据已存储或反馈已在局部整合。StreamMemBench在此https URL公开可用。

英文摘要

A central role of personal-agent memory is to turn stored information and prior interactions into future-oriented assistance. In daily use, useful cues come from what the agent observes and how the user interacts with the agent, and the agent must carry them forward from the current request to similar future tasks. Existing memory benchmarks usually test dialogue recall or task improvement in isolation, leaving the trajectory from streaming observations to later assistance largely untested. We introduce StreamMemBench, a streaming benchmark that constructs a two-step task sequence around each evidence anchor from EgoLife egocentric streams. The initial task tests evidence use, while the follow-up task tests whether feedback and interaction experience are reused. Four metrics diagnose evidence recall, initial evidence use, feedback incorporation, and follow-up reuse. Experiments with eight memory systems across two backbones show that current systems often fail to use observed evidence or turn feedback into reliable follow-up behavior, even when evidence is stored or feedback is incorporated locally. StreamMemBench is publicly available at https://github.com/landian60/StreamMemBench.

URL PDF HTML ☆

赞 0 踩 0

2606.14562 2026-06-15 cs.CV cs.LG 新提交

NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

NEST3D：织布鸟树巢的高分辨率多模态数据集

Constanza A. Molina Catricheo, Simon Boeder, Ting-Jia Guo, Giacomo May, Clément Berthelot, Devis Tuia, Friedrich Fedor Reinhard, Fabio Remondino, Benjamin Risse

发表机构 * Institute for Geoinformatics (ifgi), University of Münster（明斯特大学地理信息学研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院）； Max Planck Institute of Animal Behavior（马克斯·普朗克动物行为研究所）； University of Konstanz（康斯坦茨大学）； Kuzikus Research Station（库兹库斯研究站）； Fondazione Bruno Kessler (FBK)（布鲁诺·凯斯勒基金会）

AI总结针对织布鸟巢缺乏精细3D结构数据的问题，提出包含104棵巢树、1.4TB多模态无人机数据集，并基准测试语义分割方法，PT-v3达86.35% mIoU。

Comments 14 pages, 4 figures. Dataset available at https://huggingface.co/NEST3D

详情

AI中文摘要

织布鸟巢作为复杂的生态结构，提供体温调节微栖息地并维持多种物种；然而，先前研究使用的数据集缺乏精细的3D结构细节。由于巢穴的不规则几何形状以及与复杂宿主植被的整合，生成可用且准确的3D织布鸟巢数据具有挑战性。我们通过一个开放获取的1.4TB多模态无人机数据集（包含104棵巢树，共27,945张RGB图像、111,780张多光谱图像、约7.81亿个3D点以及专家标注的语义分割标签）弥合了这一差距。我们使用KPConv、RandLA-Net和Point Transformer V3对语义分割进行基准测试，其中PT-v3在测试集上达到了86.35%的mIoU。虽然结果展示了基于Transformer和逐点方法的强大性能，但也凸显了架构相关的挑战，特别是对于基于卷积的方法（如KPConv）。通过独特地结合光谱、空间和结构信息，所提出的数据集推动了3D重建、分割和分类算法的发展，实现了从巢穴体积估计到物种保护等生态应用，并作为一个要求严格的基准，揭示了在极端类别不平衡下与架构相关的性能差异。

英文摘要

Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2606.14561 2026-06-15 cs.RO cs.LG 新提交

ORCA: A Platform for Open-Source Dexterity Research

ORCA: 开源灵巧性研究平台

Francesco Capuano, Maximilian Eberlein, Fabrice Bourquin, Clemens Claudio Christoph

发表机构 * University of Oxford（牛津大学）； ETH Zurich（苏黎世联邦理工学院）； Orca Dexterity

AI总结提出ORCA学习栈，统一灵巧手控制、仿真、遥操作和重定向，集成机器人学习框架，实现端到端灵巧操作研究。

Comments 15 pages

详情

AI中文摘要

机器人操作研究越来越关注两指平行夹爪，因其有效性、经济性和易于遥操作。然而，夹爪受限于其外形因素，即使对于简单的重新定向任务，也常常需要双臂设置。拟人手是灵巧机器人学习的更自然平台——更接近人手，能够从人类视频中学习——但它们在学习研究中仍然难以使用：即使存在开放且可访问的手部硬件，用于控制、仿真、遥操作和重定向的软件也分散在零散的代码库中，并且与机器人学习生态系统基本脱节。在这项工作中，我们介绍了\orca~学习栈，这是一个将灵巧性作为第一类机器人学习领域的开源研究栈。我们的\orca~栈将低级控制、仿真、来自一系列消费平台的遥操作以及手部重定向统一在单个接口后面，并原生集成流行的机器人学习框架（如\lerobot），使灵巧手研究人员能够利用与非灵巧机器人学习相同的数据、训练和评估流程。我们展示了一个完整的端到端工作流程，通过使用消费级VR头显进行遥操作收集手内重新定向任务的专家演示，使用\lerobot训练自主策略，并在完全可重现和可观察的设置中评估学习到的策略。我们将整个栈开源，作为灵巧操作研究的共享、可重现基础。

英文摘要

Robotics manipulation research increasingly focuses on two-finger parallel grippers for their effectiveness, affordability, and ease of teleoperation. Grippers are nonetheless limited by their form factor, often requiring bimanual setups even for simple reorientation tasks. Anthropomorphic hands are a more natural platform for dexterous robot learning -- closer to the human hand, and capable of learning from human video -- yet they remain hard to use in learning research: even where open and accessible hand hardware exists, the software for control, simulation, teleoperation, and retargeting is scattered in one-off code bases, and largely disconnected from the robot-learning ecosystem. In this work, we introduce the \orca~learning stack, an open-source research stack for dexterity as a first-class robot learning domain. Our \orca~stack unifies low-level control, simulation, teleoperation from a range of consumer platforms, and hand retargeting, behind a single interface, and integrates natively with popular robot-learning frameworks such as \lerobot, so dexterous hand researchers can leverage the same data, training, and evaluation pipelines used for non-dexterous robot learning. We demonstrate a complete end-to-end workflow, collecting expert demonstrations of an in-hand reorientation task by teleoperation with a consumer-grade VR headset, training an autonomous policy with \lerobot, and evaluating the learned policy in a fully reproducible and observable setup. We open-source the entire stack as a shared, reproducible foundation for dexterous-manipulation research.

URL PDF HTML ☆

赞 0 踩 0

2606.14556 2026-06-15 cs.CV 新提交

Visual Quality Score Assessment of Large White Goods in Remanufacture with Multi-View Deformable-DETR

基于多视角可变形DETR的再制造大型白色家电视觉质量评分评估

Paul Koch, Vivek Chavan

发表机构 * Fraunhofer-Institut für Produktionsanlagen und Konstruktionstechnik (IPK)（弗劳恩霍夫生产设备和结构技术研究所）

AI总结针对再制造中大型白色家电视觉质量评估依赖人工且难以处理小缺陷的问题，提出基于多视角可变形DETR的自动评分框架，通过自监督预训练和微调减少标注需求，实现精确评估与可解释性。

Comments Accepted to GCSM 2026

详情

AI中文摘要

再制造大型白色家电对于循环经济至关重要，但视觉质量评估仍然是培训和定价的手动瓶颈。传统的检测方法需要大量标注，并且难以处理高分辨率多视角数据中的小缺陷。我们提出了一个基于可变形DETR的多视角框架，用于自动质量评分，该框架跨冗余视图聚合信息以提取细粒度特征。为了在有限标签下增强鲁棒性，我们采用自监督预训练，随后在专家标注的分数上进行监督微调。此外，在冻结特征图上进行线性投影，以识别感兴趣区域来解释模型决策。在工业多视角数据集上评估，我们的方法提供了精确的质量评估，同时减少了对人工标注和每个部件定制的依赖，为再制造生产线实现了可扩展且透明的检测。

英文摘要

Remanufacturing large white goods is essential for a circular economy, yet visual quality assessment remains a manual bottleneck for training and pricing. Conventional detection methods require extensive annotation and struggle with small defects in high-resolution multi-view data. We present a multi-view framework based on Deformable-DETR for automated quality scoring that aggregates information across redundant views to extract fine-grained features. To enhance robustness with limited labels, we employ self-supervised pretraining followed by supervised fine-tuning on expert-annotated scores. Additionally, a linear projection over frozen feature maps identifies regions of interest to explain model decisions. Evaluated on an industrial multi-view dataset, our approach delivers precise quality assessments while reducing reliance on manual annotation and per-part customization, enabling scalable and transparent inspection for remanufacturing lines.

URL PDF HTML ☆

赞 0 踩 0

2606.14555 2026-06-15 cs.CV cs.AI 新提交

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

重新思考全局平均池化：你的分类器实际上是一个多实例学习器

Aray Karjauv

发表机构 * Aray Karjauv（阿瑞·卡贾乌）

AI总结本文揭示标准图像分类器中的全局平均池化结构天然具有多实例学习解释，使得单标签训练的分类器能学习多目标场景，并提出后验诊断方法提取空间类别证据。

详情

AI中文摘要

现代图像分类器广泛采用全局平均池化（GAP）后接线性分类头。这种线性结构确保图像级logits等于将分类头逐点应用于GAP之前的特征网格所获得的logits的平均值。因此，标准分类器可能固有地保留空间类别证据，即使在图像级预测错误时这些证据仍可恢复。这种结构自然暗示了多实例学习（MIL）解释，其中图像被视为空间实例的包。在此框架下，我们证明使用每张图像单个标签训练的标准分类器仍然可以在多目标场景中学习预期的分类任务。我们进一步利用这一特性将图像级logits分解为预测网格，提供一种事后诊断方法来提取GAP原本掩盖的空间类别证据。我们的系统评估表明，现成模型始终能在前景区域内恢复真实类别。MIL解释进一步表明，常见的分类器失败反映了均值聚合的已知局限性。

英文摘要

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

URL PDF HTML ☆

赞 0 踩 0

2606.14536 2026-06-15 cs.LG cs.RO cs.SY eess.SY 新提交

Provably Safe, Yet Scalable Reinforcement Learning

可证明安全且可扩展的强化学习

Kai S. Yun, Zeyang Li, Navid Azizan

发表机构 * MIT（麻省理工学院）

AI总结提出PS2-RL框架，通过两阶段架构（学习备份策略隐式构造控制不变集，再通过可微投影层训练RL策略）实现可证明安全且可扩展的强化学习，在高达10维状态空间中保持性能与安全性。

详情

AI中文摘要

安全强化学习旨在学习在满足约束的同时优化奖励的策略。主流方法依赖于软约束策略优化，虽取得经验成功，但无法为学习策略提供正式安全保证。相反，具有严格保证的方法通常依赖显式证书函数，其构造需要直接综合和验证控制不变集，这一过程随状态维度扩展性差，且往往导致过于保守的行为。本文提出可证明安全且可扩展的强化学习（PS2-RL）框架，一种新颖的两阶段架构，以可扩展方式学习可证明安全的策略，旨在克服先前方法的关键瓶颈。PS2-RL不显式计算不变集，而是利用学习的备份策略前向积分系统动力学，在线生成隐式控制不变集。第一阶段，通过提出的安全到达值函数训练备份策略，该值函数刻画了用于不变集构造的最优备份策略。第二阶段，通过可微投影层端到端训练RL策略，该投影层严格强制由学习备份策略诱导的安全保证。通过在第一阶段最大化隐式控制不变集的体积，第二阶段得到的PS2策略既高效又可扩展，同时保持可证明安全性。关键的是，PS2-RL对底层RL算法无限制，可插入任何现有训练流程。我们为所提框架建立了理论保证，并在状态维度高达10的机器人控制任务上进行了评估，而在此范围内，先前可证明安全的RL方法难以应对或变得不实用。

英文摘要

Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.

URL PDF HTML ☆

赞 0 踩 0

2606.14535 2026-06-15 cs.RO 新提交

Spatially Conditioned Diffusion Policy: Learning Precise and Robust Manipulation with a Single RGB Camera

空间条件扩散策略：使用单个RGB相机学习精确且鲁棒的操作

Seoyoon Kim, Kanghyun Kim, Dongwoo Ko, Yeong Jin Heo, Min Jun Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Neuromeka

AI总结提出空间条件扩散策略（SCDP），利用末端执行器轨迹作为视觉注意力锚点，通过多尺度特征编码和空间条件模块，在单相机设置下实现精确鲁棒的操作。

Comments 15 pages

详情

AI中文摘要

最近的视觉模仿学习系统广泛采用多相机设置，其中腕部相机已成为事实标准。然而，从单一全局视角进行操作仍然具有挑战性，因为策略需要捕捉细粒度的交互细节并识别任务相关区域，而无需局部腕部视图。为了应对这一挑战，我们提出了空间条件扩散策略（SCDP），一种基于扩散的视觉运动策略，可在单相机设置下实现精确且鲁棒的操作。我们的关键思想是，末端执行器轨迹可以作为反映任务相关区域的视觉注意力锚点。基于这一思想，SCDP由两个关键组件组成：（i）一个视觉编码器，生成多尺度特征图以捕捉更广泛的上下文和细粒度视觉特征，以及（ii）一个空间条件模块，在扩散循环中沿中间末端执行器轨迹采样点状特征。大量的仿真实验表明，SCDP始终优于强大的单视图基线，并实现了与多相机基线相当的性能。真实世界实验进一步证明了其精确操作和对视觉干扰物的鲁棒性，突显了单相机模仿学习的潜力。

英文摘要

Recent visual imitation learning systems have widely adopted multi-camera setups with wrist-mounted cameras as the de facto standard. However, manipulation from a single global view remains challenging, as the policy should capture fine-grained interaction details and identify task-relevant regions without local wrist views. To address this challenge, we present Spatially Conditioned Diffusion Policy (SCDP), a diffusion-based visuomotor policy that achieves precise and robust manipulation in a single-camera setting. Our key idea is that end-effector trajectories can serve as visual attention anchors that reflect task-relevant regions. Building on this idea, SCDP consists of two key components: (i) a visual encoder that produces multi-scale feature maps to capture both broader context and fine-grained visual features, and (ii) a spatial conditioning module that samples point-wise features along intermediate end-effector trajectories in the diffusion loop. Extensive simulation experiments show that SCDP consistently outperforms strong single-view baselines and achieves performance comparable to multi-camera baselines. Real-world experiments further demonstrate precise manipulation and robustness to visual distractors, highlighting the potential of single-camera imitation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14534 2026-06-15 cs.CV 新提交

A Lightweight Fiducial-Based Pipeline for 3D Hyperspectral Mapping of ex-vivo Lumpectomy Specimens

一种轻量级基于基准的离体肿块切除标本三维高光谱映射流水线

Anna Bicchi, Alberto Rota, Leonardo Passoni, Nicola Ancellotti, Andrea Peroni, Lorenzo Vinco, Dario Polli, Elena De Momi

发表机构 * Politecnico di Milano（米兰理工大学）； Department of Electronics, Information and Bioengineering (DEIB), Politecnico di Milano（米兰理工大学电子、信息与生物工程系）； Department of Physics, Politecnico di Milano（米兰理工大学物理系）

AI总结提出一种全自动、免标定流水线，利用RGB图像和单次HSI采集生成离体肿块切除标本的三维高光谱点云，通过ArUco标记实现亚毫米级配准，支持术中切缘评估。

详情

AI中文摘要

高光谱成像（HSI）是保乳手术（BCS）中用于术中评估切缘的一种有前景的模态，但其临床转化需要将固有的二维光谱信息与切除组织的三维形状对齐，以便精确定位可疑区域进行靶向随访。我们提出了一种全自动、免标定的流水线，该流水线从一组消费级相机RGB图像和单次自上而下的HSI采集生成离体肿块切除标本的三维高光谱点云。三维几何结构通过深度学习运动恢复结构（Structure-from-Motion）骨干网络重建，并通过自定义光束法平差（bundle adjustment）在度量参考框架中稳定，该平差对放置在标本周围的四个ArUco标记的角点强制执行一致性。然后，HSI立方体在不恢复HSI相机位姿的情况下配准到重建结果：两种模态中可见的标记定义了16个角点对应关系，驱动平面单应性（planar homography），并通过在正交渲染的深度图上查找恢复三维坐标。在两个离体肿块切除标本上评估，该流水线实现了中位三维配准误差低于1毫米，二维重投影误差低于0.02毫米，在加速硬件上每个标本的总处理时间低于4分钟。这些结果支持将HSI引导的空间定位集成到保乳手术的术中切缘评估工作流程中的可行性。

英文摘要

Hyperspectral Imaging (HSI) is a promising modality for intraoperative assessment of resection margins in Breast-Conserving Surgery (BCS), but its clinical translation requires aligning the inherently 2D spectral information onto the 3D shape of the excised tissue so that suspicious regions can be precisely localized for targeted follow-up. We present a fully automated, calibration-free pipeline that produces a 3D hyperspectral point cloud of an ex-vivo lumpectomy specimen from a set of consumer-camera RGB images and a single top-down HSI acquisition. The 3D geometry is reconstructed with a deep-learning Structure-from-Motion backbone, stabilized in a metric reference frame by a custom bundle adjustment that enforces consistency on the corners of four ArUco markers placed around the specimen. The HSI cube is then registered to the reconstruction without recovering the HSI camera pose: the markers, visible in both modalities, define 16 corner correspondences that drive a planar homography, and 3D coordinates are recovered by lookup on an orthographically rendered depth map. Evaluated on two ex-vivo lumpectomy specimens, the pipeline achieves a median 3D registration error below 1~mm and a 2D reprojection error below 0.02 mm, with a total per-specimen processing time under 4 minutes on accelerated hardware. These results support the feasibility of integrating HSI-guided spatial localization into intraoperative margin assessment workflows for breast-conserving surgery.

URL PDF HTML ☆

赞 0 踩 0

2606.14533 2026-06-15 cs.LG cs.GT 新提交

The Risk Shadow of Principal Component Analysis: When 99.9999% Variance Preservation Causes Catastrophic Decision Errors

主成分分析的风险阴影：当99.9999%的方差保留导致灾难性决策错误

Hamidou Tembine

发表机构 * Department of EECS, School of Engineering, UQTR, Canada（加拿大魁北克大学三河城分校工程学院电气工程与计算机科学系）； Learning and Game Theory Laboratory (LnG Lab), TIMADIE（学习与博弈论实验室（LnG Lab），TIMADIE）

AI总结本文证明主成分分析（PCA）在保留99.9999%方差时可能完全丢失罕见高影响事件的信息，导致分类器退化为常数预测器，并提出Expectile PCA和Tail-Preserving PCA两种方法通过重加权协方差来保留尾部风险信息。

Comments 5 tables, 1 figure. all references fully checked manually

详情

AI中文摘要

主成分分析（PCA）保留方差，而非检测罕见灾难性事件所需的信息。本文证明了“风险阴影”的存在：PCA可以保留超过99.9999%的总方差，同时完全抹去关于罕见高影响失败的所有信号。当这种情况发生时，即使是在PCA表示上运行的最佳分类器也会退化为常数预测器。根本原因是方差最大化与尾部风险意识之间的根本不匹配。为了打破阴影，我们引入了Expectile PCA（ExPCA）和Tail-Preserving PCA（TP-PCA），这两种方法将数据协方差重新加权以偏向高影响事件。我们从理论上证明，ExPCA在保留罕见事件信息方面严格优于PCA，并在合成数据和真实世界的信用卡欺诈检测基准上验证了我们的主张。我们的结果呼吁在高风险决策中从根本上重新思考基于方差的降维方法。

英文摘要

Principal Component Analysis (PCA) preserves variance, not the information needed to detect rare catastrophic events. This paper proves the existence of a {\it Risk Shadow}: PCA can retain over 99.9999 percent of total variance while completely erasing all signal about rare, high-impact failures. When this happens, even the best possible classifier operating on the PCA representation reduces to a constant predictor. The root cause is a fundamental mismatch between variance maximization and tail risk awareness. To break the shadow, we introduce Expectile PCA (ExPCA) and Tail-Preserving PCA (TP-PCA), two methods that reweight the data covariance toward high-impact events. We prove theoretically that ExPCA strictly outperforms PCA in retaining rare-event information, and we validate our claims on synthetic data and a real-world credit card fraud detection benchmark. Our results call for a fundamental rethinking of variance-based dimensionality reduction in high-stakes decisions.

URL PDF HTML ☆

赞 0 踩 0

2606.14531 2026-06-15 cs.RO 新提交

AERMANI-PLACE: Language Guided Object Placement with Aerial Manipulators

AERMANI-PLACE: 基于语言引导的空中机械臂物体放置

Sarthak Mishra, Ritama Sanyal, Rishabh Dev Yadav, Wei Pan, Spandan Roy

发表机构 * Robotics Research Center, IIIT Hyderabad（海得拉巴国际信息技术学院机器人研究中心）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Newcastle University（纽卡斯尔大学）

AI总结提出AERMANI-PLACE框架，通过自然语言指令和图像编辑模型生成视觉标记，引导空中机械臂完成物体放置，在测试集和真实平台上分别达到87%和72%的平均成功率。

详情

AI中文摘要

物体放置是空中操纵任务的基本组成部分，但现有系统通常需要以度量坐标明确指定期望的放置位置。这种界面不直观，要求用户推理坐标框架和场景几何，使其在实际部署中难以使用。相比之下，人类通常通过语言和指向手势的组合来传达空间目标。受此观察启发，我们提出了AERMANI-PLACE，一个用于空中机械臂语言引导物体放置的框架。给定场景图像和自然语言指令，图像编辑模型生成场景的修改版本，其中包含指示物体应放置位置的视觉标记。然后，使用深度观测将该标记锚定到物理环境中，以恢复度量放置点，之后由空中机械臂生成并执行放置轨迹。我们在包含100个语言引导放置任务的测试集上评估了所提出的方法，并在真实的空中操纵平台上展示了成功执行。实验结果表明，所提出的方法能够可靠地从语言指令中推断放置位置，在测试集上的平均成功率为87%，并有效迁移到真实世界空中操纵，平均成功率为72%。视频：此 https URL

英文摘要

Object placement is a fundamental component of aerial manipulation tasks, yet existing systems typically require the desired placement position to be specified explicitly in metric coordinates. Such interfaces are not intuitive and require users to reason about coordinate frames and scene geometry, making them difficult to use in practical deployments. In contrast, humans often communicate spatial goals through a combination of language and pointing gestures. Inspired by this observation, we present AERMANI-PLACE, a framework for language-guided object placement with aerial manipulators. Given a scene image and a natural language instruction, an image editing model generates a modified version of the scene containing a visual marker that indicates where the object should be placed. This marker is then grounded into the physical environment using depth observations to recover a metric place point, after which a placement trajectory is generated and executed by the aerial manipulator. We evaluate the proposed approach on a test set of 100 language-guided placement tasks and demonstrate successful execution on a real aerial manipulation platform. Experimental results show that the proposed method reliably infers placement locations from language instructions with an average success rate of 87\% on the test-set and transfers effectively to real-world aerial manipulation with an average success rate of 72\%. Video: https://youtu.be/SgwwgLBsv0g

URL PDF HTML ☆

赞 0 踩 0

2606.14530 2026-06-15 cs.LG 新提交

Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

LLM隐藏状态中的代码正确性信号：生成前探测与修复几何

Carlo Di Cicco

发表机构 * Independent researcher（独立研究员）

AI总结本文通过残差化方法，发现Qwen3-4B-Instruct模型在生成前隐藏状态可线性解码代码正确性（AUC 0.931），但修复成功的方向性信号在控制上下文协变量后消失，揭示了方法学上的正负结果。

Comments 12 pages, 8 tables. Code, data, and analysis scripts available at https://github.com/CarloDiCicco/ReasoningLab

详情

AI中文摘要

大型语言模型在其隐藏状态中编码丰富信息。本文研究在Qwen3-4B-Instruct-2507生成之前以及修复失败尝试时，代码正确性是否可从隐藏状态中解读，基于444个LiveCodeBench任务。报告两个发现，通过单一混杂控制工具——残差化联系起来。首先，模型首次尝试代码的正确性可从提示最终隐藏状态线性解码，在50个外部分割上无泄漏的留出AUC为0.931±0.008。从每个隐藏状态维度去除提示长度的线性效应后，探针仍达到0.911±0.010，远高于提示长度基线0.754±0.014。其次，在236个清理后的案例中，模型尝试修复失败的首次尝试，从失败尝试到修复的隐藏状态偏移携带统计上可检测的对比方向，在幅度和分割半测试中均显著高于标签打乱的零假设。该方向在对修复上下文协变量（成功与失败修复间不同）进行条件残差化后不再存在，表明它是修复成功的相关因素，由修复上下文驱动，而非孤立的修复理解特征。探针层通过嵌套交叉验证选择，同样的残差化方法支持了生成前正确性结果，却推翻了修复方向解释。贡献既是方法论上的也是实证上的：一个足够诚实的诊断，同时报告了负面结果和正面结果。

英文摘要

Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model's first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

URL PDF HTML ☆

赞 0 踩 0