arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.19245 2026-06-18 cs.AI cs.LG 新提交

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP:分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出TxBench-PP基准,用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力,测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情
AI中文摘要

人工智能(AI)代理有望通过压缩解释和决策循环来加速药物发现,但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学(TxBench-PP),这是一个针对小分子临床前药理学的可验证基准,也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论,而非从文献中记忆的事实。该基准包含100个评估,按程序阶段、实验类型和任务结构索引,涵盖作用机制(MoA)和药效学(PD)推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照,在编码环境中检查文件,并返回确定性评分的结构化答案。在16个模型-工具配置(包括11个模型和4,800条轨迹)中,没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试(178/300;95% CI, 51.1-67.6),其次是GPT-5.5 / Pi,为55.3%(166/300;47.0-63.6)。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 新提交

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 新提交

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Tencent Hunyuan(腾讯混元)

AI总结 针对GRPO等RL算法中策略熵崩溃问题,提出STARE方法,通过惊讶度分位数识别熵关键令牌并重加权其优势,结合目标熵闭环门控稳定熵,在1.5B-32B模型和多种任务上实现稳定训练,AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情
AI中文摘要

基于可验证奖励的强化学习算法(如GRPO)已成为LLMs复杂推理的主流后训练范式,但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析,识别出令牌级信用分配不匹配:每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积,产生优势-惊讶度四象限结构和近临界性质。受此启发,我们提出STARE(基于惊讶度的令牌级优势重加权以实现策略熵稳定性),该方法通过批次内惊讶度分位数识别熵关键令牌子集,选择性重加权其有效优势,并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族(短思维链、长思维链和多轮工具使用)上,STARE在数千步内维持稳定的RL训练,同时将策略熵保持在目标带内。在AIME24和AIME25上,STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%,反思令牌和响应长度同步增长,表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

2606.19233 2026-06-18 cs.RO 新提交

Mobile Pedipulation for Object Sliding via Hierarchical Control on a Wheeled Bipedal Robot

基于轮式双足机器人分层控制的移动式腿部操作物体滑动

Yue Qin, Yulun Zhuang, Zelin Shen, Yanran Ding

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种分层控制框架,使轮式双足机器人能用腿部滑动平面物体,通过简化三刚体动力学模型和轨迹优化运动规划器,在实验中成功实现1kg物体取回和4kg物体滑动。

Comments 8 pages, 7 figures

详情
AI中文摘要

在本文中,我们提出了一种分层控制框架,使轮式双足机器人能够利用其轮式腿执行平面物体滑动任务。该方法基于一个简化三刚体动力学模型构建了非线性模型预测控制器,该模型明确考虑了髋关节滚动自由度和多种轮-环境接触模式,这对于横向步态和腿部操作任务至关重要。在该框架内,非线性模型预测控制器同时调节机器人 locomotion 和交互力,使机器人能够稳定地执行滚动和物体操作行为。我们开发了一个基于轨迹优化的机器人-物体运动规划器,以生成包含地面-物体接触中粘滑转换的参考运动。通过实际硬件实验验证了两种代表性的腿部操作运动,即滑行和横向滑动,其中机器人成功地从桌子下取回一个1kg的物体,并通过滑行将一个4kg的物体滑动0.228米的距离。

英文摘要

In this letter, we present a hierarchical control framework that enables wheeled bipedal robots to perform planar object sliding tasks with their wheeled legs. The proposed approach formulates a nonlinear model predictive controller (NMPC) based on a reduced-order three rigid bodies (TRB) dynamical model that explicitly accounts for the hip roll degree of freedom and multiple wheel-environment contact modes, which is essential for lateral stepping and pedipulation tasks. Within this framework, the NMPC simultaneously regulates robot locomotion and interaction forces, allowing the robot to stably execute both rolling and object manipulation behaviors. A trajectory-optimization-based robot-object motion planner is developed to generate reference motions that incorporate stick-slip transitions in ground-object contact. Two representative pedipulation motions, namely scooting and lateral sliding, are validated through real-world hardware experiments, in which the robot successfully retrieves a 1 kg object from under a desk and slides a 4 kg object over a distance of 0.228 m via scooting.

2606.19230 2026-06-18 cs.LG cs.HC stat.ML 新提交

A Human-in-the-Loop Bayesian Optimization Framework for Constraint-Aware Bioprocess Development

一种面向约束感知的生物过程开发的人机协同贝叶斯优化框架

Samuel Stricker, Claus Wirnsperger, Alessandro Butté, Laura Helleckes, Gonzalo Guillén Gosálbez, Antonio del Rio Chanona, Mehmet Mercangöz

发表机构 * Imperial College London(伦敦帝国理工学院) DataHow AG ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种扩展的帕累托前沿引导采样框架,通过将高斯过程代理的约束满足概率和鲁棒性作为多目标优化目标,结合交互式仪表盘实现人机协同的约束感知生物过程优化。

详情
AI中文摘要

本文提出了帕累托前沿引导采样(PFGS)的一种扩展,这是一种人机协同(HitL)贝叶斯优化(BO)框架,其中高斯过程(GP)代理导出的量被重新表述为多目标优化问题的目标,得到的帕累托前沿暴露给领域专家进行交互式候选选择,而不是返回单一的自动推荐。该框架在两个方向上进行了扩展:约束优化通过将满足输出规格限的后验概率作为显式的帕累托目标来处理,该概率从GP后验分布解析计算得到;鲁棒优化通过蒙特卡洛采样策略来处理,该策略估计在用户定义的输入扰动变异性下的期望下置信性能,捕捉在可能的实现偏差下的性能退化。由此产生的多维帕累托表示通过交互式仪表盘上的成对二维投影同时显示预测性能、模型不确定性、概率约束满足和输入鲁棒性之间的权衡,使得选择标准能够随着代理模型的改进和开发目标的演变而迭代细化。该框架在一个八维的补料分批中国仓鼠卵巢(CHO)细胞培养模拟器上进行了展示,证明了系统性地识别高性能、满足可行性且对扰动具有鲁棒性的操作条件,并说明了专家定义的需求如何提供原则性的停止标准并支持实验资源的明智分配。

英文摘要

This work presents an extension to Pareto Front Guided Sampling (PFGS), a Human-in-the-Loop (HitL) Bayesian Optimization (BO) framework in which Gaussian process (GP) surrogate-derived quantities are reformulated as objectives of a multi-objective optimization problem, and the resulting Pareto front is exposed to a domain expert for interactive candidate selection rather than returning a single automated recommendation. The framework is extended in two directions: constrained optimization is addressed by incorporating the posterior probability of satisfying output specification limits as an explicit Pareto objective, computed analytically from the GP posterior distribution; robust optimization is addressed by a Monte Carlo sampling strategy that estimates expected lower-confidence performance over a user-defined variability of input perturbations, capturing performance degradation under likely implementation deviations. The resulting multi-dimensional Pareto representation renders trade-offs between predicted performance, model uncertainty, probabilistic constraint satisfaction, and input robustness simultaneously visible through pairwise two-dimensional projections on an interactive dashboard, enabling selection criteria to be iteratively refined as the surrogate model improves and development objectives evolve. The framework is showcased on an eight-dimensional fed-batch Chinese Hamster Ovary (CHO) cell culture simulator demonstrating systematic identification of high-performing, feasibility-compliant, and perturbation-resilient operating conditions, and illustrating how expert-defined requirements provide a principled stopping criterion and support informed allocation of experimental resources.

2606.19227 2026-06-18 cs.RO 新提交

Constant Time-Delay Leader Following with Neural Networks and Invariant Extended Kalman Filters for Arbitrary Trajectories

基于神经网络与不变扩展卡尔曼滤波的任意轨迹恒定时间延迟领航跟随

Luka Antonyshyn, Paulo Ricardo Marques de Araujo, Sidney Givigi

发表机构 * University of Toronto Institute for Aerospace Studies(多伦多大学航空航天研究所) School of Computing, Queen’s University(女王大学计算机学院)

AI总结 提出一种结合概率Seq2Seq神经网络与不变扩展卡尔曼滤波的恒定时间延迟轨迹跟踪方法,用于无通信、无全局坐标的车队,在SE(2)流形上准确估计领车轨迹,并利用几何模型预测控制提升性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

本文提出了一种用于车辆队列的恒定时间延迟轨迹跟踪方法,该方法无需车辆间通信、公共坐标系或全球定位。该方法将概率序列到序列(Seq2Seq)神经网络与不变扩展卡尔曼滤波(IEKF)相结合,以热启动预测过程,从而在SE(2)流形上准确估计领车相对轨迹。进一步引入几何模型预测控制器,以充分利用基于流形的轨迹预测来改善控制性能。该系统能够处理具有不同速度和运动轮廓的任意非线性轨迹,同时减少了对基于专家领域知识的轨迹跟踪系统设计的需求,即使在长轨迹延迟下也是如此。通过运动学仿真中与纯IEKF基线、基于学习的方法以及真实轨迹的对比,以及使用真实机器人车辆的实验,验证了该方法的有效性。

英文摘要

This paper proposes a constant time-delay trajectory tracking method for vehicle convoys operating without inter-vehicle communication, a common coordinate system, or global positioning. The method integrates a probabilistic sequence-to-sequence (Seq2Seq) neural network with an invariant extended Kalman filter (IEKF) to warm-start the prediction process, allowing accurate estimation of a leader vehicle's relative trajectory on the SE(2) manifold. A geometric model predictive controller is further incorporated to fully exploit the manifold-based trajectory predictions for improved control performance. The system can handle arbitrary nonlinear trajectories with varying speeds and motion profiles while reducing the need for expert-based domain knowledge for the design of trajectory following systems, even under long trajectory delays. The effectiveness of the method is validated through comparisons with a pure IEKF baseline, learning-based methods, and the ground-truth trajectory in kinematic simulations, as well as in experiments using real robotic vehicles.

2606.19222 2026-06-18 cs.LG cs.AI 新提交

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

机制引导的选择性遗忘:针对RLVR诱导的推理

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo, Japan(东京科学大学工学院) College of Control Science and Engineering, Zhejiang University, China(浙江大学控制科学与工程学院) Department of Electrical and Computer Engineering, National University of Singapore, Singapore(新加坡国立大学电气与计算机工程系)

AI总结 提出MAST方法,通过机制引导选择性更新参数,在遗忘RLVR诱导的推理行为时,显著降低对保留性能的附带损害。

Comments 15 pages, 4 figures, 7 tables

详情
AI中文摘要

我们提出MAST(机制对齐选择性目标),一种机制引导的方法,用于遗忘RLVR诱导的推理,其附带损害远低于标准全参数更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR检查点上,SFT到RLVR的增量在token级delta-log-probability上与SFT更新显著不同,而全参数梯度上升仅通过破坏保留的MATH和GSM8K来实现遗忘。MAST根据离主能量、更新幅度和遗忘梯度耦合幅度对注意力投影张量进行排序,然后仅更新排名最高的子集。在主模型上,MAST诱导了统计上显著的目标遗忘(MATH遗忘从45/150降至37/150;McNemar p=0.0078),同时保留了GSM8K(+0.8个百分点)和MATH保留(-0.5个百分点)。该优势在不同种子、NPO/SimNPO目标以及Qwen3上均得到复现,在Qwen3上MAST保留了GSM8K,而全参数遗忘导致其崩溃。

英文摘要

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

2606.19220 2026-06-18 cs.LG cs.AI 新提交

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto(波尔图理工学院工程学院GECAD研究所)

AI总结 针对XGBoost模型提出XGBoost-Forget遗忘方法,在表格型网络入侵数据集上实现高效遗忘,保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

详情
AI中文摘要

机器遗忘(MU)已成为一种从训练模型中移除特定数据点而无需完全重新训练的重要技术。然而,现有大多数MU研究集中于深度学习和图像数据,在网络入侵检测领域存在空白,该领域严重依赖表格数据。本文引入XGBoost-Forget,一种针对XGBoost模型的遗忘方法,以填补这一空白。该方法在两个表格型网络入侵(NI)数据集IoT-23和GeNIS上进行了评估,使用多个指标衡量模型性能、遗忘效率和遗忘质量。结果表明,XGBoost-Forget在保持接近原始模型的预测性能的同时,提供了显著更快的遗忘速度,展示了其在表格型NI场景中用于MU的潜力。

英文摘要

Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

2606.19218 2026-06-18 cs.CL 新提交

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM:开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville(阿拉巴马大学亨茨维尔分校) University of North Alabama(北阿拉巴马大学) Stanford University(斯坦福大学) Meta AI Amazon GenAI(亚马逊GenAI)

AI总结 提出 RECOM 数据集,发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性,余弦相似度有效性高但区分性差,BERTScore 区分性受长度影响且有效性弱。

详情
AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法,但一个指标被默默要求完成两项任务:区分真实内容对齐与表面巧合(有效性),以及区分更好的系统与更差的系统(区分性)。在开放式、观点驱动的问答中,这两者存在矛盾。我们引入了 RECOM(Reddit Evaluation for Correspondence of Models),一个无污染评估数据集,包含 15,000 个 r/AskReddit 问题(2025 年 9 月),每个问题都配有真实的社区回复,这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM(7-10B)的每个回复与每个指标配对,并加入随机乱序噪声基线,我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答(Cohen's $d \approx 2$),但无法对五个模型进行排序($|d| < 0.1$);BERTScore 精确度看似能对模型排序(原始 $|d|$ 高达 0.63),但一旦控制回复长度,这一数值骤降至 $|d| = 0.09$,且其有效性较弱($d \approx 0.8$,而余弦相似度约为 2)。由于每个指标对相同的输出进行评分,这种有效性与区分性的权衡是指标的属性,而非模型的属性,我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距,同样只能微弱地区分五个模型。我们建议在两个轴上报告指标,并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

2606.19215 2026-06-18 cs.CV 新提交

GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

GUMP-Net: 一种用于多类盆腔分割的可解释模型-数据驱动智能算法

Liheng Wang, Yinghui Zhang, Licheng Zhang, Hailin Xu, Qiyong Cao, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) Department of Orthopedics, The Fourth Medical Center of Chinese PLA General Hospital(中国人民解放军总医院第四医学中心骨科) National Clinical Research Center for Orthopedics, Sports Medicine and Rehabilitation(国家骨科与运动康复临床医学研究中心) Department of Trauma and Orthopedics, People’s Hospital Peking University(北京大学人民医院创伤骨科) Department of Orthopedics and Traumatology, Beijing Jishuitan Hospital, Capital Medical University(首都医科大学附属北京积水潭医院骨科)

AI总结 提出GUMP-Net,结合改进测地线活动轮廓模型与深度神经网络,实现多类盆腔分割,在小训练数据下表现更优,并提供可解释几何视角。

Comments 26 pages, 8 figures, 3 tables

详情
AI中文摘要

盆腔分割是盆腔骨折精准智能诊疗及手术规划导航中最重要和基础的研究问题之一。通过将改进的测地线活动轮廓模型与深度神经网络相结合,我们提出了GUMP-Net,一种用于多类盆腔分割的可解释模型-数据驱动智能算法,其中设计了三个网络模块共同构成整体分割框架:用于自动水平集初始化的目标检测模块、用于学习解剖感知边缘检测函数的边缘检测器模块以及用于深度水平集演化的迭代模块。利用水平集表示和深度学习的优势,GUMP-Net在分割性能上比最先进的方法更准确、鲁棒和一致,尤其是在小训练数据情况下。在盆腔数据集上的大量实验证明了所提算法的合理性和有效性。扩展到踝关节数据集的进一步实验表明其对其他解剖结构具有更广泛的应用。所提算法不仅为复杂骨折复位提供了高效的分割方法,而且为理解深度学习分割提供了可解释的几何视角。

英文摘要

Pelvic segmentation is one of the most important and fundamental research problems in precise and intelligent diagnosis and treatment, as well as surgical planning and navigation for pelvic fractures. By combining an improved geodesic active contour model with deep neural networks, we propose GUMP-Net, an interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation, in which three network modules are designed to constitute the overall segmentation framework together: the object detection module for automatic level set initialization, the edge detector module for learning an anatomy-aware edge detector function and the iteration module for deep level set evolution. Leveraging the advantages of level set representation and deep learning, GUMP-Net shows more accurate, robust and consistent segmentation performance, especially in small training data situation, compared to the state-of-the-art methods. Extensive experiments on pelvic datasets demonstrate the rationality and effectiveness of the proposed algorithm. Further experiments extended to ankle dataset indicate broader applications to other anatomies. The proposed algorithm not only provides an efficient segmentation method for complex fracture reduction, but also gives an interpretable geometric perspective for understanding deep learning segmentation.

2606.19209 2026-06-18 cs.SD 新提交

FineCombo-TTS: Collaborative and Precise Controllable Speech Synthesis Using Text Descriptions and Reference Speech

FineCombo-TTS: 使用文本描述和参考语音的协作式精确可控语音合成

Shuoyi Zhou, Yixuan Zhou, Peiji Yang, Yifan Hu, Yicheng Zhong, Zhisheng Wang, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Inner Mongolia University(内蒙古大学) Tencent(腾讯)

AI总结 提出FineCombo-TTS统一框架,通过条件流匹配的语音方差预测器实现基于文本描述的细粒度参考到目标变换,实现灵活精确的声学属性控制。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

可控文本到语音(TTS)已成为一个关键研究焦点。然而,基于参考语音或文本描述的方法缺乏灵活性和精确控制,最近的联合方法仍然松散耦合,语音建模音色而文本控制全局风格。我们提出FineCombo-TTS,一个基于参考语音并由文本描述引导的语音合成统一框架,能够对声学属性进行灵活精确的控制。不同于显式属性解耦,我们学习统一的声学表示,并引入基于条件流匹配(CFM)的语音方差预测器,以建模由文本描述引导的细粒度参考到目标变换。为了支持相对属性控制,我们构建了FineEdit,一个结构化的配对数据集,显式编码源到目标的属性变化。实验表明,我们的方法实现了灵活、精确且富有表现力的可控TTS。

英文摘要

Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.

2606.19204 2026-06-18 cs.CV 新提交

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)(中国科学院空天信息创新研究院遥感与数字地球重点实验室)

AI总结 提出ROSA-TFormer模型,集成SAR和光学嵌入分支、传感器感知门和时间注意力池化,利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类,总体精度达99.67%。

Comments journal in tree classification

详情
AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer,一种雷达-光学传感器感知的时间Transformer,利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化,以捕获多源季节特征。在月度与半月点级数据集上的实验表明,ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能,总体精度99.67%,宏F1 99.56%,樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力,但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

2606.19199 2026-06-18 cs.LG cs.AI 新提交

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

预测关键因素:面向决策的强化学习用于未知离开时间的受控电动汽车充电

Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder

发表机构 * Ghent University -- imec(根特大学 -- imec)

AI总结 针对电动汽车充电中离开时间未知导致强化学习策略效果差的问题,提出面向决策的强化学习框架,联合训练预测器与控制器,实现端到端优化,使总奖励提升14%,未供应能量减少55%。

Comments ACM e-Energy 2026 5 pages, 1 figure, 1 table

详情
AI中文摘要

近年来电动汽车的普及给电力系统带来了挑战,包括峰值需求增加和潜在的电网不稳定。基于强化学习的智能充电控制可以通过从历史数据中学习时间和上下文模式来缓解这些问题。然而,在现实场景中,关键特征(如离开时间)通常不可用。这使得强化学习智能体更难学习和执行有效的充电策略。为了减轻这种不确定性,训练好的预测器可以从可用数据中近似未知特征。然而,由于这些预测模型通常针对准确性(而非对下游智能体决策质量的影响)进行训练,它们的误差可能会传播并阻碍使用预测的控制器的整体性能。为了避免这种情况,我们提出了一种面向决策的强化学习框架,其中预测器是端到端训练的,即通过强化学习智能体采取的充电策略动作的反馈。这种预测器和控制器的联合训练最终产生了更高质量的动作:与没有离开时间预测的强化学习方法相比,我们提出的面向决策的强化学习方法产生了更优的充电决策,总奖励提高了14%,未供应能量(即由于电动汽车已离开而未能进行的充电)减少了55%。

英文摘要

The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

2606.19195 2026-06-18 cs.CV 新提交

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架,性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) VIVO AI Lab(维沃人工智能实验室)

AI总结 提出Moebius轻量级图像修复框架,通过局部-λ混合交互模块和自适应多粒度蒸馏策略,以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量,推理速度提升15倍以上。

详情
AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界,但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案,然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题,我们提出了Moebius,一个高效的轻量级修复框架。我们通过引入局部-λ混合交互($L\lambda MI$)模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成,巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵,在保留复杂潜在交互的同时大幅减少参数。此外,为了释放这种高度紧凑架构的全部表示能力,我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码,动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明,这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是,Moebius仅使用不到2%的参数(0.22B vs. 11.9B)就实现了这一点,同时总推理时间加速超过15倍,为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

2606.19190 2026-06-18 cs.RO 新提交

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

FAST-LIVGO:一种退化鲁棒的LiDAR-惯性-视觉-GNSS融合里程计

Zhiyu Chen, Chunran Zheng, Jiayu Wen, XiaoLei Zhang, Jiaming Xu, Feng Pan, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University(深圳大学机电与控制工程学院) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系) College of Automation, Harbin Engineering University(哈尔滨工程大学自动化学院)

AI总结 提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架,通过动态时间规整的时空对齐模块、多普勒和时差载波相位观测模型以及退化感知的双模式异常值拒绝策略,在长期大尺度动态环境中实现高精度鲁棒的状态估计。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在长期、大规模和高度动态环境中的鲁棒状态估计与建图仍然是机器人领域的关键挑战。现有的LiDAR-惯性-视觉里程计(LIVO)系统在局部精度上表现良好,但在长距离下会累积漂移,并在几何退化或无纹理场景中可能失效。同时,GNSS辅助融合框架通常依赖LiDAR或视觉里程计进行状态预测和异常值拒绝,使其在里程计退化时变得脆弱。为解决这些局限,我们提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架。引入基于动态时间规整的在线时空对齐模块以应对高度动态条件。为更好利用GNSS精度,我们开发了基于多普勒频移和固定锚点时间差载波相位的观测模型,在不增加历史锚点状态的情况下提供毫米级相对约束。我们进一步设计了一种退化感知的双模式异常值拒绝策略,根据LIVO退化程度在LIVO先验引导拒绝和GNSS辅助恢复之间切换。在公开M3DGR数据集和自建20 m/s固定翼无人机数据集上的实验表明,我们的系统减少了累积漂移和地图重影,在精度和鲁棒性上优于现有方法。

英文摘要

Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.

2606.19186 2026-06-18 cs.RO cs.LG 新提交

Learning to Annotate Delayed and False AEB Events: A Practical System for Extreme Class Imbalance and Asymmetric Label Noise

学习标注延迟和误报AEB事件:针对极端类别不平衡和非对称标签噪声的实用系统

Mengxiang Hao, Xin Jiang, Xinghao Huang, Wenliang Su, Zhiteng Wang, Junjie Rao, Xiaotian Yang, Wei Liao, Chengyu Han, Gen Liang, Yulun Song, Zhitao Xu, Xianpeng Lang

发表机构 * Li Auto(理想汽车)

AI总结 提出首个自动化AEB标注框架,通过特定数据增强和噪声抑制技术,解决极端类别不平衡和非对称标签噪声问题,将延迟/误报触发召回率提升80%,人工工作量减少50%。

Comments 8 pages, 5 figures, accepted by IEEE International Conference on Robotics and Automation (ICRA)

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

自主紧急制动(AEB)优化依赖于准确标注的真实世界触发事件,特别是揭示系统缺陷的罕见但关键的延迟和误报AEB触发事件。然而,这些少数样本在每天数千次触发事件中占比不到5%,使得大规模人工标注成本过高。我们提出了首个自动化AEB标注框架来解决这一问题。在开发过程中,我们识别出两个严重损害延迟/误报触发标注准确性的基本挑战:(1)极端类别不平衡,其中延迟/误报触发被真实触发淹没;(2)非对称标签噪声,其中误标注的多数样本(真实触发)抑制了少数样本(延迟/误报触发)的学习。为克服这些挑战,我们提出两项关键创新:(1)特定数据增强,通过操纵焦点目标属性、移植自车动态和掩蔽非焦点代理来合成逼真样本;(2)噪声抑制,使用稳定硬度估计和探针引导的自适应阈值来清理误标注的真实触发样本。关键的是,我们将模型部署为具有全栈架构的实用标注系统,从每天数千个AEB事件中高效识别关键的延迟/误报触发。生产结果表明,延迟/误报触发的召回率提高了80%,人工工作量减少了50%。除了直接收益,该系统通过积累高质量标注实现持续自我改进,为车载AEB系统优化奠定了必要的数据基础。

英文摘要

Autonomous Emergency Braking (AEB) optimization relies on accurately annotated real-world trigger events, particularly rare but critical delayed and false AEB triggers that expose system deficiencies. However, these minority samples comprise less than 5% of thousands of daily triggers, making manual annotation prohibitively expensive at scale. We present the first automated AEB annotation framework to address this problem. During development, we identified two fundamental challenges that severely impair delayed/false trigger annotation accuracy: (1) Extreme class imbalance where delayed/false triggers are overwhelmed by true triggers; (2) Asymmetric label noise where mislabeled majority samples (true triggers) suppress minority samples (delayed/false triggers) learning. To overcome these challenges, we propose two key innovations: (1) Specific data augmentation that synthesizes realistic samples by manipulating focal target attributes, transplanting ego-vehicle dynamics, and masking non-focal agents; (2) noise suppression using stable hardness estimation and probe-guided adaptive threshold to clean mislabeled true trigger samples. Crucially, we deploy our model as a practical annotation system with full-stack architecture, efficiently identifying critical delayed/false triggers from thousands of daily AEB events. Production results demonstrate 80% improvement in recall of delayed/false triggers and 50% reduction in manual workload. Beyond immediate gains, the system enables continuous self-improvement through accumulated high-quality annotations, establishing a necessary data foundation for on-vehicle AEB system optimization

2606.19185 2026-06-18 cs.LG 新提交

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

AGDN:利用各向异性图扩散网络学习求解旅行商问题

Bolin Shen, Ziwei Huang, Zhiguang Cao, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) Singapore Management University(新加坡管理大学)

AI总结 提出各向异性图扩散网络(AGDN),通过MixScore转移矩阵和各向异性扩散策略,有效利用图结构信息求解旅行商问题,在多种实例规模和分布上优于现有方法。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

旅行商问题(TSP)是组合优化的基石,出现在许多实际场景中。尽管基于图的学习方法已被探索用于TSP,但如何更有效地利用图结构的问题仍然悬而未决。我们提出了各向异性图扩散网络(AGDN),一种新的图神经网络框架,旨在求解TSP。我们的方法解决了两个核心难点:(1)完全连接TSP图中缺乏信息丰富的拓扑先验,以及(2)在常用的图稀疏化技术后,最优解中丢失连接节点。为了克服这些问题,我们构建了一个MixScore转移矩阵,将节点相似性与成对距离相结合,并开发了一种各向异性图扩散策略,支持跨多跳的高效信息交换。涵盖不同实例规模和节点分布的全面实验表明,AGDN在保持计算时间竞争力的同时,始终优于现有方法。此外,AGDN能够很好地泛化到训练期间未见的问题规模和分布。实现代码已公开在:this https URL。

英文摘要

The Traveling Salesman Problem (TSP) is a cornerstone of combinatorial optimization and arises in many practical scenarios. Although graph-based learning approaches have been explored for TSP, the question of how to exploit graph structure more effectively remains open. We present the Anisotropic Graph Diffusion Network (AGDN), a new Graph Neural Network framework designed to solve TSP. Our method tackles two central difficulties: (1) the lack of informative topological prior in fully connected TSP graphs, and (2) losing connected nodes in the optimal solution after the commonly used graph sparsification techniques. To overcome these issues, we construct a MixScore transition matrix that merges node similarity with pairwise distance, and we develop an anisotropic graph diffusion strategy that supports efficient information exchange across multiple hops. Comprehensive experiments spanning diverse instance sizes and node distributions show that AGDN consistently outperforms existing methods while keeping computation time competitive. Furthermore, AGDN generalizes well to problem sizes and distributions beyond those seen during training. The implementation is publicly available at: https://github.com/LabRAI/AGDN.

2606.19184 2026-06-18 cs.CV cs.LG 新提交

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

当AUC误导:域偏移下深度伪造检测器的极化感知评估

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel, Djamila Aouada

发表机构 * Cristal Laboratory, National School of Computer Sciences, University of Manouba(马努巴大学国家计算机科学学院Cristal实验室)

AI总结 针对现有AUC评估无法反映真实场景中混合数据源和不同伪影类型的问题,提出Cross-dataset AUC(Cross-AUC)指标,通过平均每域AUC并引入预测极化度量(Wasserstein距离)来评估域偏移鲁棒性,实验证明其有效性。

详情
AI中文摘要

生成式AI的最新进展,如扩散模型和换脸工具,使得创建高度逼真的深度伪造成为可能,导致了包括金融欺诈和非自愿色情内容在内的现实危害。为此,深度伪造检测成为一个活跃的研究领域,近期方法越来越关注提高对未见操作的泛化能力。这通常通过跨多个数据集分别测量的ROC曲线下面积(AUC)来评估。然而,这种评估未能反映检测器面对混合数据源和不同伪影类型的真实场景。为解决这一局限,我们引入一种新指标——跨数据集AUC(Cross-AUC),该指标平均每域AUC并加入预测极化度量,以考虑对域偏移的鲁棒性。极化程度通过类别分数分布之间的Wasserstein距离量化。Cross-AUC不仅更真实地评估深度伪造检测器在域偏移下的泛化能力,而且具有可解释性,因为它能更好地解释性能下降的原因。在七个基准数据集上的实验证明了其实用性。

英文摘要

Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

2606.19183 2026-06-18 cs.CL cs.AI 新提交

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

语言模型作为接口而非预言机:用于小儿阑尾炎的混合LLM-ML系统

Soheyl Bateni, Maryam Abdolali

发表机构 * K. N. Toosi University of Technology(K. N. 图西理工大学)

AI总结 提出ClaMPAPP混合系统,利用LLM从自由文本中提取结构化特征,再由XGBoost分类器进行诊断,在两个独立队列中优于端到端LLM,提高了诊断稳定性和可审计性。

详情
AI中文摘要

大型语言模型(LLM)通过解释自由文本记录可使临床决策支持更易获取,但直接作为诊断引擎使用时,受提示敏感性、信息顺序以及看似合理但错误的输出限制。结构化机器学习模型提供更稳定的风险预测,但需要难以与叙事性临床工作流集成的表格输入。我们提出ClaMPAPP(临床语言辅助机器学习阑尾炎诊断流程),这是一个混合系统,将LLM用作接口而非最终决策者。ClaMPAPP从类似笔记的叙述中提取模式约束的临床特征,应用确定性合理性检查,并将验证后的特征传递给基于临床、实验室和超声变量训练的XGBoost分类器。我们在来自德国医院的两个独立小儿阑尾炎队列上评估了ClaMPAPP,并将其与端到端LLM基线(包括开源和专有模型)进行比较。为在测试自由文本输入时保留真实标签,通过模板渲染和约束LLM重写从结构化电子健康记录生成叙述,并附加句子顺序排列以评估位置鲁棒性。ClaMPAPP在内部和外部验证中均达到最强的整体诊断性能,同时最小化漏诊阑尾炎病例(急性分诊中的关键安全问题)。端到端LLM表现出不稳定的灵敏度-特异性权衡,且在叙述重排下性能下降更严重。这些结果支持LLM作为接口、ML作为预测器的设计,将自然语言可用性与预测推理分离,并为临床决策支持提供更可审计的路径。

英文摘要

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University(乔治华盛顿大学)

AI总结 提出硬件验证的视觉在环框架,结合深度变换器单目位姿估计器和延迟卡尔曼滤波器,在模拟逼真海上环境中实现自主室内飞行,验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情
AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计,然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架,能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合,为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应,包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段,用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

2606.19172 2026-06-18 cs.AI 新提交

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹:将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

AI总结 提出User as Engram方法,将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑,推理技能共享一个适配器,实现高精度间接推理且内存占用极小。

详情
AI中文摘要

语言模型中的个人记忆涉及两个问题:内容和推理技能。大脑将两者分开(每个情节在海马体中有一个稀疏的局部印迹,解释它的共享技能在缓慢的新皮层中),因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外,存储在自然语言记忆文件或检索索引中。当事实被写入模型时,标准方法是每用户的LoRA适配器,这与大脑相反,将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本;将相同事实写为局部Engram行则数学上保持不变,导致内存占用大约减少33,000倍。因此,我们提出User as Engram:将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑,并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回,同时平均提供5.6倍更高的间接推理准确性,并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒:写入一个事实会在精确触发时打开其查找,添加答案所需的值,保持其他每个位置不变到最后一位,如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中,它们的编辑可组合:许多用户同时共享一个表,可加性且无损地堆叠,而每用户LoRA(一个全局权重增量)只允许一个。在检索时,每用户Engram表不会随着检索器必须搜索的群体增长,因此在大约100个事实后,它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

2606.19170 2026-06-18 cs.CL 新提交

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango:一个严格仅L1的大型语言模型,用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University(京都大学) NII-LLMC(国立信息学研究所-大规模语言模型中心)

AI总结 提出1.8B参数的Dango模型,通过过滤L2污染和微调L2学习课程,模拟人类L2产出模式,优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情
AI中文摘要

我们介绍了Dango,一个1.8B参数的大型语言模型,旨在用于第二语言习得(SLA)中L1到L2(日语到英语)迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA,但它们主要依赖于较小的或非解码器模型,限制了它们生成开放式文本的能力,并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战:用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题,我们提出了一种过滤方法,以减少对英语的过早暴露,同时保留现实的最小暴露。然后,我们在LLM生成的L2学习课程上对模型进行微调,以模拟L2习得过程。我们的评估证实,Dango发展了类似人类的L2产出模式,优于未过滤和标准的多语言基线。我们发布了模型、数据和代码,以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

2606.19168 2026-06-18 cs.AI cs.LG 新提交

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据:具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出安全反射预训练方法,在预训练语料中插入安全反思,使模型具备自我监控能力,实验表明该方法能有效降低推理和微调攻击成功率。

详情
AI中文摘要

为了实现大型语言模型(LLMs)更深层次的安全对齐,最近的研究探讨了如何将安全干预措施提前到预训练阶段,主要通过过滤不安全数据或将其改写为更安全的形式。我们认为,预训练阶段的对齐应超越使数据安全:LLMs可能将看似良性的知识和能力组合成不安全的行为。为此,我们提出了安全反射预训练,一种预训练阶段的对齐方法,该方法定期在预训练语料中插入简短的安全反思,将自我监控直接集成到语言建模中,建立一种基础能力,随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明,安全反射预训练提高了安全分类准确性,并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验,我们还引入了一个完全受控的合成环境MedSafetyWorld,其中包含清晰的安全定义和推理结构,模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明,与数据过滤和改写相比,安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看,我们的发现表明,预训练对齐不仅应使训练数据安全,还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

2606.19164 2026-06-18 cs.LG cs.AI 新提交

Essential Subspace Merging for Multi-Task Learning

多任务学习的本质子空间合并

Longhua Li, Lei Qi, Xin Geng, Qi Tian

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学)) Huawei Inc.(华为公司)

AI总结 提出本质子空间分解(ESD)和合并(ESM/ESM++)方法,通过正交化任务更新的主成分来减少多任务合并中的干扰,无需训练即可实现高效多任务学习。

详情
AI中文摘要

模型合并旨在通过将多个从同一预训练检查点微调得到的模型的能力集成到一个单一模型中,从而实现多任务学习。其核心挑战是任务特定参数更新之间的任务间干扰。在本文中,我们分析了任务更新引起的输出偏移,并观察到它们的能量集中在少数主方向上。我们将这些方向张成的子空间称为本质子空间。相比之下,大多数剩余方向携带的任务相关能量很少,但它们在多个任务更新中的累积会在合并过程中引起严重干扰。受此观察启发,我们提出了本质子空间分解(ESD),它根据激活偏移的主成分分解每个任务更新。基于ESD,我们引入了本质子空间合并(ESM),一种无需训练的静态合并方法,它将本质成分正交化并融合成一个紧凑的多任务模型。我们进一步将ESM扩展到ESM++,一种无需训练的动态合并方法,它将任务特定残差分解为低秩专家,并在前向推理过程中通过基于原型的路由选择最相关的专家。跨多个任务集和模型规模的大量实验表明,ESM和ESM++在减少任务间干扰的同时有效保留了任务知识。

英文摘要

Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

2606.19162 2026-06-18 cs.LG cs.CV 新提交

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta(Meta FAIR) Columbia University(哥伦比亚大学) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

2606.19161 2026-06-18 cs.RO 新提交

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench:基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University(北京航空航天大学) Rimbot BUPT(北京邮电大学) ShanghaiTech University(上海科技大学) Tsinghua University(清华大学) CAS(中国科学院)

AI总结 提出HT-Bench多任务基准和HandTouch编码器,通过大规模自我中心视觉与全手触觉数据,在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情
AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性,为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准,而是探索了一个可扩展且有前景的未来发展方向:将自我中心视觉与全手触觉数据配对。为此,我们引入了\ extbf{HT-Bench},一个用于灵巧全手触觉感知的大规模多任务基准,包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示:它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力,HT-Bench包含四个任务:细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch},一个矢量量化视觉-触觉编码器,通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上,HandTouch始终优于代表性的触觉编码器基线,将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%,将掩码触觉修复的RMSE从0.022降低到0.010,并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性,并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

2606.19156 2026-06-18 cs.CV 新提交

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Hand-4DGS: 用于从第一人称视频进行4D手部重建的前馈3D高斯泼溅方法

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

发表机构 * Yonsei University(延世大学) Electronics and Telecommunications Research Institute(韩国电子通信研究院) ETH Zurich(苏黎世联邦理工学院) Microsoft Spatial AI Lab(微软空间人工智能实验室) VGG, University of Oxford(牛津大学VGG实验室)

AI总结 提出Hand-4DGS,首个前馈框架,从第一人称视频直接重建动态4D手部,利用网格引导表示和时间卷积,实现快速推理和强泛化,无需3D真值标注。

Comments Project page: https://jeongminb.github.io/hand-4dgs/

详情
AI中文摘要

从第一人称视频进行动态3D手部重建对于下一代计算平台(如AR/VR和AI眼镜)至关重要。尽管其重要性,大多数先前工作要么关注多视角3D手部重建,要么关注4D人体重建。由于头部快速运动、手部快速动态、严重遮挡以及单视角观察固有的模糊性,第一人称4D手部重建仍然具有挑战性。为了解决这些挑战,我们引入了Hand-4DGS,这是第一个直接从第一人称视频重建动态4D手部的前馈框架,实现了快速(约60 FPS)推理和强泛化。我们的方法结合了用于结构先验的网格引导表示和用于建模动态运动的时间卷积。我们在两个具有挑战性的第一人称数据集H2O和ARCTIC上评估了我们的框架,并展示了相对于基线的显著改进。我们的方法受益于前馈网络的泛化能力以及通过高斯泼溅的有效2D图像监督,无需昂贵的3D手部姿态真值标注。

英文摘要

Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

2606.19154 2026-06-18 cs.RO 新提交

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集:用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University, AASS research centre, Robot Navigation and Perception Lab(厄勒布鲁大学,AASS研究中心,机器人导航与感知实验室)

AI总结 提出首个包含4D成像雷达的森林多传感器数据集,通过MinkowskiUNet实现雷达与激光雷达点云的语义分割,并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情
AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据,但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集,该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录,3D立方体标注(包括每棵树的直径估计)为所有三种感知模态提供了共享语义标签。此外,我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别(地面91%,冠层86%)上取得了与激光雷达竞争性的IoU分数,但在几何精细结构(如树干)上落后(56%对74%)。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型,而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割,共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

2606.19150 2026-06-18 cs.LG 新提交

Complementary Attention Head Pruning for Efficient Transformers

互补注意力头剪枝用于高效Transformer

Yaniv Livertovsky, Shahar Somin, Gonen Singer

发表机构 * Bar-Ilan University(巴伊兰大学)

AI总结 提出CAHP框架,将注意力头选择建模为全局图论问题,通过图聚类和信息论距离保留互补头,自动确定剪枝数量,在SST-5和MNLI上优于现有方法。

Comments 9 pages, 4 figures, 3 tables. Accepted for presentation at the International Joint Conference on Neural Networks (IJCNN) 2026

详情
AI中文摘要

基于Transformer的模型在自然语言处理中的显著成功源于架构的规模化,这导致大量参数并阻碍了在资源受限环境中的部署。虽然结构化剪枝提供了一条压缩路径,但现有的最先进方法通常依赖于基于梯度的重要性排序或随机门控,这些方法存在不稳定性、结构退化以及需要大量手动超参数调整的问题。在本文中,我们引入了CAHP(互补注意力头剪枝),一种新颖的事后框架,将头选择重新定义为全局图论问题。CAHP不是孤立地评估头,而是利用基于图的聚类结合信息论距离度量来识别并保留一组拓扑多样化的互补注意力头。无需预定义稀疏度或剪枝比例,该框架通过识别递减的边际性能曲线自动确定各层中保留的注意力头数量,其中根据所选多项式次数,剪除额外头会导致性能急剧下降。在SST-5和MNLI基准上跨不同Transformer模型规模的广泛评估表明,CAHP始终优于竞争基线,特别是在高压缩率情况下。此外,我们的结构分析表明,CAHP避免了基于梯度的剪枝方法的“邻近偏差”(倾向于主要保留靠近输出层的头),而是保留了模型中间层中功能关键的注意力头集合。

英文摘要

The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the "proximity bias" of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model's intermediate layers.

2606.19145 2026-06-18 cs.LG cs.AI cs.SY eess.SY 新提交

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

OrthoReg:混合符号-神经动力系统的正交正则化

Till Richter, Niki Kilbertus

发表机构 * Technical University of Munich(慕尼黑工业大学) Helmholtz Munich(亥姆霍兹慕尼黑中心)

AI总结 针对混合建模中神经部分可能重复学习符号结构导致模型冗余的问题,提出正交正则化方法OrthoReg,直接惩罚符号与神经组件间的重叠,实现互补分解,提升符号恢复和分布外行为。

详情
AI中文摘要

动力系统是建模自然世界的基础,然而建模过程中存在持续的权衡:手动指定的机械模型设计上可解释但通常过于简单且设定错误;相反,灵活的数据驱动神经方法缺乏物理洞察。混合建模旨在通过结合指定的或基于符号的物理组件与灵活的神经网络来兼顾两者优势。然而,一个关键挑战是神经组件可能重新学习机械部分,产生冗余且不可解释的模型,特别是当符号结构本身是从数据中发现时。基于标准$L^2$正则化的现有方法依赖于投影论证,但当符号组件通过稀疏发现学习时,该论证失效,允许神经增强与符号结构重叠。我们引入\textbf{OrthoReg}(正交正则化),直接惩罚符号与神经组件之间的重叠,防止符号结构被神经残差吸收。这产生互补分解:符号部分捕捉库能表达的内容,神经部分捕捉剩余内容。在存在部分库不匹配的基准动力系统上,OrthoReg改善了符号恢复和分布外行为。

英文摘要

Dynamical systems are fundamental to modeling the natural world, yet modeling them involves a persistent trade-off: manually prescribed mechanistic models are interpretable by design but often overly simplistic and misspecified; in contrast, flexible data-driven neural methods lack physical insight. Hybrid modeling aims for the best of both worlds by combining a prescribed or symbolic, physics-based component with a flexible neural network. A critical challenge, however, is that the neural component may relearn mechanistic parts, yielding redundant and uninterpretable models, especially when the symbolic structure itself is discovered from data. Existing methods based on standard $L^2$ regularization rely on a projection argument that breaks when the symbolic component is learned through sparse discovery, allowing the neural augmentation to overlap with symbolic structure. We introduce \textbf{OrthoReg} (Orthogonal Regularization), which directly penalizes overlap between the symbolic and neural components, preventing symbolic structure from being absorbed by the neural residual. This yields a complementary decomposition: the symbolic part captures what the library can express, and the neural part captures what remains. On benchmark dynamical systems with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution behavior.