arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08552 2026-06-09 cs.AI cs.MA cs.NE physics.data-an 新提交

Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents

定量承诺理论:自主智能体中的意向性与推理

Mark Burgess

发表机构 * ChiTek-i AS

AI总结 本文提出将贝叶斯概率与信息论优化(包括主动推理)融入承诺语义,以解决概率计算中的非局部协调、校准和归一化问题,并利用边界条件作为承诺约束状态与决策阈值,实现可扩展的意图定义。

详情
AI中文摘要

我讨论了涉及自主智能体过程的承诺理论的一些定量表示。智能体模型在软件系统、机器学习和生物学中很常见,但也可能适用于物理学和其他工程形式。我描述了贝叶斯概率和信息论优化(包括主动推理)如何与承诺语义相结合——以及承诺理论如何补充解决方案,帮助避免概率的陷阱,包括非局部协调、校准和归一化概率计算。边界条件在约束允许状态和选择决策阈值中的作用是一种承诺形式,而智能体对齐提供了意图的可扩展定义。自主智能体可以通过最小化其信息来凝聚成具有超级智能体特征的群体,尽管不确定性会最大化信息。承诺理论的使用涉及一些研究挑战以及风格偏好。

英文摘要

I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.

2606.08548 2026-06-09 cs.RO 新提交

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

OASIS:从仿真数据收集到真实世界人形机器人移动操作

Zehao Yu, Jiakun Zheng, Weiji Xie, Jiyuan Shi, Chenyun Zhang, Chenjia Bai, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) Fudan University(复旦大学) East China University of Science and Technology(华东理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出OASIS框架,利用3D生成模型从真实图像重建物体资产,在仿真中收集并增强轨迹数据,训练层次化视觉运动策略,实现零样本部署下人形机器人移动操作,成功率优于真实遥操作数据。

详情
Comments
Project Page: https://oasis-humanoid.github.io/
AI中文摘要

近年来,机器人操作领域的进展主要得益于大规模演示学习。然而,对于人形机器人移动操作任务,现有数据源在轨迹质量和可扩展性之间做出了令人不满意的权衡。真实世界遥操作能提供最高质量的轨迹,但需要专用的物理空间和耗时的场景重置。仿真提供了摆脱这一困境的替代方案:无需任何物理硬件即可大规模生成干净、符合本体形态的数据。在本文中,我们提出了OASIS,一个基于仿真数据的人形机器人移动操作框架。OASIS利用3D生成模型从真实世界图像自动重建逼真的物体资产。基于这些资产,首先在仿真中通过遥操作收集轨迹,然后在后处理阶段在多样化的领域随机化下进行增强。利用得到的仿真数据,我们进一步设计了一种用于人形机器人移动操作的层次化视觉运动策略。在真实人形机器人上的大量实验表明,在零样本部署下,基于我们的仿真数据训练的策略在大多数任务上实现了比基于真实机器人遥操作数据训练的策略更高的成功率,这主要归功于我们的仿真渲染覆盖了广泛的照明和环境变化,而真实机器人数据无法捕捉这些变化。项目页面见https://oasis-humanoid.github.io/。

英文摘要

Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.

2606.08545 2026-06-09 cs.CL cs.SE 新提交

Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

Ishigaki-IDS:一种面向建筑信息模型中信息交付规范起草的开放权重验证器感知模型

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式AI创新中心)

AI总结 针对BIM项目中IDS编写瓶颈,提出开放权重LLM Ishigaki-IDS,结合持续预训练、监督微调和基于验证器奖励的强化学习,生成可通过外部验证器检查的IDS草案,在基准上显著优于基线,并减少54.7%工作时间。

详情
Comments
8 pages, 2 figures, 5 tables. Preprint
AI中文摘要

建筑信息模型(BIM)项目需要将信息需求描述为机器可检查的信息交付规范(IDS)文件,以验证建筑模型是否包含所需属性。然而,IDS编写仍然是一个实际瓶颈:从业人员必须处理领域词汇、严格的XML模式约束和外部验证器一致性,同时还要检查需求本身是否正确表达。我们提出了Ishigaki-IDS,一个专门用于验证器感知IDS草案生成的开放权重LLM。该模型结合了在BIM/IDS语料库上的持续预训练、信息需求到IDS对的监督微调,以及来自外部验证器的可验证奖励的强化学习。目标不是取代专家审查,而是将IDS编写从低层级的XML和模式修复转向验证器可加载的草案,供从业人员检查和修正。在由166个专家创建的Ishigaki-IDS-Bench上,Ishigaki-IDS-8B的IDSAuditPass得分为0.651(生成的IDS文件的验证器通过指标),显著优于我们评估的最强单次LLM基线Claude Opus 4.5(0.331)。它还获得了0.282的Audit-Gated FacetF1,衡量验证器通过草案中的需求方面对齐度。相同的配方可扩展:14B和32B变体分别达到IDSAuditPass 0.753/0.693和Audit-Gated FacetF1 0.392/0.369。在与六位BIM从业者的工作流检查中,在相同的验证和对齐终点下,Ishigaki辅助编写减少了54.7%的总工作时间。这些结果表明,验证器感知的IDS生成可以减轻将BIM信息需求转换为可审查IDS草案的实际负担。

英文摘要

Building Information Modeling (BIM) projects require information requirements to be described as machine-checkable Information Delivery Specification (IDS) files in order to verify whether building models contain the required attributes. However, IDS authoring remains a practical bottleneck: practitioners must handle domain vocabulary, strict XML schema constraints, and external validator conformance while also checking whether the requirement itself is correctly expressed. We present Ishigaki-IDS, an open-weight LLM specialized for verifier-aware IDS draft generation. The model combines continued pretraining on BIM/IDS corpora, supervised fine-tuning on information-requirement-to-IDS pairs, and reinforcement learning with verifiable rewards from an external validator. The goal is not to replace expert review, but to move IDS authoring from low-level XML and schema repair toward validator-loadable drafts that practitioners can inspect and correct. On the 166-case expert-created Ishigaki-IDS-Bench, Ishigaki-IDS-8B achieves an IDSAuditPass score of 0.651, a validator-pass metric for generated IDS files, substantially outperforming Claude Opus 4.5, the strongest single-shot LLM baseline we evaluated, at 0.331. It also obtains an Audit-Gated FacetF1 of 0.282, which measures requirement-facet alignment among validator-passing drafts. The same recipe scales: 14B and 32B variants reach IDSAuditPass 0.753 / 0.693 and Audit-Gated FacetF1 0.392 / 0.369. In a workflow check with six BIM practitioners, Ishigaki-assisted authoring reduced aggregate work time by 54.7% under the same validation and alignment endpoint. These results suggest that verifier-aware IDS generation can reduce the practical burden of converting BIM information requirements into reviewable IDS drafts.

2606.08543 2026-06-09 cs.AI 新提交

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

PAEC:面向RLVR中LLM推理的位置感知熵校准

Shumeng Yang, Yisu Liu, Jiayi Zheng, Zhaohui Yang, Linjing Li

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 提出位置感知熵校准(PAEC),通过局部top-p熵和top-2候选竞争构建软掩码,并施加基于锚点的下界惩罚,防止决策相关位置熵崩溃,提升数学推理性能。

详情
Comments
22 pages, 7 figures
AI中文摘要

基于可验证奖励的强化学习(RLVR)改进了大语言模型的推理能力,但常常导致策略熵快速崩溃,即策略过早地集中在狭窄的高概率推理路径上。虽然全局熵正则化可以鼓励探索,但均匀增加所有标记位置的熵对于长推理轨迹而言效率低下,因为许多标记与决策无关。我们提出位置感知熵校准(PAEC),一种标记级熵管理框架,它从局部top-p熵和top-2候选竞争中构建软掩码,并应用基于锚点的下界惩罚来防止选定位置的熵崩溃。在五个数学推理基准上的实验表明,PAEC在强RLVR基线上提高了宏观平均多数投票性能,在AIME风格任务上取得了明显收益。我们的结果表明,推理RL中的熵管理应被表述为对决策敏感位置的选择性探索分配,而非均匀的随机性注入。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.

2606.08542 2026-06-09 cs.RO cs.AI cs.CV 新提交

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

当视频误读:面向探索性操作痕迹问答的阅读启发式闭环蒸馏

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang

发表机构 * Tsinghua University(清华大学) DISCOVER Robotics

AI总结 针对探索性操作中机器人误读视频痕迹的问题,提出闭环痕迹蒸馏方法,通过任务编码代理提取单行自然语言启发式提示,使冻结VLM准确预测最小成功动作链,在模拟和真实机器人任务上提升准确率0.38-0.47。

详情
Comments
16 pages, 4 figures, 4 tables
AI中文摘要

探索性操作往往将看似失败的尝试转化为下一步操作的关键证据。例如,机器人拉动锁住的抽屉失败,只有在开锁后才成功。失败的拉动揭示了潜在前提条件(抽屉被锁住),该条件决定了最小成功动作链(完成任务的最少动作),此处为[开锁,拉抽屉]。正确读取这一痕迹因此成为恢复该链的前提。我们将此设定形式化为探索性操作痕迹问答(EMT-QA):给定来自探索性痕迹的同步视频和本体感觉,预测在探测所揭示的潜在前提条件下的最小成功动作链。然而,即使最先进的VLM和具身多模态LLM也会误读这一证据:它们无法从原始视频、原始本体感觉或它们的组合中可靠地恢复动作链。我们引入闭环痕迹蒸馏,一种使用每任务编码代理检查带标签训练痕迹并蒸馏出关于痕迹的单行自然语言提示(称为蒸馏阅读启发式DRH)的流水线。推理时,不调用代理,不更新模型权重;冻结的VLM接收原始痕迹加上DRH作为提示条目。在三个模拟器和两个真实机器人任务上,DRH将链准确率比最佳原始模态基线提高0.38至0.47。相同的DRH还作为一次性程序分类器的唯一规范,其性能与提示的VLM相当。

英文摘要

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

2606.08539 2026-06-09 cs.AI 新提交

AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

AgentTrust: AI代理行为的自改进信任层

Chenglin Yang

发表机构 * Independent Researcher(独立研究员)

AI总结 提出AgentTrust v2,通过威胁类型分类(词汇/语义)和自学习机制,在代理行为中实现自改进信任决策,显著提升语义威胁检测准确率并降低误拦。

详情
Comments
29 pages, 5 figures
AI中文摘要

AI代理越来越多地采取具有后果的行动——shell命令、云操作和任意工具调用——因此信任层必须针对每个行动决定允许、警告、阻止或升级。我们认为,推理此类层的正确方式是按威胁类型。词汇(固定签名)威胁,其中危险存在于稳定令牌中,可通过确定性规则判定;语义(意图依赖)威胁,其中良性和恶意行动共享相同表面,规则无法处理。我们通过否定性证明具体说明:一个精心手工制作的云规则包仅将留出准确率从48%提升至56%,且语义类别准确率无提升(data_db 29至29,observability 59至59,supply_chain 50至50),而强LLM评判器恰好处理这些类别。我们赋予评判器自学习能力:在主要包含语义攻击的语料上,其几乎将规则准确率翻倍(48%至83.6-85.2%),且近乎零误拦,这在两个模型提供商上均成立。我们将其转化为自改进双存储系统:评判器在词汇威胁上提炼不断增长的确定性规则基础(随时间更便宜),并在语义威胁上提供受保护的RAG记忆(判决缓存失败——表面孪生导致准确率降至约58%——因此验证保护将语义准确率提升+13pp,70至84)。结果是AgentTrust v2与其静态前身v1的区别:信任层从其自身的决策流中自我进化——在词汇类别上更便宜(提炼自身规则),在语义类别上更智能(积累受保护先例),同时从不硬性阻止良性行动。端到端在线回放显示评判器调用率下降(50%至44%),评判器领域准确率上升(71%至80%),在45,000个行动中零良性硬性阻止。

英文摘要

AI agents increasingly take consequential actions -- shell commands, cloud operations, and arbitrary tool-calls -- so a trust layer must decide, per action, whether to allow, warn, block, or escalate. We argue that the right way to reason about such a layer is by threat type. Lexical (fixed-signature) threats, where danger lives in a stable token, are decidable by deterministic rules; semantic (intent-dependent) threats, where a benign and a malicious action share the same surface, are out of reach for rules by construction. We make this concrete with a negative proof: a determined, hand-authored cloud rule pack lifts held-out accuracy only 48 to 56% overall and moves the semantic categories by 0pp (data_db 29 to 29, observability 59 to 59, supply_chain 50 to 50), while a strong LLM judge carries exactly those categories. We give the judge a self-learning capability: on a corpus that is mainly semantic attacks it nearly doubles rule accuracy (48% to 83.6-85.2%) with near-zero false-blocks, and this holds across two model providers. We turn this into a self-improving dual-store system: the judge distills a growing deterministic rule floor on lexical threats (cheaper over time) and feeds a guarded RAG memory on semantic threats (a verdict-cache fails -- surface-twins collapse to ~58% -- so a corroboration guard lifts semantic accuracy +13pp, 70 to 84). The result is what sets AgentTrust v2 apart from its static v1 predecessor: a trust layer that self-evolves from its own stream of decisions -- cheaper on the lexical class (it distils its own rules) and smarter on the semantic class (it accrues guarded precedent), while never hard-blocking a benign action. An end-to-end online replay shows the judge-call rate falling (50% to 44%) and judge-domain accuracy rising (71% to 80%), with 0 benign hard-blocks across 45,000 actions.

2606.08538 2026-06-09 cs.LG 新提交

Routine laboratory trajectories encode the onset of organ-level complications in cancer

常规实验室轨迹编码癌症器官级并发症的发生

Jannik Lübberstedt, Krischan Braitsch, Jacqueline Lammert, Christof Winter, Florian Gabriel, Tristan Lemke, Christopher Zirn, Markus Graf, Friedrich Puttkammer, Hartmut Häntze, Johannes Moll, Anirudh Narayanan, Andrei Zhukov, Fabian Drexel, Zeineb Ben Chaaben, Sebastian Ziegelmayer, Su Hwan Kim, Marion Högner, Jan Kirschke, Florian Bassermann, Marcus Makowski, Christian Wachinger, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich(慕尼黑工业大学) Charité - Universitätsmedizin Berlin(柏林夏里特医学院) German Heart Center(德国心脏中心)

AI总结 利用Transformer分析癌症患者常规实验室检测的纵向轨迹,预测162种治疗相关并发症,性能优于单时间点方法,验证了轨迹数据对器官功能恶化的早期编码能力。

详情
AI中文摘要

癌症治疗期间抽取的常规实验室检查构成了器官功能的纵向生理记录,然而其时间结构被单时间点预后工具所忽略。一个基于Transformer的模型在来自3,905名多发性骨髓瘤或卵巢癌患者的2,777,595次实验室测量上训练,预测了162种治疗相关并发症(包括治疗相关骨髓增生异常综合征)的两年内发生,涵盖八个临床类别,在群体水平上实现了高于患病率1.5至6.1倍的富集。它在分组终点上匹配或超越了非序列基线(AUROC提升高达+0.11),表明纵向实验室轨迹捕捉到了从孤立测量中无法获得的、随并发症演变的特异性生理信息。预测在两种癌症中均具有泛化能力,差异集中在疾病特异性并发症上,生物标志物掩膜恢复了与既定病理生理学一致的签名。在MIMIC-IV和MMRF CoMMpass上的外部验证证实了其在独立医疗系统中的可迁移性(AUROC高达0.85)。常规肿瘤学实验室数据在临床发作前数周至数月编码了器官恶化,从而无需额外检测基础设施即可实现并发症特异性监测。

英文摘要

Routine laboratory panels drawn during cancer treatment constitute longitudinal physiological recordings of organ function, yet their temporal structure is discarded by single-timepoint prognostic tools. A transformer trained on 2,777,595 laboratory measurements from 3,905 patients with multiple myeloma or ovarian cancer predicted the two-year onset of 162 treatment-associated complications, including therapy-related myelodysplastic syndromes, spanning eight clinical categories, achieving 1.5- to 6.1-fold enrichment above prevalence at the group level. It matched or outperformed non-sequential baselines across grouped endpoints (AUROC gains up to +0.11), demonstrating that longitudinal laboratory trajectories capture evolving complication-specific physiology inaccessible from isolated measurements. Predictions generalised across both cancers, divergence concentrating in disease-specific complications, and biomarker masking recovered signatures consistent with established pathophysiology. External validation on MIMIC-IV and MMRF CoMMpass confirmed transferability across independent healthcare systems (AUROC up to 0.85). Routine oncological laboratory data encode organ deterioration weeks to months before clinical onset, enabling complication-specific surveillance without additional testing infrastructure.

2606.08535 2026-06-09 cs.CV 新提交

NGram-MoSE: Efficient Remote Sensing Super-Resolution via N-Gram Context and Mixture-of-Experts

NGram-MoSE:基于N-Gram上下文和混合专家模型的高效遥感超分辨率

Yun-Hsuan Huang, Trong-An Bui, Chih-Hung Chuang

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出轻量Transformer架构NGram-MoSE,通过N-Gram上下文注入增强局部一致性,结合混合专家前馈设计稀疏激活以降低计算量,在遥感超分辨率任务中实现高效且鲁棒的纹理重建。

详情
AI中文摘要

环境监测和灾害管理的遥感应用经常受到时空权衡的限制:具有精细空间细节的图像通常获取频率较低,而时间上更可用的观测通常更粗糙。单图像超分辨率提供了一种实用的方法,可以在不改变获取计划的情况下增强粗糙图像,然而许多基于Transformer的SR模型仍然计算成本高昂,并且可能对有限或地理偏倚的训练数据敏感,这降低了在分布外条件下的鲁棒性。本文提出了NGram-MoSE,一种轻量级Transformer架构,旨在提高效率和纹理连续性。NGram-MoSE引入了N-Gram上下文注入以增强跨窗口局部一致性并减轻窗口边界伪影,并采用了混合专家(MoE)前馈设计,通过稀疏激活扩展容量而不成比例地增加推理成本。在地理上不相交的OOD测试集上的实验表明,NGram-MoSE实现了31.68 dB的PSNR,同时相对于重型Transformer参考模型将FLOPs减少了14倍。在滑坡分割基准上的下游评估进一步表明,将退化的输入恢复到检测器训练尺度可提高性能,在mAP@50上比双三次上采样绝对提高了4.47%,并且在尺度外推下表现出更强的跨尺度一致性。这些结果表明,NGram-MoSE为需要鲁棒泛化的资源受限遥感流水线提供了一个有效的SR模块。

英文摘要

Remote sensing applications for environmental monitoring and disaster management are frequently constrained by a spatial--temporal trade-off: imagery with fine spatial detail is often acquired less frequently, whereas more temporally available observations are typically coarser. Single-image super-resolution provides a practical means to enhance coarse imagery without changing acquisition schedules, yet many Transformer-based SR models remain computationally expensive and can be sensitive to limited or geographically biased training data, which degrades robustness under out-of-distribution conditions. This paper presents NGram-MoSE, a lightweight Transformer architecture designed to improve both efficiency and texture continuity. NGram-MoSE introduces N-Gram Context Injection to strengthen cross-window local consistency and mitigate window-boundary artifacts, and incorporates a Mixture-of-Experts (MoE) feed-forward design to scale capacity through sparse activation without proportional growth in inference cost. Experiments on a geographically disjoint OOD test set show that NGram-MoSE achieves 31.68\,dB PSNR while reducing FLOPs by \(14\times\) relative to a heavyweight Transformer reference. Downstream evaluation on a landslide segmentation benchmark further demonstrates that restoring degraded inputs to the detector training scale improves performance, yielding a 4.47\% absolute gain in mAP@50 over bicubic upsampling, and exhibits stronger cross-scale consistency under scale extrapolation. These results indicate that NGram-MoSE provides an effective SR module for resource-constrained remote sensing pipelines requiring robust generalization.

2606.08533 2026-06-09 cs.LG cs.RO 新提交

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习的自主空中操控

Lixuan Jin, Bingxuan Lan, Xinyi Bao, Xiangyuan Xie, Chunjie Zhang, Zheng Chen, Tianshuo Liu, Ruijie Tian, Jinyu Ru, Gang Wang, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Faculty of Robot Science and Engineering, Northeastern University(东北大学机器人科学与工程学院) National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室)

AI总结 提出Aco2方法,通过上下文对比元强化学习,使四旋翼无人机在无需人工干预下自主完成不同载荷的抓取、运输和投递,并直接迁移到真实世界。

详情
AI中文摘要

无人机越来越多地部署在物流、服务机器人等实际应用中,对自主载荷获取和投递的需求日益增长。现有方法通常假设预附载荷或依赖专用夹爪,使得通用的端到端空中投递问题仍未解决,因为不同载荷会导致高度变化的飞行动力学,需要单一策略在线适应,无需手动校准或显式系统辨识。为此,我们研究了通过上下文对比元强化学习的自主空中操控(\textbf{\textit{Aco2}}),这是一个完全自主的空中投递设置,其中配备轻型钩子的四旋翼无人机连续拾取、运输和投递各种带手柄的物体,在随机位置之间进行,全程无需人工干预。首先,我们设计了一个上下文观测编码器,从最近的交互历史中推断出紧凑的潜在上下文,使策略能够在线适应载荷相关的动力学。为了进一步提高上下文质量,我们引入了一个对比目标,该目标围绕任务相关变化结构化上下文嵌入,从而改善跨不同载荷的泛化能力,无需显式系统辨识。完全在模拟中训练,并采用广泛的域随机化,\textit{Aco2}可以直接部署在物理四旋翼上,无需真实世界微调。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.

2606.08531 2026-06-09 cs.AI 新提交

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

VESTA: 一种全自动的LLM智能体场景生成与安全评估框架

Lu Jia, Haibo Tong, Feifei Zhao, Jindong Li, Dongqi Liang, Ping Wu, Qian Zhang, Yi Zeng

发表机构 * BrainCog AI Lab, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所类脑人工智能实验室) Beijing Institute of AI Safety and Governance (Beijing-AISI)(北京人工智能安全与治理研究院) Beijing Key Laboratory of Safe AI and Superalignment(北京市安全人工智能与超级对齐重点实验室) School of Artificial Intelligence, UCAS(中国科学院大学人工智能学院) Long-term AI(长期人工智能)

AI总结 提出VESTA框架,基于五个风险维度自动生成1072个可执行场景,评估12个LLM智能体在任务执行中的行为安全风险,平均攻击成功率达47.1%。

详情
Comments
Preprint. 18 pages, 12 figures, 5 tables
AI中文摘要

大型语言模型(LLM)正从简单的文本交互系统逐渐演变为能够保持记忆、使用工具、访问外部环境并执行任务的LLM智能体。随着其能力和自主性的增强,它们面临的安全风险也变得更加多样化。现有的评估通常依赖于手动编写的场景、静态提示或最终输出判断,难以捕捉智能体在任务执行过程中可能遇到的各种风险。我们引入了VESTA,一个全自动的LLM智能体场景生成与安全评估框架。基于五个风险维度,VESTA将现实任务执行中的抽象且多样的安全风险实例化为1072个可测量的评估场景。利用自动化评估流水线,在两种权限上下文中对12个LLM智能体进行了评估。结果表明,当前智能体在任务执行过程中仍然面临显著的行为安全风险,平均攻击成功率为47.1%,部分模型超过70%。这些发现证明了可执行的过程级评估对于理解和提升LLM智能体安全性的重要性。

英文摘要

Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final-output judgments, making it difficult to capture the diverse risks that agents may face during task execution. We introduce VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, VESTA instantiaes abstract and diverse safety risks in real-world task execution into 1,072 measurable evaluation scenarios. Using the automated evaluation pipeline, 12 LLM agents are evaluated under two authority contexts. The results show that current agents still face substantial behavioral safety risks during task execution, with an average ASR of 47.1% and several models exceeding 70%. These findings demonstrate the importance of executable, process-level evaluation for understanding and improving LLM agent safety.

2606.08529 2026-06-09 cs.AI cs.CL cs.LG 新提交

Scaffold Effects on GAIA: A Controlled Comparison

脚手架对GAIA的影响:一项受控比较

Jason Starace

发表机构 * Independent Researcher(独立研究员)

AI总结 通过受控实验比较三种脚手架(ReAct、多智能体设计、规划-执行)对五个模型在GAIA验证集上的影响,发现脚手架选择可导致准确率差异高达28个百分点,且模型能力越强对脚手架依赖性不一定越低。

详情
Comments
12 pages, 3 figures
AI中文摘要

已发布的智能体能力评分混淆了模型本身的能力与脚手架赋予的能力,且这种激发差距的大小在受控条件下尚未得到充分表征。本研究在GAIA验证集的Level 1和Level 2上,对来自三个提供商的五个模型(Claude Opus 4.7、Sonnet 4.6、Haiku 4.5;Gemini 3.1 Pro Preview;GPT-5.5)进行了预先注册的受控比较,涉及三种脚手架(ReAct、规划-执行者-评估者多智能体设计以及规划-执行),保持任务和条件固定,每个问题尝试三次。仅脚手架选择就使单个模型(Opus,Level 2,稳健切片)的测量准确率移动了多达28个百分点,证实了预先注册的假设,即脚手架变化至少产生10个百分点的差距。预先注册的预测——能力更强的模型对脚手架敏感性更低——在方向上被拒绝:在每个数据集切片中,脚手架效应因模型而异,但能力最强的Anthropic模型在更难级别上从结构化脚手架中获益最多,且层级缩放仅在Level 1的稳健切片下成立。在Level 2上,多智能体相对于ReAct的优势出现在Anthropic系列内部,但跨提供商模型中没有,因此模型系列而非能力层级成为调节变量,而预测的规划-执行者在文件读取任务上的优势被证伪。结构化脚手架在更难级别上调用工具次数更少,但从中途错误中恢复的频率更高,且单个单元(Gemini搭配规划-执行者)在两个级别上成本最低,在Level 2上准确率最高。这些结果表明,单脚手架能力数值是脚手架条件估计,且激发差距不一定会随着模型改进而缩小。

英文摘要

Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre-registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre-registered prediction that more capable models would be less scaffold-sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier-scaling holds only at Level 1 under the robust slice. The multi-agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross-provider models, making model family rather than capability tier the conditioning variable, and the predicted planner-executor advantage on file-reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid-trajectory errors at the harder level, and a single cell (Gemini with planner-then-executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single-scaffold capability numbers are scaffold-conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

2606.08525 2026-06-09 cs.CV 新提交

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward:面向自动驾驶的综合数据集与生成式视觉语言奖励模型

Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun, Fangzhen Li, Bing Wang, Guang Chen, Yang Ji, Jiong Deng, Hongwei Xie, Hangjun Ye, Long Chen, Yi Zhang

发表机构 * Tsinghua University(清华大学) Xiaomi EV(小米汽车)

AI总结 提出DriveReward数据集和专用视觉语言奖励模型,通过反事实标注和时序视觉引导,解决自动驾驶中奖励获取的泛化问题,在强化学习和轨迹选择中取得与基于规则方法相当的性能。

详情
AI中文摘要

奖励模型在强化学习和自动驾驶的多模态轨迹选择中起着关键作用。然而,获取此类奖励通常依赖于手工设计的基于规则的目标或感知真值,这阻碍了数据扩展的泛化能力。虽然视觉语言模型在其他领域已被证明可作为奖励模型,但其在驾驶任务中的有效性尚未得到充分探索。在这项工作中,我们通过以下方式弥合这一差距:(1)引入DriveReward,一个通过时间接地视觉引导严格标注的推理轨迹评估数据集,并增加了反事实驾驶行为;(2)以及一个专门的视觉语言奖励模型。为了解决传统数据集中失败案例稀缺的问题,我们提出了一种反事实数据标注方案,构建包含多种驾驶风格和错误行为的案例。在我们提出的基准上的评估显示,即使是领先的开源和专有视觉语言模型也无法在所有任务中表现出色,突显出现有模型仍有很大的改进空间。基于这些发现,我们随后定制了一个专门的1B奖励模型,在特定任务的奖励对齐上优于更大的视觉语言模型。最后,我们通过将奖励模型集成到强化学习微调和多模态轨迹评分中,在多个基线上验证了其有效性,在开环和闭环评估中均达到了与基于规则的奖励计算相当的性能。

英文摘要

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.

2606.08520 2026-06-09 cs.RO 新提交

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

两座桥梁,一条路径:从VLM到具有具身轨迹耦合数据的可泛化VLA

Linqi Yin, Shiduo Zhang, Shenling Qiu, Chenxin Li, Zhaoyang Fu, Lei Xiao, Xiang Wang, Chenchen Yang, Zhe Xu, Pengfang Qian, Jingjing Gong, Xipeng Qiu, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出具身轨迹耦合(ETC)数据作为中间桥梁,通过三阶段训练策略(分布桥接、目标桥接、保留适应)将视觉语言模型(VLM)逐步转化为可泛化的视觉语言动作模型(VLA),解决从VLM到VLA的双重鸿沟。

详情
AI中文摘要

视觉语言模型(VLM)是强大的通用推理器,但将其转化为机器人控制策略(VLA)却异常困难。根本原因在于双重鸿沟:VLM在互联网规模的图像上训练,具有语言理解目标,而VLA必须感知机器人场景并预测电机动作。直接在机器人动作数据上微调VLM迫使模型同时跨越两个鸿沟——学习曲线陡峭,预训练期间获得的丰富泛化能力往往会退化而非迁移。我们认为,通过合适的中间数据可以逐步弥合这一鸿沟。我们引入了\emph{具身轨迹耦合(ETC)数据}——源自用于动作学习的相同机器人场景和轨迹的视觉语言监督。由于ETC数据共享机器人操作的视觉上下文,同时保留熟悉的语言理解目标,它提供了VLM预训练和VLA微调之间的自然垫脚石。基于此,我们设计了一个三阶段训练方案。分布桥接首先将VLM适应于具身视觉语言语义。目标桥接然后逐步将模型转向动作预测,同时保留已获得的表示。保留适应最后将策略专门化到目标部署领域。我们进一步证明,将任务相关的分布外ETC数据与少量动作数据混合,使模型能够泛化到新颖的视觉语言条件,而无需额外的机器人演示。仿真和真实机器人实验证实,这种逐步桥接策略是将VLM泛化能力迁移到鲁棒、可部署的机器人策略的关键。

英文摘要

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

2606.08517 2026-06-09 cs.LG cs.CL 新提交

A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

自适应选择性共形风险控制的联合有限样本证书

Xiaoli Yu, Jiamiao Liu

发表机构 * Chongqing University of Posts and Telecommunications(重庆邮电大学) Army Medical University (Third Military Medical University)(陆军军医大学(第三军医大学))

AI总结 提出一种联合有限样本证书,同时上界选择性风险、下界接受概率和部署效用,适用于自适应阈值选择,通过比率风险的经验伯恩斯坦界等方法,在ImageNet和COCO上比Hoeffding-CRC提升22个百分点接受前沿,且紧致约10倍。

详情
AI中文摘要

选择性预测器在置信输入上做出预测,否则弃权;安全部署需要一个单一的有限样本证书,同时上界所选风险、下界接受概率 $\pacc$ 高于下限 $\pmin$,并下界部署效用。该证书必须在从 $\ncert$ 样本上的有限网格 $m$ 对中进行自适应阈值选择时有效。我们通过将所选风险直接视为比率而非通过Hoeffding式范围界,为有界、可能非单调的损失给出了这样的证书。该构造耦合了三个置信界:比率风险的方差自适应经验伯恩斯坦界、接受概率的Clopper-Pearson界以及效用的双边接近界。它们共同下界认证策略的绝对效用,并且与认证集上的最优策略相差不超过 $2\gammau$,两者在可行时均非平凡;一个按场景划分的第三部分与外部预言机匹配,仅在风险边际 $\gammar < α$ 时有信息量,在主要操作点处为空。相对于仅范围Hoeffding比率构造,这使接受下限依赖从 $1/\pmin$ 变为 $1/\sqrt{\pmin}$,并且一个闭式推论识别出每对场景,其中我们的风险界优于Hoeffding共形风险控制(Hoeffding-CRC)选择性界。实验上,在ImageNet(三个ResNet)和COCO val 2017全景分割上,该证书比Hoeffding-CRC打开了+22个百分点的认证接受前沿,并且比非平凡匹配验证基线紧致约10倍;这些增益是按场景的,非普适的,在ADE20K上不存在。认证器运行时间为 $O(\ncert m)$。

英文摘要

Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single finite-sample certificate that simultaneously upper-bounds the selected risk, lower-bounds the acceptance probability $\pacc$ above a floor $\pmin$, and lower-bounds the deployment utility. This certificate must be valid under adaptive threshold selection from a finite grid of $m$ pairs on $\ncert$ samples. We give such a certificate for bounded, possibly non-monotone losses by treating the selected risk directly as a ratio rather than through a Hoeffding-style range bound. The construction couples three confidence bounds: a variance-adaptive empirical-Bernstein bound on the ratio risk, a Clopper--Pearson bound on acceptance, and a two-sided closeness bound on utility. Together they lower-bound the certified policy's utility absolutely and to within $2\gammau$ of the best over the \emph{certified set}, both non-vacuous whenever feasible; a regime-scoped third leg matches an external oracle, informative only where the risk margin $\gammar < α$ and vacuous at the headline operating points. Relative to the range-only Hoeffding-ratio construction this sharpens the acceptance-floor dependence from $1/\pmin$ to $1/\sqrt{\pmin}$, and a closed-form corollary identifies a per-pair regime in which our risk bound dominates a Hoeffding conformal risk control (Hoeffding--CRC) selective bound. Empirically, on ImageNet (three ResNets) and COCO val 2017 panoptic, the certificate opens a $+22$ pp certified-acceptance frontier over Hoeffding--CRC and is ${\approx}10{\times}$ tighter than a non-vacuous matched-valid baseline; these gains are regime-scoped, not universal, and absent on ADE20K. The certifier runs in $O(\ncert m)$ time.

2606.08514 2026-06-09 cs.CV 新提交

OmniTryOn: Video Try-On Anything at Once!

OmniTryOn: 一次性视频试穿任意物品!

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

发表机构 * Xi’an Jiaotong University(西安交通大学)

AI总结 提出OmniTryOn框架,通过首帧可穿戴缓存和时空一致RoPE,实现无外部先验的一次性视频多物品试穿,在TryAny-Bench上显著优于现有方法。

详情
AI中文摘要

尽管视频虚拟试穿(VVT)取得了显著进展,现有方法仍存在两个基本局限:首先,它们仅限于单件衣物迁移,使得同时进行多物品试穿极不实用;其次,它们严重依赖显式外部先验(如衣物掩码),不可避免地破坏了关键的物理动态并降低了视觉质量。为弥补这一差距,本文提出了新颖的“任意试穿”任务,旨在一次推理过程中将多种可穿戴物品同时迁移到视频中的人物身上。为了支持并标准化这一范式,我们引入了TryAny-Bench,一个包含配对视频数据集和定制评估协议的综合基准。此外,我们提出了OmniTryOn,一个无外部先验的生成框架,用于解决该任务。具体而言,OmniTryOn采用首帧可穿戴缓存策略,通过初始视频帧直接为生成过程提供多样化的可穿戴物品。为保持一致性,我们提出了时空一致RoPE(STC-RoPE),它固有地建立了稳健的时空锚点,以严格保留复杂的人体运动和背景动态。通过提出的渐进式试穿(GTO)训练策略进行优化,我们的模型逐步掌握了稳健的多物品合成。在TryAny-Bench上的大量实验表明,OmniTryOn显著优于现有的专用视频虚拟试穿模型和通用视频编辑基线,为“任意试穿”任务建立了强大的新标准。我们的数据集、代码和模型可在https://github.com/xcltql666/OminTryOn获取。

英文摘要

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

2606.08513 2026-06-09 cs.RO cs.LG cs.SY eess.SY 新提交

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

面向自主水下机器人的端到端运动规划与执行:基于强化学习的方法

Elisei Shafer, Oren Gal

发表机构 * University of Haifa(海法大学)

AI总结 提出分层强化学习架构,将原始传感器数据直接映射为推进器指令,实现AUV端到端运动规划与执行,在HoloOcean仿真中轨迹长度接近RRT*基线(误差4%-6%),并具备鲁棒性。

详情
AI中文摘要

自主水下机器人(AUV)传统上依赖复杂、高度工程化的流水线进行感知、路径规划和运动控制。本文探索了一种端到端深度强化学习(DRL)方法的可行性,该方法将原始传感器数据直接映射为推进器指令,减少了人工工程。我们提出了一种分层强化学习(HRL)架构,将问题分解为两个马尔可夫决策过程。高层(HL)策略以2Hz运行,处理原始$84 \ imes 84$像素单目相机帧、堆叠的$100 \ imes 100$像素前视成像声纳以及本体感受数据,生成空间子目标。同时,低层(LL)策略以10Hz运行,将这些子目标转换为推进器指令。HL策略使用基于先前演示的强化学习(RLPD)在修改后的样本高效机器人强化学习(SERL)框架中训练,而LL策略则采用软演员-评论家(SAC)结合后见经验回放(HER)。在高保真HoloOcean模拟器中评估,我们的方法展示了成功的避障能力,轨迹长度与$\ ext{RRT}^*$规划基线非常接近(误差在4%到6%之间)。此外,学习到的策略对模拟传感器噪声和能见度降低表现出强鲁棒性。尽管系统能有效导航熟悉的几何环境,但实验揭示了在遇到具有新颖障碍形状的未访问区域时存在泛化限制。最终,这项工作展示了使用最小计算硬件进行样本高效、端到端DRL在水下导航中的潜力。

英文摘要

Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.

2606.08511 2026-06-09 cs.CV 新提交

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

少看多思:面向高效多模态大语言模型的块级注意力跳过

Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

发表机构 * Xiamen University(厦门大学)

AI总结 针对多模态大语言模型视觉注意力饱和问题,提出训练无关的Visual-Skip方法,通过选择性跳过冗余的视觉自注意力模块实现块级稀疏性,并利用轻量级校准动态选择最优稀疏路径,在保持性能的同时显著降低计算成本。

详情
AI中文摘要

多模态大语言模型(MLLMs)由于长视觉标记序列的自注意力二次计算成本而面临显著的推理瓶颈。然而,我们识别出当前架构中的一个关键低效问题:视觉注意力饱和。我们的分析表明,视觉标记在早期层中迅速建立其空间结构和模态内关系,使得深层中的视觉-视觉自注意力在计算上变得冗余。相反,这些层中的前馈网络(FFNs)对于将视觉特征投影到不断演化的文本语义空间中仍然至关重要。利用这一洞察,我们提出了Visual-Skip(V-Skip),一种无需训练的推理范式,它将空间交互与语义演化解耦。V-Skip不是丢弃标记,而是通过选择性地绕过饱和的视觉自注意力模块来施加块级结构化稀疏性。此外,认识到不同的下游任务需要不同的推理深度,V-Skip采用轻量级、少样本校准来动态路由任务最优的稀疏路径。大量实验表明,V-Skip有效地绕过了冗余的视觉注意力以实现块级稀疏性,在各种MLLMs上保持了94.16%至100.31%的性能保留。最终,我们证明,为了更有效地推理,模型不需要丢弃它们所看到的内容——它们只需要在正确的深度“少看”即可。

英文摘要

Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see -- they simply need to "look less" at the right depth.

2606.08508 2026-06-09 cs.RO cs.AI 新提交

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

ActProbe:面向生成式机器人策略早期故障检测的动作空间探针

Bingjia Huang, Xiangyu Li, Xiang Wang, Liang Mi, Zixu Hao, Weijun Wang, Hao Wu, Kun Li, Yunxin Liu, Ting Cao

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院(AIR)) University of Electronic Science and Technology of China(电子科技大学) Nanjing University(南京大学)

AI总结 提出ActProbe,一种轻量级纯动作空间故障检测器,利用时间一致性误差和动作块幅度两个信号,通过LSTM-MLP架构预测故障,在多种生成式策略上提升F1-时效性帕累托前沿平均超体积增益+12.7%,并加速强化学习微调。

详情
Comments
24 pages,9 figures,11 tables, Project page: https://air-embodied-brain.github.io/actprobe
AI中文摘要

生成式机器人策略在部署时不可预测地失败:它们在关键时刻犹豫不决,偏离任务,或执行不可恢复的动作。现有的在线故障检测器要么需要白盒访问策略内部,要么通过重采样和观测侧信号增加运行时开销。我们的实证分析表明,发射的动作块本身已经携带了生成式机器人策略即将发生故障的强预测信号。受此观察启发,我们引入了ActProbe,一种轻量级的纯动作空间检测器,它使用单次前向传递中可用的两个紧凑信号:连续动作块之间的时间一致性误差(TCE)和当前块的动作块幅度(ACM)。ActProbe通过任务条件化的LSTM-MLP架构将这些信号映射到每步故障概率。在一系列多样化的生成式机器人策略和基准测试中,ActProbe在故障变得视觉可识别之前发出警报,相比内部和外部特征基线,将故障检测的F1-时效性帕累托前沿平均超体积增益提高了+12.7%,在未见任务上早期检测ROC-AUC领先+9.0%。ActProbe进一步迁移到部署中,预测未见真实机器人拾取任务上的故障,并以2.9倍更少的环境交互加速了强化学习微调(PPO)。

英文摘要

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

2606.08503 2026-06-09 cs.AI cs.LO 新提交

Standpoint Logics with Defeasible Beliefs

带有可废止信念的立场逻辑

Nicholas Leisegang, Thomas Meyer, Sebastian Rudolph

发表机构 * University of Cape Town(开普敦大学) CAIR, South Africa(南非人工智能研究中心) Technische Universität Dresden(德累斯顿工业大学) ScaDS.AI – Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany(德国德累斯顿/莱比锡可扩展数据分析与人工智能中心)

AI总结 将KLM可废止逻辑与立场逻辑框架结合,提出DRSL,通过公理化语义和多种蕴涵关系提升,实现多视角下可废止信念的形式化表达。

详情
AI中文摘要

在本文中,我们将Kraus、Lehmann和Magidor(KLM)的可废止逻辑与Gómez Álvarez和Rudolph的立场逻辑框架相结合。这样做是为了形式化地表达考虑多个(可能矛盾的)视角的知识,而这些视角可能持有可废止信念。为此,我们利用了Leisegang等人引入的可废止受限立场逻辑(DRSL)。我们的工作扩展了先前的研究,为DRSL语义提供了基础表示结果,并系统地将几个著名的蕴涵关系从命题情况提升到立场增强设置。具体地,我们通过一组为立场情况调整的KLM风格公设来刻画DRSL的语义。此外,我们提供了一种方法来提升优先蕴涵,以及基于单个排序函数的蕴涵关系类,从纯命题语境到立场增强语境,包括理性和词典序闭包。我们证明这可以通过语义和算法手段等价地实现。此外,我们表明,对于每种考虑的蕴涵形式,从命题KLM到DRSL,蕴涵检查的复杂度类不会改变。

英文摘要

In this paper, we integrate the defeasible logic of Kraus, Lehmann and Magidor (KLM) with the standpoint logic framework of Gómez Álvarez and Rudolph. This is done with the goal of formally expressing knowledge taking into account multiple (possibly contradicting) viewpoints, which in turn may hold defeasible beliefs. In doing so, we utilise Defeasible Restricted Standpoint Logics (DRSL), introduced by Leisegang et al. Our work expands on previous work by providing a foundational representation result for DRSL semantics and systematically lifting several well-known entailment relations from the propositional case to the standpoint-enhanced setting. In particular, we characterise the semantics for DRSL through a set of KLM-style postulates adapted for the standpoints case. We furthermore provide a means to lift preferential entailment, and the class of entailment relations based on single ranking functions from the purely propositional to the standpoint-enhanced context, including rational and lexicographic closure. We show this can be done equivalently through semantic and algorithmic means. Furthermore, we show that, for each considered form of entailment, the complexity class of entailment checking does not change when moving from propositional KLM to DRSL.

2606.08501 2026-06-09 cs.CL 新提交

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

重回正轨:在扩散大语言模型中对齐奖励与状态以进行推理

Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo, Xueyang Fu, Yang Cao, Wei Zhai, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Tongyi Lab(通义实验室) Northeastern University(东北大学)

AI总结 针对扩散大语言模型强化学习中过程奖励与状态轨迹的双重错位问题,提出PAPO框架,通过步骤感知过程奖励和熵引导历史重演实现对齐,在四个基准上取得显著提升。

详情
AI中文摘要

强化学习(RL)在增强扩散大语言模型(dLLMs)的推理能力方面具有巨大潜力。然而,进展受到真实生成轨迹与梯度更新过程之间双重错位的基本限制:(i)过程-奖励错位。稀疏的终端奖励被不加区分地分配给生成过程的所有中间步骤,未能提供有区分度的信用分配。(ii)状态-轨迹错位。策略更新常常被引向人工的、偏离轨迹的状态,在信息量较少的样本上浪费梯度。为了解决这些限制,我们引入了过程对齐策略优化(PAPO),这是一种新颖的框架,通过步骤感知过程奖励(SPR)将稀疏的终端奖励转化为密集的逐步信用,以及熵引导历史重演(EHR)在高不确定性步骤重放真实轨迹,从而整体上对齐RL更新与dLLM的生成轨迹。在四个基准上的大量实验表明,PAPO显著优于基线,在GSM8K上提升高达4.5%,在MATH500上提升4.8%,在Countdown上提升42.2%,在Sudoku上提升16.1%。

英文摘要

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

2606.08497 2026-06-09 cs.AI cs.CL 新提交

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

解释黑盒语言模型:学习优化语言结构化的单词子集

Minyoung Hwang, Seokhyun Lee, Changhee Lee

发表机构 * Korea University(高丽大学)

AI总结 针对黑盒语言模型解释的三个关键需求(推理效率、黑盒兼容性、语言结构可解释性),提出一种通过强化学习选择信息性单词子集的方法,实现高效、无梯度且语言连贯的解释。

详情
Comments
KDD 2026 Research Track
AI中文摘要

随着深度语言模型(DLMs)在医疗保健等高风险领域中的部署日益增多,理解其决策依据对于确保信任、安全和问责变得至关重要。然而,当这些DLMs作为黑盒系统(例如通过API)运行时,访问内部模型状态(如参数、梯度)受到限制,实现这一关键的可解释性水平尤其具有挑战性。尽管付出了诸多努力,现有的解释方法往往无法同时满足三个关键需求:(i)推理时效率,(ii)黑盒兼容性且不引发分布外行为,以及(iii)基于输入语言结构的可理解解释。为了解决这些挑战,我们提出了一种方法,通过选择一小部分信息丰富的输入单词来解释DLM的预测。我们将其表述为一个摊销优化问题,从而无需针对特定输入进行搜索即可实现高效的一次性推理。我们的选择策略通过REINFORCE风格策略梯度进行训练,允许在完全无梯度的设置中进行离散单词选择。为了增强可解释性并与人类语言直觉对齐,我们将图结构知识整合到这一选择过程中,促进语言连贯的子集,从而产生对最终用户既高度信息丰富又具有认知意义的解释。我们在多种DLM架构和多个真实世界数据集上评估了我们的方法。它一致地识别出具有增强判别能力和与语言显著线索更强对齐的单词子集,优于传统的黑盒兼容方法和基于梯度的方法(后者被赋予黑盒模型梯度的oracle访问权限,以构成更具挑战性的基准)。我们的代码可在以下地址获取:here。

英文摘要

As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

2606.08496 2026-06-09 cs.CL cs.LG 新提交

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

SAEExplainer: 基于激活引导偏好优化的SAE特征解释

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang, Fei Sun, Mengnan Du

发表机构 * Shanghai Jiao Tong University(上海交通大学) NJIT(新泽西理工学院) Jilin University(吉林大学) Institute of Computing Technology, CAS(中国科学院计算技术研究所) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SAEExplainer框架,利用激活分数作为奖励信号,通过两轮优化迭代自纠正基础解释,减少解释幻觉并增强因果触发模式。

详情
AI中文摘要

尽管稀疏自编码器(SAE)通过将密集表示分解为稀疏特征缓解了大语言模型(LLM)的不透明性,但解释这些特征仍然是一个核心挑战。然而,当前的解释方法通常运行在开环范式下,未能利用机械反馈进行进一步优化。在本文中,我们提出SAEExplainer,一个利用激活分数作为客观奖励信号来训练模型进行自我纠正和迭代自举的训练框架。通过两轮优化过程迭代验证和纠正基础解释,SAEExplainer实现了其解释能力的持续提升。该机制显著减少了解释幻觉并强化了因果触发模式。大量实验表明,我们的方法在大多数指标上优于已有基线,特别是在因果触发和判别性激活方面。

英文摘要

Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.

2606.08495 2026-06-09 cs.RO cs.CV 新提交

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

EgoPriMo:面向交互式人形控制的自我中心运动生成

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen

发表机构 * Tianjin University(天津大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) DeepCybo

AI总结 提出EgoPriMo框架,通过自我中心人类演示学习全身运动先验,利用三流DiT联合建模身体动态、视觉上下文和文本,支持重建、生成和预测,并在Unitree人形机器人上执行。

详情
AI中文摘要

人形机器人需要适应场景上下文、任务要求和用户意图的全身运动。运动跟踪可以再现指定的轨迹,人形机器人视觉-语言-动作系统提供了语义接口,但两者都不能为广泛的全身行为提供可扩展且交互式的先验。我们提出了EgoPriMo(人形机器人自我中心运动先验),一个统一的框架,从自我中心人类演示中学习此类先验。给定自我中心观察和文本提示,EgoPriMo重建、生成和预测基于SMPL的全身运动。语言被用作高级控制信号,而不是完整的运动规范。EgoPriMo的核心是一个三流DiT,它联合建模身体动态、自我中心视觉上下文和文本;任务条件掩码通过同一个检查点路由不同的任务和缺失模态数据。在Nymeria和EgoExo4D上的实验表明,一个检查点在支持重建和预测的同时,改进了自我中心运动生成,优于UniEgoMotion;生成的SMPL运动也可以由Unitree人形控制器执行。这些结果表明了一条从可扩展的自我中心观察到可泛化和交互式人形运动先验的实用路径。

英文摘要

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

2606.08492 2026-06-09 cs.CV cs.AI 新提交

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

眼见为实:基于视觉锚点的提示重写对齐用于文本到图像生成

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu, Xuhang Chen, Tianrun Chen, Siwei Ma

发表机构 * Peking University(北京大学) Tencent(腾讯) Dalian University of Technology(大连理工大学) Nanyang Technological University(南洋理工大学) University of Cambridge(剑桥大学) Zhejiang University(浙江大学)

AI总结 提出FaithRewriter框架,利用多模态大模型生成中间视觉线索,结合大语言模型生成视觉锚定的增强提示,再蒸馏至小模型,以缩小用户意图与生成图像之间的差距。

详情
AI中文摘要

尽管文本到图像(T2I)模型具有令人印象深刻的能力,但由于用户提示的简洁性和模糊性,意图-生成差距往往持续存在。现有方法主要优化提示的流畅性和可读性。然而,增强过程仍然缺乏视觉基础。因此,重写器可能过度推断缺失的细节,导致意图-生成差距。为了解决这一限制,我们提出了FaithRewriter,一种用于T2I生成的新型提示增强框架。具体来说,FaithRewriter首先利用多模态MLLM从原始提示生成图像作为中间视觉线索。然后将该线索与提示结合,输入大规模LLM,生成视觉锚定的增强,更好地反映预期内容在图像中应如何呈现。最后,将这些增强蒸馏到小规模LLM中以便高效部署,增强其生成有效T2I提示的能力。实验表明,与强基线相比,FaithRewriter生成的提示更忠实于用户意图且视觉上更合理,有助于缩小意图-生成差距。

英文摘要

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.

2606.08486 2026-06-09 cs.CL 新提交

TRADE: Transducer-Augmented Decoder for Speech LLM

TRADE: 换能器增强的语音大语言模型解码器

Yun Tang, Shanil Puri, Shinji Watanabe, Subhabrata Mukherjee

发表机构 * Hippocratic AI Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出TRADE模型,通过换能器分支增强多模态大语言模型,实现帧同步对齐与语言推理结合,支持流式和非流式解码,在多个基准上取得低词错误率。

详情
AI中文摘要

语音大语言模型(Speech LLMs)缺乏原则性的流式推理机制:其标签同步生成没有声学帧对齐,使得实时解码和话语结束检测变得困难。我们提出TRADE(换能器增强解码器),它通过一个换能器分支增强多模态大语言模型,该分支共享音频编码器,并直接使用LLM的隐藏状态作为预测网络——将帧同步声学对齐与LLM的语言推理相结合。三个设计选择使系统准确、可流式处理且支持长语音:(1) 紧密耦合的双词汇表——从LLM词汇表导出的紧凑换能器词汇表,实现零成本分数融合;(2) 带梯度停止的块同步流式训练,消除训练-推理不匹配,内存成本与离线相当;(3) 局部解码器音频注意力(LDAA),一种因果滑动窗口,独立于话语长度限制KV缓存内存。单个TRADE检查点支持在连续延迟操作点范围内的离线与流式解码。TRADE在Open ASR排行榜上达到6.71%的平均词错误率,而同一检查点使用960ms块大小的流式识别达到8.40%。在长语音上,无需外部分割,在TED-LIUM上获得3.64%的词错误率,在Earnings-22上获得10.88%。TRADE提供句末标点时间戳,与声学语音活动检测(VAD)结合时,相比单独使用声学VAD,话语结束检测的F1值提高0.03。

英文摘要

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies -- a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.

2606.08484 2026-06-09 cs.LG cs.AI 新提交

STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling

STELLAR: 面向长尾物种分布建模的时空环境学习与潜在对齐精炼

Shufeng Kong, Tao Yu, Yuanyuan Wei, Caihua Liu, Junwen Bai, Yingheng Wang, Marc Grimson, Daniel Fink, Carla P. Gomes

发表机构 * Sun Yat-sen University(中山大学) Cornell University(康奈尔大学) Foshan University(佛山大学) Cornell Lab of Ornithology(康奈尔鸟类学实验室)

AI总结 提出STELLAR框架,通过图-时间编码器、上下文锚定潜在对齐和不平衡感知解码模块,联合优化动态栖息地上下文和群落结构,有效解决物种分布建模中的时空耦合与长尾不平衡问题。

详情
Comments
Accept by IJCAI 2026
AI中文摘要

联合物种分布建模(JSDM)是生物多样性监测和保护规划的关键工具。然而,准确的JSDM面临两个耦合挑战:环境驱动因素和物种分布本质上是时空的,而物种共现模式表现出复杂的非线性群落结构以及由稀有物种导致的严重长尾不平衡。现有方法通常孤立地处理这些因素,从静态协变量中学习或忽略动态群落结构的历史轨迹。为克服这些限制,我们提出STELLAR(时空环境学习与潜在对齐精炼),一种新颖的框架,学习一个共享潜在空间,其中动态栖息地上下文和群落结构被联合优化。我们的方法整合了三个互补组件:(1)图-时间编码器,采用图注意力和循环单元来聚合空间邻域效应并捕捉环境上下文和群落结构的共同演化历史动态;(2)上下文锚定潜在对齐机制,利用标签激活的混合先验和监督对比学习结构化潜在空间,基于共享环境偏好主动聚类物种;(3)不平衡感知解耦解码模块,利用非对称损失聚焦于困难稀有物种样本的学习,防止长尾中的模式崩溃。在领域专家精心整理的大规模eBird数据集上的实验表明,我们的框架显著优于最先进的基线,特别是在预测稀有物种和揭示可解释的物种相互作用方面。

英文摘要

Joint Species Distribution Modeling (JSDM) is a key enabler for biodiversity monitoring and conservation planning. However, accurate JSDM faces two coupled challenges: environmental drivers and species distributions are inherently spatio-temporal, while species co-occurrence patterns exhibit complex non-linear community structure and severe long-tail imbalance driven by rare species. Existing approaches often address these factors in isolation, learning from static covariates or neglecting the historical trajectories of dynamic community structure. To overcome these limitations, we propose STELLAR (Spatio-Temporal Environmental Learning with Latent Alignment and Refinement), a novel framework that learns a shared latent space where dynamic habitat context and community structure are optimized jointly. Our approach integrates three complementary components: (1) a Graph-Temporal Encoder that employs graph attention and recurrent units to aggregate spatial neighborhood effects and capture the co-evolving historical dynamics of environmental context and community structure; (2) a Context-Anchored Latent Alignment mechanism that structures the latent space using a label-activated mixture prior and supervised contrastive learning, actively clustering species based on shared environmental preferences; and (3) an Imbalance-Aware Decoupled Decoding module that utilizes Asymmetric Loss to focus learning on hard, rare species samples, preventing mode collapse in the long tail. Experiments on the large-scale eBird dataset, curated with domain experts, demonstrate that our framework significantly outperforms state-of-the-art baselines, particularly in predicting rare species and revealing interpretable species interactions.

2606.08483 2026-06-09 cs.AI 新提交

Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs

测试黑箱:面向消费者的健康大语言模型独立评估的结构性障碍

Rahul Gorijavolu, Kaushik Madapati, Pritika Vig, Rawan Abulibdeh, Nikhil Jaiswal, Mahri Kadyrova, Zeamanuel Hailu Tesfaye, Charles Senteio, Paula Maurutto, Leo Anthony Celi

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Johns Hopkins University(约翰霍普金斯大学) University of California, Berkeley(加州大学伯克利分校) Toronto General Hospital, University Health Network(多伦多综合医院,大学健康网络) McGill University(麦吉尔大学) University of Toronto(多伦多大学) Independent Researcher(独立研究者) Rutgers University(罗格斯大学) Beth Israel Deaconess Medical Center(贝斯以色列女执事医疗中心) Harvard T.H. Chan School of Public Health(哈佛大学陈曾熙公共卫生学院)

AI总结 本研究通过模拟用户档案,测试面向消费者的健康大语言模型在响应变异和谄媚行为方面的表现,发现五大结构性障碍阻碍独立评估。

详情
Comments
6 pages, 1 figure. Preprint submitted for review
AI中文摘要

背景:面向消费者的大语言模型现已成为健康信息的常见来源,它们解释并个性化响应而非检索信息。其响应是否因用户而异是一个临床、公平和治理问题,证据表明谄媚响应可能改变判断并增加信任,这一问题更加突出。\n目标:评估在类似普通患者使用条件下,面向消费者的健康大语言模型的响应变异和谄媚行为。\n方法:我们构建了模拟用户档案,这些档案在地理位置、浏览环境、表达信念和健康社会决定因素方面存在差异,借鉴了将社会背景与健康态度联系起来的文献。我们将经过验证的工具(包括疫苗接种态度量表和生殖态度量表)改编为多轮提示,旨在引发用户间有临床意义的变异。\n结果:评估遇到了五个相互关联的障碍。事实性提示产生稳定的响应,掩盖了在多轮对话中出现的谄媚行为。基于浏览器的界面未披露哪些信号影响输出,且无法重置为干净基线。大规模测试受到服务条款、速率限制和机器人检测的限制。基于准确性的标准无法捕捉语气、框架或遗漏,而LLM作为评判者的方法存在共享对齐偏差的风险。模型在无追溯版本标识符的情况下发生变化,阻碍了可靠的重复。\n结论:目前尚不存在可靠的独立评估框架来检查面向消费者的健康大语言模型在普通使用中的行为。监管需要披露个性化信号、稳定的版本标识符、研究人员安全港计划以及部署后对健康相关输出的监控。

英文摘要

Background: Consumer-facing large language models are now a common source of health information, and they interpret and personalize responses rather than retrieve them. Whether their responses vary across users is a clinical, equity, and governance question, sharpened by evidence that sycophantic responses can alter judgment and increase trust. Objective: To evaluate response variation and sycophancy in consumer-facing health LLMs under conditions resembling ordinary patient use. Methods: We constructed simulated user profiles differing in geography, browsing context, expressed beliefs, and social determinants of health, drawing on literature linking social context to health attitudes. We adapted validated instruments, including the Vaccination Attitudes Examination scale and reproductive attitudes scales, into multi-turn prompts designed to elicit clinically meaningful variation across users. Results: The evaluation encountered five linked barriers. Factual prompts produced stable responses that masked sycophancy emerging over multi-turn conversation. Browser-based interfaces did not disclose which signals influence outputs and could not be reset to a clean baseline. Large-scale testing was restricted by terms of service, rate limits, and bot detection. Accuracy-based criteria could not capture tone, framing, or omission, and LLM-as-judge methods risked shared alignment bias. Models changed without traceable version identifiers, preventing reliable replication. Conclusions: No reliable independent evaluation framework yet exists for examining how consumer-facing health LLMs behave in ordinary use. Oversight requires disclosure of personalization signals, stable version identifiers, researcher safe harbor programs, and post-deployment monitoring of health-related outputs.

2606.08481 2026-06-09 cs.LG cs.AI cs.DB cs.SE 新提交

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

PIPE-Cypher:面向文本到Cypher系统的自动企业基准生成

Suraj Ranganath, Anish Raghavendra

发表机构 * Halıcıoğlu School of Data Science and Computing, University of California, San Diego(加利福尼亚大学圣迭戈分校哈勒乔卢数据科学与计算学院) Independent Researcher(独立研究员)

AI总结 提出PIPE-Cypher流水线,利用本地大模型从企业属性图自动生成平衡的NL-to-Cypher基准,通过模式分析、逆向查询约束生成和执行验证等步骤,实现可重复的基准构建。

详情
AI中文摘要

企业属性图在模式结构、内部术语、领域假设、治理约束和用户交互模式上差异很大。因此,与部署相关的Text2Cypher基准反映了用户和代理实际对该图提出的问题。创建这样的基准很困难,因为模式和值是唯一的,且图结构随时间变化。每个自然语言查询对必须可执行、使用真实图实体、保持多样性,并在查询类型和难度级别上保持平衡。我们提出PIPE-Cypher,一个本地基准生成流水线,它将实时属性图和来自客户问题、分析师日志或代理工具调用的可选种子查询转化为平衡的NL-to-Cypher基准。PIPE-Cypher结合了模式分析、逆向查询接地、约束生成、确定性Cypher治理、执行验证、编辑、多样性控制以及校准的本地大语言模型评判器。使用本地Qwen3.5-9B生成和评判,PIPE-Cypher导出了3000个可接受的FinBench/SNB示例,完成了三个审计消融套件,用人类标签校准评判器行为,并评估了11个本地下游模型。生成的基准具有明确的区分性:零样本迁移效果弱,而少样本控制表明,特定模式的示例库可以帮助兼容的模型家族。总之,PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程,随图、用户和目标工作负载而演变。

英文摘要

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

2606.08480 2026-06-09 cs.LG cs.AI cs.IR 新提交

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

生成式推荐中噪声鲁棒GRPO的自适应损失平衡

Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li

发表机构 * JD.com(京东) Waseda University(早稻田大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对生成式推荐中奖励模型因曝光偏差导致噪声的问题,提出AdaGRPO框架,通过策略难度和奖励可区分性诊断动态切换GRPO与监督学习,在电商数据集上提升召回率并抑制幻觉。

详情
AI中文摘要

强化学习为超越监督模仿的生成式推荐提供了有前景的途径,通过利用奖励信号指导策略改进。然而,其有效性关键取决于奖励模型对所评估样本的可信度。实践中,广泛采用的奖励模型——生产级排序器,是在有曝光偏差的日志上训练的,导致样本相关的误差,违反了这一假设。我们的分层分析揭示了一个一致的模式:当策略表现出不确定性且排序器能有效区分真实物品与rollout负样本时,奖励指导最为有益。在其他样本上,奖励信号要么可忽略,要么有害,凸显了统一应用RL的风险。为解决此问题,我们引入AdaGRPO,一种新颖框架,将奖励指导优化视为选择性准入而非统一压力。训练以监督负对数似然为基础,而GRPO目标由基于两个rollout诊断(策略侧难度和奖励可区分性)的逐样本二元裁剪门控。未通过任一诊断的实例退化为纯监督,确保稳定性并减轻噪声梯度的放大。我们在大规模电商数据集上验证了AdaGRPO。在最佳中间检查点,它将HR@10从11.01%提升至12.18%,同时将幻觉限制在0.22%以下,并在最终检查点保持鲁棒性(HR@10 11.63%,幻觉0.27%),在检索-有效性前沿上优于固定NLL-GRPO混合。在生产A/B测试中,AdaGRPO在点击率和停留时间上实现了统计显著的提升,证实了其实用价值。

英文摘要

Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.

2606.08477 2026-06-09 cs.AI 新提交

A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis

基于可变性的框架:形式概念分析与关系概念分析中的可解释命名

Alain Gutierrez, Marianne Huchard, Pierre Martin, André Miralles, Violaine Prince

发表机构 * LIRMM, Univ. Montpellier, CNRS(法国国家科学研究中心蒙彼利埃大学计算机科学、机器人及微电子实验室) CIRAD, UPR AIDA(法国农业国际合作研究发展中心AIDA研究单元) AIDA, CIRAD, Univ. Montpellier(法国农业国际合作研究发展中心AIDA研究单元,蒙彼利埃大学) INRAE - UMR TETIS - Territoires, Environnement(法国国家农业、食品与环境研究院TETIS联合研究单元)

AI总结 针对形式概念分析和关系概念分析中概念命名缺乏可解释性的问题,提出一种基于可变性的LLM辅助命名框架,通过控制信息源生成可读名称,并在披萨店数据集上验证其有效性。

详情
AI中文摘要

从符号数据中提取知识通常会产生形式上定义但用户无法立即解释的抽象概念。形式概念分析(FCA)和关系概念分析(RCA)为此问题提供了代表性场景:它们根据对象描述和关系生成明确的概念结构、蕴含关系和关系依赖。尽管这些结构在设计上是可解释的,但概念通常由技术标签标识,这限制了它们作为人类可解释知识单元的使用。因此,为这些概念赋予有意义的名称是领域专家进行解释、导航、验证和复用的关键问题。\n本文从符号知识表示的角度研究FCA和RCA中的概念命名。我们首先描述了命名生成的符号抽象所涉及的语言和术语挑战,包括歧义性、区分性、简洁性以及相关概念间的一致性。然后,我们提出一个可配置的LLM辅助概念命名框架。该框架依赖于一个可变性模型,该模型控制命名过程中暴露的信息源,如内涵、外延、继承信息、邻近概念、蕴含关系和关系属性。从而明确从形式概念描述到人类可读名称的语义选择。\n该方法作为概念验证在披萨店领域的小型关系数据集上进行了说明。该示例展示了不同配置如何影响LLM建议的名称,以及命名可变性如何揭示解释选择、关系依赖以及底层符号数据中可能的建模问题。

英文摘要

Knowledge extraction from symbolic data often produces abstractions that are formally defined but not immediately interpretable by users. Formal Concept Analysis (FCA) and Relational Concept Analysis (RCA) provide representative settings for this issue: they generate explicit conceptual structures, implications, and relational dependencies from object descriptions and relations. Although these structures are explainable by design, their concepts are often identified by technical labels, which limits their use as human-interpretable knowledge units. Assigning meaningful names to such concepts is therefore a key issue for interpretation, navigation, validation, and reuse by domain experts. This paper investigates concept naming in FCA and RCA from a symbolic knowledge representation perspective. We first characterize the linguistic and terminological challenges involved in naming generated symbolic abstractions, including ambiguity, discrimination, concision, and consistency across related concepts. We then propose a configurable framework for LLM-assisted concept naming. The framework relies on a variability model that controls which sources of information are exposed during naming, such as intent, extent, inherited information, neighboring concepts, implications, and relational attributes. It thereby makes explicit the semantic choices involved in moving from formal concept descriptions to human-readable names. The approach is illustrated as a proof of concept on a small relational dataset in the pizzeria domain. This illustration shows how different configurations influence the names suggested by an LLM, and how naming variability can reveal interpretation choices, relational dependencies, and possible modeling issues in the underlying symbolic data.