arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10725 2026-06-11 cs.LG cs.CL 版本更新

Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Pre-AF 13:从出院报告中挖掘的可解释房颤风险评分

Olga Shakhmatova, Dmitrii Kriukov, Daniil Larionov, Nikita Khromov, Iaroslav Bespalov, Alexander Zolotarev, Kirill Grishchenkov, Ekaterina Ivanova, Miron Kuznetsov, Ilya Sochenkov, Elizaveta Panchenko, Artem Shelmanov, Dmitry V. Dylov

发表机构 * National Medical Research Center of Cardiology named after Academician E.I. Chazov(国家医学研究中心心脏病学以E.I. Chazov院士命名) Skolkovo Institute of Science and Technology (Skoltech)(斯科尔科沃科学技术研究所) Artificial Intelligence Research Institute (AIRI)(人工智能研究所) University of Mannheim(曼海姆大学) Russian Center for Scientific Information (RCSI)(俄罗斯科学信息中心) Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI(网络智能系统研究所,国家研究核大学MEPhI) M.V. Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学) Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)(俄罗斯科学院信息传输问题研究所(Kharkevich研究所)) Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences (FRC CSC RAS)(俄罗斯科学院联邦研究中心“计算机科学与控制”) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 利用NLP从出院报告中提取特征,构建可解释ML模型预测心血管病患者房颤风险,Pre-AF 13模型优于现有临床评分。

详情
Comments
O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Correspondence to: Olga Shakhmatova < this http URL [at] this http URL > and Dmitry V. Dylov < this http URL [at] this http URL >
AI中文摘要

背景:房颤(AF)是最常见的心律失常,也是预后的主要决定因素。现有的AF风险评分依赖于在心血管疾病(CVD)患者中几乎普遍存在的因素(如高龄、高血压),因此在该高风险群体中提供的分层有限。大多数评分针对长期(5-10年)而非中期预测。我们开发了可解释的ML模型,利用常规收集的医院数据预测CVD患者在24个月和整个随访期间内的AF风险。方法:对俄罗斯国家心脏病学研究中心电子健康记录进行单中心回顾性研究,纳入2012年1月至2019年5月期间多次住院、年龄≥18岁、患有CVD但无既往AF的患者。自定义NLP流水线将非结构化出院报告转化为73个结构化特征,结合基于规则的解析器和基于Transformer的命名实体识别。使用LightAutoML构建了完整模型(73个特征)、简单模型(简化子集)以及用于床旁风险评分的线性模型。性能通过ROC AUC评估,并与CHARGE-AF、C2HEST、MHS和HAVOC进行比较,并通过SHAP进行解释。结果:在来自45,000名患者的80,576份记录中,17,562份符合纳入标准;其中1,438名(8.19%)发生AF。完整模型在24个月和整个随访期间的ROC AUC分别为0.735和0.696;简单模型几乎相同(0.725和0.696)。所有非线性模型均优于四个临床风险评分(ROC AUC 0.53-0.64)。简单模型使用13个特征,命名为Pre-AF 13。SHAP识别出年龄和左心房容积为主要预测因子。线性风险评分(Pre-AF 9)将观察到的24个月AF发生率从约7%分层至36%。结论:基于常规收集的EHR数据构建的可解释ML模型能够识别高AF风险的CVD患者,优于现有的临床风险评分。

英文摘要

Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged >=18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores.

2606.10639 2026-06-11 cs.RO 版本更新

Planar-Sector LOS Guidance for Interception of Agile Targets with Lifting-Wing Quadcopters

面向敏捷目标拦截的升力翼四旋翼平面扇形视线制导

Linkai Liu, Kun Yang, Han Zou, Chen Min, Shuli Lv, Shuai Wang, Quan Quan

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院) Research and Development Department, China Academy of Launch Vehicle Technology(中国运载火箭技术研究院研发部)

AI总结 提出平面扇形视线(PS-LOS)制导框架,通过非对称约束释放机动性,使升力翼四旋翼在仅用单目相机的情况下实现远程自主拦截敏捷目标,实验验证了高达138米距离的成功拦截。

详情
Comments
Accepted to the IEEE International Conference on Robotics and Automation (ICRA 2026). Recipient of the ICRA 2026 Best Paper Award in Field and Service Robotics
AI中文摘要

由于目标运动不可预测、感知受限以及目标可见性与拦截器机动性之间的强耦合,对敏捷空中目标的自主视觉拦截具有挑战性。大多数现有的捷联相机拦截方法使用锥形视线(LOS)约束来保持目标靠近图像中心,从而保证可见性。虽然安全,但这种对称约束不必要地限制了机动性,并可能显著减少可用于追击的推力。受激进FPV飞行员不在所有图像方向上保持相等可见性裕度的观察启发,本文提出了一种平面扇形视线(PS-LOS)制导框架,用于仅配备捷联单目相机的升力翼四旋翼的自主拦截。PS-LOS严格约束横向图像误差,同时放松纵向图像误差在安全的视场裕度内,在保持可见性的同时释放机动性以进行加速密集型追击。在升力翼四旋翼模型下,PS-LOS在LOS方向附近提供的可用推力比传统锥形LOS约束多近50%。为了实现无需直接深度测量的仅视线拦截,为升力翼四旋翼开发了延迟补偿状态估计框架和非线性制导与控制架构。广泛的外场飞行实验证明了在真实风扰动下,对具有大幅、高频和不可预测运动的敏捷目标的自主拦截。所提出的系统在高达138米的距离上实现了成功拦截,并在整个交战过程中保持连续视觉跟踪。结果验证了PS-LOS作为一种保持可见性、感知机动性的制导框架,用于远程视觉拦截敏捷空中目标。

英文摘要

Autonomous visual interception of agile aerial targets is challenging due to unpredictable target motion, limited sensing, and the strong coupling between target visibility and interceptor maneuverability. Most existing strapdown-camera interception methods preserve visibility using conic line-of-sight (LOS) constraints that keep the target near the image center. While safe, such symmetric constraints unnecessarily restrict maneuverability and can significantly reduce the usable thrust for pursuit. Motivated by the observation that aggressive FPV pilots do not maintain equal visibility margins in all image directions, this paper proposes a Planar-Sector Line-of-Sight (PS-LOS) guidance framework for autonomous interception using a lifting-wing quadcopter equipped with only a strapdown monocular camera. PS-LOS tightly constrains lateral image error while relaxing longitudinal image error within a safe field-of-view margin, preserving visibility while releasing maneuverability for acceleration-intensive pursuit. Under the lifting-wing quadcopter model, PS-LOS provides nearly 50% more available thrust near the LOS direction than conventional conic LOS constraints. To realize LOS-only interception without direct depth measurements, a delay-compensated state-estimation framework and a nonlinear guidance-and-control architecture are developed for lifting-wing quadcopters. Extensive outdoor flight experiments demonstrate autonomous interception of agile targets exhibiting large-amplitude, high-frequency, and unpredictable motion under real wind disturbances. The proposed system achieves successful interceptions at ranges up to 138 m while maintaining continuous visual tracking throughout the engagement. The results validate PS-LOS as a visibility-preserving, maneuverability-aware guidance framework for long-range visual interception of agile aerial targets.

2606.10401 2026-06-11 cs.CV 版本更新

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

CoCoSI: 面向空间智能的协作认知地图构建

Yiming Zhang, Ruoxuan Cao, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Cornell University(康奈尔大学)

AI总结 提出一种即插即用的多智能体框架,通过协作构建结构化认知地图作为空间记忆,无需修改架构或额外训练即可增强预训练多模态大模型的空间理解能力。

详情
AI中文摘要

空间智能是多模态大语言模型(MLLMs)的一个关键前沿,使其能够从视觉体验中推理物理世界。受人类空间认知启发,最近的方法从多帧视觉输入构建基于网格的认知地图,以随时间维持连贯的空间表示。然而,有限的上下文长度仍然挑战空间理解,而现有方法如长上下文建模和外部记忆通常需要架构更改、记忆模块或微调,限制了其对现成预训练MLLMs的适用性。这促使我们提出一种轻量级、模型无关的方法,以在原生上下文窗口之外保留空间信息。为此,我们提出一个即插即用的多智能体框架,协作构建认知地图作为结构化空间记忆,无需架构修改或额外训练即可增强任意预训练MLLMs的空间理解。我们的框架具有局部-全局智能体协调、原子提交的认知地图构建以及跨智能体验证的特点。大量实验表明,我们的方法在空间理解任务上取得了优越性能,同时完全无需训练。代码将发布。

英文摘要

Spatial intelligence is a key frontier for multimodal large language models (MLLMs), enabling them to reason about the physical world from visual experience. Inspired by human spatial cognition, recent approaches construct grid-based cognitive maps from multi-frame visual inputs to maintain coherent spatial representations over time. However, limited context lengths still challenge spatial understanding, while existing methods, such as long-context modeling and external memory, often require architectural changes, memory modules, or finetuning, limiting their applicability to off-the-shelf pretrained MLLMs. This motivates a lightweight, model-agnostic method for preserving spatial information beyond the native context window. To this end, we propose a plug-and-play multi-agent framework that collaboratively constructs cognitive maps as structured spatial memory, enhancing the spatial understanding of arbitrary pretrained MLLMs without architectural modification or additional training. Our framework features local-global agent coordination, cognitive map construction with atomic commits, and cross-agent verification. Extensive experiments demonstrate that our method achieves superior performance on spatial understanding tasks while remaining fully training-free. Code will be released.

2606.09426 2026-06-11 cs.AI 版本更新

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

WeaveBench: 面向混合接口的长期、真实世界计算机使用代理基准

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan

发表机构 * Zhejiang University(浙江大学) Microsoft Research Asia(微软亚洲研究院) Tsinghua University(清华大学)

AI总结 提出WeaveBench基准,包含114个跨8个真实工作领域的长期混合接口任务,要求代理结合GUI和CLI/代码操作,最佳PassRate仅41.2%,揭示现有评估的不足。

详情
AI中文摘要

计算机使用代理(CUA)越来越多地在结合视觉桌面控制、命令行执行、代码编辑、浏览器和外部工具的运行时中运行。然而,现有基准通常将这些接口作为可分离的能力进行评估,导致长期跨接口编排测试不足。因此,我们引入了WeaveBench,一个长期混合接口基准,包含114个跨8个真实工作领域的任务,基于真实用户请求和公开可验证的工件。每个任务要求代理在单个轨迹中结合GUI观察/操作与CLI/代码操作。我们在部署的CLI代理运行时内的真实Ubuntu桌面上评估这些任务,并增加了最小的桌面控制插件。我们还提出了一个配套的轨迹感知评判器,检查交付物、文件、截图、日志和操作痕迹,同时检测快捷行为,如伪造的视觉证据或硬编码指标。在前沿模型-运行时配对中,最佳PassRate仅达到41.2%,表明该基准远未饱和。轨迹感知评判器进一步揭示,仅基于结果的评分显著高估了代理性能。总体而言,WeaveBench暴露了CUA评估中的关键差距,并提供了一个有效的测试平台,以衡量代理是否能在长期真实世界任务中编排GUI、CLI和代码操作。

英文摘要

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

2606.09365 2026-06-11 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

2606.09337 2026-06-11 cs.RO 版本更新

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

TORL-VLA:触觉引导的在线强化学习用于接触丰富操作

Huaihang Zheng, Yi Yang, Kai Ma, Shenglin Xu, Tian Xie, Guozheng Li, Xiangyu Wang, Yiren Ma, Si Liu, Yinian Mao, Baoxu Liu

发表机构 * Meituan(美团) Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) State Key Lab of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS(中国科学院自动化研究所多模态人工智能系统国家重点实验室) China University of Mining and Technology (Beijing)(中国矿业大学(北京))

AI总结 提出TORL-VLA框架,结合触觉反馈与在线强化学习,通过触觉导出的力矩感知VLA预测参考动作,并利用轻量在线RL模块优化动作,解决接触条件变化时的策略适应问题,在长时接触任务中提升成功率和执行效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为机器人操作的有力框架,最近的研究将触觉或力反馈引入VLA以处理接触丰富的任务。然而,这些模型通常作为离线策略部署。当接触条件偏离训练分布时,策略无法进行在线适应,导致接触力不当和重试效率低下等问题。因此,我们提出TORL-VLA,一种触觉引导的在线强化学习框架,将触觉反馈与策略优化相结合用于接触丰富操作。我们的方法引入了一个触觉导出的力矩感知VLA来预测参考动作和未来的力矩序列,同时使用轻量级在线RL模块来优化参考动作。为了稳定地从混合的探索性策略生成和人工干预数据中学习,我们引入了一个干预审查评论家,防止干预后的成功被错误地归因于干预前的策略生成动作。在包括门闩操作、咖啡杯放置和鸡蛋处理等长时接触丰富任务上的真实机器人实验表明,TORL-VLA在子任务和完整任务级别上提高了成功率,并在时间约束的执行效率上优于强基线。

英文摘要

Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines.

2606.09289 2026-06-11 cs.LG 版本更新

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

通过时序图学习识别足球比赛中控球阶段的意图驱动方法

Yuesen Li, Daniel Link

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于时序图注意力网络(T-GAN)的框架,从时空追踪数据中识别足球比赛控球阶段,实现战术意图(入侵空间、保持控球、得分)和六个子阶段的分类,F1分数达0.87(意图级)和0.79(得分阶段)。

详情
Comments
27 pages, 10 figures
AI中文摘要

理解足球(以下简称足球)的战术组织需要识别不同的比赛阶段。然而,控球阶段很少直接可观察,而是由不断演变的战术意图塑造,而非仅靠空间模式。本研究提出一个数据驱动框架,用于从时空追踪数据中识别控球比赛阶段。分析了七场德国足球甲级联赛比赛,使用TRACAB以25 Hz记录。定义了一个层次化阶段模型,包含三种战术意图(入侵对手空间、保持控球、得分)和六个阶段(构建、推进、反击、维持、持续威胁、完成)。开发了时序图注意力网络(T-GAN),结合帧级球员交互图、上下文特征和基于Transformer的时序建模。使用帧级F1和序列感知的Truth-Dominance交并比(IoT-D)指标评估性能。T-GAN在意图级别达到宏平均帧级F1分数0.87,入侵相关阶段0.76,得分阶段0.79。在序列级别,后处理后意图的平均对角线IoT-D F1从0.68增加到0.79,阶段从0.61增加到0.71,表明时序连贯性改善。模型比较显示,序列建模是分割质量的主要驱动因素,而基于图的关系建模特别有利于反击识别。探索性球员注意力分析进一步表明,边路和中场位置组对阶段区分贡献显著。总体而言,该框架将连续追踪数据转化为战术可解释的控球阶段表示,具有自动比赛标注、战术分析和打法特征分析的潜在应用。

英文摘要

Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

2606.09287 2026-06-11 cs.LG 版本更新

Trajectory Geometry of Transformer Representations Across Layers

Transformer表示在层间的轨迹几何

Vishal Pandey, Gopal Singh, Yacine Mahdid

发表机构 * MetriQual London, UK(英国伦敦) Athens, GR(希腊雅典)

AI总结 通过计算轨迹长度、曲率等几何指标,发现语义相关提示在中间层收敛、推理任务曲率更大、歧义token轨迹分叉,并揭示三层结构。

详情
Comments
18 pages, 9 figures
AI中文摘要

理解Transformer表示如何跨层演化,而不仅仅是它们编码了什么,仍然是机械可解释性中的一个开放问题。我们将Transformer前向传播重新解释为通过高维表示流形的离散群体轨迹,借鉴了计算神经科学的几何工具。我们不是探测预定义的特征,而是使用直接在环境空间中计算的五个指标来表征轨迹几何:轨迹长度、曲率、语义收敛指数、逐层余弦相似度和表示稳定性。在三个模型家族(GPT-2、TinyLlama、Qwen2.5)和五个受控提示家族中,我们报告了四个发现。首先,语义相关的提示在中间到后期层显著收敛(峰值CI 0.41--0.58,p<0.001,Mann-Whitney U),与吸引子动力学一致。其次,推理任务产生的轨迹曲率大于词汇变化(0.71--0.83弧度 vs. 0.27--0.31弧度),表明曲率编码了计算复杂度。第三,歧义token表现出轨迹分叉,在最后一层表示分离高达5.6倍,而在无歧义控制中则没有。第四,逐层余弦相似度揭示了一个普遍的三阶段结构:编码、精化和输出准备,在所有三种架构中一致。所有四个效应在打乱层和随机嵌入控制下消失。我们发布了一个完全开源、模型无关的管道,并认为轨迹几何构成了一个原则性的、无探针的机械可解释性视角。

英文摘要

Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.

2606.09105 2026-06-11 cs.AI 版本更新

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Graph2Idea:基于检索增强的图结构上下文科学想法生成

Xu Li, Hanzhe Tu, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出Graph2Idea框架,利用知识图谱将检索文献转化为结构化三元组,提取图衍生上下文,通过两阶段生成过程提高科学想法的新颖性、质量和可行性。

详情
AI中文摘要

生成新颖、可行且高质量的研究想法是科学发现中重要但具有挑战性的任务。近期基于大语言模型(LLM)的方法通常通过检索文献来支撑想法生成,但检索到的证据通常以平面文本形式提供,如标题、摘要或总结。这种平面上下文可能包含冗余或弱相关信息,同时使得问题、方法、机制和发现之间的跨论文关系难以识别和追踪。为解决这一挑战,我们提出Graph2Idea,一种知识图谱引导的检索增强科学想法生成框架。Graph2Idea首先根据输入主题检索论文,将其转化为结构化知识三元组,并动态构建以目标为中心的知识图谱,使文献关系明确化。然后,它提取紧凑的图衍生上下文,保留与目标相关的关系证据,同时减少噪声文本输入。基于这些上下文,两阶段生成过程首先识别有前景的研究方向,然后引导LLM从图基础证据中综合候选想法。在科学想法生成基准上的实验表明,Graph2Idea在自动评估协议下优于代表性基线。与最强基线分数相比,它将新颖性从0.45提升至0.52,质量从0.24提升至0.29,可行性从0.22提升至0.28。这些结果表明,图结构证据有助于LLM通过更明确、紧凑和可追溯的先前科学知识重组来生成研究想法。

英文摘要

Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery. Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace. To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit. It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input. Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence. Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol. Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28. These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

2606.08956 2026-06-11 cs.LG 版本更新

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

从反问题到神经算子:数据驱动模型的预测、机制与泛化

Conor Rowan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文从哲学视角统一反问题、稀疏辨识、神经常微分方程和神经算子等数据驱动建模策略,指出它们仅在输入-输出关系的模型类假设上不同,并论证只有某些模型能发现机制并实现泛化。

详情
AI中文摘要

科学家历来依赖基于微分方程的数学模型来关联系统输入(力、通量或热源)与输出(位移、速度、浓度和温度)。这些模型依赖深厚的领域知识来确定控制微分方程的形式,然后通过求解反问题用数据校准。近年来,科学机器学习领域引入了多种针对物理系统的替代建模策略。一种称为非线性动力学稀疏辨识的方法,将控制方程学习为用户定义库中项的稀疏线性组合。神经常微分方程通过将状态及其导数输入神经网络来构建控制方程。神经算子则完全摒弃微分方程的建模框架,直接学习系统输入与输出之间的非线性映射。从反问题到神经算子,所有这些建模策略都可以概念化为数据驱动机制,用于预测系统在一系列输入下的响应。因此,自然会思考这些不同策略之间究竟如何关联,以及它们能否被清晰地分类。借鉴科学模型的哲学文献,我们认为许多模型类型具有共同结构,仅在其定义的输入-输出关系的假设模型类上有所不同。联系关于机制的哲学观点,并论证物理系统的数据来自简洁微分方程的解,我们提出只有某些模型能够发现机制,从而实现泛化。我们的分析旨在统一看似不同的建模策略,并为其适当使用场景提供见解。

英文摘要

Scientists have historically relied on mathematical models based on differential equations to relate system inputs -- forces, fluxes, or heat sources -- to outputs, such as displacement, velocity, concentration, and temperature. These models rely on deep domain knowledge to determine the form of the governing differential equation, which is then calibrated with data by solving an inverse problem. In recent years, the field of Scientific Machine Learning has introduced a variety of alternative modeling strategies for physical systems. A method called Sparse Identification of Nonlinear Dynamics learns the governing equation as a sparse linear combination of terms in a user-defined library. Neural Ordinary Differential Equations construct the governing equation by taking in the state and its derivatives at the input layer of a neural network. Entirely foregoing the modeling framework of differential equations, neural operators directly learn a non-linear mapping between the system inputs and outputs. From inverse problems to neural operators, all of these modeling strategies can be conceptualized as data-driven machinery to predict a system's response over a range of inputs. It is then natural to wonder how exactly these various strategies relate to each other, and whether they can be neatly taxonomized. Drawing from the philosophical literature on scientific models, we argue that many model types have a common structure, differing only in the assumed model class of the input-output relation they define. Connecting to philosophical ideas on mechanism, and arguing that data from physical systems arises from solutions to parsimonious differential equations, we propose that only certain models are capable of mechanism discovery, and thus generalization. Our analysis is intended to unite apparently disparate modeling strategies and provide insight into their appropriate use cases.

2606.08744 2026-06-11 cs.CV 版本更新

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

MB-Loc:室外LiDAR场景中的多平面鸟瞰图定位

Ayaan Choudhury, Preet Savalia, Anirudh Pydah, Avinash Sharma

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校)

AI总结 提出MB-Loc框架,通过将LiDAR扫描投影为2.5D多平面鸟瞰图表示,结合KL正则化隐瓶颈和3D空间增强,实现轻量级、视角鲁棒的场景坐标回归定位,在NCLT数据集上达到实时推理并超越现有方法。

详情
AI中文摘要

全局LiDAR定位是自主导航系统的基本任务。最近的方法通过预测密集的3D世界坐标进行场景坐标回归(SCR),相比绝对位姿回归(APR)方法实现了更高的精度。然而,SCR方法引入了两个主要瓶颈:处理原始3D几何结构导致的严重计算低效,以及在不同传感器视角下性能显著下降。为了解决这些限制,我们提出了MB-Loc,一个轻量级且视角鲁棒的SCR框架。我们不依赖沉重的3D卷积,而是将输入的LiDAR扫描投影为2.5D多平面鸟瞰图(BEV)表示。通过沿Z轴切片点云并将有符号深度映射到离散的2D平面,MB-Loc保留了关键的3D几何结构,同时利用了标准2D CNN的计算可处理性。为了处理室外LiDAR固有的稀疏性,我们引入了一个KL正则化的隐瓶颈,该瓶颈在不注入随机噪声的情况下显式建模空间不确定性。最后,为了确保旋转鲁棒性,我们在平面投影之前应用3D空间增强,迫使网络隐式学习视角不变的特征。我们在公开的NCLT数据集上进行了大量实验,证明了我们提出的方法优于当前最先进的方法。以实时推理速度运行,MB-Loc在计算效率上显著优于传统的3D-SCR架构。

英文摘要

Global LiDAR localization is a fundamental task for autonomous navigation systems. Recent methods perform Scene Coordinate Regression (SCR) and achieve superior accuracy over Absolute Pose Regression (APR) solutions by predicting dense 3D world coordinates. However, SCR approaches introduce two major bottlenecks: severe computational inefficiency from processing raw 3D geometries and significant performance degradation under varying sensor viewpoints. To address these limitations, we present MB-Loc, a lightweight and viewpoint-robust SCR framework. Instead of relying on heavy 3D convolutions, we project the input LiDAR scan into a 2.5D Multi-planar Bird's-Eye View (BEV) representation. By slicing the point-cloud along the Z-axis and mapping signed depths into discrete 2D planes, MB-Loc retains essential 3D geometric structures while exploiting the computational tractability of standard 2D CNNs. To handle the inherent sparsity of outdoor LiDAR, we introduce a KL-regularized latent bottleneck that explicitly models spatial uncertainty without injecting stochastic noise. Finally, to ensure rotation robustness, we apply 3D spatial augmentations prior to planar projection, forcing the network to implicitly learn viewpoint-invariant features. We perform extensive experiments on the publicly available NCLT dataset and demonstrate that our proposed method outperforms the current state-of-the-art. Operating at real-time inference speeds, MB-Loc significantly outperforms traditional 3D-SCR architectures in computational efficiency.

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

GEAR-VLA:学习几何感知的动作表示以实现可泛化的机器人操作

Yuan Zhang, Shiqi Zhang, Yedong Shen, Shuai Dong, Jiajun Deng, Xin Zhang, Yuxuan Gao, Jiajia Wu, Xin Nie, Zhiyuan Cheng, Jianmin Ji, Yanyong Zhang, Xingyi Zhang, Jia Pan

发表机构 * Anhui University(安徽大学) University of Science and Technology of China(中国科学技术大学) iFLYTEK(科大讯飞)

AI总结 提出GEAR-VLA框架,通过粗到细的动作学习、语义对齐的3D集成和具身规范化,学习统一的几何感知动作表示,实现跨物体、背景和机器人的泛化操作。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在基准测试中表现强劲,但在实际部署中仍难以应对未见过的物体、背景变化和不同的机器人本体。我们认为这源于缺乏统一的几何感知操作表示,使得现有VLA容易受到低级轨迹监督、不对齐的3D特征和本体差异的影响。为此,我们提出GEAR-VLA,一个用于学习统一几何感知动作表示以实现可泛化机器人操作的VLA框架。GEAR-VLA采用粗到细的动作学习,其中多源具身预训练赋予VLM具身推理和离散动作理解能力,随后潜在动作标记将动作语义连接到梯度解耦的DiT连续动作专家。它通过将可训练的3D空间骨干与VLA表示对齐,同时冻结原始VLM对齐的视觉通路,进一步执行语义对齐的3D集成。为了跨机器人共享该表示,GEAR-VLA使用具身规范化,其中具身感知状态和具身不变动作将机器人差异限制在低级接口。大量的仿真和真实实验证明了强大的泛化能力:GEAR-VLA在LIBERO、零样本LIBERO-Plus和RoboTwin 2.0上达到了最先进的性能,在AgileX上达到85.9%的成功率,在预训练未见过的LDT-01本体上达到81.0%,并在包含212个未见物体的6,360次试验通用抓取基准上获得90.1%的成功率。代码和模型将在https://github.com/babynabeauty/GEAR-VLA发布。

英文摘要

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at this https URL.

2606.08415 2026-06-11 cs.CV cs.AI 版本更新

CoVEBench: Can Video Editing Models Handle Complex Instructions?

CoVEBench: 视频编辑模型能处理复杂指令吗?

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出CoVEBench基准,包含416个源视频和626条多点编辑指令,通过MLLM评估指令遵循度和保真度,揭示当前模型在组合编辑中常遗漏编辑或破坏保留约束。

详情
Comments
34 pages, 11 figures, 9 tables
AI中文摘要

虽然近期基于文本引导的视频编辑模型在基础任务(如风格迁移、物体插入)上表现出色,但现实用户请求具有高度组合性。单个提示通常要求多个耦合编辑,例如同时修改主体、动作和相机视角,同时严格保留无关的时空内容。现有基准受限于孤立编辑和粗粒度全局指标,无法诊断模型如何处理此类复杂工作流。为弥补这一空白,我们引入CoVEBench,一个组合视频编辑基准,包含416个精心策划的源视频、626条多点编辑指令和9,990个细粒度检查项。CoVEBench覆盖多样化的编辑维度,通过MLLM评判的指令遵循度和视频保真度,以及视频质量的自动指标来评估模型。大量实验表明,组合编辑仍然是一个深层次的挑战:当前模型在处理多个操作同时进行时,经常遗漏编辑、违反保留约束或引入伪影。CoVEBench为推进视频编辑向现实用户工作流发展提供了一个具有挑战性的诊断测试平台。

英文摘要

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

2606.08343 2026-06-11 cs.LG 版本更新

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

GENERIC-FNO:将能量守恒和熵产生嵌入傅里叶神经算子

Jason Sulskis, Sathya Ravi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校) Georgia Tech Research Institute(佐治亚理工学院研究所)

AI总结 提出GENERIC-FNO,首个在函数空间直接嵌入非平衡热力学完整GENERIC结构的神经算子,通过秩一投影精确满足退化条件,实现能量守恒与熵产生,在超分辨率下保持结构保证。

详情
Comments
Under review at TMLR
AI中文摘要

我们引入了GENERIC-FNO,这是第一个将非平衡热力学的完整GENERIC(度量-辛)结构——可逆、能量守恒动力学和不可逆、熵产生动力学通过退化条件耦合——直接嵌入函数空间的神经算子。现有的保结构神经算子最多强制执行单一守恒律或可逆(哈密顿)结构,而热力学一致的学习仅限于有限维、图或粒子系统。GENERIC-FNO填补了这一空白:它将能量和熵泛函学习为神经算子,并将泊松和摩擦算子参数化为对角傅里叶乘子,夹在秩一投影之间,通过构造精确满足退化条件,无需惩罚项、更新投影或残差。退化恒等式对任何初始化、维度或分辨率都达到机器精度(残差~10^-13),因此连续时间动力学守恒学习的能量并精确产生熵;显式时间步进仅增加小的O(dt^2)漂移(每步残差~10^-6)。我们进一步指出,给定流的(E,S,L,M)分解并不唯一,并引入了一个规范不变的耗散诊断,独立于学习的泛函分离可逆和耗散动力学。在三个算子主干(1D/2D FNO和DeepONet)和四个涵盖可逆、耗散和混合机制的PDE上,GENERIC-FNO在4倍超分辨率范围(64到256)内零样本保持其精确结构保证,恢复物理耗散的真实顺序,并与强无约束和能量惩罚基线竞争,在相当或更少参数的情况下在多个耗散和混合问题上优于它们。

英文摘要

We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

改写以翻译,翻译以奖励:机器翻译中源端改写的强化学习

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学大学) Preferred Networks Inc(Preferred Networks 公司) Nara Institute of Science and Technology(奈良先端科学技术大学院大学)

AI总结 提出RLSR框架,通过强化学习训练源端改写模型,以翻译质量提升为奖励,无需为每个MT模型调提示,在6个MT模型和16个语言对上超越无改写和同规模提示基线,与235B LLM提示基线性能相当。

详情
AI中文摘要

尽管直接提示现成的大语言模型(LLM)生成保留意义的源端改写可以有效提升机器翻译(MT)质量,但这样做需要为不同的MT模型手动调整提示。在这项工作中,我们提出了RLSR(用于源端改写的强化学习),一种新颖的基于强化学习的框架,用于训练源端改写模型,而无需为每个MT模型调整提示。RLSR通过直接使用每个改写源端所带来的下游翻译质量的提升作为奖励来优化改写模型。跨六个MT模型和16个语言对的广泛实验表明,我们通过RLSR训练的4B改写模型显著优于无改写基线和现有的同规模基于提示的改写基线,同时与基于235B LLM的提示基线相比取得了具有竞争力的性能。

英文摘要

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述:一个简单的餐厅预订场景,其中代理检索相似记忆,接收关于无效时间格式的反馈,并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI University of Washington(华盛顿大学)

AI总结 提出MemToolAgent框架,通过记忆管理提升大语言模型代理的工具使用能力,包含记忆提取和动态检索模块,在三个基准上分别提升29%、80%和17%。

详情
Comments
8 pages, 5 figures
AI中文摘要

现代大语言模型(LLM)代理可以使用外部工具帮助用户解决复杂任务。然而,对于需要从长期历史事件或先前的代理-环境交互中学习的问题,LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统,但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent,一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块,将过去的经验处理成结构化的记忆条目,以及一个检索模块,动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应,与用户偏好和反馈保持一致。总之,本工作有三个主要贡献:(1)统一的记忆条目格式,无需LLM微调即可改善通用和个性化工具使用;(2)基于反思的记忆提取,利用环境和用户反馈将错误执行提炼为批评并存储;(3)一个检索模块,根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

2606.07591 2026-06-11 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2606.07226 2026-06-11 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University Shanghai Innovation Institute East China Normal University

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

详情
Comments
Accepted by KDD 2026
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.07082 2026-06-11 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST UT Austin Zhejiang University Hong Kong PolyU USTC BUPT Nankai University BIT

AI总结 本文通过参数空间诊断,揭示在线策略蒸馏(OPD)的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性,表明其并非介于SFT和RLVR之间的中间方法。

详情
Comments
17 pages, 8 figures
AI中文摘要

在线策略蒸馏(OPD)越来越多地被用于改进大型语言模型的推理能力,但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹,并将其与监督微调(SFT)和可验证奖励强化学习(RLVR)进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域:与SFT相比,其更新影响更少的权重,并更强烈地避开主方向;而与RLVR相比,其约束更宽松。除了这种静态定位外,OPD还表现出子空间锁定:其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能,但会严重降低SFT,表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明,稀疏化更新令牌和将rollout生成移至离策略能保持秩动态,而将OPD目标与RLVR混合则会改变它们。总体而言,这些结果表明OPD不仅仅是SFT和RLVR之间的中间点,而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

2606.06921 2026-06-11 cs.SD 版本更新

Towards Event-Robust Acoustic Scene Classification

面向事件鲁棒的声学场景分类

Yiqiang Cai, Bohan Hu, Yu Yang, Pengwei Lu, Shengchen Li, Xi Shao

发表机构 * Xi'an Jiaotong-Liverpool University Zhongdian Zhiheng Information Technology Service Co., Ltd China Telecom Jiangsu Branch Nanjing University of Posts and Telecommunications

AI总结 针对现有声学场景分类系统在未知声音事件下性能下降的问题,提出事件移位声学场景数据集ESAS,通过大语言模型注入前景事件模拟真实环境,评估并推动事件鲁棒ASC研究。

详情
Comments
Accepted to Interspeech 2026. The ESAS dataset is available at: this https URL
AI中文摘要

本文介绍了事件移位声学场景(ESAS)数据集,这是一个用于评估声学场景分类(ASC)系统对未知声音事件鲁棒性的新型基准。现有的ASC数据集通常包含干净且一致的音频记录,而现实环境往往包含多样且意外的事件。为弥合这一差距,ESAS通过借助大语言模型将前景声音事件注入背景场景来模拟现实世界的声学变化。本文介绍了构建方法、数据集统计和评估协议。此外,使用ESAS基准对最先进的ASC系统进行了全面评估。实验结果表明,现有的ASC模型在面对事件移位挑战时性能显著下降。ESAS数据集的引入旨在推动未来研究朝向事件鲁棒的ASC发展。

英文摘要

This paper introduces the Event-Shifted Acoustic Scene (ESAS) dataset, a novel benchmark for evaluating the robustness of Acoustic Scene Classification (ASC) systems against unknown sound events. Existing ASC datasets typically contain recordings of clean and consistent audio, while real-world environments often include diverse and unexpected sound events. To bridge this gap, ESAS simulates real-world acoustic variability by injecting foreground sound events into background scenes with the assistance of large language models. In this work, we present the construction methodology, dataset statistics, and evaluation protocols. Furthermore, a comprehensive evaluation of state-of-the-art ASC systems is conducted using the ESAS benchmark. Experimental results reveal that existing ASC models suffer significant performance degradation when facing the event-shift challenge. The introduction of the ESAS dataset aims to drive future research toward event-robust ASC.

2606.06904 2026-06-11 cs.RO cs.CV 版本更新

ActionMap: Robot Policy Learning via Voxel Action Heatmap

ActionMap: 基于体素动作热图的机器人策略学习

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

发表机构 * National University of Singapore NVIDIA

AI总结 提出ActionMap,一种将动作空间建模为体素热图的动作解码器,替代现有VLA模型中的单点预测器,在LIBERO仿真和真实Franka操作中提升性能和数据效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在骨干网络、训练方法和数据规模方面快速发展,但将骨干网络隐藏状态转换为连续控制信号的动作解码器几乎没有变化,在大多数现有VLA中仍然是单点预测器。无论是通过自回归词元箱、L1回归还是流匹配去噪实现,所得解码器都将动作空间视为无结构的,在训练期间未利用相邻动作的几何邻近性。为了改进这一点,我们引入了ActionMap,一种体素热图动作头,可以插入现有VLA中替换其原生动作解码器。对于每个新动作,该头预测动作空间上的体素热图,其中每个体素直接存储对应动作的概率。在LIBERO仿真和真实Franka操作中,我们的热图头在匹配训练步数下超越了两种架构不同的骨干网络(例如,在LIBERO四套件平均上比OpenVLA-OFT的L1回归头高出8.2%),在两种骨干网络上以相当或更快的速度收敛,并且在低训练数据下保持显著更高的数据效率。跨骨干网络的一致性表明,动作表示是VLA性能的一个真正杠杆,与进一步的骨干网络或方法缩放不同。项目页面:此 https URL。

英文摘要

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: this https URL.

2606.12387 2026-06-11 cs.DB cs.AI 新提交

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

TAHOE: 基于经验的自动提示优化文本到SQL系统

Zhiyi Chen, Jie Song, Peng Li

AI总结 提出TAHOE系统,通过错误驱动的提示学习管道将调试痕迹转化为结构化提示库,结合策略层建模用户意图,在Spider 2.0-Snow上无需更新参数即可显著提升Text-to-SQL性能。

详情
AI中文摘要

大型语言模型(LLM)通过Text-to-SQL使数据库访问民主化,但从原型到生产部署仍然困难。实际部署必须处理严格的SQL方言、大规模模式和不断变化的用户偏好,而有监督微调成本高且僵化,代理测试时扩展昂贵。我们提出Tahoe,一个将提示优化视为动态数据管理问题的系统。Tahoe在开发和部署阶段使用错误驱动的提示学习管道,将调试痕迹整合到结构化的提示库中。编译器反馈被提炼为可重用的语法提示(针对方言特定规则),而执行和用户反馈被转换为语义提示(针对模式和用户特定逻辑)。Tahoe进一步引入策略层,将冲突的用户意图建模为共享自然语言触发下的竞争策略,并利用近期信号和学习后归因统计来总结经验成功、危害、惰性和支持。在推理时,Tahoe检索相关提示,并通过逻辑规划后接SQL合成引导LLM。我们实现并评估了开发阶段的工作流,将部署时的人类反馈更新留作未来工作。在Spider 2.0-Snow上,Tahoe在不更新模型参数的情况下显著改进了Text-to-SQL。在113个有监督的Spider 2.0-Snow-0212示例上使用GPT-5.5,Tahoe将通过率从61.95%提高到79.42%,pass-at-4从72.57%提高到87.61%,实现了100%的Snowflake语法通过率,并将每个采样候选的平均编译器反馈批评轮次从2.79降低到0.12。相同的提示库也迁移到较弱的骨干模型,包括在Doubao-2.0-lite上获得19.7个百分点的通过率提升。

英文摘要

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

2606.12382 2026-06-11 cs.NE cs.AI 新提交

SPEA2$^+$: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees

SPEA2$^+$:具有可证明运行时间保证的改进SPEA2密度估计

Duc-Cuong Dang, Andre Opris, Dirk Sudholt

AI总结 针对SPEA2处理支配解时多样性不足的问题,提出使用所有成对距离改进密度估计的SPEA2$^+$,在OneTrapZeroTrap基准上达到与其他主流算法相同的性能保证。

详情
Comments
To appear in the Proceedings of PPSN 2026
AI中文摘要

强度帕累托进化算法2(SPEA2)是解决多目标优化问题的流行且著名的进化算法。尽管其受欢迎,但SPEA2的理论分析直到最近才出现。此外,这些分析仅关注SPEA2如何处理非支配解,而忽略了处理支配解的算法组件。我们首次对SPEA2进行了运行时分析,其中分析了这些组件。我们证明,与其他主流算法(包括相同设置下具有恒定种群大小和重复消除的NSGA-II、NSGA-III和SMS-EMOA)不同,SPEA2无法有效覆盖OneTrapZeroTrap基准的帕累托前沿。我们的结果表明,在适应度分配中使用k近邻距离提供的信号不足以维持支配个体间的多样性。为了解决这个问题,我们提出了一种改进的变体SPEA2$^+$,它考虑了所有成对距离。新算法在OneTrapZeroTrap上实现了与其他主流算法相同的性能保证,同时在更简单的问题上匹配原始SPEA2的性能。实验结果补充了我们的理论发现。

英文摘要

The Strength Pareto Evolutionary Algorithm 2 (SPEA2) is a popular and prominent evolutionary algorithm for solving multi-objective optimisation problems. Despite its popularity, theoretical analyses of SPEA2 have only appeared recently. Moreover, these analyses focus exclusively on how SPEA2 handles non-dominated solutions and disregard the algorithmic components responsible for handling dominated solutions. We conduct a first runtime analysis of SPEA2 for which these components are analysed. We prove that, unlike other prominent algorithms, including NSGA-II, NSGA-III and SMS-EMOA under the same setting of constant population size and duplicate elimination, SPEA2 is unable to cover the Pareto front of the OneTrapZeroTrap benchmark efficiently. Our results indicate that using k-th nearest-neighbour distance in the fitness assignment provides an insufficient signal to maintain diversity among dominated individuals. To address this issue, we propose an improved variant, SPEA2$^+$, that considers all pairwise distances. The new algorithm achieves the same performance guarantees as the other prominent algorithms on OneTrapZeroTrap, while matching the performance of the original SPEA2 on simpler problems. Experimental results complement our theoretical findings.

2606.12373 2026-06-11 cs.CL 新提交

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

可验证环境是乐高积木:递归组合实现推理泛化

Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu

AI总结 提出RACES框架,将可验证环境视为可递归组合的构建块,通过定义四种组合算子自动生成复合环境,在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分,且仅用50个基础环境即可达到300个环境的性能。

详情
AI中文摘要

基于可验证环境的强化学习已成为增强大语言模型推理能力的有效方法。虽然先前研究表明扩展环境数量可提升强化学习性能,但现有手动或单独构建方法受限于线性扩展瓶颈,阻碍了可扩展的推理泛化。本文提出RACES(递归自动组合环境扩展)框架,将可验证环境视为可递归组装的可组合构建块。关键洞察是:当一个环境的余域(输出类型)与另一个环境的定义域(输入类型)匹配时,它们可以自动融合为新的可验证环境,从而实现递归组合。RACES使用300个独立环境实现,并定义了四种组合算子(SEQUENTIAL、PARALLEL、SORT和SELECT),诱导出多样化的推理模式。大量实验表明,在这些复合环境上进行强化学习训练持续提升了推理泛化能力。具体而言,RACES在六个未见基准上平均提升DeepSeek-R1-Distill-Qwen-14B 3.1分(从48.2到51.3),并将Qwen3-14B的性能从58.8提升至61.1。此外,RACES仅使用50个基础环境即可达到与使用300个独立环境训练相当的性能,展现了显著的环境利用效率。

英文摘要

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

2606.12360 2026-06-11 cs.LG 新提交

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

后训练的解剖:利用可解释性表征数据并塑造学习信号

Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh Lubana

AI总结 提出基于可解释性的数据后训练流程,通过统计假设识别偏好数据中的潜在概念,实现细粒度反馈,减少虚假关联和不良行为。

详情
AI中文摘要

语言模型后训练是塑造模型行为的主要阶段,但它仍然主要涉及优化总结多样需求的标量奖励。这种抽象使从业者几乎无法了解数据实际教会了模型什么,导致模型学习虚假关联,并引发过度风格化和谄媚等不良行为。为了解决这个问题,我们提出:能否在优化之前检查偏好数据集,并在概念层面决定模型应该被允许学习哪些行为?受此启发,我们引入了一个以数据为中心的后训练流程,该流程使用可解释性协议来开发统计假设,以区分偏好和非偏好生成的潜在概念,使其明确以供细粒度用户反馈。基于这一观点,我们将几种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方式。实验上,我们表明我们的流程诊断了现有偏好数据中的不良信号,减轻了脱靶学习,并且还可以帮助放大或塑造期望的属性,如安全防护和模型个性。更广泛地说,我们的结果表明,可解释性可以将后训练从优化不透明的代理奖励转变为审计和塑造学习信号本身的过程。

英文摘要

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

2606.12350 2026-06-11 cs.AI 新提交

Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Nonslop: 人机协作写作中的游戏化实验

Maria Edwards, Julian Togelius

AI总结 通过游戏化写作实验,研究用户在AI建议下何时保持创意自主性,揭示效率与真实性之间的张力。

详情
Comments
Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version
AI中文摘要

大型语言模型(LLM)的快速普及引发了关于人类创造力和个体表达在AI辅助创作时代的关键问题。人类何时采纳AI建议?这对个体声音有何影响?本研究通过一项游戏化写作练习来探讨这些问题,74名参与者(214份回复)在写作时,AI生成的单词建议可供使用。该游戏模拟了一个反乌托邦的未来,其中AI试图从残存的人类个性中学习,并抑制类似AI的写作。通过这种方式,它试图创造能够揭示真实用户偏好而非默认行为(例如接受现成的AI生成建议)的条件。请注意,这是对“有帮助的助手”设计模式的刻意反转;系统明确禁止你接受AI建议。我们分析了不同任务类型、用户行为和回复特征下的用户行为模式,以理解创造性任务中人机交互的影响因素。研究重点关注用户何时选择保持创意自主性,而非违反游戏规则接受AI帮助。此外,还探讨了这些选择如何与回复模式、任务特征和用户行为相关联。这种游戏化方法既为研究真实的人机交互提供了一个框架,也为理解AI增强创造力中效率与真实性之间的张力提供了一个发人深省的视角。

英文摘要

The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the "helpful assistant" design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity.

2606.12337 2026-06-11 math.NA cs.LG 新提交

Adjoint Method versus Physics-Informed Neural Networks in PDE-Constrained Inverse Problems

伴随方法与物理信息神经网络在PDE约束逆问题中的比较

Zhen Zhang, Alessandro Alla, George Em Karniadakis

AI总结 针对PDE约束逆问题,公平比较伴随优化与PINN,发现未知参数表示决定方法选择:网格场适合伴随,神经表示适合PINN;PINN在时间依赖问题中成本更低,且可预热启动伴随。

详情
Comments
35 pages, 10 figures
AI中文摘要

由偏微分方程(PDE)控制的逆问题是计算力学的核心,通常通过伴随优化求解,而物理信息神经网络(PINN)已成为一种灵活的替代方案。由于这两种方法通常在不同公式、参数化、优化器和正则化选择下进行比较,因此它们的相对性能难以评估。我们针对PDE约束逆问题,对伴随优化和PINN进行了公平比较。从共同的抽象公式出发,我们在相同的域、控制方程、观测模型和正则化项上实例化两种方法,并在适用情况下匹配优化器、未知参数化和算术精度。基准测试包括非定常Burgers方程、噪声达西渗透率反演、三维Allen-Cahn反应识别和非定常Navier-Stokes粘度识别。结果表明,未知参数的表示在很大程度上决定了首选方法:基于网格的场有利于离散伴随,而神经表示是PINN的原生方法,适用于封闭和本构建模。对于时间依赖问题,伴随反演可能因轨迹存储和微分而成本高昂,而PINN以较低成本提供令人满意的重建。然后,PINN预热启动的伴随策略以大幅降低的成本恢复伴随级别的精度。

英文摘要

Inverse problems governed by partial differential equations (PDEs) are central to computational mechanics and are commonly solved by adjoint-based optimization, while physics-informed neural networks (PINNs) have emerged as a flexible alternative. Their relative performance remains difficult to assess because the two approaches are often compared under different formulations, parameterizations, optimizers, and regularization choices. We present a fair comparison of adjoint optimization and PINNs for PDE-constrained inverse problems. From a common abstract formulation, we instantiate both methods on identical domains, governing equations, observation models, and regularization terms, while matching the optimizer, unknown parameterization, and arithmetic precision wherever applicable. The benchmarks include unsteady Burgers, noisy Darcy permeability inversion, three-dimensional Allen--Cahn reaction identification, and unsteady Navier--Stokes viscosity identification. The results show that the representation of the unknown largely determines the preferred method: grid-based fields favor the discrete adjoint, whereas neural representations are native to PINNs and relevant for closure and constitutive modeling. For time-dependent problems, adjoint inversion can be dominated by trajectory storage and differentiation, while PINNs provide satisfactory reconstructions at lower cost. A PINN-warm-started adjoint strategy then recovers adjoint-level accuracy at substantially reduced cost.

2606.12306 2026-06-11 cs.RO 新提交

UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief

基于共享暴露信念的UGV条件多无人机信息规划

Lars Oerlemans, Moji Shi, Marija Popovic

AI总结 提出一种协调无人机编队降低地面车辆在未知威胁区导航风险的方法,通过共享暴露信念引导感知并减少冗余覆盖,仿真显示累积暴露降低38%,冗余覆盖从38.8%降至3.7%。

详情
Comments
8 pages, 6 figures
AI中文摘要

在大型、威胁增强的环境中进行安全地面导航需要空中支持,以主动降低地面车辆沿路线面临的风险。现有的空中侦察系统专注于测绘或覆盖环境,但不将感知引导到对地面车辆安全最相关的区域。在本文中,我们解决了协调一组无人机(UAV)以提高无人地面车辆(UGV)在未知威胁区导航安全性的问题。我们方法的一个关键方面是共享暴露信念,该信念根据空中观测在线更新,并由无人机团队和地面车辆共同使用。这使我们能够将空中感知引导到路线相关区域,同时允许UGV围绕新发现的威胁重新规划。我们通过空间区域分配协调无人机团队以避免冗余感知。仿真实验表明,与不考虑危险等级的系统相比,我们的方法将UGV累积暴露降低了38%,并在我们的多无人机协调方案下将冗余空中覆盖从38.8%降至3.7%。

英文摘要

Safe ground navigation in large, threat-augmented environments requires aerial support that actively reduces the risks that a ground vehicle faces along its route. Existing aerial reconnaissance systems focus on mapping or covering the environment, but do not direct sensing toward regions that are most relevant for ground vehicle safety. In this paper, we address the problem of coordinating a team of unmanned aerial vehicles (UAVs) to improve the safety of an unmanned ground vehicle (UGV) navigating through unknown threat zones. A key aspect of our approach is a shared exposure belief that is updated online from aerial observations and used jointly by the UAV team and the ground vehicle. This enables us to direct aerial sensing towards route-relevant regions while allowing the UGV to replan around newly revealed threats. We coordinate the UAV team through spatial region assignment to avoid redundant sensing. Simulation experiments show that our approach reduces cumulative UGV exposure by 38% compared to a system that does not account for hazard levels, and reduces redundant aerial coverage from 38.8% to 3.7% under our multi-UAV coordination scheme.

2606.12303 2026-06-11 cs.CV 新提交

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

从二维网格到一维标记:重塑多模态图像融合的共享表示

Yuchen Xian, Yunqiu Xu, Yang He, Yi Yang

AI总结 提出基于冻结预训练图像标记器的紧凑一维标记接口,通过选择性标记编辑(STE)稀疏更新关键标记,在保持融合骨干网络不变的同时引导全局外观一致性,实现全局连贯与局部保真的最佳平衡。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

多模态图像融合旨在将来自不同模态的互补信息整合到融合图像中,该图像在保持全局一致外观的同时保留丰富的局部细节。现有方法在二维特征网格上构建共享表示,这些表示擅长建模局部结构,但对图像级全局外观因素的利用有限。为平衡这些目标,我们引入了一种基于冻结预训练图像标记器的紧凑一维标记接口,用于建模非局部外观/基因素。我们的设计不是将标记器用作重建骨干,而是将一维标记空间用作全局载体,同时保留用于局部结构恢复的二维空间路径。具体来说,我们引入了选择性标记编辑(STE),它稀疏地更新/替换一小部分关键标记,提供了一种轻量级机制来引导全局外观一致性,同时保持融合骨干网络不变并避免额外损失。在四个常用基准上的实验表明,我们的方法实现了最佳整体性能,在全局连贯性和局部保真度方面均具有一致的多指标改进。项目页面:此 https URL

英文摘要

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: this https URL

2606.12289 2026-06-11 cs.LG cs.AI cs.NE 新提交

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

标准可解释模型:一种基于拉格朗日力学的可解释机器学习通用理论,用于演绎设计可解释方法

Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris

AI总结 提出标准可解释模型(SIM),基于拉格朗日力学从前提演绎出可解释性对称性和约束,通过最小化拉格朗日函数得到最优可解释模型,解决现有方法局限性并指导新方法设计。

详情
AI中文摘要

随着人工智能模型复杂性的增加,可解释性已成为理解、调试和控制其计算不可或缺的工具。然而,可解释性缺乏通用理论来演绎设计可解释方法。理论与方法之间的这种差距导致了文献的碎片化和不一致的评估协议。为填补这一空白,我们引入了标准可解释模型(SIM),这是一种基于拉格朗日力学的通用理论,能够演绎设计可解释方法。具体而言,SIM 在一组前提中总结了目标用户的可解释性含义。从这些前提出发,SIM 系统地推导出可解释性对称性和相应的约束,这些约束塑造了拉格朗日函数的景观,其最小值对应于最优可解释模型。为了达到最小值,可以更新不透明模型的参数值使其更可解释,或者将约束编译成可解释架构。我们通过实验表明,SIM 能够识别并解决现有方法(包括传统、基于概念和机制可解释性)的局限性,突出未充分探索的研究方向,并指导核心编程接口的设计。除了作为一种研究方法,SIM 的演绎性质为可解释性课程提供了教学基础,并可能改变科学界对这一长期碎片化学科的看法。

英文摘要

As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.