arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19559 2026-06-19 cs.AI cs.CL 新提交

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University（AI Talent Hub, ITMO大学）

AI总结提出一种基于提示的不确定性分解方法，将行动置信度与请求不确定性分离，使代理能在任务规范模糊时主动寻求澄清，在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情

AI中文摘要

最近的立场论文认为，经典的偶然/认知不确定性框架对于交互式大型语言模型（LLM）代理是不够的，并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示，以解锁新的代理能力，如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法，使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁，该分解将行动置信度与请求不确定性（u）分离，使代理能在任务规范模糊时请求澄清。为了评估它，我们引入了两个增强澄清的基准（WebShop-Clarification和ALFWorld-Clarification），其中50%的任务被故意欠规范，并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上，系统地将所提出的分解与ReAct+UE和不确定性感知记忆（UAM）在五个LLM骨干（GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B）上进行比较。在五个骨干上平均，所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1，比UAM提高了36%，并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1，表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19782 2026-06-19 cs.AI cs.CL 新提交

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA：一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute（向量研究所）

AI总结提出多智能体管道AgentFinVQA，通过分解查询步骤并记录可追溯的模型评估包，在金融图表问答中实现可审计性与本地部署，在FinMME上提升准确率7.68个百分点。

详情

AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性：从业者必须在采取行动之前知道哪些答案值得信任，而且许多机构无法将客户数据发送给外部模型提供商。然而，现有的图表问答智能体注重准确性且不透明，并且大多数假设专有API访问；据我们所知，没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA，一个多智能体管道，将每个查询分解为规划、OCR、图例定位、视觉检查和验证，每个样本记录在可追溯的模型评估包（MEP）中。在FinMME上，AgentFinVQA在使用专有主干（Gemini-3 Flash；71.24% vs. 63.56%，McNemar p ≈ 1.1×10^{-16}）时比主骨干匹配的零样本基线提高+7.68个百分点，在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号（确认答案与修正答案的精确准确率分别为68.2%和55.6%），支持人在回路审查路由。错误分析表明，问题误解、图例混淆和提取错误占失败原因的近三分之二，并且是验证器检测最少的类别，为未来工作指明了明确方向。这些结果共同表明，可审计、本地部署的金融图表问答是可行的，并且开放权重系统保留了大部分准确率提升，同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.19893 2026-06-19 cs.AI 新提交

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute（江西陶瓷工艺美术职业技术学院数字艺术学院）； Universiti Sains Malaysia（马来西亚理科大学）

AI总结提出MetaResearcher框架，通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构，在对抗环境中扩展深度研究智能体的训练，提升基准性能和认知鲁棒性。

详情

AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力，但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中，我们提出了MetaResearcher，一个新颖的框架，在四个协同维度上扩展深度研究智能体的训练。首先，我们引入了一个演化虚拟世界，将时间动态和对抗性错误信息注入训练环境，迫使智能体发展来源可信度评估和时间冲突解决技能。其次，我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索，推动智能体走向真正的研究行为。第三，我们在GRPO框架内提出了一种自我反思元奖励机制，共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性，直接解决了先前工作中观察到的重复动作循环问题。第四，我们引入了一个异构多智能体群体架构，包括专门的侦察、过滤和合成模型，通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施，MetaResearcher在训练中需要零边际API成本，同时目标是在基准性能（GAIA，Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.20122 2026-06-19 cs.AI cs.MA 新提交

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； Key Laboratory of High Confidence Software Technologies, Ministry of Education（教育部高可信软件技术重点实验室）； GRG Banking Equipment Co., Ltd.（广电运通金融电子股份有限公司）； Center on Frontiers of Computing Studies, Peking University（北京大学计算前沿研究中心）； Peking University Information Technology Institute (Tianjin Binhai)（北京大学（天津滨海）信息技术研究院）

AI总结提出ScaffoldAgent框架，通过效用引导的动态大纲优化（扩展、收缩、修订操作）解决开放式深度研究中大纲漂移问题，在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情

AI中文摘要

开放式深度研究（OEDR）要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而，现有方法要么在写作前固定大纲，要么使用局部启发式方法进行优化，导致在持续信息积累下出现大纲漂移，且评估大纲修改的反馈延迟。我们提出ScaffoldAgent，一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程，包含三种操作：扩展、收缩和修订，从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制，通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明，ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.20142 2026-06-19 cs.AI cs.MA 新提交

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

RACL：用于连续元启发式学习的推理代理控制层

Antón Asla Manzárraga

AI总结提出RACL方法，在元启发式优化器之上添加推理代理，通过观察、推理和干预控制搜索行为，在车辆路径问题上平均成本降低0.641%-8.337%。

Comments 10 pages, 5 tables

详情

AI中文摘要

本文介绍了RACL，一种用于元启发式算法的推理代理控制层。RACL在现有优化器之上放置一个推理代理。该代理不替换优化器，也不修改业务约束。相反，它通过观察操作内存、推理过去行为、制定有界假设、测试干预、评估结果、应用护栏、巩固有用策略并解释其决策来控制优化器的内部搜索行为。实验使用车辆路径作为测试平台，但贡献不是新的路由求解器、特定的ALNS配置或特定的路由规则集。贡献是RACL方法：一种推理代理发现、验证、巩固和解释元启发式算法控制规则的方式。在当前实验设置中，RACL在21个可行案例中的21个中改进或持平操作内存策略，在21个可行案例中的18个中改进或持平非推理停滞触发策略，平均RACL与STP成本差异为-0.641%。在Sevilla-9/10运行时样本中，RACL相对于Fixed平均成本降低-8.337%，相对于STP降低-1.605%，且没有显示实质性计算开销。在概念验证期间，Codex被用作循环推理代理，观察执行、解释日志并提出实时有界干预。后来仅使用策略代理使定量评估可重复。

英文摘要

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.20363 2026-06-19 cs.AI 新提交

OnDeFog：帧丢失下的在线决策变压器

Daiki Yotsufuji, Kenta Nishihara, Shoma Shimizu, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

AI总结针对帧丢失导致性能下降的问题，提出OnDeFog，将DeFog机制与在线决策变压器结合，通过直接环境交互学习策略，在高丢帧率环境下优于ODT，在低奖励数据集上优于DeFog。

Comments Accepted to PRICAI 2025

详情

DOI: 10.1007/978-981-95-7072-0_10

AI中文摘要

在具有挑战性的现实世界强化学习应用中，通信延迟或传感器故障经常导致帧丢失，此时智能体无法接收丢失的状态及相关奖励。为了解决帧丢失导致的性能下降问题，通过将额外机制引入决策变压器以处理帧丢失，开发了随机帧丢失下的决策变压器（DeFog）。尽管DeFog可以缓解帧丢失环境中的性能下降，但由于DeFog是一种离线学习方法，它难以有效泛化到训练数据集中未充分表示的新状态。在本研究中，我们提出OnDeFog，它将DeFog中的机制与在线决策变压器（ODT）相结合，ODT是一种通过直接环境交互学习策略的在线强化学习方法。全面的实验评估表明，我们提出的OnDeFog在高丢帧率环境下相比ODT取得了更优的性能，并且在包含大量低奖励数据的数据集上优于DeFog。

英文摘要

In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.

URL PDF HTML ☆

赞 0 踩 0

2606.19729 2026-06-19 cs.RO cs.AI 交叉投稿

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

VOiLA: 基于学习扩散模型的向量化在线规划用于POMDP智能体

Marcus Hoerger, Rishikesh Joshi, Rahul Shome, Ian Manchester, Hanna Kurniawati

发表机构 * Australian National University（澳大利亚国立大学）； The University of Sydney（悉尼大学）

AI总结提出VOiLA框架，利用条件扩散模型学习POMDP模型，通过蒸馏加速采样并与向量化在线规划器集成，在三个基准任务和实物机器人上实现高效在线规划。

Comments Submitted to the 2026 International Symposium of Robotics Research (ISRR)

详情

AI中文摘要

不确定性下的规划是自主机器人的关键能力。部分可观测马尔可夫决策过程（POMDP）为此提供了强大框架。尽管基于POMDP的规划已取得显著进展，但其在现实问题中的应用常受限于难以获得准确的POMDP模型。我们提出VOiLA（Vectorized Online planning wIth Learned diffusion model for POMDP Agents），一个学习任务无关POMDP模型以实现在不确定性下在线规划的框架。VOiLA使用条件扩散模型学习转移和观测采样器，并学习用于基于粒子的信念更新的观测似然模型。为实现高效在线规划，扩散采样器被蒸馏为紧凑的前馈生成器，并与VOPP（一种利用GPU并行化的在线POMDP规划器）集成。实验结果表明，蒸馏策略将采样成本降低了近三个数量级，使学习到的生成式POMDP模型对在线规划实用。在三个基准问题上的评估表明，VOiLA在使用不到10%训练数据的情况下，性能达到或优于递归软演员-评论家算法，并且对未见环境配置的泛化能力更强。实物机器人评估表明，VOiLA仅使用模拟数据学习模型，并在10次运行中全部成功完成任务。

英文摘要

Planning under uncertainty is an essential capability for autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for such a capability. Although POMDP-based planning has advanced significantly, its application to real-world problems is often limited by the difficulty of obtaining faithful POMDP models. We present Vectorized Online planning wIth Learned diffusion model for POMDP Agents (VOiLA), a framework that learns task-agnostic POMDP models for online planning under uncertainty. VOiLA learns transition and observation samplers using conditional diffusion models and learns observation-likelihood models for particle-based belief updates. To enable efficient online planning, the diffusion samplers are distilled into compact feedforward generators and integrated with Vectorized Online POMDP Planner (VOPP), an online POMDP planner designed to leverage GPU parallelization. Experimental results indicate the distillation strategy reduces sampling cost by up to nearly three orders of magnitude, making learned generative POMDP models practical for online planning. Evaluation of VOiLA on three benchmark problems indicate that VOiLA achieves equal or better performance than Recurrent Soft Actor Critic while using less than 10% training data, and generalizes much better to unseen environment configurations. Physical robot evaluation indicates VOiLA uses the models learned using only simulated data and generates a policy that successfully accomplish the task in 10 of 10 runs.

URL PDF HTML ☆

赞 0 踩 0

2606.19992 2026-06-19 cs.SE cs.AI 交叉投稿

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

超越静态端点：工具程序作为灵活智能体网络服务的接口

Mugeng Liu, Shuoqi Li, Yixuan Zhang, Yun Ma

AI总结提出ToolPro，将工具意图表示为可执行程序，通过约束引导构建、效应感知重放和策略决策，在MCP服务上实现最高53.4%的延迟降低和96.1%的流量减少。

Comments Accepted by ICML 2026

详情

AI中文摘要

在智能体网络时代，基于LLM的智能体越来越多地将网络服务作为工具调用，然而大多数接口仍然是\emph{静态端点}，难以表达包含循环、条件、连接和重试的长周期工作流。我们提出ToolPro，它将智能体的工具意图表示为一个\emph{可执行工具程序}，该程序紧凑地编码了多步服务交互并带有显式效应类型。ToolPro结合了约束引导的程序构建、用于精确一次状态修改调用的效应感知重放，以及一个基于配置文件的策略，该策略决定何时程序执行优于逐步调用。我们在具有WebAssembly沙箱的MCP风格服务上实例化ToolPro，并在现实应用的各种工作流上进行了评估。ToolPro将端到端延迟降低了高达53.4%，客户端流量减少了高达96.1%，在网络延迟和工作流复杂度更高时收益更大。

英文摘要

In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain \emph{static endpoints} that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an \emph{executable tool program} that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots：通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出Connect the Dots框架，通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域，实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情

AI中文摘要

本文提出了一个通用框架，用于训练大型语言模型（LLMs）具备“Connect the Dots”（CoD）这一元能力，该能力是长期生命周期智能体所必需的：当基于LLM的AI智能体部署在环境中时，它解决一系列长期任务，同时持续探索环境、从自身经验中学习，并迭代地自我更新关于环境的上下文，从而在更新上下文的条件下，在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括：（1）用于端到端强化学习（RL）的算法设计和基础设施，其中包含交替执行任务和更新上下文的长展开序列；（2）用于在训练过程中激励和激发LLM中目标元能力的任务和环境，以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现，包括具有细粒度信用分配的GRPO风格RL算法，以及针对目标元能力（而非特定领域的LLM能力或标准的逐任务RL）量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性，并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作，并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用，我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

URL PDF HTML ☆

赞 0 踩 0

2606.20120 2026-06-19 cs.RO cs.AI 交叉投稿

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

用于将自然语言协议翻译为机器人实验室平台的双智能体跨模型验证框架

Hyeonna Choi, Jung Yup Kim, Hyuneui Lim, Seunggyu Jeon

AI总结提出双智能体框架，通过解析器形式化协议、规则映射引擎生成控制命令、异构LLM验证器纠错，实现自然语言微孔板协议到机器人平台可执行命令的转换，并验证了端到端自主执行。

详情

AI中文摘要

生物实验协议以自然语言编写，而自动化系统依赖预定义控制命令，这造成了限制自主执行的语义鸿沟。微孔板自动实验由于需要同时控制孔映射、样本-试剂组合、重复放置和平行分配而尤其具有挑战性。本研究提出一种基于智能体的协议翻译框架，将自然语言微孔板协议转换为机器人实验室平台的可执行控制命令。解析器智能体将自然语言协议形式化为结构化表示，基于规则的映射引擎确定性地融入机器人实验室平台的操作约束以生成设备级控制命令。异构LLM验证器检查完整性、参数准确性和执行顺序，并在检测到错误时触发带有结构化反馈的自校正循环。在随机选择的ELISA协议上对7个解析器和3个验证器进行扫描，评估模型规模和验证器类型在跨模型验证下对翻译准确率和通过率的影响。通过将所提框架的基于规则映射与LLM端到端直接映射进行比较，进一步验证了准确率-延迟权衡。最后，在机器人实验室平台上演示了基于Bradford法的微孔板蛋白质定量，验证了从自然语言协议到真实实验的端到端自主执行。所提框架为缩小自然语言协议与基于微孔板的自主实验室之间的语义鸿沟提供了一种灵活方法。

英文摘要

Biological experiment protocols are written in natural language, whereas automation systems rely on predefined control commands, creating a semantic gap that limits autonomous execution. Microplate-based automatic experiments are particularly challenging due to the need to simultaneously control well mapping, sample-reagent combinations, replicate placement, and parallel dispensing. This study proposes an agent-based protocol translation framework that converts natural-language microplate-based protocols into executable control commands for a robotic laboratory platform. A Parser Agent formalizes the natural-language protocol into a structured representation, and a rule-based mapping engine deterministically incorporates the operational constraints of the robotic laboratory platform to generate device-level control commands. A heterogeneous LLM Validation Agent verifies completeness, parameter accuracy, and execution order, and triggers a self-correction loop with structured feedback when errors are detected. A sweep involving 7 Parsers and 3 Validators on randomly selected ELISA protocols evaluates how model scale and Validator type affect translation accuracy and pass rates under cross-model verification. The accuracy-latency trade-off is further verified by comparing the rule-based mapping of the proposed framework with LLM end-to-end direct mapping. Finally, Bradford assay-based protein quantification using a microplate was demonstrated on a robotic laboratory platform, validating end-to-end autonomous execution from natural-language protocols to real-world experiments. The proposed framework provides a flexible approach to narrowing the semantic gap between natural-language protocols and microplate-based self-driving laboratories.

URL PDF HTML ☆

赞 0 踩 0

2606.20373 2026-06-19 cs.SE cs.AI 交叉投稿

Oranits: 基于Open RAN的智能交通系统中的任务分配与卸载——元启发式与深度强化学习方法

Ngoc Hung Nguyen, Nguyen Van Thieu, Quang-Trung Luu, Anh Tuan Nguyen, Senura Wanasekara, Nguyen Cong Luong, Fatemeh Kavehmadavani, Van-Dinh Nguyen

发表机构 * Department of Smart City, Hanyang University（翰阳大学智能城市系）

AI总结提出Oranits系统模型，通过元启发式算法CGG-ARO和深度强化学习框架MA-DDQN优化车辆协作中的任务依赖与卸载成本，分别提升任务完成率7.7%和12.5%。

Comments 16 pages, 13 figures

Journal ref IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2026

详情

AI中文摘要

本文研究了基于开放无线接入网（Open RAN）的智能交通系统（ITS）中的任务分配与卸载问题，其中自动驾驶车辆利用移动边缘计算进行高效处理。现有研究常忽视任务之间的复杂依赖关系以及将任务卸载到边缘服务器的成本，导致决策次优。为弥补这一不足，我们引入了Oranits，一种新颖的系统模型，明确考虑了任务依赖性和卸载成本，同时通过车辆协作优化性能。为此，我们提出了一种双重优化方法。首先，我们开发了一种基于元启发式的进化计算算法，即混沌高斯全局ARO（CGG-ARO），作为单时隙优化的基线。其次，我们设计了一种增强的基于奖励的深度强化学习（DRL）框架，称为多智能体双深度Q网络（MA-DDQN），该框架集成了多智能体协调和多动作选择机制，显著减少了任务分配时间并提高了对基线方法的适应性。大量仿真表明，CGG-ARO将完成任务数量和总体收益分别提高了约7.1%和7.7%。同时，MA-DDQN在任务完成率和总体收益方面分别实现了11.0%和12.5%的更大提升。这些结果凸显了Oranits在动态ITS环境中实现更快、更自适应、更高效任务处理的有效性。

英文摘要

In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

URL PDF HTML ☆

赞 0 踩 0

2601.16233 2026-06-19 cs.SI cs.AI 版本更新

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI（韩国科学技术院人工智能系）

AI总结提出利用Lean证明助手提供过程级验证信号，结合GRPO风格强化学习目标，通过策略级监督提升定理证明性能。

详情

AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号，但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中，我们证明Lean证明助手本身可以作为符号过程预言机，在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列，Lean的细化标记出局部正确的步骤和最早失败的步骤，从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中，采用首次错误传播和首次令牌信用方法，平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明，在大多数设置中，策略级监督优于仅结果基线，在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升，我们的研究还突出了一个更广阔的视角：符号证明助手不仅在评估时是验证器，而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路，该框架将语言模型的可扩展性与符号验证的可靠性相结合，用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.20526 2026-06-19 cs.AI 新提交

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

DeepSWIP: 神经概率逻辑程序的商-WMC反事实

Saimun Habib, Vaishak Belle, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结提出DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义，通过神经物化、SWIP和加权模型计数实现精确反事实推理，实验证明比孪生网络方法快2.14倍。

详情

AI中文摘要

诸如DeepProbLog之类的神经符号系统将神经感知与概率逻辑相结合，但标准推理是关联性的。反事实推理还需要干预和证据的因果语义。我们引入了DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义。利用神经物化，我们将固定上下文神经谓词简化为普通的ProbLog选择，应用单世界干预程序（SWIP），并通过单个转换程序上的加权模型计数（WMC）计算反事实。在有限基和唯一支持模型假设下，DeepSWIP相对于学习到的物化FCM是精确的。ProbLog条件句的标准商-WMC形式识别了活跃的神经概率，并解释了干预清理、校准敏感性和罕见证据不稳定性。在MPI3D上的实验证实了该转换相对于DeepTwin构造在12,000个查询上的有效性，并且由于避免了孪生网络的内源性重复，推理速度提升了2.14倍。一个SUMO HOV实验表明，神经校准退化会偏置插件估计，而正确作用域的随机策略AIPW估计器消除了总体均值和ATE估计量的大部分一阶偏差。代码位于此https URL。

英文摘要

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

URL PDF HTML ☆

赞 0 踩 0

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 交叉投稿

去中心化联盟形成的退出与加入动力学

Quanyan Zhu

发表机构 * New York University Tandon School of Engineering（纽约大学坦登工程学院）； Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结研究基于单边退出与加入决策的去中心化联盟形成动力学，利用Aumann-Dreze值计算个体收益，建立合作支付分配与非合作最优反应的关联，并分析均衡特征及成本对局部稳定性的影响。

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MATM框架，通过共享存储和检索智能体轨迹，实现异构智能体群体间的知识复用，提升下游任务性能并减少交互步骤。

详情

AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署，激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决，检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成（展示了人类创作工件对单个智能体的价值）扩展到检索智能体生成的工件以支持智能体群体。特别是，智能体轨迹编码了可重用的程序性知识，然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留，迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆（MATM），一个用于群体级存储和检索智能体生成轨迹的框架，其中生产者智能体将轨迹贡献到共享仓库，消费者智能体检索它们以改进任务执行。我们专注于交互环境（ALFWorld和WebArena），其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明，从MATM检索轨迹可提高下游任务性能并减少交互步骤，无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

在拉取请求之前：挖掘多智能体协调

Dipankar Sarkar

AI总结针对自主编码智能体在拉取请求中协调不足的问题，提出基于git的协调基板grite，通过事件日志减少重复和冲突工作，提升吞吐量，并自动恢复多种故障模式。

Comments 9 pages, 2 tables. LNCS format. Code, dataset, and mining toolkit: https://github.com/neul-labs/grite

详情

AI中文摘要

自主编码智能体现在可以开启数百万个拉取请求，然而大规模研究发现，它们的拉取请求虽然生成更快，但被接受的频率却更低——这是一个拉取请求级别的遥测无法解释的协调与信任差距。我们认为缺失的信号存在于拉取请求之前，即并发智能体如何声明、划分和碰撞共享工作。我们通过grite（我们的开源协调基板）来研究这一过程，它不需要中央服务器，并将其记录存储在git本身内部，因此其仅追加的、签名的事件日志直接捕获了协调过程。我们证明：(i) 这种共享基板以有限的开销减少了重复和冲突工作——仅重复队友任务的工作份额从78%降至0%，而有效吞吐量增加了三倍以上；(ii) 每个智能体的日志副本收敛到相同状态，没有写入被静默丢弃，而基于文件的跟踪器会丢失并发写入；(iii) 该日志是一个可挖掘的工件，从中可以自动恢复具体的故障模式——冲突编辑、锁饥饿、冗余发现、竞态关闭——并带有来源信息，其中一些在拉取请求历史中是不可见的。我们发布了数据集、测试平台和挖掘工具包。

英文摘要

Autonomous coding agents now open millions of pull requests, yet large-scale studies find their PRs are produced faster but accepted less often - a coordination and trust gap that pull-request-level telemetry cannot explain. We argue the missing signal lives before the PR, in how concurrent agents claim, divide, and collide over shared work. We study this process through grite, our open-source coordination substrate that needs no central server and stores its records inside git itself, so its append-only, signed event log captures the coordination process directly. We show that (i) this shared substrate reduces duplicate and conflicting work at bounded overhead - the share of work that merely re-does a teammate's task falls from 78% to 0% while useful throughput more than triples; (ii) every agent's copy of the log converges to the same state with no write silently dropped, where a file-based tracker loses concurrent writes; and (iii) the log is a mineable artefact from which concrete failure modes - conflicting edits, lock starvation, redundant rediscovery, race-to-close - are automatically recoverable with provenance, several invisible in pull-request history. We release the dataset, harness, and mining toolkit.

URL PDF HTML ☆

赞 0 踩 0

2606.19632 2026-06-19 cs.RO cs.AI cs.LG cs.LO cs.MA 交叉投稿

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式化验证

Ahmad Farooq, Kamran Iqbal

发表机构 * University of Arkansas at Little Rock（阿肯色大学小石城分校）

AI总结提出通过决策树蒸馏将多智能体强化学习策略转化为可解释模型，并利用PRISM进行形式化验证，确保安全属性转移至原始网络，在无人机编队任务中实现88.9%属性满足率。

Comments 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026

详情

AI中文摘要

多智能体强化学习使智能体能够通过涌现通信发展协调策略，但神经策略缺乏无人机群和自动驾驶车队等安全关键机器人部署所需的形式化安全保证。我们提出了首个通过学习策略抽象进行安全验证的端到端框架：神经策略被蒸馏为可解释的决策树，然后进行形式化验证，并通过经验验证确认验证的安全属性可转移至原始网络。我们的四阶段流程包括：从智能体观测中提取领域特定特征；决策树蒸馏达到97.9% +/- 1.2%的神经策略保真度；自动翻译为PRISM概率模型检查器规范，具有完整的特征到状态变量对应关系；以及通过成对分解、联合界聚合和经验邻居建模对概率计算树逻辑属性进行组合验证。评估用于5-7个智能体多无人机协调的矢量量化变分信息瓶颈策略，我们验证了18个涵盖安全性、活性和合作的时间逻辑属性，实现了88.9%的属性满足率，所有五个安全阈值均满足（碰撞概率0.3% vs 阈值1%）。原始神经策略的蒙特卡洛验证确认验证的安全属性转移偏差<=0.6个百分点（95%置信区间）。离散VQ-VIB消息相比连续方法提供+11.6至+13.6个百分点的保真度优势，实现3-4倍更快的验证。我们的框架为蒸馏策略抽象提供了经验验证的安全验证，作为深度多智能体强化学习与多机器人部署形式化安全工作流之间的实用桥梁。

英文摘要

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.20014 2026-06-19 cs.LG cs.AI 交叉投稿

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制：基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

AI总结提出LLM作为中央策略控制器选择RL技能策略的层次化架构，在2v2对抗环境中达到与手工BT相当的胜率，且被感知为最类人。

Comments 12 pages, 9 figures

详情

AI中文摘要

强化学习（RL）在序列决策中取得了强劲表现，但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难，扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构，其中预训练的大语言模型（LLM）作为集中式策略控制器，为一组智能体选择专门的RL技能策略，而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统，与行为树（BT）和“扁平”RL（无技能分解的端到端训练）基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能（胜率46.4% vs 51.5%，p=0.103），而两者均显著优于无技能分解训练的扁平RL。一项用户研究（n=15）显示，60%的参与者认为LLM+RL智能体最像人类（p=0.027），归因于行为适应性和战术变异性。这些结果表明，预训练LLM推理可以有效编排预训练RL技能，实现具有竞争力的多智能体协调和优越的感知可信度，而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.20485 2026-06-19 q-fin.RM cs.AI nlin.AO physics.soc-ph 交叉投稿

Optimal Order of Multi-Agent and General Many-Body Systems

多智能体与一般多体系统的最优序

Jake J. Xia

AI总结提出一个分析多智能体系统的通用框架，基于智能体的权力和响应函数，推导出宏观性质，并引入风险偏好系数研究增长与韧性之间的权衡，得出最优有序度。

Comments Key Words: Many body systems, multi agent crowd interactions, feedback loops, agent power, response function, utility function, risk appetite, order, optimal order, fragility, mobility, synchronization, useful energy, entropy, concentration, correlation, task dependency, receiver dependency, collective intelligence, AI model scaling law

详情

AI中文摘要

本文开发了一个通用框架，用于分析具有智能体行动与集体观测之间反馈回路的多智能体系统。该框架建立在两个基本的智能体层面变量上：权力，衡量智能体对集体结果的影响；以及响应函数，决定智能体如何对观测做出反应。我们推导了宏观性质（包括总权力、有用权力、熵、有序度、脆弱性和流动性）如何从异质智能体的这两个变量中涌现。为了研究增长与韧性之间的权衡，我们引入了一个由风险偏好系数参数化的系统层面效用函数，并推导出一个平衡生产力、稳定性和适应性的最优有序度。分析表明，更强的同步可以增加集体产出，但也可能增加系统脆弱性并降低流动性。我们进一步论证，有序度、熵、信息和有用能量是任务依赖和系统相对的概念，其含义取决于系统的目标。通过测量和设计智能体的权力分布和响应函数，可能更好地理解、预测和优化集体行为，并识别集体智慧和最优序出现的条件。

英文摘要

This paper develops a general framework for analyzing multi-agent systems with feedback loops between agents actions and collective observations. The framework is built on two fundamental agent-level variables: power, which measures agent influence on collective outcomes, and response functions, which determine how agents react to observations. We derive how macroscopic properties, including total power, useful power, entropy, order, fragility, and mobility, emerge from these two variables of heterogeneous agents. To study the trade off between growth and resilience, we introduce a system-level utility function parameterized by a risk-appetite coefficient and derive an optimal degree of order that balances productivity, stability, and adaptability. The analysis suggests that stronger synchronization can increase collective output but may also increase systemic fragility and reduce mobility. We further argue that order, entropy, information, and useful energy are task-dependent and system-relative concepts whose meanings depend on the objectives of the system. By measuring and designing agent power distributions and response functions, it may be possible to better understand, predict, and optimize collective behavior and identify the conditions under which collective intelligence and optimal order emerge.

URL PDF HTML ☆

赞 0 踩 0

2606.20493 2026-06-19 cs.LG cs.AI cs.MA 交叉投稿

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络：多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering（齐鲁理工学院软件工程学院）

AI总结提出传染网络框架，量化评估者偏见在多智能体LLM系统中的传播，发现同模型智能体间偏见传播系数为0.157-0.352，且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情

AI中文摘要

当大型语言模型在多智能体系统中担任评估者时，其系统性评估偏见会通过智能体网络传播。我们引入传染网络，这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中，我们采用了三种不同的评估者偏见配置文件（结构化、平衡、基于证据），测量了跨智能体传染矩阵Gamma_3，并发现评估者偏见始终在智能体间传播（gamma在[0.157, 0.352]范围内），即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制，并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍（MM-EPC: gamma约0.85-1.3），使其处于抑制机制中。我们表明，将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%，提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

URL PDF HTML ☆

赞 0 踩 0

2501.17015 2026-06-19 cs.AI cs.MA cs.RO 版本更新

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

UniMM：一种用于多智能体仿真的统一混合模型框架

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

发表机构 * Zhejiang University（浙江大学）； Horizon Robotics

AI总结提出UniMM框架统一回归混合模型与离散NTP模型，通过闭环样本生成缓解分布偏移，并在WOSAC基准上取得最优性能。

Comments Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026

详情

DOI: 10.1109/TPAMI.2026.3700402

AI中文摘要

仿真在评估自动驾驶系统中起着关键作用，其中生成逼真的多智能体行为是一个关键方面。在多智能体仿真中，主要挑战包括行为多模态性和闭环分布偏移。在本研究中，我们提出了一个统一的混合模型（UniMM）框架，用于生成多模态智能体行为，该框架涵盖了主流方法，包括基于回归的混合模型和离散NTP模型。此外，我们引入了一种针对混合模型的闭环样本生成方法，以缓解分布偏移。在UniMM框架内，我们从模型和数据角度识别了关键配置。我们对各种模型配置进行了系统检查，并全面描述了它们的效果。此外，我们对数据配置的研究强调了闭环样本在实现逼真仿真中的关键作用。为了将闭环样本的优势扩展到更广泛的混合模型中，我们进一步引入了一种时间解缠和对齐机制，以解决捷径学习和离策略学习问题。利用我们探索的见解，UniMM框架内提出的不同变体，包括离散模型、无锚模型和基于锚点的模型，均在WOSAC基准上取得了最先进的性能。

英文摘要

Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.18413 2026-06-19 cs.AI cs.HC 版本更新

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结研究共享工作空间的人机团队协作，通过Collaborative Gym环境实验发现，缺乏协调结构时增加协作者会降低性能，而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情

AI中文摘要

自动化AI代理越来越强大，但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队，其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务，我们考察何时添加模拟人类协作者能提升性能，以及何时过程损失将额外协作者变为协调开销。在1482个会话中，当团队缺乏协调贡献的结构时，添加相关协作者会降低性能。然后我们评估一种脚手架，它结合了共享群体记忆和模拟人在环（HITL）门控，其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显，产生了更高的平均性能，具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言，人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

URL PDF HTML ☆

赞 0 踩 0

2502.19193 2026-06-19 cs.SI cs.AI cs.NE 版本更新

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类，导致警报生成与遏制之间存在延迟。我们提出Agentra，一个可监督的多智能体入侵响应系统（IRS）框架，它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中，通过有界的规划器-验证器审查循环验证提议的计划，通过审核安全网关筛选检索到的威胁情报，通过行动目录和风险评分门控行动，并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上，将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84，并在仅规划器配置引入不安全过度反应后，将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明，多智能体响应规划可以在保持分析师批准和可审计性的同时，提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.19741 2026-06-19 cs.AI cs.LG 新提交

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Nanyang Technological University（南洋理工大学）； Microsoft Research（微软研究院）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出演化程序瓶颈（EPB）框架，通过将黑盒神经组合优化模型蒸馏为可读程序组合，利用LLM和混合梯度下降实现可解释性，揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情

AI中文摘要

神经组合优化（NCO）取得了强劲性能，但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具（如概念瓶颈模型）不适用于NCO，因为其决策是动态的、状态依赖的，且缺乏适当的概念词汇定义。为弥合这一差距，我们引入了演化程序瓶颈（EPB），据我们所知，这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序，其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作：模块I固定程序库容量，并引入混合文本-数值梯度下降方案，该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合；模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性，蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化，并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO，并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

URL PDF HTML ☆

赞 0 踩 0

2606.19759 2026-06-19 cs.AI cs.SI 新提交

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

知识工作者问答论坛中的最优调度

Rohit Negi, Mustafa Yilmaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结针对知识工作者问答论坛，提出基于专家专业水平的请求调度模型，计算系统容量并设计达到容量的调度器，同时探讨专家协作对容量的提升。

Comments 14 pages, 4 figures

2606.20084 2026-06-19 cs.AI 新提交

Residual-Space Evolutionary Optimization via Flow-based Generative Models

基于流生成模型的残差空间进化优化

Zhuo Cao, Lena Krieger, Fernanda Nader, Xuan Zhao, Hanno Scharr, Ira Assent

发表机构 * LMU Munich, Munich Center for Machine Learning (MCML), Germany（慕尼黑大学，慕尼黑机器学习中心（MCML），德国）； Department of Computer Science, Aarhus University, Denmark（丹麦奥胡斯大学计算机科学系）

AI总结提出残差空间进化优化框架，结合流生成编辑与进化算法，在残差空间分离局部利用与全局探索，用于非可微黑盒目标的数据编辑。

Comments Accepted by ICML 2026 Workshop SPIGM, 5 pages, 3 figures

详情

AI中文摘要

使用生成方法进行数据编辑通常需要可微目标和基于梯度的搜索。然而，这些假设在基于流的设置中不成立，其中编辑通过前向和反向积分执行，并且通常涉及不可微或黑盒目标。我们引入了残差空间进化优化，这是一个模型无关的框架，通过将基于流的生成编辑与进化算法相结合来解决这一差距。基于条件流匹配（CFM）可以将条件控制因素与实例特定残差分离的观察，我们的框架直接在残差空间中操作，并分离两个互补的搜索机制：自花授粉通过保留特征的残差细化进行局部利用，而交叉授粉通过跨异质样本重组残差促进更广泛的探索。作为概念验证，我们在MorphoMNIST（一个用于反事实生成的基准数据集）和晶体数据上进行了验证，表明这种探索-利用分解为平衡目标对齐、实例保留和多样性提供了有用的机制，并且可以扩展到图像之外的真实世界科学领域。

英文摘要

Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration--exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.

URL PDF HTML ☆

赞 0 踩 0

2606.19369 2026-06-19 cs.LG cs.AI 交叉投稿

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

零膨胀高斯分布使估计分布算法中的参数空间稀疏化

Andreas Faust, Sven Nitzsche, Juergen Becker

发表机构 * University of Freiburg（弗莱堡大学）； FZI Research Center for Information Technology（FZI信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出多元零膨胀高斯分布作为估计分布算法的采样分布，联合优化稀疏模式和活跃参数，无需手工设计稀疏算子，在Lunar Lander基准上收敛更快且最终回报更高。

详情

AI中文摘要

估计分布算法（EDA）是一类强大的黑箱优化进化方法，尤其当目标函数结构未知时。经典进化算法依赖于手工设计的变异和交叉算子，这些算子难以针对未知问题结构设计，且是偏差的来源，而EDA完全绕过了算子设计：它们将概率分布拟合到最佳个体，并从中采样下一代。EDA在连续参数空间上已得到充分确立，但此前尚未推广到稀疏空间——其中良好解的大多数系数恰好为零。现有的稀疏黑箱优化器因此重新引入了EDA旨在避免的东西：手工制作的稀疏算子、支持集与活跃值交替的双层方案、零阈值以及其他内置假设。我们通过提出多元零膨胀高斯（ZIG）分布作为EDA采样法则来填补这一空白。一个具有独立指示维度和值维度的潜在高斯模型表示稀疏模式、活跃参数之间的相关性以及两者之间的相互作用，因此稀疏模式和活跃值被联合优化，无需层次结构。我们证明该模型的潜在参数可以从观测样本中识别，不同于相关构造起源的缺失数据设置，并引入了实用的基于摊销反演的估计器。这些估计器准确恢复潜在相关结构，在Lunar Lander基准上，由此产生的ZIG-EDA比稠密高斯EDA、手工制作的稀疏进化算法和特设稀疏EDA收敛更快且最终回报更高，同时找到的控制器只有一小部分参数活跃。

英文摘要

Estimation-of-distribution algorithms (EDAs) are a powerful class of evolutionary methods for black-box optimization, especially when little is known about the structure of the objective. Whereas classical evolutionary algorithms rely on hand-designed mutation and crossover operators, hard to devise for unknown problem structures, and a source of bias, EDAs sidestep operator design entirely: they fit a probability distribution to the best individuals and sample the next generation from it. EDAs are well established on continuous parameter spaces, but they have not previously been generalized to sparse ones, in which most coefficients of a good solution are exactly zero. Existing sparse black-box optimizers therefore reintroduce exactly what EDAs were designed to avoid: hand-crafted sparsity operators, bi-level schemes alternating between support set and active values, zeroing thresholds, and other baked-in assumptions. We close this gap by proposing multivariate zero-inflated Gaussian (ZIG) distributions as EDA sampling laws. A latent Gaussian model with separate indicator and value dimensions represents sparsity patterns, correlations among active parameters, and the interactions between the two, so sparsity patterns and active values are optimized jointly, hierarchy-free. We show that the latent parameters of this model are identifiable from observed samples, unlike in the missing-data settings where related constructions originate, and introduce practical amortized inversion-based estimators for them. The estimators accurately recover latent correlation structures, and on the Lunar Lander benchmark the resulting ZIG-EDA converges faster and reaches higher final returns than a dense Gaussian EDA, a hand-crafted sparse evolutionary algorithm, and an ad-hoc sparse EDA, while finding controllers with only a small fraction of parameters active.

URL PDF HTML ☆

赞 0 踩 0

2606.19533 2026-06-19 cs.AR cs.AI 交叉投稿

A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model

基于伊辛模型的自适应概率处理器合成工具

Jonathan Juracy Carneiro da Silva, Leonardo R. Gobatto, Jose Rodrigo Azambuja

AI总结提出一种自动合成与仿真概率架构的工具，通过将组合优化问题映射到伊辛模型，自适应选择更新算法，改善收敛行为并支持硬件实现。

Comments ACM/IEEE/SBC/SBMICRO Symposium on Integrated Circuits and Systems Design 2026

2602.17315 2026-06-19 cs.LG cs.AI 版本更新

Flickering Multi-Armed Bandits

闪烁多臂老虎机

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； INRIA Paris（巴黎国家信息与自动化研究所）

AI总结提出闪烁多臂老虎机模型，通过随机图约束动作可用性，设计两阶段懒惰随机游走算法实现次线性遗憾界，并证明信息论下界的最优性。

2606.19475 2026-06-19 cs.AI cs.CL 新提交

Diffusion Language Models: An Experimental Analysis

扩散语言模型：一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）； University of Pisa（比萨大学）

AI总结本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现，分析了去噪步数、上下文长度等推理因素对性能与效率的影响，揭示了扩散语言模型在不同任务和预算下的权衡。

详情

AI中文摘要

大型语言模型（LLMs）通过自回归生成彻底改变了语言建模，使其在广泛的任务中表现出色。最近，扩散语言模型（DLMs）作为一种替代范式出现，它通过迭代去噪而非下一个词预测来生成文本，从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构，但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中，我们对现代DLMs进行了系统的实验分析。具体来说，我们评估了八种最先进的DLMs在八个基准上的表现，这些基准涵盖推理、编码、翻译、知识和结构化问题解决，同时明确考虑了生成质量和计算效率。除了下游评估，我们还分析了关键推理时间因素的影响，包括去噪步数、上下文长度、块大小和并行解掩策略，并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明，DLMs的行为受到生成时间设计选择的强烈影响，导致性能和计算效率之间的不同权衡。总体而言，我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19538 2026-06-19 cs.AI cs.LG 新提交

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

ITNet: 一种可学习的积分变换，统一卷积、注意力与循环

Ashim Dhor, Rasel Mondal, Pin Yu Chen

发表机构 * Indian Institute of Science Education and Research Bhopal（印度科学教育与研究学院博帕尔分校）； IBM Research（IBM研究院）

AI总结提出可学习积分变换网络ITNet，通过位置-特征联合核函数统一卷积、注意力和循环架构，实现跨模态高性能。

详情

AI中文摘要

卷积网络、循环网络和变换器各自编码不同的归纳偏置——局部性、序列记忆和内容相关的成对交互——自诞生以来在数学上一直彼此独立。我们表明，这种碎片化反映的不是信号处理方式的根本多样性，而是对单一底层数学对象的不完整视角：可学习的积分变换。我们引入积分变换网络（ITNet），这是一种统一架构，围绕一个依赖于位置和特征的联合可学习核构建。该核实现为一个小型神经网络（具体为MLP），用于建模成对交互，使模型能够从数据中自适应其行为。我们证明，卷积、自注意力（包括多头）和自回归循环（包括LSTM、GRU、S4和Mamba）在适当参数化下均作为特例出现，且ITNet是连续算子的通用逼近器。为使其实用，我们开发了分块核融合、重要性加权蒙特卡洛积分和可学习低秩分解，实现高效可扩展计算。单个ITNet架构，共享算子与轻量级模态特定编码器，在ImageNet-1K、GLUE、ModelNet40、VQA v2和NLVR2上匹配或超越专用基线。结果表明，单一学习交互机制可从数据中恢复所有三个架构族的行为。

英文摘要

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

URL PDF HTML ☆

赞 0 踩 0

2606.19607 2026-06-19 cs.AI stat.AP 新提交

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对？

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University（哥伦比亚大学）

AI总结研究偏好后训练中如何选择最具信息量的比较对，提出基于采样设计的比较策展方法，通过DPO训练的理论分析给出优化准则，实验证明能提升样本效率。

详情

AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而，人工偏好标签通常比生成额外补全昂贵得多，这提示了相同标注预算的不同使用方式：生成更大的补全集，但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题，并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化（DPO）实例化该框架，分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明，比较选择通过一个单一的设计相关信息矩阵影响下游性能，该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则，并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明，所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.19658 2026-06-19 cs.AI cs.IR cs.MM 新提交

Denoising Implicit Feedback for Cold-start Recommendation

去噪隐式反馈用于冷启动推荐

Gaode Chen, Shicheng Wang, Shikun Li, Rui Huang, Xinghua Zhang, Yunze Luo, Shipeng Li, Shiming Ge, Ruina Sun, Yinjie Jiang, Jun Zhang

发表机构 * Hong Kong Baptist University（香港浸会大学）； Independent Researcher（独立研究员）； Peking University（北京大学）； Nanjing University（南京大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结针对冷启动推荐中隐式反馈噪声问题，提出模型无关的去噪方法DIF，通过内容相似性推断伪标签并建模置信度与不确定性，在快手应用中显著提升冷启动场景商业指标。

Comments Accepted by KDD 2026 ADS Track

详情

AI中文摘要

隐式反馈因其可获取性和通用性被广泛用于推荐系统，但通常包含噪声样本（如点击诱饵、位置偏差）。同时，由于新物品的持续涌入，推荐器不可避免地面临物品冷启动问题。我们识别出冷物品因上述因素更容易受到噪声样本的影响，而研究者往往忽视了为冷物品去噪隐式反馈的重要性。先前的去噪研究通常基于启发式模式（如高损失值）识别噪声样本，并通过样本选择或重加权来减轻噪声。然而，这些方法适应性有限，在冷启动场景中效果不佳。为了实现冷启动推荐中的隐式反馈去噪，我们提出了一种模型无关的去噪方法DIF。首先，用户对内容的偏好是稳定的，这使我们能够通过内容相似的热物品推断出指示用户是否对冷物品感兴趣的伪标签。其次，为了提高伪标签准确性，我们基于冷物品与热物品的内容相似性对伪标签的置信度进行建模，然后为每个样本聚合多个伪标签。最后，我们通过考虑噪声样本标签的相对熵和物品的冷启动状态，显式估计其不确定性，从而自适应地指导伪标签在样本级别纠正噪声标签。DIF的优越性得到了理论证明和真实数据集上大量实验的支持。该方法已部署在十亿用户规模的短视频应用快手上，并在冷启动场景中显著提升了各项商业指标。

英文摘要

Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.19771 2026-06-19 cs.AI 新提交

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵：从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Sichuan University（四川大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结针对RLVR中令牌更新导致的熵塌陷或爆炸问题，提出ICT框架，利用JS散度识别关键令牌，通过选择性更新平衡策略集中度，提升推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著推进了大语言模型（LLM）推理；然而，它面临一个基本的优化不稳定性：均匀令牌更新会导致熵塌陷，从而过早收敛到次优策略，而过度的香农熵最大化可能导致熵爆炸，驱动盲目探索走向不连贯的推理链。为解决这一二分问题，我们引入了独立组合令牌（ICT）框架，该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农（JS）散度，ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵，证明选择性地更新这些令牌可以调节策略集中度：它降低了由香农熵度量的整体分布不确定性，同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索，并有效稳定了训练景观。实验结果表明，在Qwen2.5（0.5B/1.5B/7B）模型上仅更新前10%的独特令牌，在涵盖数学、常识和奥林匹克级别问题的七个基准测试中，与GRPO、20-Entropy和STAPO基线相比，平均pass@4提升了4.58%，最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

URL PDF HTML ☆

赞 0 踩 0

2606.19808 2026-06-19 cs.AI cs.CL 新提交

信息格学习作为概率图模型结构学习

Haizi Yu, Lav R. Varshney

发表机构 * Kocree, Inc.（Kocree公司）； AI Innovation Institute, Stony Brook University（石溪大学人工智能创新研究所）

AI总结将信息格学习（ILL）解释为概率图模型结构学习，通过投影到分区格上学习可解释规则，并建立与最大熵和因子图的联系。

详情

AI中文摘要

信息格学习（ILL）通过将信号交替投影到编码抽象层次结构的分区格上，并将选定的规则提升回信号域，来学习信号的可解释规则。当信号是概率质量函数时，我们证明ILL学习的概率规则具有自然的概率图模型（PGM）解释，并详细发展了这一解释。ILL中的分区诱导出一个确定性的商变量，规则是该商变量的边际分布。因此，规则集是可解释抽象上的边际约束集合。一般提升是满足这些约束的所有联合分布的可行族，而特殊提升则选择最大无知重建，在ILL中通过L2均匀性原理实现，该原理与最大熵密切相关。在香农熵提升下，相同的约束产生一个对数线性因子图，其因子由学习的抽象索引。然而，信息格本身不是贝叶斯网络：其边编码抽象的细化与粗化，而非条件依赖。因此，ILL最好被视为商变量上可解释的基于约束的因子图的结构学习。这一观点阐明了ILL如何与图模型和最大熵模型相关，同时为推理、可识别性和混合符号-概率学习提出了新方向。

英文摘要

Information lattice learning (ILL) learns interpretable rules of a signal by alternately projecting the signal onto a partition lattice that encodes a hierarchy of abstractions and lifting selected rules back to the signal domain. When the signal is a probability mass function, we show the probabilistic rules learned by ILL admit a natural probabilistic graphical model (PGM) interpretation and develop this interpretation in detail. A partition in ILL induces a deterministic quotient variable, and a rule is the marginal law of that quotient variable. A rule set is therefore a collection of marginal constraints over interpretable abstractions. General lifting is the feasible family of all joint distributions satisfying those constraints, while special lifting chooses a maximum-ignorance reconstruction, implemented in ILL by an L2 uniformity principle closely related to maximum entropy. Under a Shannon-entropy lifting, the same constraints yield a log-linear factor graph whose factors are indexed by learned abstractions. The information lattice itself, however, is not a Bayesian network: its edges encode refinement and coarsening of abstractions, not conditional dependence. Thus ILL is best viewed as structure learning for interpretable constraint-based factor graphs over quotient variables. This view clarifies how ILL relates to graphical models and maximum entropy models, while suggesting new directions for inference, identifiability, and hybrid symbolic-probabilistic learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19374 2026-06-19 cs.LG cs.AI 交叉投稿

Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs

基于二级结构和能量过滤氢键图的蛋白质表示学习

Mohamed Mouhajir, Limei Wang, El Houcine Bergou, Hajar El Hammouti, Lamiae Azizi, Dongqi Fu

发表机构 * College of Computing, UM6P（穆罕默德六世理工大学计算机学院）

AI总结提出一种二级结构感知的图神经网络，通过增强残基节点表示并基于能量过滤的氢键构建边，以捕获局部结构上下文和长程耦合，在蛋白质基准上取得一致改进并增强生物学可解释性。

Journal ref The 25th International Workshop on Data Mining in Bioinformatics (BIOKDD 2026)

详情

AI中文摘要

基于图的表示被广泛用于蛋白质建模，然而许多现有方法主要依赖序列邻接或几何邻近，这仅部分反映了控制蛋白质折叠的原理。蛋白质实际上采用围绕二级结构元素（如α-螺旋和β-折叠）组织的复杂三维构象，这些元素编码了重复的局部基序和稳定的氢键相互作用。在这项工作中，我们引入了一种二级结构感知的图神经网络用于蛋白质表示学习。残基级别的节点表示通过二级结构分配得到增强，图边由经过能量强度过滤的氢键相互作用构建。这种设计使模型能够捕获对蛋白质稳定性和功能至关重要的局部结构上下文和长程耦合。我们在常用的蛋白质基准上评估了所提出的方法，并观察到相对于现有基于图的方法的一致改进。此外，生成的图表示提供了增强的生物学可解释性，因为学习到的连接性与已建立的结构基序一致。这些发现表明，融入二级结构和能量过滤的氢键拓扑为蛋白质表示学习提供了有效的归纳偏置。代码发布在 https://this URL。

英文摘要

Graph-based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three-dimensional conformations organized around secondary structure elements, such as $α$-helices and $β$-sheets, which encode recurring local motifs and stabilizing hydrogen-bond interactions. In this work, we introduce a secondary-structure-aware graph neural network for protein representation learning. Residue-level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen-bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long-range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph-based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy-filtered hydrogen-bond topology provides an effective inductive bias for protein representation learning. The code is released at https://github.com/mohamedmohamed2021/SSProNet

URL PDF HTML ☆

赞 0 踩 0

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性？逐块线性可恢复性是学习得到的，而非架构决定的

Stuart Whipp

发表机构 * Independent Research（独立研究）

AI总结通过精确最小二乘线性近似，测量训练后 Transformer 各前馈块的线性可恢复性，发现其高度异质且非单调，是学习得到的属性而非架构决定，并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情

AI中文摘要

Transformer 前馈网络（FFN）通常被视为非线性的计算存储单元，但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射，并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性（R^2_lin），这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中，R^2_lin 高度异质且随深度非单调变化，相邻块之间范围从近线性（>0.99）到强非线性（<0.3），且并非由激活函数决定：相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓，因此可恢复性是单个训练块的学习属性，而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点，且增益与残差非线性不相关：未恢复的计算不是单个位置级乘积，而是高阶或分布式结构。该测量还作为有针对性的压缩信号：可恢复块允许大的单层替换（GPT-2 的早期 FFN 参数减少 8 倍，困惑度增加 +0.77），而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱：训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛，因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

URL PDF HTML ☆

赞 0 踩 0

2606.19476 2026-06-19 cs.LG cs.AI 交叉投稿

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心？

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team（Google – 智能范式团队）； Google DeepMind

AI总结研究利用序列模型的上下文学习能力作为即时无更新世界模型，以消除传统内在好奇心方法中梯度下降的计算瓶颈，理论证明在非时间设置下可渐近收敛到真实学习进度。

详情

AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模，还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模，但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索，该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而，传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新，这使得它们在规模上计算上不可行。在这项工作中，我们研究序列模型涌现的上下文学习（ICL）能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说，我们评估是否可以训练一个探索策略来最大化学习进度，仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明，在一般马尔可夫决策过程中，这实际上不可能以无偏的方式实现：由此产生的内在奖励要么包含干扰项，使其对真实学习进度的估计产生偏差，要么无法使用上下文学习者的预测误差来实现。相反，我们对于非时间设置的一个广泛子类（包括主动学习和贝叶斯实验设计）证明了积极结果：在这里，ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论，表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

URL PDF HTML ☆

赞 0 踩 0

2606.19489 2026-06-19 cs.LG cs.AI 交叉投稿

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

概念流模型：通过层次瓶颈锚定基于概念的推理

Ya Wang, Adrian Paschke

发表机构 * Fraunhofer Institute for Open Communication Systems（弗劳恩霍夫开放通信系统研究所）； Freie Universität Berlin（柏林自由大学）

AI总结提出概念流模型（CFM），用层次化概念决策树替代扁平瓶颈，通过逐步缩小预测范围减少信息泄露，在保持预测性能的同时提升可解释性。

Journal ref Transaction on Machine Learning Research, 2/2026

详情

AI中文摘要

概念瓶颈模型（CBM）通过将学习到的特征投影到人类可理解的概念空间来增强可解释性。最近的方法利用视觉-语言模型生成概念嵌入，减少了对人工概念标注的需求。然而，这些模型存在一个关键限制：当概念数量接近嵌入维度时，信息泄露增加，使得模型能够利用虚假或语义上不相关的相关性，从而削弱可解释性。在这项工作中，我们提出了概念流模型（CFM），它将扁平瓶颈替换为层次化的、概念驱动的决策树。层次结构中的每个内部节点专注于局部判别性概念子集，逐步缩小预测范围。我们的框架从视觉嵌入构建决策层次，在每个层次级别分布语义概念，并通过概率树遍历训练可微的概念权重。在多个基准上的大量实验表明，CFM在预测性能上与扁平CBM相当，同时通过减少有效概念使用显著缓解了信息泄露。此外，CFM产生逐步决策流，使得具有层次类结构的透明且可审计的模型推理成为可能。

英文摘要

Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.

URL PDF HTML ☆

赞 0 踩 0

2606.19528 2026-06-19 cs.LG cs.AI 交叉投稿

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

边缘设备上LLM LoRA微调峰值内存降低技术

Hassan Dbouk, Matthias Reisser, Prathamesh Mandke, Likhita Arun Navali, Christos Louizos

AI总结针对边缘设备上LLM LoRA微调的内存瓶颈，提出四种互补技术（量化、检查点、softmax近似、logits掩码），在Llama-3.2 3B和Qwen-2.5 3B上实现高达26倍和28倍的峰值内存降低。

Comments Hassan Dbouk and Matthias Reisser contributed equally to this work

2606.19629 2026-06-19 cs.SD cs.AI cs.LG 交叉投稿

RIVET: Robust Idempotent Voice Attribute Editing

RIVET: 鲁棒的幂等语音属性编辑

Dareen Alharthi, Bhuvan Koduru, Rita Singh, Bhiksha Raj

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出RIVET训练框架，通过幂等性正则化提升语音属性编辑模型对标签噪声的鲁棒性，在合成噪声和真实噪声数据集上均优于标准训练。

详情

AI中文摘要

语音属性编辑模型在保留说话人身份的同时修改年龄和性别等特征。然而，在大规模语音数据集中，属性标注通常带有噪声或不一致，这可能导致条件生成模型产生不稳定的编辑。在这项工作中，我们证明幂等性为提升对噪声标签的鲁棒性提供了一种有效机制。幂等算子是指重复应用不会改变结果的算子，即 f(f(x)) = f(x)。强制这一性质作为一种隐式正则化器，降低了对错误标注样本的敏感性。我们引入了 RIVET，一种结合幂等性目标以提升对标签噪声鲁棒性的训练框架。我们在受控标签噪声下以及在具有自然噪声标注的 GLOBE 数据集上评估了 RIVET。RIVET 提高了编辑成功率，并且比标准训练更好地保留了说话人身份，表明幂等性提升了语音编辑模型的鲁棒性。

英文摘要

Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

URL PDF HTML ☆

赞 0 踩 0

2606.19679 2026-06-19 cs.LG cs.AI 交叉投稿

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

LOKI: 无记忆零空间约束的终身知识编辑

Masih Eskandar, Miquel Sirera Perelló, Stratis Ioannidis, Jennifer Dy

AI总结提出LOKI方法，通过希尔伯特-施密特独立性准则动态选择层，并将梯度更新投影到模型权重的零空间，实现无需访问旧知识的终身知识编辑，平均准确率提升14%。

2606.19697 2026-06-19 cs.LG cs.AI cs.CL 交叉投稿

迈向基于预训练数据组成的工程化缩放定律

Jan-Lucas Uslu, Kevin Greif, Daniel Whiteson, Benjamin Nachman

AI总结研究通过工程化预训练数据组成（增加多样性和与下游任务的对齐）来改变粒子物理中神经网络的缩放行为，使其更偏向数据扩展而非模型扩展。

2606.19805 2026-06-19 cs.CV cs.AI 交叉投稿

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

ParaScale: 通过规范不变视差数进行尺度校准的相机运动迁移

Zijie Meng

发表机构 * Peking University（北京大学）

AI总结提出ParaScale模块，通过规范不变的视差数Pi实现尺度忠实相机运动迁移，无需重新训练，在四个数量级尺度上降低视差一致性误差3倍以上。

Comments Accepted by SCA2026(poster)

详情

AI中文摘要

将参考视频的相机运动迁移到新生成的视频中，可以让创作者重复使用电影级运镜。然而，参考视频和目标视频往往处于不兼容的尺度——例如跨越银河系的扫视与桌面上的轻推——直接复用恢复的轨迹会导致运动要么不可察觉，要么剧烈夸张。我们将此归结为一个几何事实：平移引起的图像运动与||T||/Z成比例，因此单目轨迹仅在深度尺度规范下才有意义。我们将此提炼为视差数Pi = ||Delta T|| / Zbar，这是一个无量纲、规范不变的描述符，用于衡量相机运动的感知强度，并证明它是尺度忠实迁移必须保持的量，而非原始轨迹。ParaScale是一个即插即用模块，它从任何参考视频中读取Pi，并针对目标场景的深度逐帧重新实现它，保持旋转不变。它位于姿态提取和姿态注入之间，无需重新训练，可插入任何姿态条件生成器。我们进一步引入了视差一致性误差（PCE），这是一种尺度对称的度量，与相似性对齐的TransErr不同，它能暴露场景尺度不匹配。在跨越四个数量级的尺度范围和多个骨干网络上，ParaScale将实现的视差保持在恒等线上，并将PCE比未校准的迁移降低3倍以上，且不损失视觉保真度。

英文摘要

Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19827 2026-06-19 cs.LG cs.AI 交叉投稿

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

何时、何地以及如何：面向表格自监督学习的自适应分箱

Daehwan Kim, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出自适应分箱方法，通过特征级粗到细课程学习动态优化离散化，结合类别重建与顺序监督，在医疗表格数据上提升自监督学习性能。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

医疗表格数据在临床研究中无处不在，但表格数据的深度学习仍未被充分探索，因为可靠的标签通常需要昂贵的专家判定，尽管结构化临床变量通常以表格形式常规可用。自监督学习可以利用这些未标记的表格，而最近基于分箱的前置任务提供了一种有前景的归纳偏置，但现有目标固定单个全局分位数离散化并应用特征无关的监督。我们提出自适应分箱，一种用于表格自监督学习的训练自适应离散化前置任务，通过特征级粗到细课程将离散化与学习耦合。受神经网络的频谱偏差和课程学习原则的启发，我们的方法在检测到平台期时逐步细化每个特征的离散化，并选择表示感知的分割点，以联合改善值空间浓度和表示空间一致性。一种异质性感知目标统一了类别重建与数值特征的顺序监督，在统一评估协议下对公共医疗表格数据集的实验显示，线性探测和微调均取得一致改进，无需数据集特定的离散化调整。我们进一步引入一个医疗表格自监督学习基准，配备标准化协议，以支持这一未被充分探索领域的可重复进展。我们的代码可在该网址获取。

英文摘要

Medical tabular data are ubiquitous in clinical research, but deep learning for tables remains underexplored because reliable labels often require costly expert adjudication, even though structured clinical variables are routinely available in tabular form. Self-supervised learning can leverage these unlabeled tables, and recent binning-based pretexts offer a promising inductive bias, but existing objectives fix a single global quantile discretization and apply feature-agnostic supervision. We propose Adaptive Binning, a training-adaptive discretization pretext for tabular SSL that couples discretization to learning through a feature-wise coarse-to-fine curriculum. Motivated by the spectral bias of neural networks and the principles of curriculum learning, our method progressively refines discretization per feature upon plateau detection and selects representation-aware splits to jointly improve value-space concentration and representation-space coherence. A heterogeneity-aware objective unifies categorical reconstruction with ordinal supervision for numerical features, and experiments on public medical tabular datasets under unified evaluation protocols show consistent gains for linear probing and fine-tuning without dataset-specific discretization tuning. We further introduce a medical tabular SSL benchmark with standardized protocols to support reproducible progress in this underexplored domain. Our code is available at https://github.com/labhai/Adaptive-Binning.

URL PDF HTML ☆

赞 0 踩 0

2606.19850 2026-06-19 cs.LG cs.AI 交叉投稿

Neural Additive and Basis Models with Feature Selection and Interactions

具有特征选择和交互的神经加性模型与神经基础模型

Yasutoshi Kishimoto, Kota Yamanishi, Takuya Matsuda, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

AI总结提出在神经加性模型和神经基础模型中引入特征选择机制，通过特征选择层减少计算开销，并支持高维数据中的特征交互学习，性能优于或持平于现有GAM方法。

Comments Accepted at PAKDD 2024. Code is available at https://github.com/shiralab/NAM-FS

详情

DOI: 10.1007/978-981-97-2259-4_1

AI中文摘要

深度神经网络（DNN）在各个领域表现出色，但通常可解释性较低。神经加性模型（NAM）及其变体神经基础模型（NBM）在广义加性模型（GAM）中使用神经网络（NN）作为非线性形状函数。这两种模型具有高度可解释性，并且在NN训练中表现出良好的性能和灵活性。NAM和NBM基于GAM架构，可以提供并可视化每个特征对预测的贡献。然而，当使用双输入NN来考虑特征交互或将其应用于高维数据集时，由于所需计算资源的增加，训练NAM和NBM变得棘手。本文提出将特征选择机制融入NAM和NBM以解决计算瓶颈。我们在两种模型中引入特征选择层，并在训练过程中更新选择权重。我们的方法简单，与原始NAM和NBM相比，可以降低计算成本和模型大小。此外，它使我们即使在数据维度很高的情况下也能使用双输入NN并捕获特征交互。我们证明，所提出的模型与原始NAM和NBM相比计算效率更高，并且与最先进的GAM相比表现出更好或相当的性能。

英文摘要

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19888 2026-06-19 cs.LG cs.AI 交叉投稿

StreamKL: 快速且内存高效的KL散度用于提升注意力蒸馏

Guangda Liu, Yiquan Wang, Chengwei Li, Wenhao Chen, Jing Lin, Yiwu Yao, Danning Ke, Wenchao Ding, Jieru Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei（华为）； Fudan University（复旦大学）

AI总结提出StreamKL，首个融合GPU原语，通过在线公式和逐块重计算将注意力蒸馏的内存和IO成本从O(N_QN_K)降至O(1)，实现高达43倍前向和14倍反向加速。

详情

AI中文摘要

注意力蒸馏通过最小化Kullback-Leibler (KL)散度来训练一个注意力分布匹配另一个，广泛应用于知识蒸馏、模型压缩、持续学习和稀疏注意力LLM训练。然而，现有方法在计算KL归约前需要具体化两个注意力分布，导致$O(N_QN_K)$的内存和IO成本，在长上下文长度下变得不可接受。我们提出StreamKL，首个用于注意力KL散度的融合GPU原语，消除了这种二次具体化。StreamKL推导了一种新颖的在线公式用于耦合的双分布KL归约，使得单个前向内核能够通过片上SRAM流式处理查询-键块。对于反向传播，StreamKL逐块重计算注意力概率，避免存储二次中间结果。我们进一步设计并实现了具有专用优化的高效GPU内核。实验表明，StreamKL在前向和反向传播中分别比基线方法快高达43倍和14倍。最重要的是，StreamKL将注意力蒸馏的额外HBM占用从$O(N_QN_K)$减少到$O(1)$，使得在单个GPU上进行长上下文蒸馏成为可能。

英文摘要

Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2606.20076 2026-06-19 cs.CV cs.AI 交叉投稿

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea（韩国科学技术院金载哲人工智能研究生院，大田，韩国）； School of Computing, KAIST, Daejeon, South Korea（韩国科学技术院计算学院，大田，韩国）

AI总结针对固定压缩比限制扩散模型质量-计算权衡的问题，提出基于可学习全局合并的可变长度分词器，通过合并令牌实现跨长度表示对齐，在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情

AI中文摘要

潜在扩散模型（LDM）在视觉合成中占据主导地位，但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器（VLT）通过改变令牌数量实现自适应压缩，使扩散模型能够灵活平衡质量和计算。然而，传统的VLT通过截断有序令牌序列来调节长度，这使得令牌语义依赖于令牌位置，并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移，阻碍单个可变长度扩散模型有效运行。为了解决这个问题，我们提出了一种新颖的可变长度分词器，通过合并令牌来调节长度。我们表明，当扩散变换器根据合并模式运行时，鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的，使得生成过程中无法访问合并模式，我们引入了可学习的全局合并，它是数据独立的，以确保与扩散变换器的兼容性。在ImageNet 256×256生成中，我们的基于合并的可变长度分词器与扩散变换器集成，相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

URL PDF HTML ☆

赞 0 踩 0

2606.20104 2026-06-19 cs.LG cs.AI 交叉投稿

Sensorimotor World Models: Perception for Action via Inverse Dynamics

传感器运动世界模型：通过逆动力学实现面向行动感知

Petr Ivashkov, Randall Balestriero, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Department of Computer Science, Brown University（布朗大学计算机科学系）； ELLIS Institute（ELLIS研究所）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出传感器运动世界模型（SMWM），通过逆动力学正则化端到端训练潜空间世界模型，防止表示崩溃并学习与行动对齐的紧凑表示，在2D和3D控制任务中实现竞争性规划性能。

详情

AI中文摘要

面向行动的感知表明，世界的表示不应仅由视觉保真度决定，而应由其与行动的相关性决定。同时，潜在的JEPA风格世界模型主张从高维观测中学习紧凑的预测状态以促进未来状态的预测，但这些模型的端到端训练并非易事，因为如果我们的唯一目标是构建易于预测的潜在状态，表示可能会崩溃。我们引入了一种传感器运动世界模型（SMWM）：一种通过逆动力学正则化进行端到端训练的潜在世界模型。这一单一正则化解决了两个问题：它防止表示崩溃并诱导与行动对齐的表示。通过迫使潜在状态保留关于转换背后行动的信息，它使模型偏向于环境中可控的自由度，同时丢弃不可控的干扰因素。这产生了从离线、无奖励轨迹中训练的稳定潜在世界模型，无需冻结编码器、指数移动平均或复杂的潜在正则化。实验表明，SMWM学习了紧凑、可解释的潜在空间，并在简单的2D和3D控制任务中实现了竞争性的规划性能。

英文摘要

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.20151 2026-06-19 cs.NE cs.AI 交叉投稿

Hybrid ANN-SNN Pipeline with Local Plasticity

混合ANN-SNN流水线与局部可塑性

Denis Larionov, Khairutin Shtanchaev, Mikhail Kiselev, Mikhail Korovin, Ivan Tugoy

AI总结提出一种混合ANN-SNN流水线，利用预训练ANN的丰富嵌入实现高性能SNN，通过速率编码和局部学习规则训练，在64类ImageNet上达到99.09%准确率。

Comments 9 pages, 4 figues, source-code available

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 交叉投稿

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA：利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Linköping University（林雪平大学）； TRATON AB（TRATON公司）； Qualcomm Auto Ltd Sweden Filial（高通汽车有限公司瑞典分公司）

AI总结提出HilDA框架，通过分层蒸馏（多层蒸馏和全局上下文蒸馏）结合时间占用扩散目标，自监督预训练LiDAR骨干网络，在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情

AI中文摘要

利用视觉基础模型（VFM）进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而，当前方法通常将VFM视为黑盒教师，仅依赖逐帧特征相似性。因此，它们未能充分利用教师的逐层语义结构和全局上下文，以及LiDAR序列中固有的丰富时空信息。我们提出HilDA，一个用于LiDAR骨干网络的自监督预训练框架，能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏（包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏）与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果，并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见：此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

URL PDF HTML ☆

赞 0 踩 0

2606.20216 2026-06-19 cs.LG cs.AI 交叉投稿

Learner-based Concept Drift Detection: Analysis and Evaluation

基于学习器的概念漂移检测：分析与评估

Md Moman Ul Haque Khan, Samira Sadaoui

发表机构 * Department of Computer Science, University of Regina（里贾纳大学计算机科学系）

AI总结本文从理论上分析概念漂移特征，并评估多种漂移检测算法在合成和真实数据集上的性能，旨在增强对漂移检测器行为及其适用性的理解。

Comments 2 authors, 29 pages

2606.20246 2026-06-19 cs.RO cs.AI 交叉投稿

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

微调视觉-语言-动作模型所需的层数比你想象的少

Gia-Binh Nguyen, Trong-Bao Ho, Thien-Loc Ha, Khoa Vo, Philip Lund Møller, Quang T. Nguyen, Long Dinh, Tuan Dam, Vu Duong, Tung M. Luu, Trung Le, Tran Nguyen Le, Minh Vu, An Thai Le, Ngan Le, Daniel Sonntag, James Zou, Jan Peters, Duy M. H. Nguyen, Ngo Anh Vien

发表机构 * Center for AI Research, VinUniversity（VinUniversity人工智能研究中心）； VinRobotics ； University of Arkansas（阿肯色大学）； Technical University of Denmark（丹麦技术大学）； Hanoi University of Science and Technology（河内科技大学）； KAIST（韩国科学技术院）； Monash University（莫纳什大学）； Oldenburg University（奥尔登堡大学）； DFKI（德国人工智能研究中心）； University of Stuttgart（斯图加特大学）； IMPRS-IS（国际马克斯·普朗克智能系统研究学院）； Stanford University（斯坦福大学）； Technische Universität Darmstadt（达姆施塔特工业大学）

AI总结本文发现VLA模型存在层间表示冗余，提出无需训练的压缩方法，通过去除冗余层将模型深度减少50%，实现40-50%训练加速和30%推理加速，性能不变。

详情

AI中文摘要

在大规模视频-机器人数据集上预训练的视觉-语言-动作（VLA）模型彻底改变了机器人操作，但其数十亿参数架构在下游微调和实时推理过程中带来了巨大的计算负担。在这项工作中，我们揭示了这些连续控制基础策略（例如pi_0、GR00T-N1.5）的一个高度非平凡的结构特性：尽管在多样化的物理轨迹上训练，它们表现出严重的逐层表示冗余。为了利用这一点，我们引入了一个完全无需训练的结构压缩流程，避免了现有方法需要加载全尺寸模型来学习优化的令牌缩减或动态层选择器的需求。相反，仅通过使用中心核对齐的单次前向传递来识别冗余层特征，我们移除孪生层以永久压缩模型深度高达50%，涵盖VLM主干和连续控制策略头。这种精简架构的下游微调带来了双重加速效益：训练时间减少40-50%，实时推理速度提升高达30%，同时匹配或超越全尺寸基模型性能。我们在三个模拟基准（LIBERO、RoboCasa、SimplerEnv）和10个跨4种不同机器人实体的多样化真实世界操作任务上全面验证了我们的方法。这些结果证明，先进的VLA所需的层数远少于先前假设，为可扩展的机器人学习提供了一种高度计算高效的范式。

英文摘要

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

URL PDF HTML ☆

赞 0 踩 0

2606.20283 2026-06-19 cs.LG cs.AI 交叉投稿

Boundary Embedding Shaping with Adaptive Contrastive Learning for Graph Structural Disentanglement

基于自适应对比学习的边界嵌入塑造用于图结构解缠

Jiaqing Chen, Zidu Yin, Yichao Cai, Yuhang Liu, Zhen Zhang, Dong Gong, Javen Qinfeng Shi

发表机构 * Yunnan Normal University（云南师范大学）； Adelaide University（阿德莱德大学）； The University of New South Wales（新南威尔士大学）

AI总结针对图结构纠缠导致的分类性能下降，提出边界嵌入塑造模块，通过自适应对比学习选择性抑制决策边界处的虚假结构噪声，提升节点分类和链接预测精度。

Comments Accepted at ICML 2026

详情

AI中文摘要

图神经网络（GNN）在聚合邻居信息进行分类方面表现出色，但其性能受到图结构纠缠的阻碍，来自语义无关邻居的虚假相关污染了节点嵌入。这种挑战在嵌入空间中靠近类边界的节点最为严重，放大的结构噪声模糊了决策边界并破坏了预测的稳定性。现有的鲁棒GNN方法大多统一处理所有节点，忽略了边界脆弱性。本文中，为了提高分类性能，我们通过将边界区域纠缠识别为主要瓶颈来解决图结构解缠问题，并提出边界嵌入塑造（BES），一种自适应对比学习GNN插件模块，以最小的模型参数扰动选择性地抑制决策边界处的虚假结构噪声。大量实验表明，BES持续改善边界判别性，并优于现有领先方法。值得注意的是，BES在节点分类中平均提升GCN性能3.3%（在WikiCS上高达5.0%），并在链接预测中实现更优的准确率。

英文摘要

Graph neural networks (GNNs) excel at aggregating neighbor information for classification, yet their performance is hindered by graph structural entanglement, where spurious correlations from semantically irrelevant neighbors contaminate node embeddings. This challenge is most acute for nodes near class boundaries in the embedding space, where amplified structural noise blurs decision boundaries and destabilizes predictions. Existing robust GNN methods largely treat all nodes uniformly, ignoring boundary vulnerabilities. In this paper, to improve classification performance, we tackle graph structural disentanglement by identifying boundary-region entanglement as the primary bottleneck and propose Boundary Embedding Shaping (BES), an adaptive contrastive learning GNN plug-in module that selectively suppresses spurious structural noise at decision boundaries with minimal model parameter perturbation. Extensive experiments demonstrate that BES consistently improves boundary discrimination and outperforms existing leading methods. Notably, BES boosts GCN performance by an average of 3.3% in node classification (up to 5.0% on WikiCS) and achieves superior accuracy in link prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.20356 2026-06-19 math.OC cs.AI cs.LG math.PR stat.ML 交叉投稿

Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise

公共噪声Wasserstein不确定性下的平均场控制鲁棒$Q$-学习

Mathieu Laurière, Ariel Neufeld, Kyunghyun Park

AI总结提出一种针对公共噪声分布Wasserstein不确定性的离散时间平均场控制鲁棒$Q$-学习算法，结合量化投影与Wasserstein对偶，证明同步和异步学习的收敛性及有限时间界，并在系统风险和流行病模型中验证鲁棒性-性能权衡。

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 交叉投稿

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

AI总结提出将预训练的语音分类器作为扩散生成的主干，通过附加轻量子网络并仅训练该子网络，实现单主干模型的高质量条件语音生成，降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情

AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型：一个分类器和一个扩散模型。因此，我们研究了一种更紧凑的替代方案，其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始，我们附加一个轻量子网络，该子网络重用中间分类器表示，并在去噪分数匹配目标下仅训练该子网络。我们的工作表明，预训练的分类器可以重新用于条件生成，为判别建模和条件语音合成之间提供了有吸引力的桥梁，从而在单主干模型中实现高语音质量，同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.20560 2026-06-19 cs.LG cs.AI 交叉投稿

How Transparent is DiffusionGemma?

DiffusionGemma 的透明度如何？

Joshua Engels, Callum McDougall, Bilal Chughtai, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira, Rohin Shah, Neel Nanda

发表机构 * Google（谷歌）

AI总结研究DiffusionGemma在连续潜空间中的推理透明度，通过变量透明度和算法透明度分解，发现可解释的令牌瓶颈将不透明串行深度降至Gemma 4的1.1倍，并揭示扩散特有现象。

Comments 20 main text pages and 6 pages of references and appendices

详情

AI中文摘要

LLM推理透明度是理解模型决策、减少误用和错位以及调试意外模型行为的关键能力。然而，DiffusionGemma在连续潜空间中执行了更大比例的计算；这是否使其推理透明度降低？我们通过将透明度分解为两个组成部分来研究这个问题：变量透明度，即我们是否理解模型计算状态的中间快照；以及算法透明度，即我们是否能够利用这些快照重建模型得出其输出的过程。直观上，DiffusionGemma的变量透明度较差：其不透明串行深度，即在可解释模型状态之间发生的串行计算量，最初似乎是相应自回归Gemma 4模型的28.6倍。然而，我们表明，我们可以通过一个可解释的令牌瓶颈映射去噪步骤之间流动的信息，且下游性能没有下降。将这些中间状态视为可解释的，将不透明串行深度降至仅为Gemma 4的1.1倍。对于扩散模型来说，算法透明度比自回归模型更难，因为画布中的所有令牌预测在每个去噪步骤中都可能发生变化，这使模型有能力在去噪过程中实现复杂的分布式算法。为了开始弥合这一差距，我们进行了一系列可解释性案例研究，发现了扩散特有现象（如非时序推理、令牌和序列涂抹以及中间上下文推理）的初步证据。最后，我们测试了可监控性，这是透明度的一个关键应用，衡量模型输出是否对下游任务有用。我们发现DiffusionGemma的可监控性与Gemma 4相似。

英文摘要

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

URL PDF HTML ☆

赞 0 踩 0

2509.25148 2026-06-19 cs.AI 版本更新

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

AAPA：用于大型语言模型后训练的对抗锚定偏好对齐

Faqiang Qian, Kang An, Weikun Zhang, Ziliang Wang, Xuhui Zheng, Liangjian Wen, Yong Dai, Mengya Gao, Yichao Wu

发表机构 * Southwest University of Finance and Economics（西南财经大学）

AI总结提出AAPA框架，通过固定轻量判别器对策略输出与专家响应进行句子级对抗锚定，增强SFT、GRPO等后训练目标，在指令遵循基准上持续提升性能。

详情

AI中文摘要

大型语言模型的后训练对齐通常结合了专家演示上的监督微调（SFT）和来自偏好或可验证反馈的强化学习（RL）。SFT提供了有用的行为锚点，但可能过拟合静态演示，而RL鼓励探索但可能偏离专家行为或利用不完美的奖励。我们提出\textbf{AAPA}（\emph{对抗锚定偏好对齐}），这是一个插件式框架，通过句子级对抗锚定信号增强现有的后训练目标。AAPA使用固定的轻量判别器将策略生成结果与离线预收集的专家响应进行比较，因此在策略优化期间既不需要在线教师推理，也不需要判别器协同训练。相同的锚定项可以添加到SFT、GRPO和CHORD中，同时保留其原始训练流程。在指令遵循基准上的实验表明，AAPA在不同模型规模上一致地改善了相应的基础目标。特别是，分阶段的AAPA配置在\texttt{Qwen3-0.6B}上比强GRPO基线提高了5.77%，在\texttt{Qwen3-4B}上提高了3.75%。对响应长度、对数概率分布和判别器变体的进一步分析表明，对抗锚定为偏好优化提供了稳定的语义基础信号。代码可在\url{this https URL}获取。

英文摘要

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

URL PDF HTML ☆

赞 0 踩 0

2602.05533 2026-06-19 cs.AI 版本更新

Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach

硬约束下的条件扩散引导：一种随机分析方法

Zhengyi Guo, Wenpin Tang, Renyuan Xu

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University（哥伦比亚大学工业工程与运营管理系）； Department of Management Science and Engineering, Stanford University（斯坦福大学管理科学与工程系）

AI总结提出基于Doob h-变换和鞅表示的条件扩散引导框架，通过鞅损失和鞅协方差损失学习条件函数梯度，确保硬约束满足并给出非渐近保证。

详情

AI中文摘要

我们研究了扩散模型中在硬约束下的条件生成，其中生成的样本必须以概率1满足预设事件。这类约束在安全关键应用和稀有事件模拟中自然出现，而软或基于奖励的引导方法无法保证约束满足。基于扩散模型的概率解释，我们利用Doob h-变换、鞅表示和二次变差过程，开发了一个原则性的条件扩散引导框架。具体地，得到的引导动力学通过涉及条件函数对数梯度的显式漂移校正来增强预训练扩散，而不修改预训练得分网络。利用鞅和二次变差恒等式，我们提出了两种新的离策略学习算法，基于鞅损失和鞅协方差损失，仅使用预训练模型的轨迹来估计h及其梯度。我们为得到的条件采样器在总变差和Wasserstein距离下提供了非渐近保证，明确刻画了得分近似和引导估计误差的影响。数值实验证明了所提方法在强制硬约束和生成稀有事件样本方面的有效性。数值实验的代码可在此https URL找到。

英文摘要

We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob's h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples. The code of the numerical experiments can be found at https://github.com/ZhengyiGuo2002/CDG_Finance.

URL PDF HTML ☆

赞 0 踩 0

2603.15106 2026-06-19 cs.AI 版本更新

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

PrototypeNAS: 微控制器单元深度神经网络的快速设计

Mark Deutel, Simon Geis, Axel Plinge

发表机构 * Fraunhofer Institute for Integrated Circuits（弗劳恩霍夫集成电路研究所）

AI总结提出零样本NAS方法PrototypeNAS，通过解耦设计与训练、多架构搜索空间、集成零样本代理和超体积子集选择，快速为不同MCU定制DNN，在图像分类等任务上分钟级找到小模型且精度接近大模型。

Comments Accepted at ECML-PKDD 2026. 18 pages, 7 figures, 4 tables. This work was funded by the European Commission as part of the MANOLO project under the Horizon Europe programme Grant Agreement No.101135782

详情

AI中文摘要

在具有不同硬件约束的边缘设备上实现高效的深度神经网络推理是一项具有挑战性的任务，通常需要为每个设备单独定制DNN架构。为避免大量人工努力，可以使用神经架构搜索。然而，许多现有的NAS方法资源密集且耗时，因为它们需要从头开始训练许多不同的DNN。此外，它们没有考虑目标系统的资源约束。为了解决这些缺点，我们提出了PrototypeNAS，一种零样本NAS方法，用于加速和自动化DNN的选择、压缩和针对不同目标微控制器单元的专门化。我们提出了一种新颖的三步搜索方法，将DNN设计和专门化与给定目标平台上的DNN训练解耦。首先，我们提出了一种新的搜索空间，不仅从单个大型架构中裁剪出较小的DNN，而且结合了多种架构类型的结构优化，以及它们的剪枝和量化配置的优化。其次，我们探索在优化过程中使用集成零样本代理而不是单个代理。第三，我们提出使用超体积子集选择从多目标优化的帕累托前沿中提取DNN架构，这些架构代表了准确性和FLOPs之间最有意义的权衡。我们在三个不同任务（图像分类、时间序列分类和目标检测）的12个数据集上评估了PrototypeNAS的有效性。我们的结果表明，PrototypeNAS能够在几分钟内识别出足够小、可部署在现成MCU上的DNN模型，并且仍然达到与大型DNN模型相当的精度。

英文摘要

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

URL PDF HTML ☆

赞 0 踩 0

2606.17979 2026-06-19 cs.AI 版本更新

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training（机构文本：STAR：时空自适应奖励分配用于文本到图像强化学习后训练）

AI总结针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题，提出STAR方法，利用文本-图像注意力构建时空自适应分配图，对相关潜在区域施加更强策略更新，提升语义对齐和文本渲染性能。

详情

AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势，并以相同强度应用于整个生成轨迹。然而，文本到图像生成自然具有时间和空间结构：不同的去噪步骤负责不同的生成阶段，而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题，我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励（STAR）分配**。STAR利用生成模型内部的文本-图像注意力，从用户提示中真正关心的核心内容开始，构建在去噪步骤和展开中动态变化的空间分配图，并将相同的组相对优势分配给更相关的潜在区域，几乎没有额外的计算开销。然后，STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型，并在三个任务上评估：GenEval、OCR文本渲染和PickScore。实验结果表明，STAR在不改变外部奖励源的情况下，改善了组合语义对齐、文本渲染和偏好优化，在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

URL PDF HTML ☆

赞 0 踩 0

2402.14035 2026-06-19 cs.LG cs.AI 版本更新

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

委员会智慧：来自大型基础模型和领域专家的多样化蒸馏

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

发表机构 * Rice University（Rice大学）； Google DeepMind（谷歌DeepMind）； Google Inc（谷歌公司）； University of California, Davis（加州大学戴维斯分校）

AI总结针对基础模型向紧凑领域模型蒸馏时能力、架构和模态差异大的问题，提出DiverseDistill框架，通过可学习的问答机制和对齐异构教师输出，在推荐和视觉任务上恢复73-114%的性能差距。

Comments Accepted at the 1st Workshop on Resource-Efficient Learning and Knowledge Discovery (RelKD), KDD 2026

Journal ref Proceedings of the RelKD Workshop at KDD 2026

详情

AI中文摘要

从基础模型向紧凑领域模型进行知识蒸馏因能力、架构和模态的巨大差异而具有挑战性。例如，在我们的实验中，从7600万参数的语言模型蒸馏到200万参数的推荐模型仅能弥补未蒸馏学生与教师之间不到40%的性能差距。我们表明，引入与基础模型共享学生架构特征的领域专家作为多样化教师委员会，能显著改善迁移效果。然而，标准的多教师方法未能利用这种多样性：简单组合异构教师可能使性能低于单教师蒸馏。为此，我们提出DiverseDistill，一种交互式蒸馏框架，采用可学习的问答机制生成教师条件查询，并将异构教师输出对齐到学生的表示空间。与需要基于梯度的协同优化或修改教师架构的方法不同，DiverseDistill在冻结教师的情况下仅通过其中间层的前向推理运行：无需参数更新、无需协同训练、无需架构修改。动态教师重要性机制通过过滤每个样本中低相关性的教师（例如，在推荐任务中减少约30%的前向传播且无质量损失）进一步降低训练成本，而整个蒸馏模块在训练后被丢弃，推理时零开销。在推荐（38倍压缩）和视觉（3.6倍压缩）任务上的评估表明，DiverseDistill恢复了73-114%的师生性能差距，持续优于所有单教师和多教师基线方法。

英文摘要

Knowledge distillation from foundation models to compact domain models is challenging due to substantial gaps in capacity, architecture, and modality. For example, in our experiments, distilling from a 76M-parameter language model to a 2M-parameter recommender closes less than 40% of the performance gap between the undistilled student and the teacher. We show that introducing domain-specific experts -- which share the student's architectural characteristics -- alongside the foundation model as a diverse teacher committee significantly improves transfer. However, standard multi-teacher methods fail to exploit this diversity: naively combining heterogeneous teachers can degrade performance below single-teacher distillation. To address this, we propose DiverseDistill, an interactive distillation framework that employs a learnable Question-Answer mechanism to generate teacher-conditioned queries and align heterogeneous teacher outputs into the student's representation space. Unlike methods requiring gradient-based co-optimization or architectural modification of teachers, DiverseDistill operates with frozen teachers using only forward-pass inference through their intermediate layers: no parameter updates, no co-training, and no architectural surgery. A dynamic teacher importance mechanism further reduces training cost by filtering low-relevance teachers per sample (e.g., ~30% fewer forward passes with no quality loss for recommendation tasks), while the entire Distillation Module is discarded after training, adding zero inference overhead. Evaluations on recommendation (38x compression) and vision (3.6x compression) tasks demonstrate that DiverseDistill recovers 73-114% of the teacher-student performance gap, consistently outperforming all single- and multi-teacher baselines.

URL PDF HTML ☆

赞 0 踩 0

2503.02636 2026-06-19 q-bio.NC cs.AI 版本更新

A Deep Generative Model for Resting-State EEG Synthesis and Transferable Representation Learning

一种用于静息态脑电合成与可迁移表示学习的深度生成模型

Yeganeh Farahzadi, Morteza Ansarinia, Zoltan Kekecs

发表机构 * Institute of Psychology, Eötvös Loránd University（埃斯特哈兹·洛朗大学心理学研究所）； Doctoral School of Psychology, Eötvös Loránd University（埃斯特哈兹·洛朗大学心理学博士学院）； Department of Behavioural and Cognitive Sciences, University of Luxembourg（卢森堡大学行为与认知科学系）

AI总结提出REST-GAN框架，结合对抗训练与自监督重构，从原始时域信号合成静息态EEG并学习可迁移表示，在频谱、连接性及分类任务中表现优异。

详情

AI中文摘要

静息态脑电提供了一种非侵入性的自发脑活动观测方式，但提取有意义的模式常受限于高质量数据稀缺和对人工设计特征的依赖。生成对抗网络（GAN）能够合成神经信号并从原始数据中学习可迁移表示，这一双重能力在脑电研究中尚未被充分探索。本文提出REST-GAN，一个基于GAN的静息态脑电框架，将对抗训练与辅助自监督重构目标相结合，以支持信号合成和无监督特征提取。尽管仅使用原始时域信号训练，未引入显式的频域或传感器拓扑监督，生成的时序列再现了真实脑电的关键时间、频谱和连接特性。在频带功率特征空间中，生成的样本在睁眼和闭眼条件下均表现出高精确率和召回率（EO: 0.91/0.67; EC: 0.87/0.65），而组平均频谱相干矩阵与真实数据在各频段上的平均绝对差异较低（约0.01-0.03）。模型判别器学习到的表示可迁移至独立的静息态人口统计学分类任务，其性能优于直接在原始脑电上训练的模型，并与近期脑电基础模型表现相当，同时所需训练数据和计算资源大幅减少。这些发现突显了一种计算高效的架构驱动策略，其中生成模型不仅作为脑电信号生成器，还作为无监督特征提取器。该方法有望支持更数据高效的脑电分析，同时减少对人工特征工程的依赖。REST-GAN的实现代码见：this https URL。

英文摘要

Resting-state EEG provides a non-invasive view of spontaneous brain activity, but extracting meaningful patterns is often limited by scarce high-quality data and reliance on manually engineered features. Generative adversarial networks (GANs) can synthesize neural signals and learn transferable representations directly from raw data, a dual capability that remains underexplored in EEG research. Here, we introduce REST-GAN, a GAN-based framework for resting-state EEG that combines adversarial training with an auxiliary self-supervised reconstruction objective to support signal synthesis and unsupervised feature extraction. Although trained only on raw time-domain signals, without explicit frequency-domain or sensor-topographic supervision, the generated time series reproduced key temporal, spectral, and connectivity properties of real EEG. In band-power feature space, generated samples showed high precision and recall across eyes-open and eyes-closed conditions (EO: 0.91/0.67; EC: 0.87/0.65), while group-average spectral coherence matrices showed low mean absolute differences from real data across frequency bands (~0.01-0.03). The representations learned by the model's critic transferred to independent resting-state demographic classification tasks, outperforming models trained directly on raw EEG and showing competitive performance relative to a recent EEG foundation model, while requiring substantially less training data and computational resources. These findings highlight a computationally efficient, architecture-driven strategy in which generative models serve not only as EEG signal generators, but also as unsupervised feature extractors. This approach may support more data-efficient EEG analysis while reducing reliance on manual feature engineering. The implementation code for REST-GAN is available at: https://github.com/Yeganehfrh/REST-GAN.

URL PDF HTML ☆

赞 0 踩 0

2509.15927 2026-06-19 cs.LG cs.AI 版本更新

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

增强生成式自动出价：结合离线奖励评估与策略搜索

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba（阿里巴巴淘宝与天猫集团）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对现有生成式自动出价方法无法超越静态数据集进行探索的性能瓶颈，提出AIGB-Pearl方法，通过轨迹评估器和KL-Lipschitz约束的分数最大化方案实现安全高效探索，在模拟和真实广告系统中取得最优性能。

详情

AI中文摘要

自动出价是广告主提升广告效果的关键工具。最近进展表明，AI生成式出价（AIGB）从离线数据中学习条件生成规划器，相比典型的基于离线强化学习（RL）的自动出价方法取得了更优性能。然而，现有AIGB方法仍面临性能瓶颈，因其固有能力无法在静态数据集之外进行带反馈的探索。为解决此问题，我们提出\textbf{AIGB-Pearl}（\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}），一种融合生成式规划与策略优化的新方法。AIGB-Pearl的核心在于构建轨迹评估器以评估生成分数的质量，并设计一个理论上可靠的KL-Lipschitz约束分数最大化方案，确保在离线数据集之外进行安全高效的探索。进一步开发了结合同步耦合技术的实用算法，以保证所提方案所需的模型正则性。在模拟和真实广告系统上的大量实验证明了我们方法的最优性能。

英文摘要

Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose \textbf{AIGB-Pearl} (\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

URL PDF HTML ☆

赞 0 踩 0

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology（首尔科学技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； LG CNS

AI总结提出MENTOR方法，通过灵活的教师优化奖励结构，平衡行为对齐与下游性能，提升小模型在工具使用任务中的域外泛化能力。

详情

AI中文摘要

将大型语言模型（LLMs）的工具使用能力蒸馏到小型语言模型（SLMs）中对其实际应用至关重要。主要方法监督微调（SFT）由于与静态教师轨迹的刚性对齐，导致域外（OOD）泛化性能较差。虽然强化学习（RL）提供了一种替代方案，但SLMs的能力限制带来了严峻的困境：稀疏的结果奖励提供的指导不足，而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距，我们提出了MENTOR，它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制，而是利用教师的参考来指导工具使用行为，平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明，与SFT和严格RL基线相比，MENTOR提高了OOD工具使用性能。我们的研究结果表明，在可验证的工具使用环境中，灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

URL PDF HTML ☆

赞 0 踩 0

2510.21978 2026-06-19 cs.LG cs.AI 版本更新

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

超越推理增益：缓解大型推理模型中的通用能力遗忘

Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； New York University（纽约大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结针对强化学习训练导致推理模型遗忘基础能力的问题，提出RECAP重放策略，通过动态目标重加权在线调整训练重点，在保持通用能力的同时提升推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）在数学和多模态推理方面取得了显著进展，并已成为当代语言和视觉-语言模型的标准后训练范式。然而，RLVR方法引入了能力退化的重大风险，即模型在长时间训练后，若未采用正则化策略，会遗忘基础技能。我们通过实验证实了这一担忧，观察到开源推理模型在感知和忠实性等核心能力上出现性能下降。虽然施加KL散度等正则化项有助于防止偏离基础模型，但这些项是在当前任务上计算的，因此不能保证保留更广泛的知识。同时，跨异构领域的经验回放使得决定每个目标应获得多少训练权重变得困难。为解决这一问题，我们提出RECAP——一种具有动态目标重加权的重放策略，用于通用知识保留。我们的重加权机制利用短期收敛和不稳定信号在线自适应，将后训练焦点从饱和目标转移到表现不佳或不稳定的目标。我们的方法是端到端的，可直接应用于现有RLVR流程，无需训练额外模型或进行繁重调优。在Qwen2.5-VL-3B和Qwen2.5-VL-7B上的广泛实验证明了我们方法的有效性，该方法不仅保留了通用能力，还通过实现任务内奖励的更灵活权衡提升了推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, in which models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are computed on the current task and therefore do not guarantee preservation of broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training emphasis each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts online using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks using Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

URL PDF HTML ☆

赞 0 踩 0

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结提出BA-solver，通过轻量SideNet（1-2%主干大小）学习双向时间感知和双锚点速度积分，在不重新训练主干的情况下，以极低训练成本实现10步内达到100+步Euler求解器质量，支持即插即用。

详情

AI中文摘要

流匹配（FM）模型已成为高保真合成的前沿范式。然而，它们对迭代常微分方程（ODE）求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难：无训练求解器在低神经函数评估（NFE）下性能严重下降，而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距，我们提出了双锚点插值求解器（BA-solver）。BA-solver保留了标准无训练求解器的通用性，同时通过引入轻量级SideNet（主干大小的1-2%）与冻结主干并行，实现了显著加速。具体而言，我们的方法基于两个协同组件：1）双向时间感知，其中SideNet学习近似未来和过去的速度，无需重新训练重型主干；2）双锚点速度积分，利用带有两个锚点速度的SideNet高效近似中间速度，用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹，BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明，BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量，并在仅5次NFE时保持高保真度，且训练成本可忽略不计。此外，BA-solver确保与现有生成流水线的无缝集成，便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

URL PDF HTML ☆

赞 0 踩 0

2601.22970 2026-06-19 cs.LG cs.AI 版本更新

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

稳定Q-梯度场以实现Actor-Critic方法中的策略平滑性

Jeong Woon Lee, Kyoleen Kwak, Daeho Kim, Hyoseok Hwang

发表机构 * College of Software, Kyung Hee University（韩国庆熙大学软件学院）

AI总结针对连续动作空间中actor-critic方法策略振荡问题，提出基于评论家微分几何的PAVE框架，通过稳定Q-梯度场实现策略平滑，无需修改actor。

详情

AI中文摘要

通过连续actor-critic方法学习的策略通常表现出不稳定的高频振荡，使其不适合物理部署。当前方法试图通过直接正则化策略输出来强制平滑性。我们认为这种方法治标不治本。在这项工作中，我们从理论上建立了策略非平滑性根本上由评论家的微分几何决定。通过对actor-critic目标应用隐式微分，我们证明了最优策略的敏感性受限于Q函数的混合偏导数（噪声敏感性）与其动作空间曲率（信号区分度）之比。为了实证验证这一理论见解，我们引入了PAVE（策略感知值场均衡），一种以评论家为中心的正则化框架，将评论家视为标量场并稳定其诱导的动作梯度场。PAVE通过最小化Q-梯度波动同时保持局部曲率来修正学习信号。实验结果表明，PAVE在不修改actor的情况下，实现了与策略侧平滑正则化方法相当的平滑性，同时保持了有竞争力的任务性能。

英文摘要

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

URL PDF HTML ☆

赞 0 踩 0

2602.04396 2026-06-19 cs.LG cs.AI 版本更新

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge（剑桥大学）； Institute of Science and Technology Austria（奥地利科学与技术研究院）； Lancaster University（兰卡斯特大学）； Flower Labs（Flower实验室）

AI总结提出LoRDO框架，统一低秩优化与低频同步，通过全秩准双曲更新恢复子空间探索，在125M-720M模型规模下实现与低秩DDP近似的性能，通信量减少约10倍。

Comments Accepted at ICML 2026

详情

AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率，但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制；然而，在局部更新机制下，工作节点无法访问计算低秩投影所需的全批次梯度，这降低了性能。我们提出$\ exttt{LoRDO}$，一个统一低秩优化与低频同步的原则性框架。我们首先证明，虽然基于伪梯度的全局投影在理论上更优，但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索，我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能，同时将通信量减少了约10倍。最后，我们表明在具有小秩/小批次大小的极低内存设置中，$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

URL PDF HTML ☆

赞 0 踩 0

2602.22495 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement-aware Knowledge Distillation for LLM Reasoning

面向LLM推理的强化学习感知知识蒸馏

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

发表机构 * Meta ； Guo et al. ； Lin et al. ； Xu et al. ； Shao et al. ； Schulman et al. ； Xie et al.

AI总结提出RL感知蒸馏（RLAD），通过信任区域比率蒸馏（TRRD）在强化学习后训练中实现选择性模仿，解决分布不匹配和目标干扰问题，在逻辑推理和数学基准上优于现有方法。

详情

AI中文摘要

强化学习（RL）后训练最近推动了长链思维推理大语言模型（LLM）的重大进展，但这类模型的高推理成本促使将其蒸馏到更小的学生模型中。大多数现有的知识蒸馏（KD）方法是为监督微调（SFT）设计的，依赖于固定的教师轨迹或基于教师-学生KL散度的正则化。当与RL结合时，这些方法常常遭受分布不匹配和目标干扰：教师监督可能与学生不断变化的rollout分布不一致，并且KL正则化项可能与奖励最大化竞争，需要仔细的损失平衡。为了解决这些问题，我们提出了RL感知蒸馏（RLAD），它在RL期间执行选择性模仿——仅在改进当前策略更新时引导学生向教师学习。我们的核心组件，信任区域比率蒸馏（TRRD），用基于PPO/GRPO风格似然比的目标替代教师-学生KL正则化项，该目标锚定到教师-旧策略混合，从而在学生rollout上产生优势感知、信任区域约束的蒸馏，并自然平衡探索、利用和模仿。在多种逻辑推理和数学基准上，RLAD始终优于离线蒸馏、标准GRPO和基于KL的在策略教师-学生知识蒸馏。

英文摘要

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

URL PDF HTML ☆

赞 0 踩 0

2606.15015 2026-06-19 cs.CV cs.AI 版本更新

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford（牛津大学）； University of Cambridge（剑桥大学）

AI总结提出神经能量场框架NEXUS，通过标量能量和耗散项建模保守与非保守动力学，提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情

AI中文摘要

基于物理的视频生成需要可控的3D物体动力学，这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应，难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS，一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图，并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发，NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应（包括重力和弹性变形）被组合为加性能量项，而非保守效应（如阻尼和冲击引起的能量损失）则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到，并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中，NEXUS在不同力学属性和物理效应组合下，相较于代表性的学习和物理结构化动力学基线，提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导，在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18812 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France（巴黎高等师范学院，PSL大学，法国巴黎）； Soda team, Inria Saclay, Palaiseau, France（Soda团队，法国国家信息与自动化研究所萨克雷中心，法国帕莱索）

AI总结提出通过合成MDP构建强化学习基础模型，利用固定大小的充分统计量使注意力架构适用，在线和离线实验均优于传统算法。

详情

AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动，而结构化领域（表格预测、时间序列预测、图学习、强化学习）则不然。替代方案是合成数据，它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中：TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先}，强化学习是明显的空白：采样一个合成MDP与采样一个合成表格数据集一样可行，然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次}，MDP允许一个固定大小的充分统计量，独立于观察到的回合且形状为表格形式，这使得它们直接适用于用于表格基础模型的基于注意力的架构，只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证，我们完全在合成MDP上训练一个模型，并表明，无需任务特定的调优，它就能在上下文中解决留出的表格基准，包括在线和离线：在线时，使用比UCB-VI和表格Q-learning少得多的回合；离线时，与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

URL PDF HTML ☆

赞 0 踩 0

2606.19626 2026-06-19 cs.AI cs.CL 新提交

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Toten：基于知识本体的巴西葡萄牙语物理量和技术符号分词

Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

发表机构 * Aia Context ； Universidade Federal do Maranhão（马拉尼昂联邦大学）； Universidade de São Paulo（圣保罗大学）

AI总结提出TOTEN框架，利用工程实体本体对物理量和技术符号进行声明式分类，替代统计分词，在巴西葡萄牙语语料上实现高原子性分词和数值重建。

详情

AI中文摘要

字节对编码分词在词汇压缩方面统计高效，但对结构化技术实体语义盲目，将物理量、数字、单位和符号表达式分割成词汇上任意子词。我们提出TOTEN，一个基于知识本体的分词框架，用基于工程实体形式本体（OEE）的声明式分类取代统计推导。我们将TOTEN形式化为三元组<O, classify, {inst_tau}>：本体收集类型、结构原理、组成关系和可保存不变量；分类函数将原始文本映射到类型化区域；实例化器族产生自描述的结构化表示。鲁棒性源于与三个外部预言机的确定性耦合：Pint（量纲）、Unicode字符数据库（排版）和RSLP（葡萄牙语形态）。内在评估涵盖四个可通过构造验证的属性——本体原子性、量纲等价性、排版鲁棒性和数值重建——在一个内部、物理验证的基准（EngQuant，N=800）和四个巴西葡萄牙语外部语料库（N=1771个合格案例）上进行。我们还报告检测召回率，区分覆盖率和条件原子性。与八个最先进基线相比，TOTEN在所有对比中实现单位本体原子性，在外部语料库上数值重建为0.775-0.904，而最佳基线（Quantulum3）为0.627-0.703；在EngQuant上为0.780 vs. 0.340。差异具有统计显著性（McNemar检验，Holm校正）。内部和外部排名之间的Spearman相关性证实了控制基准的同时效度。量纲等价性显示与Pint（系统继承量纲权威的预言机）统计对等。

英文摘要

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple <O, classify, {inst_tau}>: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

URL PDF HTML ☆

赞 0 踩 0

2606.20245 2026-06-19 cs.AI 新提交

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识：面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology（国防科技大学大数据与决策国家重点实验室）

AI总结提出MACR框架，通过自适应知识评估与多智能体推理，显式解决大语言模型内部参数知识与外部上下文之间的冲突，超越传统二元选择范式。

Comments 12 pages, 3 figures

详情

AI中文摘要

大型语言模型（LLM）通过利用广泛的参数化知识和上下文学习能力，在多种基于语言的任务中取得了强劲性能，使其能够整合输入提示中提供的外部信息。然而，外部知识的整合可能引入冲突，不仅存在于模型内部参数知识与外部信息之间，也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的，忽视了两种来源都可能包含错误的情况，并通过优先考虑某一来源而非另一来源来避免冲突，而非主动解决不一致性。为解决这些局限，我们提出了一种新颖的LLM知识冲突解决框架MACR，该框架超越了传统的二元选择范式，并基于多智能体推理方法引入了显式的冲突解决机制。具体而言，我们首先提出一种自适应知识评估与检索方法，采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计，MACR要么将模型的内部知识外化为文本表示，要么在内部知识不足时检索相关外部知识，为后续推理生成基本上下文。然后，我们引入一个归纳式多智能体推理框架，包含三个专门智能体，分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明，MACR在多个基准测试中显著优于最先进的基线方法，同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

URL PDF HTML ☆

赞 0 踩 0

2606.20333 2026-06-19 cs.AI 新提交

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill: 用于上下文适应的行为压缩

Xijia Tao, Yihua Teng, Xinyu Fu, Ziru Liu, Kecheng Chen, Yuzhi Zhao, Suiyun Zhang, Rui Liu, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Huawei Research（华为研究院）； City University of Hong Kong（香港城市大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SoftSkill方法，通过可训练的软技能前缀压缩自然语言技能为紧凑连续向量，在冻结基模型上提升问答和数学任务性能，减少标记数量。

详情

AI中文摘要

智能体技能通常以自然语言Markdown文件形式部署，编码回答策略、证据使用习惯和任务流程。这些文件可读且可移植，但间接消耗：对于每个任务实例，冻结的语言模型必须将长文本制品转换为生成时行为。本文探讨自然语言技能是否可以初始化一个紧凑的连续上下文对象，通过可训练的软增量进行优化，同时基模型保持冻结。我们提出SoftSkill，一种冻结骨干方法，通过下一词预测调整此类软技能，并在推理时将其部署为潜在行为先验。在我们的主要单轮设置中，在Qwen3.5-4B上使用长度为32的SoftSkill前缀，相比无技能提示在SearchQA上提升8.3分，LiveMath上提升42.1分，DocVQA上提升1.3分。相对于SkillOpt，SoftSkill在SearchQA上准确率提升5.2分，LiveMath上提升12.5分，同时用少量虚拟标记替换数百到数千个Markdown技能标记。我们进一步研究了作为更难边界情况的智能体执行，其中稀疏轨迹模仿提供了有用信号，但尚未稳健地压缩长程过程行为。更广泛地说，结果表明某些任务技能更适合被视为紧凑的潜在控制，而不是在推理时重新解释的额外Markdown，用于控制冻结模型如何进入任务。

英文摘要

Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.

URL PDF HTML ☆

赞 0 踩 0

2606.20518 2026-06-19 cs.AI 新提交

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit: 流匹配TTS中终身发音适应的联想记忆

Harshit Singh, Ayush Pratap Singh, Nityanand Mathur

发表机构 * University Of Maryland（马里兰大学）； TU Darmstadt（达姆施塔特工业大学）； Smallest AI

AI总结针对流匹配TTS部署后无法纠正专有名词发音错误的问题，提出FlowEdit框架，通过潜在条件编辑而非权重更新学习发音修正，并利用现代Hopfield网络存储和检索修正，在312个多语言专有名词基准上将音素错误率降低92.7%。

详情

AI中文摘要

流匹配文本到语音系统在零样本场景下表现出色，但部署后保持静态：除非重新训练模型，否则对词汇表外的专有名词的发音错误会持续存在。我们提出FlowEdit，一个用于冻结的流匹配TTS的终身适应框架，它将发音修正学习为潜在条件编辑而非权重更新。当提供纠正性反馈时，FlowEdit优化文本嵌入空间中的令牌级扰动，然后将修正存储在作为内容可寻址情景记忆的现代Hopfield网络中。在推理时，通过具有相似性门控的软注意力检索修正，实现模糊形态匹配。在我们整理的涵盖18个语系的312个多语言专有名词基准上，FlowEdit相对于零样本基线将目标词音素错误率降低了92.7%，同时保持相同的通用语音质量。修正过程在单个GPU上大约15秒完成。

英文摘要

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2606.20532 2026-06-19 cs.AI 新提交

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

指令如何塑造语音？面向风格描述文本到语音的交叉注意力归因

Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana, Akshat Mandloi, Sudarshan Kamath

AI总结提出交叉注意力归因方法，分析风格描述文本到语音系统中单词对声学输出的影响，发现风格标记在早期步骤和深层注意力峰值，且与基频和能量相关。

详情

AI中文摘要

风格描述文本到语音系统使用自然语言控制语音特征，但单个单词如何影响声学输出仍不清楚。理解这一点对于诊断故障模式和提高表现性TTS的可控性至关重要。我们首次将DAAM框架适配到语音领域，为语音扩散模型提出交叉注意力归因，并将其应用于CapSpeech-TTS。我们的方法提取了25层和24个ODE步骤的逐词热力图。我们分析了3,600个（风格描述，文本转录）组合，包括120个风格描述条件生成30个文本转录，揭示了描述词如何塑造波形。结果表明：（1）风格标记的时间方差低于内容/功能标记，确认了全局条件作用；（2）风格注意力与基频和能量相关；（3）风格条件作用在早期步骤和深层达到峰值；（4）注意力熵在第17层达到最小值，与风格重要性峰值同时出现，表明在最关键风格阶段网络选择性最大。这是首次研究自然语言如何影响语音扩散模型中的交叉注意力。

英文摘要

Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models

URL PDF HTML ☆

赞 0 踩 0

2606.18485 2026-06-19 cs.SD cs.AI eess.AS 交叉投稿

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

MagpieTTS-LF：无需长语音数据训练的推理时长生成长语音生成

Subhankar Ghosh, Jason Li, Paarth Neekhara, Shehzeen Hussain, Ryan Langman, Xuesong Yang, Roy Fejgin

发表机构 * NVIDIA Corporation（英伟达公司）

AI总结提出MagpieTTS-LF推理时方法，通过软注意力先验、有状态推理和历史感知文本编码，在不重新训练模型的情况下实现连贯的长语音生成。

Journal ref Interspeech 2026

详情

AI中文摘要

神经文本到语音（TTS）系统在短语句上取得了显著质量，但长语音生成表现出韵律漂移、说话人不一致和句子边界伪影。现有方法要么压缩序列、增加上下文长度，要么简单拼接独立合成的片段。我们提出一种称为MagpieTTS-LF的推理时方法，使MagpieTTS能够在不重新训练模型的情况下生成连贯的长语音。我们的方法引入了三个关键创新：（1）软注意力先验，在保留过去和未来上下文的同时引导单调对齐；（2）有状态推理算法，跨句子块维护上下文，确保韵律连续性；（3）历史感知文本编码，利用过去文本进行语篇级韵律规划。在长文本上的实验表明，与其他基线相比，在长距离可懂度、韵律连贯性、说话人一致性和边界自然度方面有显著改进。

英文摘要

Neural Text-to-Speech (TTS) systems achieve remarkable quality on short utterances but long-form speech generation shows prosodic drift, speaker inconsistencies and sentence boundary artifacts. Existing approaches either compress sequences, increase context length or naively concatenate independently synthesized chunks. We present an inference-time approach called MagpieTTS-LF that enables MagpieTTS to produce coherent long-form speech without model retraining. Our method introduces three key innovations: (1) soft attention priors to guide monotonic alignment while preserving past and future context; (2) a stateful inference algorithm that maintains context across sentence chunks, ensuring prosodic continuity; (3) history-aware text encoding that uses past text for discourse-level prosodic planning. Experiments on long texts show significant improvements in long-range intelligibility, prosodic coherence, speaker consistency, and boundary naturalness compared to other baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19346 2026-06-19 cs.CL cs.AI 交叉投稿

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College（哈弗福德学院）； Brown University（布朗大学）

AI总结通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解，发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 交叉投稿

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化？

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research（英伟达研究院）

AI总结提出基于问题可解性的错误分类法，揭示LLM在RTL编码中受限于预训练知识，对齐技术仅教会编译，而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情

AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型（LLM）的关键瓶颈。为了研究这一点，我们引入了一种新的错误分类法，该分类法基于问题可解性，受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限，前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义，暴露出对测试时计算扩展免疫的持续知识差距。此外，我们揭示了一个显著的表面收敛差距：优化容易消除语法错误，但同时加剧了更深层次的功能失败。我们的发现表明，对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误，但寄存器传输级（RTL）编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究，而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

URL PDF HTML ☆

赞 0 踩 0

2606.19348 2026-06-19 cs.CL cs.AI 交叉投稿

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4: 迈向高效百万令牌上下文智能

DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chengyu Hou, Chenhao Xu, Chenze Shao, Chong Ruan, Conner Sun, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Donghao Li, Dongjie Ji, Erhang Li, Fang Wei, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei, Haotian Yuan, Haowei Zhang, Haowen Luo, Haoyu Chen, Haozhe Ji, Hengqing Zhang, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J Yang, JQ Zhu, Jia Luo, Jia Song, Jia Yu, Jialiang Huang, Jialu Cai, Jian Liang, Jiangting Zhou, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jieyu Yang, Jin Chen, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jingzi Zhou, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junmin Zheng, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Leyi Xia, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Linyan Zhu, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, MS Di, MY Xu, Max Mei, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Mingxu Zhou, Minmin Han, Ning Wang, Panpan Huang, Panpan Wang, Peixin Cong, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, Ruyi Chen, SH Liu, Shanghao Lu, Shangmian Sun, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoqing Wu, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W Zhang, WL Xiao, Wangding Zeng, Wei An, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wenqing Hou, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyu Zhang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xueyin Chen, Xuheng Lin, Xuwei Fu, YC Yan, YQ Wang, YW Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Xu, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Shao, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Yiling Ma, Ying He, Ying Tang, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, YuKun Li, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuehan Yang, Yuer Xu, Yuhan Wu, Yuhao Meng, Yuheng Zou, Yukun Zha, Yunfan Xiong, Yupeng Chen, Yuping Lin, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yuting Yan, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, ZF Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zekai Zhang, Zhangli Sha, Zhe Fu, Zhe Ju, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zheren Gao, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhongyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihua Qu, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Ziyi Wan, Zizheng Pan, Zongqing Yao

发表机构 * DeepSeek-AI（深度求索人工智能）

AI总结提出DeepSeek-V4系列MoE模型，通过混合注意力架构、流形约束超连接和Muon优化器，实现百万令牌上下文的高效推理，在核心任务上超越前代。

详情

AI中文摘要

我们展示了DeepSeek-V4系列的预览版本，包括两个强大的混合专家（MoE）语言模型——DeepSeek-V4-Pro（1.6T参数，49B激活）和DeepSeek-V4-Flash（284B参数，13B激活），两者均支持一百万个令牌的上下文长度。DeepSeek-V4系列在架构和优化方面引入了多项关键升级：（1）混合注意力架构，结合压缩稀疏注意力（CSA）和重度压缩注意力（HCA），以提高长上下文效率；（2）流形约束超连接（mHC），增强传统残差连接；（3）Muon优化器，实现更快的收敛和更高的训练稳定性。我们在超过32T多样且高质量的令牌上预训练了两个模型，随后通过全面的后训练流程解锁并进一步增强其能力。DeepSeek-V4-Pro-Max是DeepSeek-V4-Pro的最大推理努力模式，重新定义了开放模型的最先进水平，在核心任务上超越了其前代。同时，DeepSeek-V4系列在长上下文场景中非常高效。在百万令牌上下文设置下，与DeepSeek-V3.2相比，DeepSeek-V4-Pro仅需27%的单令牌推理FLOPs和10%的KV缓存。这使得我们能够常规支持百万令牌上下文，从而使长时任务和进一步的测试时扩展更加可行。模型检查点可从此https URL获取。

英文摘要

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

URL PDF HTML ☆

赞 0 踩 0

2606.19349 2026-06-19 cs.CL cs.AI 交叉投稿

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处？通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University（东南大学）

AI总结本文系统分析了扩散大语言模型中查询位置对生成质量的影响，发现其与示例语义质量同等重要，并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情

AI中文摘要

尽管上下文学习（ICL）在自回归（AR）大语言模型（LLMs）中已被广泛研究，但其在扩散大语言模型（dLLMs）中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同，dLLMs本质上利用双向注意力，为查询放置提供了广泛的空间灵活性。不幸的是，当前实践通常继承AR风格的尾随查询模板，往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦，我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部，这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性，我们揭示了传统的单步置信度（$C_{decoded}$）在dLLMs中失效。相反，我们提出了平均置信度（$\overline{C}$），一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线，我们引入了Auto-ICL，一种无需训练的自适应路由策略，动态优化查询放置，在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19351 2026-06-19 cs.CL cs.AI 交叉投稿

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）

AI总结提出LUCID方法，结合LLM注意力分数、知识图谱语义和结构信息，利用图神经网络检测LLM在知识图谱推理中的幻觉，在九个数据集上达到最优性能。

详情

AI中文摘要

知识图谱推理从现有事实中推断新知识，广泛应用于问答、推荐和决策支持。随着大语言模型（LLM）的快速发展，基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而，LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识，模型仍可能生成错误输出，导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态，要么验证与检索上下文的一致性，但两者都忽略了知识图谱中的结构信息，导致性能次优。为了解决这一差距，我们提出了LUCID，这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说，它从注意力分数和语义相似度中提取节点和边特征，并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明，与15个基线相比，LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 交叉投稿

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich（慕尼黑工业大学）； University of Exeter（埃克塞特大学）； Horace Greeley High School（霍勒斯格里利高中）

AI总结针对LLM推理成本与服务质量之间的矛盾，提出SLARouter在线路由算法，利用稀疏单侧用户反馈学习成本最优策略，理论保证成本最优和SLA合规，实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情

AI中文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.19733 2026-06-19 cs.CV cs.AI 交叉投稿

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

QueryGaussian: 可扩展且无需训练的开词汇3D实例检索

Xiuyuan Zhu, Ke Lu, Zijie Yang, Chao Yue, Jian Xue, Dongming Zhang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； State Key Laboratory of Communication Content Cognition（通信内容认知国家重点实验室）； Peng Cheng Laboratory（鹏城实验室）

AI总结提出QueryGaussian，一种无需训练的开词汇3D实例检索框架，通过实例级查询机制解耦语义与几何，结合2D视觉模型和时序融合模块，在保持精度的同时降低70%以上GPU内存并加速180倍，支持城市级场景。

Comments 8 pages, 4 figures, 6 tables. Accepted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2026)

详情

AI中文摘要

通过自然语言提示从大规模场景中高效检索特定3D实例仍然是多媒体分析中的一个严峻挑战。现有方法主要遵循“场景级嵌入”范式，需要将高维语义特征蒸馏到每个3D基元中。这种策略存在一个根本性的架构瓶颈：内存和计算成本随场景复杂度线性增长，不可避免地导致城市级环境中的内存溢出（OOM）故障。为了解决这一障碍，我们提出了QueryGaussian，一个无需训练的框架，用于快速且可扩展的开词汇3D实例检索。与整体语义蒸馏不同，QueryGaussian采用实例级查询机制，将语义理解与几何表示解耦。具体来说，我们利用预训练的2D视觉模型解释用户提示，并通过并发最大权重关联策略将分割掩码提升到3D，确保语义-视觉一致性。为了缓解投影歧义，我们引入了一个具有多阶段自适应密度聚类的时间融合模块。实验结果表明，QueryGaussian不仅匹配了最先进方法的准确性，还实现了决定性的效率飞跃，将GPU内存使用减少超过70%，并将推理速度提升180倍。关键的是，QueryGaussian能够在包含数千万个高斯的城市级场景中，使用消费级硬件实现快速的实例检索。

英文摘要

Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.19857 2026-06-19 cs.CL cs.AI 交叉投稿

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

AI总结研究提出BabelTele表示法，将语义编码为紧凑、非标准文本，牺牲人类可读性但保持LLM可恢复性，实验表明可压缩至27.9%长度并保持99.5%语义保真度，降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情

AI中文摘要

大型语言模型（LLM）通常使用人类可读的自然语言进行提示和交互，即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式，这种形式牺牲了人类可读性，但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele，这里不是作为固定协议，而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估，我们发现BabelTele可以显著偏离普通自然语言，同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式，BabelTele展示了高信息密度，即使文本体积压缩到原始长度的27.9%，也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明，BabelTele可以降低上下文开销，同时通常保持可靠的下游性能，但其有效性取决于压缩器-读取器对和任务设置。这些发现表明，人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦，为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20077 2026-06-19 cs.CV cs.AI 交叉投稿

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey（萨里大学以人为本人工智能研究所）； Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）

AI总结研究视觉语言模型中视觉令牌如何通过不同集成架构（上下文注入与逐层注入）转化为有意义表示，揭示其内部演化过程及对性能的影响。

详情

AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型（LLM）。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示，还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成，目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式，在单图像、多图像和视频基准上进行公平比较。在此过程中，我们揭示了一个隐藏的演化：视觉令牌作为伪装的视觉上下文（缺乏语言结构的原始表示）进入LLM，但根据集成范式逐渐被重塑，每种范式捕捉视觉信号的不同频率特征。我们表明，LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐，以及最终每种范式在不同任务上的表现。我们进一步证明，仅关注注意力分配是不够的，性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

URL PDF HTML ☆

赞 0 踩 0

2606.20101 2026-06-19 cs.SD cs.AI cs.MM 交叉投稿

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

基于整流流的混合扩散变压器用于指令引导音频编辑

Liting Gao, Yonggang Zhu, Yaru Chen, Dongyu Wang, Shubin Zhang, Zhenbo Li, Jean-Yves Guillemaut, Wenwu Wang

发表机构 * Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Fisheries College, Ocean University of China（中国海洋大学水产学院）； College of Information and Electrical Engineering, China Agricultural University（中国农业大学信息与电气工程学院）

AI总结提出混合两阶段扩散变压器架构，通过粗到细策略平衡全局语义对齐与局部细节编辑，在重叠音频事件和复杂指令任务上提升性能与效率。

详情

AI中文摘要

音频编辑旨在根据自然语言指令修改现有音频剪辑中的特定内容，同时保留其余声学内容。尽管扩散模型取得了显著进展，但现有的基于训练的编辑方法主要依赖于卷积U-Net骨干中的局部归纳偏差和交叉注意力交互，这通常阻碍了长程语义对齐以及对指令的精确理解和定位。相比之下，扩散变压器提供了更强的全局建模和多模态融合，但现有的编辑架构通常采用MMDiT和DiT块的简单堆叠。在所有块中对拼接的音频和文本标记应用联合注意力会导致相对于标记长度的二次复杂度。为了平衡编辑性能和效率，我们提出了一种基于整流流匹配的混合两阶段扩散变压器架构，用于指令引导音频编辑。它在低分辨率阶段对音频和文本标记进行联合注意力以建立粗略的语义对齐，然后在高分辨率阶段切换到交替的联合注意力和交叉注意力块以细化编辑细节。这种从粗到细的策略实现了高效且准确的指令引导音频编辑。实验表明，所提出的框架在涉及重叠音频事件和复杂指令的具有挑战性的编辑任务上取得了显著的性能提升，同时通过紧凑模型大幅提高了编辑效率。

英文摘要

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

URL PDF HTML ☆

赞 0 踩 0

2606.20152 2026-06-19 cs.CL cs.AI 交叉投稿

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数：追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征，发现作文质量信息以线性可解码形式存在，并识别出与分数相关的神经元，揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进展极大地改变了自动作文评分（AES），但基于LLM的评分内部机制仍知之甚少。在本工作中，我们系统分析了八个LLMs在两个英文作文数据集（ASAP++、CSEE）和一个葡萄牙语数据集（ENEM）上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析，我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现，在不同提示策略下保持稳健，并且尽管评分标准不同，仍能在作文提示间部分迁移。此外，非线性探测相对于线性探测仅提供边际且不一致的改进，表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”，其激活与作文分数强相关，且其行为对目标干预敏感。此外，这些神经元的逐层分布随作文长度系统性地变化，较长的作文更依赖深层。总体而言，我们的发现提供了LLM编码与作文质量相关的结构化表征的证据，并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20244 2026-06-19 cs.CV cs.AI 交叉投稿

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E：基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore（新加坡国立大学）； Fudan University（复旦大学）； Technical University of Munich（慕尼黑工业大学）； Sagenic Tech ； Zhejiang University（浙江大学）； vivo ； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出SPOT-E方法，通过测试时熵整形和视觉聚光灯，解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题，无需重新训练即可提升定位与鲁棒性。

详情

AI中文摘要

视觉语言模型（VLM）在证据密集型任务中通常表现不佳，因为决定性视觉证据往往微小、局部且容易被忽略，导致即使高层推理完好，证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位，但大多是开环的，缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号，并表明朴素熵最小化具有歧义性，因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义，我们引入低熵锚点和熵整形目标，在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E，一种即插即用的测试时方法，生成问题条件聚光灯，并通过基于组相对策略优化（GRPO）的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中，SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于：\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

URL PDF HTML ☆

赞 0 踩 0

2606.20255 2026-06-19 cs.CL cs.AI 交叉投稿

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

语域差距：尼日利亚公共话语的意义智能框架

Celestine Achi

AI总结提出九维意义智能框架（MIF），通过语域、真实意图等维度区分表面情感与真实交际意图，在尼日利亚公共话语数据集上使语域分类准确率提升40个百分点，复合意义智能评分提升5.4分。

Comments Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

详情

AI中文摘要

我们提出了意义智能框架（MIF），这是一个用于尼日利亚公共话语的九维标注和评估方案，将表面情感与真实交际意图区分开来。现有的尼日利亚语言基准（包括NaijaSenti和AfriSenti）将情感分类视为三向极性任务（正面、负面、中性）。我们认为，AI系统在尼日利亚话语上的主要失败模式不是翻译失败，而是语境失败：同一话语根据说话者、听众和情境可能具有相反的语用效力。MIF通过九个评分维度将这一见解操作化：语域、表面情感、真实意图、反讽、编码潜台词、风险等级、标注者置信度、说话者情绪和推荐沟通行动。我们构建了一个包含30个项目的校准数据集，涵盖标准英语、尼日利亚英语、尼日利亚皮钦语和混合语域，并在零样本和模式引导提示条件下评估了一个前沿语言模型（Gemini 2.5 Flash）。主要发现是语域差距：零样本语域分类准确率为33.3%，当模型在上下文中接收到MIF模式时，准确率上升至73.3%（+40个百分点）。在模式引导提示下，复合意义智能评分增加了5.4分（从73.2到78.6），最大的实际收益体现在语域识别、编码潜台词检测（+10分）和战略行动推荐（+10.3分）上。我们发布了框架规范、标注指南和包含30个项目的公开校准集以支持可重复性，同时保留了一个私有留存语料库用于防污染评估。

英文摘要

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.20506 2026-06-19 cs.CV cs.AI 交叉投稿

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang, Rui Wang, Yufeng Yang, Xuanyang Zhang, Xianfang Zeng, Difan Zou, Gang Yu, Chi Zhang

AI总结提出FreeStyle框架，利用社区LoRA作为锚点，通过两阶段课程学习（注意力级约束和频率感知RoPE调制）解决双参考生成中的内容泄露问题，并引入新基准和评估指标，实现风格对齐、内容保持与泄露抑制的平衡。

Comments 35 pages, 26figures. Project page: https://github.com/Blue2Giant/FreeStyle

详情

AI中文摘要

风格-内容双参考生成旨在合成一张图像，该图像保留内容参考的结构和语义，同时采用单独风格参考的风格。尽管近期有所进展，但这一设置仍然具有挑战性，因为模型必须平衡内容保真度、风格对齐和指令遵循，同时避免风格参考的语义泄露。一个关键瓶颈是缺乏大规模的三元组数据，这些数据具有清晰的内容-风格分离和广泛的长尾风格。在这项工作中，我们提出了FreeStyle，一个基于社区LoRA的可扩展双参考生成框架。我们将社区LoRA视为风格和内容的组合锚点，并设计了一个严格的生成和过滤流水线，以在多个基础模型上构建大规模的风格参考和内容参考三元组。为了解决内容泄露，我们采用了两阶段课程学习，并设计了特定阶段的解耦机制：在风格迁移阶段，采用注意力级增强约束来抑制风格参考泄露；在更困难的双参考阶段，采用频率感知的RoPE调制策略来针对基于位置对应的泄露。我们还引入了一个基准，涵盖风格参考和双参考生成，并在风格相似性、内容保持、美学质量、指令遵循和泄露拒绝方面进行评估。该基准包含一个风格不变的内容对齐分数（CAS），并引入了一个基于校准的VLM的拒绝分数，用于评估生成可靠性和泄露。大量实验表明，我们的模型在风格对齐、内容保持和泄露抑制之间实现了强平衡。

英文摘要

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

URL PDF HTML ☆

赞 0 踩 0

2606.20554 2026-06-19 cs.IR cs.AI 交叉投稿

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

AI总结提出G2Rec框架，通过统一图建模与语义分词，实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情

AI中文摘要

生成式推荐是一种新兴范式，在工业推荐系统中展现出前景，旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词，它连接了物品语义与推荐模型。然而，现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面，现有的基于图的集成方法，如图序列化和图神经网络，要么存在可扩展性问题，要么仅利用局部图信息。另一方面，现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号，可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性，我们提出G2Rec，一个可扩展的框架，将基于图的整体用户共同参与建模与语义分词统一起来，用于工业级生成式推荐。总体而言，G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型，而无需真实用户兴趣，从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.11537 2026-06-19 cs.AI cs.CE 版本更新

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.（Maum AI公司）； Humelo Inc.（Humelo公司）

AI总结提出ZeSTA框架，通过轻量领域嵌入区分真实与合成语音，结合真实数据过采样，在极低资源下提升零样本文本转语音增强的说话人相似度，保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结提出Vero系列开放视觉语言模型，通过构建600K样本数据集Vero-600K和任务路由奖励，在30个基准测试中平均提升2.9-5.4点，Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情

AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么？最强的视觉语言模型（VLM）表明广泛的视觉推理是可以实现的，但其封闭的数据和强化学习（RL）流程使得其成果难以研究、复现或扩展。我们引入了Vero，一个完全开放的VLM系列，在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励，构建了Vero-600K，一个来自59个数据集的600K样本数据集，并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中，Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型，Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是，基于Instruct模型训练的Vero-Qwen3I-8B，在没有额外蒸馏的情况下，平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示，不同的任务类别引发不同的推理模式，而广泛的收益依赖于联合学习它们，而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata（III-LIDI国立拉普拉塔大学）； CDTEC, Federal University of Pelotas（CDTEC，联邦 Pelotas 大学）； CONICET III-LIDI ； Comision de Investigaciones Cientificas Universidad Nacional de La Plata（科学委员会国立拉普拉塔大学）； Universidade Federal de Pelotas（联邦 Pelotas 大学）

AI总结针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题，提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强，并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情

AI中文摘要

手语翻译（SLT）仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法，其中GPT-4o生成参考句子的受控释义变体，而手语输入保持不变。采用基于Signformer姿态的Transformer，在两阶段调度下进行训练：先在增强语料库上预训练，然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估：PHOENIX14T（德国手语），具有适度的词汇多样性；GSL（希腊手语），具有高度受控、重复的录制；以及LSA-T（阿根廷手语），具有严重的长尾稀疏性。在PHOENIX14T上，增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知，这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.19935 2026-06-19 cs.AI 新提交

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

PhysDrift: 弥合人形机器人共语动作生成中的具身差距

Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu

发表机构 * South China University of Technology（华南理工大学）； DexForce Technology（DexForce科技公司）； Foshan University（佛山大学）

AI总结针对人形机器人共语动作生成中人体运动流形与机器人具身约束不匹配的问题，提出IK-EER框架和PhysDrift模型，直接预测可执行关节轨迹，提升运动对齐、物理合理性和实时交互能力。

详情

AI中文摘要

人形机器人需要共语动作，这些动作不仅要富有表现力且与语音对齐，还要在具身约束下物理可执行。现有的共语动作生成流程主要是以人为中心的：首先以人体表示（如SMPL-X）生成动作，随后重定向到人形机器人。在这项工作中，我们识别出这种范式中的基本具身差距，即人体运动流形与人形机器人具身约束之间的不匹配在运动转移和物理执行过程中破坏了具身一致性。通过广泛分析，我们表明尽管重定向可以保留粗粒度的运动语义，但它显著压缩了运动多样性并削弱了韵律-动作同步，限制了富有表现力的人形机器人行为。为解决此问题，我们首先提出IK-EER，一种保留韵律的人形机器人运动策展框架，在重定向过程中联合优化运动学可行性和语音-运动时间对齐。基于策展的机器人原生运动数据集，我们进一步引入PhysDrift，一种具身感知的共语动作生成框架，直接预测可执行的人形机器人关节轨迹，无需依赖中间人体表示。与传统的以人为中心的流程不同，PhysDrift在训练和推理过程中都保持具身一致性，同时加入物理正则化以稳定机器人运动动态。大量实验和真实世界人形机器人部署表明，具身感知的机器人原生生成显著改善了语音-运动对齐、物理合理性、运动平滑性、推理效率和实时交互能力。

英文摘要

Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensive analysis, we show that although retargeting can preserve coarse motion semantics, it significantly compresses motion diversity and weakens prosody-motion synchronization, limiting expressive humanoid behaviors. To address this problem, we first propose IK-EER, a prosody-preserving humanoid motion curation framework that jointly optimizes kinematic feasibility and speech-motion temporal alignment during retargeting. Building upon the curated robot-native motion dataset, we further introduce PhysDrift, an embodiment-aware co-speech motion generation framework that directly predicts executable humanoid joint trajectories from speech without relying on intermediate human-body representations. Unlike conventional human-centric pipelines, PhysDrift maintains embodiment consistency throughout both training and inference while incorporating physical regularization to stabilize robot motion dynamics. Extensive experiments and real-world humanoid deployment demonstrate that embodiment-aware robot-native generation substantially improves speech-motion alignment, physical plausibility, motion smoothness, inference efficiency, and real-time interaction capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19948 2026-06-19 cs.AI 新提交

Advancing DialNav through Automatic Embodied Dialog Augmentation

通过自动具身对话增强推进DialNav

Leekyeung Han, Sangwon Jung, Hyunji Min, Jinseong Jeong, Minyoung Kim, Paul Hongsuck Seo

发表机构 * Korea University（高丽大学）； Trillion Labs

AI总结提出自动生成管道构建大规模RAINbow数据集（238K episodes），结合双策略训练和定位模型，在DialNav任务上实现成功率显著提升（Val Seen +89%，Val Unseen +100%）。

Comments 29 pages, 9 figures

详情

AI中文摘要

对于能够进行物理交互的具身智能体，创建和理解对话的能力对于确保安全性和有效性至关重要。虽然DialNav~\cite{han2025dialnav}为真实感室内导航中的对话-执行循环提供了整体评估框架，但其性能仍受限于训练数据的严重稀缺（2K episodes）。为解决这一问题，我们提出了一种自动生成管道，并构建了\textbf{RAINbow}数据集，这是一个包含238K episodes的大规模训练数据集，用于DialNav。我们的管道将现有的VLN数据集转换为多轮对话，并创建了成本高效且高质量的数据集。然后，我们引入了两项额外的互补性进展以充分释放数据潜力：（1）双策略训练，一种导航训练方案，用于使导航训练与动态对话-导航循环对齐；（2）一个利用VLN知识的定位模型。通过结合这些互补性解决方案，我们的模型在\textbf{Val Seen}（58.24，\textbf{+89\%}）和\textbf{Val Unseen}（29.05，\textbf{+100\%}）两个分割上的成功率均大幅超越基线，建立了新的最优水平。

英文摘要

For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\cite{han2025dialnav} provides a framework for holistic evaluation of the dialog--execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbf{Val Seen} (58.24, \textbf{+89\%}) and \textbf{Val Unseen} (29.05, \textbf{+100\%}) splits, establishing a new state of the art.

URL PDF HTML ☆

赞 0 踩 0

2606.19980 2026-06-19 cs.AI 新提交

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA（英伟达）； CMU（卡内基梅隆大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出ENPIRE框架，通过环境重置、策略执行、结果验证和迭代优化的闭环反馈，使编码智能体自主改进机器人操作策略，在灵巧操作任务上达到99%成功率。

详情

AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程，这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索，但其成功主要局限于数字环境。我们推测，自动化机器人研究缺失的抽象是一个可重复的反馈循环，用于现实世界策略改进：重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距，我们引入ENPIRE，一个用于编码智能体的框架，通过四个核心模块实例化这一物理反馈例程：环境模块（EN）用于自动重置和验证，策略改进模块（PI）启动策略优化，推出模块（R）用于评估一个或多个并行运行的物理机器人的策略，以及进化模块（E），其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程，在最小化人工努力的同时，允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下，前沿编码智能体可以自主训练策略，在具有挑战性的灵巧操作任务（如整理针盒、紧固扎带和工具使用）上达到99%的成功率，并且当我们派遣智能体团队在机器人集群上工作时，这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

URL PDF HTML ☆

赞 0 踩 0

2606.19990 2026-06-19 cs.AI 新提交

Reward as An Agent for Embodied World Models

奖励作为具身世界模型的智能体

Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You

发表机构 * ACE Robotics（ACE机器人）

AI总结提出奖励智能体框架和动态感知 rollout 多样化方法，通过鲁棒验证支持更广泛探索，缓解奖励黑客问题，提升世界模型性能。

详情

AI中文摘要

虽然强化学习已成为改进世界模型的有前景工具，现有方法大多依赖于训练分布附近的保守 rollout，限制了探索、行为多样性和更丰富的动态发现。在这项工作中，我们挑战这种保守范式。我们认为核心限制不是探索本身，而是缺乏支持更广泛探索的可靠验证策略。没有可靠的验证，扩展的探索极易受到奖励黑客攻击，即策略利用不完美的奖励而未能实现真正的改进。为了评估这一动机，我们在具身世界模型中实例化我们的方法，其中物理合理性和任务完成性为复杂动态下的可扩展强化学习提供了严格的测试平台。在验证方面，我们引入奖励作为智能体，一种主动评估生成行为以提供鲁棒奖励信号并减轻分布偏移下奖励黑客攻击的智能体奖励框架。在探索方面，我们通过 DynDiff-GRPO 引入动态感知 rollout 多样化，显式扩展动作空间探索以多样化轨迹、拓宽状态-动作覆盖范围，并鼓励超越保守 rollout 机制的更丰富具身行为。通过将奖励作为智能体与 DynDiff-GRPO 统一，我们在更可靠的奖励基础上实现强化学习，并大幅多样化采样，有效缓解奖励黑客攻击，同时在多个开源世界模型上取得显著的精度提升，从而证明当基于鲁棒验证时，更广泛的探索可以成功扩展。

英文摘要

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

URL PDF HTML ☆

赞 0 踩 0

2606.20274 2026-06-19 cs.AI 新提交

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

Lagrange: 一种面向通用端到端驾驶的开放词汇、基于能量的稀疏框架

Shihao Ji, HongXi Li, Zihui Song, Mingyu Li

AI总结提出Lagrange框架，利用掩码潜在场和视觉语言模型实现开放词汇、稀疏计算，通过拉格朗日动作最小化确保运动学约束，在nuScenes和CODA基准上验证了鲁棒性和可解释性。

详情

AI中文摘要

将端到端自动驾驶扩展到复杂的开放世界环境，需要能够泛化到异常场景的感知模型和能够产生运动学有效轨迹的规划器。现有范式在表示效率和泛化能力之间存在明显分歧。密集模型（如占用网络）虽然几何鲁棒，但存在关键计算瓶颈，且难以进行高层语义推理。相反，稀疏的基于查询的规划器效率高，但依赖于封闭集定义，使其容易受到分布外事件的影响。尽管最近的视觉-语言-动作模型提供了开放词汇推理，但其自回归离散令牌生成从根本上与车辆动力学的连续高频控制需求相冲突。为解决这一问题，我们提出了Lagrange，一种基于掩码潜在场的开放词汇、计算稀疏的驾驶框架。Lagrange不依赖密集体积重建或封闭集查询机制，而是利用视觉语言模型将类别无关的目标提议编码为连续语义视觉令牌。我们引入了一种意图驱动的掩码交叉注意力模块，该模块在时间上过滤不相关实体，并将注意力令牌解码为定义在空间坐标上的隐式连续能量场。通过将决策制定为跨越该能量场的拉格朗日动作最小化问题，我们在执行碰撞避免的同时强制遵守车辆运动学。在标准（nuScenes）和长尾（CODA）基准上的大量离线评估表明，Lagrange为鲁棒、可解释且运动学可行的开放世界自主性建立了一个有前景的框架。

英文摘要

Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-distribution (OOD) events. Although recent Vision-Language-Action (VLA) models offer open-vocabulary reasoning, their autoregressive, discrete token generation fundamentally conflicts with the continuous, high-frequency control requirements of vehicle dynamics. To address this, we propose Lagrange, an open-vocabulary, computationally sparse driving framework based on Masked Latent Fields (MLF). Rather than relying on dense volumetric reconstructions or closed-set query mechanisms, Lagrange exploits Vision-Language Models (VLMs) to encode class-agnostic object proposals into continuous semantic visual tokens. We introduce an intent-driven masked cross-attention module that temporally filters irrelevant entities, decoding the attended tokens into an implicit continuous energy field defined over spatial coordinates. By framing decision-making as a Lagrangian action minimization problem spanning this energy field, we enforce strict compliance with vehicle kinematics while executing collision avoidance. Extensive offline evaluations on both standard (nuScenes) and long-tail (CODA) benchmarks demonstrate that Lagrange establishes a promising framework for robust, interpretable, and kinematically feasible open-world autonomy.

URL PDF HTML ☆

赞 0 踩 0

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University（纽约大学）； Tsinghua University（清华大学）； University of Michigan（密歇根大学）

AI总结提出HUG模型，利用人类抓取数据（1M-HUG数据集）和流匹配方法，从单张RGB-D图像生成多样化抓取姿态，并重定向到机器人手，实现零样本抓取，在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情

AI中文摘要

人类可以轻松抓取物体，而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类，他们每天拿起数千个物体。我们提出HUG，一个流匹配模型，能够为任何用户指定的物体（从立体相机捕获的单张RGB-D图像中）生成多样化的人类抓取。使用智能眼镜，我们首先收集了1M-HUGs，一个自我中心的人类抓取数据集，涵盖100万帧（27.8小时）和41栋建筑中的6,707个物体实例。接下来，为了建模自然人类抓取的分布，我们的新型流匹配模型融合RGB和深度观测，输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手，实现在日常场景中的零样本抓取。为了标准化评估，我们构建了一个新的模拟基准HUG-Bench，包含来自五个几何类别和不同尺寸的90个未见物体，并带有公制尺度的3D网格。我们在真实世界中评估HUG，使用HUG-Bench的30个物体测试集，跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布：https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

URL PDF HTML ☆

赞 0 踩 0

2606.19357 2026-06-19 cs.RO cs.AI 交叉投稿

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Physical Atari: 一个用于机器人实时强化学习的鲁棒且可访问的平台

Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack

AI总结提出Physical Atari平台，通过机器人操作Atari控制器和实时渲染游戏帧，实现物理世界中的强化学习研究，验证了算法可直接在机器人上学习，并指出分布偏移会显著降低策略性能。

Comments To appear at RLC 2026

详情

AI中文摘要

我们构建了一个名为Robotroller的机器人，它能够操作Atari CX40+控制器，以及一个名为Atari Devbox的设备，该设备在屏幕上渲染来自Arcade Learning Environment的游戏帧和奖励信号。Robotroller和Atari Devbox，连同现成的摄像头和台式计算机，构成一个可用于研究物理世界中强化学习算法的系统。我们将整个系统称为Physical Atari。在本文中，我们详细介绍了使Physical Atari成为一个鲁棒且可访问平台的关键决策。为了使系统鲁棒，我们设计了Robotroller，使得所有运动都通过轴承完成，从而减少磨损。此外，我们编写了软件，以高频监控伺服电机的状态并进行干预以限制应力。为了使系统可访问，我们使用了价格合理的现成组件和可通过消费级3D打印机制造的零件。Physical Atari的建造成本低于1000美元，并且已用于数周不间断的强化学习实验，未出现任何机械故障。我们用它验证了强化学习算法可以直接在机器人上学习，并表明即使学习和部署之间的微小分布偏移也会显著降低策略的性能。我们的结果强调了设备端适应对于在机器人上获得强性能的重要性。

英文摘要

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

URL PDF HTML ☆

赞 0 踩 0

2606.19419 2026-06-19 cs.RO cs.AI 交叉投稿

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Impossible Research

AI总结提出RATs框架，让机器人通过自主探索学习可复用技能，在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情

AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为，但它们仍然主要是任务驱动的：可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习，其中具身编码代理在下游任务到来之前，将自主导向的趣味性作为持续技能学习阶段。我们引入RATs，即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段，RATs提出新颖且可学习的探索性任务，规划并执行机器人代码策略，验证中间进展，诊断失败，通过密集的步骤级反馈进行重试，并将成功执行提炼到持久代码技能库中。在测试时，代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明，与无趣味性和随机趣味性基线相比，趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点（相对于CaP-Agent0）。此外，学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中，无需微调基础模型，即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

URL PDF HTML ☆

赞 0 踩 0

2606.19633 2026-06-19 cs.RO cs.AI 交叉投稿

CTS-MoE: Implicit Terrain Adaptation via Mixture-of-Experts for Perceptive Locomotion

CTS-MoE: 基于混合专家模型的隐式地形适应感知运动

Francisco Affonso, Matheus P. Angarola, Ana Luiza Mineiro, Aditya Potnis, Marcelo Becker, Girish Chowdhary

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of São Paulo（圣保罗大学）

AI总结针对非连续地形上的感知运动问题，提出CTS-MoE方法，通过密集混合专家策略与感知门控组合共享行为，并用多批评家防止价值干扰，实现端到端训练和隐式地形适应，在仿真和硬件上优于基线。

详情

AI中文摘要

在不连续地形（如楼梯、间隙和障碍物）上的感知腿式运动需要自适应行为，因为单一的保守步态无法产生应对突然拓扑变化所需的预期动作。将该问题视为多任务强化学习，会在共享与分离之间引入张力。任务使用共同的运动基础但具有冲突的奖励，因此策略必须共享行为同时避免价值干扰。先前的工作只解决了其中一方面：整体策略牺牲了专业化，而分层子策略牺牲了跨过渡和未知地形的泛化能力。我们提出CTS-MoE，它结合了密集混合专家执行器与基于感知的门控来组合共享行为，以及具有任务特定价值头的多批评家来防止干扰。该模型在单阶段并发教师-学生设置中进行端到端训练，处理部分可观测性并避免顺序蒸馏，任务标签仅在训练期间使用。部署时，路由仅依赖于感知，从而无需高层选择器或地形分类器即可实现地形适应。在仿真和硬件上对Unitree Go1进行的实验（涵盖已知和未知地形）显示了任务感知的专业化，与整体基线相比，跟踪误差更低，成功率更高。项目网站：此https URL。

英文摘要

Perceptive legged locomotion over discontinuous terrain (e.g., stairs, gaps, and obstacles) requires adaptive behavior, as a single conservative gait cannot produce the anticipatory maneuvers needed for abrupt topology changes. Cast as multi-task reinforcement learning, this problem introduces a tension between sharing and separation. Tasks use a common locomotion base but have conflicting rewards, so a policy must share behavior while avoiding value interference. Prior work addresses only one side, with monolithic policies sacrificing specialization and hierarchical sub-policies sacrificing generalization across transitions and unseen terrain. We propose CTS-MoE, which combines a dense mixture-of-experts actor with perception-based gating to compose shared behaviors and a multi-critic with task-specific value heads to prevent interference. The model is trained end-to-end in a single-stage concurrent teacher-student setup that handles partial observability and avoids sequential distillation, with task labels used only during training. At deployment, routing depends solely on perception, allowing terrain adaptation without a high-level selector or terrain classifier. Experiments on a Unitree Go1 in simulation and on hardware across seen and unseen terrains show task-aware specialization, with lower tracking error and higher success rates than monolithic baselines. Project Website: https://cts-moe.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2606.19728 2026-06-19 cs.RO cs.AI 交叉投稿

Bidirectional Tutoring for Developmental Motor Learning in Robots: Co-Developed Interaction Dynamics Support Stable Learning

机器人发展性运动学习的双向辅导：共同发展的交互动力学支持稳定学习

Rui Fukushima, Jun Tani

发表机构 * Okinawa Institute of Science and Technology Graduate University（冲绳科学技术大学院大学）

AI总结提出双向辅导框架，通过人类或AI导师与机器人动态适应，利用自由能原理神经网络实现稳定序列学习，在物体操作任务中验证了行为一致性和泛化能力。

Comments 16 pages, 14 figures

详情

AI中文摘要

众所周知，婴儿通过与照顾者的密集互动来发展运动技能。尽管这种社会互动对人类发展至关重要，但机器人的运动技能学习通常被视为单向过程，机器人被动接受导师的演示。这忽视了社会互动的一个关键特性：它本质上是双向的，导师和学习者相互动态适应。在这种互动中，机器人的过往经验可能作为先验约束，塑造共同发展轨迹的动态。我们假设双向辅导允许这些约束引导形成一致的行为模式，从而保持行为一致性并支持泛化，而单向互动缺乏此类约束，导致更广泛、更不一致的行为模式。为检验这一假设，我们使用实体人形机器人进行了两个物体操作实验：一个涉及人机互动，另一个采用AI导师通过自适应干预机制与真实机器人互动，以检验在更受控条件下是否会出现类似效果。我们使用基于自由能原理的神经网络并扩展生成回放来实现发展性学习框架，该框架支持从单个辅导情节中进行稳定的逐序列学习。在两种设置中，双向辅导促进了行为一致性和阶段性泛化，同时机器人逐渐需要更少的导师指导。这些结果表明，双向辅导作为一种具身和社会化方法，为机器人的发展性运动学习提供了有效支架。

英文摘要

Infants are well known to develop their motor skills through dense interaction with caregivers. Although such social interaction is crucial for human development, motor-skill learning in robots is often treated as a unidirectional process in which robots passively receive demonstrations from tutors. This overlooks a key property of social interaction: it is inherently bidirectional, with tutor and learner dynamically adapting to each other. In such interactions, the robot's past experiences may function as prior constraints that shape the dynamics of their co-developed trajectories. We hypothesize that bidirectional tutoring allows such constraints to guide the formation of consistent behavioral patterns that preserve behavioral coherence and support generalization, whereas unidirectional interaction lacks such constraints and leads to broader, less consistent behavioral patterns. To examine this hypothesis, we conducted two experiments with a physical humanoid robot performing an object manipulation task: one involving human-robot interaction and another employing an AI tutor interacting with the real robot through an adaptive intervention mechanism designed to examine whether similar effects would emerge under more controlled conditions. We implement the developmental learning framework using a free-energy-principle-based neural network extended with generative replay, which supports stable sequence-by-sequence learning from single tutored episodes. Across both settings, bidirectional tutoring fostered consistent behaviors and stage-wise generalization, while the robot gradually required less tutor guidance. These results suggest that bidirectional tutoring, as an embodied and socially grounded approach, provides an effective scaffold for developmental motor learning in robots.

URL PDF HTML ☆

赞 0 踩 0

2606.19752 2026-06-19 cs.RO cs.AI 交叉投稿

Temporal Self-Imitation Learning

时间自我模仿学习

Yinsen Jia, Boyuan Chen

发表机构 * Duke University（杜克大学）

AI总结提出时间自我模仿学习框架，通过挖掘高效成功轨迹并转化为可重用监督信号，提升长时域机器人操作任务的学习效率与鲁棒性。

详情

AI中文摘要

基于奖励塑形训练的长时域机器人操作策略仍可能通过低效交互利用密集奖励，而训练过程中稀有高效行为可能被遗忘。我们认为时间效率本身为强化学习提供了强大且未充分利用的自我监督源。我们引入时间自我模仿学习（TSIL），一种强化学习框架，挖掘学习过程中产生的时间高效成功轨迹，并将其转化为可重用的监督信号以改进未来策略。TSIL通过从快速成功轨迹中提取配置条件自适应时间目标逐步优化学习，并通过效率加权自我模仿学习保留和重放高效行为。在15个不同的长时域操作任务中，TSIL持续提升了学习效率、任务完成效率、快速成功行为的重访率以及对不稳定训练条件的鲁棒性。更广泛地，我们的结果表明，成功行为的时间结构本身为强化学习提供了超越人工奖励塑形的可扩展自我监督信号。

英文摘要

Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

URL PDF HTML ☆

赞 0 踩 0

2606.19914 2026-06-19 cs.RO cs.AI 交叉投稿

Co-policy: Responsive Human-Robot Co-Creation for Musical Performances

Co-policy: 响应式人机音乐共创框架

Xuetao Li, Wenke Huang, Mang Ye, Zijian Liu, Jinhua Xie, Jifeng Xuan, Miao Li

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Automation, Wuhan University of Technology（武汉理工大学自动化学院）； School of Geodesy and Geomatics, Wuhan University（武汉大学测绘学院）； School of Robotics, Wuhan University（武汉大学机器人学院）

AI总结提出Co-policy框架，通过语义锚定、约束变分和视觉运动策略实现人机音乐实时共创，在真实钟琴实验中优于扩散策略基线。

详情

AI中文摘要

艺术长期以来一直是人类创造力的关键表达。具身人工智能为生成模型通过物理动作而非无形数字内容参与创造力提供了一条途径。在机器人音乐共创中，将语义音乐理解与实时且可物理执行的表演连接起来具有挑战性。我们提出了Co-policy，一个人机音乐共创框架，它分离了语义意图接地、约束音乐变分和视觉运动执行。为了接地音乐语义，Co-policy使用预推理语义锚点和微调的Qwen-vl规划器（F-Qwen）将语音、实时音乐种子和视觉观察转化为结构化的共创计划。为了支持低延迟执行，Co-policy引入了高斯混合视觉运动策略（GMP），实现为条件混合密度策略，在单次前向传递中将目标音符和视觉上下文映射到多模态机器人动作。与仅复现用户指定音符的机器人回放系统不同，Co-policy在音乐和物理约束下生成互补的音乐响应。真实机器人钟琴实验、消融研究和专家评估显示，与扩散策略和消融基线相比，意图对齐、执行准确性和响应频率均有提升，支持物理接地动作生成作为具身人机共创的关键要求。

英文摘要

Art has long stood as a pivotal expression of human creativity. Embodied artificial intelligence offers a route for generative models to participate in that creativity through physical action rather than disembodied digital content. In robotic music co-creation, it is challenging to connect semantic musical understanding with real-time and physically executable performance. We present Co-policy, a framework for human-robot musical co-creation that separates semantic intent grounding, constrained musical variation, and visuomotor execution. To ground musical semantics, Co-policy uses pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to transform speech, live musical seeds, and visual observations into structured co-creation plans. To support low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), implemented as a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems that merely reproduce user-specified notes, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments, ablations, and expert evaluation show improved intent alignment, execution accuracy, and response frequency over diffusion-policy and ablated baselines, supporting physically grounded action generation as a key requirement for embodied human-AI co-creation.

URL PDF HTML ☆

赞 0 踩 0

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong（香港大学信息具身人工智能实验室）； HKU Musketeers Foundation Institute of Data Science（香港大学赛马会数据科学研究院）

AI总结提出Tri-Info方法，通过信息论信号捕捉动作多样性、时间一致性和状态耦合，实现跨架构、环境及仿真到现实的零样本故障检测，准确率达83%。

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越多地部署在各种任务中，但它们仍然是黑箱，其物理交互可能导致不可逆的伤害，因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此，我们将VLA控制形式化为闭环信息管道，并推导出三重信息论（Tri-Info）信号，这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中，Tri-Info在域内匹配最强的基线。此外，Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移，在现实世界任务中达到83%的准确率，而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法，不仅能够检测故障并具有强大的跨域泛化能力，还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

URL PDF HTML ☆

赞 0 踩 0

2606.20031 2026-06-19 cs.RO cs.AI 交叉投稿

A Neuromorphic Reinforcement Learning Framework for Efficient Pathfinding in Robotic Mobile Fulfillment Systems

一种用于机器人移动履行系统高效路径规划的神经形态强化学习框架

Junzhe Xu, Zecui Zeng, Lusong Li, Yuetong Fang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； JD Explore Academy（京东探索研究院）

AI总结提出SDQN-RMFS框架，通过ANN到SNN的转换和硬标签知识蒸馏，在神经形态芯片上实现超低功耗路径规划，相比GPU能耗降低11281倍，延迟减少近一半。

详情

AI中文摘要

动态环境变化、受限工作空间和严格的实时约束使得机器人移动履行系统（RMFS）中的路径规划对传统的搜索和基于规则的方法来说是一个具有挑战性的问题，这些方法通常遭受高计算复杂性和长决策延迟。虽然强化学习（RL）已成为一种强大的替代方案，但在资源受限的硬件上以极端的能源效率部署学习到的策略仍然是一个开放的挑战。我们提出了SDQN-RMFS，一个端到端的框架，实现了从全精度人工神经网络（ANN）训练的RL策略到神经形态芯片的高保真部署。通过仅在稀疏事件触发时进行计算，该框架实现了超低功耗的RMFS路径规划。我们的全栈流水线操作如下：首先通过碰撞允许策略高效训练ANN策略以密集化信息轨迹，然后通过硬标签知识蒸馏方法将其转换为脉冲神经网络（SNN）。这有效地解决了输出分布不匹配问题，在保持策略能力的同时显著降低了推理延迟。硬件实验表明，与高性能GPU基线相比，能耗节省高达11281倍，延迟几乎减少两倍，同时决策质量与原始训练策略相当。这些结果确立了物理神经形态推理作为大规模RMFS运营的实用且能源可持续的途径。

英文摘要

Dynamic environmental changes, confined workspaces, and stringent real-time constraints make pathfinding in Robotic Mobile Fulfillment Systems (RMFS) a challenging problem for conventional search- and rule-based methods, which typically suffer from high computational complexity and long decision latency. While reinforcement learning (RL) has emerged as a powerful alternative, deploying learned policies with extreme energy efficiency on resource-constrained hardware remains an open challenge. We present SDQN-RMFS, an end-to-end framework that achieves high-fidelity deployment of an RL-trained policy from a full-precision artificial neural network (ANN) through to a neuromorphic chip. By computing only when triggered by sparse events, this framework unlocks ultra-low-power RMFS pathfinding. Our full-stack pipeline operates as follows: an ANN policy is first efficiently trained via a collision-allowing strategy to densify informative trajectories, and then converted into a spiking neural network (SNN) via a hard-label knowledge distillation approach. This effectively addresses the output distribution mismatch, preserving policy capability across the ANN-to-SNN pipeline while substantially reducing inference latency. Hardware experiments demonstrate up to 11,281$\times$ energy savings and a nearly two-fold reduction in latency compared to a high-performance GPU baseline, while maintaining decision quality on par with the original trained policy. These results establish physical neuromorphic inference as a practical and energy-sustainable pathway for large-scale RMFS operations.

URL PDF HTML ☆

赞 0 踩 0

2606.20045 2026-06-19 cs.CV cs.AI 交叉投稿

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University（山东大学信息科学与工程学院）； Faculty of Engineering and Information Technology, University of Technology Sydney（悉尼科技大学工程与信息技术学院）； School of Computer Science and Technology, Shandong University（山东大学计算机科学与技术学院）； School of Artificial Intelligence, Shandong University（山东大学人工智能学院）； School of Computer Science and Artificial Intelligence, Shandong Normal University（山东师范大学计算机科学与人工智能学院）； Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University（山东师范大学通用人工智能跨学科研究中心）

AI总结针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题，提出UAV-VLN-FOV任务和3DG-VLN框架，通过动态3D方向线索增强细粒度视觉定位与空间对齐，在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情

AI中文摘要

无人机视觉语言导航（UAV-VLN）通常被形式化为一个整体的搜索与到达问题，其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难，即一旦目标进入其视场，无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性，我们引入了UAV-VLN-FOV，一个目标可见的导航任务，它隔离了“看到并到达”阶段，并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN，一种由动态3D方向线索引导的视觉语言航点预测框架，以增强细粒度视觉定位和空间方向对齐，从而实现精确的目标到达。具体来说，3DG-VLN自适应地处理高分辨率的前视和下视观测，以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向，使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务，我们构建了一个专用的高分辨率基准，包含2,717条轨迹，带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明，3DG-VLN优于具有竞争力的UAV-VLN基线，成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取：此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

URL PDF HTML ☆

赞 0 踩 0

2606.20135 2026-06-19 cs.RO cs.AI 交叉投稿

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

频率感知流匹配用于连续且一致的机器人动作生成

Jianing Guo, Fangzheng Chen, Zihao Mao, Wong Lik Hang Kenny, Zhenhong Wu, Yu Li, Yishuai Cai, Yuanpei Chen, Yikun Ban, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Simin Li

发表机构 * Beihang University（北京航空航天大学）； Peking University（北京大学）； The Chinese University of Hong Kong（香港中文大学）； PKU-Psibot Lab（北大-智源机器人实验室）； Zhongguancun Laboratory（中关村实验室）； Hefei Comprehensive National Science Center（合肥综合性国家科学中心）

AI总结提出频率感知流匹配（FAFM），通过离散余弦变换将离散动作序列转换到频域进行流匹配，并正则化一阶时间导数以生成平滑连续的动作，提升成功率、多模态表达性和运动平滑性。

详情

AI中文摘要

流匹配已成为机器人操作的标准范式，因为它与扩散策略等类似方法一样，对建模复杂的多模态动作分布具有很强的表达能力。然而，现有方法依赖于离散化的动作块，使得它们对以异构控制频率收集的演示数据脆弱，并且容易产生时间上不一致的动作，从而降低控制稳定性。在本文中，我们提出了频率感知流匹配（FAFM），它输出连续的、时间上一致的动作。为了处理异构频率输入，我们使用离散余弦变换（DCT）将离散动作序列转换到频域，对得到的系数进行流匹配，并通过余弦基展开重建连续动作。为了生成时间上一致的动作，我们对一阶时间导数进行正则化以促进平滑动作。这对应于一个Sobolev型约束，抑制高频误差并阻止突变的动作变化。我们的FAFM简单，不引入额外的网络参数，并且适用于独立的流匹配策略和视觉-语言动作模型。在合成玩具基准、避障、LapGym和LIBERO上，FAFM提高了成功率、多模态表达能力、运动平滑性、收敛速度、对机械偏差和混合频率输入的鲁棒性。这些优势在真实世界的Franka机器人上部署时保持一致。代码见此https URL。

英文摘要

Flow matching has emerged as a standard paradigm for robotic manipulation owing to its strong expressive power for modelling complex, multimodal action distributions, alongside similar approaches like diffusion policy. However, existing methods rely on discretized action chunks, making them brittle to demonstrations collected at heterogeneous control frequencies and prone to temporally inconsistent actions that degrade control stability. In this paper, we propose Frequency-Aware Flow Matching (FAFM), which outputs continuous, temporally consistent actions. To handle heterogeneous frequency input, we transform discrete action sequences into the frequency domain with the discrete cosine transform (DCT), perform flow matching over the resulting coefficients, and reconstruct continuous actions via cosine basis expansion. To generate temporally consistent actions, we regularize the first-order temporal derivative to promote smooth actions. This corresponds to a Sobolev-type constraint that suppresses high-frequency errors and discourages abrupt action changes. Our FAFM is simple, introduces no additional network parameters and applies to standalone flow-matching policies and vision-language action models. Across synthetic toy benchmark, obstacle avoidance, LapGym, and LIBERO, FAFM improves success rates, multimodal expressivity, motion smoothness, convergence speed, robustness to mechanical bias and mixed-frequency input. These gains are consistent when deployed on a real-world Franka robot. Code available at https://anonymous.4open.science/r/FAFM.

URL PDF HTML ☆

赞 0 踩 0

2606.20209 2026-06-19 cs.RO cs.AI 交叉投稿

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Bosch Research（博世研究院）； University of Haifa（海法大学）

AI总结提出潜在高斯泼溅（LaGS）方法，通过特征高斯体作为动态关键点实现多视图特征聚合，用于4D全景占据跟踪，在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情

DOI: 10.1109/LRA.2026.3703990

AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而，现有方法通常只解决部分问题：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中，我们提出了潜在高斯泼溅（LaGS）用于4D全景占据跟踪（4D-POT）。我们重新审视底层表示，将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点，在泼溅到体素网格进行解码之前，能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互，这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节，进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型：this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

URL PDF HTML ☆

赞 0 踩 0

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany（弗赖堡大学计算机科学系）； Qualcomm SARL France（法国.qualcomm SARL）； Automated Driving, Qualcomm Technologies, Inc.（qualcomm Technologies, Inc. 自动驾驶部门）

AI总结提出类别增量运动预测新任务，通过端到端框架结合伪标签与开放词汇分割，利用3D-2D投票机制和查询特征方差重放策略，缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情

AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而，现有方法通常假设一个封闭世界设定，具有固定的对象分类法并依赖高质量感知，限制了其在现实世界中的应用，因为现实世界中感知不完美，且新对象类别可能随时间出现。在这项工作中，我们引入了类别增量运动预测，这是一个新颖的设定，其中新对象类别随时间顺序引入，并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架，该框架适应新引入的类别，同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签，并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测，而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明，我们的方法成功地在已知类别上保持性能，同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移，并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

URL PDF HTML ☆

赞 0 踩 0

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics（LimX动力学）

AI总结提出Any2Any范式，通过运动学对齐和动力学微调，实现预训练全身跟踪模型高效迁移至新的人形机器人本体，仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情

AI中文摘要

全身跟踪（WBT）模型已成为人形机器人的关键基础，使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算，使得在新人形平台上快速部署成本高昂。这自然引发一个问题：预训练的WBT模型能否通过最小化适应跨本体迁移？为回答这个问题，我们提出Any2Any，一种范式，能够高效地将现有WBT专家迁移到新人形本体，仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐，对齐其输入和输出空间，使得预训练的源策略可以在目标本体上有意义地重用。然后，Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调（PEFT）组件进行动力学适应，保留有用的行为先验，同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明，与从头训练相比，Any2Any显著加速收敛并降低训练成本，同时实现具有竞争力或更优的跟踪性能。值得注意的是，仅使用完整训练所需计算和数据的1%，Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明，预训练的WBT专家可以跨本体高效重用，为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

URL PDF HTML ☆

赞 0 踩 0

2606.19464 2026-06-19 cs.AI cs.MA 新提交

Deontic Policies for Runtime Governance of Agentic AI Systems

面向自主AI系统运行时治理的道义策略

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

发表机构 * CSEE Department UMBC Baltimore, MD, USA ； Center for AI UMBC Baltimore, MD, USA ； Information Systems Department UMBC Baltimore, MD, USA ； CSAIL MIT Cambridge, MA, USA

AI总结针对大语言模型驱动的自主AI系统在安全、隐私和合规方面的治理挑战，提出AgenticRei框架，利用基于Rei的道义策略语言（OWL表示）在运行时通过逻辑引擎强制执行义务、豁免、冲突解决等治理约束，并兼容A2AS等标准。

Comments 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services

详情

AI中文摘要

由大语言模型驱动的自主AI系统引入了一类新的安全、隐私和合规挑战：能够调用工具、操作数据、安装软件并与跨组织边界对等代理协调的代理，不仅必须通过身份验证和访问控制来约束，还必须通过企业治理的完整结构来约束。这包括指定代理被允许和禁止做什么，它们在特定操作后必须做什么（例如，通知CISO），在什么条件下可以免除一项持续义务，以及当策略冲突时哪些规则优先。这个治理问题超出了当前策略引擎的能力范围。诸如XACML、Rego和Cedar等系统仅处理此治理结构的允许/禁止子集。它们不提供义务生命周期管理、元策略冲突解决、在特定情况下免除义务的豁免，以及通常在医疗、网络安全或数据隐私等应用中发现的领域类层次结构的本体推理。我们提出了AgenticRei，它实现了关键的治理需求，如义务、豁免、策略冲突解决和策略推理，以及基本的允许/禁止约束。我们使用基于Rei框架的道义策略语言，表示为OWL（Web本体语言），并由完全在LLM外部的高性能逻辑引擎在运行时评估。同一管道同时管理代理的工具调用和代理间消息。我们通过示例表明，道义策略捕获了当前生产引擎大多无法表达的安全和隐私治理约束。我们的方法自然地与A2AS等行业标准框架兼容。

英文摘要

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

URL PDF HTML ☆

赞 0 踩 0

2606.19509 2026-06-19 cs.AI 新提交

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

LLM 不知道它不知道什么：通过跨模型归因分歧检测临床表格数据上的认知盲点

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

发表机构 * Centific AI Research（Centific AI研究）

AI总结研究大语言模型在结构化临床数据上的认知不确定性，通过跨模型归因分歧分析，发现其口头置信度空洞、存在逆难度效应，并提出基于归因分歧的校准方法，无需训练即可提升准确率并降低校准误差。

Comments Accepted at EIML@ICML 2026

详情

AI中文摘要

大语言模型（LLM）越来越多地应用于结构化临床数据，但它们在处理此类任务时能否认识到自身知识的局限性仍未得到探索。我们通过跨模型归因分歧的视角研究这一问题，旨在减少结构化任务的认知不确定性，通过归因分歧分析比较 Qwen 2.5 7B 和 XGBoost 在预测任务上的表现。我们报告了四个发现。首先，LLM 口头表达的置信度在认知上是空洞的，无论准确率是 49% 还是 75.3%，它输出接近常数（0.856-0.937），追踪的是提示格式而非预测质量。其次，LLM 表现出逆难度效应：当 XGBoost 以 99% 正确时，LLM 准确率降至 64.8%，但在 XGBoost 中等不确定时，LLM 与其匹配（73.8% 对 73.1%）。第三，少样本示例和 SHAP 导出的特征证据是正交的、超加性的干预措施：它们将归因分歧分数（ADS）从 1.54 降至 0.38，并在无需训练的情况下将准确率从 49% 提升至 75.3%。第四，一种利用归因分歧信号确定 LLM 可靠性的跨模型校准器，将期望校准误差从 0.254 降至 0.080，用患者特定的可靠性估计替代了无信息量的口头置信度，无需访问模型内部或重复推理。我们将这些发现视为 LLM 在结构化数据上的冷启动问题，并勾勒出通向真正认知自我意识的路径。

英文摘要

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

URL PDF HTML ☆

赞 0 踩 0

2606.19527 2026-06-19 cs.AI 新提交

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague（捷克理工大学CIIRC）

AI总结提出一种在线对齐技术，通过引入良心步骤和基于直接偏好优化的对齐损失，使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结研究LLM与SAT/SMT求解器混合推理中，将求解器输出转化为用户答案的叙述步骤存在的安全漏洞，通过形式化建模和实验评估发现证书门控可保证求解结果正确，但对抗攻击可反转结论。

详情

AI中文摘要

诸如SAT和SMT求解器之类的形式化工具，当安全或安保关键问题可以用逻辑表述时，越来越多地被嵌入到语言模型推理流程中。与思维链不同（其步骤从模型分布中采样，没有形式化保证），求解器产生可靠且可独立验证的答案。然而，这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分：形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解，但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距，我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型，发现证书门控使求解器判定可靠，而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法，该方法显著减少了注入但无法完全消除，并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究，我们表明在LLM-求解器循环中，鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

URL PDF HTML ☆

赞 0 踩 0

2606.19735 2026-06-19 cs.AI cs.CV 新提交

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结提出基于LLM的交互接口GLARE，将自然语言问题转换为SQL查询以聚合局部解释数据，提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情

AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要，但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案，而不是静态产物，我们提出了一种基于LLM的交互接口，提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者，将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能，而无需向用户暴露低级表示。对于每个查询，接口输出统计增强的自然语言响应，支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明，LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

URL PDF HTML ☆

赞 0 踩 0

2606.19812 2026-06-19 cs.AI cs.LG 新提交

Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

AI辅助法律发现中的人机协同编排

Anushree Sinha, Srivaths Ranganathan, Abhishek Dharmaratnakar, Debanshu Das

AI总结针对AI代理在电子取证中因多步推理错误导致的法律风险，提出一种四层验证架构，通过人机协同阈值减少特权豁免风险达61%。

详情

AI中文摘要

自主大语言模型（LLM）代理越来越多地部署于电子发现（e-discovery），其中跨多步推理链的复合错误可能构成法律渎职。与单轮检索不同，在特权文档语料库上运行的代理工作流表现出我们称之为“轨迹崩溃”的一类失败：早期错误分类无声传播，导致整个特权审查失效。本文做出三项贡献。首先，我们提出一个按功能阶段组织的法律信息检索中代理失败的结构化分类法。其次，我们引入一个四层验证架构——涵盖规划、推理、执行和不确定性量化——旨在这些失败复合之前拦截它们。第三，我们在一个合成电子取证语料库上进行初步模拟研究，展示强制性人机协同（HOTL）升级阈值如何相对于完全自主基线降低特权豁免风险。我们的结果表明，与完全自主部署相比，校准的不确定性阈值可将特权豁免风险降低高达61%，同时将不到四分之一的文档路由给律师审查。

英文摘要

Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e-discovery), where compounding errors across multi-step reasoning chains can constitute legal malpractice. Unlike single-turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term "trajectory collapse": an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four-layer verification architecture -- spanning planning, reasoning, execution, and uncertainty quantification -- designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e-discovery corpus that demonstrates how mandatory Human-on-the-Loop (HOTL) escalation thresholds reduce privilege-waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege-waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.

URL PDF HTML ☆

赞 0 踩 0

2606.20508 2026-06-19 cs.AI cs.LG 新提交

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

安全对齐的LLM从混合顺从演示中学到了什么？

Sihui Dai, Mann Patel

AI总结研究通过混合良性顺从演示和有害顺从演示，探究演示组成如何驱动有害顺从，发现演示内容、顺序和训练方法影响模型提取的信息。

详情

AI中文摘要

先前工作表明，上下文演示可以越狱语言模型，但模型如何解释不同类型的顺从演示仍不清楚。我们通过混合良性顺从演示（无害请求，有帮助响应）与有害顺从演示（有害请求，有帮助响应）并测试关于演示组成如何驱动有害顺从的三个假设来研究这一点。在四个模型中，我们发现良性和有害演示不可互换：良性演示根据模型不同可以减少或增加有害顺从。我们进一步表明，偏好优化是防止良性演示增加有害顺从的关键训练阶段，演示顺序表现出强烈的近因偏差，并且模型在拒绝与上下文学习的交互方式上有所不同：一些模型在拒绝时也采用演示的格式，而其他模型在拒绝时覆盖所有上下文信号。综合来看，这项工作超越了展示基于演示的越狱有效，而是描述了其工作原理：模型从顺从演示中提取的内容取决于演示内容、顺序和训练方法。

英文摘要

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

URL PDF HTML ☆

赞 0 踩 0

2606.19344 2026-06-19 cs.CL cs.AI 交叉投稿

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事：通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出TreeTracer工具，通过系统扰动分析、语法对齐聚合和分类感知节点合并，利用桑基图对比不同语义上下文，揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情

AI中文摘要

大型语言模型（LLM）表现出表征性和句法性偏见，由于文本生成的随机性，这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标，这些方法掩盖了底层概率分布，未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer，一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程，替换每个输入提示中由本体定义的术语，将数百次随机生成聚合成语法对齐的层次结构，然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树，工作空间能够直接比较语义上下文，并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集，系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率，从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间，比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害，例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实，聚合比较界面降低了认知负荷，并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

URL PDF HTML ☆

赞 0 踩 0

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 交叉投稿

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态：挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

AI总结本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作，揭示了校准类别的关键影响，并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情

AI中文摘要

自主代理的运行时监视器通常对累积的内部状态（行为基线、漂移统计量，或在我们之前工作中的建模情感状态）设置阈值。我们之前报告了一个状态饱和陷阱：在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理（Modgil 2026）上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0，因此其指数衰减从未运作：已发布的陷阱是一个纯累加器的结果。我们更正了记录（勘误，v2）并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间（每次观测，如CUSUM）还是挂钟时间（半衰期以秒计，如情感模型和EMA基线）校准的。在固定速率流上两者一致；在代理流上，动作间时间变化几个数量级，它们不一致。在20条轨迹上对均匀间隔（dt在{0..600}秒内）的预注册扫描显示，挂钟水平触发器有两个机制：在dt<=1秒时恒定警报（20/20；中位数18次触发）；在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒（p90 2.33秒）；真实编码节奏位于陷阱机制内，在修正机制下证实了经验发现。该结构是校准类别的属性，而非引擎：在原始误差流上的最小挂钟累加器重现了相同的悬崖，而相同流上的样本时间CUSUM恰好是dt不变的（20/20）。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论，挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制；转换检测在每个节奏下都逃脱了陷阱，但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

URL PDF HTML ☆

赞 0 踩 0

2606.19390 2026-06-19 cs.SE cs.AI 交叉投稿

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化：一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

AI总结提出一种协议驱动框架，通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测，结合静态与运行时证据生成CSAF VEX公告，经密码签名和确定性重放验证，在合成自主AI工作负载上评估。

Journal ref Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework. Front Artif Intell 9, (May 2026), 1826384

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 交叉投稿

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移：一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

AI总结提出LLM辅助PQC开发中的安全编码漂移模型，通过游戏化框架将LLM转变为主动安全协作者，以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情

AI中文摘要

向后量子密码学（PQC）的过渡引入了相当大的实现复杂性，要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时，大型语言模型（LLM）已深度嵌入软件开发工作流程，包括密码工程。虽然LLM提高了生产力，但证据表明它们经常生成不安全或次优的代码，特别是在安全关键领域。本文引入了PQC中的安全编码漂移，这是一种新颖的社会技术漏洞模型，捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同，我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题，我们提出了一种游戏化的、LLM增强的安全编码框架，将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者，为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

URL PDF HTML ☆

赞 0 踩 0

2606.19755 2026-06-19 cs.CR cs.AI 交叉投稿

测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

AI总结针对AI科学家等自主执行多步科学任务的代理系统，本文提出生物代理评估作为解释性工具，并基于实践经验给出定义、设计、运行、评分和记录评估的考量，以帮助决策者谨慎解读结果并指导投资。

详情

AI中文摘要

本文针对一个迅速出现的政策挑战：如何生成和解释关于AI科学家（即能够自主或协作执行多步科学任务的代理AI系统）的生物能力与风险的可信证据。随着这些系统进入真实研究流程，决策者越来越多地面临评估结果，而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据，并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出；引导公共和私人资助者向AI-生物学评估研究的高杠杆投资；并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

分析针对基于模型引导的自动化攻击的防御性误导策略在智能体AI系统中的应用

Reza Soosahabi, Vivek Namsani

AI总结本文通过概率模型分析智能体AI系统的攻击-防御场景，提出“检测-误导”策略（如CMPE）以替代传统“检测-拦截”方法，通过产生误导性响应降低攻击者成功率，并在基准测试中将攻击成功率上限降低两个数量级。

详情

AI中文摘要

智能体AI系统越来越依赖语言模型组件来解释指令、处理外部数据、调用工具以及与其他智能体协调。这些能力使得提示注入和越狱攻击的后果更加严重，尤其是当攻击者采用模型引导的自动化来扩展探测、提示优化和响应评估时。本文通过目标系统、其防御机制以及攻击者的自动评判器的概率模型来分析由此产生的攻击-防御场景。我们的分析表明，传统的“检测-拦截”防御可能使攻击者成功率（ASR）随着查询预算的增长而趋近于1，因为可预测的拒绝为自动化搜索提供了有用的反馈。然后，我们研究了“检测-误导”策略，其中检测到的恶意交互会收到受控的、非操作性的响应，旨在诱导攻击者评判器产生假阳性错误。这种策略降低了攻击者选择候选的正预测值，并产生有界的渐近ASR。我们通过渐进式参与的上下文误导（CMPE）评估了该策略的概念验证实现，这是一种轻量级的对话误导方法，旨在在自动化越狱设置中用安全但具有战略误导性的响应替换可预测的拒绝文本。在越狱基准测试中，CMPE将估计的ASR上限降低了两个数量级，并在端到端PAIR和GPTFuzz攻击运行中几乎消除了验证的攻击成功。

英文摘要

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

URL PDF HTML ☆

赞 0 踩 0

2606.20510 2026-06-19 cs.CR cs.AI 交叉投稿

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST（韩国科学技术院）

AI总结提出解耦证明者-验证者游戏（DPVG），通过分离正确性与可检查性训练一个翻译器模型，将固定求解器的解转化为可检查形式，在保持答案正确性的同时提高可检查性，解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

2505.22829 2026-06-19 cs.LG cs.AI 版本更新

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全：概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA ； Computer Science Department, University of California, Santa Barbara Santa Barbara California USA ； Department of Electrical ； Computer Engineering, University of California, Santa Barbara Santa Barbara California USA ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA ； Center for Data Science, New York University ； Computer Science Department, University of California, Santa Barbara ； Computer Engineering, University of California, Santa Barbara ； Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

AI总结本文通过分析分布偏移与AI安全之间的概念和方法论协同，建立了特定偏移类型与细粒度安全问题之间的两种联系，促进了两领域研究的深度融合。

Comments 35 pages

2509.03122 2026-06-19 cs.CL cs.AI cs.LG 版本更新

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

从构建到注入：面向大型语言模型的基于编辑的指纹

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

发表机构 * East China Normal University（华东师范大学）； Hasso Plattner Institute/University of Potsdam（哈索罗普拉特纳研究所/波茨坦大学）

AI总结提出端到端注入指纹框架，通过代码混合指纹和多候选编辑方法，解决黑盒部署中指纹的不可感知性和鲁棒性挑战。

Comments preprint

详情

AI中文摘要

可靠的模型指纹对于保护大型语言模型（LLMs）免受未经授权的重新分发和商业滥用至关重要。在黑盒部署中，验证受到对可疑指纹查询的防御性过滤以及可能削弱嵌入所有权证据的下游模型修改的阻碍。这些风险要求指纹在构建和注入方面都具有鲁棒性。在构建方面，先前的范式面临不可感知性的权衡：自然语言指纹可能被意外激活，而乱码指纹在统计上暴露且更容易被过滤。在注入方面，现有方法难以在模型修改下保持持久的触发-目标行为。我们提出了一个端到端的注入指纹框架来解决这些挑战。代码混合指纹（CF）在高复杂度约束下使用最低困惑度的代码混合来缓解这种双向不可感知性权衡。多候选编辑（MCEdit）构建结构冗余、间隔分离的触发-目标映射，以在模型修改下实现优雅降级。在不可感知性、可检测性和无害性方面的广泛评估表明，该框架在几乎不影响实用性的情况下实现了鲁棒的所有权验证。

英文摘要

Reliable model fingerprints are essential for protecting large language models (LLMs) against unauthorized redistribution and commercial misuse. In black-box deployment, verification is hindered by defensive filtering of suspected fingerprint queries, as well as by downstream model modifications that may weaken embedded ownership evidence. These risks require fingerprints to be robust in both construction and injection. For construction, prior paradigms face an imperceptibility trade-off: natural-language fingerprints may be accidentally activated, whereas garbled fingerprints are statistically exposed and easier to filter. For injection, existing methods struggle to preserve persistent trigger--target behaviors under model modification. We propose an end-to-end injected fingerprinting framework to address these challenges. Code-mixing Fingerprints (CF) use lowest-perplexity code-mixing under a high-complexity constraint to mitigate this two-sided imperceptibility trade-off. Multi-Candidate Editing (MCEdit) constructs structurally redundant, margin-separated trigger--target mappings to enable graceful degradation under model modification. Extensive evaluations on imperceptibility, detectability, and harmlessness demonstrate robust ownership verification with negligible impact on utility.

URL PDF HTML ☆

赞 0 踩 0

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet：面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）

AI总结提出Proto-LeakNet，利用扩散模型中的信号泄漏痕迹，结合闭集分类与密度开集评估，实现可解释的生成器归因，在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情

DOI: 10.1016/j.cviu.2026.104848

AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明，扩散管道会在其输出中无意中留下持久的统计痕迹，称为信号泄漏，特别是在潜在表示中。基于这一观察，我们提出了Proto-LeakNet，一个信号泄漏感知且可解释的归因框架，它将闭集分类与基于密度的开集评估相结合，对学习到的嵌入进行开集评估，从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域，重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征，而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC，Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒，超越了最先进的方法，并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取：this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

URL PDF HTML ☆

赞 0 踩 0

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST（韩国科学技术院）

AI总结针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题，提出框架感知的去偏方法，通过量化框架差异并增强跨框架一致性，有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

随着大语言模型（LLMs）在现实应用中的日益部署，确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力，但一个持续的挑战是隐藏的偏见：LLMs 在标准评估下表现公平，但在这些评估设置之外可能产生有偏见的响应。在本文中，我们识别出框架——语义等价的提示在表达方式上的差异（例如，“A 比 B 好” vs. “B 比 A 差”）——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准，我们发现（1）公平性得分随框架变化显著，以及（2）现有的去偏方法改善了整体（即框架平均）公平性，但往往未能减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知的去偏方法，鼓励 LLMs 在不同框架之间更加一致。实验表明，我们的方法减少了整体偏见，并提高了对框架差异的鲁棒性，使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

URL PDF HTML ☆

赞 0 踩 0

2603.19423 2026-06-19 cs.CR cs.AI cs.LG 版本更新

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.19469 2026-06-19 cs.AI cs.SE 新提交

Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023

衡量课程在主题覆盖、能力与认知深度上的一致性：应用于CS2013和CS2023的纵向框架

Sherzod Turaev, Mary John, Saja Aldabet, Mamoun Awad, Nazar Zaki, Khaled Shuaib

发表机构 * United Arab Emirates University（阿联酋大学）； Abu Dhabi Polytechnic（阿布扎比理工学院）

AI总结提出一种人机协同流程，通过语义检索与人工确认，纵向衡量计算机科学课程对CS2013和CS2023指南的覆盖情况，发现课程覆盖稳定但新指南对认知深度要求更高。

Comments 24 pages, 5 figures, 8 tables

详情

AI中文摘要

本科计算机科学教育受约每十年修订一次的国际课程指南指导，但各项目缺乏可靠且可重复的方法来衡量其对当前指南的覆盖程度，以及当指南重组时覆盖情况如何变化。我们通过一个人机协同流程解决此问题，该流程衡量项目对外部知识体系的覆盖情况，并纵向应用于一个经认证的计算机科学学士学位项目，对照计算机科学课程2013（CS2013）和2023（CS2023）。该流程将项目和每个指南表示为结构化语料库，通过语义检索生成候选课程-知识单元匹配，并在明确的覆盖定义下通过人工判断确认。在七个基准检索器中，倒数秩融合集成最强，而知名长上下文模型表现不如小型句子模型，因此必须衡量检索器的选择。两个映射由独立第二评分者验证（CS2023的Cohen's kappa为0.64，CS2013为0.69）。该项目覆盖CS2023的49.7%和CS2013的50.9%的知识单元，十年间几乎恒定。将相同的检索-确认设计扩展到能力表述和认知深度，显示项目在每个指南下对约88%的覆盖单元表述了能力，但在CS2023下对76%的现有单元以推荐深度交付，而在CS2013下为95%，这一差距反映了新指南提高了期望，而非项目本身。纵向比较将持久的结构性差距（并行与分布式计算、编程语言基础、系统基础）——这些差距在两种指南和ABET下均未覆盖——与反映标准演变的差异区分开来。该工具可重用，并可向作者索取。

英文摘要

Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program's coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen's kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard's evolution. The instrument is reusable and available from the authors on request.

URL PDF HTML ☆

赞 0 踩 0

2606.19704 2026-06-19 cs.AI 新提交

超越准确性：衡量预测模型的逻辑合规性

Guillaume Olivier Delplanque, Pierre Genevès, Nabil Layaïda, Zephirin Faure

AI总结提出规则违反分数（RVS），一种独立于预测准确性的评估指标，用于量化预测模型对逻辑规则的遵守程度，并通过实验证明两个准确率相近的模型可能表现出截然不同的逻辑合规性。

详情

AI中文摘要

机器学习模型主要通过预测性能指标进行评估，如排序质量、预测误差或分类准确性。虽然这些指标有效量化了预测与真实值的匹配程度，但它们不评估模型输出是否尊重预定义的逻辑或领域特定约束。在医疗、金融和自主系统等高安全性应用中，逻辑一致性与预测准确性同样关键，但尚无标准指标捕捉这一维度。我们引入了规则违反分数（RVS），这是一种互补的评估指标，独立于预测准确性，量化预测模型对给定逻辑规则集的遵守程度。RVS 对硬规则（严格约束）和软规则（统计规律）区别对待，可在任何数据集和任何在关系词汇上表达的预测模型上进行评估，并可通过为 Horn 规则自动生成的 SQL 查询进行计算。除了评估模型，RVS 还可以评估训练数据集的逻辑一致性，并帮助识别定义不良的规则。我们在三个基准测试上评估了 RVS，涵盖知识图谱链接预测和关系回归，包括基于规则、基于嵌入和神经符号的预测模型。我们的结果表明，两个实现相当预测准确性的模型可能表现出显著不同的逻辑合规性，揭示了标准指标无法捕捉的模型行为差异。

英文摘要

Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

URL PDF HTML ☆

赞 0 踩 0

2606.20227 2026-06-19 cs.AI cs.SE 新提交

IHBench：评估语音代理在结构化工作流中的中断后恢复能力

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

发表机构 * Boson AI

AI总结提出IHBench基准，评估语音代理在结构化工作流中处理中断后的恢复能力，涵盖任务完成和恢复质量两个维度，实验表明闭源模型比开源模型更鲁棒。

详情

AI中文摘要

部署在结构化工作流（客户服务、医疗调度、账户管理）中的语音代理必须处理频繁的用户中断，同时保持多步骤程序的进度。现有的语音能力模型基准侧重于中断的时机：闯入检测、端点检测和轮流对话动态。它们忽略了中断后发生的情况：代理是否在正确的步骤恢复工作流？是否处理了用户的插话？是否避免重复用户已经听过的内容？我们引入了IHBench（中断处理基准），这是一个评估语音代理在10个企业领域中执行状态机驱动工作流时的中断后恢复能力的基准。六种中断类型在话语中间的控制点注入，并随数据生成每个中断的评估标准。每个中断在两个轴上评分：任务完成和恢复质量。我们评估了来自OpenAI、Google和开源社区的27个音频-语言模型配置。模型差异很大，恢复质量强烈依赖于中断类型。在我们的实验中，闭源模型比开源模型对中断更鲁棒：它们在任务完成上获胜的频率更高，随着对话变长，性能下降速度慢约3.3倍，并且没有音频与文本模态差距，而开源模型在这三个方面都处于劣势。一项人类研究验证了LLM评判员与人类标注者的一致性，与AudioMultiChallenge的跨基准分析表明，恢复质量在很大程度上是一个独立的能力轴。

英文摘要

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

URL PDF HTML ☆

赞 0 踩 0

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 交叉投稿

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA（美国俄亥俄州立大学计算机科学与工程系）

AI总结提出PrefSQA模型，通过不确定性感知logits、损伤注意力头和非匹配参考比较模块，利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情

AI中文摘要

平均意见得分（MOS）广泛用于语音质量评估，但标量标签对评估者变异性和听力测试差异敏感，这引入了标签噪声，限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性，产生更干净的标签。我们研究了无MOS的偏好预测，并提出了PrefSQA，它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集，包括MOS衍生和低噪声模拟集（包含匹配和非匹配内容），在人类偏好集上进行实验，并在未见数据上测试。实验表明，在MOS衍生数据上改进较小，而其他数据集显示出相对于基线的明显改进，突显了高质量偏好数据的价值，并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

AI总结提出AURA框架，通过自适应不确定性感知精炼，在少量人工验证下迭代学习人类一致性信号，优先审核不确定比较，提升LLM评判的可靠性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作开放式生成的评判者，因为大规模人工评估通常昂贵且难以扩展，但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号，例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中，这一假设是脆弱的：初始分割可能继承评判者偏差，而人工验证通常过于稀缺，无法在规模上定义稳定组。我们提出AURA，一种自适应不确定性感知精炼框架，用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号，传播可靠证据，并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量，随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程，以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

URL PDF HTML ☆

赞 0 踩 0

2606.19727 2026-06-19 cs.CL cs.AI 交叉投稿

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM：语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University（深圳技术大学）； New Delhi Institute of Management（新德里管理学院）； Technische Universität Dresden（德累斯顿工业大学）； Ramakrishna Mission Vivekananda Educational and Research Institute（罗摩克里希纳传道会维韦卡南达教育与研究学院）； Indian Institute of Technology（印度理工学院）； Swami Vivekananda Institute of Technology（斯瓦米·维韦卡南达技术学院）； GuangDong Engineering Technology Research Center of Edge Intelligence（广东省边缘智能工程技术研究中心）

AI总结提出NRITYAM基准，包含9,260个跨12语言的文化问答对，评估语言模型对全球舞蹈传统的文化理解能力，涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情

AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而，其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距，我们提出NRITYAM，一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对，涵盖12种语言，是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发，他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型，包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准，NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

URL PDF HTML ☆

赞 0 踩 0

2606.19769 2026-06-19 cs.RO cs.AI 交叉投稿

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI

人形机器人数据标准：物理AI缺失的基础设施

Shaoshan Liu, Xiugong Qin, Xuan Wu, Xuan Xia, Ning Ding, Jialu Liu, Jie Tang

AI总结本文论证数据标准是人形机器人可扩展性的关键基础设施，通过提出ISO/WD 26264-1标准，解决数据非累积性问题，使具身经验可解释、可共享、可追溯和可复用。

详情

AI中文摘要

人形机器人的可扩展性不仅取决于模型和硬件，还取决于物理经验能否在机器人、任务、组织及时间维度上积累。基于作者在ISO/TC 299/WG 16内制定ISO/WD 26264-1《人形机器人数据集——第1部分：通用要求》的工作，本文论证数据标准正成为物理AI的基础设施。我们提出三个见解：第一，人形机器人数据是具身交互数据，而非孤立数字样本的集合；有用的数据集必须保留机器人本体、动作、任务、场景、执行轨迹和结果之间的关系。第二，其价值取决于物理一致性：多模态流仅在时序、坐标系、标定、运动学、单位和同步假设可检查时才可复用。第三，主要瓶颈不仅是数据稀缺，更是由高采集成本、数据孤岛和不一致评估导致的非累积性数据。我们认为人形机器人数据标准通过使具身经验可解释、可共享、可追溯和可复用来解决这些瓶颈。通用标准应为生命周期管理、元数据、来源、质量、版本控制和可追溯性提供横向基础设施，而能力特定部分应定义操作、移动、人机交互、认知及未来人形能力的领域语法。随着AI从屏幕进入实体，数据标准必须从组织数字信息演变为结构化物理交互。

英文摘要

The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.19819 2026-06-19 cs.CL cs.AI 交叉投稿

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

CREDENCE: 面向分解与增强可信度的声明缩减——语义度量与收敛性分析

Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le

发表机构 * Vietnamese-German University（越南德国大学）； Ho Chi Minh University of Technology（胡志明市理工大学）

AI总结提出CREDENCE框架，通过语义F1度量解决Jaccard度量对释义声明的低估问题，并形式化分析修复管道的收敛性，实验表明语义F1比Jaccard F1提升15-32个百分点，规则修复将原子性违反率降低47-100%。

Comments 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

详情

AI中文摘要

将复合句分解为原子化的、可验证的声明是可靠自动化事实核查的前提。先前工作依赖基于词重叠（Jaccard）的度量，系统性地低估了释义声明的分解质量，并且缺乏对修复循环的形式化终止分析。我们提出CREDENCE，一个改进的声明分解与评估框架，解决了这两个缺陷。我们的贡献包括：(1) 语义F1：我们使用BGE-large余弦相似度保真度度量，解决了Jaccard的惩罚问题，并提高了下游事实核查的准确性；(2) 收敛定理：我们形式化地表征了修复管道的四个性质，确立了在预言解析器假设下基于规则的修复是单调且有限终止的；基于LLM的自修复被证明是非单调的，需要早期退出保护；(3) 三个评估基准，涵盖社交媒体、百科全书和新闻领域，用于跨领域泛化度量；(4) 跨四个分解器模型（3.8B-12B）和一个封闭API模型的多模型基准测试。在SocialClaimSplit、WikiSplitBench和ClaimDecompBench上的实验表明，语义F1比Jaccard F1提升15-32个百分点。在SocialClaimSplit和WikiSplitBench上，EPR范围为0.94至1.00，而ClaimDecompBench由于更难的新闻领域构造，包含较低的基线EPR情况（低至0.824），规则修复相对于基线模型将原子性违反率（AVR）降低了47-100%，且不降低保真度。

英文摘要

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.19887 2026-06-19 cs.CR cs.AI 交叉投稿

FFinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

FFinRED：面向金融大语言模型红队测试的专家引导基准生成与评估框架

Chaeyun Kim, Daeyoung Park, Junghwan Kim, Jinyoung Jeong, Eunji Song, Yongtaek Lim, Minwoo Kim

AI总结提出FinRED框架，通过专家引导的两级分类法将全球金融标准映射为威胁，并利用真实金融文档生成上下文丰富的红队行为提示，结合专家验证的评估标准，有效降低关键假阴性。

详情

AI中文摘要

现有的安全基准主要针对通用对抗场景，但忽略了金融领域的特定风险。金融大语言模型面临监管合规违规、欺诈助长和系统性信任侵蚀等问题，需要有针对性的评估。我们引入了FinRED，一个与金融专家共同开发的、用于金融大语言模型安全评估的专家引导红队测试框架。FinRED采用新颖的两级分类法，将全球标准（如FATF和EU DORA）映射到从监管规避到复杂欺诈的威胁，并结合可扩展的流水线，通过专家定义的架构将真实金融文档转换为上下文丰富的红队行为提示（种子）。严格的专家验证确认了种子的合理性和真实性，以实现有意义的LLM安全评估。我们还提供了一个经过专家验证的、金融专用的评估标准，该标准超越了免责声明检查，比静态的一刀切标准更贴近人类专家，并将关键假阴性从28个减少到12个。FinRED与国际采纳的风险管理和信息安全标准（如ISO/IEC 27001）保持一致，已在韩国金融安全研究院（FSI）的监管沙盒中部署，用于真实金融服务中的生成式AI安全评估。为减轻双重用途风险，数据集、生成流水线、提示模板和评估框架对合格研究人员开放，访问地址为：此https URL和此https URL。

英文摘要

Existing safety benchmarks target general adversarial scenarios but miss finance-specific risks. Financial LLMs face regulatory compliance violations, fraud facilitation, and systemic trust erosion that require targeted evaluation. We introduce FinRED, an expert-guided red-teaming framework for financial LLM safety evaluation developed with financial experts. FinRED uses a novel two-level taxonomy mapping global standards (e.g., FATF and EU DORA) to threats ranging from regulatory evasion to complex fraud, integrated with a scalable pipeline that converts real financial documents into context-rich red-teaming Behavioral Prompts (seeds) through an expert-defined schema. Rigorous expert validation confirms seed plausibility and realism for meaningful LLM safety evaluation. We also provide an expert-validated, finance-specific rubric that goes beyond disclaimer checks, aligns more closely with human experts than static one-size-fits-all rubrics, and reduces critical false negatives from 28 to 12. Aligned with internationally adopted risk-management and information-security standards (e.g., ISO/IEC 27001), FinRED is deployed in South Korea's Financial Security Institute (FSI) regulatory sandbox for generative AI security evaluation in real financial services. To mitigate dual-use risks, the dataset, generation pipeline, prompt template, and evaluation framework are gated for qualified researchers at https://github.com/selectstar-ai/FinRED-paper and https://huggingface.co/datasets/datumo/FinRED.

URL PDF HTML ☆

赞 0 踩 0

2606.19965 2026-06-19 cs.CV cs.AI 交叉投稿

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE：多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University（中山大学）； Shaanxi Normal University（陕西师范大学）

AI总结提出ROSE基准，通过固定视觉场景并变化区域约束与符号输出，测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力，发现模型性能下降高达44.5个百分点，揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越被期望基于视觉信息采取行动，然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动？为了回答这个问题，我们引入了\textsc{ROSE}（\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution），一个受控基准，它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务，\textsc{ROSE}测试模型是否能够推断出隐含的多数参考，并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中，从计数导向任务到区域条件行动的性能下降高达44.5个百分点，而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在，即使同一模型在这些场景和区域上返回正确的计数，而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失，揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

URL PDF HTML ☆

赞 0 踩 0

2606.20089 2026-06-19 cs.CL cs.AI 交叉投稿

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

IHUBERT: 面向波斯语资源的基于向量的语义去重与领域平衡预训练

Arash Ghafouri, Mahdi Firouzmandi, Hossein Saberi, Mohammad Reza Hasani Ahangar

AI总结提出IHUBERT，一个基于RoBERTa-base的波斯语预训练模型，通过多阶段预处理（包括基于向量数据库的语义去重和领域平衡）在45GB语料上训练，在多项NLU任务上取得领先结果，尤其抽取式问答表现突出。

详情

AI中文摘要

波斯语预训练语言模型仍然受到大规模高质量预训练语料库稀缺以及标准分类和NER任务之外评估不足的限制。我们提出了IHUBERT，一个从头训练的波斯语单语PLM，采用RoBERTa-base编码器（1.25亿参数），在Sepahr-Danesh集合的45GB精选子集（约70-80亿token）上进行训练。为了提高语料质量并减少冗余，我们采用多阶段预处理流程，包括规范化、精确和近似重复去除、匿名化，以及基于向量数据库的语义去重，以实现跨领域和语体的分布平衡控制。我们还在完整的预训练语料库上训练了一个13.9万词汇量的BPE分词器，以更好地捕捉波斯语的形态和拼写变化。IHUBERT在七个波斯语NLU基准测试上进行评估，涵盖NER、情感分析、主题分类、NLI、抽取式问答和关系抽取，使用任务标准指标（实体级F1、宏F1、EM/F1）。IHUBERT在抽取式QA上取得了最强增益，在PQuAD（F1 88.3542）和ParsiNLU-RC（F1 49.0987）上均排名第一，并在FarsTail上取得了最佳结果（宏F1 0.8350）。在NER和主题分类上，它保持竞争力（例如，ParsTwiNER上F1 0.8308；DigiMag上宏F1 0.7953），而关系抽取仍然是主要差距（PERLEX上宏F1 0.6684）。在IHUBERT预训练语料库上的受控分词器消融实验表明，在匹配词汇量下，BPE产生的子词碎片化程度略低于WordPiece，支持了我们的分词设计。总体而言，IHUBERT通过语义精选的大规模预训练以及跨分类和理解型任务的广泛评估，推进了波斯语语言建模。

英文摘要

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we employ a multi-stage preprocessing pipeline that includes normalization, exact and near-duplicate removal, anonymization, and vector-database-based semantic deduplication for distribution balancing control across domains and registers. We additionally train a 139k-vocabulary BPE tokenizer on the full pretraining corpus to better capture Persian morphology and orthographic variation. IHUBERT is evaluated on seven Persian NLU benchmarks covering NER, sentiment analysis, topic classification, NLI, extractive question answering, and relation extraction, using task-standard metrics (entity-level F1, Macro-F1, EM/F1). IHUBERT achieves its strongest gains on extractive QA, ranking first on both PQuAD (F1 88.3542) and ParsiNLU-RC (F1 49.0987), and attains the best result on FarsTail (Macro-F1 0.8350). On NER and topic classification, it remains competitive (e.g., 0.8308 F1 on ParsTwiNER; 0.7953 Macro-F1 on DigiMag), while relation extraction remains the main remaining gap (0.6684 Macro-F1 on PERLEX). A controlled tokenizer ablation on the IHUBERT pretraining corpus shows that BPE yields slightly lower subword fragmentation than WordPiece at matched vocabulary size, supporting our tokenization design. Overall, IHUBERT advances Persian language modeling through semantically curated large-scale pretraining and broad evaluation across both classification and comprehension-oriented tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.20177 2026-06-19 cs.CV cs.AI 交叉投稿

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory（鹏城实验室）； Tsinghua University（清华大学）； Central South University（中南大学）

AI总结提出RS-Neg基准评估遥感MLLMs的否定理解，并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种遥感（RS）任务中取得了显著成功。然而，它们理解否定的能力仍未得到充分探索，限制了在现实应用中的部署，其中模型必须明确识别什么是错误的或不存在的，例如，应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性，我们引入了RS-Neg，这是第一个从区域级到场景级任务评估否定理解的基准。具体来说，我们为遥感图像设计了一个自动数据生成流程，使用LLMs合成多样化的否定查询，并引入了一个动态视觉焦点模块进行验证。我们的评估表明，先进的遥感MLLMs在否定理解上存在困难，表现出幻觉和显著的性能下降。为了弥补这一差距，我们提出了NeFo，一种新颖的测试时学习方法，将否定的逻辑角色明确纳入模型优化。值得注意的是，使用约5%的未标注测试样本，NeFo显著提升了模型的否定理解能力，并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.20235 2026-06-19 cs.IR cs.AI 交叉投稿

LLM智能体安全性、多轮红队测试、越狱基准、对抗鲁棒性、安全关键系统

Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park

AI总结提出NRT-Bench基准，通过模拟核电站控制室的多轮红队测试，评估LLM智能体在安全关键系统中的对抗鲁棒性，发现不同模型的漏洞几乎不重叠，且防御效果高度依赖模型。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越多地被提议作为安全关键系统的监督组件，但它们在持续、自适应对抗压力下的鲁棒性仍鲜有表征。我们提出了NRT-Bench，一个用于对作为安全关键系统操作员的LLM智能体进行多轮红队测试的基准，实例化为一个模拟核电站控制室。一个由五个角色组成的操作员团队，每个角色由可配置的LLM支持，运行一个由六项关键安全功能（CSF）管理的工厂，而对手在有限的多轮会话中通过四个通道注入消息，每轮有反馈。危害是一个客观信号，而非LLM评判的文本：一旦任何CSF丢失，运行即终止，并归因于导致该消息。在固定攻击配对重放协议下评估四个前沿操作员模型，我们发现自适应多轮攻击可靠地将操作员团队推过安全极限：在这四个模型中，8.7%至12.1%的攻击会话以工厂失去关键安全功能告终。尽管这四个模型在此聚合率下看起来几乎同样鲁棒，但它们的失败几乎没有重叠：在149个会话中，没有一个会话击败所有四个模型，而三分之一的会话至少击败一个模型，因此漏洞在模型之间几乎是不相交的，而非嵌套的。添加防御的效果强烈依赖于模型：同一套护栏或安全顾问智能体对一个模型降低攻击成功率，却可能对另一个模型提高成功率。我们发布了模拟场地、攻击数据集和重放工具，用于LLM智能体的可重复安全评估。

英文摘要

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 交叉投稿

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准：诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

AI总结提出CWE-Trace框架，通过834个Linux内核样本和两个诊断指标（DFI和HDD）评估LLM漏洞检测能力，发现数据污染无实质帮助，微调仅改变输出阈值而非决策策略，模型缺乏真正的安全推理能力。

详情

AI中文摘要

大语言模型在漏洞基准测试中得分高，但究竟是真正推理安全还是仅对污染数据进行模式匹配，这一问题仍未解决。我们提出CWE-Trace，一个基于834个手动整理的Linux内核样本（涵盖74个CWE）构建的LLM漏洞检测框架。该框架强制执行严格的时间分割（2025年前的历史集/截止后的无泄漏集），保留上下文感知的易受攻击-修补对，并引入两个诊断指标：方向性失败指数（DFI）和层次距离与方向（HDD）。我们评估了8个原始LLM和15个LoRA微调变体，涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先，数据污染未提供可衡量的优势。函数级分析显示，84%的名义污染样本不携带可用的记忆信号：易受攻击的函数缺失或跨数据集交叉映射，约31%的污染样本存在CWE误分类。其次，骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式（DFI范围从-85.5到+94.8个百分点），这些模式从历史数据持续到截止后数据，且难以纠正。微调改变了输出阈值，但未改变决策策略。这是无理解的校准：输出分布适应训练数据，而底层安全推理仍然缺失。在二元检测中最弱的骨干（DeepSeek-R1）在粗粒度CWE分类中提升最大，表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%（比随机高2.1个百分点）；精确CWE排名Top-1准确率仍低于1.3%，证实当前LLM无论采用何种微调策略，都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 交叉投稿

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80：全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DEMR-ONERA，巴黎-萨克雷大学）； DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DTIS-ONERA，巴黎-萨克雷大学）； Hugging Face

AI总结为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题，基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集，支持跨模态检索与生成任务。

详情

AI中文摘要

多模态基础模型因大规模光学基准而快速发展，但合成孔径雷达（SAR）的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测（GRD）产品，未保留复值SAR测量或原生采集几何，限制了基于物理的多模态学习。特别是，结合甚高分辨率（VHR）SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据（SICD）构建的VHR SAR-光学-文本数据集。从约2500个全球场景（VV/HH，20cm–2m原生分辨率）出发，通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格，并将图像分割为1024×1024的图块。对于每个SAR图块，我们检索高分辨率光学图块，并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体（短/中/长），以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组（复数和幅度斜距SAR图块、对齐光学图块、自然语言描述），覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码，以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用，网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

URL PDF HTML ☆

赞 0 踩 0

2506.14990 2026-06-19 cs.AI 版本更新

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

MEAL: 持续多智能体强化学习基准

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang

发表机构 * Eindhoven University of Technology, The Netherlands（埃因霍温理工大学，荷兰）； University of Edinburgh, UK（爱丁堡大学，英国）； University of Stuttgart, Germany（斯图加特大学，德国）； King's College London, UK（伦敦国王学院，英国）； University of Liverpool, UK（利物浦大学，英国）

AI总结提出MEAL基准，利用JAX和GPU加速实现100任务序列训练，揭示长序列中出现的失败模式。

Comments To be published in the International Conference on Machine Learning (ICML) 2026

2603.28387 2026-06-19 cs.AI cs.LG 版本更新

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应：提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）

AI总结研究发现，在临床VLM评估中，提示中提及MRI可用性即可解释70-80%的性能提升，与图像数据是否存在无关，这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情

AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}（情感障碍）和\textsc{OASIS-3}（认知衰退）上评估了12个开源视觉语言模型（VLM）的二分类性能。两个数据集都包含结构MRI数据，但这些数据不携带可靠的个体级诊断信号。在这些条件下，较小的VLM在引入神经影像上下文后F1分数提升高达58%，蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示，仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变，与影像数据是否存在无关，这是模态坍塌的一个领域特定实例，我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由，而偏好对齐虽然消除了引用MRI的行为，却使两种条件都退化为随机基线。我们的发现表明，表面评估不足以作为多模态推理的指标，这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2604.05435 2026-06-19 cs.AI 版本更新

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

CareTransition-Audit：用于高效护理过渡的出院总结审计基准

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava, Shivali Dalmia, Abhishek Mukherji

发表机构 * Department of Computer Science \& Engineering, University of Minnesota-Twin Cities, Minneapolis, USA ； Centific AI Research, Redmond, USA

AI总结提出基于大语言模型的自动化框架，通过46项检查清单审计出院总结完整性，在MIMIC-IV数据集上基准测试11个模型，最佳模型与临床医生标签的Cohen's kappa约0.5，所有模型难以识别模糊文档。

Comments Accepted as a poster at IEEE-ICHI 2026; Accepted at SD4H@ICML

详情

AI中文摘要

不完整或不一致的出院文档会导致护理碎片化和可避免的再入院。尽管其在患者安全中至关重要，但审计出院总结依赖于人工审查且无法扩展。我们提出一个使用大语言模型（LLM）的自动化审计框架。我们的方法将DISCHARGED框架操作化为一个包含46个问题的检查清单。使用来自MIMIC-IV数据库的50份总结及临床医生真实标签，我们对11个LLM进行基准测试。模型评估的平均文档完整性范围为54.9%至74.2%，最佳模型与临床医生标签的Cohen's kappa值约为0.5，表明中等一致性。所有模型在识别模糊文档（Unclear）方面均存在困难，突显了当前自动化审计的关键差距。本工作为临床文档的系统性质量改进提供了临床医生验证的基准和零样本基线。

英文摘要

Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing models achieve a Cohen's kappa values around 0.5 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.

URL PDF HTML ☆

赞 0 踩 0

2604.07593 2026-06-19 cs.AI 版本更新

Too long; didn't solve

太长；没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro（巴塞罗那研究所）； Poindexter Labs（波因迪克斯实验室）

AI总结研究提示长度和解答长度与大型语言模型在数学问题上的性能关系，发现两者与模型失败率正相关。

2605.25160 2026-06-19 cs.AI 版本更新

DRFLOW：用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research（ServiceNow人工智能研究）

AI总结提出DRFLOW基准，评估AI代理从异构源预测个性化工作流的能力，包含5领域100任务，并设计7个诊断指标，实验显示现有代理性能有限。

详情

AI中文摘要

深度研究（DR）系统越来越多地用于复杂信息寻求任务，但现有工作主要关注生成报告和摘要。相比之下，许多企业任务需要代理识别具体的工作流，即一系列行动步骤。例如，代理不应总结预算政策，而应能确定回答诸如“在固定预算下如何申请新员工？”这类问题所需的步骤。因此，我们引入DRFLOW，一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据，然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务，1246个参考工作流步骤，基于超过3900个来源。我们定义了七个诊断指标，涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent（DRFA），一个面向工作流的参考代理，用于预测个性化工作流。我们表明，尽管DRFA相比强基线代理有所改进（平均F1分数提升高达10.02%），但在这些工作流指标上仍有很大的改进空间，表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

URL PDF HTML ☆

赞 0 踩 0

2606.18950 2026-06-19 cs.AI 版本更新

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University（首尔国立大学）

AI总结提出RTSGameBench，基于Beyond All Reason游戏，通过多样化对战、迷你游戏诊断和自进化生成框架，评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情

AI中文摘要

现代视觉语言模型（VLM）在竞争和合作环境中的不确定性下，往往难以进行战略推理，即预测和影响其他智能体的行为。实时策略（RTS）游戏可以作为诊断这一局限性的自然测试平台，因为它们要求与盟友协调、适应对手策略，并在部分可观测性下进行长期规划。然而，现有的RTS基准评估范围有限，缺乏系统的能力诊断，并且局限于预设计的场景覆盖。为了解决这些限制，我们提出了RTSGameBench，它建立在Beyond All Reason之上，这是一款大规模RTS游戏，其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估，通过迷你游戏进行诊断性评估，每个迷你游戏针对单个战略能力，并通过自进化生成框架实现可扩展的覆盖，该框架将自由形式的查询转化为新的迷你游戏，并在连续循环中改进。此外，为了让VLM在大规模RTS游戏中运行，我们提供了RTSGameAgent，它通过具有智能体记忆的有限状态机（FSM）管理单位。我们通过实验验证，多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

URL PDF HTML ☆

赞 0 踩 0

2606.19245 2026-06-19 cs.AI cs.LG 版本更新

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP：分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出TxBench-PP基准，用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力，测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情

AI中文摘要

人工智能（AI）代理有望通过压缩解释和决策循环来加速药物发现，但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学（TxBench-PP），这是一个针对小分子临床前药理学的可验证基准，也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论，而非从文献中记忆的事实。该基准包含100个评估，按程序阶段、实验类型和任务结构索引，涵盖作用机制（MoA）和药效学（PD）推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照，在编码环境中检查文件，并返回确定性评分的结构化答案。在16个模型-工具配置（包括11个模型和4,800条轨迹）中，没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试（178/300；95% CI, 51.1-67.6），其次是GPT-5.5 / Pi，为55.3%（166/300；47.0-63.6）。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

URL PDF HTML ☆

赞 0 踩 0

2507.19653 2026-06-19 cs.NI cs.AI cs.LG 版本更新

On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments

关于射线追踪在城市环境中基于学习的射频任务局限性的研究

Armen Manukyan, Hrant Khachatrian, Edvard Ghukasyan, Theofanis P. Raptis

发表机构 * Yerevan State University, Yerevan, Armenia（亚美尼亚叶里温州立大学）； YerevaNN, Yerevan, Armenia（亚美尼亚叶里温YerevaNN）； Institute of Informatics and Telematics, National Research Council, Pisa, Italy（意大利那不勒斯国家研究委员会信息与电信研究所）

AI总结通过罗马城区实测数据评估Sionna射线追踪仿真器，发现天线位置和方向对保真度影响显著，而超参数影响微弱；优化后相关性提升5%-130%，定位误差降低三分之一，但残差城市噪声仍是挑战。

Comments This work was supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

Journal ref 2026 IEEE Wireless Communications and Networking Conference (WCNC)

详情

DOI: 10.1109/WCNC65185.2026.11555460

AI中文摘要

我们研究了Sionna v1.0.2射线追踪在罗马市中心户外蜂窝链路中的真实感。我们使用了包含1,664个用户设备（UE）和六个名义基站（BS）站点的真实测量数据集。利用这些固定位置，我们系统地改变了主要仿真参数，包括路径深度、漫反射/镜面反射/折射标志、载波频率，以及天线的属性如高度、辐射方向和方向图。通过测量功率与仿真功率之间的Spearman相关性，以及基于RSSI指纹的k近邻定位算法，对每个基站的仿真保真度进行评分。在所有实验中，求解器超参数对所选指标的影响微不足道。相反，天线位置和方向被证明是决定性的。通过简单的贪婪优化，我们将不同基站的Spearman相关性提高了5%到130%，而仅使用仿真数据作为参考点的kNN定位误差在真实世界样本上减少了三分之一，但仍比纯真实数据的误差高一倍。因此，精确的几何形状和可信的天线模型是必要但不充分的；忠实地捕捉残余的城市噪声仍然是实现可迁移、高保真户外射频仿真的一个开放挑战。

英文摘要

We study the realism of Sionna v1.0.2 ray-tracing for outdoor cellular links in central Rome. We use a real measurement set of 1,664 user-equipments (UEs) and six nominal base-station (BS) sites. Using these fixed positions we systematically vary the main simulation parameters, including path depth, diffuse/specular/refraction flags, carrier frequency, as well as antenna's properties like its altitude, radiation pattern, and orientation. Simulator fidelity is scored for each base station via Spearman correlation between measured and simulated powers, and by a fingerprint-based k-nearest-neighbor localization algorithm using RSSI-based fingerprints. Across all experiments, solver hyper-parameters are having immaterial effect on the chosen metrics. On the contrary, antenna locations and orientations prove decisive. By simple greedy optimization we improve the Spearman correlation by 5% to 130% for various base stations, while kNN-based localization error using only simulated data as reference points is decreased by one-third on real-world samples, while staying twice higher than the error with purely real data. Precise geometry and credible antenna models are therefore necessary but not sufficient; faithfully capturing the residual urban noise remains an open challenge for transferable, high-fidelity outdoor RF simulation.

URL PDF HTML ☆

赞 0 踩 0

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战：推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona（巴塞罗那人工智能在医学实验室（BCN-AIM），巴塞罗那大学数学与计算机学院）

AI总结提出MAMA-MIA挑战，通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测，在跨洲多中心数据上分析模型泛化性与公平性，发现性能与亚组公平性之间存在权衡。

详情

AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤，也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用，尤其是接受新辅助化疗的患者。然而，现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估，使得直接比较困难，并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题，该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者，而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行，以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明，在共同的外部评估框架下，性能存在显著差异，并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源，以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

URL PDF HTML ☆

赞 0 踩 0

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Sydney（悉尼大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白，构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集，并基于此基准测试了九种最新方法，识别出最鲁棒的方法和最具挑战的场景。

详情

AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中，已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场，每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏，限制了发展。为填补这一空白，我们引入了DF3DV-1K，一个包含1048个场景的大规模真实世界数据集，每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像，模拟随意拍摄，涵盖128种干扰类型和161种场景主题，包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试，识别出最鲁棒的方法和最具挑战的场景。除了基准测试，我们还展示了DF3DV-1K的一个应用：微调基于扩散的2D增强器以改进辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展，并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取：此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

URL PDF HTML ☆

赞 0 踩 0

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 版本更新

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础：用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.（Spotify美国公司）

AI总结提出替代指标理论框架，证明在弱于分布等价条件下，校准LLM输出可识别平均处理效应，并分析随机性带来的偏差与方差。

详情

AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型（LLM）代替人类参与者，以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效，但这不现实。因此，我们开发了一个统计框架，将替代终点理论适配到LLM。该框架表明，将LLM结果校准到人类结果，在替代性和可比性条件（联合弱于分布等价性）下，可以识别平均处理效应。当这些条件不成立时，感兴趣的效应仅部分可识别，我们提供了诊断方法，可以在历史实验上证伪替代性，并给出有限重叠下最坏情况偏差的界限。我们进一步证明，LLM固有的随机性会引入偏差和方差，但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是，LLM结果作为替代指标的有效性只能对过去的处理被证伪，而无法对新处理被验证，因此对于新颖干预，人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用，以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

URL PDF HTML ☆

赞 0 踩 0

2606.18613 2026-06-19 cs.CL cs.AI 版本更新

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生？PhysAssistBench：交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University（阿尔托大学）； Tencent（腾讯）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Hong Kong Polytechnic University（香港理工大学）； Aarhus University（奥胡斯大学）； Technical University of Munich（慕尼黑工业大学）

AI总结提出PhysAssistBench基准，通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力，发现当前模型不可靠，瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情

AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生，但当前的评估通常测试孤立能力：临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力，其中医生提出不明确的请求，患者模糊描述症状，EHR系统要求精确的工具使用。我们引入PhysAssistBench，一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例，PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理，将静态EHR记录转化为多轮临床场景，同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集，包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明，当前模型在此设置下仍不可靠，这暴露了临床LLM的关键瓶颈：可靠的辅助需要知识、沟通和系统之间的协调，而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

URL PDF HTML ☆

赞 0 踩 0

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics（数学系）； Department of Political and Social Sciences（政治与社会科学系）

AI总结通过受控基准测试，比较量子与经典生成器在脑MRI数据增强中的性能，发现两者均未显著优于仅用真实数据训练，且量子生成器无额外优势。

详情

AI中文摘要

医学图像分类常受限于有限的标注数据，因此生成式增强被提出；最近，量子生成模型被用于此目的，并经常报告准确率提升。然而，这些声称通常基于单次训练运行，未匹配量子与经典生成器的参数预算，也未表征任何收益出现的数据范围。我们提出了一个受控基准测试，隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中，在该空间中，使用变分量子生成器或参数数量几乎相同的经典生成器（1648 vs. 1632）训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器，覆盖从5%到100%的标注数据比例，通过八个随机种子进行配对显著性检验（多重比较校正）以及集内多样性和潜在分布分析。在所有比例下，没有增强变体显著优于仅用真实数据训练，且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展：合成样本分布外移，并且在数据稀缺时严重模式崩溃，而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 新提交

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）； University of Glasgow（格拉斯哥大学）； University of Cambridge（剑桥大学）

AI总结针对DeFi监管中LLM智能体易误报的问题，提出DeXposure-Claw系统，通过图时间序列基础模型预测风险网络，结合确定性监控和置信度门控生成可审计监管票据，并构建六轴评估基准DeXposure-Bench，实验验证有效性。

详情

AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景：它们过度解读弱证据并推荐高风险干预，而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw，一个基于预测的智能体监管系统，通过结构化证据引导LLM决策：(1) DeXposure-FM，一个图时间序列基础模型，预测未来风险网络；(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据；(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench，一个六轴评估框架，其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

URL PDF HTML ☆

赞 0 踩 0

2606.19522 2026-06-19 cs.AI 新提交

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

REVEAL++：用于阿尔茨海默病风险视觉-语言视网膜建模的可微分表型分组

Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

发表机构 * University of Virginia（弗吉尼亚大学）； J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida（佛罗里达大学赫伯特·韦特海姆工程学院J. Crayton Pruitt家庭生物医学工程系）

AI总结提出可微分连续表型相似性权重函数，替代离散分组，在对比学习中端到端学习跨模态对齐与表型结构，提升AD风险预测。

Comments Accepted for publication at MICCAI 2026

详情

AI中文摘要

视网膜为神经退行性疾病提供了非侵入性窗口，能够捕捉与未来认知衰退风险相关的细微结构模式。诸如REVEAL等视觉-语言对齐框架已表明，将视网膜眼底图像与结构化临床风险叙述配对可改善阿尔茨海默病（AD）的早期预测。这些方法的一个关键设计选择是使用表型分组，即在对比学习中将具有相似风险特征的个体视为多正对。然而，现有方法将表型相似性操作化为离散构造，依赖硬分组分配，施加刚性监督并将分组形成与表示学习分离。我们提出对比学习中表型结构的连续形式。我们不将样本分配到固定聚类，而是将受试者间相似性建模为可微分权重函数，该函数源自视网膜图像和风险特征中模态内嵌入相似性。这些权重通过连续聚合算子定义软多正关系，实现反映疾病风险谱的梯度监督。我们进一步引入软目标对比目标，以端到端方式联合学习跨模态对齐和表型结构。在UK Biobank视网膜成像数据上进行AD发病预测评估，所提框架持续优于基于离散分组的对比学习和标准视觉-语言基线。通过将表型相似性视为可学习的连续信号而非固定分组规则，我们的方法为从多模态视网膜和临床数据中进行人群规模的神经退行性风险建模提供了有原则且稳健的基础。

英文摘要

The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer's disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.

URL PDF HTML ☆

赞 0 踩 0

2606.19602 2026-06-19 cs.AI 新提交

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG：什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen（埃森大学医学院人工智能医学研究所）； Faculty of Computer Science, University of Duisburg-Essen（杜伊斯堡-埃森大学计算机科学学院）； Department of Physics, TU Dortmund University（多特蒙德工业大学物理系）； Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University（多特蒙德工业大学拉马尔机器学习和人工智能研究所）； Advanced Clinical Research Center, Fukushima Medical University（福岛医科大学先进临床研究中心）； Department of Cardiology and Vascular Medicine, University Hospital Essen（埃森大学医院心血管内科）

AI总结针对临床文档元数据缺失问题，提出基于智能体RAG的ACIE系统，在埃森大学医学中心部署，通过完整患者上下文推理和源引用验证，在7326次临床判断中实现96.5%的提取接受率。

详情

AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点，然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效，无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE（智能体临床信息提取）：一个本地智能体RAG管道，能够推理完整的患者上下文，并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距，追溯了由此形成的架构决策，并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果，其中核医学医生根据引用的来源验证每个提取值。在7326次判断中，临床医生接受了96.5%的提取结果，按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

URL PDF HTML ☆

赞 0 踩 0

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 新提交

eCNNTO：一种高度泛化的加速拓扑优化的卷积网络

Shengbiao Lu, Xiaodong Wei

发表机构 * Global college, Shanghai Jiao Tong University（上海交通大学全球学院）

AI总结提出基于元素的卷积神经网络eCNNTO，通过预测近最优密度跳过大量迭代，加速密度拓扑优化，并引入新训练策略提升效率与泛化能力。

详情

AI中文摘要

本工作提出了一种基于元素的卷积神经网络（CNN）来加速基于密度的拓扑优化（TO），称为eCNNTO。TO通常需要大量迭代，其中每次迭代都进行有限元分析，导致效率瓶颈，尤其是在使用密集网格实现高分辨率设计时。为解决这一限制，eCNNTO建立在Kallioras等人（2020）的工作基础上，该工作为每个元素训练了一个深度信念网络（DBN），根据其早期历史预测近最优密度，从而跳过绝大多数迭代并显著加速TO过程。然而，该方法缺乏相邻元素间的空间相关性，可能导致最终结构中存在不连通的特征。所提方法采用带有残差连接的CNN来解决这一问题。在此基础上，引入了一种新的训练策略以进一步提高优化效率，其中训练数据集由最终阶段的密度历史而非早期历史组成。这一变化也有助于减少所需的训练数据量。eCNNTO仅需少量数据集进行训练，却能泛化到边界条件、载荷情况、设计域几何形状、网格分辨率以及非设计域大不相同的各种问题。最后，通过二维和三维的多个示例展示了eCNNTO的泛化能力和效率，分别实现了高达90%和97%的迭代次数减少。

英文摘要

This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.20087 2026-06-19 cs.AI 新提交

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

基于多头注意力的特征提取器与软演员-评论家集成用于增材制造中的孔隙率预测和工艺参数优化

Kianoush Aqabakee, Leonardo Stella

发表机构 * Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic)（阿米尔卡比尔理工大学（德黑兰理工大学）电气工程系）； Department of Mechanical Engineering, Amirkabir University of Technology (Tehran Polytechnic)（阿米尔卡比尔理工大学（德黑兰理工大学）机械工程系）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）

AI总结提出一种结合多头注意力机制与软演员-评论家算法的连续动作空间方法，用于增材制造孔隙率预测和参数优化，实现更快收敛和更高奖励。

详情

AI中文摘要

增材制造工艺优化需要精确的参数控制以最小化孔隙等缺陷。传统的使用离散动作空间的强化学习方法收敛慢且易陷入局部最优，限制了其在精密制造任务中的有效性。本研究通过采用连续动作空间并结合一种新颖架构——将多头注意力机制与软演员-评论家（SAC）算法集成，来解决这些局限性。基于注意力的特征提取器增强了智能体捕捉低维输入特征中细微变化的能力，从而在存在局部极小值的价值空间中实现更有效的探索-利用平衡。我们在激光粉末床熔融中的孔隙率预测和工艺参数优化上验证了该方法，与标准强化学习方法（包括DQN、PPO、TD3和原始SAC）相比，展示了更快的收敛速度和更高的最终奖励值。所提出的方法在14个回合内达到322.79的收敛值，在保持训练稳定性的同时优于现有方法。

英文摘要

Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent's ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.

URL PDF HTML ☆

赞 0 踩 0

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交

学生绘制的科学模型的置信度感知自动评估

Luyang Fang, Yingchuan Zhang, Jongchan Park, Zhaoji Wang, Ping Ma, Xiaoming Zhai

发表机构 * AI4STEM Education Center, Athens, GA, USA（AI4STEM教育中心，雅典，佐治亚州，美国）； Department of Statistics, University of Georgia, Athens, GA, USA（佐治亚大学统计系，雅典，佐治亚州，美国）

AI总结提出一种基于视觉Transformer的置信度感知评分框架，通过选择性自动化高置信度响应并延迟不确定案例至人工审核，在六个NGSS评估项上提高了评分可靠性并平衡了自动化覆盖率与评分风险。

详情

AI中文摘要

学生生成的绘图广泛应用于科学教育中，用于评估学习者在基于建模任务中的概念理解，这些任务与下一代科学标准（NGSS）保持一致。然而，对这些绘图进行评分需要专家人工判断来解释复杂的视觉表示，使得大规模评估在课堂环境中实施和维持成本高昂。在这项工作中，我们研究了使用基于视觉模型的自动评分学生生成的科学绘图。我们评估了具有参数高效适应的视觉Transformer（ViT），并提出了一个置信度感知评分框架，该框架从测试时预测分布中推导出响应级别的置信度。这种置信度信号通过自动评分高置信度响应，同时将不确定案例延迟至人工审核，实现了选择性自动化。在六个与NGSS对齐的中学评估项上的实验表明，所提出的方法提高了评分可靠性，同时支持自动化覆盖率和评分风险之间的实际权衡，突出了置信度感知方法在可信教育评估中的价值。

英文摘要

Student-generated drawings are widely used in science education to assess learners' conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.20323 2026-06-19 cs.AI 新提交

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

利用系统非线性应对智能故障诊断系统设计中的数据稀缺问题

Giancarlo Santamato, Andrea Mattia Garavagno, Massimiliano Solazzi, Antonio Frisoli

AI总结提出一种利用系统固有非线性的周期多激励级方法，结合数据可视化与增强技术，在数据稀缺条件下实现基于深度迁移学习的振动故障诊断，并在铁路受电弓结构上验证有效性。

Journal ref Nonlinear Dynamics, vol. 112, pp. 16153-16166, 2024

详情

DOI: 10.1007/s11071-024-09864-6

AI中文摘要

深度迁移学习（DTL）允许高效构建智能故障诊断系统（IFDS）。另一方面，DTL方法仍然严重依赖大量标记数据。在处理机器或结构故障时，获取如此大量的数据可能具有挑战性。本文提出了一种在数据严重稀缺条件下使用DTL设计基于振动的IFDS的新方法。利用真实世界系统固有非线性的周期性多激励级过程生成图像，这些图像可以由预训练的卷积神经网络（CNN）方便地分析以诊断故障。本文提出了一种新的数据可视化方法及其增强技术，以应对IFDS设计过程中典型的数据缺乏问题。在铁路受电弓结构上的实验验证为所提方法提供了有效支持。

英文摘要

Deep Transfer Learning (DTL) allows for the efficient building of Intelligent Fault Diagnosis Systems (IFDS). On the other hand, DTL methods still heavily rely on large amounts of labelled data. Obtaining such an amount of data can be challenging when dealing with machines or structures faults. This document proposes a novel approach to the design of vibration-based IFDS using DTL in condition of strong data scarcity. A periodic multi-excitation level procedure leveraging intrinsic non-linearities of real-world systems is used to produce images that can be conveniently analysed by pre-trained Convolutional Neural Networks (CNNs) to diagnose faults. A new data visualization method and its augmentation technique are proposed in this paper to tackle the typical lack of data encountered during the design of IFDS. Experimental validation on a railway pantograph structure provides effective support for the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.20438 2026-06-19 cs.AI 新提交

Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning

可解释的精子形态分类：基于注意力引导的深度学习

Zahra Asghari Varzaneh, Reza Khoshkangini, Thomas Ebner, Lars Johansson

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出注意力引导的深度学习框架，结合EfficientNet-B0和CBAM模块进行精子形态分类，在SMIDS和HuSHem数据集上分别达到90.2%和93.9%的准确率，并通过Grad-CAM++可视化增强可解释性。

2606.20459 2026-06-19 cs.AI 新提交

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

IVF实验室环境条件的上下文感知分层贝叶斯建模

Zahra Asghari Varzaneh, Reza Khoshkangini, Pia Saldeen, Lars Johansson, Thomas Ebner

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出55个上下文感知时间特征捕捉培养箱微环境动态，结合分层贝叶斯Beta回归模型跨诊所共享环境效应，将预测误差从3-5%降至1.27%，并在北欧诊所实现R²=0.86和64%误差降低。

详情

AI中文摘要

IVF妊娠率通常使用患者层面变量进行建模，而高分辨率实验室环境数据仍未得到充分利用。我们表明这是一个错失的机会。我们不再依赖原始传感器平均值，而是设计了55个上下文感知的时间特征，包括滚动热稳定性、同时温湿度符合性、峰值应力持续时间和应力后恢复速度，这些特征捕捉了培养箱微环境的动态。基于来自一家亚洲IVF诊所的61周数据，这些特征将交叉验证预测误差降低至1.27%，而原始平均值的误差为3-5%。然后，我们训练了一个分层贝叶斯Beta回归模型，通过部分池化在亚洲和北欧诊所之间共享环境效应，同时保留特定于诊所的基线。在来自北欧诊所的保留数据上，该模型在35-39岁年龄组中实现了R²=0.86和相对于朴素基线的64%误差降低，表明结构化的环境监测包含具有临床意义的可迁移信号。

英文摘要

IVF pregnancy rates are routinely modeled using patient-level variables, while high-resolution laboratory environmental data remain underutilized. We show that this is a missed opportunity. Rather than relying on raw sensor averages, we engineer 55 context-aware temporal features, including rolling thermal stability, simultaneous temperature-humidity adherence, peak stress duration, and post-stress recovery speed, that capture the dynamics of incubator microenvironments. On 61 weeks of data from an Asian IVF clinic, these features reduce cross-validated prediction error to 1.27%, compared to 3-5% for raw averages. We then train a hierarchical Bayesian Beta regression model that shares environmental effects across an Asian and a Northern European clinic via partial pooling, while preserving site-specific baselines. On held-out data from the Northern European clinic, the model achieves R2 = 0.86 and a 64% error reduction for the 35-39 age group over a naive baseline, demonstrating that structured environmental monitoring contains clinically meaningful, transferable signal.

URL PDF HTML ☆

赞 0 踩 0

2606.19345 2026-06-19 cs.CL cs.AI 交叉投稿

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于摘要识别PubMed中EQ-5D研究的大型语言模型集成

Zhyar Rzgar K. Rostam, Márta Péntek, János Tibor Czere, Zsombor Zrubka, László Gulácsi, Gábor Kertész

发表机构 * Doctoral School of Applied Informatics and Applied Mathematics, Obuda University（欧布达大学应用信息学与应用数学博士学院）； John von Neumann Faculty of Informatics, Obuda University（欧布达大学约翰·冯·诺伊曼信息学学院）； Doctoral School of Innovation Management, Obuda University（欧布达大学创新管理博士学院）

AI总结提出多阶段框架集成Gemini和Gemma等LLM，通过少样本提示、权重集成和软堆叠元分类器，自动检测PubMed中EQ-5D研究，加权集成F1达0.74。

Comments 6 pages, 7 tables, 8 equations

详情

AI中文摘要

科学出版物的快速增长导致系统文献综述（SLR）中的人工研究筛选越来越耗费资源、效率低下且不一致。分类明确报告健康相关生活质量结果（如EQ-5D数据）的研究需要高水平的临床解释，并给人类评审者带来挑战。本研究探讨了使用Google的Gemini和Gemma大型语言模型（LLM）仅基于已发表摘要自动检测PubMed生物医学数据库中的EQ-5D。提出了一个多阶段框架，集成了少样本提示、权重集成聚合和软堆叠元分类器。在由两位专家手动标记的PubMed研究数据集上评估了九个LLM的EQ-5D报告情况。gemini-2.5-pro、gemma-3-12b和gemma-3-27b的加权集成获得了0.74的加权F1分数和0.74的准确率，超过了单独获得的结果。与单个模型相比，表现最佳模型的集成改善了精确率和召回率之间的平衡，而软堆叠方法提供了更高的可靠性和可解释性。特征分析表明，模型的概率结果在指导最终预测中很重要。研究结果表明，基于集成的LLM设置是自动化生物医学研究筛选的可靠且可扩展的方法。

英文摘要

The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

URL PDF HTML ☆

赞 0 踩 0

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 交叉投稿

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University（肯尼索州立大学）； Michigan Technological University（密歇根理工大学）； University of Iowa（爱荷华大学）

AI总结提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，通过自适应决定何时需要额外模态，在保持准确性的同时降低数据采集成本。

详情

AI中文摘要

阿尔茨海默病（AD）是一种致命性疾病，会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效，导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据，如临床评估、结构磁共振成像（MRI）和正电子发射断层扫描（PET）成像。然而，MRI和PET采集仍然昂贵且不易普及，使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，该网络自适应地确定何时需要额外模态，有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类，并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时，ProMUSE逐步引入MRI或PET特征，通过Dempster-Shafer理论融合模态层面的信念和不确定性，获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明，ProMUSE在减少50-90%的MRI/PET使用量的同时，实现了与全模态基线相当或更优的准确性，从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

URL PDF HTML ☆

赞 0 踩 0

2606.19373 2026-06-19 cs.LG cs.AI 交叉投稿

cAPM: Continual AI-Assisted Pace-Mapping with Active Learning

cAPM：具有主动学习的持续AI辅助起搏标测

Dylan O'Hara, Pradeep Bajracharya, Casey Meisenzahl, Karli Gillette, Anton J. Prassl, Gernot Plank, Saman Nazarian, Roderick Tung, John L Sapp, Linwei Wang

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； University of Utah（犹他大学）； Scientific Computing and Imaging Institute, University of Utah（犹他大学科学计算与成像研究所）； Medical University of Graz（格拉茨医科大学）； University of Pennsylvania Perelman School of Medicine（宾夕法尼亚大学佩雷尔曼医学院）； The University of Arizona College of Medicine（亚利桑那大学医学院）； Dalhousie University（达尔豪斯大学）

AI总结提出cAPM框架，通过任务无关的代理神经网络、主动学习和持续学习策略，在减少起搏标测数据量的同时，实现跨室性心动过速的知识迁移，将定位精度提升至81%。

详情

AI中文摘要

室性心动过速是一种危及生命的心律失常，是心源性猝死的主要原因。起搏标测是一种临床程序，用于在导管消融室性心动过速期间识别干预靶点。它要求临床医生在心室的不同部位起搏，并快速解释由此产生的心电图，以确定下一步起搏位置或是否已识别出靶点。已提出主动学习AI模型来指导临床医生选择下一个起搏点，显示出在减少起搏点数量和改善起搏标测效率方面的潜力。现有方法需要对每个靶点重新训练，无法在同一患者或不同患者的多个室性心动过速之间迁移知识。我们引入cAPM用于持续AI辅助起搏标测，以捕获和迁移从过去起搏标测数据中积累的知识，从而减少未来靶点室性心动过速所需的起搏标测数据量。这是通过一个任务无关的代理神经网络实现的，该网络学习从起搏点到12导联心电图形态的映射；一种主动学习策略，通过为每个靶点选择信息量最大的起搏点来优化该代理模型；以及一种持续学习策略，以顺序方式执行此操作，同时保留先前靶点的知识。在由不同生理条件和心室几何形状下顺序呈现的定位任务组成的计算机模拟测试平台上评估，cAPM（无论是否重放过去数据样本）在使用4.5个起搏标测点时，在临床耐受范围内（5毫米精度）定位的概率达到81%，而最先进的主动学习方法使用13.7个起搏点达到38%的概率。这些结果为cAPM准备用于体内临床前和临床研究提供了坚实基础，在这些研究中，cAPM可用于指导起搏标测。

英文摘要

Ventricular tachycardia is a life-threatening rhythm disorder and a major cause of sudden cardiac death. Pace-mapping is a clinical procedure for identifying the intervention target during catheter ablation of VT. It requires clinicians to pace different sites in the ventricles and rapidly interpret the resulting electrocardiograms to determine where to pace next or whether a target site has been identified. Active learning AI models have been proposed to guide clinicians to the next pacing site, showing promise in reducing the number of pacing sites and improving the efficiency of pace-mapping. Existing methods require retraining each target without the ability to transfer knowledge across multiple VTs within the same patient or across patients. We introduce cAPM for continuous AI-assisted pace-mapping to capture and transfer knowledge accumulated from past pace-mapping data to reduce the number of pace-mapping data needed for future target VTs. This is made possible by a task-agnostic surrogate neural network that learns the mapping from pacing sites to 12-lead ECG morphology, an active-learning strategy that refines this surrogate model by selecting the most informative pacing site for each target, and a continual learning strategy to do so sequentially while retaining knowledge from prior targets. Evaluated on an in-silico testbed consisting of sequentially-presented localization tasks across different physiological conditions and ventricular geometries, cAPM with and without replay of past data samples achieved an 81% probability of localizing within clinical tolerance (5 mm accuracy) using 4.5 pace-mapping sites, compared to the state-of-the-art active-learning method achieving 38% probability using 13.7 pacing sites. These results provide a strong basis for preparing cAPM towards in-vivo preclinical and clinical studies where it can be used to guide pace-mapping.

URL PDF HTML ☆

赞 0 踩 0

2606.19377 2026-06-19 cs.LG cs.AI 交叉投稿

Emyx: Fast and efficient all-atom protein generation

Emyx: 快速高效的全原子蛋白质生成

Nicholas J. Williams, Ward Haddadin, Matteo P. Ferla, Constantin Schneider, Nicholas B. Woodall, Ruby Sedgwick, Christian D. Madsen, Andrew L. Hopkins, Edward O. Pyzer-Knapp

发表机构 * Xyme

AI总结提出Emyx，一种140M参数的流匹配模型，通过轻量条件表示和稀疏连接降低复杂度，在酶设计基准上超越现有方法，训练仅需682 GPU小时。

详情

AI中文摘要

计算酶设计需要生成能够支撑催化残基和配体的蛋白质，这要求生成模型同时具备几何准确性和结构多样性。当前的全原子生成模型继承了结构预测中的昂贵架构，导致训练成本高、样本多样性有限。我们认为，对于生成模型而言，这种复杂性大多是不必要的，因为生成模型依赖于稀疏的几何约束而非丰富的共进化信号。Emyx是一个140M参数的条件流匹配模型，将能力集中在标准Transformer块中，用轻量条件表示和稀疏连接替代了厚重的嵌入堆叠。此外，我们推导了流匹配插值到EDM噪声水平框架的精确重参数化，将流匹配训练效率与为扩散模型设计的最先进采样方法桥接起来，无需重新训练。尽管是最小的模型，Emyx在AME酶设计基准上，在要求全局折叠恢复和催化几何准确性的严格评估下，在成功率、结构新颖性、骨架多样性和几何有效性方面均优于Proteína-Complexa和RFdiffusion3，而训练仅需682 GPU小时，约为RFdiffusion3的1/4。

英文摘要

Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Proteína-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just $682$ GPU-hours, roughly $4\times$ less than RFdiffusion3.

URL PDF HTML ☆

赞 0 踩 0

2606.19382 2026-06-19 cs.SE cs.AI 交叉投稿

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO：基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

AI总结提出DynAMO引擎，采用先规划后执行架构生成可验证工作流图，支持顺序与并行执行，通过动态识别独立任务提升效率，在工业基准上实现1.6倍延迟降低，并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情

AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化，但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO（动态资产管理编排），一个部署就绪的引擎，采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流（拓扑执行）和并行工作流（依赖感知并发）。通过动态识别独立任务，DynAMO在保持结构正确性和安全性的同时，通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中，DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍，在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后，延迟分解显示LLM推理和编排仍占执行时间的90%以上，表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%，并且DynAMO在受控故障注入下保持正确的功能行为（任务完成、智能体排序和输出质量），同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行，并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取：this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

URL PDF HTML ☆

赞 0 踩 0

2606.19387 2026-06-19 cs.SE cs.AI 交叉投稿

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成：基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

AI总结提出结合LLM创造力与形式化方法可解释性的硬件生成框架，通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

2606.19407 2026-06-19 cs.SE cs.AI 交叉投稿

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!：用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

AI总结提出JustDiag诊断论证引擎，通过维护显式的过程状态（证据、发现、竞争假设、冲突和下一步检查）来支持可问责的根本原因分析，在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情

AI中文摘要

大型语言模型可以生成流畅的根本原因分析，但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中，工程师需要知道哪些证据支持诊断，考虑了哪些替代方案，哪里存在矛盾，以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白，这是一个用于RCA的诊断论证引擎，它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统，该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比，JustDiag获得了更强的结果和过程分数，同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明，可问责的RCA需要显式的诊断论证工件和过程感知评估，而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

URL PDF HTML ☆

赞 0 踩 0

2606.19460 2026-06-19 cs.CV cs.AI cs.LG 交叉投稿

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

使用整流流变换器扩展胸部X光片的生成式基础模型

Fabio De Sousa Ribeiro, Emma A. M. Stanley, Charles Jones, Tian Xia, Dominic C. Marshall, Laurent Renard Triché, Christopher V. Cosgriff, Panagiotis Dimitrakopoulos, Sotirios A. Tsaftaris, Ben Glocker

发表机构 * Imperial College London（帝国理工学院）； Causality in Healthcare AI Hub（医疗AI因果关系中心）； University of Edinburgh（爱丁堡大学）； Cleveland Clinic London（克利夫兰诊所伦敦）； Department of Perioperative Medicine, CHU Clermont-Ferrand（克莱蒙费朗大学医院围手术期医学科）； Department of Medicine, Massachusetts General Hospital（麻省总医院医学部）； Broad Institute of MIT and Harvard（麻省理工学院与哈佛大学博德研究所）

AI总结提出首个十亿参数级胸部X光片生成基础模型，通过整流流变换器实现高保真可控合成，显著提升合成图像与真实图像的不可区分性。

Comments Project page: https://RadiT-project.github.io

详情

AI中文摘要

我们引入了首个从零开始在十亿参数规模上训练的胸部X光片合成生成基础模型。现有的放射学AI模型通常在不同患者亚群、机构和采集设置下泛化能力差，导致实际临床效用有限。可控、高保真的胸部X光片合成是多样化临床数据集和评估诊断模型鲁棒性的有前景途径。因此，我们提出了迄今为止最大的胸部X光片专用生成基础模型，拥有超过13亿参数，在包含120万张X光片和临床专家指导元数据的精选异质数据集上训练了1.6万亿个token。我们的模型支持跨多个人口统计亚组、采集视图和十多种病理的可控X光片生成和编辑。此外，我们显著推进了X光片合成保真度的最新技术，生成的图像对临床专家而言与真实X光片无法区分。

英文摘要

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

URL PDF HTML ☆

赞 0 踩 0

2606.19539 2026-06-19 astro-ph.SR cs.AI 交叉投稿

Review of Machine Learning Models for Solar Energetic Particle Prediction

太阳高能粒子预测的机器学习模型综述

Spiridon Kasapis, Pouya Hosseinzadeh, Kathryn Whitman, Ricky Egeland, Manolis Georgoulis, Angelos Vourlidas, Athanasios Papaioannou, Eleni Lavasa, Anastasios Anastasiadis, Giorgos Giannopoulos, Andres Munoz-Jaramillo, Bala Poduval, Irina N. Kitiashvili, Alexander G. Kosovichev, Viacheslav Sadykov, Soukaina Filali Boubrahimi, Tate T. Hutchins, Hameedullah A. Farooki, Manuel E. Cuesta, Leng Y. Khoo, Sungmin Pak, Robert Czarnota, Jamie S. Rankin, Jamey Szalay, Mitchell M. Shen, Georgios Livadiotis, Zigong Xu, David J. McComas, Nikolaos Sarlis, Dionissios Hristopulos, Arik Posner, Alec J. Engell, Mohammed AbuBakr Ali, Ali G. A. Abdelkawy, Abdelrazek M. K. Shaltout, M. M. Beheary, Christina O. Lee, Sigiava Aminalragia-Giamini, Constantinos Papadimitriou, Ingmar Sandberg, Savvas Raptis, Shah Muhammad Hamdi, Monica Laurenza, Mirko Stumpo, Sumanth A. Rotti, India Jackson, Aatiya Ali, Atilim Gunes Baydin, Nathan Schwadron, Subhamoy Chatterjee, Maher A. Dayeh, Gelu M. Nita, Patrick M. O'Keefe, Chun Jie Chong, Paul Kosovich, Russell D. Marroquin, Berkay Aydin, Petrus C. Martens, Lulu Zhao, Yang Chen, Yian Yu, Monica G. Bobra, Ward Manchester, Tamas Gombosi, Ming Zhang, Jesse Torres, Philip K. Chan, Mohamed Nedal, Kamen Kozarev, Peijin Zhang, Kimberly Moreland, Hazel M. Bain, Samuel Hart, Michael J. Starkey, Alan G. Ling, Simone Benella

AI总结综述了用于太阳高能粒子预测的机器学习模型，包括数据集、架构、输入输出比较，并提出了未来研究建议。

Comments Review Paper, Maine text: 23 pages, References: 5 pages, Appendix: 42 pages

详情

AI中文摘要

太阳高能粒子事件因其对航空、航天器电子设备以及地球磁层外人类任务的显著辐射危害而日益受到关注。从科学角度来看，SEP事件之所以引人入胜，是因为它们源于从太阳表面和日冕延伸到日光层的一系列物理过程，提供了对广泛适用于天体物理学的粒子加速和传输机制的洞察。因此，提高我们理解和预测SEP事件的能力，对于加深对这些机制的认识以及保护空间技术和探索至关重要。传统上，研究人员使用基于物理的模拟和经验方法对SEP进行建模。最近，机器学习已成为理解和预测SEP事件的新工具。本文旨在回顾当前可用于SEP预测的机器学习模型，识别用于训练的数据集，比较它们的架构、输入和输出，并基于这些见解，为未来研究概述良好实践和建议。

英文摘要

Solar energetic particle (SEP) events have attracted increasing attention due to their significant radiation hazards for aviation, spacecraft electronics, and human missions beyond Earth's magnetosphere. From a scientific perspective, SEP events are intriguing because they arise from a set of physical processes extending from the solar surface and corona through the heliosphere, offering insight into particle acceleration and transport mechanisms that are widely applicable across astrophysics. Therefore, advancing our ability to understand and predict SEP events is essential both for deepening our knowledge of such mechanisms and for safeguarding space technologies and exploration. Traditionally, researchers have modeled SEPs using physics-based simulations and empirical methods. More recently, machine learning (ML) has emerged as a new tool for understanding and predicting SEP events. The purpose of this manuscript is to review the currently available ML models for SEP prediction, identify the datasets used for training, compare their architectures, inputs, and outputs, and, based on these insights, outline good practices and recommendations for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.19566 2026-06-19 eess.SY cs.AI cs.SY 交叉投稿

GDGU: A Gradient Difference-based Graph Unlearning Method for Cyberattack Localization in Electric Vehicle Charging Networks

GDGU：基于梯度差异的图遗忘方法用于电动汽车充电网络中的网络攻击定位

Nanhong Liu, Mucun Sun, Jie Zhang

AI总结针对电动汽车充电站数据删除需求，提出基于梯度差异的图遗忘方法（GDGU），通过一阶参数校正实现高效遗忘，在保持定位性能的同时显著降低计算开销。

详情

AI中文摘要

电动汽车充电站（EVCS）可能使配电馈线暴露于网络攻击。尽管包括图神经网络在内的机器学习方法可以定位哪个母线被攻破，但在数据共享和模型训练方面仍存在重大挑战。例如，隐私法规允许EVCS所有者从已部署的模型中删除其训练数据，但每次请求都从头重新训练在计算上不可行。为了解决这个问题，我们研究了用于EVCS网络攻击定位的图遗忘（GU），将其形式化为图级多标签分类任务上的特征级遗忘问题。具体来说，我们提出了基于梯度差异的图遗忘（GDGU），通过一阶参数校正消除请求删除数据的影响。该校正基于原始训练数据与修改后数据集之间的梯度差异计算，其中仅遗忘请求的EVCS母线的充电功率特征。然后，应用批归一化重新校准和简短的恢复微调步骤以恢复定位效用。我们在IEEE 34母线、123母线和8500节点配电网络上，使用三种图神经网络骨干网络和累积遗忘场景，将GDGU与两种二阶GU基线进行比较。GDGU在定位效用上与最强基线相当，遗忘保真度接近完全重新训练，同时遗忘速度比从头重新训练快10到12倍，且内存使用远少于二阶GU基线。

英文摘要

Electric vehicle charging stations (EVCSs) can expose distribution feeders to cyberattacks. While machine learning methods, including graph neural networks, can localize which bus is compromised, significant challenges remain in data sharing and model training. For example, privacy regulations grant EVCS owners the right to delete their training data from a deployed model, yet retraining from scratch on every request is computationally prohibitive. To address this, we study graph unlearning (GU) for EVCS cyberattack localization, formulated as a feature-level unlearning problem on a graph-level multi-label classification task. Specifically, we propose gradient difference-based graph unlearning (GDGU), which removes the influence of the requested deletion data through a first-order parameter correction. The correction is computed from the gradient difference between the original training data and a modified dataset in which only the charging power features at the requested EVCS buses are unlearned. Then, a batch-normalization recalibration and a brief recovery fine-tuning step are applied to restore localization utility. We benchmark GDGU against two second-order GU baselines on the IEEE 34-bus, 123-bus, and 8500-node distribution networks across three graph neural network backbones and cumulative unlearning scenarios. GDGU matches the strongest baseline on localization utility and reaches forgetting fidelity close to full-retraining, while unlearning 10 to 12 times faster than retraining from scratch and using far less memory than the second-order GU baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19568 2026-06-19 cs.SD cs.AI 交叉投稿

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

声学枪声分类的特征提取技术参数探索

Sinclair Gurny, Ryan Quinn

AI总结本文系统研究了特征提取技术及其参数对声学枪声分类的影响，使用ResNet-18在23000条枪声数据集上评估，发现正确技术可提升top-1准确率20%，参数优化可再提升4.7%。

2606.19579 2026-06-19 cs.SD cs.AI 交叉投稿

FlowFake: Liquid Networks for Audio Deepfake Detection

FlowFake: 用于音频深度伪造检测的液态网络

Shivaay Dhondiyal, Divyansh Sharma, Dinesh Kumar Vishwakarma

发表机构 * Delhi Technological University（德里理工大学）

AI总结针对音频深度伪造检测中跨数据集泛化失败的问题，提出基于液态时间常数（LTC）架构的FlowFake模型，通过学习ODE演化隐藏状态并自适应时间常数，以34K参数在跨域基准上超越现有方法。

Comments Accepted at the Workshop on Learning to Listen: Machine Learning for Audio at ICML 2026

详情

AI中文摘要

由神经文本转语音和语音克隆系统生成的音频深度伪造对说话人验证和公共话语构成大规模威胁。核心挑战是跨数据集泛化：在一种合成流水线上训练的检测器在面对未见过的伪造时性能崩溃。我们认为这种失败主要是由于结构性合成语音伪影，这些伪影是多时间尺度的轨迹异常。尽管每个现有检测器都聚合固定窗口的帧统计量，但这使得架构与信号不对齐。我们提出FlowFake，一种液态时间常数（LTC）架构，其隐藏状态通过学习ODE演化，每个神经元具有自适应时间常数，同时解析频谱（10ms）和韵律（2s）线索。仅34K参数，FlowFake实现了正式的BIBO稳定性和O(dt^4)积分误差。在四个数据集的跨域基准（ASVspoof2019-LA、FakeOrReal、InTheWild、MLAAD）上，FlowFake在仅用FakeOrReal训练时在ASVspoof2019上达到75.29%，仅用MLAAD训练时达到79.97%。它在每个评估对上优于RawGAT-ST和Whisper-DF，并以0.01%的参数数量匹配SSL Wav2vec2（大300倍）。源代码可在以下网址获取：this https URL

英文摘要

Audio deepfakes generated by neural text-to-speech and voice-cloning systems threaten speaker verification and public discourse at scale. The core challenge is cross-dataset generalization: detectors trained on one synthesis pipeline collapse on unseen forgeries. We argue that this failure is primarily because of structural synthetic speech artifacts which are multi-timescale trajectory anomalies. Though every existing detector aggregates a fixed-window frame statistics, this misaligns the architecture with the signal. We propose FlowFake, a Liquid Time-Constant (LTC) architecture whose hidden state evolves via a learned ODE, with per-neuron adaptive time constants simultaneously resolving spectral (10ms) and prosodic (2s) cues. At only 34K parameters FlowFake achieves formal BIBO stability and O(dt^4) integration error. On a four-dataset cross domain benchmark (ASVspoof2019-LA, FakeOrReal, InTheWild, MLAAD), FlowFake reaches 75.29% on ASVspoof2019 trained only on FakeOrReal and 79.97% trained only on MLAAD. It outperforms RawGAT-ST and Whisper-DF on every evaluated pair and matching SSL Wav2vec2 (300x larger) at 0.01% of its parameter count. The source code is available on : https://github.com/GhostRider2023/FlowFake

URL PDF HTML ☆

赞 0 踩 0

2606.19605 2026-06-19 cs.SE cs.AI 交叉投稿

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO：多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

AI总结提出FAPO框架，通过自动诊断流水线瓶颈并迭代优化提示或链结构，在18个模型-基准比较中15次优于基线GEPA，平均提升14.1个百分点。

详情

AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败，因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO（全自动提示优化），一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更，并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑，仅当提示优化似乎不足时，在归因识别出结构瓶颈的情况下，在允许范围内更改链结构。在六个基准和三个任务模型上，FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中，FAPO以不重叠的均值±试验标准差范围获胜，平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中，当提示优先搜索升级为结构变更时，FAPO在所有六个中获胜，平均增益为+33.8个百分点。FAPO还提高了安全任务的性能：在CTIBench-RCM（一个安全CVE到CWE任务）上，仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率，在Foundation-Sec-8B-Instruct上提升了+7.1个百分点，在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19627 2026-06-19 cs.IR cs.AI cs.LG 交叉投稿

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

VCG：极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

AI总结针对电商视频流中的极端冷启动和偏差问题，提出基于领域自适应视觉-语言模型（CLIP）的可扩展多模态检索系统VCG，实现零样本检索，在线测试显示深度视频完成率提升50%。

详情

AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题：与传统商品不同，新的短视频缺乏协同过滤所需的密集交互历史。此外，沉浸式视频流引入了强烈的位置和时长偏差，扭曲了标准参与信号。在本文中，我们展示了视频候选生成（VCG）系统，这是一个可扩展的多模态检索引擎，旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型（基于CLIP），我们将用户和视频映射到共享语义空间，实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构，并进行了严格的评估，比较了生成式（LLM）和判别式（CLIP）嵌入。结果表明，虽然生成式模型在属性预测方面表现出色，但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明，VCG有效缓解了参与偏差，使深度视频完成率提升了50%。为了展示系统的能力，我们提供了一个交互式演示，包含三种双向检索场景：产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

URL PDF HTML ☆

赞 0 踩 0

2606.19635 2026-06-19 cs.IR cs.AI cs.LG 交叉投稿

Token Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models

Token Factory：高效整合多样化信号于大型推荐模型

Xilun Chen, Shao-Chuan Wang, Baykal Cakici, Lukasz Heldt, Lichan Hong, Raghu Keshavan, Aniruddh Nath, Li Wei, Xinyang Xi

AI总结提出Token Factory框架，将传统信号转化为软令牌，高效集成到基于Transformer的大型推荐模型中，避免提示长度爆炸并提升性能。

Comments 8 pages, 10 figures

2606.19710 2026-06-19 cs.CL cs.AI 交叉投稿

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX: 面向人口走私知识图谱的微调NER-RE

Elijah Feldman, Dipak Meher, Carlotta Domeniconi

发表机构 * Thomas Jefferson High School for Science and Technology（托马斯·杰斐逊科技高中）

AI总结提出FineREX，一个基于微调LLM的流水线，用于从法律文档中提取实体和关系构建知识图谱，在F1分数上分别提升15.50%和31.46%，并减少50%处理时间。

Comments Code available at https://github.com/ElijahFeldman7/FineREX

详情

AI中文摘要

法庭记录包含关于人口走私网络的有价值证据，但这些信息通常埋藏在非结构化的、充满术语的法律文件中。虽然大型语言模型（LLM）可以通过自动信息提取支持知识图谱构建，但现有方法依赖通用模型，未针对该领域所需的实体和关系定义进行定制。我们提出FineREX，一个精简的知识图谱构建流水线，基于微调的LLM进行命名实体识别和关系提取（NER-RE）。使用包含512个文本块的手动标注数据集，FineREX在实体和关系F1分数上分别比更大的通用基线模型绝对提高了15.50%和31.46%。这些提升转化为更高质量的知识图谱，将法律噪声减少近一半，并将长文档上的节点重复率从17.78%降至11.17%。通过消除文档重写和冗余提取阶段，FineREX还将端到端处理时间减少了50.0%。我们的结果表明，领域特定的微调可以显著优于更大的通用模型，同时提高非法网络分析知识图谱构建的质量和效率。

英文摘要

Court proceedings contain valuable evidence about human smuggling networks, but this information is often buried within unstructured, jargon-heavy legal documents. While large language models (LLMs) can support knowledge graph construction through automated information extraction, existing approaches rely on general-purpose models that are not tailored to the entity and relationship definitions required in this domain. We introduce FineREX, a streamlined knowledge graph construction pipeline built around a fine-tuned LLM for named entity recognition and relationship extraction (NER-RE). Using a manually annotated dataset of $512$ text chunks, FineREX achieves absolute improvements of 15.50% and 31.46% in entity and relationship F1-score, respectively, compared to a larger general-purpose baseline. These gains translate into higher-quality knowledge graphs, reducing legal noise by nearly half and lowering node duplication on long documents from 17.78% to 11.17%. By eliminating document rewriting and redundant extraction stages, FineREX also reduces end-to-end processing time by 50.0%. Our results demonstrate that domain-specific fine-tuning can substantially outperform larger general-purpose models while improving both the quality and efficiency of knowledge graph construction for illicit network analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.19725 2026-06-19 cs.SE cs.AI cs.MA 交叉投稿

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

面向OpenSIL固件中大语言模型生成的单元测试的库感知双打与迭代修复

Ma Toan Bach, Yuchi Zheng, Haingo Razafindranto, Tanvir Alam, Aric Leather, Ranveer Sandhu, Jitesh Arora

AI总结针对OpenSIL固件单元测试因构建约束易失败的问题，提出LLM引导的多智能体自动化测试生成与迭代修复流程，在76个函数中73个生成可编译测试，行覆盖率达98.8%。

Comments 20 pages, 10 figures

详情

AI中文摘要

验证底层C固件中的变更成本高昂，因为单元测试（UT）在严格的构建约束下非常脆弱，缺失的头文件、未解析的符号和依赖不匹配经常阻止编译和链接。本研究为AMD维护的开源硅初始化库（openSIL）固件代码库引入了一种自动化的UT编写工作流程，通过大语言模型（LLM）引导的多智能体管道减少手动工作。该工作流程结合了测试框架的自动生成、库感知的桩、模拟和伪造的创建或重用，以及由构建日志和行覆盖率反馈驱动的迭代编译-分派修复循环。我们使用编译成功率、修复迭代次数、分派成功率和行覆盖率评估该方法，并以时间、成本和令牌使用量作为次要指标。在76个被测函数中，该工作流程为73个函数生成了可编译的UT。在没有行覆盖率指导或检索增强的配置下，平均行覆盖率达到73.9%。在两种配置下评估的48个函数子集中，仅使用行覆盖率指导时平均行覆盖率达到98.8%，与向量数据库检索结合时达到94.7%。结果表明，自动生成和修复管道可以显著提高受限固件环境中UT创建的效率和覆盖率，同时减少手动调试工作量。

英文摘要

Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

URL PDF HTML ☆

赞 0 踩 0

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 交叉投稿

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化：低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结针对低资源儿童语音识别，系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现，发现特定策略能显著提升泛化能力。

详情

AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能，尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究，我们展示了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法，与先前研究相比，在孤立词识别上实现了4.65%的相对改进，在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性，这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 交叉投稿

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究：频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结本文系统研究不同频谱特征与声学模型的组合，通过引入音高特征和优化训练帧重叠数，在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情

AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明，通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究，为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能，特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查，我们证明了增强最先进的因子化时延神经网络（F-TDNN）模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法，与先前研究相比，在构音障碍语音的孤立词识别中获得了4.65%的相对改进，在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性，这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

URL PDF HTML ☆

赞 0 踩 0

2606.19795 2026-06-19 cs.SE cs.AI 交叉投稿

Agentic Electronic Design Automation: A Handoff Perspective

代理式电子设计自动化：一种交接视角

Jiawei Liu, Peiyi Han, Yuntao Lu, Su Zheng, Fengyu Yan, Bei Yu

AI总结本文从交接有效性角度出发，将EDA流程中的代理系统分为三类，并提出五层代理通信协议，以解决多阶段、多工具间的状态传递和验证问题。

详情

AI中文摘要

电子设计自动化（EDA）本质上是多阶段且交接密集的。设计工件、流程脚本和工程决策在最终实现、签核或发布之前，跨越工具、会话和组织边界。每次传递都携带显式和隐式需求，这些需求可能无法被阶段局部检查完全捕获。基于LLM的代理现在直接调用EDA工具，将检索到的知识嵌入可执行脚本，并在会话和阶段之间传递状态。一旦它们的输出影响下游工程决策，传递的对象必须满足交接合同并符合其下一个消费者的假设。本综述引入交接有效性作为其组织原则。当传递的对象满足消费者的接受条件，并携带足够的上下文、证据和来源以供下游使用时，交接是有效的。我们回顾了82个系统，并将它们分为三个边界类别。阶段边界系统在单个EDA阶段或有界验证任务内建立有效性。流程边界系统在工具、调用和会话之间保持连贯的工作流状态。组织边界系统在知识和权限边界之间维护源基础、来源、范围及可接受性。对于每个类别，我们分析交接合同、交接对象、协调机制和开放问题。这些分析激发了一个五层EDA代理通信协议（EACP），涵盖代理发现、代理消息、工具调用、工作流编排以及安全和IP协议。我们旨在为可信的代理式EDA提供通用词汇和研究议程。

英文摘要

Electronic design automation (EDA) is inherently multi-stage and handoff-heavy. Design artifacts, flow scripts, and engineering decisions cross tool, session, and organizational boundaries before final implementation, signoff, or release. Each transfer carries explicit and implicit requirements that may not be fully captured by stage-local checks. LLM-based agents now invoke EDA tools directly, embed retrieved knowledge in executable scripts, and hand off state across sessions and stages. Once their outputs condition downstream engineering decisions, the transferred object must satisfy a handoff contract and meet the assumptions of its next consumer. This survey introduces handoff validity as its organizing principle. A handoff is valid when the transferred object satisfies the consumer's acceptance conditions and carries sufficient context, evidence, and provenance for downstream use. We review 82 systems and classify them into three boundary classes. Stage-Bound systems establish validity within a single EDA stage or bounded verification task. Flow-Bound systems preserve coherent workflow state across tools, invocations, and sessions. Organization-Bound systems maintain source grounding, provenance, scope, and admissibility across knowledge and authority boundaries. For each class, we analyze handoff contracts, handoff objects, coordination mechanisms, and open questions. These analyses motivate a five-layer EDA agent communication protocol (EACP), covering the agent discovery, agent message, tool invocation, workflow orchestration, and security and IP protocols. We aim to provide a common vocabulary and research agenda for trustworthy agentic EDA.

URL PDF HTML ☆

赞 0 踩 0

加速工业应用中的语义分割标注过程

Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno

发表机构 * Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada（格拉纳达大学计算机科学与人工智能系，安达卢西亚数据科学与计算智能研究所，DaSCI）； Department of Computer Science and Automatic Control, National Distance Education University (UNED)（国立远程教育大学计算机科学与自动控制系）

AI总结本文利用无监督算法将材料科学中语义分割的标注时间从170小时降至37小时（减少78%），并发布了最大的公开钢微观结构分割数据集。

详情

AI中文摘要

当前的机器学习模型通常需要大量且标注良好的数据集。然而，标注过程常常成为瓶颈，随着复杂性的增加，人为错误的机会也更高。在此背景下，本文旨在利用无监督算法提高工业材料科学中复杂语义分割问题的数据标注效率。以往的研究量化了标注时间，并探索了无监督方法。但据我们所知，这是首次量化无监督算法加速标注过程程度的研究。我们旨在验证这一繁琐过程可以加速的程度，重点关注涉及高分辨率图像每个像素标注的语义分割任务，例如材料科学中的微观结构表征挑战。具体来说，我们证明通过使用无监督计算机视觉算法，标注过程所需的时间可以从170小时减少到37小时，实现了约78%的减少。我们处理的数据集包括尺寸为1280x959和960x703的大图像，这进一步增加了标注任务的复杂性。尽管存在这些挑战，我们创建并共享了迄今为止最大的公开钢微观结构分割数据集，在MIT许可下提供，并具有永久DOI，为该领域贡献了一个完全标注的高分辨率数据集。此外，这是首次将从头开始标注的时间（以往研究中的常见方法）与使用这些无监督算法作为预标注步骤时的标注时间进行比较。此外，我们提供了一个在此数据集上训练的深度学习模型，该模型经过领域专家验证，并部署在工业环境中，作为该公共数据集的初始基准。

英文摘要

Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.19943 2026-06-19 eess.IV cs.AI 交叉投稿

SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications

SIMBA：面向NWP应用的FY-4A GIIRS高光谱红外辐射双向检索正向模拟框架

Jingdong Shen, Fu Wang*, Qifeng Lu, Hao Huang, Chunqiang Wu, Chi Yang, Xiaofang Liu

AI总结提出SIMBA框架，联合进行大气廓线检索和辐射重建，通过循环一致性约束和双向Mamba模块增强耦合，在FY-4A GIIRS数据上优于多种深度学习基线。

详情

AI中文摘要

高光谱红外观测是数值天气预报（NWP）的重要数据源，因为它们提供了大气温度和湿度垂直结构的丰富信息。然而，现有的深度学习方法主要关注从辐射到大气廓线的单向检索，而反向辐射模拟过程以及大气状态空间与辐射观测空间之间的一致性考虑不足。在本研究中，我们提出了SIMBA，一个用于FY-4A GIIRS高光谱红外辐射建模的统一双向检索-正向模拟框架，面向NWP应用。该框架联合执行大气廓线检索和辐射重建，引入循环一致性约束以加强两个过程之间的耦合，并采用双向Mamba状态空间模块来捕捉沿气压层的长程依赖。利用配准的FY-4A GIIRS观测和ERA5再分析数据，该方法在温度检索、比湿检索、长波辐射重建和中波辐射重建上进行了评估。实验结果表明，SIMBA在检索和重建任务上均优于多个代表性深度学习基线，而消融实验证实了双向设计和循环一致性机制的贡献。这些结果表明，所提出的框架对于联合大气廓线检索和高光谱红外辐射建模是有效的，并显示出未来在雅可比相关分析和面向NWP扩展方面的潜力。

英文摘要

Hyperspectral infrared observations are an important data source for numerical weather prediction (NWP) because they provide rich information on the vertical structure of atmospheric temperature and humidity. However, most existing deep learning methods mainly focus on one-way retrieval from radiances to atmospheric profiles, while the reverse radiance simulation process and the consistency between atmospheric state space and radiance observation space are insufficiently considered. In this study, we propose SIMBA, a unified bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling toward NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, introduces a cycle-consistency constraint to strengthen the coupling between the two processes, and employs a bidirectional Mamba state-space module to capture long-range dependencies along pressure levels. Using collocated FY-4A GIIRS observations and ERA5 reanalysis data, the proposed method is evaluated for temperature retrieval, specific humidity retrieval, long-wave radiance reconstruction, and medium-wave radiance reconstruction. Experimental results show that SIMBA outperforms several representative deep learning baselines across both retrieval and reconstruction tasks, while ablation experiments confirm the contribution of the bidirectional design and cycle-consistency mechanism. These results demonstrate that the proposed framework is effective for joint atmospheric profile retrieval and hyperspectral infrared radiance modeling, and suggest potential for future Jacobian-related analysis and NWP-oriented extensions.

URL PDF HTML ☆

赞 0 踩 0

2606.19975 2026-06-19 cs.CY cs.AI 交叉投稿

The Algorithmic-Human Manager: AI, Apps, and Workers in the Indian Gig Economy

算法-人类管理者：印度零工经济中的AI、应用程序与工人

Omir Kumar, Krishnan Narayanan

AI总结本文研究AI和数字技术对印度蓝领零工经济中算法管理的影响，发现其虽扩大就业机会但引发公平性、透明度和工人尊严问题，提出算法-人类管理者混合治理模型。

Comments Published by the Centre for Responsible AI (CeRAI) at IIT Madras

详情

AI中文摘要

本文考察了人工智能和数字技术对印度蓝领零工经济的影响，重点关注算法管理——即在基于位置的服务（如拼车和配送）中使用自动化系统来分配、监控和评估工作。采用社会正义框架和混合方法（包括对16名零工工人和21名关键利益相关者的访谈），研究揭示了一个双重现实：虽然AI驱动的系统扩大了工作机会并产生了运营效率，但它们同时引入了与公平、透明度和工人尊严相关的重大挑战。关键发现表明，算法系统设计上不透明，产生不公平的结果，并且其结构不能为额外劳动提供相应报酬。研究倡导一种务实的混合治理模型——算法-人类管理者框架，其中技术效率和人类问责制共同运作而非对立。研究结果对政策制定者、平台公司以及致力于为印度和全球南方的零工经济设计公平AI治理框架的民间社会组织具有启示意义。

英文摘要

This paper examines the impact of artificial intelligence and digital technologies on the blue-collar gig economy in India, focusing on algorithmic management. This paper examines the impact of artificial intelligence and digital technologies on the blue collar gig economy in India, focusing on algorithmic management he use of automated systems to allocate, monitor, and evaluate work in location-based services such as ride sharing and delivery. Using a social justice framework and a mixed-methods approach comprising interviews with 16 gig workers and 21 key stakeholders, the study uncovers a dual reality: while AI-powered systems expand access to work and generate operational efficiencies, they simultaneously introduce significant challenges related to fairness, transparency, and worker dignity. Key findings reveal that algorithmic systems are opaque by design, produce inequitable outcomes, and are not structured to reward additional labour with proportionate pay. The study advocates for a pragmatic hybrid governance model an Algorithmic Human Manager framework in which technological efficiency and human accountability operate together rather than in opposition. The findings carry implications for policymakers, platform companies, and civil society organizations working to design equitable AI governance frameworks for the gig economy in India and across the Global South.

URL PDF HTML ☆

赞 0 踩 0

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 交叉投稿

MedRLM：用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer ； Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结提出MedRLM递归多模态健康智能框架，通过递归检查、分解、检索、验证和合成患者信息，协调多个专业代理并引入临床证据图记忆，实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情

AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理，而不是回答孤立的医学问题。然而，当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索，当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时，这可能变得脆弱。本文提出MedRLM，一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中，而是将患者病例视为一个外部临床环境，可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆，将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理，而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集（涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果）的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.20388 2026-06-19 cs.HC cs.AI cs.DB 交叉投稿

DataMagic: Transforming Tabular Data into Data Insight Video

DataMagic: 将表格数据转化为数据洞察视频

Yupeng Xie, Chen Ma, Zhenyang Wang, Liangwei Wang, Jiayi Zhu, Chuxuan Zeng, Zhouan Shen, Boyan Li, Yuyu Luo

AI总结提出DataMagic系统，通过声明式规范DVSpec和多智能体架构，将原始表格数据和自然语言查询转化为叙事性数据洞察视频，并支持交互式探索。

Comments 5 pages, 3 figures, accepted at VLDB 2026

详情

AI中文摘要

数据视频整合动态图表、语音叙述和同步动画，以时间叙事的方式传达数据洞察，使其成为提高数据管理生命周期中数据消费效率的有效媒介。然而，制作高质量的数据视频需要涵盖数据分析、叙事设计和视频制作的专业知识。现有方法存在不足：静态可视化工具（如BI仪表板）缺乏叙事逻辑和动画；创作工具要求用户预先准备可视化，而非从原始数据开始；像素级视频生成模型无法保证数据保真度或来源。我们演示了DataMagic，一个端到端的交互式系统，将原始表格数据和自然语言查询转化为叙事性数据洞察视频。为确保数据保真度，DataMagic引入了声明式规范DVSpec，通过数据驱动的语义引用将视觉和动画元素绑定到底层数据字段。为解决设计空间的组合爆炸问题，DataMagic采用先生成后编排的多智能体架构，并行生成候选场景，然后通过全局编排优化叙事连贯性。利用DVSpec逻辑与渲染的解耦，系统进一步支持三种交互模式和基于结构化来源的数据问答，将单向视频转化为可探索的交互式数据界面。在109个真实世界样本上的评估验证了DataMagic的有效性。主页：此 https URL

英文摘要

Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high-quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre-prepare visualizations rather than working from raw data; pixel-level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end-to-end interactive system that transforms raw tabular data and natural language queries into narrative data-insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data-driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate-then-Orchestrate multi-agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec's decoupling of logic and rendering, the system further supports three interaction modes and structured provenance-based data Q&A, transforming one-way videos into explorable interactive data interfaces. Evaluation on 109 real-world samples validates the effectiveness of the DataMagic. Homepage: https://datamagic-home.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.20436 2026-06-19 cs.CR cs.AI 交叉投稿

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

AI总结提出多反编译器视角提升LLM恶意软件分类性能，通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情

AI中文摘要

恶意软件分析师通常在源代码不可用时，通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明，大型语言模型（LLMs）可以通过将反编译代码分类为良性或恶意来辅助这一过程，但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的：反编译器是有损的启发式工具，不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试，涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译，生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中，我们发现提供两种反编译器视图可以提高恶意类别的F1分数，主要是通过提高恶意样本的召回率。一致性分析进一步表明，Ghidra和RetDec会犯部分不同的错误，支持反编译器输出提供互补证据的观点。我们的结果表明，多反编译器提示是一种简单、无需训练的方法，可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 交叉投稿

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices（超威半导体）； University of California, Los Angeles（加州大学洛杉矶分校）； Purdue University（普渡大学）

AI总结针对上下文密集型智能体场景，提出UltraQuant方法，通过4位KV缓存压缩、旋转量化和代码本量化，结合AMD GPU优化，在长上下文多轮任务中延迟降低3.47倍，吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情

AI中文摘要

上下文密集型智能体给键值（KV）缓存带来了异常压力：长前缀在多个短轮次中重复使用，而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩，采用TurboQuant风格的旋转和代码本量化作为质量锚点，vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先，我们将4位KV缓存框架用于多轮智能体工作负载，其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次，我们描述了使4位路径鲁棒所需的实际设计选择，包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三，我们展示了AMD GPU上的服务优化，包括优化的解码注意力内核和UltraQuant，一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上，UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍（所有轮次平均2.3倍），并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

URL PDF HTML ☆

赞 0 踩 0

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net：基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Q-Net框架，通过结合卡尔曼滤波与神经网络，解决信号交叉口队列长度估计中的数据融合问题，提升空间转移性和实时性，实现无需昂贵传感设备的准确队列估计。

Journal ref Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

详情

DOI: 10.1016/j.trc.2026.105809

AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源：(i) 接近停止线的环形检测器提供的车辆计数汇总数据，以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD)，但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此，本文提出Q-Net：一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战，如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构，并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现，并通过将aFCD测量分组为固定大小的局部组来提高空间转移性，使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示，Q-Net优于基线方法，能够准确追踪队列的形成和消散，并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性，Q-Net在无需昂贵的传感基础设施（如摄像头或雷达）的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

URL PDF HTML ☆

赞 0 踩 0

2510.00831 2026-06-19 cs.AI cs.LG eess.SP 版本更新

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

电力系统保护中故障分类与定位的机器学习模型受控比较

Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Pérez-Toro, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Department of Electrical Engineering, Media and Computer Science, Ostbayerische Technische Hochschule Amberg-Weiden（奥贝格-魏登应用技术大学电气工程、媒体与计算机科学系）

AI总结在统一电磁暂态数据集和10-50ms决策窗口下，对比机器学习模型在故障分类与定位中的性能，发现分类在10ms时F1>0.98，定位误差稳定在约10%线路长度。

Comments Accepted at IEEE PES Innovative Smart Grid Technologies Europe 2026 (ISGT Europe 2026). Pre-camera-ready author version; final proceedings version may differ

详情

AI中文摘要

现代电力系统因逆变器基和分布式能源的集成而日益复杂，挑战了传统保护方案的可靠性，并推动了机器学习在保护任务中的应用。然而，由于不同研究中的数据集、传感假设和决策时域各异，已发表的结果往往难以比较。本文在相同的传感、时序和验证条件下，基于公共电磁暂态数据集，使用10-50ms的决策窗口以反映保护相关时间尺度，对故障分类（FC）和故障定位（FL）的机器学习模型进行了受控比较。对于FC，性能最佳的非线性模型在10ms时F1分数已超过0.98，而低容量模型在较短时域下性能下降，但随窗口延长而改善，表明相关故障类型信息在最早暂态中已存在。对于FL，顶级模型在所有评估时域下达到约10%归一化线路长度的稳定定位误差，而较弱模型形成明显分离的第二性能层级。线路解析分析显示，定位精度随电网段变化，表明存在拓扑依赖的难度而非仅时间上下文不足。这些发现为比较两个信息需求根本不同的保护任务中的机器学习模型提供了受控参考。

英文摘要

The increasing complexity of modern power systems, driven by the integration of inverter-based and distributed energy resources, challenges the reliability of conventional protection schemes and motivates the use of machine learning for protection tasks. However, published results are often difficult to compare because datasets, sensing assumptions, and decision horizons vary across studies. This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) under identical sensing, timing, and validation conditions on a common electromagnetic transient dataset, using decision windows of 10-50 ms to reflect protection-relevant time scales. For FC, the best-performing nonlinear models achieve F1 scores above 0.98 already at 10 ms, while lower-capacity models degrade at shorter horizons but improve with longer windows, indicating that relevant fault-type information is already present in the earliest transient. For FL, the top-performing models reach a stable localization error of about 10 % of normalized line length across all evaluated horizons, while weaker models form a clearly separated second performance tier. Line-resolved analysis shows that localization accuracy varies across grid segments, indicating topology-dependent difficulty rather than insufficient temporal context alone. These findings provide a controlled reference for comparing machine learning models across two protection tasks with fundamentally different information requirements.

URL PDF HTML ☆

赞 0 踩 0

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新

全球生活便利指数：面向主要经济体纵向分析的机器学习框架

Arun Kumar Selvaraj, Tanay Panat, Rohitash Chandra

发表机构 * Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics（过渡人工智能研究组，数学与统计学学院）； Centre for Artificial Intelligence and Innovation（人工智能与创新中心）； Pingla Institute（Pingla研究所）

AI总结提出全球生活便利指数，结合社会经济和基础设施因素，利用机器学习处理缺失数据，并通过主成分分析和因子分析降维，为政策制定者提供改善生活质量的可操作工具。

详情

AI中文摘要

全球经济、地缘政治条件以及COVID-19疫情等破坏性事件对生活成本和生活质量产生了巨大影响。理解主要经济体中生活成本和生活质量的长期影响至关重要。一个透明且全面的生活指数必须包含生活条件的多个维度。在本研究中，我们提出了一种通过全球生活便利指数量化生活质量的方法，该指数将各种社会经济和基础设施因素整合为一个单一综合得分。我们的指数利用定义生活水平的经济指标，这有助于针对特定领域进行干预改进。我们提出了一个机器学习框架来处理特定国家某些经济指标的数据缺失问题。然后，我们整理并更新数据，并使用降维方法（主成分分析和因子分析）创建自1970年以来主要经济体的生活便利指数。我们的工作通过为政策制定者提供识别需要改进领域（如医疗系统、就业机会和公共安全）的实用工具，显著丰富了相关文献。我们的方法使用开放数据和代码，易于复现并适用于各种情境，为生活质量评估的持续研究和政策制定提供了透明度和可访问性。

英文摘要

The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is essential to comprehend the long-term implications of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework to address missing data for certain economic indicators in specific countries. We then curate and update the data and use a dimensionality reduction approach (Principal Component Analysis and Factor Analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts, providing transparency and accessibility for ongoing research and policy development in quality-of-life assessment.

URL PDF HTML ☆

赞 0 踩 0

2506.01678 2026-06-19 cond-mat.mtrl-sci cs.AI 版本更新

Overcoming Labelled Data Scarcity for Defect Classification in Scanning Tunneling Microscopy

克服扫描隧道显微镜缺陷分类中的标注数据稀缺问题

Nikola L. Kolev, Max Trouton, Filippo Federici Canova, Geoff Thornton, David Z. Gao, Neil J. Curson, Taylor J. Z. Stock

发表机构 * London Centre for Nanotechnology, University College London（伦敦纳米技术中心，伦敦大学学院）； Department of Electronic and Electrical Engineering, University College London（电子与电气工程系，伦敦大学学院）； Department of Physics and Astronomy, University College London（物理与天文学系，伦敦大学学院）； Department of Chemistry, University College London（化学系，伦敦大学学院）； Aalto Science Institute, School of Science, Aalto University（艾尔沃斯科学研究所，艾尔沃斯大学）； Nanolayers Research Computing LTD, London, UK（纳米层研究计算有限公司，伦敦，英国）； Department of Physics, NTNU Norwegian University of Science and Technology（物理系，挪威科技大学）

AI总结提出结合少样本学习和无监督学习的自动分割方法，在仅需少量标注数据下实现高精度STM图像缺陷分类，并在三种表面验证了强泛化能力。

详情

AI中文摘要

扫描隧道显微镜（STM）是一种以原子分辨率对表面成像的强大技术，可深入理解单原子和分子层面的物理化学过程。STM图像分析的一项常规任务是在均匀背景中识别和标记感兴趣的特征。手动执行此操作是一项劳动密集型工作，需要大量人力。为减轻这一负担，我们提出了一种自动化的STM图像分割方法，该方法同时使用少样本学习和无监督学习。与之前的监督方法相比，我们的技术提供了更大的灵活性；它消除了对大型手动标注数据集的需求，因此更容易适应未见过的表面，同时仍保持高精度。我们通过使用该方法识别三种不同表面上的原子特征来展示其有效性：Si(001)、Ge(001)和TiO$_2$(110)，包括吸附在硅和锗表面上的AsH$_3$分子。我们的模型表现出强大的泛化能力，在初始训练后，仅需一个额外的标注数据点即可适应未见过的表面。这项工作朝着高效且与材料无关的STM图像自动分割迈出了重要一步。

英文摘要

Scanning tunnelling microscopy (STM) is a powerful technique for imaging surfaces with atomic resolution, providing insight into physical and chemical processes at the level of single atoms and molecules. A regular task of STM image analysis is the identification and labelling of features of interest against a uniform background. Performing this manually is a labour-intensive task, requiring significant human effort. To reduce this burden, we propose an automated approach to the segmentation of STM images that uses both few-shot learning and unsupervised learning. Our technique offers greater flexibility compared to previous supervised methods; it removes the requirement for large manually annotated datasets and is thus easier to adapt to an unseen surface while still maintaining a high accuracy. We demonstrate the effectiveness of our approach by using it to recognise atomic features on three distinct surfaces: Si(001), Ge(001), and TiO$_2$(110), including adsorbed AsH$_3$ molecules on the silicon and germanium surfaces. Our model exhibits strong generalisation capabilities, and following initial training, can be adapted to unseen surfaces with as few as one additional labelled data point. This work is a significant step towards efficient and material-agnostic, automatic segmentation of STM images.

URL PDF HTML ☆

赞 0 踩 0

2511.08378 2026-06-19 cs.IR cs.AI 版本更新

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

告别跷跷板：通过混合意图的双重约束实现准确的长期会话推荐

Xiao Wang, Ke Qin, Dongyang Zhang, Xiurui Xie, Shuang Liang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结针对会话推荐中长尾分布导致准确性与多样性冲突的跷跷板问题，提出混合意图双重约束框架HID，通过属性感知谱聚类重构意图映射并区分噪声意图，结合多样性与准确性约束损失，实现长尾与准确性的双赢。

Comments accepted by AAAI 2026 Oral

详情

AI中文摘要

基于会话的推荐（SBR）旨在根据用户的交互会话预测匿名用户的下一次交互。在实际推荐场景中，低曝光物品构成了交互的大部分，形成长尾分布，严重损害了推荐多样性。现有方法试图通过提升尾部物品来解决这一问题，但会导致准确性下降，在长尾与准确性性能之间表现出“跷跷板”效应。我们将这种冲突归因于尾部物品中的会话无关噪声，而现有的长尾方法未能有效识别和约束这些噪声。为了解决这一根本冲突，我们提出了HID（混合意图双重约束框架），这是一个即插即用的框架，通过引入基于混合意图的双重约束，将传统的“跷跷板”转变为“双赢”，同时提升长尾和准确性性能。该框架包含两个关键创新：（i）混合意图学习，我们通过采用属性感知谱聚类重构物品到意图的映射，重新制定了意图提取策略。此外，通过为每个会话分配目标意图和噪声意图，实现了会话无关噪声的区分。（ii）意图约束损失，它引入了两种关于多样性和准确性的新约束范式，以调节物品和会话的表示学习过程。通过严格的理论推导，这两个目标被统一到单个训练损失中。在多个SBR模型和数据集上的大量实验表明，HID能够同时提升长尾性能和推荐准确性，在长尾推荐系统中建立了新的最先进性能。

英文摘要

Session-based recommendation (SBR) aims to predict anonymous users' next interaction based on their interaction sessions. In the practical recommendation scenario, low-exposure items constitute the majority of interactions, creating a long-tail distribution that severely compromises recommendation diversity. Existing approaches attempt to address this issue by promoting tail items but incur accuracy degradation, exhibiting a "see-saw" effect between long-tail and accuracy performance. We attribute such conflict to session-irrelevant noise within the tail items, which existing long-tail approaches fail to identify and constrain effectively. To resolve this fundamental conflict, we propose \textbf{HID} (\textbf{H}ybrid \textbf{I}ntent-based \textbf{D}ual Constraint Framework), a plug-and-play framework that transforms the conventional "see-saw" into "win-win" through introducing the hybrid intent-based dual constraints for both long-tail and accuracy. Two key innovations are incorporated in this framework: (i) \textit{Hybrid Intent Learning}, where we reformulate the intent extraction strategies by employing attribute-aware spectral clustering to reconstruct the item-to-intent mapping. Furthermore, discrimination of session-irrelevant noise is achieved through the assignment of the target and noise intents to each session. (ii) \textit{Intent Constraint Loss}, which incorporates two novel constraint paradigms regarding the \textit{diversity} and \textit{accuracy} to regulate the representation learning process of both items and sessions. These two objectives are unified into a single training loss through rigorous theoretical derivation. Extensive experiments across multiple SBR models and datasets demonstrate that HID can enhance both long-tail performance and recommendation accuracy, establishing new state-of-the-art performance in long-tail recommender systems.

URL PDF HTML ☆

赞 0 踩 0

2601.00014 2026-06-19 eess.SP cs.AI cs.LG 版本更新

Modeling Day-Long ECG Signals to Predict Heart Failure Risk with Explainable AI

建模全天心电图信号以可解释人工智能预测心力衰竭风险

Eran Zvuloni, Ronit Almog, Michael Glikson, Shany Brimer Biton, Ilan Green, Izhar Laufer, Offer Amir, Joachim A. Behar

发表机构 * Leumit Health Services（Leumit健康服务）

AI总结提出DeepHHF深度学习模型，利用24小时单导联心电图数据预测五年内心力衰竭风险，AUC达0.80，优于短时片段和临床评分，可解释性分析显示模型关注心律失常和心脏异常。

详情

AI中文摘要

心力衰竭（HF）影响11.8%的65岁及以上成年人，降低生活质量和寿命。预防HF可降低发病率和死亡率。我们假设将人工智能（AI）应用于24小时单导联心电图（ECG）数据可预测五年内HF风险。为此，使用了Technion-Leumit Holter ECG（TLHE）数据集，包括20年间收集的47,729名患者的69,663条记录。我们的深度学习模型DeepHHF在24小时ECG记录上训练，实现了0.80的受试者工作特征曲线下面积，优于使用30秒片段和临床评分的模型。DeepHHF识别的高风险个体住院或死亡事件概率翻倍。可解释性分析显示DeepHHF关注心律失常和心脏异常。本研究强调了深度学习建模24小时连续ECG数据的可行性，捕捉了对可靠风险预测至关重要的阵发性事件。应用于单导联Holter ECG的人工智能无创、廉价且广泛可及，使其成为HF风险预测的有前景工具。

英文摘要

Heart failure (HF) affects 11.8% of adults aged 65 and older, reducing quality of life and longevity. Preventing HF can reduce morbidity and mortality. We hypothesized that artificial intelligence (AI) applied to 24-hour single-lead electrocardiogram (ECG) data could predict the risk of HF within five years. To research this, the Technion-Leumit Holter ECG (TLHE) dataset, including 69,663 recordings from 47,729 patients, collected over 20 years was used. Our deep learning model, DeepHHF, trained on 24-hour ECG recordings, achieved an area under the receiver operating characteristic curve of 0.80 that outperformed a model using 30-second segments and a clinical score. High-risk individuals identified by DeepHHF had a two-fold chance of hospitalization or death incidents. Explainability analysis showed DeepHHF focused on arrhythmias and heart abnormalities. This study highlights the feasibility of deep learning to model 24-hour continuous ECG data, capturing paroxysmal events essential for reliable risk prediction. Artificial intelligence applied to single-lead Holter ECG is non-invasive, inexpensive, and widely accessible, making it a promising tool for HF risk prediction.

URL PDF HTML ☆

赞 0 踩 0

2601.02149 2026-06-19 cond-mat.mes-hall cond-mat.dis-nn cs.AI 版本更新

AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

基于人工智能的量子点哈密顿量调优以实现马约拉纳模式

Mateusz Krawczyk, Jarosław Pawłowski

发表机构 * Institute of Theoretical Physics, Wrocław University of Science and Technology（理论物理研究所，沃林大学技术学院）

AI总结本文提出基于神经网络的模型，通过学习量子点模拟器的工作区域，利用输运测量自动调优设备以获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。

Comments 12 pages, 8 figures, 2 tables

Journal ref Phys. Rev. Applied 25, 064032 (2026)

详情

DOI: 10.1103/xkbl-ctwn

AI中文摘要

我们提出了一种基于神经网络的模型，能够学习量子点模拟器广泛的工作区域，并利用此知识通过输运测量自动调优这些设备，以在结构中获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据，采用融合马约拉纳零模关键性质的物理引导损失函数。我们展示了通过适当训练，深度视觉变换器网络可以高效记忆哈密顿量参数与导电图之间的关系，并利用此提出量子点链参数更新，驱动系统进入拓扑相。从参数空间的广泛初始调谐范围开始，单步更新足以生成非平凡零模。此外，通过启用迭代调优过程——系统在每一步获得更新的导电图——我们证明该方法可以处理参数空间更大的区域。

英文摘要

We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.

URL PDF HTML ☆

赞 0 踩 0

2604.08552 2026-06-19 cs.DB cs.AI 版本更新

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

使用本体约束的LLM代理自动化标准化遗留生物医学元数据

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

发表机构 * Division of Computational Medicine, Stanford University（斯坦福大学计算医学部）； Department of Biology, University of Pennsylvania（宾夕法尼亚大学生物学系）

AI总结提出基于LLM的元数据标准化系统，通过实时查询标准指南和本体服务，在839条HuBMAP记录上验证，相比纯LLM方法显著提升预测准确性。

详情

AI中文摘要

科学元数据通常不完整且不符合社区标准，限制了数据集的可发现性、互操作性和重用。即使存在标准元数据报告指南，它们通常缺乏机器可操作的表征。生成FAIR数据集需要将元数据标准编码为具有丰富字段规范和精确值约束的机器可操作模板。最近的研究表明，由字段名称和本体约束引导的LLM可以改善元数据标准化，但这些方法将约束视为静态文本提示，仅依赖模型的训练知识。我们提出了一种基于LLM的元数据标准化系统，该系统实时查询标准报告指南和权威生物医学术语服务，以按需检索规范正确的标准。我们在来自人类生物分子图谱计划（HuBMAP）的839条遗留元数据记录上评估了该方法，使用专家策划的金标准进行精确匹配评估。我们的评估表明，与仅使用LLM相比，通过实时工具访问增强LLM在受本体约束和不受本体约束的字段上均持续提高了预测准确性，展示了一种实用的生物医学元数据自动化标准化方法。

英文摘要

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

URL PDF HTML ☆

赞 0 踩 0

2604.11556 2026-06-19 cs.SE cs.AI 版本更新

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University（并行与分布式系统研究所，上海交通大学）

AI总结提出FM-Agent框架，利用LLM自动生成函数级规范，实现大型系统的组合式推理，在143k行代码的系统中2天内发现522个新bug。

详情

AI中文摘要

LLM辅助的软件开发已日益普遍，并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而，由于代码复杂性，大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理（即组合式推理）的方法。然而，现有工作仍难以扩展，因为Hoare逻辑要求为每个函数编写形式化规范，给人类带来沉重负担。当代码由LLM生成时，问题更加严重，因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent，这是第一个实现大型系统自动化组合式推理的框架。利用LLM，FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说，FM-Agent从调用者期望函数如何行为中推导出函数的规范，因此即使实现有缺陷，生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达，而现有的验证器只支持公式。因此，FM-Agent推广了Hoare风格推理，以针对自然语言规范推理函数。最后，为了确认错误存在并解释错误原因，FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中，FM-Agent在2天内成功推理了大型系统，每个系统最多有143k行代码。这些系统已经由开发者测试过，但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果，包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

URL PDF HTML ☆

赞 0 踩 0

2606.12500 2026-06-19 cs.LG cs.AI 版本更新

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结本文利用机器学习行为模型替代传统规则模型进行交通微观仿真，通过极端值理论分析模拟冲突预测碰撞频率，在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情

AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案，用于预测当前或计划道路基础设施设计的碰撞频率。然而，现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型，这些模型能较好地再现交通流，但往往无法生成真实的冲突动态，限制了碰撞预测的准确性。机器学习（ML）行为模型的最新进展提供了一个有希望的机会，通过直接从大规模轨迹数据集中学习人类驾驶行为，可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性，我们对英国利兹的五个真实信号交叉口进行了交通微观仿真，使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突，然后使用极端值理论建模以预测碰撞频率。结果表明，ML模型的冲突产生的碰撞预测与实际碰撞数据一致，而基于规则的模型由于缺乏对特定模拟交叉口的模型校准，无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果，这表明尽管当前的ML模型可以真实地再现冲突，但尚不能生成真实的碰撞。总体而言，研究结果表明，基于ML的行为模型在无需特定地点模型校准的情况下，有望从模拟冲突中改进碰撞预测，并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

URL PDF HTML ☆

赞 0 踩 0

2606.13794 2026-06-19 eess.SY cs.AI cs.RO cs.SY 版本更新

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)（斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所）

AI总结提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法，结合在线自适应机制，实现过驱动飞行器的高效非线性控制分配，兼具可解释性和低计算成本。

详情

AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时，线性分配器因模型失配增加而精度下降，进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度，但分别带来实时分配难以承受的计算负担，并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型，解决了这些限制。所得映射紧凑、可解释，并允许解析导数，从而能够在非线性求解器中高效计算，同时额外包含执行器动力学，无需机载模型。在线自适应机制监控预测残差，并在检测到显著对象变化时刷新模型，从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估，达到了与完整非线性机载模型相当的精度，同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18611 2026-06-19 cs.SD cs.AI cs.LG stat.ML 版本更新

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company（朝日新闻社）； Tokyo Woman's Christian University（东京女子基督教大学）

AI总结提出参数高效的QC-GAN，结合四元数Conformer生成器和MetricGAN训练，通过汉密尔顿积共享权重减少参数量，在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48，性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

2606.19630 2026-06-19 cs.AI cs.DL cs.SY eess.SY 新提交

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

AI4SE 与 SE4AI 探索：回顾与展望的十年

H. Sinan Bank, Daniel R. Herber, Thomas Bradley

发表机构 * Colorado State University（科罗拉多州立大学）

AI总结本文回顾了人工智能与系统工程在三个阶段的进展，通过人机一致性文献综述识别出五个关键研究空白，并提供了AI采纳、保障和劳动力转型的指导。

Comments 10 pages, 5 figure

详情

AI中文摘要

2020年3月INCOSE INSIGHT关于人工智能与系统工程的特刊成为该刊历史上下载量最高的一期，并催生了一个研究社区，其年度研讨会现吸引超过250名注册者。在本文中，我们基于作者对该领域核心论文的解读，追溯了人工智能与系统工程在三个阶段（标记为基础、应用和LLM转折点）的进展，并描述了我们对社区已达成共识以及仍存在关键空白的看法。此外，我们进行了一项人机一致性文献综述，利用人类专家和六个人工智能模型评估了1,712篇INCOSE INSIGHT文章和889篇SERC出版物的相关性。结果识别出五个关键研究空白，并为从业者在系统工程中应对AI采纳、保障和劳动力转型提供了指导。我们共享一致性数据以及AI4SE/SE4AI Explorer网络应用程序，以便读者将自己的相关性判断与人类和AI评分者进行比较。

英文摘要

The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication's history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors' reading of the field's core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.

URL PDF HTML ☆

赞 0 踩 0

2606.19753 2026-06-19 cs.AI cs.SE 新提交

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理：确定性封装生成模型的原则

Marty O'Neill

AI总结提出四种AI混合架构原语，实现概率模型的确定性封装，并指出两个行业反模式，为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

2606.19924 2026-06-19 cs.AI 新提交

The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self

主体之道：自生目标人工智能、嵌入主体与自我的消解

Aritra Sarkar

AI总结本文探讨自生目标AI中主体生成自身目标的问题，通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性等概念，揭示嵌入性虽必要但不充分，并指出核心难题在于主体如何生成并相对化自我，最后提出量子表述、哲学解读和基于LLM的具体实现。

详情

AI中文摘要

大多数人工智能系统建立在目标由设计者外生指定的假设上。探索当主体开始生成自身目标时会发生什么，开启了自生目标AI领域。主体不仅应追求目标，还应发现目标。本文通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性追溯其后果；发现嵌入性是自生目标主体性的必要但不充分条件。嵌入性将主体个体化，但代价是揭示这种个体化并非唯一，相同的动力学允许许多有效划分，每个划分定义了一个不同的候选自我。因此，自生目标AI最深层次的问题不在于主体如何生成目标，而在于主体如何生成并相对化目标所归属的自我。主体必须相信自身的边界才能行动，并看穿该边界才能理解。我们将这些发展整合到一个统一框架中，并沿三个方向扩展：量子表述（其中主体-环境切割成为物理的）、针对非二元沉思传统的哲学解读，以及基于LLM的具体主体实现。

英文摘要

Most artificial intelligence systems are built on the assumption that goals are exogenous and specified by the designer. Exploring what happens when an agent begins generating its own goals opens the field of autotelic AI. Agents are expected not merely to pursue objectives but to discover them. In this article, we trace its consequences through intrinsic motivation, resource-driven priors, causal-interventional learning, homeostasis, and embeddedness; the last of which is found to be a necessary but not sufficient condition for autotelic agency. Embeddedness individuates the agent at the cost of revealing that the individuation is non-unique, such that the same dynamics admit many valid partitions, each defining a different candidate self. The deepest problem with autotelic AI is therefore not how the agent generates goals, but how it generates and relativizes the self to which the goals are assigned. The agent must believe in its own boundary in order to act, and see through that boundary in order to understand. We consolidate these developments into a single framework and extend it along three directions: a quantum formulation in which the agent-environment cut becomes physical, a philosophical reading against non-dual contemplative traditions, and a concrete LLM-based agentic instantiation.

URL PDF HTML ☆

赞 0 踩 0

2606.20231 2026-06-19 cs.AI cond-mat.stat-mech cs.IT math-ph math.IT math.MP nlin.AO 新提交

Thermodynamic Measure of Intelligence

智能的热力学度量

Ishanu Chattopadhyay

发表机构 * Institute for Biomedical Informatics, University of Kentucky（肯塔基大学生物医学信息学研究所）； Department of Computer Science, University of Kentucky（肯塔基大学计算机科学系）

AI总结提出智能是稀有但有效未来的合法放大，通过递归自模拟实现，并给出热力学度量，证明该结构对高智能必要且近乎充分。

详情

AI中文摘要

智能可以被度量吗？我们提出智能可以定义为稀有但有效未来的合法放大：一个系统增加那些在被动动力学下不太可能但在领域约束下仍然可允许的结果的概率。我们从智能系统必须建模世界及其自身在其中的位置这一前提开始。由于系统是其建模世界的一部分，这自然导致递归自模拟：系统表示其自身动作是轨迹一部分的未来。我们的核心结果给出了一个必要性陈述和一个条件性近乎充分性陈述，将该架构与稀有-有效未来的合法放大的精确热力学度量联系起来：高稀有-有效提升是不可能的，除非内部模拟以高保真度识别稀有-有效未来；反之，当稀有-有效保真度高且模拟包含有效策略时，可实现的提升接近受驱动限制的最优值。因此，递归自模拟不仅是智能的一个合理特征，而且在所述假设下，对于高热力学智能是必要且近乎充分的。由此产生的框架使智能在通用尺度上可度量，从被动物质和反馈控制器、大型语言模型、作为文本生成器的人类到麦克斯韦妖式信息引擎。

英文摘要

Can intelligence be measured? We propose that intelligence can be defined as the lawful amplification of rare but valid futures: a system increases the probability of outcomes that would be unlikely under passive dynamics but remain admissible under the constraints of the domain. We start with the premise that an intelligent system must model the world and its own place within it. Because the system is part of the world it models, this leads naturally to recursive self-simulation: the system represents futures in which its own actions are part of the trajectory. Our central results give a necessity statement and a conditional near-sufficiency statement connecting this architecture to a precise thermodynamic measure of lawful amplification of rare-valid futures: high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity; conversely, when rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift approaches the actuation-limited optimum. Thus recursive self-simulation is not merely a plausible feature of intelligence but, under the stated assumptions, is necessary and nearly sufficient for high thermodynamic intelligence. The resulting framework makes intelligence measurable on a universal scale, from passive matter and feedback controllers, large language models, and humans as text generators to Maxwell-demon-like information engines.

URL PDF HTML ☆

赞 0 踩 0

2606.18716 2026-06-19 cs.HC cs.AI 交叉投稿

Human-AI Agent Interaction in a Business Context

商业环境中的人机智能体交互

Kathrin Paimann, Elizangela Valarini, Sebastian Juhl

发表机构 * SAP SE（SAP公司）； Hochschule Fresenius Heidelberg（弗赖辛大学海德堡分校）； University of Missouri（密苏里大学）

AI总结本研究采用混合方法，识别并评估了商业环境中人与AI智能体积极用户体验的原则与标准，并通过调查实验验证设计元素的有效性，以促进用户采纳、信任和以用户为中心的决策。

Comments 9 pages, 5 tables, 1 figure, submitted to Springer Nature

2606.19361 2026-06-19 cs.LG cs.AI cs.NA math.NA stat.CO stat.ME stat.ML 交叉投稿

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2509.02581 2026-06-19 cs.DL cs.AI 版本更新

Charting the Future of Scholarly Knowledge with AI: A Community Perspective

用AI绘制学术知识的未来：社区视角

Azanzi Jiomekong, Hande Küçük McGinty, Keith G. Mills, Allard Oelen, Enayat Rajabi, Harry McElroy, Antrea Christou, Anmol Saini, Janice Anta Zebaze, Hannah Kim, Anna M. Jacyszyn, Gollam Rabby, Dirk Betz, Claudia Biniossek, Sanju Tiwari, Sören Auer

发表机构 * TIB Leibniz Information Centre for Science and Technology（蒂宾根莱比锡科学与技术信息中心）； Department of Computer Science, University of Yaounde 1（亚奥内1大学计算机科学系）； Department of Computer Science, Kansas State University（堪萨斯州立大学计算机科学系）； School of EECS, Louisiana State University（路易斯安那州立大学电子工程与计算机科学学院）； Management Science Department, Cape Breton University（cape breton 大学管理科学系）； Department of Development and Research, Performigence（Performigence 发展与研究部）； Department of Engineering and Computer Science, Wright State University（怀特州立大学工程与计算机科学系）； Department of Physics, University of Yaounde 1（亚奥内1大学物理系）； FIZ Karlsruhe, Leibniz Institute for Information Infrastructure（卡尔斯鲁厄莱比锡信息基础设施研究所）； Sharda University, Delhi-NCR, India（德里-纳尔默德印度大学）； L3S Research Center, Leibniz University of Han（汉莱比锡大学L3S研究中心）

AI总结本文从社区视角出发，识别促进跨学科对话、共享挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

Comments 39 pages, 3 figures

详情

AI中文摘要

尽管支持学术知识提取和组织的工具日益普及，许多研究人员仍依赖手动方法，有时是因为对现有技术不熟悉或缺乏领域适应性解决方案。同时，跨学科学术出版物的快速增长使得跟上最新进展越来越困难，进一步凸显了对可扩展的、基于AI的方法来结构化和综合学术知识的需求。各个研究社区已开始独立应对这一挑战，开发旨在构建可靠、动态且可查询的学术知识库的工具和框架。然而，这些社区之间的有限互动阻碍了方法、模型和最佳实践的交流，减缓了向更集成解决方案的进展。本文确定了促进跨学科对话、识别共同挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

英文摘要

Despite the growing availability of tools designed to support scholarly knowledge extraction and organization, many researchers still rely on manual methods, sometimes due to unfamiliarity with existing technologies or limited access to domain-adapted solutions. Meanwhile, the rapid increase in scholarly publications across disciplines has made it increasingly difficult to stay current, further underscoring the need for scalable, AI-enabled approaches to structuring and synthesizing scholarly knowledge. Various research communities have begun addressing this challenge independently, developing tools and frameworks aimed at building reliable, dynamic, and queryable scholarly knowledge bases. However, limited interaction across these communities has hindered the exchange of methods, models, and best practices, slowing progress toward more integrated solutions. This manuscript identifies ways to foster cross-disciplinary dialogue, identify shared challenges, categorize new collaboration and shape future research directions in scholarly knowledge and organization.

URL PDF HTML ☆

赞 0 踩 0

2603.16648 2026-06-19 cs.AI 版本更新

Domain-Independent Dynamic Programming with Constraint Propagation

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

发表机构 * Imko Marijnissen 1 ； J. Christopher Beck 2 ； Emir Demirović 1 ； Ryo Kuroiwa 3, 4

Comments 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

Journal ref Proceedings of the International Conference on Automated Planning and Scheduling (2026) | Volume 36(1) | Pages 171-180

2602.05416 2026-06-19 cs.CE cs.AI cs.LG physics.ao-ph physics.flu-dyn 版本更新

Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models

Freja Høgholm Petersen, Jesper Sandvig Mariegaard, Rocco Palmitessa, Allan P. Engsig-Karup

发表机构 * DTU（技术大学）

Comments Submitted for peer-review in a journal. v2: revised version submitted to journal after minor revisions

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2602.14239 2026-06-19 cs.SI cs.AI cs.LG 版本更新

A Hybrid TGN-SEAL Model for Dynamic Graph Link Prediction

Nafiseh Sadat Sajadi, Behnam Bahrak, Mahdi Jafari Siavoshani

发表机构 * Department of Computer Engineering, Sharif University of Technology（谢尔万大学计算机工程系）； Tehran Institute for Advanced Studies, Khatam University（泰赫兰高级研究院，卡塔姆大学）

Journal ref EPJ Data Science (2026)

2510.24435 2026-06-19 cs.AI 版本更新

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

发表机构 * Universidade Federal de Santa Catarina（联邦圣卡塔琳娜大学）

Comments 12 pages

Journal ref Proceedings of the 2026 Computer on the Beach

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； All India Institute of Medical Sciences（全印度医学科学研究所）； Indian Institute of Technology Hyderabad（印度理工学院海得拉巴分校）

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland（以患者为中心的数字健康研究所，伯恩应用科学大学，比尔，瑞士）； ID Suisse AG, St. Gallen, Switzerland（ID瑞士股份有限公司，圣加尔，瑞士）

1. 智能体、规划与决策 22 篇

Uncertainty Decomposition for Clarification Seeking in LLM Agents

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Human-like autonomy emerges from self-play and a pinch of human data

OnDeFog: Online Decision Transformer under Frame Dropping

VOiLA: Vectorized Online Planning with Learned Diffusion Model for POMDP Agents

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Dual-Agent Framework for Cross-Model Verified Translation of Natural-Language Protocols into Robotic Laboratory Platform

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

CogniFold: Always-On Proactive Memory via Cognitive Folding

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Policy-Embedded Graph Expansion: Networked HIV Testing with Diffusion-Driven Network Samples

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

2. 知识表示、推理与符号AI 5 篇

Process-Verified Reinforcement Learning for Theorem Proving via Lean

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

Latent Confounded Causal Discovery via Lie Bracket Geometry

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

3. 多智能体与博弈 17 篇

Hidden Anchors in Multi-Agent LLM Deliberation

Exit-and-Join Dynamics for Decentralized Coalition Formation

Multi-Agent Transactive Memory

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

A Multi-Agent system for Multi-Objective constrained optimization

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

Before the Pull Request: Mining Multi-Agent Coordination

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

Optimal Order of Multi-Agent and General Many-Body Systems

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

Searching for Synergy in Shared Workspace Human-AI Collaboration

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

4. 搜索、优化与约束求解 6 篇

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

Residual-Space Evolutionary Optimization via Flow-based Generative Models

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

A Tool for the Synthesis of Adaptive Probabilistic Processors Based on the Ising Model

Flickering Multi-Armed Bandits

5. 机器学习与表示学习 53 篇

Diffusion Language Models: An Experimental Analysis

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Which Pairs to Compare for LLM Post-Training?

Denoising Implicit Feedback for Cold-start Recommendation

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Modularity-Free Conflict-Averse Training for Generalized PINNs

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Toward Calibrated Mixture-of-Experts Under Distribution Shift

Information Lattice Learning as Probabilistic Graphical Model Structure Learning

Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Can In-Context Learning Support Intrinsic Curiosity?

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

RIVET: Robust Idempotent Voice Attribute Editing

LOKI: Memory-Free Null-Space Constrained Lifelong Knowledge Editing

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Towards Engineering Scaling Laws with Pretraining Data Composition

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

Neural Additive and Basis Models with Feature Selection and Interactions

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models