arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19559 2026-06-19 cs.AI cs.CL 新提交

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University（AI Talent Hub, ITMO大学）

AI总结提出一种基于提示的不确定性分解方法，将行动置信度与请求不确定性分离，使代理能在任务规范模糊时主动寻求澄清，在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情

AI中文摘要

最近的立场论文认为，经典的偶然/认知不确定性框架对于交互式大型语言模型（LLM）代理是不够的，并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示，以解锁新的代理能力，如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法，使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁，该分解将行动置信度与请求不确定性（u）分离，使代理能在任务规范模糊时请求澄清。为了评估它，我们引入了两个增强澄清的基准（WebShop-Clarification和ALFWorld-Clarification），其中50%的任务被故意欠规范，并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上，系统地将所提出的分解与ReAct+UE和不确定性感知记忆（UAM）在五个LLM骨干（GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B）上进行比较。在五个骨干上平均，所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1，比UAM提高了36%，并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1，表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.19782 2026-06-19 cs.AI cs.CL 新提交

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

AgentFinVQA：一种可部署的多智能体管道用于可审计的金融图表问答

Aravind Narayanan, Shaina Raza

发表机构 * Vector Institute（向量研究所）

AI总结提出多智能体管道AgentFinVQA，通过分解查询步骤并记录可追溯的模型评估包，在金融图表问答中实现可审计性与本地部署，在FinMME上提升准确率7.68个百分点。

详情

AI中文摘要

在受监管环境中的金融图表问答不仅要求准确性：从业者必须在采取行动之前知道哪些答案值得信任，而且许多机构无法将客户数据发送给外部模型提供商。然而，现有的图表问答智能体注重准确性且不透明，并且大多数假设专有API访问；据我们所知，没有一种方法能在不显著牺牲准确性的情况下同时实现可审计性和本地部署。我们提出AgentFinVQA，一个多智能体管道，将每个查询分解为规划、OCR、图例定位、视觉检查和验证，每个样本记录在可追溯的模型评估包（MEP）中。在FinMME上，AgentFinVQA在使用专有主干（Gemini-3 Flash；71.24% vs. 63.56%，McNemar p ≈ 1.1×10^{-16}）时比主骨干匹配的零样本基线提高+7.68个百分点，在使用本地服务的开放权重Qwen3.6-27B-FP8时提高+4.84个百分点。验证器的判断也作为有用的置信度信号（确认答案与修正答案的精确准确率分别为68.2%和55.6%），支持人在回路审查路由。错误分析表明，问题误解、图例混淆和提取错误占失败原因的近三分之二，并且是验证器检测最少的类别，为未来工作指明了明确方向。这些结果共同表明，可审计、本地部署的金融图表问答是可行的，并且开放权重系统保留了大部分准确率提升，同时实现了完全的数据驻留。我们发布代码以支持可重复评估。

英文摘要

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.19893 2026-06-19 cs.AI 新提交

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute（江西陶瓷工艺美术职业技术学院数字艺术学院）； Universiti Sains Malaysia（马来西亚理科大学）

AI总结提出MetaResearcher框架，通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构，在对抗环境中扩展深度研究智能体的训练，提升基准性能和认知鲁棒性。

详情

AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力，但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中，我们提出了MetaResearcher，一个新颖的框架，在四个协同维度上扩展深度研究智能体的训练。首先，我们引入了一个演化虚拟世界，将时间动态和对抗性错误信息注入训练环境，迫使智能体发展来源可信度评估和时间冲突解决技能。其次，我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索，推动智能体走向真正的研究行为。第三，我们在GRPO框架内提出了一种自我反思元奖励机制，共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性，直接解决了先前工作中观察到的重复动作循环问题。第四，我们引入了一个异构多智能体群体架构，包括专门的侦察、过滤和合成模型，通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施，MetaResearcher在训练中需要零边际API成本，同时目标是在基准性能（GAIA，Xbench-DS）和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

URL PDF HTML ☆

赞 0 踩 0

2606.20122 2026-06-19 cs.AI cs.MA 新提交

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

ScaffoldAgent: 面向开放式深度研究的效用引导动态大纲优化

Zhibang Yang, Xinke Jiang, Yuzhen Xiao, Ruizhe Zhang, Yue Fang, XinFei Wan, Zhengxing Song, Yuxuan Liu, Yuheng Huang, Xu Chu, Junfeng Zhao, Yasha Wang

发表机构 * National Engineering Research Center of Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； School of Computer Science, Peking University（北京大学计算机学院）； Key Laboratory of High Confidence Software Technologies, Ministry of Education（教育部高可信软件技术重点实验室）； GRG Banking Equipment Co., Ltd.（广电运通金融电子股份有限公司）； Center on Frontiers of Computing Studies, Peking University（北京大学计算前沿研究中心）； Peking University Information Technology Institute (Tianjin Binhai)（北京大学（天津滨海）信息技术研究院）

AI总结提出ScaffoldAgent框架，通过效用引导的动态大纲优化（扩展、收缩、修订操作）解决开放式深度研究中大纲漂移问题，在DeepResearch Bench和Gym上提升长报告生成与事实准确性。

Comments 9 pages, 6 figures

详情

AI中文摘要

开放式深度研究（OEDR）要求系统通过多轮检索获取知识并生成连贯的长篇报告。大纲作为协调检索、证据组织和生成的结构性支架起着核心作用。然而，现有方法要么在写作前固定大纲，要么使用局部启发式方法进行优化，导致在持续信息积累下出现大纲漂移，且评估大纲修改的反馈延迟。我们提出ScaffoldAgent，一种面向OEDR的效用引导动态大纲优化框架。ScaffoldAgent将大纲演化建模为结构化决策过程，包含三种操作：扩展、收缩和修订，从而实现对报告支架的受控更新。它进一步引入效用引导的反馈机制，通过检索增益、结构连贯性和试生成质量来估计每个大纲操作的下游价值。得到的效用信号指导推理过程中的节点选择、操作调度和终止。在DeepResearch Bench和DeepResearch Gym上的实验表明，ScaffoldAgent在长报告生成和事实基础上持续优于现有的深度研究智能体。

英文摘要

Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

URL PDF HTML ☆

赞 0 踩 0

2606.20142 2026-06-19 cs.AI cs.MA 新提交

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

RACL：用于连续元启发式学习的推理代理控制层

Antón Asla Manzárraga

AI总结提出RACL方法，在元启发式优化器之上添加推理代理，通过观察、推理和干预控制搜索行为，在车辆路径问题上平均成本降低0.641%-8.337%。

Comments 10 pages, 5 tables

详情

AI中文摘要

本文介绍了RACL，一种用于元启发式算法的推理代理控制层。RACL在现有优化器之上放置一个推理代理。该代理不替换优化器，也不修改业务约束。相反，它通过观察操作内存、推理过去行为、制定有界假设、测试干预、评估结果、应用护栏、巩固有用策略并解释其决策来控制优化器的内部搜索行为。实验使用车辆路径作为测试平台，但贡献不是新的路由求解器、特定的ALNS配置或特定的路由规则集。贡献是RACL方法：一种推理代理发现、验证、巩固和解释元启发式算法控制规则的方式。在当前实验设置中，RACL在21个可行案例中的21个中改进或持平操作内存策略，在21个可行案例中的18个中改进或持平非推理停滞触发策略，平均RACL与STP成本差异为-0.641%。在Sevilla-9/10运行时样本中，RACL相对于Fixed平均成本降低-8.337%，相对于STP降低-1.605%，且没有显示实质性计算开销。在概念验证期间，Codex被用作循环推理代理，观察执行、解释日志并提出实时有界干预。后来仅使用策略代理使定量评估可重复。

英文摘要

This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.20363 2026-06-19 cs.AI 新提交

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

为计算机使用智能体自动生成SKILL.md：基于交互轨迹挖掘

Yuexing Hao, Xiaomin Li

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

AI总结提出三阶段流水线从GUI轨迹中挖掘可读技能库，但发现可读性不保证下游策略提升，GRPO仅带来微小改进，揭示当前方法的局限性。

详情

AI中文摘要

显式技能库使计算机使用智能体更易于检查，但尚不清楚是否可以从交互数据中挖掘此类库以改进下游策略。我们通过一个三阶段流水线研究这个问题：分割GUI轨迹，将片段聚类为候选技能，并从生成的注释中训练技能感知策略。挖掘的聚类在源基准上是可读的：八个聚类中有五个对InteraSkill Workflows标签的纯度至少为0.95。然而，可读性并不意味着可迁移。GRPO仅将IW技能步骤准确率从18.5%提高到20.5%，使BrowseComp+基本不变，并在关键源域指标上低于简单的频率先验。因此，我们将该方法作为诊断性研究呈现：轨迹挖掘可以暴露可检查的技能结构，但当前的边界检测器、无序片段表示和离线奖励模型不足以实现可靠的跨域策略改进。

英文摘要

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.20529 2026-06-19 cs.AI cs.CL 新提交

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent: 策略遵从工具调用代理的结构化状态

Md Nayem Uddin, Amir Saeidi, Eduardo Blanco, Chitta Baral

发表机构 * Arizona State University（亚利桑那州立大学）； University of Arizona（亚利桑那大学）

AI总结针对客服领域策略遵从工具调用代理，提出LedgerAgent方法，通过独立账本维护任务状态并渲染到提示中，在执行工具调用前检查状态依赖策略约束，提升多轮一致性。

Comments Work in Progress

详情

AI中文摘要

在客服领域，策略遵从工具调用代理必须在跨轮次调用工具时维护任务状态，并遵守领域策略。任务状态包括通过用户交互和工具调用观察到的事实、标识符、约束和条件。在标准代理中，任务状态没有单独表示。观察结果、工具返回和策略指令被放入提示中，使得代理每次决定下一步时都需要从提示中重建相关状态。这种设计使状态管理变得隐式，导致两种常见失败模式：代理可能检索到正确的事实，但后来基于过时、缺失或不正确的信息做出决策；语法上有效的工具调用可能仍然违反依赖于当前任务状态的领域策略。我们引入了\textsc{LedgerAgent}，一种用于工具调用代理的推理时方法，它在单独的账本中维护观察到的任务状态，并将状态渲染到提示中。在执行改变环境的工具调用之前，账本还用于检查状态依赖的策略约束，阻止策略违规。在四个客服领域以及开源和闭源模型的混合面板上，\textsc{LedgerAgent}在标准基于提示的工具调用方法上提高了平均pass^k，在更严格的多轮一致性指标下提升最大。

英文摘要

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.20068 2026-06-19 cs.AI 新提交

Process-Verified Reinforcement Learning for Theorem Proving via Lean

基于Lean的过程验证强化学习用于定理证明

Minsu Kim, Se-Young Yun

发表机构 * KAIST AI（韩国科学技术院人工智能系）

AI总结提出利用Lean证明助手提供过程级验证信号，结合GRPO风格强化学习目标，通过策略级监督提升定理证明性能。

详情

AI中文摘要

虽然基于可验证奖励的强化学习通常依赖于单一的二元验证信号，但形式推理中的符号证明助手提供了丰富、细粒度的结构化反馈。这种结构化过程与非结构化奖励之间的差距凸显了既密集又可靠的反馈的重要性。在这项工作中，我们证明Lean证明助手本身可以作为符号过程预言机，在训练期间提供结果级和细粒度的策略级验证反馈。证明尝试被解析为策略序列，Lean的细化标记出局部正确的步骤和最早失败的步骤，从而产生基于类型理论的密集、验证器基础的信用信号。我们将这些结构化奖励纳入GRPO风格的强化学习目标中，采用首次错误传播和首次令牌信用方法，平衡结果级和过程级优势。在STP-Lean和DeepSeek-Prover-V1.5上的实验表明，在大多数设置中，策略级监督优于仅结果基线，在MiniF2F和ProofNet等基准测试上取得了改进。除了经验上的提升，我们的研究还突出了一个更广阔的视角：符号证明助手不仅在评估时是验证器，而且在训练期间可以作为过程级奖励预言机。这为强化学习框架开辟了一条道路，该框架将语言模型的可扩展性与符号验证的可靠性相结合，用于形式推理。

英文摘要

While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.20526 2026-06-19 cs.AI 新提交

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

DeepSWIP: 神经概率逻辑程序的商-WMC反事实

Saimun Habib, Vaishak Belle, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结提出DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义，通过神经物化、SWIP和加权模型计数实现精确反事实推理，实验证明比孪生网络方法快2.14倍。

详情

AI中文摘要

诸如DeepProbLog之类的神经符号系统将神经感知与概率逻辑相结合，但标准推理是关联性的。反事实推理还需要干预和证据的因果语义。我们引入了DeepSWIP，一种用于DeepProbLog程序的单世界反事实语义。利用神经物化，我们将固定上下文神经谓词简化为普通的ProbLog选择，应用单世界干预程序（SWIP），并通过单个转换程序上的加权模型计数（WMC）计算反事实。在有限基和唯一支持模型假设下，DeepSWIP相对于学习到的物化FCM是精确的。ProbLog条件句的标准商-WMC形式识别了活跃的神经概率，并解释了干预清理、校准敏感性和罕见证据不稳定性。在MPI3D上的实验证实了该转换相对于DeepTwin构造在12,000个查询上的有效性，并且由于避免了孪生网络的内源性重复，推理速度提升了2.14倍。一个SUMO HOV实验表明，神经校准退化会偏置插件估计，而正确作用域的随机策略AIPW估计器消除了总体均值和ATE估计量的大部分一阶偏差。代码位于此https URL。

英文摘要

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

URL PDF HTML ☆

赞 0 踩 0

2606.19494 2026-06-19 cs.AI 新提交

Hidden Anchors in Multi-Agent LLM Deliberation

多智能体LLM协商中的隐藏锚点

Apurba Pokharel, Ram Dantu

发表机构 * University of North Texas（北德克萨斯大学）

AI总结将多智能体LLM协商建模为闭环动力系统，每个智能体有隐藏内部信念（锚点），解释协商如何超越初始信念凸包，并通过恢复锚点预测模型行为。

Comments 13 pages, 6 figures, 7 tables

详情

AI中文摘要

多智能体LLM协商，即智能体在多轮中交换和修改答案，越来越多地被用于提高推理和准确性，但其工作原理很少被建模。这种协商反映了人类如何做出决策。作为社会性动物，我们既受到群体的影响（如DeGroot和Friedkin-Johnsen等经典意见动力学模型所捕捉的羊群效应），也受到自身内部信念的影响（这些模型未考虑）。我们将多智能体协商建模为一个闭环动力系统，其中每个智能体携带一个隐藏的内部信念（其锚点），该锚点持续拉动其意见，无论邻居如何。我们证明，仅从协商中就可以恢复该锚点，并且它解释了经典共识规则所禁止的行为：智能体对正确答案的信心可以超过任何智能体初始时的水平，从而逃离由初始信念形成的空间（凸包）。检查恢复的锚点是否也能预测未参与运行的协商（泛化），为模型是否真正由这样的锚点驱动提供了一个简单测试。在三个开放权重模型系列中，这是一个谱系，而非全有或全无。所有锚点的影响强度大致相同，但它们在锚点位置上有差异，只有当锚点远离初始意见时，协商才会逃离凸包并需要完整的闭环模型。

英文摘要

Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin--Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.

URL PDF HTML ☆

赞 0 踩 0

2606.19683 2026-06-19 cs.AI cs.MA cs.SY eess.SY 新提交

Exit-and-Join Dynamics for Decentralized Coalition Formation

去中心化联盟形成的退出与加入动力学

Quanyan Zhu

发表机构 * New York University Tandon School of Engineering（纽约大学坦登工程学院）； Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结研究基于单边退出与加入决策的去中心化联盟形成动力学，利用Aumann-Dreze值计算个体收益，建立合作支付分配与非合作最优反应的关联，并分析均衡特征及成本对局部稳定性的影响。

2606.19911 2026-06-19 cs.AI cs.CL cs.IR 新提交

Multi-Agent Transactive Memory

多智能体交互记忆

To Eun Kim, Xuhong He, Dishank Jain, Ambuj Agrawal, Negar Arabzadeh, Fernando Diaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MATM框架，通过共享存储和检索智能体轨迹，实现异构智能体群体间的知识复用，提升下游任务性能并减少交互步骤。

详情

AI中文摘要

具有不同能力的LLM智能体在多样化任务中的去中心化部署，激发了跨异构智能体群体知识共享的基础设施需求。正如搜索引擎索引人类生成的工件以支持人类问题解决，检索系统可以组织智能体生成的工件以供跨智能体群体重用。我们将检索增强生成（展示了人类创作工件对单个智能体的价值）扩展到检索智能体生成的工件以支持智能体群体。特别是，智能体轨迹编码了可重用的程序性知识，然而这些工件通常在一次使用后被丢弃或仅由产生智能体保留，迫使新实例化的智能体反复重新发现现有解决方案。我们提出了多智能体交互记忆（MATM），一个用于群体级存储和检索智能体生成轨迹的框架，其中生产者智能体将轨迹贡献到共享仓库，消费者智能体检索它们以改进任务执行。我们专注于交互环境（ALFWorld和WebArena），其中轨迹较长且编码了特别丰富的程序性结构。我们的实验表明，从MATM检索轨迹可提高下游任务性能并减少交互步骤，无需协调或联合训练。这些结果将MATM定位为开放智能体生态系统中群体级经验共享的设计模式。

英文摘要

The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated artifacts to support human problem solving, retrieval systems can organize agent-generated artifacts for reuse across agent populations. We extend retrieval-augmented generation - which demonstrates the value of human-authored artifacts to individual agents - to retrieval of agent-generated artifacts supporting a population of agents. In particular, agent trajectories encode reusable procedural knowledge, yet these artifacts are typically discarded after a single use or retained only by the producing agent, forcing newly instantiated agents to repeatedly rediscover existing solutions. We propose Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories, where producer agents contribute trajectories to a shared repository and consumer agents retrieve them to improve task execution. We focus on interactive environments (ALFWorld and WebArena), where trajectories are long and encode especially rich procedural structure. Our experiments demonstrate that retrieving trajectories from MATM improves downstream task performance and reduces interaction steps without coordination or joint training. These results position MATM as a design pattern for population-level experience sharing in open agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.20058 2026-06-19 cs.AI 新提交

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

面向企业级AI规模的自驱动事件驱动多智能体编排

Harsh Rao Dhanyamraju, Leonidas Raghav, Aaron Lee

AI总结针对企业级AI中多智能体系统在规模扩展时性能下降的问题，提出任务管理器通过优先级推理、事件合并和抢占机制，在200个生产场景中验证其降低高优先级延迟14-75%，提升相关事件正确率超20个百分点。

详情

AI中文摘要

企业AI旨在朝着跨专业智能体的持续事件监控、检测和行动方向发展，然而现有的多智能体系统大多假设离散的请求-响应工作流，并且在企业规模下仍未得到充分探索。我们在208个源自生产的场景中评估了DAG Plan and Execute和ReAct，这些场景涵盖个人（少于10个智能体）、部门（20-80个）和企业（200个）规模，并引入了一个任务管理器，通过优先级推理、相关事件合并和抢占实现持续运行。结果表明，规模而非任务复杂性主导了编排性能：两种架构在小规模下表现良好，但在企业规模下性能下降，因为智能体发现噪声成为主要瓶颈，简单任务的下降幅度比复杂任务更严重。DAG Plan and Execute在较小规模下提供更高的精度和结构化并行化，但其较高的开销在企业规模下恶化；ReAct通过增量处理故障而更具鲁棒性。任务管理器将高优先级队列延迟降低了14-75%，并在企业规模下将相关事件正确性提高了超过20个百分点。

英文摘要

Enterprise AI aims to move toward continuous event monitoring, detection, and action across specialist agents, yet existing multi-agent systems largely assume discrete request-response workflows and remain underexplored at enterprise scale. We evaluate DAG Plan and Execute and ReAct across 208 production-derived enterprise scenarios spanning Persona (<10 agents), Department (20-80), and Enterprise (200) scales, and introduce a Task Manager for continuous operation via priority inference, related-event merging, and preemption. Results show that scale, not task complexity, dominates orchestration performance: both architectures perform well at small scale but degrade at enterprise scale as agent discovery noise becomes the primary bottleneck, with simple tasks degrading more sharply than complex ones. DAG Plan and Execute offers higher precision and structured parallelization at smaller scales, but its higher overhead worsens at enterprise scale; ReAct is more robust by handling failures incrementally. The Task Manager reduces high-priority queue latency by 14-75% and improves related-event correctness by over 20 percentage points at enterprise scale.

URL PDF HTML ☆

赞 0 踩 0

2606.20236 2026-06-19 cs.AI cs.LG cs.MA 新提交

A Multi-Agent system for Multi-Objective constrained optimization

多目标约束优化的多智能体系统

Federica Filippini

发表机构 * University of Milano-Bicocca（米兰比可卡大学）

AI总结提出MAMO，通过多智能体强化学习解耦任务执行与目标设计，自动学习奖励权重以平衡主目标优化与约束违反，提升动态环境下RL的自主性和鲁棒性。

Comments Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

AI中文摘要

计算和网络系统中的许多决策问题可以自然地表述为在性能约束下的成本最小化问题。在动态环境中，强化学习（RL）通常通过在运行时将成本和约束违反通过加权惩罚项嵌入到单个标量奖励中（遵循拉格朗日启发式公式）来解决此类问题。然而，在这种背景下，学习策略的行为关键取决于这些权重的选择，而权重通常是手动选择的。这使得难以在优化主要目标和有效避免约束违反之间找到适当的权衡，特别是在非平稳环境中，它们的相对重要性可能发生变化。本文提出了MAMO（多目标约束优化的多智能体系统），一种通过多智能体RL解决这种平衡问题的方法。MAMO通过将奖励权重的选择表述为一个学习问题，将任务执行与目标设计解耦，为动态环境中约束优化问题的更自主和鲁棒的基于RL的解决方案迈出了第一步。

英文摘要

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2606.19741 2026-06-19 cs.AI cs.LG 新提交

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Nanyang Technological University（南洋理工大学）； Microsoft Research（微软研究院）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出演化程序瓶颈（EPB）框架，通过将黑盒神经组合优化模型蒸馏为可读程序组合，利用LLM和混合梯度下降实现可解释性，揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情

AI中文摘要

神经组合优化（NCO）取得了强劲性能，但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具（如概念瓶颈模型）不适用于NCO，因为其决策是动态的、状态依赖的，且缺乏适当的概念词汇定义。为弥合这一差距，我们引入了演化程序瓶颈（EPB），据我们所知，这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序，其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作：模块I固定程序库容量，并引入混合文本-数值梯度下降方案，该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合；模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性，蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化，并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO，并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

URL PDF HTML ☆

赞 0 踩 0

2606.19759 2026-06-19 cs.AI cs.SI 新提交

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

知识工作者问答论坛中的最优调度

Rohit Negi, Mustafa Yilmaz

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结针对知识工作者问答论坛，提出基于专家专业水平的请求调度模型，计算系统容量并设计达到容量的调度器，同时探讨专家协作对容量的提升。

Comments 14 pages, 4 figures

2606.20084 2026-06-19 cs.AI 新提交

Residual-Space Evolutionary Optimization via Flow-based Generative Models

基于流生成模型的残差空间进化优化

Zhuo Cao, Lena Krieger, Fernanda Nader, Xuan Zhao, Hanno Scharr, Ira Assent

发表机构 * LMU Munich, Munich Center for Machine Learning (MCML), Germany（慕尼黑大学，慕尼黑机器学习中心（MCML），德国）； Department of Computer Science, Aarhus University, Denmark（丹麦奥胡斯大学计算机科学系）

AI总结提出残差空间进化优化框架，结合流生成编辑与进化算法，在残差空间分离局部利用与全局探索，用于非可微黑盒目标的数据编辑。

Comments Accepted by ICML 2026 Workshop SPIGM, 5 pages, 3 figures

详情

AI中文摘要

使用生成方法进行数据编辑通常需要可微目标和基于梯度的搜索。然而，这些假设在基于流的设置中不成立，其中编辑通过前向和反向积分执行，并且通常涉及不可微或黑盒目标。我们引入了残差空间进化优化，这是一个模型无关的框架，通过将基于流的生成编辑与进化算法相结合来解决这一差距。基于条件流匹配（CFM）可以将条件控制因素与实例特定残差分离的观察，我们的框架直接在残差空间中操作，并分离两个互补的搜索机制：自花授粉通过保留特征的残差细化进行局部利用，而交叉授粉通过跨异质样本重组残差促进更广泛的探索。作为概念验证，我们在MorphoMNIST（一个用于反事实生成的基准数据集）和晶体数据上进行了验证，表明这种探索-利用分解为平衡目标对齐、实例保留和多样性提供了有用的机制，并且可以扩展到图像之外的真实世界科学领域。

英文摘要

Data editing with generative methods typically requires differentiable objectives and gradient-based search. However, these assumptions break down in flow-based settings, where edits are performed through forward and backward integration and often involve non-differentiable or black-box objectives. We introduce residual-space evolutionary optimization, a model-agnostic framework that addresses this gap by combining flow-based generative editing with evolutionary algorithms. Building on the observation that conditional flow matching (CFM) can disentangle condition-controlled factors from instance-specific residuals, our framework directly operates in residual space and separates two complementary search regimes: self-pollination performs local exploitation through feature-preserving residual refinement, and cross-pollination promotes broader exploration by recombining residuals across heterogeneous samples. As a proof of concept, we validate on MorphoMNIST, a benchmark dataset for counterfactual generation, and on crystal data, demonstrating that this exploration--exploitation decomposition provides a useful mechanism for balancing target alignment, instance preservation, and diversity, and extends beyond images to real-world scientific domains.

URL PDF HTML ☆

赞 0 踩 0

2606.19475 2026-06-19 cs.AI cs.CL 新提交

Diffusion Language Models: An Experimental Analysis

扩散语言模型：一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）； University of Pisa（比萨大学）

AI总结本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现，分析了去噪步数、上下文长度等推理因素对性能与效率的影响，揭示了扩散语言模型在不同任务和预算下的权衡。

详情

AI中文摘要

大型语言模型（LLMs）通过自回归生成彻底改变了语言建模，使其在广泛的任务中表现出色。最近，扩散语言模型（DLMs）作为一种替代范式出现，它通过迭代去噪而非下一个词预测来生成文本，从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构，但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中，我们对现代DLMs进行了系统的实验分析。具体来说，我们评估了八种最先进的DLMs在八个基准上的表现，这些基准涵盖推理、编码、翻译、知识和结构化问题解决，同时明确考虑了生成质量和计算效率。除了下游评估，我们还分析了关键推理时间因素的影响，包括去噪步数、上下文长度、块大小和并行解掩策略，并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明，DLMs的行为受到生成时间设计选择的强烈影响，导致性能和计算效率之间的不同权衡。总体而言，我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19538 2026-06-19 cs.AI cs.LG 新提交

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

ITNet: 一种可学习的积分变换，统一卷积、注意力与循环

Ashim Dhor, Rasel Mondal, Pin Yu Chen

发表机构 * Indian Institute of Science Education and Research Bhopal（印度科学教育与研究学院博帕尔分校）； IBM Research（IBM研究院）

AI总结提出可学习积分变换网络ITNet，通过位置-特征联合核函数统一卷积、注意力和循环架构，实现跨模态高性能。

详情

AI中文摘要

卷积网络、循环网络和变换器各自编码不同的归纳偏置——局部性、序列记忆和内容相关的成对交互——自诞生以来在数学上一直彼此独立。我们表明，这种碎片化反映的不是信号处理方式的根本多样性，而是对单一底层数学对象的不完整视角：可学习的积分变换。我们引入积分变换网络（ITNet），这是一种统一架构，围绕一个依赖于位置和特征的联合可学习核构建。该核实现为一个小型神经网络（具体为MLP），用于建模成对交互，使模型能够从数据中自适应其行为。我们证明，卷积、自注意力（包括多头）和自回归循环（包括LSTM、GRU、S4和Mamba）在适当参数化下均作为特例出现，且ITNet是连续算子的通用逼近器。为使其实用，我们开发了分块核融合、重要性加权蒙特卡洛积分和可学习低秩分解，实现高效可扩展计算。单个ITNet架构，共享算子与轻量级模态特定编码器，在ImageNet-1K、GLUE、ModelNet40、VQA v2和NLVR2上匹配或超越专用基线。结果表明，单一学习交互机制可从数据中恢复所有三个架构族的行为。

英文摘要

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

URL PDF HTML ☆

赞 0 踩 0

2606.19607 2026-06-19 cs.AI stat.AP 新提交

Which Pairs to Compare for LLM Post-Training?

LLM后训练中应比较哪些对？

Jiangze Han, Vineet Goyal, Will Ma

发表机构 * Columbia University（哥伦比亚大学）

AI总结研究偏好后训练中如何选择最具信息量的比较对，提出基于采样设计的比较策展方法，通过DPO训练的理论分析给出优化准则，实验证明能提升样本效率。

详情

AI中文摘要

基于偏好的后训练已成为对齐语言模型的核心范式。常见的数据收集策略是为每个提示生成少量补全并标注生成的比较对。然而，人工偏好标签通常比生成额外补全昂贵得多，这提示了相同标注预算的不同使用方式：生成更大的补全集，但只标注最具信息量的比较对。本文研究在基于偏好的后训练中应比较哪些对。我们将比较策展形式化为一个采样设计问题，并通过基于偏好的后训练目标下的最终策略质量来评估设计。我们针对直接偏好优化（DPO）实例化该框架，分析标注对的选择如何通过DPO训练传播到下游策略性能。我们的主要结果为DPO训练策略的后训练最优性差距提供了匹配的上界和下界。这些界限表明，比较选择通过一个单一的设计相关信息矩阵影响下游性能，该矩阵将标签分配与参数估计误差和策略次优性联系起来。这为预算受限的比较策展提供了显式优化准则，并激发了从大型生成补全池中选择信息对的实际采样设计。在合成设置和语言模型后训练基准上的实验表明，所提出的设计在样本效率上持续优于常见的比较选择启发式方法。

英文摘要

Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

URL PDF HTML ☆

赞 0 踩 0

2606.19658 2026-06-19 cs.AI cs.IR cs.MM 新提交

Denoising Implicit Feedback for Cold-start Recommendation

去噪隐式反馈用于冷启动推荐

Gaode Chen, Shicheng Wang, Shikun Li, Rui Huang, Xinghua Zhang, Yunze Luo, Shipeng Li, Shiming Ge, Ruina Sun, Yinjie Jiang, Jun Zhang

发表机构 * Hong Kong Baptist University（香港浸会大学）； Independent Researcher（独立研究员）； Peking University（北京大学）； Nanjing University（南京大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结针对冷启动推荐中隐式反馈噪声问题，提出模型无关的去噪方法DIF，通过内容相似性推断伪标签并建模置信度与不确定性，在快手应用中显著提升冷启动场景商业指标。

Comments Accepted by KDD 2026 ADS Track

详情

AI中文摘要

隐式反馈因其可获取性和通用性被广泛用于推荐系统，但通常包含噪声样本（如点击诱饵、位置偏差）。同时，由于新物品的持续涌入，推荐器不可避免地面临物品冷启动问题。我们识别出冷物品因上述因素更容易受到噪声样本的影响，而研究者往往忽视了为冷物品去噪隐式反馈的重要性。先前的去噪研究通常基于启发式模式（如高损失值）识别噪声样本，并通过样本选择或重加权来减轻噪声。然而，这些方法适应性有限，在冷启动场景中效果不佳。为了实现冷启动推荐中的隐式反馈去噪，我们提出了一种模型无关的去噪方法DIF。首先，用户对内容的偏好是稳定的，这使我们能够通过内容相似的热物品推断出指示用户是否对冷物品感兴趣的伪标签。其次，为了提高伪标签准确性，我们基于冷物品与热物品的内容相似性对伪标签的置信度进行建模，然后为每个样本聚合多个伪标签。最后，我们通过考虑噪声样本标签的相对熵和物品的冷启动状态，显式估计其不确定性，从而自适应地指导伪标签在样本级别纠正噪声标签。DIF的优越性得到了理论证明和真实数据集上大量实验的支持。该方法已部署在十亿用户规模的短视频应用快手上，并在冷启动场景中显著提升了各项商业指标。

英文摘要

Implicit feedback is widely used in recommender systems due to its accessibility and generality, yet it usually presents noisy samples (e.g., clickbait, position bias). Meanwhile, recommenders inevitably face the item cold-start problem due to the continuous influx of new items. We identify that cold items are more prone to noisy samples due to the aforementioned factors, and researchers often overlook the significance of denoising implicit feedback for cold items. Previous denoising studies usually identify noisy samples based on heuristic patterns, such as higher loss values, and mitigate noise through sample selection or re-weighting. However, these methods have limited adaptability and are ineffective in cold-start scenarios. To achieve denoising implicit feedback for cold-start recommendation, we propose a model-agnostic denoising method called DIF. First, user preferences for content remain stable, which allows us to infer pseudo-labels indicating whether a user is interested in a cold item through content-similar warm items. Furthermore, to improve pseudo-label accuracy, we model the confidence of pseudo-labels based on the content similarity between the cold item and warm items, and then aggregate multiple pseudo-labels for each sample. Finally, we explicitly estimate the uncertainty of the noisy sample label by considering its relative entropy and the cold-start status of the item, which adaptively guides the role of pseudo-labels to correct the noisy labels at the sample level. DIF's superiority is supported by both theoretical justification and extensive experiments on real-world datasets. The method has been deployed on a billion-user scale short video application Kuaishou and has significantly improved various commercial metrics within cold-start scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.19771 2026-06-19 cs.AI 新提交

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

超越熵：从令牌级分布偏差中学习以增强LLM推理

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Sichuan University（四川大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结针对RLVR中令牌更新导致的熵塌陷或爆炸问题，提出ICT框架，利用JS散度识别关键令牌，通过选择性更新平衡策略集中度，提升推理性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）显著推进了大语言模型（LLM）推理；然而，它面临一个基本的优化不稳定性：均匀令牌更新会导致熵塌陷，从而过早收敛到次优策略，而过度的香农熵最大化可能导致熵爆炸，驱动盲目探索走向不连贯的推理链。为解决这一二分问题，我们引入了独立组合令牌（ICT）框架，该框架将优化焦点从标量不确定性转移到令牌logits的分布特性。通过利用令牌logits分布之间的詹森-香农（JS）散度，ICT将具有独特分布模式的令牌识别为引导LLM推理中有效探索的关键分支点。我们的理论分析基于香农熵和二阶Rényi熵，证明选择性地更新这些令牌可以调节策略集中度：它降低了由香农熵度量的整体分布不确定性，同时控制了由二阶Rényi熵捕获的概率集中度。这种双重效应防止了过度集中的令牌生成削弱探索，并有效稳定了训练景观。实验结果表明，在Qwen2.5（0.5B/1.5B/7B）模型上仅更新前10%的独特令牌，在涵盖数学、常识和奥林匹克级别问题的七个基准测试中，与GRPO、20-Entropy和STAPO基线相比，平均pass@4提升了4.58%，最大提升达14.9%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

URL PDF HTML ☆

赞 0 踩 0

2606.19808 2026-06-19 cs.AI cs.CL 新提交

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

再思考还是更长时间思考？面向预算感知推理的选择性验证

Sajib Acharjee Dip, Dawei Zhou, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）； Fralin Biomedical Research Institute, Virginia Tech（弗吉尼亚理工大学弗拉林生物医学研究所）； FBRI Cancer Research Center（FBRI癌症研究中心）

AI总结提出选择性验证框架SEVRA，通过服务层控制器决定是否对冻结求解器的初始答案进行验证，在Math500上以更少token达到更高准确率，并减少有害翻转。

详情

AI中文摘要

测试时推理越来越多地被用作服务时的控制旋钮，但额外的推理并非均匀有价值：它可以修复失败的尝试，在已经正确的答案上浪费计算，或引入有害的答案更改。我们将其视为一个部署分配问题，而非新验证器问题。我们引入SEVRA，即面向推理分配的选择性验证，这是一个服务层控制器，决定是保留冻结求解器的初始答案还是调用主动验证。使用冻结的Qwen3-4B求解器，我们记录干预结果并从服务可见的尝试状态训练可恢复性感知的门控。在Math500上，选择性验证达到76.3%的准确率，而始终验证为75.5%，同时将生成后token减少26.8%，有害翻转从2.2%降至1.0%。然而，8,192 token的初始求解达到76.0%的准确率，总模型token减少28%，表明选择性恢复有用但并非测试的最佳成本前沿。在冻结迁移到GSM时，选择性策略仅验证3.0%的样本，准确率从93.4%提升至94.5%，验证token相对于始终验证减少91.2%；同样，更长的初始求解以更少的实际token达到相同准确率。在CommonsenseQA上，始终开启的验证有害，而Self-Consistency@5以约五倍的实际token成本提升准确率。由此得出的部署规则是：首先调整初始预算，然后在需要显式检查、有限重试、可审计性或回归风险控制时使用选择性恢复。

英文摘要

Test-time reasoning is increasingly used as a serving-time control knob, but extra reasoning is not uniformly valuable: it can repair failed attempts, waste compute on already-correct answers, or introduce harmful answer changes. We study this as a deployment allocation problem rather than a new-verifier problem. We introduce \sevra, Selective Verification for Reasoning Allocation, a serving-layer controller that decides whether to preserve a frozen solver's initial answer or invoke active verification. Using a frozen Qwen3-4B solver, we log intervention outcomes and train recoverability-aware gates from serving-visible attempt state. On \mathfive, selective verification reaches 76.3\% accuracy, compared with 75.5\% for always verifying, while reducing post-generation tokens by 26.8\% and harmful flips from 2.2\% to 1.0\%. However, an 8,192-token initial solve reaches 76.0\% accuracy with 28\% fewer total model tokens, showing that selective recovery is useful but not the best tested cost frontier. In frozen transfer to \gsm, the selective policy verifies only 3.0\% of examples, improves accuracy from 93.4\% to 94.5\%, and reduces verification tokens by 91.2\% relative to always verifying; again, a longer initial solve matches its accuracy with fewer realized tokens. On CommonsenseQA, always-on verification hurts, while Self-Consistency@5 improves accuracy at about five times the realized token cost. The resulting deployment rule is: tune the initial budget first, then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

URL PDF HTML ☆

赞 0 踩 0

2606.20156 2026-06-19 cs.AI 新提交

Modularity-Free Conflict-Averse Training for Generalized PINNs

面向广义PINN的无模块化冲突规避训练

Heejo Kong, Beomchul Park, Sung-Jin Kim, Seong-Whan Lee

发表机构 * Department of Brain and Cognitive Engineering, Korea University（韩国大学脑与认知工程系）； Department of Artificial Intelligence, Korea University（韩国大学人工智能系）

AI总结针对过参数化PINN因功能模块化导致冲突规避优化失效的问题，提出ModSync框架，通过惩罚任务专属连接并保留交互路径，实现结构优化与冲突规避训练的融合，在多种PDE基准上达到最先进精度。

Comments Accepted by ICASSP 2026

详情

AI中文摘要

物理信息神经网络（PINN）通过将物理定律嵌入可微目标，已成为求解偏微分方程的强大框架。尽管取得了进展，训练PINN仍然脆弱：最近的冲突规避优化方案缓解了残差损失和边界损失之间的梯度干扰，但我们表明，随着模型容量的增加，其有效性会下降。在本文中，我们识别了一种容量诱导的失效模式，其中过参数化网络经历功能模块化，自我划分为任务专属模块，抑制跨目标交互并阻碍向帕累托驻点收敛。为解决此问题，我们提出了一种新颖框架——模块稀疏同步（ModSync），通过惩罚任务专属连接同时保留促进交互的路径，将结构优化整合到冲突规避训练中。跨多种PDE基准的大量实验表明，ModSync持续防止容量驱动的失败，维持稳健的跨目标耦合，并实现了最先进的精度。代码可在\url{this https URL}获取。

英文摘要

Physics-informed neural networks (PINNs) have become a powerful framework for solving PDEs by embedding physical laws into differentiable objectives. Despite their advances, training PINNs remains fragile: recent conflict-averse optimization schemes alleviate gradient interference between residual and boundary losses, but we show that their effectiveness deteriorates as model capacity increases. In this paper, we identify a capacity-induced failure mode, where overparameterized networks undergo functional modularity, self-partitioning into task-exclusive modules that suppress cross-objective interaction and hinder convergence toward Pareto-stationary points. To address this issue, we propose a novel framework, Modular-Sparsity Synchronization (ModSync), which integrates structural optimization into conflict-averse training by penalizing task-exclusive connections while preserving interaction-promoting pathways. Extensive experiments across diverse PDE benchmarks demonstrate that ModSync consistently prevents capacity-driven failures, sustains robust cross-objective coupling, and achieves state-of-the-art accuracy. Codes are available at \url{https://github.com/heejokong/ModSync}.

URL PDF HTML ☆

赞 0 踩 0

2606.20381 2026-06-19 cs.AI 新提交

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

重新思考LLM FP4预训练中的收缩偏差：几何起源、系统影响与UFP4方案

Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou

发表机构 * Ling Team, Ant Group（蚂蚁集团灵团队）

AI总结本文发现E2M1格式因几何不对称导致收缩偏差，该偏差经随机哈达玛变换放大，造成训练不稳定；提出均匀网格E1M2/INT4及UFP4训练方案，在多种模型上实现更低损失。

Comments 18 pages, 12 figures

详情

AI中文摘要

FP4训练有望大幅减少LLM预训练的内存和计算成本，然而当前的FP4硬件路径和方案，包括NVIDIA Blackwell/Rubin级系统和AMD MI350系列GPU，仍以E2M1数据元素为中心。在本研究中，我们识别出该选择的一个根本限制：诸如E2M1的非均匀格式固有地遭受收缩偏差，这是一种由其可表示区间的几何不对称性导致的系统性负舍入误差。我们证明该偏差在层间乘性累积，并被随机哈达玛变换（RHT）放大，为现有基于E2M1的FP4方案中观察到的训练不稳定性提供了统一解释。相比之下，均匀网格（E1M2/INT4）绕过了这种网格几何误差，并能更好地将RHT改进的桶利用率转化为更高的量化质量。基于这一发现，我们提出UFP4，一种均匀4位训练方案，它将RHT应用于所有三个训练GEMM，同时仅对dY施加随机舍入。在Dense 1.5B、MoE 7.9B和MoE 124B的长程预训练中，UFP4始终比强E2M1基线实现更低的BF16相对损失退化，这得到了缩放定律分析和消融研究的支持。我们的结果表明，未来的加速器应支持E1M2/INT4风格的均匀4位网格作为与E2M1并列的一等训练原语。

英文摘要

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

URL PDF HTML ☆

赞 0 踩 0

2606.20544 2026-06-19 cs.AI cs.LG 新提交

Toward Calibrated Mixture-of-Experts Under Distribution Shift

面向分布偏移下的校准混合专家模型

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结研究混合专家模型在分布偏移下的校准问题，提出对抗性重加权方法以改善路由聚合的校准误差，提升准确率-校准权衡。

Journal ref ICML 2026

详情

AI中文摘要

校准将模型的预测不确定性与其经验结果的频率对齐，对于理解和信任报告的概率很重要。最近的研究表明，在单个预测器级别强制执行校准可以提高集成准确性和校准，特别是混合专家（MoE）模型显示出强烈的经验改进；然而，校准有助于MoE的条件尚不清楚。在这项工作中，我们研究了MoE模型在分布偏移下的行为，重点关注路由机制如何与专家级校准相互作用。我们表明，在硬路由模型中，专家校准足以确保整体模型在一大类分布偏移下的校准，但不足以校准软路由模型。为了解决这个问题，我们提出了一种对抗性重加权方法，惩罚分布偏移下路由聚合的校准误差，并证明它在平均情况下以及在数据的困难子集上，跨模型类别、预测任务和分布偏移，改善了准确率-校准权衡。

英文摘要

Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.19626 2026-06-19 cs.AI cs.CL 新提交

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Toten：基于知识本体的巴西葡萄牙语物理量和技术符号分词

Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa

发表机构 * Aia Context ； Universidade Federal do Maranhão（马拉尼昂联邦大学）； Universidade de São Paulo（圣保罗大学）

AI总结提出TOTEN框架，利用工程实体本体对物理量和技术符号进行声明式分类，替代统计分词，在巴西葡萄牙语语料上实现高原子性分词和数值重建。

详情

AI中文摘要

字节对编码分词在词汇压缩方面统计高效，但对结构化技术实体语义盲目，将物理量、数字、单位和符号表达式分割成词汇上任意子词。我们提出TOTEN，一个基于知识本体的分词框架，用基于工程实体形式本体（OEE）的声明式分类取代统计推导。我们将TOTEN形式化为三元组<O, classify, {inst_tau}>：本体收集类型、结构原理、组成关系和可保存不变量；分类函数将原始文本映射到类型化区域；实例化器族产生自描述的结构化表示。鲁棒性源于与三个外部预言机的确定性耦合：Pint（量纲）、Unicode字符数据库（排版）和RSLP（葡萄牙语形态）。内在评估涵盖四个可通过构造验证的属性——本体原子性、量纲等价性、排版鲁棒性和数值重建——在一个内部、物理验证的基准（EngQuant，N=800）和四个巴西葡萄牙语外部语料库（N=1771个合格案例）上进行。我们还报告检测召回率，区分覆盖率和条件原子性。与八个最先进基线相比，TOTEN在所有对比中实现单位本体原子性，在外部语料库上数值重建为0.775-0.904，而最佳基线（Quantulum3）为0.627-0.703；在EngQuant上为0.780 vs. 0.340。差异具有统计显著性（McNemar检验，Holm校正）。内部和外部排名之间的Spearman相关性证实了控制基准的同时效度。量纲等价性显示与Pint（系统继承量纲权威的预言机）统计对等。

英文摘要

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple <O, classify, {inst_tau}>: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction -- ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction -- over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

URL PDF HTML ☆

赞 0 踩 0

2606.20245 2026-06-19 cs.AI 新提交

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识：面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology（国防科技大学大数据与决策国家重点实验室）

AI总结提出MACR框架，通过自适应知识评估与多智能体推理，显式解决大语言模型内部参数知识与外部上下文之间的冲突，超越传统二元选择范式。

Comments 12 pages, 3 figures

详情

AI中文摘要

大型语言模型（LLM）通过利用广泛的参数化知识和上下文学习能力，在多种基于语言的任务中取得了强劲性能，使其能够整合输入提示中提供的外部信息。然而，外部知识的整合可能引入冲突，不仅存在于模型内部参数知识与外部信息之间，也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的，忽视了两种来源都可能包含错误的情况，并通过优先考虑某一来源而非另一来源来避免冲突，而非主动解决不一致性。为解决这些局限，我们提出了一种新颖的LLM知识冲突解决框架MACR，该框架超越了传统的二元选择范式，并基于多智能体推理方法引入了显式的冲突解决机制。具体而言，我们首先提出一种自适应知识评估与检索方法，采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计，MACR要么将模型的内部知识外化为文本表示，要么在内部知识不足时检索相关外部知识，为后续推理生成基本上下文。然后，我们引入一个归纳式多智能体推理框架，包含三个专门智能体，分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明，MACR在多个基准测试中显著优于最先进的基线方法，同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

URL PDF HTML ☆

赞 0 踩 0

2606.20333 2026-06-19 cs.AI 新提交

SoftSkill: Behavioral Compression for Contextual Adaptation

SoftSkill: 用于上下文适应的行为压缩

Xijia Tao, Yihua Teng, Xinyu Fu, Ziru Liu, Kecheng Chen, Yuzhi Zhao, Suiyun Zhang, Rui Liu, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Huawei Research（华为研究院）； City University of Hong Kong（香港城市大学）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SoftSkill方法，通过可训练的软技能前缀压缩自然语言技能为紧凑连续向量，在冻结基模型上提升问答和数学任务性能，减少标记数量。

详情

AI中文摘要

智能体技能通常以自然语言Markdown文件形式部署，编码回答策略、证据使用习惯和任务流程。这些文件可读且可移植，但间接消耗：对于每个任务实例，冻结的语言模型必须将长文本制品转换为生成时行为。本文探讨自然语言技能是否可以初始化一个紧凑的连续上下文对象，通过可训练的软增量进行优化，同时基模型保持冻结。我们提出SoftSkill，一种冻结骨干方法，通过下一词预测调整此类软技能，并在推理时将其部署为潜在行为先验。在我们的主要单轮设置中，在Qwen3.5-4B上使用长度为32的SoftSkill前缀，相比无技能提示在SearchQA上提升8.3分，LiveMath上提升42.1分，DocVQA上提升1.3分。相对于SkillOpt，SoftSkill在SearchQA上准确率提升5.2分，LiveMath上提升12.5分，同时用少量虚拟标记替换数百到数千个Markdown技能标记。我们进一步研究了作为更难边界情况的智能体执行，其中稀疏轨迹模仿提供了有用信号，但尚未稳健地压缩长程过程行为。更广泛地说，结果表明某些任务技能更适合被视为紧凑的潜在控制，而不是在推理时重新解释的额外Markdown，用于控制冻结模型如何进入任务。

英文摘要

Agent skills are commonly deployed as natural-language Markdown files that encode answer policies, evidence-use habits, and task procedures. These files are readable and portable, but they are consumed indirectly: for each task instance, a frozen language model must translate a long textual artifact into generation-time behavior. This paper asks whether a natural-language skill can instead initialize a compact continuous context object, refined by a trainable soft delta while the base model remains frozen. We propose SoftSkill, a frozen-backbone method that tunes such soft skills with next-token prediction and deploys them as latent behavioral priors at inference time. In our main single-round setting, a length-32 SoftSkill prefix on Qwen3.5-4B improves over no-skill prompting by 8.3 points on SearchQA, 42.1 points on LiveMath, and 1.3 points on DocVQA. Relative to SkillOpt, SoftSkill improves accuracy by 5.2 points on SearchQA and 12.5 points on LiveMath, while replacing hundreds to thousands of Markdown skill tokens with a few virtual tokens. We further study agentic execution as a harder boundary case, where sparse trajectory imitation provides useful signal but does not yet robustly compress long-horizon procedural behavior. More broadly, the results suggest that some task skills are better treated not as additional Markdown to be reinterpreted at inference time, but as compact latent controls over how a frozen model enters the task.

URL PDF HTML ☆

赞 0 踩 0

2606.20518 2026-06-19 cs.AI 新提交

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit: 流匹配TTS中终身发音适应的联想记忆

Harshit Singh, Ayush Pratap Singh, Nityanand Mathur

发表机构 * University Of Maryland（马里兰大学）； TU Darmstadt（达姆施塔特工业大学）； Smallest AI

AI总结针对流匹配TTS部署后无法纠正专有名词发音错误的问题，提出FlowEdit框架，通过潜在条件编辑而非权重更新学习发音修正，并利用现代Hopfield网络存储和检索修正，在312个多语言专有名词基准上将音素错误率降低92.7%。

详情

AI中文摘要

流匹配文本到语音系统在零样本场景下表现出色，但部署后保持静态：除非重新训练模型，否则对词汇表外的专有名词的发音错误会持续存在。我们提出FlowEdit，一个用于冻结的流匹配TTS的终身适应框架，它将发音修正学习为潜在条件编辑而非权重更新。当提供纠正性反馈时，FlowEdit优化文本嵌入空间中的令牌级扰动，然后将修正存储在作为内容可寻址情景记忆的现代Hopfield网络中。在推理时，通过具有相似性门控的软注意力检索修正，实现模糊形态匹配。在我们整理的涵盖18个语系的312个多语言专有名词基准上，FlowEdit相对于零样本基线将目标词音素错误率降低了92.7%，同时保持相同的通用语音质量。修正过程在单个GPU上大约15秒完成。

英文摘要

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2606.20532 2026-06-19 cs.AI 新提交

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

指令如何塑造语音？面向风格描述文本到语音的交叉注意力归因

Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana, Akshat Mandloi, Sudarshan Kamath

AI总结提出交叉注意力归因方法，分析风格描述文本到语音系统中单词对声学输出的影响，发现风格标记在早期步骤和深层注意力峰值，且与基频和能量相关。

详情

AI中文摘要

风格描述文本到语音系统使用自然语言控制语音特征，但单个单词如何影响声学输出仍不清楚。理解这一点对于诊断故障模式和提高表现性TTS的可控性至关重要。我们首次将DAAM框架适配到语音领域，为语音扩散模型提出交叉注意力归因，并将其应用于CapSpeech-TTS。我们的方法提取了25层和24个ODE步骤的逐词热力图。我们分析了3,600个（风格描述，文本转录）组合，包括120个风格描述条件生成30个文本转录，揭示了描述词如何塑造波形。结果表明：（1）风格标记的时间方差低于内容/功能标记，确认了全局条件作用；（2）风格注意力与基频和能量相关；（3）风格条件作用在早期步骤和深层达到峰值；（4）注意力熵在第17层达到最小值，与风格重要性峰值同时出现，表明在最关键风格阶段网络选择性最大。这是首次研究自然语言如何影响语音扩散模型中的交叉注意力。

英文摘要

Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models

URL PDF HTML ☆

赞 0 踩 0

2606.19935 2026-06-19 cs.AI 新提交

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

PhysDrift: 弥合人形机器人共语动作生成中的具身差距

Zhangzhao Liang, Xiaofen Xing, Mingyue Yang, Wenlve Zhou, Xiangmin Xu

发表机构 * South China University of Technology（华南理工大学）； DexForce Technology（DexForce科技公司）； Foshan University（佛山大学）

AI总结针对人形机器人共语动作生成中人体运动流形与机器人具身约束不匹配的问题，提出IK-EER框架和PhysDrift模型，直接预测可执行关节轨迹，提升运动对齐、物理合理性和实时交互能力。

详情

AI中文摘要

人形机器人需要共语动作，这些动作不仅要富有表现力且与语音对齐，还要在具身约束下物理可执行。现有的共语动作生成流程主要是以人为中心的：首先以人体表示（如SMPL-X）生成动作，随后重定向到人形机器人。在这项工作中，我们识别出这种范式中的基本具身差距，即人体运动流形与人形机器人具身约束之间的不匹配在运动转移和物理执行过程中破坏了具身一致性。通过广泛分析，我们表明尽管重定向可以保留粗粒度的运动语义，但它显著压缩了运动多样性并削弱了韵律-动作同步，限制了富有表现力的人形机器人行为。为解决此问题，我们首先提出IK-EER，一种保留韵律的人形机器人运动策展框架，在重定向过程中联合优化运动学可行性和语音-运动时间对齐。基于策展的机器人原生运动数据集，我们进一步引入PhysDrift，一种具身感知的共语动作生成框架，直接预测可执行的人形机器人关节轨迹，无需依赖中间人体表示。与传统的以人为中心的流程不同，PhysDrift在训练和推理过程中都保持具身一致性，同时加入物理正则化以稳定机器人运动动态。大量实验和真实世界人形机器人部署表明，具身感知的机器人原生生成显著改善了语音-运动对齐、物理合理性、运动平滑性、推理效率和实时交互能力。

英文摘要

Humanoid robots require co-speech motions that are not only expressive and speech-aligned, but also physically executable under embodiment constraints. Existing co-speech generation pipelines are predominantly human-centric: motions are first generated in human-body representations such as SMPL-X and subsequently retargeted to humanoid robots. In this work, we identify a fundamental embodiment gap in this paradigm, where the mismatch between human motion manifolds and humanoid embodiment constraints disrupts embodiment consistency during motion transfer and physical execution. Through extensive analysis, we show that although retargeting can preserve coarse motion semantics, it significantly compresses motion diversity and weakens prosody-motion synchronization, limiting expressive humanoid behaviors. To address this problem, we first propose IK-EER, a prosody-preserving humanoid motion curation framework that jointly optimizes kinematic feasibility and speech-motion temporal alignment during retargeting. Building upon the curated robot-native motion dataset, we further introduce PhysDrift, an embodiment-aware co-speech motion generation framework that directly predicts executable humanoid joint trajectories from speech without relying on intermediate human-body representations. Unlike conventional human-centric pipelines, PhysDrift maintains embodiment consistency throughout both training and inference while incorporating physical regularization to stabilize robot motion dynamics. Extensive experiments and real-world humanoid deployment demonstrate that embodiment-aware robot-native generation substantially improves speech-motion alignment, physical plausibility, motion smoothness, inference efficiency, and real-time interaction capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19948 2026-06-19 cs.AI 新提交

Advancing DialNav through Automatic Embodied Dialog Augmentation

通过自动具身对话增强推进DialNav

Leekyeung Han, Sangwon Jung, Hyunji Min, Jinseong Jeong, Minyoung Kim, Paul Hongsuck Seo

发表机构 * Korea University（高丽大学）； Trillion Labs

AI总结提出自动生成管道构建大规模RAINbow数据集（238K episodes），结合双策略训练和定位模型，在DialNav任务上实现成功率显著提升（Val Seen +89%，Val Unseen +100%）。

Comments 29 pages, 9 figures

详情

AI中文摘要

对于能够进行物理交互的具身智能体，创建和理解对话的能力对于确保安全性和有效性至关重要。虽然DialNav~\cite{han2025dialnav}为真实感室内导航中的对话-执行循环提供了整体评估框架，但其性能仍受限于训练数据的严重稀缺（2K episodes）。为解决这一问题，我们提出了一种自动生成管道，并构建了\textbf{RAINbow}数据集，这是一个包含238K episodes的大规模训练数据集，用于DialNav。我们的管道将现有的VLN数据集转换为多轮对话，并创建了成本高效且高质量的数据集。然后，我们引入了两项额外的互补性进展以充分释放数据潜力：（1）双策略训练，一种导航训练方案，用于使导航训练与动态对话-导航循环对齐；（2）一个利用VLN知识的定位模型。通过结合这些互补性解决方案，我们的模型在\textbf{Val Seen}（58.24，\textbf{+89\%}）和\textbf{Val Unseen}（29.05，\textbf{+100\%}）两个分割上的成功率均大幅超越基线，建立了新的最优水平。

英文摘要

For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\cite{han2025dialnav} provides a framework for holistic evaluation of the dialog--execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbf{Val Seen} (58.24, \textbf{+89\%}) and \textbf{Val Unseen} (29.05, \textbf{+100\%}) splits, establishing a new state of the art.

URL PDF HTML ☆

赞 0 踩 0

2606.19980 2026-06-19 cs.AI 新提交

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

ENPIRE: 现实世界中智能体机器人策略的自我改进

Wenli Xiao, Jia Xie, Tonghe Zhang, Haotian Lin, Letian "Max" Fu, Haoru Xue, Jalen Lu, Yi Yang, Cunxi Dai, Zi Wang, Jimmy Wu, Guanzhi Wang, S. Shankar Sastry, Ken Goldberg, Linxi "Jim" Fan, Yuke Zhu, Guanya Shi

发表机构 * NVIDIA（英伟达）； CMU（卡内基梅隆大学）； UC Berkeley（加州大学伯克利分校）

AI总结提出ENPIRE框架，通过环境重置、策略执行、结果验证和迭代优化的闭环反馈，使编码智能体自主改进机器人操作策略，在灵巧操作任务上达到99%成功率。

详情

AI中文摘要

在现实世界中实现灵巧的机器人操作严重依赖人工监督和算法工程，这成为追求通用物理智能的核心瓶颈。尽管新兴的编码智能体可以生成代码来自动化算法搜索，但其成功主要局限于数字环境。我们推测，自动化机器人研究缺失的抽象是一个可重复的反馈循环，用于现实世界策略改进：重置场景、执行策略、验证结果并优化下一次迭代。为弥补这一差距，我们引入ENPIRE，一个用于编码智能体的框架，通过四个核心模块实例化这一物理反馈例程：环境模块（EN）用于自动重置和验证，策略改进模块（PI）启动策略优化，推出模块（R）用于评估一个或多个并行运行的物理机器人的策略，以及进化模块（E），其中编码智能体分析日志、查阅文献、改进训练基础设施和算法代码以解决失败模式。这一闭环系统将现实世界操作学习转化为可控的优化过程，在最小化人工努力的同时，允许对训练方案和智能体变体进行公平消融。在ENPIRE的支持下，前沿编码智能体可以自主训练策略，在具有挑战性的灵巧操作任务（如整理针盒、紧固扎带和工具使用）上达到99%的成功率，并且当我们派遣智能体团队在机器人集群上工作时，这一过程会进一步加速。我们的结果展示了将编码智能体部署到物理世界中自主推进机器人技术的实用且可扩展的路径。

英文摘要

Achieving dexterous robotic manipulation in the real world heavily relies on human supervision and algorithm engineering, which becomes a central bottleneck in the pursuit of general physical intelligence. Although emerging coding agents can generate code to automate algorithm search, their successes remain largely confined in digital environments. We conjecture that the missing abstraction to automate robotics research is a repeatable feedback loop for real-world policy improvement: reset the scene, execute a policy, verify the outcome, and refine the next iteration. To bridge this gap, we introduce ENPIRE, a harness framework for coding agents that instantiates this physical feedback routine with four core modules: an Environment module (EN) for automatic reset and verification, a Policy Improvement module (PI) that launches policy refinement, a Rollout module (R) to evaluate policies with one or multiple physical robots operating in parallel, and an Evolution module (E) in which coding agents analyze logs, consult literature, improve training infrastructure and algorithm code to address failure modes. This closed-loop system transforms real-world manipulation learning into a controllable optimization procedure, minimizing human effort while allowing fair ablations across training recipe and agent variants. Powered by ENPIRE, frontier coding agents can autonomously train a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks, such as organizing a pin box, fastening a zip tie, and tool use, a process that further accelerates when we dispatch an agent team on a robot fleet. Our results suggest a practical and scalable path toward deploying coding agents to autonomously advancing robotics in the physical world.

URL PDF HTML ☆

赞 0 踩 0

2606.19990 2026-06-19 cs.AI 新提交

Reward as An Agent for Embodied World Models

奖励作为具身世界模型的智能体

Pu Li, Zhigang Lin, Qiang Wu, Yongxuan Lv, Fei Wang, Shan You

发表机构 * ACE Robotics（ACE机器人）

AI总结提出奖励智能体框架和动态感知 rollout 多样化方法，通过鲁棒验证支持更广泛探索，缓解奖励黑客问题，提升世界模型性能。

详情

AI中文摘要

虽然强化学习已成为改进世界模型的有前景工具，现有方法大多依赖于训练分布附近的保守 rollout，限制了探索、行为多样性和更丰富的动态发现。在这项工作中，我们挑战这种保守范式。我们认为核心限制不是探索本身，而是缺乏支持更广泛探索的可靠验证策略。没有可靠的验证，扩展的探索极易受到奖励黑客攻击，即策略利用不完美的奖励而未能实现真正的改进。为了评估这一动机，我们在具身世界模型中实例化我们的方法，其中物理合理性和任务完成性为复杂动态下的可扩展强化学习提供了严格的测试平台。在验证方面，我们引入奖励作为智能体，一种主动评估生成行为以提供鲁棒奖励信号并减轻分布偏移下奖励黑客攻击的智能体奖励框架。在探索方面，我们通过 DynDiff-GRPO 引入动态感知 rollout 多样化，显式扩展动作空间探索以多样化轨迹、拓宽状态-动作覆盖范围，并鼓励超越保守 rollout 机制的更丰富具身行为。通过将奖励作为智能体与 DynDiff-GRPO 统一，我们在更可靠的奖励基础上实现强化学习，并大幅多样化采样，有效缓解奖励黑客攻击，同时在多个开源世界模型上取得显著的精度提升，从而证明当基于鲁棒验证时，更广泛的探索可以成功扩展。

英文摘要

While RL has become a promising tool for refining world models, existing methods largely rely on conservative rollouts near the training distribution, limiting exploration, behavioral diversity, and richer dynamic discovery. In this work, we challenge this conservative paradigm. We argue that the core limitation is not exploration itself, but the lack of reliable verification strategies to support broader exploration. Without reliable verification, expanded exploration becomes highly susceptible to reward hacking, where policies exploit imperfect rewards without achieving genuine improvement. To evaluate this motivation, we instantiate our method in embodied world models, where physical plausibility, and task completion provide a rigorous testbed for scalable RL under complex dynamics. On the verification side, we introduce Reward as an Agent, an agentic reward framework that actively evaluates generated behaviors to provide robust reward signals and mitigate reward hacking under distribution shifts. On the exploration side, we introduce Dynamic-Aware Rollout Diversification through DynDiff-GRPO, which explicitly expands action-space exploration to diversify trajectories, broaden state-action coverage, and encourage richer embodied behaviors beyond conservative rollout regimes. By unifying Reward as an Agent with DynDiff-GRPO, we enable RL on a more reliable reward foundation with substantially diversified sampling, effectively mitigating reward hacking while yielding significant accuracy gains across multiple open-source world models, thereby demonstrating that broader exploration can scale successfully when grounded in robust verification.

URL PDF HTML ☆

赞 0 踩 0

2606.20274 2026-06-19 cs.AI 新提交

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

Lagrange: 一种面向通用端到端驾驶的开放词汇、基于能量的稀疏框架

Shihao Ji, HongXi Li, Zihui Song, Mingyu Li

AI总结提出Lagrange框架，利用掩码潜在场和视觉语言模型实现开放词汇、稀疏计算，通过拉格朗日动作最小化确保运动学约束，在nuScenes和CODA基准上验证了鲁棒性和可解释性。

详情

AI中文摘要

将端到端自动驾驶扩展到复杂的开放世界环境，需要能够泛化到异常场景的感知模型和能够产生运动学有效轨迹的规划器。现有范式在表示效率和泛化能力之间存在明显分歧。密集模型（如占用网络）虽然几何鲁棒，但存在关键计算瓶颈，且难以进行高层语义推理。相反，稀疏的基于查询的规划器效率高，但依赖于封闭集定义，使其容易受到分布外事件的影响。尽管最近的视觉-语言-动作模型提供了开放词汇推理，但其自回归离散令牌生成从根本上与车辆动力学的连续高频控制需求相冲突。为解决这一问题，我们提出了Lagrange，一种基于掩码潜在场的开放词汇、计算稀疏的驾驶框架。Lagrange不依赖密集体积重建或封闭集查询机制，而是利用视觉语言模型将类别无关的目标提议编码为连续语义视觉令牌。我们引入了一种意图驱动的掩码交叉注意力模块，该模块在时间上过滤不相关实体，并将注意力令牌解码为定义在空间坐标上的隐式连续能量场。通过将决策制定为跨越该能量场的拉格朗日动作最小化问题，我们在执行碰撞避免的同时强制遵守车辆运动学。在标准（nuScenes）和长尾（CODA）基准上的大量离线评估表明，Lagrange为鲁棒、可解释且运动学可行的开放世界自主性建立了一个有前景的框架。

英文摘要

Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-distribution (OOD) events. Although recent Vision-Language-Action (VLA) models offer open-vocabulary reasoning, their autoregressive, discrete token generation fundamentally conflicts with the continuous, high-frequency control requirements of vehicle dynamics. To address this, we propose Lagrange, an open-vocabulary, computationally sparse driving framework based on Masked Latent Fields (MLF). Rather than relying on dense volumetric reconstructions or closed-set query mechanisms, Lagrange exploits Vision-Language Models (VLMs) to encode class-agnostic object proposals into continuous semantic visual tokens. We introduce an intent-driven masked cross-attention module that temporally filters irrelevant entities, decoding the attended tokens into an implicit continuous energy field defined over spatial coordinates. By framing decision-making as a Lagrangian action minimization problem spanning this energy field, we enforce strict compliance with vehicle kinematics while executing collision avoidance. Extensive offline evaluations on both standard (nuScenes) and long-tail (CODA) benchmarks demonstrate that Lagrange establishes a promising framework for robust, interpretable, and kinematically feasible open-world autonomy.

URL PDF HTML ☆

赞 0 踩 0

2606.19464 2026-06-19 cs.AI cs.MA 新提交

Deontic Policies for Runtime Governance of Agentic AI Systems

面向自主AI系统运行时治理的道义策略

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

发表机构 * CSEE Department UMBC Baltimore, MD, USA ； Center for AI UMBC Baltimore, MD, USA ； Information Systems Department UMBC Baltimore, MD, USA ； CSAIL MIT Cambridge, MA, USA

AI总结针对大语言模型驱动的自主AI系统在安全、隐私和合规方面的治理挑战，提出AgenticRei框架，利用基于Rei的道义策略语言（OWL表示）在运行时通过逻辑引擎强制执行义务、豁免、冲突解决等治理约束，并兼容A2AS等标准。

Comments 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services

详情

AI中文摘要

由大语言模型驱动的自主AI系统引入了一类新的安全、隐私和合规挑战：能够调用工具、操作数据、安装软件并与跨组织边界对等代理协调的代理，不仅必须通过身份验证和访问控制来约束，还必须通过企业治理的完整结构来约束。这包括指定代理被允许和禁止做什么，它们在特定操作后必须做什么（例如，通知CISO），在什么条件下可以免除一项持续义务，以及当策略冲突时哪些规则优先。这个治理问题超出了当前策略引擎的能力范围。诸如XACML、Rego和Cedar等系统仅处理此治理结构的允许/禁止子集。它们不提供义务生命周期管理、元策略冲突解决、在特定情况下免除义务的豁免，以及通常在医疗、网络安全或数据隐私等应用中发现的领域类层次结构的本体推理。我们提出了AgenticRei，它实现了关键的治理需求，如义务、豁免、策略冲突解决和策略推理，以及基本的允许/禁止约束。我们使用基于Rei框架的道义策略语言，表示为OWL（Web本体语言），并由完全在LLM外部的高性能逻辑引擎在运行时评估。同一管道同时管理代理的工具调用和代理间消息。我们通过示例表明，道义策略捕获了当前生产引擎大多无法表达的安全和隐私治理约束。我们的方法自然地与A2AS等行业标准框架兼容。

英文摘要

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

URL PDF HTML ☆

赞 0 踩 0

2606.19509 2026-06-19 cs.AI 新提交

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

LLM 不知道它不知道什么：通过跨模型归因分歧检测临床表格数据上的认知盲点

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava

发表机构 * Centific AI Research（Centific AI研究）

AI总结研究大语言模型在结构化临床数据上的认知不确定性，通过跨模型归因分歧分析，发现其口头置信度空洞、存在逆难度效应，并提出基于归因分歧的校准方法，无需训练即可提升准确率并降低校准误差。

Comments Accepted at EIML@ICML 2026

详情

AI中文摘要

大语言模型（LLM）越来越多地应用于结构化临床数据，但它们在处理此类任务时能否认识到自身知识的局限性仍未得到探索。我们通过跨模型归因分歧的视角研究这一问题，旨在减少结构化任务的认知不确定性，通过归因分歧分析比较 Qwen 2.5 7B 和 XGBoost 在预测任务上的表现。我们报告了四个发现。首先，LLM 口头表达的置信度在认知上是空洞的，无论准确率是 49% 还是 75.3%，它输出接近常数（0.856-0.937），追踪的是提示格式而非预测质量。其次，LLM 表现出逆难度效应：当 XGBoost 以 99% 正确时，LLM 准确率降至 64.8%，但在 XGBoost 中等不确定时，LLM 与其匹配（73.8% 对 73.1%）。第三，少样本示例和 SHAP 导出的特征证据是正交的、超加性的干预措施：它们将归因分歧分数（ADS）从 1.54 降至 0.38，并在无需训练的情况下将准确率从 49% 提升至 75.3%。第四，一种利用归因分歧信号确定 LLM 可靠性的跨模型校准器，将期望校准误差从 0.254 降至 0.080，用患者特定的可靠性估计替代了无信息量的口头置信度，无需访问模型内部或重复推理。我们将这些发现视为 LLM 在结构化数据上的冷启动问题，并勾勒出通向真正认知自我意识的路径。

英文摘要

Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains unexplored. We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on a prediction task via attribution divergence analysis. We report four findings. First, LLM verbalized confidence is epistemically vacuous, it outputs a near-constant (0.856-0.937) regardless of whether accuracy is 49% or 75.3%, tracking prompt format rather than prediction quality. Second, the LLM exhibits an inverse difficulty effect: accuracy drops to 64.8% when XGBoost is 99% correct, but matches XGBoost (73.8% vs. 73.1%) when it is moderately uncertain. Third, few-shot examples and SHAP-derived feature evidence are orthogonal, super-additive interventions: they reduce the Attribution Disagreement Score (ADS) from 1.54 to 0.38 and improve accuracy from 49% to 75.3% without training. Fourth, a cross-model calibrator that determined LLM reliability using attribution divergence signals reduces expected calibration error from 0.254 to 0.080, replacing uninformative verbalized confidence with patient-specific reliability estimates, without accessing model internals or requiring repeated inference. We frame these findings as a cold start problem for LLMs on structured data and outline a path toward genuine epistemic self-awareness.

URL PDF HTML ☆

赞 0 踩 0

2606.19527 2026-06-19 cs.AI 新提交

Emergent Alignment

涌现对齐

Martin Kolář

发表机构 * CIIRC, Czech Technical University in Prague（捷克理工大学CIIRC）

AI总结提出一种在线对齐技术，通过引入良心步骤和基于直接偏好优化的对齐损失，使大语言模型在训练、微调、对抗提示和零样本学习中自我纠正非伦理输出。

Comments Rejected from ICML 2026

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结研究LLM与SAT/SMT求解器混合推理中，将求解器输出转化为用户答案的叙述步骤存在的安全漏洞，通过形式化建模和实验评估发现证书门控可保证求解结果正确，但对抗攻击可反转结论。

详情

AI中文摘要

诸如SAT和SMT求解器之类的形式化工具，当安全或安保关键问题可以用逻辑表述时，越来越多地被嵌入到语言模型推理流程中。与思维链不同（其步骤从模型分布中采样，没有形式化保证），求解器产生可靠且可独立验证的答案。然而，这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分：形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解，但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距，我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型，发现证书门控使求解器判定可靠，而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法，该方法显著减少了注入但无法完全消除，并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究，我们表明在LLM-求解器循环中，鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

URL PDF HTML ☆

赞 0 踩 0

2606.19735 2026-06-19 cs.AI cs.CV 新提交

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结提出基于LLM的交互接口GLARE，将自然语言问题转换为SQL查询以聚合局部解释数据，提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情

AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要，但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案，而不是静态产物，我们提出了一种基于LLM的交互接口，提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者，将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能，而无需向用户暴露低级表示。对于每个查询，接口输出统计增强的自然语言响应，支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明，LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

URL PDF HTML ☆

赞 0 踩 0

2606.19812 2026-06-19 cs.AI cs.LG 新提交

Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

AI辅助法律发现中的人机协同编排

Anushree Sinha, Srivaths Ranganathan, Abhishek Dharmaratnakar, Debanshu Das

AI总结针对AI代理在电子取证中因多步推理错误导致的法律风险，提出一种四层验证架构，通过人机协同阈值减少特权豁免风险达61%。

详情

AI中文摘要

自主大语言模型（LLM）代理越来越多地部署于电子发现（e-discovery），其中跨多步推理链的复合错误可能构成法律渎职。与单轮检索不同，在特权文档语料库上运行的代理工作流表现出我们称之为“轨迹崩溃”的一类失败：早期错误分类无声传播，导致整个特权审查失效。本文做出三项贡献。首先，我们提出一个按功能阶段组织的法律信息检索中代理失败的结构化分类法。其次，我们引入一个四层验证架构——涵盖规划、推理、执行和不确定性量化——旨在这些失败复合之前拦截它们。第三，我们在一个合成电子取证语料库上进行初步模拟研究，展示强制性人机协同（HOTL）升级阈值如何相对于完全自主基线降低特权豁免风险。我们的结果表明，与完全自主部署相比，校准的不确定性阈值可将特权豁免风险降低高达61%，同时将不到四分之一的文档路由给律师审查。

英文摘要

Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e-discovery), where compounding errors across multi-step reasoning chains can constitute legal malpractice. Unlike single-turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term "trajectory collapse": an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four-layer verification architecture -- spanning planning, reasoning, execution, and uncertainty quantification -- designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e-discovery corpus that demonstrates how mandatory Human-on-the-Loop (HOTL) escalation thresholds reduce privilege-waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege-waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.

URL PDF HTML ☆

赞 0 踩 0

2606.20508 2026-06-19 cs.AI cs.LG 新提交

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

安全对齐的LLM从混合顺从演示中学到了什么？

Sihui Dai, Mann Patel

AI总结研究通过混合良性顺从演示和有害顺从演示，探究演示组成如何驱动有害顺从，发现演示内容、顺序和训练方法影响模型提取的信息。

详情

AI中文摘要

先前工作表明，上下文演示可以越狱语言模型，但模型如何解释不同类型的顺从演示仍不清楚。我们通过混合良性顺从演示（无害请求，有帮助响应）与有害顺从演示（有害请求，有帮助响应）并测试关于演示组成如何驱动有害顺从的三个假设来研究这一点。在四个模型中，我们发现良性和有害演示不可互换：良性演示根据模型不同可以减少或增加有害顺从。我们进一步表明，偏好优化是防止良性演示增加有害顺从的关键训练阶段，演示顺序表现出强烈的近因偏差，并且模型在拒绝与上下文学习的交互方式上有所不同：一些模型在拒绝时也采用演示的格式，而其他模型在拒绝时覆盖所有上下文信号。综合来看，这项工作超越了展示基于演示的越狱有效，而是描述了其工作原理：模型从顺从演示中提取的内容取决于演示内容、顺序和训练方法。

英文摘要

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

URL PDF HTML ☆

赞 0 踩 0

2606.19469 2026-06-19 cs.AI cs.SE 新提交

Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023

衡量课程在主题覆盖、能力与认知深度上的一致性：应用于CS2013和CS2023的纵向框架

Sherzod Turaev, Mary John, Saja Aldabet, Mamoun Awad, Nazar Zaki, Khaled Shuaib

发表机构 * United Arab Emirates University（阿联酋大学）； Abu Dhabi Polytechnic（阿布扎比理工学院）

AI总结提出一种人机协同流程，通过语义检索与人工确认，纵向衡量计算机科学课程对CS2013和CS2023指南的覆盖情况，发现课程覆盖稳定但新指南对认知深度要求更高。

Comments 24 pages, 5 figures, 8 tables

详情

AI中文摘要

本科计算机科学教育受约每十年修订一次的国际课程指南指导，但各项目缺乏可靠且可重复的方法来衡量其对当前指南的覆盖程度，以及当指南重组时覆盖情况如何变化。我们通过一个人机协同流程解决此问题，该流程衡量项目对外部知识体系的覆盖情况，并纵向应用于一个经认证的计算机科学学士学位项目，对照计算机科学课程2013（CS2013）和2023（CS2023）。该流程将项目和每个指南表示为结构化语料库，通过语义检索生成候选课程-知识单元匹配，并在明确的覆盖定义下通过人工判断确认。在七个基准检索器中，倒数秩融合集成最强，而知名长上下文模型表现不如小型句子模型，因此必须衡量检索器的选择。两个映射由独立第二评分者验证（CS2023的Cohen's kappa为0.64，CS2013为0.69）。该项目覆盖CS2023的49.7%和CS2013的50.9%的知识单元，十年间几乎恒定。将相同的检索-确认设计扩展到能力表述和认知深度，显示项目在每个指南下对约88%的覆盖单元表述了能力，但在CS2023下对76%的现有单元以推荐深度交付，而在CS2013下为95%，这一差距反映了新指南提高了期望，而非项目本身。纵向比较将持久的结构性差距（并行与分布式计算、编程语言基础、系统基础）——这些差距在两种指南和ABET下均未覆盖——与反映标准演变的差异区分开来。该工具可重用，并可向作者索取。

英文摘要

Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program's coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen's kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard's evolution. The instrument is reusable and available from the authors on request.

URL PDF HTML ☆

赞 0 踩 0

2606.19704 2026-06-19 cs.AI 新提交

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜：LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

AI总结本文通过14项并行研究，论证聚合分数排行榜无法泛化到分布外场景，提出基于预测有效性的排名配置方法，并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情

AI中文摘要

智能体基准测试发展迅速，但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析：14项并行实现研究，涵盖新的资产类别（包括多模态视觉扩展）、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试，我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置；最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性（样本内与样本外排名之间的相关性）而非样本内均值来配置排名，并报告了一个十二层测量装置，该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化；现有证据部分支持但过于薄弱无法确认。最后，我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

URL PDF HTML ☆

赞 0 踩 0

2606.19749 2026-06-19 cs.AI cs.CL 新提交

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago（芝加哥大学）； Bar-Ilan University（巴伊兰大学）

AI总结针对AI辅助研究给同行评审带来的压力，新兴智能审稿系统涌现，但缺乏评估标准。本文评估了多种系统，发现最佳配置（OpenAIReview + GPT-5.5）在成对准确性上达83.0%，能捕获71.6%注入错误，且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情

AI中文摘要

一类新的智能审稿系统正在兴起，以缓解AI辅助研究给同行评审系统带来的压力，但如何评估它们尚不明确。我们评估了两个开源系统（OpenAIReview和coarse）、一个专有系统（Reviewer3）以及一个零样本基线，跨越六个涵盖前沿和高效模型的LLM。首先，我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量（通过引用和接受决定等外部信号近似）相关。每个系统在成对准确性上均高于随机水平，最佳为OpenAIReview + GPT-5.5，达到83.0%。其次，为测试系统能否捕获已知真实错误的错误，我们构建了一个扰动基准，向八个arXiv学科类别的论文中注入四类错误，并测量检测召回率。最强配置（OpenAIReview + GPT-5.5）捕获了71.6%的注入错误，仍有很大改进空间。六个模型的检测并集达到83.3%的召回率，表明不同模型检测不同错误，更好的利用设计可能提高性能。除这些基准外，我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面，比例为1.44:1，最常见的抱怨是误报和琐碎挑剔。总之，通过评估基于最先进模型的全审稿系统在真实研究论文上的表现，我们表明虽然AI评审仍有改进空间，但它们已经能够很好地跟踪人类质量判断、捕获重要错误，并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

URL PDF HTML ☆

赞 0 踩 0

2606.19787 2026-06-19 cs.AI 新提交

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

ORAgentBench: LLM代理能否解决具有挑战性的端到端运筹学任务？

Jiajun Li, Mingshu Cai, Yixuan Li, Yu Ding, Ran Hou, Guanyu Nie, Xiongwei Han, Wanyuan Wang

发表机构 * Southeast University（东南大学）； Waseda University（早稻田大学）； Nanyang Technological University（南洋理工大学）

AI总结提出ORAgentBench基准，评估LLM代理在端到端运筹学任务中的表现，发现当前代理通过率仅35.51%，主要受策略性弱点限制。

Comments 31 pages, preprint, v1

详情

AI中文摘要

大型语言模型越来越多地被部署为可执行环境中多步任务的自主代理，但它们执行现实运筹学工作的能力仍不明确。现有的运筹学评估通常将建模与求解分离，依赖预形式化或纯文本实例，很少测试从操作工件到验证决策的完整工作流程。在这项工作中，我们引入了ORAgentBench，一个基于执行环境的基准，用于评估自主代理在具有挑战性的端到端运筹学任务上的表现。它包含107个经过人工审核的任务，涵盖多样化的操作场景，每个任务都打包在一个隔离环境中，包含自然语言简介、多文件数据、配置工件和所需的提交模式。代理必须编写并运行解决方案代码，其提交由隐藏验证器根据模式有效性、硬约束可行性和归一化目标质量进行评估。对十四个前沿代理模型配置的实验表明，当前代理远未达到可靠的运筹学实践。最佳代理仅通过35.51%的所有任务和20.59%的困难任务，许多可行的提交仍低于所需的质量阈值。失败分析进一步表明，错误主要由策略性弱点主导，包括遗漏操作规则、脆弱的公式化、弱可行解构造以及解改进不足。运筹学特定的程序性技能增加了困难任务的可行性，但并未可靠地提高解质量或通过率。这些结果表明，运筹学代理的进展需要超越合理的优化代码，转向可靠、高质量的操作决策。

英文摘要

Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In this work, we introduce ORAgentBench, an execution-grounded benchmark for evaluating autonomous agents on challenging end-to-end operations research tasks. It contains 107 human-reviewed tasks across diverse operational scenarios, each packaged in an isolated environment with a natural-language brief, multi-file data, configuration artifacts, and a required submission schema. Agents must write and run solution code, and their submissions are evaluated by hidden validators for schema validity, hard-constraint feasibility, and normalized objective quality. Experiments with fourteen frontier agent-model configurations show that current agents remain far from reliable OR practice. The best agent passes only 35.51% of all tasks and 20.59% of hard tasks, and many feasible submissions still fall below the required quality threshold. Failure analysis further shows that errors are dominated by strategic weaknesses, including missed operational rules, brittle formulations, weak feasible-solution construction, and insufficient solution improvement. OR-specific procedural skills increase hard-task feasibility, but do not reliably improve solution quality or pass rate. These results suggest that progress in OR agents requires moving beyond plausible optimization code toward dependable, high-quality operational decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.19788 2026-06-19 cs.AI cs.CL 新提交

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

CombEval：评估大语言模型中组合计数的框架

Yuxu Zhou, Ondřej Kuželka, Yuyi Wang, Yuanhong Wang, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Czech Technical University in Prague（捷克布拉格理工大学）； CRRC Zhuzhou Institute（中车株洲研究所）； Tengen Intelligence Institute（天元智能研究院）； International Center of Future Science, Jilin University（吉林大学未来科学国际合作中心）； Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE（教育部知识驱动人机智能工程研究中心）

AI总结提出CombEval动态基准，通过类型化Cofola规范生成组合计数问题，评估11个大语言模型在直接和代码增强设置下的表现，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上存在脆弱性。

Comments under review. Code: https://github.com/YuxuZhou-CN/combination-problem-generation

详情

AI中文摘要

我们提出了CombEval，一个用于评估大语言模型中组合计数的动态基准。CombEval将每个问题表示为关于实体、组合对象、对象依赖和约束的类型化Cofola规范，从而能够生成带有精确求解器验证答案的自然语言计数问题。与静态集合不同，CombEval支持对象类型、实体规模、约束数量和推理深度的系统变化。我们在直接和代码增强设置下评估了11个大语言模型，发现模型在有序对象、不可区分元素、相对位置约束和嵌套对象依赖上仍然脆弱。错误分析进一步识别出在约束解释和计数原则上的失败。CombEval为研究大语言模型何时以及为何在组合推理上失败提供了一个诊断测试平台。代码和生成的基准套件可在\url{this https URL}公开获取。

英文摘要

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

URL PDF HTML ☆

赞 0 踩 0

2606.19868 2026-06-19 cs.AI 新提交

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

大型语言模型黑盒不确定性估计方法的系统评估

Jiayi Wang, Xu-Yao Zhang

AI总结系统评估了24种黑盒不确定性估计方法在4个模型和4个数据集上的表现，发现无单一方法普遍最优，但基于答案空间推理和比较的方法通常有效，混合方法在多数条件下表现良好。

详情

AI中文摘要

尽管大型语言模型（LLMs）在广泛的任务中展现出强大的能力，但其输出通常仍不可靠，可能包含幻觉，因此不确定性估计（UE）对于构建可信赖的LLMs至关重要。在实践中，许多主流LLMs仅通过受限API访问，此时logits和隐藏状态等内部信号不可用，使得黑盒UE尤为重要。然而，现有关于LLMs黑盒UE的研究在方法论上仍然零散，缺乏统一的实证比较。为填补这一空白，我们系统回顾了黑盒UE方法，并将其分为五类：基于口头化、基于采样、基于解释、多智能体和混合方法。我们进一步构建了统一的评估框架，并在4个模型和4个数据集设置下对24种代表性方法进行了基准测试。结果表明，没有单一方法在所有设置中一致占优。然而，在答案空间中进行推理和比较候选的方法通常有效，而结合多种不确定性信号的混合方法在大多数条件下表现良好。通过发布基准数据和统一评估框架，我们旨在促进可重复比较并支持未来研究，同时我们的实证发现为开发未来LLMs的黑盒UE方法提供了实践指导。

英文摘要

Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.20146 2026-06-19 cs.AI 新提交

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

BIM-Edit：基于IFC的建筑信息模型的大语言模型基准测试

Bharathi Kannan Nithyanantham, Clemens Kujat, Tobias Sesterhenn, Stefan Telgmann, Jörn Plönnigs, Stefan Lüdtke, Christian Bartelt

发表机构 * University of Rostock（罗斯托克大学）； Clausthal University of Technology（克劳斯塔尔工业大学）

AI总结提出BIM-Edit基准，评估大语言模型在IFC格式建筑信息模型上的自然语言编辑能力，涵盖324个任务，最佳模型平均得分仅49.5%，揭示当前能力与工程需求间的差距。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被应用于计算机辅助设计（CAD），以从文本指令生成设计工件。在工程实践中，这需要的不仅仅是创建新的几何体，模型还必须理解现有场景，正确编辑它们，并保留语义和关系。然而，许多CAD基准侧重于创建新模型而非编辑现有模型，并且主要评估几何正确性。我们引入了BIM-Edit，这是一个用于评估LLMs在行业基础类（IFC）格式表示的建筑信息模型（BIM）上进行自然语言编辑的基准。BIM提供了一个具有挑战性的测试平台，因为建筑模型将几何体与语义和关系结构编码在一起。BIM-Edit包含324个编辑任务，涵盖11个真实建筑模型和36个合成场景。任务使用三种指令类别——直接、空间和拓扑——表达，涵盖显式编辑和场景接地编辑。我们沿三个维度评估输出：几何准确性、语义有效性和拓扑一致性。在评估的LLMs中，表现最佳的模型在三个指标上的平均得分仅为49.5%，且没有模型完全解决超过3.4%的任务。这些结果表明当前LLM能力与结构化工程设计工作流的要求之间存在巨大差距。

英文摘要

Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 新提交

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development（马克斯·普朗克人类发展研究所）； University of Konstanz（康斯坦茨大学）； Barcelona Supercomputing Center（巴塞罗那超级计算中心）； University of Basel（巴塞尔大学）

AI总结通过心理测量框架分析56个指令微调LLM，发现模型间差异主要源于方向性响应偏差而非特质，该偏差解释了81-90%的变异，且可通过题目选择操控，表明LLM心理特征是测量假象。

详情

AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型（LLM）稳定的心理特征，这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架，我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具，报告了四个发现。第一，模型间差异并非由工具所针对的特质驱动，而是由方向性响应偏差驱动，即倾向于向量表一端或某个标签选项做出反应，而不考虑项目内容；方差分解将81-90%的模型间变异归因于这种偏差，而在人类中这一比例为9-16%。第二，偏差随模型能力提升而下降，但并未被消除。第三，由于响应由偏差而非特质驱动，工具的表面信度几乎完全由其响应正交性预测，这是我们提出的术语，指特质和偏差指向相反方向的项目比例。第四，模型呈现的特征随所用项目而变化，并可通过项目选择来制造。这些结果表明，LLM的表面心理特征是用于测量它们的工具的假象，而非模型本身的属性。由于从人类心理学借用的工具很少完全正交，且可能对LLM天生缺乏效度，我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

URL PDF HTML ☆

赞 0 踩 0

2606.20208 2026-06-19 cs.AI cs.DB cs.NE 新提交

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

超越准确性：衡量预测模型的逻辑合规性

Guillaume Olivier Delplanque, Pierre Genevès, Nabil Layaïda, Zephirin Faure

AI总结提出规则违反分数（RVS），一种独立于预测准确性的评估指标，用于量化预测模型对逻辑规则的遵守程度，并通过实验证明两个准确率相近的模型可能表现出截然不同的逻辑合规性。

详情

AI中文摘要

机器学习模型主要通过预测性能指标进行评估，如排序质量、预测误差或分类准确性。虽然这些指标有效量化了预测与真实值的匹配程度，但它们不评估模型输出是否尊重预定义的逻辑或领域特定约束。在医疗、金融和自主系统等高安全性应用中，逻辑一致性与预测准确性同样关键，但尚无标准指标捕捉这一维度。我们引入了规则违反分数（RVS），这是一种互补的评估指标，独立于预测准确性，量化预测模型对给定逻辑规则集的遵守程度。RVS 对硬规则（严格约束）和软规则（统计规律）区别对待，可在任何数据集和任何在关系词汇上表达的预测模型上进行评估，并可通过为 Horn 规则自动生成的 SQL 查询进行计算。除了评估模型，RVS 还可以评估训练数据集的逻辑一致性，并帮助识别定义不良的规则。我们在三个基准测试上评估了 RVS，涵盖知识图谱链接预测和关系回归，包括基于规则、基于嵌入和神经符号的预测模型。我们的结果表明，两个实现相当预测准确性的模型可能表现出显著不同的逻辑合规性，揭示了标准指标无法捕捉的模型行为差异。

英文摘要

Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

URL PDF HTML ☆

赞 0 踩 0

2606.20227 2026-06-19 cs.AI cs.SE 新提交

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

QMFOL：通过可量化的一元一阶逻辑测试用例生成来基准测试大语言模型推理

Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, Kailong Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Nanyang Technological University（南洋理工大学）； Hubei University（湖北大学）； East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

AI总结提出QMFOL框架，通过可控制复杂度的合取/析取模式生成一元一阶逻辑推理任务，并构建包含2880个实例的基准QMFOLBench，评估显示逻辑复杂度增加导致性能下降和计算开销上升。

详情

AI中文摘要

大型语言模型（LLMs）在推理方面取得了显著进展，特别是在演绎推理中，这对于高风险决策至关重要。随着模型的改进，评估基准也应随之发展。然而，现有基准缺乏对逻辑复杂性的细粒度控制，并且在语义多样性与逻辑一致性之间难以平衡。为了解决这些问题，我们提出了QMFOL，一个自动生成具有可量化和可控复杂度的一元一阶逻辑推理任务的框架。它使用合取和析取模式构建形式逻辑结构，从而能够精确控制推理深度、宽度、标签类型和干扰项。然后通过LLM将这些结构转化为自然语言，并通过外部证明器的往返验证确保逻辑一致性。基于我们的框架，我们构建了QMFOLBench，一个包含2880个实例、960种配置的基准，覆盖不同的逻辑和语义维度。对六个大型推理模型（LRMs）和两个LLM的评估表明，随着逻辑复杂度的增加，性能下降且计算开销上升。模型在True标签任务上的表现优于False或Unknown任务，并且对语义变化敏感。总体而言，QMFOL提供了一种可扩展且可靠的方法来构建具有可控复杂度的演绎推理基准，从而能够更精确地评估现代语言模型的推理能力。

英文摘要

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

URL PDF HTML ☆

赞 0 踩 0

2606.20517 2026-06-19 cs.AI cs.PL 新提交

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode ； Yandex School of Data Analysis, Applied AI Institute（Yandex数据分析学院，应用人工智能研究所）

AI总结提出 Multi-LCB 基准，将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言，评估 LLM 跨语言代码生成能力，发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情

AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤，LCB 提供了污染感知的评估，并提供了编码能力的整体视图。然而，LCB 仍然局限于 Python，留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB，这是一个跨十二种编程语言（包括 Python）评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务，同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容，Multi-LCB 将自动跟踪未来的 LCB 更新，从而能够系统地评估跨语言代码生成能力，并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力，发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准，直接解决了 LCB 的主要局限性，并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 新提交

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh（爱丁堡大学）； University of Glasgow（格拉斯哥大学）； University of Cambridge（剑桥大学）

AI总结针对DeFi监管中LLM智能体易误报的问题，提出DeXposure-Claw系统，通过图时间序列基础模型预测风险网络，结合确定性监控和置信度门控生成可审计监管票据，并构建六轴评估基准DeXposure-Bench，实验验证有效性。

详情

AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景：它们过度解读弱证据并推荐高风险干预，而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw，一个基于预测的智能体监管系统，通过结构化证据引导LLM决策：(1) DeXposure-FM，一个图时间序列基础模型，预测未来风险网络；(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据；(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench，一个六轴评估框架，其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

URL PDF HTML ☆

赞 0 踩 0

2606.19522 2026-06-19 cs.AI 新提交

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

REVEAL++：用于阿尔茨海默病风险视觉-语言视网膜建模的可微分表型分组

Ethan Elio Meidinger, Seowung Leem, Zeyun Zhao, Ruogu Fang

发表机构 * University of Virginia（弗吉尼亚大学）； J. Crayton Pruitt Family Department of Biomedical Engineering, Herbert Wertheim College of Engineering, University of Florida（佛罗里达大学赫伯特·韦特海姆工程学院J. Crayton Pruitt家庭生物医学工程系）

AI总结提出可微分连续表型相似性权重函数，替代离散分组，在对比学习中端到端学习跨模态对齐与表型结构，提升AD风险预测。

Comments Accepted for publication at MICCAI 2026

详情

AI中文摘要

视网膜为神经退行性疾病提供了非侵入性窗口，能够捕捉与未来认知衰退风险相关的细微结构模式。诸如REVEAL等视觉-语言对齐框架已表明，将视网膜眼底图像与结构化临床风险叙述配对可改善阿尔茨海默病（AD）的早期预测。这些方法的一个关键设计选择是使用表型分组，即在对比学习中将具有相似风险特征的个体视为多正对。然而，现有方法将表型相似性操作化为离散构造，依赖硬分组分配，施加刚性监督并将分组形成与表示学习分离。我们提出对比学习中表型结构的连续形式。我们不将样本分配到固定聚类，而是将受试者间相似性建模为可微分权重函数，该函数源自视网膜图像和风险特征中模态内嵌入相似性。这些权重通过连续聚合算子定义软多正关系，实现反映疾病风险谱的梯度监督。我们进一步引入软目标对比目标，以端到端方式联合学习跨模态对齐和表型结构。在UK Biobank视网膜成像数据上进行AD发病预测评估，所提框架持续优于基于离散分组的对比学习和标准视觉-语言基线。通过将表型相似性视为可学习的连续信号而非固定分组规则，我们的方法为从多模态视网膜和临床数据中进行人群规模的神经退行性风险建模提供了有原则且稳健的基础。

英文摘要

The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive decline. Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer's disease (AD). A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning. However, existing methods operationalize phenotypic similarity as a discrete construct, relying on hard group assignments that impose rigid supervision and decouple group formation from representation learning. We propose a continuous formulation of phenotypic structure within contrastive learning. Rather than assigning samples to fixed clusters, we model inter-subject similarity as a differentiable weighting function derived from intra-modality embedding similarities in both retinal images and risk profiles. These weights define soft multi-positive relationships through a continuous aggregation operator, enabling graded supervision that reflects the spectrum nature of disease risk. We further introduce a soft-target contrastive objective that jointly learns cross-modal alignment and phenotypic structure in an end-to-end manner. Evaluated on UK Biobank retinal imaging data for incident AD prediction, the proposed framework consistently outperforms discrete group-based contrastive learning and standard vision-language baselines. By treating phenotypic similarity as a learnable, continuous signal rather than a fixed grouping rule, our approach provides a principled and robust foundation for population-scale neurodegenerative risk modeling from multi-modal retinal and clinical data.

URL PDF HTML ☆

赞 0 踩 0

2606.19602 2026-06-19 cs.AI 新提交

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG：什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen（埃森大学医学院人工智能医学研究所）； Faculty of Computer Science, University of Duisburg-Essen（杜伊斯堡-埃森大学计算机科学学院）； Department of Physics, TU Dortmund University（多特蒙德工业大学物理系）； Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University（多特蒙德工业大学拉马尔机器学习和人工智能研究所）； Advanced Clinical Research Center, Fukushima Medical University（福岛医科大学先进临床研究中心）； Department of Cardiology and Vascular Medicine, University Hospital Essen（埃森大学医院心血管内科）

AI总结针对临床文档元数据缺失问题，提出基于智能体RAG的ACIE系统，在埃森大学医学中心部署，通过完整患者上下文推理和源引用验证，在7326次临床判断中实现96.5%的提取接受率。

详情

AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点，然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效，无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE（智能体临床信息提取）：一个本地智能体RAG管道，能够推理完整的患者上下文，并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距，追溯了由此形成的架构决策，并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果，其中核医学医生根据引用的来源验证每个提取值。在7326次判断中，临床医生接受了96.5%的提取结果，按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

URL PDF HTML ☆

赞 0 踩 0

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 新提交

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

BrainG3N：用于可控3D脑MRI生成的双用途分词器

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

发表机构 * Department of Biomedical Data Science, Stanford University School of Medicine（斯坦福大学医学院生物医学数据科学系）； Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University（根特大学数学建模、统计与生物信息学系）； Department of Electrical Engineering, Stanford University（斯坦福大学电气工程系）

AI总结提出基于3D掩码自编码器的分词器，解耦编码器与解码器，在23项线性探测任务中21项超越SOTA，并支持条件生成和纵向预测。

详情

AI中文摘要

三维（3D）脑MRI是临床神经病学和神经肿瘤学的核心，生成模型可以增强代表性不足的队列、模拟疾病轨迹并支持隐私保护的数据共享。潜在扩散已成为建模成像数据的首选解决方案，但它对分词器提出了两个竞争性要求：编码器嵌入必须保留下游任务所需的临床信息，解码器必须重建解剖学上准确的体积。现有的重建驱动分词器以牺牲前者为代价实现了后者。为了解决这个问题，我们引入了一种基于全体积掩码自编码器（MAE）的分词器，用于3D脑MRI潜在扩散，解耦编码器和解码器：冻结的3D MAE编码器产生临床信息丰富的嵌入，而专用的CNN解码器从这些嵌入的线性投影重建体素。我们在来自18个公共队列的35,309个体积上预训练编码器，涵盖四种模态、十种疾病类别和200多个采集站点，并在两种设置中展示了其双重用途。首先，在23项线性探测基准测试中，编码器在21项任务上优于或匹配SOTA模型（即BrainIAC、BrainSegFounder和MedicalNet）。其次，在这些临床信息丰富的嵌入上训练的条件扩散变压器（DiT）支持跨六个变量的条件生成和患者特定的纵向预测。这些结果共同建立了一个单一的3D脑MRI嵌入空间，能够同时支持下游临床任务和可控生成。

英文摘要

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

URL PDF HTML ☆

赞 0 踩 0

2606.19747 2026-06-19 cs.AI 新提交

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

预训练Transformer模型用于古兰经语音识别的比较研究：语音表示、标签格式和数据集组成

Nabil Mosharraf Hossain, Riasat Islam, Unaizah Obaidellah

发表机构 * Greentech Apps Foundation（Greentech Apps基金会）； Queen Mary University of London（伦敦玛丽女王大学）； University of Malaya（马来亚大学）

AI总结本文系统比较了Wav2Vec2.0、HuBERT和XLS-R等预训练Transformer模型在古兰经语音识别中的微调效果，通过870小时数据集实验，最佳配置实现0.08词错误率，训练时间从140小时降至40小时。

Comments 30 pages, 9 figures, 5 tables, Submitted to International Journal of Speech Technology

详情

AI中文摘要

古兰经自动语音识别（ASR）旨在将古兰经诵读转换为文本，从而支持辅助记忆工具和古兰经搜索引擎等应用。然而，现有的ASR模型在用户诵读的经文上通常表现出较高的词错误率（WER），并且缺乏对古兰经语料库的完整覆盖。本文对基于预训练Transformer模型的领域特定微调进行了系统的实证研究，使用了先进的语音特征提取方法：Wav2Vec2.0、HuBERT和XLS-R。这些模型通过掩码输入音频的部分内容并利用Transformer架构学习上下文感知的语音特征，应用自监督学习。预训练模型在超过870小时的专业和用户诵读过滤后的古兰经数据集上进行微调。通过跨特征提取器、输出标签格式、训练策略和剪辑时长的全面消融研究，我们确定了影响该领域转录准确性的关键因素。我们的最佳配置在EveryAyah子集上实现了0.08的WER，在EveryAyah+Tarteel组合设置上实现了0.11的WER，相比Citrinet基线（WER=0.163）提高了约五个百分点，同时将组合模型训练时间从140小时减少到40小时。不带变音符号的阿拉伯文本产生了最佳的微调结果，而Wav2Vec2-XLSR-53提供了最强的整体表示。未来的工作包括改进数据集质量和开发音素感知模型，以提取更深的语音特征表示，用于对Tajweed敏感的应用。

英文摘要

Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19821 2026-06-19 cs.AI cs.LG 新提交

TelcoAgent: A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability

TelcoAgent: 一种可扩展的5G多KPM预测与3GPP基础可解释性

Geon Kim, Dara Ron, Sukhdeep Singh, Suyog Moogi, Pranshav Gajjar, V V N K Someswara Rao Koduri, Een Kee Hong, Vijay K. Shah

发表机构 * NextG Wireless Lab, North Carolina State University（北卡罗来纳州立大学下一代无线实验室）； Kyung Hee University（庆熙大学）

AI总结提出TelcoAgent框架，利用基础模型实现多KPM的零样本预测，通过3GPP知识图谱和可解释性管道提供可操作诊断。

Comments 6 pages, 6 figures. Submitted to IEEE GLOBECOM 2026

详情

AI中文摘要

关键性能测量（KPM）预测对于5G及下一代电信网络的主动网络管理至关重要。然而，现有的机器学习（ML）方法在可扩展性和可解释性方面存在显著局限性，限制了其在实际部署中的有效性。我们提出TelcoAgent，一个基于基础模型的框架，能够在不需站点特定训练的情况下，跨不同网络单元实现多个KPM的准确、可扩展和可解释预测。具体而言，该框架包含三个关键组件：(i) 一个自动化的三智能体管道，直接从规范文档构建第三代合作伙伴计划（3GPP）知识图谱；(ii) 一个可扩展的基于时间序列基础模型（TSFM）的预测管道，以提供准确的零样本预测；以及(iii) 一个推理和解释管道，提供可操作的、领域基础的诊断。使用来自美国网络运营商的三个月真实城市级5G KPM数据集进行评估，TelcoAgent在200个单元中针对每个单元的7个KPM均展示了高预测准确性，同时提供了可解释的见解和可操作的指令来解决网络退化问题。

英文摘要

Key Performance Measurement (KPM) forecasting is essential for proactive network management of 5G and next-generation telecom networks. However, existing machine learning (ML) approaches face significant limitations in scalability and explainability, restricting their effectiveness in real-world deployments. We propose TelcoAgent, a foundation model-based framework that enables accurate, scalable, and explainable forecasting of multiple KPMs across diverse network cells without the need for site-specific training. Specifically, the framework comprises three key components: (i) an automated three-agent pipeline that constructs a 3rd Generation Partnership Project (3GPP) knowledge graph directly from specification documents, (ii) a scalable, time-series foundation model (TSFM)-based prediction pipeline to deliver accurate, zero-shot forecasting, and finally (iii) a reasoning and explanation pipeline that provides actionable, domain-grounded diagnostics. Evaluated using a 3-month, real-world, city-scale 5G KPM dataset from a U.S.-based network operator, TelcoAgent demonstrates high forecasting accuracy for all 7 considered KPMs per cell across 200 cells, while delivering explainable insights and actionable instructions to address network degradations.

URL PDF HTML ☆

赞 0 踩 0

2606.19921 2026-06-19 cs.AI 新提交

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

eCNNTO：一种高度泛化的加速拓扑优化的卷积网络

Shengbiao Lu, Xiaodong Wei

发表机构 * Global college, Shanghai Jiao Tong University（上海交通大学全球学院）

AI总结提出基于元素的卷积神经网络eCNNTO，通过预测近最优密度跳过大量迭代，加速密度拓扑优化，并引入新训练策略提升效率与泛化能力。

详情

AI中文摘要

本工作提出了一种基于元素的卷积神经网络（CNN）来加速基于密度的拓扑优化（TO），称为eCNNTO。TO通常需要大量迭代，其中每次迭代都进行有限元分析，导致效率瓶颈，尤其是在使用密集网格实现高分辨率设计时。为解决这一限制，eCNNTO建立在Kallioras等人（2020）的工作基础上，该工作为每个元素训练了一个深度信念网络（DBN），根据其早期历史预测近最优密度，从而跳过绝大多数迭代并显著加速TO过程。然而，该方法缺乏相邻元素间的空间相关性，可能导致最终结构中存在不连通的特征。所提方法采用带有残差连接的CNN来解决这一问题。在此基础上，引入了一种新的训练策略以进一步提高优化效率，其中训练数据集由最终阶段的密度历史而非早期历史组成。这一变化也有助于减少所需的训练数据量。eCNNTO仅需少量数据集进行训练，却能泛化到边界条件、载荷情况、设计域几何形状、网格分辨率以及非设计域大不相同的各种问题。最后，通过二维和三维的多个示例展示了eCNNTO的泛化能力和效率，分别实现了高达90%和97%的迭代次数减少。

英文摘要

This work proposes an element-based Convolutional Neural Network (CNN) to accelerate density-based Topology Optimization (TO), termed eCNNTO. TO generally undergoes a large number of iterations, where finite element analysis is performed in every iteration, leading to the efficiency bottleneck especially when dense meshes are used to achieve high-resolution designs. To address this limitation, eCNNTO is proposed to build upon Kallioras et al. (2020), where a Deep Belief Network (DBN) was trained for every element to predict its near-optimal density from its early history, thereby skipping the great majority of iterations and significantly accelerating the TO procedure. However, the method lacks spatial correlations among neighboring elements and may lead to disconnected features in the final structure. The proposed method employs CNN with residual connections to address this issue. On top of it, a novel training strategy is introduced to further enhance the optimization efficiency, where the training dataset consists of the final stage density histories rather than early ones. This change can also help reduce the required training data size. eCNNTO requires only a small dataset to train and yet it can be generalized to problems with largely different boundary conditions, loading cases, design domain geometries, mesh resolutions, as well as non-design domains. In the end, the generalization capabilities and efficiency of eCNNTO are demonstrated through a variety of examples in two and three dimensions, achieving up to 90% and 97% reduction of iterations, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.20087 2026-06-19 cs.AI 新提交

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

基于多头注意力的特征提取器与软演员-评论家集成用于增材制造中的孔隙率预测和工艺参数优化

Kianoush Aqabakee, Leonardo Stella

发表机构 * Department of Electrical Engineering, Amirkabir University of Technology (Tehran Polytechnic)（阿米尔卡比尔理工大学（德黑兰理工大学）电气工程系）； Department of Mechanical Engineering, Amirkabir University of Technology (Tehran Polytechnic)（阿米尔卡比尔理工大学（德黑兰理工大学）机械工程系）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）

AI总结提出一种结合多头注意力机制与软演员-评论家算法的连续动作空间方法，用于增材制造孔隙率预测和参数优化，实现更快收敛和更高奖励。

详情

AI中文摘要

增材制造工艺优化需要精确的参数控制以最小化孔隙等缺陷。传统的使用离散动作空间的强化学习方法收敛慢且易陷入局部最优，限制了其在精密制造任务中的有效性。本研究通过采用连续动作空间并结合一种新颖架构——将多头注意力机制与软演员-评论家（SAC）算法集成，来解决这些局限性。基于注意力的特征提取器增强了智能体捕捉低维输入特征中细微变化的能力，从而在存在局部极小值的价值空间中实现更有效的探索-利用平衡。我们在激光粉末床熔融中的孔隙率预测和工艺参数优化上验证了该方法，与标准强化学习方法（包括DQN、PPO、TD3和原始SAC）相比，展示了更快的收敛速度和更高的最终奖励值。所提出的方法在14个回合内达到322.79的收敛值，在保持训练稳定性的同时优于现有方法。

英文摘要

Additive manufacturing process optimization requires precise parameter control to minimize defects such as porosity. Traditional reinforcement learning (RL) approaches using discrete action spaces suffer from slow convergence and susceptibility to local optima, limiting their effectiveness for high-precision manufacturing tasks. This study addresses these limitations by employing a continuous action space combined with a novel architecture that integrates a multi-head attention mechanism with the Soft Actor-Critic (SAC) algorithm. The attention-based feature extractor enhances the agent's ability to capture subtle variations in low-dimensional input features, enabling more effective exploration-exploitation balance for navigating value spaces with local minima. We validate our approach on porosity prediction and process parameter optimization in laser powder bed fusion, demonstrating faster convergence and higher final reward values compared to standard RL methods including DQN, PPO, TD3, and vanilla SAC. The proposed methodology achieves a convergence value of 322.79 within 14 episodes, outperforming existing approaches while maintaining stability throughout training.

URL PDF HTML ☆

赞 0 踩 0

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 新提交

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示：基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University（莱顿大学）； FutureWhiz

AI总结提出一种基于14个教学特征的主题感知提示路由模型，通过模拟训练和在线A/B测试，在高中辅导中实现自适应策略切换，提高教学效率并减少交互轮次。

详情

AI中文摘要

LLMs可以个性化教育，尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统，该系统基于从原始转录中提取的14个教学特征（例如，辅导支架、学生理解）。我们首先在模拟环境中训练一个提示路由模型，然后将其部署到实际高中学生的在线适应中。模拟基准测试显示，路由器的性能优于两个静态基线（$0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$）。A/B测试（$N=656$ 次对话，来自359名学生）显示了从模拟到现实的迁移，其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率，保持了教学质量，并减少了约3轮交互（$p=0.007$）。虽然贪婪路由器的练习转化率与基线相当（$19.1\%$ vs. $19.6\%$），但随机采样策略的随机路由器实现了更高的转化率（$28.1\%$）。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

URL PDF HTML ☆

赞 0 踩 0

2606.20162 2026-06-19 cs.AI cs.IT cs.NI math.IT 新提交

Implicit Semantic-Aware Communication Based on Hypergraph Reasoning

基于超图推理的隐式语义感知通信

Yiwei Liao, Shurui Tu, Yong Xiao, Yingyu Li, Guangming Shi

发表机构 * China Electric Power Research Institute Co., Ltd（中国电力科学研究院有限公司）； National Key Laboratory for Power Grid Environmental Protection（电网环境保护国家重点实验室）； School of Electronic Information and Communications, Huazhong University of Science and Technology（华中科技大学电子信息与通信学院）； Peng Cheng Laboratory（鹏城实验室）； Pazhou Laboratory (Huangpu)（琶洲实验室（黄埔））； School of Mechanical Engineering and Electronic Information, China University of Geosciences（中国地质大学机械与电子信息学院）

AI总结提出基于超图的隐式语义推理框架HISR，通过超图建模多实体高阶关系，在噪声信道下提升语义推理鲁棒性，准确率提升36.6%。

Comments This work is accepted at IEEE Transactions on Communications

详情

AI中文摘要

语义感知通信已成为下一代通信系统的变革性范式，将基本目标从传输比特级符号转变为可靠恢复和理解信息的语义含义。先前研究表明，将源消息的语义内容表示为基于图的结构可以显著提高通信效率和接收端语义推理的准确性。然而，现有解决方案通常采用仅捕获成对关系的图，从而忽略了现实场景中常见的高阶隐式相关性，例如群体交互、多实体关联和复杂关系上下文。这种限制降低了语义表达能力，并使语义推理容易受到歧义和性能下降的影响，尤其是在噪声或损坏的信道条件下。为了解决这些问题，本文提出了一种新颖的基于超图的隐式语义推理框架HISR，该框架利用超图表示语义知识实体之间的复杂多实体关系。在HISR中，实体及其关联的高阶关系被映射到针对不同关系上下文定制的专用语义子空间中。这种设计不仅解耦了多样的语义交互以减轻传统图嵌入方法中常见的过平滑效应，而且即使在传输过程中发生部分信息丢失时也能实现鲁棒的语义推理。数值结果表明，所提出的HISR在隐式语义解释准确率上比最先进的基准提高了36.6%。

英文摘要

Semantic-aware communication has emerged as a transformative paradigm for next-generation communication systems, shifting the fundamental goal from transmitting bit-level symbols to reliably recovering and understanding the semantic meaning of information. Previous studies have demonstrated that representing the semantic content of source messages as graph-based structures can significantly improve communication efficiency and the accuracy of semantic inference at the receiver. However, existing solutions typically employ graphs that capture only pairwise relationships, thereby neglecting higher-order implicit correlations commonly observed in real-world scenarios, such as group interactions, multi-entity associations, and complex relational contexts. This limitation reduces semantic expressiveness and makes semantic inference susceptible to ambiguity and performance degradation, particularly under noisy or corrupted channel conditions. To address these issues, this paper proposes a novel hypergraph-based implicit semantic reasoning framework, HISR, which leverages hypergraphs to represent complex multi-entity relationships among semantic knowledge entities. In HISR, entities and their associated higher-order relations are mapped into dedicated semantic subspaces tailored to distinct relational contexts. This design not only disentangles diverse semantic interactions to mitigate the over-smoothing effects commonly found in traditional graph embedding methods but also enables robust semantic inference even when partial information loss occurs during transmission. Numerical results show that the proposed HISR achieves up to a 36.6% improvement in implicit semantic interpretation accuracy over the state-of-the-art benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.20210 2026-06-19 cs.AI 新提交

Augmenting Game AI with Deep Reinforcement Learning

用深度强化学习增强游戏AI

Alessandro Sestini, Joakim Bergdahl, Amir Baghi, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Linus Gisslén

发表机构 * Electronic Arts (EA), Stockholm, Sweden（美国艺电公司（EA），斯德哥尔摩，瑞典）

AI总结本文提出一种框架，通过深度强化学习训练游戏AI，以增强角色行为的真实感，并探讨了部署中的挑战与未来研究方向。

Comments Vision paper, published in Conference on Games 2026

详情

AI中文摘要

视频游戏的沉浸感不仅取决于图形、音频和游戏机制，还取决于游戏内角色的质量。产生可信的角色（即游戏AI）仍然是一个重大挑战，因为行为复杂性难以通过手工编码系统捕捉。游戏AI是沉浸感和参与度的来源；然而，由于创建游戏AI的挑战所带来的限制，常常导致玩家沮丧并打破游戏内的真实感幻觉。机器学习模型的引入为在游戏中创建更可信、更真实、更易共鸣的角色打开了大门。其前景是，它们要么通过与游戏互动学习，要么从玩家数据中学习，以发展出真正类似人类的行为。在本文中，我们展望未来强化学习在游戏AI中的更多应用。为实现这一目标，当前的研究限制阻碍了其在各种游戏类型中的广泛部署。因此，我们提出一个框架，用于训练强化学习模型，并考虑了一套适合游戏AI和游戏开发的需求。我们展示了带有强化学习增强游戏AI的游戏示例，并描述了在现代游戏中部署面向玩家的机器学习代理的实践。此外，我们识别了这些领域的瓶颈和难题，我们认为这些为加速机器学习在游戏AI中的应用提供了有前景的研究方向，以推动视频游戏行业的发展。

英文摘要

Immersion in video games depends not only on graphics, audio, and game mechanics, but also on the quality of in-game characters. Producing believable characters, or game AI, remains a significant challenge as behavioral complexity is hard to capture with hand-coded systems. Game AI is a source of immersion and engagement; however, the limitations stemming from the challenges of creating game AI often lead to frustration and the breaking of the illusion of realism within the game. The introduction of machine learning models opens the door to creating more believable, authentic, and relatable characters in games. The promise is that they either learn from interacting with the game, or from player data, to develop true human-like behavior. In this paper, we envision more applications of reinforcement learning for game AI in the future. For this to materialize, current research limitations are prohibitive to broad deployment across game genres. Therefore, we propose a framework for training reinforcement learning models with a set of requirements in mind that are suited towards game AI and game development. We present examples of games with reinforcement learning-augmented game AI and describe the practicalities of deploying player-facing machine learning agents in modern games. Furthermore, we identify bottlenecks and hard problems in these areas, which we believe offer promising research directions to accelerate the adoption of machine learning in game AI for the video game industry.

URL PDF HTML ☆

赞 0 踩 0

2606.20264 2026-06-19 cs.AI 新提交

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

学生绘制的科学模型的置信度感知自动评估

Luyang Fang, Yingchuan Zhang, Jongchan Park, Zhaoji Wang, Ping Ma, Xiaoming Zhai

发表机构 * AI4STEM Education Center, Athens, GA, USA（AI4STEM教育中心，雅典，佐治亚州，美国）； Department of Statistics, University of Georgia, Athens, GA, USA（佐治亚大学统计系，雅典，佐治亚州，美国）

AI总结提出一种基于视觉Transformer的置信度感知评分框架，通过选择性自动化高置信度响应并延迟不确定案例至人工审核，在六个NGSS评估项上提高了评分可靠性并平衡了自动化覆盖率与评分风险。

详情

AI中文摘要

学生生成的绘图广泛应用于科学教育中，用于评估学习者在基于建模任务中的概念理解，这些任务与下一代科学标准（NGSS）保持一致。然而，对这些绘图进行评分需要专家人工判断来解释复杂的视觉表示，使得大规模评估在课堂环境中实施和维持成本高昂。在这项工作中，我们研究了使用基于视觉模型的自动评分学生生成的科学绘图。我们评估了具有参数高效适应的视觉Transformer（ViT），并提出了一个置信度感知评分框架，该框架从测试时预测分布中推导出响应级别的置信度。这种置信度信号通过自动评分高置信度响应，同时将不确定案例延迟至人工审核，实现了选择性自动化。在六个与NGSS对齐的中学评估项上的实验表明，所提出的方法提高了评分可靠性，同时支持自动化覆盖率和评分风险之间的实际权衡，突出了置信度感知方法在可信教育评估中的价值。

英文摘要

Student-generated drawings are widely used in science education to assess learners' conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.20323 2026-06-19 cs.AI 新提交

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

利用系统非线性应对智能故障诊断系统设计中的数据稀缺问题

Giancarlo Santamato, Andrea Mattia Garavagno, Massimiliano Solazzi, Antonio Frisoli

AI总结提出一种利用系统固有非线性的周期多激励级方法，结合数据可视化与增强技术，在数据稀缺条件下实现基于深度迁移学习的振动故障诊断，并在铁路受电弓结构上验证有效性。

Journal ref Nonlinear Dynamics, vol. 112, pp. 16153-16166, 2024

详情

DOI: 10.1007/s11071-024-09864-6

AI中文摘要

深度迁移学习（DTL）允许高效构建智能故障诊断系统（IFDS）。另一方面，DTL方法仍然严重依赖大量标记数据。在处理机器或结构故障时，获取如此大量的数据可能具有挑战性。本文提出了一种在数据严重稀缺条件下使用DTL设计基于振动的IFDS的新方法。利用真实世界系统固有非线性的周期性多激励级过程生成图像，这些图像可以由预训练的卷积神经网络（CNN）方便地分析以诊断故障。本文提出了一种新的数据可视化方法及其增强技术，以应对IFDS设计过程中典型的数据缺乏问题。在铁路受电弓结构上的实验验证为所提方法提供了有效支持。

英文摘要

Deep Transfer Learning (DTL) allows for the efficient building of Intelligent Fault Diagnosis Systems (IFDS). On the other hand, DTL methods still heavily rely on large amounts of labelled data. Obtaining such an amount of data can be challenging when dealing with machines or structures faults. This document proposes a novel approach to the design of vibration-based IFDS using DTL in condition of strong data scarcity. A periodic multi-excitation level procedure leveraging intrinsic non-linearities of real-world systems is used to produce images that can be conveniently analysed by pre-trained Convolutional Neural Networks (CNNs) to diagnose faults. A new data visualization method and its augmentation technique are proposed in this paper to tackle the typical lack of data encountered during the design of IFDS. Experimental validation on a railway pantograph structure provides effective support for the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.20438 2026-06-19 cs.AI 新提交

Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning

可解释的精子形态分类：基于注意力引导的深度学习

Zahra Asghari Varzaneh, Reza Khoshkangini, Thomas Ebner, Lars Johansson

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出注意力引导的深度学习框架，结合EfficientNet-B0和CBAM模块进行精子形态分类，在SMIDS和HuSHem数据集上分别达到90.2%和93.9%的准确率，并通过Grad-CAM++可视化增强可解释性。

2606.20459 2026-06-19 cs.AI 新提交

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

IVF实验室环境条件的上下文感知分层贝叶斯建模

Zahra Asghari Varzaneh, Reza Khoshkangini, Pia Saldeen, Lars Johansson, Thomas Ebner

发表机构 * Department of Computer Science and Media Technology, Malmö University（马尔默大学计算机科学与媒体技术系）

AI总结提出55个上下文感知时间特征捕捉培养箱微环境动态，结合分层贝叶斯Beta回归模型跨诊所共享环境效应，将预测误差从3-5%降至1.27%，并在北欧诊所实现R²=0.86和64%误差降低。

详情

AI中文摘要

IVF妊娠率通常使用患者层面变量进行建模，而高分辨率实验室环境数据仍未得到充分利用。我们表明这是一个错失的机会。我们不再依赖原始传感器平均值，而是设计了55个上下文感知的时间特征，包括滚动热稳定性、同时温湿度符合性、峰值应力持续时间和应力后恢复速度，这些特征捕捉了培养箱微环境的动态。基于来自一家亚洲IVF诊所的61周数据，这些特征将交叉验证预测误差降低至1.27%，而原始平均值的误差为3-5%。然后，我们训练了一个分层贝叶斯Beta回归模型，通过部分池化在亚洲和北欧诊所之间共享环境效应，同时保留特定于诊所的基线。在来自北欧诊所的保留数据上，该模型在35-39岁年龄组中实现了R²=0.86和相对于朴素基线的64%误差降低，表明结构化的环境监测包含具有临床意义的可迁移信号。

英文摘要

IVF pregnancy rates are routinely modeled using patient-level variables, while high-resolution laboratory environmental data remain underutilized. We show that this is a missed opportunity. Rather than relying on raw sensor averages, we engineer 55 context-aware temporal features, including rolling thermal stability, simultaneous temperature-humidity adherence, peak stress duration, and post-stress recovery speed, that capture the dynamics of incubator microenvironments. On 61 weeks of data from an Asian IVF clinic, these features reduce cross-validated prediction error to 1.27%, compared to 3-5% for raw averages. We then train a hierarchical Bayesian Beta regression model that shares environmental effects across an Asian and a Northern European clinic via partial pooling, while preserving site-specific baselines. On held-out data from the Northern European clinic, the model achieves R2 = 0.86 and a 64% error reduction for the 35-39 age group over a naive baseline, demonstrating that structured environmental monitoring contains clinically meaningful, transferable signal.

URL PDF HTML ☆

赞 0 踩 0

2606.19630 2026-06-19 cs.AI cs.DL cs.SY eess.SY 新提交

AI4SE and SE4AI Exploration: A Decade Looking Back and Forward

AI4SE 与 SE4AI 探索：回顾与展望的十年

H. Sinan Bank, Daniel R. Herber, Thomas Bradley

发表机构 * Colorado State University（科罗拉多州立大学）

AI总结本文回顾了人工智能与系统工程在三个阶段的进展，通过人机一致性文献综述识别出五个关键研究空白，并提供了AI采纳、保障和劳动力转型的指导。

Comments 10 pages, 5 figure

详情

AI中文摘要

2020年3月INCOSE INSIGHT关于人工智能与系统工程的特刊成为该刊历史上下载量最高的一期，并催生了一个研究社区，其年度研讨会现吸引超过250名注册者。在本文中，我们基于作者对该领域核心论文的解读，追溯了人工智能与系统工程在三个阶段（标记为基础、应用和LLM转折点）的进展，并描述了我们对社区已达成共识以及仍存在关键空白的看法。此外，我们进行了一项人机一致性文献综述，利用人类专家和六个人工智能模型评估了1,712篇INCOSE INSIGHT文章和889篇SERC出版物的相关性。结果识别出五个关键研究空白，并为从业者在系统工程中应对AI采纳、保障和劳动力转型提供了指导。我们共享一致性数据以及AI4SE/SE4AI Explorer网络应用程序，以便读者将自己的相关性判断与人类和AI评分者进行比较。

英文摘要

The March 2020 INCOSE INSIGHT special issue on AI and Systems Engineering (SE) became the most downloaded issue in the publication's history and launched a research community that now draws over 250 registrants to its annual workshop. In this article, we trace the progress in AI and SE across three phases (labeled here foundational, applied, and LLM inflection) based on the authors' reading of the field's core papers, and describe our opinions of where the community has converged and where critical gaps remain. Separately, a human-AI agreement literature review leveraging both human expertise and six AI models was performed to assess the relevance of 1,712 INCOSE INSIGHT articles and 889 SERC publications. The results identify five critical research gaps and offer guidance for practitioners navigating AI adoption, assurance, and workforce transformation in SE. We share the agreement data and the AI4SE/SE4AI Explorer web application so readers can compare their own relevance judgments with the human and AI raters.

URL PDF HTML ☆

赞 0 踩 0

2606.19753 2026-06-19 cs.AI cs.SE 新提交

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理：确定性封装生成模型的原则

Marty O'Neill

AI总结提出四种AI混合架构原语，实现概率模型的确定性封装，并指出两个行业反模式，为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

2606.19924 2026-06-19 cs.AI 新提交

The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self

主体之道：自生目标人工智能、嵌入主体与自我的消解

Aritra Sarkar

AI总结本文探讨自生目标AI中主体生成自身目标的问题，通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性等概念，揭示嵌入性虽必要但不充分，并指出核心难题在于主体如何生成并相对化自我，最后提出量子表述、哲学解读和基于LLM的具体实现。

详情

AI中文摘要

大多数人工智能系统建立在目标由设计者外生指定的假设上。探索当主体开始生成自身目标时会发生什么，开启了自生目标AI领域。主体不仅应追求目标，还应发现目标。本文通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性追溯其后果；发现嵌入性是自生目标主体性的必要但不充分条件。嵌入性将主体个体化，但代价是揭示这种个体化并非唯一，相同的动力学允许许多有效划分，每个划分定义了一个不同的候选自我。因此，自生目标AI最深层次的问题不在于主体如何生成目标，而在于主体如何生成并相对化目标所归属的自我。主体必须相信自身的边界才能行动，并看穿该边界才能理解。我们将这些发展整合到一个统一框架中，并沿三个方向扩展：量子表述（其中主体-环境切割成为物理的）、针对非二元沉思传统的哲学解读，以及基于LLM的具体主体实现。

英文摘要

Most artificial intelligence systems are built on the assumption that goals are exogenous and specified by the designer. Exploring what happens when an agent begins generating its own goals opens the field of autotelic AI. Agents are expected not merely to pursue objectives but to discover them. In this article, we trace its consequences through intrinsic motivation, resource-driven priors, causal-interventional learning, homeostasis, and embeddedness; the last of which is found to be a necessary but not sufficient condition for autotelic agency. Embeddedness individuates the agent at the cost of revealing that the individuation is non-unique, such that the same dynamics admit many valid partitions, each defining a different candidate self. The deepest problem with autotelic AI is therefore not how the agent generates goals, but how it generates and relativizes the self to which the goals are assigned. The agent must believe in its own boundary in order to act, and see through that boundary in order to understand. We consolidate these developments into a single framework and extend it along three directions: a quantum formulation in which the agent-environment cut becomes physical, a philosophical reading against non-dual contemplative traditions, and a concrete LLM-based agentic instantiation.

URL PDF HTML ☆

赞 0 踩 0

2606.20231 2026-06-19 cs.AI cond-mat.stat-mech cs.IT math-ph math.IT math.MP nlin.AO 新提交

Thermodynamic Measure of Intelligence

智能的热力学度量

Ishanu Chattopadhyay

发表机构 * Institute for Biomedical Informatics, University of Kentucky（肯塔基大学生物医学信息学研究所）； Department of Computer Science, University of Kentucky（肯塔基大学计算机科学系）

AI总结提出智能是稀有但有效未来的合法放大，通过递归自模拟实现，并给出热力学度量，证明该结构对高智能必要且近乎充分。

详情

AI中文摘要

智能可以被度量吗？我们提出智能可以定义为稀有但有效未来的合法放大：一个系统增加那些在被动动力学下不太可能但在领域约束下仍然可允许的结果的概率。我们从智能系统必须建模世界及其自身在其中的位置这一前提开始。由于系统是其建模世界的一部分，这自然导致递归自模拟：系统表示其自身动作是轨迹一部分的未来。我们的核心结果给出了一个必要性陈述和一个条件性近乎充分性陈述，将该架构与稀有-有效未来的合法放大的精确热力学度量联系起来：高稀有-有效提升是不可能的，除非内部模拟以高保真度识别稀有-有效未来；反之，当稀有-有效保真度高且模拟包含有效策略时，可实现的提升接近受驱动限制的最优值。因此，递归自模拟不仅是智能的一个合理特征，而且在所述假设下，对于高热力学智能是必要且近乎充分的。由此产生的框架使智能在通用尺度上可度量，从被动物质和反馈控制器、大型语言模型、作为文本生成器的人类到麦克斯韦妖式信息引擎。

英文摘要

Can intelligence be measured? We propose that intelligence can be defined as the lawful amplification of rare but valid futures: a system increases the probability of outcomes that would be unlikely under passive dynamics but remain admissible under the constraints of the domain. We start with the premise that an intelligent system must model the world and its own place within it. Because the system is part of the world it models, this leads naturally to recursive self-simulation: the system represents futures in which its own actions are part of the trajectory. Our central results give a necessity statement and a conditional near-sufficiency statement connecting this architecture to a precise thermodynamic measure of lawful amplification of rare-valid futures: high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity; conversely, when rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift approaches the actuation-limited optimum. Thus recursive self-simulation is not merely a plausible feature of intelligence but, under the stated assumptions, is necessary and nearly sufficient for high thermodynamic intelligence. The resulting framework makes intelligence measurable on a universal scale, from passive matter and feedback controllers, large language models, and humans as text generators to Maxwell-demon-like information engines.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 7 篇

Uncertainty Decomposition for Clarification Seeking in LLM Agents

AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

2. 知识表示、推理与符号AI 2 篇

Process-Verified Reinforcement Learning for Theorem Proving via Lean

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

3. 多智能体与博弈 5 篇

Hidden Anchors in Multi-Agent LLM Deliberation

Exit-and-Join Dynamics for Decentralized Coalition Formation

Multi-Agent Transactive Memory

Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale

A Multi-Agent system for Multi-Objective constrained optimization

4. 搜索、优化与约束求解 3 篇

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

Residual-Space Evolutionary Optimization via Flow-based Generative Models

5. 机器学习与表示学习 9 篇

Diffusion Language Models: An Experimental Analysis

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Which Pairs to Compare for LLM Post-Training?

Denoising Implicit Feedback for Cold-start Recommendation

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

Modularity-Free Conflict-Averse Training for Generalized PINNs

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Toward Calibrated Mixture-of-Experts Under Distribution Shift

6. 自然语言与多模态智能 5 篇

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

SoftSkill: Behavioral Compression for Contextual Adaptation

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

7. 机器人与具身智能 5 篇

PhysDrift: Bridging the Embodiment Gap in Humanoid Co-Speech Motion Generation

Advancing DialNav through Automatic Embodied Dialog Augmentation

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Reward as An Agent for Embodied World Models

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

8. 可信、安全与AI治理 7 篇

Deontic Policies for Runtime Governance of Agentic AI Systems

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

Emergent Alignment

Analyzing the Narration Gap in LLM-Solver Loops

GLARE: A Natural Language Interface for Querying Global Explanations

Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

9. 评测、基准与数据集 11 篇

Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Benchmarking Agentic Review Systems

ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

10. AI应用与系统 15 篇

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

TelcoAgent: A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability

eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

Implicit Semantic-Aware Communication Based on Hypergraph Reasoning

Augmenting Game AI with Deep Reinforcement Learning

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

Interpretable Sperm Morphology Classification via Attention-Guided Deep Learning

Context-Aware Hierarchical Bayesian Modeling of IVF Laboratory Environmental Conditions

11. 其他/综合AI 4 篇