arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.23942 2026-05-26 cs.AI

A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence

基于变换和语义等价性的认知过程动力学框架

Carlo Cattani, Dioneia Motta Monte-Serrat

AI总结 提出一个基于变换和语义等价性的动力学框架,通过迭代更新规则建模认知过程,并利用不动点论证和收缩条件确保稳定性,在语言应用中展示上下文依赖解释的轨迹。

详情
AI中文摘要

本文提出一个结构性和动力学框架,从控制论视角建模认知过程。认知状态表示为状态空间中的元素,通过迭代更新规则演化: \[ X_{t+1} = \pi\big(F(f(X_t))\big), \] 其中 $f$ 描述内部变换,$F$ 表示解释映射,$\pi$ 强制语义等价。该模型被解释为整合变换、观察和稳定的反馈系统。引入范畴论表述以捕捉组合结构,并通过不动点论证和收缩条件分析相关动力学,确保稳定性。为展示该框架的操作特性,提供了计算示例和诱导动力学的定性分析。一个具体的语言应用展示了如何将上下文依赖的解释建模为朝向稳定语义类的轨迹。所提出的方法连接了动力系统、范畴论和认知建模,提供了将认知视为朝向不变解释的反馈驱动过程的统一表示。

英文摘要

This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states are represented as elements of a state space evolving through an iterative update rule of the form \[ X_{t+1} = π\big(F(f(X_t))\big), \] where $f$ describes internal transformations, $F$ represents interpretative mappings, and $π$ enforces semantic equivalence. The model is interpreted as a feedback system integrating transformation, observation, and stabilization. A categorical formulation is introduced to capture compositional structure, while the associated dynamics are analyzed through fixed-point arguments and contraction conditions ensuring stability. To demonstrate the operational character of the framework, a computational illustration is provided, together with a qualitative analysis of the induced dynamics. A concrete linguistic application shows how context-dependent interpretation can be modeled as a trajectory toward a stable semantic class. The proposed approach connects dynamical systems, category theory, and cognitive modeling, and provides a unified representation of cognition as a feedback-driven process evolving toward invariant interpretations.

2605.23941 2026-05-26 cs.AI cs.RO

MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics

MEMOR-E: 面向阿尔茨海默病辅助机器人的上下文与微调大语言模型个性化

Maissa Abir Smaili, Eren Sadikoglu, Ransalu Senanayake

AI总结 提出移动四足机器人MEMOR-E,结合微调与上下文学习的大语言模型,实现阿尔茨海默病患者的个性化认知支持与可解释人机交互。

Comments 8 pages 14 figures

详情
AI中文摘要

阿尔茨海默病是一种神经退行性疾病,其特征是记忆和语言能力进行性衰退,导致日常生活独立性降低,从而激发社交辅助机器人的支持需求。本文介绍了MEMOR-E,一种配备交互式平板界面的移动四足机器人,通过药物提醒、日常指导、记忆导向互动和陪伴来协助患者和护理人员。我们评估了微调大语言模型(LLMs)以模拟阶段一致的认知行为并解释标准神经心理学语言任务中响应的可行性,使用了235名阿尔茨海默病患者的音频转录和合成生成的健康对照数据。我们还报告了在LLMs中使用上下文学习(ICL)的结果,其中第二个LLM生成了领域和严重程度级别的认知错误摘要。我们的结果表明,MEMOR-E能够生成阶段感知的非诊断性认知摘要,支持个性化辅助互动,同时可解释AI机制将模型输出转化为透明、人类可读的证据,以实现护理人员监督和可信赖的人机交互。

英文摘要

Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily life, motivating socially assistive robotic support. This paper presents MEMOR-E, a mobile quadruped robot with an interactive tablet interface that assists patients and caregivers through medication reminders, routine guidance, memory oriented interactions, and companionship. We evaluated the feasibility of fine tuning large language models (LLMs) to emulate stage consistent cognitive behavior and interpret responses across standard neuropsychological language tasks, using audio transcriptions from 235 Alzheimer's patients and synthetically generated healthy controls. We also report findings on using in context learning (ICL) in LLMs, where a second LLM produced domain and severity level cognitive error summaries. Our results show that MEMOR-E can generate stage aware, non diagnostic cognitive summaries that support personalized assistive interactions, while explainable AI mechanisms translate model outputs into transparent, human readable evidence to enable caregiver oversight and trustworthy human robot interaction.

2605.23940 2026-05-26 cs.AI cs.CL

Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning

残差漂移主导多轮约束推理中的矛盾

Sebastien Kawada

AI总结 通过构建DRIFT-Bench基准和MUS-Repair方法,发现多轮推理系统的主要失败模式是可满足漂移而非逻辑矛盾,残差错误中98-100%为可满足漂移。

Comments Published at ICLR 2026 Workshop on Reasoning and Planning for LLMs. 18 pages. ICLR page: https://iclr.cc/virtual/2026/10017484 Code: https://github.com/kaons-research/drift-bench

详情
AI中文摘要

多轮推理系统如何失败?预期的答案是逻辑矛盾,即系统维护的状态变得不可满足。我们表明,主导模式反而是可满足漂移,即内部状态保持一致,而返回的答案默默违反先前的承诺。我们构建了DRIFT-Bench(将推理分解为失败类型),这是一个包含三个约束领域816个测试问题的求解器辅助基准,并在四个开源模型(8B-120B参数)上评估了四种方法。MUS-Repair方法将最小不可满足子集反馈给生成器,在所有设置中表现最强(比最佳非MUS基线高+1.8到+15.0个百分点)。但核心发现是修复留下的问题。在结构化反馈后,模型很少自相矛盾。它们会遗忘。残差错误在所有设置中98-100%是可满足漂移,而矛盾降至接近零。可靠的多轮系统必须单独验证返回的答案尊重维护的状态。代码可在https://github.com/kaons-research/drift-bench获取。

英文摘要

How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsatisfiable. We show that the dominant mode is instead satisfiable drift, where the internal state stays consistent while the returned answer silently violates prior commitments. We build DRIFT-Bench (Decomposing Reasoning Into Failure Types), a solver-instrumented benchmark of 816 test problems across three constraint domains, and evaluate four methods on it across four open-weight models (8B-120B parameters). MUS-Repair, which feeds minimal unsatisfiable subsets back to the generator, is strongest in every setting (+1.8 to +15.0 pp over the best non-MUS baseline). But the central finding is what repair leaves behind. After structured feedback, models rarely contradict themselves. They forget. Residual errors are 98-100% satisfiable drift across all settings, while contradiction drops to near zero. Reliable multi-turn systems must separately validate that the returned answer respects the maintained state. Code is available at https://github.com/kaons-research/drift-bench.

2605.23939 2026-05-26 cs.AI cs.LG

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

DRIVE:在持续学习下为Web代理建模推理与交互层面的技能

Xirui Liu, Sihang Zhou, Yanning Hou, Rong Zhou, Haoyuan Chen, Maolin He, Siwei Wang, Hao Chen, Jian Huang

AI总结 提出DRIVE框架,通过将历史经验分离为自然语言推理技能和程序化交互技能,并采用场景感知协调机制,解决Web代理在持续学习中推理与交互知识纠缠的问题,在WebArena上平均任务成功率提升7.3个百分点。

Comments 35 pages, 5 figures

详情
AI中文摘要

Web代理需要高层推理(用于任务分解)和低层交互(用于页面元素操作)来执行不同任务。然而,这些知识类型存在根本差异:推理知识(例如,预订航班需要首先搜索路线)是抽象的且可跨网站迁移,而交互知识(例如,在站点A的特定坐标点击搜索按钮)严重依赖于页面特定上下文。现有方法统一存储经验。这造成了一个困境:抽象表示在具体页面上失去可执行性,而具体表示无法跨领域泛化。这种纠缠限制了能力积累:在新网站上,代理要么因表面差异而无法识别可重用的任务逻辑,要么尝试基于过时页面结构的不可行操作。为了解耦它们,我们提出DRIVE,一个双层技能建模框架,将历史经验分离为自然语言推理技能(捕获可迁移的任务逻辑)和程序化交互技能(将抽象动作接地到可执行操作)。一种场景感知协调机制根据任务语义自适应地检索和调用这些双层技能。DRIVE还使用技能级反思来识别层次特定的失败模式,实现有针对性的技能库扩展和精炼。在五个WebArena领域上的实验表明,DRIVE达到了52.8%的平均任务成功率,比无技能基线高出7.3个百分点。进一步的消融实验显示,推理和交互技能提供了不同且互补的益处,支持将可迁移的任务逻辑与可执行的页面级操作分离。

英文摘要

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

2605.23938 2026-05-26 cs.AI cs.CY cs.LG

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

LLM介导的普适系统中的权威倒置:当模型信任用户胜过传感器

Long Zhang, Zi-bo Qin, Wei-neng Chen

AI总结 本研究揭示了大语言模型在融合传感器与用户冲突信息时,由于格式依赖性导致数值传感器数据被自然语言用户主张支配的权威倒置现象,并提出了几何框架、审计指标(CIR和AAI)以及推理时层干预方法(GAC)来诊断和缓解该问题。

详情
AI中文摘要

大语言模型(LLM)越来越多地融合普适系统中的异构输入。然而,当传感器测量值与用户主张冲突时,LLM如何隐式分配权威尚未被研究,这引发了在物理传感必须保持优先级的部署场景中的关键可靠性问题。与显式的传统融合不同,LLM将权威分配隐藏在学习的表示中。我们发现这种分配严重依赖于格式:数值传感器数据未能整合到与答案相关的模型方向中,使得自然语言主张主导最终决策,我们将这种现象称为 extbf{权威倒置}。为了诊断和缓解这一问题,我们开发了一个上下文整合的几何框架,引入了两个可计算的审计指标,即上下文整合比(CIR)和权威对齐指数(AAI),并提出了几何权威校准(GAC),一种推理时的层级干预方法,以抑制错位的用户权威。在四个数据集(共576个冲突实例)上评估四个模型(参数规模4B至35B,三种架构),揭示了极端的倒置:在数值任务上,模型表现出接近零的传感器信任(AAI = -0.805,Cohen's d = -2.14),且不受模型容量影响。验证我们的几何框架,理论引导的因果注入翻转了80.2%的错误决策(随机对照<0.4%)。实际应用中,GAC将HAR准确率从0–1.6%提升至21.9–27.5%,优于提示基线。最终,LLM介导系统中的权威分配必须被显式审计并根据应用特定配置,而不是保持隐式。

英文摘要

Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when sensor measurements and user claims conflict remains unexamined, raising critical reliability concerns for deployments where physical sensing must retain priority. Unlike explicit traditional fusion, LLMs bury authority allocation within learned representations. We discover this allocation is severely format-dependent: numerical sensor data fails to integrate into answer-relevant model directions, allowing natural-language claims to dominate the final decision, a phenomenon we term \textbf{Authority Inversion}.To diagnose and mitigate this, we develop a geometric framework of context integration, introduce two computable audit metrics, specifically the Context Integration Ratio (CIR) and Authority Alignment Index (AAI), and propose Geometric Authority Calibration (GAC), an inference-time layer-level intervention to suppress misplaced user authority. Evaluating four models (4B to 35B parameters, three architectures) across four datasets totaling 576 conflict instances reveals extreme inversion: on numerical tasks, models exhibit near-zero sensor trust (AAI = -0.805, Cohen's d = -2.14), unaffected by model capacity. Validating our geometric framework, theory-guided causal injection flips 80.2\% of incorrect decisions (vs. <0.4\% for random controls). Practically, GAC improves HAR accuracy from 0 -- 1.6\% to 21.9 -- 27.5\%, outperforming prompting baselines. Ultimately, authority allocation in LLM-mediated systems must be explicitly audited and application-specifically configured rather than left implicit.

2605.23936 2026-05-26 cs.AI cs.LG

Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications

模糊、中智和不确定图论:性质与应用

Takaaki Fujita, Florentin Smarandache

AI总结 本书系统综述了不确定性下的图论,以不确定图框架为核心,统一了模糊、中智等模型,并介绍了扩展图类及其在分子图、决策系统、图神经网络等领域的应用。

Comments 326 pages. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-197250204-4

详情
AI中文摘要

本书全面系统地综述了不确定性下的图论,特别强调了不确定图框架的统一作用。它回顾了模糊、中智及相关模型中的基本概念、结构性质、图类和图参数,同时介绍了广泛的扩展,如不确定有向图、超图、超超图和动态图。除了理论发展,本书还探讨了实际应用,包括不确定分子图、决策系统、图神经网络、知识图谱和认知地图。通过从共同视角组织多样化的不确定性感知图模型,本书为理解它们在复杂系统中的关系、能力和应用提供了一个连贯的框架。

英文摘要

This book presents a comprehensive and systematic survey of graph theory under uncertainty, with particular emphasis on the unifying role of the uncertain graph framework. It reviews fundamental concepts, structural properties, graph classes, and graph parameters within fuzzy, neutrosophic, and related models, while also introducing a wide range of extensions such as uncertain digraphs, hypergraphs, superhypergraphs, and dynamic graphs. In addition to theoretical developments, the book explores practical applications, including uncertain molecular graphs, decision-making systems, graph neural networks, knowledge graphs, and cognitive maps. By organizing diverse uncertainty-aware graph models within a common perspective, this work provides a coherent framework for understanding their relationships, capabilities, and applications in complex systems.

2605.23935 2026-05-26 cs.AI cs.CY cs.MA cs.SE cs.SY eess.SY

Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

操作化重构权威:自主智能体系统中的运行时构建、依赖解析与执行门控

Marcelo Fernandez - TraslaIA

AI总结 本文提出一种运行时执行模型,通过动态依赖解析和恢复循环,确保动作仅在当前状态可构建权威时执行,从而保证安全性和条件活性。

Comments Agent Governance Series, Paper P6. Companion papers on arXiv: P0 (2604.17511), P1 (2603.18829), P2 (2604.17517). P3/4 and P5 submitted concurrently (pending arXiv IDs). Zenodo: 10.5281/zenodo.19699460

详情
AI中文摘要

自主智能体系统的失败不仅源于错误决策,还源于执行那些在运行时其权威不再成立的决策。先前的工作将重构权威(RAM)定义为有效执行的条件:仅当权威能从当前状态构建时,才允许执行动作。本文关注运行时强制执行问题:如何在运行系统中强制执行该条件。我们引入一种运行时执行模型,其中权威在动作时被评估,执行取决于其可构建性。这将执行状态空间从允许/拒绝扩展到第三种状态——暂停,表示由于不完整或不确定的可观测性导致权威未定义。我们定义了一个具体的执行协议,包括动态依赖解析、权威重构和显式决策语义。我们进一步引入一个恢复循环,将漂移检测(IML)与执行控制(ACP)集成,允许系统暂停执行、获取缺失信息并重新尝试权威重构。我们证明该模型保证了安全性——没有动作会在没有可构建权威的情况下执行——以及条件活性:当定义权威的变量变得可观测时,执行恢复。这项工作将重构权威操作化为一种运行时强制机制,提供了在真实系统中应用RAM所需的执行语义。

英文摘要

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

2605.23934 2026-05-26 cs.AI quant-ph

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

实用量子CIM赋能:基于全自主核心智能体大模型

Wang Rui, Lu Diannan

AI总结 本研究将飞秒激光泵浦的相干伊辛机与LLM驱动的智能体系统结合,实现QUBO/Ising模型校准、约束权重决策迭代和文献方案快速验证,并完全基于国产大模型和硬件完成,同时发现智能体辅助量子计算迭代可反向增强智能体问题解决能力的新范式。

Comments 21 pages 7 figures

详情
AI中文摘要

量子计算设备被认为是解决NP完全问题的强大工具。然而,其建模的复杂性给非专业人士带来了显著障碍,而约束权重和建模方法的繁琐迭代也消耗了专家的大量精力。为应对这些挑战,本研究通过利用LangGraph和LangChain框架,将飞秒激光泵浦的相干伊辛机(CIM)与LLM驱动的智能体系统集成。综合研究表明,大语言模型(LLMs)可以有效执行建模任务,如QUBO/Ising模型校准、约束权重决策迭代以及文献报道方案的快速验证。值得注意的是,所有这些任务都可以完全基于国产大模型实现,结合国内开发的CIM硬件,我们真正实现了完全依赖全自主智能体大模型和硬件的实用量子CIM赋能。这项工作成功实现了稳健的技术集成,为后续研究奠定了坚实基础。然而,它也指出了当前阶段大模型和量子计算这两个前沿领域持续存在的挑战。令人鼓舞的是,我们意外发现了一种有前景的新范式,其中智能体辅助的量子计算迭代积累的知识反向增强了智能体自身的问题解决能力,从而应对这些挑战。

英文摘要

Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent's own problem-solving capability, thereby addressing these challenges.

2605.23932 2026-05-26 cs.AI cs.CL cs.CY cs.LG

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

当正确信念崩溃:LLMs在临床压力下的认知韧性

Boyu Xiao, Xiuqi Tian, Xuwen Song, Haochun Wang, Guanchun Song, Sendong Zhao, Bing Qin

AI总结 研究LLMs在临床对话中面对逐步升级压力时信念稳定性问题,提出Med-Stress压力测试框架,发现知识-韧性差距,并设计RBED和R-FT方法提升鲁棒性。

Comments ACL 2026

详情
AI中文摘要

尽管在医学基准测试中准确率很高,但LLMs在临床对话中可能表现出严重的多轮谄媚行为,在逐步升级的压力下放弃最初正确的诊断。我们提出了\textbf{\textsc{Med-Stress}},一个针对性的压力测试框架,用于评估在逐步升级压力下的信念稳定性。在九个前沿大型语言模型(LLMs)中,我们发现医学知识与鲁棒性之间存在明显的分离:高初始诊断能力并不意味着高信念稳定性,导致多个LLMs存在较大的知识-鲁棒性差距。为了缓解这种失败模式,我们提出了一种轻量级的推理时防御方法\textbf{\texttt{RBED}}(\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense),以及一种训练时方法\textbf{\texttt{R-FT}}(\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning),该方法内化了基于证据的抗压能力。实验表明,\textbf{\texttt{R-FT}}几乎消除了信念变化,并显著提高了鲁棒性。

英文摘要

Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose \textbf{\textsc{Med-Stress}}, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical knowledge and robustness: high initial diagnostic capability does not imply high belief stability, yielding large knowledge-robustness gaps for several LLMs. To mitigate this failure mode, we propose a lightweight inference-time defense, \textbf{\texttt{RBED}} (\textbf{R}ole-\textbf{B}ased \textbf{E}pistemic \textbf{D}efense), and \textbf{\texttt{R-FT}} (\textbf{R}esilience-oriented \textbf{F}ine-\textbf{T}uning), a training-time approach that internalizes evidence-based resistance to pressure. Experiments show that \textbf{\texttt{R-FT}} nearly eliminates belief change and substantially improves robustness.

2605.23931 2026-05-26 cs.AI cs.PL cs.SE

BODHI: Precise OS Kernel Specification Inference

BODHI:精确的操作系统内核规范推断

Zhiming Chang, Ziyang Li

AI总结 提出一种领域知识提示方法BODHI,通过结构化C到Python翻译指南增强少样本提示,在OSV-Bench基准上将Pass@1从55.10%提升至96.73%,缩小了通用代码生成与形式规范合成之间的差距。

详情
AI中文摘要

操作系统内核的形式化验证需要精确的规范来捕获系统调用的预期行为。手动编写这些规范需要深厚的领域专业知识,这促使使用大型语言模型(LLM)来自动化该过程。然而,在OSV-Bench(一个源自Hyperkernel操作系统内核的245个规范生成任务基准)中,最佳报告的Pass@1为55.10%。我们提出了一种领域知识提示方法(BODHI),该方法通过一个涵盖15类领域特定翻译模式的结构化C到Python翻译指南来增强标准的少样本提示。受结构化思维链(SCoT)提示的启发,该指南通过关注点分离来组织翻译,将前置条件提取和后置条件生成作为不同的类别处理。在来自六个提供商(Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba)的九个模型上进行了评估,涵盖了密集、混合专家和推理架构,BODHI改进了所有测试的模型,提升幅度从+11%到+32%。最佳配置(Claude Opus 4.6 + BODHI)达到了96.73%的Pass@1。BODHI减少了语法和语义错误,对具有足够指令跟随能力以利用结构化参考材料的模型效果最强。这些结果表明,领域知识注入是一种与模型无关的技术,显著缩小了通用代码生成与形式规范合成之间的差距。

英文摘要

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

2605.23930 2026-05-26 cs.AI cs.LG cs.MA

Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game

量子青蛙:量化时间合作博弈中的涌现合作与难度缩放

Saad Mankarious

AI总结 通过强化学习分析量化时间合作博弈Quantum Frog,发现同步冲刺策略最优,合作训练可大幅提升成功率并缩短回合步数。

详情
AI中文摘要

我们引入了\emph{Quantum Frog},这是一个双人合作游戏,基于一种新颖的\emph{量化时间}机制,其中环境仅在玩家行动时推进。受经典街机游戏Frogger启发,Quantum Frog要求两只青蛙穿越一个8×8的交通网格并一起到达远端。我们使用强化学习(RL)作为分析镜头来回答四个设计问题:(1)游戏难度如何随交通密度缩放,(2)最优单智能体策略是什么以及为什么,(3)独立和合作双智能体游戏之间的合作差距有多大,以及(4)当智能体被激励合作时会出现什么联合策略?我们通过五个升级阶段训练智能体:表格Q学习、深度Q网络(\DQN)、独立\DQN~(\IDQN)和多智能体近端策略优化(\MAPPO\ 带有集中式评论家),针对一到六辆车的交通密度进行评估。我们的主要发现是:(i)量化时间机制使得\emph{冲刺策略}(每一步直接向上移动)普遍最优,因为暴露于交通的时间被最小化;(ii)添加一个不协调的第二玩家比将单个专家玩家的交通量增加六倍更难;(iii)合作训练相对于独立智能体将联合成功率提高了+32–34个百分点,并将回合长度从约90步减少到约6步;(iv)涌现的合作策略是同步冲刺,而不是复杂的位置协调,这表明在时间关键的合作任务中,仅共享激励就足以使智能体对齐。这些发现为Quantum Frog的商业设计提供了具体、经验基础的指导,并为环境机制在塑造多智能体学习动态中的作用提供了更广泛的见解。

英文摘要

We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment advances only when a player acts. Inspired by the classic arcade game Frogger, Quantum Frog requires two frogs to cross an 8$\times$8 grid of traffic and reach the far side together. We use reinforcement learning (RL) as an analytical lens to answer four design questions: (1) how does game difficulty scale with traffic density, (2) what is the optimal single-agent policy and why, (3) how large is the cooperation gap between independent and cooperative two-agent play, and (4) what joint strategy emerges when agents are incentivised to cooperate? We train agents through five escalating stages, Tabular Q-Learning, Deep Q-Network (\DQN), Independent \DQN~(\IDQN), and Multi-Agent Proximal Policy Optimisation (\MAPPO\ with a centralised critic), evaluating each against traffic densities of one to six cars. Our key findings are: (i) the quantized-time mechanic makes a \emph{rush strategy} (moving directly upward at every step) universally optimal, as time exposure to traffic is minimised; (ii) adding an uncoordinated second player is harder than sextupling the traffic for a single expert player; (iii) cooperative training recovers +32--34 percentage points of joint success rate relative to independent agents and reduces episode length from $\sim$90 to $\sim$6 steps; and (iv) the emergent cooperative strategy is synchronised rushing, not complex positional coordination, illustrating that shared incentives alone suffice to align agents in time-critical cooperative tasks. These findings provide concrete, empirically grounded guidance for the commercial design of Quantum Frog and offer broader insights into the role of environment mechanics in shaping multi-agent learning dynamics.

2605.23929 2026-05-26 cs.AI cs.SE

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

面向LLM驱动的智能体工作流的可靠设计:优化延迟-可靠性-成本权衡

Ya-Ting Yang, Quanyan Zhu

AI总结 本文通过引入参数化指数可靠性函数建模LLM与非LLM智能体的性能,提出水填充令牌分配策略,并刻画最优工作流可靠性的影子价格,以解决延迟、可靠性和成本之间的权衡问题。

详情
AI中文摘要

现代AI系统日益依赖由多个交互智能体组成的工作流,其中一些由大语言模型(LLM)驱动,另一些由传统计算模块驱动。本文分析了LLM驱动的智能体工作流中延迟、可靠性和成本之间的基本权衡。我们为LLM和非LLM智能体引入了性能模型,这些模型捕捉了计算努力与输出质量之间的关系,并利用参数化指数可靠性函数纳入了LLM智能体的推理和输出令牌的影响。然后,我们研究了在延迟和成本约束下顺序工作流的设计。主要结果包括一种水填充令牌分配策略,以及以影子价格形式刻画的最优工作流可靠性特征。

英文摘要

Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

2605.23928 2026-05-26 cs.AI cs.CL cs.DC cs.MA cs.PL cs.SE

Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction

Context: 通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能

Gregory Magarshak

AI总结 提出Context架构,通过可组合沙盒程序、声明式连接和结构化交互实现主动目标导向智能,并证明其在成本、正确性和效率上的优势。

Comments 7 pages; third in a series with arXiv:2501.XXXXX (Magarshak Machine / SPACER) and arXiv:2502.XXXXX (Grokers)

详情
AI中文摘要

我们提出Context,Magarshak架构的智能层,用主动目标导向智能体取代被动查询-响应聊天机器人,无需等待用户提示即可推进共享任务。该架构基于三个相互增强的机制。编写时上下文组装通过Groker智能体预计算丰富的类型化属性,将交互上下文组装为图状态的确定性纯函数;上下文块在语义变化之间的轮次中字节相同,实现近100%的KV缓存重用。可组合沙盒智慧程序形成一个受管理的库,包含LM生成的命令式程序,通过类型化流关系声明式连接到目标类型,通过阶段排序组合,并在交互时执行而无需进一步调用LM。主动目标流状态机通过检查图状态并发出结构化交互内容(选项数组、治理功能、澄清提示)驱动对话走向终止状态,无需等待用户输入。我们证明了六个形式化结果:上下文稳定性定理,将每轮LM成本限制为语义变化率的函数;程序组合正确性定理;声明式连接正确性定理;主动主导定理,证明主动智能体在期望轮数到终止状态上弱主导被动智能体;协调开销消除与质量保持,建立多方目标聊天中的帕累托改进;以及跨平台投票一致性定理。已在开源Qbix/Safebox/Safebots栈中实现。

英文摘要

We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive goal-directed agents that advance shared tasks without waiting for user prompts. The architecture rests on three mutually reinforcing mechanisms. Write-time context assembly precomputes enriched typed attributes via Groker agents, assembling interaction context as a deterministic pure function of graph state; context blocks are byte-identical across turns between semantic changes, enabling near-100% KV-cache reuse. Composable sandboxed wisdom programs form a governed library of LM-generated imperative programs declaratively wired to goal types via typed stream relations, composed via phase ordering, and executed at interaction time without further LM calls. Proactive goal stream state machines drive conversations toward terminal states by inspecting graph state and emitting structured interaction content (option arrays, governance affordances, clarification prompts) without awaiting user input. We prove six formal results: the Context Stability Theorem, bounding per-turn LM cost as a function of semantic change rate; a Program Composition Correctness Theorem; a Declarative Wiring Soundness Theorem; the Proactive Dominance Theorem, proving proactive agents weakly dominate reactive agents on expected turns-to-terminal-state; Coordination Overhead Elimination and Quality Preservation, establishing Pareto improvements in multi-participant goal chats; and a Cross-Platform Vote Consistency Theorem. Implemented in the open-source Qbix / Safebox / Safebots stack.

2605.23926 2026-05-26 cs.AI cs.LG

How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning

多少思考才足够?量化和理解LLM推理中的冗余

Zhiyuan Zhai, Xinkai You, Wenjing Yan, Xin Wang

AI总结 本文通过形式化推理冗余度量,量化了前沿推理模型在数学基准上高达61%-93%的步骤级冗余,并证明这种冗余是长度无关结果奖励的结构性后果,而非模型特定伪影。

详情
AI中文摘要

具备推理能力的大语言模型通过生成长思维链来解决难题,这严重增加了延迟、GPU时间和能耗。粗略检查其轨迹发现大量重构、验证和循环自省,然而这种深思熟虑中有多少实际上是必要的,从未在大规模上被度量或从第一性原理解释。本文填补了这两个空白。 我们直接以推理模型本身的形式化推理冗余:一个正确轨迹的冗余度是其尾部可被截断的最大分段步骤比例,同时迫使模型终止思考并输出最终答案,仍能产生正确答案。对四个前沿推理模型和两个数学基准的大规模量化表明,步骤级冗余一致地高——在我们研究的8个(模型,基准)条件下介于61%和93%之间,其中六个条件下中位关键前缀等于单个分段步骤——该发现对评判模型族的选择是稳健的,并且尽管在MATH-500上随问题难度增加而降低,所有四个模型即使在最难的Level-5问题上仍然显著冗余(ρ∈[46%,85%])。 然后我们证明这种冗余是长度无关结果奖励的结构性后果,而非模型特定伪影:在任何此类奖励下,没有有限期望停止时间是最优的。该结果无论RL算法、基础模型、数据分布或策略是通过RL还是蒸馏获得均成立;因此过度思考不是需要在单个模型中修补的缺陷,而是当前推理模型训练方式的结构性属性。代码:https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

英文摘要

Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and energy. Casual inspection of their traces reveals extensive reformulation, verification, and circular self-reflection, yet how much of this deliberation is actually necessary has never been measured at scale or explained from first principles. This paper closes both gaps. We formalise reasoning redundancy directly in terms of the reasoning model itself: the redundancy of a correct trace is the largest fraction of its trailing segmented steps that can be truncated while $π$, forced to terminate thinking and emit a final answer, still produces the correct answer. A large-scale quantification across four frontier reasoning models and two mathematical benchmarks shows that step-level redundancy is consistently high -- between 61% and 93% across the 8 (model, benchmark) conditions we study, with the median critical prefix equal to a single segmented step in six of the eight conditions -- that the finding is robust to the choice of judge family, and that although $ρ$ decreases with problem difficulty on MATH-500, all four models remain substantially redundant ($ρ\in [46\%, 85\%]$) even on the hardest Level-5 problems. We then prove that this redundancy is a structural consequence of length-agnostic outcome rewards, not a model-specific artefact: under any such reward, no finite expected stopping time is optimal. The result holds regardless of RL algorithm, base model, data distribution, or whether the policy is obtained via RL or distillation; over-thinking is therefore not a bug to be patched in individual models but a structural property of how current reasoning models are trained. Code: https://github.com/zhiyuanZhai20/how-much-thinking-is-enough

2605.23924 2026-05-26 cs.CL cs.IR q-fin.GN

Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach

提高分部披露的完整性和可比性:一种大语言模型方法

Yue Liu, Zhiyuan Cheng, Longying Lai

AI总结 本研究开发了一种基于大语言模型的框架,直接从10-K文件中提取分部披露信息,保留可报告和嵌套分部信息,并设计检索增强系统以支持跨公司和跨时间的可比性,从而解决结构化数据库中分部数据的完整性和可比性问题。

Comments 39 pages, 4 figures, submitted to Accounting Horizons

详情
AI中文摘要

分部层面的披露是财务报告的核心组成部分,提供了对公司内部组织以及经济活动在运营单位之间分配的洞察。然而,分部信息通常以定性和定量两种形式呈现,分散在10-K文件的表格和叙述部分。依赖结构化数据库的实证研究面临完整性和可比性挑战,因为一些公司-年度观测可能缺失,嵌套的分部披露未被捕获,并且对纵向和跨公司可比性的支持有限。本研究开发了一个基于大语言模型的框架,直接从10-K文件中提取分部披露,并保留可报告和嵌套的分部信息。我们进一步设计了一个检索增强系统,整合多个文件中的信息以支持可比性。我们使用两个代表性设置来演示其应用:公司内部的纵向分析以解释分部随时间的变化,以及跨公司地理分部的对齐(针对具有不同报告结构的公司)。结果表明,该工件准确提取了分部层面的信息,并有效回答了需要跨时期知识的问题,展示了基于LLM的方法在增强分部披露的测量和解释方面的潜力。

英文摘要

Segment-level disclosures are a central component of financial reporting, providing insight into firms' internal organization and the allocation of economic activities across operating units. However, segment information is often presented in both qualitative and quantitative forms, dispersed across tables and narrative sections of Form 10-K filings. Empirical research relying on structured databases faces both completeness and comparability challenges, as some firm-year observations may be missing, nested segment disclosures are not captured, and support for longitudinal and cross-firm comparability is limited. This study develops a large language model-based framework to extract segment disclosures directly from Form 10-K filings and to preserve both reportable and nested segment information. We further design a retrieval augmented system that incorporates information across multiple filings to support comparability. We use two representative settings to demonstrate its application: longitudinal analysis within a firm to interpret segment changes over time, and cross firm alignment of geographic segments across firms with different reporting structures. The results indicate that the artifact accurately extracts segment-level information and effectively addresses questions that require cross-period knowledge, demonstrating the potential of LLM-based approaches to enhance the measurement and interpretation of segment disclosures.

2605.23917 2026-05-26 cs.CL

Multi-Persona Debate System for Automated Scientific Hypothesis Generation

用于自动科学假设生成的多角色辩论系统

Jaeha Oh, Byungchan Kim, Ju Li, Yang Jeong Park, Jin-Sung Park

AI总结 提出多角色辩论系统(MPDS),结合文献检索、长上下文大语言模型推理、语料驱动角色归纳和结构化多智能体辩论,自动生成科学假设,在电池材料研究中验证其有效性。

Comments 31 pages with 7 main figures, 4 supplementary figures and 1 supplementary table

详情
AI中文摘要

现代科学发现的瓶颈不在于数据稀缺,而在于无法将碎片化知识综合为可操作的假设。这一挑战在电池材料研究中尤为突出,因为电化学性能、界面行为和制造可行性必须同时优化。在此,我们提出多角色辩论系统(MPDS),这是一个基于文献的自动科学假设生成框架,结合了文献检索、长上下文大语言模型推理、语料驱动角色归纳和结构化多智能体辩论。MPDS构建最多500篇论文的文献快照,将智能体基于角色特定的证据池,并进行三轮引文感知辩论,随后由主持人综合,从而在保持证据可追溯性的同时实现角色间的协商。我们使用时间控制协议评估MPDS,排除对目标论文的直接访问,包括两个留出的电池材料案例研究和30个匹配案例的盲比较。在钠离子阳极和全固态电池阴极设计任务中,MPDS恢复了与实验验证解空间一致的设计逻辑,并生成了比简单基线更机械明确、过程感知的提案。为了评估角色和辩论的影响,我们引入了综合假设质量评分。在消融研究中,MPDS在五种条件下获得了最高平均分,其最大优势在于跨视角整合。实验室后续表明其作为识别工作流程中实际瓶颈的诊断辅助工具的实用性。这些结果表明,在耦合工程约束下,对文献快照的结构化辩论改善了假设形成,并为文本密集型科学发现提供了可重用工作流程。

英文摘要

Modern scientific discovery is bottlenecked not by data scarcity, but by the inability to synthesize fragmented knowledge into actionable hypotheses. This challenge is especially acute in battery materials research, where electrochemical performance, interfacial behavior, and manufacturing feasibility must be optimized simultaneously. Here, we present the Multi-Persona Debate System (MPDS), a literature-grounded framework for automated scientific hypothesis generation that combines literature retrieval, long-context large language model reasoning, corpus-driven persona induction, and structured multi-agent debate. MPDS constructs literature snapshots of up to 500 papers, grounds agents in role-specific evidence pools, and conducts a three-round citation-aware debate followed by moderator synthesis, enabling negotiation between personas while preserving evidence traceability. We evaluate MPDS using a temporally controlled protocol excluding direct access to target papers, including two held-out battery-materials case studies and a blinded comparison across 30 matched cases. In sodium-ion anode and all-solid-state battery cathode design tasks, MPDS recovered design logics aligned with experimentally validated solution spaces and generated more mechanistically explicit, process-aware proposals than simpler baselines. To assess the impact of personas and debate, we introduce Integrative Hypothesis Quality scoring. In ablation studies, MPDS achieved the highest mean score among five conditions, with its largest advantage in cross-perspective integration. A laboratory follow-up suggests utility as a diagnostic aid for identifying practical bottlenecks in workflows. These results indicate that structured debate over literature snapshots improves hypothesis formation under coupled engineering constraints and provides a reusable workflow for text-intensive scientific discovery.

2605.23912 2026-05-26 cs.CL cs.AI cs.SD

Raon-Speech Technical Report

Raon-Speech 技术报告

Beomsoo Kim, Changho Choi, Dohyun Kim, Dongki Lee, Ethan Ewer, Eunchong Kim, Gyeongman Kim, Haechan Kim, Hyeonghwan Kim, Inkyu Park, Jihun Yun, Jihwan Moon, Jiyun Kim, Joonghyun Bae, Junhyuck Kim, Minkyu Kim, Sehun Lee, Seungjun Chung, Sungwoo Cho, Dongmin Park, Dongwon Kim, Hara Kang, Jonghyun Lee, Keon Lee, Kangwook Lee, Jaewoong Cho

AI总结 本文提出 Raon-Speech,一个 9B 参数的语音语言模型,通过多阶段训练实现英语和韩语的语音理解、回答与生成,并扩展为全双工对话模型 Raon-SpeechChat,在语音任务上超越同类模型。

详情
AI中文摘要

我们提出了 Raon-Speech,一个在英语和韩语语音理解、回答和生成方面表现优异的 9B 参数语音语言模型(SpeechLM),以及 Raon-SpeechChat,一个用于自然实时对话的高性能全双工扩展。Raon-Speech 成功地将预训练的大语言模型(LLM)转换为既能理解又能生成语音的 SpeechLM,同时保留了强大的文本能力。它在 138 万小时精心策划的英语和韩语语音及文本数据集上训练,训练阶段包括:(1) 语音模块对齐,(2) 基于知识蒸馏的端到端 SpeechLM 预训练,以及 (3) 基于多任务偏好优化的后训练。在 42 个英语和韩语语音及文本基准测试中,与包括 Qwen2.5-Omni 和 Fun-Audio-Chat 在内的八个近期类似规模的音频基础模型相比,Raon-Speech 在语音中心任务上建立了最强的整体表现,同时保留了强大的文本问答性能。在此基础上,Raon-SpeechChat 通过在 119K 小时的时间对齐的真实和合成对话数据上进行持续训练,实现了自然的全双工对话。它通过三个互补的训练阶段进行:(1) 因果编码器适应,(2) 全双工预训练,(3) 用于语音和角色控制的全双工微调。在多个全双工基准测试中,Raon-SpeechChat 在 FDB v1.0 涵盖的轮流发言和中断敏感行为上显示出最明显的优势,并在更广泛的全双工评估套件中保持竞争力。我们开源了所有模型检查点、训练和推理流程以及交互式演示。

英文摘要

We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answering, and generation, and Raon-SpeechChat, a high-performing full-duplex extension for natural real-time conversation. Raon-Speech successfully transforms a pre-trained LLM into a SpeechLM that both understands and generates speech while preserving strong text capabilities. It trains on 1.38M hours of highly curated English and Korean speech and text datasets with the following training stages: (1) speech modules alignment, (2) end-to-end SpeechLM pre-training with knowledge distillation, and (3) multi-task preference optimization-based post-training. Across 42 English and Korean speech and text benchmarks, Raon-Speech establishes the strongest overall profile on speech-centric tasks in our comparison against eight similarly sized recent audio foundation models, including Qwen2.5-Omni and Fun-Audio-Chat, while preserving strong text question answering performance. Building upon it, Raon-SpeechChat enables natural full-duplex conversation by continual training on 119K hours of time-aligned real and synthetic dialogue data. It proceeds through three complementary training stages: (1) causal encoder adaptation, (2) full-duplex pre-training, (3) full-duplex fine-tuning for voice and role-control. On multiple full-duplex benchmarks, Raon-SpeechChat shows its clearest strengths on the turn-taking and interruption-sensitive behaviors covered by FDB v1.0, and remains competitive across the broader full-duplex evaluation suite. We open-source all model checkpoints, the training and inference pipeline, and an interactive demo.

2605.23909 2026-05-26 cs.AI cs.LG

Confidence Calibration in Large Language Models

大型语言模型中的置信度校准

Noam Michael, Daniel BenShushan, Jacob Bien, Don A. Moore

AI总结 通过预注册研究,发现大型语言模型(LLMs)的置信度普遍高于准确率,且存在显著的难易效应:困难测试中过度自信,简单测试中信心不足,并提出了LifeEval测试用于评估不同难度下的模型校准。

详情
AI中文摘要

我们研究了大型语言模型(LLMs)在不同任务上的置信度校准情况。预注册研究的结果表明,当前一批LLMs与人类一样,过于确信自己是正确的:平均而言,置信度超过了准确率。然而,重要的是,这种趋势受到强大的难易效应的调节,即在困难测试中过度自信最为严重;相比之下,简单测试实际上显示出明显的信心不足。我们开发了LifeEval,一个用于评估不同难度水平下模型校准的测试。

英文摘要

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

2605.22809 2026-05-26 cs.CV

Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving

Sensor2Sensor: 自动驾驶的跨本体传感器转换

Jiahao Wang, Bo Sun, Yijing Bai, Vincent Casser, Songyou Peng, Zehao Zhu, Meng-Li Shih, Xander Masotto, Shih-Yang Su, Kanaad V Parvate, Tiancheng Ge, Linn Bieske, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang

AI总结 提出Sensor2Sensor生成模型,将单目行车记录仪视频转换为多模态传感器数据(多视角相机图像和LiDAR点云),通过4D高斯泼溅重建和扩散架构解决无配对数据问题,为自动驾驶开发解锁外部数据源。

Comments Accepted by CVPR 2026

详情
AI中文摘要

自动驾驶系统(ADS)的鲁棒训练和验证需要大规模、多样化的数据集。自动驾驶车队收集的专有数据虽然高保真,但在规模、传感器配置多样性以及地理和长尾行为覆盖方面有限。相比之下,来自行车记录仪等来源的野外数据提供了巨大的规模和多样性,捕获了关键的长尾场景和新环境。然而,这种非结构化的野外视频数据与期望结构化多模态传感器输入进行验证和训练的ADS不兼容。为了弥合这一数据差距,我们提出了Sensor2Sensor,一种新颖的生成建模范式,将野外的单目行车记录仪视频转换为高保真的多模态传感器套件(AV日志),包括多视角相机图像和LiDAR点云。一个核心挑战是缺乏配对训练数据。我们通过4D高斯泼溅(4DGS)重建和新视角渲染将真实的AV日志转换为行车记录仪风格的视频来解决这一问题。然后,Sensor2Sensor利用扩散架构进行生成转换。我们对生成的传感器数据的保真度和真实感进行了全面的定量评估。我们通过将具有挑战性的野外互联网和行车记录仪镜头转换为逼真的多模态数据格式,展示了Sensor2Sensor的实际效用,进一步为AV开发解锁了巨大的外部数据源。

英文摘要

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.

2605.22800 2026-05-26 cs.LG cs.AI stat.ML

The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

匹配原则:面向干扰鲁棒表示学习的损失函数几何理论

Vishal Rajput

AI总结 提出匹配原则,通过估计任务协方差矩阵并匹配惩罚矩阵的像空间,统一了多种鲁棒性方法,并在线性高斯模型中证明最优性。

Comments 58 pages, 13 pre-specified empirical blocks. v2: partial-pass framing, geometry-task dissociation, T2B protocol v3, layout/figure fixes; core theorems unchanged. Code: matching-pmh (PyPI). Related note: arXiv:2604.21395

详情
AI中文摘要

鲁棒性、领域自适应、光度/遮挡不变性、传感器漂移和对齐风格被视为独立的文献领域,拥有各自独立的方法族。在标签保持的部署偏移下,它们共享一个几何对象:协方差 Sigma_task = Cov_{Q_n}(n),即输入在标签不变的情况下可以变化的方式。CORAL、对抗训练、数据增强、度量学习、雅可比惩罚和对齐约束并非独立的技巧——它们都是 Sigma_task 的估计量。固定该对象后,雅可比惩罚由一个矩阵 Sigma' 确定,其像空间必须覆盖 range(Sigma_task)——即匹配原则。我们在线性高斯模型中证明了最优性(定理A),证明了任何能够消除部署漂移的二次惩罚都需要像空间覆盖(定理G),并在全局最小值处证明了相同的二分性(定理A*_global)。错误方向/信号对齐控制(引理C;推论E/E*)以及七个估计量(引理D1-D7),加上无标签TDI,为需要学习 Sigma_task 的情况提供了可证伪的配方。在十三个模块(从ML到Qwen2.5-7B)上,测试了匹配的、各向同性的和错误方向的惩罚对几何和部署漂移的影响。其中十二个模块与可识别性成立的理论一致;Office-31是一个命名的特征间隙失败案例。部分通过:几何可以在不改善每个头条任务指标的情况下提升。一次初步的7B DPO运行(一个epoch,240对):匹配风格-PMH保持了风格TDI,而标准DPO则使其退化。我们不声称标准训练达到全局最小值(假设(O)是开放的),不声称估计的 Sigma_task 总是可识别的,也不声称在每个排行榜上占优。我们提出一个可证伪的设计配方:估计 Sigma_task,匹配 Sigma',运行控制,分别报告任务和几何指标。

英文摘要

Robustness, domain adaptation, photometric/occlusion invariance, sensor drift, and alignment style are treated as separate literatures with separate method families. Under label-preserving deployment shift they share one geometric object: the covariance Sigma_task = Cov_{Q_n}(n) of ways inputs can change without changing the label. CORAL, adversarial training, augmentation, metric learning, Jacobian penalties, and alignment constraints are not independent tricks--they are estimators of Sigma_task. Fix that object and the Jacobian penalty is pinned by a matrix Sigma' whose range must cover range(Sigma_task)--the matching principle. We prove optimality in a linear-Gaussian model (Thm. A), necessity of range coverage for any quadratic penalty that zeros deployment drift (Thm. G), and the same dichotomy at global minima (Thm. A*_global). Wrong-direction/signal-aligned controls (Lemma C; Cor. E/E*) and seven estimators (Lemmas D1--D7), plus label-free TDI, yield a falsifiable recipe when Sigma_task must be learned. Thirteen blocks (ML through Qwen2.5-7B) test matched vs isotropic vs wrong-direction penalties on geometry and deployment drift. Twelve match theory where identifiability holds; Office-31 is a named eigengap failure. Partial passes: geometry can improve without every headline task metric moving. A pilot 7B DPO run (one epoch, 240 pairs): matched style-PMH preserves Style TDI where standard DPO degrades it. We do not claim standard training reaches global minima (assumption (O) is open), that estimated Sigma_task is always identifiable, or dominance on every leaderboard. We claim a falsifiable design recipe: estimate Sigma_task, match Sigma', run the controls, report task and geometry separately.

2605.22542 2026-05-26 cs.CL

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

词汇语义的场景抽象:情境意义的结构化表示

Yejin Cho, Katrin Erk

AI总结 提出场景抽象框架,通过少样本提示大语言模型构建词汇使用情境的结构化表示,实验证明场景可可靠识别且优于基线方法。

详情
AI中文摘要

咖啡和茶共享许多属性,但它们唤起截然不同的情境、氛围和情感联想。这些词汇意义的情境维度是真实且系统的,但在大多数词汇意义的计算表示中仍然隐含。我们提出场景抽象,一个构建词汇在不同使用语境中参与的解释性场景的结构化表示框架。每个场景由情境场景(事件、实体、设置)和以表达为中心的表达轮廓(参与事件、可概括属性、唤起情感)组成,通过大语言模型的少样本提示实现。我们的贡献有三方面:(1)情境词汇意义的结构化表示框架;(2)COCA-Scenes,一个包含26个关键词的520个使用实例的数据集,用于区分场景识别;(3)来自两个实验的经验证据表明,场景在人类观察者中可靠识别(准确率82.4%,比纯文本嵌入高11.8个百分点),并且我们的场景轮廓比基于ATOMIC的替代方案更符合人类对语境中词汇的解释(在三个语义维度上偏好86.4%)。

英文摘要

Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

2605.22532 2026-05-26 cs.LG

Relational Linear Properties in Language Models: An Empirical Investigation

语言模型中的关系线性性质:一项实证研究

Giovanni Valer, Luigi Gresele, Marco Bronzini, Emanuele Marconato

AI总结 本文提出基于KL散度的探针方法,实证检验语言模型中关系线性假设(即固定关系下对象解嵌入可由主体嵌入线性映射预测),发现其随模型、层和关系表述变化。

详情
AI中文摘要

线性性质在语言模型的表示中普遍存在;然而,实验性地测试它们仍然是一项具有挑战性的任务。本文聚焦于关系线性:即对于固定关系(例如“演奏”),对象的解嵌入(例如“小号”)可以通过线性映射从其主体(例如“迈尔斯·戴维斯”)的嵌入预测。我们提出了一种实验方法,用于测试Marconato等人(2025)提出的关系线性公式。具体而言,我们引入了一种基于KL散度的探针方法来评估这一性质,并考察其在不同层和不同表述的关系查询中的变化。该方法也比先前工作更高效;例如,它避免了Hernandez等人(2024)在线性关系嵌入中使用的粗略雅可比近似。我们在四个数据集上的发现表明,关系线性在不同模型间存在差异,展现出与先前关于模型表示中语言信息的观察一致的逐层模式,并且受关系表述方式变化的影响不同。

英文摘要

Linear properties are ubiquitous in the representations of language models; however, testing them experimentally remains a challenging task. This work focuses on relational linearity: the hypothesis that, for a fixed relation (e.g., "plays"), the unembedding of an object (e.g., "trumpet") can be predicted from the embedding of its subject (e.g.,"Miles Davis") by a linear map. We present an experimental method to test the formulation of relational linearity by Marconato et al. (2025). Specifically, we introduce a probing method, based on Kullback-Leibler divergence, to evaluate this property and examine its variation across layers and paraphrased relational queries. It is also more efficient than previous work; for example, it avoids the crude Jacobian approximations used in Linear Relational Embeddings by Hernandez et al. (2024). Our findings across four datasets show that relational linearity varies across models, exhibits layer-wise patterns consistent with prior observations about linguistic information in model representations, and is differently affected by changes in how the relation is phrased.

2605.22093 2026-05-26 cs.AI

Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

知识图谱沿本体论连续体的重工程(扩展版)

Enrico Daga, Valentina Tamma, Terry Payne

AI总结 本文提出本体论连续体作为概念框架,通过语义与语用、属性与可供性两个正交维度描述、比较和转换知识图谱,以解决不同建模实践间的集成与重用问题,并通过案例研究验证其有效性。

详情
AI中文摘要

知识图谱已成为数据集成的主要载体,对现代AI的成功至关重要,但KG建模实践的多样性(从轻量级词汇表到丰富公理化的本体论)使得集成和重用成本高昂且脆弱。这一挑战在神经符号AI中尤为突出,其中桥接神经和符号组件依赖于重新设计KG以适应新需求的能力;生成式AI现在提供了前所未有的自动化能力,但如果没有对KG空间的原则性理解,这种自动化在概念上仍然缺乏基础。我们将本体论连续体引入为缺失的概念化,这是一个理论构造,其特征框架由两个正交区分定义:语义与语用,以及属性与可供性;这些共同定义了一个词汇表,用于描述、比较、导航和转换跨越全部建模实践的KG。方法论立场是经验性的:连续体并非规定KG应如何建模,而是旨在定义一种存在理论,源于对现实世界KG工程实践的观察,其结构可以形式化地明确表达,例如通过形式概念分析(FCA)。我们通过一个关于溯源知识的案例研究来夯实这一愿景,展示单一关注点如何在连续体上以不同方式体现。我们阐述了五个开放的研究挑战,并邀请社区将本体论连续体发展为一个共享的研究议程。

英文摘要

Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG modelling practices, from lightweight vocabularies to richly axiomatised ontologies, makes integration and reuse expensive and brittle. This challenge is particularly acute in neuro-symbolic AI, where bridging neural and symbolic components depends on the ability to reengineer KGs to fit new requirements; GenAI now offers unprecedented automation capability, but without a principled understanding of the KG space, such automation remains conceptually ungrounded. We introduce the ontological continuum as that missing conceptualisation, a theoretical construct a theoretical construct whose characterisation framework is defined by two orthogonal distinctions: semantics vs pragmatics, and properties vs affordances; together these define a vocabulary to describe, compare, navigate, and transform KGs across the full range of modelling practices. The methodological stance is empirical: rather than prescribing how KGs should be modelled, the continuum aims to define a theory of the existent, derived from observation of real-world KG engineering practices and whose structure can be made formally explicit, for example, through Formal Concept Analysis (FCA). We ground the vision through a case study on provenance knowledge, showing how a single concern manifests differently across the continuum. We articulate five open research challenges and invite the community to develop the ontological continuum as a shared research agenda.

2605.22005 2026-05-26 cs.LG cs.AI cs.CL

Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

检查你的大语言模型的秘密词典!五行代码揭示你的大语言模型学到了什么(包括它不应该学到的)

Hisashi Miyashita

AI总结 通过对lm_head权重矩阵进行奇异值分解(仅需五行PyTorch代码且无需模型推理),直接从模型权重中揭示可解释的语义子空间,并发现模型训练数据组成和策展哲学。

详情
AI中文摘要

我们展示了基于Transformer的大语言模型的lm_head权重矩阵的奇异值分解——仅需五行PyTorch代码且无需模型推理——直接从模型权重中揭示可解释的语义子空间。每个左奇异向量识别出当隐藏状态与相应奇异方向对齐时最容易被选中的词汇标记;检查这些聚类揭示了模型的训练数据组成和策展哲学。 分析GPT-OSS-120B、Gemma-2-2B和Qwen2.5-1.5B,我们发现奇异值谱和词汇聚类结构在不同模型间存在系统性差异:GPT呈现出功能分化子空间的渐进层次;Gemma以19世纪前的英语正字法为主,形成阶梯式聚类结构,这可能有助于高输出可控性;Qwen展现出广泛的多语言覆盖,同时其子空间的词汇被作者认为在伦理上不适合直接发表。 基础-指令对比表明,伦理上令人担忧的子空间源自预训练,并且不会被后训练对齐移除。我们引入词汇聚类得分(VCS)来量化子空间一致性,以及加权投影得分(WPS)作为静态故障标记检测器;将WPS应用于GPT-OSS-120B,无需任何模型推理即可恢复shokubutsu-hyakka-tsu(ID 137606),这是CJK语言社区中广泛报道的一个著名故障标记。我们提出了问题词汇内容根本原因的分类法,并呼吁将lm_head SVD分析作为标准发布前安全审计步骤。我们的发现进一步指出了SVD引导的分词器优化和更可控的大语言模型设计方向。

英文摘要

We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.

2605.20670 2026-05-26 cs.LG

LT2: Linear-Time Looped Transformers

LT2: 线性时间循环Transformer

Chunyuan Deng, Yizhe Zhang, Rui-Jie Zhu, Yuanyuan Xu, Jiarui Liu, T. S. Eugene Ng, Hanjie Chen

AI总结 提出LT2系列架构,用次二次线性时间注意力替代二次softmax注意力,通过循环实现线性注意力中的迭代记忆精炼和稀疏注意力中的有效感受野扩展,在召回、状态跟踪和语言建模任务上取得一致提升,并展示了混合变体在效率和性能上的优势。

详情
AI中文摘要

循环Transformer(LT)通过在解码最终token之前多次迭代其层,已成为一种强大的架构。然而,将其与全注意力配对会保留二次复杂度,使其计算昂贵且速度慢。我们引入了LT2(线性时间循环Transformer),这是一系列循环架构,用次二次、线性时间注意力替代二次softmax注意力。我们研究了两种变体:具有线性注意力的LT2-linear和具有稀疏注意力的LT2-sparse。我们发现循环与这些变体独特地协同作用:它在线性注意力中实现迭代记忆精炼,并在稀疏注意力中逐步扩展有效感受野。我们从理论上形式化了这些优势,并在受控的召回、状态跟踪和语言建模任务中展示了一致的经验提升。然后我们探索了LT2-hybrid,它在循环设置中结合了不同的注意力变体。两种变体尤其有前景:LT2-hybrid (GDN+DSA),它交错使用线性和稀疏注意力以最大化效率,并以完全线性时间成本匹配标准循环Transformer的质量;以及LT2-hybrid (Full+GDN),它将GDN与一小部分全注意力层交错使用以最大化质量,在性能和效率上都超过了标准循环Transformer。我们还展示了如何将预训练的LT转换为LT2-hybrid模型。经过约10亿token的训练,我们的转换模型Ouro-hybrid-1.4B在性能上优于行业级别的10亿参数模型,并与行业级别的40亿参数模型竞争,同时保留了线性时间注意力的速度优势。这些结果共同展示了使循环Transformer更具可扩展性并推进高效、有能力的小型语言模型的清晰路径。

英文摘要

Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally expensive and slow. We introduce LT2 (Linear-Time Looped Transformers), a family of looped architectures that replace quadratic softmax attention with subquadratic, linear-time attention. We study two variants: LT2-linear with linear attention and LT2-sparse with sparse attention. We find that looping uniquely synergizes with these variants: it enables iterative memory refinement in linear attention and progressively expands the effective receptive field in sparse attention. We formalize these benefits theoretically and demonstrate consistent empirical gains across controlled recall, state-tracking, and language modeling tasks. We then explore LT2-hybrid, which combines different attention variants in a looped setting. Two variants are especially promising: LT2-hybrid (GDN+DSA), which interleaves linear and sparse attention to maximize efficiency and matches the standard looped transformer's quality at fully linear-time cost; and LT2-hybrid (Full+GDN), which interleaves GDN with a small fraction of full attention layers to maximize quality, surpassing the standard looped transformer in both performance and efficiency. We also show how to convert a pre-trained LT into an LT2-hybrid model. With about 1B tokens of training, our converted model, Ouro-hybrid-1.4B, outperforms industry-level 1B models and is competitive with industry-level 4B models while retaining the speed benefits of linear-time attention. Together, these results show a clear path toward making looped transformers more scalable and advancing efficient, capable small language models.

2605.20490 2026-05-26 cs.AI cs.LG

ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems

ECUAS$_n$: 一种用于原则性评估不确定性增强系统的度量族

Lautaro Estienne, Erik Ernst, Matías Vera, Pablo Piantanida, Luciana Ferrer

AI总结 针对高 stakes 自动决策中不确定性增强系统的评估问题,提出一种基于适当评分规则的度量族 ECUAS$_n$,通过参数 $n$ 平衡错误预测成本与不确定性质量,并在分类和生成数据集上验证其理论优势与实证效果。

Comments pre-print, 9-pages paper, 25 pages total

详情
AI中文摘要

在高风险自动决策中,获取预测不确定性对于使用户(人类或下游系统)能够根据应用特定的成本权衡接受或拒绝预测至关重要。这种不确定性增强(UA)系统——即同时输出预测和不确定性分数的系统——目前在文献中以多种方式被评估,包括使用单独的指标评估预测和不确定性分数、设置固定拒绝成本的成本函数或对覆盖-风险曲线进行积分。我们认为这些评估方法不足以评估UA系统在不确定性下决策的整体性能,并提出了一种新的度量族ECUAS$_n$,将其表述为感兴趣任务的适当评分规则。参数$n$根据用例需求控制错误预测成本与不完美不确定性之间的权衡。我们通过在不同分类和生成数据集(包括TriviaQA的手动注释子集)上的实验,从理论和实证两方面展示了ECUAS$_n$度量的优势。

英文摘要

In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, ECUAS$_n$, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the ECUAS$_n$ metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

2605.19848 2026-05-26 cs.CL

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

CLIF:用于透明瓶颈模型的概念级影响函数

Yike Sun, Mingkun Xu, Mu You, Zhongzhi He, Henghua Shen, Zehan Tan, Derek F. Wong, Tao Fang

AI总结 提出概念级影响函数方法,在样本和概念层面增强NLP模型可解释性,通过调整关键样本和概念验证了数据调试和决策透明化的有效性。

Comments A critical theoretical error invalidates the main results. The independence assumption on concept representations and gradients (Section 3.2, Eq.7) is incorrect, breaking the influence estimation in nonlinear bottleneck layers. This flaw undermines all empirical claims in Sections 4-5. The authors withdraw to prevent dissemination of incorrect findings

详情
AI中文摘要

近年来,深度学习模型的黑箱特性限制了其在医疗诊断和金融等高风险领域的应用,而这些领域对可解释性至关重要。为解决这一问题,我们提出了一种新颖的方法,利用影响函数在样本和概念层面增强NLP模型的可解释性。在CEBaB和Yelp数据集上的实验表明,影响函数能有效识别对模型预测最有影响的训练样本(包括有益和有害的)。通过调整这些样本的标签和权重,我们证明无需重新训练即可将模型性能恢复到基线水平,证实了影响函数在高效数据调试中的价值。此外,我们的概念级分析识别了概念瓶颈模型(CBM)中对预测有显著影响的关键概念。修改这些概念会明显改变模型行为,为决策过程提供了清晰的洞察。

英文摘要

In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

2605.19846 2026-05-26 cs.CV cs.AI cs.CL

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

FineBench: 细粒度人类活动理解的视觉-语言模型基准测试与增强

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

AI总结 针对视觉-语言模型在细粒度人类活动理解上的不足,提出包含密集标注的长视频问答基准FineBench和增强框架FineAgent。

Comments CVPR'26 (Workshop on Video Large Language Models). Project Page: https://joslefaure.github.io/assets/html/finebench.html

详情
AI中文摘要

视觉-语言模型(VLM)在通用视频理解方面表现出色,但在需要细致理解人类动作和交互的真实世界应用中,它们常常难以进行细粒度理解。虽然最近一些以人为中心的基准测试评估了模型行为的公平性/伦理、情感感知等维度,但它们没有结合长视频、密集的问答覆盖以及大规模的帧级空间/时间定位。为弥补这一差距,我们引入了FineBench,一个专门设计用于评估细粒度理解的以人为中心的视频问答(VQA)基准。FineBench包含199,420个多项选择问答对,密集标注在64个长视频(每个15分钟)上,重点关注详细的人物运动、人物交互和物体操作,包括组合动作。我们的广泛评估显示,虽然像GPT-5这样的专有模型取得了不错的性能,但当前的开源VLM明显表现不佳,特别是在多人场景的空间推理以及区分人类运动和交互的细微差异方面。为了解决这些已识别的弱点,我们提出了FineAgent,一个模块化框架,通过利用定位器和描述器来增强VLM。实验表明,FineAgent在FineBench上持续提高了各种开源VLM的性能。FineBench为未来细粒度以人为中心的视频理解研究提供了严格的测试平台,而FineAgent则为增强当前VLM中的此类推理提供了一种实用方法。项目页面和代码:https://joslefaure.github.io/assets/html/finebench.html。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

2605.19027 2026-05-26 cs.CV

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

MedFM-Robust:医学基础模型的鲁棒性基准测试

Xiangxiang Cui, Tianjin Huang, Yifang Wang, Lijie Hu, Lu Yin

AI总结 本文提出了一个包含40种扰动类型(12种基础、28种医学特定)的鲁棒性基准,评估了多种医学基础模型在VQA、视觉定位、图像描述和分割任务上的表现,发现微调策略主导鲁棒性、医学特定扰动对分割影响大、零样本VQA鲁棒性依赖模型等关键结论。

Comments MICCAI2026

详情
AI中文摘要

医学基础模型在临床任务中取得了显著性能,但它们在现实世界扰动下的鲁棒性仍未得到充分探索。我们提出了一个鲁棒性基准,包含8种成像模态下的40种扰动类型(12种基础、28种医学特定),评估了五个视觉语言模型(LLaVA-Med、MedGemma、MedGemma-1.5、Gemini-2.5-flash和GPT-4o-mini)在VQA、视觉定位和图像描述任务上的表现,以及两个分割模型(MedSAM、SAM-Med2D)在五种微调策略下的性能。我们的发现表明:(1)微调策略主导鲁棒性,LoRA的退化程度几乎是全微调的两倍,而SAM-Med2D的Adapter提供了有利的效率-鲁棒性权衡。(2)医学特定扰动对分割造成不成比例的损害,15个最严重扰动中有9个是领域特定的。(3)LoRA微调的视觉定位性能下降超过40个百分点,而零样本图像描述保持稳定(下降<7%)。零样本VQA表现出模型依赖的鲁棒性——医学模型下降不到20%,而Gemini-2.5-flash下降54%。通用视觉语言模型在VQA上准确率更高,但在视觉定位上失败;在医学视觉语言模型中,MedGemma表现出最佳的整体稳定性。这些结果提供了部署指南,并强调了医学AI领域特定鲁棒性评估的必要性。我们的代码可在 https://abnerai.github.io/MedFM-Robust 获取。

英文摘要

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies. Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency-robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). Zero-shot VQA shows model-dependent robustness--medical models drop under 20% while Gemini-2.5-flash drops 54%. General-purpose VLMs achieve higher VQA accuracy but fail on grounding; among medical VLMs, MedGemma demonstrates the best overall stability. These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI. Our code is available at: https://abnerai.github.io/MedFM-Robust.

2605.19021 2026-05-26 cs.LG

Deep Neural Sheaf Diffusion

深度神经层扩散

Rémi Bourgerie, Šarūnas Girdzijauskas, Viktoria Fodor

AI总结 针对图神经网络深层堆叠导致表示崩溃的问题,提出用层邻接算子替代层拉普拉斯算子,结合归一化、奇非线性函数和门控机制,在合成和真实数据集上显著提升深层网络性能。

Comments Accepted at the ICML 2026 Workshop on Graph Foundation Models (GFM@ICML 2026). Code available at https://github.com/remibourgerie/deep-neural-sheaf-diffusion

详情
AI中文摘要

深度图神经网络对于捕捉图结构数据中的复杂依赖关系至关重要。然而,将GNN扩展到深层仍具挑战性,因为堆叠层会导致表示崩溃和由于重复聚合导致的敏感性降低。虽然神经层扩散(NSD)提供了针对这种崩溃的强理论保证,但这些保证在实践中并未实现:随着深度增加,层拉普拉斯算子的不一致信号消失,限制了更深层的贡献。我们识别了阻碍NSD在深度上有效性的机制,并提出了深度神经层扩散(DNSD),它用层邻接算子替换层拉普拉斯算子,以在层间保持信息信号。这辅以归一化、奇非线性函数和门控。为了对预期性能改进提供原则性解释,我们将层扩散与图注意力机制进行对比,强调DNSD用矩阵值边函数替换标量注意力分数,并归一化节点表示而非注意力分数。我们通过实验证明,DNSD在图任务中有效利用深层聚合,在合成长程数据集上以高达30个百分点的准确率优于GNN和NSD基线,并在真实世界基准上持续优于它们。这些结果将基于层的架构定位为图基础模型的有前途的构建块,通过支持有效的深层架构。

英文摘要

Deep Graph Neural Networks (GNNs) are essential for capturing complex dependencies in graph-structured data. However, scaling GNNs to depth remains challenging, as stacking layers leads to representation collapse and diminishing sensitivity due to repeated aggregation. While Neural Sheaf Diffusion (NSD) provides strong theoretical guarantees against such collapse, these guarantees do not translate to practice: as depth increases, the disagreement signal of the sheaf Laplacian vanishes, limiting the contribution of deeper layers. We identify mechanisms that hinder NSD effectiveness at depth and propose \emph{Deep Neural Sheaf Diffusion} (DNSD), which replaces the sheaf Laplacian with a sheaf adjacency operator to maintain informative signals across layers. This is complemented by normalization, odd nonlinearities, and gating. To provide a principled explanation of the expected performance improvement, we contrast sheaf diffusion to graph attention mechanisms, highlighting that DNSD replaces scalar attention scores with matrix-valued edge functions and normalizes node representations rather than attention scores. We demonstrate empirically that DNSD effectively utilizes deep aggregation in graph tasks, outperforming GNN and NSD baselines with up to 30pp accuracy on synthetic long-range datasets, and consistently outperforming them on real-world benchmarks. These results position sheaf-based architectures as a promising building block for graph foundation models by supporting effective deep architectures.