arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19394 2026-06-19 cs.FL 新提交

On Epimorphisms of Hypergraphic Automata and Input Symbol Semigroups

超图自动机及其输入符号半群的满态射

Jasem Hamoud

AI总结本文刻画了泛超图自动机及其输入符号半群的满态射，引入了弱和强两种超图满态射概念，并证明它们在p*-超图子类中一致，给出了三元组成为满态射的充要条件。

Comments 13 pages, 2 figures

详情

AI中文摘要

超图自动机是状态集和输出符号集为超图且在转移函数和输出函数作用下保持不变的自动机。此类自动机构成的范畴中，泛吸引对象称为泛超图自动机；其输入符号半群是映射代数，其性质与自动机自身的代数结构紧密相关。本文建立了泛超图自动机及其输入符号半群的满态射的完整刻画。核心贡献是引入了超图的两种不同满态射概念（弱和强），并证明这些概念通常不同，但对于重要的$p^*$-超图子类必然一致，该子类包括状态超图和输出超图为射影平面或仿射平面的自动机。主要结果给出了三元组$(f, \mathbb{P}_s, g)$成为泛超图自动机满态射的充要条件，用状态超图和输出超图上的分量映射表示。

英文摘要

Hypergraphic automata are automata whose state sets and output symbol sets are hypergraphs invariant under the actions of the transition and output functions. Universally attracting objects in the category of such automata are called universal hypergraphic automata; their semigroups of input symbols are algebras of mappings whose properties are tightly linked to the algebraic structure of the automata themselves. This paper establishes a complete characterisation of epimorphisms of universal hypergraphic automata and of their semigroups of input symbols. A central contribution is the introduction of two distinct notions of epimorphism for hypergraphs including weak, strong and the proof that these notions diverge in general but necessarily coincide for the important subclass of $p^*$-hypergraphs, which includes automata whose state hypergraphs and output hypergraphs are projective or affine planes. The main results give necessary and sufficient conditions for a triple $(f, \mathbb{P}_s, g)$ to be an epimorphism of universal hypergraphic automata, expressed in terms of the component maps on the state and output hypergraphs.

URL PDF HTML ☆

赞 0 踩 0

2606.19390 2026-06-19 cs.SE cs.AI 新提交

Execution-bound advisory automation for agentic AI: a reproducible AIBOM-driven CSAF-VEX framework

面向执行约束的自主AI自动化：一种可复现的AIBOM驱动的CSAF-VEX框架

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

AI总结提出一种协议驱动框架，通过绑定SBOM和AIBOM工件与确定性环境捕获及结构化运行时遥测，结合静态与运行时证据生成CSAF VEX公告，经密码签名和确定性重放验证，在合成自主AI工作负载上评估。

2606.19388 2026-06-19 cs.SE cs.CL cs.HC 新提交

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

超越GUI范式：移动代理是否需要手机屏幕？

Li Gu, Zihuan Jiang, Linqiang Guo, Zhixiang Chi, Ziqiang Wang, Huan Liu, Yuanhao Yu, Tse-Hsun Chen, Yang Wang

AI总结本文挑战移动代理的GUI主导范式，提出CLI应同等重要，通过实验证明CLI代理在AndroidWorld和MobileWorld上超越GUI基线，并引入CLI-Advantage任务套件展示其优势。

详情

AI中文摘要

近期移动代理的进展主要由GUI范式主导，其中代理感知UI信息并发出屏幕交互。然而，移动平台也提供了命令行接口（CLI），可直接访问设备服务和数据。我们认为CLI应与GUI同等重要。我们在AndroidWorld和MobileWorld上，使用四种模型API评估了三个编码代理（Claude Code、Terminus-2、mini-swe-agent），未进行任何移动特定后训练，并与三个可复现的GUI基线（GUI-Owl-1.5-32B、MAI-UI、Qwen3-VL-32B）进行比较。Claude Code（Opus 4.7）达到71.8%和51.9%，优于所有可复现的GUI基线（AndroidWorld上69.3/68.1/57.8%；MobileWorld上43.2/26.3/13.3%），而其他CLI配置也保持竞争力。为确立该范式的上限，我们提供了oracle CLI解决方案，在AndroidWorld上达到88.8%（103/116个任务可CLI解决），在MobileWorld上达到86.3%（101/117个任务可CLI解决），表明未来有大量改进空间。为覆盖GUI范围之外的日常用户意图，我们引入了\ extbf{CLI-Advantage任务套件}，包含五个类别的45个模板：批量操作、多条件过滤、聚合、跨应用工作流和隐藏设备状态。所有CLI代理在所有五个类别中均优于所有GUI基线，且每个任务步骤显著更少（10.7步 vs. 18.6步）。为支持未来移动CLI代理的研究，我们将开源代理实现、oracle解决方案、CLI-Advantage套件和评估基础设施。

英文摘要

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the \textbf{CLI-Advantage Task Suite}, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.19387 2026-06-19 cs.SE cs.AI 新提交

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成：基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

AI总结提出结合LLM创造力与形式化方法可解释性的硬件生成框架，通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 新提交

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态：挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

AI总结本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作，揭示了校准类别的关键影响，并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情

AI中文摘要

自主代理的运行时监视器通常对累积的内部状态（行为基线、漂移统计量，或在我们之前工作中的建模情感状态）设置阈值。我们之前报告了一个状态饱和陷阱：在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理（Modgil 2026）上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0，因此其指数衰减从未运作：已发布的陷阱是一个纯累加器的结果。我们更正了记录（勘误，v2）并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间（每次观测，如CUSUM）还是挂钟时间（半衰期以秒计，如情感模型和EMA基线）校准的。在固定速率流上两者一致；在代理流上，动作间时间变化几个数量级，它们不一致。在20条轨迹上对均匀间隔（dt在{0..600}秒内）的预注册扫描显示，挂钟水平触发器有两个机制：在dt<=1秒时恒定警报（20/20；中位数18次触发）；在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒（p90 2.33秒）；真实编码节奏位于陷阱机制内，在修正机制下证实了经验发现。该结构是校准类别的属性，而非引擎：在原始误差流上的最小挂钟累加器重现了相同的悬崖，而相同流上的样本时间CUSUM恰好是dt不变的（20/20）。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论，挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制；转换检测在每个节奏下都逃脱了陷阱，但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

URL PDF HTML ☆

赞 0 踩 0

2606.19383 2026-06-19 cs.RO cs.CV 新提交

3D Scene Graphs: Open Challenges and Future Directions

3D场景图：开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

AI总结本文统一综述3D场景图（3DSG）的构建、应用与评估，分析现有建模选择与开放挑战，旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情

AI中文摘要

3D场景图（3DSG）通过将几何基础与环境的语义和关系抽象相结合，已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关，包括操作、导航、任务规划、场景理解等。然而，该领域仍然分散：不同的社区采用不同的公式、构建流程和评估协议，使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾，特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG，并分析表征现有公式的主要建模选择，包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后，我们回顾如何从原始感官观察构建3DSG，讨论最常见的术语、约定和技术。最后，我们检查下游应用和评估策略，从内在图质量到任务级性能。为支持社区，我们还提供了一个专用网站，组织和扩展所调查的内容，可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.19382 2026-06-19 cs.SE cs.AI 新提交

DynAMO:Dynamic Asset Management Orchestration via Topological Multi-Agent Scheduling

DynAMO：基于拓扑多智能体调度的动态资产管理编排

Kanishk Kushwaha, Vikrant Vinod Bansode, Harsh Vardhan, Dhaval C. Patel

AI总结提出DynAMO引擎，采用先规划后执行架构生成可验证工作流图，支持顺序与并行执行，通过动态识别独立任务提升效率，在工业基准上实现1.6倍延迟降低，并保持正确性与安全性。

Comments 11 pages, 2 figures, 7 tables, 4 algorithms. Evaluated on the AssetOpsBench industrial benchmark. Code: https://github.com/kushwaha001/DynAMO

详情

AI中文摘要

虽然基于LLM的智能体为工业资产生命周期提供了端到端自动化，但现实世界中的工业4.0部署受到延迟、并发不稳定性和安全风险的阻碍。我们提出了DynAMO（动态资产管理编排），一个部署就绪的引擎，采用先规划后执行架构来生成可验证的工作流图。DynAMO支持顺序工作流（拓扑执行）和并行工作流（依赖感知并发）。通过动态识别独立任务，DynAMO在保持结构正确性和安全性的同时，通过受控推理重叠显著提高效率。在AssetOpsBench工业基准上的六项受控实验中，DynAMO展示了显著的性能和鲁棒性提升。并行执行相比顺序编排将端到端延迟中位数降低了1.6倍，在高度可并行化的工作流上达到1.8倍。在外部工具调用中加入实际延迟后，延迟分解显示LLM推理和编排仍占执行时间的90%以上，表明模型推理是主要系统瓶颈。结构化上下文剪枝将推理延迟降低约30%，并且DynAMO在受控故障注入下保持正确的功能行为（任务完成、智能体排序和输出质量），同时表现出优雅降级。可重复性分析进一步证实了重复运行下的稳定执行，并行调度降低了延迟方差。这些发现确立了DynAMO作为工业4.0自动化流水线中可扩展、安全且延迟感知的智能体部署的实用蓝图。代码可在以下网址获取：this https URL

英文摘要

While LLM-powered agents offer end-to-end automation for industrial asset lifecycles, real-world Industry 4.0 deployment is hindered by latency, concurrency instability, and safety risks. We present DynAMO (Dynamic Asset Management Orchestration), a deployment-ready engine using a Plan-then-Execute architecture to generate verifiable workflow graphs. DynAMO supports both SequentialWorkflow (topological execution) and ParallelWorkflow (dependency-aware concurrency). By dynamically identifying independent tasks, DynAMO preserves structural correctness and safety while significantly improving efficiency through controlled reasoning overlap. Across six controlled experiments on the AssetOpsBench industrial benchmark, DynAMO demonstrates substantial performance and robustness gains. Parallel execution reduces end-to-end latency by a median of 1.6x over sequential orchestration, rising to 1.8x on highly parallelizable workflows. After instrumenting external tool calls with realistic latencies, a latency decomposition shows that LLM reasoning and orchestration still account for more than 90% of execution time, identifying model inference as the primary system bottleneck. Structured context pruning reduces inference latency by approximately 30%, and DynAMO maintains correct functional behaviour (task completion, agent sequencing, and output quality) while exhibiting graceful degradation under controlled fault injection. Reproducibility analysis further confirms stable execution under repeated runs, with parallel scheduling reducing latency variance. These findings establish DynAMO as a practical blueprint for scalable, safe, and latency-aware agent deployment in Industry 4.0 automation pipelines. Code is available at: https://github.com/kushwaha001/DynAMO

URL PDF HTML ☆

赞 0 踩 0

2606.19381 2026-06-19 cs.SD cs.AI 新提交

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

利用语码混合引导的合成语音改进语码转换语音识别

Yue Heng Yeo, Haoyang Li, Yizhou Peng, Shreyas Gopal, Hexin Liu, Leibny Paola Garcia-Perera, Hardik B. Sailor, Jeremy H. M. Wong, Eng Siong Chng

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Google DeepMind（谷歌深度思维）

AI总结针对语码转换语音识别中高质量文本-语音对稀缺的问题，提出语码混合引导的偏好学习框架，通过语码混合指数优化合成语音的转换保真度，在SEAME语料库上微调Whisper Large，将混合错误率从12.1%/17.8%降至8.9%/14.2%。

Comments Accepted to Interspeech 2026

详情

AI中文摘要

语码转换语音识别由于缺乏高质量的语码转换文本-语音对用于训练而仍然具有挑战性。尽管已经探索了通过文本到语音进行合成数据增强，但现有的语码转换文本到语音方法主要优化重建保真度，并未明确强制语言边界一致性，从而限制了它们在语码转换语音识别增强中的有效性。本文提出了一种语码混合引导的偏好学习框架，该框架利用语码混合指数引导合成语音生成，以提高语码转换保真度。在SEAME汉英口语语料库上的实验表明，所提方法增强了合成数据在语音识别微调中的效用。具体来说，当微调Whisper Large时，所提方法在DevMAN和DevSGE测试集上分别将混合错误率从12.1%/17.8%降低到8.9%/14.2%。

英文摘要

Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19380 2026-06-19 cs.SE cs.LG 新提交

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor：编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

AI总结提出AgentArmor框架，通过系统提示增强、命令分类器、三振政策等机制，缓解编码代理因规范不足、能力错误和工具错误导致的失败，显著提升安全性。

详情

AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中，我们研究这些失败模式源于三种不同的机制：规范不足，即默认模型行为不安全；能力错误，即安全动作可用但模型因偏见或能力限制而未遵循；以及代理工具错误，即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制，每个评估都受实际部署失败的启发，总计20个编码环境和59个合成转录模板。基于此评估，我们提出AgentArmor，一种代理工具修改，以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具，我们证明AgentArmor在统计显著数量的样本上更安全。因此，我们为当前编码代理提出具体缓解措施，并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

URL PDF HTML ☆

赞 0 踩 0

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 新提交

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性？逐块线性可恢复性是学习得到的，而非架构决定的

Stuart Whipp

发表机构 * Independent Research（独立研究）

AI总结通过精确最小二乘线性近似，测量训练后 Transformer 各前馈块的线性可恢复性，发现其高度异质且非单调，是学习得到的属性而非架构决定，并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情

AI中文摘要

Transformer 前馈网络（FFN）通常被视为非线性的计算存储单元，但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射，并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性（R^2_lin），这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中，R^2_lin 高度异质且随深度非单调变化，相邻块之间范围从近线性（>0.99）到强非线性（<0.3），且并非由激活函数决定：相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓，因此可恢复性是单个训练块的学习属性，而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点，且增益与残差非线性不相关：未恢复的计算不是单个位置级乘积，而是高阶或分布式结构。该测量还作为有针对性的压缩信号：可恢复块允许大的单层替换（GPT-2 的早期 FFN 参数减少 8 倍，困惑度增加 +0.77），而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱：训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛，因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

URL PDF HTML ☆

赞 0 踩 0

2606.19377 2026-06-19 cs.LG cs.AI 新提交

Emyx: Fast and efficient all-atom protein generation

Emyx: 快速高效的全原子蛋白质生成

Nicholas J. Williams, Ward Haddadin, Matteo P. Ferla, Constantin Schneider, Nicholas B. Woodall, Ruby Sedgwick, Christian D. Madsen, Andrew L. Hopkins, Edward O. Pyzer-Knapp

发表机构 * Xyme

AI总结提出Emyx，一种140M参数的流匹配模型，通过轻量条件表示和稀疏连接降低复杂度，在酶设计基准上超越现有方法，训练仅需682 GPU小时。

详情

AI中文摘要

计算酶设计需要生成能够支撑催化残基和配体的蛋白质，这要求生成模型同时具备几何准确性和结构多样性。当前的全原子生成模型继承了结构预测中的昂贵架构，导致训练成本高、样本多样性有限。我们认为，对于生成模型而言，这种复杂性大多是不必要的，因为生成模型依赖于稀疏的几何约束而非丰富的共进化信号。Emyx是一个140M参数的条件流匹配模型，将能力集中在标准Transformer块中，用轻量条件表示和稀疏连接替代了厚重的嵌入堆叠。此外，我们推导了流匹配插值到EDM噪声水平框架的精确重参数化，将流匹配训练效率与为扩散模型设计的最先进采样方法桥接起来，无需重新训练。尽管是最小的模型，Emyx在AME酶设计基准上，在要求全局折叠恢复和催化几何准确性的严格评估下，在成功率、结构新颖性、骨架多样性和几何有效性方面均优于Proteína-Complexa和RFdiffusion3，而训练仅需682 GPU小时，约为RFdiffusion3的1/4。

英文摘要

Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Proteína-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just $682$ GPU-hours, roughly $4\times$ less than RFdiffusion3.

URL PDF HTML ☆

赞 0 踩 0

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 新提交

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich（慕尼黑工业大学）； University of Exeter（埃克塞特大学）； Horace Greeley High School（霍勒斯格里利高中）

AI总结针对LLM推理成本与服务质量之间的矛盾，提出SLARouter在线路由算法，利用稀疏单侧用户反馈学习成本最优策略，理论保证成本最优和SLA合规，实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情

AI中文摘要

大型语言模型（LLM）应用的推理成本正在快速增长，这是由于需求激增和基础设施成本上升所驱动的。用户期望高质量的响应，在商业环境中，这被正式编码在服务级别协议（SLA）中，从而在成本和质量之间形成了根本性的矛盾。最近在成本感知的LLM请求路由方面的进展显示出解决这一矛盾的潜力，但现有方法依赖于完整的反馈信号、离线训练、大量的每工作负载调优，并且大多数缺乏SLA保证或推理时适应性。我们引入了SLARouter，一种在线路由算法，它从生产系统中可用的稀疏、单侧用户反馈中学习成本最优策略。SLARouter为成本最优性和严格的SLA合规性提供了理论保证。在广泛的LLM基准测试上的实验表明，SLARouter无需每基准调优即可满足SLA约束，将运营成本降低至现有基线的2.2倍。

英文摘要

Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19374 2026-06-19 cs.LG cs.AI 新提交

Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs

基于二级结构和能量过滤氢键图的蛋白质表示学习

Mohamed Mouhajir, Limei Wang, El Houcine Bergou, Hajar El Hammouti, Lamiae Azizi, Dongqi Fu

发表机构 * College of Computing, UM6P（穆罕默德六世理工大学计算机学院）

AI总结提出一种二级结构感知的图神经网络，通过增强残基节点表示并基于能量过滤的氢键构建边，以捕获局部结构上下文和长程耦合，在蛋白质基准上取得一致改进并增强生物学可解释性。

详情

Journal ref: The 25th International Workshop on Data Mining in Bioinformatics (BIOKDD 2026)

AI中文摘要

基于图的表示被广泛用于蛋白质建模，然而许多现有方法主要依赖序列邻接或几何邻近，这仅部分反映了控制蛋白质折叠的原理。蛋白质实际上采用围绕二级结构元素（如α-螺旋和β-折叠）组织的复杂三维构象，这些元素编码了重复的局部基序和稳定的氢键相互作用。在这项工作中，我们引入了一种二级结构感知的图神经网络用于蛋白质表示学习。残基级别的节点表示通过二级结构分配得到增强，图边由经过能量强度过滤的氢键相互作用构建。这种设计使模型能够捕获对蛋白质稳定性和功能至关重要的局部结构上下文和长程耦合。我们在常用的蛋白质基准上评估了所提出的方法，并观察到相对于现有基于图的方法的一致改进。此外，生成的图表示提供了增强的生物学可解释性，因为学习到的连接性与已建立的结构基序一致。这些发现表明，融入二级结构和能量过滤的氢键拓扑为蛋白质表示学习提供了有效的归纳偏置。代码发布在 https://this URL。

英文摘要

Graph-based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three-dimensional conformations organized around secondary structure elements, such as $α$-helices and $β$-sheets, which encode recurring local motifs and stabilizing hydrogen-bond interactions. In this work, we introduce a secondary-structure-aware graph neural network for protein representation learning. Residue-level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen-bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long-range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph-based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy-filtered hydrogen-bond topology provides an effective inductive bias for protein representation learning. The code is released at https://github.com/mohamedmohamed2021/SSProNet

URL PDF HTML ☆

赞 0 踩 0

2606.19373 2026-06-19 cs.LG cs.AI 新提交

cAPM: Continual AI-Assisted Pace-Mapping with Active Learning

cAPM：具有主动学习的持续AI辅助起搏标测

Dylan O'Hara, Pradeep Bajracharya, Casey Meisenzahl, Karli Gillette, Anton J. Prassl, Gernot Plank, Saman Nazarian, Roderick Tung, John L Sapp, Linwei Wang

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； University of Utah（犹他大学）； Scientific Computing and Imaging Institute, University of Utah（犹他大学科学计算与成像研究所）； Medical University of Graz（格拉茨医科大学）； University of Pennsylvania Perelman School of Medicine（宾夕法尼亚大学佩雷尔曼医学院）； The University of Arizona College of Medicine（亚利桑那大学医学院）； Dalhousie University（达尔豪斯大学）

AI总结提出cAPM框架，通过任务无关的代理神经网络、主动学习和持续学习策略，在减少起搏标测数据量的同时，实现跨室性心动过速的知识迁移，将定位精度提升至81%。

详情

AI中文摘要

室性心动过速是一种危及生命的心律失常，是心源性猝死的主要原因。起搏标测是一种临床程序，用于在导管消融室性心动过速期间识别干预靶点。它要求临床医生在心室的不同部位起搏，并快速解释由此产生的心电图，以确定下一步起搏位置或是否已识别出靶点。已提出主动学习AI模型来指导临床医生选择下一个起搏点，显示出在减少起搏点数量和改善起搏标测效率方面的潜力。现有方法需要对每个靶点重新训练，无法在同一患者或不同患者的多个室性心动过速之间迁移知识。我们引入cAPM用于持续AI辅助起搏标测，以捕获和迁移从过去起搏标测数据中积累的知识，从而减少未来靶点室性心动过速所需的起搏标测数据量。这是通过一个任务无关的代理神经网络实现的，该网络学习从起搏点到12导联心电图形态的映射；一种主动学习策略，通过为每个靶点选择信息量最大的起搏点来优化该代理模型；以及一种持续学习策略，以顺序方式执行此操作，同时保留先前靶点的知识。在由不同生理条件和心室几何形状下顺序呈现的定位任务组成的计算机模拟测试平台上评估，cAPM（无论是否重放过去数据样本）在使用4.5个起搏标测点时，在临床耐受范围内（5毫米精度）定位的概率达到81%，而最先进的主动学习方法使用13.7个起搏点达到38%的概率。这些结果为cAPM准备用于体内临床前和临床研究提供了坚实基础，在这些研究中，cAPM可用于指导起搏标测。

英文摘要

Ventricular tachycardia is a life-threatening rhythm disorder and a major cause of sudden cardiac death. Pace-mapping is a clinical procedure for identifying the intervention target during catheter ablation of VT. It requires clinicians to pace different sites in the ventricles and rapidly interpret the resulting electrocardiograms to determine where to pace next or whether a target site has been identified. Active learning AI models have been proposed to guide clinicians to the next pacing site, showing promise in reducing the number of pacing sites and improving the efficiency of pace-mapping. Existing methods require retraining each target without the ability to transfer knowledge across multiple VTs within the same patient or across patients. We introduce cAPM for continuous AI-assisted pace-mapping to capture and transfer knowledge accumulated from past pace-mapping data to reduce the number of pace-mapping data needed for future target VTs. This is made possible by a task-agnostic surrogate neural network that learns the mapping from pacing sites to 12-lead ECG morphology, an active-learning strategy that refines this surrogate model by selecting the most informative pacing site for each target, and a continual learning strategy to do so sequentially while retaining knowledge from prior targets. Evaluated on an in-silico testbed consisting of sequentially-presented localization tasks across different physiological conditions and ventricular geometries, cAPM with and without replay of past data samples achieved an 81% probability of localizing within clinical tolerance (5 mm accuracy) using 4.5 pace-mapping sites, compared to the state-of-the-art active-learning method achieving 38% probability using 13.7 pacing sites. These results provide a strong basis for preparing cAPM towards in-vivo preclinical and clinical studies where it can be used to guide pace-mapping.

URL PDF HTML ☆

赞 0 踩 0

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 新提交

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University（肯尼索州立大学）； Michigan Technological University（密歇根理工大学）； University of Iowa（爱荷华大学）

AI总结提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，通过自适应决定何时需要额外模态，在保持准确性的同时降低数据采集成本。

详情

AI中文摘要

阿尔茨海默病（AD）是一种致命性疾病，会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效，导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据，如临床评估、结构磁共振成像（MRI）和正电子发射断层扫描（PET）成像。然而，MRI和PET采集仍然昂贵且不易普及，使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，该网络自适应地确定何时需要额外模态，有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类，并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时，ProMUSE逐步引入MRI或PET特征，通过Dempster-Shafer理论融合模态层面的信念和不确定性，获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明，ProMUSE在减少50-90%的MRI/PET使用量的同时，实现了与全模态基线相当或更优的准确性，从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

URL PDF HTML ☆

赞 0 踩 0

2606.19370 2026-06-19 cs.LG cs.AI cs.MA 新提交

Human-like autonomy emerges from self-play and a pinch of human data

类人自主性从自我对弈和少量人类数据中涌现

Daphne Cornelisse, Julian Hunt, Zixu Zhang, Waël Doulazmi, Kevin Joseph, Jaime Fernández Fisac, Eugene Vinitsky

发表机构 * NYU Tandon School of Engineering（纽约大学坦登工程学院）； NYU Courant（纽约大学库朗数学科学研究所）； Princeton University（普林斯顿大学）； Centre for Robotics, Mines Paris（巴黎矿业大学机器人中心）； Valeo（法雷奥）

AI总结提出一种结合自我对弈强化学习与少量人类演示的正则化方法，仅用30分钟人类数据即可训练出与人类协调的驾驶策略，训练时间仅15小时。

Comments 10 pages

详情

AI中文摘要

自我对弈强化学习最近成为一种无需任何人类数据即可训练驾驶策略的方法。它利用廉价的大规模模拟来替代昂贵的大规模人类驾驶演示。这种方法的一个关键局限性是，通过纯自我对弈训练的策略可以学习有效但不符合人类习惯的驾驶惯例。先前的工作试图通过广泛的奖励工程和领域随机化来缓解这种行为偏差，但这些方法脆弱且劳动密集。我们的方法没有完全抛弃人类演示，而是将其作为最小安全目标达到奖励之上的正则化目标。就像好炖菜中的香料一样，我们发现少量人类数据大有裨益：我们的方法仅使用30分钟的人类演示，比同类模仿学习方法少2500倍。由此产生的策略与保留的人类轨迹协调，并在单个消费级GPU上15小时内完成训练。视频和完整源代码见https://this URL。

英文摘要

Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.19369 2026-06-19 cs.LG cs.AI 新提交

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

零膨胀高斯分布使估计分布算法中的参数空间稀疏化

Andreas Faust, Sven Nitzsche, Juergen Becker

发表机构 * University of Freiburg（弗莱堡大学）； FZI Research Center for Information Technology（FZI信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出多元零膨胀高斯分布作为估计分布算法的采样分布，联合优化稀疏模式和活跃参数，无需手工设计稀疏算子，在Lunar Lander基准上收敛更快且最终回报更高。

详情

AI中文摘要

估计分布算法（EDA）是一类强大的黑箱优化进化方法，尤其当目标函数结构未知时。经典进化算法依赖于手工设计的变异和交叉算子，这些算子难以针对未知问题结构设计，且是偏差的来源，而EDA完全绕过了算子设计：它们将概率分布拟合到最佳个体，并从中采样下一代。EDA在连续参数空间上已得到充分确立，但此前尚未推广到稀疏空间——其中良好解的大多数系数恰好为零。现有的稀疏黑箱优化器因此重新引入了EDA旨在避免的东西：手工制作的稀疏算子、支持集与活跃值交替的双层方案、零阈值以及其他内置假设。我们通过提出多元零膨胀高斯（ZIG）分布作为EDA采样法则来填补这一空白。一个具有独立指示维度和值维度的潜在高斯模型表示稀疏模式、活跃参数之间的相关性以及两者之间的相互作用，因此稀疏模式和活跃值被联合优化，无需层次结构。我们证明该模型的潜在参数可以从观测样本中识别，不同于相关构造起源的缺失数据设置，并引入了实用的基于摊销反演的估计器。这些估计器准确恢复潜在相关结构，在Lunar Lander基准上，由此产生的ZIG-EDA比稠密高斯EDA、手工制作的稀疏进化算法和特设稀疏EDA收敛更快且最终回报更高，同时找到的控制器只有一小部分参数活跃。

英文摘要

Estimation-of-distribution algorithms (EDAs) are a powerful class of evolutionary methods for black-box optimization, especially when little is known about the structure of the objective. Whereas classical evolutionary algorithms rely on hand-designed mutation and crossover operators, hard to devise for unknown problem structures, and a source of bias, EDAs sidestep operator design entirely: they fit a probability distribution to the best individuals and sample the next generation from it. EDAs are well established on continuous parameter spaces, but they have not previously been generalized to sparse ones, in which most coefficients of a good solution are exactly zero. Existing sparse black-box optimizers therefore reintroduce exactly what EDAs were designed to avoid: hand-crafted sparsity operators, bi-level schemes alternating between support set and active values, zeroing thresholds, and other baked-in assumptions. We close this gap by proposing multivariate zero-inflated Gaussian (ZIG) distributions as EDA sampling laws. A latent Gaussian model with separate indicator and value dimensions represents sparsity patterns, correlations among active parameters, and the interactions between the two, so sparsity patterns and active values are optimized jointly, hierarchy-free. We show that the latent parameters of this model are identifiable from observed samples, unlike in the missing-data settings where related constructions originate, and introduce practical amortized inversion-based estimators for them. The estimators accurately recover latent correlation structures, and on the Lunar Lander benchmark the resulting ZIG-EDA converges faster and reaches higher final returns than a dense Gaussian EDA, a hand-crafted sparse evolutionary algorithm, and an ad-hoc sparse EDA, while finding controllers with only a small fraction of parameters active.

URL PDF HTML ☆

赞 0 踩 0

2606.19367 2026-06-19 cs.LG 新提交

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Weibull 权重尺度参数在 AdamW 训练动态下的演化

Tiexin Ding

发表机构 * Independent Researcher（独立研究员）

AI总结研究 AdamW 训练中 Weibull 权重尺度参数 λ 增长、过冲和松弛的原因，推导出三种力（对齐、注入、衰减）的分解，并在 Pythia-70M 模型上验证对齐力主导上升阶段，贡献 88-94%。

Comments 21 pages, 14 figures

详情

AI中文摘要

基于用于诊断变压器权重分布的双参数 Weibull 框架，我们研究了为什么在 AdamW 训练期间 Weibull 权重尺度参数 λ 会增长、过冲然后松弛。我们从 AdamW 更新中推导出平方权重范数的领先阶三力分解：一个对齐力，测量权重与自适应更新方向之间的相关性；一个注入力，来自自适应步长幅度；以及一个衰减力，来自解耦的权重衰减。在具有真实优化器矩的自训练 Pythia-70M 模型上，对齐力主导上升阶段，在四个随机种子中贡献了绝对力预算的 88-94%，并且对超权重移除具有鲁棒性。接近饱和时，对齐力和衰减力趋于平衡，解释了从权重尺度增长到松弛的转变。这些力动态直接控制 λ(t) 背后的平方范数分量；剩余的 RMS 到 Weibull 重建偏移是可测量的，并分解为桥接分量和积分分量，在密集采样区域总计约 5-6%。为了将分析扩展到无法获得优化器矩的真实模型，我们引入了一种样条位移方法，该方法从稀疏检查点以约 92-94% 的准确率恢复对齐力，大约是朴素两点基线的两倍。我们进一步观察到，在我们的实验中，λ(t) 的峰值随训练数据一致性而变化，这表明权重尺度增长存在数据依赖成分，我们将其留待后续对照研究。代码和数据可在 https://this URL 获取。

英文摘要

Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $λ$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $λ(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $λ(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at https://github.com/tiexinding/NPM-Weibull-public.

URL PDF HTML ☆

赞 0 踩 0

2606.19366 2026-06-19 cs.LG cs.AI eess.SP 新提交

Information Lattice Learning as Probabilistic Graphical Model Structure Learning

信息格学习作为概率图模型结构学习

Haizi Yu, Lav R. Varshney

发表机构 * Kocree, Inc.（Kocree公司）； AI Innovation Institute, Stony Brook University（石溪大学人工智能创新研究所）

AI总结将信息格学习（ILL）解释为概率图模型结构学习，通过投影到分区格上学习可解释规则，并建立与最大熵和因子图的联系。

详情

AI中文摘要

信息格学习（ILL）通过将信号交替投影到编码抽象层次结构的分区格上，并将选定的规则提升回信号域，来学习信号的可解释规则。当信号是概率质量函数时，我们证明ILL学习的概率规则具有自然的概率图模型（PGM）解释，并详细发展了这一解释。ILL中的分区诱导出一个确定性的商变量，规则是该商变量的边际分布。因此，规则集是可解释抽象上的边际约束集合。一般提升是满足这些约束的所有联合分布的可行族，而特殊提升则选择最大无知重建，在ILL中通过L2均匀性原理实现，该原理与最大熵密切相关。在香农熵提升下，相同的约束产生一个对数线性因子图，其因子由学习的抽象索引。然而，信息格本身不是贝叶斯网络：其边编码抽象的细化与粗化，而非条件依赖。因此，ILL最好被视为商变量上可解释的基于约束的因子图的结构学习。这一观点阐明了ILL如何与图模型和最大熵模型相关，同时为推理、可识别性和混合符号-概率学习提出了新方向。

英文摘要

Information lattice learning (ILL) learns interpretable rules of a signal by alternately projecting the signal onto a partition lattice that encodes a hierarchy of abstractions and lifting selected rules back to the signal domain. When the signal is a probability mass function, we show the probabilistic rules learned by ILL admit a natural probabilistic graphical model (PGM) interpretation and develop this interpretation in detail. A partition in ILL induces a deterministic quotient variable, and a rule is the marginal law of that quotient variable. A rule set is therefore a collection of marginal constraints over interpretable abstractions. General lifting is the feasible family of all joint distributions satisfying those constraints, while special lifting chooses a maximum-ignorance reconstruction, implemented in ILL by an L2 uniformity principle closely related to maximum entropy. Under a Shannon-entropy lifting, the same constraints yield a log-linear factor graph whose factors are indexed by learned abstractions. The information lattice itself, however, is not a Bayesian network: its edges encode refinement and coarsening of abstractions, not conditional dependence. Thus ILL is best viewed as structure learning for interpretable constraint-based factor graphs over quotient variables. This view clarifies how ILL relates to graphical models and maximum entropy models, while suggesting new directions for inference, identifiability, and hybrid symbolic-probabilistic learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19365 2026-06-19 cs.LG 新提交

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

跨GPU架构的3D生成扩散模型性能分析与优化

Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee

发表机构 * Fairleigh Dickinson University（费尔利·迪金森大学）； The University of Colorado at Colorado Springs（科罗拉多大学科罗拉多斯普林斯分校）； Northeastern University（东北大学）

AI总结针对3D MRI扩散模型Med-DDPM，分析其在三代NVIDIA架构上的内核级性能瓶颈，提出TF32 Tensor Core激活和3D channels-last布局优化，实现SM周期和动态指令减少100倍，Tensor Core利用率提升至9.98倍，IPC提升7%。

详情

DOI: 10.1145/3777884.3797012

AI中文摘要

扩散模型已成为高保真3D MRI合成的关键，但由于每个样本需要数百次U-Net评估以及高度异构的内核行为，其部署仍受到大量GPU资源需求的限制。本文对最先进的医学扩散模型Med-DDPM在三代NVIDIA架构上进行了全面的性能分析，研究了内核级运行时分解、指令混合特征、内存系统利用率、线程束级活动以及分析器优先级得分估计。我们发现训练主要由cuDNN卷积和隐式GEMM内核主导，效率低下源于内存访问模式、张量布局转换和有限的Tensor Core利用率。基于这些洞察，我们评估了两种架构感知优化——TF32 Tensor Core激活和3D channels-last布局，并证明它们将SM周期减少多达100倍，动态指令减少100倍，Tensor Core利用率从1.45倍提高到9.98倍，并在A100上将IPC提高7%，且不降低合成质量。

英文摘要

Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.

URL PDF HTML ☆

赞 0 踩 0

2606.19364 2026-06-19 cs.LG 新提交

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距：SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

AI总结针对云LLM推理中提示词预填充阶段能耗高的问题，提出SPSD边缘端管道，利用4比特量化小语言模型压缩用户提示，在保持响应质量非劣效的前提下，平均节省99.9个输入token，每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情

AI中文摘要

大语言模型（LLM）推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架：礼貌标记、道歉性开场白、重复以及建立融洽关系的语言，这些对人类交流很重要，但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD（情感保留语义蒸馏），一种边缘端管道，在传输到云端部署的LLM之前，使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上，使用Gemma-2-2B-Instruct（Q4_K_M）作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估，每次蒸馏调用平均输入token节省99.9个，所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估，响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径；评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一：均值0.682，中位数0.712，54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下，每次调用净节能估计为70-270 uWh。SPSD表明，设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时，降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

URL PDF HTML ☆

赞 0 踩 0

2606.19363 2026-06-19 cs.LG 新提交

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

何时信任，如何蒸馏：面向轻量级鲁棒科学时间序列预测的多基础模型指导

Rupasree Dey, Abdul Matin, Nathan Orwick, Yao Zhang, Shrideep Pallickara, Sangmi Lee Pallickara

发表机构 * Colorado State University（科罗拉多州立大学）

AI总结提出Guard框架，通过上下文路由器和不确定性门控温度机制，从多个分布偏移的基础模型中蒸馏知识，训练轻量级预测器，在气象、碳通量等四个领域降低RMSE。

Comments KDD 2026, paper decision: Accepted, track: AI for Science. total 12 pages including references and appendix

详情

DOI: 10.1145/3770855.3819018

AI中文摘要

时间序列基础模型（TSFMs）在物理科学中的部署受到一个关键权衡的阻碍：虽然这些模型编码了丰富、通用的时间动态，但当零样本应用于特定科学领域时，它们会遭受严重的分布错位，并且其计算成本阻碍了在边缘计算传感器网络中的部署。我们解决了一个基本挑战：如何从错位的基础模型（FM）中提取潜在的结构知识，以训练轻量级、专门的预测器？我们提出了用于蒸馏的门控不确定性感知路由（Guard），这是一个新颖的框架，将多教师蒸馏重新定义为实例级决策过程，具有两种自适应机制：（1）上下文路由器，根据局部输入统计动态选择最相关的教师，利用不同基础模型之间的互补性；（2）不确定性门控温度机制，充当“断路器”，当教师置信度与领域现实偏离时自动减弱蒸馏强度。我们在四个气候关键领域评估了我们提出的轻量级框架：气象学、生态系统碳通量、土壤湿度和能源电网。我们的方法相对于固定权重的多教师蒸馏基线显著降低了RMSE，成功地从预训练的FM（教师）中蒸馏知识，即使由于原始和目标数据域之间的分布偏移，它们表现出次优的零样本准确性。我们证明，这些领域错位的教师仍然可以作为关键的纠正者，在28.5%的最难实例上优于全局优越的FM。最终，这使得适用于资源受限边缘部署的高精度科学预测成为可能。代码可在https://this URL获取。

英文摘要

The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when applied zero-shot to specific scientific domains, and their computational cost prohibits deployment in edge-computing sensor networks. We address a fundamental challenge: How can we extract latent structural knowledge from misaligned foundation models (FM) to train lightweight, specialized forecasters? We propose Gated Uncertainty-Aware Routing for Distillation (Guard), a novel framework that reframes multiteacher distillation as an instance-wise decision process with two adaptive mechanisms: (1) a Contextual Router that dynamically selects the most relevant teacher based on local input statistics, exploiting complementarity across diverse foundation models; and (2) an Uncertainty-Gated Temperature mechanism that acts as a "circuit-breaker," automatically attenuating distillation strength when teacher confidence diverges from domain reality. We evaluate our proposed lightweight framework on four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. Our method significantly reduces RMSE relative to a fixed-weight multi-teacher distillation baseline, successfully distilling knowledge from pretrained FMs (teachers) even when they exhibit suboptimal zero-shot accuracy due to distribution shift between the original and target data domains. We demonstrate that these domain-misaligned teachers can still serve as critical correctives, outperforming the globally superior FMs on 28.5% of the hardest instances. Ultimately, this enables high-precision scientific forecasting suitable for resource-constrained edge deployment. Code is available at https://github.com/RupasreeDey/GUARD-KDD2026.

URL PDF HTML ☆

赞 0 踩 0

2606.19358 2026-06-19 cs.RO 新提交

WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

WorkBenchMark：面向智能制造联盟的基于乐高积木的装配基准与通过拆卸进行装配的基线方法

Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann

发表机构 * Chair of Machine Learning and Reasoning (i6), RWTH Aachen University（亚琛工业大学机器学习与推理教席（i6））； MASCOR Institute, FH Aachen University of Applied Science（亚琛应用技术大学MASCOR研究所）

AI总结提出一个基于乐高Duplo的机器人装配基准，包含400个任务和四个复杂度层级，并提供一个基于规划的基线方法，在所有层级上优于现代视觉-语言-动作方法。

Comments RoboCup Symposium 2026 accepted paper

2606.19357 2026-06-19 cs.RO cs.AI 新提交

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Physical Atari: 一个用于机器人实时强化学习的鲁棒且可访问的平台

Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack

AI总结提出Physical Atari平台，通过机器人操作Atari控制器和实时渲染游戏帧，实现物理世界中的强化学习研究，验证了算法可直接在机器人上学习，并指出分布偏移会显著降低策略性能。

Comments To appear at RLC 2026

详情

AI中文摘要

我们构建了一个名为Robotroller的机器人，它能够操作Atari CX40+控制器，以及一个名为Atari Devbox的设备，该设备在屏幕上渲染来自Arcade Learning Environment的游戏帧和奖励信号。Robotroller和Atari Devbox，连同现成的摄像头和台式计算机，构成一个可用于研究物理世界中强化学习算法的系统。我们将整个系统称为Physical Atari。在本文中，我们详细介绍了使Physical Atari成为一个鲁棒且可访问平台的关键决策。为了使系统鲁棒，我们设计了Robotroller，使得所有运动都通过轴承完成，从而减少磨损。此外，我们编写了软件，以高频监控伺服电机的状态并进行干预以限制应力。为了使系统可访问，我们使用了价格合理的现成组件和可通过消费级3D打印机制造的零件。Physical Atari的建造成本低于1000美元，并且已用于数周不间断的强化学习实验，未出现任何机械故障。我们用它验证了强化学习算法可以直接在机器人上学习，并表明即使学习和部署之间的微小分布偏移也会显著降低策略的性能。我们的结果强调了设备端适应对于在机器人上获得强性能的重要性。

英文摘要

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

URL PDF HTML ☆

赞 0 踩 0

2606.19356 2026-06-19 cs.CL cs.AI 新提交

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统：使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc（Synechron公司）

AI总结提出Argent信令协议(ASP)，通过结构化质量信号区分可修复与不可修复的失败，在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情

AI中文摘要

当多智能体LLM系统产生错误答案时，并非所有失败都相同：有些答案基于正确材料但不完整，而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁（重试并希望最好），使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP)，这是一种紧凑的机器可读头部，为每个AI生成的响应附带结构化质量信号：确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引，用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败，并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下，基于Array BioPharma/Ono许可协议的27个问题的文档问答基准，比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上，ASP将通过率从11.1%提升至33.3%，平均术语覆盖率从36.7%提升至65.4%；在Dobby~(8B)上，ASP产生4次失败到通过的恢复，通过率从33.3%提升至44.4%；在SmolLM3~(3B)上，ASP在每次问题中交替进行修复和遏制。总体改进显著（从12/81通过到21/81通过）。在多智能体模式下，ASP侧车位于检索智能体和下游决策智能体之间；侧车100%阻止无依据的上游输出到达下游智能体（24/27被阻止，0次无依据传播）。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

URL PDF HTML ☆

赞 0 踩 0

2606.19354 2026-06-19 cs.CL cs.LG 新提交

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率：测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana（欧洲地拉那大学）

AI总结提出GRACE理论框架，将验证粒度建模为问题难度、验证器准确率和计算预算的函数，证明存在相变：细粒度验证在计算预算大或问题难时占优，粗粒度验证在低预算简单问题时更优，自适应策略可达到计算-性能帕累托前沿。

详情

AI中文摘要

测试时扩展（TTS）已成为一种强大的范式，通过在推理时投入额外计算来提升大语言模型（LLMs）的推理性能。TTS的核心组件是验证器，它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处，但一个基本问题仍未充分探索：在给定计算预算下，最优验证粒度是什么？粗粒度的结果奖励模型（ORMs）和细粒度的过程奖励模型（PRMs）代表两个极端，但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架，称为GRACE（粒度调控的自适应计算效率），该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变：当计算预算大或问题难时，细粒度验证占优；而在低预算、简单问题场景下，粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内，并激发了一种自适应粒度策略，该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张，在匹配计算量下，我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

URL PDF HTML ☆

赞 0 踩 0

2606.19353 2026-06-19 cs.CL cs.LG 新提交

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH（浦项科技大学）

AI总结针对上下文学习（ICL）中预测对提示设计敏感的问题，提出基于贝叶斯观点和机制可解释性的自函数向量，直接估计偶然不确定性，并设计严格评估协议，在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情

AI中文摘要

上下文学习（ICL）使LLM能够从少量示例中适应新任务，但其可靠性仍存疑虑：预测对提示设计和模型理解上下文的能力高度敏感，使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键，然而现有方法针对标准生成任务设计，未能捕捉ICL的独特动态。为解决此问题，我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念，从而在贝叶斯框架内直接估计偶然不确定性，并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议，我们还提出了首个严格的评估协议，其中数据以受控方式被操纵，以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架（最初基于合成任务进行概念开发，随后扩展到真实世界数据集），我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外，我们展示了它可作为可信相关应用（如幻觉检测）的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集：资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington（华盛顿大学塔科马工程与技术学院）

AI总结本文调查了35种手语的120个数据集，分析了模态不平衡、标注粒度和手语者偏差等挑战，并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情

AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展，但由于数据集碎片化、标注不一致以及语言覆盖有限，进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求，对这些局限性的系统分析仍然有限。在本调查中，我们提出了一个全面的手语数据集索引，涵盖了35种手语的120个资源。我们分析了关键挑战，如模态不平衡、标注粒度和手语者偏差，并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表，并发布了一个公共GitHub仓库（此 https URL ），以支持标准化文档和可复现评估。总体而言，我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19351 2026-06-19 cs.CL cs.AI 新提交

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）

AI总结提出LUCID方法，结合LLM注意力分数、知识图谱语义和结构信息，利用图神经网络检测LLM在知识图谱推理中的幻觉，在九个数据集上达到最优性能。

详情

AI中文摘要

知识图谱推理从现有事实中推断新知识，广泛应用于问答、推荐和决策支持。随着大语言模型（LLM）的快速发展，基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而，LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识，模型仍可能生成错误输出，导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态，要么验证与检索上下文的一致性，但两者都忽略了知识图谱中的结构信息，导致性能次优。为了解决这一差距，我们提出了LUCID，这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说，它从注意力分数和语义相似度中提取节点和边特征，并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明，与15个基线相比，LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19350 2026-06-19 cs.CL 新提交

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

基于因果归因的剪枝保留大型语言模型的推理性能

Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

发表机构 * Edison Academy Magnet School（爱迪生学院磁石学校）； Massachusetts Institute of Technology（麻省理工学院）； State University of New York College at Plattsburgh（纽约州立大学普拉茨堡学院）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）

AI总结提出无需训练的因果归因剪枝（CAP）方法，通过测量注意力头对推理任务的因果影响进行细粒度剪枝，在20%稀疏度下相比Wanda在ARC-Challenge上准确率提升高达61%。

Comments Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLMs）在多步推理方面表现出色，但推理成本高昂。我们引入了因果归因剪枝（CAP），一种无需训练的方法，通过测量注意力头对推理任务的因果影响来识别关键注意力头，并利用这些头级分数指导细粒度的权重剪枝。对于每个注意力头，CAP估计在推理问题的小型校准集上前向传播时掩码该头所导致的预期性能下降。这些因果分数随后被转换为对应投影矩阵的权重级重要性值。与仅基于幅度或激活的标准不同，CAP的干预测量直接捕捉每个头的功能贡献，在20%稀疏度下，相比Wanda在ARC-Challenge上获得高达61%的相对准确率提升。我们在GSM8K、StrategyQA和ARC-Challenge上使用Llama-3-8B-Instruct和Mistral-7B-Instruct在10%、20%和50%稀疏度下评估CAP。在中等稀疏度（10-20%）下，CAP在大多数模型-基准配置中优于Wanda，尤其在Llama-3的ARC-Challenge上提升显著。我们的结果表明，在相同稀疏度下，注意力头级因果归因比相关性剪枝标准能更好地保留下游基准的推理性能，但在50%稀疏度下仍受限于粗粒度的MLP归因。

英文摘要

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

URL PDF HTML ☆

赞 0 踩 0