arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3412
2605.24760 2026-05-26 cs.RO

Geometric Workspace Analysis and Transmission-Aware Dynamics of a Serial Spherical Tool for Microsurgery

显微外科用串行球形工具的几何工作空间分析与传动感知动力学

Anestis Mablekos-Alexiou, Lyndon da Cruz, Christos Bergeles

AI总结 提出一种用于显微外科的串行球形机构(带额外平移自由度)的运动学与传动感知设计框架,通过解析工作空间公式和传动感知动力学方法实现快速设计评估。

详情
AI中文摘要

我们提出了一种用于显微外科的串行球形机构(带额外平移自由度)的运动学与传动感知设计框架。第一个贡献是解析工作空间公式,提供可达运动的几何洞察,并能够快速选择旋转轴方向而无需数值优化。第二个贡献是一种用于自锁传动机构驱动的动力学感知方法,支持评估指定工作空间几何的扭矩需求。该框架附带一个用于摩擦识别和逆动力学分析的开源软件包。在专为玻璃体视网膜手术设计的机器人工具上进行的实验验证了模型的预测能力,并展示了其在工程设计中的实用价值。

英文摘要

We present a kinematic and transmission-aware design framework for a serial spherical mechanism with an additional translational degree of freedom for microsurgery. The first contribution is an analytical workspace formulation that provides geometric insight into reachable motion and enables rapid selection of rotation axis orientations without numerical optimization. The second contribution is a dynamics-informed methodology for mechanisms driven by self-locking transmissions, supporting evaluation of torque requirements for a prescribed workspace geometry. The framework is accompanied by an open-source software package for friction identification and inverse dynamics analysis. Experiments on a purpose-built robotic tool for vitreoretinal surgery validate the predictive capability of the models and demonstrate their practical utility for engineering design.

2605.24759 2026-05-26 cs.LG

A Contractive Feedback Semantics for Reinforcement Learning

强化学习的收缩反馈语义

Zuyuan Zhang

AI总结 本文通过将单步决策过程视为开放随机组件,并利用收缩反馈环实现无限时域策略评估,建立了强化学习的组合语义,并推导出近似等价、状态抽象和合约规范的理论结果。

详情
AI中文摘要

折扣强化学习通常通过闭马尔可夫决策过程上的贝尔曼方程来呈现。本文发展了一种组合视角:将单步决策过程视为开放随机组件,并通过闭合收缩反馈环实现无限时域策略评估。由此产生的语义为开放组件分配了类型化的贝尔曼变换器,将串联和并联布线解释为变换器的复合和张量,并将反馈解释为由唯一不动点实现的可容许有界守护迹。这一视角产生了三个理论结果。第一,近似组件等价是对于可容许的良类型守护单孔上下文的上下文同余:局部算子误差在将组件插入周围电路后仍受控,该电路使用该孔一次且其反馈节点具有认证的均匀守护性。第二,精确和近似状态抽象成为交换或近交换的余代数图,从而给出值保持和显式 sup-norm 失真界。第三,在单调 ω-连续合约变换器语义下,安全性、风险和资源规范可以表示为量值值合约,其中局部归纳界通过最小不动点推理提升到布线和反馈中。其核心主张并非所有强化学习态射构成全局迹幺半范畴,而是折扣贝尔曼评估在守护电路的可容许类上允许收缩反馈语义。

英文摘要

Discounted reinforcement learning is usually presented through Bellman equations on closed Markov decision processes. This paper develops a compositional view: a one-step decision process is treated as an open stochastic component, and infinite-horizon policy evaluation is obtained by closing a contractive feedback loop. The resulting semantics assigns typed Bellman transformers to open components, interprets series and parallel wiring as composition and tensoring of transformers, and interprets feedback as an admissible guarded Banach trace realized by a unique fixed point. This perspective yields three theoretical consequences. First, approximate component equivalence is a contextual congruence for admitted well-typed guarded one-hole contexts: local operator error remains controlled after plugging the component into a surrounding circuit that uses the hole once and whose feedback nodes have certified uniform guardedness. Second, exact and approximate state abstractions become commuting or near-commuting coalgebraic diagrams, giving value-preservation and explicit sup-norm distortion bounds. Third, under monotone $ω$-continuous contract-transformer semantics, safety, risk, and resource specifications can be represented as quantale-valued contracts, where local inductive bounds lift through wiring and feedback by least-fixed-point reasoning. Its central claim is not that all RL morphisms form a global traced monoidal category, but that discounted Bellman evaluation admits a contractive feedback semantics on the admissible class of guarded circuits.

2605.24756 2026-05-26 cs.AI

Proper Scoring Rules for Agentic Uncertainty Quantification

智能体不确定性量化的适当评分规则

Suresh Raghu, Satwik Pandey, Shashwat Pandey

AI总结 针对语言模型智能体轨迹中的不确定性信号,提出严格适当的轨迹评分规则TPS,用于评估逐步骤成功概率过程,并处理删失数据。

Comments 38 pages, 2 figures

详情
AI中文摘要

语言模型智能体在轨迹中越来越多地发出不确定性信号,但现有的智能体不确定性量化评估常常混淆排序有用性与概率真实性。AUROC、AUPRC、风险覆盖、轨迹ECE和标量化轨迹评分评估了区分度、分箱校准或压缩摘要,但并未严格引出完整的基于前缀的条件成功概率轨迹$q_t = P^π(Y=1 | H_t)$。基于序列适当评分,我们引入了轨迹适当评分(TPS),这是一个预测器无关的严格适当的轨迹级评分规则族,适用于任何校准为最终成功概率的逐步骤不确定性信号。我们证明,在完全观测下,TPS在所选的评分族和权重方案内严格引出了成功概率过程。我们将构造扩展到行政删失轨迹,通过将完整数据评分投影到可观测的停止前缀上,得到精确的$q_Z$加权简化评分,并在$q_Z$未估计时得到可处理的近似。我们进一步表明,常见的轨迹评估器针对的是比完整前缀条件概率过程更弱的目标:轨迹ECE是分辨率盲的,而标量化轨迹Brier仅引出压缩标量,而非完整轨迹。在StrategyQA、Tau2-Bench、HotpotQA和WebShop上的实验表明,这些理论差异在操作上是可见的:概率重新校准可以显著改变TPS,而几乎不改变排序指标,并且可处理的删失近似相对于仅完整评估可能改变结论。

英文摘要

Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores evaluate discrimination, binwise calibration, or collapsed summaries, but do not strictly elicit the full prefix-conditioned success-probability trace $q_t = P^π(Y=1 | H_t)$. Building on prequential proper scoring, we introduce the Trajectory Proper Score (TPS), a predictor-agnostic family of strictly proper trajectory-level scoring rules for any per-step uncertainty signal calibrated into a probability of eventual success. We prove that TPS strictly elicits the success-probability process under complete observation, within the chosen score family and weight schedule. We extend the construction to administratively censored trajectories by projecting the complete-data score onto the observable stopped prefix, yielding an exact $q_Z$-weighted reduced score and a tractable approximation when $q_Z$ is unestimated. We further show that common trajectory evaluators target weaker objects than the full prefix-conditioned probability process: Trajectory ECE is resolution-blind, while scalarized Trajectory Brier elicits only the collapsed scalar, not the full trace. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that these theoretical distinctions are operationally visible: probability recalibration can substantially change TPS while leaving rank metrics nearly unchanged, and the tractable censored approximation can change the verdict relative to complete-only evaluation.

2605.24755 2026-05-26 cs.AI cs.CL

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

使用多智能体语言模型自动检测和分类自然音频日记中的妄想相关内容

Feng Chen, Justin Tauscher, Changye Li, Meliha Yetisgen, Alex Cohen, Adam Kuczynski, Angelina Pei-Tzu Tsai, Benjamin Buck, Dror Ben-Zeev, Trevor Cohen

AI总结 提出一种多智能体LLM流水线,从自然音频日记中自动检测和分类妄想信念、情感和行为反应,通过多数投票实现稳健性能。

Comments Accepted by CLPych 2026

详情
AI中文摘要

在自然环境中录制的言语独白为表征精神疾病现象学和检测症状恶化提供了机会。大型语言模型(LLM)为自动化这一过程提供了新的可能性,因为它们主要需要标注数据进行评估而非训练。在本文中,我们提出了一种新颖的自动化多智能体LLM流水线,用于从具有中度被害妄想的人的音频日记转录中,进行细粒度、多标签的提取,以识别暗示妄想信念、相关情感反应和行为反应的语言。通过评估三个基础模型的集成,我们证明详细的诊断提示指令成功减少了妄想主题分类的假阳性,但也限制了情感或行为反应的解读。此外,比较多智能体裁决框架表明,智能体之间的复杂对话辩论通过诱导过早共识降低了临床模糊文本的准确性。相反,多数投票建立了稳健的性能(妄想检测和分类的Micro F1分别为0.872和0.779)。这项工作为自动检测和表征自然言语中暗示妄想信念的内容提供了一个经过验证且可扩展的流水线。

英文摘要

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

2605.24754 2026-05-26 cs.CV cs.AI cs.LG

Motion-Compensated Weight Compression

运动补偿权重压缩

Ismail Lamaakal

AI总结 提出运动补偿权重压缩(MCWC)方法,通过对齐置换对称块并利用层序预测和熵编码,有效压缩神经网络权重,在Transformer语言建模和视觉分类任务中提升率-精度帕累托前沿。

Comments 54 pages, 17 tables, 6 Figures

详情
AI中文摘要

神经网络权重日益成为部署的瓶颈,然而大多数压缩流水线独立处理各层,忽略了由函数保持对称性引起的跨层冗余。我们提出运动补偿权重压缩(MCWC),一种仅权重的编解码器,它对齐置换对称块(例如隐藏单元和注意力头)以最大化跨层对应,将深度转化为可预测序列。在对齐的坐标系中,MCWC使用带有周期性关键帧的轻量级层序预测器,并仅编码在率失真目标下训练的学习熵模型预测残差。一个简单的解码器通过熵解码、反量化、预测驱动重建和逆对齐来重建可部署的权重,从而实现快速权重物化以进行推理。在Transformer语言建模和视觉分类中,MCWC在强量化和学习权重编解码基线之上改善了率-精度帕累托前沿,同时保持有竞争力的解码时间。消融实验证实,对齐、预测、熵建模和关键帧调度对于获得全部增益都是必要的。我们的代码可通过 https://github.com/Ism-ail11/MCWC 获取。

英文摘要

Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook cross-layer redundancy induced by function-preserving symmetries. We propose Motion-Compensated Weight Compression (MCWC), a weight-only codec that aligns permutation-symmetric blocks (e.g., hidden units and attention heads) to maximize cross-layer correspondence, turning depth into a predictable sequence. In the aligned coordinate system, MCWC uses a lightweight layer-sequential predictor with periodic keyframes and encodes only quantized prediction residuals using a learned entropy model trained under a rate distortion objective. A simple decoder reconstructs deployable weights by entropy decoding, dequantization, predictor-driven reconstruction, and inverse alignment, enabling fast weight materialization for inference. Across Transformer language modeling and vision classification, MCWC improves the rate accuracy Pareto frontier over strong quantization and learned weight-codec baselines, while maintaining competitive decode time. Ablations confirm that alignment, prediction, entropy modeling, and keyframe scheduling are each necessary for the full gains. Our code is available via https://github.com/Ism-ail11/MCWC.

2605.24753 2026-05-26 cs.CV

Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain

点云中的鬼影:瞬态域中的LiDAR去眩光

Avery Gump, Connor Henley, Sungjin Cheong, Akarsh Prabhakara, Mohit Gupta

AI总结 针对固态LiDAR内部多径眩光导致的伪影问题,提出基于瞬态眩光扩散函数(TGSF)的物理模型和无训练算法,在点云形成前抑制眩光,保留真实场景结构。

Comments CVPR 2026

详情
AI中文摘要

现代LiDAR正迅速从笨重的机械扫描系统过渡到超紧凑、低成本、固态阵列。这种微型化在实现可扩展性、经济性和类似相机的数据结构的同时,引入了一种新的严重故障模式:内部多径眩光。当来自明亮或高反射表面的光在LiDAR内部反射和散射时,本应到达单个像素的光会扩散到像素阵列上。由此产生的伪影会创建幻影物体、遮挡真实物体,并产生安全关键的“点云中的鬼影”。本文介绍了一种基于物理的传感模型和算法技术来解决这一效应。我们表明,内部眩光可以表示为作用于瞬态测量的线性、场景无关算子——瞬态眩光扩散函数(TGSF)。基于此模型,我们开发了一种无训练方法,在点云形成之前对低级LiDAR检测(或回波)进行操作,利用眩光扩散函数的知识来推理每个检测来自眩光的可能性。该方法与现有LiDAR信号处理流水线兼容,可在未经修改的商业传感器上部署。通过使用真实单光子LiDAR硬件的实验,我们证明了在保留真实场景结构的同时,显著抑制了严重眩光伪影。

英文摘要

Modern LiDARs are rapidly transitioning from bulky, mechanically scanned systems to ultra-compact, low-cost, solid-state arrays. This miniaturization-while enabling scalability, affordability, and camera-like data structures-introduces a new and severe failure mode: internal-multipath glare. When light from a bright or retroreflective surface reflects and scatters within the LiDAR, light that should reach a single pixel spreads across the pixel array. The resulting artifacts create phantom objects, obscure real ones, and produce safety-critical "ghosts in the point clouds." This paper introduces a physically grounded sensing model and algorithmic techniques for addressing this effect. We show that internal glare can be represented as a linear, scene-independent operator-the Transient Glare Spread Function (TGSF)-acting on the transient measurements. Building on this model, we develop a training-free approach that operates on low-level LiDAR detections (or echoes) prior to point-cloud formation, leveraging knowledge of the glare spread function to reason about the likelihood of each detection arising from glare. The resulting approach is compatible with existing LiDAR signal-processing pipelines, and deployable on unmodified commercial sensors. Using experiments with real single-photon LiDAR hardware, we demonstrate substantial suppression of severe glare artifacts while preserving true scene structure.

2605.24752 2026-05-26 cs.LG cs.CC cs.DS math.PR

A computational phase transition for learning-to-sample from Ising models

从Ising模型中学习采样的计算相变

Andrej Risteski, Thuy-Duong Vuong

AI总结 本研究构造了谱阈值以上的有界宽度Ising模型族,证明在标准密码学假设下学习采样是计算困难的,从而在谱阈值处建立了尖锐的计算相变。

详情
AI中文摘要

我们研究从Ising模型中\emph{学习采样}——这是生成模型背后的基本算法任务,Ising模型是理论计算机科学和机器学习中算法思想的标准测试平台。给定未知目标分布的独立同分布样本,学习采样的目标是学习一个计算高效的生成过程,产生近似相同分布的新样本。我们构造了一个常界宽度的Ising模型族,该族恰好位于谱阈值$λ_{\max}(J)-λ_{\min}(J)=1$之上,并表明在标准密码学假设下,即使学习者获得模型的多项式多个独立同分布样本以及对其参数的显式访问,对该族的学习采样在计算上也是困难的。结合[AJKPV24,KLV25]的结果(表明谱阈值以下学习采样是可处理的),这建立了在谱阈值处的一个尖锐计算相变。此外,结合先前关于有界宽度Ising模型参数学习的结果[KM17,WSD19,VML20],这表明学习采样可能比参数学习更困难。最后,我们表明,对于这些困难实例,任何高效的学习者都表现出一种自然的记忆-幻觉二分法:学习者要么输出经过简单变换后与(变换后的)训练数据匹配的配置,要么将大量质量放在目标分布下概率可忽略的配置上。

英文摘要

We study \emph{learning-to-sample} -- a basic algorithmic task underlying generative modeling -- for Ising models, a standard testbed for algorithmic ideas in both theoretical computer science and machine learning. Given i.i.d. samples of an unknown target distribution, the goal of learning-to-sample is to learn a computationally efficient generation procedure that produces new samples following approximately the same distribution. We construct a family of Ising models of constantly bounded-width which lie just beyond the spectral threshold $λ_{\max}(J)-λ_{\min}(J)=1$, and show that learning-to-sample for this family is computationally hard under standard cryptographic assumptions, even when the learner is given both polynomially many i.i.d. samples from the model and explicit access to its parameters. Combined with results of [AJKPV24,KLV25] showing tractability of learning-to-sample below the spectral threshold, this establishes a sharp computational phase transition at the spectral threshold. Moreover, combined with prior results on parameter learning for bounded-width Ising models [KM17,WSD19,VML20], this shows that learning-to-sample can be more difficult than parameter learning. Finally, we show that any efficient learner for these hard instances exhibits a natural memorization-hallucination dichotomy: the learner must either output configurations that, after a simple transformation, match the (transformed) training data or place substantial mass on configurations of negligible probability under the target distribution.

2605.24743 2026-05-26 cs.LG cs.AI

Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

用于多轮LLM微调的合成轨迹的双层优化

Shresth Verma, Mauricio Tec, Cheol Woo Kim, Kai Wang, Milind Tambe

AI总结 提出BOOST双层优化框架,通过内层加权训练和外层轻量级重加权头学习,解决合成轨迹质量异质性导致的LLM多轮交互性能下降问题。

详情
AI中文摘要

虽然LLM在单轮生成中表现出色,但在长程多轮交互中表现不佳。离线强化学习提供了一种可扩展的方法,但其性能依赖于多轮轨迹数据的可用性和质量。一种常见的补救措施是使用LLM或模拟器生成的合成轨迹来增强训练,但合成数据的质量高度异质,天真地将所有轨迹视为同等信息量会降低性能。我们提出BOOST,一个双层优化框架,其中内层在重新加权的数据上训练LLM,外层在保留的真实验证任务上训练一个轻量级的重加权头,无需外部评判器即可分配连续的轨迹级权重。为了夯实这一方法,我们推导出一个PAC-Bayesian界,揭示了三方权衡:合成数据增加了多样性但存在任务偏移风险,而将权重集中在高质量轨迹上提高了经验性能但以有效样本量为代价。实验上,我们的方法一致优于多个基线。分析表明,它提高了与真实数据分布一致且具有更高定性价值的合成轨迹的权重。

英文摘要

While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge. To ground this approach, we derive a PAC-Bayesian bound revealing a three-way trade-off: synthetic data increases diversity but risks task-shift, while concentrating weight on high-quality trajectories improves empirical performance at the cost of effective sample size. Empirically, our method consistently outperforms multiple baselines. Analysis reveals it upweights synthetic trajectories that align with the real data distribution and exhibit higher qualitative merit.

2605.24742 2026-05-26 cs.LG

Aligning Molecular Graph Explanations with Chemical Identity via InChIfied Invariants

通过InChIfied不变量将分子图解释与化学身份对齐

Emanuele Guidotti, Sara Puglioli

AI总结 提出基于InChI的节点、边和图特征(InChIfied Invariants),确保化学等价分子图具有一致表示,从而提升预测和解释的一致性。

详情
AI中文摘要

在分子图上进行机器学习时,获得一致的解释需要预测和归因与化学身份对齐。然而,同一分子的化学等价图示可能产生不同的分子表示,导致不一致的预测和解释。在这里,我们引入了InChIfied不变量,这是一类基于国际化学标识符(InChI)的节点、边和图特征,设计为在保持化学身份的变换下具有不变性。使用来自PubChem Substances的一百万个分子图,我们表明InChIfied不变量在99.62%的情况下为化学等价图生成相同的表示,而标准的Daylight不变量仅在0.35%的情况下如此。在MoleculeNet任务中,InChIfied不变量在保持预测性能的同时,显著提高了同一分子不同图描绘之间的预测一致性。我们进一步进行了定量归因分析,并表明使用标准分子特征化方法产生的解释在化学等价图之间差异很大,而InChIfied不变量通过构造强制一致归因。我们发布了实现InChIfied不变量的开源软件,可作为标准分子图特征的即插即用替代品。

英文摘要

Obtaining consistent explanations for machine learning on molecular graphs requires predictions and attributions to be aligned with chemical identity. However, chemically equivalent drawings of the same molecule can induce different molecular representations, leading to inconsistent predictions and explanations. Here, we introduce InChIfied Invariants, a class of node, edge, and graph features based on the International Chemical Identifier (InChI) and designed to be invariant under transformations that preserve chemical identity. Using one million molecular graphs from PubChem Substances, we show that InChIfied Invariants produce identical representations for chemically equivalent graphs in 99.62% of cases, whereas standard Daylight invariants do so in only 0.35% of cases. Across MoleculeNet tasks, InChIfied Invariants preserve predictive performance while significantly improving prediction consistency across alternative graph depictions of the same molecules. We further perform a quantitative attribution analysis and show that explanations produced with standard molecular featurization methods vary substantially across chemically equivalent graphs, while InChIfied Invariants enforce consistent attributions by construction. We release open-source software implementing InChIfied Invariants, which can be used as a drop-in replacement for standard molecular graph features.

2605.24740 2026-05-26 cs.LG cs.GT

Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality

可达性的强化学习:保证渐近最优性

Amogh Palasamudram, Jakub Svoboda, Suguman Bansal, Krishnendu Chatterjee

AI总结 针对可达性规格的强化学习,提出一种基于PAC学习的迭代方法,在无需已知MDP内部参数的情况下实现渐近最优策略,并通过实验验证收敛动态。

Comments Main text and appendix of work accepted in ICML 2026

详情
AI中文摘要

强化学习(RL)在可达性规格中的应用是序列决策的基础,但理论保证仍较少探索。最近的工作实现了向最优策略的渐近收敛。然而,该方法对收敛动态的洞察有限。在这项工作中,我们提出了一种替代方法,提供了对收敛更深入的理论洞察。我们的方法基于带有假设的PAC学习。PAC学习保证在有限时间内以高置信度获得接近最优的策略,但需要知道内部MDP参数,如最小转移概率。我们认为,虽然这些参数在RL中是未知的,但它们可以迭代地细化并以递增的精度估计。通过迭代满足PAC条件,我们证明了在极限情况下可以实现精确最优性。在标准基准上的实证评估验证了我们对收敛动态的理论洞察。

英文摘要

Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provides limited insight into convergence dynamics. In this work, we present an alternative approach that provides deeper theoretical insights into convergence. Our approach builds on PAC learning with assumptions. PAC learning guarantees near-optimal policies with high confidence in finite time but requires knowing internal MDP parameters like minimum transition probability. We argue that while these parameters are unknown in RL, they can be iteratively refined and estimated with increasing accuracy. By iteratively satisfying PAC conditions, we show that exact optimality can be achieved in the limit. Empirical evaluations on standard benchmarks validate our theoretical insights into convergence dynamics.

2605.24737 2026-05-26 cs.CL cs.AI cs.CY

Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring

谁来评判评判者?基于指标的治理:面向持续LLM合规监控的运行时框架

Jehanne Dussert

AI总结 针对AI合规作为审计时二元判定而非生产系统持续可测量属性的问题,提出基于指标的治理原则,并开发开源框架govllm,通过运行时可观测性信号实现持续合规监控,验证了多模型陪审团设计在监管评估中的有效性。

Comments 41 pages, 8 figures, preprint

详情
AI中文摘要

当前AI合规方法将合规性视为审计时的二元判定,而非生产系统的持续可测量属性。我们认为这种合规虚构在结构上不适合欧盟AI法案的要求,该法案要求持续的人类监督和检测部署系统中涌现的行为漂移。我们引入了基于指标的治理原则,即监管合规性是从运行时可观测性中推导出的持续信号,而非来自静态评估。基于这一原则,我们提出了govllm,一个开源框架,实现了治理驱动的路由架构,其中模型选择由累积的合规分数决定,而非仅由延迟或成本决定。我们方法的核心是一个监管评判者小组——针对每个标准(欧盟AI法案、GDPR、ANSSI、可访问性)专门化的LLM评估器——我们将评判者间的分歧重新定义为监管不确定性信号,而非噪声,需要人工仲裁。我们通过一个包含49个标注提示/响应对的地面真实语料库验证了该方法,涵盖五个监管标准,由四个完全本地运行的小型语言模型(SLM,1.7B-7B参数)评估。一致率从51.5%(mistral:7b)到69.1%(phi4-mini)不等,没有单一模型在所有标准上占主导地位——这从经验上激励了“档案即陪审团”的设计。我们进一步记录了小型监管评判者中的三种结构性失败模式,以及一种评判者特定的位置偏差,该偏差在三种问题顺序条件(原始、反转、排列)下使一致率降低多达25个百分点。govllm作为开源软件发布,以支持可复现的AI治理研究。

英文摘要

Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of production systems. We argue that this compliance fiction is structurally ill-suited to the requirements of the EU AI Act, which demands ongoing human oversight and the detection of emergent behavioural drift in deployed systems. We introduce governance from metrics, a principle whereby regulatory compliance is derived as a continuous signal from runtime observability rather than from static assessments. Building on this principle, we present govllm, an open-source framework implementing a governance-driven routing architecture in which model selection is determined by accumulated compliance scores rather than by latency or cost alone. Central to our approach is a panel of regulatory judges - LLM evaluators specialised per criterion (EU AI Act, GDPR, ANSSI, accessibility) - whose inter-judge disagreement we reframe not as noise but as a regulatory uncertainty signal warranting human arbitration. We validate this approach through a ground truth corpus of 49 annotated prompt/response pairs across five regulatory criteria, evaluated by four small language models (SLMs, 1.7B-7B parameters) running fully on-premise. Agreement rates range from 51.5% (mistral:7b) to 69.1% (phi4-mini), with no single model dominating across all criteria - empirically motivating the Profile-as-jury design. We further document three structural failure modes in small regulatory judges and a judge-specific position bias that degrades agreement by up to 25 percentage points across three question-order conditions (original, reversed, permuted). govllm is released as open-source software to support reproducible AI governance research.

2605.24733 2026-05-26 cs.CL

StepGap: A Hybrid NLI-LLM Checker for Step-Level Evidence-Gap Detectionin Multi-Hop Question Answering

StepGap:一种用于多跳问答中步骤级证据缺口检测的混合NLI-LLM检查器

Yuelyu Ji, Zhuochun Li, Hui Ji, Daqing He

AI总结 提出混合NLI-LLM决策树StepGap,用于检测多跳问答中的步骤级证据缺口并输出三类标签,在82个问题上达到sF1=72.0,且作为GRPO过程奖励可提升模型精确匹配率。

详情
AI中文摘要

我们提出 extbf{StepGap},一种混合NLI-LLM决策树,用于检测多跳问答中的步骤级证据缺口,并输出三类标签: extsc{矛盾声明}(CC)、 extsc{无关证据}(IE)或 extsc{缺失桥梁}(MB),每个标签对应具体的修复动作。在82个多跳问题(181个标注步骤,$κ{=}0.704$)上,StepGap达到sF1$=$72.0,处于纯LLM基线(70.1)的bootstrap置信区间内,但具有更可分解的结构:移除StepGap的每个阶段都会 extit{降低}F1,而四个纯LLM移除中有三个 extit{提高}F1——这是 extit{竞争性错误抵消}的迹象,即内部阶段相互掩盖错误。我们进一步揭示了 extit{Q-F1陷阱}:问题级F1被标记每一步的检查器机械地膨胀,使得步骤级F1成为必要的诊断指标。作为带类型的GRPO过程奖励,StepGap将Qwen2.5-7B-Instruct的精确匹配率从$32.1{\pm}0.3$提升至$35.4{\pm}0.9$(三个种子),单次运行比较显示,与匹配的Search-R1 GRPO复现相比,平均EM增益为$+5.6$。

英文摘要

We present \textbf{StepGap}, a hybrid NLI-LLM decision tree that detects step-level evidence gaps in multi-hop QA and emits one of three typed labels: \textsc{Contradicted Claim} (CC), \textsc{Irrelevant Evidence} (IE), or \textsc{Missing Bridge} (MB), each tied to a concrete repair action. On 82 multi-hop questions (181 annotated steps, $κ{=}0.704$), StepGap reaches sF1$=$72.0, within the bootstrap confidence interval of an LLM-only baseline (70.1) but with a more decomposable structure: every StepGap stage \emph{hurts} F1 when removed, while three of four LLM-only removals \emph{improve} F1 -- a sign of \emph{competing-error cancellation}, where internal stages mask each other's errors. We further expose a \emph{Q-F1 trap}: question-level F1 is mechanically inflated by checkers that flag every step, making step-level F1 the necessary diagnostic. Used as a typed GRPO process reward, StepGap improves Qwen2.5-7B-Instruct Exact Match from $32.1{\pm}0.3$ to $35.4{\pm}0.9$ across three seeds, with the single-run comparison showing a $+5.6$ Avg EM gain over the matched Search-R1 GRPO reproduction.

2605.24726 2026-05-26 cs.CV

From Full Boards to Tiny Defects: Scale-Aware Tile Inference with Topology-Aware Merging for High-Resolution PCB Defect Detection

从整板到微小缺陷:面向高分辨率PCB缺陷检测的尺度感知瓦片推理与拓扑感知合并

Mohammad Alijanpour Shalmani, Alale Rezvani Boroujeni, Ali Amini, Jiann Shiun Yuan

AI总结 针对高分辨率PCB图像缩放导致微小缺陷丢失的问题,提出基于瓦片推理的尺度一致训练策略和拓扑感知合并方法,无需重新训练即可显著提升缺陷检测精度。

详情
AI中文摘要

高分辨率印刷电路板(PCB)检测在将整板图像缩放到标准检测器输入时存在分辨率崩溃问题:微尺度缺陷缩小到几个像素而被遗漏。基于瓦片的推理保留了局部细节,但在瓦片边缘引入边界伪影,导致分割检测和假阴性。我们提出了五种推理策略的系统比较,在两个高分辨率PCB缺陷数据集PCB-Defect(230张图像,1704个标注)和HRIPCB(693张图像,2953个标注)上评估,涵盖六类缺陷。我们表明训练-推理尺度一致性至关重要:在全图像上训练的检测器在瓦片推理下mAP@50崩溃至0.01,而同一架构在640×640瓦片裁剪上训练时在两个数据集上分别达到0.72和0.94。我们进一步利用拓扑感知瓦片合并(TA-TM),一种无需训练的后处理方法,构建瓦片邻接图,并在全局NMS之前使用邻瓦片一致性调整边界敏感检测分数。在两个数据集中,添加128像素瓦片重叠将边界区域召回率从约26-63%提升至约70-100%,TA-TM在两个基准上均达到最佳mAP@50,且瓦片推理恢复了全图像方法完全遗漏的46-100%的小缺陷。结果在不同数据集上一致,证实了所提出策略的泛化性。TA-TM无需重新训练且架构无关,可直接应用于现有PCB检测流水线。

英文摘要

High-resolution printed circuit board (PCB) inspection suffers from resolution collapse when full-board images are resized to standard detector inputs: micro-scale defects shrink to a few pixels and are missed. Tile-based inference preserves local detail but introduces boundary artefacts at tile edges, causing split detections and false negatives. We present a systematic comparison of five inference strategies evaluated on two high-resolution PCB defect datasets, PCB-Defect (230 images, 1704 annotations) and HRIPCB (693 images, 2 953 annotations), spanning six defect classes. We show that training-inference scale consistency is critical: a detector trained on full images collapses to mAP@50 = 0.01 under tile inference, while the same architecture trained on 640*640 tile crops achieves 0.72 and 0.94 on the two datasets respectively. We further exploited Topology-Aware Tile Merging (TA-TM), a training-free post-processing method that builds a tile-adjacency graph and adjusts boundary-sensitive detection scores using neighbour-tile agreement before global NMS. Across both datasets, adding 128 px tile overlap raises boundary-zone recall from ~26-63% to ~70-100%, TA-TM achieves the best mAP@50 on both benchmarks, and tile inference recovers 46-100% of small defects missed entirely by full-image methods. Results are consistent across datasets, confirming the generalizability of the proposed strategy. TA-TM requires no retraining and is architecture-agnostic, making it directly applicable to existing PCB inspection pipelines.

2605.24722 2026-05-26 cs.CV

Calibrating Probabilistic Object Detectors with Annotator Disagreement

校准具有标注者分歧的概率目标检测器

Zhi Qin Tan, Owen Addison, Yunpeng Li

AI总结 针对目标检测中因物体模糊性导致标注者分歧的问题,提出一种无需真实标注即可校准概率目标检测器的方法,通过设计分类和定位校准误差指标及训练时/事后校准器,使模型预测不确定性匹配标注分布。

详情
AI中文摘要

对于模糊物体(例如医学图像),标注者之间可能存在高度分歧,这凸显了在目标检测任务中建立真实标注的挑战。尽管如此,所有现有的目标检测器都隐式地需要访问真实标注以进行训练或评估。我们针对的基本问题是:如何利用多个标注者的标注(但缺乏因物体模糊性导致的客观真实标注)来学习目标检测器,以及如何使学习到的检测器在检测模糊物体时表达有意义的模型预测不确定性?为了回答这些问题,我们提出了一种可解释的方法来校准概率目标检测器,其校准目标是将类别置信度和边界框方差估计与标注者的标注分布对齐。我们引入了一个高效且有效的框架来校准概率目标检测器,通过设计四个评估指标来衡量分类和定位的校准误差,并提出了一种训练时校准和后处理校准器,所有这些都无需访问任何真实标注。该框架可推广到许多现有的概率目标检测器,例如YOLO系列和两阶段检测器。在医学和自然图像的真实世界和合成数据集上的实验结果表明,所提出的框架与三种流行的目标检测器相结合具有优越的性能。

英文摘要

High degrees of disagreement among annotators can exist for ambiguous objects, e.g. in medical images, underscoring the challenges of establishing ground truth annotations in object detection tasks. Despite this, all existing object detectors implicitly require access to ground truth annotations for either training or evaluation. The fundamental questions we target are: How can we learn an object detector with multiple annotators' annotations but without objective ground truth annotations due to object ambiguity, and how can we enable the learned detector to express meaningful model predictive uncertainties in detecting ambiguous objects? To answer these questions, we present an interpretable approach to calibrate probabilistic object detectors, where the calibration goal is to align the class confidence and bounding box variance estimates to the annotators' annotation distribution. We introduce an efficient yet effective framework to calibrate probabilistic object detectors by designing four evaluation metrics to measure calibration errors regarding classification and localization, and proposing a train-time calibration and post-hoc calibrator, all without the need to access any ground truth. This framework is generalizable to many existing probabilistic object detectors, such as the YOLO families and two-stage detectors. Empirical results with real-world and synthetic datasets of medical and natural images demonstrate the superior performance of the proposed framework with three popular object detectors.

2605.24721 2026-05-26 cs.CL

ROC Analysis for Evaluating Translation Quality Estimation Systems

ROC分析用于评估翻译质量估计系统

Evelyn Y. Garland, Carola F. Berger

AI总结 本文提出使用接收者操作特征(ROC)分析评估自动翻译质量估计(QE)系统,该方法与现有方法结果一致,并能为商业决策提供可操作的性能洞察。

Comments 16 pages, 8 PNG figures, 3 tables, uses acl.sty

详情
AI中文摘要

自动翻译质量估计(QE)系统的日益普及,需要实用的、面向决策的方法来评估其性能。我们提出接收者操作特征(ROC)分析是用于此目的的有用方法。我们的研究表明,ROC分析不仅产生与当前流行方法一致的结果,而且还提供了几个重要优势,包括支持商业决策的可操作性能洞察。

英文摘要

The increasing use of automated translation quality estimation (QE) systems calls for practical, decision-oriented methods for evaluating their performance. We propose that Receiver Operating Characteristic (ROC) analysis is a useful approach for this purpose. Our study shows that ROC analysis not only produces results consistent with currently prevalent methods, but also offers several important advantages, including actionable performance insights that support business decision-making.

2605.24719 2026-05-26 cs.CL cs.AI

World-State Transformations for Neuro-symbolic Interactive Storytelling

世界状态转换用于神经符号交互式故事讲述

Santiago Góngora, Luis Chiruzzo, Gonzalo Méndez, Pablo Gervás

AI总结 本研究探索在神经符号架构中利用LLM预测规则系统中的世界状态转换,以解决纯LLM方法的故事连贯性问题,并通过实验表明该方法能保持世界状态一致性并促进玩家创造性输入。

Comments To be presented at the 17th International Conference on Computational Creativity (ICCC'26)

详情
AI中文摘要

大型语言模型(LLM)改变了处理自由文本用户输入的交互式故事讲述系统的可能性。然而,随着这类系统越来越多地被构建,越来越多的证据表明,仅依赖它们会出现故事连贯性问题。最近的研究表明,LLM可以有效地预测基于规则的交互式故事讲述系统中的状态变化,触发预编程的世界状态转换。在本文中,我们进行了一项探索性评估,研究这种转换是否可以作为玩家表达的催化剂,同时旨在解决纯LLM方法典型的连贯性问题。基于神经符号架构,我们使用开源模型(Llama 3 70B)和闭源模型(Gemini 1.5 Flash)进行了实验,测试以英语和西班牙语进行。八名参与者玩了两个场景,这些场景经过精心设计以评估不同的评估目标。我们的观察表明,转换提供了一种保持世界状态一致性的方式,同时鼓励玩家通过他们的书面输入进行创造性互动。

英文摘要

Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However, as more of these systems are built, evidence continues to mount regarding the story coherence problems that arise when relying solely on them. Recent research suggests that LLMs can effectively predict state changes within rule-based Interactive Storytelling systems, triggering pre-programmed world-state transformations. In this paper, we conduct an exploratory evaluation of whether such transformations can serve as a catalyst for player expression while aiming to address the incoherence issues typical of purely LLM-based approaches. Building upon a neuro-symbolic architecture, we conducted experiments using an open-source model (Llama 3 70B) and a closed-source model (Gemini 1.5 Flash), with testing conducted in both English and Spanish. Eight participants played two scenarios, carefully designed to assess different evaluation objectives. Our observations suggest that transformations offer a way to maintain world-state consistency while encouraging players to interact creatively through their written inputs.

2605.24718 2026-05-26 cs.CL

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

25种欧洲语言的Tokenizer税:领域不变性、跨语言少样本效应与乌克兰语惩罚

Volodymyr Ovcharov

AI总结 研究测量了10个基础模型在25种欧洲语言上的tokenizer生育率,揭示了从英语到其他语言的成本差异,并发现乌克兰语因预训练数据不足而支付额外成本。

Comments 16 pages, 3 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/tokenizer-fertility-map

详情
AI中文摘要

Tokenizer生育率(每词token数)对非英语NLP施加了隐藏成本。我们在平行文本上测量了10个基础模型在25种欧洲语言上的生育率,生成了首个受控的欧洲tokenizer税地图。该税从英语(1.2 tokens/词)到希腊语/马耳他语(约3.1)跨度达2.5倍,遵循清晰层次:罗曼语族(1.5-1.7)、日耳曼语族(1.7-1.9)、斯拉夫语族(2.2-2.5)、乌拉尔语系/波罗的语族(2.7-3.0)。乌克兰语(2.7)比同源斯拉夫语言多支付15-18%,反映了其在预训练数据中的代表性不足。生育率排名在三种文本语域中具有领域不变性(rho > 0.97)。子词分析表明,高生育率tokenizer会碎片化形态边界而非保留它们。对四种斯拉夫语言的跨语言少样本评估显示,少样本效应是模型固有的,而非语言依赖的。我们将所有测量结果作为公共数据集发布。

英文摘要

Tokenizer fertility the number of tokens per word imposes a hidden cost on non-English NLP. We measure fertility for ten foundation models across 25 European languages on parallel text, producing the first controlled tokenizer tax map for the continent. The tax spans 2.5x from English (1.2 tokens/word) to Greek/Maltese (~3.1), following a clear hierarchy: Romance (1.5-1.7), Germanic (1.7-1.9), Slavic (2.2-2.5), Uralic/Baltic (2.7-3.0). Ukrainian (2.7) pays 15-18% more than cognate Slavic languages, reflecting underrepresentation in pre-training data. Fertility rankings are domain-invariant across three text registers (rho > 0.97). A subword analysis reveals that high-fertility tokenizers fragment morphological boundaries rather than preserving them. Cross-lingual few-shot evaluation on four Slavic languages shows that few-shot effects are model-intrinsic, not language-dependent. We release all measurements as a public dataset.

2605.24712 2026-05-26 cs.LG cs.HC

Hardware-Aware Federated Learning for Speech Emotion Recognition

面向语音情感识别的硬件感知联邦学习

Beyazit Bestami Yuksel, Emrah Dikbiyik

AI总结 提出一种硬件感知联邦学习框架,通过硬件性能分析、Top-K客户端选择和自适应本地轮数,在IEMOCAP数据集上实现情感识别,相比FedAvg减少约36.5%训练时间和40%通信成本。

Comments 4 pages, 3 figures, 4 Tables

详情
AI中文摘要

联邦学习(FL)能够在分布式边缘设备间进行隐私保护的协作训练,但实际部署中涉及具有不同处理能力、内存容量和通信延迟的异构客户端,这通常会增加轮次持续时间和系统成本。本文提出一种硬件感知的联邦学习框架,用于在会话划分的IEMOCAP数据集上进行情感识别,该框架在统一训练循环中集成了硬件性能分析、Top-K客户端选择和自适应本地轮数。我们在非独立同分布设置下将该方法与FedAvg、FedProx和随机Top-K选择进行比较,结果表明,在50个联邦轮次和5次独立试验中,所提方法达到了具有竞争力的验证准确率(0.352),总训练时间相比FedAvg减少约36.5%,累积通信成本降低40%。

英文摘要

Federated learning (FL) enables privacy-preserving collaborative training across distributed edge devices, but real deployments involve heterogeneous clients with different processing power, memory capacity, and communication latency, which often increase round duration and system cost. This paper proposes a hardware-aware federated learning framework for emotion recognition on session-partitioned IEMOCAP that integrates hardware profiling, top-K client selection, and adaptive local epochs within a unified training loop. We compare the method against FedAvg, FedProx, and random top-K selection under a non-IID setup and show that, across 50 federated rounds and 5 independent trials, the proposed approach achieves competitive validation accuracy (0.352), reduces total training time by about 36.5% compared to FedAvg, and lowers cumulative communication cost by 40%.

2605.24710 2026-05-26 cs.LG math.PR math.ST stat.ML stat.TH

Feature Learning in Wide Neural Networks under $μ$P: Identifiability and Sparse-Dictionary Decomposition of the Mean-Field Limit

μP 下宽神经网络中的特征学习:平均场极限的可辨识性与稀疏字典分解

Akmal Xodarev

AI总结 本文在最大更新参数化(μP)下,针对宽两层神经网络,建立了特征学习的四个结构结果,包括平均场极限的全局存在唯一性、可辨识性刻画、稀疏字典分解以及总特征学习误差分解,并揭示了架构-数据对的自然学习单元。

Comments 86 pages

详情
AI中文摘要

我们在最大更新参数化($μ$P)下,为宽两层神经网络中的特征学习建立了四个结构结果。 第一,我们证明了在$μ$P下带噪声梯度下降的平均场极限的全局存在唯一性,确定了初始化矩序列上的最大可容许权重$w^*$作为参数-矩增长边界的倒数,从而也是流传播的最大加权矩类。有限粒子近似具有关于时间的均匀平方Wasserstein速率$O(N^{-1})$。 第二,我们刻画了平均场极限的可辨识性:两个可容许参数测度在$L^2$中诱导相同的网络函数当且仅当它们的活跃分量在模去架构的有限秩实现对称性后一致。轨道深度$D^*_{\mathrm{orb}}$与矩簇深度$D^*_{\mathrm{var}}$不同。 第三,在Barron-Hermite目标条件下,长时间极限测度的活跃支撑集允许一个稀疏字典分解:它在模去有限秩实现对称性后至多支撑在$S^*$个原子上,其中$S^*$由一个显式的系数阈值数界定。 第四,我们将总特征学习误差分解为统计、优化、混沌传播和稀疏残差分量,其中目标相关的Hermite/Barron尾部取代了任何仅初始化的残差。 这四个结果通过一个架构恒等式联系在一起:三元组$(w^*, D^*_{\mathrm{orb}}, S^*)$——最大可容许权重、轨道可辨识深度以及目标可实现时的稀疏字典深度——是架构-数据对$(\sigma, \rho)$的自然学习单元。证明是自包含的,除了来自$μ$P和平均场Langevin理论的标准结果。

英文摘要

We establish four structural results for feature learning in wide two-layer neural networks under the Maximal Update Parametrization ($μ$P). First, we prove global existence and uniqueness of the mean-field limit of noisy gradient descent under $μ$P, identifying the maximal admissible weight $w^*$ on the moment sequence of the initialization as the reciprocal parameter-moment-growth boundary, and hence the largest weighted moment class propagated by the flow. The finite-particle approximation has uniform-in-time squared-Wasserstein rate $O(N^{-1})$. Second, we characterize identifiability of the mean-field limit: two admissible parameter measures induce the same network function in $L^2$ exactly when their active components agree modulo the finite-rank realization symmetry of the architecture. The orbit depth $D^*_{\mathrm{orb}}$ is separated from the moment-variety depth $D^*_{\mathrm{var}}$. Third, under the Barron-Hermite target condition the active support of the long-time limit measure admits a sparse-dictionary decomposition: it is supported on at most $S^*$ atoms modulo finite-rank realization symmetry, with $S^*$ bounded by an explicit coefficient-threshold number. Fourth, we derive the total feature-learning-error decomposition into statistical, optimization, propagation-of-chaos, and sparse-residual components, with a target-dependent Hermite/Barron tail replacing any initialization-only residual. The four results are tied together by an architectural identity: the triple $(w^*, D^*_{\mathrm{orb}}, S^*)$ -- the maximal admissible weight, the orbit identifiability depth, and the sparse-dictionary depth at which the target is realizable -- is the natural learning cell of the architecture-data pair $(σ, ρ)$. The proofs are self-contained except for standard results from $μ$P and mean-field Langevin theory.

2605.24709 2026-05-26 cs.LG

Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

部分可观测下的流式强化学习与实时循环学习

Noah Farr, Aryaman Reddi, Carlo D'Eramo, Jan Peters

AI总结 提出使用递归迹单元(RTU)实现精确实时循环学习(RTRL),在参数数量上具有线性时间和内存复杂度,解决了部分可观测环境下流式强化学习的梯度计算瓶颈,并在离散和连续控制任务中保持性能。

Comments 16 pages, 4 figures

详情
AI中文摘要

流式强化学习已成为一种在线学习范式,它符合自然学习代理的约束,即增量处理数据(批大小为1,无回放缓冲区)。虽然流式RL最近在完全可观测下通过深度函数逼近实现了扩展,但部分可观测设置仍然难以实现。在流式设置下,截断式时间反向传播退化为一步梯度视野,而精确的实时循环学习则代价过高。我们使用递归迹单元(一种对角递归架构,能够在参数数量上实现线性时间和内存复杂度的精确RTRL)来弥合这一差距,并展示它们能够干净地集成到现有的流式算法中,适用于离散和连续控制。在链长从2到128的MemoryChain诊断任务中,我们的方法保持了性能,而使用前馈、GRU和RTU网络的流式TBPTT(1)基线则崩溃。在五个POPGym任务和部分可观测的MuJoCo连续控制中,流式方法在POPGym上与批量PPO竞争,并在掩码MuJoCo上恢复了批量性能的很大一部分,尽管没有使用回放缓冲区或批量更新。

英文摘要

Streaming reinforcement learning has emerged as an online learning paradigm that conforms to the restrictions of natural learning agents that process data incrementally, i.e. with a batch size of 1 and no replay buffer. While streaming RL has recently been shown to scale with deep function approximation with full observability, partially observable settings have remained out of reach. Truncated backpropagation through time collapses to a one-step gradient horizon under the streaming setting, and exact real-time recurrent learning is prohibitively expensive. We close this gap using recurrent trace units, a diagonal recurrent architecture that enables exact RTRL with linear time and memory complexity in the parameter count, and show that they integrate cleanly into existing streaming algorithms across both discrete and continuous control. On a MemoryChain diagnostic with chain lengths from 2 to 128, our method sustains performance where streaming TBPTT(1) baselines using feedforward, GRU, and RTU networks collapse. On five POPGym tasks and on partially observable MuJoCo continuous control, the streaming approach is competitive with batched PPO on POPGym and recovers a substantial fraction of batched performance on masked MuJoCo, despite using no replay buffer or batched updates.

2605.24703 2026-05-26 cs.CL cs.AI

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

TS-Skill: 用于评估时间序列问答中分析技能的基准

Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava

AI总结 提出TS-Skill基准,通过三种可组合的分析技能(时间尺度选择、时间定位和跨区间整合)来诊断时间序列问答中模型的信号级能力,并开发SKEvol框架自动构建基准,实验揭示不同技能上的能力差距。

详情
AI中文摘要

大型语言模型(LLMs)和时间序列语言模型(TSLMs)越来越多地应用于时间序列问答(TSQA)。与纯文本问答不同,TSQA要求模型将答案基于时间信号,这些信号的模式可能出现在不同尺度、特定时间位置或跨分离区间。然而,现有的基准通常按任务类型或高层次推理类别组织,难以诊断驱动模型性能的底层信号级能力。我们引入TS-Skill,一个用于评估TSQA中三种可组合分析技能的控制基准:时间尺度选择(SK1)、时间定位(SK2)和跨区间整合(SK3)。TS-Skill提供时间戳感知的问题、广泛的领域覆盖以及人工验证的问答质量。为了大规模构建基准,我们开发了SKEvol,一个技能引导的智能体框架,结合了领域感知的时间序列种子生成、技能控制的问题生成、元数据和代码辅助的答案构建、多阶段信号接地验证以及人在回路中的策展。在十个最先进的LLMs和TSLMs上的实验揭示了SK1-SK3之间显著且不均匀的能力差距。特别是,SK3对非智能体模型始终具有挑战性,而工具增强的智能体在独立的SK3上显示出选择性优势。这些发现表明,技能级评估可以揭示被聚合TSQA分数掩盖的时间推理失败。

英文摘要

Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns may occur at different scales, specific time locations, or across separated intervals. However, existing benchmarks are typically organized by task types or high-level reasoning categories, making it difficult to diagnose the underlying signal-level capabilities driving model performance. We introduce TS-Skill, a controlled benchmark for evaluating three composable analytical skills in TSQA: temporal scale selection (SK1), temporal localization (SK2), and cross-interval integration (SK3). TS-Skill provides timestamp-aware questions, broad domain coverage, and human-validated QA quality. To construct the benchmark at scale, we develop SKEvol, a skill-guided agentic framework that combines domain-aware time-series seed generation, skill-controlled question generation, metadata- and code-assisted answer construction, multi-phase signal-grounded verification, and human-in-the-loop curation. Experiments on ten state-of-the-art LLMs and TSLMs reveal substantial and uneven capability gaps across SK1-SK3. In particular, SK3 remains consistently challenging for non-agent models, whereas tool-augmented agents show a selective advantage on standalone SK3. These findings demonstrate that skill-level evaluation can uncover temporal reasoning failures that are obscured by aggregate TSQA scores.

2605.24702 2026-05-26 cs.CV

Do Image-Text Metrics Respect Semantic Invariances?

图像-文本度量是否尊重语义不变性?

Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, Michael Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth

AI总结 通过空间、物体和社会语言框架三个维度的语义保持扰动,系统评估了五种流行图像-文本评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)的语义不变性,发现它们对非语义变化敏感,并提出了不变性校准评分作为后处理调整方法。

详情
AI中文摘要

无参考图像到文本评估器现在已成为评分图像-标题对齐的标准工具,但尚不清楚它们是否尊重语义不变性。我们对五种流行评估器(CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判)进行了不变性探测,在三个轴向上施加语义保持扰动:空间(翻转、上下文保持的重定位、轻微旋转)、物体(尺度、类别)和社会语言框架(带有中性及长度匹配对照的文化/经济形容词)。在三个检测数据集和三个标题评估套件的精心策划切片上,我们发现了一致的非语义敏感性,其中良性的空间编辑和简单的措辞变化平均使分数变化约6-9%,而对于仅相差0.7%的系统,这些变化可能导致高达约37%的情况下的排名翻转,尤其是在空间变化下。一项小型人类研究也支持这一发现,并确认标注者通常认为扰动对同样正确,因此这些变化反映了度量行为而非语义变化。我们进一步提出了不变性校准评分,这是一种后处理调整方法,大致将中位数绝对敏感性减半,同时保持与学习型标题评估器的相关性。

英文摘要

Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes -- spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by $\approx$6--9\% on average, and for systems separated by just 0.7\%, these shifts can cause ranking flips in up to $\sim$37\% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

2605.24699 2026-05-26 cs.AI cs.LG

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

MDIA:HealthBench Professional上的多智能体诊断智能流水线

Roberto Cruz, David Rey-Blanco

AI总结 提出MDIA多智能体诊断系统,通过7节点专业路由临床推理图架构,在非微调LLM上实现HealthBench Professional基准性能提升3.72个百分点,归因于系统架构设计而非提示工程。

Comments 33 pages, 10 figures

详情
AI中文摘要

大多数关于agentic-LLM临床基准测试的报告收益通常归因于提示工程,但我们的结果表明,更大的改进可能来自架构和引擎级别的设计。我们提出了MDIA,一个多智能体诊断智能体,实现为7节点专业路由临床推理图,在完整的HealthBench Professional基准测试(n=525)上,使用非微调LLM。MDIA在OpenAI的GPT-5.4-2026-03-05下达到0.6272,比OpenAI的ChatGPT for Clinicians的性能高出3.72个百分点。实验工作表明,性能提升归因于系统架构:专业路由、多轮上下文保留、药物状态安全门控、站点过滤搜索、长度感知合成和引擎级可靠性。这些发现支持了agentic临床基准性能由底层基础模型和编排架构共同塑造的观点。然而,我们也注意到在使用其他模型作为评分器时存在显著差异;特别是,当使用Gemini 2.5 Pro时,MDIA得分为0.6585,这表明评分器的选择是变异性来源。因此,对LLM的稳健评估需要跨多个独立评分器模型进行评估。

英文摘要

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

2605.24697 2026-05-26 cs.CL cs.AI

The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models

路径很重要:学习扩散语言模型的令牌提交策略

Bohang Sun, Max Zhu, Francesco Caso, Jindong Gu, Junchi Yu, Philip Torr, Pietro Liò, Jialin Yu

AI总结 本文提出TraceLock,一种轻量级可插拔控制器,通过学习可复用的轨迹状态策略来优化扩散语言模型中的令牌提交决策,从而改善质量与步数之间的权衡。

详情
AI中文摘要

扩散大语言模型通过并行细化多个令牌位置有望实现更快的生成,但这种并行性引入了一个隐藏的控制问题:每一步中哪些提议的令牌应被转移到部分解码的序列中?我们将此决策称为令牌提交。现有的冻结生成器解码器主要依赖于手工设计的置信度规则或特定块的接受过滤器。我们认为令牌提交可以学习为一种可复用的轨迹状态策略。我们引入了TraceLock,一种轻量级可插拔控制器,为冻结的扩散语言模型实例化此策略。由于无法获得 oracle 提交时间,TraceLock 从未来稳定性中推导出自我监督:在解码步骤 t,如果提议的令牌在完整解码轨迹完成后与位置 i 的最终令牌匹配,则将其标记为稳定。控制器对可变长度的轨迹状态进行评分,并决定哪些活跃的令牌提议应被提交到部分解码的序列中。一旦为给定的冻结主干训练完成,该控制器可以在局部窗口宽度、生成长度和步数预算下部署,无需重新训练或按设置校准。在问答、数学推理和代码生成上的实验表明,TraceLock 在质量-步数权衡上优于启发式和学习的基线,在跨设置部署下尤其稳定。诊断分析表明,其决策不能简化为标量置信度,这表明冻结的扩散语言模型暴露了一个超越基于置信度解码的可学习的提交轨迹空间。代码可在 https://github.com/BobSun98/TraceLock 获取。

英文摘要

Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a hidden control problem: which proposed tokens should be transferred into the partially decoded sequence at each step? We refer to this decision as token commitment. Existing frozen-generator decoders largely rely on hand-designed confidence rules or block-specific acceptance filters. We argue that token commitment can instead be learned as a reusable trace-state policy. We introduce TraceLock, a lightweight plug-in controller that instantiates this policy for a frozen diffusion language model. Since oracle commitment times are unavailable, TraceLock derives self-supervision from future stability: at decoding step t, a proposed token for position i is labeled stable if it matches the final token at position i after the full decoding trace completes. The controller scores variable-length trace states and decides which active token proposals should be committed to the partially decoded sequence. Once trained for a given frozen backbone, the controller can be deployed across local-window widths, generation lengths, and step budgets without retraining or per-setting calibration. Experiments on question answering, mathematical reasoning, and code generation show that TraceLock improves the quality-step tradeoff over heuristic and learned baselines, with particularly stable behavior under cross-setting deployment. Diagnostic analyses show that its decisions are not reducible to scalar confidence, suggesting that frozen diffusion language models expose a learnable space of commitment trajectories beyond confidence-based decoding. Code is available at https://github.com/BobSun98/TraceLock.

2605.24693 2026-05-26 cs.CL

CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

CP-Agent: 一种用于反馈驱动竞赛编程的校准风险控制智能体

Peisong Wang, Bowen Liu, Zehua Li, Yuyao Wang, Zhiwei Ma, Yuhan Li, Jia Li

AI总结 提出CP-Agent,通过校准停止过程建模反馈驱动求解,结合双重粒度验证、测试增强和经验驱动自我进化机制,在不更新参数的情况下显著提升竞赛编程性能。

Comments Code: https://github.com/NineAbyss/CP-Agent

详情
AI中文摘要

大型语言模型在竞赛级编程中仍存在困难,而许多智能体解决方案依赖于大量的推理时采样或昂贵的多阶段后训练。我们研究了执行反馈何时能可靠地帮助LLM竞赛编程求解器,以及哪些机制支配着性能提升。我们将反馈驱动求解建模为校准停止过程,并识别出三个量:虚假接纳风险、针对不良程序的程序级证据以及活跃状态成功风险。在保留的轨迹校准和从预先声明的有限控制器清单中选择下,所得的结构性证书在虚假接纳之前为干净成功概率提供了下界。我们针对这些量实例化了机制:双重粒度验证、测试增强和经验驱动自我进化,从而得到CP-Agent。在不更新任何参数的情况下,CP-Agent在LiveCodeBench Pro上将Pass@1从25.8%提升至48.5%,并在ICPC-Eval上将Refine@5提高了11.0%。在三个LLM骨干网络上,CP-Agent处于成本-准确率效率前沿,消融实验表明每个组件主要影响其对应的证书量。

英文摘要

Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.

2605.24691 2026-05-26 cs.CV

AdaFuse-Det: Adaptive Cross-Modal Fusion of Event Cameras for Robust Object Detection in Low-Light RGB Imagery

AdaFuse-Det: 自适应跨模态融合事件相机用于低光照RGB图像中的鲁棒目标检测

Raju Imandi, Chethana B, Bharatesh Chakravarthi, Yong-Guk Kim, Manipriya S, Pavan Kumar B N

AI总结 提出AdaFuse-Det双流框架,通过基于最小方差线性估计的自适应跨模态融合模块融合CLAHE增强RGB与事件数据,在低光照下实现鲁棒目标检测,在LLE-VOS基准上召回率65.54%、精确率53.85%、F1分数59.12%。

详情
AI中文摘要

在极端低光照条件下可靠地检测目标是计算机视觉中的一个开放性问题,在从夜间监控到搜索救援机器人等应用中具有实际紧迫性。传统RGB相机在低光子通量下性能急剧下降,而事件相机以微秒分辨率和宽动态范围记录异步逐像素亮度变化,提供了很大程度上与光照无关的互补结构线索。我们提出AdaFuse-Det,一个双流框架,通过基于最小方差线性估计理论的自适应跨模态融合模块,将CLAHE增强的RGB帧与体素化事件张量融合。我们形式化地证明学习到的注意力图渐近地恢复了高斯-马尔可夫最优融合权重,并为体素化阶段建立了事件守恒和时间分辨率界限。在LLE-VOS基准上,AdaFuse-Det在严重光照退化下实现了召回率65.54%、精确率53.85%和F1分数59.12%,在召回率上优于单模态检测器,其差距反映了理论上预测的光照适应行为。

英文摘要

Detecting objects reliably under extreme low-light conditions is an open problem in computer vision, with practical urgency in applications ranging from nighttime surveillance to search-and-rescue robotics. Conventional RGB cameras degrade sharply at low photon flux, while event cameras which record asynchronous per-pixel brightness changes at microsecond resolution and high dynamic range provide complementary structural cues that are largely illumination-invariant. We present AdaFuse-Det, a dual-stream framework that fuses CLAHE-enhanced RGB frames with voxelized event tensors through an Adaptive Cross-Modal Fusion (ACMF) module grounded in minimum-variance linear estimation theory. We formally show that the learned attention map asymptotically recovers the Gauss-Markov optimal fusion weights, and establish event conservation and temporal resolution bounds for the voxelization stage. On the LLE-VOS benchmark, AdaFuse-Det achieves a Recall of $65.54\%$, Precision of $53.85\%$, and F1-Score of $59.12\%$ under severe illumination degradation, outperforming single-modality detectors in recall by a margin that reflects the theoretically predicted illumination-adaptation behavior.

2605.24690 2026-05-26 cs.RO cs.LG

Sum of Costs Diffusion with Dynamic Guidance for Motion Planning

运动规划的动态引导代价和扩散模型

Aysu Aylin Kaplan, Özgür Erkent

AI总结 提出一种基于扩散模型的高泛化运动规划方法,通过总碰撞代价梯度引导去噪过程并动态选择引导起始步,在Mπnets数据集上取得最优性能。

Comments Accepted at the Frontiers of Optimization for Robotics Workshop at the IEEE International Conference of Robotics & Automation (ICRA), 2026

详情
AI中文摘要

机器人操作的运动规划问题可以通过经典方法或深度学习方法来解决。现有方法在泛化到不同场景时面临重大挑战。在本研究中,我们提出了一种具有高泛化能力的方法,该方法使用扩散模型生成无碰撞轨迹,其中去噪过程由总碰撞代价的梯度引导。我们还提出了一种动态选择梯度引导起始步的方法。实验结果表明,通过动态引导扩散模型与碰撞代价之和,能够克服竞争方法面临的泛化问题,提供更鲁棒的性能。所提出的模型在Mπnets数据集的不同测试场景中,相比其他方法取得了最高性能,证明了其有效性。

英文摘要

The motion planning problem for robotic manipulation can be addressed through classical or deep learning approaches. Existing methods face significant challenges in generalizing to diverse settings. In this study, we present a method with high generalization capability that generates collision-free trajectories using diffusion models where the denoising process is guided by the gradient of the total collision cost. We are also presenting a dynamic approach for choosing start step of the gradient guidance. Experimental results demonstrate that guiding the diffusion model dynamically with the sum of collision costs offers more robust performance by overcoming the generalization issues faced by competing methods. The proposed model demonstrates its effectiveness by achieving the highest performance on diverse test settings in M$π$nets\ dataset among the compared methods.

2605.24687 2026-05-26 cs.CV cs.AI

HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing

HoloFair: 统一的T2I公平性评估与Fair-GRPO去偏

Ruyi Chen, Lu Zhou, Xiaogang Xu, Chiyu Zhang, Jiafei Wu, Liming Fang

AI总结 提出HoloFair基准框架,通过多属性组间偏差指数(MGBI)评估文本到图像模型的公平性,并引入基于强化学习的Fair-GRPO方法进行去偏,在SD3.5-Medium模型上显著提升多维公平性且保持图像质量。

Comments Accepted to ICML 2026. Code and dataset are available at https://github.com/1059684669/HoloFair

详情
AI中文摘要

文本到图像(T2I)模型在视觉真实感和语义一致性方面取得了显著进展,但它们常常延续并放大社会偏见。现有的评估方法通常只处理单维偏见,缺乏从社会相关深层语义层面揭示模型偏见的视角。我们引入了HoloFair,一个用于多维人口统计偏见分析的综合基准框架。该框架基于我们大规模面向公平性的数据集和SpaFreq(空间-频率)属性分类器,提出了多属性组间偏差指数(MGBI)指标,旨在评估内在多样性和条件偏见。除评估外,我们还进一步引入了Fair-GRPO,一种基于强化学习的去偏方法,通过设计的多目标奖励函数改变生成模型的分布。例如,在SD3.5-Medium模型上的实验表明,Fair-GRPO在保持高图像质量的同时显著改善了多维公平性。我们还分析了潜在的奖励黑客现象,并提供了相应的缓解策略。代码和数据集可在https://github.com/1059684669/HoloFair获取。

英文摘要

Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify societal biases. Existing evaluation methods typically address only single-dimensional biases, lacking perspectives to uncover model biases at social-related deeper semantic levels. We introduce HoloFair, a comprehensive benchmark framework for multidimensional demographic bias analysis. Built upon our large-scale fairness-oriented dataset and the SpaFreq (Spatial-Frequency) attribute classifier, this framework proposes the Multi-attribute, Group-wise Bias Index (MGBI) metric, designed to assess both intrinsic diversity and conditional biases. Beyond evaluation, we further introduce Fair-GRPO, a reinforcement-learning-based debiasing method that alters the distribution of generative models through a designed multi-objective reward function. E.g., experiments on the SD3.5-Medium model demonstrate that Fair-GRPO significantly improves multidimensional fairness while maintaining high image quality. We also analyze potential reward hacking phenomena and provide corresponding mitigation strategies. Code and dataset are available at https://github.com/1059684669/HoloFair

2605.24686 2026-05-26 cs.AI

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

大型语言模型中的情商在感知、认知和交互上存在碎片化

Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han, Mengyue Wu

AI总结 本文提出FACET框架,基于Mayer-Salovey-Caruso四分支能力模型评估大型语言模型的情商,发现其并非单一能力,而是在认知和交互维度上碎片化,且隐藏情绪识别是普遍瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地集成到情感敏感领域,其情商(EI)的结构完整性成为安全和对齐的关键前沿。当前的基准测试常常将表面的礼貌与深层次的情感推理混为一谈,未能区分感知准确性和交互效能。在此,我们引入FACET(功能性情感能力和共情测试),这是一个基于心理测量学的框架,包含480个专家设计的项目。与先前的指标不同,FACET在理论上锚定于Mayer-Salovey-Caruso四分支能力模型,通过情绪感知、促进、理解和管理来操作化情商。通过对九个前沿模型(包括GPT-5、Claude-Sonnet-4)的评估,我们证明情商并非单一能力,而是在认知和交互维度上碎片化。尽管前沿模型在客观情绪识别和社会推理方面表现出强大的能力,但这并不一致地转化为交互成功。我们将这些差异归类为三种不同的表现类型:认知主导型、交互主导型和情境依赖型。这些类型表明情感技能并非随通用智能或模型大小均匀扩展;相反,它们由特定的对齐范式塑造。值得注意的是,我们识别出隐藏情绪识别是所有架构的普遍性能瓶颈。我们的结果表明,当前的RLHF过程可能优化了“随机共情”,即对情感句法的统计模仿,而牺牲了整合的情感推理。这些发现挑战了线性情感扩展的假设,并为开发能够真正临床共鸣的社会感知智能体提供了严谨的路线图。

英文摘要

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

2605.24684 2026-05-26 cs.LG cs.AI

Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs

超越聚合困境:多模态图的先验保持解耦学习

Hao Yan, Xuanru Wang, Jun Yin, Shirui Pan, Senzhang Wang, Chengqi Zhang

AI总结 针对多模态属性图学习中强制聚合导致性能反转的聚合困境,提出解耦双路径架构SUPRA,通过保持先验特征的独立性和轻量级共享GNN捕获结构协同,并辅以深度监督缓解梯度饥饿,实现SOTA性能且显著降低计算开销。

详情
AI中文摘要

多模态属性图学习(MAGL)通过图聚合将节点内在属性与结构拓扑相结合。然而,随着预训练编码器演变为大型基础模型(LFM),MAGL的格局发生了根本性转变:在高置信度LFM先验下,强制聚合引入了拓扑噪声,淹没了判别信号,引发反直觉的性能反转,即复杂的MAGL架构性能不如简单的拓扑无关MLP。通过系统的实证和理论分析,我们确定这种反转源于一个基本的聚合困境,其特征是两种并发病理:(1)表征病理(信噪比退化)——强制聚合用拓扑噪声稀释了鲁棒的内在特征,导致噪声惩罚超过其协作收益;(2)优化病理(梯度饥饿)——拓扑聚合减弱了梯度流,而共享任务损失导致主导模态过早抑制较弱模态。为解决这一困境,我们提出SUPRA(共享-独特先验保持架构),一种解耦的双路径范式。SUPRA通过拓扑无关的MLP处理模态特定特征,同时通过轻量级共享GNN捕获结构协同,并辅以深度监督来对抗梯度饥饿。大量评估表明,SUPRA实现了最先进的性能,同时峰值GPU内存需求降低3.5倍,训练时间比多模态图变换器快4.4倍。

英文摘要

Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, as pretrained encoders evolve into Large Foundation Models (LFMs), the landscape of MAGL fundamentally shifts: under high-confidence LFM priors, mandatory aggregation introduces topological noise that overwhelms discriminative signals, triggering a counter-intuitive performance inversion where sophisticated MAGL architectures underperform simple topology-agnostic MLPs. Through systematic empirical and theoretical analysis, we identify that this inversion stems from a fundamental aggregation dilemma characterized by two concurrent pathologies: (1) Representational Pathology (SNR Degradation) - mandatory aggregation dilutes robust intrinsic features with topological noise, causing the noise penalty to outweigh its collaborative benefit; and (2) Optimization Pathology (Gradient Starvation) - topological aggregation attenuates gradient flow, while a shared task loss causes dominant modalities to prematurely suppress weaker ones. To resolve this dilemma, we propose SUPRA (Shared-Unique Prior-Retaining Architecture), a decoupled dual-pathway paradigm. SUPRA processes modality-specific features through topology-agnostic MLPs while capturing structural synergy via a lightweight shared GNN, with auxiliary deep supervision counteracting gradient starvation. Extensive evaluations demonstrate that SUPRA achieves state-of-the-art performance while requiring 3.5x lower peak GPU memory and up to 4.4x faster training time than Multimodal Graph Transformers.