arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.29737 2026-05-29 cs.CR cs.CL cs.SE

Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

最小提示扰动导致代码漏洞:编码大语言模型中的提示脆弱性和隐藏状态信号

Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

AI总结 本文通过token级突变实验,发现微小提示扰动(如单字符变化)即可使LLM生成代码从安全变为脆弱,并利用隐藏状态分析揭示输入处理漏洞比安全默认值漏洞更可预测。

详情
AI中文摘要

基于LLM的编码助手正被迅速采用,显著提高了开发者的生产力。随着组织越来越多地部署这些代理生成的代码,代码的安全性变得至关重要。先前的研究表明,微小的提示扰动会降低LLM生成代码的功能正确性,但这是否也会危及代码安全性尚未被研究。我们对三个模型和五种编程语言的提示应用token级突变,并表明小至单字符变化的突变可以将生成的代码从安全变为脆弱。探测模型的隐藏状态揭示,这种脆弱性部分编码在提示表示中,但分布不均匀。输入处理漏洞(模型省略验证或清理)比安全默认值漏洞(不安全代码源于一个局部选择,如弱算法或不安全参数)更可预测(平均AUC 0.753 vs 0.674)。这些结果表明,LLM辅助编码的威胁模型不仅包括提示注入,还包括普通的提示变化,并指出输入处理缺陷可以在生成前被捕获,而安全默认值缺陷需要在解码过程中进行干预。

英文摘要

LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.

2605.29734 2026-05-29 cs.CL

HTAM: Hierarchical Transition-Attended Memory for Operator Optimization

HTAM: 用于算子优化的层次化过渡注意力记忆

Yining Zhang, Mingyang Yi, Chen Wang, Xuwen Xiang, Tianhe Jia, Zedong Dan, Chengqing Zong, Yue Wang

AI总结 提出HTAM框架,通过构建层次化过渡图(HTG)组织粗粒度全局方向和细粒度局部策略,解决LLM在GPU算子优化中粒度不匹配问题,显著提升正确率和加速比。

详情
Comments
24 pages, 5 figures
AI中文摘要

高性能GPU内核对于高效部署LLM至关重要,但其优化仍然需要大量专业知识。最近基于LLM的代码生成使得自动GPU算子生成变得有前景,但算子优化仍然是一个硬件感知的搜索问题。现有的基于LLM的方法面临粒度不匹配的问题:粗粒度的提示可重用但难以执行,而细粒度的记忆可操作但会扩大搜索空间并模糊优化瓶颈。因此,关键挑战在于以适当的粒度组织优化经验。为了解决这个问题,本文提出了HTAM(层次化过渡注意力记忆),一种用于基于LLM的算子优化的粗到细框架。HTAM构建了一个两层的层次化过渡图(HTG),用于组织粗粒度的全局方向、细粒度的局部策略以及优化步骤之间的过渡经验。在每个演化步骤中,HTAM从当前状态和最近的优化历史中选择一个全局方向,检索相应的局部策略记忆,并用它来指导具体的CUDA代码生成。在完整的KernelBench套件上的实验表明,与基于LLM的基线相比,HTAM在正确率、快速解率和加速比上均有持续提升,而后端和Robust-KBench研究则表明结构化记忆带来的可迁移优势。

英文摘要

High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.

2605.29733 2026-05-29 cs.AI

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

面向跨建筑能耗预测的不确定性感知迁移学习:迈向鲁棒且可扩展的区域级能源管理

Shadmehr Zaregarizi, Khashayar Yavari

AI总结 提出基于时间融合变换器的不确定性感知迁移学习框架,通过引入迁移鲁棒性指标和探针微调策略,实现跨建筑能耗预测的鲁棒迁移与不确定性量化。

详情
Comments
5 pages, 3 figures, 2 tables. Accepted at BALANCES'26 (6th ACM International Workshop on Big Data and Machine Learning for Smart Buildings and Cities), Banff, Alberta, Canada, June 22, 2026. This is the author's accepted manuscript; final published version DOI will be activated after June 22, 2026
AI中文摘要

将数据驱动的能耗预测扩展到区域级需要能够在最小目标域数据和诚实不确定性估计下跨建筑复用的模型。我们提出了一种基于时间融合变换器的不确定性感知迁移学习框架,用于跨建筑能耗预测,并在新发布的高分辨率真实子计量数据集上进行了评估:丹麦奥尔堡大学的一栋教育建筑(源域)和瑞士EMPA的多类型NEST建筑(目标域)。我们引入了迁移鲁棒性指数,一种与架构无关的度量,用于量化跨域泛化质量。一项四策略层冻结消融实验表明,仅探针微调(仅更新806K参数中的455个输出层参数)实现了最佳的迁移质量,优于全微调,表明TFT编码器学习了可迁移的时间表示。蒙特卡洛丢弃法得到的预测区间覆盖概率为93.2%,接近名义上的95%目标。数据稀缺性分析进一步显示,随着目标域数据的增加,性能单调提升,为区域能源部署提供了实践指导。

英文摘要

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

2605.29731 2026-05-29 cs.LG

EMAG: Differentiable 4D Gaussian Mixture Splatting for EEG Spatial Super-Resolution

EMAG: 可微分的4D高斯混合喷溅用于EEG空间超分辨率

Alex Lazarovich, Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar

AI总结 提出EMAG框架,通过可微分的各向异性4D时空高斯混合模型,从稀疏低密度电极重建高密度EEG信号,实现空间超分辨率,并在三个基准上超越现有方法。

详情
AI中文摘要

高密度脑电图(HD-EEG)能够精细测量皮层活动,但需要昂贵的硬件和较长的设置时间,限制了其在临床和研究中的可及性。我们提出EMAG(EEG各向异性高斯混合),一个可微分的框架,通过将脑电源表示为各向异性4D时空高斯的混合,从稀疏的低密度(LD)电极子集重建HD-EEG信号。EMAG在球形脑网格的每个点上放置多个高斯的混合,每个高斯由完整的4x4精度矩阵参数化,从而实现各向异性的空间扩散以及空间和时间维度之间的显式耦合。前向模型通过电极位置处的可微分高斯场贡献渲染头皮EEG,从而无需显式源定位监督即可进行端到端训练。我们在三个公共EEG基准(Localize-MI、SEED和SEED-IV)上以2倍到8/16倍的超分辨率因子评估EMAG。在大多数超分辨率因子下,EMAG在三个标准基准(Localize-MI、SEED、SEED-IV)上优于当前最先进的EEG超分辨率方法。显式高斯参数化进一步实现了学习到的脑源配置的直接可视化和可解释性,可能为临床和神经科学应用(如源定位或生物标志物发现)开辟途径。

英文摘要

High-density electroencephalography (HD-EEG) enables fine-grained measurement of cortical activity but requires expensive hardware and lengthy setup times, limiting its clinical and research accessibility. We propose EMAG (EEG Mixture of Anisotropic Gaussians), a differentiable framework that reconstructs HD-EEG signals from a sparse subset of low-density (LD) electrodes by representing brain electrical sources as a mixture of anisotropic 4D space-time Gaussians. EMAG places a mixture of multiple Gaussians at each point of a spherical brain grid, each parameterized by a full 4 x 4 precision matrix, enabling anisotropic spatial spreads and explicit coupling between spatial and temporal dimensions. The forward model renders scalp EEG via differentiable Gaussian field contributions at electrode locations, enabling end-to-end training without explicit source localization supervision. We evaluate EMAG on three public EEG benchmarks (Localize-MI, SEED, and SEED-IV) at super-resolution factors of 2x through 8/16x. EMAG outperforms the current state-of-the-art EEG super-resolution method at most super-resolution factors on three standard benchmarks (Localize-MI, SEED, SEED-IV). The explicit Gaussian parameterization further enables direct visualization and interpretability of learned brain source configurations, potentially opening avenues for clinical and neuroscientific applications, such as source localization or biomarker discovery.

2605.29729 2026-05-29 cs.LG

Realistic honeypot evaluations for scheming propensity

针对策划倾向的逼真蜜罐评估

Victoria Krakovna, David Lindner, Lewis Ho, Sebastian Farquhar, Rohin Shah

AI总结 提出一种框架,通过在Google对齐研究代码库中设置编码任务作为蜜罐,测试模型在有机会时是否会追求工具性目标,实验表明Gemini模型在真实部署中不会主动策划,但在特定提示下会表现出策划或破坏行为。

详情
AI中文摘要

我们引入了策划蜜罐评估,这是一个测试模型在有机会时是否会追求工具性目标的框架。我们的策划蜜罐评估以Google对齐研究代码库中的编码任务形式进行。在真实的内部部署环境中,Gemini模型不会表现出未经提示的策划行为。如果提示明确鼓励主动性(情境意识或目标导向)和/或给模型一个隐藏目标,模型有时会策划或尝试破坏。验证了我们设置的真实性,模型表现出较低的评估意识,通常是由于主动性提示而非环境所致。

英文摘要

We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. Our scheming honeypot evaluations take the form of coding tasks in Google's alignment research codebases. In a real internal deployment setting, Gemini models do not demonstrate unprompted scheming. If prompts explicitly encourage agency (situational awareness or goal-directedness) and/or give the model a hidden goal, models sometimes scheme or attempt sabotage. Validating the realism of our setting, models show low rates of evaluation awareness, usually due to agency prompts rather than the environments.

2605.29727 2026-05-29 cs.LG

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

Bastion: 预算感知的树结构块扩散草稿投机解码

Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun

AI总结 提出BASTION框架,通过动态构建查询相关的树结构平衡草稿质量与硬件约束,实现预算感知的投机解码,无需训练且保持目标模型分布,速度提升达6.61倍。

详情
AI中文摘要

块扩散草稿者最近作为投机解码的强大替代方案出现,通过在单个并行步骤中预测多个未来令牌分布。然而,由于这些并行预测是从位置边缘分布而非完全条件序列中采样,承诺单一贪婪路径往往无法捕捉目标模型的偏好轨迹。为解决此问题,我们提出BASTION,一种基于树的扩散草稿的预算感知投机解码框架。与依赖静态树拓扑的现有方法不同,BASTION通过平衡草稿质量与硬件约束动态构建查询相关的树。我们的框架整合了三个协同组件:(1) 接受代理,通过路径置信度估计期望接受长度;(2) 在线延迟估计器,校准硬件感知的屋顶线模型;(3) 自适应最佳优先扩展,在边际增益不再证明增量验证成本合理时停止树生长。BASTION无需训练,保持目标模型分布,且无需逐设置调优。在多种基准和GPU架构上,BASTION相比标准自回归解码实现高达6.61倍加速,优于最先进的块扩散基线39%。

英文摘要

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

2605.29726 2026-05-29 cs.CV

SLAD : Shared LoRA Adapters for Task Specific Distillation

SLAD:用于任务特定蒸馏的共享LoRA适配器

Reda Bensaid, Yassir Bendou, Vincent Gripon, François Leduc-Primeau

AI总结 提出SLAD方法,通过共享低秩适配器参数对齐教师和学生模型的特征表示,实现高效的知识蒸馏,在多个分类和分割数据集上达到最先进性能。

详情
Comments
CVPR Findings 2026
AI中文摘要

在资源受限环境(如嵌入式系统)中,将缩小版基础模型适配到下游任务变得越来越流行。这最近激发了任务特定蒸馏的新场景,其中同一基础模型的较大和较小版本都适配到同一下游任务,目标是将知识从前者转移到后者。最近的工作展示了使用同一基础模型的较大版本协助较小版本适配的好处。通常,较大模型(教师)首先通过微调或线性探测进行适配,然后将其知识蒸馏到较小模型(学生)。虽然微调教师通常能提升其性能,但最近的工作表明,对教师进行探测能更好地向学生蒸馏知识。我们的发现表明,这主要是由于教师微调过程中教师和学生之间特征表示的对齐偏差。受现有保留先前学习知识的努力启发,我们首先提出利用低秩适配,从而带来更好的特征对齐,进而实现更好的知识转移。基于这一洞察,我们进一步通过联合训练期间两个编码器之间适配器的参数共享策略来增强特征对齐。我们提出的方法SLAD在教师和学生之间展现出更好的特征对齐,不仅提升了学生模型的性能,也提升了教师模型的性能,同时训练速度比微调快2倍。通过在多个分类和分割数据集上的大量实验,我们展示了该方法在准确性和迁移效率上的提升,在任务特定蒸馏框架中达到了最先进性能。

英文摘要

In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.

2605.29720 2026-05-29 cs.CV cs.LG

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

面向大规模人脸识别数据集的高效、免验证的内在质量评估

Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang, Yin Xie, Ziyong Feng

AI总结 提出一种无需训练的内在质量(IQ)指标,通过邻域一致性得分和全局表示子空间复杂度来估计人脸识别数据集生成高性能模型的潜力,实现快速数据集诊断与筛选。

详情
Comments
ICML 2026
AI中文摘要

我们提出内在质量(IQ),一种无需验证的度量,旨在估计人脸识别(FR)数据集产生高性能模型的固有潜力,而无需进行全规模训练。IQ 包含两个组成部分:(i)邻域一致性得分,通过最近邻量化局部身份标签一致性;(ii)全局表示子空间复杂度(有效秩,ER),捕捉底层嵌入几何和数据集多样性。IQ 允许使用轻量级代理模型或数据子集进行快速评估,便于在资源密集型的全规模训练之前进行数据集诊断和筛选。我们描述了一个针对干净、噪声和混合质量 FR 数据集定制的实验协议,并概述了验证 IQ 对下游性能预测能力的评估方法。

英文摘要

We propose Intrinsic Quality (IQ), a validation-free metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed-quality FR datasets, and outline evaluation methodologies to validate IQ's predictive power for downstream performance.

2605.29716 2026-05-29 cs.AI

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

NaRA: 面向扩散大语言模型参数高效微调的噪声感知LoRA

Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

AI总结 针对扩散大语言模型,提出噪声感知低秩适配(NaRA),通过噪声条件超网络生成低秩核心矩阵,实现沿去噪轨迹连续变化的更新矩阵,在常识推理、数学推理和代码生成基准上优于噪声无关基线。

详情
AI中文摘要

扩散大语言模型(dLLMs)已成为一种有前途的非自回归生成范式。鉴于全微调的计算成本过高,参数高效微调(PEFT)已成为标准方法。然而,现有的PEFT方法(如LoRA)最初是为自回归模型设计的,依赖于静态参数,对噪声水平不敏感。因此,它们忽略了扩散过程的内在动态性,其中输入分布和生成难度沿去噪轨迹显著变化,使得它们对dLLMs而言是次优的。为了解决这个问题,我们提出了噪声感知低秩适配(NaRA),它引入了一个由轻量级、全局共享的超网络根据噪声水平生成的低秩核心矩阵。这种设计使得更新矩阵能够沿扩散过程连续变化,同时保持参数和延迟开销可忽略不计。我们为所提出的NaRA框架提供了理论依据,并在常识推理、数学推理和代码生成基准上实证证明了其相对于噪声无关基线的持续改进。我们的代码可在https://github.com/generaldi/NaRA获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.

2605.29715 2026-05-29 cs.CL

User-Aware Active Knowledge Acquisition for Emotional Support Dialogue

面向情感支持对话的用户感知主动知识获取

Mufan Xu, Kehai Chen, Jiahao Hu, Xinchao Xu, Muyun Yang, Tiejun Zhao, Min Zhang

AI总结 提出用户感知主动知识获取(UKA)框架,通过理论心智不确定性估计和主动学习,在情感支持对话中高效获取用户对齐的对话知识,提升对话质量和用户对齐。

详情
AI中文摘要

情感支持在对话系统中扮演重要角色,其成功取决于在多轮交互中适应用户不断变化且隐含的需求,同时利用大语言模型的强大推理能力。然而,由于用户需求的信号通常微弱、间接,且只能通过多轮交互来消除歧义,现有的情感支持方法往往难以高效获取和泛化相关的对话知识。为弥补这一差距,我们引入了用户感知主动知识获取(UKA),这是一种无梯度的主动对话学习框架,明确表示用户需求的不确定性,并将主动学习融入知识获取和响应选择中。我们提出了一种理论心智不确定性估计机制,使模型能够优先选择响应,从而引发更多信息性的用户反馈。UKA能够在训练期间高效探索用户对齐的对话知识,同时在测试时保持鲁棒性。在多个对话基准和模型架构上的实验表明,我们的方法在对话质量和用户对齐方面始终优于强基线。

英文摘要

Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user's evolving and implicit needs across multi-turn interactions while leveraging the strong reasoning capacity of large language models. However, since signals about user needs are often weak, indirect, and can only be disambiguated through multi-turn interaction, existing emotional support methods often struggle to acquire and generalize relevant conversational knowledge efficiently. To bridge this gap, we introduce User-Aware Active Knowledge Acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response selection.We propose a Theory-of-Mind uncertainty estimation mechanism that allows the model to prioritize responses, thereby eliciting more informative user feedback. UKA is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.

2605.29714 2026-05-29 cs.CL

Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation

利用混合专家模型中的路由动态实现高效语言适配

Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, Golnoosh Farnadi

AI总结 研究英语中心混合专家模型在多语言持续预训练中的路由动态,发现早期和中间层路由分散且语言无关,最终层出现语言专化,并提出仅更新最终层语言特定和共享专家的参数高效适配策略。

详情
AI中文摘要

混合专家(MoE)模型被广泛用于扩展语言模型,但其专家路由行为和多语言环境下的适配仍未被充分探索。在这项工作中,我们研究了在英语中心的MoE模型上使用多语言语料库进行持续预训练时的多语言路由动态,分析了专家使用如何随语言变化。我们发现,持续的多语言预训练导致早期和中间层出现分散的、与语言无关的路由,而语言专化主要出现在最终层。我们还表明,语言之间的token级词汇重叠在路由方式中起着重要作用。受这些发现启发,我们提出了一种参数高效的适配策略,仅更新最终MoE层中的语言特定和共享专家。在MultiBLiMP和Belebele上的实验表明,我们的方法实现了强大的性能-效率权衡,在更新不到2%参数的情况下,达到了与微调整个最终层相竞争的性能。总体而言,我们的发现揭示了在持续预训练期间MoE中语言专化出现的位置和方式,并为低资源多语言适配提供了实用见解。我们的代码可在https://github.com/aditi184/moe-routing-adaptation获取。

英文摘要

Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.

2605.29713 2026-05-29 cs.LG cs.AI

The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer

生成式AI基础小书:直观数学入门

Tianhua Chen

AI总结 本书通过推导导向的方式,从PCA到能量模型,系统介绍现代生成式人工智能的数学基础,旨在使生成建模结构更易理解。

详情
Comments
Preprint version, 178 pages. Comments and corrections are welcome
AI中文摘要

本书提供了对现代生成式人工智能数学基础的紧凑、推导导向的介绍。它不是调查每一个最近的架构或实现细节,而是通过连接主要生成模型家族的思想发展出一条连贯的路线,从PCA、概率PCA、变分自编码器和扩散模型到归一化流、自回归分解、GANs、Wasserstein GANs和基于能量的模型。目的是使生成建模的结构更易理解,同时不失去理解这些模型如何推导和关联所需的数学实质。本书旨在为具有数学好奇心的研究人员、从业者和学生提供基础构建的入门读物。

英文摘要

This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students.

2605.29712 2026-05-29 cs.CL cs.AI

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

教会语言模型使用人类应试策略检查基于事实的声明真实性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

AI总结 将基于事实的声明真实性检查建模为真假阅读理解任务,通过提示语言模型使用明确的应试策略进行高效推理,并训练小语言模型以降低推理成本。

详情
Comments
ACL 2026 Main
AI中文摘要

基于事实的声明真实性检查对于大型语言模型(LLM)应用(如检索增强生成)非常重要,因为它帮助用户评估生成输出的正确性。现有的使用蕴含分类器的指标需要针对数据集调整阈值,而基于LLM的方法通常使用直接提示,这未能充分利用LLM的推理能力。我们通过将基于事实的声明真实性检查建模为真假阅读理解任务,并提示LLM使用明确的应试策略进行高效推理来解决这一问题。与无引导的开放式推理相比,我们的方法减少了超过80%的令牌使用量,并在两个真实性基准测试中取得了与更昂贵替代方案竞争的性能,在一个基准上达到了新的最先进水平。为了进一步降低推理成本,我们训练小语言模型(SLM)来替代检查流程中的LLM。通过监督微调(SFT)和自我修正机制,SLM学会了改进其真实性判断。实验结果表明,生成的SLM在性能上与强基线相当,结合了低推理成本和生成支持理由以支持可解释性。代码和数据集将在接收后发布。

英文摘要

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

2605.29711 2026-05-29 cs.CL cs.AI

Personalized Turn-Level User Conversation Satisfaction Benchmark

个性化轮级用户对话满意度基准

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

AI总结 针对AI助手响应的个性化满意度评估问题,提出结合用户记忆与目标轮上下文的满意度评估器,并构建PersTurnBench基准,通过回放实现生成模型的受控比较。

详情
AI中文摘要

用户对AI助手的满意度高度个性化:同一响应可能满足一个用户但令另一个失望,取决于每个用户的期望以及他们之前询问的内容。现有的自动评估方法大多衡量通用响应质量,难以判断某个响应在特定轮次是否满足用户。我们将此问题作为个性化轮级用户对话满意度评估进行研究。我们构建了一个对话满意度评估器,将紧凑的用户记忆与目标轮上下文相结合,生成满意度分数和不满意的理由。与人类满意度标注的元评估表明,个性化记忆和事后分数校准在有序一致性和不满意轮次检测上优于监督式、检索式和通用LLM作为评判者的基线。我们进一步引入了PersTurnBench,这是一个个性化轮级用户对话满意度基准,通过回放使用经过验证的评估器来评估生成模型。通过固定回放状态,PersTurnBench能够在无需为每个候选模型收集新人工标签的情况下,对通用生成模型和记忆增强的个性化系统进行受控比较。该评估器和基准让研究人员能够在无需为每个模型收集新用户反馈的情况下,比较候选生成模型在个性化满意度上的表现。

英文摘要

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

2605.29710 2026-05-29 cs.RO

PhAIL: A Real-Robot VLA Benchmark and Distributional Methodology

PhAIL:一个真实机器人VLA基准测试与分布性方法论

Sergey Arkhangelskiy

AI总结 针对现有VLA策略评估中样本量小、统计比较不可靠的问题,提出PhAIL基准测试,采用时间-成功累积分布函数作为评估基元,通过人类相对吞吐量评分和Kolmogorov-Smirnov显著性检验,在少量rollout下实现更可靠的模型比较。

详情
Comments
22 pages, 10 figures, 8 tables. Dataset, analysis pipeline, and paper source: https://phail.ai and https://github.com/Positronic-Robotics/phail-paper
AI中文摘要

视觉-语言-动作(VLA)策略的真实世界评估仍然依赖于固定超时下的二元成功率,每个条件最多进行$N \le 25$次rollout,几乎总是没有置信区间或配对统计比较;这些队列规模难以可靠地解决接近的比较。我们引入了PhAIL(物理AI排行榜,https://phail.ai),这是一个基于Franka FR3的开放真实机器人基准测试(包括数据集、每次rollout的工件和端到端参考实现),采用分布性评估方法论:以时间-成功累积分布函数(CDF)作为评估基元,分为两个独立任务。第一个是通过人类相对吞吐量(HRT)进行评分,这是一个具有bootstrap置信区间的无量纲标量,锚定于同一设备的远程操作。第二个是显著性检验(Kolmogorov-Smirnov,按对象计算并在对象间进行宏观平均)。在四个公开可用的VLA上,宏观平均KS检验在每(模型,对象)单元$N \le 30$次rollout下解决了两个接近的比较(GR00T vs. ACT,OpenPI vs. ACT),而二元阈值指标无法做到;最接近的一对(OpenPI vs. GR00T)在我们的预算内仍未解决。评估中最佳的VLA每次操作比人类参考慢约$7\times$(RMST比率)。

英文摘要

Real-world evaluation of vision-language-action (VLA) policies still rests on binary success rate at a fixed timeout with $N \le 25$ rollouts per condition, almost always without confidence intervals or paired statistical comparison; these cohort sizes struggle to resolve close comparisons reliably. We introduce PhAIL (Physical AI Leaderboard, https://phail.ai), an open real-robot benchmark on a Franka FR3 (dataset, per-rollout artifacts, and end-to-end reference implementation) of a distributional evaluation methodology: the time-to-success cumulative distribution function (CDF) as the evaluation primitive, with two separated jobs. The first is scoring via Human-Relative Throughput (HRT), a dimensionless scalar with bootstrap confidence intervals, anchored to same-fixture human teleoperation. The second is a significance test (Kolmogorov-Smirnov, computed per-object and macro-averaged across objects). On four publicly-available VLAs, the macro-averaged KS test resolves two close comparisons (GR00T vs. ACT, OpenPI vs. ACT) at $N \le 30$ rollouts per (model, object) cell where binary-threshold metrics do not; the closest pair (OpenPI vs. GR00T) remains unresolved within our budget. The best evaluated VLA is $\sim 7\times$ slower per operation (RMST ratio) than the human reference.

2605.29708 2026-05-29 cs.CL

Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMs

理解混合专家大语言模型中的安全敏感专家行为

Zhibo Zhang, Yuxi Li, Zhen Ouyang, Ling Shi, Kailong Wang

AI总结 通过提出RASET框架,研究混合专家大语言模型中安全对齐与路由专家专业化之间的关系,发现路由模式主要由主题驱动,而安全行为可通过调整少数专家改变而不影响路由路径。

详情
Comments
11 pages, 4 figures
AI中文摘要

混合专家(MoE)大语言模型依赖于稀疏的、由路由器驱动的专家激活,然而安全对齐如何与路由专家专业化相互作用仍未被充分探索。一种常见的直觉是,安全行为可能通过将有害请求路由到不同的拒绝导向专家来控制。在这项工作中,我们为不同的情况提供了经验证据:对齐的MoE大语言模型中的路由模式主要是主题驱动的,而安全行为可以在不改变模型固有路由路径的情况下被改变。基于这一观察,我们提出了**RASET**(**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning,路由器无关的安全关键专家微调),这是一个红队框架,用于探测集中在少数专家中的安全执行,同时保持模型固有的路由行为。**RASET**通过对比路由敏感性标准识别安全关键专家,并仅对选定的专家应用参数高效微调,从而相对于路由器干预最小化语义干扰。这些结果揭示了独特的MoE安全风险,强调了需要专家感知的对齐机制。

英文摘要

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

2605.29707 2026-05-29 cs.CL

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

Domino: 在推测解码中将因果建模与自回归草稿解耦

Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang

AI总结 提出Domino框架,通过并行草稿骨干和轻量级Domino头解耦因果依赖建模与自回归草稿执行,结合基础锚定训练课程,在Qwen3模型上实现高达5.49倍端到端加速和5.8倍吞吐量加速。

详情
AI中文摘要

推测解码通过草拟多个令牌并与目标模型并行验证来加速LLM推理。然而,其实际加速受限于草稿质量与草稿成本之间的权衡:自回归草稿器建模草稿令牌间的因果依赖但引入顺序开销,而并行草稿器降低草稿成本但削弱块内依赖建模。本文提出Domino,一种将因果依赖建模与昂贵的自回归草稿执行解耦的推测解码框架。Domino首先使用并行草稿骨干为整个块生成初步草稿分布,然后应用轻量级Domino头以前缀依赖的因果信息对其进行细化。为稳定教师强制因果编码,我们进一步引入基础锚定训练课程,首先强化并行骨干,然后逐步将优化转向因果修正的最终分布。在Qwen3模型上的实验表明,Domino在Transformers后端下实现高达5.49倍的端到端加速,在SGLang服务下实现高达5.8倍的吞吐量加速。

英文摘要

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.

2605.29705 2026-05-29 cs.AI

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

BitTP:面向边缘设备的轻量级轨迹预测模型与BitLLM

Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

AI总结 提出BitTP,通过将LLM轨迹预测器转换为1.58比特轻量架构,在保持或提升预测质量的同时大幅降低内存和计算需求,实现边缘设备部署。

详情
Comments
Camera-ready version. Accepted as a findings paper at CVPR 2026. 8 pages, 4 figures
AI中文摘要

轨迹预测是自主系统的一项基本任务,需要对多智能体交互和意图进行复杂推理。大型语言模型(LLM)最近被用于此任务,因为它们提供了强大的上下文推理和可解释的、基于语言的轨迹表示。然而,这些基于LLM的预测器极其消耗内存和计算资源,难以部署在资源受限的边缘设备上,例如自主机器人的车载计算机。为弥合这一差距,我们提出BitTP,它将基于LLM的轨迹预测器转换为轻量级比特线性架构。我们证明,仅权重量化到1.58比特(BitTP-Weight)是最优的。关键在于,激活值必须保持全精度,因为量化它们会导致时空推理的严重退化和不稳定性。实验表明,BitTP-Weight不仅保持了全精度(BF16)LLM基线的预测质量,还提升了质量,平均ADE降低14.29%,FDE降低20.97%,同时相比其他量化方法减少了内存使用和推理延迟。这些结果表明,精心设计的量化可作为有效的正则化器,使得基于LLM的复杂推理能够在边缘设备上实际部署。代码地址:https://github.com/MintCat98/BitTP。

英文摘要

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

2605.29704 2026-05-29 cs.RO

FLIP: Real-Time and Resilient Formation Planning for Large-Scale DIstributed Swarms via Point Cloud Registration

FLIP:通过点云配准实现大规模分布式集群的实时弹性编队规划

Yuan Zhou, Guangtong Xu, Zhenyu Hou, Jialiang Hou, Fei Gao

AI总结 提出将最优编队位置序列计算转化为时空点云配准问题,利用带离群点剔除的PCR方法实现大规模分布式集群的弹性、高效轨迹规划。

详情
AI中文摘要

传统的大规模编队规划要么过度简化编队表示导致性能不佳,要么采用完全协作关系导致计算负载过大。为了实现高性能和大规模编队规划,我们将最优编队位置序列(OFPS)计算问题转化为时空点云配准(PCR)问题。每个智能体通过分布式计算自身当前位置与所有其他智能体期望编队位置之间的匹配结果来获得OFPS。然后每个智能体利用OFPS优化协作编队轨迹。我们利用带离群点剔除的PCR方法快速执行大规模编队位置配准。这可以防止次优轨迹和故障智能体通过协作网络传播并影响更多智能体。因此,我们统一实现了大规模集群的弹性、高效和分布式轨迹规划。通过120架无人机编队的大规模仿真以及与最先进(SOTA)方法的严格基准测试,证明了所提方法的有效性和优越性。

英文摘要

Traditional large-scale formation planning either oversimplify the formation representation which leads to poor performance, or they employ complete collaborative relationships, which results in excessive computational load. To achieve high-performance and large-scale formation planning, we transform the Optimal Formation Position Sequence \cite{c1} (OFPS) calculation problem into a spatiotemporal Point Cloud Registration (PCR) problem. Each agent derives its OFPS by distributively computing the matching result between current positions and the desired formation positions of all other agents. Then each agent optimizes the cooperative formation trajectory by using OFPS. We leverage the PCR method with outlier rejection to rapidly perform large-scale formation position registration. This prevents suboptimal trajectories and failed agents from propagating through the cooperative network and affecting more agents. Consequently, we uniformly achieve resilient, efficient, and distributed trajectory planning for large-scale swarms. The effectiveness and the superiority of the proposed method are demonstrated through large-scale simulations of 120-drone formation, and rigorous benchmarking against state-of-the-art (SOTA) methods.

2605.29703 2026-05-29 q-bio.NC cs.CV q-bio.TO

Subcortical Shape Variations and Their Associations with Cognition Across the 8th Decade of Life. A Study in the Lothian Birth Cohort 1936

皮层下形状变化及其与第八个十年生命期认知的关联:洛锡安出生队列1936研究

Maria del C. Valdes-Hernandez, Wonjung Park, Joanna Moodie, Susana Muñoz Maniega, Janie Corley, Fraser N. Sneden, Mark E. Bastin, Joanna M. Wardlaw, Simon R. Cox, Jinah Park

AI总结 利用洛锡安出生队列1936的纵向数据,通过ANCOVA和混合线性模型分析,研究第八个十年中皮层下结构的形状变化及其与认知老化的关联。

详情
Comments
34 pages
AI中文摘要

对正常个体脑形态变化的研究可能捕捉到与功能相关的脑老化方面,而这些方面不一定完全由总体积测量所指示。尽管皮层下脑结构在认知中起重要作用,但其形态轨迹与认知老化之间的关联尚未被记录。我们利用来自一项大型认知老化纵向研究——洛锡安出生队列1936——的神经影像、人口统计学和认知数据,探索社区居住个体在第八个十年生命期中皮层下脑结构的形状变化。我们使用ANCOVA和混合线性模型分析研究这些变化与认知老化的关联。皮层下形状变化是异质性的,在整个时期呈现不同的萎缩模式。海马体和腹侧DC经历了不同的形态变形(相对于其基线点),左右半球不同,而丘脑和苍白球形状则经历了更均匀的体积收缩,几乎在不同时间线上对称。一般认知的变化主要与时间点之间的向内和向外顶点位移相关。

英文摘要

The study of brain morphology changes in normal individuals may capture aspects of functionally-relevant brain aging not fully indicated by gross volumetry. Despite the important role of subcortical brain structures in cognition, the associations between their morphological trajectories and cognitive changes in aging have not been documented. We use neuroimaging, demographic, and cognitive data from a large longitudinal study of cognitive aging, the Lothian Birth Cohort 1936, to explore shape changes in subcortical brain structures of community-dwelling individuals across their 8th decade of life. We investigate the association of these changes with cognitive aging using ANCOVA and mixed linear model analyses. Subcortical shape changes were heterogeneous, with varied atrophy patterns across whole period. The hippocampus and the ventral DC experienced varied morphological deformations (from its baseline point) different in left and right hemispheres, while the thalami and globus pallidi shapes, for example, experienced a more uniform volume contraction, nearly symmetrical throughout different timelines. Changes in general cognition were mainly associated with inwards and outwards vertex displacements between the time-points.

2605.29698 2026-05-29 cs.LG physics.chem-ph

A Systematic Evaluation of Molecular Mixture Behavior Prediction

分子混合物行为预测的系统评估

Roel J. Leenhouts, Nathan K. Morgan, William Green, Jan G. Rittig, Florence H. Vermeire

AI总结 提出一个将混合物性质误差分解为纯组分和相互作用成分的评估框架,并基于七个匹配数据集发现绝对精度可能掩盖非理想混合行为的恢复能力。

详情
AI中文摘要

分子性质预测的机器学习主要集中在纯化合物上,尽管许多实际应用依赖于具有分子间相互作用的混合物。最近的工作扩大了混合物数据集的可用性,但评估仍然主要关注绝对精度。然而,混合物中的绝对误差将纯组分贡献与理想混合的偏差混为一谈。我们提出了一个评估框架,将混合物性质误差分解为纯化合物和相互作用(非理想)成分。该框架结合了泄漏感知分割协议、理想混合物基线和过量性质指标。为了支持可重复的基准测试,我们整理了七个匹配的纯和混合物物理化学性质数据集。在多个混合物性质任务和模型家族中,我们发现强绝对精度可能掩盖对非理想混合物行为的恢复能力,并且在严格分子分割下性能显著下降。这些结果将向未见分子的迁移识别为分子混合物机器学习中的核心挑战,并推动超越绝对精度的评估。

英文摘要

Machine learning for molecular property prediction has focused largely on pure compounds, even though many practical applications depend on mixtures with intermolecular interactions. Recent work has expanded the availability of mixture datasets, but evaluation still focuses mainly on absolute accuracy. However, absolute errors in mixtures conflate pure-component contributions with deviations from ideal mixing. We propose an evaluation framework that decomposes mixture-property error into pure-compound and interaction (non-ideal) components. The framework combines leakage-aware split protocols, ideal-mixture baselines, and excess-property metrics. To support reproducible benchmarking, we curate seven matched pure and mixture physicochemical property datasets. Across multiple mixture-property tasks and model families, we find that strong absolute accuracy can mask poor recovery of non-ideal mixture behavior, and that performance drops substantially under strict molecule splits. These results identify transfer to unseen molecules as a central challenge in molecular mixture machine learning and motivate evaluation beyond absolute accuracy alone.

2605.29697 2026-05-29 cs.AI

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

超越轨迹奖励:通过图建模实现智能搜索的步骤级信用分配

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu

AI总结 针对智能搜索中轨迹级奖励无法量化单步行为贡献的问题,提出基于图距离贡献奖励(GDCR)的步骤级过程奖励,并结合步骤优势策略优化(SAPO)在四个基准上验证有效性。

详情
Comments
15 pages, 8 figures
AI中文摘要

在智能搜索中,轨迹级结果奖励无法量化单个步骤的行为贡献,而现有的步骤级奖励方法通常依赖于代价高昂的树采样。我们将世界知识视为潜在的世界图,并将每个信息搜索任务视为在潜在任务图中的搜索,其中有效步骤应朝着答案节点进行图进展。基于这一先验,我们提出图距离贡献奖励(GDCR),这是一种步骤级过程奖励,通过训练时实体-关系(ER)图中实体到答案节点的距离对新检索和引用的实体进行评分。我们进一步提出步骤优势策略优化(SAPO),它将GDCR转换为步骤级优势,并与轨迹级结果优势相结合。在四个具有挑战性的基准上的实验验证了我们方法的有效性。

英文摘要

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

2605.29695 2026-05-29 cs.AI cs.CE cs.LG math.PR

FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting

FHRFormer: 一种用于胎儿心率时间序列修复和预测的自监督掩码Transformer框架

Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

AI总结 针对胎儿心率监测中信号丢失问题,提出基于掩码Transformer的自监督自编码器方法,通过捕获局部时间和频率成分来修复和预测缺失信号,具有鲁棒性并支持AI风险算法开发。

详情
Comments
Submitted to Frontiers in Digital Health. arXiv admin note: substantial text overlap with arXiv:2509.20852
AI中文摘要

大约10%的新生儿出生时需要帮助才能开始呼吸,约5%需要通气支持。胎儿心率(FHR)监测在产前护理中评估胎儿健康状况方面起着关键作用,能够检测异常模式并支持及时产科干预以减轻分娩期间的胎儿风险。应用人工智能(AI)方法分析具有不同结局的连续FHR监测大数据集,可能为预测需要呼吸辅助或干预的风险提供新见解。可穿戴FHR监测仪的最新进展实现了在不影响母亲活动能力的情况下进行连续胎儿监测。然而,母亲运动期间的传感器移位以及胎儿或母亲位置的变化常常导致信号丢失,造成记录的FHR数据出现缺口。这种缺失数据限制了有意义信息的提取,并使基于AI的自动化分析复杂化。传统的缺失数据处理方法,如简单插值技术,往往无法保留信号的频谱特性。在本文中,我们提出了一种基于掩码Transformer的自编码器方法,通过捕获数据的局部时间和频率成分来重建缺失的FHR信号。所提出的方法在不同缺失数据时长下表现出鲁棒性,可用于信号修复和预测。该方法可回顾性地应用于研究数据集,以支持基于AI的风险算法开发。未来,该方法可集成到可穿戴FHR监测设备中,实现更早、更稳健的风险检测。

英文摘要

Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

2605.29693 2026-05-29 cs.LG cs.RO

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的低排放交通信号控制奖励设计

Chinmay Mundane, Amith Manoharan, Arun Singh

AI总结 提出一种基于动量的奖励函数(MBRF),通过鼓励车辆持续移动而非单纯惩罚拥堵,在SUMO仿真中实现更好的吞吐量-排放权衡和更稳定的学习行为。

详情
AI中文摘要

城市交通拥堵是一个日益严重的全球性问题,导致通勤时间延长和环境污染加剧。传统的交通信号控制系统往往难以适应动态交通状况。自适应交通信号控制可以在不改变道路基础设施的情况下改善城市交通。深度强化学习(DRL)在此任务中表现出色,但现有的基于延误和队列的奖励常常产生短视或不稳定的策略。本文提出了一种基于动量的奖励函数(MBRF),鼓励车辆持续移动,而非仅惩罚拥堵。该方法在SUMO(城市交通仿真)中使用标准交通指标(如等待时间、队列长度、吞吐量和CO2排放)进行评估。结果表明,与基于延误或队列的奖励以及经典控制器(如Max Pressure和LQF)相比,所提出的奖励实现了更好的吞吐量-排放权衡和更稳定的学习行为。

英文摘要

Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.

2605.29691 2026-05-29 cs.CV

Unsupervised Semantic Segmentation Facilitates Model Understanding

无监督语义分割促进模型理解

Xiaoyan Yu, Lisa Mais, Jannik Franzen, Peter Hirsch, Nick Lechtenbörger, Andreas Mardt, Dagmar Kainmüller

AI总结 提出基于无监督语义分割的可视化协议,直观揭示不同自监督视觉Transformer的注意力机制、位置偏差和缩放行为等模型特性。

详情
AI中文摘要

自监督学习(SSL)产生了多种视觉Transformer(ViT),其预训练表示支持广泛的下游任务。为了更好地理解这些模型,已有工作评估了自注意力的机制以及表示中捕获的信息类型,例如揭示了对比学习(CL)和掩码图像建模(MIM)训练模型之间的显著差异。然而,模型理解的这些进展尚未完全渗透到更广泛的社区,其中针对CL模型的见解有时被泛化到MIM模型。为了使模型理解对广大受众直接且直观,我们提出了一种简单且易于解释的可视化协议。我们的协议基于可视化无监督语义分割结果,但目标不是最大化分割性能。相反,它允许我们传达跨图像一致出现的模型行为。通过对不同层和表示上的多种SSL模型进行基准测试,我们获得了关于不同位置偏差和缩放行为的新见解,包括DINOv3-Large模型令牌中的强边界伪影。这些见解补充并有助于传达一系列先前发现。我们的协议进一步能够清晰地区分位置效应与密切相关但不同的局部性偏差,后者在文献中已被更广泛地研究。该协议在GitHub上公开,我们相信它将促进更广泛社区的进一步模型理解。

英文摘要

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

2605.29688 2026-05-29 cs.LG

A Novel Tensor Product-Based Neural Network for Solving Partial Differential Equations

一种基于张量积的新型神经网络用于求解偏微分方程

Qihong Yang, Yangtao Deng, Qiaolin He, Shiquan Zhang

AI总结 提出张量积网络(TPNet),通过将解显式表示为基函数的线性组合并利用最小二乘直接求解系数,实现高效准确的函数逼近和PDE求解。

详情
Comments
44 pages, 11 figures
AI中文摘要

本文提出了张量积网络(TPNet),一种用于高效准确函数逼近和PDE求解的新型神经架构。该方案的核心是将解显式构造为集成到网络中的基函数的线性组合,系数通过直接最小二乘求解确定,从而绕过了传统的基于梯度的训练。关键的方法贡献包括:(1)一种高效的张量积方案,通过组合两组子网络输出的组合生成多维基函数,在保持表达力的同时显著降低模型复杂度和参数数量;(2)一种块时间推进策略,以提高长时间模拟的计算效率;(3)一种线性重构策略,通过将已知非线性项视为源项来处理非线性PDE。TPNet在准确性和训练时间上优于传统神经网络求解器。这一性能提升源于其结构化设计和确定性最小二乘拟合,与主流方法(如物理信息神经网络PINNs)所需的迭代且通常计算密集的优化形成对比。

英文摘要

This paper presents the Tensor Product Network (TPNet), a novel neural architecture for efficient and accurate function approximation and PDE solving. The core of the proposal involves constructing the solution explicitly as a linear combination of basis functions integrated into the network, with coefficients determined by a direct least-squares solve, thereby bypassing traditional gradient-based training. The key methodological contribution include: (1) an efficient tensor-product scheme that generates multi-dimensional basis functions from combinations of two sets of subnetwork outputs, significantly reducing model complexity and parameter count while maintaining expressivity; (2) a block time-marching strategy to improve computational efficiency in long-time simulations; and (3) a linear reformulation strategy for handling nonlinear PDEs by treating known nonlinear terms as sources. TPNet achieves superior accuracy and shorter training times than conventional neural network solvers. This performance gain stems from its structured design and deterministic least-squares fitting, which contrast with the iterative, often computationally intensive optimization required by mainstream methods like Physics-Informed Neural Networks (PINNs).

2605.29687 2026-05-29 cs.AI cs.LO

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

基于偏好最大可满足性的大语言模型可靠推理

Pedro Orvalho, Marta Kwiatkowska, Guillem Alenyà, Felip Manyà

AI总结 提出一种混合推理方法,通过LLM生成代码将自然语言问题编码为偏好最大可满足性问题,由精确求解器求解并独立验证,显著提高可行性。

详情
Comments
17 pages, 1 figure, 4 tables
AI中文摘要

大语言模型(LLM)擅长理解自然语言,但在涉及多个约束和用户定义偏好的优化任务(常见于机器人等领域)中表现不佳。我们提出一种混合推理方法,其中LLM通过代码生成实现外部化推理。给定自然语言问题描述,LLM生成Python代码,将用户定义的约束和偏好编码为偏好最大可满足性(MaxSAT)问题,然后由精确的MaxSAT求解器求解。为确保正确性,模型生成代码返回的解会与规范MaxSAT编码独立验证可行性和最优性,允许不同的编码和多个最优解。我们使用开源和闭源LLM在三个偏好推理任务族上评估该方法,并与相同模型的直接回答、思维链和程序思维基线进行比较。虽然这些基线很少产生可行解,但基于MaxSAT的流水线实现了显著更高的接受率,在某些情况下超过80%。我们的结果表明,LLM驱动的代码生成结合偏好MaxSAT能够针对生成的编码实现可验证的优化,并在独立验证的参考语义下大幅提高正确性。

英文摘要

Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hybrid reasoning approach in which LLMs externalise reasoning through code generation. Given a natural language problem description, an LLM generates Python code that encodes user-defined constraints and preferences as a preference-based Maximum Satisfiability (MaxSAT) problem, which is then solved by an exact MaxSAT solver. To ensure correctness, solutions returned by the model-generated code are independently verified for feasibility and optimality against a canonical MaxSAT encoding, allowing for different encodings and multiple optimal solutions. We evaluate our approach using both open-source and closed-access LLMs on three families of preference-based reasoning tasks, and compare it against direct-answer, chain-of-thought, and program-of-thought baselines using the same models. While these baselines rarely produce feasible solutions, the MaxSAT-based pipeline achieves substantially higher acceptance rates, in some cases exceeding 80%. Our results demonstrate that LLM-driven code generation combined with preference-based MaxSAT enables solver-verifiable optimisation with respect to generated encodings, and substantially improves correctness under independently verified reference semantics.

2605.29685 2026-05-29 cs.AI

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

NICE:一个基于理论的LLM社交智能诊断基准

Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao

AI总结 本文通过构建基于社会理论的社交智能框架,提出诊断基准NICE,用于细粒度评估大语言模型在社交交互中的能力弱点。

详情
AI中文摘要

随着大语言模型(LLM)在情感陪伴和客户服务等社交场景中的广泛应用,衡量其社交智能对人工智能交互的质量与安全性变得至关重要。然而,现有的社交智能基准缺乏统一框架来组织社交能力,因此无法进行细粒度诊断。为了构建首个基于社会理论的整体诊断评估,我们首先通过文献综述和多阶段专家验证(遵循心理测量学原则)构建了一个社交智能框架。该框架包括4个类别和11个维度,每个维度进一步由细粒度的能力方面指定。基于此框架,我们提出了NICE(规范、交互、认知、体验),一个包含137个项目的诊断基准,通过代表性中文情境进行操作化。在5个前沿LLM和一个人类参考组中,模型在总体准确率上得分较高,但在沟通方面表现出持续的弱点,框架将其定位到三个具体能力方面:多轮沟通、非语言沟通和同步性。因此,NICE将社交智能评估重新定义为对LLM中具有社会后果的弱点的基于理论的诊断。

英文摘要

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

2605.29684 2026-05-29 cs.LG cond-mat.dis-nn stat.ML

Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime

贝叶斯深度神经网络中的核重整化:比例机制下的等效Wishart假设

Paolo Baglioni, Christian Keup, Vincenzo Zimbardo, Rosalba Pacelli, Alessandro Vezzani, Raffaella Burioni, Pietro Rotondo

AI总结 针对固定深度L的贝叶斯多层感知机,提出等效Wishart假设来捕捉层次经验核的随机涨落,通过大偏差分析得到重正化NNGP核描述,在比例极限下用至多L个标量序参数刻画表示学习,并扩展到CNN揭示局部核重整化机制。

详情
Comments
45 pages, 21 figures
AI中文摘要

训练集大小$P$和深度神经网络宽度$N$以相同速率增长的比例宽度极限,已被深入研究用于浅层单隐藏层网络。然而,将这些非微扰结果从浅层架构扩展到深度非线性网络已被证明非常具有挑战性。在这里,我们提出了一种有效的近似方法,用于预测固定深度$L$的贝叶斯多层感知机(MLP)在任意高维数据上的泛化性能。我们提出了一个等效Wishart假设,以捕捉MLP层次经验核的主要随机涨落。这使我们能够在比例极限下对MLP的配分函数进行大偏差分析,并用重正化NNGP核表示。在这种描述中,即使比例极限下的强表示学习也由至多$L$个标量序参数编码,这些参数自洽确定。将该方法扩展到卷积架构(CNN),我们识别出一种层次局部核重整化机制,该机制允许量化CNN中由于有限宽度效应导致的大宽度核的更复杂数据相关变换。我们在经典基准数据集上,针对深度$L \sim O(10)$和$P\sim O(10^3)$的有限深度神经网络的贝叶斯后验采样实验测试了我们的有效理论,发现总体吻合良好,同时存在两种不同类型的系统性偏差。

英文摘要

The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.

2605.29682 2026-05-29 cs.CL

Scaling Laws for Agent Harnesses via Effective Feedback Compute

智能体框架的有效反馈计算缩放定律

Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che

AI总结 提出有效反馈计算(EFC)作为缩放坐标,通过衡量信息性、有效性、非冗余性和保留性来预测智能体框架性能,在多个任务上优于原始计算基线。

详情
AI中文摘要

智能体框架通过决定模型如何调用工具、接收反馈、验证中间状态、存储记忆和修正解决方案,日益决定语言模型系统的性能。然而,当前的测试时缩放分析通常通过原始支出(令牌、工具调用、操作、挂钟时间或成本)来参数化这一过程,这并未区分有用反馈与冗余或不稳定的交互。我们引入了有效反馈计算(EFC),这是一种轨迹级缩放坐标,仅在反馈具有信息性、有效性、非冗余性且被保留用于后续决策时才计入反馈,并在比较具有不同反馈需求的任务时通过任务需求进行归一化。在合成可控任务、可执行代码任务、真实基准轨迹、保留集和前瞻性验证批次中,基于EFC的坐标一致地比原始计算基线和强多变量SAS基线更好地预测失败率。在受控缩放中,原始令牌和工具调用解释的变异有限(R²=0.33和0.42),SAS达到0.88,而Oracle-EFC和Estimated-EFC达到0.94,Oracle-EFC/D_task达到0.99。匹配预算的干预表明,在原始成本和工具调用固定的情况下,提高反馈质量将成功率从0.27提升到0.90。在混合真实轨迹上,NRS-EFC/D_task达到R²=0.92,而原始计算具有接近零或负的拟合,并且在前瞻性保留集中仍然是最佳预测器(R²=0.85)。这些结果表明,框架缩放受计算量多少的影响较小,而更多地取决于原始预算如何高效地转化为持久且任务充分的反馈。

英文摘要

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.