arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26795 2026-05-27 cs.AI

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

链式思维在探测时为何有效?局部共现而非全局推导

Xiang Wang, Wei Wei

AI总结 研究链式思维提示在探测时提升语言模型准确率的原因,发现增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

详情
AI中文摘要

链式思维提示可靠地提高了语言模型的准确性,但推理文本的哪些属性驱动了这种改进尚不清楚。先前的工作主要研究生成本身的行为。我们转而提出一个探测时问题:给定上下文中的固定推理文本,该文本中的什么改变了答案?我们确定了增益的两个互补来源。首先,即使是全局词序打乱的推理文本也显著优于无推理基线,表明存在强烈的词汇激活效应。更重要的是,结构化文本带来的额外增益似乎较少来自句子级的逻辑排序,而更多来自短距离标记邻接。保留仅$n^\star{=}2$--$3$个标记的连续窗口即可恢复向完整链式思维性能的大部分剩余增益。支持性实验排除了显式答案声明或答案值的复制以及完整的语法实现作为主要驱动因素。进一步的泛化实验表明,这种定性模式在多个模型家族、参数规模和数据集上保持稳定。这些结果支持探测时链式思维的局部共现激活解释,其中观察到的增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

英文摘要

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$--$3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

2605.26789 2026-05-27 cs.AI

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

组合崩溃:稳定的事实知识并不意味着组合推理

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han

AI总结 本文提出组合崩溃现象,即模型在稳定掌握原子事实的情况下仍无法将其组合成链式推理,并通过双门控协议分解后训练增益,揭示聚合指标掩盖的组合能力变化。

详情
AI中文摘要

后训练通常通过聚合基准分数来评估,这些分数将多跳推理视为单一能力——仿佛回答更多问题的模型必然更擅长组合事实。我们表明这种假设可能具有误导性:在统计上无法区分的原子知识配方下,组合行为差异超过40个百分点,我们将这种现象称为组合崩溃:即系统性地无法将稳定已知的事实组合成链,而这种失败对聚合指标不可见。我们引入双门控协议,将估计量从聚合组合性差距转变为基于稳定原子访问的残差组合失败,将后训练收益分解为三个独立通道:原子稳定性、残差组合和关键深度。在一个涵盖深度2-11的时序事实链基准上,对四种后训练配方进行分解,揭示了后训练目标以聚合指标掩盖的方式改变组合能力,并表明关于多跳推理改进的主张应伴随原子门控控制的组合指标。诊断探针进一步显示,测量到的组合失败中相当一部分反映了生成时的计算约束,而非永久性的组合能力缺失。

英文摘要

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

2605.26788 2026-05-27 cs.CL cs.AI

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT: 基于句子变换器的决策变换器条件化用于多轮对话可靠性

Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla

AI总结 针对大语言模型在多轮对话中性能下降的问题,提出一种无需训练和额外数据的推理方法SeDT,通过引入离线强化学习中的return-to-go条件化,利用语义、词汇和位置信号计算累积相关性得分并注释对话历史,显著提升模型性能并降低不可靠性。

详情
AI中文摘要

大语言模型(LLMs)在单轮任务完全指定时表现令人印象深刻,但当相同任务在多轮中逐步揭示时,同一模型性能下降高达39%,这一现象在规模上被记录为“迷失在对话中”。关键的是,这种崩溃几乎完全是可靠性失败;最佳情况下,能力仅下降16%,而不可靠性增加超过一倍(+112%)。我们认为根本原因是结构性的:扁平化的对话历史对每个先前轮次赋予相等隐式权重,使模型无法区分关键约束与无关对话。我们提出SeDT(句子变换器-决策变换器),一种无需训练的推理时方法,通过从离线强化学习中引入return-to-go条件化来解决此问题。SeDT使用来自三种互补信号(语义、词汇和位置)的累积相关性得分注释每个对话片段,并在最后一轮向模型呈现完整的注释历史,无需权重更改、无需训练数据、无需丢弃上下文。在三个LLM和三个生成任务的Lost-in-Conversation基准上评估,SeDT在所有九个模型-任务组合中均优于分片基线,平均性能P提升高达+37.7%,同时在九个组合中的七个中降低了不可靠性。简而言之,告诉模型哪些过去的轮次重要足以显著恢复对话中丢失的性能。

英文摘要

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.

2605.26786 2026-05-27 cs.CY cs.AI cs.LG

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

大数据分析在糖尿病管理中的应用:卢旺达医疗系统需求评估

Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere

AI总结 本研究通过利益相关者研讨会评估卢旺达医疗系统采用大数据分析管理糖尿病的准备情况,并提出了一个基于可解释机器学习模型的实用框架。

详情
AI中文摘要

糖尿病是一种慢性代谢疾病,如果不及早诊断和管理,可能导致严重的健康问题。大数据分析和机器学习为分析大型健康数据集、支持早期发现和更好的治疗决策提供了实用工具。然而,它们在常规临床实践中的使用仍然有限。本研究考察了卢旺达医疗系统采用大数据分析管理糖尿病的准备情况。随着该国不断扩大电子病历和健康信息系统的使用,改善预测、监测和临床决策的新机遇随之出现。我们举办了一个为期五天的研讨会,涉及25名关键利益相关者,包括临床医生、数据管理员、政策制定者、医学研究人员、营养学家和技术提供商,以评估准备情况并识别现有差距。研究结果突出了大数据分析实施的潜力和主要挑战。基于这些结果,本文提出了一个实用的大数据分析框架,利用可解释的机器学习模型支持糖尿病管理策略。

英文摘要

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

2605.26785 2026-05-27 cs.CL cs.AI

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

EmoDistill: 对抗性谈判中语言模型代理的离线情感技能蒸馏

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup

AI总结 提出EmoDistill离线框架,通过隐式Q学习选择情感和低秩适应策略表达情感,蒸馏情感谈判技能到语言模型代理,在四个高风险谈判领域取得最高效用。

详情
AI中文摘要

后训练的LLM通常被优化以对齐响应与人类偏好,使其安全、礼貌且适合对话。然而,在对抗性谈判中,这种对齐可能成为漏洞:情感框架语言可能引导代理朝向对手方利益。使用基于GoEmotions的情感提示,我们表明情感显著改变谈判结果,表明情感是战略行动渠道而非表面风格。因此,我们引入 extbf{EmoDistill},一个用于将情感谈判技能蒸馏到语言模型代理中的离线框架。EmoDistill将情感策略分解为情感选择和情感表达:隐式Q学习(IQL)选择器学习表达\emph{哪种}情感,而基于低秩适应(LoRA)的策略通过监督微调(SFT)和裁判策略优化(JPO)学习\emph{如何}表达它。在四个情感敏感、高风险的谈判领域,在EmoDistill框架下训练的SLM策略实现了最高效用,优于普通SLM/LLM基线和仅IQL情感选择。消融实验表明情感条件化是必要的,迁移研究展示了跨领域、未见对手和训练对训练锦标赛的泛化能力。总体而言,EmoDistill从离线代理间交互中学习技能,避免了训练期间昂贵的在线谈判。

英文摘要

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

2605.26784 2026-05-27 cs.LG cs.AI

Ratio-Variance Regularized Policy Optimization

比率方差正则化策略优化

Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li

AI总结 提出R²VPO方法,通过约束策略比率方差作为信任区域的局部近似,替代启发式裁剪,在LLM和机器人控制任务中提升性能与样本效率。

详情
AI中文摘要

标准的同策略强化学习依赖启发式裁剪来强制信任区域,但这种机制通过不加区分地截断高回报但高散度的更新而施加了严重代价。我们证明,显式约束策略比率方差为信任区域约束提供了原则性的局部近似,消除了二元硬裁剪的需要。通过作为分布式的“软刹车”,这种方法保留了来自新颖发现的关键梯度信号,同时自然降低权重并允许重用陈旧的离策略数据。我们引入了${\bf R}^2{\bf VPO}$(比率方差正则化策略优化),它通过原始-对偶优化框架实现这一约束。在跨越快速和慢速推理范式的$7$个LLM规模以及$10$个机器人控制任务上的广泛评估证明了所提出方法的通用性。R$^2$VPO在数学推理基准上取得了显著的性能提升,特别是在较小模型上改进尤为明显,同时显著提高了样本效率。此外,它在连续控制领域(特别是稀疏奖励和动态环境)中始终优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则性基础。

英文摘要

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

2605.26782 2026-05-27 cs.RO cs.HC

Manipulating Tangible Virtual Object Dynamics to Promote Learning of Precision Force Generation

操控有形虚拟物体动力学以促进精确力生成的学习

Alberto Garzás-Villar, Alba Riera-Cardona, Alexis Derumigny, J. Micah Prendergast, Jane Murray Cramm, Laura Marchal-Crespo

AI总结 本研究提出通过操控有形虚拟物体的动力学(线性、高斯或反对称高斯弹簧模型)来训练精确力控制,实验表明反对称高斯组在训练中力精度最高,但长期保留无显著差异,且参与者主要依赖学习到的目标伸长而非目标力。

详情
AI中文摘要

机器人触觉设备结合虚拟现实为训练精细力生成提供了新机会,这是中风后康复中重要但常被忽视的部分。本研究提出,操控有形虚拟物体的渲染动力学可用于训练精确力控制,同时激活体感系统。我们进行了一项实验,50名健康参与者执行一项类似冰壶的任务,他们必须拉伸虚拟弹簧以产生目标释放力,将石头推至冰面上预定义位置。在训练中,弹簧的力-伸长关系被建模为线性或非线性函数,即高斯或反对称高斯函数,在释放目标力处导数为零。结果表明,反对称高斯组在训练中始终比线性组获得更高的力精度,而高斯组仅在训练后期优于线性组。人格特质分析显示,在高斯动力学下,更高的自由精神得分与较差的表现和减少的任务探索相关,而更高的挑战转化得分与增加的探索相关。尽管存在这些训练效应,但在不同弹簧类型或人格特质之间,长期保留没有显著差异。参与者主要依赖学习到的目标伸长而非目标力,这通过在不同刚度但相同目标力的转移任务中的表现得以证实。虽然这些方法对体感神经康复有前景,但在对神经疾病患者进行测试之前,需要改进以减少对本体感觉线索的依赖。

英文摘要

Robotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring's force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.

2605.26781 2026-05-27 cs.AI cs.MM

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench: 大型多模态模型真的征服了高中水平的考试吗?

Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li

AI总结 本文提出动态多学科基准LiveK12Bench,通过自动化流水线和新颖的模拟考试评估方案,揭示大型多模态模型在真实考试场景下性能显著下降,尤其对复杂视觉布局敏感。

详情
AI中文摘要

先进的大型多模态模型(LMMs)在K-12推理任务中展示了令人印象深刻的表现,展现出作为智能导师的巨大潜力。实现这一潜力需要模型有效应对真实世界的考试,但大多数现有基准未能捕捉真实考试环境的复杂性。具体来说,大多数数据集是静态的,容易受到数据污染,并且通常局限于受限的模态、学科和评估标准。为了解决这些问题,我们引入了LiveK12Bench,这是一个动态、全面、多学科的基准,旨在评估LMMs在真实考试场景中的推理能力。LiveK12Bench包含2000多道经过验证的题目,涵盖数学、物理、化学和生物,来源于最新的真实考试试卷,并设计为随时间增长。我们的框架具有几个核心创新:1)采用自动化流水线,持续摄取和解析最新考试试卷以减轻数据泄露;2)提出一种新颖的“模拟考试”评估方案,评估模型自主完成端到端考试并具有准确高效推理路径的能力。在12个LMMs上的大量实验表明,先进模型在考试真实约束下性能大幅下降:当过程严谨性和效率共同评估时,GPT-5的分数从79降至53(满分100)。我们的发现暴露了关键漏洞,例如对复杂视觉布局的敏感性,凸显了理想化推理能力与真正教育准备之间的差距。代码和数据集均已公开。

英文摘要

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

2605.26778 2026-05-27 cs.AI

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

归因盲点:检测语言模型何时依赖记忆而非检索到的上下文

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Bo Yang, Chen Ye, Gaolei Li, Meng Han

AI总结 本文提出计算现实监控(CRM)方法,通过比较有无上下文时的内部表征差异,检测语言模型是否依赖预训练记忆而非检索到的上下文进行生成,解决了输出级监控无法识别的归因盲点问题。

详情
AI中文摘要

检索增强生成承诺将语言模型输出锚定于外部证据,然而该领域缺乏可靠方法来验证检索到的上下文是否实际主导了生成——这是任何高风险部署的前提。标准假设(上下文一致的输出意味着上下文主导的输出)在检索到的文档与模型预训练数据重叠时失效:模型可以完全从参数化记忆中生成看似忠实的文本,且两种途径产生无法区分的输出。我们将此失败命名为归因盲点,并引入计算现实监控(CRM)来解决它。CRM 操作化了源自认知科学现实监控框架的一个原则:比较有上下文和无上下文时的内部表征,揭示了输出级监控系统系统性遗漏的基于成员条件的表征分歧。CRM 并不证明单个生成使用了哪个来源;它检测预训练暴露是否留下可测量的内部轨迹特征,从而为来源归因建立必要的基础。在跨越三个系列的九个模型变体中,这种分歧集中在架构特定的层模式中,得到块级噪声干预的汇聚支持,并在任务和数据集上泛化,而在领域混淆的基准上消失。归因盲点是可以测量且部分可解决的:内部表征携带输出级不可见的诊断信号,为系统建立基础,使其对证据来源的内部意识支配其外部行为。

英文摘要

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.

2605.26776 2026-05-27 cs.LG cs.AI

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的混合专家车辆路径问题模型

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

AI总结 提出基于混合专家架构的残差细化专家与实例级门控机制(R2E-IG),通过模块化策略网络和动态权重适应训练,提升车辆路径问题在分布偏移下的泛化能力。

详情
AI中文摘要

近年来,深度强化学习(DRL)在车辆路径问题(VRPs)上取得了显著进展。然而,现有的基于DRL的方法通常是在均匀分布生成的实例上训练的,这限制了它们在真实世界分布偏移下的性能。在本文中,我们旨在开发一个面向泛化的模型,该模型将策略网络划分为多个模块,并在推理过程中自适应地重组模块以形成特定策略。具体来说,我们提出了具有实例级门控的残差细化专家(R2E-IG)以改进跨分布泛化。我们的贡献有三方面:(1)我们引入了一种残差细化专家(R2E)架构,通过残差细化增强专家表达能力;(2)我们设计了一种实例级门控机制,学习分布感知的实例表示并将输入路由到合适的模块;(3)我们提出了一种配备动态权重适应(DWA)的混合分布训练机制,该机制动态地重新加权来自不同分布的训练数据,以强调更具信息量的数据。大量实验表明,R2E-IG在合成和基准数据集的分布内和分布外实例上均取得了与最先进基线相竞争的性能。此外,R2E-IG是通用的,可以轻松集成到现有的基于DRL的方法中,以进一步提高性能。

英文摘要

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.

2605.26772 2026-05-27 cs.AI

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

超越单一方向:思维链破坏简单的拒绝引导

Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

AI总结 本文研究大型推理模型(LRM)中拒绝行为的机制,发现思维链(CoT)与激活共同编码拒绝信号,使得仅通过激活引导难以逆转拒绝,但通过两阶段干预(激活引导下重新生成CoT)可显著提高逆转率。

详情
AI中文摘要

大型推理模型(LRM)在生成最终输出之前会生成思维链(CoT)轨迹,引入动态内部状态,可能使拒绝等控制机制复杂化。与指令调优的LLM不同,后者的拒绝由单一方向子空间介导,而LRM中的拒绝还依赖于CoT。在DeepSeek-R1-Distill-LLaMA-8B中,当CoT保持不变时,激活引导仅在39%的情况下逆转拒绝,但完全移除CoT可将此比例提高到70%,表明CoT积极强化拒绝。在两阶段干预中,模型在激活引导下重新生成其CoT,拒绝在94%的情况下被逆转,而即使移除引导,生成的CoT本身仍保留48%的效果。这表明CoT可以独立携带和重建顺从信号。这些发现表明,LRM中的拒绝由残差流激活和CoT共同编码。这种联合编码使得LRM对仅激活层面的干预更具鲁棒性,但使CoT暴露于可能的替代表面攻击。

英文摘要

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

2605.26770 2026-05-27 cs.CL

Quality Without Usefulness: LLM-Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids

质量而无用:LLM生成的XAI叙事作为信任启发而非决策辅助

Fabian Lukassen, Jan Herrmann, Christoph Weisser, Alexander Silbersdorff, Benjamin Saefken, Thomas Kneib

AI总结 通过五个受控实验,研究LLM生成的高质量自然语言解释在时间序列能源预测中是否提升任务准确性,发现解释不改善准确性但膨胀自信,存在质量-有用性差距。

详情
AI中文摘要

先前研究表明,大型语言模型(LLMs)可以将可解释人工智能(XAI)输出转换为在合理性、连贯性和可理解性等质量指标上得分很高的自然语言解释(NLEs)。但解释质量是否能转化为实际有用性?我们通过五个受控实验(60个测试实例中的2,730个判断)在时间序列能源预测领域中研究这一问题,每个实验操作化XAI文献中研究的有用性的一个不同方面。在保持NLE质量与先前因子研究确定的高水平一致的情况下,我们发现NLEs在五个任务中的任何一个上都没有提高任务准确性,同时膨胀了自我报告的置信度。一个安慰剂对照表明,这种置信度提升是由文本存在而非内容驱动的。在分布外检测任务中,NLEs降低了LLM判断者标记不可靠预测的能力,提供了掩盖模型失败的虚假安慰。我们将这些发现定性为质量-有用性差距,并认为对XAI到NLE管道的评估必须超越文本质量指标,扩展到下游任务性能。

英文摘要

Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge's ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.

2605.26769 2026-05-27 cs.CY cs.AI

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

生成式人工智能与高等教育中少数群体知识的边缘化:以残疾为例

Fatiha Tali-Otmani

AI总结 研究通过教育科学、批判技术研究和残疾研究,揭示生成式人工智能如何通过以英语和西方为中心的训练数据集强化认知殖民性,导致残疾人群体的双重边缘化,并探讨研究者与机器混合以维护认知多样性的可能性及其结构性限制。

详情
AI中文摘要

生成式人工智能通过重构科学知识的生产和验证过程,重新定义了高等教育。这些系统并非中立;它们积极促进了非霸权认识论的边缘化。本研究借鉴教育科学、批判技术研究和残疾研究,证明训练数据集(主要来自英语和西方中心)强化了认知殖民性。残疾人的情况特别清晰地说明了这一现象。技术架构常常将这些个体限制在刻板的刻板印象中,或将他们排除在设计过程之外,导致双重边缘化。本文探讨了研究者与机器之间的混合是否可能维护认知多样性,同时承认当算法校正作为纯粹姑息策略时固有的结构性限制。

英文摘要

Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced and validated. These systems are not neutral; they actively contribute to the marginalization of non-hegemonic epistemologies. This research draws upon educational sciences, critical technology studies, and disability studies to demonstrate that training datasets, which remain predominantly Anglophone and Western-centric, reinforce epistemic coloniality. The situation of persons with disabilities provides a particularly clear illustration of this phenomenon. Technological architectures frequently confine these individuals to reductive stereotypes or exclude them from the design process, leading to a double marginalization. This article examines whether a hybridization between the researcher and the machine might preserve epistemic plurality, while acknowledging the structural limitations inherent in algorithmic correction when used as a purely palliative strategy.

2605.26763 2026-05-27 cs.LG cs.AI

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

对抗训练用于最坏设施损失下的鲁棒覆盖网络

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

AI总结 针对最大覆盖选址-阻断问题,提出基于对抗学习的双智能体深度强化学习框架,实现高效求解与鲁棒决策。

详情
AI中文摘要

最大覆盖选址-阻断问题(MCLIP)是一个经典的双层优化问题,对于韧性基础设施规划至关重要,但计算上仍然难以处理。具体来说,上层确定设施位置以最大化覆盖范围,而下层执行最坏情况下的阻断以最小化覆盖范围。上下层之间的强耦合以及各自的高组合复杂性使得传统方法无效。为了弥补这一差距,我们提出了一种基于对抗学习的双智能体深度强化学习(DADRL)框架,包括对应于上层的选址智能体和对应于下层的阻断智能体。我们的贡献有三方面:(1)选址智能体同时针对不断演化的阻断智能体进行训练,使其有效捕捉上下层之间的动态竞争相互作用;(2)为了充分利用阻断智能体的学习能力,我们提出了一种基于替代的集成推理策略,利用训练好的阻断智能体作为高保真替代来指导选址智能体的决策;(3)在合成和真实世界数据集上的大量实验表明,与其他基线相比,我们的方法在保持高度竞争力的解质量的同时,实现了卓越的计算效率。此外,我们的DADRL框架对网络结构是模型无关的,而其底层的对抗学习范式在解决其他双层优化问题方面显示出强大的潜力。

英文摘要

The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.

2605.26759 2026-05-27 cs.LG

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

基于上下文条件与因果增强预训练的时间序列因果发现

Biao Ouyang, Tengxue Zhang, Zhihao Zhuang, Yang Shu, Chenjuan Guo, Bin Yang

AI总结 提出PTCD框架,通过上下文条件建模和可迁移因果增强的预训练范式,提升跨任务时间序列因果发现的泛化能力,在多个真实OOD数据集上因果发现和根因识别表现优异。

详情
Comments
Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026). 27 pages
AI中文摘要

时间序列的因果发现对于许多现实世界应用至关重要,例如追踪异常的根本原因。现有方法通常依赖于特定数据集的优化,这使得其因果发现能力难以迁移到由不同因果机制控制的新时间序列上。在本文中,我们提出PTCD,一种新颖的时间序列因果发现预训练框架,通过上下文条件建模和可迁移的因果增强来改进跨任务泛化。为了建模复杂的时间因果依赖关系,PTCD采用双尺度迭代注意力机制来捕获窗口级别的因果关系,并利用带有上下文级别路由机制的高斯混合模型来处理异质的外生分布。为了进一步解决因果图之间的分布偏移,PTCD在合成数据集上采用预训练范式,该范式整合了基于干预的学习和因果混合策略,促进了稳定的因果发现和更强的泛化能力。在多个真实世界分布外(OOD)数据集上的大量实验表明,PTCD在因果发现和根因识别方面均表现出色。

英文摘要

Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer their causal discovery capabilities to new time series governed by diverse causal mechanisms. In this paper, we propose \textbf{PTCD}, a novel \textbf{P}retraining framework for \textbf{T}ime-series \textbf{C}ausal \textbf{D}iscovery, which improves cross-task generalization through context-conditioned modeling and transferable causal augmentation. To model complex temporal causal dependencies, PTCD employs a dual-scale iterative attention mechanism to capture window-level causal relationships, and a Gaussian mixture with a context-level routing mechanism to handle heterogeneous exogenous distributions. To further address distribution shifts across causal graphs, PTCD adopts a pretraining paradigm on synthetic datasets that integrates intervention-based learning and a causal mixup strategy, promoting stable causal discovery and stronger generalization. Extensive experiments on multiple real-world out-of-distribution (OOD) datasets demonstrate that PTCD excels in both causal discovery and root cause identification.

2605.26754 2026-05-27 cs.CR cs.AI

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Cordon-MAS:通过信息流控制防御 RAG 的知识投毒

Zhe Yu, Wenpeng Xing, Gaolei Li, Shuguang Xiong, Hongzhi Wang, Xuyang Teng, Meng Han

AI总结 针对检索增强生成(RAG)中的 Confundo 式投毒攻击,提出 Cordon-MAS 框架,通过分离证据提取、跨源审计和答案合成到具有非对称内存权限的智能体中,将攻击成功率相对降低 92.4%,将投毒问题从检测重新定义为信息流控制。

详情
AI中文摘要

检索增强生成(RAG)日益支撑着高风险应用,但仍易受到 Confundo 式投毒攻击,其中对抗性优化的文档操纵生成的输出。现有防御假设检测到中毒证据即可防止危害。我们证明这一假设不正确:模型存在监控-控制差距——它们可以检测到检索证据中的矛盾,但仍会依据中毒声明行动。我们引入 Cordon 原则——任何能够进行最终合成的智能体都不得访问不可信的自然语言证据——并通过 CORDON-MAS 实现该原则,这是一个隔离框架,通过将证据提取、跨源审计和答案合成分离到具有非对称内存权限的智能体中,在架构上强制执行该原则。在五个 BEIR 数据集上,CORDON-MAS 相对于未防御的 RAG 将攻击成功率降低了 92.4%。这将 RAG 投毒问题从检测问题重新定义为信息流控制问题。

英文摘要

Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap -- they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle -- no agent capable of final synthesis may access untrusted natural-language evidence -- and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4\% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.

2605.26747 2026-05-27 cs.AI

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

面向口语处理任务的机器人-患者与医生-患者医疗对话数据集

Heriberto Cuayahuitl, Grace Jang

AI总结 提出MeDial-Speech数据集,包含机器人-患者和医生-患者的真实医疗对话语音数据,用于训练和评估医疗AI,并通过句子选择基准测试评估三个大语言模型。

详情
Journal ref
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2026)
AI中文摘要

大型语言模型(LLM)为人工智能(AI)带来了巨大改进,可应用于通用任务。然而,它们在文本或口语医疗咨询中的应用仍是一个开放的研究问题。本文提出MeDial-Speech,这是一个新颖的语音数据集,用于训练和评估能够与患者进行咨询的医疗AI。该数据集在真实环境中从机器人-患者和医生-患者对话中收集,包含111小时以上的语音数据(无数据增强),涵盖四种健康状况:路易体痴呆、心力衰竭、肩痛和心绞痛。此外,我们通过句子选择(20个选项)提出了一个对话基准,用于评估三个最先进的LLM:GPT-5 mini、DeepSeek-V3和Claude Sonnet 4。实验结果显示,Claude Sonnet 4在句子选择中表现最佳,使用人工转录的准确率为71.1%,使用自动转录的准确率为74.7%,并且所有LLM在其概率预测中高度过度自信,无论选择医疗对话中的正确或错误句子。该数据集对非商业用途免费提供,网址为:https://huggingface.co/datasets/hcuayahu/MeDial-Speech

英文摘要

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

2605.26744 2026-05-27 cs.CV

Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy

基于高效人体球代理的自交感知3D人体运动生成

Pascal Herrmann, Maarten Bieshaar, Dennis Mack, Robert Herzog, Juergen Gall

AI总结 提出一种基于人体球代理的自交损失函数,用于训练人体运动生成模型,可减少高达49%的自交现象并改善评估指标。

详情
Comments
Accepted to BMVC 2025
AI中文摘要

近年来,人体运动生成取得了巨大进展,最先进的方法在领先的评估基准上超越了真实数据。然而,对生成运动的视觉检查揭示了不同情况:即使是最先进的方法也经常生成包含自交(即身体部位相互穿透)的运动,这些强烈的伪影严重限制了感知到的运动质量。我们引入了一种新的损失函数,明确惩罚自交,用于人体运动生成方法的训练。我们的损失基于人体几何的球代理,与基于三角网格的类似方法相比,计算自交损失的速度快98%,内存使用减少83%。该损失与具体方法无关,我们将其添加到最近的人体运动生成方法(人体运动扩散模型MDM和MoMask)的训练中。大量实验表明,生成运动中的自交减少了高达49%,同时改善了其他评估指标。代码可在https://github.com/boschresearch/humansphereproxy获取。

英文摘要

Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at https://github.com/boschresearch/humansphereproxy .

2605.26741 2026-05-27 cond-mat.mtrl-sci cs.AI

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

MatFormBench: 一个面向目标驱动材料配方的基准评估框架

Linhan Wu, Chenxi Wang, Chuhan Yang, Zhengwei Yang, Yuyang Liu

AI总结 针对现有材料机器学习基准仅关注正向属性预测而缺乏逆向优化评估的问题,提出MatFormBench基准框架,集成物理驱动配方生成方案与多维度评分指标,系统评估39种逆向设计算法。

详情
Comments
26 pages
AI中文摘要

材料的逆向设计显著推进了目标驱动的配方优化,然而现有的材料机器学习基准仍局限于正向属性预测,未能系统评估逆向优化和生成算法,这一关键差距阻碍了目标驱动材料设计的进展。为解决这一局限性,我们提出了MatFormBench,一个新颖的基准评估生态系统,专门用于评估和指导目标驱动配方的生成策略。MatFormBench集成了一个物理驱动的配方生成方案,用于生成忠实模拟真实材料结构-属性响应关系的合成样本,并辅以五个递增难度级别来量化这些关系的复杂性。为了严格评估算法性能,我们进一步提出了MatFormScore,一个多维指标,全面量化五个关键轴上的性能:目标成功率、搜索效率、探索能力、鲁棒性和稳定性。我们通过评估39种不同的逆向设计算法来验证MatFormBench,涵盖经典的代理辅助黑箱搜索、最先进的深度生成模型以及日益流行的基于大语言模型(LLM)的推荐策略。在1170次标准化算法-任务评估中,基于扩散的模型展现出最强的整体性能,而基于变分自编码器(VAE)和遗传算法(GA)的方法在特定场景中表现出独特优势。通过为目标驱动材料配方建立统一的评估标准,MatFormBench实现了可重复的基准测试、原则性的算法比较和逆向设计策略的诊断分析,为推进材料逆向设计提供了基础工具。

英文摘要

Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benchmarks remain limited to forward property prediction, failing to systematically evaluate inverse optimization and generation algorithms, a critical gap that hinders the progress of target-driven materials design. To address this limitation, we propose MatFormBench, a novel benchmarking ecosystem tailored to evaluate and guide generative strategies for target-driven formulation. MatFormBench integrates a physics-driven formulation generation scheme to generate synthetic samples that faithfully emulate realistic materials structure-property response relationships, complemented by five escalating difficulty levels to quantify the complexity of these relationships. To rigorously assess algorithm performance, we further propose MatFormScore, a multi-dimensional metric that comprehensively quantifies performance across five critical axes: target success, search efficiency, exploratory capacity, robustness, and stability. We validate MatFormBench by evaluating 39 diverse inverse design algorithms, covering classical surrogate-assisted black-box search, state-of-the-art deep generative models, and increasingly popular Large Language Model (LLM)-based recommendation strategies. Across 1170 standardized algorithm-task evaluations, diffusion-based models demonstrate the strongest overall performance, while Variational Autoencoder (VAE)-based and Genetic Algorithm (GA)-based methods exhibit distinct advantages in specific scenarios. By establishing a unified evaluation standard for target-driven materials formulation, MatFormBench enables reproducible benchmarking, principled algorithm comparison, and diagnostic analysis of inverse design strategies, providing a foundational tool for advancing materials inverse design.

2605.26738 2026-05-27 cs.CL

KARMA: Karma-Aligned Reward Model Adaptation

KARMA:基于Karma对齐的奖励模型适应

Jared Scott, Jesse Roberts

AI总结 提出KARMA框架,利用Reddit对话数据训练奖励模型预测语境依赖的回应价值,并通过强化学习微调语言模型以提升语用能力,发现最佳奖励模型不一定带来最优对齐,且KARMA会降低事实性。

详情
AI中文摘要

人类交流依赖于隐含的社会信号,其有效性由语气、语境和对话规范塑造,而不仅仅是语义内容。我们引入了KARMA(Karma对齐的奖励模型适应),这是一个让LLM从大规模社交互动数据中学习语境敏感对话行为的框架。KARMA在Reddit对话上训练奖励模型,以预测基于语境的回应价值,并利用该信号通过强化学习微调语言模型,以提升语用中介任务的表现。关键的是,我们发现表现最好的奖励模型并未带来更好的下游模型对齐:一个完全依赖对话语境的奖励模型在预测Reddit karma方面表现更差,但产生了显著更好的下游性能。我们评估了KARMA应用于下游模型的效果,无论该模型是否直接接触社交媒体数据。得到的模型显示出改进的语用中介行为,同时很大程度上减轻了不良副作用。在所有条件下,KARMA都持续降低了事实性,包括下游模型未直接接触Reddit数据的情况,这表明这种张力嵌入在奖励信号本身中,而非由噪声训练数据引入。

英文摘要

Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.

2605.26735 2026-05-27 cs.CL

Rethinking the Multilingual Reasoning Gap with Layer Swap

通过层交换重新思考多语言推理差距

Maxence Lasbordes, Amélie Chatelain, Djamé Seddah

AI总结 本文通过构建多语言推理数据集并微调专家模型,发现本地推理与英语枢轴推理的性能差距远小于先前报道,并提出层交换方法将英语专家的推理中间层迁移到本地专家,以缩小差距。

详情
AI中文摘要

最近的推理大语言模型在生成思维链(CoT)时主要使用英语,即使提示使用非英语语言。先前的研究表明,强制CoT保持输入语言(本地推理)会显著降低性能,而允许模型用英语推理然后用输入语言回答(英语枢轴推理)则表现更好。然而,大多数关于这种本地推理差距的研究依赖于推理时的干预或有限的本地语言训练数据。我们在更大规模且可比监督下重新审视这一比较。我们构建了涵盖六种语言(英语、法语、德语、西班牙语、中文和斯瓦希里语)的长篇多语言推理数据集;在Qwen/Qwen3-8B-Base基础上微调本地和英语枢轴两种模式的专家模型,并在数学、科学、通用知识和代码任务上进行评估。在此设置下,五种非英语语言的平均本地推理差距缩小至1.9-3.5%,远小于先前报道。对本地专家的权重空间分析显示,中间层的微调更新具有对齐性,而外层则存在分歧。这表明存在一个很大程度上与语言无关的推理核心,周围环绕着特定语言的层。利用这一结构,我们引入了层交换方法:将英语专家更强的推理中间层迁移到每个本地专家中,从而在保留目标语言CoT的同时,几乎消除了五种非英语语言的本地推理差距。我们发布了所有模型和数据集。

英文摘要

Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emph{native reasoning}) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emph{English-pivoted reasoning}). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \texttt{Qwen/Qwen3-8B-Base}, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9--3.5\% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist's stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.

2605.26734 2026-05-27 cs.CV

CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains

CIRCLED:跨领域一致对话的多轮CIR数据集

Tomohisa Takeda, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Osamu Torii, Yusuke Matsui

AI总结 为解决现有MTCIR数据集缺乏对话历史一致性和领域局限的问题,构建了CIRCLED数据集,通过扩展FashionIQ、CIRR和CIRCO,利用CIReVL检索流水线生成多轮会话,并经过多重过滤确保质量,最终提供22,608个多轮会话,涵盖九个子集,规模与通用性显著提升。

详情
AI中文摘要

现有的多轮组合图像检索(MTCIR)数据集缺乏对话历史一致性,且仅限于时尚领域。为解决这些限制,我们通过扩展FashionIQ、CIRR和CIRCO构建了CIRCLED。在CIRCLED中,每一轮的查询逐步逼近目标图像。数据通过基于CIReVL的检索流水线生成,并经过检索成功、轮次长度、一致性和信息冗余等多重过滤以确保质量。我们总共收集了涵盖九个子集的22,608个多轮会话,在规模和通用性上显著超过Multi-turn FashionIQ(11,505个会话)。我们进一步应用了多种基线方法,并在CIRCLED上定量评估了检索准确性。我们的工作提供了一个实用、高质量的基准,以促进未来多轮CIR的研究。数据集和代码公开于https://huggingface.co/datasets/tk1441/CIRCLED和https://github.com/mti-lab/circled。

英文摘要

Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at https://huggingface.co/datasets/tk1441/CIRCLED and https://github.com/mti-lab/circled.

2605.26733 2026-05-27 cs.LG cs.AI

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

循环语言模型中测试时可扩展潜在推理的稳定循环动力学

Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

AI总结 提出STARS训练框架,通过雅可比谱半径正则化约束潜在状态趋近渐近稳定不动点,解决循环语言模型深度递归时性能崩溃问题,实现可靠的测试时扩展并提升峰值性能。

详情
Comments
ICML 2026
AI中文摘要

循环语言模型(LoopLMs)通过深度递归实现高效的潜在推理,但表现出不可靠的测试时缩放行为:性能通常在某个迭代深度达到峰值,然后随着进一步递归而崩溃。通过潜在动力学分析,我们发现现有架构和策略在稳定性和有效性之间存在固有的权衡。通过将推理概念化为不确定性减少,我们提出收敛到稳定不动点同时保持有效性是一种有前景的方法。为此,我们提出了STARS(稳定性驱动的递归缩放),一种训练框架,约束潜在状态趋近渐近稳定不动点。这通过高效的雅可比谱半径正则化和随机循环采样实现,使STARS能够在确保严格稳定性的同时最大化有效性。在算术任务上的实验表明,STARS实现了可靠的测试时缩放,在复杂数学推理中,它显著减轻了随着递归深度增加而出现的性能退化,同时提高了峰值性能。

英文摘要

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

2605.26732 2026-05-27 cs.LG

APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction

APEX: 针对稀缺目标的高频波预测的幅度锚定与相位先验

Yifan Sun, Lei Cheng, Sijie Chen, Ting Zhang, Jianlong Li, Shikai Fang

AI总结 提出APEX框架,通过低频神经算子预测幅度作为锚点,结合格林函数启发的相位先验和条件流匹配增强器,在目标数据稀缺时实现高频波场预测,在多个基准上优于直接外推和联合生成方法。

详情
AI中文摘要

基于学习的替代模型在波场预测中日益有效,特别是神经算子在观测频率范围内表现出色。然而,在目标监督稀缺的情况下,高频预测仍相对未被充分探索,尤其是在高频数据模拟或测量成本远高于低频数据的波动问题中。一个核心困难是跨频率迁移本质上是不对称的:粗粒度幅度结构在不同频率间保持相对稳定,而相位敏感的振荡结构随着频率增加而迅速恶化。受此不对称性启发,我们提出APEX(从外推粗预测中进行的幅度锚定和相位先验引导增强),一个针对目标稀缺高频波场预测的框架。低频神经算子首先在目标频率范围内提供粗预测,我们仅保留幅度作为可迁移的结构锚点。然后,条件流匹配增强器在格林函数启发的相位先验指导下重建目标高频场。在SimpleWave、Helmholtz和Maxwell基准上的实验表明,在有限的目标频率监督下,APEX始终优于直接的低频到高频外推、目标自适应算子和联合生成基线。我们的结果表明,振荡波场的可靠高频预测不应依赖于完整复数场的直接端到端迁移,而应显式重用可迁移的粗粒度结构,同时单独恢复缺失的振荡细节。

英文摘要

Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently asymmetric: coarse amplitude structure remains relatively stable across frequencies, whereas phase-sensitive oscillatory structure deteriorates much more rapidly as frequency increases. Motivated by this asymmetry, we propose APEX, Amplitude-anchored and Phase-prior-guided Enhancement from eXtrapolated coarse predictions, a framework for target-scarce higher-frequency wave-field prediction. A lower-frequency neural operator first provides a coarse prediction in the target-frequency regime, from which we retain only the amplitude as a transferable structural anchor. A conditional flow-matching enhancer then reconstructs the target higher-frequency field under the guidance of a Green's-function-inspired phase prior. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks show that APEX consistently outperforms direct lower-to-higher extrapolation, target-adapted operator, and joint generative baselines under limited target-frequency supervision. Our results suggest that reliable higher-frequency prediction of oscillatory wave fields should not rely on direct end-to-end transfer of the full complex field, but instead on explicitly reusing transferable coarse structure while separately recovering the missing oscillatory detail.

2605.26731 2026-05-27 cs.AI cs.CL

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

不是能力问题:LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

AI总结 通过 432 次实验,发现 LLM 智能体的驾驭敏感性随模型层级非单调变化,且依赖模型类型(聊天 vs. 推理),推翻了“更高能力模型需要更少结构指导”的假设。

详情
Comments
9 pages, 3 figures
AI中文摘要

LLM 智能体部署中的一个普遍假设是,更结构化的驾驭方式普遍能提高可靠性,并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设,实验跨越了四个能力层级的六个模型,在 HEAT-24(一个基于 git 工作区验证的 24 任务合成基准)上采用了三种驾驭条件(轻量、平衡、严格)。我们的结果从两个方面反驳了单调反比关系。首先,对于评估的前沿聊天模型(Gemini 2.5 Flash),增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次,对于评估的前沿推理模型(Qwen3.5-122B,启用扩展思考),严格驾驭实现了最高的 VTSR(91.7%)和最低的延迟,与预测相反。在受限层级内,一个 2B 模型(Gemma4:e2B)在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表,这些结果应解释为模型特定的观察;驾驭敏感性在所评估的模型中呈现非单调性,并且关键依赖于模型类型(聊天 vs. 推理)。我们引入了一个六标签失败分类法,显示格式违规主导了能力强的模型失败,而错误文件主导了低能力失败,并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

2605.26729 2026-05-27 cs.CV

Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics

基于混合光照特性的参考引导曝光校正

Hao Ren, Zetong Bi, Zhaoliang Wan, Hui Cheng

AI总结 提出HICNet,一种参考引导的曝光校正框架,通过轻量编码器提取光照嵌入,结合FiLM全局调整和光度通道重平衡实现精细曝光匹配,无需真值或内在分解即可在基准测试上取得更优精度并泛化到未见场景。

详情
Comments
ICASSP2026
AI中文摘要

我们提出了HICNet,一个参考引导的曝光校正框架。一个轻量级、内容无关的编码器将每张图像蒸馏成一个紧凑的光照嵌入,捕获区域亮度、边缘对比度和高阶亮度矩。源图像与其参考图像之间的嵌入差异驱动一个多尺度调制网络,该网络结合基于FiLM的全局调整和光度通道重平衡,实现细粒度的、光照感知的光谱门控,产生曝光匹配的输出,同时忠实保留场景细节。跨批次对比损失对光照流形进行排序,增强了对不同光照条件的鲁棒性。在没有真值或内在分解的情况下训练,HICNet在公共基准测试上达到了更好的精度,并且能够很好地泛化到完全未见过的场景。

英文摘要

We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.

2605.26726 2026-05-27 eess.IV cs.AI cs.CV

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

AI总结 提出一种基于动态系统收敛性的不确定性度量方法,通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

详情
Comments
Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026
AI中文摘要

神经细胞自动机(NCA)为编码器-解码器分割网络提供了一种轻量级替代方案。然而,决定何时应信任预测可能很困难。在这里,我们研究基于NCA的医学图像分割的不确定性估计,无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发,其中收敛吸引子对应于可信预测。具体地,我们提出了弹性(resilience),这是一种简单的度量,通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的,而显著变化的预测被标记为不确定。我们使用选择性预测指标($\Delta$Dice@90和AURC)和排序指标(AUROC和AUPRC)通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中,弹性比基线更可靠地识别失败案例,提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

2605.26725 2026-05-27 cs.CV

Joint 2D-3D Segmentation and Association in Street-level Imaging

街景成像中的联合2D-3D分割与关联

Amir Melnikov, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

AI总结 提出一个统一框架,结合零样本检测分割与运动恢复结构,通过3D驱动的几何一致性机制替代传统2D多目标跟踪,实现街景图像中跨视角的稳定分割与身份关联,在挑战性城市场景中性能提升22%。

详情
Comments
15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables
AI中文摘要

准确解读街景图像对于大规模城市地图绘制和创建空间数字孪生环境至关重要。本文提出了一个用于联合2D-3D分割与关联的统一框架,该框架将视觉语义与多视图几何推理相结合。与依赖时序帧进行跟踪的传统方法不同,我们的方法利用零样本检测和分割,结合运动恢复结构重建,建立稳定的跨视图对应关系。3D驱动的关联机制取代了传统的2D多目标跟踪,利用几何一致性指导宽基线视角和不同成像条件下的身份保持。通过结合2D纹理线索和全局3D上下文,所提出的管道非常适合可扩展的街景处理,并可适用于多种对象类型。实验表明,与最先进的纯2D跟踪方法相比,我们的方法显著提高了对真实序列的覆盖率和更鲁棒的身份保持,在挑战性城市场景中实现了22%的性能提升。

英文摘要

Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.

2605.26720 2026-05-27 cs.AI

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

面向CUDA内核生成中自进化LLM代理的反馈到计划决策

Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

AI总结 通过轨迹冻结和选择性反馈注入,提出CUDAnalyst框架以归因规划决策对反馈组件的贡献,揭示显式规划仅在反馈对齐时有效,且有效规划源于结构化多反馈交互。

详情
Comments
ICML 2026 accpeted, camera-ready in progress
AI中文摘要

大型语言模型(LLMs)作为自进化代理在CUDA内核生成中展现出强大的实证收益,这得益于跨代际的反馈条件规划。然而,规划决策如何归因并组合异构反馈信号仍不透明。标准的端到端消融无法解决这一问题,因为迭代规划放大了早期扰动,并将反馈效应与轨迹依赖漂移混为一谈。我们引入 exttt{CUDAnalyst},一个统一的分析层,通过轨迹冻结和选择性反馈注入,实现对规划决策到反馈组件的受控、代际级归因。 exttt{CUDAnalyst}支持稳定的代际级评估和原则性的联盟式反馈效应及交互归因。我们的结果表明,显式规划仅在反馈对齐时有益,有效规划源于结构化的多反馈交互,且来自更强推理模型的高级规划可部分迁移至较弱模型。这些趋势在参考骨干网络、代表性工作负载和参考归纳机制中保持一致,表明在所研究的受控轴内,识别出的反馈到规划结构是稳健的。

英文摘要

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

2605.26718 2026-05-27 cs.LG

MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction

MTL-FNO:一种用于稀疏场重建的轻量级多任务傅里叶神经算子

Siyu Ye, Shihang Li, Zhiqiang Gong, Benrong Zhang, Weien Zhou, Yiyong Huang, Wen Yao

AI总结 针对航空航天飞行器多场稀疏重建中模型庞大且难以利用跨场相关性的问题,提出基于硬参数共享的轻量级多任务傅里叶神经算子MTL-FNO,通过极坐标解耦优化和Cayley变换实现高效联合训练,在少样本条件下模型大小减少76%和60%且精度相当或更优。

详情
AI中文摘要

高效的星载多场稀疏重建对于航空航天飞行器的自主运行至关重要。虽然现有的深度学习模型在单场重建中表现出潜力,但部署多个独立模型会导致模型尺寸急剧增长,并且无法利用跨场相关性,尤其是在少样本条件下。为了解决这些挑战,我们首先提出了一种轻量级多任务傅里叶神经算子(MTL-FNO),这是一种基于硬参数共享的端到端联合训练框架。在每一层中,参数被分为共享部分和任务特定部分,以捕获各场之间的共同特征,同时保留任务特定特征。此外,任务特定的微调参数被实现为低秩项,实现了显著的模型压缩。其次,为了解决共享参数和任务特定参数及其实部和虚部联合优化的困难,我们从极坐标形式的角度重新审视了FNO的谱权重,并设计了一种具有物理意义的解耦优化方案。具体地,我们应用极分解将谱权重逐片解耦为编码相位信息的酉张量和表征振幅的半正定张量。通过解耦相位和振幅的优化,我们的方法可以有效缓解任务冲突。同时,为了在训练过程中保持酉几何保真度,引入Cayley变换对酉张量进行重参数化,将约束优化问题转化为无约束优化问题。最后,在两个代表性工程案例上验证了所提方法在少样本条件下的有效性。结果表明,MTL-FNO达到了与标准FNO相当甚至更优的精度,同时分别将总模型大小减少了76%和60%。

英文摘要

Efficient onboard multi-field sparse reconstruction is essential for the autonomous operation of aerospace vehicles. While existing deep learning models exhibit promise for single-field reconstruction, deploying multiple independent models leads to prohibitive model size growth and fails to exploit cross-field correlations, particularly under few-shot conditions. To address these challenges, we first propose a lightweight multi-task Fourier neural operator (MTL-FNO), an end-to-end joint training framework based on hard parameter sharing. In each layer, the parameters are divided into shared and task-specific components to capture common features across fields while preserving task-specific characteristics. Moreover, the task-specific fine-tuning parameters are implemented as low-rank terms, achieving substantial model compression. Second, to address the difficulty of co-optimizing shared and task-specific parameters along with their real and imaginary parts, we revisit the FNO's spectral weight from a polar-form perspective and devise a physically meaningful decoupled optimization scheme. Specifically, we apply polar decomposition to slice-wise disentangle the spectral weight into a unitary tensor encoding phase information and a positive semi-definite tensor characterizing amplitude. By decoupling the optimization of phase and amplitude, our method can effectively mitigate tasks conflict. Meanwhile, to preserve unitary geometric fidelity during training, the Cayley transform is introduced to reparameterize the unitary tensor, converting the constrained optimization problem to an unconstrained one. Finally, the effectiveness of the proposed method under few-shot conditions is validated on two representative engineering cases. Results show that MTL-FNO achieves accuracy comparable to or even surpassing that of standard FNO, while reducing total model size by 76% and 60%, respectively.