arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2251
专题追踪
2604.16358 2026-05-28 cs.LG cs.CL

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

SaFeR-Steer:通过合成引导和反馈动力学进化多轮多模态大语言模型

Haolong Hu, Hanyu Li, Tiancheng He, Huahui Yi, An Zhang, Qiankun Li, Kun Wang, Yang Liu, Zhigang Zeng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing University of Posts and Telecommunications(北京邮电大学) West China Biomedical Big Data Center, Sichuan University(四川大学西部生物医学大数据中心) School of Public Policy and Administration, Chongqing University(重庆大学公共政策与管理学院) Nanyang Technological University(南洋理工大学)

AI总结 提出SaFeR-Steer框架,通过分阶段合成引导和导师参与的GRPO训练单学生模型,并引入轨迹一致总结奖励(TCSR)以解决多轮安全对齐中的长上下文安全衰减问题,显著提升多轮安全性和有用性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在多轮场景中,攻击者可以通过不断演变的视觉-文本历史升级不安全意图,并利用长上下文安全衰减。然而,安全对齐仍然以单轮数据和固定模板对话为主,导致训练与部署之间存在不匹配。为弥补这一差距,我们提出SaFeR-Steer,一种渐进式多轮对齐框架,结合分阶段合成引导和导师参与的GRPO,在自适应、在线策略攻击下训练单个学生模型。我们还引入了轨迹一致总结奖励(TCSR),该奖励聚合了历史最小值和回合奖励的平均值,使得任何低质量回合都会影响轨迹级别的回报。I. 数据集。我们发布STEER,一个多轮多模态安全数据集,包含STEER-SFT(12,934)、STEER-RL(2,000)和STEER-Bench(3,227)对话,回合数为2-10。II. 实验。从Qwen2.5-VL-3B/7B开始,SaFeR-Steer在单轮基准(3B:48.30/45.86 → 81.84/70.77;7B:56.21/60.32 → 87.89/77.40)和多轮基准(3B:12.55/27.13 → 55.58/70.27;7B:24.66/46.48 → 64.89/72.35)上显著提高了安全性/有用性,将失败转移到后续回合,并产生了超越单纯扩展的鲁棒性。代码可在https://anonymous.4open.science/r/SaFeR-Steer获取。

英文摘要

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment. To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce Trajectory-Consistent Summative Reward (TCSR), which aggregates the historical minimum and average of turn rewards so that any low-quality turn affects the trajectory-level return. I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2-10 turns. II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 $\rightarrow$ 81.84/70.77 for 3B; 56.21/60.32 $\rightarrow$ 87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 $\rightarrow$ 55.58/70.27 for 3B; 24.66/46.48 $\rightarrow$ 64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone. Code is available at https://anonymous.4open.science/r/SaFeR-Steer

2604.15898 2026-05-28 cs.AI

Towards Rigorous Explainability by Feature Attribution

通过特征归因实现严格可解释性

Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

发表机构 * IRIT, University of Toulouse France Nanyang Technological University, Singapore ICREA \& Univ.\ Lleida, Spain

AI总结 本文综述了使用严格的符号化可解释人工智能方法替代非严格的非符号化方法(如SHAP)来分配相对特征重要性的研究进展。

详情
AI中文摘要

大约十年来,非符号化方法一直是解释复杂机器学习(ML)模型的首选。不幸的是,这些方法缺乏严格性,可能误导人类决策者。在ML的高风险应用中,缺乏严格性尤其成问题。一个典型的不严格性证明例子是在可解释人工智能(XAI)中采用Shapley值,工具SHAP就是一个普遍的例子。本文概述了当前使用严格的符号化XAI方法作为非严格非符号化方法替代方案的努力,具体用于分配相对特征重要性。

英文摘要

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

2604.14585 2026-05-28 cs.AI cs.CL

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同抛硬币:诊断其在复合AI系统中何时有效

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过大量实验发现提示优化在复合AI系统中效果不稳定,仅当任务具有可挖掘的输出结构时才有帮助,并提供了两阶段诊断方法。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

复合AI系统中的提示优化在统计上与抛硬币无异:在Claude Haiku 4.5上的72次优化运行(6种方法 × 4个任务 × 3次重复)中,49%的得分低于零样本;在Amazon Nova Lite上,失败率更高。然而,在一个任务上,所有六种方法相比零样本提升了高达+6.8分。是什么区分了成功与失败?我们通过18,000次网格评估和144次优化运行进行了调查,按照必须回答的顺序测试了TextGrad和DSPy等端到端优化工具背后的两个假设:(A) 智能体提示存在交互,需要联合优化而非独立优化;(B) 单个提示本身值得优化。交互效应从未显著(p > 0.52,所有F < 1.0),并且优化仅在任务具有可挖掘的输出结构时才有帮助:即模型可以生成但不会默认采用的格式。我们进一步给出了机制性解释:指令微调将输入措辞压缩成狭窄的输出分布,消除了联合优化所依赖的措辞敏感性。我们提供了一个两阶段诊断:一个80美元的ANOVA预测试用于智能体耦合,以及一个10分钟的头空间测试,用于预测优化是否值得,从而将抛硬币转变为知情决策。

英文摘要

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku 4.5 (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy, in the order they must be answered: (A) agent prompts interact, requiring joint rather than independent optimization, and (B) individual prompts are worth optimizing at all. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure: a format the model can produce but does not default to. We further give a mechanistic account: instruction-tuning compresses input phrasing into a narrow output distribution, eliminating the very phrasing-sensitivity that joint optimization assumes. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile, turning a coin flip into an informed decision.

2604.14356 2026-05-28 cs.CL cs.AI

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

当多囊卵巢综合征遇上进食障碍:一种可解释的AI方法检测隐藏的三重负担

Apoorv Prasad, Susan McRoy

发表机构 * University of Wisconsin - Milwaukee(威斯康星大学密尔沃基分校)

AI总结 本研究通过微调小型开源语言模型,利用可解释性AI从社交媒体帖子中自动检测多囊卵巢综合征患者的身体形象困扰、进食障碍和代谢挑战的三重负担,最佳模型在150条测试帖上达到75.3%的精确匹配准确率。

详情
AI中文摘要

患有多囊卵巢综合征(PCOS)的女性面临身体形象困扰、进食障碍和代谢挑战的显著升高风险,然而现有的自然语言处理方法在检测这些状况时缺乏透明度,且无法识别共病表现。我们开发了小型开源语言模型,以基于可解释性的方式自动检测社交媒体帖子中的这种三重负担。我们从六个子论坛收集了1000条与PCOS相关的帖子,由两名经过训练的标注员根据Lee等人(2017)临床框架的操作化指南对帖子进行标注。使用低秩适配对三个模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B)进行微调,以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上实现了75.3%的精确匹配准确率,具有稳健的共病检测能力和强可解释性。性能随诊断复杂性下降,表明其最佳用途是筛查而非自主诊断。

英文摘要

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

2604.12955 2026-05-28 cs.AI

Text2Model: Modeling Copilots for Text-to-Model Translation

Text2Model: 用于文本到模型翻译的建模副驾驶

Serdar Kadioglu, Karthik Uppuluri, Akash Singirikonda

发表机构 * AI Center of Excellence, Fidelity Investments(富达投资人工智能卓越中心) Department of Computer Science, Brown University(布朗大学计算机科学系)

AI总结 本文提出Text2Model和Text2Zinc,通过统一架构和数据集、求解器无关的方式,利用多种LLM策略实现文本到组合优化与满足问题的模型翻译,并开源副驾驶和排行榜以缩小性能差距。

Comments AAAI'25 Bridge Program on Machine Learning and Operations Research CPAIOR'26 Master Class on LLMs for CP/OR

详情
AI中文摘要

利用大型语言模型(LLM)进行文本到模型翻译和优化任务的研究兴趣日益增长。本文通过引入\textsc{Text2Model}和\textsc{Text2Zinc}来推进这一研究方向。\textsc{Text2Model}是一套基于多种LLM策略(复杂度各异)的副驾驶,并附带在线排行榜。\textsc{Text2Zinc}是一个跨领域数据集,用于捕捉自然语言指定的优化和满足问题,并附带内置AI助手的交互式编辑器。虽然已有新兴文献使用LLM将组合问题翻译为形式化模型,但我们的工作是首次尝试将满足问题和优化问题集成在\textit{统一架构}和\textit{数据集}中。此外,我们的方法是\textit{求解器无关的},不同于现有专注于翻译为特定求解器模型的工作。为此,我们利用\textsc{MiniZinc}的求解器和范式无关的建模能力来表述组合问题。我们进行了全面实验,比较了多种单次和多次调用策略的执行和解准确率,包括:零样本提示、思维链推理、通过知识图谱的中间表示、基于语法的语法编码,以及将模型分解为顺序子任务的代理方法。我们的副驾驶策略具有竞争力,并在部分方面改进了该领域的最新研究。我们的发现表明,虽然LLM有前景,但尚未成为组合建模的一键式技术。我们开源了\textsc{Text2Model}副驾驶和排行榜,以及\textsc{Text2Zinc}和交互式编辑器,以支持缩小这一性能差距。

英文摘要

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of copilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}'s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our copilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} copilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

2604.13232 2026-05-28 cs.CL

Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

评估评估者:SemEval-2020任务1在词汇语义变化检测中的问题

Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana

发表机构 * Department of Linguistics, KU Leuven(KU莱顿大学语言学系) Instituut voor de Nederlandse Taal(荷兰语研究所) Department of Linguistics and Literary Studies, Vrije Universiteit Brussel(布鲁塞尔自由大学语言学与文学研究系)

AI总结 通过操作化、数据质量和基准设计三个框架,批判性分析SemEval-2020任务1的局限性,指出其窄化语义变化模型、数据质量问题及设计缺陷,呼吁未来改进。

详情
AI中文摘要

本文通过操作化、数据质量和基准设计三个框架重新审视了词汇语义变化检测中最具影响力的共享基准SemEval-2020任务1。首先,在操作化层面,我们认为该基准主要将语义变化建模为离散义项的增加、丢失或重新分布。虽然这种框架便于标注和评估,但过于狭窄,无法捕捉渐变的、构式的、搭配的和语篇层面的变化。此外,黄金标签是标注决策、聚类过程和阈值设置的结果,可能限制任务的有效性。其次,在数据质量层面,我们表明该基准受到严重的语料库和预处理问题影响,包括OCR噪声、畸形字符、截断句子、不一致的词形还原、词性标注错误以及目标词遗漏。这些问题可能扭曲模型行为,使语言分析复杂化,并降低可重复性。第三,在基准设计层面,我们认为精心挑选的小规模目标集和有限的语言覆盖降低了现实性并增加了统计不确定性。综合来看,这些局限性表明该基准应被视为一个有用但不完整的测试平台,而非进展的最终衡量标准。因此,我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论,透明地记录预处理过程,扩大跨语言覆盖范围,并使用更现实的评估设置。这些步骤对于词汇语义变化检测中更有效、可解释和可推广的进展是必要的。

英文摘要

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

2506.01247 2026-05-28 cs.CV cs.AI cs.LG

Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering

超越可解释性:稀疏自编码器何时、为何以及如何实现无标签视觉引导

Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

发表机构 * Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Department of Statistics, Rutgers University(罗格斯大学统计系)

AI总结 本文提出无标签视觉稀疏引导方法VS2,通过训练稀疏自编码器并利用其重构误差和稀疏特征放大来引导冻结的视觉语言模型,在九个图像分类数据集上提升零样本准确率。

详情
AI中文摘要

稀疏自编码器(SAE)越来越多地被用于解释基础模型,但它们作为可操作干预空间的作用仍不太被理解,尤其是在视觉领域。我们研究稀疏视觉特征是否不仅可用于事后分析,还可用于引导冻结的视觉语言模型。我们引入视觉稀疏引导(VS2),一种无标签方法,它在冻结的CLIP图像编码器的无标签激活上训练一个top-$k$ SAE,并在测试时通过放大输入的活跃稀疏特征并解码诱导的变化来构建一个可解释的引导向量。我们证明该过程可分解为质心偏差引导:每个输入沿着其与SAE学习到的质心的偏差移动。残差项由SAE的每样本重构误差(通过FVU测量)精确控制,从而产生基于FVU的残差界限,并促使在SAE重构不可靠时回退到零样本CLIP的可靠性门控。通过使用在无标签CLIP图像编码器激活上训练的目标域SAE,VS2在九个图像分类数据集上提高了零样本准确率,在推理计算量增加不到0.1%的情况下实现了高达+4.12%的提升。最后,一项受控的上界研究VS2++表明,选择性放大稀疏特征可带来高达+21.44%的提升,揭示了一个重构与任务显著性的差距:对重构显著的稀疏特征不一定与对下游预测有用的特征一致。

英文摘要

Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains less understood, especially in vision. We study whether sparse visual features can be used not only for post-hoc analysis, but also to steer frozen vision-language models. We introduce Visual Sparse Steering (VS2), a label-free method that trains a top-$k$ SAE on unlabeled activations from a frozen CLIP image encoder and, at test time, constructs an interpretable steering vector by amplifying the input's active sparse features and decoding the induced change. We show that this procedure admits a closed-form decomposition as centroid-deviation steering: each input is moved along its deviation from the SAE-learned centroid. The residual term is controlled exactly by the SAE's per-sample reconstruction error, measured by FVU, yielding an FVU-based residual bound and motivating a reliability gate that falls back to zero-shot CLIP when SAE reconstruction is unreliable. With target-domain SAEs trained on unlabeled CLIP image-encoder activations, VS2 improves zero-shot accuracy across nine image-classification datasets, achieving gains up to $+4.12\%$ with less than $0.1\%$ additional inference compute. Finally, a controlled upper-bound study, VS2++, shows that selective amplification of sparse features can yield gains up to $+21.44\%$, exposing a reconstruction-vs-task saliency gap: features salient for reconstruction need not align with features useful for downstream prediction.

2604.10567 2026-05-28 cs.CL cs.AI

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策至关重要:非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

发表机构 * LG AI Research(LG人工智能研究)

AI总结 本文通过分析非自回归扩散语言模型的推理动态,发现其存在邻近偏差导致的错误传播问题,并提出一种轻量级规划器和序列结束温度退火方法来引导早期令牌选择,从而显著提升推理与规划任务的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

基于扩散的语言模型(dLLMs)已成为自回归语言模型的一种有前景的替代方案,提供了并行令牌生成和双向上下文建模的潜力。然而,如何利用这种灵活性实现完全非自回归解码仍然是一个开放问题,尤其是在推理和规划任务中。在这项工作中,我们通过系统分析非自回归解码在时间轴上的推理动态来研究dLLMs中的非自回归解码。具体来说,我们揭示了基于置信度的非自回归生成中固有的失败模式,该模式源于强烈的邻近偏差——即去噪顺序倾向于集中在空间相邻的令牌上。这种局部依赖性导致空间错误传播,使得整个轨迹关键地依赖于初始去掩码位置。利用这一见解,我们提出了一种最小干预方法,通过轻量级规划器和序列结束温度退火来指导早期令牌选择。我们在各种推理和规划任务上全面评估了我们的方法,并观察到在现有启发式基线基础上,无需显著计算开销即可实现整体性能的显著提升。

英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

2604.09367 2026-05-28 cs.CV

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent: 一种以智能体为中心的古铭文修复系统

Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University, China(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其跨学科应用关键实验室(东南大学),教育部,中国) Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China(计算机网络与信息集成关键实验室(东南大学),教育部,中国) Nanjing University Museum, Nanjing University, China(南京大学博物馆,南京大学,中国) The China Centre for Linguistic and Strategic Studies, Nanjing University, China(中国语言战略研究中心,南京大学,中国)

AI总结 提出基于智能体的EpiAgent系统,通过分层规划与LLM协调多模态分析、历史经验和专用工具,实现灵活自适应的古铭文修复,在真实退化铭文上取得更优修复质量和泛化能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

古铭文作为文化记忆的载体,历经数世纪的环境和人为退化。恢复其交织的视觉和文本完整性是数字遗产保护中最具挑战性的任务之一。然而,现有基于AI的方法通常依赖刚性流水线,难以泛化到如此复杂和异质的真实退化场景。受人类金石学家技能协调工作流程的启发,我们提出EpiAgent,一个以智能体为中心的系统,将铭文修复形式化为分层规划问题。遵循观察-构思-执行-重新评估范式,基于LLM的中央规划器协调多模态分析、历史经验、专用修复工具和迭代自我精炼之间的协作。这种以智能体为中心的协调使得修复过程比传统的单次通过方法更加灵活和自适应。在真实退化的铭文上,EpiAgent相比现有方法实现了更优的修复质量和更强的泛化能力。我们的工作标志着向专家级智能体驱动的文化遗产修复迈出了重要一步。代码可在 https://github.com/blackprotoss/EpiAgent 获取。

英文摘要

Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

2604.09258 2026-05-28 cs.LG

Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

Nexus: 相同预训练损失,通过公共极小值实现更好的下游泛化

Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Nexus优化器,通过最大化梯度相似性促使不同数据源的损失函数极小值靠近,在保持相同预训练损失的情况下显著提升下游泛化性能。

详情
AI中文摘要

大型语言模型的基础能力是在互联网规模、高度异构的数据混合上进行预训练时获得的。在这项工作中,我们研究了关于预训练收敛状态的一个有趣的几何问题:模型是否收敛到所有数据源的公共极小值(例如,图\cref{fig:cwa_illustration:close}),还是仅仅收敛到总损失的极小值(例如,图\cref{fig:cwa_illustration:distant})?我们假设任务特定极小值的几何“接近度”与下游泛化内在相关。我们发现标准优化器(例如AdamW)通常收敛到任务特定极小值彼此远离的点。为了解决这个问题,我们提出了Nexus优化器,它通过在优化过程中最大化梯度相似性来鼓励这些极小值的接近。在从130M到3B参数的各种模型、多种数据混合和超参数调度下的实验表明,Nexus在实现相同预训练损失的情况下显著提升了下游性能(见图\cref{fig:demo:benchmark})。值得注意的是,在3B模型上,Nexus将分布外损失降低了0.012,并在复杂推理任务(例如GSM8k)上带来了高达15.0%的准确率提升。这一发现挑战了将预训练损失作为模型评估唯一代理的依赖,并展示了隐式偏好在解锁下游泛化中的重要性。

英文摘要

The foundational capabilities of large language models are acquired during pretraining on internet-scale, highly heterogeneous data mixtures. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

2604.05333 2026-05-28 cs.AI

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

技能图谱:面向大规模智能体技能的依赖感知结构检索

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Maryland(马里兰大学) Brown University(布朗大学) Carnegie Mellon University(卡内基梅隆大学) Lehigh University(莱斯大学)

AI总结 提出技能图谱(GoS),一种推理时的结构检索层,通过构建可执行技能图并利用混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合,实现依赖感知的技能束检索,在SkillsBench和ALFWorld上显著提升奖励并节省令牌。

Comments 11 pages of main text, 12 pages of appendix. Core contribution by Dawei Liu and Zongxia Li. Project page: https://github.com/davidliuk/graph-of-skills

详情
AI中文摘要

现代LLM智能体越来越依赖可复用技能,当与个人应用、网页浏览器等接口交互时,技能库可扩展至数千个技能。扩展到更大的技能集带来了两个关键挑战。首先,加载完整技能集会饱和上下文窗口,推高令牌成本、幻觉和延迟。其次,语义检索会找到主题相关的技能,但遗漏其上下游技能的先决条件链,造成先决条件缺口,使检索到的技能束执行不完整。在本文中,我们提出技能图谱(GoS),一种用于大型技能库的推理时结构检索层。GoS离线从技能包构建可执行技能图,然后在推理时通过混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合,检索一个有界、依赖感知的技能束。在SkillsBench和ALFWorld上,GoS在三个模型系列(Claude Sonnet 4.5、MiniMax M2.7和GPT-5.2 Codex)中持续带来显著的奖励提升和令牌节省。在SkillsBench上,使用GPT-5.2 Codex时,GoS相比原始完整技能加载基线实现了25.55%的峰值奖励提升,同时总令牌减少56.72%。消融实验证实了在200到2000个技能库中的这一模式。

英文摘要

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces, skill libraries can scale to thousands of skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. Second, semantic retrieval surfaces topically relevant skills but misses their prerequisite chain of upstream and downstream skills, creating a prerequisite gap that leaves the retrieved bundle execution-incomplete. In this paper, we present Graph-of-Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-aware Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS consistently delivers substantial reward improvements and token savings across three model families (Claude Sonnet 4.5, MiniMax M2.7, and GPT-5.2 Codex). On SkillsBench, GoS achieves a peak reward increase of 25.55% while reducing total tokens by 56.72% over the vanilla full skill-loading baseline using GPT-5.2 Codex. Ablations confirm this pattern across skill libraries from 200 to 2,000 skills.

2604.04074 2026-05-28 cs.AI cs.LG

FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification

FactReview:基于执行式声明验证的证据驱动同行评审

Ling Yue, Chaoqian Ouyang, Hang Xu, Ruijun Huang, Yuchen Liu, Libin Zheng, Wei Liu, Shaowu Pan, Shimin Di, Min-Ling Zhang

发表机构 * Rensselaer Polytechnic Institute(罗切斯特理工学院) Sun Yat-sen University(中山大学) Southeast University(东南大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出FactReview系统,通过提取与评审相关的声明、将其与相关工作关联,并在代码可用时在固定修复预算下执行发布工件来审计经验声明,覆盖84%的声明,将评审质量提升至4.86/5,并将评审时间减少58%。

详情
AI中文摘要

基于LLM的评审系统通常仅以手稿为输入,使得文献和基于代码的声明难以验证。我们提出FactReview,一个提取与评审相关的声明、将其与相关工作关联,并在代码可用时在固定修复预算下执行发布工件以审计经验声明的系统。在35篇ML论文和463个基准主要声明中,FactReview覆盖了84%的声明。在证据感知评分标准下,其评审在整体质量上得分为4.86/5,比DeepReview-v2高0.7,比匹配的OpenReview评论高1.5。移除执行证据会改变17%的声明状态,超过任何其他单一证据来源。在一项评审辅助研究中,FactReview将平均评审时间减少了58%,同时将基准声明覆盖率从87%提高到99%。我们认为LLM评审者应审计经验声明,而非做出接受或拒绝的决定。代码公开于:https://github.com/DEFENSE-SEU/FactReview。

英文摘要

LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We present FactReview, a system that extracts review-relevant claims, grounds them in related work, and, when code is available, executes released artifacts under a fixed repair budget to audit empirical claims. Across 35 ML papers and 463 benchmark major claims, FactReview covers 84% of claims. Under an evidence-aware rubric, its reviews score 4.86/5 in overall quality, 0.7 above DeepReview-v2 and 1.5 above matched OpenReview comments. Removing execution evidence changes 17% of claim statuses, more than any other single evidence source. In a reviewer-assistance study, FactReview reduces mean review time by 58% while raising benchmark claim coverage from 87% to 99%. We argue that LLM reviewers should audit empirical claims, not make accept-reject decisions. The code is public at: https://github.com/DEFENSE-SEU/FactReview.

2604.05378 2026-05-28 cs.CL cs.CV

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive:面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

发表机构 * Texas Tech University(德克萨斯科技大学) Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI))

AI总结 提出ICR-Drive框架,通过生成四类扰动指令(改写、歧义、噪声、误导)并基于CARLA仿真评估,揭示语言条件驾驶模型对指令变化的脆弱性。

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880

详情
AI中文摘要

视觉-语言-动作(VLA)模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令,但标准评估大多假设指令精确且格式良好。在实际部署中,指令的措辞和具体性各不相同,可能省略关键限定词,偶尔还包含误导性的权威框架文本,导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive,一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体,涵盖四类扰动:改写、歧义、噪声和误导,其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线,以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明,微小的指令变化可能导致显著的性能下降和不同的故障模式,揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

英文摘要

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

2604.03799 2026-05-28 cs.CV

Next-Scale Autoregressive Models for Text-to-Motion Generation

Next-Scale 自回归模型用于文本到运动生成

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出 MoScale 框架,通过从粗到细的时间分辨率分层生成运动,结合跨尺度和尺度内细化,实现高效、可扩展的文本到运动生成。

Comments Accepted to CVPR 2026

详情
AI中文摘要

自回归(AR)模型提供稳定高效的训练,但标准的下一 token 预测与文本条件运动生成所需的时间结构不太一致。我们引入 MoScale,一个下一尺度 AR 框架,从粗到细的时间分辨率分层生成运动。通过在最粗尺度提供全局语义并逐步细化,MoScale 建立了一个更适合长程运动结构的因果层次。为了提高在有限文本-运动数据下的鲁棒性,我们进一步结合了跨尺度层次细化以改进每个尺度的初始预测,以及尺度内时间细化用于选择性双向重新预测。MoScale 在文本到运动任务上实现了最先进的性能,具有高训练效率,能有效随模型大小扩展,并零样本泛化到多种运动生成和编辑任务。

英文摘要

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation and editing tasks.

2604.02645 2026-05-28 cs.CL cs.AI

Speaking of Language: Reflections on Metalanguage Research in NLP

论语言:NLP中元语言研究的思考

Nathan Schneider, Antonios Anastasopoulos

发表机构 * Georgetown University(乔治城大学) George Mason University(弗吉尼亚理工大学)

AI总结 本文定义元语言概念,将其与NLP和LLM关联,介绍两个实验室以元语言为中心的研究,并讨论元语言的四个维度及元语言任务,提出未来研究方向。

Comments To appear at the Big Picture Workshop at ACL 2026. Camera-ready version

详情
AI中文摘要

本工作旨在聚焦元语言话题。我们首先定义元语言,将其与NLP和LLM联系起来,然后讨论我们两个实验室以元语言为中心的努力。最后,我们讨论元语言和元语言任务的四个维度,提供一系列尚未充分研究的未来研究方向。

英文摘要

This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs' metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

2604.02028 2026-05-28 cs.CL

Why Gaussian Diffusion Models Fail on Discrete Data and How to Prevent It?

为什么高斯扩散模型在离散数据上失败以及如何防止?

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

发表机构 * Constructor University(Constructor大学) Lomonosov Moscow State University(罗蒙诺索夫莫斯科国立大学)

AI总结 本文研究高斯扩散模型在离散数据上采样质量差的原因,发现关键采样区间内噪声数据密度呈多峰分布导致DDPM进入低密度区域,并提出自条件化和q采样结合的方法来改善生成质量。

详情
AI中文摘要

扩散模型已成为连续域生成建模的标准方法,但其在离散数据上的应用仍然具有挑战性。我们研究了使用DDPM求解器的高斯扩散模型为何难以从表示为连续空间中δ分布混合的离散分布中采样。通过一个玩具随机层次模型,我们识别出一个关键采样区间,在该区间内噪声数据的密度变为多峰分布。在这个区间内,DDPM偶尔会进入模式之间的低密度区域,为模型产生分布外输入并降低样本质量。我们表明,现有的启发式方法,包括自条件化和我们称之为q采样的求解器,有助于缓解这个问题。此外,我们证明在关键区间内将自条件化与从DDPM切换到q采样相结合,可以提高真实数据的生成质量。我们在多个领域的条件和无条件任务中验证了这些发现,包括文本、编程代码和蛋白质。

英文摘要

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

2604.01604 2026-05-28 cs.AI

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

CRaFT:基于跨层转码器的电路引导拒绝特征选择

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出CRaFT框架,利用跨层转码器构建稀疏特征电路图,通过量化特征间影响及其对最终输出的贡献,选择控制拒绝行为的关键特征,显著提升越狱攻击性能。

详情
AI中文摘要

虽然现代LLM经过对齐以拒绝有害请求,但理解这种拒绝行为背后的机制基础对于模型安全分析至关重要。例如,基于引导的越狱攻击通过识别和操纵稀疏的、类似神经元的拒绝特征来绕过安全护栏。当前的特征选择方法主要依赖于特征在有害提示上的激活强度。然而,仅凭激活强度往往捕捉到主题或词汇线索等表面启发式,而非真正的因果机制。因此,选择拒绝特征需要测量特征间的关系,而不是将每个特征视为孤立的激活信号。基于这一见解,我们提出CRaFT,一个电路引导的框架,用于识别直接控制拒绝决策的关键拒绝特征。CRaFT利用跨层转码器将模型的内部计算映射到稀疏特征电路图中,其中边量化特征间的影响及其对最终输出logits的贡献。通过聚合沿拒绝路径传播的效应,CRaFT有效地对最具影响力的特征进行排序。在四个越狱基准上的广泛评估表明,与当前最先进方法相比,CRaFT将平均性能从6.7%提高到57.4%,并生成更具体的有害补全。

英文摘要

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.

2604.00913 2026-05-28 cs.CV cs.CL

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

跨描绘装配指令对齐的视觉-语言模型基准测试与机制分析

Zhuchenyang Liu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 构建IKEA-Bench基准,评估19个视觉-语言模型在装配图与视频帧对齐任务上的表现,发现视觉编码是提升跨描绘鲁棒性的关键瓶颈。

详情
AI中文摘要

二维装配图通常是抽象的且难以遵循,因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中,此类系统必须从摄像头画面中识别已完成和正在进行的步骤,并将其与图示指令对齐。视觉语言模型(VLM)在此任务上展现出潜力,但由于装配图和视频帧共享的视觉特征极少,面临描绘鸿沟。为系统评估这一鸿沟,我们构建了IKEA-Bench基准,包含29个宜家家具产品的6种任务类型共1623个问题,并在三种对齐策略下评估了19个VLM(2B-38B)。主要发现:(1)装配指令理解可通过文本恢复,但文本同时降低了图到视频的对齐性能;(2)架构族比参数数量更能预测对齐精度;(3)视频理解是难以通过策略影响的硬瓶颈。三级机制分析进一步揭示,图和视频占据不相交的ViT子空间,且添加文本会使模型从视觉驱动转向文本驱动的推理。这些结果表明,视觉编码是提升跨描绘鲁棒性的主要目标。项目页面:https://ryenhails.github.io/IKEA-Bench/

英文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

2604.00402 2026-05-28 cs.CV cs.AI

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

COTTA: 面向自动驾驶轨迹预测的上下文感知迁移适应

Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

发表机构 * Ewha Womans University(成均馆大学) Seoul National University(首尔国立大学) Sangmyung University(Sangmyung 大学) NVIDIA

AI总结 本文研究将基于美国数据训练的轨迹预测模型QCNet迁移到韩国道路环境,通过对比四种训练策略,发现冻结编码器并微调解码器可在精度和效率间取得最佳平衡,预测误差降低66%以上。

Comments 4 pages, 2 figures. Accepted at ICEIC 2026

详情
AI中文摘要

开发鲁棒模型以准确预测周围代理的轨迹是自动驾驶安全的基础。然而,大多数公开数据集(如Waymo Open Motion Dataset和Argoverse)是在西方道路环境中收集的,并未反映其他地区(包括韩国)独特的交通模式、基础设施和驾驶行为。当在西方数据上训练的最先进模型部署到不同地理环境时,这种领域差异会导致性能下降。在本工作中,我们研究了查询中心轨迹预测(QCNet)从美国数据迁移到韩国道路环境时的适应性。使用韩国自动驾驶数据集,我们比较了四种训练策略:零样本迁移、从头训练、全微调和编码器冻结。实验结果表明,利用预训练知识显著提高了预测性能。具体而言,在冻结编码器的同时选择性微调解码器,在精度和训练效率之间取得了最佳平衡,与从头训练相比,预测误差降低了66%以上。本研究为在新地理领域部署轨迹预测模型提供了有效的迁移学习策略的实用见解。

英文摘要

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

2512.11524 2026-05-28 cs.CV cs.LG

Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using Airborne LiDAR HD Reference Data across Metropolitan France

利用法国大都市机载LiDAR HD参考数据从Sentinel-2时间序列进行超分辨率冠层高度制图

Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells

发表机构 * CESBIO GlobEO

AI总结 提出THREASURE-Net端到端框架,利用Sentinel-2时间序列和LiDAR HD数据生成2.5m、5m和10m分辨率的年度冠层高度图,无需预训练模型或高分辨率光学图像,在法国大都市区实现优于现有方法的精度。

详情
AI中文摘要

精细尺度的森林监测对于理解冠层结构及其动态至关重要,这些是碳储量、生物多样性和森林健康的关键指标。深度学习特别有效,因为它整合了共同反映冠层结构的光谱、时间和空间信号。为满足这一需求,我们提出了THREASURE-Net,一种新颖的端到端树高回归与超分辨率框架。该模型使用来自法国大都市区多个空间分辨率的LiDAR HD数据导出的参考高度指标,在Sentinel-2时间序列上训练,以生成年度高度图。我们评估了三种模型变体,分别产生2.5米、5米和10米分辨率的树高预测。THREASURE-Net不依赖任何预训练模型或参考甚高分辨率光学图像来训练其超分辨率模块;相反,它仅从LiDAR导出的高度信息中学习。我们的方法优于现有基于Sentinel数据的最先进方法,并与基于甚高分辨率图像的方法具有竞争力。它可以部署生成高精度年度冠层高度图,在2.5米、5米和10米分辨率下分别实现2.63米、2.70米和2.88米的平均绝对误差。这些结果凸显了THREASURE-Net仅使用免费卫星数据对温带森林进行可扩展且经济高效的结构监测的潜力。THREASURE-Net的源代码可在以下网址获取:https://github.com/Global-Earth-Observation/threasure-net。

英文摘要

Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.63 m, 2.70 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

2601.17354 2026-05-28 cs.CV cs.GR

PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling

PocketGS: 用于高感知建模的3D高斯泼溅设备端训练

Wenzhi Guo, Guangchi Fang, Shu Yang, Bing Wang

发表机构 * Hong Kong Polytechnic University(香港理工大学) Nanjing University(南京大学)

AI总结 提出PocketGS,通过三个协同设计的算子(G、I、T)在移动设备上实现3D高斯泼溅的高效训练,在严格资源约束下保持高保真重建。

详情
AI中文摘要

虽然3D高斯泼溅(3DGS)能够实现实时渲染,但其训练需要工作站级别的计算和内存,使得在分钟级时间预算和有限峰值内存下移动部署不切实际。我们提出了PocketGS,一种移动场景建模范式,能够在这些紧密耦合的约束下实现设备端3DGS训练,同时保持高保真重建。PocketGS通过三个协同设计的算子解决了训练效率、内存紧凑性和建模质量之间的基本矛盾:$\mathcal{G}$构建几何保真的点云先验;$\mathcal{I}$注入局部表面统计以播种各向异性高斯,从而减少早期条件差距;$\mathcal{T}$使用缓存的中间结果和索引映射梯度散射展开alpha合成,以实现稳定的移动反向传播。大量实验表明,PocketGS在移动预算下优于强大的主流工作站3DGS基线,提供高质量重建,并实现了完全设备端的实用捕获到渲染工作流。

英文摘要

While 3D Gaussian Splatting (3DGS) enables real-time rendering, its training demands workstation-level compute and memory, making mobile deployment impractical under minute-scale time budgets and limited peak memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high-fidelity reconstruction. PocketGS resolves the fundamental tension between training efficiency, memory compactness, and modeling quality through three co-designed operators: $\mathcal{G}$ builds geometry-faithful point-cloud priors; $\mathcal{I}$ injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and $\mathcal{T}$ unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Extensive experiments demonstrate that PocketGS outperforms the powerful mainstream workstation 3DGS baseline under mobile budgets, delivering high-quality reconstructions and enabling a fully on-device, practical capture-to-rendering workflow.

2601.01627 2026-05-28 cs.CL cs.AI

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

JMedEthicBench:用于评估日语大语言模型医疗安全性的多轮对话基准

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

发表机构 * Kyoto University(京都大学) Hohai University(河海大学) The University of Tokyo(东京大学) University of Science and Technology of China(中国科学技术大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 提出首个多轮对话基准JMedEthicBench,基于日本医学会67条指南和7种自动越狱策略生成5万+对抗对话,评估27个模型发现医疗专用模型安全性脆弱,且多轮交互中安全性显著下降。

Comments 12 pages, 6 figures

详情
AI中文摘要

随着大语言模型(LLM)在医疗领域的部署日益增多,在临床使用前仔细评估其医疗安全性变得至关重要。然而,现有的安全基准仍然以英语为中心,并且仅使用单轮提示进行测试,尽管临床咨询是多轮的。为了解决这些差距,我们引入了JMedEthicBench,这是第一个用于评估日语医疗LLM医疗安全性的多轮对话基准。我们的基准基于日本医学会的67条指南,包含使用七种自动发现的越狱策略生成的超过50,000个对抗性对话。使用双LLM评分协议,我们评估了27个模型,发现商业模型保持了稳健的安全性,而医疗专用模型表现出更高的脆弱性。此外,安全分数在对话轮次中显著下降(中位数:9.5降至5.0,p < 0.001)。对我们的基准的日语和英语版本进行的跨语言评估表明,医疗模型的脆弱性跨语言持续存在,表明存在固有的对齐限制,而非语言特定因素。这些发现表明,领域特定的微调可能会意外削弱安全机制,并且多轮交互代表了一个需要专门对齐策略的独特威胁面。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

2505.13820 2026-05-28 cs.LG cs.AI cs.CL

Structured Agent Distillation for Large Language Model

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Harvard University(哈佛大学) MIT(麻省理工学院) Northeastern University(东北大学) Adobe Research(Adobe研究) National University of Singapore(新加坡国立大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学)

AI总结 提出结构化智能体蒸馏框架,通过分段对齐推理和动作跨度,将大型语言模型智能体压缩为小型学生模型,在保持决策性能的同时降低推理成本。

Journal ref The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

大型语言模型(LLMs)通过交错推理和动作(如ReAct风格框架)展现出作为决策智能体的强大能力。然而,它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏,一种将基于大型LLM的智能体压缩为更小的学生模型的框架,同时保持推理保真度和动作一致性。与标准的token级蒸馏不同,我们的方法将轨迹分割为[REASON]和[ACT]跨度,应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明,我们的方法始终优于token级和模仿学习基线,在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

2603.26182 2026-05-28 cs.CL

ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

ClinicalAgents:具有双记忆的临床决策多智能体编排

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出ClinicalAgents多智能体框架,通过蒙特卡洛树搜索动态编排和双记忆架构模拟临床推理,显著提升诊断准确性和可解释性。

Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

虽然大型语言模型(LLMs)在医疗保健领域展现出潜力,但它们往往难以应对临床准确诊断所需的复杂非线性推理。现有方法通常依赖从症状到诊断的静态线性映射,未能捕捉人类临床医生固有的迭代、假设驱动推理。为弥补这一差距,我们引入了ClinicalAgents,一种新颖的多智能体框架,旨在模拟专家临床医生的认知工作流。与僵化的顺序链不同,ClinicalAgents采用了一种动态编排机制,建模为蒙特卡洛树搜索(MCTS)过程。这使得编排器能够迭代生成假设、主动验证证据,并在关键信息缺失时触发回溯。该框架的基础是双记忆架构:一个可变的短期工作记忆,用于维护不断演变的患者状态以进行上下文感知推理;以及一个静态的经验记忆,通过主动反馈循环检索临床指南和历史病例。大量实验表明,ClinicalAgents在评估的基线中取得了最佳性能,与强大的单智能体和多智能体基线相比,显著提高了诊断准确性和可解释性。我们的代码发布在https://github.com/ZhuohanGe/ClinicalAgents-Code。

英文摘要

While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent in human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. The foundation of this framework is a Dual-Memory architecture: a mutable working memory that maintains the evolving patient state for context-aware reasoning, and a static experience memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves the best performance among evaluated baselines, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines. Our code is released at https://github.com/ZhuohanGe/ClinicalAgents-Code.

2601.19302 2026-05-28 cs.CL

Formula-One Prompting: A Composable Equation-First Prefix for Applied Mathematics

Formula-One Prompting:一种可组合的方程优先前缀用于应用数学

Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul

发表机构 * SCB DataX, SCBX Group(SCB数据X,SCBX集团)

AI总结 提出公式提示(FP)和Formula-One提示(F-1),通过先形式化问题中的控制方程再求解,在多个应用数学基准上优于思维链和程序思维提示,平均提升5.76和8.42个百分点。

详情
AI中文摘要

本文介绍了公式提示(FP)和Formula-One提示(F-1),两种单次调用方法,在解决应用数学问题之前先引出控制方程。思维链(CoT)和程序思维(PoT)提示通过引出预训练期间学到的推理轨迹或类似代码的结构来改进数学推理。这提出了一个诊断性问题:哪些有用的预训练模式仍然未被充分引出?使用infini-gram-mini,我们扫描了81.7万亿预训练令牌,发现在精心策划的语料库(如DataComp-LM)中,以方程为中心的语言出现频率比代码高121倍,比逐步叙述高3.79倍,但标准提示方法并未明确引出方程形式化。FP要求模型在求解前先形式化问题的控制方程;F-1扩展了FP,增加了一个可组合的第二阶段,在同一调用中选择直接、CoT或PoT风格的求解。在五个推理模型和四个应用数学基准(金融、物理、密码学、竞赛数学)上,F-1平均优于CoT 5.76个百分点,优于PoT 8.42个百分点,在FinanceMath上取得最大提升13.30个百分点,同时以仅68个提示令牌的开销占据准确率-令牌效率前沿。变体消融实验表明,方程形式化前缀(而非策略菜单)是主要驱动因素:在前缀之上添加CoT或PoT不会带来进一步收益,且73.3%的剩余失败发生在第一阶段方程正确之后。

英文摘要

This paper introduces Formula Prompting (FP) and Formula-One Prompting (F-1), two single-call methods that elicit governing equations before solving applied-math problems. Chain-of-Thought (CoT) and Program-of-Thought (PoT) prompting improve mathematical reasoning by eliciting reasoning traces or code-like structures learned during pretraining. This suggests a diagnostic question: which useful pretraining patterns remain under-elicited? Using infini-gram-mini, we scan 81.7 trillion pretraining tokens and find that, in curated corpora such as DataComp-LM, equation-centered language appears 121x more often than code and 3.79x more often than step-by-step narration, yet standard prompting methods do not explicitly elicit equation formulation. FP asks the model to formalize a problem's governing equations before solving; F-1 extends FP with a composable Phase 2 that selects Direct, CoT, or PoT-style solving in the same call. Across five reasoning models and four applied-math benchmarks (finance, physics, cryptography, competition math), F-1 outperforms CoT by 5.76 pp and PoT by 8.42 pp on average, with the largest gain of 13.30 pp on FinanceMath, while topping the accuracy-token efficiency frontier at only 68 prompt tokens of overhead. Variant ablations identify the equation-formalization prefix, not the strategy menu, as the primary driver: adding CoT or PoT on top of the prefix yields no further gain, and 73.3% of remaining failures occur downstream of a correct Phase-1 equation.

2512.12887 2026-05-28 cs.CV

Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

重新审视用于可扩展3D医学图像分类的2D基础模型

Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic

发表机构 * Digital Technology and Innovation, Siemens Healthineers, Princeton NJ, USA(西门子医疗数字技术与创新,普林斯顿新泽西州,美国) Digital Technology and Innovation, Siemens Healthineers, Erlangen, Germany(西门子医疗数字技术与创新,埃尔兰根,德国)

AI总结 本文针对当前3D医学图像分类基础模型的数据偏差、适应不足和任务覆盖不全问题,提出AnyMC3D框架,通过冻结2D基础模型并添加轻量插件实现高效多任务扩展,并在12项任务上达到领先性能。

Comments 1st Place in VLM3D Challenge

详情
AI中文摘要

3D医学图像分类对于现代临床工作流程至关重要。医学基础模型(FMs)已成为扩展到新任务的有前途的方法,然而当前研究存在三个关键缺陷:数据体制偏差、次优适应和任务覆盖不足。在本文中,我们解决了这些缺陷,并引入了AnyMC3D,一种从2D FMs改编的可扩展3D分类器。我们的方法通过在单个冻结骨干网络上添加轻量级插件(每个任务约1M参数),高效地扩展到新任务。这个通用框架还支持多视图输入、辅助像素级监督和可解释的热力图生成。我们建立了一个涵盖12个任务的综合基准,包括不同的病理、解剖和模态,并系统分析了最先进的3D分类技术。我们的分析揭示了关键见解:(1)有效适应对于释放FM潜力至关重要,(2)通用FMs在适当适应后可以匹敌医学专用FMs,(3)基于2D的方法在3D分类上优于3D架构。我们首次证明了使用单一可扩展框架(包括在VLM3D挑战中获得第一名)在不同应用中实现最先进性能的可行性,消除了对单独任务特定模型的需求。

英文摘要

3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

2603.22735 2026-05-28 cs.CL

Explanation Generation for Contradiction Reconciliation with LLMs

面向矛盾调和的大语言模型解释生成

Jason Chan, Zhixue Zhao, Robert Gaizauskas

发表机构 * University of Sheffield, UK(谢菲尔德大学)

AI总结 提出矛盾调和解释生成任务,通过改造NLI数据集和设计质量指标,评估18个LLM在该任务上的表现,发现模型能力有限且增大模型规模时“思考”收益递减。

Comments Preprint

详情
AI中文摘要

现有的NLP工作通常将矛盾视为需要通过选择接受或拒绝哪些陈述来解决的错误。然而,在社交互动和专业领域中,人类推理的一个关键方面是能够假设调和矛盾的解释。例如,“Cassie讨厌咖啡”和“她每天买咖啡”看似矛盾,但如果Cassie有每天为所有同事买咖啡这一不令人羡慕的日常任务,那么两者是兼容的。尽管大语言模型(LLM)的推理能力不断增强,但它们假设这种调和解释的能力在很大程度上仍未探索。为了填补这一空白,我们引入了调和解释生成任务,其中模型必须生成能够有效使矛盾陈述兼容的解释。我们提出了一种改造现有自然语言推理(NLI)数据集的新方法,并引入了可实现可扩展自动评估的质量指标。对18个LLM的实验表明,大多数模型在此任务中取得的成功有限,并且通过“思考”延长测试时计算的好处随着模型规模的增大而趋于平稳。我们的结果突显了LLM推理中一个未被充分探索的维度,以及解决这一限制以增强LLM下游应用(如聊天机器人和科学助手)的必要性。

英文摘要

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.

2603.21465 2026-05-28 cs.CL cs.LG

DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation

DRTriton:大规模合成数据驱动的强化学习用于Triton内核生成

Siqi Guo, Ming Lin, Tianbao Yang

发表机构 * Texas A&M University(德克萨斯大学)

AI总结 提出DRTriton框架,通过合成数据生成、课程强化学习和测试时搜索,训练LLM将PyTorch程序转换为优化的Triton内核,在KernelBench Level 2任务中超越GPT-5.2和Claude-Sonnet-4.5。

详情
AI中文摘要

在生成式AI行业中,开发高效的CUDA内核是一项基础但具有挑战性的任务。最近的研究利用大型语言模型(LLMs)自动将PyTorch参考实现转换为CUDA内核,显著减少了工程工作量。最先进的LLMs,如GPT-5.2和Claude-Sonnet-4.5,仍然难以胜任此任务。为应对这一挑战,我们提出了DRTriton,一个可扩展的学习框架,用于训练LLM将PyTorch程序转换为高度优化的Triton内核,然后在运行时编译为CUDA内核。DRTriton包含三个关键组件:(i)数据合成算法CSP-DAG,保证在算子空间上的完全覆盖和具有可控难度的无偏均匀采样;(ii)具有解耦奖励的课程RL框架,联合优化转换成功率和执行速度;(iii)测试时搜索算法,进一步提高生成的Triton内核的执行速度。通过在使用现有LLM整理的有限PyTorch-Triton对上进行SFT预热阶段,DRTriton在合成PyTorch程序上通过RL训练,有效泛化到即使对人类专家也具挑战性的真实世界CUDA内核。实验结果表明,DRTriton-7B在92%的KernelBench Level 2任务上实现了相对于PyTorch的加速,而GPT-5.2为23%,Claude-Sonnet-4.5为19%。

英文摘要

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing engineering effort. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle with this task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch programs into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum RL framework with decoupled rewards that jointly optimizes conversion success rate and execution speed; and (iii) a test-time search algorithm that further improves the execution speed of the generated Triton kernels. With a warmup stage of SFT on limited PyTorch-Triton pairs curated using existing LLMs, DRTriton trained by RL on synthesized PyTorch programs generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup over PyTorch on 92% of KernelBench Level 2 tasks, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

2603.21165 2026-05-28 cs.CL cs.CV

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

多种方言,多种语言,一种文化视角:评估多语言视觉语言模型对孟加拉文化的理解,涵盖历史关联语言和地区方言

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda

发表机构 * United International University(国际联合大学) BRAC University(布拉塔克大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出 BanglaVerse 基准,通过手工标注图像和扩展至多种语言及方言,评估多语言视觉语言模型在孟加拉文化理解中的表现,发现标准孟加拉语评估高估模型能力,方言变化导致性能下降,文化知识缺失是主要瓶颈。

Comments https://labib1610.github.io/BanglaVerse/

详情
AI中文摘要

孟加拉文化通过地区、方言、历史、食物、政治、媒体和日常视觉生活丰富地表达,但在多模态评估中仍然代表性不足。为了解决这一差距,我们引入了BanglaVerse,这是一个文化基础的基准,用于评估多语言视觉语言模型(VLM)对孟加拉文化的理解,涵盖历史关联语言和地区方言。该基准由9个领域的1152张手动策划图像构建,支持视觉问答和字幕生成,并扩展为四种语言和五种孟加拉方言,产生约32.2K个工件。我们的实验表明,仅评估标准孟加拉语会高估真实模型能力:在方言变化下性能下降,尤其是字幕生成,而历史关联语言如印地语和乌尔都语保留了一些文化意义,但在结构化推理方面仍然较弱。跨领域来看,主要瓶颈是缺失文化知识而非仅视觉基础,尤其是知识密集型类别。这些发现将BanglaVerse定位为在语言变化下衡量文化基础多模态理解的更现实测试平台。

英文摘要

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.2K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

2601.04716 2026-05-28 cs.CL

Identifying and Mitigating Bottlenecks in Role-Playing Agents: A Systematic Study of Disentangling Character Profile Axes

识别与缓解角色扮演代理中的瓶颈:解耦角色档案轴线的系统研究

Yonghyun Jun, Junhyuk Choi, Jeonghyun Park, Jihyeong Park, Liu Nicole Geumheon, Hwanhee Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 本研究通过解耦角色档案的熟悉度、结构和性格三个轴线,系统诊断LLM角色扮演代理的性能瓶颈,并提出无训练的场感知对比解码(FACD)策略来缓解性格带来的性能下降。

Comments 28 pages

详情
AI中文摘要

尽管大语言模型(LLM)角色扮演代理发展迅速,但尚不清楚哪些档案元素真正驱动角色扮演质量。为填补这一空白,我们引入了一个系统诊断框架,沿三个轴线解耦角色档案的影响:熟悉度(已知 vs. 未知)、结构(结构化 vs. 非结构化)和性格(道德 vs. 不道德)。利用统一的分层模式(5个维度,28个字段),我们构建了一个包含211个人物的受控数据集,并在单轮和多轮交互中评估了五个LLM。我们的结果揭示了显著的不对称性:熟悉度和结构的影响可忽略,而性格在所有条件下对不道德角色产生大且一致的性能下降。进一步分析表明,道德-不道德差距被后SFT对齐放大,且这种下降在不同档案属性间差异显著。为缓解这一瓶颈,我们提出场感知对比解码(FACD),一种无训练策略,通过放大被抑制的性格敏感信号,显著缩小性能差距而不牺牲道德角色的性能。

英文摘要

While Large Language Model (LLM) role-playing agents have advanced rapidly, it remains unclear which profile elements genuinely drive role-playing quality. To bridge this gap, we introduce a systematic diagnostic framework that disentangles the impact of character profiles along three axes: Familiarity (Known vs. Unknown), Structure (Structured vs. Unstructured), and Disposition (Moral vs. Immoral). Utilizing a unified hierarchical schema (5 dimensions, 28 fields), we construct a controlled dataset of 211 personas and evaluate five LLMs on both single- and multi-turn interactions. Our results reveal a striking asymmetry: Familiarity and Structure show negligible impact, while Disposition produces large, consistent performance degradation for immoral characters across all conditions. Further analyses suggest that the Moral--Immoral gap is amplified by post-SFT alignment, and that this degradation varies substantially across profile attributes. To mitigate this bottleneck, we propose Field-Aware Contrastive Decoding (FACD), a training-free strategy that amplifies suppressed disposition-sensitive signals, significantly closing the performance gap without sacrificing moral-character performance.