arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.10567 2026-05-28 cs.CL cs.AI

Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models

早期决策至关重要：非自回归扩散语言模型中的邻近偏差与初始轨迹塑造

Jiyeon Kim, Sungik Choi, Yongrae Jo, Moontae Lee, Minjoon Seo

AI总结本文通过分析非自回归扩散语言模型的推理动态，发现其存在邻近偏差导致的错误传播问题，并提出一种轻量级规划器和序列结束温度退火方法来引导早期令牌选择，从而显著提升推理与规划任务的性能。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

基于扩散的语言模型（dLLMs）已成为自回归语言模型的一种有前景的替代方案，提供了并行令牌生成和双向上下文建模的潜力。然而，如何利用这种灵活性实现完全非自回归解码仍然是一个开放问题，尤其是在推理和规划任务中。在这项工作中，我们通过系统分析非自回归解码在时间轴上的推理动态来研究dLLMs中的非自回归解码。具体来说，我们揭示了基于置信度的非自回归生成中固有的失败模式，该模式源于强烈的邻近偏差——即去噪顺序倾向于集中在空间相邻的令牌上。这种局部依赖性导致空间错误传播，使得整个轨迹关键地依赖于初始去掩码位置。利用这一见解，我们提出了一种最小干预方法，通过轻量级规划器和序列结束温度退火来指导早期令牌选择。我们在各种推理和规划任务上全面评估了我们的方法，并观察到在现有启发式基线基础上，无需显著计算开销即可实现整体性能的显著提升。

英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential for parallel token generation and bidirectional context modeling. However, harnessing this flexibility for fully non-autoregressive decoding remains an open question, particularly for reasoning and planning tasks. In this work, we investigate non-autoregressive decoding in dLLMs by systematically analyzing its inference dynamics along the temporal axis. Specifically, we uncover an inherent failure mode in confidence-based non-autoregressive generation stemming from a strong proximity bias-the tendency for the denoising order to concentrate on spatially adjacent tokens. This local dependency leads to spatial error propagation, rendering the entire trajectory critically contingent on the initial unmasking position. Leveraging this insight, we present a minimal-intervention approach that guides early token selection, employing a lightweight planner and end-of-sequence temperature annealing. We thoroughly evaluate our method on various reasoning and planning tasks and observe substantial overall improvement over existing heuristic baselines without significant computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2604.09367 2026-05-28 cs.CV

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent: 一种以智能体为中心的古铭文修复系统

Shipeng Zhu, Ang Chen, Na Nie, Pengfei Fang, Min-Ling Zhang, Hui Xue

AI总结提出基于智能体的EpiAgent系统，通过分层规划与LLM协调多模态分析、历史经验和专用工具，实现灵活自适应的古铭文修复，在真实退化铭文上取得更优修复质量和泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

古铭文作为文化记忆的载体，历经数世纪的环境和人为退化。恢复其交织的视觉和文本完整性是数字遗产保护中最具挑战性的任务之一。然而，现有基于AI的方法通常依赖刚性流水线，难以泛化到如此复杂和异质的真实退化场景。受人类金石学家技能协调工作流程的启发，我们提出EpiAgent，一个以智能体为中心的系统，将铭文修复形式化为分层规划问题。遵循观察-构思-执行-重新评估范式，基于LLM的中央规划器协调多模态分析、历史经验、专用修复工具和迭代自我精炼之间的协作。这种以智能体为中心的协调使得修复过程比传统的单次通过方法更加灵活和自适应。在真实退化的铭文上，EpiAgent相比现有方法实现了更优的修复质量和更强的泛化能力。我们的工作标志着向专家级智能体驱动的文化遗产修复迈出了重要一步。代码可在 https://github.com/blackprotoss/EpiAgent 获取。

英文摘要

Ancient inscriptions, as repositories of cultural memory, have suffered from centuries of environmental and human-induced degradation. Restoring their intertwined visual and textual integrity poses one of the most demanding challenges in digital heritage preservation. However, existing AI-based approaches often rely on rigid pipelines, struggling to generalize across such complex and heterogeneous real-world degradations. Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formulates inscription restoration as a hierarchical planning problem. Following an Observe-Conceive-Execute-Reevaluate paradigm, an LLM-based central planner orchestrates collaboration among multimodal analysis, historical experience, specialized restoration tools, and iterative self-refinement. This agent-centric coordination enables a flexible and adaptive restoration process beyond conventional single-pass methods. Across real-world degraded inscriptions, EpiAgent achieves superior restoration quality and stronger generalization compared to existing methods. Our work marks an important step toward expert-level agent-driven restoration of cultural heritage. The code is available at https://github.com/blackprotoss/EpiAgent.

URL PDF HTML ☆

赞 0 踩 0

2604.09258 2026-05-28 cs.LG

Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

Nexus: 相同预训练损失，通过公共极小值实现更好的下游泛化

Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu

AI总结本文提出Nexus优化器，通过最大化梯度相似性促使不同数据源的损失函数极小值靠近，在保持相同预训练损失的情况下显著提升下游泛化性能。

详情

AI中文摘要

大型语言模型的基础能力是在互联网规模、高度异构的数据混合上进行预训练时获得的。在这项工作中，我们研究了关于预训练收敛状态的一个有趣的几何问题：模型是否收敛到所有数据源的公共极小值（例如，图\cref{fig:cwa_illustration:close}），还是仅仅收敛到总损失的极小值（例如，图\cref{fig:cwa_illustration:distant}）？我们假设任务特定极小值的几何“接近度”与下游泛化内在相关。我们发现标准优化器（例如AdamW）通常收敛到任务特定极小值彼此远离的点。为了解决这个问题，我们提出了Nexus优化器，它通过在优化过程中最大化梯度相似性来鼓励这些极小值的接近。在从130M到3B参数的各种模型、多种数据混合和超参数调度下的实验表明，Nexus在实现相同预训练损失的情况下显著提升了下游性能（见图\cref{fig:demo:benchmark}）。值得注意的是，在3B模型上，Nexus将分布外损失降低了0.012，并在复杂推理任务（例如GSM8k）上带来了高达15.0%的准确率提升。这一发现挑战了将预训练损失作为模型评估唯一代理的依赖，并展示了隐式偏好在解锁下游泛化中的重要性。

英文摘要

The foundational capabilities of large language models are acquired during pretraining on internet-scale, highly heterogeneous data mixtures. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

URL PDF HTML ☆

赞 0 踩 0

2604.05333 2026-05-28 cs.AI

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

技能图谱：面向大规模智能体技能的依赖感知结构检索

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, Lichao Sun

AI总结提出技能图谱（GoS），一种推理时的结构检索层，通过构建可执行技能图并利用混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合，实现依赖感知的技能束检索，在SkillsBench和ALFWorld上显著提升奖励并节省令牌。

Comments 11 pages of main text, 12 pages of appendix. Core contribution by Dawei Liu and Zongxia Li. Project page: https://github.com/davidliuk/graph-of-skills

详情

AI中文摘要

现代LLM智能体越来越依赖可复用技能，当与个人应用、网页浏览器等接口交互时，技能库可扩展至数千个技能。扩展到更大的技能集带来了两个关键挑战。首先，加载完整技能集会饱和上下文窗口，推高令牌成本、幻觉和延迟。其次，语义检索会找到主题相关的技能，但遗漏其上下游技能的先决条件链，造成先决条件缺口，使检索到的技能束执行不完整。在本文中，我们提出技能图谱（GoS），一种用于大型技能库的推理时结构检索层。GoS离线从技能包构建可执行技能图，然后在推理时通过混合语义-词汇种子、反向感知个性化PageRank和上下文预算水合，检索一个有界、依赖感知的技能束。在SkillsBench和ALFWorld上，GoS在三个模型系列（Claude Sonnet 4.5、MiniMax M2.7和GPT-5.2 Codex）中持续带来显著的奖励提升和令牌节省。在SkillsBench上，使用GPT-5.2 Codex时，GoS相比原始完整技能加载基线实现了25.55%的峰值奖励提升，同时总令牌减少56.72%。消融实验证实了在200到2000个技能库中的这一模式。

英文摘要

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces, skill libraries can scale to thousands of skills. Scaling to larger skill sets introduces two key challenges. First, loading the full skill set saturates the context window, driving up token costs, hallucination, and latency. Second, semantic retrieval surfaces topically relevant skills but misses their prerequisite chain of upstream and downstream skills, creating a prerequisite gap that leaves the retrieved bundle execution-incomplete. In this paper, we present Graph-of-Skills (GoS), an inference-time structural retrieval layer for large skill libraries. GoS constructs an executable skill graph offline from skill packages, then at inference time retrieves a bounded, dependency-aware skill bundle through hybrid semantic-lexical seeding, reverse-aware Personalized PageRank, and context-budgeted hydration. On SkillsBench and ALFWorld, GoS consistently delivers substantial reward improvements and token savings across three model families (Claude Sonnet 4.5, MiniMax M2.7, and GPT-5.2 Codex). On SkillsBench, GoS achieves a peak reward increase of 25.55% while reducing total tokens by 56.72% over the vanilla full skill-loading baseline using GPT-5.2 Codex. Ablations confirm this pattern across skill libraries from 200 to 2,000 skills.

URL PDF HTML ☆

赞 0 踩 0

2604.06196 2026-05-28 cs.CL cs.AI cs.LO

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

面向三值逻辑问答的成分一致性引导解码

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

AI总结针对大语言模型在三值逻辑问答中的否定不一致和认知未知问题，提出一种轻量级测试时解码层CGD-PD，通过神经三值分类、符号否定一致性投影和定向二值蕴含探测，在FOLIO数据集上提升准确率4.4-6.8点并减少未知预测。

Comments Accepted at the ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents

详情

AI中文摘要

三值逻辑问答（QA）在给定前提集 $S$ 的情况下，将 $ ext{True}$、$ ext{False}$ 或 $ ext{Unknown}$ 之一分配给假设 $H$。我们将此任务视为一个紧凑的成分推理问题：在确定性否定映射下，$H$ 和机械否定假设 $ eg H$ 的预测应保持一致。尽管结构简单，大语言模型（LLM）可能表现出两种实际失败模式：(i) 否定不一致，即对 $H$ 和 $ eg H$ 的回答违反了所需的标签映射；(ii) 认知 $ ext{Unknown}$，即模型在某一侧被蕴含时仍选择弃权。我们引入 CGD-PD，一个轻量级、无需训练的测试时层，结合神经三值分类、符号否定一致性投影和定向二值蕴含探测。在 FOLIO 一阶逻辑领域的一个验证集上，CGD-PD 在 GPT-5.2 上提升了 4.4 个百分点的准确率，在 Claude Sonnet 4.5 上提升了 6.8 个百分点，同时减少了 $ ext{Unknown}$ 预测和认知弃权。这些结果提供了一个受控的概念验证，表明推理时的简单逻辑组合有助于评估和提高 LLM 推理可靠性；但本身并不足以证明在此形式化基准设置之外的鲁棒性。

英文摘要

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a premise set $S$. We study this task as a compact compositional inference problem: predictions for $H$ and for a mechanically negated hypothesis $\neg H$ should agree under a deterministic negation map. Despite this simple structure, large language models (LLMs) can exhibit two practical failure modes: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the required label mapping, and (ii) epistemic $\text{Unknown}$, where the model abstains even when one side is entailed. We introduce CGD-PD, a lightweight, training-free test-time layer that combines neural 3-way classification, symbolic negation-consistency projection, and targeted binary entailment probes. On one validation split of FOLIO's first-order logic fields, CGD-PD improves accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, while reducing $\text{Unknown}$ predictions and epistemic abstention. These results provide a controlled proof of concept that simple logical composition at inference time can help evaluate and improve LLM reasoning reliability; they do not, by themselves, establish robustness beyond this formal benchmark setting.

URL PDF HTML ☆

赞 0 踩 0

2604.04074 2026-05-28 cs.AI cs.LG

FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification

FactReview：基于执行式声明验证的证据驱动同行评审

Ling Yue, Chaoqian Ouyang, Hang Xu, Ruijun Huang, Yuchen Liu, Libin Zheng, Wei Liu, Shaowu Pan, Shimin Di, Min-Ling Zhang

AI总结提出FactReview系统，通过提取与评审相关的声明、将其与相关工作关联，并在代码可用时在固定修复预算下执行发布工件来审计经验声明，覆盖84%的声明，将评审质量提升至4.86/5，并将评审时间减少58%。

详情

AI中文摘要

基于LLM的评审系统通常仅以手稿为输入，使得文献和基于代码的声明难以验证。我们提出FactReview，一个提取与评审相关的声明、将其与相关工作关联，并在代码可用时在固定修复预算下执行发布工件以审计经验声明的系统。在35篇ML论文和463个基准主要声明中，FactReview覆盖了84%的声明。在证据感知评分标准下，其评审在整体质量上得分为4.86/5，比DeepReview-v2高0.7，比匹配的OpenReview评论高1.5。移除执行证据会改变17%的声明状态，超过任何其他单一证据来源。在一项评审辅助研究中，FactReview将平均评审时间减少了58%，同时将基准声明覆盖率从87%提高到99%。我们认为LLM评审者应审计经验声明，而非做出接受或拒绝的决定。代码公开于：https://github.com/DEFENSE-SEU/FactReview。

英文摘要

LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We present FactReview, a system that extracts review-relevant claims, grounds them in related work, and, when code is available, executes released artifacts under a fixed repair budget to audit empirical claims. Across 35 ML papers and 463 benchmark major claims, FactReview covers 84% of claims. Under an evidence-aware rubric, its reviews score 4.86/5 in overall quality, 0.7 above DeepReview-v2 and 1.5 above matched OpenReview comments. Removing execution evidence changes 17% of claim statuses, more than any other single evidence source. In a reviewer-assistance study, FactReview reduces mean review time by 58% while raising benchmark claim coverage from 87% to 99%. We argue that LLM reviewers should audit empirical claims, not make accept-reject decisions. The code is public at: https://github.com/DEFENSE-SEU/FactReview.

URL PDF HTML ☆

赞 0 踩 0

2604.05378 2026-05-28 cs.CL cs.CV

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

ICR-Drive：面向端到端语言驱动自动驾驶的指令反事实鲁棒性

Kaiser Hamid, Can Cui, Nade Liang

AI总结提出ICR-Drive框架，通过生成四类扰动指令（改写、歧义、噪声、误导）并基于CARLA仿真评估，揭示语言条件驾驶模型对指令变化的脆弱性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 872-880

AI中文摘要

视觉-语言-动作（VLA）模型的最新进展使得语言条件驾驶代理能够在闭环仿真中执行自然语言导航命令，但标准评估大多假设指令精确且格式良好。在实际部署中，指令的措辞和具体性各不相同，可能省略关键限定词，偶尔还包含误导性的权威框架文本，导致指令级鲁棒性未被充分衡量。我们提出了ICR-Drive，一个用于端到端语言条件自动驾驶中指令反事实鲁棒性的诊断框架。ICR-Drive生成受控的指令变体，涵盖四类扰动：改写、歧义、噪声和误导，其中误导变体与导航目标冲突并试图覆盖意图。我们在匹配的仿真器配置和种子下重放相同的CARLA路线，以隔离由指令语言引起的性能变化。鲁棒性通过标准CARLA排行榜指标和相对于基线指令的每族性能下降来量化。在LMDrive和BEVDriver上的实验表明，微小的指令变化可能导致显著的性能下降和不同的故障模式，揭示了在安全关键驾驶中部署具身基础模型的可靠性差距。

英文摘要

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

URL PDF HTML ☆

赞 0 踩 0

2604.03799 2026-05-28 cs.CV

Next-Scale Autoregressive Models for Text-to-Motion Generation

Next-Scale 自回归模型用于文本到运动生成

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

AI总结提出 MoScale 框架，通过从粗到细的时间分辨率分层生成运动，结合跨尺度和尺度内细化，实现高效、可扩展的文本到运动生成。

Comments Accepted to CVPR 2026

2604.02645 2026-05-28 cs.CL cs.AI

Speaking of Language: Reflections on Metalanguage Research in NLP

论语言：NLP中元语言研究的思考

Nathan Schneider, Antonios Anastasopoulos

AI总结本文定义元语言概念，将其与NLP和LLM关联，介绍两个实验室以元语言为中心的研究，并讨论元语言的四个维度及元语言任务，提出未来研究方向。

Comments To appear at the Big Picture Workshop at ACL 2026. Camera-ready version

2604.02028 2026-05-28 cs.CL

Why Gaussian Diffusion Models Fail on Discrete Data and How to Prevent It?

为什么高斯扩散模型在离散数据上失败以及如何防止？

Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

AI总结本文研究高斯扩散模型在离散数据上采样质量差的原因，发现关键采样区间内噪声数据密度呈多峰分布导致DDPM进入低密度区域，并提出自条件化和q采样结合的方法来改善生成质量。

详情

AI中文摘要

扩散模型已成为连续域生成建模的标准方法，但其在离散数据上的应用仍然具有挑战性。我们研究了使用DDPM求解器的高斯扩散模型为何难以从表示为连续空间中δ分布混合的离散分布中采样。通过一个玩具随机层次模型，我们识别出一个关键采样区间，在该区间内噪声数据的密度变为多峰分布。在这个区间内，DDPM偶尔会进入模式之间的低密度区域，为模型产生分布外输入并降低样本质量。我们表明，现有的启发式方法，包括自条件化和我们称之为q采样的求解器，有助于缓解这个问题。此外，我们证明在关键区间内将自条件化与从DDPM切换到q采样相结合，可以提高真实数据的生成质量。我们在多个领域的条件和无条件任务中验证了这些发现，包括文本、编程代码和蛋白质。

英文摘要

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

URL PDF HTML ☆

赞 0 踩 0

2604.01604 2026-05-28 cs.AI

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

CRaFT：基于跨层转码器的电路引导拒绝特征选择

Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han

AI总结提出CRaFT框架，利用跨层转码器构建稀疏特征电路图，通过量化特征间影响及其对最终输出的贡献，选择控制拒绝行为的关键特征，显著提升越狱攻击性能。

详情

AI中文摘要

虽然现代LLM经过对齐以拒绝有害请求，但理解这种拒绝行为背后的机制基础对于模型安全分析至关重要。例如，基于引导的越狱攻击通过识别和操纵稀疏的、类似神经元的拒绝特征来绕过安全护栏。当前的特征选择方法主要依赖于特征在有害提示上的激活强度。然而，仅凭激活强度往往捕捉到主题或词汇线索等表面启发式，而非真正的因果机制。因此，选择拒绝特征需要测量特征间的关系，而不是将每个特征视为孤立的激活信号。基于这一见解，我们提出CRaFT，一个电路引导的框架，用于识别直接控制拒绝决策的关键拒绝特征。CRaFT利用跨层转码器将模型的内部计算映射到稀疏特征电路图中，其中边量化特征间的影响及其对最终输出logits的贡献。通过聚合沿拒绝路径传播的效应，CRaFT有效地对最具影响力的特征进行排序。在四个越狱基准上的广泛评估表明，与当前最先进方法相比，CRaFT将平均性能从6.7%提高到57.4%，并生成更具体的有害补全。

英文摘要

While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal behavior for model safety analysis. For example, steering-based jailbreak attacks exploit this by identifying and manipulating sparse, neuron-like refusal features to bypass safety guardrails. Current feature selection methods primarily rely on how strongly features activate on harmful prompts. However, activation strength alone often captures superficial heuristics such as topic or lexical cues, rather than the true causal mechanisms. Thus, selecting refusal features requires measuring inter-feature relationships, rather than treating each feature as an isolated activation signal. Based on this insight, we propose CRaFT, a circuit-guided framework for identifying critical refusal features that directly govern the refusal decision. CRaFT leverages cross-layer transcoders to map the model's internal computations into a sparse feature circuit graph, where edges quantify inter-feature influences and their contributions to the final output logits. By aggregating the effects propagating along the paths to refusal, CRaFT effectively ranks the most influential features. Extensive evaluations across four jailbreak benchmarks show that CRaFT significantly improves average performance from 6.7% to 57.4% and generates more specific harmful completions compared to current SOTA methods.

URL PDF HTML ☆

赞 0 踩 0

2604.00913 2026-05-28 cs.CV cs.CL

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

跨描绘装配指令对齐的视觉-语言模型基准测试与机制分析

Zhuchenyang Liu, Yao Zhang, Yu Xiao

AI总结构建IKEA-Bench基准，评估19个视觉-语言模型在装配图与视频帧对齐任务上的表现，发现视觉编码是提升跨描绘鲁棒性的关键瓶颈。

详情

AI中文摘要

二维装配图通常是抽象的且难以遵循，因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中，此类系统必须从摄像头画面中识别已完成和正在进行的步骤，并将其与图示指令对齐。视觉语言模型（VLM）在此任务上展现出潜力，但由于装配图和视频帧共享的视觉特征极少，面临描绘鸿沟。为系统评估这一鸿沟，我们构建了IKEA-Bench基准，包含29个宜家家具产品的6种任务类型共1623个问题，并在三种对齐策略下评估了19个VLM（2B-38B）。主要发现：（1）装配指令理解可通过文本恢复，但文本同时降低了图到视频的对齐性能；（2）架构族比参数数量更能预测对齐精度；（3）视频理解是难以通过策略影响的硬瓶颈。三级机制分析进一步揭示，图和视频占据不相交的ViT子空间，且添加文本会使模型从视觉驱动转向文本驱动的推理。这些结果表明，视觉编码是提升跨描绘鲁棒性的主要目标。项目页面：https://ryenhails.github.io/IKEA-Bench/

英文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

URL PDF HTML ☆

赞 0 踩 0

2604.00402 2026-05-28 cs.CV cs.AI

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

COTTA: 面向自动驾驶轨迹预测的上下文感知迁移适应

Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

AI总结本文研究将基于美国数据训练的轨迹预测模型QCNet迁移到韩国道路环境，通过对比四种训练策略，发现冻结编码器并微调解码器可在精度和效率间取得最佳平衡，预测误差降低66%以上。

Comments 4 pages, 2 figures. Accepted at ICEIC 2026

详情

AI中文摘要

开发鲁棒模型以准确预测周围代理的轨迹是自动驾驶安全的基础。然而，大多数公开数据集（如Waymo Open Motion Dataset和Argoverse）是在西方道路环境中收集的，并未反映其他地区（包括韩国）独特的交通模式、基础设施和驾驶行为。当在西方数据上训练的最先进模型部署到不同地理环境时，这种领域差异会导致性能下降。在本工作中，我们研究了查询中心轨迹预测（QCNet）从美国数据迁移到韩国道路环境时的适应性。使用韩国自动驾驶数据集，我们比较了四种训练策略：零样本迁移、从头训练、全微调和编码器冻结。实验结果表明，利用预训练知识显著提高了预测性能。具体而言，在冻结编码器的同时选择性微调解码器，在精度和训练效率之间取得了最佳平衡，与从头训练相比，预测误差降低了66%以上。本研究为在新地理领域部署轨迹预测模型提供了有效的迁移学习策略的实用见解。

英文摘要

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

URL PDF HTML ☆

赞 0 踩 0

2512.11524 2026-05-28 cs.CV cs.LG

Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using Airborne LiDAR HD Reference Data across Metropolitan France

利用法国大都市机载LiDAR HD参考数据从Sentinel-2时间序列进行超分辨率冠层高度制图

Ekaterina Kalinicheva, Florian Helen, Stéphane Mermoz, Florian Mouret, Milena Planells

AI总结提出THREASURE-Net端到端框架，利用Sentinel-2时间序列和LiDAR HD数据生成2.5m、5m和10m分辨率的年度冠层高度图，无需预训练模型或高分辨率光学图像，在法国大都市区实现优于现有方法的精度。

详情

AI中文摘要

精细尺度的森林监测对于理解冠层结构及其动态至关重要，这些是碳储量、生物多样性和森林健康的关键指标。深度学习特别有效，因为它整合了共同反映冠层结构的光谱、时间和空间信号。为满足这一需求，我们提出了THREASURE-Net，一种新颖的端到端树高回归与超分辨率框架。该模型使用来自法国大都市区多个空间分辨率的LiDAR HD数据导出的参考高度指标，在Sentinel-2时间序列上训练，以生成年度高度图。我们评估了三种模型变体，分别产生2.5米、5米和10米分辨率的树高预测。THREASURE-Net不依赖任何预训练模型或参考甚高分辨率光学图像来训练其超分辨率模块；相反，它仅从LiDAR导出的高度信息中学习。我们的方法优于现有基于Sentinel数据的最先进方法，并与基于甚高分辨率图像的方法具有竞争力。它可以部署生成高精度年度冠层高度图，在2.5米、5米和10米分辨率下分别实现2.63米、2.70米和2.88米的平均绝对误差。这些结果凸显了THREASURE-Net仅使用免费卫星数据对温带森林进行可扩展且经济高效的结构监测的潜力。THREASURE-Net的源代码可在以下网址获取：https://github.com/Global-Earth-Observation/threasure-net。

英文摘要

Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.63 m, 2.70 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: https://github.com/Global-Earth-Observation/threasure-net.

URL PDF HTML ☆

赞 0 踩 0

2601.17354 2026-05-28 cs.CV cs.GR

PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling

PocketGS: 用于高感知建模的3D高斯泼溅设备端训练

Wenzhi Guo, Guangchi Fang, Shu Yang, Bing Wang

AI总结提出PocketGS，通过三个协同设计的算子（G、I、T）在移动设备上实现3D高斯泼溅的高效训练，在严格资源约束下保持高保真重建。

详情

AI中文摘要

虽然3D高斯泼溅（3DGS）能够实现实时渲染，但其训练需要工作站级别的计算和内存，使得在分钟级时间预算和有限峰值内存下移动部署不切实际。我们提出了PocketGS，一种移动场景建模范式，能够在这些紧密耦合的约束下实现设备端3DGS训练，同时保持高保真重建。PocketGS通过三个协同设计的算子解决了训练效率、内存紧凑性和建模质量之间的基本矛盾：$\mathcal{G}$构建几何保真的点云先验；$\mathcal{I}$注入局部表面统计以播种各向异性高斯，从而减少早期条件差距；$\mathcal{T}$使用缓存的中间结果和索引映射梯度散射展开alpha合成，以实现稳定的移动反向传播。大量实验表明，PocketGS在移动预算下优于强大的主流工作站3DGS基线，提供高质量重建，并实现了完全设备端的实用捕获到渲染工作流。

英文摘要

While 3D Gaussian Splatting (3DGS) enables real-time rendering, its training demands workstation-level compute and memory, making mobile deployment impractical under minute-scale time budgets and limited peak memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high-fidelity reconstruction. PocketGS resolves the fundamental tension between training efficiency, memory compactness, and modeling quality through three co-designed operators: $\mathcal{G}$ builds geometry-faithful point-cloud priors; $\mathcal{I}$ injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and $\mathcal{T}$ unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Extensive experiments demonstrate that PocketGS outperforms the powerful mainstream workstation 3DGS baseline under mobile budgets, delivering high-quality reconstructions and enabling a fully on-device, practical capture-to-rendering workflow.

URL PDF HTML ☆

赞 0 踩 0

2601.01627 2026-05-28 cs.CL cs.AI

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

JMedEthicBench：用于评估日语大语言模型医疗安全性的多轮对话基准

Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato

AI总结提出首个多轮对话基准JMedEthicBench，基于日本医学会67条指南和7种自动越狱策略生成5万+对抗对话，评估27个模型发现医疗专用模型安全性脆弱，且多轮交互中安全性显著下降。

Comments 12 pages, 6 figures

详情

AI中文摘要

随着大语言模型（LLM）在医疗领域的部署日益增多，在临床使用前仔细评估其医疗安全性变得至关重要。然而，现有的安全基准仍然以英语为中心，并且仅使用单轮提示进行测试，尽管临床咨询是多轮的。为了解决这些差距，我们引入了JMedEthicBench，这是第一个用于评估日语医疗LLM医疗安全性的多轮对话基准。我们的基准基于日本医学会的67条指南，包含使用七种自动发现的越狱策略生成的超过50,000个对抗性对话。使用双LLM评分协议，我们评估了27个模型，发现商业模型保持了稳健的安全性，而医疗专用模型表现出更高的脆弱性。此外，安全分数在对话轮次中显著下降（中位数：9.5降至5.0，p < 0.001）。对我们的基准的日语和英语版本进行的跨语言评估表明，医疗模型的脆弱性跨语言持续存在，表明存在固有的对齐限制，而非语言特定因素。这些发现表明，领域特定的微调可能会意外削弱安全机制，并且多轮交互代表了一个需要专门对齐策略的独特威胁面。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly English-centric, and test with only single-turn prompts despite multi-turn clinical consultations. To address these gaps, we introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of LLMs for Japanese healthcare. Our benchmark is based on 67 guidelines from the Japan Medical Association and contains over 50,000 adversarial conversations generated using seven automatically discovered jailbreak strategies. Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability. Furthermore, safety scores decline significantly across conversation turns (median: 9.5 to 5.0, $p < 0.001$). Cross-lingual evaluation on both Japanese and English versions of our benchmark reveals that medical model vulnerabilities persist across languages, indicating inherent alignment limitations rather than language-specific factors. These findings suggest that domain-specific fine-tuning may accidentally weaken safety mechanisms and that multi-turn interactions represent a distinct threat surface requiring dedicated alignment strategies.

URL PDF HTML ☆

赞 0 踩 0

2505.13820 2026-05-28 cs.LG cs.AI cs.CL

Structured Agent Distillation for Large Language Model

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

AI总结提出结构化智能体蒸馏框架，通过分段对齐推理和动作跨度，将大型语言模型智能体压缩为小型学生模型，在保持决策性能的同时降低推理成本。

详情

Journal ref: The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

大型语言模型（LLMs）通过交错推理和动作（如ReAct风格框架）展现出作为决策智能体的强大能力。然而，它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏，一种将基于大型LLM的智能体压缩为更小的学生模型的框架，同时保持推理保真度和动作一致性。与标准的token级蒸馏不同，我们的方法将轨迹分割为[REASON]和[ACT]跨度，应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明，我们的方法始终优于token级和模仿学习基线，在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

URL PDF HTML ☆

赞 0 踩 0

2603.26182 2026-05-28 cs.CL

ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

ClinicalAgents：具有双记忆的临床决策多智能体编排

Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li

AI总结提出ClinicalAgents多智能体框架，通过蒙特卡洛树搜索动态编排和双记忆架构模拟临床推理，显著提升诊断准确性和可解释性。

Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3818931

AI中文摘要

虽然大型语言模型（LLMs）在医疗保健领域展现出潜力，但它们往往难以应对临床准确诊断所需的复杂非线性推理。现有方法通常依赖从症状到诊断的静态线性映射，未能捕捉人类临床医生固有的迭代、假设驱动推理。为弥补这一差距，我们引入了ClinicalAgents，一种新颖的多智能体框架，旨在模拟专家临床医生的认知工作流。与僵化的顺序链不同，ClinicalAgents采用了一种动态编排机制，建模为蒙特卡洛树搜索（MCTS）过程。这使得编排器能够迭代生成假设、主动验证证据，并在关键信息缺失时触发回溯。该框架的基础是双记忆架构：一个可变的短期工作记忆，用于维护不断演变的患者状态以进行上下文感知推理；以及一个静态的经验记忆，通过主动反馈循环检索临床指南和历史病例。大量实验表明，ClinicalAgents在评估的基线中取得了最佳性能，与强大的单智能体和多智能体基线相比，显著提高了诊断准确性和可解释性。我们的代码发布在https://github.com/ZhuohanGe/ClinicalAgents-Code。

英文摘要

While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent in human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. The foundation of this framework is a Dual-Memory architecture: a mutable working memory that maintains the evolving patient state for context-aware reasoning, and a static experience memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves the best performance among evaluated baselines, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines. Our code is released at https://github.com/ZhuohanGe/ClinicalAgents-Code.

URL PDF HTML ☆

赞 0 踩 0

2601.19302 2026-05-28 cs.CL

Formula-One Prompting: A Composable Equation-First Prefix for Applied Mathematics

Formula-One Prompting：一种可组合的方程优先前缀用于应用数学

Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul

AI总结提出公式提示（FP）和Formula-One提示（F-1），通过先形式化问题中的控制方程再求解，在多个应用数学基准上优于思维链和程序思维提示，平均提升5.76和8.42个百分点。

详情

AI中文摘要

本文介绍了公式提示（FP）和Formula-One提示（F-1），两种单次调用方法，在解决应用数学问题之前先引出控制方程。思维链（CoT）和程序思维（PoT）提示通过引出预训练期间学到的推理轨迹或类似代码的结构来改进数学推理。这提出了一个诊断性问题：哪些有用的预训练模式仍然未被充分引出？使用infini-gram-mini，我们扫描了81.7万亿预训练令牌，发现在精心策划的语料库（如DataComp-LM）中，以方程为中心的语言出现频率比代码高121倍，比逐步叙述高3.79倍，但标准提示方法并未明确引出方程形式化。FP要求模型在求解前先形式化问题的控制方程；F-1扩展了FP，增加了一个可组合的第二阶段，在同一调用中选择直接、CoT或PoT风格的求解。在五个推理模型和四个应用数学基准（金融、物理、密码学、竞赛数学）上，F-1平均优于CoT 5.76个百分点，优于PoT 8.42个百分点，在FinanceMath上取得最大提升13.30个百分点，同时以仅68个提示令牌的开销占据准确率-令牌效率前沿。变体消融实验表明，方程形式化前缀（而非策略菜单）是主要驱动因素：在前缀之上添加CoT或PoT不会带来进一步收益，且73.3%的剩余失败发生在第一阶段方程正确之后。

英文摘要

This paper introduces Formula Prompting (FP) and Formula-One Prompting (F-1), two single-call methods that elicit governing equations before solving applied-math problems. Chain-of-Thought (CoT) and Program-of-Thought (PoT) prompting improve mathematical reasoning by eliciting reasoning traces or code-like structures learned during pretraining. This suggests a diagnostic question: which useful pretraining patterns remain under-elicited? Using infini-gram-mini, we scan 81.7 trillion pretraining tokens and find that, in curated corpora such as DataComp-LM, equation-centered language appears 121x more often than code and 3.79x more often than step-by-step narration, yet standard prompting methods do not explicitly elicit equation formulation. FP asks the model to formalize a problem's governing equations before solving; F-1 extends FP with a composable Phase 2 that selects Direct, CoT, or PoT-style solving in the same call. Across five reasoning models and four applied-math benchmarks (finance, physics, cryptography, competition math), F-1 outperforms CoT by 5.76 pp and PoT by 8.42 pp on average, with the largest gain of 13.30 pp on FinanceMath, while topping the accuracy-token efficiency frontier at only 68 prompt tokens of overhead. Variant ablations identify the equation-formalization prefix, not the strategy menu, as the primary driver: adding CoT or PoT on top of the prefix yields no further gain, and 73.3% of remaining failures occur downstream of a correct Phase-1 equation.

URL PDF HTML ☆

赞 0 踩 0

2512.12887 2026-05-28 cs.CV

Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification

重新审视用于可扩展3D医学图像分类的2D基础模型

Han Liu, Bogdan Georgescu, Yanbo Zhang, Youngjin Yoo, Michael Baumgartner, Riqiang Gao, Jianing Wang, Gengyan Zhao, Eli Gibson, Dorin Comaniciu, Sasa Grbic

AI总结本文针对当前3D医学图像分类基础模型的数据偏差、适应不足和任务覆盖不全问题，提出AnyMC3D框架，通过冻结2D基础模型并添加轻量插件实现高效多任务扩展，并在12项任务上达到领先性能。

Comments 1st Place in VLM3D Challenge

详情

AI中文摘要

3D医学图像分类对于现代临床工作流程至关重要。医学基础模型（FMs）已成为扩展到新任务的有前途的方法，然而当前研究存在三个关键缺陷：数据体制偏差、次优适应和任务覆盖不足。在本文中，我们解决了这些缺陷，并引入了AnyMC3D，一种从2D FMs改编的可扩展3D分类器。我们的方法通过在单个冻结骨干网络上添加轻量级插件（每个任务约1M参数），高效地扩展到新任务。这个通用框架还支持多视图输入、辅助像素级监督和可解释的热力图生成。我们建立了一个涵盖12个任务的综合基准，包括不同的病理、解剖和模态，并系统分析了最先进的3D分类技术。我们的分析揭示了关键见解：（1）有效适应对于释放FM潜力至关重要，（2）通用FMs在适当适应后可以匹敌医学专用FMs，（3）基于2D的方法在3D分类上优于3D架构。我们首次证明了使用单一可扩展框架（包括在VLM3D挑战中获得第一名）在不同应用中实现最先进性能的可行性，消除了对单独任务特定模型的需求。

英文摘要

3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

URL PDF HTML ☆

赞 0 踩 0

2603.22735 2026-05-28 cs.CL

Explanation Generation for Contradiction Reconciliation with LLMs

面向矛盾调和的大语言模型解释生成

Jason Chan, Zhixue Zhao, Robert Gaizauskas

AI总结提出矛盾调和解释生成任务，通过改造NLI数据集和设计质量指标，评估18个LLM在该任务上的表现，发现模型能力有限且增大模型规模时“思考”收益递减。

Comments Preprint

详情

AI中文摘要

现有的NLP工作通常将矛盾视为需要通过选择接受或拒绝哪些陈述来解决的错误。然而，在社交互动和专业领域中，人类推理的一个关键方面是能够假设调和矛盾的解释。例如，“Cassie讨厌咖啡”和“她每天买咖啡”看似矛盾，但如果Cassie有每天为所有同事买咖啡这一不令人羡慕的日常任务，那么两者是兼容的。尽管大语言模型（LLM）的推理能力不断增强，但它们假设这种调和解释的能力在很大程度上仍未探索。为了填补这一空白，我们引入了调和解释生成任务，其中模型必须生成能够有效使矛盾陈述兼容的解释。我们提出了一种改造现有自然语言推理（NLI）数据集的新方法，并引入了可实现可扩展自动评估的质量指标。对18个LLM的实验表明，大多数模型在此任务中取得的成功有限，并且通过“思考”延长测试时计算的好处随着模型规模的增大而趋于平稳。我们的结果突显了LLM推理中一个未被充分探索的维度，以及解决这一限制以增强LLM下游应用（如聊天机器人和科学助手）的必要性。

英文摘要

Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.

URL PDF HTML ☆

赞 0 踩 0

2603.21465 2026-05-28 cs.CL cs.LG

DRTriton: Large-Scale Synthetic Data Driven Reinforcement Learning for Triton Kernel Generation

DRTriton：大规模合成数据驱动的强化学习用于Triton内核生成

Siqi Guo, Ming Lin, Tianbao Yang

AI总结提出DRTriton框架，通过合成数据生成、课程强化学习和测试时搜索，训练LLM将PyTorch程序转换为优化的Triton内核，在KernelBench Level 2任务中超越GPT-5.2和Claude-Sonnet-4.5。

详情

AI中文摘要

在生成式AI行业中，开发高效的CUDA内核是一项基础但具有挑战性的任务。最近的研究利用大型语言模型（LLMs）自动将PyTorch参考实现转换为CUDA内核，显著减少了工程工作量。最先进的LLMs，如GPT-5.2和Claude-Sonnet-4.5，仍然难以胜任此任务。为应对这一挑战，我们提出了DRTriton，一个可扩展的学习框架，用于训练LLM将PyTorch程序转换为高度优化的Triton内核，然后在运行时编译为CUDA内核。DRTriton包含三个关键组件：（i）数据合成算法CSP-DAG，保证在算子空间上的完全覆盖和具有可控难度的无偏均匀采样；（ii）具有解耦奖励的课程RL框架，联合优化转换成功率和执行速度；（iii）测试时搜索算法，进一步提高生成的Triton内核的执行速度。通过在使用现有LLM整理的有限PyTorch-Triton对上进行SFT预热阶段，DRTriton在合成PyTorch程序上通过RL训练，有效泛化到即使对人类专家也具挑战性的真实世界CUDA内核。实验结果表明，DRTriton-7B在92%的KernelBench Level 2任务上实现了相对于PyTorch的加速，而GPT-5.2为23%，Claude-Sonnet-4.5为19%。

英文摘要

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent research leverages Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing engineering effort. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle with this task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch programs into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum RL framework with decoupled rewards that jointly optimizes conversion success rate and execution speed; and (iii) a test-time search algorithm that further improves the execution speed of the generated Triton kernels. With a warmup stage of SFT on limited PyTorch-Triton pairs curated using existing LLMs, DRTriton trained by RL on synthesized PyTorch programs generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup over PyTorch on 92% of KernelBench Level 2 tasks, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

URL PDF HTML ☆

赞 0 踩 0

2603.21165 2026-05-28 cs.CL cs.CV

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

多种方言，多种语言，一种文化视角：评估多语言视觉语言模型对孟加拉文化的理解，涵盖历史关联语言和地区方言

Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda

AI总结提出 BanglaVerse 基准，通过手工标注图像和扩展至多种语言及方言，评估多语言视觉语言模型在孟加拉文化理解中的表现，发现标准孟加拉语评估高估模型能力，方言变化导致性能下降，文化知识缺失是主要瓶颈。

Comments https://labib1610.github.io/BanglaVerse/

详情

AI中文摘要

孟加拉文化通过地区、方言、历史、食物、政治、媒体和日常视觉生活丰富地表达，但在多模态评估中仍然代表性不足。为了解决这一差距，我们引入了BanglaVerse，这是一个文化基础的基准，用于评估多语言视觉语言模型（VLM）对孟加拉文化的理解，涵盖历史关联语言和地区方言。该基准由9个领域的1152张手动策划图像构建，支持视觉问答和字幕生成，并扩展为四种语言和五种孟加拉方言，产生约32.2K个工件。我们的实验表明，仅评估标准孟加拉语会高估真实模型能力：在方言变化下性能下降，尤其是字幕生成，而历史关联语言如印地语和乌尔都语保留了一些文化意义，但在结构化推理方面仍然较弱。跨领域来看，主要瓶颈是缺失文化知识而非仅视觉基础，尤其是知识密集型类别。这些发现将BanglaVerse定位为在语言变化下衡量文化基础多模态理解的更现实测试平台。

英文摘要

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.2K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

URL PDF HTML ☆

赞 0 踩 0

2601.04716 2026-05-28 cs.CL

Identifying and Mitigating Bottlenecks in Role-Playing Agents: A Systematic Study of Disentangling Character Profile Axes

识别与缓解角色扮演代理中的瓶颈：解耦角色档案轴线的系统研究

Yonghyun Jun, Junhyuk Choi, Jeonghyun Park, Jihyeong Park, Liu Nicole Geumheon, Hwanhee Lee

AI总结本研究通过解耦角色档案的熟悉度、结构和性格三个轴线，系统诊断LLM角色扮演代理的性能瓶颈，并提出无训练的场感知对比解码（FACD）策略来缓解性格带来的性能下降。

Comments 28 pages

详情

AI中文摘要

尽管大语言模型（LLM）角色扮演代理发展迅速，但尚不清楚哪些档案元素真正驱动角色扮演质量。为填补这一空白，我们引入了一个系统诊断框架，沿三个轴线解耦角色档案的影响：熟悉度（已知 vs. 未知）、结构（结构化 vs. 非结构化）和性格（道德 vs. 不道德）。利用统一的分层模式（5个维度，28个字段），我们构建了一个包含211个人物的受控数据集，并在单轮和多轮交互中评估了五个LLM。我们的结果揭示了显著的不对称性：熟悉度和结构的影响可忽略，而性格在所有条件下对不道德角色产生大且一致的性能下降。进一步分析表明，道德-不道德差距被后SFT对齐放大，且这种下降在不同档案属性间差异显著。为缓解这一瓶颈，我们提出场感知对比解码（FACD），一种无训练策略，通过放大被抑制的性格敏感信号，显著缩小性能差距而不牺牲道德角色的性能。

英文摘要

While Large Language Model (LLM) role-playing agents have advanced rapidly, it remains unclear which profile elements genuinely drive role-playing quality. To bridge this gap, we introduce a systematic diagnostic framework that disentangles the impact of character profiles along three axes: Familiarity (Known vs. Unknown), Structure (Structured vs. Unstructured), and Disposition (Moral vs. Immoral). Utilizing a unified hierarchical schema (5 dimensions, 28 fields), we construct a controlled dataset of 211 personas and evaluate five LLMs on both single- and multi-turn interactions. Our results reveal a striking asymmetry: Familiarity and Structure show negligible impact, while Disposition produces large, consistent performance degradation for immoral characters across all conditions. Further analyses suggest that the Moral--Immoral gap is amplified by post-SFT alignment, and that this degradation varies substantially across profile attributes. To mitigate this bottleneck, we propose Field-Aware Contrastive Decoding (FACD), a training-free strategy that amplifies suppressed disposition-sensitive signals, significantly closing the performance gap without sacrificing moral-character performance.

URL PDF HTML ☆

赞 0 踩 0

2601.21207 2026-05-28 cs.LG cs.AI math.AT

A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

图神经模型中复杂网络建模与注意力机制的层论与拓扑视角

Chuan-Shen Hu

AI总结提出细胞层论框架分析图神经网络中节点特征与边权重的局部一致性与调和性，并引入基于拓扑数据分析的多尺度扩展以捕获层次特征交互。

详情

AI中文摘要

组合与拓扑结构，如图、单纯复形和胞腔复形，构成了几何与拓扑深度学习（GDL和TDL）架构的基础。这些模型在此类域上聚合信号、整合局部特征，并为多样化的实际应用生成表示。然而，训练过程中GDL和TDL特征的分布与扩散行为仍是一个开放且未充分探索的问题。受此空白启发，我们引入了一个细胞层论框架，用于建模和分析基于图的架构中节点特征与边权重的局部一致性与调和性。通过层结构追踪局部特征对齐与一致性，该框架提供了特征扩散与聚合的拓扑视角。此外，受拓扑数据分析（TDA）启发，提出了一个多尺度扩展，以捕获图模型中层次化的特征交互。该方法基于GDL和TDL架构的底层几何与拓扑结构以及其上定义的学习信号，实现了对它们的联合刻画，为节点分类、子结构检测和社区检测等传统任务的未来研究提供了见解。

英文摘要

Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.

URL PDF HTML ☆

赞 0 踩 0

2603.02097 2026-05-28 cs.CL

ClinConsensus: A Physician-Calibrated Benchmark for Evaluating Clinical Rubric Coverage in Chinese Medical LLMs

ClinConsensus：一个用于评估中文医疗大模型临床评分标准覆盖率的医师校准基准

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Xue Yang, Kailuan Wu, Ruyi Xu, Tianyun Lu, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Sen Yang, Lin Qu, Bing Zhao, Hu Wei

AI总结为解决开放域医疗大模型评估缺乏医师校准的临床响应标准覆盖率问题，提出包含2500个专家病例的ClinConsensus基准，并引入医师锚定覆盖率评分（CACS）及双裁判框架，发现前沿模型存在19.2-21.9分的覆盖率差距。

详情

AI中文摘要

开放域医疗大模型评估在医师校准的临床相关响应标准覆盖率方面仍然薄弱，尤其是在本地化临床环境中。我们引入了 extsc{ClinConsensus}，一个中文医疗基准，包含 2,500 个专家精选病例，涵盖 36 个专科、12 个任务主题、多个难度级别以及面向非专业与专业人员的场景。每个病例配有 30 个病例特定的二元评分标准。为了评估响应是否满足足够多的医师撰写的标准，我们提出了 \emph{医师锚定覆盖率评分}（CACS），一个在 $k=10$ 实例化的医师校准阈值度量，并开发了一个双裁判框架，结合 GPT-5.1 评分器与一个医师监督的 Qwen3-8B 裁判。评估 11 个前沿大模型，我们发现存在持续的覆盖率差距：评分准确率在 39.6% 到 52.1% 之间，而 CACS@10 在 17.8% 到 32.9% 之间，模型间存在 19.2-21.9 个百分点的差距。分层分析进一步揭示了在推理、证据使用、结构化提取、用药说明、随访和对话语域方面的显著差异。这些结果表明，医疗大模型评估应衡量阈值化的、基于评分标准的临床覆盖率，而非平均部分正确性。

英文摘要

Open-ended medical LLM evaluation remains weakly grounded in physician-calibrated coverage of clinically relevant response criteria, especially in localized clinical settings. We introduce \textsc{ClinConsensus}, a Chinese medical benchmark of 2{,}500 expert-curated cases spanning 36 specialties, 12 task themes, multiple difficulty levels, and lay-facing versus professional-facing settings. Each case is paired with 30 case-specific binary rubric criteria. To evaluate whether responses satisfy enough physician-authored criteria, we propose \emph{Clinician-Anchored Coverage Score} (CACS), a physician-calibrated threshold metric instantiated at $k=10$, and develop a dual-judge framework combining a GPT-5.1 grader with a physician-supervised Qwen3-8B judge. Evaluating 11 frontier LLMs, we find a persistent coverage gap: Rubric Accuracy ranges from 39.6\% to 52.1\%, whereas CACS@10 ranges from 17.8\% to 32.9\%, leaving a 19.2--21.9 point gap across models. Stratified analyses further reveal substantial variation across reasoning, evidence use, structured extraction, medication instructions, follow-up, and dialogue register. These results suggest that medical LLM evaluation should measure thresholded, rubric-grounded clinical coverage rather than average partial correctness.

URL PDF HTML ☆

赞 0 踩 0

2603.16985 2026-05-28 cs.LG

Integrating Inductive Biases in Transformers via Distillation for Financial Time Series Forecasting

通过蒸馏将归纳偏置整合到Transformer中用于金融时间序列预测

Yu-Chen Den, Kuan-Yu Chen, Kendro Vincent, Darby Tien-Hao Chang

AI总结提出TIPS框架，通过知识蒸馏将因果性、局部性和周期性等归纳偏置整合到统一Transformer中，在四个主要股票市场实现年化收益、夏普比率和卡尔玛比率分别提升55%、9%和16%，且推理计算量仅为38%。

Comments KDD 2026

详情

AI中文摘要

基于Transformer的模型因其高表示能力和架构灵活性而被广泛用于时间序列预测。然而，许多Transformer变体隐含地假设平稳性和稳定的时间动态——这些假设在具有制度转换和非平稳性的金融市场中经常被违反。经验上，最先进的时间序列Transformer在金融任务上甚至常常不如普通Transformer，而具有不同归纳偏置的简单架构（如CNN和RNN）能以更低的复杂度实现更强的性能。同时，没有单一的归纳偏置在所有市场或制度中占主导地位，这表明稳健的金融预测需要整合互补的时间先验。我们提出TIPS（Transformer with Inductive Prior Synthesis），一个知识蒸馏框架，将多样化的归纳偏置——因果性、局部性和周期性——综合到统一的Transformer中。TIPS通过注意力掩码训练偏置专用的Transformer教师，然后通过跨归纳偏置的制度依赖对齐，将它们的知识蒸馏到单个学生模型中。在四个主要股票市场上，TIPS实现了最先进的性能，在年化收益、夏普比率和卡尔玛比率上分别超过强集成基线55%、9%和16%，同时仅需要38%的推理计算量。进一步分析表明，TIPS产生了统计上显著的超额收益，超过普通Transformer及其教师集成，并在其盈利期间表现出与经典架构的制度依赖行为对齐。这些结果强调了在非平稳金融时间序列中，制度依赖的归纳偏置利用对于稳健泛化的重要性。

英文摘要

Transformer-based models have been widely adopted for time-series forecasting due to their high representational capacity and architectural flexibility. However, many Transformer variants implicitly assume stationarity and stable temporal dynamics -- assumptions routinely violated in financial markets characterized by regime shifts and non-stationarity. Empirically, state-of-the-art time-series Transformers often underperform even vanilla Transformers on financial tasks, while simpler architectures with distinct inductive biases, such as CNNs and RNNs, can achieve stronger performance with substantially lower complexity. At the same time, no single inductive bias dominates across markets or regimes, suggesting that robust financial forecasting requires integrating complementary temporal priors. We propose TIPS (Transformer with Inductive Prior Synthesis), a knowledge distillation framework that synthesizes diverse inductive biases -- causality, locality, and periodicity -- within a unified Transformer. TIPS trains bias-specialized Transformer teachers via attention masking, then distills their knowledge into a single student model with regime-dependent alignment across inductive biases. Across four major equity markets, TIPS achieves state-of-the-art performance, outperforming strong ensemble baselines by 55%, 9%, and 16% in annual return, Sharpe ratio, and Calmar ratio, while requiring only 38% of the inference-time computation. Further analyses show that TIPS generates statistically significant excess returns beyond both vanilla Transformers and its teacher ensembles, and exhibits regime-dependent behavioral alignment with classical architectures during their profitable periods. These results highlight the importance of regime-dependent inductive bias utilization for robust generalization in non-stationary financial time series.

URL PDF HTML ☆

赞 0 踩 0

2601.04505 2026-05-28 cs.AI cs.CL cs.SY eess.SY

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

CircuitLM: 一种基于多智能体的大语言模型辅助设计框架，用于从自然语言提示生成电路原理图

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

AI总结提出CircuitLM多智能体流水线，通过嵌入驱动的组件知识库和五阶段流程，将自然语言提示转化为结构化的CircuitJSON原理图，并采用确定性电气规则检查和LLM作为评判的元评估器双重验证，解决大语言模型在电路设计中的幻觉和物理约束问题。

Comments Accepted at the 2026 IEEE International Conference on LLM-Aided Design (ICLAD), 10 pages, 8 figures, 6 tables

详情

AI中文摘要

从高层自然语言描述生成准确的电路原理图仍然是电子设计自动化（EDA）中的一个持久挑战，因为大语言模型（LLM）经常产生组件幻觉、违反严格的物理约束并输出非机器可读的结果。为解决此问题，我们提出CircuitLM，一个多智能体流水线，将用户提示转化为结构化的、视觉可解释的$\texttt{CircuitJSON}$原理图。该框架通过五个顺序阶段： (i) 组件识别，(ii) 规范引脚输出检索，(iii) 思维链推理，(iv) JSON原理图合成，以及(v) 交互式力导向可视化，基于一个精心策划的、嵌入驱动的组件知识库进行生成，从而减轻幻觉并确保物理可行性。我们在一个包含100个独特电路设计提示的数据集上，使用五个最先进的大语言模型评估了该系统。为系统评估性能，我们部署了严格的双层评估方法：一个确定性电气规则检查（ERC）引擎按严格严重性（关键、主要、次要、警告）对拓扑故障进行分类，同时一个LLM作为评判的元评估器识别复杂的、上下文感知的设计缺陷，这些缺陷绕过了标准的基于规则的检查器。最终，这项工作展示了目标检索与确定性和语义验证相结合如何将自然语言转化为结构可行的、原理图就绪的硬件和安全电路原型。我们的代码和数据公开在 https://github.com/Khandakar227/CircuitLM。

英文摘要

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

URL PDF HTML ☆

赞 0 踩 0

2512.20780 2026-05-28 cs.CL cs.CY

Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

大型语言模型在数学辅导中接近专家教学质量，但在教学和语言特征上存在差异

Ramatu Oiza Abdulsalam, Segun Aroyehun

AI总结通过分析数学辅导对话数据集，比较专家、新手教师和七种大型语言模型的教学质量，发现大型语言模型平均接近专家水平，但在教学策略和语言特征上存在系统性差异。

详情

AI中文摘要

最近的工作探索了使用大型语言模型（LLMs）生成数学辅导回应，但尚不清楚其教学行为与人类专家实践的接近程度。我们分析了一个数学补救对话数据集，其中专家教师、新手教师和七种不同规模的大型语言模型（包括开放权重和商业模型）对相同的学生错误做出回应。我们检查了教学策略和辅导回应的语言特征，包括吸收（重述和转述）、追问准确性和推理、词汇多样性、可读性、礼貌性和能动性。我们发现专家教师产生的回应质量高于新手教师，并且较大的LLMs通常比较小的模型获得更高的教学质量评分，平均接近专家表现。然而，LLMs在教学特征上表现出系统性差异：它们较少使用专家教师特有的讨论策略，同时生成更长、词汇更丰富、更礼貌的回应。回归分析表明，追问准确性和推理、重述和转述以及词汇多样性与感知教学质量正相关，而更高水平的能动性和礼貌性语言则负相关。这些发现强调了在评估人类教师和智能辅导系统的辅导回应时分析教学策略和语言特征的重要性。

英文摘要

Recent work has explored the use of large language models (LLMs) to generate tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We analyze a dataset of math remediation dialogues in which expert tutors, novice tutors, and seven LLMs of varying sizes, comprising both open-weight and commercial models, respond to the same student errors. We examine instructional strategies and linguistic characteristics of tutoring responses, including uptake (restating and revoicing), pressing for accuracy and reasoning, lexical diversity, readability, politeness, and agency. We find that expert tutors produce higher-quality responses than novices, and that larger LLMs generally receive higher pedagogical quality ratings than smaller models, approaching expert performance on average. However, LLMs exhibit systematic differences in their instructional profiles: they underuse discursive strategies characteristic of expert tutors while generating longer, more lexically diverse, and more polite responses. Regression analyses show that pressing for accuracy and reasoning, restating and revoicing, and lexical diversity, are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. These findings highlight the importance of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.

URL PDF HTML ☆

赞 0 踩 0

2502.08938 2026-05-28 cs.LG

Reevaluating Policy Gradient Methods for Imperfect-Information Games

重新评估不完美信息博弈的策略梯度方法

Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota

AI总结通过实现五种大型博弈的精确可剥削性计算，对比发现基于虚构博弈、双预言机和反事实遗憾最小化的深度强化学习算法未能超越通用策略梯度方法（如PPO）。

Comments International Conference on Learning Representations (ICLR) 2026

详情

AI中文摘要

在过去十年中，受对抗性不完美信息博弈中朴素自我对弈深度强化学习（DRL）所谓失败的驱动，研究人员基于虚构博弈（FP）、双预言机（DO）和反事实遗憾最小化（CFR）开发了大量DRL算法。鉴于最近磁镜下降算法的结果，我们假设更简单的通用策略梯度方法（如PPO）与这些基于FP、DO和CFR的DRL方法相比具有竞争力或更优。为了验证这一假设，我们实现并发布了五个大型博弈的首次广泛可访问的精确可剥削性计算。利用这些博弈，我们进行了不完美信息博弈中DRL算法有史以来最大规模的可剥削性比较。在超过7000次训练运行中，我们发现基于FP、DO和CFR的方法未能超越通用策略梯度方法。代码可在https://github.com/nathanlct/IIG-RL-Benchmark 和 https://github.com/gabrfarina/exp-a-spiel 获取。

英文摘要

In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for five large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 7000 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .

URL PDF HTML ☆

赞 0 踩 0