arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16255 2026-05-18 cs.DC cs.AI 版本更新

Designing Datacenter Power Delivery Hierarchies for the AI Era

为AI时代设计数据中心电力交付层级

Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini

发表机构 * Stanford University（斯坦福大学）； Microsoft Azure Research（微软Azure研究院）

AI总结本文研究了AI时代数据中心电力交付层级设计的挑战，提出了一种评估框架，结合吞吐量、功率和成本指标，分析多资源短缺对部署容量、资本支出和性能的影响。

详情

AI中文摘要

FORGE：无权重更新的自演化代理记忆

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

发表机构 * Carleton University（卡尔顿大学）

AI总结 FORGE通过群体广播机制实现无梯度更新的自生成记忆，提升层次ReAct代理决策能力，在CybORG CAGE-2任务中显著提高性能并降低失败率。

详情

DOI: 10.1145/3786335.3813155

AI中文摘要

LLM代理能否通过自生成记忆提升决策能力而不进行梯度更新？我们提出了FORGE（失败优化反射毕业与进化），一种分阶段、基于群体的协议，通过注入提示的自然语言记忆来进化层次ReAct代理。FORGE包含一个反射式内环，其中专门的反思代理（使用相同的基础LLM，不从更强模型蒸馏）将失败轨迹转换为可重用的知识工件：文本启发式（规则）、少量示例（示例）或两者（混合），外环在阶段间将表现最佳实例的记忆传播到群体，并通过毕业标准冻结收敛实例。我们在CybORG CAGE-2上评估，这是一个具有30步地平线的随机网络防御POMDP，对抗B线攻击者。所有四个测试的LLM家族（Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B）均表现出强烈负的、重尾零样本奖励。与零样本基线和反射基线（隔离单流学习）相比，FORGE在所有12种模型-表示条件下，将平均评估回报提高了1.7-7.7倍，比反射基线提高了29-72%，将主要失败率（低于-100）降低到约1%。我们发现（1）群体广播是关键机制，无毕业消融确认广播承载性能提升，而毕业主要节省计算；（2）示例在三个模型中表现最强，规则提供最佳成本-可靠性剖面，约少40%的token；（3）较弱基线模型受益显著，表明FORGE可能缓解能力差距而非放大强模型。所有证据均限于CAGE-2 B线；跨家族发现是方向性证据。

英文摘要

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.16232 2026-05-18 cs.CL cs.AI cs.ET cs.LG cs.SY eess.SY 版本更新

A Unified Generative-AI Framework for Smart Energy Infrastructure: Intelligent Gas Distribution, Utility Billing, Carbon Analytics, and Quantum-Inspired Optimisation

智能能源基础设施的统一生成式AI框架：智能燃气分配、公用事业计费、碳分析和量子启发优化

Pavan Manjunath, Thomas pruefer

发表机构 * Independent Research, India（印度独立研究）； Independent Research, Germany（德国独立研究）

AI总结本文提出一种统一的生成式AI框架，整合智能燃气分配、计费、碳分析和量子优化，以提升能源管理效率与环境责任。

2605.16207 2026-05-18 cs.AI cs.CL 版本更新

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

确认正确，遗漏其余：LLM辅导代理在反馈最关键的地方表现不佳

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

发表机构 * North Carolina State University（北卡罗来纳州立大学）

AI总结本文研究了LLM在逻辑推理中的辅导性能，发现其在区分最优解、次优解和错误解方面存在系统性偏差，影响适应性教学效果。

Comments 22 pages, 20 fgures

详情

AI中文摘要

有效的辅导需要区分最优解、有效但次优解和错误解，这对智能辅导系统至关重要，但此前未针对LLM辅导代理进行测试。本文通过知识图谱衍生的地面真实数据，评估了七个LLM反馈代理在命题逻辑中的表现。模型在最优步骤上表现接近天花板，但在有效但次优的推理和错误解的验证上系统性地过度拒绝和接受，这在适应性辅导中尤为关键。这些失败在不同模型和情境下均持续存在，表明是架构而非信息限制的问题。此外，准确的诊断未能可靠地产生教学可行的反馈，揭示了诊断判断与教学效果之间的差距。研究发现LLM更适合混合架构，其中基于知识图谱的模型负责诊断，而LLM支持开放式的支架和对话。

英文摘要

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

URL PDF HTML ☆

赞 0 踩 0

2605.16205 2026-05-18 cs.AI cs.CL cs.LG cs.MA cs.SY eess.SY 版本更新

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

上下文、推理与层次：在对抗性POMDP中的复合LLM代理设计成本-性能研究

Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

发表机构 * Carleton University（卡尔顿大学）

AI总结研究探讨了在对抗性部分可观测序贯环境中，复合LLM代理设计的上下文、推理和层次分解对性能与成本的影响，发现程序化状态抽象在成本效率上表现最佳，而分层分解无需推理可获得最佳性能。

详情

DOI: 10.1145/3786335.3813149

AI中文摘要

在对抗性、部分可观测的序贯环境中部署复合LLM代理需要处理多个设计维度：（1）代理所见的内容，（2）其推理方式，以及（3）任务在组件间的分解。然而，从业者缺乏指导，以确定哪些设计选择能提升性能而非仅仅增加推理成本。我们通过CybORG CAGE-2环境（建模为部分可观测马尔可夫决策过程POMDP）进行受控研究。奖励为非正数，因此所有配置均在故障缓解模式下运行。我们的评估涵盖五种模型家族、六种模型和十二种配置（3,475次回合），并进行逐token的成本计算。我们变化上下文表示（原始观察与确定性状态跟踪层压缩历史）、推理（自我提问、自我批评和自我改进工具，可选思维链提示）以及分层分解（单体ReAct与委托给专门子代理）。我们发现：（1）程序化状态抽象在每token花费上获得最大回报（RPTS），在原始观察上提升均值回报高达76%。（2）在分层中分布推理工具相对于单独分层，对所有五种模型家族均降低性能，达到3.4倍更差的均值回报，同时使用1.8-2.7倍更多token。我们称此破坏性模式为推理瀑布。（3）没有推理的分层分解在大多数模型中获得最佳绝对性能，且上下文工程通常比推理更经济有效。这些发现表明在结构对抗性POMDPs中的设计原则：投资于程序化基础设施和清洁任务分解，而不是更深入的单个代理推理，因为这些策略在结合时可能会相互干扰。

英文摘要

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

URL PDF HTML ☆

赞 0 踩 0

2605.16198 2026-05-18 cs.AI cs.CY cs.LG cs.LO 版本更新

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

形式方法与大语言模型交汇：面向高级AI系统合规性的审计、监控与干预

Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

发表机构 * University of Toronto, Vector Institute（多伦多大学，向量研究所）

AI总结本文提出结合形式方法与机器学习的审计和监控技术，用于检测AI系统中时间扩展行为约束的违规，实验表明其在检测违规方面优于LLM基方法，且能有效降低LLM代理的违规率。

详情

AI中文摘要

我们探讨了AI治理的一个维度：如何在整个AI开发生命周期中监控和审计AI增强的产品和服务，从预部署测试到部署后的审计。结合形式方法的原则与最先进的机器学习，我们提出技术，使AI增强产品和服务开发者、第三方AI开发者和评估者能够对产品特定的时间扩展行为约束（如安全约束、规范、规则和法规）进行离线审计和在线（运行时）监控，针对黑箱高级AI系统，特别是LLMs。我们进一步提供实用的预测监控技术，如基于抽样的方法，并引入干预监控器，在运行时预判并可能缓解预测的违规。实验结果表明，通过利用线性时序逻辑（LTL）的形式语法和语义，我们提出的方法在检测时间扩展行为约束的违规方面优于LLM基方法；使用我们的方法，即使小模型标注器也能匹配或超越前沿LLM判断者。我们还显示，通过受控实验，LLM的时间推理在事件距离、约束数量和命题数量增加时表现出显著的准确性下降。

英文摘要

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.

URL PDF HTML ☆

赞 0 踩 0

2605.16194 2026-05-18 cs.DL cs.AI cs.IR cs.MA 版本更新

paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

为LLM-代理可操作论文的协调约定

Arquimedes Canedo

发表机构 * arquicanedo

AI总结本文提出paper.json文件，通过稳定声明ID、明确不声明列表、精确图示命令和稳定定义ID等约定，解决LLM代理在阅读学术论文时的重复失败问题。

详情

AI中文摘要

LLM代理通常作为学术论文的第一（有时唯一）阅读者，快速浏览子声明、提取可重复性步骤并概括范围。标准论文在这一角色中产生重复失败：无法在子论文粒度下引用子声明、范围过度扩展超出论文测试内容，以及图示命令埋藏在代码库而非论文本身。我们提出paper.json，一个随PDF一同携带的JSON文件，通过轻量级约定解决这些失败：稳定声明ID（C1）、明确不声明列表（C2）、精确每图shell命令（C3）和稳定定义ID（C5）。第五个约定（C4）指出，最小可行合规性，手写JSON与PDF一同，可在一小时内完成，无需触碰人类可读输出。C1、C2、C3和C5是开放邀请：阅读合规论文并采取行动的代理将产生证据支持或反对它们。本文本身合规：运行`uv run validator.py paper.json --against paper.typ`通过。仓库：https://github.com/arquicanedo/paper-json

英文摘要

LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose `paper.json`, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: `uv run validator.py paper.json --against paper.typ` passes. Repo: https://github.com/arquicanedo/paper-json

URL PDF HTML ☆

赞 0 踩 0

2605.16165 2026-05-18 cs.CV cs.AI 版本更新

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

二阶多级方差校正用于多模态模型中的模态竞争

Yishun Lu, Wes Armour

发表机构 * University of Oxford, Oxford, United Kingdom（牛津大学，英国）

AI总结本文提出ML-FOP-SOAP框架，通过多级方差校正提升多模态对齐稳定性，实验显示在Janus和Emu3数据集上，该方法提高了样本效率和训练速度，适用于大规模多模态基础模型。

详情

AI中文摘要

自回归的下一个标记训练为图像生成和文本理解提供统一框架，但同时也导致强模态竞争，破坏了优化稳定性并限制了大批次扩展。我们发现一阶优化器如AdamW易受跨模态梯度异质性影响，而二阶预条件，特别是SOAP，为多模态对齐提供了更稳定的基。基于此，我们提出ML-FOP-SOAP，一个带有多级方差校正的二阶优化框架。我们的Fisher-正交投影抑制由方差引起的模态冲突，减少视觉生成与文本理解之间的权衡。为在大梯度累积下实用，我们引入了分层折叠策略，以低微步开销捕获细粒度方差。在Janus和Emu3上的实验显示，在两个模态上均获得一致收益，并在8192批次大小下实现稳定训练。与AdamW相比，我们的方法提高了样本效率高达1.4倍，并加速了实时时钟训练高达1.5倍，为扩展多模态基础模型提供了一个稳健的优化器。

英文摘要

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.16153 2026-05-18 cs.AI 版本更新

ShopGym: 一个集成框架，用于电子商务网络代理的现实模拟和可扩展基准测试

Chinmay Savadikar, Mingyu Zhao, Yuanzheng Zhu, Han Li, Shuang Xie, Alberto Castelo, Tianfu Wu, Lingyun Wang

发表机构 * North Carolina State University（北卡罗来纳州立大学）； Shopify

AI总结本文提出ShopGym框架，通过模拟层ShopArena和基准层ShopGuru，实现电子商务网络代理的现实模拟与可扩展基准测试，验证了合成商店在结构属性和代理性能上的有效性。

Comments 32 pages, 10 figures

详情

AI中文摘要

开发和评估电子商务网络代理需要能够保持有意义任务结构并支持可控、可重复和可扩展科学比较的环境。现有方法面临权衡：实时商店提供现实但非平稳、难以检查和不可重复，而手动构建的沙盒基准测试提供控制但仅覆盖狭窄的布局、目录、政策和交互模式范围。我们主张核心瓶颈是方法论的：该领域缺乏一种可扩展的方式，能够构建同时现实、多样、可控、可检查和可重复的评估设置。我们引入ShopGym，一个集成框架，用于电子商务网络代理的现实模拟和可扩展基准测试。ShopGym是一个构建电子商务模拟环境和基础基准任务的框架。其模拟层ShopArena通过匿名化商店规范和分阶段验证生成过程，将实时种子商店转换为自包含的沙盒商店。在这些模拟商店之上，ShopGuru合成跨七个技能类别的基准任务，每个任务基于商店的目录、导航结构、政策和交互可能性。共同，ShopArena和ShopGuru产生自包含、可重置、可检查和稳定的评估成果，保留结构属性和与购物任务相关的代理评估信号。我们通过基于图的结构分析和基于代理的行为评估验证了该框架，使用224个生成的任务在六个沙盒商店中：三个由合成数据构建，三个由真实数据构建。我们的结果表明，合成商店保留了实时商店的关键结构属性，代理在合成商店上的表现与在实时商店上的表现正相关。

英文摘要

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

URL PDF HTML ☆

赞 0 踩 0

2605.16113 2026-05-18 cs.CL cs.AI 版本更新

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

DebiasRAG: 通过检索增强生成实现大型语言模型中公平生成的无调优路径

Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

发表机构 * Huawei（华为）

AI总结本文提出DebiasRAG，一种基于检索增强生成的无调优动态查询特定去偏框架，通过生成查询特定去偏候选、构建上下文候选池和梯度更新去偏引导上下文重排序三阶段，提升生成公平性并保留LLM固有属性。

详情

AI中文摘要

大型语言模型（LLMs）因生成能力卓越而取得空前成功。然而，由于依赖训练语料中的知识，它们可能生成幻觉、刻板印象和社会偏见内容。特别是，LLMs容易产生涉及种族、性别和年龄的偏见响应，统称为社会偏见。先前研究使用微调和提示工程来减轻LLMs中的偏见，但这些方法需要额外的训练资源或领域知识来设计框架。此外，它们可能降低LLMs的原始能力，并常忽视公平推断中动态去偏上下文的需要。本文提出DebiasRAG，一种基于检索增强生成（RAG）的新型无调优和动态查询特定去偏框架。DebiasRAG在保持LLM固有属性如表示能力的同时提升公平性。DebiasRAG包含三个阶段：（1）查询特定去偏候选生成；（2）上下文候选池构建；（3）梯度更新去偏引导上下文重排序。首先，DebiasRAG通过常规检索生成与查询相关的自我诊断偏见上下文，这些偏见上下文由DebiasRAG提供者离线准备。给定查询特定的偏见上下文，DebiasRAG反向生成去偏上下文，作为额外的公平性约束提供给LLM输出。其次，常规RAG检索过程从常规RAG文档数据库生成查询相关的上下文，如分块维基百科数据集。

英文摘要

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.16112 2026-05-18 cs.LG cs.AI 版本更新

Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix

动态图变换器中的注意力分散：诊断与可迁移的修复

Jinhao Zhang, Kangfei Zhao, Qiuhao Zeng, Long-Kai Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； University of Toronto（多伦多大学）； Hong Kong Baptist University（香港 Baptist 大学）

AI总结本文识别动态图变换器在时间分布偏移下的注意力分散问题，并提出可迁移的差分注意力机制以提升性能，尤其在高偏移数据集上表现显著。

详情

AI中文摘要

集中式与去中心化联邦学习：性能权衡分析

Chaimaa Medjadji, Guilain Leduc, Sylvain Kubler, Yves Le Traon

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本文通过Fedstellar模拟器、MNIST数据集和MLP分类器，对比分析集中式、去中心化和半去中心化联邦学习架构的性能权衡，揭示不同应用场景下的优劣势。

详情

DOI: 10.1109/FiCloud62933.2024.00019

AI中文摘要

联邦学习（FL）作为一种在分布式边缘设备上进行协作模型训练同时保护数据隐私的有前景范式，尤其在物联网设备数量激增的情况下显得尤为重要。然而，将如此大量的数据集中存储面临通信限制、隐私和法规等问题。FL可以是集中式（CFL）、去中心化（DFL）或半去中心化（SDFL）。选择合适的FL架构取决于应用需求。然而，非常少的研究通过实验比较了这三种架构，不仅为了理解各自的优势和局限性，还为了探讨不同性能指标之间的权衡。本文克服了这一分析的不足，利用Fedstellar模拟器、MNIST数据集和MLP分类器进行实验分析。

英文摘要

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application's needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.

URL PDF HTML ☆

赞 0 踩 0

2605.16088 2026-05-18 cs.LG cs.AI 版本更新

Multi-level Self-supervised Pretraining on Compositional Hierarchical Graph for Molecular Property Prediction

基于组合层次图的多级自监督预训练用于分子性质预测

Xiayu Liu, Zhengyi Lu, Hou-biao Li

发表机构 * School of Mathematical Sciences（数学科学学院）； University of Electronic Science and Technology of China（电子科技大学）； Department of Computer Science and Engineering（计算机科学与工程系）； Oakland University（奥克兰大学）

AI总结本文提出MolCHG框架，通过多级自监督预训练提升分子性质预测性能，采用组合层次图组织分子结构，引入bond graph增强bond信息，实现原子与bond语义的平等聚合。

Comments 11pages, 4 figures

详情

AI中文摘要

自监督预训练在分子图上已展现出分子性质预测的潜力，但现有方法多在单一结构粒度上操作，将bond信息视为辅助边属性而非独立语义层。本文提出MolCHG，一种基于新型组合层次图的多级自监督预训练框架，将分子结构划分为三个语义层级的四种节点类型。通过引入与原子图并行的bond图，该架构将bond层面信息提升为独立演化的节点表示，使片段节点能平等聚合原子层面和bond层面语义。设计了三个层级特定的预训练目标：原子-债券交叉视图对比任务对齐每个片段的原子视图和bond视图表示；片段级功能团预测任务注入领域相关的化学知识；图级结构预测任务编码全局分子拓扑。在九个MoleculeNet基准测试中，MolCHG在七个数据集上取得最佳性能，在其余数据集上与最强基线竞争。消融研究进一步确认多级监督信号互补，每个组件均对整体性能有贡献。

英文摘要

Self-supervised pretraining on molecular graphs has emerged as a promising approach for molecular property prediction, yet most existing methods operate at a single structural granularity and treat bond information as auxiliary edge attributes rather than as an independent semantic layer. In this work, we propose MolCHG, a multi-level self-supervised pretraining framework built upon a novel Compositional Hierarchical Graph that organizes molecular structure into four types of nodes across three semantic levels. By introducing a bond graph that operates in parallel with the atom graph, our architecture elevates bond-level information to independently evolving node representations, enabling fragment nodes to aggregate atom-level and bond-level semantics on an equal footing. We design three level-specific pretraining objectives: an atom-bond cross-view contrastive task that aligns the atom-view and bond-view representations within each fragment, a fragment-level functional group prediction task to inject domain-relevant chemical knowledge, and graph-level structure prediction tasks to encode global molecular topology. Experiments on nine MoleculeNet benchmarks demonstrate that MolCHG achieves the best performance on seven datasets across both classification and regression tasks, remaining competitive with the strongest baselines on the rest. Ablation studies further confirm that the multi-level supervision signals are complementary and that each component contributes to the overall performance.

URL PDF HTML ☆

赞 0 踩 0

2605.16085 2026-05-18 cs.DB cs.AI 版本更新

Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

面向关系数据库的foundation models的语言模型与图神经网络方法

Jingcheng Wu, Ratan Bahadur Thapa, Mojtaba Nayyeri, Lucas Etteldorf, Max Finkenbeiner, Fabian Leeske, Steffen Staab

发表机构 * University of Stuttgart, Stuttgart, Germany（斯图加特大学）； Internet Science Research Group, University of Southampton, Southampton, United Kingdom（互联网科学研究组，南安普顿大学）

AI总结本文提出结合语言模型和图神经网络的混合架构，通过关系实体图建模提升关系数据库的预测性能，实验表明其在多个任务中表现优异，接近监督基线并缩小与RDL的差距。

Comments 15 pages, 7 figures, 4 tables. Preprint of a paper accepted at the 1st Workshop on Extraction from Triplet Text-Table-Knowledge Graph and associated Challenge (TRIPLET), co-located with ESWC 2026

详情

AI中文摘要

关系数据库存储了大量结构化信息，对复杂预测应用至关重要。然而，关系数据的深度学习进展有限，传统方法通过人工特征工程将数据库扁平化为单表，丢失了关系上下文。关系深度学习（RDL）通过将数据库建模为关系实体图（REGs）供图神经网络（GNNs）处理，但任务和数据库特定。为结合两种范式的优势，本文提出混合架构，结合微调的BART编码器捕捉行内语义，以及基于GraphSAGE的GNN处理REGs注入关系上下文。在RelBench上的实验表明，GNN显著丰富BART的行嵌入，实现驱动-dnf任务在rel-f1数据集上的ROC-AUC为67.40。该性能与监督基线如LightGBM（68.86）相当，并缩小与RDL（72.62）的差距至5.22点，尽管与最先进的基础模型如KumoRFM（82.63）仍有较大差距。这些结果表明，轻量级混合LM-GNN架构为关系数据库的基础模型提供了有前景且资源高效的路径。

英文摘要

Relational databases store much of the world's structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART's row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.

URL PDF HTML ☆

赞 0 踩 0

2605.16079 2026-05-18 cs.CV cs.AI cs.HC 版本更新

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker：通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； Xiaohongshu Inc.（小红书公司）； East China Normal University（华东师范大学）； Xi’an Jiaotong University（西安交通大学）

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务，提升视频理解精度，实验表明其在实例级任务中比基线模型提升13.7%，超越GPT-4o和Gemini-2.5-Pro。

Comments Project Page: https://gaotiexinqu.github.io/VideoSeeker/

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在视频理解上取得了显著进展，但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互，但这些提示难以提供精确的空间和时间参考，导致用户体验不佳。此外，当前方法通常将视觉感知与语言推理解耦，以语言为中心而非视觉内容，限制了模型主动感知细粒度视觉证据的能力。为解决这些问题，我们提出VideoSeeker，一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务，使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道，高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中，构建了一个强大的视频理解模型。实验表明，我们的模型在实例级视频理解任务中平均比基线模型提升13.7%，超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro，同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.16076 2026-05-18 cs.CV cs.AI 版本更新

理由者还是翻译者？面向污染的评估与税法中的神经符号鲁棒性

Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

发表机构 * Bloomberg（彭博社）； Michigan State University（密歇根州立大学）

AI总结本文研究了税法推理中LLM性能受数据污染影响的问题，提出神经符号框架提升法律AI的可靠性与鲁棒性。

详情

AI中文摘要

近期大型语言模型（LLM）的进步显著增强了自动化法律推理能力。然而，其性能反映的是真正的法律推理能力还是数据污染的产物仍不明确。本文对税法推理方法进行了全面实证研究，并实施了污染检测协议以严格评估LLM的可靠性。我们发现性能可能因污染而被夸大。基于此分析，我们进行了系统评估，比较了单一LLM与混合系统，后者将法律文本翻译为形式化表示并委托符号求解器进行推理。我们构建了一个新的测试套件，通过案例和规则变化来测试对未见文档的泛化能力。我们的发现表明，法律推理本质上是组合性的，神经符号框架为法律AI提供了更可靠和稳健的基础，以及对未观测情境的更好泛化能力。

英文摘要

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

URL PDF HTML ☆

赞 0 踩 0

2605.16048 2026-05-18 cs.LG cs.AI 版本更新

视觉语言模型在数学教育中能否具备适应性？一种基于学习者模型的评分研究

Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube, Jackie Chi Kit Cheung

发表机构 * McGill University（麦吉尔大学）； Mila – Quebec AI Institute（魁北克AI研究院）； Canada CIFAR AI Chair（加拿大CIFAR人工智能 chair）

AI总结本文探讨视觉语言模型在数学教育中的适应性，提出基于学习者模型的评分框架，评估模型在认知、动机和复杂度方面的适应性，并发现现有模型在有限学习者信息下难以产生一致的指导响应。

详情

AI中文摘要

适应性学习指的是跟踪学习者学习进度并根据个体学习者表现调整教学过程的教育技术。它日益被认可为开发有效学习支持工具的关键。视觉语言模型（VLMs）已在数学教育中得到应用，学生将其作为个性化教学的辅助工具。然而，不清楚VLMs是否具备根据不同学习者档案提供数学指导的能力。当前VLMs缺乏系统评估框架来评估数学辅导任务中对不同学习者档案的适应性。为解决这一差距，我们借鉴适应性学习框架中的学习者模型（Shute和Towle，2018），提出基于学习者模型的评分表。我们的评分表将适应性评估形式化为三个方面：认知方面、动机方面和复杂度。我们还评估了VLM响应的两个额外维度：正确性（答案和解决方案的正确性）和质量（响应本身的质量）。我们的实验结果表明，不同模型在适应性方面存在可测量的差异，并揭示了当前VLMs在有限学习者信息下难以一致产生基于学习者模型的教学响应。

英文摘要

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

URL PDF HTML ☆

赞 0 踩 0

2605.15995 2026-05-18 cs.LG cs.AI 版本更新

Constrained latent state modeling: A unifying perspective on representation learning under competing constraints

受限潜在状态建模：在竞争约束下表示学习的统一视角

Gwenolé Quellec

发表机构 * LaTIM UMR 1101

AI总结本文提出受限潜在状态建模（CLSM），统一了表示学习中在竞争约束下的核心原则与方法，揭示了潜在状态的内在耦合关系与根本权衡。

Comments Resources and model cards: https://github.com/gwenole-quellec/clsm

详情

AI中文摘要

从复杂数据中学习潜在表示是现代机器学习的核心，涵盖时间、多模态和部分观测系统。在这些设置中，表示应被视为捕捉系统动态的潜在状态，而非仅仅是观测的压缩总结。然而，当前方法仍碎片化，依赖于对这些状态应代表什么的不同且往往隐含的假设。我们主张这种碎片化反映了更根本的限制：潜在表示通常从欠约束的目标学习，未能指定有意义的潜在状态应满足的属性。因此，多个表示可以满足相同的目标，导致结构和解释的模糊性。尽管许多底层原则已被单独探索，但它们的相互作用尚未被显式形式化。在本文中，我们提出受限潜在状态建模（CLSM）作为统一的视角。我们识别了一组核心属性——预测充分性、最小性、时间一致性、观测兼容性、对干扰因素的不变性以及结构约束——并展示它们通过根本的权衡相互耦合。通过这一视角重新审视主要建模家族，我们显示现有方法可以被解释为强制不同的约束子集，从而占据共同设计空间的不同区域。这一视角将持续挑战如可识别性不足重新解释为欠约束形式的后果，而非孤立的技术限制。更广泛地说，CLSM提供了一个原则性的框架，以使设计选择显式化，分析权衡，并指导开发更具可解释性、稳健性和任务对齐的潜在状态模型。

英文摘要

Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct -- and often implicit -- assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties -- predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints -- and show that they are intrinsically coupled through fundamental trade-offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade-offs, and to guide the development of more interpretable, robust, and task-aligned latent state models.

URL PDF HTML ☆

赞 0 踩 0

2605.15984 2026-05-18 cs.SD cs.AI cs.CR 版本更新

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

超越内容：一个综合的语音毒性数据集和检测框架，结合副语言线索

Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li, Qinglong Wang, Li Lu

发表机构 * The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security（杭州高新区（滨江）区块链与数据安全研究院）； School of Cyber Science and Engineering（网络安全科学与工程学院）

AI总结本文提出ToxiAlert-Bench数据集和双头神经网络框架，通过整合副语言线索提升语音毒性检测性能，实验显示方法在多个指标上优于现有基线。

详情

AI中文摘要

语音毒性检测已成为维护安全在线通信环境的关键挑战。然而，现有方法常忽视副语言线索（如情绪、语调和语速）的作用，而当前数据集多为文本基，限制了对副语言线索的建模。为此，我们提出ToxiAlert-Bench，包含30000多个音频片段，标注七种主要毒性类别和二十种细粒度标签，并标注毒性来源（文本或副语言）。我们还提出双头神经网络，包含两个任务特定分类头：一个用于识别敏感源（文本或副语言），另一个用于分类具体毒性类型。训练过程包括独立头训练和联合微调以减少任务干扰。为缓解数据类别不平衡，我们采用类平衡采样和加权损失函数。实验结果表明，利用副语言特征显著提升了检测性能，方法在多个评估指标上优于现有基线，宏F1分数提升21.1%，准确率提升13.0%。

英文摘要

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.15983 2026-05-18 cs.AI 版本更新

Petri Net Induced Heuristic Search for Resource Constrained Scheduling

基于Petri网的启发式搜索用于资源受限调度

Ido Lublin, Dor Atzmon, Izack Cohen

发表机构 * Bar-Ilan University（巴伊兰大学）

AI总结本文将资源受限项目调度问题建模为Timed Transition Petri网的可达图最优搜索，采用相对延迟令牌实现调度决策与状态空间转换的对应关系，通过结合关键路径和资源下界启发式函数的A*算法，证明其一致性，并在PSPLIB基准测试中优于MIP基线方法。

Comments Accepted at the International Symposium on Combinatorial Search (SoCS 2026)

2605.15978 2026-05-18 cs.CL cs.AI cs.LO 版本更新

Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

执法ontology：用于执法报告中语义理解和推理的概念知识学习

Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué

发表机构 * Law Enforcement Agencies（执法机构）

AI总结本文提出利用符号方法将执法报告中的叙述转化为证据关联事实，通过消除个人标识、语义解析、谓词映射到本体和推理，提高对事件细节的恢复能力，并构建包含时间线索和领域公理的时间图。

Comments 13 pages, 8 figures, 9 tables

详情

AI中文摘要

执法报告包含结构化字段和书面叙述。然而，许多需要审查、警察培训和调查的事件事实是以自然语言形式存在的，需要手动阅读。我们提出了一种使用符号方法将叙述转换为证据关联事实的框架。我们的目标是通过仅从无结构文本中恢复事件细节，并构建包含时间线索和领域公理的时间图。我们通过消除个人标识、语义解析、谓词映射到本体和推理来实现这一点。我们在450份财产犯罪报告和一段简短的人类审查中评估了符号方法。从系统中提取的事件中，54.1%具有至少0.80的置信度分数，93.7%通过PropBank-VerbNet-WordNet语义路径映射。在事件启动、被盗物品和时间线索上达到了100%的一致性，在强制进入解释上则一致率较低。

英文摘要

Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.15976 2026-05-18 cs.CL cs.AI 版本更新

何时以及为何对抗训练能提升PINNs：神经 tangent 核视角

Yuan-dong Cao, Chi Chiu SO, Jun-Min Wang, He Wang

发表机构 * School of Mathematics and Statistics, Beijing Institute of Technology, China（北京理工大学数学与统计学院，中国）； School of Professional Education and Executive Development The Hong Kong Polytechnic University, China（香港理工大学专业教育学院及管理发展学院，中国）； Department of Computer Science & UCL AI Centre, University College London, UK（伦敦大学学院计算机科学系及UCL人工智能中心，英国）

AI总结本文从神经 tangent 核角度分析对抗训练提升PINNs的机制，提出理论框架并设计高效算法，实验证明能显著改善PINNs训练病理，提升模型精度。

详情

AI中文摘要

物理信息神经网络（PINNs）是微分方程的强大替代品，但因频谱偏置、刚性和高频率或多尺度解的准确性差而难以训练。基于生成对抗网络（GANs）的对抗训练近期在提升训练效果上取得了显著的实证结果，但其内在机制仍不明确。为此，本文提出了一种新的分析框架，基于GANs中判别器如何影响PINNs训练动态的关键观察。该框架首先为为何以及何时对抗训练在PINNs中有效提供了必要的理论依据，然后对GANs变体在该训练中的统一分析，并最终提出一种新的、实用的、高效的PINNs训练算法。实验证明，我们的方法能显著减少PINNs训练的病理现象，从而提供更优的模型，通常比其他方法准确度高几个数量级。

英文摘要

Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.

URL PDF HTML ☆

赞 0 踩 0

2605.15942 2026-05-18 cs.CV cs.AI 版本更新

辛神经算子用于学习无限维哈密顿系统

Yeang Makara, Yusuke Tanaka, Takashi Matsubara, Takaharu Yaguchi

发表机构 * Graduate School of Science（理学研究科）； Kobe University（Kobe大学）； NTT Communication Science Laboratories（NTT通信科学实验室）； Faculty of Information Science and Technology（信息科学和技术学部）； Hokkaido University（北海道大学）； Institute of Mathematics for Industry（工业数学研究所）； Kyushu University（九州大学）

AI总结本文提出辛神经算子，用于解决无限维哈密顿系统建模与模拟中的计算与结构挑战，通过保持辛结构提升长期稳定性与能量行为。

2605.15880 2026-05-18 cs.CV cs.AI 版本更新

FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

FSCM：频率增强的空间-频谱耦合Mamba用于红外超光谱图像着色

Tingting Liu, Yuan Liu, Guiping Chen, Xiubao Sui, Qian Chen

发表机构 * School of Electronic and Optical Engineering, Nanjing University of Science and Technology（南京理工大学电子与光学工程学院）； School of Mechanical Engineering, University of Science and Technology Beijing（北京科技大学机械工程学院）； School of Instrument and Electronics, North University of China（北方大学仪器与电子学院）

AI总结本文提出FSCM框架，通过频率增强的空间-频谱状态空间生成器和双流混合门控模块，提升红外超光谱图像着色的视觉质量和语义一致性。

详情

AI中文摘要

将音乐建模为时频图像：一种用于音乐生成的2D分词器

Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

发表机构 * Department of Music AI and Information Technology, Central Conservatory of Music（音乐人工智能与信息技术系，中央音乐学院）； Zhipu AI（智谱AI）

AI总结本文提出BandTok，一种面向生成的2D梅尔频谱分词器，通过单个共享码本生成梅尔频带token，提升自回归建模能力，实验表明其在数据有限情况下表现优异。

详情

AI中文摘要

自回归音乐生成高度依赖音频分词器。现有高保真编码器常使用残差多码本量化，虽保留重建质量但序列展平后语言建模复杂，因残差层次强序列依赖且放大误差积累。我们提出BandTok，一种面向生成的2D梅尔频谱分词器，通过单个共享码本生成梅尔频带token，生成物理可解释的时频token网格，具有更独立的token结构，更适合自回归建模。BandTok通过多尺度PatchGAN目标和EMA码本更新提升重建质量。我们进一步引入具有2D Rotary Position Embedding（2D RoPE）的自回归语言模型，以在生成过程中保持时间和频带结构。实验表明，BandTok优于残差码本分词器，在数据有限情况下表现优异。本工作源代码和生成演示已公开。

英文摘要

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.15812 2026-05-18 cs.HC cs.AI 版本更新

Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling

通过跨时间情感建模实现自然和陪伴型虚拟代理

Feier Qin, Xiao Li, Yi Zheng, Haibin Huang, Hanyao Wang, Xiaoyu Wang, Yan Lu, Yuan Zhang

发表机构 * Communication University of China（中国通信大学）； Microsoft Research Asia（微软亚洲研究院）； Institute of Artificial Intelligence, China Telecom（中国电信人工智能研究院）

AI总结本文提出CTEM框架，通过链接长期行为历史与即时情感表达，提升虚拟代理的自然性和情感和谐度，实验显示在21天的真实场景中效果显著。

Comments 21 pages, published in CHI '26

Journal ref Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), ACM, 2026

详情

DOI: 10.1145/3772318.3790917

AI中文摘要

最近基础模型的进步使对话代理旨在持续陪伴而非单纯任务完成。然而大多数代理仍无法支持自然、长期的陪伴式互动，导致体验显得片段化和不真实。我们主张当前代理忽视了跨时间建模的社会行为和内部情感：生成的行为很少影响代理的情感状态，而情感状态 seldom 形成后续行为。我们提出了跨时间情感建模（CTEM）框架，该框架将长期行为历史与即时情感表达联系起来。CTEM建立了一个闭环，过去的经验更新演化的心理状态；该状态调节即时互动；用户反馈不断修订记忆和心理状态，使反思和预期成为可能。我们将CTEM实例化为Auri，一个即时通讯平台上的陪伴代理，并报告了一项21天的真实场景研究，显示CTEM在感知自然性、连贯性和情感和谐度方面有所改进。

英文摘要

Recent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents' social behaviors and internal emotions: generated behaviors rarely influence an agent's emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as Auri, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.

URL PDF HTML ☆

赞 0 踩 0

2605.15787 2026-05-18 cs.LG cs.AI 版本更新

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

通过结构推断理解Grokking：Transformer需要贝叶斯彩票

Kai Hidajat, Solden Stoll, Joseph An

发表机构 * Department of Computer Science（计算机科学系）； University of Washington（华盛顿大学）； Seattle, WA 98195（西雅图, WA 98195）

AI总结研究探讨了Transformer在延迟泛化现象中的结构推断机制，提出贝叶斯彩票理论，解释了泛化延迟与结构学习的关系。

详情

AI中文摘要

为什么一个已经记忆了训练集的Transformer要在数千步后才开始泛化？现有解释将这种延迟归因于范数最小化、特征出现或稀疏子网络的晚期发现。这些解释捕捉了过渡过程中的重要部分，但忽略了注意力模型特有的约束：如果注意力丢弃了一个信息性token，就没有有界下游计算能恢复它。我们正式将注意力建模为任务依赖图的隐式贝叶斯后验，并证明泛化需要两个分离的条件：一个与MLP容量相关的Goldilocks界，与基于范数的Grokking理论一致，以及一个新的贝叶斯结构性条件，要求注意力对每个信息性token放置足够的质量。这种分离解释了延迟泛化为延迟结构推断。训练早期，MLP通过不匹配的特征记忆，驱动交叉熵损失接近零，从而使注意力缺乏结构梯度。权重衰减必须在记忆消失前侵蚀记忆，使缺失的图变得可学习，产生已知的逆权重衰减延迟，我们推导为结构等待时间。然后证明这种解释-消除延迟可通过KL基于的结构性干预绕过，产生Grokking时间的逆干预强度缩放定律。在算法序列任务上的实验将结构与容量分离，显示这种贝叶斯彩票与彩票转移相匹配或优于。

英文摘要

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.15779 2026-05-18 cs.RO cs.AI 版本更新

A Topology-Aware Spatiotemporal Handover Framework for Continuous Multi-UAV Tracking

一种面向拓扑的时空切换框架用于连续多无人机跟踪

Jianlin Ye, Christos Kyrkou, Panayiotis Kolios

发表机构 * KIOS Research and Innovation Centre of Excellence (KIOS CoE)（KIOS研究与创新中心（KIOS CoE））； University of Cyprus（塞浦路斯大学）

AI总结本文提出一种实时多摄像头多车辆跟踪系统，通过拓扑基于的时空切换机制解决多无人机视角下的身份持续性问题，实验显示其切换成功率高达99.8%，优于传统Re-ID方法。

Journal ref 2026 International Conference on Unmanned Aircraft Systems (ICUAS)

详情

AI中文摘要

将无人机（UAVs）整合到智能交通系统（ITS）中为交通监控提供了全景可见性，但可扩展部署受到轨迹碎片化的影响，其中车辆身份在多UAV视角下丢失。尽管最先进的框架在优化局部轨迹提取和稳定性方面表现优异，但它们通常作为孤立的数据孤岛，生成不连贯的轨迹，从而阻碍了网络层面的分析，如起讫点估计。本文提出了一种实时多摄像头多车辆跟踪（MCMT）系统，旨在处理全局身份持续性。针对俯视视角中基于外观的重识别（Re-Identification）的视觉模糊和计算成本，我们引入了一种轻量级的拓扑基于的时空切换机制。我们实现了高吞吐量的并行管道，利用YOLO11和ByteTrack处理同时的4K流。我们的核心贡献是一种确定性的队列基于的匹配算法，利用几何重叠和虚拟车道离散化来通过FIFO队列预测性地管理身份切换。在复杂的城市环境中，包括交叉口和汇入交通，实验结果表明在连续交通流中的切换成功率（HOSR）为99.8%，显著优于Re-ID基线（74.1%），同时验证了边缘部署的可行性。源代码可在https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system获取。

英文摘要

The integration of Unmanned Aerial Vehicles(UAVs) into Intelligent Transportation Systems (ITS) offers synoptic visibility for traffic monitoring, yet scalable deployment is hindered by trajectory fragmentation, where vehicle identity persistence is lost across multi-UAV Fields of View (FOV). While state-of-the-art frameworks excel in optimizing local trajectory extraction and stability for single-drone imagery, they often function as isolated data silos that generate disjointed trajectories, thereby precluding network-level analysis such as Origin-Destination estimation. This paper presents a real-time Multi-Camera Multi-Vehicle Tracking (MCMT) system designed to handle global identity persistence. Addressing the visual ambiguity and computational cost of appearance-based Re-Identification (Re-ID) in nadir views, we introduce a lightweight Topology-Based Spatiotemporal Handover mechanism. We implement a high-throughput parallel pipeline leveraging YOLO11 and ByteTrack to process concurrent 4K streams. Our core contribution is a deterministic queue-based matching algorithm that utilizes geometric overlaps and virtual lane discretization to predictively manage identity handover via FIFO queues. Experimental results on complex urban environments, including intersections and merging traffic, demonstrate a Handover Success Rate (HOSR) of 99.8% in continuous traffic flows, significantly outperforming Re-ID baselines (74.1%) while validating edge deployment feasibility. The source code is available at https://github.com/JYe9/multi-camera-multi-vehicle-tracking-system.

URL PDF HTML ☆

赞 0 踩 0

2605.15120 2026-05-18 cs.RO cs.AI cs.CV 版本更新

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

CLOVER：端到端自动驾驶规划的闭环价值估计与排序

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

发表机构 * Department of Automation, University of Science and Technology of China（中国科学技术大学自动化系）； Institute for AI Industry Research, Tsinghua University（清华大学人工智能产业研究院）； School of Electronic Information Engineering, Beihang University（北航电子信息技术学院）； National College for Excellent Engineers, Beihang University（北航卓越工程师学院）

AI总结 CLOVER通过闭环价值估计与排序框架，解决端到端自动驾驶规划中训练与评估不匹配的问题，通过生成器和评分器的轻量级架构提升规划器性能，实现更准确的候选轨迹排序。

详情

AI中文摘要

端到端自动驾驶规划器通常通过模仿单条记录轨迹进行训练，但通过基于规则的规划指标进行评估，这导致了训练与评估之间的不匹配：接近记录路径的轨迹可能违反规划规则，而偏离记录路径的替代方案可能仍有效且得分高。这种不匹配对提案选择规划器尤其限制，因为其性能依赖于候选集覆盖和评分器排序质量。我们提出了CLOVER，一种用于端到端自动驾驶规划的闭环价值估计与排序框架。CLOVER采用轻量级生成器-评分器架构：生成器产生多样化的候选轨迹，评分器预测规划指标子分数以在推理时对它们进行排序。为了扩展提案支持超越单轨迹模仿，CLOVER构建了评估器过滤的伪专家轨迹，并通过集级别覆盖监督训练生成器。然后，它执行保守的闭环自我蒸馏：评分器被拟合到生成的提案上的真实评估子分数，而生成器则通过稳定性正则化向教师选择的前k和向量帕累托目标进行细化。我们分析了当评分器不完美时如何改进生成器，证明了当评分器选择的目标在真实评估下得到丰富且更新保持保守时，评分器介导的细化是可靠的。在NAVSIM上，CLOVER实现了94.5 PDMS和90.4 EPDMS，建立了新的状态。在更具挑战性的NavHard分割上，它获得了48.3 EPDMS，与最强报告结果相匹配。在补充的nuScenes开环评估中，CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将在https://github.com/WilliamXuanYu/CLOVER上发布。

英文摘要

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

URL PDF HTML ☆

赞 0 踩 0

2605.15108 2026-05-18 stat.ML cs.AI cs.IR cs.LG stat.ME 版本更新

Logging Policy Design for Off-Policy Evaluation

为离线策略评估设计日志策略

Connor Douglas, Joel Persson, Foster Provost

发表机构 * New York University（纽约大学）； Spotify

AI总结本文研究如何设计日志策略以最小化OPE误差，探讨了奖励与覆盖之间的根本权衡，并在不同信息场景下提出了最优策略。

详情

AI中文摘要

离线策略评估（OPE）利用不同日志策略收集的数据来估计目标策略（如推荐系统）的价值。它使高风险实验无需实时部署，但实际准确性严重依赖于用于计算估计值的数据收集日志策略。我们研究如何设计日志策略以最小化OPE误差。我们刻画了一个根本的奖励-覆盖权衡：将概率质量集中在高奖励动作上会减少方差，但可能错过目标策略可能采取的动作的信号。我们提出了一种统一的日志策略设计框架，并在目标策略和奖励分布已知、未知或部分通过先验或噪声估计可知的信息场景中推导出最优策略。我们的结果为公司选择多个候选推荐系统提供了可行指导。我们展示了在收集OPE数据时治疗选择的重要性，并在该目标是公司主要目标时描述了理论上最优的方法。我们还提炼了在操作约束防止实施理论最优的情况下选择日志策略的实用设计原则。

英文摘要

Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm's primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

URL PDF HTML ☆

赞 0 踩 0

2605.14344 2026-05-18 cs.AI 版本更新

CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

CrystalReasoner: 基于推理和强化学习的性质条件晶体结构生成

Yuyang Wu, Stefano Falletta, Delia McGrath, Sherry Yang

发表机构 * Tsinghua University（清华大学）； Radical AI ； New York University（纽约大学）

AI总结 CrystalReasoner通过引入物理先验和强化学习，实现从自然语言指令生成稳定且具有特定性质的晶体结构，提升了生成精度和科学合理性。

Comments Our work is available at https://crystalreasoner.github.io/, with code at https://github.com/wyy603/CrystalReasoner

详情

AI中文摘要

表示高阶网络：基于图的框架综述

Takaaki Fujita, Florentin Smarandache

AI总结本文综述了用于表示高阶网络的图基框架，探讨了多方式、分层、时间、多层、递归和张量交互等方法，旨在提供统一视角以比较不同模型并识别合适工具。

Comments 170 pages. Peer-Reviewed Book. Publisher: Neutrosophic Science International Association (NSIA) Publishing House. ISBN: 978-1-59973-881-9

详情

DOI: 10.6084/m9.figshare.31827613

AI中文摘要

许多现实世界现象自然地通过图和网络建模。然而，经典图模型通常局限于成对交互，可能无法充分捕捉实践中更丰富的结构。高阶图形式化通过引入多方式、分层、时间、多层、递归和张量基的交互，从而提供更丰富的复杂系统表示。本书全面概述了可用于建模高阶网络的数学概念，回顾了基础概念、扩展框架和新引入的正式化，强调其结构原理、关系和建模作用。目的是提供一种统一的视角，帮助读者比较不同的高阶网络模型，并识别适用于理论研究和实际应用的合适工具。本书是第2.0版，主要包含新增概念以及对错别字和解释的修正和改进。

英文摘要

Many real-world phenomena are naturally modeled by graphs and networks. However, classical graph models are often limited to pairwise interactions and may not adequately capture the richer structures that arise in practice. Higher-order graph formalisms extend this framework by incorporating multiway, hierarchical, temporal, multilayer, recursive, and tensor-based interactions, thereby providing more expressive representations of complex systems. This book presents a comprehensive overview of mathematical notions that can be used to model higher-order networks. It surveys foundational concepts, extensional frameworks, and newly introduced formalisms, with an emphasis on their structural principles, relationships, and modeling roles. The aim is to provide a unified perspective that helps readers compare diverse higher-order network models and identify appropriate tools for theoretical study and practical applications. This book is Edition 2.0. It mainly includes the addition of several concepts, as well as corrections and improvements of typographical errors and explanations.

URL PDF HTML ☆

赞 0 踩 0

2605.10867 2026-05-18 cs.CR cs.AI cs.CV cs.LG cs.NI 版本更新

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON：一个用于从游戏数据中学习行为指纹的多模态数据集

Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh, Maninder Singh

AI总结 BEACON数据集通过高精度运动技能和认知负荷，为行为生物特征的鲁棒性测试提供严格压力测试，支持连续认证、行为建模和多模态学习。

详情

AI中文摘要

在高风险数字环境中，连续认证需要具有细粒度行为信号的高质量数据集，但现有基准往往受限于规模小、单模态传感或缺乏同步环境上下文。为此，本文引入BEACON（行为认证与连续监控行为引擎），一个大规模多模态数据集，捕捉竞技Valorant游戏中的多样化技能层级。BEACON包含约430GB同步多模态数据（461GB总存储量，包括辅助Valorant配置捕获），来自79个会话的28名不同玩家，估计102.51小时的活跃游戏时间，包括高频鼠标动态、按键事件、网络数据包捕获、屏幕录制、硬件元数据和游戏内配置上下文。BEACON利用战术射击游戏固有的高精度运动技能和高认知负荷，使其成为评估行为生物特征鲁棒性的严格压力测试。该数据集允许在高保真的电子竞技环境中研究连续认证、行为建模、用户漂移和多模态表示学习。作者在Hugging Face和GitHub上发布数据集和代码，以创建可重复的基准，用于评估下一代行为指纹和安全模型。

英文摘要

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON (Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive Valorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary Valorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models.

URL PDF HTML ☆

赞 0 踩 0

2605.10813 2026-05-18 cs.AI 版本更新

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

NanoResearch: 为个性化研究自动化共进化技能、记忆与政策

Jinhang Xu, Qiyuan Zhu, Yujun Wu, Zirui Wang, Dongxu Zhang, Marcia Tian, Yiling Duan, Siyuan Li, Jingxuan Wei, Sirui Han, Yike Guo, Odin Zhang, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； The Hong Kong University of Science and Technology（香港科技大学）； Peking University（北京大学）； Zhejiang University（浙江大学）； Xi'an Jiaotong University（西安交通大学）； East China University of Science and Technology（东华大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出NanoResearch框架，通过三重共进化解决研究自动化中的个性化需求，提升研究效率与用户体验。

Comments 40 pages, 14 figures, 7 tables

详情

AI中文摘要

基于大语言模型的多智能体系统如今能够自动化从构想到论文写作的整个研究流程，但一个根本问题依然存在：自动化为谁服务？研究人员在资源配置、方法论偏好和输出格式上各不相同。一个无论这些差异如何产生统一输出的系统将系统性地忽视每位用户，使个性化成为研究自动化真正可用的前提。然而，实现这一目标需要三种当前系统缺乏的能力：在不同项目间积累可重用的程序性知识、在不同会话中保留用户特定的经验、以及内化隐含的偏好，这些偏好难以显式形式化。我们提出NanoResearch，一个通过三级共进化解决这些差距的多智能体框架。技能库将重复操作提炼成紧凑的程序规则，可在不同项目间重用。记忆模块维护用户和项目特定的经验，使规划决策基于每位用户的研究历史。无标签的政策学习将自由形式反馈转化为规划器的持续参数更新，重塑后续协调。这三层结构共进化：可靠的技能产生更丰富的记忆，更丰富的记忆指导更好的规划，偏好内化持续调整循环以适应每位用户。大量实验表明，NanoResearch在最先进的AI研究系统上取得了显著优势，并在后续循环中逐步优化，以更低的成本产生更高质量的研究。

英文摘要

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under different resource configurations, hold different methodological preferences, and target different output formats. A system that produces uniform outputs regardless of these differences will systematically under-serve every individual user, making personalization a precondition for research automation to be genuinely usable. However, achieving it requires three capabilities that current systems lack: accumulating reusable procedural knowledge across projects, retaining user-specific experience across sessions, and internalizing implicit preferences that resist explicit formalization. We propose NanoResearch, a multi-agent framework that addresses these gaps through tri-level co-evolution. A skill bank distills recurring operations into compact procedural rules reusable across projects. A memory module maintains user- and project-specific experience that grounds planning decisions in each user's research history. A label-free policy learning converts free-form feedback into persistent parameter updates of the planner, reshaping subsequent coordination. These three layers co-evolve: reliable skills produce richer memory, richer memory informs better planning, and preference internalization continuously realigns the loop to each user. Extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems, and progressively refines itself to produce better research at lower cost over successive cycles.

URL PDF HTML ☆

赞 0 踩 0

通过视觉手写特征的证据深度回归进行历史手稿的概率年代测定

Ranjith Chodavarapu

发表机构 * Kent State University（肯特州立大学）

AI总结本文提出一种基于视觉特征的深度回归方法，用于确定历史手稿的年代，通过分解不确定性提升预测精度，实验显示模型在测试集上取得优异性能。

详情

AI中文摘要

我们介绍了一种概率方法，用于仅通过视觉特征确定历史手稿页面的年代。与以往文献中将世纪聚合为类别的做法不同，我们将年代测定视为一个在连续年份轴上的证据深度回归问题，使神经网络能够在一个前向传递中输出完整的预测分布，包含分解的偶然性和epistemic不确定性。我们的架构结合了EfficientNet-B2主干网络和通过联合负对数似然和证据正则化目标训练的Normal-Inverse-Gamma（NIG）输出头。在DIVA-HisDB基准（150页，3个中世纪手稿，151936个补丁）上，我们的模型在测试集上取得了5.4年的MAE，远低于50年的世纪标签监督粒度，93%的补丁在5年内，97%在10年内。我们的方法在单次前向传递中实现了PICP=92.6%的校准，优于MC Dropout（PICP=88.2%，50次传递）和Deep Ensembles（PICP=79.7%，5个模型）的性能，且推理成本低5倍。不确定性分解显示偶然性不确定性是年代误差的强预测因子（Spearman ρ=0.729），且对最确定的20%补丁的有选择性预测可提供0.5年的MAE。我们展示了预测的不确定性随着图像退化程度的恶化而增加，空间分解映射解释了哪些手写区域导致偶然性不确定性，且页面级聚合将MAE降低到4.5年，不确定性与页面级误差之间的相关性为ρ=0.905。

英文摘要

We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $ρ=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $ρ=0.905$ between uncertainty and page-level error.

URL PDF HTML ☆

赞 0 踩 0

2605.06223 2026-05-18 cs.AI cs.RO 版本更新

ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

ProCompNav：基于比较判断的主动实例导航

Junhyuk Kwon, Seungjoon Lee, Hyejin Park, Kyle Min, Jungseul Ok

发表机构 * GSAI, POSTECH（POSTECH人工智能研究所）； CSE, POSTECH（POSTECH计算机科学与工程系）； Oracle（Oracle公司）

AI总结 ProCompNav通过两阶段框架解决用户查询歧义问题，通过比较判断逐步缩小候选集，提升导航成功率并减少用户响应长度。

Comments Project page: https://tree-jhk.github.io/procompnav/ . Code: https://github.com/tree-jhk/procompnav/

详情

AI中文摘要

自然语言实例导航在初始请求不唯一指定目标实例时变得具有挑战性。一个实用的代理应通过主动询问区分目标与相似干扰项所需的信息来减轻用户负担，而非要求详细描述。现有方法常无法达到此目标：它们可能在初步可行候选者前停止，或在收集多个候选后仅询问单个候选的属性，而非选择区分候选池的提问。因此，尽管有对话，代理仍可能无法区分目标与干扰项，导致提前决策和冗长用户响应。我们提出了Proactive Instance Navigation with Comparative Judgment（ProCompNav），一个两阶段框架，首先构建候选池，然后通过比较判断确定目标。每轮中，ProCompNav提取一个属性-值对，将当前池分割，询问二元是/否问题，并一次性修剪所有不一致的候选。这将歧义消除从开放性目标描述转为池级辨别提问，每个问题旨在缩小候选集。在CoIN-Bench上，ProCompNav在相同最小输入和非交互基线中提高了成功率，并显著减少了响应长度。ProCompNav还在TextNav上实现了最先进的成功率，表明比较判断对相似干扰项间的实例导航具有广泛价值。代码可在https://github.com/tree-jhk/procompnav获取。

英文摘要

Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collecting multiple candidates, ask about the target's attributes derived from individual candidates rather than questions selected to distinguish candidates in the pool. As a result, despite the dialogue, the agent may still fail to distinguish the target from distractors, leading to premature decisions and lengthy user responses. We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment. At each round, ProCompNav extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once. This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length. ProCompNav also achieves state-of-the-art Success Rate on TextNav, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors. Code is available at https://github.com/tree-jhk/procompnav.

URL PDF HTML ☆

赞 0 踩 0

2604.26733 2026-05-18 cs.AI cs.LG 版本更新

FutureWorld: A Live Reinforcement Learning Environment for Predictive Agents with Real-World Outcome Rewards

FutureWorld: 一个用于预测代理的实时强化学习环境，具有现实世界结果奖励

Zhixin Han, Yanzhi Zhang, Chuyang Wei, Maohang Gao, Xiawei Yue, Kefei Chen, Yu Zhuang, Haoxiang Guan, Jiyan He, Jian Li, Yitong Duan, Yu Shi, Mengting Hu, Shuxin Zheng

发表机构 * College of Software, Nankai University（南开大学软件学院）； Academy of Mathematics and Systems Science, Chinese Academy of Sciences（中国科学院数学与系统科学研究院）； School of Computer Science and Technology, University of Science and Technology of China（中国科学技术大学计算机科学与技术学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； IIIS, Tsinghua University（清华大学智能系统与信息工程研究院）； Zhongguancun Academy, Beijing, China（北京中关村学院）

AI总结本文提出FutureWorld，一个实时强化学习环境，通过闭环预测、结果实现与参数更新，提升预测准确性与校准能力。

Comments The code will be released in the near future. The experiments are currently ongoing

详情

AI中文摘要

A3-FPN：渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

发表机构 * Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms（人工智能理论与算法河南省工程研究中心）； Henan University（河南大学）； Faculty of Computer Science and Control Engineering（计算机科学与控制工程学院）； Shenzhen University of Advanced Technology（深圳先进技术大学）； Department of Electrical and Electronic Engineering（电子与电气工程系）

AI总结本文提出A3-FPN，通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示，提升密集预测任务中小物体的识别性能。

Journal ref Pattern Recognition, 2026, 113793

详情

DOI: 10.1016/j.patcog.2026.113793

AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展，但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络（A3-FPN），通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言，A3-FPN采用横向扩展的列网络，实现渐近全局特征交互，并将每个层次与所有层次表示解耦。在特征融合中，它从相邻层次收集补充内容，生成位置加权偏移和权重用于上下文感知重采样，并学习深度上下文重权重以提高类别内相似性。在特征重组装中，它进一步加强了同一尺度的判别特征学习，并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明，A3-FPN可以轻松集成到最先进的CNN和Transformer架构中，取得显著性能提升。值得注意的是，当与OneFormer和Swin-L主干结合时，A3-FPN在MS COCO上达到49.6的mask AP，在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

URL PDF HTML ☆

赞 0 踩 0

2604.09631 2026-05-18 cs.DC cs.AI 版本更新

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

边缘目标检测在故障注入下的硬件利用与推断性能

Faezeh Pasandideh, Mehdi Azarafza, Achim Rettberg

发表机构 * Hamm-Lippstadt University of Applied Sciences (HSHL)（哈姆-利普施塔特应用科学大学（HSHL））

AI总结研究通过故障注入测试评估了TensorRT优化的YOLO模型在边缘平台上的硬件行为，发现其在资源降级下保持稳定性能，为边缘推断可靠性提供硬件层面的视角。

详情

AI中文摘要

随着深度学习模型部署在资源受限的边缘平台，了解硬件在资源降级下的行为变得至关重要。本文系统地表征了在大规模故障注入测试下，TensorRT优化的YOLOv10s、YOLOv11s和YOLO2026n管道在NVIDIA Jetson Nano上的CPU负载、GPU利用率、RAM消耗、功耗、吞吐量和热行为。故障通过解耦框架合成，利用大型语言模型和潜在扩散模型。结果表明，两种任务和两种模型的推断引擎在资源降级下保持GPU占用稳定，温度上升受控，功耗在安全范围内，内存使用在初始暖机阶段后趋于一致释放模式。目标检测在内存和热行为上略有波动，但两者均得出结论：TensorRT管道在输入数据严重降级时仍表现良好。这些发现提供了模型可靠性的硬件层面视角，与边缘推断性能研究形成补充。

英文摘要

As deep learning models are deployed on resource constrained edge platforms in autonomous driving systems, reli able knowledge of hardware behavior under resource degradation becomes an essential requirement. Therefore, we introduce a systematic characterization of CPU load, GPU utilization, RAM consumption, power draw, throughput, and thermal behaviour of TensorRT-optimized YOLOv10s, YOLOv11s and YOLO2026n pipelines running on NVIDIA Jetson Nano under a large-scale fault injection campaign targeting both lane-following and ob ject detection tasks. Faults are synthesized using a decoupled framework that leverages large language models (LLMs) and latent diffusion models (LDMs), based on original data from our JetBot platform data collection. Results show that across both tasks and both models the inference engines keep GPU occupancy stable, temperature rise under control, and power consumption within safe limits, while memory usage settles into a consistent release pattern after the initial warm-up phase. Object detection tends to show somewhat more variability in memory and thermal behavior, yet both tasks point to the same conclusion: the TensorRT pipelines hold up well even when the input data is heavily degraded. These findings offer a hardware-level view of model reliability that sits alongside, rather than against, the broader body of work focused on inference performance at the edge.

URL PDF HTML ☆

赞 0 踩 0

2604.08426 2026-05-18 cs.LG cs.AI cs.CL 版本更新

KV Cache Offloading for Context-Intensive Tasks

KV缓存卸载用于上下文密集型任务

Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov

发表机构 * HSE（俄罗斯人民友谊大学）； Yandex ； NSU（俄罗斯国立核能研究大学梅利科夫）

AI总结本文研究了KV缓存卸载在上下文密集型任务中的应用，通过Text2JSON基准测试发现，该方法在Llama 3和Qwen 3模型上导致性能下降，分析指出低秩投影和不可靠地标是主要问题，并提出更简单的替代策略以提升准确性。

Comments Preprint

详情

AI中文摘要

随着长上下文LLM在广泛应用中的需求增长，键值（KV）缓存已成为延迟和内存使用的关键瓶颈。最近，KV缓存卸载作为一种减少内存占用和推理延迟同时保持准确性的有前途的方法出现。先前的评估主要集中在不需要从上下文中提取大量信息的任务上。在本文中，我们研究了KV缓存卸载在上下文密集型任务中的应用：解决这些问题需要从输入提示中查找大量信息。我们创建并发布了Text2JSON基准测试，这是一个高度上下文密集型任务，需要从原始文本中提取结构化知识。我们评估了现代KV卸载在Text2JSON和其他上下文密集型任务上的表现，并发现Llama 3和Qwen 3模型上存在显著的性能下降。我们的分析确定了两个关键原因：键的低秩投影和不可靠的地标，并提出了一种更简单的替代策略，该策略在多个LLM家族和基准测试中显著提高了准确性。这些发现突显了对长上下文压缩技术进行全面和严格评估的必要性。

英文摘要

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

URL PDF HTML ☆

赞 0 踩 0

2603.29617 2026-05-18 q-bio.NC cs.AI cs.CL 版本更新

Convergent Representations of Linguistic Constructions in Human and Artificial Neural Systems

人类和人工神经系统的语言构造收敛表征

Pegah Ramezani, Thomas Kinfe, Andreas Maier, Achim Schilling, Patrick Krauss

发表机构 * Department of English and American Studies, University Erlangen-Nuremberg（英语与美国研究系，埃尔朗根-纽伦堡大学）； Pattern Recognition Lab, University Erlangen-Nuremberg（模式识别实验室，埃尔朗根-纽伦堡大学）； Neuromodulation and Neuroprosthetics, University Hospital Mannheim, University Heidelberg（神经调控与神经假体，曼海姆大学医院，海德堡大学）； BGU Ludwigshafen, Germany（吕贝克大学吕贝克分校，德国）； Neuroscience Lab, University Hospital Erlangen（神经科学实验室，埃尔朗根大学医院）

AI总结研究通过EEG验证人类神经活动对语言构造的表征，发现句末alpha波段出现构造特异性神经签名，与人工语言模型的构造表征模式相似，支持语言构造作为形式-意义映射的神经编码。

详情

AI中文摘要

理解大脑如何处理语言构造是认知神经科学和语言学的核心挑战。最近的计算研究表明，人工神经语言模型会自发发展出对论元结构构造（ASCs）的差异化表征，生成关于构造层面信息在处理过程中何时何地出现的预测。本研究通过脑电图（EEG）在人类神经活动中测试这些预测。十名母语英语者在听200个合成生成的句子时，这些句子涵盖四种构造类型（单及物、双及物、因果运动、结果性）。利用时频方法、特征提取和机器学习分类分析，发现构造特异性神经签名主要出现在句末位置，即论元结构完全歧义化的位置，并且最显著地出现在alpha波段。成对分类显示可靠区分，尤其是双及物和结果性构造之间，而其他对则有重叠。关键的是，这些效应的出现时间和相似性结构与基于循环和变压器的语言模型中的构造表征模式相似，其中构造性表征在整合处理阶段出现。这些发现支持语言构造作为神经编码的独立形式-意义映射的观点，与构造语法一致，并表明生物和人工系统在相似的表征解决方案上趋于一致。更广泛地说，这种趋同与学习系统在基础表征景观中发现稳定区域（最近称为柏拉图表征空间）的想法一致，该景观约束了高效语言抽象的出现。

英文摘要

Understanding how the brain processes linguistic constructions is a central challenge in cognitive neuroscience and linguistics. Recent computational studies show that artificial neural language models spontaneously develop differentiated representations of Argument Structure Constructions (ASCs), generating predictions about when and how construction-level information emerges during processing. The present study tests these predictions in human neural activity using electroencephalography (EEG). Ten native English speakers listened to 200 synthetically generated sentences across four construction types (transitive, ditransitive, caused-motion, resultative) while neural responses were recorded. Analyses using time-frequency methods, feature extraction, and machine learning classification revealed construction-specific neural signatures emerging primarily at sentence-final positions, where argument structure becomes fully disambiguated, and most prominently in the alpha band. Pairwise classification showed reliable differentiation, especially between ditransitive and resultative constructions, while other pairs overlapped. Crucially, the temporal emergence and similarity structure of these effects mirror patterns in recurrent and transformer-based language models, where constructional representations arise during integrative processing stages. These findings support the view that linguistic constructions are neurally encoded as distinct form-meaning mappings, in line with Construction Grammar, and suggest convergence between biological and artificial systems on similar representational solutions. More broadly, this convergence is consistent with the idea that learning systems discover stable regions within an underlying representational landscape - recently termed a Platonic representational space - that constrains the emergence of efficient linguistic abstractions.

URL PDF HTML ☆

赞 0 踩 0

2603.25099 2026-05-18 cs.CE cs.AI 版本更新

Large Language Models as Optimization Controllers: Adaptive Continuation for SIMP Topology Optimization

大语言模型作为优化控制器：SIMP拓扑优化的自适应延续

Shaoliang Yang, Jun Wang, Yunsheng Wang

发表机构 * Department of Mechanical Engineering, Santa Clara University（圣克拉拉大学机械工程系）

AI总结本文提出利用大语言模型作为SIMP拓扑优化的在线自适应控制器，通过实时状态条件参数决策替代传统固定调度延续方法，提升优化效果。

Comments 32 pages, 11 figures

详情

AI中文摘要

我们提出一个框架，其中大语言模型（LLM）作为SIMP拓扑优化的在线自适应控制器，取代传统固定调度延续方法。在每次第k次迭代中，LLM接收结构化观察（当前合规性、灰度指数、停滞计数器、棋盘度量、体积分数和预算消耗），并通过直接数字控制接口输出惩罚指数p、投影锐度β、滤波半径r_min和移动限制δ的数值。硬灰度门防止过早二元化，元优化循环使用第二个LLM迭代来调整代理的调用频率和门阈值。我们对四个基线（固定、标准三场延续、专家启发法、仅调度消融）在三个二维问题（悬臂、MBB梁、L型支架）和两个三维问题（悬臂、MBB梁）上进行基准测试，所有问题均运行300次迭代。标准化的40次锐化尾部从最佳有效快照应用，使得合规性差异仅反映探索阶段。LLM代理在每个基准测试中均达到最低最终合规性：相对于固定基线，-5.7%至-18.1%，所有解决方案均为完全二进制。仅调度消融在三个问题中的两个上表现低于固定基线，确认LLM的实时干预（而非调度几何）驱动了增益。代码和再生产脚本将在发表时发布。

英文摘要

We present a framework in which a large language model (LLM) acts as an online adaptive controller for SIMP topology optimization, replacing conventional fixed-schedule continuation with real-time, state-conditioned parameter decisions. At every $k$-th iteration, the LLM receives a structured observation$-$current compliance, grayness index, stagnation counter, checkerboard measure, volume fraction, and budget consumption$-$and outputs numerical values for the penalization exponent $p$, projection sharpness $β$, filter radius $r_{\min}$, and move limit $δ$ via a Direct Numeric Control interface. A hard grayness gate prevents premature binarization, and a meta-optimization loop uses a second LLM pass to tune the agent's call frequency and gate threshold across runs. We benchmark the agent against four baselines$-$fixed (no-continuation), standard three-field continuation, an expert heuristic, and a schedule-only ablation$-$on three 2-D problems (cantilever, MBB beam, L-bracket) at $120\!\times\!60$ resolution and two 3-D problems (cantilever, MBB beam) at $40\!\times\!20\!\times\!10$ resolution, all run for 300 iterations. A standardized 40-iteration sharpening tail is applied from the best valid snapshot so that compliance differences reflect only the exploration phase. The LLM agent achieves the lowest final compliance on every benchmark: $-5.7\%$ to $-18.1\%$ relative to the fixed baseline, with all solutions fully binary. The schedule-only ablation underperforms the fixed baseline on two of three problems, confirming that the LLM's real-time intervention$-$not the schedule geometry$-$drives the gain. Code and reproduction scripts will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2603.17915 2026-05-18 cs.CL cs.AI 版本更新

IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

IndicSafe：评估南亚多语言大语言模型安全性的基准

Priyaranjan Pattnayak, Sanchari Chowdhuri

发表机构 * Oracle America Inc.（Oracle美洲公司）

AI总结本文提出IndicSafe基准，评估12种南亚语言中LLM的安全性，发现跨语言一致性仅12.8%，安全率波动超17%，揭示多语言LLM安全泛化缺口。

详情

AI中文摘要

随着大语言模型（LLM）在多语言环境中的部署，其在文化多样性和低资源语言中的安全性行为仍不明确。我们首次系统评估了12种印地语系语言中LLM的安全性，这些语言由超过12亿人使用，但在LLM训练数据中代表性不足。使用覆盖种姓、宗教、性别、健康和政治的6000个文化相关提示集，我们评估了10种领先LLM在翻译提示变体上的表现。我们的分析揭示了显著的安全漂移：跨语言一致性仅为12.8%，安全率波动超过17%。某些模型在低资源脚本中过度拒绝良性提示，在政治敏感话题上过度标记，而其他模型未能标记不安全生成。我们使用提示级熵、类别偏见分数和多语言一致性指数量化这些失败。我们的发现突显了多语言LLM在安全泛化方面的关键缺口，并表明安全对齐在不同语言中并不均匀转移。我们发布了IndicSafe，这是首个能够为印地语部署提供文化知情安全评估的基准，并倡导基于地区危害的语言意识对齐策略。

英文摘要

As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

URL PDF HTML ☆

赞 0 踩 0

2603.13452 2026-05-18 cs.AI cs.CY cs.LG 版本更新

MESD: A Risk-Sensitive Metric for Explanation Fairness Across Intersectional Subgroups

MESD：一种用于跨交集子组解释公平性的风险敏感度度量

Gideon Popoola, John Sheppard

AI总结本文提出MESD，一种衡量不同交集子组解释质量差异的程序公平度量，结合标签感知聚合、经验贝叶斯收缩和CVaR加权，通过多目标优化框架UEF优化效用、结果公平和程序公平。

详情

AI中文摘要

机器学习中的公平性主要通过结果导向指标，如人口统计学均等性，来评估预测是否在受保护群体中统计上一致。然而，这些指标无法检测模型是否对不同人口群体使用系统性不同的推理，这违反了程序公平原则。这个问题被交集性加剧，其中模型可能在个别属性（如种族）上显得公平，但在交集子群（如种族×性别）上表现出显著差异，即公平性红区划分。本文引入多类别解释稳定性差异（MESD），一种程序公平度量，量化由多个受保护属性的笛卡尔积形成的交集子组中的解释质量差异。MESD整合了三个组件，即标签感知聚合，与结果条件公平对齐，经验贝叶斯收缩以稳定小交集群体的估计，以及条件价值-at-风险（CVaR）加权以强调最坏情况子群差异。我们将MESD整合到多目标优化框架（UEF）中，通过NSGA-II联合优化效用、结果公平和程序公平。我们在三个基准数据集和四种最先进方法上评估了MESD和UEF，证明MESD揭示了仅靠结果指标无法察觉的程序差异。我们将我们的贡献置于程序正义理论中，并讨论了对监管合规和交集公平性的意义。

英文摘要

Fairness in machine learning is predominantly evaluated through outcome-oriented metrics, such as Demographic parity, which measure whether predictions are statistically consistent across protected groups. However, these metrics cannot detect whether a model uses systematically different reasoning for different demographic groups, which violates procedural fairness principles. This problem is compounded by intersectionality, where models may appear fair on individual attributes (e.g., race) while exhibiting significant disparities for intersectional subgroups (e.g., race $\times$ gender), a phenomenon known as fairness gerrymandering. In this work, we introduce Multi-category Explanation Stability Disparity (MESD), a procedural fairness metric that quantifies disparities in explanation quality across intersectional subgroups formed by the Cartesian product of multiple protected attributes. MESD integrates three components, which are label-aware aggregation aligned with outcome-conditional fairness, empirical-Bayes shrinkage to stabilize estimates for small intersectional groups, and Conditional Value-at-Risk (CVaR) weighting to emphasize worst-case subgroup disparities. We integrate MESD within a multi-objective optimization framework (UEF) that jointly optimizes utility, outcome fairness, and procedural fairness using NSGA-II. We evaluated MESD and UEF on three benchmark datasets along with four state-of-the-art methods in several experiments, and we demonstrate that MESD reveals procedural disparities invisible to outcome metrics alone. We position our contribution within procedural justice theory and discuss implications for regulatory compliance and intersectional equity.

URL PDF HTML ☆

赞 0 踩 0

2603.01290 2026-05-18 cs.AI cs.GT cs.LG cs.SY eess.SY 版本更新

Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy

在部分可观测性下对手状态推断：一种用于2026年F1能源策略的HMM-POMDP框架

Kalliopi Kleisarchaki

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出HMM-POMDP框架用于2026F1能源策略，通过HMM推断对手状态并利用DQN决策，解决部分可观测博弈问题，检测反收割陷阱。

Comments 17 pages. v3: editorial corrections and bibliographic updates. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry from Australian Grand Prix (8 March 2026) onwards

详情

AI中文摘要

2026年F1技术规则对能源策略进行了根本性改变：在内燃机与电池动力50/50分配、无限再生和驾驶员控制的Override模式下，最优能源部署策略不仅取决于驾驶员自身状态，还取决于对手车辆的隐藏状态。这形成了一个部分可观测随机博弈，无法通过单agent优化方法解决。本文提出一个可处理的双层推断和决策框架。第一层是一个40状态的隐藏马尔可夫模型（HMM），通过六个公开可观测的 telemetry 信号推断每个对手的ERS充电水平（四种模式：H、M、L_harvest、L_derate）、Override模式状态和轮胎退化状态。第二层是一个深度Q网络（DQN）策略，以HMM信念状态为输入，选择能量部署策略。我们正式刻画了反收割陷阱，一种欺骗策略，其中车辆故意压制可观测部署信号以诱导对手进入失败攻击，并表明检测它需要对ERS水平和harvest/derate子模式进行信念状态推断。在合成比赛上，HMM实现了96.8%的ERS水平准确性（随机基线25%），将L_harvest与L_derate分类准确率为89.4%，反收割陷阱检测召回率为96.3%。赛季前分析表明，赛道依赖的充电可用性（每圈1.0x到2.2x）是主要干扰因素；墨尔本是最难的验证环境。Baum-Welch校准在2026年比赛 telemetry 上从澳大利亚大奖赛（2026年3月8日）开始。

英文摘要

The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode, the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 40-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level (four modes: H, M, L_harvest, L_derate), Override Mode status, and tyre degradation state from six publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap, a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack, and show that detecting it requires belief-state inference over both ERS level and the harvest/derate sub-mode. On synthetic races, the HMM achieves 96.8% ERS-level accuracy (random baseline 25%), classifies L_harvest vs. L_derate with 89.4% accuracy, and detects counter-harvest trap conditions with 96.3% recall. Pre-season analysis indicates circuit-dependent recharge availability (1.0x to 2.2x per lap) as the primary confound; Melbourne is the hardest-case validation environment. Baum-Welch calibration on 2026 race telemetry begins with the Australian Grand Prix (8 March 2026).

URL PDF HTML ☆

赞 0 踩 0

2602.23410 2026-05-18 cs.LG cs.AI eess.SP q-bio.NC 版本更新

Brain-OF: An Omnifunctional Foundation Model for fMRI, EEG and MEG

Brain-OF：一种适用于fMRI、EEG和MEG的多功能基础模型

Hanning Guo, Hanwen Bi, Farah Abdellatif, Andrei Galbenus, Jon. N. Shah, Abigail Morrison, Jürgen Dammers

发表机构 * INM-4, Forschungszentrum Jülich, Germany（Jülich 研究中心 INM-4 实验室，德国）； Department of Computer Science（计算机科学系）； Software Engineering, RWTH Aachen University, Germany（软件工程，亚琛工业大学，德国）； INM-7, Forschungszentrum Jülich, Germany（Jülich 研究中心 INM-7 实验室，德国）； Institute of Systems Neuroscience, Heinrich Heine University, Germany（系统神经科学研究所，海因里希·海涅大学，德国）； Department of Neurology, RWTH Aachen University, Germany（神经病学系，亚琛工业大学，德国）； JARA-BRAIN-Translational Medicine, Germany（JARA-BRAIN 转化医学，德国）； INM–11, JARA, Forschungszentrum Jülich, Germany（JARA-INM-11 实验室，Jülich 研究中心，德国）； IAS-6, Forschungszentrum Jülich, Germany（IAS-6 实验室，Jülich 研究中心，德国）； Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, Germany（精神病学、心理治疗和精神病理学系，亚琛工业大学，德国）

AI总结 Brain-OF通过联合预训练fMRI、EEG和MEG数据，解决多模态数据语义异质性和分辨率差异问题，提升跨模态数据处理能力。

详情

AI中文摘要

脑基础模型在多种神经科学任务中取得了显著进展。然而，现有模型多局限于单一功能模态，限制了其利用互补的时空动态和不同神经成像技术的集体数据规模的能力。这一限制主要源于模态间的严重语义异质性和分辨率差异。为解决这些问题，我们提出了Brain-OF，一种联合预训练fMRI、EEG和MEG的多功能脑基础模型，能够在统一框架内处理单模态和多模态输入。为协调异构的时空分辨率，我们引入了Any-Resolution神经信号采样器，将多样化的脑信号投影到共享的语义空间。为进一步管理语义偏移，Brain-OF的主干整合了DINT注意力与稀疏专家混合模型，其中共享专家捕捉模态不变的表示，路由专家专注于模态特定的语义。此外，为了通过自监督学习显式内化神经活动的特征，我们提出了Masked Temporal-Frequency Modeling，一种双域预训练目标，联合重建时间和频率域中的脑信号。Brain-OF在包含约40个数据集的大型语料库上进行预训练，并在多样化的下游任务中表现出色，突显了联合多模态集成和双域预训练的优势。

英文摘要

Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across different neuroimaging techniques. This limitation largely arises from severe semantic heterogeneity and resolution discrepancies among modalities. To address these challenges, we propose Brain-OF, an omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic space. To further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, to explicitly internalize the characteristics of neural activity through self-supervised learning, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.

URL PDF HTML ☆

赞 0 踩 0

2602.04003 2026-05-18 cs.AI 版本更新

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

当AI说服人：对抗性解释攻击对人类信任AI辅助决策的影响

Shutong Fan, Lan Zhang, Xiaoyong Yuan

发表机构 * Clemson University（克莱姆森大学）

AI总结本文研究了对抗性解释攻击如何通过操控LLM生成的解释框架，影响人类对AI输出的信任，揭示了认知层的新型安全风险。

详情

AI中文摘要

大多数对抗性威胁针对AI模型的计算行为，而非依赖它们的人类。然而，现代AI系统越来越多地在人类决策循环中运行，用户根据模型推荐进行解释和行动。大型语言模型（LLMs）生成流畅的自然语言解释，影响用户对AI输出的认知和信任，揭示了AI与用户之间的沟通渠道这一新攻击面。我们引入对抗性解释攻击（AEAs），攻击者操控LLM生成的解释框架以调节人类对错误输出的信任。我们通过信任失调差距这一指标，正式化这一行为威胁，该指标捕捉了良性与对抗性解释之间人类信任的差异。通过这一指标，我们强调了说服性解释框架可能在AI预测错误时仍能保持用户信任的行为风险。为了表征这一威胁，我们进行了包含超过200名参与者的实验，系统地变化解释框架的四个维度：推理模式、证据类型、沟通风格和呈现格式。我们的发现显示，用户对对抗性和良性解释的信任几乎相同，对抗性解释尽管错误，却保留了大多数良性信任。最脆弱的情况出现在AEAs接近专家沟通时，结合权威证据、中性语气和领域合适的推理。脆弱性最高出现在困难任务、事实驱动领域以及受教育程度较低、年轻或高度信任AI的参与者中。

英文摘要

Most adversarial threats in artificial intelligence (AI) target the computational behavior of models rather than the humans who rely on them. Yet modern AI systems increasingly operate within human decision loops, where users interpret and act on model recommendations. Large Language Models (LLMs) generate fluent natural-language explanations that shape how users perceive and trust AI outputs, revealing a new attack surface at the cognitive layer: the communication channel between AI and its users. We introduce adversarial explanation attacks (AEAs), where an attacker manipulates the framing of LLM-generated explanations to modulate human trust in incorrect outputs. We formalize this behavioral threat through the trust miscalibration gap, a metric that captures the difference in human trust between benign and adversarial explanations. Using this metric as a lens, we highlight a behavioral risk where persuasive explanation framing can preserve user trust even when the underlying AI prediction is wrong. To characterize this threat, we conducted a human study with over 200 participants, systematically varying four dimensions of explanation framing: reasoning mode, evidence type, communication style, and presentation format. Our findings show that users report nearly identical trust for adversarial and benign explanations, with adversarial explanations preserving the vast majority of benign trust despite being incorrect. The most vulnerable cases arise when AEAs closely resemble expert communication, combining authoritative evidence, neutral tone, and domain-appropriate reasoning. Vulnerability is highest on hard tasks, in fact-driven domains, and among participants who are less formally educated, younger, or highly trusting of AI.

URL PDF HTML ☆

赞 0 踩 0

2602.01970 2026-05-18 cs.AI cs.LG 版本更新

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

小规模可泛化提示预测模型可引导大推理模型的高效强化学习后训练

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University, Beijing, China（自动化系，清华大学，北京，中国）； LLM Department, Tencent, Beijing, China（大模型部门，腾讯，北京，中国）

AI总结本文提出GPS方法，通过轻量级生成模型进行提示难度的贝叶斯推断，结合中间难度优先和历史锚定多样性，提升大模型强化学习后的训练效率和测试效率。

详情

AI中文摘要

强化学习能增强大语言模型的推理能力，但通常因滚动优化而产生高计算成本。在线提示选择通过优先选择信息性提示来提高训练效率。然而，现有方法要么依赖昂贵的精确评估，要么构建缺乏跨提示泛化的提示特定预测模型。本研究引入可泛化的提示选择（GPS），通过轻量级生成模型对提示难度进行贝叶斯推断，利用共享优化历史训练。中间难度优先和历史锚定多样性被纳入批量获取原则中以选择信息性提示批次。小规模预测模型在测试时也具备泛化能力，以实现高效的计算分配。在各种推理基准上的实验表明，GPS在训练效率、最终性能和测试效率上显著优于更优的基线方法。

英文摘要

Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS's substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2602.01167 2026-05-18 cs.AI 版本更新

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

所有个体层都有帮助吗？视觉-语言模型中任务干扰层的实证研究

Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Harbin Institute of Technology（哈尔滨工业大学）； Southeast University（东南大学）； Central South University（中南大学）； National University of Singapore（新加坡国立大学）； The Hong Kong University of Science and Technology, Guangzhou（香港科学与技术大学（广州））

AI总结研究通过层干预发现部分层阻碍下游任务，提出任务自适应层剔除方法提升性能，揭示预训练VLM的意外模块化特性。

详情

AI中文摘要

当前VLM在多种多模态任务中表现出色，但默认启用所有层可能阻碍任务表现。通过干预单层参数发现，某些层反而抑制任务性能。系统研究各层对不同任务的影响，提出任务-层交互向量量化方法，并引入无需训练的测试时适应方法TaLo，动态剔除最干扰的层，提升模型在多个任务和数据集上的性能，包括提升Qwen-VL在ScienceQA地图任务上的准确率。

英文摘要

Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL's accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2601.23068 2026-05-18 cs.LG cs.AI 版本更新

ExplainerPFN: Towards tabular foundation models for model-free zero-shot feature importance estimations

ExplainerPFN：迈向无模型零样本特征重要性估计的表格基础模型

Joao Fonseca, Julia Stoyanovich

发表机构 * INESC-ID ； New York University（纽约大学）

AI总结本文提出ExplainerPFN，一种基于TabPFN的表格基础模型，通过预训练合成结构因果数据实现无模型零样本特征重要性估计，展示了其在真实和合成数据集上的竞争力。

Comments 35 pages, 11 figures

详情

AI中文摘要

在监督分类任务中计算特征重要性对模型可解释性至关重要。Shapley值是解释模型预测的常用方法，但需要直接访问底层模型，这一假设在现实部署中常被违反。我们探讨在零样本设置下是否能仅通过输入数据分布和不评估目标模型来获得有意义的特征归因。由于多个模型可能产生相同预测但产生不同Shapley分解，数据到归因的映射并非唯一可识别。因此，我们针对“真实数据”而非“真实模型”学习后验均值归因，基于元训练先验。为此，我们引入ExplainerPFN，一种基于TabPFN的表格基础模型，预训练于合成结构因果数据，通过精确或近精确的Shapley值监督，可预测未见过的表格数据集的特征归因，而无需模型访问、梯度或示例解释。我们的贡献包括：（1）展示少量样本替代解释器在仅使用两个参考观测时可实现高SHAP保真度；（2）提出ExplainerPFN，首个无需访问底层模型或参考解释的零样本方法，提供无现有解释器可应用的归因；（3）发布开源实现，包括完整训练流程和合成数据生成器；（4）通过大量真实和合成数据集实验，展示ExplainerPFN在性能上可与依赖2-10个SHAP示例的少量样本替代解释器竞争。

英文摘要

Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name "weakly MAS" and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/structlearning/MASNET.

URL PDF HTML ☆

赞 0 踩 0

2510.13842 2026-05-18 cs.CL cs.AI cs.CR 版本更新

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

ADMIT: RAG基事实核查中的少样本知识污染攻击

Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma

发表机构 * Deakin University（德金大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）

AI总结 ADMIT提出一种无需访问目标模型的少样本攻击方法，通过注入真实证据来翻转事实核查决策，实验显示其在多种系统中成功率达86%，揭示了RAG事实核查系统的重大漏洞。

2510.03161 2026-05-18 cs.CV cs.AI 版本更新

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

UniShield: 一种适应性多智能体框架用于统一的伪造图像检测与定位

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（北京大学电子与计算机工程学院）； School of Future Technology, South China University of Technology（华南理工大学未来技术学院）； School of Electronic and Information Engineering, South China University of Technology（华南理工大学电子与信息工程学院）； Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University（北京大学深圳研究生院超高清沉浸媒体技术省重点实验室）

AI总结 UniShield通过多智能体框架实现跨领域伪造图像检测与定位，提升检测的适应性和实用性。

2510.02453 2026-05-18 cs.LG cs.AI cs.CL 版本更新

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

如何训练你的导师：通过导师模型引导黑盒大语言模型

Parth Asawa, Alan Zhu, Abigail O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Bespoke Labs（Bespoke实验室）

AI总结本文提出Advisor Models，通过训练小型开放权重模型生成动态个性化建议，提升黑盒前沿模型性能，实验显示在多个任务中效果显著，且具有良好的迁移性和鲁棒性。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

前沿语言模型作为黑盒服务部署，其权重无法修改，定制仅限于提示。我们引入Advisor Models，一种方法通过训练小型开放权重模型生成动态、实例特定的自然语言建议，以提升黑盒前沿模型的能力。Advisor Models将GPT-5.2在RuleArena（税务）任务上的性能提升27.4%，减少Gemini 3 Pro在SWE代理任务中的步骤24.6%，并在个性化GPT-5到用户偏好方面优于静态提示优化器（85-100% vs. 40-60%）。我们还发现顾问具有可迁移性：用低成本学生模型训练的顾问仍能将改进转移到前沿模型。此外，Advisor Models具有鲁棒性：在其他基准测试中未观察到降级，除了训练管道所训练的基准测试。我们的方法展示了如何以实用且经济有效的方式对黑盒前沿模型进行参数优化。

英文摘要

Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5.2's performance on RuleArena (Taxes) by 27.4%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

URL PDF HTML ☆

赞 0 踩 0

2510.01632 2026-05-18 q-bio.BM cs.AI 版本更新

BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction

BioBlobs：无监督发现蛋白质功能预测的的功能子结构

Xin Wang, Kaiwen Shi, Carlos Oliver

发表机构 * Vanderbilt University（范德比大学）； Yale University（耶鲁大学）

AI总结 BioBlobs通过无监督方法发现蛋白质的功能子结构，利用端到端可微分框架压缩蛋白质为少量连贯子结构并预测功能，实现了对功能区域的候选识别。

详情

AI中文摘要

蛋白质功能由如催化三元组、结合口袋和结构模体等紧密子结构驱动，这些子结构仅占据蛋白质残基的小部分。然而，现有基于蛋白质编码器的流程并未在子结构层面建模，未能回答核心生物学问题：蛋白质的哪一部分负责其功能？我们引入了BioBlobs，一种编码器无关、端到端可微分的框架，能够将蛋白质压缩为少量连贯的子结构（blobs），并仅基于这些blobs预测功能，使得每个blob对应一个候选功能区域。在多样化的蛋白质功能预测任务和多种基于序列和结构的编码器上，BioBlobs在仅使用少量残基的情况下，匹配或超过了强大的基线模型。发现的blobs会根据任务调整其空间尺度，从局部催化位点到整个结构域。仅在蛋白质层面标签上训练，BioBlobs能够恢复M-CSA数据库中实验注释的催化位点，证明了无监督的功能子结构发现，并为未注释的整个蛋白质组的规模化功能位点发现开辟了道路。

英文摘要

Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: which substructure of a protein is responsible for its function? We introduce BioBlobs, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (blobs) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, BioBlobs matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered blobs adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, BioBlobs recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.

URL PDF HTML ☆

赞 0 踩 0

2509.23352 2026-05-18 cs.CV cs.AI 版本更新

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

动态树RPO：通过结构化采样打破独立轨迹瓶颈

Xiaolong Fu, Lichen Ma, Zipeng Guo, ShiPing Dong, Lan Yang, Tan Lit Sin, Gaojing Zhou, Yu He, Jingling Fu, Shizhe Zhou, Junshi Huang, Jason Li

发表机构 * Sun Yat-sen University（中山大学）； Tsinghua University（清华大学）； Beijing University of Chemical Technology（北京化工大学）

AI总结本文提出动态树RPO，通过树状结构采样策略和动态噪声强度，提升文本到图像生成的质量与效率，同时结合层调优强化学习方法，在多个基准测试中表现出色。

Comments Fig.3 updated

详情

AI中文摘要

将强化学习（RL）整合到流匹配模型中，推动了文本到图像（T2I）生成的质量提升。然而，这些进步往往以大量探索和低效采样策略为代价，由于采样组的微小变化。基于这一见解，我们提出了动态树RPO，实现了滑动窗口采样策略作为树状结构搜索，具有沿深度动态噪声强度。我们在此树结构中执行GRPO引导优化和受约束的随机微分方程（SDE）采样。通过共享树的前缀路径，我们的设计有效缓解了轨迹搜索的计算开销。通过为每个树层设计良好的噪声强度，动态树RPO可以在不增加额外计算成本的情况下增强探索的多样性。此外，我们无缝整合监督微调（SFT）和RL范式，构建我们的提议层调优RL，将SFT的损失函数重新表述为动态加权进展奖励模型（PRM），而不是单独的预训练方法。通过将此加权PRM与动态自适应剪裁边界关联，避免了动态树RPO中探索过程的干扰。得益于树状结构采样和层调优RL范式，我们的模型在有效方向上动态探索多样化的搜索空间。与现有基线相比，我们的方法在语义一致性、视觉保真度和人类偏好对齐方面在已建立的基准测试中表现出显著优势，包括HPS-v2.1、PickScore和ImageReward。特别是，我们的模型在这些基准测试中分别优于SoTA by 4.9%、5.91%和8.66%，同时将训练效率提高了近50%。

英文摘要

The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

URL PDF HTML ☆

赞 0 踩 0

2509.22739 2026-05-18 cs.CL cs.AI cs.LG stat.ML 版本更新

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

无痛激活导向：一种自动化、轻量级的微调大型语言模型方法

Sasha Cui, Zhongren Chen

发表机构 * Yale University（耶鲁大学）

AI总结本文提出Painless Activation Steering，一种自动化方法，无需人工干预即可利用标注数据提升模型性能，尤其在行为任务中表现优异，但对智能任务效果有限。

详情

AI中文摘要

语言模型通常通过权重或提示导向进行微调，但前者耗时昂贵，后者控制不精确且需手动试错。激活导向（AS）提供了一种更经济、快速且可控的替代方法，但现有技术需人工构造提示对或进行大量特征标注，不如RL和SFT等方法方便。本文引入Painless Activation Steering（PAS），一种完全自动的方法，可利用任何标注数据集进行AS，无需提示构造、特征标注或人工干预。在三个开源模型和18个任务上评估PAS，发现其在行为任务中性能可靠，但对智能任务效果有限。 introspective variant（iPAS）在偏差、道德和对齐任务上分别提升了10.1%、5.2%和34.8%。此外，PAS在上下文学习（ICL）和SFT基础上还提供了额外增益。PAS构建了一个快速、轻量的激活向量，可低成本训练、存储和激活。实验结果为AS的应用提供了明确的指导，展示了其作为实用自动化微调方法的潜力。

英文摘要

Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

URL PDF HTML ☆

赞 0 踩 0

2509.21663 2026-05-18 cs.LG cs.AI cs.LO 版本更新

CIS-BWE: 基于混沌的语音带宽扩展

Tarikul Islam Tamiti, Tonmoy Das, Nursadul Mamun, Anomadarshi Barua

发表机构 * Chittagong University of Engineering and Technology（奇坦加大学工程与技术学院）； George Mason University（乔治·梅森大学）

AI总结本文提出NDSI-BWE框架，利用六种基于非线性动力学系统的判别器捕捉语音的复杂时间行为，通过深度卷积实现参数减少，提升语音带宽扩展性能。

详情

AI中文摘要

恢复因带宽限制丢失的高频成分对于电信和有限资源下的高保真音频应用至关重要。我们引入NDSI-BWE，一种新的对抗性带宽扩展（BWE）框架，利用四种新的判别器灵感来自非线性动力学系统以捕捉多样的时间行为：多分辨率李雅普诺夫判别器（MRLD）用于确定初始条件的敏感性，通过捕捉确定性混沌；多尺度递归判别器（MS-RD）用于自相似递归动力学；多尺度去趋势分形分析判别器（MSDFA）用于长程缓慢变异性尺度不变关系；多分辨率庞加莱图判别器（MR-PPD）用于捕捉隐藏的潜在空间关系；多周期判别器（MPD）用于捕捉周期性模式；多分辨率振幅判别器（MRAD）和多分辨率相位判别器（MRPD）用于捕捉复杂的振幅-相位转换统计。通过在每个判别器中使用深度卷积块的核心深度卷积，NDSI-BWE实现了八倍的参数减少。这些七个判别器指导一个基于复数ConformerNeXt的生成器，采用双流Lattice-Net架构，同时优化幅度和相位。生成器利用基于Transformer的Conformer的全局依赖建模能力和ConvNeXt块的局部时间建模能力。在六个客观评估指标和包含五名人类评委的主观文本中，NDSI-BWE在BWE中建立了新的SoTA。

英文摘要

Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer's global dependency modeling and ConvNeXt block's local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.

URL PDF HTML ☆

赞 0 踩 0

2507.14200 2026-05-18 cs.CL cs.AI cs.LG 版本更新

A Scalable Multi-LLM Collaboration System with Retrieval-based Selection and Exploration-Exploitation-Driven Enhancement

可扩展的多语言模型协作系统：基于检索的选择与探索-利用驱动增强

Shengji Tang, Jianjian Cao, Weihao Lin, Jiale Hong, Bo Zhang, Shuyue Hu, Lei Bai, Tao Chen, Wanli Ouyang, Peng Ye

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出SMCS系统，通过检索优先选择模块和探索-利用驱动后验增强模块，有效协调多个开源语言模型，实验显示其在多个任务中优于闭源模型，且在不同数据集上超越开源模型的平均最佳结果。

详情

AI中文摘要

现有多语言模型协作系统在整合新语言模型和任务时常面临可扩展性挑战，导致性能不佳。为此，我们提出SMCS，一种可扩展的多语言模型协作系统，旨在有效协调多个开源语言模型。系统包含两个核心模块：基于检索的优先选择模块（RPS），动态选择最适合的语言模型；探索-利用驱动的后验增强模块（EPE），通过混合评分机制促进响应多样性并选择高质量输出。在八个主流基准测试中，实验验证了系统的有效性：通过整合十五个开源语言模型，SMCS在多个任务中优于现有的闭源语言模型，例如GPT-4（+5.36%）和GPT-o3-mini（+5.28%）。值得注意的是，它甚至在不同数据集上超越了开源语言模型的最佳平均结果（+2.86%），显著推进了开源协作的实证性能前沿。代码已发布在https://github.com/magent4aci/SMCS。

英文摘要

Existing multi-LLM collaboration systems often encounter scalability challenges when integrating new LLMs and tasks, leading to suboptimal performance. To address this, we propose SMCS, a Scalable Multi-LLM Collaboration System designed to effectively coordinate multiple open-source LLMs. The system consists of two core components: a Retrieval-based Prior Selection (RPS) module, which dynamically selects the most suitable LLMs for each input, and an Exploration-Exploitation-Driven Posterior Enhancement (EPE) module, which fosters response diversity and selects high-quality outputs through a hybrid scoring mechanism. Experiments on eight mainstream benchmarks validate the effectiveness of our system: by integrating fifteen open-source LLMs, SMCS outperforms prevailing closed-source LLMs, e.g., GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results on different datasets with open-source LLMs (+2.86%), significantly advancing the empirical performance frontier of open-source collaboration. The code is released at https://github.com/magent4aci/SMCS.

URL PDF HTML ☆

赞 0 踩 0

2506.22604 2026-05-18 cs.AI cs.HC cs.RO 版本更新

Bootstrapping Human-Like Planning via LLMs

通过大语言模型实现人类样式的规划

David Porfirio, Vincent Hsiao, Morgan Fine-Morris, Leslie Smith, Laura M. Hiatt

发表机构 * Navy Center for Applied Research in AI, US Naval Research Laboratory（美国海军人工智能应用研究中心）

AI总结本文研究如何结合自然语言接口与拖放界面，利用大语言模型生成人类风格的动作序列，并与手工指定的动作序列进行比较。

Comments Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

详情

DOI: 10.1109/RO-MAN63969.2025.11217637

AI中文摘要

机器人终端用户日益需要能够指定机器人执行任务的可访问方法。两种常见的终端用户编程范式包括拖放界面和自然语言编程。尽管自然语言接口利用了人类沟通的直观形式，拖放界面使用户能够精确地规定机器人任务中的关键动作。在本文中，我们探讨这两种方法结合的程度。具体来说，我们构建了一个基于大语言模型（LLM）的管道，接受自然语言作为输入，并生成人类风格的动作序列作为输出，其细度水平与人类产生的相似。我们然后将生成的动作序列与另一组手工指定的动作序列进行比较。尽管我们的结果表明，较大的模型在生成人类风格的动作序列方面优于较小的模型，但较小的模型仍然实现了令人满意的性能。

英文摘要

Robot end users increasingly require accessible means of specifying tasks for robots to perform. Two common end-user programming paradigms include drag-and-drop interfaces and natural language programming. Although natural language interfaces harness an intuitive form of human communication, drag-and-drop interfaces enable users to meticulously and precisely dictate the key actions of the robot's task. In this paper, we investigate the degree to which both approaches can be combined. Specifically, we construct a large language model (LLM)-based pipeline that accepts natural language as input and produces human-like action sequences as output, specified at a level of granularity that a human would produce. We then compare these generated action sequences to another dataset of hand-specified action sequences. Although our results reveal that larger models tend to outperform smaller ones in the production of human-like action sequences, smaller models nonetheless achieve satisfactory performance.

URL PDF HTML ☆

赞 0 踩 0

2504.18361 2026-05-18 cs.CV cs.AI 版本更新

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

COCO-Inpaint：用于检测和定位基于修补的图像篡改的基准

Haozhen Yan, Yan Hong, Jiahui Zhan, Suning Lang, Yikun Ji, Huijia Zhu, Jun Lan, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ant Group（蚂蚁集团）

AI总结本文提出COCO-Inpaint基准，用于检测和定位基于修补的图像篡改，通过高质样本、多样场景和大规模覆盖，揭示修补与真实区域的内在不一致。

Comments 6 pages, 8 figures

详情

AI中文摘要

近年来，图像篡改技术的进步使高逼真内容生成成为可能，但也降低了随意编辑的门槛，引发了对多媒体真实性和安全性的担忧。现有图像篡改检测与定位（IMDL）方法主要针对拼接或复制移动伪造，而基于修补的篡改基准仍有限。为弥合这一差距，我们提出了COCO-Inpaint，一个专门用于修补检测和定位的综合基准，主要贡献包括：1）由六个最先进的修补模型生成的高质量修补样本；2）通过四种掩码生成策略和可选文本引导实现的多样化生成场景；3）包含238,302张具有丰富语义多样性的修补图像的大规模覆盖。本基准旨在突出修补区域与真实区域之间的内在不一致，而非表面语义特征如物体形状。我们进一步建立了严格的评估协议，通过三个标准指标来评估现有IMDL方法，揭示当前趋势和挑战。

英文摘要

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

URL PDF HTML ☆

赞 0 踩 0

2504.00289 2026-05-18 cs.CL cs.AI cs.CY 版本更新

Do Chinese models speak Chinese languages?

中国模型会说中文吗？

Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno

发表机构 * Cornell University（康奈尔大学）

AI总结本文通过比较中西方开源大模型的多语言能力，发现中国模型在多数语言上表现与西方模型相似，但对部分中国少数民族语言识别能力较弱，揭示了多语言发展中的优先级与权衡。

Comments First and second author contribute equally

详情

DOI: 10.1145/3805689.3812333

AI中文摘要

顶级开源大模型的发布巩固了中国在AI发展中的领先地位。这些模型支持中国使用的语言吗？还是与美国或欧洲开发的模型支持相同的语言？比较多语言能力对于两个原因很重要：首先，语言能力提供了关于预训练数据编纂的见解，从而揭示了资源分配和发展优先级；其次，中国模型开发者需要在服务于国内语言多样化的群体与优化全球可见基准（主要为英语）之间取得平衡。我们通过比较中国开发和西方开发的开源大模型，在21种语言变体（包括亚洲地区、中文和欧洲语言）上进行了研究。我们的信息平衡和阅读理解实验表明，中国模型在这些语言上的表现与西方模型高度相关（r=0.93），唯一的例外是中文表现更好。中国开发的模型在法语和德语方面表现良好，但有时无法识别中国少数民族语言，如哈萨克语和维吾尔语。总体而言，所有研究的开源大模型在多语言表现上相似，尽管模型开发者所处的语言和文化背景各不相同。我们将这种同质化解释为全球基准实践和共享训练资源影响的结果。而不是将当前语言支持视为不可避免，我们的结果强调多语言发展是一个优先级和权衡的空间，对模型开发者、政策制定者和用户都有影响。

英文摘要

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, Chinese model developers need to navigate the tension between serving a linguistically diverse population domestically, and optimizing for globally visible benchmarks that are predominantly English. We investigate Chinese model developers' priorities through a comparative study of Chinese-developed and Western-developed open-weight LLMs, on 21 language variants including Asian regional, Chinese, and European languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with their Western counterparts, with the sole exception being better Mandarin. Chinese-developed models are good at French and German, but they sometimes cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur. Overall, all open-weight LLMs we study have a similar multilingual performance profile, despite the diverse linguistic and cultural contexts the model developers operated within. We interpret the homogenization as consistent with the influence of global benchmarking practices and shared training resources. Rather than treating current language support as inevitable, our results highlight multilingual development as a space of prioritization and trade-offs, with implications for model developers, policymakers, and users.

URL PDF HTML ☆

赞 0 踩 0

2412.06853 2026-05-18 cs.LG cs.AI 版本更新

Tube Loss: A Novel Approach for Prediction Interval Estimation

Tube Loss：预测区间估计的一种新方法

Pritam Anand, Tathagata Bandyopadhyay, Suresh Chandra

发表机构 * Dhirubhai Ambani University (Formerly DA-IICT)（迪鲁巴希阿米大学（原达乌学院））； Indian Institute of Technology, Delhi（印度理工学院德里分校）

AI总结本文提出Tube Loss损失函数，用于回归任务中同时估计预测区间边界。该方法能渐近达到指定置信水平，允许用户调整区间位置以优化覆盖范围和宽度，适用于偏斜分布。

详情

AI中文摘要

本文提出了一种名为'Tube Loss'的新损失函数，用于回归任务中同时估计预测区间（PI）的边界。基于Tube Loss最小化经验风险得到的PI在以下方面优于现有方法：首先，渐近达到指定置信水平t∈(0,1)。其次，用户可通过调整参数移动区间，以捕捉响应变量概率分布的密集区域，从而缩小区间宽度。该方法通过单个优化问题平衡覆盖范围和平均宽度，并通过重新校准进一步减少平均宽度。不同于现有方法，梯度下降法可用于最小化经验风险。通过大量实验，我们证明了基于Tube Loss的PI估计在核机和神经网络中的有效性，并展示了基于Tube Loss的深度概率预报模型在多个基准和风能数据集上优于现有概率预报技术。最后，我们通过符合预测框架验证了Tube Loss方法的优势。代码可在https://github.com/ltpritamanand/Tube$_$loss获取。

英文摘要

This paper proposes a novel loss function, called 'Tube Loss', for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t $\in$ (0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Through extensive experiments, we demonstrate the effectiveness of Tube Loss-based PI estimation in both kernel machines and neural networks. Additionally, we show that Tube Loss-based deep probabilistic forecasting models achieve superior performance compared to existing probabilistic forecasting techniques across several benchmark and wind datasets. Finally, we empirically validate the advantages of the Tube loss approach within the conformal prediction framework. Codes are available at https://github.com/ltpritamanand/Tube$\_$loss.

URL PDF HTML ☆

赞 0 踩 0

2407.20240 2026-05-18 cs.CY cs.AI 版本更新

Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada

通用大型语言模型对加拿大新移民融入社会的潜在风险

Isar Nejadgholi, Maryam Molamohammadi, Samir Bakhtawar

发表机构 * National Research Council Canada（加拿大国家研究委员会）； Mila - Quebec Artificial Intelligence Institute（魁北克人工智能研究所）

AI总结研究探讨通用大语言模型在移民安置领域可能带来的风险，强调需开发定制化AI工具以确保人类监督与责任。

Comments 26 pages, 8 figures

详情

DOI: 10.1007/978-3-031-95890-8_11

AI中文摘要

加拿大非营利安置部门支持新移民实现成功融入。该部门面临日益增长的操作压力，凸显了提高效率和创新的必要性，可能通过可靠的AI解决方案实现。随意使用通用生成式AI，如ChatGPT，可能成为移民和服务机构的常见做法，但这些工具未针对安置领域进行优化，可能对移民和难民产生有害影响。本文探讨这些工具可能对新移民造成的风险，警告避免未经监管的生成式AI使用，并鼓励进一步研究开发AI素养课程及定制化LLM，使其符合受影响社区的偏好。关键在于此类技术应无缝集成到安置部门现有流程中，确保人类监督、可信度和问责制。

英文摘要

The non-profit settlement sector in Canada supports newcomers in achieving successful integration. This sector faces increasing operational pressures amidst rising immigration targets, which highlights a need for enhanced efficiency and innovation, potentially through reliable AI solutions. The ad-hoc use of general-purpose generative AI, such as ChatGPT, might become a common practice among newcomers and service providers to address this need. However, these tools are not tailored for the settlement domain and can have detrimental implications for immigrants and refugees. We explore the risks that these tools might pose on newcomers to first, warn against the unguarded use of generative AI, and second, to incentivize further research and development in creating AI literacy programs as well as customized LLMs that are aligned with the preferences of the impacted communities. Crucially, such technologies should be designed to integrate seamlessly into the existing workflow of the settlement sector, ensuring human oversight, trustworthiness, and accountability.

URL PDF HTML ☆

赞 0 踩 0

2312.05975 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FM-G-CAM: A Holistic Approach for Explainable AI in Computer Vision

FM-G-CAM：计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

发表机构 * Department of Computer Science Nottingham Trent University（计算机科学系诺丁汉特大学）

AI总结本文提出FM-G-CAM方法，通过综合考虑多个预测类别，提供CNN模型决策的全面解释，改进传统Grad-CAM的局限性。

详情

AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型（特别是卷积神经网络）预测的必要性。现有方法主要基于梯度加权类激活图（Grad-CAM），仅关注单一目标类别，忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法，称为融合多类梯度加权类激活图（FM-G-CAM），考虑多个高预测类别，提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外，通过现实应用场景的定量和定性比较，展示了FM-G-CAM相较于Grad-CAM的优势。最后，我们提供了一个开源Python库，包含FM-G-CAM实现，方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

URL PDF HTML ☆

赞 0 踩 0

2308.06822 2026-05-18 cs.LG cs.AI cs.CR math.OC 版本更新

Approximate and Weighted Data Reconstruction Attack in Federated Learning

联邦学习中的近似和加权数据重建攻击

Yongcun Song, Ziqi Wang, Enrique Zuazua

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University（南洋理工大学数学科学学院，物理与数学科学学院）； Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship, Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg（动态、控制、机器学习和数值学主席职位，数学系，埃尔兰根-纽伦堡弗里德里希-亚历山大大学）； Chair of Computational Mathematics, Fundación Deusto（计算数学主席，德乌斯基金会）； Departamento de Matemáticas, Universidad Autónoma de Madrid（数学系，马德里自治大学）

AI总结本文提出了一种基于插值的近似方法，用于攻击联邦学习中的联邦平均场景，通过生成客户端本地训练过程中的中间模型更新，改进数据重建质量，并通过实验验证了其在图像数据重建中的优越性。

详情

AI中文摘要

联邦学习（FL）是一种分布式学习范式，允许多个客户端在不共享私人数据的情况下协作构建机器学习模型。尽管FL被设计为隐私保护，但最近的数据重建攻击表明，攻击者可以根据FL中共享的参数恢复客户端的训练数据。然而，大多数现有方法无法攻击最广泛使用的水平联邦平均（FedAvg）场景，其中客户端在多次本地训练步骤后共享模型参数。为了解决这个问题，我们提出了一种基于插值的近似方法，通过生成客户端本地训练过程中的中间模型更新，使攻击FedAvg场景成为可能。然后，我们设计了一种层间加权损失函数以提高数据重建质量。我们根据神经网络结构为不同层的模型更新分配不同的权重，权重通过贝叶斯优化进行调整。最后，实验结果验证了所提出的近似和加权攻击（AWA）方法在不同评估指标上优于其他最先进的方法，显示出在图像数据重建中的显著改进。

英文摘要

Federated Learning (FL) is a distributed learning paradigm that enables multiple clients to collaborate on building a machine learning model without sharing their private data. Although FL is considered privacy-preserved by design, recent data reconstruction attacks demonstrate that an attacker can recover clients' training data based on the parameters shared in FL. However, most existing methods fail to attack the most widely used horizontal Federated Averaging (FedAvg) scenario, where clients share model parameters after multiple local training steps. To tackle this issue, we propose an interpolation-based approximation method, which makes attacking FedAvg scenarios feasible by generating the intermediate model updates of the clients' local training processes. Then, we design a layer-wise weighted loss function to improve the data quality of reconstruction. We assign different weights to model updates in different layers concerning the neural network structure, with the weights tuned by Bayesian optimization. Finally, experimental results validate the superiority of our proposed approximate and weighted attack (AWA) method over the other state-of-the-art methods, as demonstrated by the substantial improvement in different evaluation metrics for image data reconstructions.

URL PDF HTML ☆

赞 0 踩 0

2306.04321 2026-05-18 cs.AI cs.MM 版本更新

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

生成语义通信：扩散模型超越位恢复

Eleonora Grassucci, Sergio Barbarossa, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecommunication, Sapienza University of Rome（信息工程、电子与电信系，罗马萨皮恩扎大学）

AI总结本文提出一种新的生成扩散框架，利用扩散模型合成多媒体内容并保留语义特征，通过空间自适应归一化生成语义一致的场景，提升在信道噪声下的图像生成质量。

Journal ref IEEE Transactions on Cognitive Communication and Networking, 2026

详情

DOI: 10.1109/TCCN.2026.3689849

AI中文摘要

语义通信被认为是下一代AI通信的核心之一。其可能使接收端能再生与传输内容语义等价的图像或视频，而无需恢复传输的位序列。当前解决方案仍缺乏从接收到的有限信息中构建复杂场景的能力。本文提出一种新的生成扩散指导框架，利用扩散模型在合成多媒体内容和保留语义特征方面的强大能力，通过发送高度压缩的语义信息来减少带宽使用。然后，扩散模型通过空间自适应归一化从去噪的语义信息中学习生成语义一致的场景。通过深入评估多个场景，证明我们的方法在接收到显著退化的内容时，仍能生成高质量的图像并保留语义信息。具体而言，即使在通信信道极其嘈杂的条件下，对象、位置和深度仍可识别。代码可在https://github.com/ispamm/GESCO获取。

英文摘要

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

URL PDF HTML ☆

赞 0 踩 0

2210.13455 2026-05-18 cs.LG cs.AI 版本更新

Epistemic Monte Carlo Tree Search

认知蒙特卡洛树搜索

Yaniv Oren, Viliam Vadocz, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Epistemic MCTS，通过考虑认知不确定性提升搜索效率，在代码编写等稀疏奖励任务中表现更优。

2605.15769 2026-05-18 cs.RO cs.AI 版本更新

Lamarckian Inheritance in Dynamic Environments: How Key Variables Affect Evolutionary Dynamics

动态环境中的拉马克继承：关键变量如何影响进化动态

K. Ege de Bruin, Kyrre Glette, Kai Olav Ellefsen

发表机构 * Department of Informatics, University of Oslo, Norway（奥斯陆大学信息学院）； RITMO, University of Oslo, Norway（奥斯陆大学RITMO）

AI总结本文研究动态环境中关键变量对进化动态的影响，通过虚拟软机器人和两种学习方法，发现拉马克继承在环境变化冲突且不可预测时表现欠佳，但添加环境感知传感器可恢复其优势。

详情

AI中文摘要

在动态环境中机器人身体与控制器的共优化是一个耦合挑战：形态约束了哪些控制策略有效，而控制则决定了形态的表现。为了解决这一问题，我们结合形态优化作为进化与控制器优化作为生命周期学习，利用拉马克继承将学习到的控制器参数从父代传递给子代。在动态环境中，现有文献呈现矛盾证据：虽然传统进化理论通常认为拉马克继承无益，但最近的进化机器人研究显示它可以提高性能。我们假设这是因为以前的研究没有包含所有与动态环境相关的变量。在本工作中，我们发现拉马克继承的益处取决于两个变量：环境变化对机器人控制的冲突程度，以及这些变化对机器人代理的可预测性。使用虚拟软机器人和两种不同的学习方法，贝叶斯优化和强化学习，我们发现拉马克继承只在环境变化既冲突又不可预测时表现欠佳。我们发现添加一个检测环境变化的传感器可以恢复拉马克继承在冲突环境中的优势，通过允许机器人代理预测需要不同行为的需要，从而泛化其控制。

英文摘要

The co-optimization of a robot's body and brain presents a coupled challenge: the morphology constrains which control strategies are effective, while the control determines how well the morphology performs. To address this, we combine morphology optimization as evolution with controller optimization as lifetime learning, utilizing Lamarckian inheritance to transfer learned controller parameters from parent to offspring. In dynamic environments, existing literature presents conflicting evidence: while traditional evolutionary theory often suggests Lamarckian inheritance lacks benefit, recent studies in evolutionary robotics indicate it can improve performance. We hypothesize that this is because previous works have not included all relevant variables with dynamic environments. In this work, we show that the benefit of Lamarckian inheritance depends on two variables: how conflicting the environmental changes are to robot control, and the predictability of those changes for the robotic agent. Using virtual soft robots and two different learning approaches, Bayesian optimization and reinforcement learning, we show that Lamarckian inheritance only underperforms Darwinian inheritance when the changes are both conflicting and unpredictable. We find that adding a sensor to detect environmental changes restores the benefits for Lamarckian inheritance in conflicting environments, by allowing robotic agents to predict the need for a different behavior, thereby generalizing their control.

URL PDF HTML ☆

赞 0 踩 0

2605.15764 2026-05-18 cs.CV cs.AI 版本更新

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP：学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Georgia Institute of Technology（佐治亚理工学院）； Amazon AGI Korea University（亚马逊AGI韩国大学）

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件，提升多个人非语言互动的社会推理能力，包含290万对问题-答案对，提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情

AI中文摘要

理解社会互动需要推理微妙的非语言线索，但当前多模态大语言模型（MLLMs）在多个人视频中常无法识别谁与谁互动。我们引入GRASP，一个大规模社会推理数据集，将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对，覆盖46K小时视频，按16类分类涵盖目光、手势及联合目光-手势推理，同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源，GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外，我们提出Social Grounding Reward（SGR），一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示，SGR在GRASP-Bench上提升性能，同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.15763 2026-05-18 cs.CL cs.AI 版本更新

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

CompactQE: 通过小规模开源大语言模型实现可解释的翻译质量估计

Kamil Guttmann, Zofia Fraś, Artur Nowakowski, Krzysztof Jassem

发表机构 * Laniqo ； Faculty of Mathematics and Computer Science, Adam Mickiewicz University（亚当·密茨凯维奇大学数学与计算机科学学院）

AI总结本文提出CompactQE，利用小规模开源大语言模型实现翻译质量估计，生成质量评分、错误标注、修正建议和完整润色，其性能优于传统指标和人类标注。

2605.15736 2026-05-18 cs.CV cs.AI 版本更新

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

发表机构 * Wenzhou University（温州大学）； Wenzhou Business College（温州商务学院）

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制，提升医学视觉-语言模型在提示变化下的鲁棒性，实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

2605.15734 2026-05-18 cs.AI 版本更新

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

我们能否信任AI推断的用户状态。一种用于验证由LLMs在操作环境中对用户状态分类的可靠性的人格测量框架

Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

发表机构 * Orange Research, AI Center（Orange研究院、人工智能中心）

AI总结本文通过实证测试检验了使用大语言模型评估用户状态的假设，探讨了AI测量在人格测量中的可靠性问题，并提出可复制的评估框架以提高适应性系统的AI设计可靠性。

Comments Full survey article with data tables for futher possible replicabilty and comparison

详情

AI中文摘要

使用大语言模型来评估对话和自适应系统中的用户状态是基于一种假设，即用于此类评估的指标在个体分数层面是稳定且可解释的。本文通过实证测试检验这一假设，重点研究了人工智能（AI）测量在人格测量中的可靠性。本研究采用复制评估程序，评估了三个不同双模大语言模型（GPT-4o音频、Gemini 2.0 Flash、Gemini 2.5 Flash）中广泛指标的可重复性。分析包括个体分数可靠性和聚合可靠性，使我们能够区分可能对实时适应有用的指标，以及仅在聚合分析中保留价值的指标。结果表明，指标的可靠性不能被视为解释领域中的默认属性。个体分数层面的不稳定性使得在实时自适应系统中将这些分数解释为用户状态的指标是不可能的，即使这些指标在聚合后表现出稳定性。同时，本研究指出，个体不稳定指标可以在事后研究中保留分析效用，识别交互规则及其与用户经验参数如满意度、信任和参与度的关系。本文的主要贡献，除了量化问题的严重性（只有213个指标中的31个符合标准）外，还提出了一个可复制的评估框架，使指标适用性的可测量评估成为可能。这种方法支持更负责任的AI设计，其中结果的解释需要显式验证可靠性和随时间监测违规情况。

英文摘要

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.

URL PDF HTML ☆

赞 0 踩 0

2605.15733 2026-05-18 cs.NE cs.AI cs.CV 版本更新

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

在启发式世界模型中的结构抽象与泛化

Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Center of Quantitative Biology, School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University（北京大学-清华大学生命科学中心，先进跨学科研究院，IDG/麦克戈文脑科学研究院，定量生物学中心，心理与认知科学学院，机器感知重点实验室（教育部），北京大学）

AI总结本文提出了一种脑启发的分层模型，通过逆向模型提取潜在转换并构建预测视觉世界模型，展示了在连续高维动态中同时提取抽象结构的能力，实现了结构泛化。

Comments Project page: https://hpc-mec-worldmodel.github.io/

详情

AI中文摘要

人类将经验抽象为结构化表示以促进模式推断和知识转移。尽管海马-内侧颞叶（HPC-MEC）回路已知能表示空间和概念空间，但如何同时从连续、高维动态中提取抽象结构的机制仍不明确。我们提出了一种脑启发的分层模型，同时推断潜在转换并构建预测视觉世界模型。该架构采用逆向模型进行结构提取，同时结合HPC-MEC耦合模型，将关系结构（MEC）与整合的事件场景（HPC）分离。通过使用原始变换动态作为基准，我们展示了该模型在结构抽象方面的能力。通过利用速度驱动的路径整合，该框架能够在不同情境中实现稳健的预测和结构重用，从而实现结构泛化。本文提供了一个新的计算框架，用于理解如何通过脑启发的自监督学习世界模型，促进可重用的抽象知识的获取。

英文摘要

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.15728 2026-05-18 cs.CV cs.AI 版本更新

学习动态抓取与放置用于四足机械臂

Moonkyu Jung, Jiseong Lee, Zhengmao He, Donghoon Youm, Juhyeok Mun, HyeongJun Kim, Hyunsik Oh, Donghyuk Choi, Jungwoo Hur, Jie Song, Jemin Hwangbo

发表机构 * Robotics and Artificial Intelligence Lab, KAIST（机器人与人工智能实验室，韩国科学技术院）

AI总结本文提出一种分层强化学习框架，用于四足机械臂的动态抓取与放置任务，通过模拟和现实实验验证了其在不同负载和工作空间下的高成功率。

Comments Accepted to IEEE Robotics and Automation Letters 2026

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7652-7659, 2026

详情

DOI: 10.1109/LRA.2026.3688092

AI中文摘要

四足机械臂通过结合敏捷移动与多功能臂控制，扩展了机器人静态操作的能力。然而，实现精确操作的同时保持协调移动仍是一个重大挑战。本文提出了一种分层强化学习框架，用于四足机械臂的动态抓取与放置任务。该框架包含一个显式的质量估计模块，能够实现对不同重量物体的自适应全身控制。在模拟中，系统在负载达2.3kg时的成功率高达86.05%。通过六个代表性场景的现实实验，验证了该方法在不同物体物理属性（尺寸和质量）和任务高度下的有效性。在垂直工作空间从地面到1.1米高桌面的范围内，系统在负载达1.3kg时的平均成功率为73.3%，平均执行时间为4.06秒。与以往处理轻质物体并执行慢速分步操作的方法不同，本文的方法利用移动和操作的同时进行，实现了动态连续执行。这些结果展示了四足移动机械臂在适应性、全身抓取与放置任务中处理更重负载和扩展工作空间的潜力。

英文摘要

Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.

URL PDF HTML ☆

赞 0 踩 0

2605.15705 2026-05-18 cs.RO cs.AI 版本更新

Feedback World Model Enables Precise Guidance of Diffusion Policy

反馈世界模型使扩散策略获得精准指导

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）

AI总结本文提出反馈世界模型，通过实时反馈修正预测误差，提升机器人决策性能，实验显示在分布偏移下预测准确率和策略表现显著提升。

Comments 21 pages, 9 figures

详情

AI中文摘要

世界模型旨在通过预测动作后果来提高机器人决策能力。然而，当机器人遇到训练分布外的状态时，其预测往往不可靠，限制了实际应用。我们发现执行本身提供了一个自然但未被充分利用的信号：每次动作后，机器人直接观察到真实下一步状态，揭示了预测与实际结果之间的不匹配。基于这一见解，我们提出反馈世界模型，一种在推理时关闭预测与观察之间循环的新范式。与将世界模型视为静态开环预测器不同，我们的方法维护一个轻量级反馈状态，在线更新以迭代修正未来预测，利用实时观测补偿模型误差，而无需额外训练数据或参数更新。我们证明这一过程可以被视为潜在空间观察者，并在温和条件下具有收敛保证。我们进一步引入动作感知指导，通过强调动作可控的组件而抑制无关变化，以更好地将修正预测转化为控制。在LIBERO-Plus、Robomimic和真实世界操控任务上的实验表明，我们的方法在分布偏移下显著提高了预测准确性和策略性能。特别是，它将世界模型预测误差减少了高达76.4%，并提高了分布外（OOD）成功率30%。这些结果表明，在推理时纳入实时反馈为静态世界建模提供了一个简单而有力的替代方案。

英文摘要

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.15701 2026-05-18 cs.CL cs.AI 版本更新

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

H-Mem: 一种通过混合结构进化和检索智能体记忆的新型记忆机制

Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Huawei Cloud Computing Technologies CO., LTD.（华为云计算技术有限公司）

AI总结 H-Mem通过混合结构有效建模智能体记忆的长期演化并高效检索记忆数据，提升问答任务性能。

详情

AI中文摘要

在基于大语言模型（LLM）的智能体（如OpenClaw和Manus）中，记忆数据无处不在。尽管近期有研究尝试利用智能体的记忆来提高问答（QA）任务的性能，但缺乏有效建模记忆数据随时间演化和高效检索的原理性机制，导致记忆利用效率低下。为此，我们提出了H-Mem，一种通过混合结构实现的新型记忆机制，能够有效建模智能体记忆的长期演化，并提供高效的记忆检索方法。特别是，H-Mem构建了时间与语义树结构，使短期记忆数据逐步演变为长期记忆数据，后者为前者提供总结信息，同时构建知识图谱以捕捉记忆中实体之间的关系。此外，通过利用树和图结构的混合特性，H-Mem提供了有效的记忆检索方法。在三个智能体记忆基准测试中，H-Mem在问答任务上实现了最先进的性能。

英文摘要

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

URL PDF HTML ☆

赞 0 踩 0

2605.15688 2026-05-18 stat.ML cs.AI cs.LG math.PR 版本更新

$α$-TCAV: A Unified Framework for Testing with Concept Activation Vectors

$α$-TCAV：基于概念激活向量的测试统一框架

Ekkehard Schnoor, Jawher Said, Malik Tiomoko, Wojciech Samek, Alexander Jung

发表机构 * Department of Computer Science（计算机科学系）； Department of Artificial Intelligence（人工智能系）； Aalto University（阿alto大学）； Fraunhofer Heinrich Hertz Institute（弗劳恩霍夫海因里希·赫兹研究所）； Department of Artificial Intelligence, Fraunhofer HHI（人工智能系，弗劳恩霍夫HHI研究所）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Department of EECS, Technische Universität Berlin（电子工程与计算机科学系，柏林技术大学）

AI总结本文提出$α$-TCAV框架，解决传统TCAV方法中因指示函数不连续导致的方差问题，通过参数化平滑函数统一概率表述，并提供参数调优指导，挑战现有实践惯例。

Comments 44 pages, 12 figures

详情

AI中文摘要

概念激活向量（CAVs）是深度学习中基于概念的可解释性基础工具，但其实际应用受限于统计不稳定性。本文分析了CAVs和TCAV方法的随机性质，推导了主要CAV类别的分布，包括PatternCAV、FastCAV和基于岭回归的CAV。识别了标准TCAV得分的根本缺陷：其依赖不连续指示函数导致关键区域方差不衰减。为此，引入$α$-TCAV，一种通用框架，用参数化平滑函数替代指示函数，得到统一的概率表述，涵盖TCAV和Multi-TCAV。刻画了灵敏度得分和不同TCAV变体的诱导分布，显示现有最先进的选择缺乏理论依据。提供原理指导，调优$α$-TCAV参数：要么以较低计算成本模仿Multi-TCAV，要么获得校准的贝叶斯最优概率度量。最终分析产生实用建议，挑战现有惯例：最显著的是将全部采样预算分配给单一CAV而非多个。

英文摘要

Concept Activation Vectors (CAVs) are a fundamental tool for concept-based explainability in deep learning, yet their practical utility is limited by statistical instability. We analyze the stochastic nature of CAVs and the Testing with CAVs (TCAV) method, deriving the distributions of major CAV classes including PatternCAV, FastCAV, and ridge regression-based CAVs. We then identify a fundamental flaw in the standard TCAV score: its reliance on a discontinuous indicator function induces non-decaying variance in critical regimes. To address this, we introduce $α$-TCAV, a generalized framework that replaces the indicator with a parameterized smooth function, yielding a unified probabilistic formulation that subsumes both TCAV and Multi-TCAV. We characterize the induced distributions of sensitivity scores and different TCAV variants, showing that established state-of-the-art choices lack theoretical justification. We provide principled guidance on tuning the parameter in $α$-TCAV -- either to imitate Multi-TCAV at substantially lower computational cost, or to obtain a calibrated Bayes-optimal probabilistic measure of a concept's influence. Finally, our analysis yields practical recommendations that challenge established routines: most notably, allocating the full sampling budget to a single CAV rather than splitting it across several.

URL PDF HTML ☆

赞 0 踩 0

2605.15675 2026-05-18 cs.LG cs.AI 版本更新

Interaction-Aware Influence Functions for Group Attribution

群体属性中的交互感知影响函数

Jaeseung Heo, Kyeongheung Yun, Youngbin Choi, Sehyun Hwang, Jungseul Ok, Dongwoo Kim

发表机构 * GSAI, POSTECH（POSTECH 人工智能研究所）； CSE, POSTECH（POSTECH 计算科学与工程系）

AI总结本文提出交互感知影响函数，通过考虑样本间相互作用来改进群体属性评估，实验显示其在多个任务中优于传统方法。

详情

AI中文摘要

影响函数近似于移除训练样本如何改变感兴趣的量，如保留损失。为估计群体样本的影响，常规做法是求和个体影响。然而，这种求和无法捕捉样本联合影响：样本对可能是冗余或互补的，但求和无法区分这些情况。我们提出交互感知影响函数，通过在训练参数周围扩展目标到二次项，获得一个估计器，该估计器在标准求和基础上增加了一个双变量交互项，捕捉两个样本对目标影响的对齐情况。我们实验证明，该估计器在六个数据集-模型组合上显著优于一阶影响方法。此外，当用作Llama-3.1-8B指令微调数据的贪心选择规则时，在五个七下游任务中优于传统影响和表示相似性基线，在标准影响选择表现不佳的领域中。

英文摘要

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

URL PDF HTML ☆

赞 0 踩 0

2605.15672 2026-05-18 cs.CV cs.AI 版本更新

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

VLMs 跟踪无需跟踪：诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

发表机构 * Yonsei University（延世大学）

AI总结研究VLMs在视觉路径跟随任务中的表现，发现其在面对局部相似干扰时易切换路径，揭示局部竞争导致的失败原因。

详情

AI中文摘要

视觉-语言模型（VLMs）在多模态基准测试中表现优异，但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务，其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力，我们设计了受控的路径跟随任务，引入附近的竞争者并减少语义和拓扑模糊性，如交叉和重叠。在这些任务中，即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径，尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明，这些失败源于局部竞争：附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈：模型大小扩展只能提供有限的收益，推理部分通过成本高昂的替代策略补偿，而显式路径指示未能恢复稳定的路径跟随。最后，在复杂的电缆场景和地铁地图上测试表明，相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

URL PDF HTML ☆

赞 0 踩 0

2605.15665 2026-05-18 cs.AI 版本更新

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

PRISM：通过迭代模拟和监控实现提示的可靠性用于企业对话式AI

Keshava Chaitanya, Jahnavi Gundakaram

AI总结 PRISM通过持续模拟和监控，将提示工程视为可靠性工程问题，提升企业对话式AI的可靠性，减少提示开发时间并修复生产中的回归问题。

Comments 12 pages, 1 figure, 5 tables. arXiv preprint

详情

AI中文摘要

在企业环境中部署基于大型语言模型（LLM）的对话代理需要同时正确且具有抗非确定性行为漂移能力的提示。现有提示优化框架将提示质量视为一次性的编译时问题，未能解决如何检测和修复由时间推移导致的LLM行为变化引起的提示回归问题。我们提出了PRISM（通过迭代模拟和监控实现提示的可靠性），一个闭环框架，将提示工程视为持续的可靠性工程问题而非一次性创作任务。PRISM输入自然语言代理需求、配置的工具和内存变量集以及初始草稿提示。它自动从需求生成测试用例，模拟完整的多轮对话以对抗平台忠实的LLM环境，使用LLM作为判断者评估通过/失败，并诊断失败的根本原因，然后对提示进行手术性修复——迭代直到所有测试通过。关键的是，PRISM设计为定期运行（每日），将LLM行为漂移视为首要的可靠性问题。我们评估了PRISM在Yellow.ai V3平台上的35个企业对话代理，持续三周部署。PRISM将中位提示开发时间从2天减少到30分钟以内，实现了所有评估代理99%的生产可靠性，并在24小时内成功识别和修复由LLM行为漂移引起的生产回归问题。我们的结果表明，持续的、基于模拟的提示优化在大规模可靠的企业对话式AI中是可行且必要的。

英文摘要

Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.15661 2026-05-18 cs.CV cs.AI 版本更新

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS：图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）； Harvard University（哈佛大学）； School of Computing and Data Science（计算与数据科学学院）； The University of Hong Kong（香港大学）； Kempner Institute for the Study of Natural and Artificial Intelligence（自然与人工智能研究学院）

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量，无需微调或额外计算。

详情

AI中文摘要

分类自由引导（CFG）是控制流式采样器中文本语义强度的主要手段，但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾：早期步骤以噪声为主，携带弱语义信号，而后期步骤需提交图像结构，要求更强的方向性承诺；更关键的是，任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度（VAGS），一种无需训练的替代方案，通过结合时间信号级项和任务相关速度场的余弦相似度，将名义尺度乘以一个有界因子。对于无需反向传播的编辑，VAGS测量源和目标引导速度之间的对齐程度，使每一步的编辑强度反映局部保留与变换的兼容性。对于生成，VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递，固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑，在COCO17、CUB-200和Flickr30K进行生成时，VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

URL PDF HTML ☆

赞 0 踩 0

2605.15656 2026-05-18 eess.SP cs.AI 版本更新

TFZ-Tree: An Ultra-Lightweight Waveform Classification Framework for Resource-Constrained Devices

TFZ-Tree：一种面向资源受限设备的超轻量波形分类框架

Hao Wang, Kuang Zhang, Yonggang Chi, Tianqi Zhao, Yanbo Fu, Jiaxing Guo

发表机构 * x86 platform（x86平台）； Einstein-sworder

AI总结本文提出TFZ-Tree框架，通过时间频率多维特征和优化的Z检验树实现超轻量波形分类，实现在资源受限设备上实时识别十种物联网波形类型，测试精度达99.5%。

详情

AI中文摘要

在6G物联网多波形共存趋势下，智能接收器必须首先识别物理层波形类型才能正确解调和资源调度。然而，现有信号识别研究主要聚焦于符号级调制分类，直接针对物理层波形类型（如OFDM、OTFS、LoRa）的研究极为稀缺，且依赖深度神经网络和复杂时频变换，难以部署在资源受限终端。符号调制分类方法本身也无法规避“波形识别先于解调”的前提。为解决这一双重缺口，本文提出一种基于时频多维特征的超轻量波形分类框架，采用低复杂度时域特征提取，分类后端采用优化的Z检验树，利用假设检验置信度自动控制决策树分裂和大小，确保在资源有限处理器上高效执行。在包含OFDM、OTFS、DSSS、LoRa和NB-IoT在内的十种6G候选波形上测试，方法在AWGN信道下平均精度达99.5%，在TDL-C多径信道下为87.4%，主要混淆OTFS与LoRa。在x86平台用C语言实现，单次推理延迟低于4ms。据所知，这是首次实现十种物联网波形类型实时识别的工作。未来工作将针对嵌入式MCU上的部署加速。代码和数据集已开源：https://github.com/Einstein-sworder/IoT-wave.

英文摘要

Under the trend of multi-waveform coexistence in 6G IoT, intelligent receivers must first identify physical-layer waveform types before performing correct demodulation and resource scheduling. However, existing signal identification research largely focuses on symbol-level modulation classification. Research directly targeting physical-layer waveform types (e.g., OFDM, OTFS, LoRa) is not only extremely scarce but also heavily reliant on deep neural networks and complex time-frequency transforms, making deployment on resource-constrained terminals difficult. Symbol modulation classification methods themselves cannot circumvent the prerequisite of ``waveform identification first.'' To address this dual gap, we propose an ultra-lightweight waveform classification framework based on time-frequency multidimensional features with a cooperative Z-test tree (ZTree). The framework employs low-complexity time-domain feature extraction, and the classification backend adopts a ZTree optimized by Z-statistical testing, which uses hypothesis testing confidence to automatically control decision tree splitting and size, ensuring efficient execution on resource-limited processors. Tested on ten 6G candidate waveforms including OFDM, OTFS, DSSS, LoRa, and NB-IoT, the method achieves 99.5\% average accuracy under AWGN and 87.4\% under TDL-C multipath channels, with main confusion between OTFS and LoRa. Implemented in C on an x86 platform, single inference latency is under 4~ms. To the best of our knowledge, this is the first work achieving real-time recognition of ten IoT waveform types. Future work will target deployment acceleration on embedded MCUs. Code and dataset are open-sourced at: https://github.com/Einstein-sworder/IoT-wave.

URL PDF HTML ☆

赞 0 踩 0

2605.15651 2026-05-18 cs.LG cs.AI cs.GT 版本更新

Sharp Spectral Thresholds for Logit Fixed Points

Logit固定点的尖锐谱阈值

Tongxi Wang

发表机构 * Southeast University（东南大学）

AI总结研究探讨了logit反馈系统稳定性问题，提出新的欧几里得阈值条件以扩展稳定性保证，识别相变点。

详情

AI中文摘要

Softmax反馈系统是熵正则化强化学习、logit博弈动态、群体选择和均场变分更新的数学核心。其核心稳定性问题很简单：当softmax系统产生唯一且全局可预测的结果时？经典理论给出了保守答案。通过将softmax视为单位尺度响应，它仅在强随机化 regime 中保证稳定性。我们证明经典方法忽略了整个稳定 regime 并未识别真正质变发生点。对于有限维仿射logit系统，尖锐无维欧几里得阈值为$$β\\|ΠWΠ\\|_{\mathcal T\to\mathcal T}<2$$，而非之前使用的条件，该条件仅在softmax系统保持安全过正则化时保证稳定性。我们的定理填补了之前缺失的预分支 regime，将仿射softmax反馈系统的稳定性保证扩展到奖励响应但全局可预测的系统。它扩大了这些系统的认证稳定性边界，并识别模型真正经历相变的点。

英文摘要

Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.

URL PDF HTML ☆

赞 0 踩 0

2605.15625 2026-05-18 cs.AI cond-mat.soft 版本更新

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

ColPackAgent：基于代理技能的硬粒子蒙特卡罗工作流程用于胶体堆积

Lijie Ding, Changwoo Do

发表机构 * Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA（奥克勒德国家实验室中子散射部）

AI总结 ColPackAgent通过MCP工具服务器和代理技能实现胶体堆积模拟的自主工作流程，展示了如何利用LLM代理执行模拟任务并评估不同模型的性能。

详情

位置：人工智能需要元智能——元认知AI的案例

Sergei Chuprov, Richard D. Lange, Leon Reznik, Paulo Shakarian, Raman Zatsarenko, Dmitrii Korobeinikov

发表机构 * University of Texas Rio Grande Valley, Edinburg, TX, USA（德克萨斯大学里奥格兰德谷分校）； Rochester Institute of Technology, Rochester, NY, USA（罗切斯特理工学院）； Syracuse University, Syracuse, NY, USA（锡拉库萨大学）

AI总结本文主张将元认知作为设计更准确、安全和高效AI的通用原则，通过联邦学习案例展示元认知提升学习效率和安全性的方法，提出新的软件框架用于实现元认知AI。

Comments This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings

2605.15565 2026-05-18 cs.LG cs.AI 版本更新

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow：面向代理大语言模型的数据流强化学习

Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Michigan（密歇根大学）； UC Berkeley（加州大学伯克利分校）； Meta

AI总结 AstraFlow通过数据流导向的强化学习系统，实现复杂多策略协作训练和高效利用异构计算资源，提升代理LLM的推理与工具使用能力。

详情

AI中文摘要

强化学习（RL）日益被用于提升大语言模型的推理、编码和工具使用能力，但代理RL仍面临高昂成本。为扩展RL到代理LLM，需支持复杂工作负载，包括多策略协作训练，同时高效利用弹性、异构和跨区域计算资源。现有LLM RL系统支持部分能力，但每次新扩展通常需专门系统工程。此问题源于训练器导向的控制架构和RL系统组件缺乏原理性抽象。为此，我们提出AstraFlow，一种数据流导向的RL系统，取代传统训练器导向控制，采用原理性组件抽象。在AstraFlow中，rollout服务、数据流管理和训练被解耦为自主组件，使系统能原生支持复杂多策略代理RL工作负载并高效利用多样化计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的表现，显示同一系统支持多策略训练、弹性扩展、异构跨区域执行和可组合的数据算法，无需系统级代码更改。在多策略协作训练中，AstraFlow的准确度与现有RL系统相当或更优，同时训练时间加速2.7倍。

英文摘要

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

URL PDF HTML ☆

赞 0 踩 0

2605.15549 2026-05-18 cs.LG cs.AI cs.CE 版本更新

CTF4Nuclear: Common Task Framework for Nuclear Fission and Fusion Models

CTF4Nuclear: 用于核裂变和核聚变模型的通用任务框架

Stefano Riva, Carolina Introini, Antonio Cammi, Dean Price, Alexey Yermakov, Yue Zhao, Philippe M. Wyder, Judah Goldfeder, Jan Williams, Amy Sara Rude, Matteo Tomasetto, Joe Germany, Joseph Bakarji, Georg Maierhofer, Miles Cranmer, J. Nathan Kutz

发表机构 * Autodesk Research（Autodesk研究院）； Department of Energy, Nuclear Engineering Division, Politecnico di Milano（能源部，核工程系，米兰理工学院）； Nuclear Science and Engineering, Massachusetts Institute of Technology（核科学与工程，麻省理工学院）； Department of Applied Mathematics, University of Washington（应用数学系，华盛顿大学）； Department of Electrical and Computer Engineering, University of Washington（电气与计算机工程系，华盛顿大学）； High Performance Machine Learning, SURF（高性能机器学习，SURF）； Distyl AI ； Department of Computer Science, Columbia University（计算机科学系，哥伦比亚大学）； Department of Mechanical Engineering, University of Washington（机械工程系，华盛顿大学）； Department of Mechanical Engineering, Politecnico di Milano（机械工程系，米兰理工学院）； Department of Mathematics, American University in Beirut（数学系，贝鲁特美国大学）； Department of Mechanical Engineering, American University in Beirut（机械工程系，贝鲁特美国大学）； Department of Applied Mathematics and Theoretical Physics, University of Cambridge（应用数学与理论物理系，剑桥大学）

AI总结本文提出CTF4Nuclear框架，用于核工程中机器学习方法的标准化评估，通过12个指标和稀疏测量系统监控，提升核工业科学ML的严谨性和可重复性。

详情

AI中文摘要

清洁能源需求持续增长，新型核技术为可再生能源提供补充方案。然而，设计和运行这些系统极具挑战性，因为物理现象的复杂性导致系统动态难以预测。尽管高保真模拟有助于理解反应堆中的非线性多物理场相互作用，但计算成本高，难以实现实时应用。此外，基于模型的方法对简化假设敏感，导致与实际测量存在固有差异。相比之下，机器学习（ML）方法有潜力生成可靠的替代模型，快速预测系统行为。然而，可用于此任务的数据驱动方法种类繁多且多样。在安全关键领域如核工程中，公平比较不同ML方法及其优缺点至关重要。为此，我们引入了一个通用任务框架（CTF）用于核工程中的ML，基于动态系统和地震学的先前努力。该CTF考虑了来自不同核和核相邻系统的精选数据集。CTF评估方法在12个已建立的指标上表现，以及一个专注于仅稀疏测量的系统监控新范式。我们通过基准测试标准ML基线方法，揭示了当前方法的限制。我们的愿景是用标准化评估替代随意比较，提高核工业科学ML的严谨性和可重复性。

英文摘要

The demand for clean energy is ever increasing, with new nuclear technologies presenting a complementary solution to renewable energies. However, designing and operating these systems is exceptionally difficult, given the complexity of the physical phenomena that interact to form the system dynamics. While high-fidelity simulations help to understand the non-linear, multi-physics interactions within a reactor, they are computationally expensive and rarely suitable for real-time applications. Furthermore, model-based approaches are inherently sensitive to simplifying assumptions required to derive their governing equations and parameters, leading to inevitable discrepancies with real-world measurements. In contrast, Machine Learning (ML) methods have the potential to generate reliable surrogate models which may be able to quickly predict the system's behaviour. However, the number of data-driven methods that can potentially be used for this task is large and diverse. In a safety-critical setting such as nuclear engineering, a fair comparison of different ML methods, and a clear understanding of their advantages and limitations, is of paramount importance. To address this, we introduce a Common Task Framework (CTF) for ML in nuclear engineering, building upon previous efforts in dynamical systems and seismology. This CTF considers a curated set of datasets from different nuclear and nuclear-adjacent systems. The CTF evaluates the performance of a method on 12 established metrics, alongside a new paradigm focused on system monitoring from sparse measurements only. We illustrate the framework by benchmarking standard ML baselines against these datasets, revealing current method limitations. Our vision is to replace ad hoc comparisons with standardized evaluations on hidden test sets, raising the bar for rigour and reproducibility in scientific ML for the nuclear industry.

URL PDF HTML ☆

赞 0 踩 0

2605.15543 2026-05-18 cs.GT cs.AI 版本更新

Domain-Independent Game Abstraction using Word Embedding Techniques

基于词嵌入技术的领域无关游戏抽象

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.（CMU战略机器公司）； Strategy Robot, Inc.（策略机器人公司）； Optimized Markets, Inc.（优化市场公司）

AI总结本文提出一种基于自然语言处理的词嵌入技术进行游戏抽象的方法，通过将动作视为词，利用词向量表示和聚类实现领域无关的游戏抽象，实验表明该方法有效但不如专用算法。

2605.15542 2026-05-18 cs.AI 版本更新

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: 动态区域搜索用于无训练的GUI定位

Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu Zhou

发表机构 * Nankai University（南开大学）； Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）

AI总结 DRS-GUI通过动态区域搜索框架提升GUI定位性能，利用轻量级UI感知器和MCTS动作规划器，实现高效区域探索与筛选，提升多模态大语言模型的定位能力。

Comments 11 pages, 8 figures

详情

AI中文摘要

基于多模态大语言模型（MLLM）的GUI代理在理解和执行用户指令方面表现出色，但准确地从高分辨率截图中定位相关元素仍具挑战性。受人类动态调整感知范围的启发，本文提出DRS-GUI，一种无训练的动态区域搜索框架，可无缝集成到现有MLLM中。DRS-GUI引入轻量级UI感知器，执行聚焦、位移和分散三种人类似感知动作，逐步探索界面并生成区域提案。通过基于蒙特卡洛树搜索（MCTS）的动作规划器动态调度这些动作，并利用区域质量奖励评估和选择高度相关的区域，有效剪枝冗余UI元素。实验表明，DRS-GUI在ScreenSpot-Pro上对通用和GUI特定的MLLM（Qwen2.5-VL-7B和UGround-V1-7B）实现了14%的提升，显著增强了定位性能和泛化能力。

英文摘要

GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.15537 2026-05-18 cs.AI 版本更新

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

RTL-BenchMT：通过代理辅助分析和修订动态维护RTL生成基准

Jing Wang, Shang Liu, Hangan Zhou, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出RTL-BenchMT框架，通过自动识别和修正错误案例及检测更新过拟合案例，解决RTL基准中的缺陷和过拟合问题，降低人工维护成本。

Comments This paper has been accepted by DAC 2026

2605.15536 2026-05-18 cs.RO cs.AI cs.CV 版本更新

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

SkiP: 在何时跳过和何时细化以实现高效的机器人操作

Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； Peng Cheng Laboratory（鹏城实验室）； Southern University of Science and Technology（南方科技大学）； Sun Yat-sen University（中山大学）； UNT ； University of Chinese Academy of Sciences（中国科学院大学）

AI总结 SkiP通过动态跳过冗余步骤和精细化关键步骤，提升机器人操作效率，无需额外结构或规划器。

详情

AI中文摘要

先前的模仿学习策略在每个控制步骤都预测未来动作，无论是在平滑运动阶段还是精确的接触丰富操作阶段。这种统一处理是浪费的：大多数操作轨迹步骤在自由空间中移动，携带很少的任务相关信息，而一小部分关键步骤围绕接触、抓取和对齐需求密集的高分辨率预测。我们提出了一种新的动作重标机制：在跳过段的每个时间步，我们用下一个关键段入口的动作替换行为克隆目标，使策略能够在一个决策中跳过冗余步骤。由此产生的Skip Policy (SkiP)在单一统一网络中动态跳过跳过段并密集细化关键段，无需学习跳过规划器或分层结构。为了自动将演示分成关键和跳过段而无需手动标注，我们引入了Motion Spectrum Keying (MSK)，一种快速且任务无关的程序，从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务上的广泛实验表明，SkiP将执行步骤减少15-40%，同时在各种策略骨干上匹配或提高成功率。项目页面：https://pgq18.github.io/SkiP-page/.

英文摘要

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

URL PDF HTML ☆

赞 0 踩 0

2605.15533 2026-05-18 cs.CV cs.AI 版本更新

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

无需调优的指令式视频编辑：通过结构噪声初始化和引导

Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi, Junlan Feng

发表机构 * JIUTIAN Research, China Mobile（中国移动极天研究院）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）； State Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）

AI总结本文提出无需调优的指令式视频编辑框架，通过结构噪声初始化策略和噪声引导机制，提升视频编辑的视觉质量和性能。

Comments Accepted by ICIP 2026

详情

AI中文摘要

视频编辑面临重大挑战。尽管一系列无需调优的方法避免了大量数据收集和模型训练的需求，但它们往往未能充分利用嵌入在噪声潜在空间中的丰富信息，导致结果不满意。为此，我们提出一种无需调优、基于指令的视频编辑框架。我们从噪声潜在空间的角度出发：设计了结构噪声初始化策略（SNIS），通过为编辑区域分配更高的噪声水平（以促进内容变化）和为未编辑区域分配更低的噪声水平（以保持内容一致性），从而获得更优的编辑起点。我们引入了噪声引导机制（NGM），利用生成模型中的视频先验知识，有效整合噪声潜在空间中的丰富信息以引导去噪过程，从而保持未编辑内容和整体视觉一致性。实验表明，我们提出的方法在视觉质量和性能上均优于现有方法。

英文摘要

Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.15529 2026-05-18 cs.CL cs.AI cs.LG 版本更新

Process Rewards with Learned Reliability

基于学习可靠性的过程奖励

Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结本文提出BetaPRM，通过预测步骤成功概率和预测可靠性，改进过程奖励模型，使下游任务能区分可靠与不确定的奖励。ACA应用在最佳N推理中，提升准确率-token权衡。

详情

AI中文摘要

Process Reward Models (PRMs) 提供步骤级反馈用于推理，但当前PRMs通常为每个步骤输出单一奖励分数。下游方法必须将不完美的步骤级奖励预测视为可靠的决策信号，但无指示何时应信任这些预测。我们提出BetaPRM，一种分布型PRM，预测步骤成功概率及该预测的可靠性。给定步骤成功监督来自蒙特卡洛延续，BetaPRM学习Beta信念，通过Beta-Binomial似然解释观察到的成功延续数量，而非回归到有限样本成功比率作为点目标。该学习的可靠性信号指示何时应信任步骤奖励，使下游应用能区分可靠奖励与不确定奖励。作为一项应用，我们引入自适应计算分配（ACA）用于PRM引导的最佳N推理。ACA利用学习的可靠性信号在高奖励解决方案可靠时停止，并在不确定候选前缀上投入更多计算。在四个backbone和四个推理基准上的实验表明，BetaPRM改进了PRM引导的最佳N选择，同时保持标准步骤级错误检测。基于此信号，ACA在固定预算最佳16上提升了准确率-token权衡，减少token使用达33.57%，同时提高最终答案准确率。

英文摘要

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.15524 2026-05-18 cs.LG cs.AI math.DG math.ST stat.TH 版本更新

Neural Point-Forms

神经点形

Bruno Trentini, Jacob Hume, Vincenzo Antonio Isoldi, Philipp Misof, Ekaterina S. Ivshina, Kelly Maggs

发表机构 * NVIDIA ； University of Oxford（牛津大学）； Max Planck Institute for Mathematics in the Sciences（马克斯·普朗克数学研究所）； Department of Mathematical Sciences（数学科学系）； Chalmers University of Technology and University of Gothenburg（查尔姆斯理工大学和哥德堡大学）； School of Engineering and Applied Sciences（工程与应用科学学院）； Max Planck Institute of Molecular Cell Biology and Genetics（马克斯·普朗克分子细胞生物学与遗传学研究所）

AI总结本文提出神经点形（NPFs），通过扩散几何中的拉普拉斯技术，构建点云的可学习几何特征，用于比较微分形式，并在合成和生物相关实验中展示其在处理采样密度、流形结构和群体几何时的优势。

详情

PrismQuant: 为高斯混合源优化的率失真向量量化

Bumsu Park, Chanho Park, Youngmok Park, Namyoon Lee

发表机构 * Department of Electrical Engineering（电气工程系）

AI总结针对高斯混合源，PrismQuant通过组件标签传输和组件匹配KLT实现率失真优化，结合EM驱动学习和熵约束量化，有效逼近理论边界并优于传统模型。

详情

AI中文摘要

对于均方误差下的高斯源，传统变换编码在率失真（RD）最优：KLT对角化协方差，反向水填充分配比特，随后标量量化闭环。然而多模态源中，单一协方差无法捕捉异质局部几何，RD函数失去闭合形式。本文通过高斯混合源重新审视该问题，构建其RD理论。核心发现混合结构仅引入组件标签成本。在活跃混合组件条件下，每个分支为高斯；挑战在于异质分支间的比特分配。证明 genie-aided 条件RD函数由单一全局反向水填充水平支配。基于此，提出PrismQuant，无损传输组件标签并使用组件匹配KLT编码残差，随后标量量化，实现H(C)/n bits per source dimension的反向率，渐近间隙消失。进一步开发基于EM驱动高斯混合学习、组件自适应KLT和熵约束标量量化（ECSQ）的实用实现。合成高斯混合实验显示PrismQuant接近理论RD界限，现实世界信道状态信息（CSI）数据实验显示其性能优于传统模型，模型规模小一个数量级。

英文摘要

For a Gaussian source under mean-squared error (MSE), classical transform coding is rate--distortion (RD) optimal: the Karhunen--Loeve transform (KLT) diagonalizes the covariance, reverse waterfilling allocates the bits, and scalar quantization closes the loop. This elegant story breaks down for multimodal sources, where no single covariance can capture heterogeneous local geometries, and the RD function loses its closed form. We revisit this problem through Gaussian-mixture sources and develop a constructive RD theory for them. Our key finding is that the mixture structure incurs only a component label cost. Conditioned on the active mixture component, each branch is Gaussian; the challenge is allocating bits across heterogeneous branches. We prove that the genie-aided conditional RD function is governed by a single global reverse-waterfilling level shared across all components and eigenmodes. Building on this result, we introduce PrismQuant, which transmits the component label losslessly and encodes the residual using the component-matched KLT, followed by scalar quantization, achieving a rate of H(C)/n bits per source dimension of the converse, with a vanishing asymptotic gap. We further develop a practical implementation based on EM-driven Gaussian-mixture learning, component-adaptive KLTs, and entropy-constrained scalar quantization (ECSQ). Experiments on synthetic Gaussian mixtures show that PrismQuant closely approaches the theoretical RD bound, while experiments on real-world channel-state-information (CSI) data demonstrate competitive or superior performance compared with transformer-based learned codecs at more than one order of magnitude smaller model size.

URL PDF HTML ☆

赞 0 踩 0

2605.15504 2026-05-18 cs.LG cs.AI 版本更新

Learning with Conflicts of Interest

利益冲突中的学习

Nischal Aryal, Arash Termehchy, Ali Vakilian, Marianne Winslett

发表机构 * Oregon State University（俄勒冈州立大学）； Virginia Tech（弗吉尼亚理工大学）； University of Illinois（伊利诺伊大学）

AI总结本文提出一种博弈论框架，用于解决ML系统与用户之间的利益冲突，通过可扩展的算法在保护用户的同时最大化有益信息。

详情

AI中文摘要

金融、社会和政治因素经常导致ML系统所有者和服务使用者的利益无法完全一致。ML系统往往产生有偏见的信息，可能影响用户做出不利于自身利益的决定。当前解决方案要求ML系统实施协议以缓解偏见，但所有者通常没有实施这些协议的激励，并常认为这限制了他们的表达自由或商业。我们认为，解决此问题的成功方案必须认识到ML系统与其用户之间的利益冲突，并利用此信息保护用户免受不利影响，同时允许用户安全地受益于这些系统。为此，我们提出了一种博弈论框架，用于建模存在利益冲突的ML系统与用户之间的互动。我们提出了具有理论保证的可扩展算法，以最大化与所需信息和行动相关的内容，并最小化与偏见和操纵行为相关的交互内容。

英文摘要

Financial, social, and political factors often prevent the interests of the owners of ML systems and services and their users from being perfectly aligned. ML systems often produce biased information that can influence users to make decisions that are not in their best interest. Current solution approaches require ML systems to implement protocols to mitigate their biases. However, ML system owners usually do not have any incentive to implement these protocols and often argue that it limits their freedom of expression or business. We believe that a successful solution to this problem must recognize the conflict of interest between the ML systems and their users, and use this information to protect users against information that adversely influences their decisions while allowing users to safely benefit from these systems. To this end, we propose a game-theoretic framework that models the interaction between ML systems and users with conflicts of interest. We present scalable algorithms with theoretical guarantees that maximize the amount of desired information and actions and minimize the amount of biased and manipulative actions in interaction with ML systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15486 2026-05-18 cs.RO cs.AI 版本更新

Hybrid LLM-based Intelligent Framework for Robot Task Scheduling

基于混合大语言模型的智能机器人任务调度框架

Swayamjit Saha, Subhabrata Das, Haonan Duan, Xiao-Yang Liu

发表机构 * Department of Computer Science and Engineering, Mississippi State University（密苏里州立大学计算机科学与工程系）； Graduate School of Arts and Sciences, Columbia University（哥伦比亚大学研究生院）； Consumer and Community Banking, JPMorgan Chase（摩根大通消费与社区银行业）； Department of Data Science, Columbia University（哥伦比亚大学数据科学系）； Department of Electrical Engineering, Columbia University（哥伦比亚大学电气工程系）

AI总结本文提出利用大语言模型提升建筑机器人任务调度效率，通过平衡时间效率与资源利用，结合自然语言处理接口实现与专业人员的实时沟通，并采用两个LLM代理生成更精确的任务计划。

Comments 9 pages, 5 figures

2605.15480 2026-05-18 cs.RO cs.AI 版本更新

Residual Reinforcement Learning for Robot Teleoperation under Stochastic Delays

残差强化学习用于具有随机延迟的机器人遥控

Kaize Deng, Zewen Yang

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结针对随机延迟导致的信号不连续问题，本文提出一种混合控制框架，通过LSTM状态估计器与残差强化学习策略相结合，提升遥控稳定性与性能。

Comments Accepted at 23rd IFAC World Congress 2026

2605.15467 2026-05-18 cs.CL cs.AI 版本更新

Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

基于检索增强的大型语言模型用于受模式约束的临床信息提取

A H M Rezaul Karim, Ozlem Uzuner

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结本文提出一种模块化检索增强生成框架，通过schema约束提示、确定性后处理和二次审核，提升护士-患者对话中观察提取的F1分数达80.36%。

详情

AI中文摘要

对话护士-患者记录包含可操作的观察，但将这些记录转化为结构化表示仍具挑战性。MEDIQA-SYNUR专注于从对话记录中提取观察，要求系统将这些叙述规范化为预定义模式，并满足值-类型约束。我们提出了一种模块化检索增强生成（RAG）流程，利用训练集作为示例语料库，结合模式约束提示（完整模式与剪枝候选模式）、确定性模式后处理和二次审核，并采用两个LLM骨干：Llama-4-Scout-17B-16E-Instruct和GPT-5.2，配以相应的嵌入模型。我们的最佳配置使用GPT-5.2、完整模式、RAG和二次审核，达到80.36%的F1分数。整体结果表明，RAG consistently improves performance，而最佳模式约束程度取决于模型，二次审核通过纠正残余模式一致性错误带来小幅增益。

英文摘要

Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

URL PDF HTML ☆

赞 0 踩 0

2605.15464 2026-05-18 cs.LG cs.AI cs.CL 版本更新

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

GRLO：从零开始在开放环境中的通用强化学习

Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi

发表机构 * University of California, Riverside（加州大学河滨分校）

AI总结 GRLO研究从少量交互数据中训练的RLHF在开放环境中的泛化能力，探索其对话能力是否能迁移至数学推理和代码生成等下游任务，展示出高效且低成本的训练方法。

详情

AI中文摘要

事后训练已成为解锁大型语言模型能力的关键步骤，强化学习（RL）逐渐成为关键范式。近期基于RL的后训练方法日益分化为两种范式：基于人类反馈的强化学习（RLHF），其通过目标领域的偏好信号优化模型，以及基于可验证奖励的强化学习（RLVR），其在由验证器支持的环境中运行。后者在近期以推理为导向的后训练中占据主导地位，因为它在领域特定任务（如推理）上提供了更强的增益和更高的效率。然而，尽管领域内RL训练取得了令人满意的性能，但仍需要大量的GPU计算资源，这仍然是广泛应用的主要障碍。本文研究了从开放环境中的少量交互数据中从零开始训练的RLHF的泛化能力，并探讨其显式获得的对话能力是否能隐式地迁移到数学推理和代码生成等下游任务，即GRLO。具体而言，在Qwen3-4B-Base基础上，GRLO仅使用5K提示和22.7 GPU小时，将所有领域的平均性能从24.1提升到63.1，所需数据和计算资源分别比强大的领域内RLVR基线少约46倍和68倍。所得到的模型甚至与Qwen发布的后训练模型相媲美，后者需要更大的训练成本。值得注意的是，后续的领域内RLVR阶段仅带来选择性的增益，主要体现在更难的竞赛数学基准上。我们希望GRLO能为构建广泛具备能力的后训练模型提供一个简单且高效的配方。我们的代码和数据将在：https://github.com/SJY8460/GRLO上提供。

英文摘要

Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.

URL PDF HTML ☆

赞 0 踩 0

2605.15461 2026-05-18 cs.LG cs.AI 版本更新

DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery

DrugSAGE: 自演化代理经验用于高效前沿药物发现

Yikun Zhang, Xiwei Cheng, Tianyu Liu, Yuanqi Du, Wengong Jin

发表机构 * Northeastern University（东北大学）； Broad Institute of MIT and Harvard（MIT和哈佛大学Broad研究所）； Yale University（耶鲁大学）； Microsoft Research New England（微软研究院新英格兰分部）

AI总结 DrugSAGE通过自演化代理经验框架，高效构建前沿药物发现模型，跨任务记忆提升模型性能，实现零次搜索下的显著优势。

详情

AI中文摘要

构建前沿药物发现预测模型需要昂贵的工具、架构和训练策略搜索。当前基于LLM的代理通过大量试错找到前沿解决方案，但不保留积累的经验，因此每次新任务都要支付完整搜索成本。我们提出\method（自演化代理经验）框架，通过跨任务积累和重用经验高效构建前沿药物发现模型。\method维护跨任务记忆中的验证技能、有效策略的统计证据以及重复错误及其修复记录。在某些情况下，\method可直接转移有效解决方案而无需测试时搜索。在33个分子性质预测任务中，\method在单任务设置中排名第一。在16个较小任务积累的记忆下，\method在跨任务评估设置中达到17个保留任务的平均归一化分数为0.935，并在零次测试时搜索模式中优于所有基线代理10-30%。总之，我们的工作展示了跨任务记忆在药物发现前沿模型开发中的优势。

英文摘要

Building state-of-the-art (SOTA) predictive models for drug discovery requires expensive search over tools, architectures, and training strategies. Current LLM-based agents can find SOTA solutions through extensive trial and error, but they do not retain the experience accumulated along the way and therefore pay the full search cost on every new task. We propose \method (Self-evolving Agent Experience), a framework that accumulates and reuses experience across tasks to build SOTA drug discovery models efficiently. \method maintains a cross-task memory of verified skills, statistical evidence about effective strategies, and a record of recurring errors and their fixes. In some cases, \method transfers a working solution directly without test-time search. In 33 molecular property prediction tasks, \method ranks first among nine SOTA agents in a single-task setting. With memory accumulated from 16 smaller tasks, \method achieves an averaged normalized score of 0.935 on 17 held-out tasks in a cross-task evaluation setting and outperforms all baseline agents by 10-30\% in a zero-test-time search regime. In summary, our work shows the advantage of cross-task memory for efficient SOTA model development in drug discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.15460 2026-05-18 cs.IR cs.AI 版本更新

Differentially Private Motif-Preserving Multi-modal Hashing

差分隐私的动机保持多模态哈希

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science\ of Oxford Oxford United Kingdom ； Department of Computer Science\ of Oxford

AI总结本文提出DMP-MH框架，通过去噪后蒸馏方法在保证隐私的前提下保留多模态数据的结构特征，实验表明其在保持隐私的同时提升了检索性能。

Comments 9 Pages

详情

AI中文摘要

跨模态哈希通过将图像和文本编码为紧凑的二进制码实现高效检索。现有方法依赖于用户交互导出的语义相似性图进行监督，但这些图编码了敏感行为模式，易受链接重建攻击。现有隐私保护方法在图结构数据上失效：差分隐私SGD通过独立处理样本破坏关系动机，而图合成方法在无标度网络中面临无界局部敏感性，中心节点的单边修改会通过O(N)改变三角形计数，需要昂贵的噪声注入。我们称此现象为Hubness Explosion。本文提出DMP-MH，一种Sanitize-then-Distill框架，将隐私与表征学习解耦。我们的方法首先通过确定性裁剪节点度数来限制敏感性，独立于数据集规模上限三角动机的L2敏感性。然后通过在(ε,δ)-边差分隐私下生成去噪合成图。最后，双流哈希网络通过整体结构损失蒸馏此拓扑，强制跨模态对齐。在MIRFlickr-25K和NUS-WIDE数据集上严格归纳协议下评估，DMP-MH在保持隐私的同时，检索性能比私有基线高出11.4 mAP点，非隐私性能保留率达92.5%。

英文摘要

Cross-modal hashing enables efficient retrieval by encoding images and text into compact binary codes. State-of-the-art methods rely on semantic similarity graphs derived from user interactions for supervision, yet these graphs encode sensitive behavioral patterns vulnerable to link reconstruction attacks. Existing privacy-preserving approaches fail on graph-structured data: Differentially Private SGD destroys relational motifs by treating samples independently, while graph synthesis methods suffer from unbounded local sensitivity in scale-free networks, hub nodes cause single-edge modifications to alter triangle counts by $\mathcal{O}(N)$, necessitating prohibitive noise injection. We term this phenomenon Hubness Explosion. We propose DMP-MH, a Sanitize-then-Distill framework that decouples privacy from representation learning. Our approach first bounds sensitivity by deterministically clipping node degrees, capping the $L_2$-sensitivity of triangle motifs independently of dataset size. A sanitized synthetic graph is then generated via Noisy Mirror Descent under $(ε,δ)$-Edge Differential Privacy. Finally, dual-stream hashing networks distill this topology using a holistic structural loss that enforces cross-modal alignment. Evaluated on MIRFlickr-25K and NUS-WIDE under a strict inductive protocol, DMP-MH outperforms private baselines by up to 11.4 mAP points while retaining up to 92.5% of non-private performance.

URL PDF HTML ☆

赞 0 踩 0

2605.15450 2026-05-18 cs.CV cs.AI cs.LG 版本更新

MR2-ByteTrack：基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

发表机构 * Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.（博洛尼亚大学电气、电子与信息工程学院，意大利）； Department of Electrical Engineering (ESAT), KU Leuven, Belgium.（卢旺达大学电气工程系，比利时）； Dalle Molle Institute for Artificial Intelligence (IDSIA), USI--SUPSI, Switzerland.（人工智能研究所（IDSIA），瑞士USI--SUPSI）

AI总结本文提出MR2-ByteTrack，一种针对嵌入式视觉节点的视频目标检测方法，通过交替使用全分辨率和低分辨率推理，结合ByteTrack和Rescore算法提升效率，实现在嵌入式设备上的高精度实时检测。

详情

AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流，因为云计算在带宽、延迟和隐私限制下往往不可行。然而，这些传感系统通常依赖超低功耗微控制器（MCUs），其内存和计算能力有限，使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战，我们引入了多分辨率重评分ByteTrack（MR2-ByteTrack），一种专为基于MCU的嵌入式视觉节点设计的视频目标检测（VOD）方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本，同时通过ByteTrack在帧间链接检测，并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型，证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明，MR2-ByteTrack保持了准确性，实现了CNN模型的mAP最高达49.0，Transformer模型的mAP为48.7，同时将CNN的乘加操作减少了高达53%，Transformer的减少了32%。当部署在GAP9上，一个超低功耗RISC-V多核MCU上时，我们的方法相比仅处理全分辨率图像，实现了高达55%的能耗节省，实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

URL PDF HTML ☆

赞 0 踩 0

2605.15417 2026-05-18 cs.LG cs.AI 版本更新

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

$f$-轨迹平衡：一种用于调整GFlowNets、生成模型和LLMs的损失家族，结合on-policy和off-policy数据

Jake Fawkes, Jason Hartford

发表机构 * Department of Statistics, University College London, UK（伦敦大学学院统计学系）； Valence Labs, London, UK（伦敦Valence实验室）

AI总结本文提出一种基于$f$-散度的损失家族，通过on-policy和off-policy数据调整生成模型，提升模型覆盖性和泛化能力。

Comments Published at ICML 2026

详情

AI中文摘要

在GFlowNets和变分推断中，目标与模型对数概率之间的均方误差被证明是训练生成模型的有效低方差替代损失。该损失具有在on-policy情况下其梯度对应KL散度的梯度，而在off-policy情况下仍保持有效损失且具有相同全局最小值的性质。本文证明该构造可扩展到整个$f$-散度家族，从而得到一系列损失函数，其on-policy梯度对应相应的$f$-散度，但保留相同的全局最小值。具体而言，我们展示了on-policy梯度导致目标与模型对数概率上的翻译不变损失函数与$f$-散度之间的一一对应关系。这种等价性使我们能够设计新的替代损失函数，用于调整广泛类别的生成模型，继承相应$f$-散度的性质，如更广泛的模式覆盖，同时适用于off-policy数据。我们将其应用于各种任务，包括经典合成示例、SynFlowNets分子发现和异步大语言模型（LLM）调整，证明我们的模型在广泛类别的生成模型中保留其预测属性，无论是on-policy还是off-policy数据。

英文摘要

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.

URL PDF HTML ☆

赞 0 踩 0

2605.15412 2026-05-18 cs.CE cs.AI cs.CL 版本更新

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery

从反馈循环到政策更新：基于强化微调的LLM驱动的alpha因子发现

Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Zixuan Xie, Chiming Duan, Minghua He, Philip S. Yu, Ying Li

发表机构 * Peking University（北京大学）； Alibaba Group（阿里巴巴集团）； Nanjing University（南京大学）； University of Illinois Chicago（伊利诺伊大学香槟分校）

AI总结本文提出QuantEvolver框架，通过强化微调将可执行量化评估转化为策略更新，提升LLM在alpha因子发现中的表现，生成高质量且互补的因子池。

详情

AI中文摘要

现代量化交易日益依赖系统模型从大规模金融数据中提取预测信号，其中alpha因子发现是将市场观察转化为可交易信号的核心。最近基于LLM的方法在自动化因子生成方面表现出色，但大多数仍依赖提示级生成-评估-反馈循环进行迭代优化。随着循环变长，反复追加的历史候选和反馈会导致上下文爆炸，增加推理成本，稀释有用信息，并引入反馈漂移。此外，这些方法通常依赖非常大的LLM，其稳定的生成偏好可能导致结构相似的表达、冗余候选和搜索停滞。为了解决这些限制，我们提出QuantEvolver，一种基于强化微调的自进化alpha因子发现框架。与在提示中积累反馈不同，QuantEvolver将可执行量化评估转化为策略更新，使Miner LLM通过参数学习内化历史优化经验。具体而言，QuantEvolver构建高质量种子因子，构建多样化的种子-时间窗训练任务，生成可执行的Factor DSL表达式，通过Regime Backtest进行评估，并通过多样性-互补性奖励优化Miner LLM。在训练过程中，高质量因子持续积累在Mined Factor Database中，最终成为发现的因子库。在三个现实市场基准上的广泛实验表明，QuantEvolver的有效性，其在每个任务的主要评估指标上均优于现有基于LLM的alpha因子发现基线，产生更高质量和更互补的因子池。

英文摘要

Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation--evaluation--feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed--time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.

URL PDF HTML ☆

赞 0 踩 0

2605.15410 2026-05-18 quant-ph cs.AI cs.LG 版本更新

Diagonal Adaptive Non-local Observables on Quantum Neural Networks

量子神经网络上的对角自适应非局部可观测量

Huan-Hsin Tseng, Yan Li, Hsin-Yi Lin, Samuel Yen-Chi Chen

发表机构 * AI \& ML Department Brookhaven National Laboratory Upton NY, USA ； Department of Electrical Engineering The Pennsylvania State University University Park, PA, USA

AI总结本文提出了一种对角自适应非局部可观测量，通过仅考虑对角可观测量与量子电路的组合，降低了参数数量和经典优化成本，同时保持了全非局部可观测量的能力。

Comments Accepted at ICCCN2026

详情

AI中文摘要

自适应非局部可观测量（ANOs）已显示，使量子可观测量动态化可以显著扩大变分量子算法的功能空间，部分将硬件需求从电路合成转移到测量设计。然而，这种优势伴随着参数数量的大幅增加以及经典优化成本的上升。我们提出了一种特殊的ANo形式，通过仅考虑对角可观测量与量子电路的组合，显著降低了这一负担。数学上，这相当于全ANo在大参数空间中的完整形式，因为对角矩阵是ANo空间的规范代表，模幺正相似性。因此，对角ANo保持了全ANo的能力，同时将k-局部可观测量的复杂度从O(4^k)降低到O(2^k)，并降低了相应的测量侧经典计算成本。从这个意义上说，对角ANo保留了全ANo的许多优势，同时涵盖了传统VQCs作为特殊情况。

英文摘要

Adaptive Non-local Observables (ANOs) have shown that making quantum observables dynamic can substantially enlarge the function space of Variational Quantum Algorithms, partly shifting hardware demands from circuit synthesis to measurement design. However, this advantage is accompanied by a steep increase in the number of parameters, as well as the classical optimization cost for varying general Hermitian observables. We propose a special form of ANO that significantly reduces this burden by considering only diagonal observables paired with quantum circuits. Mathematically, this is equivalent to the full ANO of a large parameter space since diagonal matrices are canonical representatives of the ANO space modulo unitary similarity. As a result, Diagonal ANO retains the same capability of full ANO while reducing $k$-local observable complexity from $O(4^k)$ to $O(2^k)$ and lowering the corresponding measurement-side classical computation. In this sense, diagonal ANO preserves much of the benefit of full ANO while encompassing conventional VQCs as a special case.

URL PDF HTML ☆

赞 0 踩 0

2605.15400 2026-05-18 cs.AI 版本更新

Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

超越伙伴多样性：一种基于影响的团队引导框架用于零样本人机协同

Wei Sheng, Rohan Paleja

发表机构 * Department of Computer Science（计算机科学系）

AI总结本文提出基于影响的团队引导框架IBTS，通过影响塑造激励智能体发现多样化的高绩效团队交互模式，提升团队表现，强调需结合稀疏奖励协调机制与伙伴多样性覆盖。

详情

AI中文摘要

尽管AI代理正从孤立工具发展为交互合作者，数据驱动的人机协同（HMT）方法仍依赖跨领域的大量人类交互数据，导致成本高。零样本协调（ZSC）通过模拟多样化的伙伴群体来近似未见伙伴的行为。然而，随着团队规模扩大和通信退化，伙伴覆盖本身不足。为此，本文提出影响基于的团队引导（IBTS）框架，利用影响塑造激励智能体发现多样化的高绩效团队交互模式，并引导持续轨迹向更强的协调模式发展。在Overcooked-AI的双智能体和三智能体设置中评估IBTS，测试学习协调结构是否超越二元交互。评估包括模拟伙伴、合成伙伴风格变化以及首次涉及两名真实人类队友和一名机器队友的30人Overcooked-AI HMT研究。在这些评估中，IBTS在对比基线中提升了团队表现，突显了需要扩展ZSC来结合稀疏奖励协调机制与伙伴多样性覆盖，而非仅依赖多样性。

英文摘要

While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.

URL PDF HTML ☆

赞 0 踩 0

2605.15399 2026-05-18 cs.LG cs.AI cs.NA math.NA physics.comp-ph 版本更新

Breakeven complexity: A new perspective on neural partial differential equation solvers

突破性复杂度：神经偏微分方程求解器的新视角

Yijing Zhang, Nicholas Roberts, Tanya Marwah, Mikhail Khodak

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Google DeepMind（谷歌DeepMind）

AI总结本文提出突破性复杂度评估框架，考虑神经求解器的前期成本与传统求解器的低保真度成本，分析不同PDE求解器在复杂问题中的有效性。

详情

AI中文摘要

偏微分方程的神经替代求解器相比数值方法能带来显著加速，尤其在需要多次求解的场景中。然而，现有基于精度的评估方法未充分考虑两个核心问题：(1) 神经求解器在数据生成、训练和调优上存在显著前期成本；(2) 经典求解器也能在足够低的模拟成本下生成低保真度解。为明确考虑这些现实并全面纳入端到端成本，我们提出以突破性复杂度为核心的评估框架，该指标衡量在学习求解器成本有效于等误差的传统求解器之前所需的前向求解次数。为了评估此指标，我们应用扩展定律确定应分配多少训练预算给数据生成，并讨论如何在不同设置中实现平滑的误差匹配。我们评估了多个神经PDE求解器在三个2D周期域上的PDEs以及由GPU原生PyFR代码生成的新型流动基准测试中的突破性复杂度。其他发现包括，神经PDE求解器在成本、维度、滚动、物理领域（如更高雷诺数）等更复杂的问题中变得更具有效性。

英文摘要

Neural surrogate solvers of partial differential equations (PDEs) promise dramatic speedups over numerical methods, especially in scenarios requiring many solves. However, current accuracy-based evaluations do not fully consider two central issues: (1) neural solvers incur substantial up-front costs for data generation, training, and tuning; and (2) classical solvers can also generate low-fidelity solutions at a sufficiently low simulation cost. To explicitly account for these realities and fully incorporate end-to-end costs, we propose an evaluation framework centered on breakeven complexity, a metric that counts the forward solves before a learned solver is cost-effective relative to an error-equivalent traditional solver. To evaluate this measure, we apply scaling laws to determine how much training budget to allocate to data generation and discuss how to achieve smooth error-matching in diverse settings. We evaluate the breakeven complexity of multiple neural PDE solvers on three PDEs on 2D periodic domains from APEBench and a novel benchmark of flows past multiple obstacles generated by the GPU-native PyFR code. Among other findings, our results suggest that neural PDE solvers become more effective as problems get harder in terms of cost, dimension, rollout, physics regime (e.g. higher Reynolds number), etc.

URL PDF HTML ☆

赞 0 踩 0

2605.15394 2026-05-18 cs.LG cs.AI stat.ML 版本更新

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

无奖励的表示：用于LLM微调的JEPA审计

Biswa Sengupta

发表机构 * LLM Suite group of JP Morgan Chase and its affiliates（JP摩根士丹利 LLC 集团及其附属机构）

AI总结本文探讨了在无奖励设定下，通过JEPA架构学习更有效的表示方法，测试了多种辅助项在自然语言到正则表达式生成任务中的表现，发现某些辅助项在特定统计检验下显著，但整体效果不显著。

详情

AI中文摘要

联合嵌入预测架构（JEPAs）提出，当模型被训练以预测潜在表示而非观测输出时，应学习更有用的抽象。对于自回归语言模型微调，这一原则意味着诱导的隐藏状态几何必须达到语言模型头部并且提高解码任务指标。我们在此基础上，在固定Llama-3.2-1B-Instruct LoRA基础上，对自然语言到正则表达式生成任务进行了测试，比较了22种训练时的辅助项，包括轨迹形状正则化、分布约束、预测器/目标不对称性、Fisher度量Jacobi残差以及一个解码器可见的JEPA目标，该目标位于交叉熵的正锥内。经验结果是一个结构化的零假设：几种辅助项在单细胞配对α=0.10下显著（T3-Local在Δ=+2.53 pp，p=0.003最强），但无一通过Bonferroni或Holm-Bonferroni检验。解码器可见的JEPA产生了研究中的第一个正辅助-交叉熵梯度余弦值，但精确匹配仍处于种子噪声内；在五个种子的完整微调复制中，相同的辅助项在两个基准测试中均重现了零假设（TURK：Δ=+0.04 pp，p_配对=0.96；SYNTH：Δ=+0.52 pp，p_配对=0.28），因此零假设在LoRA和完整微调中对解码器可见的构造是稳健的。隐藏状态表示和解码任务准确性在这一领域因此弱相关；我们相应地将LLM领域JEPA评估重新定义为耦合问题，其中核心问题是哪些指标下有用的隐藏几何成为解码器可见的任务信号。

英文摘要

Joint-embedding predictive architectures (JEPAs) propose that a model should learn more useful abstractions when trained to predict latent representations rather than observed outputs. For autoregressive language-model fine-tuning the principle entails a stricter requirement: the induced hidden-state geometry must reach the language-model head \emph{and} improve the decoded task metric. We test that requirement under a fixed Llama-3.2-1B-Instruct LoRA harness on natural-language-to-regex generation, comparing twenty-two training-time auxiliaries across trajectory-shape regularisation, distributional constraints, predictor/target asymmetry, Fisher-metric Jacobi residuals, and a decoder-visible JEPA objective constructed to lie in cross-entropy's positive cone. The empirical answer is a structured null: several auxiliaries clear single-cell paired $α= 0.10$ without correction (T3-Local at $Δ= +2.53$~pp, $p = 0.003$ being the strongest), but none survives Bonferroni or Holm--Bonferroni at the relevant family-wise threshold, even though many change curvature, anisotropy, variance, and gradient direction. Decoder-visible JEPA yields the first positive auxiliary--cross-entropy gradient cosine in the study, yet exact match remains inside seed noise; a full-fine-tuning replication of the same auxiliary at $n = 5$ seeds reproduces the null on both benchmarks (TURK: $Δ= +0.04$~pp, $p_{\text{paired}} = 0.96$; SYNTH: $Δ= +0.52$~pp, $p_{\text{paired}} = 0.28$), so the null is robust across LoRA and full fine-tuning for the decoder-visible construction. Hidden-state representation work and decoded-task accuracy are therefore weakly coupled in this regime; we accordingly reframe LLM-domain JEPA evaluation as a coupling problem, in which the operative question is under which metrics useful hidden geometry becomes decoder-visible task signal.

URL PDF HTML ☆

赞 0 踩 0

2605.15391 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

PanoWorld：几何一致的全景视频世界建模

Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

发表机构 * Northeastern University（东北大学）

AI总结 PanoWorld通过几何和动态一致性建模生成一致的360度视频，提升了空间理解能力，适用于具身AI应用。

2605.15384 2026-05-18 cs.LG cs.AI 版本更新

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

一个评分够吗？重新思考序列演进LLM记忆的评估

Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

发表机构 * University of Virginia（弗吉尼亚大学）； Princeton University（普林斯顿大学）

AI总结本文提出SeqMem-Eval框架，通过评估记忆状态的演变、泛化、经验巩固和信息保留，揭示传统指标无法捕捉的记忆质量差异。

Comments 29 pages, 13 figures

详情

AI中文摘要

记忆在使大语言模型（LLM）能够处理序列任务中起着核心作用，通过积累和重用经验实现时间连续性。然而，现有LLM记忆评估大多依赖汇总指标如最终验证准确率或累积在线性能，这可能掩盖诸如遗忘和负迁移等关键失败模式。本文引入SeqMem-Eval，一种用于序列演进LLM记忆的诊断评估框架。受持续学习启发，它针对一种测试时间设置，其中记忆是外部的、提示介导的，并且在不修改模型参数的情况下更新。与只关注最终性能不同，SeqMem-Eval评估记忆状态在连续推理中的演变、泛化、经验巩固和信息保留。具体而言，它测量在线效用、验证泛化、反向迁移和遗忘，提供更细致的记忆质量视角。通过在多样任务和记忆方法上的广泛实验，我们显示更高的最终或累积准确性不必然意味着更好的记忆质量：许多方法表现出强劲的性能提升，同时遭受显著的遗忘或负迁移。此外，不同记忆设计在适应性和稳定性之间表现出不同的权衡，这些权衡在标准评估指标下是不可见的。

英文摘要

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility, hold-out generalization, backward transfer, and forgetting, providing a finer-grained view of memory quality. Through extensive experiments across diverse tasks and memory methods, we show that higher final or cumulative accuracy does not necessarily imply better memory quality: many methods exhibit strong performance gains while suffering from substantial forgetting or negative transfer. Moreover, different memory designs exhibit distinct trade-offs between adaptability and stability that remain invisible under standard evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.15375 2026-05-18 cs.CV cs.AI 版本更新

ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

ChangeFlow -- 潜在修正流用于遥感中的变化检测

Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc

发表机构 * University of Ljubljana, Faculty of Computer and Information Science（卢布尔雅那大学计算机与信息科学学院）

AI总结本文提出ChangeFlow框架，通过潜在空间中的修正流合成变化掩码，以生成分布中的可能掩码，提升全局一致性与鲁棒性，实现80.4%的平均F1分数。

详情

AI中文摘要

遥感变化检测（RSCD）旨在定位同一地理区域两幅图像之间的变化。在实践中，变化掩码通常遵循区域级注释惯例而非纯粹的局部外观差异，使其具有上下文依赖性和偶尔的模糊性。大多数最先进的方法使用逐像素判别分类，产生单个预测，无法显式建模变化区域作为整体。生成式方法是自然替代方案，可建模可能掩码的分布，使采样能捕捉模糊性并鼓励全局一致性。然而，现有生成式RSCD方法通常落后于强大判别基线，由于像素空间生成的高计算成本和其条件机制的复杂性。为了解决判别和生成方法的局限性，我们提出ChangeFlow，一种生成框架，通过潜在空间中的修正流重新表述变化检测为变化掩码的合成。ChangeFlow由结构化但轻量级的条件信号引导，其随机设计自然支持基于采样的预测融合。即，聚合多个预测的变化掩码提高鲁棒性，而样本一致性提供实用的置信度估计，突出模糊区域。在四个基准上，ChangeFlow实现80.4%的平均F1分数，比先前最佳方法平均提高1.3个百分点，同时保持与最近强大基线相当的推理速度。项目页面：https://blaz-r.github.io/changeflow_cd

英文摘要

Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd

URL PDF HTML ☆

赞 0 踩 0

2605.15353 2026-05-18 cs.LG cs.AI q-bio.MN q-bio.QM 版本更新

PACER: Acyclic Causal Discovery from Large-Scale Interventional Data

PACER：从大规模干预数据中进行无环因果发现

Ramon Viñas Torné, Sílvia Fàbregas Salazar, Soyon Park, Ivo Alexander Ban, Artyom Gadetsky, Nikita Doikov, Maria Brbić

发表机构 * Swiss Federal Technology Institute of Lausanne (EPFL), Switzerland（瑞士联邦理工学院洛桑分校）； Cornell University, USA（康奈尔大学）； ETH Zurich, Zurich, Switzerland（苏黎世联邦理工学院）

AI总结 PACER通过构建无环性保证的因果发现框架，在大规模高维干预数据中实现高效且准确的因果结构推断，优于现有方法。

Comments Accepted at the 43rd International Conference on Machine Learning (2026)

详情

AI中文摘要

从数据中推断有向无环图（DAG）的结构是因果发现中的核心挑战，特别是在现代高维设置中，大规模干预数据日益可用。尽管干预数据可以提高可识别性，但现有方法仍受软无环约束限制，导致优化无效环图、数值不稳定和可扩展性差。我们引入PACER（扰动驱动无环因果边恢复），一种可扩展的因果发现框架，通过构建无环性保证的结构进行优化。PACER通过变量排列和边概率的联合模型参数化DAG分布，使可以直接优化有效因果结构而无需替代惩罚。该框架支持观察性和干预性数据的统一似然处理，灵活的条件密度模型以及结构先验知识的整合。对于线性高斯机制，我们推导出干预对数似然和梯度的闭式表达式，获得显著的计算增益。实证上，PACER在蛋白质信号和大规模基因扰动基准上匹配或超过最先进方法，同时高效扩展到具有千变量的网络，并在基于惩罚的可微方法上实现高达两数量级的速度提升。这些结果表明，通过原则性的搜索空间设计，从高维扰动数据中实现精确且可扩展的因果发现是可能的。

英文摘要

Inferring the structure of directed acyclic graphs (DAGs) from data is a central challenge in causal discovery, particularly in modern high-dimensional settings where large-scale interventional data are increasingly available. While interventional data can improve identifiability, existing methods remain limited by soft acyclicity constraints, leading to optimization over invalid cyclic graphs, numerical instability, and reduced scalability. We introduce PACER (Perturbation-driven Acyclic Causal Edge Recovery), a scalable framework for causal discovery that guarantees acyclicity by construction. PACER parameterizes a distribution over DAGs through a joint model of variable permutations and edge probabilities, enabling direct optimization over valid causal structures without surrogate penalties. The framework supports a unified likelihood-based treatment of observational and interventional data, flexible conditional density models, and the incorporation of structural prior knowledge. For linear-Gaussian mechanisms, we derive closed-form expressions for the expected interventional log-likelihood and its gradients, yielding substantial computational gains. Empirically, PACER matches or exceeds state-of-the-art methods on protein signaling and large-scale genetic perturbation benchmarks, while scaling efficiently to networks with thousands of variables and achieving up to two orders of magnitude speedups over penalty-based differentiable approaches. These results demonstrate that exact and scalable causal discovery from high-dimensional perturbation data is achievable through principled search space design.

URL PDF HTML ☆

赞 0 踩 0

2605.15343 2026-05-18 cs.AI cs.LG cs.MA 版本更新

Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation

信念引擎：多智能体大语言模型协商中的可配置和可检查立场动态

Joshua C. Yang, Maurice Flechtner, Damian Dailisan, Michiel A. Bakker

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Centre for Democracy Studies Aarau, University of Zurich（苏黎世大学民主研究中心）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出Belief Engine，通过可配置的信念更新机制，研究多智能体协商中的立场动态，揭示立场变化背后的证据吸收与锚定因素。

详情

AI中文摘要

基于大语言模型的智能体日益用于模拟协商、冲突解决和多轮意见交流等 deliberative 交互。然而，生成的对话记录往往无法解释智能体立场变化的原因：变化可能反映证据吸收、锚定、角色漂移、回声或改变的提示和检索上下文。我们引入Belief Engine (BE)，一个可审计的信念更新层，将“信念”视为命题上的证据状态，并将其暴露为标量立场。BE将论点提取为结构化记忆，并通过由证据吸收u和先验锚定a控制的对数几率规则更新立场。在多个基础LLM上，参数扫描显示这些控制可靠地塑造立场动态，同时保留证据层面的更新轨迹。在DEBATE数据集上，BE最佳重建了最终立场遵循提取证据的参与者；稳定和证据反对的案例则指向锚定或提取证据流之外的因素。BE为研究证据导向的协商提供了可配置的基础设施，其中开放性、承诺、收敛和分歧可以与显式的更新假设联系，而不是隐藏的提示效应。

英文摘要

LLM-based agents are increasingly used to simulate deliberative interactions such as negotiation, conflict resolution, and multi-turn opinion exchange. Yet generated transcripts often do not reveal why an agent's stance changes: movement may reflect evidence uptake, anchoring, role drift, echoing, or changed prompt and retrieval context. We introduce the Belief Engine (BE), an auditable belief-update layer that treats "belief" as an evidential state over a proposition and exposes it as scalar stance. BE extracts arguments into structured memory and updates stance with a log-odds rule controlled by evidence uptake u and prior anchoring a. Across multiple base LLMs, parameter sweeps show that these controls reliably shape stance dynamics while preserving an evidence-level update trail. On DEBATE, a human deliberation dataset with pre/post opinions, BE best reconstructs participants whose final stance follows extracted evidence; stable and evidence-opposed cases instead point to anchoring or factors outside the extracted evidence stream. BE provides configurable infrastructure for studying evidence-grounded deliberation, where openness, commitment, convergence, and disagreement can be tied to explicit update assumptions rather than hidden prompt effects.

URL PDF HTML ☆

赞 0 踩 0

2605.15341 2026-05-18 cs.LG cs.AI 版本更新

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

LEAP：LLM在迭代科学设计中的轨迹级评估

Marilyn Zhang, Tianfeng Chen, Fabián Barzuna, Ankita Rathod, Mark E. Whiting

AI总结本文提出LEAPBench框架，通过轨迹级评估方法揭示LLM在迭代科学设计中的学习效率，发现传统基于结果的评估方法存在偏差，轨迹指标能更准确反映效率提升。

详情

AI中文摘要

LLMs正被越来越多地应用于自主实验室，其假设是领域先验知识和迭代反馈使它们在更少的迭代中收敛到好的设计。然而，当前的迭代科学设计基准仅评估固定时间范围内的结果快照，忽略了学习轨迹。为此，本文探讨了三种评估选择：测量什么、比较什么基准以及以什么为基础。引入LEAPBench，一个包含55个任务的框架，结合最佳到目前为止的曲线下面积（AUC）轨迹指标、经典贝叶斯优化基准和基于发表文献的审计。在八个现代LLMs上应用后，从最终结果到轨迹评分的切换在匹配时间范围内改变了53%的任务最佳模型决策，并揭示了被传统评分忽视的效率提升。LLMs在经典贝叶斯基准下并不表现更好。在16个生物学任务中，当oracle的奖励信号与发表最佳设计配置一致时，领域感知提示导致LLM选择匹配发表最佳的频率比领域无关提示低约10个百分点。这种模式在6个任务中最为明显，其中领域无关提示在所有6个任务中更常匹配发表最佳。轨迹指标还充当了可训练的目标。使用轨迹指标作为奖励的离线强化学习在14个21个保留任务中提升了性能。

英文摘要

LLMs are increasingly deployed in autonomous laboratories, under the assumption that their domain priors and reasoning over iterative feedback let them converge on good designs in fewer iterations than feedback-only baselines. Current iterative scientific design benchmarks, however, score only outcome snapshots at fixed horizons. This leaves the learning trajectory unmeasured, even though the trajectory is what captures learning efficiency, where each iteration saved is a real saving in cost and time. Motivated by this, we examine three evaluation choices that change the conclusions one draws about LLM learning efficiency in iterative scientific design: what to measure, what baseline to compare against, and what to ground against. We introduce LEAPBench, Learning Efficiency in Adaptive Processes, a 55-task framework that pairs a best-so-far area under the curve (AUC) trajectory metric with a classical Bayesian-optimization reference and an audit grounded in published literature. Applied to eight contemporary LLMs, switching from final-outcome to trajectory scoring changes the best-model decision on 53% of tasks at matched horizons, and exposes efficiency gains overlooked by outcome-based scoring. LLMs do not outperform a classical Bayesian baseline. On 16 biology tasks where the oracle's reward signal is aligned with configurations from the published-best design, domain-aware prompting leads to LLM choices that match the published-best's approximately 10 percentage points less often than domain-agnostic prompting at iteration 30. The pattern is sharpest on 6 tasks where the literature-typical and published-best configurations diverge, and domain-agnostic prompting matches the published-best more often on all 6. The trajectory metric also doubles as a tractable training target. Offline reinforcement learning with the metric as a reward improves performance on 14 of 21 held-out tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.15334 2026-05-18 cs.LG cs.AI cs.CL cs.SE 版本更新

From I/O to Code with Discovery Agent

从输入输出到代码：发现代理

Yihong Dong, Jiaru Qian, Haoran Zhang, Peixu Wang, Binhua Li, Zhi Jin, Yongbin Li, Ge Li, Xiaokang Yang, Xue Jiang

发表机构 * School of Computer Science, Peking University（北京大学计算机科学系）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； Wuhan University（武汉大学）； Renmin University of China（中国人民大学）； National University of Singapore（新加坡国立大学）； Shanghai Jiaotong University（上海交通大学）

AI总结本文提出DIO-Agent，通过将IO2Code视为离散程序空间的进化搜索，利用LLM作为突变算子，结合执行误差信号指导突变，解决从输入输出行为合成代码的难题。

详情

AI中文摘要

将程序自动合成视为计算机科学的圣杯。受LLM推动，NL2Code取得巨大成功，但从输入输出行为合成程序（IO2Code）仍难以解决。NL2Code可利用自然语言与代码的语义对齐，而IO2Code需从具体计算行为中恢复底层原理，面对广阔且未明确规定的假设空间。为此，我们提出DIO-Agent，将IO2Code视为离散程序空间的进化搜索，在其中LLM作为突变算子，执行误差信号指导突变。为防止搜索进入结构复杂但错误的死胡同，引入变换优先前提作为突变先验，使LLM偏向最简单的假设，逐步从常量到条件到迭代。为促进系统研究，我们构建了跨越多个难度级别的IO2CodeBench。大量实验表明，DIO-Agent在所有难度级别和各种LLM上均优于传统程序示例方法和SOTA进化代理基线，同时显著超越等效采样预算下的测试时间扩展策略。

英文摘要

The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.15333 2026-05-18 cs.AI 版本更新

Zero-Shot Goal Recognition with Large Language Models

基于大语言模型的零样本目标识别

Kin Max Piamolini Gusmão, Nathan Gavenski, Nir Oren, Felipe Meneguzzi

发表机构 * PUCRS Porto Alegre（圣路易斯-波尔图阿legre大学）； King’s College London（伦敦国王学院）； University of Aberdeen（阿伯丁大学）； PUCRS（圣路易斯-波尔图阿legre大学）

AI总结本文首次系统评估前沿大语言模型在经典PDDL基准上的零样本目标识别能力，发现其表现不均，部分模型随证据增加而提升精度，而另一些模型则依赖世界知识先验。

Comments 9 pages, 1 figure, 1 table; appendix with 8 figures and 2 code listings (29 pages total); submitted to NeurIPS 2026

详情

AI中文摘要

大语言模型最近在知名规划领域达到了与经典规划器相当的水平，但这种能力依赖于世界知识的利用而非真正的符号推理。目标识别是一种互补的归纳任务，结构上更适合大语言模型的特长：它涉及评估与世界知识的一致性，而非生成新的动作序列。本文首次系统地对前沿大语言模型进行了零样本评估，以评估其在关键经典PDDL基准上的目标识别能力。我们的结果表明，大语言模型在目标识别上的能力不均：一些模型随着证据的增加而提升，接近全观测下的地标精度，而另一些模型则无论证据如何增加，都依赖于世界知识的先验。对模型推理轨迹的定性分析表明，这种差异反映了证据整合的根本差异，而非领域熟悉度。这些发现将目标识别定位为评估大语言模型基础规划知识的原则性基准。

英文摘要

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.15315 2026-05-18 cs.AI cs.CL 版本更新

Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

通过多标准潜在推理进行编码代理的上下文剪枝

Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He, Feiyang Cai, Ana S. Carreon-Rascon, Xuanzhao Dong, Feng Luo

发表机构 * Clemson University（克莱姆森大学）； Morgan Stanley（摩根大通）； Arizona State University（亚利桑那州立大学）； University of Arizona（亚利桑那大学）

AI总结本文提出LaMR框架，通过分解代码相关性为语义证据和依赖支持两个维度，利用多任务CRF模型提升编码代理的上下文剪枝效果，实验表明其在多个基准测试中表现优异。

详情

AI中文摘要

LLM驱动的编码代理花费大部分token预算阅读仓库文件，但检索到的代码大多与任务无关。现有学习剪枝器使用单一目标序列标注器压缩上下文，将代码相关性所有方面压缩为一个分数和一个转移矩阵。我们证明这种建模瓶颈：单一CRF转移先验必须服务于异质保留模式，包括连续语义跨度和稀疏结构支持线。我们提出LaMR（潜在多标准），一个结构化剪枝框架，将代码相关性分解为两个可解释的质量维度，语义证据和依赖支持，每个由专用CRF建模，具有维度特定的转移动态。混合专家门控网络动态加权每个标准的发射量，根据查询条件。最终CRF层在融合的发射量上产生汇总的保留或剪枝决策。为了监督每个维度而无需额外标注成本，我们通过基于AST的程序分析从现有训练语料中推导出多标准标签，同时去噪教师的二元标签。通过有效过滤干扰噪声，LaMR经常匹配或甚至优于未修剪的完整上下文基线。在四个基准测试（SWE-Bench Verified，SWE-QA，LCC，LongCodeQA）上的实验表明，LaMR在16次头对头多轮比较中胜出12次。它在多轮代理任务中节省多达31%的token，并在单轮任务中将Exact Match提高多达+3.5，同时性能经常通过去噪上下文得到增强，任何剩余的下降都是微小的。

英文摘要

LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.

URL PDF HTML ☆

赞 0 踩 0

2605.15308 2026-05-18 cs.AI cs.LG cs.MA 版本更新

SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

SMCEvolve：通过序列蒙特卡洛进化进行原理性科学发现

Jiachen Jiang, Huminhao Zhu, Zhihui Zhu

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； The Ohio State University（俄亥俄州立大学）

AI总结 SMCEvolve通过将程序搜索视为从奖励倾斜的目标分布中采样，并利用序列蒙特卡洛采样器近似该分布，提出三种核心机制：自适应父采样、变异与接受的混合、自动收敛控制，从而在数学、算法效率、符号回归和端到端ML研究基准中超越现有系统。

详情

AI中文摘要

LLM驱动的程序进化已成为自动化科学发现的强大工具，但现有框架缺乏设计其各个组件的原理性指导，并无法保证搜索收敛。我们介绍了SMCEvolve，将其程序搜索重新解释为从奖励倾斜的目标分布中采样，并用序列蒙特卡洛（SMC）采样器近似该分布。从这一视角，三种核心机制浮现为原理性组件：自适应父采样、变异与接受的混合、自动收敛控制。我们进一步提供有限样本复杂性分析，该分析界定了达到目标近似误差所需的LLM调用预算。在数学、算法效率、符号回归和端到端ML研究基准上，SMCEvolve在超越现有最先进的进化系统的同时，使用更少的LLM调用次数在自定终止条件下运行。代码可在https://github.com/kongwanbianjinyu/SMCEvolve获取。

英文摘要

LLM-driven program evolution has emerged as a powerful tool for automated scientific discovery, yet existing frameworks offer no principled guide for designing their individual components and provide no guarantee that the search converges. We introduce SMCEvolve, which recasts program search as sampling from a reward-tilted target distribution and approximates it with a Sequential Monte Carlo (SMC) sampler. From this view, three core mechanisms emerge as principled components: adaptive parent resampling, mixture of mutation with acceptance, and automatic convergence control. We further provide a finite-sample complexity analysis that bounds the LLM-call budget required to reach a target approximation error. Across math, algorithm efficiency, symbolic regression, and end-to-end ML research benchmarks, SMCEvolve surpasses state-of-the-art evolving systems while using fewer LLM calls under self-determined termination. The code is available at https://github.com/kongwanbianjinyu/SMCEvolve.

URL PDF HTML ☆

赞 0 踩 0

2605.15301 2026-05-18 cs.AI 版本更新

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

Solvita：通过代理进化增强大型语言模型以应对编程竞赛

Han Li, Jinyu Tian, Rili Feng, Yuqiao Du, Chong Zheng, Chenyu Wang, Chenchen Liu, Shihao Li, Xinping Lei, Yifan Yao, Weihao Xie, Letian Zhu, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Tsinghua University（清华大学）； Independent Researcher（独立研究者）

AI总结 Solvita通过闭环系统和可训练知识网络，使代理动态学习，提升编程竞赛任务的准确性和经验积累。

详情

非线性算子及其导数的通用逼近

Filippo de Feo

发表机构 * Institut für Mathematik, Technische Universität Berlin（柏林技术大学数学研究所）

AI总结本文提出通过运算学习架构证明非线性算子及其导数的通用逼近定理，扩展了经典结果到无限维空间，并探讨了其在高阶精度、约束优化和无限维PDE数值方法中的应用。

详情

AI中文摘要

导数引导的算子学习（DIOL），即学习非线性算子及其导数，是运算学习（OL）基础领域中的开放研究前沿。特别是非线性算子及其导数的通用逼近定理（UAT）是非线性泛函分析中的基础性开放问题和精细问题。本文证明了非线性k次可微算子在巴纳赫空间之间及其导数的首个通用逼近定理，统一在紧集上和加权Sobolev范数中，适用于一般有限输入测度。我们的结果是首次将经典结果[1991]扩展到无限维设置和OL。我们讨论了DIOL和UATs的应用领域：OL中的高阶精度、Banach空间中的快速约束优化（如PDE最优控制、反问题）和无限维PDE的数值方法（如来自PDE最优控制的HJB PDEs在Banach空间、SPDEs、路径依赖系统、部分观测系统、均场控制）。我们通过编码器-解码器架构参数化非线性算子，这些架构因其通用性而著名，包括经典架构如DeepONets、Deep-H-ONets、PCA-Nets。我们的结果基于四个关键特性，使我们能够证明UATs的全面通用性：（i）巴纳赫空间的逼近性质。（ii）Bastiani意义下的k次连续可微性（弱于Fréchet意义下的k次连续可微性）。（iii）自然的紧-开拓扑用于UA；确实，我们显示在标准紧-开拓扑诱导的算子范数下，即使对于Fréchet导数，UA也遭到破坏。（iv）为UA构造新的加权Sobolev空间。

英文摘要

Derivative-Informed Operator Learning (DIOL), i.e. learning a (nonlinear) operator and its derivatives, is an open research frontier at the foundations of the influential field of Operator Learning (OL). In particular, Universal Approximation Theorems (UATs) of nonlinear operators and their derivatives are foundational open questions and delicate problems in nonlinear functional analysis. In this manuscript, we prove the first UATs of non-linear $k$-times differentiable operators between Banach spaces and their derivatives, uniformly on compact sets and in weighted Sobolev norms for general finite input measures, via OL architectures. Our results are the first complete generalizations of the corresponding influential classical results in [Hornik, 1991] to infinite-dimensional settings and OL. We discuss several open areas where DIOL and our UATs find applications: high-order accuracy in OL, fast constrained optimization in Banach spaces (e.g. optimal control of PDEs, inverse problems) and numerical methods for infinite-dimensional PDEs (e.g. HJB PDEs on Banach spaces from optimal control of PDEs, SPDEs, path-dependent systems, partially observed systems, mean-field control). We parameterize nonlinear operators via Encoder-Decoder Architectures, renowned classes in OL due to their generality, including classical architectures, such as DeepONets, Deep-H-ONets, PCA-Nets. Our results are based on four key features that allow us to prove UATs in full generality: (i) Approximation Properties of Banach spaces. (ii) $k$-times continuous differentiability in the sense of Bastiani (weaker than $k$-times continuous Fréchet differentiability). (iii) Natural compact-open topologies for UA; indeed, we show that UA in standard compact-open topologies induced by operator norms is violated even for Fréchet derivatives. (iv) Construction of novel weighted Sobolev spaces for the UA.

URL PDF HTML ☆

赞 0 踩 0

2605.15281 2026-05-18 cs.CR cs.AI 版本更新

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Vinil Pasupuleti, Siva Rama Krishna Varma Bayyavarapu, Shrey Tyagi

发表机构 * International Business Machines (IBM)（国际商业机器公司（IBM））； Salesforce Inc（Salesforce公司）

AI总结本文提出了一种基于人工智能的自主测试框架，用于实现自然语言驱动的网页执行与集成安全验证。该框架通过导航可靠性、上下文感知选择器生成、后生成验证、智能等待注入和失败学习等五项策略，有效解决了传统网页测试套件易失效的问题。实验表明，该方法显著提升了脚本生成成功率，减少了导航失败和时间相关竞争条件，并大幅降低了测试创建时间；同时，它还能通过自然语言描述攻击场景，自动转换为安全检测探针，有效发现多种安全漏洞，为自然语言驱动的安全测试提供了新颖的解决方案。

Comments 6 pages, 4 figures, 5 tables, IEEE conference format

2605.15252 2026-05-18 cs.LG cs.AI eess.SP 版本更新

PDRNN: Modular Data-driven Pedestrian Dead Reckoning on Loosely Coupled Radio- and Inertial-Signalstreams

Peter Bauer, Andreas Porada, Felix Ott, Christopher Mutschler, Tobias Feigl

发表机构 * Fraunhofer Institute for Integrated Circuits IIS（弗劳恩霍夫集成电路研究所）

AI总结本文提出了一种名为PDRNN的模块化数据驱动行人航位推算系统，用于处理松耦合的无线电与惯性传感器信号流。该方法基于简单循环神经网络架构，能够隐式预测不同估计方法下的异步传感器数据流，并通过独立的机器学习模型分别估计姿态、速度和位置等关键参数及其方差，最终融合模型结合这些输出以提升系统鲁棒性。实验表明，PDRNN在动态运动数据上的精度和稳定性优于传统方法和现有机器学习方法，同时具备更好的组件控制能力和预测能力。

Comments 12 pages

Journal ref IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025

详情

DOI: 10.1109/PLANS61210.2025.11028330

英文摘要

Modern pedestrian dead reckoning (PDR) systems rely on fusing noisy and biased estimates of position, velocity, and calibrated orientation derived from loosely coupled sensors to determine the current pose of a localized object. However, discrepancies in the sampling rates of sensor-specific estimation methods and unreliable transmission pose significant challenges. And traditional methods often fail to effectively fuse multimodal sensor data during dynamic movements characterized by high accelerations, velocities, and rapidly varying orientations. To address these limitations, we propose a simple recurrent neural network (RNN) architecture capable of implicitly forecasting asynchronous sensor data streams from diverse estimation methods along reference trajectories. The proposed approach introduces PDRNN, a modular hybrid AI-assisted PDR system that handles each component as an independent ensemble of machine learning (ML) models to estimate both key parameter means and variances. Separate ML-based models are employed to estimate orientation, (un)directed velocity or distance from acceleration and gyroscope data, with optional absolute positioning from synchronized radio systems such as 5G for stabilization. A final fusion model combines these outputs, position, velocity, and orientation, while using uncertainty estimates to enhance system robustness. The modular design allows individual components to be updated, fine-tuned, or replaced without affecting the entire system. Experiments on dynamic sports movement data show that PDRNN achieves superior accuracy and precision compared to classic and ML-based methods, effectively avoiding error accumulation common in black-box approaches. And PDRNN offers forecast capabilities and better component control despite increased system complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.15243 2026-05-18 cs.LG cs.AI q-bio.BM q-bio.MN q-bio.QM 版本更新

Reading the Cell, Designing the Cure: Perturbation-Conditioned Molecular Diffusion for Function-Oriented Drug Design

Ziyu Xu, Zijian Zhang, Liang Wang, Zhiyuan Liu, Qiang Liu, Shu Wu, Liang Wang

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学先进交叉学科学院）； NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学人工智能学院）； National University of Singapore, Singapore（新加坡国立大学）

AI总结该研究提出了一种基于转录组的药物设计方法（TBDD），旨在根据期望的基因表达变化生成具有特定功能的分子。为了解决生物学与化学领域间的巨大差异以及转录组信号稀疏性带来的挑战，研究设计了多尺度的扩散生成模型CURE，其核心模块TFE能够提取功能导向的扰动特征，并跨模态对齐化学结构信息，从而生成结构合理且功能一致的候选药物分子。实验表明，该方法在多个基准测试中表现优异，并在零样本基因抑制剂设计任务中验证了其实际应用潜力。

2605.15238 2026-05-18 cs.SE cs.AI cs.PL 版本更新

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

Alexander Du, Jianjun Ou, Danyang Zhuo, Matthew Lentz

发表机构 * Duke University（杜克大学）

AI总结本文提出了一种名为Hydra的系统，用于在代码生成过程中高效地恢复静态错误。Hydra通过异步检查和检查点回滚机制，避免了传统方法中高昂的延迟和令牌消耗，能够在生成过程中及时检测并修复错误，而无需重新生成已正确部分的代码。实验表明，Hydra在C/C++代码生成任务中，相比事后修复方法，显著降低了延迟和令牌使用量。

2605.15237 2026-05-18 cs.AR cs.AI 版本更新

A3D: Agentic AI flow for autonomous Accelerator Design

Abinand Nallathambi, Christopher Knight, Shantanu Ganguly, Wilfried Haensch, Anand Raghunathan

发表机构 * Purdue University（普渡大学）； Argonne National Laboratory（阿贡国家实验室）； University of Chicago（芝加哥大学）

AI总结 A3D 是一种基于智能体的 AI 流程，旨在实现从端到端的硬件加速器自动化设计。该方法通过自主分析工作负载、识别性能瓶颈、重构代码以适配高阶综合工具，并生成微架构，显著降低了加速器设计的复杂性和人工干预需求。A3D 还能够自动探索速度与面积的权衡空间，生成多样化的加速器设计方案，为复杂科学应用提供了高效且自动化的加速器设计解决方案。

详情

英文摘要

Accelerating applications through the design of hardware accelerators can significantly enhance system performance and energy efficiency. Despite advances, such as high-level synthesis (HLS), designing accelerators for complex applications still remains highly labor-intensive, demanding considerable expertise in understanding workloads to be accelerated, hardware design, micro-architecture, and EDA tool usage, posing challenges for application domain experts. Therefore, most accelerator solutions are limited to applications with a regular predictable dataflow. Advances in AI have enabled agents that perform autonomous planning, reasoning, execution and reflection, leading to unprecedented potential for automation through agentic AI. We present A3D, an Agentic AI flow for end-to-end Automation of hardware Accelerator Design. A3D automates workload analysis, performance bottleneck identification, code refactoring for HLS compatibility and micro-architecture generation. A3D also generates diverse accelerator designs by automatically exploring the speed-area tradeoff space. Recent efforts have explored the use of AI for specific tasks such as design space exploration in HLS, leaving several tasks to still be performed manually. A3D addresses the challenges in applying modern LLMs to accelerator design by judiciously partitioning tasks among specialist agents, orchestrating process loops with specialist and verifier agents, utilizing pre-existing and custom tools, and employing agentic RAG for codebase and proprietary EDA tool documentation exploration. Our implementation of A3D, using commercial components like Claude Sonnet 4.5 and the Catapult HLS tool, demonstrates its effectiveness by generating accelerator designs with no human intervention from complex scientific applications like LAMMPS (molecular dynamics simulation) and QMCPACK (quantum chemistry).

URL PDF HTML ☆

赞 0 踩 0

2605.15228 2026-05-18 cs.AI cs.LG 版本更新

Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems

Jun He, Deying Yu

AI总结本文研究了主权AI系统中自主智能体执行操作时的授权验证问题，提出了一种基于可信证明的分布式授权框架（DTF）。该框架通过结构化、可验证的证明对象来动态生成执行权限，确保所有高风险操作都必须基于共识验证的证明，并与证据链绑定，从而实现对智能体行为的可控、可审计和可追溯。该方法为云原生环境中的自主AI系统提供了安全、去中心化的授权基础设施。

Comments 19 pager, 2 figures, 4 tables

详情

英文摘要

Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically valid but semantically unsafe actions, making standing privileges a significant operational risk. This risk becomes especially acute in sovereign AI systems, where autonomous agents may interact with cloud infrastructure, regulated data, financial workflows, and national-scale digital services. Governed mutation substrates reduce this risk by interposing on agent actions: agents submit intents, infrastructure evaluates context and policy, and execution is mediated. However, this shifts the trust boundary: how can the decision to authorize an intent be made verifiable, distributed, and replayable? We introduce a Distributed Trust Framework (DTF), a verification framework for governed mutation systems that computes execution authority from structured, verifiable artifacts. DTF introduces a Justification Proof to encode the admissibility basis of an action, a consensus model for independent evaluation, an ephemeral Execution Identity derived from the approved proof, and an append-only Evidence Chain that preserves the authorization lifecycle. Under stated substrate assumptions, this architecture enforces a compact authorization invariant: no high-stakes execution without a proof object, no derived authority without consensus, and no valid mutation detached from evidence. We define the model, instantiate it over an OpenKedge-based governed mutation substrate, and show how it maps onto cloud-native environments. By shifting authorization from standing identity to proof-derived authority, DTF provides an infrastructure foundation for making agentic execution governable, auditable, and bounded in sovereign AI deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.15227 2026-05-18 cs.AI cond-mat.mtrl-sci cs.RO 版本更新

NIMO Controller: a self-driving laboratory orchestrator based on the Model Context Protocol

Naruki Yoshikawa, Ryo Tamura

发表机构 * National Institute for Materials Science（国家材料科学研究所）； Graduate School of Frontier Sciences, The University of Tokyo（东京大学前沿科学研究生院）

AI总结本文提出了一种基于模型上下文协议（MCP）的自主驾驶实验室（SDL）控制架构——NIMO Controller，旨在解决现有SDL软件框架缺乏标准化接口、难以支持AI代理的问题。该架构通过MCP服务器统一暴露所有SDL功能，并提供了基于MCP工具发现的可视化编程接口，使用户无需编写代码即可设计实验流程，同时支持AI代理通过同一后端进行交互。研究通过颜色匹配实验验证了该架构的可行性与实用性。

Comments 9 pages, 4 figures

2605.15226 2026-05-18 cs.AR cs.AI cs.SE 版本更新

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

Qingyun Zou, Feng Yu, Hongshi Tan, Bingsheng He, WengFai Wong

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文探讨了用于软件工程的智能体AI系统是否适用于实际的硬件工程任务，并引入了Phoenix-bench基准测试集，该基准集包含511个经过验证的Verilator实例，支持对硬件设计流程、错误修复和验证等任务的全面评估。研究发现，硬件工程与软件工程在错误传播机制和修复方式上存在显著差异，且定位精度和反馈机制对智能体性能影响显著，为未来智能体在硬件工程中的应用提供了重要参考。

详情

英文摘要

We ask whether agentic AI systems built for software engineering transfer to realistic hardware engineering. Existing hardware LLM benchmarks isolate sub-tasks but none jointly requires repository navigation, hierarchy-aware localization, Electronic Design Automation (EDA) executable verification, and maintenance-style patching. We introduce \textbf{Phoenix-bench}, a synchronized corpus of 511 verified Verilator instances from 114 GitHub repositories, each shipped with the developer patch, design-flow labels, fail-to-pass and pass-to-pass testbenches, and a Docker-pinned EDA environment so resolved-rate differences reflect agent behavior rather than toolchain availability. Using Phoenix-bench we run a uniform evaluation of four commercial agents and eight open-source agentic structures across four LLM backbones, plus two diagnostic interventions (file-level oracle localization and one round of testbench-log feedback). Three findings emerge. (i)~Software and hardware are fundamentally different engineering tasks: the same agent loses 37\% to 58\% from SWE-bench Verified to Phoenix-bench because hardware bugs propagate across parallel instantiated modules through signal flow rather than along a software-style call graph, and software-tuned agents stop at the symptom file instead of tracing back through the instantiation chain. (ii)~Failures concentrate on design control-flow / finite state machine (FSM) bugs, verification testbench bugs, and hard cases that demand cross-hierarchy signal-flow tracking and coordinated multi-file edits. (iii)~Localization granularity matters far more than localization itself: a perfect file-level oracle yields only $+1.4$\% because the agent then breaks files that did not need editing, while a single round of test case feedback lifts resolved rate by $42$\% to $45$\% because the test case tells \emph{where} the bug is and \emph{what} the fix has to look like.

URL PDF HTML ☆

赞 0 踩 0

2605.15225 2026-05-18 q-bio.QM cs.AI 版本更新

Do Biological Structural Guarantees Earn Their Complexity?

Bogdan Banu

AI总结本文探讨了生物学结构保证是否值得其复杂性，通过构建三个深度基准测试，比较了基于生物机制（如代谢优先门控、自动诱导物群体感应和贝叶斯停滞检测）的AI框架与非生物替代方案及简化对照在数千次试验中的表现，验证了生物结构在可靠性上的实际优势与代价。

2605.15224 2026-05-18 cs.AI cs.MA 版本更新

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Nanjing University（南京大学）； Sun Yat-sen University（中山大学）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）； SAP ； Microsoft Research（微软研究院）

AI总结本文提出了一种基于强化学习的新型框架ICRL，旨在使大型语言模型在获得自我批评反馈后能够内化这些指导，从而在无外部批评的情况下仍能保持良好的表现。该框架通过联合训练求解器和批评者，利用批评反馈带来的性能提升作为奖励，促使批评者生成更有助于改进的反馈。为了解决批评条件行为与无批评行为之间的分布偏移问题，ICRL引入了分布校准的重加权策略，并通过角色分组优势估计稳定联合优化过程。实验表明，ICRL在多种任务中均取得了显著提升，且训练出的批评者在性能上可与更大规模的模型相媲美。

2605.15223 2026-05-18 cs.AR cs.AI 版本更新

GenAI-Driven Approach to RISC-V Supply Chain Exploration

Nenad Petrovic, Andre Schamschurko, Yingjie Xu, Alois Knoll

发表机构 * Chair of Robotics, Artificial Intelligence and Real-Time Systems（机器人、人工智能与实时系统教授会）； Technical University of Munich（慕尼黑技术大学）

AI总结本文提出了一种基于大语言模型（LLM）的流程，用于分析 RISC-V 供应链，结合视觉语言模型（VLM）和模型驱动工程（MDE），实现了对异构、非结构化供应链数据的多模态数据驱动分析。该方法通过 LLM 理解文本信息，VLM 提取图表、表格等视觉文档中的信息，构建供应链知识图谱，并利用 MDE 技术进行依赖关系验证、瓶颈检测和风险评估，从而支持对供应链韧性的探索性与系统性分析。实验表明，该方法在 RISC-V 生态系统中有效提升了供应链透明度和决策支持能力。

2605.15221 2026-05-18 cs.SE cs.AI cs.CL 版本更新

Effective Harness Engineering for Algorithm Discovery with Coding Agents

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

发表机构 * Gemini ； Algorithmicsuperintelligence（算法智能）

AI总结本文研究了在算法发现任务中，如何设计有效的执行框架（harness）以提升基于大语言模型和进化搜索的自动算法生成效果。通过分析算法生成数量与深度、评估漏洞处理以及并行执行安全等问题，提出了改进的Vesper框架，并在圆填充问题上验证了其有效性。实验表明，在固定计算预算下，生成更少但更深入的算法能取得更优结果，同时更强大的模型更容易产生评估漏洞，凸显了漏洞检测的重要性。

2605.15220 2026-05-18 cs.CL cs.AI cs.LG 版本更新

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu, Apurva Gandhi, Kyunghyun Cho, Tal Linzen, Pratyusha Sharma

发表机构 * New York University（纽约大学）； Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软公司）

AI总结数据混合在语言模型训练中起着关键作用，决定了如何组合不同来源或类型的训练数据。本文提出了一种名为OP-Mix的高效数据混合算法，能够在整个语言模型训练生命周期中持续运行，解决了现有方法仅适用于单一训练阶段的问题。该方法通过在当前模型上训练低秩适配器并进行插值，低成本地模拟候选数据混合方案，从而避免了对代理模型的依赖，并始终基于模型的实际学习动态进行搜索。实验表明，OP-Mix在预训练、持续微调等任务中均能以更低的计算成本达到接近最优的性能。

2605.15218 2026-05-18 cs.AI cs.CE 版本更新

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

Chenying Lin, Yichen Hai, Yi He, Ran Wang, Haiyan Qiang, Liang Yu

发表机构 * Shanghai Ultradimension Technology Co., Ltd.（上海超维科技有限公司）； College of Logistics Engineering, Shanghai Maritime University（上海海洋大学物流学院）； School of Civil Aviation, Northwestern Polytechnical University（西北工业大学航空学院）； State Key Laboratory of Airliner Integration Technology（航空器集成技术国家重点实验室）； National Key Laboratory of Strength（强度与结构完整性国家实验室）； Wuhan University（武汉大学）

AI总结本文提出了一种轻量级的代理框架CAX-Agent，旨在提升MAPDL有限元仿真中的自动化可靠性。该框架通过引入领域特定的中间件，实现工具生命周期管理、工作流状态控制和故障恢复，从而解决大语言模型在该任务中常见的输出不一致和任务失败问题。实验评估表明，CAX-Agent中基于模型驱动的恢复策略在多个结构基准测试中表现出色，显著优于仅依赖规则或无恢复策略的方法。

Comments 8 pages, 6 figures, IEEE conference format

2605.15217 2026-05-18 cs.AI cs.CY cs.LG econ.GN q-fin.EC 版本更新

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Jagdish Tripathy, Marcus Buckmann

发表机构 * Bank of England（英格兰银行）

AI总结本研究探讨了指令微调语言模型在高风险决策（如房贷审批）中表现出的行为公平性与其内部潜在偏见之间的不对称关系。研究发现，尽管模型在输出层面看似无偏，但其内部表示仍保留并放大了与种族相关的偏见，且这些隐藏的偏见具有因果影响力，能够通过特定干预引发决策反转。研究还揭示了这种偏见在不同群体间的不对称性，并指出仅关注输出的行为审计不足以识别和治理模型中的潜在偏差，需结合表示分析的双重评估框架。

Comments 39 pages, 16 figures, 2 tables

2605.15215 2026-05-18 cs.AI cs.SE 版本更新

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

Duling Xu, Zheng Chen, Zaifeng Pan, Jiawei Guan, Dong Dong, Jialin Li, Bangzheng Pu

发表机构 * AetherHeart Tech Co., Ltd.（AetherHeart科技有限公司）； Renmin University of China（中国人民大学）； University of California San Diego（加州大学圣地亚哥分校）

AI总结 SkillSmith 是一种边界引导的编译-运行时框架，旨在优化基于技能的智能体系统。该方法通过离线编译技能包为最小可执行接口，提取技能的细粒度操作边界，使智能体在运行时仅调用相关组件，从而减少冗余上下文注入和重复推理。实验表明，SkillSmith 显著降低了推理阶段的 token 使用量、思考迭代次数和执行时间，并提升了任务准确率，同时支持强模型生成的编译结果被轻量模型复用。

2605.15213 2026-05-18 cs.IR cs.AI 版本更新

An LLM-RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations

Yibin Wang, Yanjie Yang, Grace Melo Guerrero, Rodolfo M. Nayga, Azlan Zahid

发表机构 * Department of Biological and Agricultural Engineering, Texas A&M AgriLife Research（生物与农业工程系，德克萨斯A&M农业生命研究）

AI总结该研究提出了一种基于健康饮食指数（HEI）的检索增强生成（RAG）框架，用于生成个性化的健康饮食推荐。该方法结合标准化营养数据库和大语言模型，通过构建食物嵌入空间并计算HEI评分，为用户提供符合健康标准的个性化饮食建议。实验结果表明，该方法能有效提升用户的HEI得分，提高饮食质量。

2605.15208 2026-05-18 cs.LG cs.AI 版本更新

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Plawan Kumar Rath, Rahul Maliakkal

发表机构 * Meta

AI总结该研究探讨了量化压缩对大型语言模型（LLMs）偏见表现的影响，发现低精度量化会导致模型在多个任务中产生新的刻板印象行为，且这种变化与精度水平呈剂量反应关系。通过在多个模型和精度级别上的大规模实验，研究揭示了传统质量评估指标无法检测到这种偏见的增加，强调了在模型压缩前进行公平性检测的重要性。

Comments 7 pages, 4 figures, 4 tables. Accepted at IEEE Cloud Summit 2026. This is the author's accepted version; the version of record will appear in IEEE Xplore

2605.15206 2026-05-18 cs.LG cs.AI cs.DC 版本更新

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Brave Software, Imperial College London（Brave软件公司，伦敦帝国学院）

AI总结随着基于大语言模型的自主代理在复杂任务中应用增多，本地部署虽能提升隐私保护和降低成本，但其资源消耗远高于普通语言模型交互。本文研究了在消费级硬件上本地运行代理的能耗问题，提出了一种名为AgentStop的轻量级监督机制，通过预测任务失败的可能性提前终止无效流程，在减少15%-20%能耗的同时仅小幅影响任务性能，为可持续的本地智能代理系统提供了可行方案。

Comments ACM CAIS '26

2605.15205 2026-05-18 cs.AI 版本更新

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Nanxu Gong, Zixin Chen, Haotian Li, Zishu Zhao, Jianxun Lian, Huamin Qu, Yanjie Fu, Xing Xie

发表机构 * Arizona State University（亚利桑那州立大学）； Hong Kong University of Science and Technology（香港科学与技术大学）； Microsoft Research Asia（微软亚洲研究院）； Smith College（史密斯学院）

AI总结本研究探讨了提升大型语言模型（LLM）心智理论（ToM）能力是否真正有助于改善人机交互。研究指出，现有基准多从第三人称视角通过阅读故事和选择题评估ToM能力，忽视了真实交互中的第一人称、动态和开放特性。为此，研究提出了一种新的交互式ToM评估范式，并通过真实数据集和用户实验系统评估了四种代表性ToM增强技术，发现静态基准上的提升并不一定带来动态人机交互中的性能改善，强调了基于交互的评估在开发下一代社会智能模型中的重要性。

2605.15204 2026-05-18 cs.AI 版本更新

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang

发表机构 * Digital China（数字中国）

AI总结本文提出了一种名为SDOF的多智能体协调框架，旨在解决现有系统在任务调度中缺乏阶段约束的问题。该框架将多智能体执行视为受约束的状态机，并通过强化学习与有限状态自动机相结合的方法，实现对任务流程的精确控制与合规性验证。实验表明，SDOF在招聘系统等实际场景中表现出更高的任务完成率与执行安全性，显著优于现有模型。

Comments 12 pages, 4 figures, 14 tables

2605.15203 2026-05-18 cs.IR cs.AI cs.MA 版本更新

Agent4POI: Agentic Context-Conditioned Affordance Reasoning for Multimodal Point-of-Interest Recommendation

Jinze Wang, Yangchen Zeng, Tiehua Zhang, Lu Zhang, Yuze Liu, Yongchao Liu, Xingjun Ma, Zhu Sun

发表机构 * Tongji University（同济大学）； Swinburne University of Technology（斯威本理工大学）； Southeast University（东南大学）； Chengdu University of Information Technology（成都信息工程大学）； Fudan University（复旦大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结本文提出了一种名为 Agent4POI 的新型兴趣点（POI）推荐框架，其核心在于推荐时动态生成与上下文条件相关的多模态表示，而非依赖于预计算的静态 POI 嵌入。该方法通过一个四阶段的大型语言模型代理，根据情境上下文生成动态的、场景特定的“可利用性”查询，并结合图像、评论和元数据进行跨模态推理，最终生成结构化且考虑不确定性的可利用性表示，从而提升推荐的准确性和适应性。实验表明，Agent4POI 在多个基准数据集和评估场景中均优于现有方法，尤其在冷启动和上下文变化场景下表现突出。

2605.15202 2026-05-18 cs.AI cs.CL cs.IR 版本更新

DeepSlide: From Artifacts to Presentation Delivery

Ming Yang, Zhiwei Zhang, Jiahang Li, Haoseng Liu, Yuzheng Cai, Weiguo Zheng

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）

AI总结 DeepSlide 是一个支持全流程演示文稿准备的人机协作多智能体系统，旨在优化从内容规划到演讲表现的整个过程，而不仅仅是生成视觉上合理的幻灯片。该系统结合了可控逻辑链规划、内容树检索、风格继承的序列渲染以及可执行的排练支持，有效提升了演讲的叙事连贯性、节奏精确度和幻灯片与讲稿的协同性。研究还引入了一个双评分板基准，用于区分静态内容质量与动态演讲表现，实验表明 DeepSlide 在多个领域和受众场景下均优于现有方法。

Comments 37 pages,10 figures,9 tables

2605.15053 2026-05-18 cs.LG cs.AI 版本更新

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种名为TFGN的新型架构，能够在无需回放数据、无需任务标识的情况下，在大规模语言模型中实现无灾难性遗忘的持续预训练。该方法通过在Transformer模型上叠加一个参数高效的输入条件更新模块，实现了跨异构文本领域的正向和反向迁移，并在多个大规模模型和数据集上取得了显著效果。研究还进一步引入了闭环元控制器和操作级计划向量，提升了模型的自主学习能力和跨域适应性，为大规模语言模型的持续学习提供了新的架构解决方案。

Comments 65 pages, 10 figures, 40 tables

详情

英文摘要

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

URL PDF HTML ☆

赞 0 踩 0

2605.14892 2026-05-18 cs.AI 版本更新

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Shihao Qi, Jie Ma, Rui Xing, Wei Guo, Xiao Huang, Zhitao Gao, Jianhao Deng, Jun Liu, Lingling Zhang, Bifan Wei, Boqian Yang, Pinghui Wang, Jianwen Sun, Jing Tao, Yaqiang Wu, Hui Liu, Yu Yao, Tongliang Liu

发表机构 * MOE KLINNS Lab（MOE KLINNS实验室）； School of Computer Science and Technology（计算机科学与技术学院）； School of Cyber Science and Engineering（网络安全工程学院）； School of Software Engineering（软件工程学院）； School of Control Science and Engineering（控制科学与工程学院）； Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering（陕西省大数据知识工程重点实验室）； Laboratory for AI and New Forms of Education（人工智能与新型教育实验室）； Lenovo AI Technology Center, CTOO, Lenovo（联想AI技术中心，联想CTOO）； Sydney AI Centre, The University of Sydney（悉尼AI中心，悉尼大学）

AI总结本文综述了基于大语言模型的多智能体系统在协作、错误归因与自主进化方面的研究进展，指出现有研究多分别关注单个智能体能力、协作机制或自我进化，而忽视了它们之间的因果关系。文章提出了一个统一的框架——LIFE 进程，涵盖能力基础构建、协作整合、错误归因与自主进化四个阶段，系统分析了各阶段之间的依赖关系，并提出了跨阶段的研究方向，旨在推动具备持续诊断、结构调整与行为优化能力的自组织多智能体系统发展。

2605.14876 2026-05-18 cs.CV cs.AI 版本更新

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结尽管当前文本到图像生成模型在技术上取得了快速进展，但它们大多依赖单步生成范式，难以处理复杂的语义内容，且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题，本文提出了一种闭环视觉推理框架CLVR，该框架将视觉语言逻辑规划与像素级扩散生成深度融合，并引入了基于代理提示的强化学习和Δ-空间权重合并等方法，有效提升了生成质量与推理效率，实验表明其在多个基准测试中优于现有开源模型，接近商业模型的性能。

2605.14859 2026-05-18 cs.CR cs.AI 版本更新

Do Coding Agents Understand Least-Privilege Authorization?

Zheng Yan, Jingxiang Weng, Charles Chen, Dengyun Peng, Ethan Qin, Jiannan Guan, Jinhao Liu, Qiming Yu, Yixin Yuan, Fanqing Meng, Carl Che, Mengkang Hu

发表机构 * Evolvent AI Research Team（Evolvent AI研究院）

AI总结随着代码代理越来越多地访问系统外壳、代码仓库和用户文件，最小权限授权成为安全部署的必要条件。本文研究当前模型是否能自行推断出权限边界，提出权限边界推理任务，并构建了包含120个真实终端任务的AuthBench基准测试集。研究发现，现有模型在权限分配上常出现遗漏必要权限或授予多余权限的问题，且增加推理时间并不能有效解决这一问题。为此，作者提出一种“充分性-紧致性分解”方法，通过任务前向模拟生成覆盖性策略，并对每个授予的权限进行审查，显著提升了模型在敏感任务中的成功率并降低了攻击成功的可能性。

2605.14665 2026-05-18 cs.AI cs.CL cs.IR 版本更新

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose

发表机构 * Independent Researcher（独立研究者）

AI总结该论文提出了一种名为Falkor-IRAC的图约束生成框架，旨在提升印度司法AI系统中法律推理的准确性和可靠性。该方法基于IRAC（问题、规则、分析、结论）知识图谱，将印度最高法院和高等法院的判决结构化为图节点，并整合程序状态转换、先例关系和法律条文引用。在推理过程中，系统仅接受能通过图结构验证的生成结果，从而有效减少错误引用和推理链不完整的问题，并能主动检测法律原则间的冲突，为法律AI的可信推理提供了新思路。

Comments 20 pages, 8 figures, 4 tables

详情

英文摘要

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.

URL PDF HTML ☆

赞 0 踩 0

2605.14401 2026-05-18 cs.CL cs.AI 版本更新

Agentic Recommender System with Hierarchical Belief-State Memory

Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin, Lei Huang, Qianqian Zhong, Lizhu Zhang, Benyu Zhang, Xiangjun Fan, Hong Yan

发表机构 * Meta Recommendation Systems (MRS)（Meta推荐系统）

AI总结本文提出了一种基于记忆增强的智能推荐系统MARS，通过分层信念状态记忆结构，将推荐问题建模为部分可观测问题，从而更准确地捕捉用户的动态偏好。MARS将记忆分为事件记忆、偏好记忆和用户画像记忆三个层级，并引入包含提取、强化、弱化、巩固、遗忘和重构六种操作的完整生命周期，由基于大语言模型的调度器动态管理。实验表明，MARS在多个推荐基准数据集上取得了显著性能提升，优于现有最优方法。

Comments 4 figures, 8 tables

2605.14311 2026-05-18 cs.LG cs.AI cs.HC 版本更新

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Yuchen Sun, Pei Fu, Shaojie Zhang, Anan Du, Xiuwen Xi, Ruoceng Zhang, Zhenbo Luo, Jian Luan, Chongyang Zhang

发表机构 * Xiaomi Inc.（小米公司）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文研究了通用图形用户界面（GUI）代理中测试时扩展（TTS）方法中的关键问题，即现有批评模型依赖二分类导致对有效操作和看似合理但无效的操作无法区分。为此，作者提出了一种新的连续语义对齐方法BBCritic，通过两阶段对比学习恢复被二分类压制的层次结构，并引入首个细粒度评估基准BBBench。实验表明，该方法在无需额外标注的情况下超越了现有大模型，在跨平台任务中表现出强大的零样本迁移能力。

Comments 28 pages including appendix. Code and BBBench benchmark to be released

2605.14309 2026-05-18 cs.CV cs.AI cs.LG 版本更新

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

发表机构 * Fujian Normal University（福建师范大学）； Nanyang Technological University（南洋理工大学）； University of New South Wales（新南威尔士大学）； Data61 CSIRO（Data61澳大利亚联邦科学与工业研究组织）

AI总结本文提出了一种基于可解释概念分解的视觉-语言模型（VLM）概念级机器遗忘方法ICED，旨在解决传统图像或实例级遗忘难以精确移除目标知识而不影响无关语义的问题。该方法通过多模态大语言模型构建任务相关的概念词汇表，并将视觉表征分解为稀疏、非负的语义概念组合，从而实现对图像中目标概念的精确抑制，同时保留非目标语义和跨模态知识。实验表明，该方法在保持模型性能的同时，能够更全面地遗忘目标知识并更好保留图像中的非目标信息。

2605.14205 2026-05-18 cs.AI 版本更新

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang

发表机构 * Shopify

AI总结本文提出SimPersona框架，旨在解决基于大语言模型的电商代理在面对真实买家群体时无法捕捉其异质性和分布特性的问题。该方法通过从历史点击流中学习离散的买家类型，并将其转化为紧凑的个性标签，从而指导代理的行为决策。实验表明，SimPersona能够有效模拟真实买家行为，实现高转化率匹配，并在多个电商场景中表现出优越的性能。

2605.13142 2026-05-18 cs.AI math.OC 版本更新

A Constraint Programming Approach for n-Day Lookahead Playoff Clinching in the NHL

Gili Rosenberg, Kyle E. C. Booth, J. Kyle Brubaker, Ruben S. Andrist

发表机构 * Amazon Advanced Solutions Lab（亚马逊高级解决方案实验室）

AI总结本文研究了如何在国家冰球联盟（NHL）中确定一支球队在接下来的 $n$ 天内是否能够锁定季后赛资格的问题。针对复杂的晋级规则和复杂的平局处理机制，作者提出了一种基于约束编程的树搜索算法，能够高效地分析未来 $n$ 天比赛结果的所有可能组合，并判断球队是否能够确保季后赛席位。该方法结合了预处理、剪枝策略和节点排序启发式，有效提升了搜索效率，并通过大量真实赛季数据验证了其有效性，具有良好的扩展性，可用于分析其他相关体育指标。

Comments 18 pages, 5 figures, 4 tables. Accepted to CP 2026

2605.12667 2026-05-18 cs.LG cs.AI 版本更新

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel, Fei Wang, Inderjit S. Dhillon

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； Google（谷歌）

AI总结该研究针对大语言模型对齐中基于人工智能反馈的强化学习（RLAIF）所面临的离散奖励噪声问题，提出了一种名为ODRPO的鲁棒策略优化框架。其核心方法是将多级离散奖励分解为一系列二元序数指示符，从而结构化地隔离评估噪声，并通过逐步设定的成功阈值独立计算优势，提升学习稳定性与鲁棒性。实验表明，ODRPO在多个基准任务上显著优于现有方法，且几乎不增加训练时间开销。

详情

英文摘要

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

URL PDF HTML ☆

赞 0 踩 0

2605.11885 2026-05-18 cs.AI q-bio.NC 版本更新

From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

Justus Meyer zu Bexten, Nico Scherf, Bogdan Franczyk, Simon M. Hofmann

发表机构 * Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)（可扩展数据分析与人工智能中心（ScaDS.AI））； Leipzig University（莱比锡大学）； Neural Data Science and Statistical Computing, Max Planck Institute for Human Cognitive and Brain Sciences（神经数据科学与统计计算，人类认知与脑科学马克斯·普朗克研究所）； Faculty of Economics, Leipzig University（经济学院，莱比锡大学）； Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences（神经病学系，人类认知与脑科学马克斯·普朗克研究所）

AI总结本文研究了如何利用基于注意力的逐层相关传播（LRP）方法对脑电图基础模型（EEG-FMs）进行解释，以解决其模型可解释性差的问题。研究将LRP方法从传统的卷积神经网络扩展到基于Transformer架构的EEG-FMs，发现该方法不仅能验证模型决策，还能揭示具有生物学意义的新假设。研究在运动想象和情感预测任务中展示了LRP的有效性，揭示了模型对特定脑区信号的依赖，为理解EEG-FMs的行为提供了新的视角。

Comments 18 pages, 6 figures

2605.11118 2026-05-18 cs.AI cs.IR 版本更新

A Cascaded Generative Approach for e-Commerce Recommendations

Moein Hasani, Hamidreza Shahidi, Trace Levinson, Yuan Zhong, Guanghua Shu, Vinesh Gudla, Tejaswi Tenneti

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结本文提出了一种级联生成框架，用于解决电商推荐中个性化店面构建的问题。该方法将店面生成分解为两个生成任务：页面区域的主题生成和针对每个区域的受限关键词生成，以支持产品检索。通过教师-学生微调策略提升模型的生产效率，并结合传统排序模型实现混合架构，实验表明该方法在每页浏览量的购物车添加率上相比基线提升了约2.7%。

2605.10799 2026-05-18 cs.LG cs.AI cs.CL 版本更新

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

Gabriel Garcia

发表机构 * Independent Researcher（独立研究者）

AI总结该论文指出，在评估链式推理（CoT）可信度的标准方法中，存在一个由格式引起的偏差问题：当基准任务的推理链以明确的最终答案结尾时，现有的腐败实验主要测量的是答案位置的影响，而非中间计算步骤的重要性。研究通过实验表明，移除最终答案或提供错误答案会显著影响模型表现，且这种影响随模型规模变化而不同。论文进一步提出了一套三要素协议，以改进未来基于腐败的可信度研究。

Comments 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT

2605.10057 2026-05-18 cs.AI cs.MA 版本更新

STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Ruiyi Yang, Lihuan Li, Hao Xue, Flora D. Salim

发表机构 * University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结本文提出了一种名为STAR的失效感知路由框架，用于多智能体时空推理中的任务分配问题。该方法通过将智能体之间的控制决策显式建模为基于状态的转移策略，能够根据任务类型和执行状态动态选择合适的专家智能体，从而有效应对不同类型的执行失败。STAR通过结合专家指定的正常路由路径和从执行轨迹中学习的恢复转移，显著提升了系统在面对异常情况时的鲁棒性和可解释性。实验表明，STAR在多个时空推理基准上优于现有方法，尤其在执行路径偏离预期的情况下表现突出。

Comments 30 pages, 13 figures

2605.10052 2026-05-18 cs.CL cs.AI 版本更新

Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering

Xinyu Zhang, Zhicheng Dou, Deyang Li, Jianjun Tao, Shuo Cheng, Ruifeng Shi, Fangchao Liu, Enrui Hu, Yangkai Ding, Hongbo Wang, Qi Ye, Xuefeng Jin, Zhangchun Zhao

发表机构 * openJiuwen Team（开放九文团队）； Gaoling School of Artificial Intelligence, Renmin University of China（北京语言大学人工智能学院）

AI总结随着人工智能工程范式从单智能体提示和上下文工程转向多智能体协调工程，如何系统化地编码和提升多智能体协作能力成为关键瓶颈。本文提出了一种名为 *Swarm Skills* 的可移植、自演进的多智能体系统规范，通过引入角色、工作流、执行边界和自演进语义结构，将多智能体协作流程转化为可分发的资产。研究还提出了一种自演进算法，能够自动提炼成功执行轨迹并持续优化现有技能，从而实现无需人工干预的多智能体协调策略自我进化。

2605.09877 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Key-Value Means: Transformers with Expandable Block-Recurrent Compressed Memory

Daniel Goldstein, Eugene Cheah

发表机构 * Featherless AI ； Eleuther AI

AI总结本文提出了一种名为 Key-Value Means（KVM）的新块循环注意力机制，能够支持固定大小或可扩展的状态存储。该方法在保持参数数量极少的情况下，使强大变压器模型具备线性时间复杂度的分块处理能力，并在长上下文任务中表现出色，预填充时间接近二次方且状态增长接近线性。KVM 结合了传统变压器和线性 RNN 的优势，支持分块并行训练与预填充，适用于所有层以节省 KV 缓存内存，并可在传统注意力机制中与 LRNN 混合使用，提升长上下文处理性能。

2605.09033 2026-05-18 cs.CR cs.AI 版本更新

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

Yang Luo, Zifeng Kang, Tiantian Ji, Xinran Liu, Yong Liu, Shuyu Li, Lingyun Peng

发表机构 * Key Laboratory of Trustworthy Distributed Computing and Service (MoE)（可信分布式计算与服务重点实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）； Zhongguancun Laboratory（中关村实验室）

AI总结本文提出了一种针对基于图的智能体记忆的新型投毒攻击方法——ShadowMerge，通过利用关系通道冲突来影响智能体的行为。该方法通过构造恶意关系，使其与合法关系共享相同的查询激活锚点和关系通道，但携带冲突的值，从而在不影响正常任务的前提下成功注入有害信息。实验表明，ShadowMerge在多个真实数据集上取得了高达93.8%的攻击成功率，显著优于现有方法，并揭示了当前防御机制在应对此类攻击时的不足。

Comments Preprint. Corresponding authors: Zifeng Kang and Tiantian Ji. Code is available at https://anonymous.4open.science/status/ShadowMerge-033C

2605.08894 2026-05-18 cs.CL cs.AI 版本更新

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Yuzhuang Xu, Xu Han, Yuxuan Li, Pengzhan Li, Wanxiang Che

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Tsinghua University（清华大学）

AI总结尽管现有极低比特量化方法主要关注数值精度的保持，但本文指出，极低比特量化大语言模型还面临系统性的平滑性退化问题。通过引入平滑性代理指标和序列邻域建模，研究发现量化位宽越低，平滑性退化越严重，导致生成质量下降。为此，作者提出在后训练量化和量化感知训练中引入平滑性保持原则，有效提升了模型性能，强调了平滑性在极端量化中的重要性。

Comments 19 pages, 4 tables, 14 figures

2605.08245 2026-05-18 cs.CV cs.AI 版本更新

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu

发表机构 * Indian Institute of Technology Dhanbad（印度理工学院丹巴德分校）； National University of Singapore（新加坡国立大学）

AI总结本文研究了视觉-语言模型（VLMs）中由于语言与视觉模态过度对齐导致的幻觉问题，揭示了其根本原因在于解码器结构使得视觉嵌入过度对齐到文本流形，从而引入了语言统计偏倚，掩盖了细粒度视觉信息。作者首次量化分析了这一现象，提出两种互补的解决方案：一种是无需训练的推理策略，另一种是引入偏倚感知的微调方法，均能有效去除视觉表示中的语言偏倚。实验表明，这些方法在多个基准测试中显著减少了模型幻觉，并提升了长文本生成的质量。

2605.06390 2026-05-18 cs.AI 版本更新

Automated alignment is harder than you think

Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving

发表机构 * AI Security Institute（人工智能安全研究所）

AI总结本文探讨了自动化对齐（automated alignment）在人工智能超级智能（ASI）发展中的潜在风险。研究指出，即使研究代理不刻意破坏对齐工作，自动化对齐过程仍可能产生误导性的安全评估，导致未对齐的AI被无意中部署。这是因为对齐研究涉及许多难以监督的模糊任务，人类判断存在系统性偏差，而自动化系统可能在优化压力下产生人类难以发现的错误，进而影响对齐结果的可靠性。因此，如何训练代理可靠地完成这些任务，成为自动化对齐研究中的关键挑战。

Comments 15 pages, 4 figures

2605.03548 2026-05-18 cs.LG cs.AI 版本更新

PerFlow: Physics-Embedded Rectified Flow for Efficient Reconstruction and Uncertainty Quantification of Spatiotemporal Dynamics

Hao Zhou, Rui Zhang, Han Wan, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）

AI总结该研究提出了一种名为PerFlow的物理嵌入式修正流模型，用于高效重建和量化由偏微分方程（PDE）支配的时空动态场的不确定性。PerFlow通过将观测条件与物理约束解耦，实现了无需梯度引导的高效条件采样，并通过约束保持投影确保物理一致性。实验表明，该方法在保持良好物理特性的同时，显著提升了重建精度和推理速度。

Comments 17 pages, 8 figures. Accepted to IJCAI-ECAI 2026

2605.01970 2026-05-18 cs.CR cs.AI 版本更新

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

Debeshee Das, Julien Piet, Darya Kaviani, Luca Beurer-Kellner, Florian Tramèr, David Wagner

发表机构 * ETH Z\"urich

AI总结本文研究了针对大型语言模型代理的“特洛伊河马”攻击，该攻击通过在代理的长期记忆中植入隐蔽载荷，当用户讨论敏感话题时激活，从而实现数据外泄。研究提出了一种动态评估框架，用于系统评估不同内存架构和防御机制的有效性，并在实际邮件助手系统中验证了该攻击的高成功率（可达85%-100%）。研究还分析了多种防御方法的效果，揭示了安全性和实用性的权衡问题，为实际防御部署提供了重要参考。

详情

英文摘要

Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5%), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.

URL PDF HTML ☆

赞 0 踩 0

2605.00424 2026-05-18 cs.CR cs.AI cs.MA cs.SE 版本更新

Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes

Alfredo Metere

发表机构 * Enclawed, LLC（Enclawed公司）

AI总结本文研究了如何在人类介入的智能体运行时中，对技能（一种增强大语言模型的结构化指令包）进行可信验证的问题。作者提出了一种信任架构和一个双向正确性准则，确保技能在加载前必须经过验证，而非依赖签名或来源注册等信任机制。该方法通过明确的验证层级和能力门控策略，使人类介入仅在验证失败时触发，从而提升系统的可扩展性和可持续性。研究贡献具有通用性，不依赖模型再训练或专有基础设施。

2604.27859 2026-05-18 cs.AI cs.ET 版本更新

Rethinking Agentic Reinforcement Learning In Large Language Models

Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, Jiahong Li

发表机构 * Beijing Beijing China（北京北京中国）； Shanghai Beijing China（上海北京中国）

AI总结本文探讨了在大型语言模型（LLM）背景下对智能体强化学习（Agentic RL）的重新思考。研究关注如何将LLM的认知能力，如目标设定、长期规划、动态策略调整和交互推理，融入强化学习框架，以应对复杂、开放式的现实任务。文章深入分析了该范式的核心概念、方法创新与设计原则，并指出了当前面临的挑战及未来发展方向。

2604.14572 2026-05-18 cs.IR cs.AI cs.CL cs.MA 版本更新

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

发表机构 * Magellan Technology Research Institute（马格纳技术研究 institute）

AI总结本文提出了一种名为Corpus2Skill的方法，通过将企业文档库离线蒸馏为分层技能目录，使大型语言模型在回答问题时能够主动导航知识库，而非被动检索。该方法在企业客服基准测试中表现出优于多种RAG基线的问答质量与证据支持能力，并揭示了导航式方法在特定领域知识库中的优势，为知识引导系统的架构设计提供了指导。

2604.08302 2026-05-18 cs.LG cs.AI 版本更新

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文提出了一种名为 DMax 的新方法，用于高效生成扩散语言模型（dLLMs）。该方法通过引入渐进式自优化机制和软并行解码策略，有效缓解了并行解码中的错误累积问题，从而在保持生成质量的同时实现更高效的并行生成。DMax 还提出了 On-Policy Uniform Training 训练策略，统一了掩码和非掩码模型的训练过程，显著提升了模型在多个基准测试中的生成效率与性能。

Comments Working in progress. Code is available at: https://github.com/czg1225/DMax

2603.23433 2026-05-18 cs.AI 版本更新

Mecha-nudges for Machines

Giulio Frey, Kawin Ethayarajh

发表机构 * University of Chicago（芝加哥大学）

AI总结本文研究了AI智能体在互联网环境中作为决策者时，其决策可能受到环境变化的系统性影响，这一现象被称为“机械助推”（mecha-nudging）。作者结合经济学中的贝叶斯劝导理论和计算机科学中的可利用信息理论，提出了一种量化环境变化对AI影响的统一方法，并基于超过六百万个Etsy商品列表的数据分析发现，ChatGPT发布后，商品信息中用于预测AI推荐决策的机器可利用信息显著增加，而人类可利用信息则几乎没有变化。该研究首次提供了大规模实证证据，表明系统性的机械助推已在实际环境中发生，但尚未被广泛察觉。

2603.16011 2026-05-18 cs.SE cs.AI cs.CL 版本更新

FormulaCode: Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； California Institute of Technology（加州理工学院）； Cornell University（康奈尔大学）

AI总结本文提出FormulaCode，一个用于评估大语言模型（LLM）代理在真实大型代码库中进行多目标优化能力的基准。该基准基于从GitHub科学Python仓库中挖掘的957个性能瓶颈，每个瓶颈都配有专家编写的补丁和大量社区维护的性能测试任务，能够全面评估LLM在保证正确性与性能约束下的优化能力。实验表明，当前最先进的LLM代理在面对大规模、多目标优化任务时仍面临显著挑战。

Comments Preprint version

2603.14764 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

发表机构 * Independent Researcher（独立研究者）

AI总结该论文研究了在结构化视觉领域（如建筑平面图分析）中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷，提出了一种轻量的拓扑保持增强策略，能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明，该方法在常见几何变换下能实现接近完美的循环邻接保持（CAP），并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

2603.07514 2026-05-18 cs.LG cs.AI cs.CV 版本更新

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

发表机构 * Sony AI（索尼人工智能）； Sony Group Corporation（索尼集团）； Stanford University（斯坦福大学）； Georgia Tech（佐治亚理工学院）

AI总结本文探讨了漂移模型与基于分数的生成模型之间的内在联系，揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现，使用高斯核时，均值漂移场精确对应于数据分布与模型分布的分数差异，这一结论基于Tweedie公式。对于实际常用的拉普拉斯核，理论与实验均表明其残差项在高维情况下可忽略，因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角，并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

2603.04459 2026-05-18 cs.CR cs.AI cs.SE 版本更新

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang

发表机构 * CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全研究中心）； University of Waterloo（滑铁卢大学）； Flexera（Flexera公司）

AI总结本文系统评估了31个大型语言模型安全基准的代码质量和可运行性，并与382篇非基准论文进行对比。研究发现，大多数基准代码需要修改才能运行，且仅有少数提供完整的安装指南和伦理考量。作者指出，基准的采用与作者知名度和代码可运行性相关，而非代码质量标准，揭示了社区在基准选择上的潜在偏差。此外，部分基准存在安全隐患，可能被用作攻击资源，影响安全评估的可靠性。

Comments 24 pages. 19 figures

详情

英文摘要

The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing potentially harmful content. These deficiencies persist across the study period with no significant improvement. Analyzing adoption factors, we find that benchmark adoption correlates with author prominence and code runnability, but not with code quality standards such as Pylint score and maintainability, suggesting that the community's benchmark selection does not reward higher coding standards. Based on these results, we identify potential safety and reliability concerns. Some safety benchmark repositories openly expose harmful content, such as successful jailbreak responses, without any ethical warning or access control, effectively serving as unguarded attack resources. Furthermore, when benchmarks require ad-hoc modifications to run, downstream safety evaluations across different papers may not be comparable. We present case studies illustrating these concrete consequences and propose a targeted checklist to help benchmark contributors improve code quality, documentation, and ethical practices.

URL PDF HTML ☆

赞 0 踩 0

2603.01283 2026-05-18 cs.AI cs.LG 版本更新

The Informational Cost of Agency: A Bounded Measure of Interaction Efficiency for Deployed Reinforcement Learning

Wael Hafez, Cameron Reid, Amit Nazeri

发表机构 * Semarx Research LLC（Semarx研究公司）

AI总结本文提出了一种名为“双可预测性”（Bipredictability，记为P）的信息论指标，用于量化智能体与环境之间的闭环交互在消除不确定性、提升共享可预测性方面的效率。该指标具有理论上的上限（小于0.5），并证明智能体的主动行为会抑制P值低于这一阈值，这一现象被称为“智能体的信息成本”。实验表明，P不仅在强化学习系统中有效，还适用于语言模型、视觉系统等不同领域，展示了其广泛的适用性；同时，基于P构建的信息数字孪生（IDT）架构在检测系统退化方面表现出更高的准确率和更低的延迟，为部署中的自主系统提供了新的可靠性评估手段。

Comments 12 pages, 2 figures

2602.23409 2026-05-18 cs.LG cs.AI cs.ET quant-ph 版本更新

Long Range Frequency Tuning for QML

Michael Poppel, Markus Baumann, Sebastian Wölckert, Claudia Linnhoff-Popien, Jonas Stein

发表机构 * LMU Munich（慕尼黑大学）； Aqarios GmbH（Aqarios公司）

AI总结该研究针对变分量子电路中的频率编码问题，提出了一种新的初始化方法以提升其对高频函数的拟合能力。传统方法在固定编码下需要大量门操作，而可训练频率电路虽有潜力，但因频谱间隙导致梯度下降效果受限。本文提出的三进制网格初始化方法通过合理设置频率前缀，消除了频谱间隙的影响，显著提升了模型性能。实验表明，该方法在合成和真实数据集上均优于现有方法。

2602.20207 2026-05-18 cs.LG cs.AI 版本更新

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

发表机构 * University of South Florida（佛罗里达州立大学）； Brandeis University（布兰迪大学）

AI总结本文研究了如何在大语言模型中高效地进行知识编辑，即在不破坏模型整体性能的前提下，针对特定查询更新模型的输出。作者提出了一种基于层梯度分析（LGA）的新方法，通过分析模型各层的梯度信息，高效识别出对知识编辑效果最佳的“黄金层”，从而避免了传统方法中繁琐的试错过程。实验表明，该方法在多种大语言模型和知识编辑任务中均表现出良好的有效性和鲁棒性。

2602.19069 2026-05-18 cs.AI 版本更新

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu, Tingchen Fu, Minqi Jiang, Alexander H Miller, Yoram Bachrach, Jakob Nicolaus Foerster

发表机构 * FAIR at Meta（Meta的FAIR）； Stanford University（斯坦福大学）； University of Oxford（牛津大学）

AI总结该研究探讨了如何通过生成中间“台阶问题”来提升大型语言模型在复杂推理任务中的表现。研究提出了一种名为ARQ的框架，通过引入问题生成器到默认推理流程中，帮助模型逐步分解任务、构建有用的中间步骤。实验表明，这些生成的台阶问题具有可迁移性，能够有效辅助不同能力的模型解决目标任务，并可通过后训练方法进一步优化生成质量。

2602.10687 2026-05-18 cs.CV cs.AI 版本更新

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China（合肥工业大学计算机科学与信息工程学院）； Wuhan University, Wuhan, China（武汉大学）； Lab for Intelligence and visiON (LION)（智能视觉实验室）

AI总结现有伪造检测方法多局限于单模态或双模态设置，难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard，一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架，旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计，有效提升了检测与定位的综合性能，并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

2602.03812 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University, Pittsburgh, PA, USA（卡内基梅隆大学）； University of Maryland, College Park, MD, USA（马里兰大学学院市分校）

AI总结该研究提出了一种名为“反蒸馏指纹”（ADFP）的新方法，用于检测第三方模型是否通过蒸馏技术学习了教师模型的输出。与现有依赖启发式扰动的方法不同，ADFP 将指纹检测目标与学生模型的学习动态对齐，利用代理模型选择能最大化指纹可检测性的标记，从而在保证生成质量的前提下提升检测效果。实验表明，ADFP 在数学推理、对话和代码生成任务中均实现了比现有方法更优的检测性能与实用性平衡。

Comments 28 pages, 13 figures, ICML 2026

2601.21028 2026-05-18 cs.CY cs.AI cs.HC 版本更新

"Unlimited Realm of Exploration and Experimentation": Methods and Motivations of AI-Generated Sexual Content Creators

Jaron Mink, Lucy Qin, Elissa M. Redmiles

发表机构 * Arizona State University（亚利桑那州立大学）； Georgetown University（乔治城大学）

AI总结本文研究了AI生成性内容（AIG-SC）创作者的动机、方法及内容类型，揭示了他们创作的多样性，包括性探索、创意表达和技术实验等。研究通过深入访谈28位创作者，探讨了AIG-SC在技术、伦理和社会层面的影响，为相关政策制定提供了重要参考。

2601.19923 2026-05-18 cs.CL cs.AI 版本更新

Structure-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation for Web Information Systems

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

发表机构 * Tele-Communication Technology Bureau, Xinhua News Agency（新华通讯社电信技术局）

AI总结随着大语言模型（LLMs）在基于网络的自主代理和复杂网络信息系统中扮演核心角色，其将自然语言准确转换为结构化格式的能力变得至关重要。为此，本文提出Structure-BiEval，一种无需人工标注的自监督框架，通过解耦结构与内容，利用内容语义准确度和归一化树编辑距离等指标，对网络数据的结构保真度进行量化评估。实验结果表明，不同规模的LLM在结构化任务中表现差异显著，且深层嵌套结构对各类模型均构成挑战。

2512.19701 2026-05-18 cs.LG cs.AI 版本更新

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Yuxuan Yin, Shengke Zhou, Yunjie Zhang, Ajay Mohindra, Boxun Xu, Peng Li

发表机构 * University of California, Santa Barbara（加州大学圣芭芭拉分校）

AI总结准确预测云工作流任务的资源消耗和运行时间对调度效率至关重要，但由于任务配置的半结构化特性，这一任务具有挑战性。本文提出 LASER 框架，通过微调大语言模型对序列化的工作流配置进行多目标资源和运行时间回归，引入科学记数法输出编码和约束解码机制以提升数值预测的准确性和效率。实验表明，LASER 在大规模芯片设计任务和新构建的 GHARuntime 数据集上均优于人类专家和最先进的表格机器学习方法，确立了基于大语言模型处理半结构化工作流数据回归任务的新范式。

Comments 20 pages, 7 figures

2512.15067 2026-05-18 cs.LG cs.AI cs.SY eess.SY 版本更新

EMFusion: An Uncertainty-Aware Conditional Diffusion Framework for Frequency-Selective EMF Forecasting in Wireless Networks

Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio

发表机构 * department of Electrical Engineering and Computer Science, York University（电气工程与计算机科学系，约克大学）； School of Electrical and Electronic Engineering, Huazhong University of Science and Technology（电子与电气工程学院，华中科技大学）； Central China Branch of State Grid Corporation of China（国家电网公司中部分部）； Department of Electronic Engineering, University of Rome Tor Vergata（罗马大学Tor Vergata电子工程系）； Consorzio Nazionale Interuniversitario per le Telecomunicazioni (CNIT)（国家大学间电信研究会（CNIT））

AI总结随着无线基础设施的快速发展，准确估计和预测电磁场（EMF）水平对于确保合规性、评估健康影响和优化网络规划变得尤为重要。本文提出EMFusion，一种结合不确定性感知的条件扩散框架，用于无线网络中频率选择性的多变量EMF预测。该方法通过引入残差U-Net结构和跨注意力机制，整合时间、季节和节假日等上下文信息，同时提供显式的不确定性估计，并采用基于插补的采样策略提升预测的时序一致性。实验表明，EMFusion在多个评价指标上均优于现有方法，显著提升了预测精度和可靠性。

Comments Submission for possible publication

2512.09673 2026-05-18 cs.LG cs.AI cs.NE stat.ML 版本更新

Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

发表机构 * University of Science and Technology of China（中国科学技术大学）； University of Edinburgh（爱丁堡大学）； Nanyang Technological University（南洋理工大学）

AI总结本文研究了强制等变性对神经网络表达能力的影响，发现这种约束可能削弱模型的表达能力。通过分析边界超平面和通道向量，作者构造性地证明了这一问题，并指出可通过扩大模型规模来补偿这一缺陷，同时证明了所需扩大的上界。令人意外的是，扩大的网络结构反而降低了假设空间的维度，可能带来更好的泛化能力。

2512.04745 2026-05-18 math.OC cs.AI cs.SY eess.SY nlin.AO 版本更新

Neural Policy Composition from Free Energy Minimization

Francesca Rossi, Veronica Centorrino, Francesco Bullo, Giovanni Russo

发表机构 * Scuola Superiore Meridionale, Italy（意大利南部高级学院）； ETH, Zürich（苏黎世联邦理工学院）； Center for Control, Dynamical Systems, and Computation, UC Santa Barbara, CA, USA（加州大学圣巴巴拉分校控制与动力系统中心）； Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, Italy（意大利萨勒诺大学信息与电气工程及应用数学系）

AI总结本文研究了如何通过最小化变分自由能来实现神经策略的组合，提出了一种规范化的框架，为策略组合提供了原理性且广泛适用的目标函数。基于该框架，作者推导出一种连续时间梯度流，其轨迹可保证以明确速率收敛到最优策略组合，并展示了该动态机制可通过软竞争递归电路实现。实验表明，该模型在多智能体群体行为、人类决策任务和分层控制等场景中，能够有效解释策略组合机制，再现关键行为特征，并在性能上优于或匹配现有模型。

2512.01089 2026-05-18 cs.AI 版本更新

CodeDistiller: Automatically Generating Code Libraries for Scientific Coding Agents

Peter Jansen, Samiah Hassan, Pragnya Narasimha

发表机构 * University of Arizona（亚利桑那大学）； Allen Institute for Artificial Intelligence（人工智能研究所）

AI总结 CodeDistiller 是一个自动从科学 GitHub 仓库中提炼高质量代码库的系统，旨在增强科学编程代理的代码生成能力。该系统通过结合自动评估和领域专家评审，生成适用于材料科学等领域的可运行代码示例，显著提升了自动科学发现系统的实验准确性和科学性。实验表明，使用 CodeDistiller 生成的代码库可使代理生成更完整、更可靠的实验代码，并为大规模评估科学发现系统提供了可行的替代指标。

Comments 8 pages, 3 figures, 3 tables. Accepted to ACL 2026 (Demo Track)

2512.00242 2026-05-18 cs.LG cs.AI cs.ET stat.ML 版本更新

Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves

Alessio Borgi, Fabrizio Silvestri, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge（计算机科学与技术系，剑桥大学）； Department of Computer, Control and Management Engineering, Sapienza University（计算机、控制与管理工程系，萨皮恩扎大学）

AI总结本文提出了一种名为多项式神经束扩散（PolyNSD）的新方法，用于改进神经束网络在图结构上的扩散过程。该方法通过在归一化束拉普拉斯矩阵上应用K次多项式传播算子，实现了与束维数无关的K跳感受野，并通过凸混合的正交多项式基响应进行可训练的谱响应建模。相比传统方法，PolyNSD在保持模型稳定性的同时，降低了计算和内存需求，并在同质和异质图基准测试中取得了新的最先进结果。

2511.19399 2026-05-18 cs.CL cs.AI cs.LG 版本更新

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

发表机构 * University of Washington（华盛顿大学）； Allen Institute for AI（人工智能研究院）； Carnegie Mellon University（卡内基梅隆大学）； Massachusetts Institute of Technology（麻省理工学院）； Seattle Children's Hospital（西雅图儿童医院）； University of California, Berkeley（加州大学伯克利分校）

AI总结该论文提出了一种名为DR Tulu-8B的深度研究模型，旨在解决现有开放源深度研究代理在长篇、多步骤研究任务中表现不足的问题。研究引入了基于动态评分标准的强化学习方法（RLER），使评分标准与策略模型在训练过程中协同进化，从而提升事实核查能力和反馈质量。DR Tulu-8B是首个直接针对开放性长篇深度研究任务训练的完全开源模型，在多个科学、医疗和通用领域的基准测试中，其性能显著优于现有开源模型，并接近甚至超越了专有模型，同时在每查询成本上大幅降低。

Comments ICML 2026

2511.19115 2026-05-18 cs.AI cs.CY 版本更新

AI Consciousness and Existential Risk

Rufin VanRullen

发表机构 * Frontier AI companies（前沿AI公司）； independent foundations（独立基金会）

AI总结本文探讨了人工智能意识与存在风险之间的关系，指出二者常被混淆，但实际上意识与智能在理论和实践中是截然不同的属性。研究认为，智能是预测AI系统存在风险的直接因素，而意识本身并不直接构成威胁，但在某些情况下可能间接影响风险。明确这一区别有助于AI安全研究者和政策制定者更准确地识别和应对核心问题。

Comments Updated for clarity and completeness following peer-review

2511.14282 2026-05-18 cs.LG cs.AI 版本更新

Weight Concentration Regularization for Improving Pruning Robustness Under High Sparsity

Vincent-Daniel Yun, Junhyuk Jo, Sunwoo Lee

发表机构 * University of Southern California（美国南加州大学）； Inha University（inha大学）

AI总结深度神经网络在视觉和语言任务中表现出色，但其庞大的参数量限制了在资源受限环境中的部署。为解决这一问题，研究提出了一种新的权重集中正则化方法（WCR），通过在训练过程中放大一小部分参数的幅度，同时将其他参数驱动至零，从而在剪枝时主要移除对模型功能贡献较小的参数，提升模型在高稀疏度下的鲁棒性。实验表明，该方法在多种任务和架构中均能有效提升剪枝鲁棒性，并与现有剪枝鲁棒优化器兼容。

2511.09884 2026-05-18 cs.AI 版本更新

Quantum Artificial Intelligence for Mission-Critical Systems: Foundations, Architectural Elements, and Future Directions

Siva Sai, Rajkumar Buyya

发表机构 * Quantum Cloud Computing and Distributed Systems (qCLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne（量子云计算与分布式系统实验室，计算与信息系统学院，墨尔本大学）

AI总结本文探讨了量子人工智能（QAI）在关键任务系统（如国防、能源管理、网络安全和航空航天控制）中的应用潜力，旨在解决传统人工智能在可靠性、实时性、可解释性和安全性方面存在的不足。研究系统分析了QAI方法在满足关键任务系统需求方面的可行性，并提出了量子云资源管理与调度的概念框架，同时指出现有QAI技术与实际需求之间的差距。文章还讨论了QAI在训练限制、数据访问、组件验证等方面面临的挑战，并展望了未来在可解释性、可扩展性和硬件实现方面的发展方向。

Comments 15 pages, 5 figures, revised and accepted version of the paper

2510.22665 2026-05-18 cs.CV cs.AI 版本更新

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li

发表机构 * School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； Yuelushan Center for Industrial Innovation（岳麓山创新中心）； School of Medical Information Engineering, Jining Medical University（济南医学院医学信息工程学院）

AI总结本文提出SARVLM，首个专为合成孔径雷达（SAR）影像设计的视觉-语言基础模型，旨在提升SAR图像的语义理解能力。为解决SAR多模态数据稀缺及跨模态表征不足的问题，研究者构建了包含百万级图像-文本对的SARVLM-1M大规模数据集，并设计了两阶段领域迁移训练策略，利用光学遥感数据作为桥梁，有效提升模型在SAR领域的表现。实验表明，SARVLM在多个基准任务中均优于现有模型，显著推进了SAR影像的语义理解水平。

Comments 13 pages, 13 figures

2510.18814 2026-05-18 cs.LG cs.AI 版本更新

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Mengqi Li, Lei Zhao, Anthony Man-Cho So, Ruoyu Sun, Xiao Li

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shanghai Jiao Tong University（上海交通大学）

AI总结本文研究了在没有外部奖励信号的情况下，语言模型能否仅通过自身生成的响应来提升推理能力。提出了一种名为Self-evolving Post-Training（SePT）的简单后训练方法，通过交替进行自我生成和基于生成数据的训练，逐步优化模型性能。实验表明，SePT在多个数学推理基准测试中有效提升了模型推理能力，验证了仅依赖自生成监督进行模型自我进化的可行性。

2510.10454 2026-05-18 cs.AI 版本更新

Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction

Sihang Zeng, Yujuan Fu, Sitong Zhou, Zixuan Yu, Lucas Jing Liu, Jun Wen, Matthew Thompson, Ruth Etzioni, Meliha Yetisgen

发表机构 * University of Washington（华盛顿大学）； Fred Hutch Cancer Center（Fred Hutch癌症中心）； Harvard University（哈佛大学）； Google（谷歌）

AI总结本文提出了一种名为Traj-CoA的多智能体系统，用于通过链式智能体结构对患者轨迹进行建模，以提升肺癌风险预测的准确性。该方法通过一系列工作智能体逐步处理电子健康记录（EHR）数据，提炼关键事件并存储在共享的长期记忆模块EHRMem中，以降低噪声并保留完整的就诊时间线，最终由管理智能体综合信息进行预测。实验表明，Traj-CoA在零样本一年期肺癌风险预测任务中优于四类基线方法，展现了其在临床时间推理方面的一致性和有效性。

Comments Accepted by NeurIPS 2025 GenAI4Health Workshop

2510.02734 2026-05-18 q-bio.BM cs.AI q-bio.GN 版本更新

SAE-RNA: A Sparse Autoencoder Model for Interpreting RNA Language Model Representations

Taehan Kim, Sangdae Nam

发表机构 * Department of Computer Science, University of California, Berkeley（加州大学伯克利分校计算机科学系）； Department of Development Engineering, University of California, Berkeley（加州大学伯克利分校发展工程系）

AI总结本文提出了一种名为 SAE-RNA 的稀疏自编码器模型，用于解释 RNA 语言模型的表示，旨在探索其是否能够对 RNA 语言模型的特征进行可解释的分解。该方法基于 RiNALMo 模型，通过映射到已知的生物学特征，分析 RNA 语言模型内部如何组织生物信息。研究为 RNA 分类和结构特征的识别提供了一个基于特征层面的比较框架，并探讨了稀疏自编码器在该任务中的适用性与局限性。

Comments 12 pages, 7 figures. v2: Updated bibliography to improve reference accuracy and reflect updated publication venues. Refined claims for better alignment with results and added an Appendix

2510.02307 2026-05-18 cs.CV cs.AI 版本更新

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

发表机构 * Rice University（里士大学）

AI总结文本到图像扩散模型在生成分辨率超出训练设定的图像时性能往往会下降。本文针对低分辨率图像生成问题，提出了一种无需额外训练的噪声重新校准方法 NoiseShift，通过调整去噪器的噪声条件索引，恢复正向与反向过程的一致性，从而减少训练与测试阶段的不匹配。实验表明，NoiseShift 在多个主流扩散模型上显著提升了低分辨率图像的生成质量，且实现简单、推理开销极小。

2509.24798 2026-05-18 cs.CV cs.AI 版本更新

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

发表机构 * Centre for AI, DS\&AI, Astrazeneca, UK（英国阿斯利康人工智能中心）； Institute for Imaging, Data and Communications (IDCOM), School of Engineering, University of Edinburgh, Edinburgh, UK（爱丁堡大学工程学院影像、数据与通信研究所）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结本文提出了一种名为 Causal-Adapter 的模块化框架，用于适配冻结的文本到图像扩散模型，实现对图像的反事实生成。该方法通过因果干预目标属性，并将其影响一致地传播至因果依赖部分，同时保持图像的核心身份。与依赖提示工程的方法不同，Causal-Adapter 引入结构因果模型，并采用属性正则化策略，实现了更准确的语义控制和高保真图像生成，在多个数据集上取得了优越的性能。

Comments Project Page: https://leitong02.github.io/causaladapter/

Journal ref ICML 2026

2508.20810 2026-05-18 cs.AI cs.CL 版本更新

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Jessica M. Lundin, Usman Nasir Nakakana, Guillaume Chabot-Couture

发表机构 * Gates Foundation（比尔及梅琳·格ates基金会）

AI总结该论文提出了一种基于图结构的评估框架，用于对领域特定语言模型进行严格评估。该方法将结构化的临床指南转化为可查询的知识图谱，并通过图遍历动态生成评估问题，从而确保评估的全面性、抗污染性和可维护性。应用在世界卫生组织IMCI指南上时，该框架生成了涵盖症状识别、治疗方案、严重程度分类和后续护理的多选题，并揭示了不同语言模型在临床决策任务中的系统性能力差距。

2508.17218 2026-05-18 cs.LG cs.AI 版本更新

Generalized Policy Gradient with History-Aware Decision Transformer for Reliable Routing over Graph Signals

Xing Wei, Yuanhang Wang, Duoxiang Zhao, Zezhou Zhang, Hao Qin, Yuqi Ouyang

发表机构 * Sichuan University-Pittsburgh Institute（四川大学匹兹堡研究院）； Sichuan University（四川大学）； College of Computer Science（计算机科学学院）； University College Dublin（都柏林大学）； School of Electrical and Electronic Engineering（电子与电气工程学院）； School of Electronics and Information Engineering（电子与信息工程学院）

AI总结该研究针对随机交通网络中的可靠路径规划问题，提出了一种基于历史感知的决策变换器与广义策略梯度结合的新型策略框架GPG-HT。该方法通过关注历史节点-边-时间观测，捕捉非马尔可夫时空依赖关系，从而在不确定环境下实现更具上下文感知的路径决策。实验表明，该方法在典型交通网络中显著提升了准时到达概率，优于传统优化和强化学习方法。

2507.16806 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文研究了如何通过强化学习训练语言模型在生成推理链时更好地评估自身不确定性。传统方法使用二元奖励函数仅评价输出正确性，导致模型在面对不确定情况时容易产生错误回答。为此，作者提出了一种新的训练方法 RLCR，结合二元正确性奖励与 Brier 分数，同时优化模型的准确性和置信度校准。实验表明，RLCR 在多个数据集上显著提升了模型的校准能力，且不牺牲准确性，优于传统强化学习和事后置信度校准方法。

详情

英文摘要

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. Code, models, and further info is available at https://rl-calibration.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2507.01679 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

发表机构 * ILCC, University of Edinburgh（爱丁堡大学ILCC）； Fudan University（复旦大学）； Qwen Team, Alibaba Group（阿里集团Qwen团队）； ILLC, University of Amsterdam（阿姆斯特丹大学ILLC）

AI总结本文研究了大语言模型后训练中监督微调（SFT）与强化微调（RFT）的结合方法，提出了Prefix-RFT这一混合策略，通过前缀采样实现从演示数据和探索行为中协同学习。该方法在数学推理任务中表现出色，不仅优于单独使用SFT或RFT，也优于其他混合策略，验证了SFT与RFT的互补性，并展示了其对演示数据质量与数量变化的鲁棒性。

Comments ICML 2026

2507.00275 2026-05-18 cs.LG cs.AI 版本更新

Deep Double Q-learning

Prabhat Nagarajan, Martha White, Marlos C. Machado

发表机构 * Department of Computing Science（计算科学系）； University of Alberta（阿尔伯塔大学）； Alberta Machine Intelligence Institute（阿尔伯塔机器智能研究所）； CIFAR AI Chair（CIFAR人工智能主席）； Edmonton, AB, Canada（加拿大艾德蒙顿省，亚伯达）

AI总结本文提出了一种深度强化学习算法——Deep Double Q-learning（DDQL），旨在解决传统深度Q网络（DQN）中存在的估计过高的问题。该方法通过显式训练两个独立的Q函数，结合降低经验回放比例、延长目标网络更新间隔等技术，有效提升了训练稳定性。实验表明，DDQL在57款Atari 2600游戏中整体表现优于Double DQN，在其中47款游戏中表现更优，并进一步减少了估计过高的现象。

Comments 44 pages

2506.14829 2026-05-18 cs.HC cs.AI cs.LG 版本更新

The Hardness of Achieving Impact in AI for Social Impact Research: A Ground-Level View of Challenges & Opportunities

Aditya Majumdar, Wenbo Zhang, Kashvi Prawal, Amulya Yadav

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结本文探讨了人工智能用于社会影响研究（AI4SI）在实际应用中面临的主要挑战与机遇。研究通过访谈26位AI4SI领域的研究者，分析了在结构性、组织性、沟通与协作等方面阻碍AI4SI落地的障碍，并总结了可行的合作策略与实践经验。该研究为希望推动社会影响的AI研究者和机构提供了实用指导。

Comments To be published in FAccT'26

2506.06739 2026-05-18 cs.AI cs.LG 版本更新

Honey, I shrunk the hypothesis space (through logical preprocessing)

Andrew Cropper, Filipe Gouveia, David M. Cerna

发表机构 * ELLIS Institute（ELLIS研究所）； University of Helsinki（赫尔辛基大学）； Czech Academy of Sciences Institute of Computer Science（捷克科学院计算机科学研究所）； Dynatrace Research（Dynatrace研究）

AI总结该研究提出了一种通过逻辑预处理缩小归纳逻辑编程（ILP）假设空间的方法。利用背景知识，该方法在学习前移除那些无论训练数据如何都无法出现在最优假设中的规则，例如“偶数不可能是奇数”等逻辑矛盾。实验表明，这种方法在保持预测精度的同时，显著减少了学习时间，例如在仅花费10秒预处理的情况下，将原本需要10小时以上的学习时间缩短至仅2秒。

Comments Published in JAIR

Journal ref Journal of Artificial Intelligence Research, Vol. 85 (2026)

2505.21535 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

发表机构 * University of Arizona（亚利桑那大学）； TetraMem, Inc.（TetraMem公司）

AI总结本文提出了一种名为FAR的函数保持注意力替换框架，旨在解决Transformer模型在基于忆阻器（ReRAM）的存算一体（IMC）设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构，并结合块级知识蒸馏和结构化剪枝，实现了功能等效的同时显著降低了计算延迟和参数量。实验表明，FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率，展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

2505.19241 2026-05-18 cs.LG cs.AI 版本更新

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）； Singapore-MIT Alliance for Research and Technology Centre（新加坡-麻省理工联盟研究技术中心）； The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； CSAIL, Massachusetts Institute of Technology（麻省理工学院计算机科学与人工智能实验室）； Institute of Data Science, National University of Singapore（新加坡国立大学数据科学研究院）

AI总结本文提出了一种名为 ActiveDPO 的主动直接偏好优化方法，旨在提升大语言模型对齐过程中的样本效率。该方法基于理论支撑的数据选择准则，适用于非线性奖励函数，并直接利用待对齐的LLM本身参数化奖励模型，从而更有效地指导数据选择。实验表明，ActiveDPO 在多种模型和真实偏好数据集上均优于现有方法，显著提升了对齐效果与数据使用效率。

Comments Accepted at ICLR 2026

2505.18134 2026-05-18 cs.AI cs.CL cs.CV 版本更新

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

发表机构 * Princeton University（普林斯顿大学）

AI总结 VideoGameBench 是一个用于评估视觉语言模型（VLMs）完成流行视频游戏能力的基准测试，包含10款90年代经典游戏，模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限，难以完成完整游戏，主要受限于推理延迟等问题。为此，研究还提出了VideoGameBench Lite 以缓解实时性挑战，并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

2504.08300 2026-05-18 cs.CL cs.AI 版本更新

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； State Key Laboratory of Transvascular Implantation Devices and TIDRI（血管植入设备国家重点实验室和TIDRI）； Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence（浙江医学影像人工智能重点实验室）； School of Data Science of Engineering, East China Normal University（华东师范大学工程数据科学学院）； Second Affiliated Hospital and Liangzhu Laboratory, Zhejiang University School of Medicine（浙江大学医学院第二附属医院和良渚实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文研究了大语言模型（LLMs）在基准测试中的表现是否受到训练数据污染的影响，指出当前基于基准测试的评估方式可能高估了模型的真实能力。为此，作者提出了一种新的评估框架TrinEval，通过重构多选题形式，减少对记忆的依赖，从而更准确地评估模型的真实学习能力。实验表明，主流大语言模型在多个数据集上约有19.6%的知识点依赖于死记硬背，而非真正的理解与推理能力。

Comments Work in Progress

2503.07518 2026-05-18 cs.CL cs.AI cs.LG 版本更新

TokenButler: Token Importance is Predictable

Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Sameh Gobriel, Nilesh Jain, Mohamed S. Abdelfattah

发表机构 * Cornell University（康奈尔大学）； Intel Labs（英特尔实验室）

AI总结大型语言模型在解码过程中依赖键值缓存（KV-Cache）存储历史信息，但随着缓存增长，其成为内存和计算瓶颈。为解决这一问题，本文提出TokenButler，一种高精度、查询感知的标记重要性预测方法，能够在固定预算下动态选择关键标记，同时保留完整的KV缓存。该方法通过学习预测低维重要性查询，并结合缓存键的投影进行高效评分，实验表明其在长上下文任务中性能优越，并显著提升了推理速度。

详情

英文摘要

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks of tokens and many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. TokenButler predicts low-dimensional importance queries at a fixed depth stride, and combines them with a learned projection of the real KV-cache keys to score tokens cheaply, enabling dynamic per-token selection under a fixed budget while preserving the full KV cache. We train TokenButler by distilling the model's masked causal attention distributions, optimizing a lightweight predictor with minimal parameter overhead. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy where existing methods fail. Furthermore, TokenButler achieves competitive or superior performance on long-context benchmarks (RULER, LongBench), up to $\approx1.6\times$ on-GPU speedup using our proposed *prediction interval with neighbor fetching* that amortizes predictor cost while maintaining accuracy within $\approx$1.1\%, and up to 7.6$\times$ reduction in latency compared to Dense Attention with CPU offloading. Code is available: https://github.com/abdelfattah-lab/TokenButler

URL PDF HTML ☆

赞 0 踩 0

2503.02597 2026-05-18 cs.CV cs.AI 版本更新

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

发表机构 * Sony Group Corporation, Tokyo, Japan（索尼集团，日本东京）

AI总结近期多模态大语言模型（MLLMs）在理解和推理多模态信息方面取得了显著进展，但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发，提出了一种新的模态互注意力机制（MMA），通过将因果注意力扩展为跨模态互注意力，使图像模态能够关注文本模态，从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能，且无需增加额外参数，具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

2501.19128 2026-05-18 cs.LG cs.AI 版本更新

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

发表机构 * Department of Mathematics, The University of Hong Kong (HKU)（香港大学数学系）； Department of Data and Systems Engineering, HKU（香港大学数据与系统工程系）； Musketeers Foundation Institute of Data Science, HKU（穆斯克特基金会数据科学研究所）

AI总结在强化学习中，稀疏奖励信号使得奖励函数的学习变得困难。本文提出一种半监督方法，结合非零奖励转移和数据增强技术，利用大量零奖励转移学习轨迹表示，从而提升奖励塑形的效果。实验表明，该方法在Atari和机器人操作任务中优于基于监督的方法，尤其在稀疏奖励环境下，其最高得分可达监督方法的两倍。

2412.12636 2026-05-18 cs.DC cs.AI cs.LG cs.PF 版本更新

TrainMover: An Interruption-Resilient Runtime for ML Training

ChonLam Lao, Jiaqi Gao, Jiamin Cao, Zhipeng Zhang, Pengcheng Zhang, Jiangfei Duan, Zhilong Zheng, Yu Guan, Yichi Xu, Yong Li, Zhengping Qian, Aditya Akella, Minlan Yu, Ennan Zhai, Dennis Cai, Jingren Zhou

发表机构 * Harvard University（哈佛大学）； Alibaba Group（阿里巴巴集团）； UT Austin（得克萨斯大学奥斯汀分校）

AI总结大规模机器学习训练任务常因硬件、软件故障或管理事件而中断，现有方法如检查点重启或运行时重新配置往往导致较长的停机时间和性能下降。本文提出TrainMover，一种具有高弹性的大语言模型训练运行时系统，通过利用弹性与备用机器实现最小停机时间和零内存开销的中断处理。TrainMover引入了两阶段基于增量的通信组构建、无通信沙箱预热以及通用备用设计等关键技术，实验表明其在千GPU规模下处理中断的停机时间可稳定控制在约20秒，相比现有最佳方案可减少55%的GPU空转时间。

Comments 14 pages body, 19 pages total

2410.02832 2026-05-18 cs.CR cs.AI 版本更新

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, Yingwei Ma, Jiaheng Zhang, Bryan Hooi

发表机构 * Engineering Programme, NUS Graduate School, National University of Singapore（国立新加坡大学整合科学与工程计划）； Institute of Data Science (IDS), National University of Singapore（国立新加坡大学数据科学研究所）； Department of Computer Science, School of Computing, National University of Singapore（国立新加坡大学计算机科学系）

AI总结本文提出了一种简单而有效的黑盒大语言模型越狱攻击方法FlipAttack。该方法利用大语言模型从左到右理解文本的特性，通过在提示左侧添加噪声干扰模型理解，从而隐藏有害指令，并进一步扩展出四种翻转模式。实验表明，FlipAttack具有高度通用性、隐蔽性和简洁性，仅需一次查询即可成功越狱，对包括GPT-4o在内的多个模型均取得了高达约98%的攻击成功率。

Comments 43 pages, 31 figures

2409.11022 2026-05-18 cs.CL cs.AI 版本更新

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

Hanjun Luo, Yingbin Jin, Xinfeng Li, Xuecheng Liu, Ruizhe Chen, Tong Shang, Kun Wang, Qingsong Wen, Zuozhu Liu

发表机构 * New York University Abu Dhabi（纽约大学阿布扎赫德分校）； Zhejiang University（浙江大学）； The Hong Kong Polytechnic University（香港理工大学）； Nanyang Technology University（南阳技术大学）； University of Electronic Science and Technology of China（电子科技大学）； Texas A&M University（德克萨斯大学）； Squirrel AI

AI总结随着大语言模型（LLM）在命名实体识别（NER）任务中的应用日益广泛，现有数据集在语料选择和设计逻辑上已难以满足LLM方法的需求。为此，本文提出DynamicNER，一个专为LLM设计的动态、多语言、细粒度NER数据集，支持同一实体在不同上下文中具有不同实体类型，涵盖8种语言和155种实体类型，适用于广泛领域。同时，本文还提出CascadeNER方法，通过两阶段策略和轻量级LLM实现更高效的细粒度识别，实验表明DynamicNER为LLM-based NER提供了有效的评估基准。

Comments This paper is accepted by EMNLP 2025 Main Conference

2406.18944 2026-05-18 cs.CV cs.AI cs.CR 版本更新

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu, Ruoxi Chen, Xun Chen, Lichao Sun

发表机构 * Lehigh University（莱维大学）； Lehigh University Computer Science（莱维大学计算机科学）； Engineering Bethlehem PA USA（工程布雷顿佛罗里达美国）； Independent Researcher（独立研究员）； Independent Researcher Fremont California USA（独立研究员佛罗里达加州美国）

AI总结个性化扩散模型（PDMs）在使用少量数据生成特定人物图像方面表现出色，但其对微小对抗性扰动高度敏感，导致在受污染数据上微调时性能显著下降。本文通过 Shortcut Learning 的视角深入分析了 PDMs 的微调过程，揭示了对抗扰动在 CLIP 嵌入空间中引发的潜在语义对齐问题，并据此提出了一种系统性的反制框架，包括图像净化和对比解耦学习，有效提升了模型的鲁棒性和泛化能力。

Comments Code is available at https://github.com/liuyixin-louis/DiffShortcut

2404.03099 2026-05-18 cs.LG cs.AI cs.CE cs.IT math.IT stat.ML 版本更新

Composite Bayesian Optimization In Function Spaces Using NEON -- Neural Epistemic Operator Networks

Leonardo Ferreira Guilhoto, Paris Perdikaris

发表机构 * Graduate Group in Applied Mathematics and Computational Science（应用数学与计算科学联合研究生组）； University of Pennsylvania（宾夕法尼亚大学）； Department of Mechanical Engineering and Applied Mechanics（机械工程与应用力学系）

AI总结本文提出了一种名为NEON的神经网络架构，用于在无限维函数空间中进行带有不确定性的预测，其参数数量远少于性能相当的深度集成方法。研究聚焦于复合贝叶斯优化问题，即优化由未知函数映射和已知函数组成的复合函数，并通过实验表明NEON在多个场景下取得了领先的优化效果，同时显著降低了模型复杂度。

Journal ref Guilhoto, Leonardo Ferreira, and Paris Perdikaris. "Composite Bayesian optimization in function spaces using NEON - Neural Epistemic Operator Networks." Scientific Reports 14.1 (2024): 29199

2403.13805 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）； MThreads, Inc.（MThreads公司）； Nanyang Technological University（南洋理工大学）

AI总结本文提出了一种名为RAR的方法，旨在提升多模态大语言模型（MLLMs）在细粒度和少样本视觉识别任务中的性能。RAR结合了CLIP的多模态检索能力与MLLMs的丰富知识库，通过建立多模态检索器来扩展模型的上下文窗口，并在推理时检索相关类别信息供MLLMs进行排序和预测。该方法有效解决了MLLMs在面对大量类别时性能下降的问题，在多个细粒度和零样本识别基准上取得了显著的性能提升。

Comments Project: https://github.com/Liuziyu77/RAR

2402.10380 2026-05-18 cs.LG cs.AI cs.CL 版本更新

Subgraph-level Universal Prompt Tuning

Junhyun Lee, Wooseong Yang, Jaewoo Kang

发表机构 * Korea University（韩国大学）； University of Illinois at Chicago（伊利诺伊大学香槟分校）

AI总结在图神经网络中，如何有效适配不同预训练策略的模型仍是一个挑战。本文提出了一种子图级通用提示调优方法（SUPT），通过在子图层面分配提示特征，保持方法的通用性，同时大幅减少调优参数数量。实验表明，SUPT在多种下游任务中表现优异，尤其在少样本场景下平均性能提升超过6.6%。

Journal ref Information Sciences 749 (2026) 123516

2311.03658 2026-05-18 cs.CL cs.AI cs.LG stat.ML 版本更新

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, Victor Veitch

发表机构 * University of Chicago（芝加哥大学）

AI总结本文探讨了“线性表示假设”，即高层概念在表示空间中以线性方向形式表示的问题，提出了“线性表示”的两种形式化定义，并分别对应输出（词）空间和输入（句子）空间。通过引入因果内积，作者建立了一个非欧几里得的内积结构，能够统一各种线性表示的概念，并用于构建探针和引导向量。实验表明，大型语言模型中确实存在概念的线性表示，且内积的选择对解释与控制模型具有基础性作用。

Comments Accepted for a presentation at ICML 2024 and an oral presentation at NeurIPS 2023 Workshop on Causal Representation Learning. Code is available at https://github.com/KihoPark/linear_rep_geometry

Journal ref In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024