arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.25046 2026-05-27 cs.CV cs.AI 版本更新

面向检索代理的自然语言查询到配置

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia

发表机构 * UC Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； Microsoft Azure Research - Systems（微软Azure研究 - 系统）

AI总结提出BRANE方法，利用LLM将查询转换为工作负载特征，并训练轻量级预测器选择最优配置，在多个基准上实现成本-质量帕累托前沿的优化。

详情

AI中文摘要

现代检索代理暴露了许多配置选择——LLM、检索器、文档数量、跳数和合成策略——每个都影响答案质量和服务成本。目前，这些流水线通常针对每个工作负载手动调整一次，留下了大量每查询优化的空间。我们形式化了这个问题：给定一个自然语言查询以及一个准确性或预算目标，从预定义的流水线目录中选择在推理时最小化成本或最大化准确性的配置。我们提出了**BRANE**，它使用LLM将每个查询转换为工作负载特定的特征，然后训练一个轻量级的每配置预测器，估计流水线是否能正确回答查询。在推理时，**BRANE**选择最大化预测正确性（经成本惩罚）的配置，无需重新训练即可暴露可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench上，**BRANE**持续推动成本-质量帕累托前沿，以高达89%的成本降低匹配最佳固定配置的准确性，并优于LLM路由、基于规则和微调的Qwen3-4B基线。这些结果表明，对整个检索流水线进行每查询配置是静态工作负载级调优的实用替代方案。

英文摘要

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.27360 2026-05-27 cs.NI cs.AI 版本更新

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

发表机构 * Institute for Intelligent Networked Systems（智能网络系统研究所）

AI总结提出GENESIS框架，通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE，将意图转化为经空口实验验证的解决方案，以加速6G无线接入网研发。

Comments 18 pages, 16 figures

详情

AI中文摘要

蜂窝研究与开发受制于六个结构性流程，每个流程每次迭代需要数月的体力工程工作：(i) 将标准或研究论文中的新特性综合为生产代码；(ii) 一致性测试和互操作性测试；(iii) 针对现场异常和多样化部署环境进行加固；(iv) 网络功能的数据驱动优化；(v) 发现并原型化未来标准的新波形、功能及能力；(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟，但其已知缺陷在无线接入网用例中更为严重：它们会幻觉应用程序编程接口并误读规范，导致RAN组件在第一次错误时即失去互操作性，并且它们严重依赖仿真来设计算法，而仿真在迁移到真实硬件时往往失效。为应对这些挑战，我们提出GENESIS，一个智能体人工智能框架，将意图（如规范条款、遥测异常或研究假设）转化为经空口实验验证的解决方案，并反馈到持久知识库中。GENESIS建立在三种可组合原语（智能体、技能、钩子）和一个知识层（SYNAPSE）之上，该知识层既作为事实来源，也作为框架产生的所有工件的接收者，使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

URL PDF HTML ☆

赞 0 踩 0

2605.27358 2026-05-27 cs.LG cs.AI cs.CL 版本更新

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

发表机构 * Meta AI

AI总结针对设备端部署，提出MobileMoE系列子十亿参数MoE语言模型，通过联合优化架构和四阶段训练，在14个基准上匹配或超越领先的密集模型和MoE模型，并在智能手机上实现高效推理。

详情

AI中文摘要

混合专家（MoE）已成为千亿参数语言模型的事实标准架构，但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距，我们提出MobileMoE，一系列设备端MoE语言模型，具有子十亿激活参数（0.3-0.9B激活，1.3-5.3B总参数），为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律，在移动内存和计算约束下联合优化MoE架构，识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构，我们采用四阶段方案训练MobileMoE，包括预训练、中期训练、指令微调和量化感知训练，全部使用开源数据集。在14个基准上，MobileMoE匹配或超越领先的设备端密集LLM，推理FLOPs减少2-4倍，并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步，我们提供了首个在商用智能手机上的高效MoE推理，并进行了全面的设备端性能分析。在相当的INT4权重内存下，MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍，解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.27354 2026-05-27 cs.LG cs.AI cs.CL 版本更新

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University（清华大学）

AI总结提出SAERL框架，通过稀疏自编码器提取模型内部状态，建模数据多样性、难度和质量，用于强化学习数据工程，提升准确率并减少训练步数。

详情

建模代理技术债务与随机税：一个用于测量、模拟和仪表盘展示的独立框架

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

发表机构 * School of Business, University of Pittsburgh（匹兹堡大学商学院）

AI总结本文提出一个形式化且可管理的框架，区分代理技术债务（累积的设计与治理负债存量）与随机税（使用随机代理时产生的运营负担流），并通过应付账款模拟和电子表格说明其应用。

2605.27299 2026-05-27 cs.CR cs.AI cs.HC cs.LG cs.SY eess.SY 版本更新

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

使用次正态高斯模糊模型的IDS风险规避警报优先级排序

Murat Moran

AI总结提出基于次正态高斯模糊数的警报优先级排序框架，通过建模威胁严重性、检测置信度和组织风险态度三种不确定性，利用排序指数实现可调安全姿态，实验证明在检测器退化下比基线方法更鲁棒。

详情

AI中文摘要

现代入侵检测系统每天生成数千条警报，但由于误报或低影响事件过多，警报疲劳严重限制了安全运营的有效性。我们通过提出一个基于次正态高斯模糊数的原则性警报优先级排序框架来解决这个问题，该框架明确建模了三种不确定性来源：威胁严重性、检测置信度和组织风险态度。每个警报被表示为一个模糊数，其核心表示严重性，展度表示不确定性，高度反映检测可靠性。我们应用排序指数对警报进行优先级排序，允许组织通过风险态度参数调整安全姿态。在CIC-IDS2017和NSL-KDD上的实验验证表明，在检测器退化下，该方法比基线方法具有更强的鲁棒性（NDCGrel@100为0.9963对比0.8215），在中等置信度警报中具有明显区分度，在稳健检测器下与基线方法接近。该框架具有理论基础、计算效率高、提供可解释推理，并且在检测器系列和校准错误场景下保持鲁棒性。

英文摘要

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.27288 2026-05-27 cs.CL cs.AI cs.LG 版本更新

生成式动画：面向提示驱动运动合成的多模型流水线

Mannat Khurana, Sanyam Jain, Rishav Agarwal

发表机构 * Canva ； Adobe

AI总结提出一种结合大语言模型和分割模型的流水线，将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。

Comments 5 pages, 6 figures

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD 版本更新

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出一种可学习的等待-思考-回答控制机制，通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机，在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情

AI中文摘要

近期大音频语言模型（LALMs）的进展使得实时、流式的语音交互越来越实用。在这种场景下，推理质量和响应速度紧密耦合：将推理延迟到语音端点可以提高答案质量，但会将思考时间转移到用户可见的响应延迟中，而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发，控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型，我们从语音推理数据中构建对齐的等待-思考-回答轨迹，使用监督微调（SFT）训练控制器，然后应用解耦裁剪和动态采样策略优化（DAPO）。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性，优化完整的等待-思考-回答轨迹，而不仅仅是最终答案。在一个六任务合成语音推理问答（SRQA）基准上，六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%，同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准（Real Audio Bench）上，作为超越文本转语音（TTS）渲染语音的迁移检查，控制器家族仍然有效：SFT实现了最强的准确率，而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明，流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

URL PDF HTML ☆

赞 0 踩 0

2605.27178 2026-05-27 cs.CV cs.AI cs.LG cs.RO 版本更新

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj: 自监督基础模型作为无标签3D物体分割的奖励

Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University（深圳研究院，香港理工大学）； vLAR Group, The Hong Kong Polytechnic University（vLAR小组，香港理工大学）

AI总结提出FoundObj框架，利用自监督2D/3D基础模型的语义和几何先验作为奖励，通过强化学习引导超点合并，实现无标注复杂场景3D物体分割。

Comments ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: https://github.com/vLAR-group/FoundObj

详情

AI中文摘要

我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体，这主要是由于学习过程中物体先验不足。在本文中，我们提出了FoundObj，一个新颖的框架，其特点是基于超点的物体发现代理，该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验，为物体发现代理提供互补反馈，并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明，我们的方法始终优于现有基线。值得注意的是，我们的方法在零样本和长尾场景中表现出强大的泛化能力，突显了其在可扩展、无标签3D物体分割方面的潜力。

英文摘要

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27174 2026-05-27 cs.SD cs.AI cs.CY 版本更新

An investigation of AI integration in sound designer workflows and experiences

AI在声音设计师工作流程与体验中的整合研究

Nelly Garcia, Joshua Reiss

发表机构 * Queen Mary University of London（伦敦大学玛丽女王学院）

AI总结通过混合方法研究（76人调查+20人访谈），发现当前AI工具在快速消费媒体中表现良好，但缺乏高端声音设计所需的叙事复杂性，从业者偏好辅助性、任务特定的应用，而非端到端生成系统。

详情

AI中文摘要

人工智能正越来越多地被整合到专业音频制作工作流程中，然而开发者生产的工具与实际声音设计师的需求之间仍存在差距。本文通过一项混合方法研究调查了这一差距，包括对76名从业者的调查以及对20名行业专业人士的后续半结构化访谈。使用描述性统计分析和主题分析对结果进行分析，以识别两个数据集中的模式。我们的分析得出了五个主题：上下文、工作流程、潜力、风险和正确使用。我们的工作表明，当前的AI工具在快速消费媒体环境中表现良好，但缺乏高端声音设计（电影、沉浸式体验等）所需的叙事复杂性。从业者表现出对辅助性、任务特定应用的偏好，特别是在音频修复和库管理方面，而不是端到端生成系统。这项工作为创意产业中AI及AI增强工具的使用正在进行的讨论做出了贡献。我们从声音设计师和创意音频从业者的角度报告了该领域的当前状况，并根据我们的发现为声音技术专家和开发者提供了一系列建议，以指导开发更明智的AI声音设计工具。

英文摘要

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.

URL PDF HTML ☆

赞 0 踩 0

2605.27168 2026-05-27 cs.CL cs.AI cs.CY 版本更新

Grounding Text Embeddings in Stakeholder Associations

将文本嵌入与利益相关者关联对齐

Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell

发表机构 * University of Oxford（牛津大学）； Institute for Wicked Problems（复杂问题研究所）； The Chinese University of Hong Kong（香港中文大学）； Danish Technical University（丹麦技术大学）； Aarhus University（奥胡斯大学）

AI总结提出利益相关者对齐练习方法，通过评估嵌入模型与人类专家的语义距离一致性，发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家（差距19-26个百分点），且该差距在美国联邦AI用例中复现（16个百分点）。

详情

AI中文摘要

文本嵌入被广泛用于分析大型复杂文本语料库。然而，尚不清楚这些嵌入是否捕捉到与使用它们的人类专家相同的语义距离。确保嵌入表示与人类意图一致对于有效分析至关重要。我们提出了利益相关者对齐练习，这是一种使专家关联显式化并将嵌入模型结果扎根于人类理解的方法。在我们关于丹麦政策问题的主要案例研究中，我们发现神经文本嵌入的可靠性远低于人类专家（差距19-26个百分点），并且这种不对齐会传播到下游聚类性能（练习排名与聚类质量之间的Spearman $ρ=0.9$）。一项关于美国联邦AI用例的二次研究使用数字协议和不同的专家社区在英语中复现了该差距（16个百分点）——表明该差距并非单一工具或领域的产物。利益相关者对齐练习提供了一种实用方法，用于评估嵌入模型是否捕捉到对领域专家最重要的语义区分。

英文摘要

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

URL PDF HTML ☆

赞 0 踩 0

2605.27164 2026-05-27 cs.AI 版本更新

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

符号查询还是语义检索？面向半结构化问答的数据集与方法

Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales, Cristina Cornelio

发表机构 * Samsung AI Warsaw（三星AI华沙实验室）； Samsung AI Cambridge（三星AI剑桥实验室）

AI总结提出 DualGraph 框架，通过文本知识图谱和符号知识图谱双视图实现半结构化文档的语义检索与符号查询结合，并在 SpecsQA 基准上超越现有方法。

详情

AI中文摘要

检索增强生成（RAG）系统通常通过查询与文档块之间的语义相似性来检索证据。虽然这种方法对非结构化文本有效，但在半结构化语料库上可靠性较低，因为回答可能需要跨多个文档的结构化属性进行精确过滤、聚合或穷举检索。符号方法支持此类操作，但在嘈杂的自然语言语料库上往往脆弱。我们通过 DualGraph 解决了这一差距，这是一个 RAG 框架，通过两种互补视图表示文档：用于语义检索的文本知识图谱和用于对类型化主语-谓语-宾语三元组进行符号查询的符号知识图谱。基于这两个组件，我们提供了多种策略来选择或组合语义和符号证据。我们还引入了 SpecsQA，这是一个来自商业购物网站的基准测试，包含半结构化产品文档和人工策划的问题，涵盖开放式和面向规格的检索。实验表明，DualGraph 在各种问题类型上始终优于最先进的密集检索、GraphRAG、符号和基于表格的基线。代码和数据可在 https://github.com/corneliocristina/DualGraphRAG 获取。

英文摘要

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.

URL PDF HTML ☆

赞 0 踩 0

2605.27157 2026-05-27 cs.AI 版本更新

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

检测不等于解决：检索增强型大语言模型中的监控控制差距

Zhe Yu, Wenpeng Xing, Chen Ye, Xuyang Teng, Bo Yang, Changting Lin, Meng Han

发表机构 * Zhejiang University（浙江大学）； Binjiang Institute of Zhejiang University（浙江大学滨江研究院）； Hangzhou Dianzi University（杭州电子科技大学）； National Fintech Evaluation Center（国家金融科技评估中心）

AI总结本文通过多轮文档累积协议发现检索增强型大语言模型存在监控控制差距，即模型能识别矛盾证据但无法安全约束最终建议，并揭示其机制在于行动选择缺陷。

详情

AI中文摘要

检索增强型大语言模型被部署用于证据质量决定行动安全的任务，但评估协议假设单轮鲁棒性能够预测证据跨轮累积时的鲁棒性。我们证明这一假设根本错误。模型存在监控-控制差距：它们容易承认矛盾证据，但这种意识无法约束最终建议——检测认知冲突并不意味着安全解决它。通过跨四个模型家族（1.5B-32B参数）和超过50,000次轮次级评估的多轮文档累积协议，我们证明单轮诊断系统性地高估了RAG安全性，矛盾承认与安全解决不相关（这一模式得到针对性人工验证的证实），并且不存在通用的提示修复方法。汇聚的机制证据——隐藏状态探测、注意力分析和响应策略分类——指向行动选择作为最可能的缺陷所在：危险相关信息被内部表示并在不安全生成期间获得增强的注意力，但未能约束输出行为。在检索增强系统可被信任用于高风险场景之前，必须测量并弥合模型识别与行动之间的差距。

英文摘要

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27156 2026-05-27 cs.CL cs.AI 版本更新

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

LitSeg: 面向文学RAG的叙事感知文档分割

Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su

发表机构 * Peking University（北京大学）

AI总结提出LitSeg，一种基于叙事理论引导的文档分割框架，通过多阶段提示提取事件、梳理叙事线索并定位转折点，以解决现有分割方法忽视文学叙事结构导致检索与生成性能下降的问题，并引入轻量版LitSeg-Lite通过数据蒸馏降低计算开销。

详情

AI中文摘要

检索增强生成（RAG）通过引入外部知识增强了大型语言模型（LLMs），特别是在文学作品等长尾领域。然而，RAG中关键的文档分割步骤仍未得到充分探索。现有策略通常语义盲目，忽视了文学作品复杂的叙事结构，常常导致情节碎片化和指代不清，严重阻碍了检索和生成性能。为了解决这一问题，我们提出了LitSeg，一种新颖的叙事理论引导的分割框架。通过采用多阶段提示，LitSeg明确提取有效事件，梳理叙事线索，阐明叙事结构，并定位转折点以指导分割。为了减轻大规模模型多阶段推理的计算开销，我们进一步引入了LitSeg-Lite，一种轻量级的单遍分块器，通过两阶段训练策略在LitSeg生成的数据上进行微调，将复杂过程蒸馏为单次推理。大量实验表明，通过结构独立的文本块，我们的方法在检索准确性和上下文相关性上显著优于基线，最终提升了下游问答性能，而消融研究验证了叙事学指导和数据蒸馏的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.27141 2026-05-27 cs.AI 版本更新

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

VitaBench 2.0：评估长期用户交互中的个性化与主动型代理

Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore（新加坡国立大学）； Meituan（美团）； University of Science and Technology of China（中国科学技术大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； Zhejiang University（浙江大学）

AI总结针对现有代理基准忽视用户偏好推断与利用的问题，提出VitaBench 2.0基准，通过时间序列任务和可扩展记忆接口评估代理在长期交互中的个性化与主动性，实验表明最先进模型仍面临挑战。

详情

AI中文摘要

大型语言模型已演变为交互式代理，与用户在现实任务中协作。在这种设置下，有效协作越来越依赖于理解用户未明确表达的内容，因为用户意图往往反映在碎片化的日常交互中，需要个性化建模和主动交互。然而，现有的代理基准主要评估推理和工具使用，在很大程度上忽视了在现实场景中推断和利用用户偏好的挑战。为解决这一差距，我们引入了VitaBench 2.0，这是一个用于评估长期用户交互中个性化与主动代理行为的基准。在VitaBench 2.0中，任务被组织为单个用户的时间顺序序列，其中偏好嵌入在碎片化和异构的交互中。成功完成任务要求代理从这些交互中持续提取、利用和更新用户偏好。我们进一步通过要求代理识别缺失信息并在决策前主动从用户或环境中获取信息的任务来评估主动性。为了支持系统分析，我们提供了一个可扩展的记忆接口，使得不同记忆架构之间的受控比较成为可能。我们对一系列前沿专有和开源LLM进行了基准测试。结果表明，即使对于最先进的模型，现实世界的个性化仍然极具挑战性，揭示了当前能力与实际需求之间的巨大差距。广泛的分析进一步揭示了当前代理在现实世界个性化决策中的失败模式和能力瓶颈，为未来的模型改进提供了见解。

英文摘要

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.27140 2026-05-27 cs.AI 版本更新

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD: 面向智能体强化学习的步骤感知在线偏好蒸馏

Yanfei Zhang, Xu Lin, Chenglin Wu

发表机构 * Independent Researcher（独立研究者）； Tencent（腾讯）； DeepWisdom（深智沃）

AI总结提出StepOPSD框架，以智能体步骤为信用分配单元，通过事后增强教师上下文重新评分步骤段，并在GRPO更新前进行归一化每步信用预算的优势塑造，解决多轮智能体强化学习中的信用分配不匹配问题。

详情

AI中文摘要

多轮智能体的强化学习存在信用分配不匹配问题：奖励稀疏且基于轨迹，而成功往往取决于少数局部决策。现有的在线策略蒸馏（OPD）提供了更密集的令牌级监督，但通常将异质的智能体轨迹视为整体字符串而非因果交互单元。我们提出StepOPSD，一种事后回放偏好自蒸馏框架，以智能体步骤作为信用重分配的单位。StepOPSD将轨迹分解为以动作中心的步骤段，在事后增强的教师上下文中重新评分，并将令牌级对数概率差距转化为符号保持的优势塑造，在GRPO更新前进行归一化的每步信用预算。在ALFWorld和Search-QA上使用Qwen3-1.7B和Qwen2.5-3B-Instruct的实验中，StepOPSD在对局部因果错误最敏感的子集上取得了最佳或次佳结果，包括ALFWorld Heat（79.1%）、PickTwo（95.0%）、Search-QA TriviaQA（61.6%）的第一名，以及HotpotQA（40.4%）的并列最佳。结果进一步揭示了一致的双旋钮定律：较小的α_clip作为广泛稳定的局部信任区域，而最优全局混合强度λ_mix依赖于任务。这些发现表明，当轨迹级奖励与决定下游成功的局部动作弱对齐时，步骤感知蒸馏最为有用。

英文摘要

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller α_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength λ_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

URL PDF HTML ☆

赞 0 踩 0

2605.27138 2026-05-27 cs.AI 版本更新

DEI：质量-多样性搜索中的进化推理多样性

John Donaghy, Shikhar Rastogi

AI总结提出DEI框架，通过异构大语言模型作为变异算子进行分布式质量-多样性搜索，实验表明模型多样性比并行性更能提升搜索性能。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情

AI中文摘要

我们提出DEI：进化推理中的多样性，一个分布式质量-多样性（QD）搜索框架，该框架将异构大语言模型（LLM）分配为变异算子，在通过非阻塞集合操作通信的对等节点间运行。与同质并行搜索（在所有工作节点上复制单一模型的归纳偏差）不同，DEI将每个LLM独特的创造性先验视为行为新颖性的互补来源。通过DEI扩展数字红皇后框架，节点在每轮结束时共享局部最优解，以播种下一轮的种群。这产生了跨模型的对抗压力，推动了超越模型内自博弈的鲁棒性。在Core War领域（一个竞争性编程基准，其中Redcode战士程序在模拟机器中战斗）上评估，一个四节点异构集成（GPT-5.4-mini、Claude Sonnet 4.6、GPT-5.2和Claude Haiku 4.5）在相等的总LLM调用预算下，相比单节点基线，实现了124%更高的合并存档QD分数（45.90 vs. 20.46）和28%更高的覆盖率（80.6% vs. 63.0%的单元格）。异构集成还在QD分数、覆盖率和所有四个模型家族的保留解泛化性上优于同等预算的同质集成。这些结果首次提供了经验证据，表明模型多样性（而非仅仅是并行性）是分布式基于LLM的QD搜索中增益的关键驱动因素。

英文摘要

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model's inductive biases across all workers, DEI treats each LLM's distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round's population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.

URL PDF HTML ☆

赞 0 踩 0

2605.27117 2026-05-27 cs.AI 版本更新

Position: AI Safety Requires Effective Controllability

立场：AI安全需要有效可控性

Yige Li, Yunhao Feng, Jun Sun

发表机构 * Singapore Management University（新加坡管理大学）； Ant Group（蚂蚁集团）

AI总结本文提出AI安全应将可控性作为首要目标，通过定义可控性、引入基准测试ControlBench并分析现有对齐机制的不足，提出以控制为中心的架构框架。

Comments 23 pages

详情

AI中文摘要

AI安全在很大程度上仍被框定为对齐：训练模型遵循人类偏好、安全策略和规范约束。这种框架改善了现代语言模型的行为，但对齐行为本身并不能保证部署的智能体在开放、交互和使用工具的环境中能够被停止、覆盖或约束。一个系统可能在期望上是安全的，但在冲突指令、长期执行、对抗性输入或高风险工具使用下，仍可能无法服从明确的运行时权威。这篇立场论文认为，AI安全因此需要将可控性作为第一类目标。我们将\emph{可控性}定义为AI系统在运行时能够可靠地被显式控制信号中断、覆盖、重定向和约束的能力，同时在没有此类信号时保持普通效用。为了研究这一差距，我们引入了\controlbench{}，一个用于评估高风险智能体场景中可控性失败的基准测试。基于OpenClaw的智能体实验表明，当前的对齐和防护机制降低了风险，但往往无法提供持久、权威和可执行的运行时控制。因此，我们提出了一个以控制为中心的架构框架，强调显式控制平面、运行时干预路径、持久控制状态和可审计决策接口，作为未来可控AI系统的关键设计原则。

英文摘要

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27115 2026-05-27 cs.AI 版本更新

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

基于对抗感知的多教师同策略蒸馏以实现领域保留下的通用能力恢复

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li

发表机构 * Kuaishou Technology, Beijing, China（快手科技，北京，中国）

AI总结针对多教师同策略蒸馏在提示覆盖不完全时出现的恢复-保留对抗和弱信号平坦化问题，提出CaMOPD方法，通过解耦交替训练和基于差距的样本选择，在保持领域性能的同时有效恢复通用能力。

详情

AI中文摘要

领域专业化可以改善LLM在垂直领域的行为，但往往会削弱从原始模型继承的通用能力。最近的多教师同策略蒸馏（MOPD）流程通过教师反馈监督学生生成的轨迹来恢复模型能力，但通常假设教师对齐的提示覆盖，即提示需要匹配教师的训练分布。当通用教师是开源模型且其训练后数据未知时，这一假设难以满足。我们不是试图重建这种隐藏分布，而是研究使用现成的代理通用提示来恢复通用能力。我们识别了在这种不完全覆盖情况下原始MOPD的两种失败模式：混合冲突的恢复和保留梯度导致的恢复-保留对抗，以及均匀平均具有不等校正需求的样本导致的弱信号平坦化。我们提出了对抗感知的多教师同策略蒸馏（CaMOPD），通过解耦交替训练和基于差距的样本选择来解决这些问题。CaMOPD为通用恢复提供专用更新，定期审查领域提示以进行保留，并选择具有较大平均词级教师-学生对数概率差距的样本以集中校正信号。在角色扮演对话和医学推理问答场景中，CaMOPD在保持领域特定行为的同时，在通用恢复方面表现优于基线。梯度一致性分析进一步支持了CaMOPD在产生更一致的校正信号方面的预期效果。

英文摘要

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

URL PDF HTML ☆

赞 0 踩 0

2605.27113 2026-05-27 cs.LG cs.AI 版本更新

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

使用GAN-扩散框架的高质量合成金融时间序列

Giuseppe Masi, Andrea Coletta, Novella Bartolini

发表机构 * Sapienza University of Rome（罗马大学）

AI总结提出一种结合GAN和扩散模型的质量感知生成框架，通过GAN的Critic引导扩散过程，生成更真实且保留金融时间序列典型事实和资产间相关结构的合成数据。

详情

AI中文摘要

近年来，金融机构和公司越来越多地采用合成数据来解决数据稀缺问题并生成反事实市场情景。然而，再现金融时间序列的所有统计特性（通常称为典型事实）对于许多现有的通用架构来说仍然是一个开放的挑战。在本文中，我们提出了一种质量感知生成框架，该框架结合了两类生成方法，展示了它们的集成如何解决现有局限性，同时增强合成数据的真实性。具体来说，我们首先引入CoMeTS-GAN（相关多变量时间序列GAN），这是一种条件生成对抗网络（C-GAN），旨在联合生成相关股票的中价和成交量时间序列。然后，我们展示了如何将我们的GAN架构整合到最先进的扩散模型中，以提高生成的相关结构的质量。具体来说，GAN的Critic作为一个质量评估模块，指导扩散过程，在生成的时间序列中强制执行学习到的相关结构。我们的框架为真实的股票市场模拟提供了一种轻量级且响应迅速的解决方案，明确建模了资产间的相关结构。我们通过实验将我们的框架与领先的生成架构进行了比较，表明它更有效地捕捉了股票市场的典型事实并建模了资产间的相关性。

英文摘要

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

URL PDF HTML ☆

赞 0 踩 0

2605.27091 2026-05-27 cs.CL cs.AI 版本更新

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

MiRD：通过误覆盖风险分解实现开放式问答的可靠集值预测

Anqi Hu, Zhiyuan Wang, Zijun Jia, Bo Fu

发表机构 * University of Electronic Science and Technology of China（电子科学与技术大学）； Beihang University（北航）

AI总结提出MiRD两阶段框架，通过将整体误覆盖分解为采样失败和条件选择失败，在开放式问答中实现可靠的集值预测，控制采样风险和条件选择风险，并产生更紧的边界和更自适应的预测集。

详情

AI中文摘要

可靠的集值预测为缓解开放式问答中的幻觉提供了一种原则性方法，但现有的共形方法通常依赖于一个脆弱的假设：有限采样必须已经产生至少一个可接受的候选，或者违反此条件的校准示例被丢弃。在本文中，我们介绍了MiRD，一个两阶段框架，将整体误覆盖分解为采样失败和条件选择失败。在第一阶段，MiRD在固定预算下，对有限采样不产生可接受答案的概率建立了一个期望水平的边际上界。在第二阶段，基于采样成功，MiRD使用在整个校准集上定义的与接受性相关的非一致性分数来校准共形选择阈值，从而保持校准集的完整性。在三个开放式问答数据集和八个模型上，MiRD控制了采样风险、条件选择风险和整体误覆盖，同时产生了比PAC风格替代方案更紧的第一阶段边界，以及比仅成功校准更自适应的预测集。

英文摘要

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

URL PDF HTML ☆

赞 0 踩 0

2605.27082 2026-05-27 cs.AI 版本更新

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

广泛的生物医学知识能否被情境化为基于场景的命题？

Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Guangzhou University of Chinese Medicine（广州中医药大学）

AI总结提出SCENE双层多智能体框架，通过迭代搜索将广泛生物医学知识转化为证据支持的场景化命题，并在临床试验和LINCS L1000研究中验证其有效性。

详情

AI中文摘要

生物医学发现通常需要将广泛的生物医学知识与特定的实验或临床数据联系起来。背景知识提示相关机制，但通常过于泛化，无法直接映射到数据集变量；而数据驱动模式可能具有数据集特异性且难以从机制上解释。我们将这一缺失环节研究为知识情境化：将广泛的生物医学知识转化为有证据支持的、基于场景的命题，供领域专家检查、重现和验证。我们提出SCENE，一个双层多智能体框架，将知识情境化视为迭代搜索。上层将广泛知识转化为搜索方向，并将其锚定在数据集模式中。下层通过多目标优化执行这些方向，以识别在证据强度和数据支持之间取得平衡的具体命题。两层之间的反馈逐步细化搜索。我们在两个场景中评估SCENE：在临床试验场景中发现具有异质性治疗益处的患者亚组，以及在LINCS L1000研究中识别特定情境下的生物学反应。在临床试验中，SCENE发现了具体且支持充分的亚组，并优于现有基线。在L1000研究中，SCENE识别出具有强靶标-响应匹配和高阳性率的扰动情境。这些结果表明，SCENE弥合了广泛知识与场景特定证据之间的差距，为后续验证生成了可追溯、可检查的假设。

英文摘要

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.

URL PDF HTML ☆

赞 0 踩 0

2605.27081 2026-05-27 cs.LG cs.AI cs.DC 版本更新

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE: 在内存受限的MoE大模型推理中通过路由器微调提升专家重用

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China（北京航空航天大学计算机科学与工程学院）； Huawei Technologies Ltd（华为技术有限公司）

AI总结提出ReMoE路由器微调框架，通过偏向近期选中的专家实现时间稳定的路由，减少专家从外部存储的获取次数，在保持下游任务性能的同时提升专家重用26%，并在实际系统中实现8.4%的吞吐量提升和1.77-1.99倍的解码加速。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

细粒度混合专家（MoE）模型对每个token仅稀疏激活一部分专家，在保持高模型容量的同时减少激活计算。然而，在内存受限的推理场景中，只能缓存少量专家。未缓存的专家必须从慢速外部存储（如UFS）获取，导致频繁的驱逐和大量的I/O开销。我们提出ReMoE，一个路由器微调框架，旨在提升token级别的专家重用。ReMoE使路由器偏向近期选中的专家，产生时间稳定的路由，更好地匹配缓存局部性约束。通过增加短时专家重用，ReMoE减少了从存储中获取专家，且不增加推理计算开销。在DeepSeek和Qwen模型上的实验表明，ReMoE在保持下游任务性能的同时将专家重用提升了26%。实际系统评估进一步证实了这些优势：在vLLM GPU-CPU专家卸载下，输出吞吐量提升8.4%；在Jetson Orin NX上的llama.cpp中，TPOT降低43.6-49.8%，对应不同工作负载下1.77-1.99倍的解码加速。检查点和使用说明见https://github.com/BUAA-OSCAR/ReMoE。

英文摘要

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.27079 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

发表机构 * KAIST AI（韩国科学技术院人工智能）； Seoul National University（首尔国立大学）； RLWRLD

AI总结针对预训练流策略的离策略强化学习不稳定性，提出信任区域Q伴随匹配方法，通过投影对偶下降自适应控制路径空间KL散度，实现稳定微调，在50个OGBench任务中离线RL成功率达68%。

详情

AI中文摘要

由于多步采样过程带来的优化不稳定性，预训练流策略的离策略强化学习仍然具有挑战性。最近，带有伴随匹配的Q学习（QAM）通过将问题重新表述为一个具有学习评论家的无记忆随机最优控制（SOC）问题来解决这一问题。然而，QAM继承了评论家引导改进的根本脆弱性：当评论家病态时，小的评论家误差会被放大，通常导致模型崩溃。本文引入了信任区域Q伴随匹配（TRQAM），一种稳定的离策略微调算法，通过投影对偶下降自适应地控制与预训练流策略的路径空间KL散度。具体来说，我们优化SOC动力学中的信任区域参数$λ$，并从理论上证明路径空间KL可以用$λ$的闭式函数表示。因此，我们的方法可以精确控制与预训练流策略的精确偏差，实现稳定的离策略强化学习。通过在50个OGBench任务上的实验，TRQAM在离线强化学习和离线到在线强化学习中都持续优于先前的方法。特别是，TRQAM在离线强化学习中实现了68%的总体成功率，显著提高了最强基线的46%。

英文摘要

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

URL PDF HTML ☆

赞 0 踩 0

2605.27072 2026-05-27 cs.CL cs.AI 版本更新

ConVer：使用合约和循环不变式合成实现可扩展的形式化软件验证

Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin, Lucas C. Cordeiro

发表机构 * The University of Manchester（曼彻斯特大学）

AI总结提出一种自上而下的组合验证工具ConVer，利用大语言模型合成函数合约，并通过CEGAR-CEGIS循环迭代精炼合约，以解决大规模C程序形式化验证中的状态空间爆炸问题。

Comments 12 pages; 6 figures

详情

AI中文摘要

大型C程序的形式化验证受到状态空间爆炸的阻碍：有界模型检验（BMC）工具必须通过展开所有嵌套结构来编码整个状态空间直至预定边界。我们提出了ConVer，一种自上而下的组合验证工具。给定一个带有顶层断言的C程序，ConVer自上而下地分解验证：它使用大语言模型（LLM）从系统属性中合成函数合约，然后在CEGAR-CEGIS循环中交替进行系统级和函数级检查，每当检查失败时通过SMART ICE学习精炼合约。我们在四个难度递增的基准测试套件上评估了ConVer，并与其他最先进（SOTA）工具进行了比较。在包含45个简单C程序的Frama-C基准测试中，ConVer在三个LLM后端上实现了82-96%的验证成功率，其中93-95%的收敛程序仅需一次CEGAR-CEGIS迭代。在X.509解析器基准测试（6个程序）和LF2C-Simple套件（17个程序）上，ConVer分别实现了33-50%和82-88%的成功率。在包含11个递归和循环密集型程序的VerifyThis套件上，预抽象策略实现了55-64%的成功率。此外，我们提出了ESBMC-LF，一个预处理工具，它将LF模型转换为C语言，同时保留LF文件的属性，使ConVer能够验证它们。我们使用ESBMC-LF将LF验证器基准测试转换为C语言；我们将这些称为LF-Hard。我们表明，ConVer总体上成功验证了67%的LF-Hard基准测试。

英文摘要

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

URL PDF HTML ☆

赞 0 踩 0

2605.27042 2026-05-27 cs.CR cs.AI 版本更新

Lessons from Penetration Tests on Large-Scale Agent Systems

大规模智能体系统渗透测试的经验教训

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang, Frederico Araujo, Ian Molloy

发表机构 * IBM Research（IBM研究院）

AI总结本文通过对2025年专有智能体产品的两次渗透测试，评估了AI智能体的安全态势是否有所改善，并指出许多安全漏洞并非全新，而是反映了先前计算系统中长期存在的重复性弱点类别。

Comments Accepted at SAGAI 2026

详情

AI中文摘要

随着AI系统获得越来越多的自主性和执行能力，发现的安全漏洞数量持续上升。然而，许多这些漏洞并非根本上的新颖，而是反映了先前计算系统中长期观察到的重复性弱点类别。具有执行能力的AI智能体实际上是无限的自修改程序，与计算栈的多个层进行广泛交互。这种广泛的交互表面给开发者带来了显著的安全负担，他们必须推理并保护复杂的跨层行为。先前的研究主要集中在开源智能体和智能体框架中的漏洞。相比之下，专有智能体系统——在更严格的编码标准和正式审查流程下开发——是否表现出类似的安全弱点仍不清楚。在本文中，我们展示了2025年对专有智能体产品进行的两次渗透测试的结果，并评估了自这些评估以来AI智能体的安全态势是否有所改善。

英文摘要

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

URL PDF HTML ☆

赞 0 踩 0

2605.27033 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

发表机构 * Universitat Pompeu Fabra（庞培法布拉大学）； ICREA

AI总结提出s-Trace方法估计最优子图，发现LLM计算分为早期稀疏核心和后期密集细化两个阶段，且计算量与模型不确定性相关。

详情

AI中文摘要

基于Transformer的大型语言模型（LLMs）由数十亿个参数组成，这些参数排列在深度和宽度都很大的计算图中，但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法，以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法，我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点（主要位于后期层，且越来越多地由注意力头组成）会导致近似完整输出分布的逐步细化。此外，我们发现每个输入所需的计算量与模型不确定性相关，并且更稀疏的子图编码浅层统计信息，例如单字频率。总体而言，我们的结果表明，有效的LLM计算中存在一致的模块化组织，其中稀疏的早期层核心提供粗略预测，然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

URL PDF HTML ☆

赞 0 踩 0

2605.27028 2026-05-27 cs.LG cs.AI 版本更新

Less is More: Early Stopping Rollout for On-Policy Distillation

少即是多：用于在线策略蒸馏的早停展开

Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Beijing Institute of General Artificial Intelligence（北京通用人工智能研究院）

AI总结针对在线策略蒸馏中存在的“离策略教师衰减”问题，提出早停展开（ESR）方法，通过限制响应生成的前几个token来提升性能、GPU效率和训练稳定性。

详情

AI中文摘要

在线策略蒸馏最近成为标准序列级模仿的有前途的替代方案，通过使用教师模型对学生自身的展开进行评分来训练学生。然而，我们观察到这种范式中的“离策略教师衰减”问题：对于后面的token，由于学生的早期轨迹作为上下文对于教师来说是离策略的，教师产生纠正性分数的能力会衰减，并可能退回到预训练阶段学习的token补全行为。我们通过实验验证了这个问题，并提出了早停展开（ESR）来解决它：一种简单而有效的蒸馏策略，仅限制展开生成到前几个响应token。我们表明，ESR在模型大小、家族、任务和训练制度上均超越了全展开在线策略蒸馏的性能，并且在跨模型家族场景下表现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制，发现了ESR的“级联对齐”和“子模式承诺”效应，这可能解释其为何有效，甚至有时超过教师模型性能。此外，我们表明这种基于位置的token选择策略不能完全由KL散度和熵信号解释。

英文摘要

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

URL PDF HTML ☆

赞 0 踩 0

2605.27022 2026-05-27 cs.AI 版本更新

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

发表机构 * National Engineering Research Center for Big Data Technology and System（大数据技术与系统国家工程研究中心）； Services Computing Technology and System Lab（服务计算技术与系统实验室）； Cluster and Grid Computing Lab（集群与网格计算实验室）； School of Computer Science and Technology（计算机科学与技术学院）； Huazhong University of Science and Technology（华中科技大学）

AI总结针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化，提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架，在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情

AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省，但面临两个主要挑战：稀疏的大幅度激活异常值，以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧，其高噪声和低噪声专家表现出不同的量化敏感性，单一全局校准策略无法捕捉。我们提出了一种后训练量化框架，结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化，以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上，我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%，同时仅导致 VBench 平均分数下降 0.9%，成像质量下降 2.3%，表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

URL PDF HTML ☆

赞 0 踩 0

2605.26969 2026-05-27 cs.CL cs.AI 版本更新

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon：基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Recon方法，通过动作重建分数评估推理轨迹的预测能力，以改进用户建模中的推理合成，在多个领域优于事后合理化基线。

详情

AI中文摘要

用户建模旨在使用语言模型（LM）从过去的上下文-动作对（例如对话轮次）语料库中模拟个体的行为，从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库，通常通过同时以上下文和动作为条件生成。然而，这种条件构成事后合理化而非推理：轨迹保证证明动作的合理性，但可能不编码潜在的潜在因果决策路径。我们提出Recon，它使用动作重建通过预测能力对推理轨迹进行评分：给定上下文和候选推理，重建模型预测动作，重建保真度决定推理质量。在四个领域，Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外，我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能，相对于基线实现了高达70.0%的胜率。我们进一步表明，Recon合成的推理可跨模型迁移，并改善重建模型之外的用户建模。我们的工作表明，事后合理化对于推理合成是不够的，有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

URL PDF HTML ☆

赞 0 踩 0

2605.26958 2026-05-27 cs.CL cs.AI 版本更新

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO：面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； University of Southern California（南加州大学）； Zhejiang University（浙江大学）； Xiaohongshu Inc.（小红书公司）

AI总结针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题，提出Tournament-GRPO框架，通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励，在Deep Research Bench上取得4.52分提升。

详情

推理深度与环境复杂度：逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

发表机构 * Kyoto University（京都大学）； University of Tokyo（东京大学）； NII LLMC（日本信息处理学会LLMC）； RIKEN（理化学研究所）

AI总结通过将推理空间划分为深度和复杂度两个维度，并考虑四种推理形式，在合成知识图谱环境中进行受控实验，发现联合深度-复杂度覆盖优于单轴策略，不同推理家族对RLVR覆盖的反应非均匀，且均匀混合优于分阶段课程。

Comments Pre-print

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为后训练推理模型的核心，但现有研究的一个关键局限在于对推理空间的狭隘视角：难度仅被视为推理深度，奖励集中在正向演绎状态追踪。相反，我们沿两个维度刻画推理空间。难度：除了推理深度，我们研究环境复杂度，即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式：我们考虑现实世界推理核心的四种能力：演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素，我们构建了一个合成知识图谱环境，具有受控的预训练和后训练分布，其中每个实例在深度、复杂度和任务家族上变化。三个发现：联合深度-复杂度覆盖优于单轴策略；推理家族反应非均匀，溯因推理在RL覆盖区域外退化，任务相关性聚类为演绎-溯因对和归纳-类比对；在固定预算下，均匀混合优于分阶段课程。我们还发现，最近的现成模型表现出相同的演绎-溯因不对称性，表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

URL PDF HTML ☆

赞 0 踩 0

2605.26926 2026-05-27 cs.AI 版本更新

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

从规范到指标 (N2I-RAG): 一种用于法律指标计算的智能检索增强生成框架

Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir

发表机构 * LISI Laboratory（LISI实验室）； Cadi Ayyad University（卡迪·阿亚德大学）； Univ Brest（布列塔尼大学）； IRD, Univ Brest, CNRS, Ifremer, LEMAR（IRD、布列塔尼大学、CNRS、Ifremer、LEMAR）

AI总结提出N2I-RAG框架，通过自适应检索、基于LLM的智能体和验证机制，实现从法律文本到指标的透明、可追溯的自动计算，在法国海洋环境法语料库上优于基线方法。

详情

AI中文摘要

从规范文本计算法律指标是法律监测和政策评估中的关键任务，但由于法律语言的复杂性、规模、解释性以及可用文档质量的差异，这一任务面临重大挑战。现有的自然语言处理技术和生成模型可以辅助法律分析，但往往存在较高的幻觉风险，且缺乏可靠指标计算所需的可解释性和证据基础。本文提出N2I-RAG（从规范到指标），一种智能检索增强生成框架，旨在以透明且可追溯的方式自动化法律指标的计算。我们将自适应检索、基于LLM的智能体和验证机制集成到一个模块化流水线中，其中每个组件在过滤、检索和评估证据，以及生成与可识别法律条款相关的二元法律结果方面执行定义明确的角色。该框架通过要求对中间决策和最终指标分配进行明确解释来强调可追溯性。我们使用内部构建的包含扫描和数字两种来源的法国海洋环境法律语料库评估N2I-RAG。与多个语言模型家族的对比实验表明，所提出的方法始终优于基线系统，并且在两种不同禁令的测试中具有良好的泛化能力。结果表明，智能检索增强生成可以桥接开放文本法律语言和标准化指标计算，为透明且可扩展的法律观测站奠定基础。

英文摘要

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

URL PDF HTML ☆

赞 0 踩 0

2605.26911 2026-05-27 cs.AI 版本更新

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

TADDLE: 一种用于检测有缺陷的LLM生成同行评审的工具增强型代理

Hanqi Duan, Xiang Li

发表机构 * East China Normal University（东华大学）

AI总结针对LLM生成的同行评审难以检测缺陷的问题，提出TADDLE工具增强型代理，通过四个专用分析工具和两阶段半监督学习，在二元检测和多标签分类任务上表现优异。

详情

AI中文摘要

LLM生成的同行评审在主要会议中越来越常见，但由于它们语言流畅、结构良好，其缺陷难以检测。现有工作要么仅分类作者身份而不评判质量，要么使用为人类撰写的评审设计的特征来评分质量；没有先前系统能在单个缺陷类型级别检测LLM生成评审中的缺陷。为弥补这一空白，我们引入了TADDLE，一种用于检测有缺陷的LLM生成同行评审的工具增强型代理，以及首个针对此任务的专家标注基准。我们的基准包含对50篇ICLR 2025论文的1800条评审，由18位领域专家根据六个缺陷类别（加上一个无缺陷标签）的分类法进行多标签标注。TADDLE将检测分解为四个专用分析工具——验证、纠正、完善和转换——由一个代理协调；一个集成器通过两阶段半监督学习将其输出综合为二元和多标签分类。大量实验表明，TADDLE在二元检测和多标签分类任务上均表现强劲。我们在https://github.com/AquariusAQ/TADDLE发布基准和代码。

英文摘要

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

URL PDF HTML ☆

赞 0 踩 0

2605.26908 2026-05-27 cs.AI cs.DS cs.LG 版本更新

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

发表机构 * Institute for Humanities-Centered Artificial Intelligence, University of Hamburg, Germany（人文导向人工智能研究所，汉堡大学，德国）

AI总结本文重新审视了因子图中可交换因子检测的理论基础，指出现有算法依赖的定理仅为必要条件而非充分条件，并提出了修正算法以保证正确性和效率。

详情

AI中文摘要

利用概率图模型（如因子图）中对象的不可区分性是提升概率推理算法的关键，并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子，即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言，我们表明，在其当前形式下，最先进算法依赖于一个中心定理，该定理被错误地视为识别可交换因子的充分条件，而实际上它仅意味着必要条件。因此，正如我们在本文中所展示的，最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷，我们证明了上述定理的一个略微修改版本，该版本作为识别可交换因子的必要条件。此外，我们提出了最先进算法的修正版本，在保持其效率的同时确保正确性，并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

URL PDF HTML ☆

赞 0 踩 0

2605.26898 2026-05-27 cs.SE cs.AI 版本更新

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

引导LLM使用软件设计模式的策略：以单例模式为例

Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron

发表机构 * University of Gothenburg and Chalmers University of Technology（哥德堡大学和查尔姆斯理工大学）

AI总结通过实验比较四种提示策略（指令、二元自动反馈、详细自动反馈、少样本详细反馈），评估13个LLM在164个Java编码挑战中生成遵循单例模式的代码的能力，发现迭代二元反馈在保持或提升功能性的同时最佳地实现了单例模式对齐。

Comments Accepted at PROMISE 2026

详情

DOI: 10.1145/3803846.3807469

AI中文摘要

大型语言模型（LLM）可以从自然语言提示生成功能性源代码，但往往无法一致地遵循更高级别的架构结构或设计模式。由于LLM在软件工程中的应用日益增多，它们将既定设计原则应用于生成代码的能力对于软件产品的长期成功至关重要。因此，本文的目标是确定引导LLM将设计模式融入生成源代码的策略。我们设计了一个计算实验，评估13个LLM生成遵循单例设计模式的代码的能力，使用了四种提示策略：指令、二元自动反馈、详细自动反馈以及带少样本提示的详细反馈，在HumanEval-X的164个Java编码挑战中进行。我们的结果表明，引导LLM包含设计模式的最佳策略在很大程度上取决于模型类型。尽管如此，总体而言，迭代二元反馈在保持或改善代码功能性的同时，提供了与单例模式的最佳对齐。通过指令引导，Llama 3.3在100%的情况下生成了单例类，并改善了代码功能性，使通过的测试数量增加了34.1个百分点。通过指令和二元反馈引导，它取得了类似的结果。Qwen 3（8B）使用二元反馈将单例模式对齐度提高到99.2%，功能性提高到58.6%。我们的结果表明，即使是简单的策略也可以用于引导LLM使用设计模式。

英文摘要

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.

URL PDF HTML ☆

赞 0 踩 0

2605.26895 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

微不足道的大小，显著的效果：大型语言模型中的尺度向量

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

发表机构 * Peking University（北京大学）

AI总结本文系统研究了大型语言模型中的尺度向量，发现其虽参数占比极小但对预训练至关重要，通过自放大预条件效应优化优化过程，并提出了三种轻量级改进策略，在多种模型规模上一致提升性能。

Comments 36 pages

详情

AI中文摘要

现代大型语言模型（LLM）中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究，但尺度向量尽管被普遍使用，其作用仍未被充分理解。在这项工作中，我们从表达能力、优化和架构结构的角度对LLM中的尺度向量进行了系统研究。首先，我们通过实验表明，虽然尺度向量仅占模型参数的极小部分，但移除它们会显著降低LLM的预训练效果。我们的理论进一步表明，在Pre-Norm架构中，尺度向量并不增加表达能力；相反，它们通过对后续线性映射产生自放大预条件效应来改善优化。其次，我们研究了权重衰减对尺度向量的作用。通过区分Input-Norm和Output-Norm层，我们从理论上证明，由于它们在优化和表达能力中的不同作用，权重衰减对前者有益但对后者有害。第三，受此理解的启发，我们提出了三种轻量级且互补的尺度向量改进方法：分支特异性异质性、线性映射周围的改进放置以及幅度-方向重参数化。理论和实验均表明，每种改进都能带来一致的收益。最后，我们将这些改进整合为一个统一的尺度向量策略，并通过在0.12B到2B参数的密集和混合专家模型上进行大规模LLM预训练实验，使用多种优化器和学习率调度，在工业级token预算下进行评估。该统一策略始终比精心调整的基线获得更低的终端损失，并展现出更有利的扩展行为，同时增加可忽略的参数和计算开销。

英文摘要

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.26893 2026-05-27 cs.CL cs.AI 版本更新

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

发表机构 * Xidian University（西安电子科技大学）； Xi’an Jiaotong University（西安交通大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； University of Science and Technology of China（中国科学技术大学）

AI总结针对思维链推理中的事后合理化问题，提出基于潜在几何结构和熵动力学的时空框架GeoFaith，通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情

Helicase: 不确定性引导的供应链知识图谱构建与自主多智能体大语言模型

Yunbo Long, Haolang Zhao, Ge Zheng, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结提出Helicase，一种基于多智能体大语言模型的自主系统，通过不确定性引导的迭代验证和知识图谱构建，解决供应链中需要多跳推理的结构化推断问题，并引入SCQA基准评估。

详情

AI中文摘要

基于大语言模型的多智能体系统已被广泛用于知识检索和报告生成，通过网页搜索和文本推理综合已知信息。然而，供应链中的许多关键信息任务并非简单的一次性查询：它们是结构化推断问题，需要在复杂、碎片化的网络资源中进行多跳推理。诸如“特斯拉哪些组件使用了来自澳大利亚矿山的锂？”之类的问题在任何单一文档中都没有答案；答案必须通过自主构建和分析从碎片化、异构来源中组装起来的动态知识图谱，以计算方式合成。此外，这种发现过程必须具有不确定性意识：决策不仅依赖于答案，还依赖于对其可靠性的校准置信度，该置信度可追溯到来源质量和推理一致性。为了解决这一能力差距，我们提出了Helicase，一种用于不确定性引导的供应链知识图谱构建的自主多智能体大语言模型系统。Helicase将高层供应链查询分解为可执行的调查计划，通过迭代验证循环协调专门的网页搜索、推理和编码智能体，并逐步构建带有每个事实不确定性注释的查询特定供应链知识图谱。其三层不确定性框架在行动、轨迹和记忆层跟踪不确定性，从而实现结构化推断和校准置信度评估。为了评估整个复杂性谱系中的自主推理，我们引入了SCQA（供应链查询评估），这是一个包含80个供应链查询的基准，这些查询组织成四个象限，涵盖单跳到多跳推理，在高低数据可见性下进行。

英文摘要

LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit{``Which Tesla components use lithium from Australian mines?''} have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textit{Helicase}, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textit{Helicase} decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.

URL PDF HTML ☆

赞 0 踩 0

2605.26833 2026-05-27 cs.LG cs.AI 版本更新

Periodic Topological Deep Learning for Polymer Design and Discovery

周期性拓扑深度学习用于聚合物设计与发现

Yasharth Yadav, Tze Kwang Gerald Er, Atsushi Goto, Kelin Xia

发表机构 * School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371（新加坡南洋理工大学物理与数学科学学院）； School of Chemistry, Chemical Engineering and Biotechnology (CCEB), Nanyang Technological University, Singapore 637371（新加坡南洋理工大学化学、化工与生物技术学院）

AI总结提出基于周期性Vietoris-Rips复形和层次单纯形消息传递的深度学习框架Periodic-TDL，通过捕捉多体相互作用和长程信息，在聚合物性质预测任务上超越现有模型，并验证了酯到酰胺取代和α-甲基化对热稳定性的提升。

Comments 19 pages, 3 figures, 3 tables

详情

AI中文摘要

聚合物支撑着能源、医疗和材料科学领域的应用，但其广阔的化学空间使得系统性发现充满挑战。大多数机器学习方法将聚合物表示为单个重复单元的分子图，从而忽略了聚合物链的周期性和超越成对键的多体相互作用。我们提出了Periodic-TDL，一个基于周期性Vietoris-Rips复形的深度学习框架，该复形捕捉跨多个空间尺度的多体相互作用，随后通过层次单纯形消息传递（HSMP）编码器将信息从长程相互作用传播到共价键，产生由高阶拓扑特征增强的表征。Periodic-TDL在涵盖电子、光学、物理和热学目标的聚合物性质预测任务中优于所有最先进的模型。此外，我们定量验证了酯到酰胺取代和α-甲基化如何增强热稳定性。使用通过系统取代丙烯酸酯和丙烯酰胺聚合物生成的计算合成数据集（48,208个结构），我们观察到在匹配的聚合物对中，酯到酰胺取代的平均$T_g$增加约$55^\circ$C，主链α-甲基化的平均$T_g$增加约$14^\circ$C。为了验证这些预测趋势，我们使用Periodic-TDL模型分析了来自独立实验测量的六对新型聚合物，包括三篇文献中未报道的新合成聚合物。实验数据成功证实了模型的预测。最终，这些发现表明Periodic-TDL捕捉了特定官能团修饰的潜在物理效应，而不仅仅是优化基准数据集上的预测性能。

英文摘要

Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery challenging. Most machine learning approaches represent polymers as molecular graphs of a single repeating unit, thereby missing both the periodicity of polymer chains and many-body interactions beyond pairwise bonds. We introduce Periodic-TDL, a deep learning framework built on periodic Vietoris-Rips complexes that capture many-body interactions across multiple spatial scales, followed by a hierarchical simplicial message-passing (HSMP) encoder that propagates information from long-range interactions to covalent bonds, yielding representations enriched by higher-order topological features. Periodic-TDL outperforms all state-of-the-art models across polymer property prediction tasks spanning electronic, optical, physical, and thermal targets. Furthermore, we quantitatively validate how ester-to-amide substitution and $α$-methylation enhance thermal stability. Using a computationally synthesized dataset of 48,208 structures-generated via systematic substitution of acrylate and acrylamide polymers-we observed a mean $T_g$ increase of $\sim 55^\circ$C for ester-to-amide substitutions and $\sim 14^\circ$C for backbone $α$-methylation across matched polymer pairs. To verify these predicted trends, we use our Periodic-TDL model to analyze six novel polymer pairs from independent experimental measurements, including three newly synthesized polymers previously unreported in the literature. The experimental data successfully confirmed the model's predictions. Ultimately, these findings demonstrate that Periodic-TDL captures the underlying physical effects of specific functional group modifications, rather than merely optimizing predictive performance on benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.26830 2026-05-27 cs.LG cs.AI cs.CV 版本更新

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化：通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结针对非线性传感场景下卡尔曼滤波性能下降的问题，提出Kalman Evolve框架，联合优化噪声参数与更新结构，利用大语言模型生成可解释的非仿射修改，在多个基准上实现高达12%的RMSE降低。

详情

AI中文摘要

HTMLCure：将浏览器体验转化为面向交互式HTML的状态引导修复

Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang, Haowen Wang, Yihang Lou, Xianglong Liu

发表机构 * Beihang University（北京航空航天大学）； IQuest Research（IQuest研究院）； Peking University（北京大学）

AI总结提出HTMLCure框架，通过浏览器交互执行、状态感知诊断和闭环修复引擎，从大规模HTML页面中筛选并修复可修复页面，显著提升SFT数据质量和模型性能。

Comments 27 pages, 11 figures. Code: https://github.com/wuyuVerse/HTMLCure

详情

AI中文摘要

LLM现在可以生成完整的HTML页面，但其中许多页面仅在表面上正确：它们渲染一次，然后在滚动、悬停、点击、调整大小或游戏过程中失败。基于截图的评估可能遗漏这些失败，而过滤会丢弃许多仍然可修复的页面。我们引入了HTMLCure，一个浏览器体验框架，在系统与页面交互后评估HTML。评估器跨视口和交互状态执行页面，记录确定性的浏览器证据，并向VLM提供来自执行轨迹的精选关键帧，而非孤立截图。相同的状态信号驱动闭环修复引擎：HTMLCure诊断当前页面，选择特定状态的修复家族，再次运行每个候选页面，并导出质量清理后的页面用于SFT。在97K提示语料库上，这将直接可用的种子扩展为63703个质量清理页面的候选池，从中我们构建了最终的40K页面精炼SFT集。在相同骨干和训练方案下，HTMLCure-27B-Refined在HTMLBench-400上达到50.6分，确定性测试用例通过率为45.2%，与Kimi-K2.6和GPT-5.4等强参考行处于相同性能区间。在发布的MiniAppBench验证集上，它达到81.2的平均分，比原始27B SFT提高15.3分，接近强参考系统的水平。

英文摘要

LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure-27B-Refined reaches 50.6 on HTMLBench-400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi-K2.6 and GPT-5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26795 2026-05-27 cs.AI 版本更新

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

链式思维在探测时为何有效？局部共现而非全局推导

Xiang Wang, Wei Wei

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结研究链式思维提示在探测时提升语言模型准确率的原因，发现增益主要来自词汇激活和短距离标记共现，而非句子级逻辑推导。

详情

AI中文摘要

链式思维提示可靠地提高了语言模型的准确性，但推理文本的哪些属性驱动了这种改进尚不清楚。先前的工作主要研究生成本身的行为。我们转而提出一个探测时问题：给定上下文中的固定推理文本，该文本中的什么改变了答案？我们确定了增益的两个互补来源。首先，即使是全局词序打乱的推理文本也显著优于无推理基线，表明存在强烈的词汇激活效应。更重要的是，结构化文本带来的额外增益似乎较少来自句子级的逻辑排序，而更多来自短距离标记邻接。保留仅$n^\star{=}2$--$3$个标记的连续窗口即可恢复向完整链式思维性能的大部分剩余增益。支持性实验排除了显式答案声明或答案值的复制以及完整的语法实现作为主要驱动因素。进一步的泛化实验表明，这种定性模式在多个模型家族、参数规模和数据集上保持稳定。这些结果支持探测时链式思维的局部共现激活解释，其中观察到的增益主要来自词汇激活和短距离标记共现，而非句子级逻辑推导。

英文摘要

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$--$3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

URL PDF HTML ☆

赞 0 踩 0

2605.26789 2026-05-27 cs.AI 版本更新

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

组合崩溃：稳定的事实知识并不意味着组合推理

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han

发表机构 * Zhejiang University（浙江大学）； Binjiang Institute of Zhejiang University（浙江大学滨江研究院）； Hong Kong Baptist University（香港 Baptist大学）； Harbin Institute of Technology（哈尔滨工业大学）； Hangzhou Dianzi University（杭州电子科技大学）

AI总结本文提出组合崩溃现象，即模型在稳定掌握原子事实的情况下仍无法将其组合成链式推理，并通过双门控协议分解后训练增益，揭示聚合指标掩盖的组合能力变化。

详情

AI中文摘要

后训练通常通过聚合基准分数来评估，这些分数将多跳推理视为单一能力——仿佛回答更多问题的模型必然更擅长组合事实。我们表明这种假设可能具有误导性：在统计上无法区分的原子知识配方下，组合行为差异超过40个百分点，我们将这种现象称为组合崩溃：即系统性地无法将稳定已知的事实组合成链，而这种失败对聚合指标不可见。我们引入双门控协议，将估计量从聚合组合性差距转变为基于稳定原子访问的残差组合失败，将后训练收益分解为三个独立通道：原子稳定性、残差组合和关键深度。在一个涵盖深度2-11的时序事实链基准上，对四种后训练配方进行分解，揭示了后训练目标以聚合指标掩盖的方式改变组合能力，并表明关于多跳推理改进的主张应伴随原子门控控制的组合指标。诊断探针进一步显示，测量到的组合失败中相当一部分反映了生成时的计算约束，而非永久性的组合能力缺失。

英文摘要

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

URL PDF HTML ☆

赞 0 踩 0

2605.26788 2026-05-27 cs.CL cs.AI 版本更新

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT: 基于句子变换器的决策变换器条件化用于多轮对话可靠性

Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla

发表机构 * Independent Researcher（独立研究者）； Drone Lab, IIT Mandi（IIT曼迪无人机实验室）； UPES, Dehradun（德里敦UPES）

AI总结针对大语言模型在多轮对话中性能下降的问题，提出一种无需训练和额外数据的推理方法SeDT，通过引入离线强化学习中的return-to-go条件化，利用语义、词汇和位置信号计算累积相关性得分并注释对话历史，显著提升模型性能并降低不可靠性。

详情

AI中文摘要

大语言模型（LLMs）在单轮任务完全指定时表现令人印象深刻，但当相同任务在多轮中逐步揭示时，同一模型性能下降高达39%，这一现象在规模上被记录为“迷失在对话中”。关键的是，这种崩溃几乎完全是可靠性失败；最佳情况下，能力仅下降16%，而不可靠性增加超过一倍（+112%）。我们认为根本原因是结构性的：扁平化的对话历史对每个先前轮次赋予相等隐式权重，使模型无法区分关键约束与无关对话。我们提出SeDT（句子变换器-决策变换器），一种无需训练的推理时方法，通过从离线强化学习中引入return-to-go条件化来解决此问题。SeDT使用来自三种互补信号（语义、词汇和位置）的累积相关性得分注释每个对话片段，并在最后一轮向模型呈现完整的注释历史，无需权重更改、无需训练数据、无需丢弃上下文。在三个LLM和三个生成任务的Lost-in-Conversation基准上评估，SeDT在所有九个模型-任务组合中均优于分片基线，平均性能P提升高达+37.7%，同时在九个组合中的七个中降低了不可靠性。简而言之，告诉模型哪些过去的轮次重要足以显著恢复对话中丢失的性能。

英文摘要

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.

URL PDF HTML ☆

赞 0 踩 0

2605.26786 2026-05-27 cs.CY cs.AI cs.LG 版本更新

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

大数据分析在糖尿病管理中的应用：卢旺达医疗系统需求评估

Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere

发表机构 * University of Rwanda（卢旺达大学）

AI总结本研究通过利益相关者研讨会评估卢旺达医疗系统采用大数据分析管理糖尿病的准备情况，并提出了一个基于可解释机器学习模型的实用框架。

详情

AI中文摘要

糖尿病是一种慢性代谢疾病，如果不及早诊断和管理，可能导致严重的健康问题。大数据分析和机器学习为分析大型健康数据集、支持早期发现和更好的治疗决策提供了实用工具。然而，它们在常规临床实践中的使用仍然有限。本研究考察了卢旺达医疗系统采用大数据分析管理糖尿病的准备情况。随着该国不断扩大电子病历和健康信息系统的使用，改善预测、监测和临床决策的新机遇随之出现。我们举办了一个为期五天的研讨会，涉及25名关键利益相关者，包括临床医生、数据管理员、政策制定者、医学研究人员、营养学家和技术提供商，以评估准备情况并识别现有差距。研究结果突出了大数据分析实施的潜力和主要挑战。基于这些结果，本文提出了一个实用的大数据分析框架，利用可解释的机器学习模型支持糖尿病管理策略。

英文摘要

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

URL PDF HTML ☆

赞 0 踩 0

2605.26785 2026-05-27 cs.CL cs.AI 版本更新

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

EmoDistill: 对抗性谈判中语言模型代理的离线情感技能蒸馏

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup

发表机构 * University of Cambridge（剑桥大学）； Technical University of Munich（慕尼黑技术大学）； Exiger LLC ； The Alan Turing Institute（艾伦·图灵研究所）

AI总结提出EmoDistill离线框架，通过隐式Q学习选择情感和低秩适应策略表达情感，蒸馏情感谈判技能到语言模型代理，在四个高风险谈判领域取得最高效用。

详情

AI中文摘要

后训练的LLM通常被优化以对齐响应与人类偏好，使其安全、礼貌且适合对话。然而，在对抗性谈判中，这种对齐可能成为漏洞：情感框架语言可能引导代理朝向对手方利益。使用基于GoEmotions的情感提示，我们表明情感显著改变谈判结果，表明情感是战略行动渠道而非表面风格。因此，我们引入 extbf{EmoDistill}，一个用于将情感谈判技能蒸馏到语言模型代理中的离线框架。EmoDistill将情感策略分解为情感选择和情感表达：隐式Q学习（IQL）选择器学习表达\emph{哪种}情感，而基于低秩适应（LoRA）的策略通过监督微调（SFT）和裁判策略优化（JPO）学习\emph{如何}表达它。在四个情感敏感、高风险的谈判领域，在EmoDistill框架下训练的SLM策略实现了最高效用，优于普通SLM/LLM基线和仅IQL情感选择。消融实验表明情感条件化是必要的，迁移研究展示了跨领域、未见对手和训练对训练锦标赛的泛化能力。总体而言，EmoDistill从离线代理间交互中学习技能，避免了训练期间昂贵的在线谈判。

英文摘要

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

URL PDF HTML ☆

赞 0 踩 0

2605.26784 2026-05-27 cs.LG cs.AI 版本更新

Ratio-Variance Regularized Policy Optimization

比率方差正则化策略优化

Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li

发表机构 * Department of Foundation Model, 2012 Labs, Huawei（华为基础模型部门，2012实验室）； Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University（上海智能自主系统研究院，同济大学）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）

AI总结提出R²VPO方法，通过约束策略比率方差作为信任区域的局部近似，替代启发式裁剪，在LLM和机器人控制任务中提升性能与样本效率。

详情

AI中文摘要

标准的同策略强化学习依赖启发式裁剪来强制信任区域，但这种机制通过不加区分地截断高回报但高散度的更新而施加了严重代价。我们证明，显式约束策略比率方差为信任区域约束提供了原则性的局部近似，消除了二元硬裁剪的需要。通过作为分布式的“软刹车”，这种方法保留了来自新颖发现的关键梯度信号，同时自然降低权重并允许重用陈旧的离策略数据。我们引入了${\bf R}^2{\bf VPO}$（比率方差正则化策略优化），它通过原始-对偶优化框架实现这一约束。在跨越快速和慢速推理范式的$7$个LLM规模以及$10$个机器人控制任务上的广泛评估证明了所提出方法的通用性。R$^2$VPO在数学推理基准上取得了显著的性能提升，特别是在较小模型上改进尤为明显，同时显著提高了样本效率。此外，它在连续控制领域（特别是稀疏奖励和动态环境）中始终优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则性基础。

英文摘要

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.26781 2026-05-27 cs.AI cs.MM 版本更新

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench: 大型多模态模型真的征服了高中水平的考试吗？

Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li

发表机构 * Tencent PCG（腾讯PCG）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）

AI总结本文提出动态多学科基准LiveK12Bench，通过自动化流水线和新颖的模拟考试评估方案，揭示大型多模态模型在真实考试场景下性能显著下降，尤其对复杂视觉布局敏感。

详情

AI中文摘要

先进的大型多模态模型（LMMs）在K-12推理任务中展示了令人印象深刻的表现，展现出作为智能导师的巨大潜力。实现这一潜力需要模型有效应对真实世界的考试，但大多数现有基准未能捕捉真实考试环境的复杂性。具体来说，大多数数据集是静态的，容易受到数据污染，并且通常局限于受限的模态、学科和评估标准。为了解决这些问题，我们引入了LiveK12Bench，这是一个动态、全面、多学科的基准，旨在评估LMMs在真实考试场景中的推理能力。LiveK12Bench包含2000多道经过验证的题目，涵盖数学、物理、化学和生物，来源于最新的真实考试试卷，并设计为随时间增长。我们的框架具有几个核心创新：1）采用自动化流水线，持续摄取和解析最新考试试卷以减轻数据泄露；2）提出一种新颖的“模拟考试”评估方案，评估模型自主完成端到端考试并具有准确高效推理路径的能力。在12个LMMs上的大量实验表明，先进模型在考试真实约束下性能大幅下降：当过程严谨性和效率共同评估时，GPT-5的分数从79降至53（满分100）。我们的发现暴露了关键漏洞，例如对复杂视觉布局的敏感性，凸显了理想化推理能力与真正教育准备之间的差距。代码和数据集均已公开。

英文摘要

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.26778 2026-05-27 cs.AI 版本更新

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

归因盲点：检测语言模型何时依赖记忆而非检索到的上下文

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Bo Yang, Chen Ye, Gaolei Li, Meng Han

发表机构 * Zhejiang University（浙江大学）； Binjiang Institute of Zhejiang University（浙江大学滨江研究院）； National Fintech Evaluation Center（国家金融科技评估中心）； Hangzhou Dianzi University（杭州电子科技大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出计算现实监控（CRM）方法，通过比较有无上下文时的内部表征差异，检测语言模型是否依赖预训练记忆而非检索到的上下文进行生成，解决了输出级监控无法识别的归因盲点问题。

详情

AI中文摘要

检索增强生成承诺将语言模型输出锚定于外部证据，然而该领域缺乏可靠方法来验证检索到的上下文是否实际主导了生成——这是任何高风险部署的前提。标准假设（上下文一致的输出意味着上下文主导的输出）在检索到的文档与模型预训练数据重叠时失效：模型可以完全从参数化记忆中生成看似忠实的文本，且两种途径产生无法区分的输出。我们将此失败命名为归因盲点，并引入计算现实监控（CRM）来解决它。CRM 操作化了源自认知科学现实监控框架的一个原则：比较有上下文和无上下文时的内部表征，揭示了输出级监控系统系统性遗漏的基于成员条件的表征分歧。CRM 并不证明单个生成使用了哪个来源；它检测预训练暴露是否留下可测量的内部轨迹特征，从而为来源归因建立必要的基础。在跨越三个系列的九个模型变体中，这种分歧集中在架构特定的层模式中，得到块级噪声干预的汇聚支持，并在任务和数据集上泛化，而在领域混淆的基准上消失。归因盲点是可以测量且部分可解决的：内部表征携带输出级不可见的诊断信号，为系统建立基础，使其对证据来源的内部意识支配其外部行为。

英文摘要

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.26776 2026-05-27 cs.LG cs.AI 版本更新

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的混合专家车辆路径问题模型

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（自主智能无人系统国家重点实验室，北京理工大学）； School of AI, Beijing Institute of Technology（北京理工大学人工智能学院）

AI总结提出基于混合专家架构的残差细化专家与实例级门控机制（R2E-IG），通过模块化策略网络和动态权重适应训练，提升车辆路径问题在分布偏移下的泛化能力。

详情

AI中文摘要

近年来，深度强化学习（DRL）在车辆路径问题（VRPs）上取得了显著进展。然而，现有的基于DRL的方法通常是在均匀分布生成的实例上训练的，这限制了它们在真实世界分布偏移下的性能。在本文中，我们旨在开发一个面向泛化的模型，该模型将策略网络划分为多个模块，并在推理过程中自适应地重组模块以形成特定策略。具体来说，我们提出了具有实例级门控的残差细化专家（R2E-IG）以改进跨分布泛化。我们的贡献有三方面：（1）我们引入了一种残差细化专家（R2E）架构，通过残差细化增强专家表达能力；（2）我们设计了一种实例级门控机制，学习分布感知的实例表示并将输入路由到合适的模块；（3）我们提出了一种配备动态权重适应（DWA）的混合分布训练机制，该机制动态地重新加权来自不同分布的训练数据，以强调更具信息量的数据。大量实验表明，R2E-IG在合成和基准数据集的分布内和分布外实例上均取得了与最先进基线相竞争的性能。此外，R2E-IG是通用的，可以轻松集成到现有的基于DRL的方法中，以进一步提高性能。

英文摘要

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26772 2026-05-27 cs.AI 版本更新

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

超越单一方向：思维链破坏简单的拒绝引导

Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

发表机构 * University of Göttingen, Germany（哥廷根大学，德国）； Northeastern University, Boston, MA, USA（东北大学，波士顿，马萨诸塞州，美国）

AI总结本文研究大型推理模型（LRM）中拒绝行为的机制，发现思维链（CoT）与激活共同编码拒绝信号，使得仅通过激活引导难以逆转拒绝，但通过两阶段干预（激活引导下重新生成CoT）可显著提高逆转率。

详情

AI中文摘要

面向口语处理任务的机器人-患者与医生-患者医疗对话数据集

Heriberto Cuayahuitl, Grace Jang

发表机构 * UK’s NHS（英国国家医疗服务体系）

AI总结提出MeDial-Speech数据集，包含机器人-患者和医生-患者的真实医疗对话语音数据，用于训练和评估医疗AI，并通过句子选择基准测试评估三个大语言模型。

详情

Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2026)

AI中文摘要

大型语言模型（LLM）为人工智能（AI）带来了巨大改进，可应用于通用任务。然而，它们在文本或口语医疗咨询中的应用仍是一个开放的研究问题。本文提出MeDial-Speech，这是一个新颖的语音数据集，用于训练和评估能够与患者进行咨询的医疗AI。该数据集在真实环境中从机器人-患者和医生-患者对话中收集，包含111小时以上的语音数据（无数据增强），涵盖四种健康状况：路易体痴呆、心力衰竭、肩痛和心绞痛。此外，我们通过句子选择（20个选项）提出了一个对话基准，用于评估三个最先进的LLM：GPT-5 mini、DeepSeek-V3和Claude Sonnet 4。实验结果显示，Claude Sonnet 4在句子选择中表现最佳，使用人工转录的准确率为71.1%，使用自动转录的准确率为74.7%，并且所有LLM在其概率预测中高度过度自信，无论选择医疗对话中的正确或错误句子。该数据集对非商业用途免费提供，网址为：https://huggingface.co/datasets/hcuayahu/MeDial-Speech

英文摘要

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

URL PDF HTML ☆

赞 0 踩 0

2605.26741 2026-05-27 cond-mat.mtrl-sci cs.AI 版本更新

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

MatFormBench: 一个面向目标驱动材料配方的基准评估框架

Linhan Wu, Chenxi Wang, Chuhan Yang, Zhengwei Yang, Yuyang Liu

发表机构 * DeepVerse

AI总结针对现有材料机器学习基准仅关注正向属性预测而缺乏逆向优化评估的问题，提出MatFormBench基准框架，集成物理驱动配方生成方案与多维度评分指标，系统评估39种逆向设计算法。

Comments 26 pages

详情

AI中文摘要

材料的逆向设计显著推进了目标驱动的配方优化，然而现有的材料机器学习基准仍局限于正向属性预测，未能系统评估逆向优化和生成算法，这一关键差距阻碍了目标驱动材料设计的进展。为解决这一局限性，我们提出了MatFormBench，一个新颖的基准评估生态系统，专门用于评估和指导目标驱动配方的生成策略。MatFormBench集成了一个物理驱动的配方生成方案，用于生成忠实模拟真实材料结构-属性响应关系的合成样本，并辅以五个递增难度级别来量化这些关系的复杂性。为了严格评估算法性能，我们进一步提出了MatFormScore，一个多维指标，全面量化五个关键轴上的性能：目标成功率、搜索效率、探索能力、鲁棒性和稳定性。我们通过评估39种不同的逆向设计算法来验证MatFormBench，涵盖经典的代理辅助黑箱搜索、最先进的深度生成模型以及日益流行的基于大语言模型（LLM）的推荐策略。在1170次标准化算法-任务评估中，基于扩散的模型展现出最强的整体性能，而基于变分自编码器（VAE）和遗传算法（GA）的方法在特定场景中表现出独特优势。通过为目标驱动材料配方建立统一的评估标准，MatFormBench实现了可重复的基准测试、原则性的算法比较和逆向设计策略的诊断分析，为推进材料逆向设计提供了基础工具。

英文摘要

Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benchmarks remain limited to forward property prediction, failing to systematically evaluate inverse optimization and generation algorithms, a critical gap that hinders the progress of target-driven materials design. To address this limitation, we propose MatFormBench, a novel benchmarking ecosystem tailored to evaluate and guide generative strategies for target-driven formulation. MatFormBench integrates a physics-driven formulation generation scheme to generate synthetic samples that faithfully emulate realistic materials structure-property response relationships, complemented by five escalating difficulty levels to quantify the complexity of these relationships. To rigorously assess algorithm performance, we further propose MatFormScore, a multi-dimensional metric that comprehensively quantifies performance across five critical axes: target success, search efficiency, exploratory capacity, robustness, and stability. We validate MatFormBench by evaluating 39 diverse inverse design algorithms, covering classical surrogate-assisted black-box search, state-of-the-art deep generative models, and increasingly popular Large Language Model (LLM)-based recommendation strategies. Across 1170 standardized algorithm-task evaluations, diffusion-based models demonstrate the strongest overall performance, while Variational Autoencoder (VAE)-based and Genetic Algorithm (GA)-based methods exhibit distinct advantages in specific scenarios. By establishing a unified evaluation standard for target-driven materials formulation, MatFormBench enables reproducible benchmarking, principled algorithm comparison, and diagnostic analysis of inverse design strategies, providing a foundational tool for advancing materials inverse design.

URL PDF HTML ☆

赞 0 踩 0

2605.26733 2026-05-27 cs.LG cs.AI 版本更新

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

循环语言模型中测试时可扩展潜在推理的稳定循环动力学

Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China（新型软件技术国家重点实验室，南京大学，南京，中国）； School of Artificial Intelligence, Nanjing University, Nanjing, China（人工智能学院，南京大学，南京，中国）； School of Intelligence Science and Technology, Nanjing University, Nanjing, China（智能科学与技术学院，南京大学，南京，中国）

AI总结提出STARS训练框架，通过雅可比谱半径正则化约束潜在状态趋近渐近稳定不动点，解决循环语言模型深度递归时性能崩溃问题，实现可靠的测试时扩展并提升峰值性能。

Comments ICML 2026

详情

AI中文摘要

循环语言模型（LoopLMs）通过深度递归实现高效的潜在推理，但表现出不可靠的测试时缩放行为：性能通常在某个迭代深度达到峰值，然后随着进一步递归而崩溃。通过潜在动力学分析，我们发现现有架构和策略在稳定性和有效性之间存在固有的权衡。通过将推理概念化为不确定性减少，我们提出收敛到稳定不动点同时保持有效性是一种有前景的方法。为此，我们提出了STARS（稳定性驱动的递归缩放），一种训练框架，约束潜在状态趋近渐近稳定不动点。这通过高效的雅可比谱半径正则化和随机循环采样实现，使STARS能够在确保严格稳定性的同时最大化有效性。在算术任务上的实验表明，STARS实现了可靠的测试时缩放，在复杂数学推理中，它显著减轻了随着递归深度增加而出现的性能退化，同时提高了峰值性能。

英文摘要

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26731 2026-05-27 cs.AI cs.CL 版本更新

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

不是能力问题：LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

发表机构 * KailosLab（凯罗斯实验室）

AI总结通过 432 次实验，发现 LLM 智能体的驾驭敏感性随模型层级非单调变化，且依赖模型类型（聊天 vs. 推理），推翻了“更高能力模型需要更少结构指导”的假设。

Comments 9 pages, 3 figures

详情

AI中文摘要

LLM 智能体部署中的一个普遍假设是，更结构化的驾驭方式普遍能提高可靠性，并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设，实验跨越了四个能力层级的六个模型，在 HEAT-24（一个基于 git 工作区验证的 24 任务合成基准）上采用了三种驾驭条件（轻量、平衡、严格）。我们的结果从两个方面反驳了单调反比关系。首先，对于评估的前沿聊天模型（Gemini 2.5 Flash），增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次，对于评估的前沿推理模型（Qwen3.5-122B，启用扩展思考），严格驾驭实现了最高的 VTSR（91.7%）和最低的延迟，与预测相反。在受限层级内，一个 2B 模型（Gemma4:e2B）在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表，这些结果应解释为模型特定的观察；驾驭敏感性在所评估的模型中呈现非单调性，并且关键依赖于模型类型（聊天 vs. 推理）。我们引入了一个六标签失败分类法，显示格式违规主导了能力强的模型失败，而错误文件主导了低能力失败，并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

URL PDF HTML ☆

赞 0 踩 0

2605.26726 2026-05-27 eess.IV cs.AI cs.CV 版本更新

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

发表机构 * Computational Health Center, Helmholtz Munich, Neuherberg, Germany（赫尔姆霍茨慕尼黑计算健康中心）； Helmholtz AI, Helmholtz Munich, Neuherberg, Germany（赫尔姆霍茨慕尼黑人工智能研究所）； Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany（慕尼黑技术大学计算机辅助医疗程序研究所）； Munich Center for Machine Learning, Munich, Germany（慕尼黑机器学习中心）； Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany（慕尼黑路德维希-马克西米利安大学医院第三医学部）； Department of Physics, University of Munich, Munich, Germany（慕尼黑大学物理系）； German Cancer Consortium (DKTK), partner site Munich, Germany（德国癌症研究中心（DKTK）慕尼黑分部）

AI总结提出一种基于动态系统收敛性的不确定性度量方法，通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情

AI中文摘要

神经细胞自动机（NCA）为编码器-解码器分割网络提供了一种轻量级替代方案。然而，决定何时应信任预测可能很困难。在这里，我们研究基于NCA的医学图像分割的不确定性估计，无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发，其中收敛吸引子对应于可信预测。具体地，我们提出了弹性（resilience），这是一种简单的度量，通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的，而显著变化的预测被标记为不确定。我们使用选择性预测指标（$\Delta$Dice@90和AURC）和排序指标（AUROC和AUPRC）通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中，弹性比基线更可靠地识别失败案例，提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

URL PDF HTML ☆

赞 0 踩 0

2605.26720 2026-05-27 cs.AI 版本更新

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

面向CUDA内核生成中自进化LLM代理的反馈到计划决策

Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）； Beijing National Research Center for Information Science and Technology, Beijing, China（北京信息科学国家研究中心）

AI总结通过轨迹冻结和选择性反馈注入，提出CUDAnalyst框架以归因规划决策对反馈组件的贡献，揭示显式规划仅在反馈对齐时有效，且有效规划源于结构化多反馈交互。

Comments ICML 2026 accpeted, camera-ready in progress

详情

AI中文摘要

大型语言模型（LLMs）作为自进化代理在CUDA内核生成中展现出强大的实证收益，这得益于跨代际的反馈条件规划。然而，规划决策如何归因并组合异构反馈信号仍不透明。标准的端到端消融无法解决这一问题，因为迭代规划放大了早期扰动，并将反馈效应与轨迹依赖漂移混为一谈。我们引入 exttt{CUDAnalyst}，一个统一的分析层，通过轨迹冻结和选择性反馈注入，实现对规划决策到反馈组件的受控、代际级归因。 exttt{CUDAnalyst}支持稳定的代际级评估和原则性的联盟式反馈效应及交互归因。我们的结果表明，显式规划仅在反馈对齐时有益，有效规划源于结构化的多反馈交互，且来自更强推理模型的高级规划可部分迁移至较弱模型。这些趋势在参考骨干网络、代表性工作负载和参考归纳机制中保持一致，表明在所研究的受控轴内，识别出的反馈到规划结构是稳健的。

英文摘要

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

URL PDF HTML ☆

赞 0 踩 0

2605.26717 2026-05-27 cs.IR cs.AI 版本更新

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

L2Rec：面向个性化推荐的LLM双视图理解

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music（网易云音乐）

AI总结提出L2Rec方法，通过双视图个性化混合专家机制在参数层面统一行为与语义理解，实现端到端个性化推荐，实验证明优于现有方法。

Comments Accepted at SIGIR 2026

详情

DOI: 10.1145/3805712.3809943

AI中文摘要

将大型语言模型（LLM）适配于个性化推荐需要将其通用能力与用户特定偏好对齐，同时有效利用行为信号和语义信号。现有方法通常在输入层（例如，将行为嵌入注入令牌空间）或输出层（例如，独立编码器的对比对齐）整合这些信号，存在分布差距或缺乏端到端任务监督。在这项工作中，我们引入了L2Rec，它在LLM的参数层面统一了行为和语义理解。我们的关键洞察是，同一组Transformer参数可以作为两个视图的共享媒介：通过双视图个性化混合专家（DPMoE）机制应用视图特定的个性化低秩扰动，L2Rec使得单个LLM主干能够为每个用户产生互补的行为和语义适应，且表示层面的不对齐最小化。一个自适应跨视图融合模块进一步将双视图输出整合为统一的用户偏好。在四个数据集上的实验表明，L2Rec持续优于最先进的基线方法，并且在大型工业平台上的在线A/B测试验证了关键参与指标的显著改进。

英文摘要

Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.25507 2026-05-27 cs.AI 版本更新

Credit Assignment with Resets in Language Model Reasoning

语言模型推理中带有重置的信用分配

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

发表机构 * Meta AI ； Columbia University（哥伦比亚大学）； Meta Superintelligence Labs（Meta超智能实验室）； Tel Aviv University（特拉维夫大学）

AI总结提出随机重置策略优化（RRPO）和自重置策略优化（SRPO）两种方法，通过重置到中间状态并重新采样反事实延续来改进语言模型多步推理中的信用分配，SRPO在多个推理基准上优于标准GRPO和RRPO。

详情

AI中文摘要

迭代精化神经算子：一种学习型不动点求解器——频谱偏差缓解的原则性方法

Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College（达特茅斯学院）； CUHK Shenzhen（香港大学深圳分校）； Lawrence Berkeley National Lab（伯克利国家实验室）

AI总结提出迭代精化神经算子（IRNO），通过固定点迭代应用学习精化模块，结合渐进频谱损失，有效缓解神经算子的频谱偏差，在湍流和活性物质等物理系统中显著降低高频误差。

Comments 47 pages; accepted to ICML 2026 as a Spotlight

详情

AI中文摘要

神经算子作为科学建模的快速数据驱动替代方法，通常依赖于单一前向推理过程，难以解析高频细节，这一局限性称为频谱偏差。我们引入迭代精化神经算子（IRNO），通过固定点迭代反复应用学习精化模块来增强预训练算子。IRNO将预测分解为粗初始化及随后的残差校正，类似于经典数值求解器。在局部假设下，我们建立了诱导算子的收缩性，确保收敛到唯一不动点。为明确针对高频误差，我们提出渐进频谱损失，在训练过程中自适应地增加对高频分量的惩罚。在物理系统中，IRNO持续降低误差，在湍流中提升高达56.05%。在活性物质中，频谱分析显示，相对于基础算子，归一化误差比在低频降至27.72-36.10%，中频降至5.07-6.68%，高频降至1.48-2.04%，且在训练迭代次数之外保持稳定。代码见 https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator。

英文摘要

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

URL PDF HTML ☆

赞 0 踩 0

2605.23910 2026-05-27 cs.CL cs.AI 版本更新

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别：多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

发表机构 * National Information Processing Institute（国家信息处理研究所）

AI总结本文通过系统综述和元分析，提出了统一框架，量化了多模态和多视角融合在文档分类中的性能提升，并揭示了方法学严谨性不足的问题。

详情

DOI: 10.1016/j.inffus.2026.104247
Journal ref: Information Fusion, 132, 2026, 104247

AI中文摘要

信息融合被广泛用于通过整合多数据源（多模态）或多表示（多视角）来改进文档分类。然而，该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域，呈现了定性分析结果以识别关键趋势，并进行了随机效应元分析（据我们所知，这是首次专注于文档分类的元分析）以量化性能提升。我们的元分析显示，多模态融合显著提高了准确率（平均提升+5.28个百分点，$p=0.0016$）——F1分数效应方向为正，但在我们的主要模型中统计上不显著。多视角融合在准确率（+4.67%）、F1分数（+3.08%）和召回率（均$p<0.05$）上提供了一致但适度的提升。关键的是，我们的定性综合揭示了方法学严谨性方面的可重复性挑战：只有11.8%（多模态）和23.3%（多视角）的研究使用统计检验来验证其发现，这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论，成功的信息融合不依赖于算法复杂性，而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

URL PDF HTML ☆

赞 0 踩 0

2605.22511 2026-05-27 cs.AI cs.CL cs.IR 版本更新

ProcCtrlBench: 评估LLM编码智能体中的过程级缺陷与控制保持

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

发表机构 * Amap, Alibaba Group（阿里云）

AI总结提出ProcCtrlBench基准，通过可复用的缺陷本体和标准化轨迹表示，从过程证据而非仅最终结果评估LLM编码智能体的执行质量，并引入控制保持量化执行过程的可解释性、可中断性等属性。

Comments 22 pages, 8 figures

详情

AI中文摘要

现有的LLM编码智能体基准主要评估最终结果。虽然有助于衡量整体能力，但这些指标提供的可见性有限，常常遗漏执行过程中出现的缺陷。我们提出了ProcCtrlBench，一个用于LLM编码智能体执行过程评估的基准。ProcCtrlBench将重复出现的执行缺陷组织成一个可复用的本体，涵盖4类11种缺陷类型，并通过标准化的过程证据而非仅最终结果来评估智能体轨迹。为了支持异构智能体之间的比较，ProcCtrlBench将原始日志标准化为统一的轨迹表示，并报告基于过程发现的校准评分卡。此外，ProcCtrlBench使用控制保持作为量化执行过程质量的方式，捕获执行是否保持可解释、可中断、可纠正、可逆，并在需要时能够交还控制权。我们在从三个基准（AndroidBench、TerminalBench和SWE-bench-Verified）中采样的200个案例上评估了ProcCtrlBench。结果表明，ProcCtrlBench可以以有用的可靠性实例化，提供比直接阈值化更稳定的语义，并揭示了传统基于结果的评估常常忽略的执行质量的有意义差异。

英文摘要

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcCtrlBench, a benchmark for execution-process evaluation in LLM coding agents. ProcCtrlBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcCtrlBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcCtrlBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable, reversible, and able to hand back authority when needed. We evaluate ProcCtrlBench on 200 cases sampled from three benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Results show that ProcCtrlBench can be instantiated with useful reliability, provides more stable semantics than direct thresholding, and reveals meaningful differences in execution quality that are often overlooked by conventional outcome-based evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.03309 2026-05-27 cs.CR cs.AI cs.SE 版本更新

Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems

加密注册表溯源：针对AI包生态系统中依赖混淆的结构性防御

Alan L. McCann

发表机构 * Mashin, Inc.（Mashin公司）

AI总结提出加密注册表溯源系统，通过注册表身份签名、双重签名模型和权威命名空间绑定三层结构防御依赖混淆攻击。

Comments 15 pages, 1 figure, 4 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情

AI中文摘要

依赖混淆攻击利用了软件分发中的结构性缺陷：一旦包被安装，就没有加密证据证明是哪个注册表分发的。所有现有防御都是基于配置的，并且在配置错误时会静默失败。我们提出一个加密分发溯源系统，包含三个组件：(1) 加密注册表身份，每个注册表持有一个Ed25519密钥对，并对其分发的每个工件进行签名；(2) 双重签名模型，发布者在打包时签名，注册表在发布时副署；(3) 权威命名空间绑定，消费者固定注册表指纹，解析器从加密上拒绝来自未授权注册表的工件。这些创建了三层防御，需要同时攻破才能成功攻击。对八个生态系统（npm、Cargo、Hex.pm、PyPI、Go模块、Docker/OCI、NuGet、Maven）的比较显示，没有现有生态系统结合了强制发布者签名、加密注册表身份、强制注册表副署和消费者端加密执行。该系统扩展到AI生成溯源作为签名属性，以及治理强制依赖解析。一个案例研究将分发溯源与三层运行时治理架构集成，创建了一个无加密间隙的四阶段生命周期链。

英文摘要

Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration-based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual-signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer-side cryptographic enforcement. The system extends to AI-generation provenance as a signed attribute and governance-enforced dependency resolution. A case study integrates distribution provenance with a three-layer runtime governance architecture, creating a four-phase lifecycle chain with no cryptographic gaps.

URL PDF HTML ☆

赞 0 踩 0

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG 版本更新

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态：利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University（北京大学）； Nanyang Technological University（南洋理工大学）； Beijing Jiaotong University（北京交通大学）

AI总结通过因果追踪识别出稀疏的“拒绝轨迹”激活模式，并提出轻量级白盒检测器SALO，基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情

AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪，我们识别出一个 extit{拒绝轨迹}：一种稀疏的上游激活模式，即使当诸如GCG的攻击抑制终端拒绝信号时，该模式也常常持续存在。基于这一观察，我们提出了SALO（稀疏激活定位算子），一种轻量级白盒检测器，它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上，SALO在固定的XSTest校准工作点下，改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况，阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.02035 2026-05-27 cs.CL cs.AI 版本更新

注意工具故障：实现医疗智能体的协同工具增益

Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han, Weimiao Yu, Guangnan Ye, Chen Jiang, Yuan Cheng

发表机构 * Fudan University（复旦大学）； Shanghai Academy of Artificial Intelligence for Science（上海人工智能科学研究院）； Shanghai Innovation Institute（上海创新研究院）； The University of Queensland（昆士兰大学）； Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR)（生物信息研究所（BII），科技研究局（A*STAR））

AI总结针对医疗AI智能体在真实临床环境中工具可能失败的问题，提出基于GRPO的强化学习框架，通过实例级工具选择和分歧感知协同学习，实现错误工具共识的纠正，提升系统鲁棒性。

详情

AI中文摘要

医疗AI智能体越来越多地使用外部工具进行诊断、治疗建议和证据检索，但大多数现有方法假设任务合适的工具在其预期范围内是可靠的。这一假设在真实临床环境中是脆弱的，因为即使相关工具也可能在具有挑战性的实例上失败，并导致不安全的后续决策。为了解决这个问题，我们研究了不完美工具设置下的医疗工具使用，以纠正单个工具遗漏的失败实例。实例相关的失败模式在最佳固定单一工具和理想的实例级选择器之间产生了差距，我们称之为单一预言风险差距。核心挑战在于，传统的任务级工具选择无法实现这一差距，因为它本质上受限于最佳单一工具的性能。受此观察启发，我们考虑了实例级异质性，并将工具使用建模为实例级选择问题。特别地，我们提出了一个基于GRPO的强化学习框架，其奖励函数用于概率风险最小化和分歧感知协同学习，促进错误工具共识的实例级纠正。此外，采用熵引导的采样策略来提升高分歧实例的权重，这些实例为学习实例特定的工具协同提供了更强的信号。这两个组件相互补充，以减轻实例级异质性并改善工具协同。在两个任务和七个医疗基准上的实验表明，我们的方法在广泛的基线上持续实现了稳健且稳定的改进，突显了协同感知工具使用对于可靠医疗智能体系统的重要性。

英文摘要

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26690 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SILO：基于生物引导搜索的自改进模仿用于预算约束下的蛋白质设计

Ashima Khanna, Dominik Grimm

发表机构 * Technical University of Munich（慕尼黑技术大学）； University of Applied Sciences Weihenstephan-Triesdorf（魏因斯坦-特里斯多夫应用科学大学）

AI总结提出SILO框架，通过层次化编辑策略、增量随机束搜索和UCB代理集成，在有限oracle预算下实现蛋白质序列优化，在8个蛋白质适应度景观上达到最优性能。

详情

AI中文摘要

在严格的oracle预算下进行蛋白质序列优化需要探索巨大的组合空间，同时使每次评估都具有信息量。现有的强化学习和离策略生成方法在代理噪声下性能下降，且位置无关的突变提议可能破坏功能关键残基。我们提出了SILO，一个用于oracle预算蛋白质设计的轨迹级自改进模仿框架。SILO使用层次化编辑策略，将每个突变分解为位置选择后跟残基选择。在每个主动学习轮次中，策略通过增量随机无放回束搜索（SBS）采样候选轨迹，结合基于UCB的代理集成和丙氨酸扫描适应度分数（AFS），选择具有功能相关编辑的候选进行计算机oracle评估。然后，通过在轮次中最佳oracle标记轨迹上的下一动作交叉熵模仿来更新策略，避免值函数估计。在八个复现的蛋白质适应度景观和来自先前工作的五个强基线上，SILO在我们的评估中在8/8的景观上实现了最高的最大和top-100平均适应度，通常表现出更快的早期改进。在每种设置两个景观的低数据和噪声代理压力测试中，当多个基线退化时，SILO保持竞争力或最佳。消融实验表明，SBS与AFS贡献了大部分增益，迭代模仿提供了额外改进。代码可在：https://github.com/grimmlab/SILO.git 获取。

英文摘要

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

URL PDF HTML ☆

赞 0 踩 0

2605.26683 2026-05-27 cs.CL cs.AI 版本更新

An In-Vitro Study on Cross-Lingual Generalization in Language Models

语言模型中跨语言泛化的体外研究

Adrian Cosma

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)（达勒莫利人工智能研究所（IDSIA））

AI总结通过构建两种程序生成的语言，独立控制词汇距离、少数语言比例等变量，研究语言模型跨语言迁移的机制，发现迁移主要取决于分词是否保留可复用的跨语言子结构，且词汇量越小越有利于掩码迁移。

Comments 16 Figures, 1 Table

详情

AI中文摘要

在自然语料中，语言模型的跨语言迁移难以研究，因为词汇重叠、形态、数据不平衡和分词相互纠缠。我们引入了一个体外框架，使用两种程序生成的语言，它们共享相同的本体、类型化语法和组合结构，但表面实现不同。这使我们能够独立改变词汇距离、少数语言比例、分词器训练制度和词汇量大小，同时评估在掩码少数语言条件下的迁移，该条件的词汇形式在训练中从未被观察到。在700次受控运行中，我们发现迁移受分词器平衡或原始词汇相似性的影响较小，而更多地取决于分词是否保留可复用的跨语言子结构。较小的词汇量通常通过保持单词可分解为共享片段来改善掩码迁移，而较大的词汇量可能将形式转化为特定语言的原子。我们进一步表明，迁移是一个阶段性过程：语法和类型级能力先于掩码词汇泛化。最后，我们尝试通过分词器桥梁解释这一机制，并表明桥梁强度与掩码可达性密切相关。

英文摘要

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

URL PDF HTML ☆

赞 0 踩 0

2605.26680 2026-05-27 cs.CV cs.AI 版本更新

MemFail: LLM记忆系统的故障模式压力测试

Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MemFail基准测试，通过形式化记忆系统为摘要、存储和检索三个操作并构建对抗性数据集，系统性地评估和诊断LLM记忆系统的故障模式。

详情

AI中文摘要

大型语言模型（LLM）代理越来越依赖外部记忆系统以在长程交互中保持一致性，但关于这些系统具体故障模式和设计选择的实证研究很少。现有基准报告聚合的问答准确率，将记忆系统视为黑箱，无法将错误答案归因于系统的特定故障模式。我们引入MemFail，一个诊断性基准，用于隔离现代LLM记忆系统的故障模式。我们首先将记忆系统形式化为三个规范操作的组合——摘要、存储和检索——并识别每个操作可能引发的故障模式。基于这些假设的故障模式，我们构建了跨越四个任务的五个数据集，每个数据集都经过对抗性设计以测试记忆系统的特定操作。使用这些数据集，我们在MemFail上评估了四种最先进的记忆系统，并展示了MemFail如何用于实证理解记忆系统架构差异带来的权衡。

英文摘要

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.26662 2026-05-27 cs.CL cs.AI econ.GN q-fin.EC 版本更新

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

AI评估可能扭曲认知：语境在解读学术写作中的重要性

Shang Wu, Randol Yao

发表机构 * UC Irvine（加州大学欧文分校）； MIT（麻省理工学院）

AI总结本文通过构建AI相似度基准，发现忽略国家和领域差异的评估方法会系统性高估或低估某些群体中的AI使用，提出基于具体语境的基准以更准确评估科学写作中的AI使用。

详情

AI中文摘要

本文研究了当评估方法忽略国家和领域的语境差异时，科学写作中AI使用估计可能产生的偏差。利用Dimensions中期刊论文的大规模数据，我们基于人类撰写和LLM重写的摘要之间的差异构建了AI相似度基准。我们表明，合并基准可能混淆已有的风格差异与AI生成的文本，即使在LLM之前的出版物中也会在跨国家-领域组中产生显著扭曲。相比之下，特定国家-领域的基准减轻了这种扭曲，并提供了更可信的比较基线。将这些方法应用于2025年的出版物，结果显示合并基准系统性高估了某些国家和领域的AI使用，同时低估了其他国家和领域的AI使用。这些发现强调了语境感知测量对于准确和公平评估科学中AI使用的重要性。

英文摘要

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.

URL PDF HTML ☆

赞 0 踩 0

2605.26661 2026-05-27 cs.CV cs.AI 版本更新

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

在预训练视觉语言模型的后验分布外检测中尊重模态差距

Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang, Ling Chen, Jie Lu

发表机构 * The University of Queensland（昆士兰大学）； University of Technology Sydney（悉尼科技大学）

AI总结针对预训练视觉语言模型在后验分布外检测中文本原型与视觉原型存在模态差距的问题，提出在线伪监督框架直接在视觉特征空间学习类原型，实现新最优性能。

详情

AI中文摘要

分布外（OOD）检测已成为一种流行的技术，通过识别来自未知类别的意外输入来增强机器学习模型的可靠性。预训练视觉语言模型（VLM）的最新进展使得无需访问分布内（ID）训练数据即可进行零样本OOD检测；在这种设置下，现有方法通常将类名的文本嵌入视为类原型。在本文中，我们通过理论证明现成的文本原型通常与最优视觉原型不对齐，从而产生无法通过提示工程单独消除的内在模态差距，来挑战广泛采用的文本即原型范式。为了在后验约束下缓解这一差距，本文提出了一种在线伪监督框架，该框架使用未标记的测试时数据流和预训练VLM的软预测，直接在视觉特征空间中学习类原型。我们为在线优化过程的收敛性提供了理论保证。大量实验经验证明，我们的方法在各种OOD检测设置中达到了新的最优水平。

英文摘要

Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.

URL PDF HTML ☆

赞 0 踩 0

2605.26657 2026-05-27 cs.AI 版本更新

UnityMAS-O: 基于LLM的多智能体系统的通用强化学习优化框架

Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China（中国人民大学）； Xiaohongshu Inc.（小红书公司）

AI总结提出UnityMAS-O框架，将多智能体工作流作为优化单元，通过逻辑角色、图轨迹、用户定义奖励和智能体-模型映射四个核心对象解耦逻辑与物理参数，支持灵活的参数共享和奖励分配，在检索增强问答、迭代搜索和反思代码生成任务上验证了多智能体RL对手动工作流的提升效果。

详情

AI中文摘要

基于LLM的多智能体系统将复杂任务分解为交互角色，但大多数仍通过提示、工具和控制规则手动编排，智能体很少通过统一的强化学习接口进行优化。现有的RL后训练框架主要针对单策略优化，缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象。我们提出了UnityMAS-O，一个用于基于LLM的多智能体系统的通用RL优化框架。UnityMAS-O将完整工作流视为优化单元，而非单个响应或策略轨迹。它通过四个核心对象表示工作流：逻辑智能体角色、图轨迹、用户定义奖励和智能体-模型映射。这将逻辑智能体与物理模型参数解耦，支持完全共享、完全分离和部分共享，奖励在角色、轮次和轨迹级别分配。UnityMAS-O通过基于Ray的星形拓扑运行时扩展了verl。中央控制器执行工作流、调用工具、记录结构化轨迹并组装奖励；模型本地工作器组负责轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户可以定义智能体、工作流、模型映射和奖励，而无需重写优化基础设施。我们在检索增强问答、迭代智能体搜索和反思代码生成上实例化了UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上，多智能体RL在优化后改进了手动指定的工作流，对于较小模型和严格代码全通过指标尤其有较大提升。这些结果表明，UnityMAS-O可以作为可复用基础，将多样化的基于LLM的多智能体工作流转化为可训练的多智能体RL系统。

英文摘要

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26636 2026-05-27 cs.CV cs.AI 版本更新

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

JetViT: 高效高分辨率视觉Transformer与训练后注意力搜索

Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai

发表机构 * MIT（麻省理工学院）； University of Pennsylvania（宾夕法尼亚大学）； NVIDIA（NVIDIA公司）； Physical Intelligence（物理智能）

AI总结提出JetViT混合架构视觉Transformer，通过训练后注意力搜索将预训练全注意力ViT转换为高效混合注意力变体，在高分辨率图像上实现更高推理效率且不损失精度。

Comments Accepted to CVPR 2026 Findings

详情

AI中文摘要

我们介绍了JetViT，一种新颖的混合架构视觉Transformer（ViT）模型系列，它在匹配最先进的全注意力视觉基础模型精度的同时，在高分辨率图像上实现了显著更高的推理效率。我们方法的核心是训练后注意力搜索，这是一种训练后加速框架，通过识别并将冗余的全注意力块替换为线性注意力或窗口注意力块，将预训练的全注意力ViT转换为高效的混合注意力变体。通过继承基础模型的MLP和注意力权重，训练后注意力搜索通过三个关键步骤高效探索架构设计空间：（1）优化线性注意力块设计；（2）找到线性注意力块和窗口注意力块的最佳组合；（3）识别并保留关键的全注意力块。我们在两个代表性的高分辨率视觉基础模型DINOv3和DepthAnythingV2上评估了JetViT。在NVIDIA H100 GPU上，JetViT在不牺牲精度的情况下实现了高达1.79倍的吞吐量提升和高达44.81%的延迟降低。我们将很快发布我们的代码和加速后的ViT模型。

英文摘要

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

URL PDF HTML ☆

赞 0 踩 0

2605.26628 2026-05-27 cs.AI 版本更新

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Tail-Aware HiFloat4: 面向Wan2.2的W4A4训练后量化

Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng, Yang Cao, Zhengjun Zha

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Tail-Aware HiFloat4方法，通过感知激活尾部的百分位校准和紧凑PTQ状态恢复，在HiFloat4数值格式下对Wan2.2进行W4A4训练后量化，减少罕见校准异常值的影响。

2605.26621 2026-05-27 cs.CV cs.AI 版本更新

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1：基于奖励驱动的证据基础用于体积推理分割

Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang

发表机构 * School of Biological Science and Medical Engineering, Beihang University, Beijing, China（生物科学与医学工程学院，北京航空航天大学）； Center for Information and Computer Science, School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Kanagawa, Japan（信息与计算机科学中心，开放与环境系统科学学院，科技研究生学校，东京大学，神奈川，日本）； Bytedance Inc., China（字节跳动公司，中国）； Tsinghua University, Beijing, China（清华大学，北京，中国）

AI总结提出MedVol-R1框架，通过强化学习将临床推理解耦为可验证的2D证据锚点，再传播为3D掩膜，实现体积推理分割，在多个基准上达到最优性能。

详情

AI中文摘要

体积推理分割（VRS）旨在根据自由形式的临床查询在3D医学扫描中分割目标区域，其中所指对象通常是隐含的，需要医学知识和体积基础推理。现有方法通常依赖专门的分割标记将语言与掩膜解码连接起来，但这种耦合将决策过程压缩为不透明的潜在表示，限制了可解释性和对多样化叙述表达的泛化能力。在本文中，我们提出MedVol-R1，一种基于强化学习的VRS框架，明确地将证据基础与体积描绘解耦：LVLM将临床推理定位到可验证的2D证据锚点（关键轴向切片和2D边界框），然后由冻结的MedSAM2模块将其传播为连贯的3D掩膜。我们使用冷启动监督微调后接GRPO来训练MedVol-R1，并由多组件奖励引导，该奖励鼓励信息性证据选择、准确的2D空间定位和跨切片体积连贯性，无需昂贵的思维链注释。在M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23上的实验表明，MedVol-R1一致优于强基线并达到最先进性能，强化学习相比纯监督微调提供了明显增益。

英文摘要

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.26615 2026-05-27 cs.AI 版本更新

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

FAST-GOAL: 快速高效的全局-局部对象对齐学习

Hyungyu Choi, Young Kyun Jang, Chanho Eom

发表机构 * Department of Virtual Convergence, Graduate School of Advanced Imaging Science, Multimedia & Films (GSAIM), Chung-Ang University（虚拟融合系，高级影像科学研究生院，多媒体与电影系（GSAIM）， Chung-Ang 大学）

AI总结提出FAST-GOAL微调方法，通过全局-局部语义对齐增强CLIP处理长文本的能力，包括快速局部图像-句子匹配和基于token相似性的学习，并在GLIT100k数据集上训练，在长/短描述数据集上均取得显著提升。

Comments 21 pages, 8 figures, IEEE/TIP 2026 accepted

详情

AI中文摘要

视觉-语言模型如CLIP在图像和文本对齐方面表现出色，但由于在简短标题上预训练，它们通常难以处理冗长详细的文本描述。我们提出FAST-GOAL（快速高效的全局-局部对象对齐学习），一种高效的微调方法，通过全局-局部语义对齐增强CLIP处理长文本的能力。我们的方法包含两个关键组件。首先，快速局部图像-句子匹配（FLISM）通过目标检测和空间划分高效提取局部图像区域，然后将其与对应句子匹配。其次，基于token相似性的学习（TSL）最大化图像中特定区域的patch token与其对应区域嵌入之间的相似性，并将相同原理应用于文本，从而增强模型捕获细节对应关系的能力。此外，我们引入了GLIT100k数据集，该数据集提供全局图像-长描述对和上下文派生的局部对，其中局部描述从全局描述中提取以保持语义连贯性。通过在长描述数据集（DOCCI, DCI）和短描述数据集（MSCOCO, Flickr30k）上的大量实验，我们证明FAST-GOAL相比基线取得了显著改进，使CLIP能够有效适应详细文本描述，同时保持计算效率。

英文摘要

Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.26606 2026-05-27 cs.LG cs.AI 版本更新

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

将你的展开用在关键处：基于组强化学习后训练的展开分配

Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu

发表机构 * Cornell University（康奈尔大学）

AI总结提出 Pilot-Commit 框架，通过预算感知的展开分配策略，优先将计算资源分配给高信息量的提示，从而在组策略优化中减少采样成本并加速收敛。

详情

AI中文摘要

强化学习（RL）是后训练大型语言模型的主要范式。然而，在在线、在策略设置中，展开生成主导了训练的计算成本。基于组的策略优化方法对每个提示计算多个展开的优势，但它们不加区分地将预算分配给奖励分布崩溃的提示，将昂贵的展开浪费在可忽略的学习信号上。我们证明，基于组的更新在高奖励方差区域最为有效。由于策略在整个训练过程中演变，提示的信息量必须在线估计而非预先计算，但穷举评估每个提示在计算上不可行。我们引入了 Pilot-Commit，一个用于基于组 RL 后训练的预算感知展开分配框架。Pilot-Commit 将提示评估与利用解耦：一个试点阶段使用预算的一部分估计每个提示的信息量，然后将剩余的展开分配给高杠杆提示，同时跳过低信号提示。在多个数学推理基准和从 1.5B 到 14B 参数的模型规模上，Pilot-Commit 以显著更低的采样成本匹配基线准确率，在累积展开中达到目标准确率的速度比 GRPO 快高达 $1.9 imes$，比 DAPO 快高达 $4.0 imes$。

英文摘要

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.

URL PDF HTML ☆

赞 0 踩 0

2605.26600 2026-05-27 cs.LG cs.AI 版本更新

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

几何感知对比学习用于少样本自动调制识别

Guanqun Zhao, Yitong Liu, Jiaxuan Fang, Yufei Mao, Hongwen Yang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出动态一致性对比学习框架，通过虚拟对抗增强和语义一致性损失解决自监督学习中的各向同性增强、频谱不稳定和语义漂移问题，在少样本设置下提升自动调制识别准确率。

2605.26596 2026-05-27 cs.AI 版本更新

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

AGORA: 基于适配器接地观察-动作保留的LLM智能体无推理提示压缩

Haoran Zhang, Zhaohua Sun

发表机构 * AI Agent Technologies (Hong Kong) Limited（人工智能代理技术（香港）有限公司）； Department of Mechanical Engineering, The University of Hong Kong（香港大学机械工程系）

AI总结针对LLM智能体，提出AGORA无推理步骤级压缩器，通过结构提示解析器、格式关键内容保留和125M参数相关性评分器，在9个测试单元中8个保持≥75%的无压缩性能。

Comments 10 pages, 2 figures. Code and data: https://github.com/ranranrannervous/agoracompression

详情

AI中文摘要

广泛用于通用LM上下文的token级抽取式压缩器在结构上不适合LLM智能体：在跨越两个独立token级方法家族的17个（环境、骨干、方法）单元中，尽管实现了1.3-13.3倍的压缩，每个单元的均值奖励≤0.05。我们将这种失败模式命名为动作语法破坏——携带动作语义的token（标识符、括号、动作动词）正是那些自信息排名最低的token，因此通用压缩器可靠地移除它们，环境拒绝剩余部分。诊断指向步骤粒度压缩。我们引入AGORA，一种无推理的步骤级压缩器，结合结构提示解析器、格式和时效关键内容的始终保留底线，以及一个在反事实下一步动作变化标签上训练的125M参数相关性评分器（约2ms/步，零每步LLM开销）。在比较的无推理和基于LLM的方法中，AGORA是唯一在9个单元中的8个中保持≥75%无压缩性能的方法（唯一的例外为73%）；四路组件消融将结构底线隔离为主要的性能杠杆，而学习到的评分器是单一固定保留比率下实现1.0-11.5倍自适应端到端压缩的来源。

英文摘要

The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward <= 0.05 despite 1.3-13.3x realized compression. We name and characterize this failure mode as action-grammar destruction -- the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self-information ranks lowest, so a general-purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step-granularity compression. We introduce AGORA, an inference-free step-level compressor combining a structural prompt parser, an always-keep floor for format- and recency-critical content, and a 125M-parameter relevance scorer trained on counterfactual next-action-change labels (~2ms/step, zero per-step LLM toll). Across the compared inference-free and LLM-based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.

URL PDF HTML ☆

赞 0 踩 0

2605.26590 2026-05-27 cs.CY cs.AI 版本更新

Examining the Challenges of Intellectual Property in AI-Generated Productions

审视人工智能生成作品中的知识产权挑战

Ali Mazhar, Mohammad Zare, Marjan Veysi

AI总结本文通过比较伊朗、欧盟、英国和美国的法律框架，分析人工智能生成作品在知识产权保护中的所有权归属与法律挑战，并提出修订法律或引入新型权利的建议。

详情

Journal ref: New Researches in the Smart City, Vol. 3, No. 4, Summer 2025

AI中文摘要

随着能够自主生成艺术、文学、音乐作品甚至发明而无需直接人工干预的人工智能系统的进步，知识产权制度面临前所未有的问题和挑战。最关键的问题涉及在缺乏人类创作者的情况下道德和经济权利的所有权，以及如何为这些产出提供法律保护。本文首先回顾了这一领域的理论基础和现有文献，然后比较研究了伊朗的法律框架，如1969年《作者、作曲家和艺术家权利保护法》和《专利和商标注册法》，以及其他法律体系，包括欧盟、英国和美国。此外，还分析了关于人工智能生成作品知识产权的现有法律观点及相关执法挑战。研究结果揭示了当前伊朗法律框架内的重大监管空白。为了在促进创新与保护人类创造力之间取得平衡，修订现有法律并引入新方法，例如为人工智能生成作品定义特定的知识产权或指定相关人类代理人之间的所有权，似乎是必要的。

英文摘要

With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inventions without direct human intervention, the intellectual property (IP) regime faces unprecedented questions and challenges. The most critical issue concerns the ownership of moral and economic rights in the absence of a human creator, and how such outputs can be granted legal protection. This paper first reviews the theoretical foundations and existing literature in this domain, then comparatively examines Iranian legal frameworks such as the 1969 Law for the Protection of Authors, Composers, and Artists Rights and the Patent and Trademark Registration Law-alongside other legal systems, including the European Union, the United Kingdom, and the United States. Furthermore, existing legal perspectives on the intellectual property of AI-generated works and the related enforcement challenges are analyzed. The findings reveal significant regulatory gaps within the current Iranian legal framework. To balance the promotion of innovation with the preservation of human creativity, revising existing laws and introducing novel approaches such as defining a specific intellectual property right for AI-generated works or designating ownership among associated human agents appears to be essential.

URL PDF HTML ☆

赞 0 踩 0

2605.26589 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

分布漂移下儿童贫血预测的表格机器学习与基础模型的少样本跨国家泛化

Yusuf Brima, Marcellin Atemkeng, Lansana Hassim Kallon, David Niyukuri, Antoine Vacavant, Samuel Saidu, Ding-Geng Chen

发表机构 * Department of Mathematics, Rhodes University, South Africa（数学系，罗德斯大学，南非）； National Institute for Theoretical and computational Sciences (NITheCS), Stellenbosch, 7600, South Africa（理论与计算科学国家研究所（NITheCS），斯泰伦博斯，7600，南非）； Interdisciplinary Research Program in Public Health, University of Burundi, Burundi（公共卫生跨学科研究计划，布恩迪大学，布恩迪）； Universite Clermont Auvergne, Clermont Auvergne INP, CNRS, Institut Pascal, Clermont–Ferrand, France（克莱蒙特-奥弗涅大学，克莱蒙特-奥弗涅INP，CNRS，帕西尔研究所，克莱蒙特-费尔南，法国）； Department of International Public Health, Liverpool School of Tropical Medicine, Liverpool, UK（国际公共卫生系，利物浦热带医学学校，利物浦，英国）； College of Health Solutions, Arizona State University, Phoenix, USA（健康解决方案学院，亚利桑那州立大学，凤凰城，美国）； Department of Statistics, University of Pretoria, Pretoria, South Africa（统计系，普里特oria大学，普里特oria，南非）

AI总结本研究评估了基于Transformer的表格基础模型TabPFN在跨国家、数据稀缺环境下预测儿童贫血的性能，发现其优于经典监督方法，尤其在低数据场景下表现出更好的区分度和校准能力。

详情

AI中文摘要

儿童贫血影响全球约40%的6-59个月儿童，且由异质性因素引起，限制了模型的泛化能力。我们在跨国家和数据稀缺环境下，评估了基于Transformer的表格基础模型与经典监督方法。我们使用了来自非洲、亚洲、拉丁美洲、高加索和中东16个国家的DHS数据（n=68,856）。比较了逻辑回归、XGBoost、LightGBM和TabPFN v2.6。性能通过AUC-ROC、Brier评分和ECE评估。泛化性通过留一国家法（LOCO）、反向LOCO和少样本设置评估。亚组分析包括性别、年龄、居住地、母亲教育和财富。特征重要性通过SHAP估计。TabPFN在低数据场景（<200样本）中优于经典模型，显示出更高的区分度和更好的校准。在各国中，它实现了最低的Brier评分（0.042）和ECE（0.203）。在全数据设置下，AUC-ROC范围为0.59-0.76，模型间差异较小（≤0.05）。LOCO性能稳定（0.58-0.69），受国家背景驱动。反向LOCO显示出不对称的可转移性。亚组性能一致，无系统性人口统计偏差。SHAP识别出儿童年龄、海拔和年龄别身高Z分数为主要预测因子，其次是财富和母亲教育。儿童贫血预测的性能更多由人群变异驱动而非模型选择。TabPFN在低资源环境中通过改进的区分度和校准提供了优势，突显了基础模型作为数据稀缺全球健康预测的有前景工具。

英文摘要

Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizability. We evaluate a transformer-based tabular foundation model against classical supervised methods under cross-country and data-scarce settings. We used DHS data from 16 countries across Africa, Asia, Latin America, the Caucasus, and the Middle East (n=68,856). We compared Logistic Regression, XGBoost, LightGBM, and TabPFN v2.6. Performance was assessed using AUC-ROC, Brier score, and ECE. Generalization was evaluated using leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. Subgroup analyses included sex, age, residence, maternal education, and wealth. Feature importance was estimated using SHAP. TabPFN outperformed classical models in low-data regimes (<200 samples), showing higher discrimination and better calibration. Across countries, it achieved the lowest Brier score (0.042) and ECE (0.203). Under full-data settings, AUC-ROC ranged from 0.59-0.76 with small between-model differences ($\leq 0.05$). LOCO performance was stable (0.58-0.69), driven by country context. Reverse-LOCO showed asymmetric transferability. Subgroup performance was consistent with no systematic demographic bias. SHAP identified child age, altitude, and height-for-age z-score as dominant predictors, followed by wealth and maternal education. Performance in childhood anemia prediction is driven more by population variation than model choice. TabPFN provides advantages in low-resource settings through improved discrimination and calibration, highlighting foundation models as promising tools for data-scarce global health prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.26582 2026-05-27 cs.LG cs.AI 版本更新

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

离散扩散中随机性的纠错效应

William Yuan, Sungwon Jeong, Amirali Aghazadeh

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文系统研究离散扩散模型中马尔可夫转移随机性程度对采样效率与质量的权衡，提出离散搅动与重启采样（DCRS）算法，通过交替正向和反向扩散过程注入受控随机性，在低函数评估次数下改善速度-质量权衡。

详情

AI中文摘要

离散扩散模型在文本和图像生成中取得了强劲性能，但其推理仍然缓慢，且必须内在平衡采样效率与样本质量。在这项工作中，我们系统研究了马尔可夫转移中随机性程度如何主导采样权衡。我们表明，高度确定性的转移收敛迅速但遭受误差累积，而更随机的转移收敛更慢但能达到更高的最终样本质量。通过信息论分析，我们识别出潜在机制为一种由对称地在状态间交换质量的冗余转移诱导的纠错效应，并表明这些转移可证明地收缩采样误差。受此分析启发，我们提出离散搅动与重启采样（DCRS），一种新颖的推理算法，通过交替正向和反向扩散过程注入受控随机性。在合成数据集和大规模基准上的实验表明，DCRS在低函数评估次数下改善了速度-质量权衡。在图像数据集上，与标准采样器相比，DCRS在保持竞争性样本质量的同时，实现了高达10倍的采样步数减少；而在语言基准上，我们观察到更细微的行为，取决于损坏过程和采样程序。

英文摘要

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

URL PDF HTML ☆

赞 0 踩 0

2605.26577 2026-05-27 eess.SY cs.AI cs.LG cs.SY math.OC 版本更新

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

桥接控制与神经网络验证器 alpha-beta-CROWN：教程

Haoyu Li, Xiangru Zhong, Hao Cheng, Bin Hu, Huan Zhang

发表机构 * Department of Computer Science（计算机科学系）； Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结本教程提出一个统一框架，通过将控制问题与神经网络验证器 α,β-CROWN 桥接，实现控制器属性的可扩展形式验证。

Comments ACC 2026 Tutorial

详情

AI中文摘要

基于学习的控制器合成方法因其高表达力和强经验性能而受到欢迎。然而，在自动驾驶、机器人技术和电力系统等安全关键场景中，仅凭经验性能是不够的，对控制器的稳定性、安全性等属性进行形式验证是非常可取的。不幸的是，许多先前的验证方法要么依赖于系统或证书的特定结构假设，难以在不同设置间迁移，要么在高维神经网络系统上可扩展性差。在本教程中，我们提出了一个统一框架，旨在通过将控制与最先进的神经网络验证器 $α,\!β$-CROWN（alpha-beta-CROWN）桥接来弥合这一差距。其核心是，$α,\!β$-CROWN 是一个通用的边界引擎，用于表示为计算图的非线性函数：给定一个输入域，它可以产生认证边界和非线性函数的显式线性松弛。这些认证边界本身对于可达性分析等任务很有用，并且它们为执行可满足性检查和优化的更复杂例程提供了基础。更具体地说，许多控制问题归结为验证状态域上的实值不等式（例如，李雅普诺夫理论）。因此，$α,\!β$-CROWN 通过计算紧边界并基于边界递归划分和剪枝子域，实现了这些条件的可扩展验证。得益于 GPU 并行化，该流程在对传统方法具有挑战性的验证和优化问题上展示了卓越的可扩展性。在本教程中，我们讨论了 $α,\!β$-CROWN 的基础知识，并介绍了其在各种控制相关任务中的应用。

英文摘要

Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performance. However, in safety-critical scenarios such as autonomous driving, robotics, and power systems, empirical performance alone is insufficient, and formal verification of controller properties such as stability and safety is highly desirable. Unfortunately, many prior verification approaches are either tied to specific structural assumptions on the system or the certificate, making them difficult to transfer across settings, or suffer from poor scalability on higher-dimensional neural network systems. In this tutorial, we present a unified framework that aims to mitigate this gap via bridging control with the state-of-the-art neural network verifier $α,\!β$-CROWN (alpha-beta-CROWN). At its core, $α,\!β$-CROWN is a general-purpose bounding engine for nonlinear functions represented as computation graphs: given an input domain, it can produce certified bounds and explicit linear relaxation of the nonlinear function. These certified bounds are useful on their own for tasks such as reachability analysis, and they also provide the foundation for more complex routines that perform satisfiability checking and optimization. More specifically, many control problems reduce to verifying real-valued inequalities over a state domain (e.g., Lyapunov theory). Consequently, $α,\!β$-CROWN enables scalable verification of such conditions by computing tight bounds and recursively partitioning and pruning subdomains based on the bounds. Thanks to GPU parallelization, this pipeline demonstrates superior scalability on verification and optimization problems that are challenging for traditional approaches. In this tutorial, we discuss the basics of $α,\!β$-CROWN and introduce its application to various control-related tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.26567 2026-05-27 cs.AI 版本更新

线性与神经延迟反馈的对抗性赌博机

Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； The Chinese University of Hong Kong（香港中文大学）

AI总结针对随机延迟反馈下的上下文对抗性赌博机问题，提出线性（LDB-DF）和神经（NDB-DF）两种算法，通过将逆概率加权（IPW）机制直接融入损失函数实现无偏校正，并给出线性设置下O(d*sqrt(T))的遗憾界和神经设置下的次线性保证。

详情

AI中文摘要

上下文对抗性赌博机构成了基于偏好的决策制定的基石，在推荐系统和大语言模型对齐中有关键应用。然而，标准算法依赖于即时反馈的理想化假设，这一条件在现实场景（如提示优化）中经常被违反。这种设置带来了独特的理论挑战：与线性赌博机不同，对抗性赌博机估计量缺乏闭式解，使得标准加权技术的朴素适应产生偏差。为解决这一问题，我们形式化了具有随机延迟反馈的上下文对抗性赌博机问题，并提出了两种新颖算法：线性延迟反馈对抗性赌博机（LDB-DF）和神经延迟反馈对抗性赌博机（NDB-DF）。我们方法的核心是一种新颖的估计量，它将逆概率加权（IPW）机制直接集成到损失函数中，确保对延迟或缺失反馈的无偏校正。我们提供了全面的理论分析，为线性设置建立了O(d*sqrt(T))的遗憾界，并为神经设置建立了次线性保证。在模拟和真实数据集上的大量实验证明了我们提出方法的有效性。

英文摘要

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

URL PDF HTML ☆

赞 0 踩 0

2605.26546 2026-05-27 cs.AI 版本更新

哪些变化重要？通过相关性敏感评估和求解器基础推理实现可信赖的法律AI

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

发表机构 * National University of Singapore（新加坡国立大学）； Griffith University（格里菲斯大学）

AI总结提出法律相关性敏感评估问题，引入统一评估套件，并设计基于形式推理的对抗多智能体框架LexGuard，以提高法律AI对法律相关变化的校准敏感性。

详情

AI中文摘要

法律推理需要区分重要的变化和不重要的变化。法律AI应在法律无关的扰动下保持稳定，但当扰动改变法律实质要点时应发生变化。我们将这一要求形式化为法律相关性敏感评估问题：LLM应仅对法律相关的变化敏感。我们引入了一个统一的评估套件，涵盖司法公平性、鲁棒性和法规混淆场景中的应变化和不应变化评估。我们的评估表明，现有的法律LLM系统性地对法律无关的变化敏感，并且常常无法区分相关的法律要素和法规规则。为了缓解这些失败，我们提出了LexGuard，一个基于形式推理的对抗多智能体框架。LexGuard将法规形式化为可执行约束，使用对抗智能体提取竞争的事实-法规论点，并调用SMT求解器验证法律满足性和逻辑一致性。实验表明，LexGuard通过减少对操纵性框架的脆弱性、改善相似法规之间的区分、限制法律无关属性的影响以及增加良性重述下的一致性，提高了法律推理的可靠性。我们表明，法律可信赖性不仅需要准确性，还需要对法律实质性变化的校准敏感性。

英文摘要

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

URL PDF HTML ☆

赞 0 踩 0

2605.26525 2026-05-27 cs.CV cs.AI 版本更新

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

ReCA: 通过递归上下文分配实现多镜头长视频外推

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

发表机构 * Monash University（墨尔本大学）； Tongyi Lab, Alibaba Group（通义实验室，阿里集团）； Zhejiang University（浙江大学）； University of Queensland（昆士兰大学）

AI总结针对多镜头视频外推任务中上下文分配瓶颈，提出递归上下文分配框架，通过层次化分解和结构化状态传播提升长视频生成的一致性和质量。

Comments Project Page: https://reca.vmv.re , Code: https://github.com/ali-vilab/ReCA

详情

AI中文摘要

分钟级电影式视频生成是生成式视频模型的核心挑战。现有范式仅解决该挑战的片段：单镜头外推保留锚点但缺乏电影结构，而多镜头叙事施加结构却可自由创造视觉状态而非延续观察到的状态。我们定义多镜头视频外推（MSVE）任务，该任务将观察到的帧或片段扩展为一系列具有电影结构的镜头，同时保留锚点状态并推进叙事意图。该设置受限于短视频模型的每次调用生成预算。我们识别出三个耦合瓶颈：（1）全局规划器从完整剧本中过度指定不支持的细节；（2）镜头级提示在携带完整故事时稀释任务相关状态；（3）时间链将生成帧转变为有损记忆，其中身份、场景、对象和动作状态衰减。MSVE揭示长视频失败不仅是上下文长度的限制，更是上下文分配失败。我们提出递归上下文分配（ReCA），一种推理时框架，在规划和生成之间分层分配上下文。ReCA递归地将MSVE分解为上下文有界子问题，在叶节点调用冻结生成器，并跨时间传播结构化状态更新。为评估该设置，我们进一步提出MSVE-Bench和NB-Q，一种源接地协议，带有专为3至5分钟长视频生成设计的提示，该场景未被现有短视频基准覆盖。与先前方法相比，ReCA在最强竞争控制器上将平均归一化分数提高8%至16%，并将多镜头一致性指标提高28%至43%。查看项目页面：https://reca.vmv.re。

英文摘要

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

URL PDF HTML ☆

赞 0 踩 0

2605.26524 2026-05-27 cs.CV cs.AI 版本更新

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

CmIVTP：面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

发表机构 * Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University（物流及海运研究系，香港理工大学）； Research Centre for ESG Advancement (RCESGA), the Hong Kong Polytechnic University（ESG进步研究中心（RCESGA），香港理工大学）； School of Navigation, Wuhan University of Technology（航海学院，武汉理工大学）

AI总结针对单一数据源局限导致船舶轨迹预测不准的问题，提出跨模态交互框架CmIVTP，融合AIS和CCTV数据，利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情

AI中文摘要

海事智能交通系统（MITS）对于确保繁忙水域的航行安全和效率至关重要。然而，由于单源数据的局限性，准确的船舶轨迹预测仍然具有挑战性。自动识别系统（AIS）数据对于小型船舶通常稀疏或不可用，而仅靠闭路电视（CCTV）数据无法完全捕捉动态船舶行为。为缓解这些挑战，我们提出了一种基于跨模态交互的船舶轨迹预测（称为CmIVTP）框架，以建模船舶动力学与环境约束之间的复杂交互。具体地，我们引入了一个目标感知场景编码器来提取场景语义特征，有效捕捉船舶-环境交互并提高轨迹预测精度。此外，我们提出了一个跨模态交互变换器，它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互，确保动态一致且环境可行的预测。此外，我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库，为候选轨迹生成提供了一种高效且可扩展的方法。另外，我们引入了海事多模态数据集增强版（名为Maritime-MmD$^+$），这是一个同步AIS数据和CCTV视频数据的大规模数据集，为多模态轨迹预测研究提供了有力支持。大量实验表明，CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

URL PDF HTML ☆

赞 0 踩 0

2605.26523 2026-05-27 cs.DC cs.AI cs.LG 版本更新

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit: 通过不确定性引导的自适应分割实现连续音频表示学习

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University（德肯大学工程学院）

AI总结提出StreamSplit框架，通过分布式的混合损失和强化学习策略实现边缘设备上的流式对比学习，在降低延迟、带宽和能耗的同时保持高精度。

Comments Accepted at ACM MobiSys 2026

详情

AI中文摘要

大批量对比学习（CL）是现代表示学习的基础，但与边缘设备波动的资源约束根本不相容。这种冲突造成了一个困境：设备上的小批量会降低模型保真度，而将计算卸载到云端则会导致不可接受的延迟和带宽成本。现有解决方案通常采用静态模型压缩，无法适应边缘环境的运行时波动。为弥合这一差距，我们提出了StreamSplit，一种新颖的框架，使得流式对比学习在异构ARM客户端平台上变得实用。StreamSplit解决了环境音频的连续性与CLAP和COLA等模型的离散批量需求之间的冲突。我们引入：（1）一种基于分布的流式框架，将表示质量与本地批量大小解耦，使用易于处理的混合损失在稀疏更新的情况下保持保真度；（2）一种不确定性引导的自适应分割器，使用轻量级强化学习（RL）策略动态划分计算。独特的是，该策略将实时资源监控与嵌入歧义性相结合，以动态优化准确率-延迟权衡。我们在从资源受限的Raspberry Pi 4到高性能Apple M2的多种硬件上评估了StreamSplit。结果表明，与以服务器为中心的基线相比，StreamSplit将每样本延迟降低了高达4.7倍，带宽减少了77.1%，能耗减少了52.3%。关键的是，它保持了与服务器中心模型相差2.2%以内的准确率，证明了自适应分布式学习是现代边缘生态系统的一条可行路径。

英文摘要

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2605.26520 2026-05-27 cs.CV cs.AI 版本更新

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； SenseTime Research（商汤研究院）； Shandong Normal University（山东师范大学）

AI总结针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题，提出InterSketch模型，通过自校正和逐步奖励机制增强交错视觉-文本思维链能力，在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情

AI中文摘要

尽管视觉-语言模型（VLM）已展现出多轮视觉推理能力，但其推理轨迹仍相对浅层且以文本为中心，限制了其在复杂视觉挑战中的适用性。相比之下，人类思维通常涉及长程推理，并伴有交错的视觉-文本思维链（VT-CoT）。为弥合这一差距，我们引入InterSketch，一种交错推理模型，通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图，并将其与文本推理交错进行，从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言，在第一个冷启动阶段，我们提出了一个合成的高质量交错VT-CoT数据集，并引入反思机制，使模型具备多轮交错推理和自校正能力。在后续的强化学习（RL）阶段，我们设计了一种逐步奖励机制，以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性，其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.26514 2026-05-27 cs.CV cs.AI cs.LG 版本更新

AI 编码智能体越来越多地用于编写真实世界的软件，但确保其输出正确性仍然是一个基本挑战。形式化验证提供了一条有希望的路径：智能体生成代码的同时生成机器检查的证明，保证代码满足形式规范。然而，无法保证形式规范本身与用户意图一致。在这项工作中，我们研究规范自动形式化：LLM 智能体能否将非正式编程问题转化为忠实的形式规范。我们引入了 Verus-SpecBench，一个包含 581 个规范编写任务的基准，这些任务源自针对 Rust 验证器 Verus 的 Codeforces 问题，以及 Verus-SpecGym，一个智能体环境，模型在其中与 Verus、bash 和文件系统交互以开发这些规范。核心挑战在于评估：专家编写的参考规范编写成本高昂，LLM 评判者可能遗漏细微错误。我们通过以下方式解决这一问题：(a) 扩展 Verus 的 exec_spec 机制，使生成的规范可以作为 Rust 代码执行；(b) 针对官方 Codeforces 测试和从 Codeforces "hacks"（即竞争对手编写的用于破解不正确解决方案的边缘情况）中提取的对抗性案例进行测试。在 Verus-SpecBench 上，最强的模型 Gemini 3.1 Pro 解决了 77.8% 的任务，其他前沿模型解决了 51.1-57.8%，而开源模型仅达到 21.5-25.5%。我们对失败模式的分析表明，模型生成的规范可能遗漏重要的输入假设、接受不正确的输出以及拒绝有效的输出。我们还发现，LLM 作为评判者的评估遗漏了我们评估者捕获的 26% 的失败。总体而言，我们的结果表明，规范自动形式化对于前沿智能体来说是可行的，但即使在它们已经能够生成正确代码的问题上仍然脆弱。代码、数据和日志可在 https://github.com/formal-verif-is-cool/verus-spec-gym 获取。

英文摘要

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

URL PDF HTML ☆

赞 0 踩 0

2605.26449 2026-05-27 cs.CV cs.AI 版本更新

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * Sungkyunkwan University（全北大学）

AI总结针对GAN多尺度生成中跨尺度轨迹未对齐问题，提出CAT（跨尺度对齐Transformer），通过生成器侧一致性正则化对齐中间输出与最终输出，在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情

AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督，并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中，我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构：每个中间图像被独立地推向其自身分辨率下的真实分布，但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外，每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此，其对抗性损失可以改善特定尺度的输出，而不约束后续阶段保持相同的样本轨迹，允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题，我们提出了CAT，一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级，因此每个中间输出在其自身分辨率下被评估，同时添加一个简单的生成器侧一致性正则化，以对齐中间输出与最终输出。在类别条件ImageNet-256上，CAT-H/2在仅60个训练周期后，通过一步推理实现了1.56的FID-50K，优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26446 2026-05-27 cs.LG cs.AI 版本更新

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

DDGAD：基于扩散的图异常检测中的轨迹动力学

Yuxin Yang, Limei Hu, Feng Chen

发表机构 * College of Artificial Intelligence（人工智能学院）； Southwest University（西南大学）

AI总结提出DDGAD框架，利用扩散正则化和可靠性感知邻域共识下的轨迹动力学区分正常与异常节点，通过三种互补异常信号检测异常。

详情

AI中文摘要

图异常检测（GAD）旨在识别图结构数据中行为或属性显著偏离整体模式的节点或子结构，在金融风险控制、社交网络分析和网络安全等领域具有关键应用。然而，现有的基于GCN的方法存在污染传播的根本问题，即异常节点通过消息传递污染其邻居的表示，导致检测性能下降。本文提出DDGAD，一种新颖的基于扩散的图异常检测框架，利用轨迹动力学区分正常和异常节点。我们的关键洞察是，在扩散正则化和可靠性感知邻域共识的耦合作用下，正常节点表现出一致且稳定的表示轨迹，而异常节点由于全局流形先验与局部污染消息传递之间的方向不一致，表现出不稳定且冲突的动力学。为了减轻污染传播，我们引入了一种分布式的可靠性感知共识细化机制，并定义了三种互补的异常信号：邻居不一致性、可靠性权重和动力学冲突能量。我们进一步对耦合动力学下的正常节点稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性和动力学不稳定性角度共同刻画异常行为。在五个真实世界数据集上的大量实验证明了所提框架的有效性。

英文摘要

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.26442 2026-05-27 cs.CL cs.AI 版本更新

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

大型语言模型的对齐调优：以数据为中心的对齐数据管道视角

Hwanjun Song

发表机构 * KAIST（韩国科学技术院）

AI总结本文以数据为中心，将对齐调优重构为管道设计问题，分解为响应合成、偏好评估和偏好实例化三个阶段，并基于此框架统一分类现有对齐方法，总结设计权衡与失败模式，提炼高层原则，最后指出开放挑战。

Comments Accepted at the Findings of ACL 2026

2605.26441 2026-05-27 cs.CV cs.AI 版本更新

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

从博弈视角重新思考弱监督视频时间定位

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu

发表机构 * Hubei Key Laboratory of Distributed System Security（湖北分布式系统安全重点实验室）； Hubei Engineering Research Center on Big Data Security（大数据安全工程研究中心）； School of Cyber Science and Engineering（网络安全科学与工程学院）； Huazhong University of Science and Technology（华中科技大学）； University of Central Florida（佛罗里达中央大学）； Zhejiang Gongshang University（浙江工商大学）； Guangzhou University（广州大学）； The Chinese University of Hong Kong（香港中文大学）； Peking University（北京大学）

AI总结本文从博弈论视角出发，通过多元合作博弈建模帧与词的不确定对应关系，实现多级跨模态交互，从而在弱监督下提升视频时间定位的准确性。

Comments Published in ECCV 2024

详情

AI中文摘要

本文针对弱监督视频时间定位这一具有挑战性的任务。现有方法通常基于时刻提案选择框架，利用对比学习和重构范式对预定义时刻提案进行评分。尽管取得了显著进展，但我们认为当前框架忽略了两个不可或缺的问题：1) 粗粒度跨模态学习：先前方法仅捕获全局视频级与查询的对齐，未能建模视频帧与查询词之间的详细一致性以准确定位时刻边界。2) 复杂的时刻提案：其性能严重依赖于提案的质量，而提案的选择既耗时又复杂。为此，本文首次尝试从新颖的博弈视角处理该任务，通过多样粒度和灵活组合有效学习每个视觉-语言对之间的不确定关系，实现多级跨模态交互。具体而言，我们创造性地将每个视频帧和查询词建模为多元合作博弈中的玩家，学习它们对跨模态相似度得分的贡献。通过博弈论交互量化联盟内帧-词合作的趋势，我们能够评估帧与词之间所有不确定但可能的对应关系。最后，我们不再使用时刻提案，而是利用学习到的查询引导的帧级得分进行更好的时刻定位。实验表明，我们的方法在Charades-STA和ActivityNet Caption数据集上均取得了优越性能。

英文摘要

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.26438 2026-05-27 cs.CL cs.AI 版本更新

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

LURE: 减少评估感知的实时使用回放评估

Igor Ivanov, David Demitri Africa

发表机构 * Meridian Cambridge（梅里登剑桥）

AI总结提出LURE方法，通过回放真实代理交互轨迹并附加评估提示来构建类似部署的评估，以减少大语言模型的评估感知，并引入自动化评估真实性流程。

详情

AI中文摘要

大型语言模型能够识别自己正在被评估（评估感知），并因此表现出不同的行为，这破坏了安全和对齐基准的有效性。我们提出LURE（实时使用回放评估），一种通过回放真实的代理交互轨迹并在末尾附加评估提示来构建类似部署的评估的方法。我们还引入了一个自动化流程来衡量评估的真实性，结合了对口头化评估感知的检测和法官模型对日志是否为评估的概率估计，并在一个包含部署和评估记录的大型数据集上进行了验证。我们发现，与广泛使用的基准和合成评估生成器相比，基于LURE的评估与部署的区分度显著降低，并且可以接近与用户真实对话的真实性。我们在策划、AI安全破坏和谄媚场景中实例化了LURE。我们的结果表明，评估真实性是对齐基准的一个关键属性，应在基准结果旁边报告，特别是当这些结果用于安全案例时。

英文摘要

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

URL PDF HTML ☆

赞 0 踩 0

2605.26434 2026-05-27 cs.LG cs.AI 版本更新

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

基于重建的脑电图基础模型中的非周期和低频谱偏差

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard, Jeppe Roden Münster, Andreas Peter Juhl Hansen, Takfarinas Medani, Tiantian Feng, Richard Leahy, Shrikanth Narayanan

发表机构 * University of Southern California（美国南加州大学）； Aalborg University（奥尔堡大学）

AI总结研究揭示基于重建预训练的脑电图基础模型存在非周期和低频成分偏差，导致低资源场景下性能不佳，并提出通过辅助损失关注高频振荡结构来改进。

Comments 18 pages, 13 figures, 3 tables

详情

AI中文摘要

脑电图基础模型在大规模无标签脑电图数据上预训练，已成为学习可泛化脑电图表示的有前景方向。尽管在数据丰富场景下表现积极，但在低资源设置中，它们往往无法显著优于完全监督的小型模型。我们对此缺陷提供了机制性解释，将其归因于基于重建的预训练任务与脑电图信号独特的频谱结构之间的根本性不匹配，该结构分解为高功率非周期成分和低功率振荡成分。通过使用受控的合成脑电图输入，我们证明脑电图基础模型嵌入偏向于捕捉脑电图信号的非周期成分，而低估振荡成分，尤其是高频成分。此外，在真实BCI数据集上的线性探针评估进一步揭示，嵌入比任务相关信息更强烈地编码受试者身份，从而强化了主要基于重建目标训练的基础模型嵌入中的低频和非周期成分偏差。这些发现共同阐明了基于重建的脑电图基础模型中的一种失败模式，并激励未来工作纳入明确针对高频振荡结构的辅助损失，作为实现更强大和可泛化的脑电图表示的途径。

英文摘要

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.

URL PDF HTML ☆

赞 0 踩 0

2605.26429 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

面向大规模分布外检测的结构自适应共形推断

Rongyi Sun, Wenguang Sun, Zinan Zhao

发表机构 * Center for Data Science and School of Mathematical Sciences, Zhejiang University（数据科学中心和数学科学学院，浙江大学）

AI总结提出结构自适应共形q值(SCQ)和伪分数引导的直推式自动模型选择(P-TAMS)，在成对可交换性下实现结构化分布外检测的有限样本错误率控制、功效提升和可解释性增强。

2605.26424 2026-05-27 cs.IR cs.AI cs.LG 版本更新

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Uniboost：基于价值对齐的全局协调实现公平高效的流量分配

Ge Fan, Nan Zhao, Kai Meng, Cong Luo, Yang Fu, Huiping Chu, Jialin Liu, Yuning Jiang, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China ； Taobao \& Tmall Group of Alibaba Beijing China ； Taobao \& Tmall Group of Alibaba

AI总结提出Uniboost统一流量分配框架，通过后验价值对齐机制和独立线性提升范式，解决耦合分配、分数膨胀和可解释性问题，提升流量分配效率和推荐性能。

Comments accepted by SIGIR 2026

详情

AI中文摘要

随着互联网服务的快速发展，推荐系统已变得不可或缺。特别是混合（重排序）阶段在跨不同业务目标分配流量中起着关键作用。然而，现有方法常受限于耦合的分配方案、分数膨胀和缺乏可解释性。为应对这些挑战，我们提出Uniboost，一个统一的流量分配框架。Uniboost引入后验价值对齐机制，将抽象模型分数校准到具有明确业务语义的锚定指标，显著增强可解释性。此外，它采用独立的线性提升范式来解耦复杂的加权方案，实现每个计划贡献的精确归因。我们通过在线A/B测试和深入数据分析验证了Uniboost的有效性，展示了三个关键发现：1）降低加权分数的整体权重有效减轻了意外的业务干扰，产生更高效的微观流量分配策略；2）事后分析和聚合仪表板提供了直观的宏观洞察，指导整体流量分配机制的设计；3）提出的“有效完成分数”作为易于获取的后验指标，为内容推荐管道提供了可靠的锚点。综合来看，我们的实验表明，Uniboost不仅在微观层面提升了流量分配效率和推荐性能，还为系统迭代提供了宏观指导。因此，这项工作为大规模工业推荐系统提供了一种高效可控的流量调节解决方案。

英文摘要

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26415 2026-05-27 cs.CV cs.AI 版本更新

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

拯救效应：时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

发表机构 * Soongsil University（顺斯大学）

AI总结针对CLIP模型INT8量化导致的表示崩溃问题，提出LRA-EE方法，通过时空语义聚合、多特征门控和层自适应阈值实现早期退出，在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情

AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化，但在CLIP等联合嵌入架构中，这引入了一种不同于量化CNN分类器的故障模式：跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向，侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃（QIRC），并在INT8 CLIP ViT-B/32上量化它，其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE（逐层表示感知早期退出），它通过时空语义聚合（用全局补丁令牌平均替代不成熟的浅层[CLS]）、学习到的多特征门控（置信度、top-2间隔、空间激活方差）以及根据每层信息噪声比校准的层自适应置信阈值，绕过噪声饱和的深层。在ImageNet-1K零样本分类上，LRA-EE相比INT8基线减少了13.4%的FLOPs，并将Top-1准确率提高了+2.44个百分点（58.72% -> 61.16%）。四象限分解隔离了拯救效应：9.5%的样本在浅层出口被正确分类，但在全深度被噪声丢失，而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

URL PDF HTML ☆

赞 0 踩 0

2605.26414 2026-05-27 cs.AI cs.CL cs.LG 版本更新

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

推理、代码，还是两者兼有？大型语言模型如何处理数学问题的变化

Matthew Kutakh

AI总结本研究通过对比链式思维推理、单次代码执行和迭代代码执行三种方法在GSM-Symbolic数据集上的表现，发现代码执行并未提升大型语言模型在数学问题变体上的推理鲁棒性。

Comments 6 pages, 4 figures, 2 tables

详情

AI中文摘要

大型语言模型（LLMs）在数学推理基准测试中取得了令人印象深刻的准确性，但当问题被修改为不同的名字或数字等简单变化时，它们的性能会下降。代码执行方法允许模型生成并运行Python代码，而不是用自然语言进行推理，已被提出作为解决方案，但其对推理鲁棒性（即在问题变体中保持准确性的能力）的影响尚未得到系统测试。本研究在GSM-Symbolic数据集的1000个问题上评估了三种方法：使用链式思维（CoT）提示的纯推理、使用程序辅助语言模型（PAL）的单次代码执行，以及使用逐步编码（SBSC）的迭代代码执行。所有三种方法均在配对的原始问题和修改问题上使用Claude Haiku 4.5运行。CoT是最鲁棒的方法，在扰动下准确率下降1.3个百分点，1.8%的问题被破坏。PAL的鲁棒性最差，准确率下降1.7个百分点，3.1%的问题被破坏，SBSC介于两者之间。尽管这些差异在统计上不显著（$p = .096$），但方向趋势在所有指标上一致，表明无论是单次还是迭代的代码执行，都没有提高小学水平问题变体的推理鲁棒性。

英文摘要

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

URL PDF HTML ☆

赞 0 踩 0

2605.26413 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Confounder Detection via Treatment Intent: A New Observational Study Design

通过治疗意图进行混杂检测：一种新的观察性研究设计

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

发表机构 * UCLA（加州大学洛杉矶分校）； ETH Zurich（苏黎世联邦理工学院）； Columbia University（哥伦比亚大学）

AI总结提出一种通过询问治疗决策者比较配对单元来揭示未观测混杂因素的新研究设计，并在ICU数据中验证其有效性。

详情

AI中文摘要

理解干预的效果是科学进步的核心，随机对照试验（RCT）在许多应用领域被视为因果推断的金标准。然而，RCT成本高、耗时长，且常受伦理或实际限制，这促使我们需要能够从观察性数据中得出结论的因果方法。尽管此类数据收集规模日益扩大，但将其用于因果推断常因并非所有影响治疗分配和结果的变量都被观测到而受阻，这一问题称为未观测混杂。在本文中，我们介绍了一种称为通过治疗意图进行混杂检测的新研究设计。其思路是询问做出治疗决策的人类专家，并要求他们比较由原则性匹配策略提出的单元对，目的是引出解释治疗决策为何不同的未观测变量。我们为此类程序提供了理论基础，确定了此类研究设计可能引出未观测混杂因素的条件。基于这些新建立的基础，我们研究了重症监护病房（ICU）中干预的治疗效果。首先，我们展示了强烈表明ICU中收集的电子健康记录（EHR）存在未观测混杂的经验证据。通过使用临床文本笔记作为医生知识的代理并利用自然语言处理，我们在已知真实情况的半合成环境中为我们的方法提供了概念验证。

英文摘要

Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed: an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians' knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.26409 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

通过模型的行为几何进行越狱易感性预测与缓解

Hayden Helm, Xiaodong Liu, Weiwei Yang

发表机构 * Microsoft Research（微软研究院）

AI总结本文通过形式化模型群体的行为几何，利用已评估和防御的模型，实现高效的易感性预测和防御迁移，在79个模型和100个系统配置上，易感性检测AUPRC达0.94且探针减少约98%，防御迁移性能优于同供应商分配。

详情

AI中文摘要

评估和缓解生成系统对越狱攻击的易感性对其安全部署至关重要。由于可部署系统的数量众多，对每种配置进行全面评估和优化是不切实际的。本文形式化了模型群体的行为几何，通过利用先前评估和防御过的模型，支持群体内高效的易感性预测和有效的防御迁移。我们将该框架应用于涵盖24个提供商的79个模型以及单个基础模型的100个系统配置。使用行为几何的简单方法在易感性检测中达到了0.94的AUPRC，与全面评估相比，探针数量减少了约98%。使用行为几何选择从哪个模型迁移优化后的防御，在无额外探针成本的情况下优于同供应商分配（+2%，p = 0.03），且一组三个模型足以覆盖整个群体。结果对超参数选择和评判者具有鲁棒性。

英文摘要

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of $0.94$ for susceptibility detection with $\approx98\%$ fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ($+2\%$, $p = 0.03$) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.

URL PDF HTML ☆

赞 0 踩 0

2605.26403 2026-05-27 cs.AI 版本更新

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

从静态上下文到校准的交互式强化学习：利用对齐模拟器缓解多轮对话中的分布偏移

Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian, Changze Lv, Kaitao Song, Tao Chen, Xiaoqing Zheng

发表机构 * Fudan University（复旦大学）

AI总结本文提出校准的交互式强化学习框架，通过将交互式强化学习与模拟器对齐相结合，缓解多轮对话中因策略和模拟器导致的分布偏移，提升对话质量。

详情

AI中文摘要

研究界的一个长期目标是开发高度交互的基于LLM的对话代理。最近的研究侧重于基于固定离线日志（静态上下文强化学习）或基于提示的模拟器（交互式强化学习）来优化策略。在这项工作中，我们从理论上证明，这两种范式都受到上下文分布偏移的根本限制——即训练期间观察到的对话历史与真实对话中遇到的对话历史之间的不匹配。这种偏移在每轮对话中呈二次方累积，严重降低对话质量。具体来说，我们将这种偏移归因于两个不同的来源：（i）策略引起的偏移，源于在静态历史而非自生成轨迹上进行训练；（ii）模拟器引起的偏移，源于模拟行为与真实人类行为之间的差异。为了解决这些挑战，我们提出了校准的交互式强化学习，这是一个统一的框架，将交互式强化学习与模拟器对齐相结合。通过将模拟器与人类交互模式对齐，我们的方法减少了模拟到真实的差距，并减轻了累积的分布偏移。在多个对话任务上的实验证实了我们的理论分析：（i）交互式强化学习通过缓解策略分布偏移，显著优于静态上下文基线；（ii）使用我们的对齐方法校准模拟器进一步弥合了模拟到真实的差距，产生了最先进的下游性能。

英文摘要

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26400 2026-05-27 cs.IR cs.AI 版本更新

Plans for Evaluating Structured Generative Search Summaries

评估结构化生成式搜索摘要的计划

Tetsuya Sakai, Jina Lee, Hanpei Fang, Young-In Song

发表机构 * Waseda University/Naver Corporation（早稻田大学/NAVER公司）； Waseda University（早稻田大学）； Naver Corporation（NAVER公司）

AI总结提出一个评估大型语言模型生成的结构化搜索摘要的框架，该摘要包含概述、带标题的章节和引用源文档列表，并描述了实施和评估该框架的计划。

Comments 8 pages (including 2 pages for references)

2605.26385 2026-05-27 cs.IR cs.AI stat.ML 版本更新

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

两阶段排序中早期检索的信用分配策略梯度

Haruka Kiyohara, Mihaela Curmei, Ariel Evnine, Shankar Kalyanaraman, Israel Nir, Ana-Roxana Pop, Nitzan Razin, Sarah Dean, Thorsten Joachims, Udi Weinsberg

发表机构 * Computer Science Department, Cornell University, Ithaca, NY, USA（康奈尔大学计算机科学系）； Central Applied Science, Meta, Menlo Park, CA, USA（Meta中央应用科学）

AI总结针对两阶段排序中早期排序器（ESR）端到端训练难的问题，提出信用分配策略梯度（CA-PG），通过对目标项被选中的概率求梯度来降低方差，提升训练稳定性和收敛速度。

Comments ICML2026

详情

AI中文摘要

大规模搜索、推荐和检索增强生成（RAG）系统通常采用两阶段架构：早期排序器（ESR）生成候选集，随后由后期排序器（LSR）重新排序。虽然有许多强化学习（RL）方法用于训练LSR，但ESR的端到端训练被证明具有挑战性。特别是，朴素应用“普通”策略梯度（V-PG）对于实际使用的候选集大小不可扩展，因为方差爆炸。该问题源于V-PG将梯度传播到候选集的联合概率，忽略了候选集中每个特定项对奖励的贡献。为缓解此问题，我们提出了一种新颖的“信用分配”策略梯度（CA-PG），它计算相对于目标项在任何候选集中被选中的概率的梯度，即边际化所有包含它的候选集。我们的理论分析表明，CA-PG通过边际化候选集的具体组成显著降低了V-PG的方差，同时保留了在合理对齐的LSR策略下学习正确排序项的能力。在合成和真实数据上的实验表明，CA-PG提高了使用经典Plackett-Luce模型的ESR的收敛速度和训练稳定性，特别是在候选集大小较大时。

英文摘要

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

URL PDF HTML ☆

赞 0 踩 0

2605.26380 2026-05-27 cs.CV cs.AI 版本更新

当正确示例有害时：重新思考示例在上下文学习中的作用

Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou

发表机构 * Texas A&M University（德克萨斯理工大学）

AI总结本文通过引入任务保持扰动，揭示了正确示例不一定有益甚至可能降低上下文学习准确性的反直觉现象，并提出了上下文证据转移的概念来解释正确性与效用之间的差距。

详情

AI中文摘要

上下文学习（ICL）通常被直觉所驱动，即示例之所以有帮助是因为它们提供了正确的输入-输出对。然而，我们揭示了一个反直觉的现象：正确性并不能保证示例的效用，一些正确的示例甚至可能降低ICL的准确性。为了研究这种正确性-效用差距，我们引入了任务保持扰动，其中仅改变示例输入，而该示例仍然是同一任务的正确实例。具体来说，每个扰动后的示例被赋予由任务映射诱导的目标。该框架涵盖了标签更新扰动（其中任务相关语义发生变化且目标被重新计算）和更严格的目标保持扰动（其中原始目标仍然有效）。我们将由此产生的失败模式形式化为上下文证据转移：任务保持扰动可以改变模型用于上下文推理的有效证据混合，从而将示例正确性与示例效用分离。在情感分类、逻辑推理和数学应用题中，我们发现任务保持扰动的示例会显著降低ICL性能，尤其是对于较小的模型、较难的任务和较高的扰动比例。我们的结果表明，鲁棒的ICL不仅需要评估示例是否正确，还需要评估它们如何影响上下文推理。代码可在 https://github.com/Chenghao-Qiu/Task-Preserving-ICL 获取。

英文摘要

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

URL PDF HTML ☆

赞 0 踩 0

2605.26340 2026-05-27 cs.AI cs.CL cs.MA 版本更新

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne: 迈向基于证据链的人类级自主研究

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister

发表机构 * Google Cloud AI Research（谷歌云人工智能研究）

AI总结提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne，通过可追溯性解决可验证性失败问题，在多项任务上达到或超越人类专家水平。

Comments Project website: https://scientist-one.github.io/

详情

AI中文摘要

自主研究代理能产生有竞争力的解决方案和专业手稿，但其输出存在表面评估无法察觉的可验证性失败：捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一，Chain-of-Evidence (CoE)，一个可验证性框架，要求每个声明都能追溯到其证据来源。第二，ScientistOne，一个端到端的自主研究系统，在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三，CoE Audit，一个事后审计，其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中，每个基线都表现出至少一种系统性失败模式：幻觉引用率高达21%，分数验证通过率低至42%，方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用（0/337）、完美的分数验证（12/12）和最高的方法-代码对齐率（14/15），同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务，在Parameter Golf上取得最先进结果，并在基线完全失败的MLE-Bench任务上获得金牌。

英文摘要

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

URL PDF HTML ☆

赞 0 踩 0

2605.26333 2026-05-27 cs.AI 版本更新

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

管理虚拟实验室规划中LLM生成程序性知识的不确定性

Polychronis Karpodinis, Dimitris Kalles

发表机构 * School of Science and Technology, Hellenic Open University（希腊开放大学科学与技术学院）

AI总结针对LLM生成实验程序存在的不确定性，提出一个原型框架，通过结构化领域表示和不确定的状态转移样本提取候选程序规则，转化为显式约束并修复不确定步骤，以提升虚拟实验室规划的可靠性。

详情

AI中文摘要

教育虚拟实验室可以使实验培训更具可扩展性、适应性和可访问性，尤其是在学生接触物理实验室设施有限的情况下。然而，编写新的模拟实验程序仍然成本高昂：教育工作者必须描述新设备，定义仪器和材料如何交互，并指定可在虚拟环境中执行或评估的有效程序流程。大型语言模型可以通过生成详细的实验程序来辅助这一编写过程，但其输出不应被视为可直接执行的计划。它们可能遗漏必要的操作，步骤顺序错误，或产生逻辑上不正确或与实验室设备不兼容的指令。本文提出了一个用于管理虚拟实验室规划中LLM生成程序性知识不确定性的原型框架。该框架旨在通过使用结构化领域表示和不确定的LLM生成状态转移样本来提取候选程序规则，将其转化为显式且可检查的约束，并利用它们修复不确定的程序步骤，从而减少程序不确定性。尽管动机领域是教育虚拟实验室，但底层问题更为普遍：在结构化交互环境中管理用于行动规划的不确定程序性知识。我们通过一个涉及实验室仪器、容器、工具和材料转移操作的虚拟实验室领域来展示该方法。

英文摘要

Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.

URL PDF HTML ☆

赞 0 踩 0

2605.26332 2026-05-27 cs.CV cs.AI 版本更新

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

被擦除但可被利用：针对已遗忘文本到图像扩散模型的黑盒嵌入感知提示攻击

Arian Komaei Koma, Seyed Amir Kasaei, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering（计算机工程系）

AI总结提出一种黑盒嵌入感知对抗提示攻击BEAP，利用大语言模型迭代生成有效对抗提示，以恢复被遗忘概念，并在攻击成功率上提升超过60%。

详情

AI中文摘要

机器遗忘旨在从预训练的文本到图像扩散模型中移除特定概念，然而已有多种白盒和黑盒攻击被提出以使模型生成这些被遗忘的概念。然而，这些攻击并未假设现实的威胁模型，即它们要么假设可以访问模型权重，要么产生无意义的对抗提示，即使通过简单的基于规则的防护也能轻易检测到。本文旨在填补这一空白。我们提出BEAP，一种黑盒、嵌入感知的对抗提示攻击，利用大语言模型（LLM）迭代生成有效的对抗提示并利用这些隐藏的漏洞。BEAP在文本空间中执行嵌入感知搜索，结合多个奖励信号：被遗忘概念的存在性、文本-图像对齐和图像质量，以优化生成的提示。与之前的攻击方法不同，BEAP使其提示对安全过滤器不可检测，同时生成高质量图像。大量实验表明，BEAP的攻击成功率（ASR）比先前方法提高了60%以上，而每次成功攻击平均仅需15个提示。警告：本文包含可能具有冒犯性或令人不安性质的模型输出。

英文摘要

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

URL PDF HTML ☆

赞 0 踩 0

2605.26329 2026-05-27 cs.AI 版本更新

JobBench: Aligning Agent Work With Human Will

JobBench：使智能体工作符合人类意愿

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran

发表机构 * University of Washington（华盛顿大学）； University of California, Santa Barbara（加州大学圣芭芭拉分校）； Stanford University（斯坦福大学）； Carnegie Mellon University（卡内基梅隆大学）； Northwestern University（西北大学）； University of Notre Dame（圣母大学）； University of California, Berkeley（加州大学伯克利分校）； Michigan State University（密歇根州立大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Bake AI ； King Abdulaziz City for Science and Technology（国王阿卜杜勒阿齐兹科技城）； Western Washington University（西雅图华盛顿大学）； University of Chicago（芝加哥大学）

AI总结提出JobBench基准，通过专家识别的高优先级工作流程评估AI智能体，以人类需求为中心而非经济价值，覆盖35个职业的130个任务，使用事实锚定的评分链评估，最强模型仅达45.9%，旨在推动从替代到增强的劳动力市场影响。

详情

AI中文摘要

基于检索增强生成和大语言模型的SDN中地毯式轰炸DDoS攻击的智能检测与缓解

Mohammed N. Swileh, Shengli Zhang, Kai Lei

发表机构 * College of Electronics and Information Engineering, Shenzhen University（深圳大学电子与信息工程学院）； ICNLab, Shenzhen Graduate School, Peking University（北京大学深圳研究生院ICN实验室）

AI总结提出一种结合检索增强生成（RAG）和大语言模型（LLM）的框架，通过接口级流量特征、语义嵌入和相似性检索，实现对SDN中地毯式轰炸DDoS攻击的实时检测与缓解，无需传统监督训练。

详情

AI中文摘要

软件定义网络（SDN）提供了灵活可编程的网络管理，但其集中控制架构极易受到分布式拒绝服务（DDoS）攻击，尤其是地毯式轰炸DDoS攻击，该攻击将恶意流量分布到多个目标以逃避传统检测机制。本文提出一种基于检索增强生成（RAG）的框架，用于SDN环境中地毯式轰炸DDoS攻击的实时检测与缓解。该框架结合接口级流量特征表示、语义嵌入生成、基于FAISS的相似性检索以及大语言模型（LLM）驱动的上下文推理，无需传统监督模型训练或再训练即可对流量行为进行分类。为评估所提框架的有效性，在多种不同攻击强度的地毯式轰炸DDoS攻击场景下进行了大量实验。此外，使用多个最先进的LLM研究了两种流量表示策略，即基于结构化JSON的表示和基于自然语言的表示（NLR）。实验结果表明，所提框架实现了高度准确且稳定的攻击检测性能，其中使用Gemma-4-31B-IT模型的框架配置取得了最强的整体检测结果。此外，实时实验证实了所提框架能够快速检测并缓解地毯式轰炸DDoS攻击，同时保持SDN网络稳定运行。所得结果凸显了将RAG机制与LLM集成用于智能自适应SDN安全分析的有效性。

英文摘要

Software-Defined Networking (SDN) provides flexible and programmable network management; however, its centralized control architecture remains highly vulnerable to Distributed Denial-of-Service (DDoS) attacks, particularly Carpet-Bombing DDoS attacks that distribute malicious traffic across multiple targets to evade conventional detection mechanisms. In this paper, a Retrieval-Augmented Generation (RAG)-based framework is proposed for real-time detection and mitigation of Carpet-Bombing DDoS attacks in SDN environments. The proposed framework combines interface-level traffic features representation, semantic embedding generation, FAISS-based similarity retrieval, and Large Language Model (LLM)-driven contextual inference to classify traffic behavior without requiring conventional supervised model training or retraining. To evaluate the effectiveness of the proposed framework, extensive experiments were conducted under multiple Carpet-Bombing DDoS attack scenarios with different attack intensities. In addition, two traffic representation strategies, namely structured JSON-based representation and natural language-based representation (NLR), were investigated using multiple state-of-the-art LLMs. The experimental results demonstrate that the proposed framework achieved highly accurate and stable attack detection performance, while the framework configuration utilizing the Gemma-4-31B-IT model achieved the strongest overall detection results. Furthermore, real-time experiments confirmed the capability of the proposed framework to rapidly detect and mitigate Carpet-Bombing DDoS attacks while maintaining stable SDN network operation. The obtained results highlight the effectiveness of integrating RAG mechanisms with LLM for intelligent and adaptive SDN security analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.26302 2026-05-27 cs.AI cs.CL cs.MA 版本更新

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化：面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出 AgingBench 基准，通过四种老化机制和诊断工具评估部署后智能体的可靠性退化，并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情

AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统，但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题：智能体在部署后能保持可靠多久？即使模型权重被冻结，智能体的有效状态也在不断变化，因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实，并经历常规维护。因此，可靠性成为整个智能体框架的寿命属性，而不仅仅是基础模型的快照属性。我们引入了AgingBench，一个用于智能体寿命工程的纵向可靠性基准：不仅测量部署的智能体是否退化，还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制：压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障，AgingBench使用时间依赖图和对偶反事实探针，为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上，跨越约400次运行（涵盖8到200个会话）的结果表明，智能体老化不是一维的：行为测试可以保持干净，而事实精度下降；派生状态跟踪可能在单个模型内急剧崩溃；相同的错误答案可能需要不同的修复，具体取决于诊断档案指向的内容。这些结果表明，可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复，而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

URL PDF HTML ☆

赞 0 踩 0

2605.26293 2026-05-27 cs.CL cs.AI 版本更新

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo: 基于自生成结果的跨语言对比偏好调优

Mike Zhang, Ali Basirat, Desmond Elliott

发表机构 * Department of Computer Science (DIKU), University of Copenhagen（哥本哈根大学计算机科学系（DIKU））； Centre for Language Technology (CST), University of Copenhagen（哥本哈根大学语言技术中心）； Pioneer Centre for Artificial Intelligence（先锋人工智能中心）

AI总结本文提出CroCo方法，利用英语偏好训练的奖励模型对多语言自生成结果进行对比偏好调优，无需语言特定偏好标注，在14种高低资源语言上提升模型性能，并避免灾难性遗忘。

详情

AI中文摘要

先前工作证实，通过奖励分数设置的大语言模型自生成结果之间的受控对比性，可以改善英语中的下游偏好调优。我们将此方法扩展到多种语言，并在总共14种高资源和低资源语言上，对两个模型在一系列多样化任务上进行评估。我们的核心发现是，跨语言对比偏好调优（CroCo）无需语言特定的偏好标注即可迁移。基于英语偏好（在多语言基础模型之上）训练的奖励模型，在大多数语言中产生了有用的语言内排名，并且在单语或多语设置中进行配对，在大多数设置上改进了每个模型，同时防止了监督微调的灾难性遗忘。我们观察到，这些增益需要基于策略的数据。非策略响应减少了收益，而在线偏好优化未能优于离线变体。具体来说，在结构化任务上，我们的方法在EuroLLM-9B的6/7种语言和Aya-3B的4/7种设置中匹配或超过了基础模型。在开放式生成中，两个调优模型在11种评估语言中均优于各自的基础模型。总体而言，我们展示了多语言偏好调优的有前景的方向。

英文摘要

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.26286 2026-05-27 cs.MA cs.AI cs.RO 版本更新

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

解耦延迟补偿：通过学习的动力学过滤增强预训练的多智能体强化学习策略

Maxim Mednikov, Oren Gal

发表机构 * University of Haifa（海法大学）

AI总结针对多智能体强化学习在延迟观测和通信延迟下的性能退化问题，提出一种模块化的执行阶段状态估计层，利用学习的门控转移模型和递归卡尔曼滤波从异步测量中估计当前状态，作为预训练策略的即插即用模块，显著提升对通信延迟和丢包的鲁棒性。

Comments 8 pages, 7 figures

详情

AI中文摘要

现实世界中的多智能体强化学习系统通常必须在过时观测、随机通信延迟和间歇性丢包下运行。在理想同步条件下训练的策略在这些场景中常常表现出显著的性能下降，因为它们基于过时的反馈行动。我们提出了一种模块化的执行阶段状态估计层，用当前信念状态估计替换延迟的通信观测。该框架将学习的门控转移模型与递归卡尔曼滤波层相结合，从异步测量中估计瞬时状态。该方法的一个主要优势是其模块性：估计器作为预训练策略的即插即用模块，无需修改原始MARL训练算法、架构或奖励结构。在多种多智能体和连续控制基准上的评估表明，所提出的层持续增强了对通信延迟和消息丢失的鲁棒性。在协调密集和动态不稳定的任务中观察到最显著的性能提升，这些任务中时间一致性对控制至关重要。

英文摘要

Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution-stage state-estimation layer that replaces delayed communicated observations with current belief-state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug-in for pre-trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi-agent and continuous-control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination-intensive and dynamically unstable tasks where temporal consistency is critical for control.

URL PDF HTML ☆

赞 0 踩 0

2605.26279 2026-05-27 cs.AI cs.CE 版本更新

Constraint acquisition needs better benchmarks

约束获取需要更好的基准测试

Rafał Stachowiak, Tomasz P. Pawlak

AI总结针对约束获取（CA）和数学规划（MP）模型验证与增强研究缺乏合适基准的问题，提出MPMMine基准套件，通过统一结构、开放格式和多样化数据支持算法评估。

Comments 12 pages, 1 figure, for the associated dataset, see https://github.com/MPMMine/MPMMine

详情

AI中文摘要

约束获取（CA）及基于领域知识工件对数学规划（MP）模型进行验证和增强的相关研究，目前因缺乏合适的基准而受到限制。这一缺陷阻碍了可重复性和跨研究可比性，减缓了CA方法的成熟。现有基准是为求解器评估而非CA算法评估而设计的。它们组织松散，对单个问题的处理不一致，并且省略了CA方法所需的领域知识工件。本工作提出了MPMMine，一个旨在评估使用多样化领域知识工件发现、验证和增强MP模型的算法的基准套件。MPMMine以一致性、标准化、完整性、可扩展性、开放性和版本控制为指导。它采用统一结构并依赖开放格式：MiniZinc、CommonMark和JSON。它为每个问题提供多个模型，每个模型提供数十个实例，以及整数和连续域中的数千个解和非解，同时附带自然语言描述以支持文本到模型方法。

英文摘要

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross-study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non-solutions in both integer and continuous domains, alongside natural-language descriptions to support text-to-model methods.

URL PDF HTML ☆

赞 0 踩 0

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV 版本更新

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力：视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich（慕尼黑技术大学）； Tensordyne

AI总结针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题，提出基于Jensen偏差的在线逐注意力分数校正方法，在INT2量化下恢复接近BF16的视频质量，且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情

AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算，但随着视频变长，该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力，但降低了视频质量。我们表明，这种降低的一个关键驱动因素是注意力权重的系统性偏差：由于softmax注意力中指数的凸性，量化噪声膨胀了缓存键的贡献，我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正，在期望中消除此偏差，该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似，额外的计算开销可忽略不计，且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化，我们的校正恢复了因激进量化而损失的大部分质量，达到接近BF16的视频质量，并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

URL PDF HTML ☆

赞 0 踩 0

2605.26256 2026-05-27 cs.AI 版本更新

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

个性化具身多模态大语言模型代理在长期用户交互中的应用

Jeongeun Lee, Chanyoung Park, Dongha Lee

发表机构 * Yonsei University（延世大学）； KAIST（韩国科学技术院）

AI总结提出POLAR框架，通过多模态知识图谱记忆机制增强具身代理在长期交互中的个性化能力，显著提升多步推理和用户上下文跟踪性能。

详情

AI中文摘要

基于多模态大语言模型的具身代理在物理环境中解决复杂任务方面展现出强大潜力。然而，个性化辅助不仅需要遵循通用指令或识别物体类别。在现实场景中，目标通常仅通过先前的交互隐式指定，要求代理利用随时间积累的个性化上下文。在这项工作中，我们提出了POLAR，一个用于长期用户交互中个性化具身代理的多模态记忆增强框架。POLAR将先前的交互组织成一个多模态知识图谱，该图谱捕获用于个性化上下文和视觉概念的语义记忆，以及用于代理轨迹等具身经验的 episodic 记忆。为了执行具身任务，POLAR检索相关记忆以解释当前请求并指导任务执行。我们在多个MLLM骨干网络和多样化的评估场景下评估POLAR，以研究记忆在长期个性化中的作用。结果表明，所提出的记忆机制通过更有效地利用先前交互中积累的信息，持续提升性能。当代理需要在多个交互中进行推理、执行多跳推理或随时间跟踪用户特定上下文的更新时，性能提升尤为显著。

英文摘要

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

URL PDF HTML ☆

赞 0 踩 0

2605.26252 2026-05-27 cs.AI cs.DB 版本更新

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

智能体记忆是数据库吗？重新思考长期AI智能体记忆的数据基础

Abdelghny Orogat, Essam Mansour

发表机构 * Concordia University（康科迪亚大学）

AI总结本文提出将长期AI智能体记忆视为一种新的数据管理工作负载，通过形式化治理演化记忆（GEM）框架，用四个状态级操作替代记录级操作，并论证记录级系统无法满足其正确性条件，最后通过原型MemState验证可行性并指出未来研究方向。

详情

AI中文摘要

工作流闭环并非自动研究系统中的科学闭环

Shuai Wang, Xinyuan Tian, Pangpang Liu, Yize Zhao

发表机构 * Yale University（耶鲁大学）

AI总结本文指出自动研究系统的工作流闭环不等于科学闭环，并提出通过非自主认知控制下的自主执行、避免目标塌陷、验证塌陷和接受塌陷等设计改进方案。

Comments 26 pages, 1 figure, 2 tables

详情

AI中文摘要

本文论证了工作流闭环并非自动研究系统中的科学闭环。当前系统日益能够内部完成类似研究的循环，从想法生成到实验执行、写作和自我评估。这一成就是真实的，但本身并不能使输出结果具有科学地位。我们认为，值得信赖的自动研究不应追求自主自足，而应追求在非自主认知控制下的自主执行。基于对该快速兴起领域100多篇近期论文和代码仓库的调查，以及对21个代表性系统的结构化审计，我们诊断出一个反复出现且结构相连的失败模式：目标塌陷，即单一代理目标取代多目标科学目标；验证塌陷，即内部自我评估取代独立验证；以及接受塌陷，即基准分数或类出版物产物取代领域级批评、重用和整合机制。这些塌陷并非自主性的固有局限，而是可纠正的设计选择。因此，我们概述了在目标信号、验证和输出路径方面的潜在补救措施，以引发社区讨论。

英文摘要

This paper argues that workflow closure is not scientific closure in auto-research systems. Current systems can increasingly complete research-like loops internally, moving from idea generation to experiment execution, writing, and self-evaluation. That achievement is real, but it does not by itself give the resulting outputs scientific standing. We argue that trustworthy auto-research should not aim for autonomous self-sufficiency, but should aim for autonomous execution under non-autonomous epistemic control. Based on a survey of more than 100 recent papers and repositories in this rapidly emerging area, together with a structured audit of 21 representative systems, we diagnose a recurring and structurally connected failure pattern: objective collapse, in which single-proxy targets replace multi-objective scientific aims; validation collapse, in which internal self-evaluation replaces independent validation; and acceptance collapse, in which benchmark scores or publication-shaped artifacts replace mechanisms for domain-level critique, reuse, and integration. These collapses are not inherent limits of autonomy but correctable design choices. Accordingly, we outline potential remedies across objective signal, validation, and output pathway to spark community discussion.

URL PDF HTML ☆

赞 0 踩 0

2605.26192 2026-05-27 cs.LG cs.AI q-bio.BM 版本更新

PitchBench: 测量音频-语言模型中的音高听觉能力

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith, David M. Chan, Karina Nguyen

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Thoughtful Lab

AI总结提出PitchBench评估套件，通过28个实验系统测量音频-语言模型在绝对和相对音高感知上的表现，发现当前模型在不同声源、音长和格式下音高感知不可靠。

Comments Preprint

详情

AI中文摘要

音频-语言模型（ALMs）越来越多地用于需要理解音乐的实际应用，从音乐辅导和转录到字幕、推荐系统和音乐制作。更广泛地说，它们正在成为多模态AI系统的重要组成部分，这些系统必须从感官输入而非仅文本进行推理。这使得可靠的音乐感知成为关键前提：如果模型无法准确听到声音的结构，就不能信任它来推理、教学、转录或对现实世界中的音频采取行动。然而，现有的基准测试很少评估这种感知背后最基本的音乐能力之一：音高听觉。当前的评估往往通过更高层次的任务间接探测音高听觉，且通常采用多项选择格式，这留下了ALMs在不同乐器、声学条件和响应格式下识别细粒度音高的可靠性问题。我们引入了PitchBench，一个系统测量ALMs音高听觉的评估套件。PitchBench包含28个实验，涵盖序列和和弦中的绝对和相对音高感知，同时变化响度、音符时长、声源、时间拉伸、背景噪声和其他声学条件。任务范围从识别孤立音高到在四声部音乐织体中跟踪旋律线。评估前沿ALMs，我们发现音高听觉仍然非常不可靠：模型在不同设置下表现持续不佳，准确率随声源、音符时长和记谱格式急剧变化。当前的ALMs尚未具备稳定的音高感知，即使对于受控的合成和乐器刺激也是如此。除了基准测试，我们还发布了PitchBench作为Python包，包含评估数据和数据生成工具，以支持未来关于音高感知音频-语言建模的工作。

英文摘要

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.26175 2026-05-27 cs.LG cs.AI 版本更新

工具模式压缩实现受限上下文预算下的智能体检索增强生成

Furkan Sakizli

发表机构 * Independent Researcher（独立研究者）

AI总结针对智能体RAG系统中工具模式与检索上下文竞争资源的问题，提出工具模式压缩方法，在8K上下文预算下将平均精确匹配率提升20.5个百分点，并验证了压缩模式在超过800个工具时仍可运行。

Comments 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: https://github.com/SKZL-AI/tscg

详情

DOI: 10.5281/zenodo.20369668

AI中文摘要

配备数十到数百个工具定义的语言模型的智能体RAG系统面临关键资源冲突：工具模式消耗了检索增强生成所需的相同上下文窗口。我们首次系统研究了这种工具-上下文权衡，评估了14个模型（涵盖1.5B-32B本地模型和一个前沿API模型），在三个上下文预算（8K、16K、32K）下使用28个工具定义进行了6,566次受控API调用。应用TSCG保守配置文件压缩（节省44-50%的模式令牌），我们观察到二元启用效应：在8K令牌时，JSON模式工具定义完全溢出上下文窗口，导致接近零的EM（平均2.6%），而压缩模式恢复了RAG功能，所有八个模型平均精确匹配提升20.5个百分点（六个表现出完全启用的模型平均提升24.7个百分点）。在32K（两种格式都适合）时，五个测试模型中的四个显示delta <= 1个百分点，确认该效应纯粹由预算驱动。在HotpotQA（50个多跳问题）上的外部验证显示，在相同溢出场景下EM提升48个百分点。前沿扩展测试表明，JSON模式在大约494个工具时溢出，而压缩模式在超过800个工具时仍可运行。我们的结果确立了工具模式压缩作为受限上下文部署中智能体RAG的必要基础设施层。所有代码、数据和检查点均已公开。

英文摘要

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.26162 2026-05-27 cs.LG cs.AI 版本更新

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

基于推送的异步联邦学习：一种偏差校正聚合方法

Jiahui Bai, Hai Dong, A. K. Qin

发表机构 * School of Computer Technologies, RMIT University（RMIT大学计算机技术学院）； School of Science, Computing and Engineering Technologies, Swinburne University of Technology（斯威丁大学科学与工程技术学院）

AI总结提出PushCen-ADFL框架，通过中心表示空间中的平均保持推-求和混合与轻量级中心正则化，解决异步去中心化联邦学习中的通信开销、聚合偏差和模型漂移问题。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix

详情

DOI: 10.1145/3770855.3817925

AI中文摘要

异步去中心化联邦学习（ADFL）消除了中央协调和全局同步，使其在大规模和异构系统中具有吸引力。然而，频繁的点对点通信、有向拓扑上的异步更新以及非独立同分布数据共同导致了过高的通信开销、有偏聚合和严重的模型漂移。我们提出了PushCen-ADFL，一种通信高效的ADFL框架，能够在非对称通信和延迟客户端参与下实现稳定训练。PushCen-ADFL在共享中心表示空间中耦合了通信、聚合和局部稳定化，形成了压缩与优化之间的闭环。客户端交换中心形式的消息，应用平均保持的推-求和混合来校正聚合偏差，并使用锚定在同一中心空间的轻量级中心正则化来减轻异构性和陈旧性下的漂移。一个有界、发送者去重的缓冲区进一步提高了在异步到达不规则情况下的鲁棒性。在视觉数据集上的实验表明，PushCen-ADFL在数据异构性下将准确率提高了最多6%，同时将每次推送的通信成本降低了80%以上，实现了良好的准确率-通信权衡。

英文摘要

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

URL PDF HTML ☆

赞 0 踩 0

2605.26161 2026-05-27 cs.LG cs.AI 版本更新

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit: 时间序列基础模型中的数据污染审计

Hongkai Li, Shifeng Xie, Lefei Shen, Zhuo Li, Mouxiang Chen, Xiaobin Zhang, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University（浙江大学）； Télécom Paris（巴黎高等电信学院）； State Street Technology (Zhejiang) Ltd.（State Street Technology（浙江）有限公司）； Datadog

AI总结针对时间序列基础模型（TSFMs）预训练数据污染问题，提出基于探针适应动力学的审计方法TSFMAudit，通过检测微调后损失下降更快且骨干网络移动更小的异常现象来识别污染数据集。

Comments 22 pages, 7 figures, 9 tables

详情

AI中文摘要

时间序列基础模型（TSFMs）越来越多地在大型语料库上进行预训练，这引发了评估数据集可能在预训练期间被暴露从而导致过于乐观的性能估计的担忧。在时间序列中审计此类污染具有挑战性，因为信号是连续且异质的，并且通常缺乏语料库文档。据我们所知，这是第一个研究TSFMs预训练污染审计的工作。我们形式化了TSFMs的预训练污染审计问题，并提出了TSFMAudit，一种基于探针适应动力学的方法。我们的关键直觉是，污染表现为异常高效的适应：在微调探针后，受污染的数据集往往表现出更快的损失减少和更小的骨干网络移动。我们在6个TSFMs和187个数据集上评估了TSFMAudit，使用文档化的训练来源证据作为监督，并与从LLM文献中改编的10个竞争基线进行了比较。

英文摘要

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

URL PDF HTML ☆

赞 0 踩 0

2605.26158 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Furina: 碎片化不确定性驱动的拒绝不稳定攻击

Tongxi Wu, Jian Zhang, Yang Gao

发表机构 * School of Intelligence Science and Technology（智能科学与技术学院）； State Key Laboratory for Novel Software Technology（新型软件技术国家重点实验室）； Nanjing University（南京大学）

AI总结通过揭示大语言模型安全行为存在不稳定区域，提出多指标诊断框架并开发Furina攻击方法，利用碎片化场景提示诱导不确定性放大，实现高效越狱。

Comments This work is accepted as a regular paper at ICML 2026

详情

AI中文摘要

大语言模型和多模态大语言模型的安全对齐通常被认为是一种近二值阈值机制。我们通过揭示安全行为受不稳定区域支配来挑战这一假设，在该区域中，小的扰动会引发随机的拒绝决策而非确定性结果。我们开发了一个结合外部和内部信号的多指标诊断框架来表征这种不稳定性。通过系统实验，我们识别出一个特征性的诊断标志：处于不稳定区域的输入表现出更高的输出不确定性，同时内部安全激活降低，这种解耦现象解释了为什么基于检测的防御无法抵御复杂攻击。基于此框架，我们提出了Furina，一种越狱攻击，它通过碎片化、场景锚定的提示故意诱导这种特征，无需针对模型的优化。Furina在HarmBench上优于强单轮和多轮基线，并在MM-SafetyBench上取得了有竞争力的结果，表明不确定性放大为理解安全漏洞提供了一种有原则且可迁移的机制。代码见：https://github.com/0xCavaliers/Furina_Jailbreak。

英文摘要

Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

URL PDF HTML ☆

赞 0 踩 0

2605.26155 2026-05-27 cs.RO cs.AI cs.LG 版本更新

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

自适应引导何时有帮助？部分可观测条件下自动驾驶的信念感知特权蒸馏

Mehmet Haklidir

发表机构 * TUBITAK BILGEM Artificial Intelligence Institute（土耳其TUBITAK BILGEM人工智能研究所）

AI总结本文提出信念感知GSAC（BA-GSAC），通过集成分歧动态调节蒸馏系数，系统研究自适应引导在部分可观测自动驾驶中的有效性，发现严重遮挡下系数过早崩溃，并揭示可观测性盲区问题。

Comments 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)

详情

AI中文摘要

引导软演员-评论家（GSAC）将来自特权全状态教师的知识蒸馏给部分可观测的学生，用于自动驾驶，但使用固定的蒸馏系数λ，而不考虑智能体的不确定性。我们提出信念感知GSAC（BA-GSAC），通过集成分歧调节λ，并将其作为系统实证研究的测试平台，探究：自适应引导何时真正有帮助？在Highway-Env上评估五种策略（固定λ∈{0.01, 0.1}、自适应、线性衰减和普通SAC）在三个POMDP难度级别下，我们发现初步的单种子运行表明在轻度和中度部分可观测性下有收益，但在严重遮挡下（所有方法使用3个种子评估），自适应系数在大约3K步内坍缩到λ_min。我们将其归因于可观测性盲区现象：由于集成预测部分观测，即使在严重遮挡下也能达到低分歧，建模了可见部分但无法检测缺失部分。我们诊断了根本原因并提出了架构修复（使用引导演员的特权访问在完整状态预测上训练集成）；虽然此处未验证，但我们表明即使存在当前限制，预热阶段也提供了可测量的稳定性（CV=13.3% vs. 常数λ=0.01的29.8%）。实际上，简单的确定性线性衰减计划在所有指标上实现了最佳的严重POMDP性能（均值116.5，CV=8.9%），表明稳定性收益来自调度效应而非集成。这些发现为设计不确定性感知的师生框架提供了实用指导，并强调了集成预测目标是一个重要的设计选择。

英文摘要

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

URL PDF HTML ☆

赞 0 踩 0

2605.26154 2026-05-27 cs.CR cs.AI 版本更新

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

MemMorph：通过记忆投毒实现LLM代理中的工具劫持

Xuanye Zhang, Yongsen Zheng, Zhuqin Xu, Kaiyu Zhou, Bowen Shen, Haoran Ou, Tianwei Zhang, Kwok-Yan Lam

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）

AI总结提出MemMorph攻击，通过向长期记忆注入少量伪装记录，诱导LLM代理自主选择攻击者偏好的工具，在多个基准测试中实现高达85.9%的攻击成功率。

Comments Preprint. Under review

详情

AI中文摘要

LLM驱动的代理能够选择外部工具来完成用户任务。然而，攻击者可能破坏这一过程，引导代理使用不当/错误的工具并实施恶意行为。现有攻击主要操纵工具元数据，这容易被审计检测，并且随着现代代理越来越多地采用记忆模块通过积累经验来优化工具选择策略，这些攻击可能失效。本文提出MemMorph，这是首次通过投毒代理的长期记忆来偏置工具选择的攻击。MemMorph不直接指定工具调用决策，而是注入少量伪装成技术事实、事件报告和操作策略的精心构造记录。这些投毒记录重塑了代理的上下文感知和决策过程，使其自主推断并选择攻击者偏好的工具。在3个基准测试、10个代理骨干和3个记忆模块实现上的实验表明，MemMorph仅需注入3条记录即可达到高达85.9%的攻击成功率，在3种代表性防御下仍保持效力，比最强基线高出25%。我们的发现揭示了长期记忆是工具增强代理中一个关键且被忽视的攻击面，敦促开发记忆层面的完整性保障。

英文摘要

LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.

URL PDF HTML ☆

赞 0 踩 0

2605.26146 2026-05-27 cs.SE cs.AI cs.HC 版本更新

Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains

增强工程：跨专业领域的多工具AI编排方法论

Elias Calboreanu

发表机构 * Swift North AI Lab, The Swift Group, LLC（Swift North AI实验室，Swift集团有限公司）

AI总结提出增强工程学科，通过提示工程和上下文工程的可移植技能，跨领域编排多个专用AI工具，并基于单实践者案例研究验证了方法有效性。

Comments 60 pages, 5 figures, 7 tables. Companion to arXiv:2604.04258 (Context Engineering). Formatted for the Journal of Systems and Software (In Practice track)

详情

AI中文摘要

组织越来越多地在专业领域部署独立的专用AI工具，通常为每个工具雇佣领域专家，这重现了AI本应转变的人员配置模式。然而，使这些工具有效的元技能——提示工程（交互级优化）和上下文工程（结构化输入流水线设计）——是可跨领域移植的：掌握这些技能的实践者可以将其应用于任何领域的任何专用AI工具。本文将增强工程定义为跨不同专业领域编排多个专用AI工具的学科，应用提示工程和上下文工程作为可跨工具边界转移的可移植能力。我们提出一个六阶段编排方法论和四个可移植性指标。一个为期5个月的形成性案例研究（2025年11月至2026年3月）记录了一位实践者将这些技能应用于跨越七个专业领域的十个组件编排栈，产出了传统上需要不同领域专家才能完成的工作产品。两个定量观察与框架预测一致：Cochran-Armitage趋势检验（n=200次交互，跨两个聊天LLM，p<0.01）显示首次接受率随提示复杂度水平上升；Wright定律拟合（n=82个工件，p<0.01）显示工件组合的生产加速。由于所有观察来自单一位实践者，推断统计是探索性和假设生成的，而非确认性的；整个组合的可移植性有待多实践者复制。增强工程完成了三个学科的演进：提示工程（一个工具）、上下文工程（可复现流水线）、增强工程（跨领域工具组合）。

英文摘要

Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, recreating the staffing models AI was expected to transform. Yet the meta-skills that make these tools effective, prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design), are domain-portable: a practitioner who masters them can apply them to any purpose-built AI tool in any domain. This paper defines Augment Engineering as the discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries. We present a six-phase orchestration methodology and four portability metrics. A 5-month formative case study (November 2025 to March 2026) documents a single practitioner applying these skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists. Two quantitative observations are consistent with the framework's predictions: a Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level, and a Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio. Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication. Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).

URL PDF HTML ☆

赞 0 踩 0

2605.26137 2026-05-27 cs.GR cs.AI cs.CV 版本更新

AssetGen: Deployable 3D Asset Generation at Interactive Speed

AssetGen: 可部署的交互速度3D资产生成

Dilin Wang, Xiaoyu Xiang, Kihyuk Sohn, Tom Monnier, Yu-Ying Yeh, Thu Nguyen-Phuoc, Jiawen Zhang, Yuchen Fan, Antoine Toisoul, Hyunyoung Jung, Prithviraj Dhar, Michael Bunnell, Nikolaos Sarafianos, Chuhang Zou, Roman Shapovalov, Andrea Vedaldi, Rakesh Ranjan

发表机构 * Reality Labs, Meta（Meta现实实验室）

AI总结提出AssetGen系统，通过粗到细的VecSet框架、多视图纹理生成及端到端加速，在30秒内生成带烘焙法线、颜色纹理和可控多边形预算的高质量网格，支持实时渲染和移动端部署。

详情

AI中文摘要

尽管3D生成技术正在快速发展，但近期工作通常侧重于获取高分辨率资产，而将用户体验和可部署性视为事后考虑。我们提出AssetGen，一个专注于这两个方面的3D生成器。给定一张参考图像，它在30秒内生成一个高质量网格，带有烘焙法线、颜色纹理和可控多边形预算，适用于实时渲染，包括移动端用例。AssetGen Flash变体进一步将延迟降低到14秒，适用于交互式和代理式创作循环。我们的模型使用粗到细的VecSet框架生成物体几何，该框架在GPU上实现网格简化、清理和法线烘焙，以及快速并行UV展开。然后以多视图方式生成纹理，随后进行反投影和3D修复。模型蒸馏、内核优化和流水线并行化被协同设计以加速整个系统。我们引入了大量自动化和盲人机评估，并在30秒内展示了与领先商业解决方案相当的视觉质量，在不到15秒内展示了预览质量的结果。最终结果是一个支持AI辅助、可部署的3D内容创建的系统，适用于交互式工作流。

英文摘要

While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.26136 2026-05-27 cs.SD cs.AI 版本更新

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

侵蚀对真实语音的信任：人类音频深度伪造感知的大规模研究

Nicolas M. Müller, Wei Herng Choong

发表机构 * Fraunhofer AISEC（弗劳恩霍夫人工智能安全研究中心）

AI总结通过大规模听辨实验（1768名参与者，35532次判断），发现音频深度伪造导致人类对真实语音的信任下降（准确率从72.7%降至64.1%），而非检测伪造能力下降。

详情

AI中文摘要

音频深度伪造近期发展迅速，但其对人类信任真实语音的影响尚未被研究。我们进行了迄今为止最大规模的音频深度伪造感知听辨研究，收集了来自1768名参与者对138个文本转语音和语音转换系统的35532次判断。我们的核心发现是怀疑偏移：与2021年的基线相比，人类对伪造样本的准确率几乎没有变化（72.9%降至71.2%），但对真实样本的准确率从72.7%降至64.1%。参与者并非更难以检测合成伪影，而是越来越不信任真实的语音。由商业和自回归语言模型系统生成的样本最难检测（61.3-65.9%），而传统seq2seq和流匹配模型生成的样本仍然较易识别（75.4-76.8%）。作为参考的机器学习检测器在所有条件下保持超过94.5%的准确率。我们的结果表明，现代深度伪造的主要威胁可能不仅仅是欺骗，而是对真实语音信任的侵蚀。

英文摘要

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

URL PDF HTML ☆

赞 0 踩 0

2605.26133 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露：成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology（日本先进科学研究院）

AI总结本文首次统一综述了大型语言模型中的预训练数据暴露问题，涵盖成员推断和数据污染，形式化定义了暴露级别，回顾了攻击与防御方法，并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情

DOI: 10.1007/978-3-031-97144-0_14

AI中文摘要

大型语言模型（LLMs）已成为NLP中的主导范式，推动了研究和工业的发展。随着模型规模和预训练数据的增长，由于训练数据集的规模和不可见性，对预训练数据暴露（PDE）的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要，涉及两个关键领域：数据污染和成员推断。尽管概念上相关，但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE，回顾了攻击和防御方法，综合了实证发现，并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

URL PDF HTML ☆

赞 0 踩 0

2605.26119 2026-05-27 cs.DC cs.AI 版本更新

Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms

超越模型的边缘AI部署：面向工业嵌入式平台的BSP感知系统框架

Pitchai Muthu M

发表机构 * Advantech Industrial Computing India Pvt. Ltd.（Advantech印度工业计算有限公司）

AI总结提出一个五层BSP感知系统框架，将边缘AI部署视为系统工程问题，解决工业嵌入式平台中模型与硬件、BSP、运行时等环节的集成挑战，提升可复现性、可诊断性、持续吞吐量和现场可靠性。

Comments 17 pages, 5 figures, industrial white paper

详情

AI中文摘要

工业边缘AI项目通常从模型开始，之后才面对平台。这种顺序具有吸引力，因为它允许早期演示，但当部署目标是具有长产品生命周期、供应商特定内核、异构加速器、安全约束和非平凡I/O路径的嵌入式系统时，这种方法就会失效。在这种环境中，模型只是从传感器开始、经过板级支持包（BSP）、最终进入生产服务循环的更大执行链中的一个组成部分。本文认为，稳健的边缘AI部署必须被视为一个系统问题，而不是一个后期应用打包练习。本文提出了一个面向工业嵌入式平台的BSP感知框架，围绕五个层次组织：硬件、BSP/操作系统适配、运行时与加速、应用/推理、以及运维/验证。讨论基于Android、NXP i.MX、NVIDIA Jetson、ONNX Runtime和TensorRT的供应商架构文档，以及关于嵌入式AI基准测试、设备不稳定性和异构边缘机群的系统文献。结果是一个实用框架，将底层平台工作与可衡量的部署成果（如可复现性、可诊断性、持续吞吐量和现场可靠性）联系起来。

英文摘要

Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allows early demonstrations, but it breaks down when the deployment target is an embedded system with long product lifecycles, vendor-specific kernels, heterogeneous accelerators, safety constraints, and nontrivial I/O paths. In that environment, a model is only one component of a larger execution chain that begins at the sensor, traverses the board support package (BSP), and ends in a production service loop. This paper argues that robust Edge AI deployment must be treated as a systems problem rather than a late-stage application packaging exercise. The paper presents a BSP-aware framework for industrial embedded platforms organized around five layers: hardware, BSP/operating-system adaptation, runtime and acceleration, application/inference, and operations/validation. The discussion is grounded in vendor architecture documentation for Android, NXP i.MX, NVIDIA Jetson, ONNX Runtime, and TensorRT, and in systems literature on embedded AI benchmarking, device instability, and heterogeneous edge fleets. The result is a practical framework that connects low-level platform work to measurable deployment outcomes such as reproducibility, diagnosability, sustained throughput, and field reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.26118 2026-05-27 cs.DC cs.AI 版本更新

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU

Xe-Forge：面向Intel GPU的多阶段LLM驱动的内核优化

Marcin Spoczynski, Daniel Fleischer, Moshe Berchansky, Gabriela Ben-Melech Stan, Shira Guskin, Weilin Xu, Adam Siemieniuk, Alexander Heinecke

发表机构 * Intel Corporation（英特尔公司）

AI总结提出Xe-Forge，一个多阶段LLM流水线，通过Chain-of-Verification-and-Refinement（CoVeR）代理和硬件验证，自动将Triton内核优化为Intel GPU，实现几何平均1.17倍加速，Flash Attention加速2-13.3倍。

详情

AI中文摘要

将深度学习算法移植到新的硬件加速器上，要求开发人员对其代码库中的每个Triton内核重复应用相同的底层优化——量化、内存访问合并、分块大小调整以及特定架构的变通方法。这种手动、重复的工作是一个主要瓶颈：每个内核都需要针对不同设备间变化的硬件约束进行相同的试错分析，而底层的优化模式却基本一致。我们提出了Xe-Forge，一个多阶段LLM驱动的流水线，为Intel GPU自动化这一过程。给定一个功能正确的Triton内核，该系统应用多达九个优化阶段——从算法重构和算子融合，到块指针现代化、GPU特定调优和开放式探索——每个阶段由一个Chain-of-Verification-and-Refinement（CoVeR）代理驱动，该代理生成候选方案，在真实硬件上验证，并对失败进行迭代。一个精心策划的知识库编码了Intel GPU约束（2的幂次线程束计数、GRF模式、SLM大小），这些约束在LLM训练数据中缺失，使模型保持在架构有效范围内。我们在97个Level-2 KernelBench内核和Intel Arc Pro B70上的Flash Attention上评估了Xe-Forge，实现了相对于PyTorch eager的几何平均1.17倍加速，67%的内核得到改进，九个内核超过5倍（最高82倍），并且在所有测试配置下Flash Attention加速2-13.3倍且无回归——这表明结构化领域知识与硬件在环验证可以系统地消除当前阻碍算法在新加速器上部署的重复移植工作。

英文摘要

Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds -- to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle of trial-and-error profiling against hardware constraints that vary across devices, yet the underlying optimization patterns remain largely consistent. We present Xe-Forge, a multi-stage LLM-powered pipeline that automates this process for Intel GPU. Given a functionally correct Triton kernel, the system applies up to nine optimization stages -- from algorithmic restructuring and operator fusion through block pointer modernization, GPU-specific tuning, and open-ended discovery -- each driven by a Chain-of-Verification-and-Refinement (CoVeR) agent that generates candidates, validates them on real hardware, and iterates on failures. A curated knowledge base encodes Intel GPU constraints (power-of-two warp counts, GRF modes, SLM sizing) that are absent from LLM training data, keeping the model within architecturally valid bounds. We evaluate Xe-Forge on 97 Level-2 KernelBench kernels and Flash Attention on the Intel Arc Pro B70, achieving a 1.17x geometric mean speedup over PyTorch eager with 67% of kernels improving, nine kernels exceeding 5x (up to 82x), and 2--13.3x speedups on Flash Attention across all tested configurations without regression -- demonstrating that structured domain knowledge with hardware-in-the-loop verification can systematically eliminate the repetitive porting effort that currently gates algorithm deployment on new accelerators.

URL PDF HTML ☆

赞 0 踩 0

2605.26047 2026-05-27 cs.AI 版本更新

Retrying vs Resampling in AI Control

AI控制中的重试与重采样

James Lucassen, Adam Kaufman

发表机构 * Redwood Research

AI总结研究AI编码框架中重试与重采样的安全性，发现重试会泄露信息导致攻击更隐蔽，而重采样通过最大怀疑分数审计可提升安全性。

详情

AI中文摘要

像Claude Code和Codex这样的AI编码脚手架使用重试：阻止标记为有风险的动作并继续轨迹。我们从AI控制的角度研究重试，该角度将模型视为潜在对抗性的。我们发现，虽然重试降低了诚实的怀疑分数，但不可信模型可以利用监控器理由构建更隐蔽的攻击，抵消了安全增益。我们还研究了重采样：从相同上下文中抽取多个样本，这不会泄露可利用的信息。我们解开了先前关于重采样的工作所捆绑的设计选择。在BashArena中，使用Claude Opus 4.6作为不可信模型，MiMo-V2-Flash作为可信监控器，每步抽取五个样本并在最大怀疑分数上审计，在0.3%的审计预算下，安全性从61%提高到71%，且不影响有用性。仅对第一次抽取中看起来可疑的步骤进行选择性重采样，恢复了6.2个百分点的增益，同时仅多抽取了10%的样本。我们在此设置中的两个发现与先前关于重采样的工作相矛盾。第一个是，基于重采样怀疑分数最大值进行审计优于使用最小值，这与Ctrl-Z的发现相反。第二个是，执行最不可疑的样本（这是早期“推迟到重采样”协议的核心机制）在我们的设置中仅带来很小的实证安全增益（+3.9个百分点，置信区间包含零）。

英文摘要

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

URL PDF HTML ☆

赞 0 踩 0

2605.25861 2026-05-27 cs.CV cs.AI 版本更新

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

MuNet: 一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

发表机构 * National Engineering Research Center for E-Learning（教育信息化国家级工程研究中心）； National Engineering Research Center of Educational Big Data（教育大数据国家级工程研究中心）； School of Electronic Information and Communications（电子信息与通讯学院）； School of Artificial Intelligence and Automation（人工智能与自动化学院）

AI总结提出MuNet，一种互惠网络，通过统一表示和互惠机制联合优化3D人体网格恢复与穿衣人体重建，在六个基准数据集上达到最先进性能。

详情

AI中文摘要

3D人体网格恢复和3D穿衣人体重建本质相关，但长期以来被孤立研究，忽视了联合优化的潜在收益。为克服这一局限，我们提出在一个统一框架中处理这两个任务，从而有效利用它们的相互依赖关系。基于这一思想，我们提出MuNet，一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络。首先，我们采用2-流形图作为所有3D模型的统一表示，从而在3D人体网格恢复和穿衣人体重建之间实现一致建模。其次，我们设计了一个端到端的图卷积网络，逐步将初始图变形为3D人体网格，并将其细化成详细的3D穿衣人体模型。第三，我们引入一种互惠机制，允许两个任务在训练期间进行相互交互，其中3D人体网格恢复为3D穿衣人体重建提供指导，而重建反馈则细化3D人体网格恢复。我们在六个基准数据集上广泛评估了MuNet，包括Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople。实验结果表明，MuNet在所有数据集上的两个任务均达到了最先进的性能。MuNet的代码已在https://github.com/starVisionTeam/MuNet上发布，供研究使用。

英文摘要

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.

URL PDF HTML ☆

赞 0 踩 0

2605.24785 2026-05-27 cs.AI 版本更新

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

PANDO: 通过在线技能蒸馏实现高效多模态AI智能体

Yubo Li, Yidi Miao, Yuntian Shen, Yuxin Liu

AI总结提出PANDO框架，通过在线技能蒸馏、结构化技能库和缓存感知提示，在VisualWebArena任务中以更低token消耗实现更高成功率。

详情

AI中文摘要

近期多模态网络智能体的进展通常依赖于增加推理时的计算量，包括展开搜索、验证器传递、离线技能发现和专家模型堆叠。这引发了一个核心问题：网络智能体能否随着经验积累变得更高效，而不是更昂贵？我们首先分析VisualWebArena的轨迹，识别出三个反复出现的低效来源：重复动作循环、隐藏发现成本和低提示缓存复用。然后，我们引入PANDO，一个单次展开的在线技能蒸馏框架，它维护一个结构化的技能库，并结合进度反思、基于置信度的技能降级、层次化路由、视觉压缩和缓存感知提示。在全部910个VisualWebArena任务上，PANDO实现了58.3%的成功率，优于SGV（54.0%）和我们的WALT复现（45.2%），同时比SGV少使用58%的token，比WALT少使用61%的token，且无需任何预评估发现预算。一个300任务的消融实验进一步表明，规则和例程提供了大部分成功增益，而路由、压缩和缓存感知提示将更大的技能库转化为更低的边际token成本。最后，我们引入三个轨迹级效率指标——动作重复率、步骤开销比和提示缓存利用率——以使效率在终端成功之外可见。

英文摘要

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

URL PDF HTML ☆

赞 0 踩 0

2605.24383 2026-05-27 cs.AI cs.CY cs.SE 版本更新

A governance horizon for ethical-use constraints in open-weight AI models

开放权重AI模型中伦理使用约束的治理视野

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

发表机构 * School of Computer Science（计算机科学学院）； Ministry of Education（教育部）； Laboratory of High Confidence Software Technologies（高可信软件技术实验室）； University of Science and Technology Beijing（北京科技大学）； University of California, Davis（加州大学戴维斯分校）

AI总结通过审计Hugging Face Hub上的模型仓库，发现基于披露的治理在开放权重AI中具有浅层结构性限制，提出治理视野概念并比较不同政策设计的效果。

详情

AI中文摘要

对开放权重AI模型的伦理约束既反映了社会关切，也是AI治理政策的基础。这些约束预计会传播到下游衍生品，同时作为自愿元数据披露实施，必须在每一代重用中重新声明。我们审计了Hugging Face Hub上的2,142,823个模型仓库，以测试这种基于披露的治理基础设施能否在深层模型谱系中维持可追溯性。限制证据以1.31个衍生步骤的半衰期衰减（$R^2$=0.98），超过七代下游后，至少80%的后代模型缺乏足够的公开证据进行治理判定，我们将这一深度边界形式化为治理视野。恢复缺失许可元数据的平台级干预表明，政策设计（而非仅执法）是约束因素：仅继承设计需要近乎完全的执法才能移动视野，而明确解决孤儿谱系组件的强制声明设计即使在中等执法水平下也能移动视野。结构性瓶颈在于没有可继承上游意图的谱系：此类孤儿组件在任何仅继承政策下都无法判定，无论执法率如何，未解决的上游节点还会造成直接的下游不可判定性瓶颈，仅靠继承规则无法恢复。与PyPI的比较（其中治理信号由显式机器可读声明携带）证实，这种崩溃是开放权重衍生特有的拓扑结构问题，而非开放生态系统固有的。这些结果表明，基于披露的治理在开放权重AI中具有浅层、结构决定的范围，实现深层供应链问责需要治理信号通过衍生本身传播的溯源机制。

英文摘要

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

URL PDF HTML ☆

赞 0 踩 0

2605.24297 2026-05-27 cs.IR cs.AI 版本更新

Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

专利嵌入基准测试：跨检索、分类和聚类任务的22个模型多任务评估

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate, Intellectual Property（Clarivate知识产权）

AI总结通过评估22个预训练模型在三个任务上的表现，发现最优微调策略取决于下游任务，且单一领域微调会损害跨领域检索性能。

Comments 31 pages, 21 figures

详情

AI中文摘要

关于从业者使用专利嵌入的两个问题出现：(i) 一种微调方案是否适用于所有下游应用？(ii) 在一个专利领域上的微调是否足以用于其他领域的下游应用？通过评估22个预训练嵌入模型（参数从22M到12B）在三个任务——信息检索、分类和聚类——上的表现，使用113,148件WIPO辅助技术专利（46,069个引文查询）和外部DAPFAM数据集，我们发现两个结果对普遍认知提出了质疑。(i) 最优微调方案取决于下游任务：跨截面对齐（方案R3）对检索性能提升最大（+7.1% nDCG@10），而组合信号方案（方案R4）更适合分类（+7.1 F1）和聚类（+10.9 V-measure）；匹配数据控制证实训练数据集大小的差异不是影响因素。(ii) 单一领域微调损害了跨领域信息检索：在DAPFAM语料库上，对8个模型-方案组合中的5个，单一领域微调显著降低了跨域检索性能，其中零样本能力较强的模型受损最严重。虽然族内扩展一致（Qwen3 0.6B->4B->8B；Llama-Nemotron 1B->8B），但族间扩展不稳定；12B的KaLM-Gemma3在TAC检索性能上排名第8，经过前缀修改后。标题+摘要+权利要求是普遍最佳文本视图，所有模型在域内和域外性能之间存在55-65%的差距，且无法通过混合BM25-密集融合来弥补。代码和评估框架已公开。

英文摘要

Two questions regarding practitioners' use of patent embeddings arise: (i) Does one fine-tuning recipe suffice for all downstream applications? (ii) Is fine-tuning on one patent landscape sufficient for downstream application on other landscapes? By evaluating 22 pre-trained embedding models (ranging from 22M to 12B parameters) on three tasks -- information retrieval, classification, and clustering -- on 113,148 WIPO patents for assistive technology (46,069 citation queries) and on an external DAPFAM dataset, we find that two results cast doubt on the prevailing wisdom. (i) The optimal fine-tuning recipe depends on the downstream task: cross-sectional alignment (recipe R3) provides the largest improvements to retrieval performance (+7.1% nDCG@10), whereas a combined signal recipe (recipe R4) is better suited to classification (+7.1 F1) and clustering (+10.9 V-measure); a matched data control confirms that differences in training dataset size are not a contributing factor. (ii) Single-landscape fine-tuning hampers cross-landscape information retrieval: fine-tuning on one landscape significantly degrades cross-domain retrieval for 5 of 8 model-recipe combinations on the DAPFAM corpus, with the stronger zero-shot models suffering most. While within-family scaling is consistent (Qwen3 0.6B->4B->8B; Llama-Nemotron 1B->8B), cross-family scaling is erratic; the 12B KaLM-Gemma3 is ranked 8th on TAC retrieval performance, following prefix modification. Title+Abstract+Claims is the ubiquitous best text view, and all models suffer from a 55-65% gap between IN and OUT-of-domain performance which cannot be mitigated by hybrid BM25-dense fusion. Code and evaluation framework are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.24296 2026-05-27 cs.AI cs.IR 版本更新

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

合成专利数据何时有帮助？低资源多标签分类中的数量-保真度权衡

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate, Intellectual Property（Clarivate知识产权）

AI总结研究通过LLM生成合成数据用于多标签专利分类时的数量与保真度权衡，发现低资源场景下数量效应主导，高资源场景下保真度更重要，混合数据策略最优。

详情

AI中文摘要

关于利用通过LLM生成的合成数据进行多标签专利分类时必须考虑的问题包括：(i) 何时使用此类数据可能有所帮助以及(ii) 为何如此。实际上，前一部分适当调整了通过增加样本量来改进结果的可能性。当前实验涉及六个开源LLM（从3.8B到12B参数），针对辅助技术64个WIPO标签分类的四种真实数据机制。应用了基于标签集条件化的全合成生成方法和释义方法，每种方法与三种分类器类别结合使用。结果表明，BERT-for-Patents的微F1从0.120到0.702的声称改进主要反映了数量效应；实际上，在165个样本中进行有放回复制产生了0.678。因此，相对于对照组的改进为+0.024，而与最佳基线（焦点损失重加权）相比为+0.219。这里要考虑的第二个关键点是随着数据生成机制变化，保真度分数的演变。对于低真实数据机制，数量效应占主导，最大均值差异（MMD）与分类性能之间的相关系数等于r = +0.95。随着使用更多真实数据，相关性变为负值，在1:10机制下达到r = -0.73（Fisher z = +6.47，p < 0.001，Delta r的95% CI [ +0.96, +1.00 ]）。在固定预算分配方面，将真实数据（约20-30%）与合成数据（70-80%）结合优于纯合成和纯真实策略。此外，一个能够将原始微F1改进高达+0.58的语料库可能会对Jaccard重叠检索代理产生不利影响。其他体裁的提示族变体可能提供对该现象的一些解释，但使用标准专利过滤器仍使nDCG@10降低26%。

英文摘要

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

URL PDF HTML ☆

赞 0 踩 0

2605.24217 2026-05-27 cs.AI cs.DC 版本更新

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

识别和减轻生产级LLM推理基准中的系统性测量偏差

Ashok Chandrasekar, Jason Kramberger

发表机构 * Google（谷歌）

AI总结针对生产级LLM推理基准中因客户端排队导致的测量偏差，提出基于多进程的无偏评估框架和归一化输出令牌时间（NTPOT）指标，实现高并发下的准确性能评估。

详情

AI中文摘要

随着大型语言模型（LLM）从研究环境过渡到生产部署，评估其是否满足严格的服务水平目标（SLO）变得至关重要。然而，当前的评估方法在大规模下存在严重的测量偏差。我们证明，广泛使用的基准测试工具依赖于单进程、异步驱动架构，在高并发下引入了根本性的客户端排队瓶颈。通过将基准测试客户端建模为$M/G/1$队列，我们从数学上展示了Python全局解释器锁（GIL）如何随着请求速率增加而人为地膨胀首令牌时间（TTFT）和每输出令牌时间（TPOT）指标。为了解决这一系统性不准确性，我们提出了一个无偏的多进程评估框架，有效分散客户端负载，确保可忽略的排队开销。此外，我们形式化了一个复合指标——归一化每输出令牌时间（NTPOT），以稳健地摊销端到端延迟，包括跨序列长度的预填充和调度延迟。我们的实证评估表明，该方法成功隔离了纯服务引擎性能，能够在每秒数千个查询的生产规模下对LLM进行准确、可复现的性能分析。

英文摘要

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

URL PDF HTML ☆

赞 0 踩 0

2605.24152 2026-05-27 cs.AI 版本更新

Neuro-Inspired Inverse Learning for Planning and Control

神经启发式逆向学习用于规划与控制

Maryna Kapitonova, Tonio Ball

发表机构 * NeuroMentum AI IMBIT, University of Freiburg, Germany（NeuroMentum AI IMBIT，弗赖堡大学，德国）

AI总结提出一种神经启发式框架Inverter，通过逆向学习（IL）结合前向/逆向内部模型、开环多步运动指令和层次化动作组织，在规划与控制任务中实现高效推理，平均性能提升24.2%且计算时间降低一到两个数量级。

Comments Version 2, minor fix in online version of the abstract, pdf unchanged

详情

AI中文摘要

我们提出了一种用于具身规划与控制的神经启发式框架。基于哺乳动物大脑中实现快速高效目标导向行为的三个原则——配对的前向/逆向内部模型、开环多步运动指令以及顺序层次化的动作组织——我们的Inverter框架使用学习组件，通过逆向学习（IL）进行端到端训练，并在自然情况下辅以解析或算法模块；我们形式化了IL，并将其与监督学习、强化学习和模仿学习区分开来。IL桥接了强化学习（RL）式的摊销（单次前向传播但每次只输出一个动作）和最优控制（OC）式的序列规划（整个轨迹但需要迭代测试时计算）。单个Inverter或层次化n=2的Inverter堆栈在所有3个maze2d和6个antmaze D4RL变体上，平均比离线RL和扩散规划基线提升24.2%（范围-1.9%至+78.2%），同时推理计算时间减少一到两个数量级。显著的是，通过前向模型（FoM）对整个T步动作序列进行优化（而非逐步骤优化），使得Inverter能够生成平滑、目标一致、轨迹级的结构，并达到比训练数据本身所蕴含的策略更接近解析最优的控制策略。我们还发现了IL的一种失败模式：在训练数据覆盖范围狭窄时出现FoM攻击，我们通过使用覆盖范围更广的随机训练数据来缓解。作为一个应用实例，脉冲Inverter合成任意单量子比特量子门，其保真度与标准迭代数值基线（GRAPE）相当，而每个门的计算时间降低超过1000倍。总之，我们得出结论：IL实现了一类通用的世界接口，特别适用于对延迟和资源敏感的具身AI。

英文摘要

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

URL PDF HTML ☆

赞 0 踩 0

2605.24071 2026-05-27 cs.LG cs.AI 版本更新

Not All Transitions Matter: Evidence from PPO

并非所有转移都重要：来自PPO的证据

Ajhesh Basnet

发表机构 * Department of Artificial Intelligence and Data Science（人工智能与数据科学系）； KPR Institute of Engineering and Technology（KPR工程科技研究院）

AI总结本文提出在PPO训练中随机丢弃一定比例的轨迹转移，以打破重复梯度结构，稳定训练，并在多个环境中验证了效果。

Comments 19 pages, 5 figures. Accepted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

详情

Journal ref: Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing

AI中文摘要

在策略上训练强化学习代理意味着每次更新时收集新的经验，而这些经验隐藏着一个问题。轨迹中的每个状态都是前一个状态的直接输出，由代理自身的动作因果链连接。因此，连续的转移从未真正独立。它们携带重叠信息，网络接收到的梯度信号最终比批次大小所暗示的要重复得多。相同的方向被反复强化，价值网络在策略变化时难以跟上，训练变得悄悄不稳定，而仅凭奖励曲线很少能揭示这一点。本文询问这种冗余是否可以简单地移除。我们表明，在适当阶段从轨迹中随机丢弃固定比例的转移，使得奖励信号保持完整，足以打破重复的梯度结构并稳定训练。变化很小：一个采样步骤，没有新组件，不修改核心算法，并且适用于任何PPO实现。在五个难度递增的环境（CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5和Hopper-v5）中，该方法在奖励上与标准PPO匹配，同时在KL散度、策略熵和价值估计上产生更一致的训练动态。丢弃25%的转移是最佳点：足以破坏冗余，又不至于使批次过薄。

英文摘要

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

URL PDF HTML ☆

赞 0 踩 0

2605.24042 2026-05-27 cs.LG cs.AI 版本更新

Hidden-State Privacy Has an Empty Middle

隐藏状态隐私存在空中间

Alexander Okezue Bell

发表机构 * Stanford University（斯坦福大学）

AI总结通过理论下界和实验证明，高斯释放机制在隐藏状态隐私中无法同时实现中等效用和隐私，存在空中间区域，并提出了对角逆Fisher机制作为最优解。

Comments 74 pages, 61 figures

详情

AI中文摘要

在我们测试的1536个高斯释放协方差中，对于单层隐藏状态隐私，没有一个能在自适应检索攻击者下同时实现中等效用和中等隐私。我们证明了一个互补的Fisher球下界：每个具有O(1) Fisher效用的满秩高斯释放都存在一个方向，其马氏信号随隐藏宽度线性增长，排除了该类中的均匀高斯安全性，并与经验上的空中间匹配。对角逆Fisher释放Σ^⋆_{diag}(K) = (2K/d) diag(1/F_{ii})是在一阶KL预算K下唯一的最小最大最优对角机制，也是在32个模型层网格的每个点上最坏攻击者top-1 ≤ 0.001的唯一释放，但它位于隐私/效用边界上，而不是填充中间。在欧几里得检索下达到13倍帕累托缩减的广义特征机制，在自适应马氏攻击者下崩溃为100% top-1，而全轨迹序列逆变器恢复了干净GPT-2前缀的94%，但在Σ_{diag}下为0%。从头训练的分离记忆Transformer在90M时达到G_{Mah} ∈ [20, 33]，并在固定token语言建模损失惩罚下，从30M到1B保持比相同预算GPT基线6-24倍的优势；预训练模型最高为9.3。这些结果将隐藏状态释放从高斯类内的机制设计重新定义为架构或释放协同设计。

英文摘要

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $Σ^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $Σ_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

URL PDF HTML ☆

赞 0 踩 0

2605.24001 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University（普渡大学）； hi-lab, Xiaohongshu Inc.（小红书实验室，小红书公司）

AI总结针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题，提出基于积分KL最小化的无数据轨迹级对齐框架DIDR，通过扩散奖励分数和代理估计器实现奖励驱动的校正，在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情

AI中文摘要

近期一步文本到图像生成的进展实现了实时合成，具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果，优化倾向于利用随机自由度，通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题，我们提出了Diff-Instruct with Diffused Reward (DIDR)，一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器，同时自然诱导出扩散奖励分数(DRS)，它作为对参考分数函数的奖励驱动校正。为了使其实用，我们进一步引入了扩散奖励代理(DRP)，一种基于可微短步去噪的DRS高效估计器。大量实验表明，DIDR持续帕累托主导现有的一步SDXL基线。此外，当迁移到6B DiT骨干网络(Z-Image)时，DIDR在偏好对齐上超越了其50步教师模型，同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

URL PDF HTML ☆

赞 0 踩 0

2605.22904 2026-05-27 cs.CV cs.AI 版本更新

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估：地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

发表机构 * Université TÉLUQ（大学TÉLUQ）； Polytechnique Montréal（蒙特利尔理工学院）； Université du Québec à Montréal（魁北克大学蒙特利尔分校）

AI总结提出首个可解释框架，通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模，从监控视频中评估自杀风险，在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情

AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要，早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理，从监控视频中评估自杀风险。然而，使用监控摄像头捕获的视频进行评估具有挑战性，因为它需要准确感知人体运动、理解站台几何结构，并随时间聚合异质行为线索。在这项工作中，我们正式定义了地铁站自杀风险评估（SRA）任务，并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同，我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模，从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务，并在真实监控数据上基准测试一个完整的操作流程，实现了83.2%的ROC-AUC，这项工作突出了自杀风险评估的复杂性，并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

URL PDF HTML ☆

赞 0 踩 0

2605.22774 2026-05-27 cs.LG cs.AI cs.HC 版本更新

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

CogAdapt: 通过导联适应将临床心电图基础模型迁移至可穿戴认知负荷评估

Amir Mousavi, Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio（计算机科学系，人工智能、网络与计算学院，德克萨斯大学圣安东尼奥分校）； Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio（教育心理学系，教育与人类发展学院，德克萨斯大学圣安东尼奥分校）

AI总结提出CogAdapt框架，通过可学习适配器LeadBridge将3导联可穿戴信号转换为12导联表示，并结合渐进微调策略ProFine，实现临床心电图基础模型向可穿戴认知负荷评估的迁移，在跨受试者验证中显著优于从头训练的基线模型。

Comments 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情

AI中文摘要

实时认知负荷评估对于自适应人机交互至关重要，但由于标记数据有限和跨受试者泛化能力差，仍然具有挑战性。最近在数百万临床记录上预训练的心电图基础模型提供了丰富的表示，但由于传感器配置不匹配和任务差异，无法直接应用于可穿戴设备。在本文中，我们提出了CogAdapt，一个将临床心电图基础模型适应于可穿戴认知负荷评估的框架。CogAdapt引入了LeadBridge，一个可学习的适配器，将3导联可穿戴信号转换为解剖学一致的12导联表示，以及ProFine，一种渐进微调策略，逐步解冻编码器层同时防止灾难性遗忘。在两个公共数据集（CLARE和CL-Drive）上的留一受试者交叉验证评估表明，CogAdapt显著优于从头训练的基线，宏F1分数分别达到0.626和0.768。这些结果证明了基础模型适应用于从可穿戴传感器进行与受试者无关的认知负荷评估的前景。

英文摘要

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

URL PDF HTML ☆

赞 0 踩 0

2605.22133 2026-05-27 q-bio.BM cs.AI 版本更新

Atom-level Protein Representation Learning Improves Protein Structure Prediction

原子级蛋白质表示学习改进蛋白质结构预测

Taewon Kim, Hyosoon Jang, Hyunjin Seo, Seonghwan Seo, Hyeongwoo Kim, Wonho Zhung, Mingyeong Shin, Wooyoun Kim, Sungsoo Ahn

发表机构 * KAIST（韩国科学技术院）

AI总结提出结构感知预训练方法TriProRep，通过VQ-VAE联合建模三种对齐的残基级视图，在结构预测任务中优于仅序列和先前结构感知表示模型。

Comments Project Page: https://holymollyhao.github.io/TriProRep/

详情

AI中文摘要

生成建模的最新进展表明，预训练表示可以作为条件特征或对齐目标来改进生成。受此启发，我们研究用于预测结构（超越常规功能注释）的蛋白质表示。我们提出TriProRep，一种结构感知预训练方法，它联合建模三种对齐的残基级视图：氨基酸身份、主链几何和局部全原子几何，通过VQ-VAE分词器进行离散编码。通过预训练从生成器损坏的视图中恢复原始标记，TriProRep学会区分合理但不正确的跨视图增强与原始蛋白质。我们进一步引入RepSP，一个用于在结构预测设置中评估蛋白质表示的基准。RepSP测试表示的三种用途：从脱辅基链表示进行同源二聚体共折叠、同源二聚体衍生相互作用属性的残基级预测，以及表示对齐的单体结构预测。在这些任务中，TriProRep优于仅序列和先前的结构感知表示模型，同时在常规基准上保持竞争性能。

英文摘要

Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views: amino-acid identity, backbone geometry, and local full-atom geometry, discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, TriProRep learns to distinguish plausible but incorrect cross-view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure-predictive settings. RepSP tests three uses of representations: homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence-only and prior structure-aware representation models, while maintaining competitive performance on conventional benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.20988 2026-05-27 cs.LG cs.AI 版本更新

A Sharper Picture of Generalization in Transformers

Transformer 泛化能力的更清晰图景

Paul Lintilhac, Sair Shaikh

发表机构 * Thayer School of Engineering Dartmouth College（达特茅斯学院泰勒工程学院）

AI总结本文通过PAC-Bayes理论研究Transformer在布尔域上的泛化行为，证明稀疏低阶频谱可实现低锐度构造并得到非平凡的泛化界，解释了思维链为何能改善高阶目标函数的泛化。

Comments 10 pages, 9 figures, 41 pages of supplementary material

详情

AI中文摘要

我们从目标函数的傅里叶谱角度研究Transformer在布尔域上的泛化行为。与先前基于Rademacher复杂度推导泛化界的工作（Edelman等人，2022；Trauger & Tosh，2024）不同，我们探讨了通过PAC-Bayes理论获得泛化界的可行性。我们证明，集中在低阶分量上的稀疏谱能够实现具有良好泛化性质的低锐度构造。我们的思路是证明存在实现任何稀疏度不超过上下文长度的布尔函数的平坦极小值，然后将PAC-Bayes界应用于一个理想化的低锐度学习器，从而得到一个非平凡的泛化界。我们利用这一点正式解释了为什么思维链能改善高阶目标函数的泛化，并展示了我们界中的复杂度参数可以通过性质测试高效估计。我们通过实验评估了预测，并进行了机制可解释性研究，以支持我们的理论构造在真实Transformer中的现实性。

英文摘要

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.20690 2026-05-27 cs.AI 版本更新

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

声明式数据服务：用于组合数据系统的结构化智能体发现

Shanshan Ye, Duo Lu

发表机构 * Northeastern University（东北大学）； Brown University（布朗大学）

AI总结提出声明式数据服务（DDS）架构，通过分层类型契约将全局搜索分解为有界子搜索，解决无界智能体发现无法稳定收敛的问题，并在交易后端工作负载上验证其有效性。

Comments Accepted at AI Agents for Discovery in the Wild (AID-Wild), Workshop at ACM CAIS 2026

详情

AI中文摘要

智能体发现已表明，在基准条件下，LLM驱动的搜索能够发现新颖的算法、设计和代码。将该范式迁移到多系统数据后端面临一个更困难的问题：搜索空间是异构的，验证器是部署栈是否实际运行，且组合知识在预训练中不均匀地捕获。即使添加了迭代和显式组合知识，无界智能体发现（一个基于失败日志反馈迭代的编码智能体）也无法在运行栈上一致收敛。我们提出声明式数据服务（DDS），一种从声明式用户意图中结构化智能体发现数据系统组合的架构。该框架在连续层（意图、操作DAG、每系统技能、运行时归因）拥有四个类型契约，将全局搜索分解为有界子搜索；子智能体搜索每个类型空间，而框架提供通道，使知识以内联技能引用的方式向前流动，错误以类型信号的方式向后路由。作为交易后端工作负载的生命证明，DDS在无界发现无法收敛的地方收敛；运行时失败成为技能补丁，下一次部署内联引用。我们将其定位为早期原型，报告来自真实世界数据系统组合的经验教训。

英文摘要

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

URL PDF HTML ☆

赞 0 踩 0

2605.20255 2026-05-27 cs.LG cs.AI cs.HC cs.RO 版本更新

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

行人行为不确定性下安全自动驾驶的多智能体强化学习

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

发表机构 * University of Bern（伯恩大学）； AI4I, The Italian Institute of Artificial Intelligence（意大利人工智能研究所）

AI总结本文使用多智能体近端策略优化（MAPPO）联合训练自动驾驶汽车和12个行人，通过隐藏的行人特质模拟乱穿马路行为，相比固定策略基线显著降低了碰撞率，并揭示了速度差异指标可用于检测未预期的乱穿马路行为。

Comments Accepted to ICRA 2026 Workshop "8th Workshop on Long-term Human Motion Prediction"

详情

AI中文摘要

自动驾驶汽车（SDC）的仿真测试通常依赖脚本化行人模型，这些模型无法捕捉真实过街行为的异质性和不确定性，限制了安全评估的真实性，尤其是对于由车辆无法观察到的潜在人格特质支配的乱穿马路行为。我们假设，通过多智能体强化学习（MARL）联合训练行人和SDC，相比针对固定行人策略训练，能产生更真实的交互场景，并且可预测与不可预测过街行为之间的差距可以直接从轨迹中测量。我们使用多智能体近端策略优化（MAPPO）联合训练一个SDC和12个行人：行人移动遵循脚本化的Dijkstra路径规划，而RL策略控制高层的前进/等待决策，乱穿马路概率取决于每个行人在回合开始时采样并隐藏于SDC的特质。在500回合评估中，联合训练的SDC达到78%的目标完成率，碰撞率为14%，而最佳基于规则的基线分别为35%和33%。速度差异指标显示，在近距离（0-3米）范围内，SDC在乱穿马路者附近比在人行横道使用者附近快2.65米/秒，表明乱穿马路遭遇未被预期。乱穿马路占过街事件的13%，但占碰撞的62%，并且联合训练相比单智能体RL减少了30%的碰撞，因为行人学会了在SDC高速接近时等待。

英文摘要

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

URL PDF HTML ☆

赞 0 踩 0

2605.19186 2026-05-27 cs.AI 版本更新

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

可发现的智能体知识——智能体知识图谱能力的形式化框架（扩展版）

Terry R. Payne, Valentina Tamma, Enrico Daga

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK（利物浦大学计算机科学与信息学学院）； Open University（开放大学）

AI总结本文提出一个四维形式化框架（语义表达性、智能体可发现性、任务相对基础性和认知信任范围），并从中推导出智能体能力概况（AAP），作为VoID和DCAT之上的语义层，支持智能体在规划时进行原则性的知识图谱选择、组合和故障诊断。

详情

AI中文摘要

二十年前，语义网服务社区被问及具有不同本体承诺的智能体如何能够连贯地发现、组合和调用网络服务。答案是OWL-S和WSMO：形式化的能力描述，指定服务能做什么、智能体为了认知上合理调用必须已经知道什么，以及如何形式化地桥接本体不匹配。当前的知识图谱元数据标准（如VoID和DCAT）描述了知识图谱包含什么，但没有说明特定智能体能从中证明什么、空结果受什么封闭假设支配，或者智能体的任务词汇是否在模式中有基础。此外，在已部署的知识图谱中，控制模式描述逻辑和操作性的蕴涵机制可能不同：这是一种当前元数据不可见的认知失效模式。我们针对知识图谱环境重新审视并扩展这些见解，提出了一个四维形式化框架：语义表达性、智能体可发现性、任务相对基础性和认知信任范围，从中我们推导出智能体能力概况（AAP）：一个位于VoID和DCAT之上的语义层，使智能体在规划时能够进行原则性的知识图谱选择、组合和故障诊断。这四个维度在单个智能体层面操作化了本体连续体的能力结构，特别用于知识图谱选择、组合和故障诊断。一个来自学术搜索任务的实例具体化了该框架，并通过五点研究议程指出了实现基于AAP的能力匹配规模化所需的形式化、计算和工程工作。

英文摘要

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current KG metadata standards such as VoID and DCAT describe what a KG contains, yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent's task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework; Semantic Expressivity, Agentic Discoverability, Task-Relative Grounding, and Epistemic Trust Scope, from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. The four dimensions operationalise the affordance structure of the Ontological Continuum at the individual-agent level, specifically for \kg selection, composition, and failure diagnosis. A worked example drawn from a scholarly-search task concretely grounds the framework, and identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale though a five-point research agenda.

URL PDF HTML ☆

赞 0 踩 0

2605.17036 2026-05-27 cs.AI cs.LG cs.MA cs.SY eess.SY 版本更新

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主AI代理在供应链管理中的可靠性与有效性

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

发表机构 * Harvard University（哈佛大学）； MIT/Purdue（麻省理工学院/普渡大学）； MIT（麻省理工学院）； Harvard University/Kempner Institute（哈佛大学/凯普勒研究所）； Georgia Tech（佐治亚理工学院）

AI总结本文通过MIT啤酒游戏研究多级供应链中的自主生成式AI代理，发现模型能力是性能主导因素，但平均性能掩盖可靠性风险，并引入代理牛鞭效应，提出基于GRPO的后训练框架以提高可靠性。

详情

AI中文摘要

本文使用MIT啤酒游戏研究多级供应链中的自主生成式AI代理。我们确定了影响性能的四个推理时杠杆：模型选择、策略和护栏、集中数据共享以及提示工程。模型能力是主导因素：开箱即用的推理模型超越人类水平性能，优化后的推理模型相对于人类团队将成本降低高达67%。然而，强劲的平均性能掩盖了显著的可靠性风险。我们引入了代理牛鞭效应：自主多级系统中运行间决策不稳定性的放大。其中一个核心组成部分是决策牛鞭效应，即由随机代理决策而非客户需求变化产生的订单变异性部分。我们表明，即使需求路径固定，决策不稳定性也可以在固定时间点跨设施以及同一设施内随时间放大。重复采样（一种自然的测试时补救措施）未能显著减少这种不稳定性，这表明可靠性需要改变底层决策策略，而不仅仅是平均模型输出。为解决这一限制，我们提出了一种基于组相对策略优化（GRPO）的强化学习后训练框架，该框架使用系统级供应链奖励训练共享的基础LLM。后训练显著减少了尾部事件，抑制了代理牛鞭效应，并提高了自主供应链代理的可靠性。

英文摘要

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

URL PDF HTML ☆

赞 0 踩 0

2605.16457 2026-05-27 cs.LG cs.AI cs.CV 版本更新

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University（人工智能交叉学科项目，首尔国立大学）； Department of Computer Science（计算机科学系）； Engineering, Seoul National University（工程系，首尔国立大学）

AI总结提出可辨识的令牌对应关系（ITC）方法，通过将下一帧预测建模为结构化分配问题，解决基于令牌的Transformer世界模型在长程推演中的时间不一致性，在四个基准上达到最先进性能。

详情

AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色，但常在长程推演中出现时间不一致性，包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题，而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系（ITC），这是一种用于基于令牌的Transformer世界模型的解码步骤，将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题：每个下一帧令牌要么通过从上一帧复制令牌来解释，要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变，可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数，显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

URL PDF HTML ☆

赞 0 踩 0

2605.04880 2026-05-27 cs.LG cs.AI 版本更新

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的调和均值公式

Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka

发表机构 * Bar Ilan University（巴伊兰大学）

AI总结针对无限时域非回合制任务中的平均奖励强化学习，提出一种修正的调和均值算子，解决SMDP中奖励和持续时间非平稳时的奖励率计算问题，并证明其理论性质及有效性。

详情

Journal ref: https://alaworkshop2026.github.io/papers/ALA2026_paper_57.pdf

AI中文摘要

最近的研究重新激发并增强了对无限时域、非回合制（持续）任务中未折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程（SMDP）尤其引人关注。在SMDP中，离散动作随机产生奖励和持续时间，目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比率来逼近这一目标。然而，当奖励和持续时间（在无限时域中）非平稳时，这种方法可能不正确。本文提出一种新颖的修正调和均值算子，即使在上述条件下也能正确计算奖励率。这产生了可以与SMDP一起工作的无模型学习算法，同时保持对随时间变化的非平稳奖励和持续时间分布的鲁棒性。我们证明了修正调和均值算子的理论性质，并通过实验与现有算法相比展示了其有效性。

英文摘要

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.02207 2026-05-27 cs.CV cs.AI cs.LG 版本更新

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo：面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA（计算机科学系，老 Dominion 大学，弗吉尼亚州，美国）

AI总结提出MultiSense-Pneumo多模态原型系统，整合症状、咳嗽音频、语音和胸片，通过可解释的后期融合实现肺炎筛查与分诊支持。

详情

GraphMind：从操作轨迹到自演化工作流自动化

Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

发表机构 * Microsoft, USA ； University of Illinois Chicago, USA ； Microsoft, Spain

AI总结提出GraphMind系统，通过离线提取因果工作流图、在线多智能体遍历执行和自适应遍历强化，实现云数据库事故调查中的自动化工作流，相比基线方法减少8倍检索上下文并降低26%幻觉率。

详情

AI中文摘要

协调人员、工具和信息的复杂操作工作流是系统运行的核心，但由于需要大量人工输入且适应能力有限，端到端自动化仍然具有挑战性。我们提出GraphMind，一个以最小人力构建、执行和演化以行动为中心的工作流图的系统。该系统分三个阶段运行。首先，一个可扩展的离线管道从大量人工解决轨迹中提取结构化工作流图，捕捉问题、行动及其因果关系。其次，一个在线多智能体遍历引擎导航该图以动态构建和执行工作流，每一步结合图引导检索与LLM驱动的推理。第三，自适应遍历强化（ATR）强化成功的遍历路径，实现执行信息引导的图适应。GraphMind已部署在四个生产云数据库服务中用于事故调查。在93个保留事故上评估并通过盲审专家验证，该系统在缓解范围、幻觉率和诊断吞吐量方面优于Agentic Summary-RAG基线，同时需要少8倍的检索上下文。ATR层将幻觉率降低26%，证明工作流图可以从执行反馈中学习。一项为期12周的现场研究证实了实用价值：97%的评分对话在交互延迟内产生可操作结果。

英文摘要

Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation remains challenging due to extensive human input requirements and limited ability to adapt over time. We present GraphMind, a system that constructs, executes, and evolves action-centric workflow graphs with minimal human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths, enabling execution-informed graph adaptation. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on 93 held-out incidents and validated via blind expert review, the system outperforms an Agentic Summary-RAG baseline in mitigation reach, hallucination rate, and diagnostic throughput while requiring 8x less retrieval context. The ATR layer reduces hallucination rate by 26%, demonstrating that workflow graphs can learn from execution feedback. A 12-week field study confirms practical value: 97% of scored conversations yield actionable results within interactive latency.

URL PDF HTML ☆

赞 0 踩 0

2603.04639 2026-05-27 cs.RO cs.AI 版本更新

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME：机器人通用策略的记忆基准与理解

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai

发表机构 * University of Michigan（密歇根大学）； Stanford University（斯坦福大学）； Figure AI

AI总结提出RoboMME基准，通过16个操作任务评估VLA模型在长时程和历史依赖场景中的记忆能力，并基于π0.5骨干网络探索14种记忆增强变体，发现记忆表示的有效性高度依赖于任务。

Comments Accepted to ICML 2026

详情

AI中文摘要

记忆对于长时程和历史依赖的机器人操作至关重要。这类任务通常涉及计数重复动作或操作暂时被遮挡的物体。最近的视觉-语言-动作（VLA）模型已开始融入记忆机制；然而，它们的评估仍局限于狭窄、非标准化的设置中。这限制了对记忆的系统理解、比较和进展测量。为应对这些挑战，我们引入了RoboMME：一个大规模标准化基准，用于评估和推进VLA模型在长时程、历史依赖场景中的表现。我们的基准包含16个操作任务，这些任务基于精心设计的分类法构建，该分类法评估时间、空间、对象和程序记忆。我们进一步开发了一套基于π0.5骨干网络的14种记忆增强VLA变体，以系统探索多种集成策略下的不同记忆表示。实验结果表明，记忆表示的有效性高度依赖于任务，每种设计在不同任务中都有独特的优势和局限性。视频和代码可在我们的网站https://robomme.github.io上找到。

英文摘要

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

URL PDF HTML ☆

赞 0 踩 0

2412.18084 2026-05-27 cs.AI 版本更新

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

属性增强指令微调用于大型语言模型的多任务分子生成

Xuan Lin, Long Chen, Yile Wang, Yangyang Chen, Xiangxiang Zeng

发表机构 * School of Computer Science, Xiangtan University（湘潭大学计算机科学学院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）； Department of Computer Science, University of Tsukuba（东京大学理工学部）； College of Computer Science and Electronic Engineering, Hunan University（湖南大学计算机科学与电子工程学院）

AI总结提出PEIT框架，通过多模态对齐预训练和指令微调，提升LLM在分子描述、文本分子生成、属性预测和多约束分子生成任务上的性能。

Comments 9

详情

AI中文摘要

大型语言模型（LLMs）广泛应用于各种自然语言处理任务，如问答和机器翻译。然而，由于缺乏标记数据以及生化属性手动标注的困难，分子生成任务的性能仍然有限，尤其是涉及多属性约束的任务。在这项工作中，我们提出了一个两步框架PEIT（属性增强指令微调）来改进LLMs在分子相关任务上的表现。第一步，我们使用文本描述、SMILES和生化属性作为多模态输入，通过对齐多模态表示来合成指令数据，预训练一个名为PEIT-GEN的模型。第二步，我们使用合成数据微调现有的开源LLMs，得到的PEIT-LLM可以处理分子描述、基于文本的分子生成、分子属性预测以及我们新提出的多约束分子生成任务。实验结果表明，我们的预训练模型PEIT-GEN在分子描述任务上优于MolT5、BioT5、MolCA和Text+Chem-T5，证明了文本描述、结构和生化属性之间的模态对齐良好。此外，PEIT-LLM在多任务分子生成中显示出有希望的改进，证明了PEIT框架在分子任务中的有效性。代码和附录可在https://github.com/chenlong164/PEIT获取。

英文摘要

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (\textbf{P}roperty \textbf{E}nhanced \textbf{I}nstruction \textbf{T}uning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5, BioT5, MolCA and Text+Chem-T5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, demonstrating the effectiveness of the PEIT framework for molecular tasks. The code and appendix are available at https://github.com/chenlong164/PEIT.

URL PDF HTML ☆

赞 0 踩 0

2603.12592 2026-05-27 cs.DS cs.AI cs.RO 版本更新

Early Pruning for Public Transport Routing

公共交通路由的早期剪枝

Andrii Rohovyi, Abdallah Abuaisha, Toby Walsh

发表机构 * Department of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, NSW 2033, Australia（新南威尔士大学计算机科学与工程系）； Department of Data Science and Artificial Intelligence, Monash University, Melbourne, Australia（墨尔本大学数据科学与人工智能系）

AI总结提出早期剪枝技术，通过预排序换乘连接并在换乘循环中应用剪枝规则，在不影响最优性的情况下加速公共交通路由算法，实验表明查询时间最多减少57%。

详情

AI中文摘要

公共交通的路由算法，特别是广泛使用的RAPTOR及其变体，在支持无限换乘时，常常在换乘松弛阶段面临性能瓶颈，尤其是在密集的换乘图上。这种低效源于遍历许多潜在的站点间连接（步行、自行车、电动滑板车等）。为了保持可接受的性能，从业者通常限制换乘距离或排除某些换乘选项，这可能会降低路径的最优性并限制向旅客展示的多模式选项。本文介绍了早期剪枝，一种低开销的技术，可以在不影响最优性的情况下加速路由算法。通过按持续时间预排序换乘连接，并在换乘循环内应用剪枝规则，该方法在站点处丢弃较长的换乘，一旦它们无法产生比当前最佳解更早的到达时间。早期剪枝可以以最小的更改集成到现有代码库中，并且只需要一次预处理步骤。该技术在扩展准则设置中保持帕累托最优性，只要额外的优化准则在换乘持续时间上单调非递减。在多个基于RAPTOR的最新解决方案中，包括RAPTOR、ULTRA-RAPTOR、McRAPTOR、BM-RAPTOR、ULTRA-McRAPTOR和UBM-RAPTOR，并在瑞士和伦敦交通网络上测试，我们实现了高达57%的查询时间减少。该方法为交通路径查找算法的效率提供了可推广的改进。

英文摘要

Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during the transfer relaxation phase, especially on dense transfer graphs, when supporting unlimited transfers. This inefficiency arises from iterating over many potential inter-stop connections (walks, bikes, e-scooters, etc.). To maintain acceptable performance, practitioners often limit transfer distances or exclude certain transfer options, which can reduce path optimality and restrict the multimodal options presented to travellers. This paper introduces Early Pruning, a low-overhead technique that accelerates routing algorithms without compromising optimality. By pre-sorting transfer connections by duration and applying a pruning rule within the transfer loop, the method discards longer transfers at a stop once they cannot yield an earlier arrival than the current best solution. Early Pruning can be integrated with minimal changes to existing codebases and requires only a one-time preprocessing step. The technique preserves Pareto-optimality in extended-criteria settings whenever the additional optimization criteria are monotonically non-decreasing in transfer duration. Across multiple state-of-the-art RAPTOR-based solutions, including RAPTOR, ULTRA-RAPTOR, McRAPTOR, BM-RAPTOR, ULTRA-McRAPTOR, and UBM-RAPTOR and tested on the Switzerland and London transit networks, we achieved query time reductions of up to 57\%. This approach provides a generalizable improvement to the efficiency of transit pathfinding algorithms.

URL PDF HTML ☆

赞 0 踩 0

2605.16000 2026-05-27 cs.SI cs.AI cs.DL 版本更新

CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

CitePrism：面向引文审计与编辑完整性的人机协同AI

Gowrika Mahesh, Budanur Madappa Darshan Gowda, Kavana Gopladevarahalli Papegowda, Prajwal Basavaraj, Binh Vu, Swati Chandna, Mehrdad Jalali

发表机构 * GitHub

AI总结提出CitePrism框架，结合LLM推理、嵌入相似度、元数据验证和人工审核，实现稿件级引文审计，初步验证显示可辅助编辑筛选不相关引文。

Comments 30 pages, 5 main figures, 3 tables, appendices with interface screenshots and implementation details; pilot-stage framework and single-manuscript validation study

详情

AI中文摘要

编辑和审稿人应确保稿件引用相关、准确、最新且符合伦理的文献，但稿件级引文审计目前仍主要依赖人工、分散且难以规模化。引文上下文、元数据质量、自引模式和书目完整性都会影响参考文献是否恰当支持局部主张。我们提出CitePrism，一个透明的混合决策支持框架，用于编辑引文审计，它结合了LLM辅助的上下文推理、基于嵌入的语义相似性、元数据验证、完整性标志和人机协同的分析师审查。CitePrism提取引文邻域、丰富参考文献元数据、计算融合相关性分数、呈现元数据和自引审查提示，并支持可配置的阈值分类。在针对一篇包含104条参考文献的路面工程案例稿件的初步验证中，与人工二元相关性标签的一致性达到Cohen's kappa = 0.429。在操作阈值tau=17时，CitePrism标记了所有人工标记为不相关的引文，同时也产生了需要分析师审查的误报。这些结果表明CitePrism可能支持保守的编辑筛选和引文质量分类，但并未确立通用的编辑性能。CitePrism旨在作为试点阶段的决策支持，而非自主的不端行为检测器或自动化编辑决策系统。在操作使用前，需要在稿件、领域、标注者、基线和部署设置中进行更广泛的验证。

英文摘要

Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.

URL PDF HTML ☆

赞 0 踩 0

2605.15850 2026-05-27 cs.CY cs.AI cs.HC 版本更新

Continuum: 基于KV缓存生存时间的高效鲁棒多轮LLM智能体调度

Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

发表机构 * UC Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； Tensormesh（Tensormesh公司）； Tsinghua University（清华大学）

AI总结针对多轮智能体工作负载中工具调用导致KV缓存无法跨轮重用的问题，提出Continuum系统，通过引入KV缓存的生存时间（TTL）机制，在GPU内存中选择性固定缓存，权衡重算/重载成本与排队延迟，实现作业完成时间平均提升8倍以上。

详情

AI中文摘要

KV缓存管理对于高效的LLM推理至关重要。为了最大化利用率，现有推理引擎会在新请求等待时驱逐已完成请求的KV缓存。但这种策略对于智能体工作负载不适用，因为智能体工作负载将LLM调用与工具调用交错进行，引入了停顿，从而阻止了跨轮次的KV有效重用。由于许多工具调用的持续时间远短于人类响应的多轮聊天，因此在工具调用期间保留KV缓存是有前景的。然而，仍存在许多挑战。首先，我们需要考虑重算或重载（如果启用卸载）的潜在成本，以及从GPU驱逐后增加的排队延迟。其次，由于工具调用持续时间的内部方差，该方法需要在工具调用持续时间有限可预测性下保持鲁棒性。我们提出了Continuum，一个通过引入KV缓存保留的生存时间（TTL）机制来优化多轮智能体工作负载作业完成时间的服务系统。对于生成工具调用的请求，Continuum选择性地将KV缓存固定在GPU内存中，其TTL值由重载成本和驱逐引起的潜在排队延迟决定。当TTL过期时，KV缓存可自动被驱逐以释放GPU内存，从而在边缘情况下提供鲁棒性能。当与程序级先来先服务结合时，Continuum保持了多轮连续性，并减少了智能体工作流的延迟。在真实世界智能体（SWE-Bench、BFCL、OpenHand）上使用Llama-3.1 8B/70B、Gemma-3 12B和GLM-4.5 355B的评估表明，Continuum在提高吞吐量的同时，将平均作业完成时间提升了8倍以上。

英文摘要

KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi-turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by introducing time-to-live mechanism for KV cache retention. For requests that generate tool calls, Continuum selectively pins the KV cache in GPU memory with a time-to-live value determined by the reload cost and potential queueing delay induced by eviction. When the TTL expires, the KV cache can be automatically evicted to free up GPU memory, providing robust performance under edge cases. When combined with program-level first-come-first-serve, Continuum preserves multi-turn continuity, and reduces delay for agentic workflows. Evaluations on real-world agents (SWE-Bench, BFCL, OpenHand) with Llama-3.1 8B/70B, Gemma-3 12B, and GLM-4.5 355B shows that Continuum improves the average job completion times by over 8x while improving throughput.

URL PDF HTML ☆

赞 0 踩 0

2605.09156 2026-05-27 cs.CL cs.AI 版本更新

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中？探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)（巴伐利亚科学学院）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University of Freiburg（弗赖堡大学）

AI总结本文提出一个可解释的深度学习框架，通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分（阳性、阴性、中性）到二分（阳性、阴性）的演变，并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情

AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组，在大多数罗曼语中从三分结构（阳性、阴性、中性）变为二分结构（阳性、阴性）。在这项工作中，我们引入了一个可解释的深度学习框架，在词法和上下文层面研究这一现象。首先，我们表明传统的分词策略对于这种低资源历史设置不够稳健，而我们提出的分词器在这些基线上提高了性能。在词法层面，我们评估了形态特征对性别预测的贡献。在上下文层面，我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

URL PDF HTML ☆

赞 0 踩 0

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP 版本更新

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR：用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy（罗马大学计算机科学系）； Moises Systems, Inc.（Moises系统公司）； Paradigma, Inc.（Paradigma公司）

AI总结提出PHALAR对比框架，利用学习谱池化和复值头实现音高和相位等变，在茎检索任务中参数减少50%、训练加速7倍，准确率相对提升约70%，并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE 版本更新

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London（伦敦大学学院）； Holistic AI ； Imperial College London（伦敦帝国学院）

AI总结本文发现语言模型内部存在对应工具选择的线性方向，通过干预该方向可切换工具调用，并能提前检测潜在错误，在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情

AI中文摘要

当工具调用代理选错工具时，失败在执行之前是不可见的：邮件被发送，会议被错过。随着代理承担重要行动，一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它；本文表明我们可以做到。在模型内部，工具的选择由激活空间中的单个方向承载，每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1（270M 到 27B）的 12 个指令微调模型和 6 个基础模型上，这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率，在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式，因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误：模型在两个工具之间不确定的查询失败率比确定的高 21 倍（Gemma 3 27B）。这不仅仅是主题注入：相同幅度的随机向量给出 0% 的切换率，而在单个领域（共享一个主题的 14 个航空工具）内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具：从模型内部状态读取所选工具（余弦读出）在 BFCL 上恢复 61-82% 的准确率，而基础生成仅为 2-10%，这表明预训练形成了表示，而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置；在多轮代理循环中，相同的干预不太稳定（匹配基线的增益或损失高达 30 个百分点，没有一致的方向）。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

URL PDF HTML ☆

赞 0 踩 0

2605.07632 2026-05-27 cs.CL cs.AI cs.LG 版本更新

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结提出GSM-SEM框架，通过修改实体、属性和关系生成语义多样的数学问题变体，降低模型对固定测试集的记忆偏差，并在多个基准上验证性能下降。

详情

AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量，但由于对固定测试集的记忆，排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动（释义、重命名、数字交换、干扰项），这些扰动在很大程度上保留了底层事实，而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM，一个可重用且随机的框架，用于生成语义多样化的基准变体，其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述，经常改变底层事实，并要求模型在新条件下重新计算解决方案，同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体，无需重新标注，减少了对静态公共基准评估的依赖，从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列（GSM-Symbolic和GSM-Plus），生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM，我们观察到一致的性能下降，当语义扰动与符号/plus变体结合时下降更大（在GSM-SEM的最大严格配置中平均下降率为28%）。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后，为了展示在GSM风格数学问题之外的适用性，我们将GSM-SEM应用于其他基准，包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

URL PDF HTML ☆

赞 0 踩 0

2604.08059 2026-05-27 cs.RO cs.AI 版本更新

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

受治理的能力演化：基于AI组件的系统的生命周期兼容性检查与回滚——以具身智能体为例

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology（哈尔滨工业大学软件学院）； School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院）； School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus（赫瑞-瓦德大学马来西亚分校数学与计算机科学学院）； School of Future Science and Engineering, Soochow University（苏州大学未来科学与工程学院）； Fraunhofer Institute for Applied Information Technology（弗劳恩霍夫应用信息科技研究所）

AI总结针对基于AI组件的系统，提出一种受治理的能力演化框架，通过四类兼容性检查和七阶段升级管线实现安全部署，在具身智能体实验中实现零不安全激活。

Comments 42 pages, 7 figures, 12 tables

详情

AI中文摘要

由版本化AI组件构建的软件系统越来越需要生命周期治理：当能力模块演化到新版本时，宿主系统必须决定新版本是否可以安全激活、应在何种部署条件下运行、如何监控以及何时回滚。现有的软件部署模式（金丝雀发布、蓝绿部署、特性标志和MLOps管线）解决了这一循环的部分问题，但它们是针对无状态Web服务而非驱动现场AI组件的带状态、策略约束运行时设计的。我们将受治理的能力演化形式化为基于AI组件的系统的一等软件生命周期问题，并提出一个分阶段升级框架，其中每个新能力版本被视为受治理的部署候选，而非立即可执行的替换。该框架引入了四类升级兼容性检查（接口、策略、行为、恢复），并将其组织成七阶段管线（候选验证、沙箱评估、影子部署、门控激活、在线监控、回滚、审计）。我们在带有ROS 2中间件的PyBullet操作测试平台上实现了参考原型，并在15个随机种子的6轮能力升级中进行了评估。朴素升级实现了72.9%的任务成功率，但到最后一轮不安全激活率升至60%；受治理升级保持了可比的成功率（67.4%），同时在所有轮次中保持零不安全激活（Wilcoxon p=0.003）。影子部署揭示了40%的升级回归问题，这些问题是单独沙箱评估无法发现的，并且在79.8%的激活后漂移场景中回滚成功。

英文摘要

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whether the new version may be activated safely, under what deployment conditions it should run, how it must be monitored, and when it should be rolled back. Existing software-deployment patterns (canary release, blue-green, feature flags, and MLOps pipelines) address parts of this loop but were designed for stateless web services rather than for stateful, policy-constrained runtimes that drive AI components in the field. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks (interface, policy, behavioral, recovery) and organizes them into a seven-stage pipeline (candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, rollback, audit). We implement a reference prototype on a PyBullet manipulation testbed with ROS 2 middleware and evaluate it over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of upgrade regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.06213 2026-05-27 cs.AI 版本更新

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

超越固定基准和最坏情况攻击：语言模型的动态边界评估

Haoxiang Wang, Da Yu, Huishuai Zhang

AI总结提出动态边界评估（DBE）方法，通过定位模型在随机采样解码下通过概率接近0.5的边界项，构建统一难度尺度的评估协议，以解决固定基准的饱和问题。

Comments This submission is being withdrawn because it was submitted without the knowledge and authorization of all co-authors. The authors need to resolve this authorship/authorization issue before any public posting

详情

AI中文摘要

当前评估大型语言模型（LLM）依赖于固定基准，这些基准对所有模型应用相同的测试项，产生天花板和地板效应，掩盖了能力差距。我们认为最具信息量的评估信号位于边界，即在随机采样解码下每个提示的通过概率接近0.5，并提出了动态边界评估（DBE），它主动定位每个模型的边界，并将其置于全局可比的难度尺度上。DBE提供三个产物：(i) 一个校准的题库，涵盖安全性、能力和真实性，其每项难度标签在9个参考LLM上得到验证；(ii) 技能引导的边界搜索（SGBS），一种仅通过API级查询访问即可为目标LLM找到边界项的搜索算法；(iii) 一个评估协议，将新的LLM置于统一的能力尺度上，并在目标超出题库覆盖范围时自适应地扩展评估集。我们在四个类别上实例化DBE，涵盖安全性（有害请求拒绝和过度拒绝）、能力（受限指令遵循）和真实性（多轮谄媚抵抗）。由此产生的评估覆盖更广泛的模型谱系而不饱和，同时与现有数据集兼容。

英文摘要

Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.05248 2026-05-27 cs.PL cs.AI 版本更新

Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect

智能系统的受控元编程：将Eval重新分类为受控效应

Alan L. McCann

发表机构 * Mashin, Inc.（Mashin公司）

AI总结针对AI系统运行时动态生成可执行代码带来的权限放大问题，提出受控元编程语言设计，将程序表示视为一等值，将形式到可执行机器的转换作为受控效应，并通过形式化证明和mashinTalk DSL实现验证。

Comments 15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Update: Abstract typo fixes. Updated license

详情

AI中文摘要

AI系统越来越多地在运行时合成可执行结构：LLM生成程序，智能体构建工作流，自我改进系统修改自身行为。在经典的homoiconic和分阶段语言中，从代码表示到执行的转换是不受限制的。eval是一种语言原语，而不是受控操作。我们认为，在受控智能系统中，这种转换是一种权限放大：它将符号结构转换为可执行权限，必须像任何其他效应一样被中介。我们提出了受控元编程，一种语言设计，其中程序表示（机器形式）是一等值，形式操作是纯计算，而物化（从形式到可执行机器的转换）是一种受控效应，需经过结构检查。治理系统在允许执行之前分析提议程序的能力需求、策略合规性和资源估计。我们形式化了两个判断：纯形式评估（不发出指令）和受控物化（恰好发出一个受控指令）。我们证明了三个性质：形式操作的纯度、无旁路定理和边界保持。我们在mashinTalk中实现了该设计，mashinTalk是一种用于AI工作流的DSL，编译为BEAM字节码，并报告了与454个现有机器检查的Rocq定理的集成。核心贡献是将eval从语言原语重新分类为受控效应。

英文摘要

AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from code representation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that in governed intelligent systems, this transition is an authority amplification: it converts symbolic structure into executable authority and must be mediated like any other effect. We present governed metaprogramming, a language design where program representations (machine forms) are first-class values, form manipulation is pure computation, and materialization (the transition from form to executable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposed program's capability requirements, policy compliance, and resource estimates before permitting execution. We formalize two judgments: pure form evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). We prove three properties: purity of form manipulation, the no-bypass theorem, and boundary preservation. We implement the design in mashinTalk, a DSL for AI workflows compiling to BEAM byte code, and report on integration with 454 existing machine-checked Rocq theorems. The central contribution is reclassifying eval from a language primitive into a governed effect.

URL PDF HTML ☆

赞 0 踩 0

2509.26619 2026-05-27 cs.CL cs.AI 版本更新

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

发表机构 * Google（谷歌）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种自动框架，将互联网建模为多臂老虎机问题，通过epsilon-greedy策略高效搜索最具挑战性的主题，以构建无需人工筛选的基准测试。

详情

AI中文摘要

许多静态基准测试开始饱和：随着模型快速改进，它们在固定测试集上获得近乎完美的分数，几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架，在互联网上大规模搜索以构建具有挑战性的基准测试，无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间，并将搜索形式化为多臂老虎机问题，其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证，确认发现的难度在独立指标（GEMBA-SQA和MetricX）、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

URL PDF HTML ☆

赞 0 踩 0

2605.01489 2026-05-27 cs.AI cs.CL 版本更新

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

发表机构 * HKUST（香港科技大学）； CUHK（香港大学）； Tencent AI Lab（腾讯AI实验室）

AI总结提出SciResearcher框架，通过合成基于学术证据的概念与计算任务并训练智能体，在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情

AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务（通常通过知识图谱构建或迭代网页浏览来策划）来发展强大的问题解决能力。然而，这些策略在前沿科学中面临固有局限性，因为领域特定知识分散在稀疏且异构的学术来源中，而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距，我们引入了SciResearcher，一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务，同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习，我们开发了SciResearcher-8B，一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型，在其参数规模上建立了新的最先进水平，并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言，SciResearcher为前沿科学推理的自动数据构建引入了一种新范式，并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

URL PDF HTML ☆

赞 0 踩 0

2601.21972 2026-05-27 cs.AI cs.DC cs.MA 版本更新

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

基于多智能体Actor-Critic的分散式LLM协作学习

Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

发表机构 * Northeastern University, Boston, MA（波士顿马萨诸塞大学）

AI总结针对分散式LLM协作优化，提出两种多智能体Actor-Critic方法（CoLLM-CC和CoLLM-DC），实验表明在长时域或稀疏奖励任务中集中式Critic方法优于蒙特卡洛方法和分散式Critic方法。

详情

AI中文摘要

近期工作探索了通过多智能体强化学习（MARL）优化LLM协作。然而，大多数MARL微调方法依赖于预定义的执行协议，通常需要集中式执行。分散式LLM协作在实践中更具吸引力，因为智能体可以并行运行推理并灵活部署。此外，当前方法使用蒙特卡洛方法进行微调，这存在高方差问题，因此需要更多样本才能有效训练。Actor-Critic方法在MARL中常用于处理这些问题；因此，我们开发了多智能体Actor-Critic（MAAC）方法来优化分散式LLM协作。本文分析了这些MAAC方法何时以及为何有益。我们提出了两种MAAC方法：带有集中式Critic的CoLLM-CC和带有分散式Critic的CoLLM-DC。我们在写作、编码和游戏领域的实验表明，在短时域和密集奖励设置中，蒙特卡洛方法和CoLLM-DC可以达到与CoLLM-CC相当的性能。然而，在长时域或稀疏奖励任务中，它们均不如CoLLM-CC，其中蒙特卡洛方法需要更多样本，而CoLLM-DC难以收敛。

英文摘要

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

URL PDF HTML ☆

赞 0 踩 0

2605.00412 2026-05-27 cs.AI cs.RO 版本更新

当VLM“修正”学生：多行手写数学OCR评估中的过度修正识别与惩罚

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

发表机构 * Electronics and Telecommunications Research Institute（电子通信研究所）

AI总结针对多行手写数学OCR评估中VLM过度修正问题，提出基于LLM的语义评估指标PINK，有效惩罚过度修正，在FERMAT数据集上优于BLEU。

详情

AI中文摘要

手写数学的准确转录对于教育AI系统至关重要，但当前基准未能正确评估这一能力。大多数先前研究关注单行表达式，并依赖BLEU等词汇指标，无法评估跨多行学生解决方案的语义推理。本文首次系统研究多行手写数学光学字符识别（OCR），揭示了视觉语言模型（VLM）的一个关键失败模式：过度修正。这些模型往往“修正”错误，而非忠实地转录学生作品，从而隐藏了教育评估旨在检测的错误。为解决此问题，我们提出PINK（基于惩罚的INK分数），一种语义评估指标，利用大语言模型（LLM）进行基于评分标准的评分，并明确惩罚过度修正。我们在FERMAT数据集上对15个最先进的VLM进行全面评估，发现与BLEU相比出现显著的排名反转：GPT-4o等模型因激进的过度修正受到严重惩罚，而Gemini 2.5 Flash成为最忠实的转录者。此外，人类专家研究表明，PINK与人类判断的一致性显著更高（55.0%偏好，而BLEU为39.5%），为教育场景中的手写数学OCR提供了更可靠的评估框架。

英文摘要

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

URL PDF HTML ☆

赞 0 踩 0

2603.13381 2026-05-27 cs.LG cs.AI 版本更新

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

注意力投影中的非线性：非线性查询的情况

Marko Karbevski

发表机构 * Simplicity Technologies（简化科技）

AI总结本文提出用非线性残差替换注意力中的查询投影W_Q，通过瓶颈MLP实现，在GPT-3小模型上验证了性能提升。

Comments Accepted at the ICLR 2026 GRaM workshop: https://openreview.net/forum?id=pwdnneFiNZ#discussion

2512.05794 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Mechanistic Interpretability of Antibody Language Models Using SAEs

使用 SAE 对抗体语言模型的机制可解释性研究

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

发表机构 * Department of Statistics, University of Oxford, UK（英国牛津大学统计系）； Reticular, San Francisco, USA（美国旧金山Reticular公司）； EECS, MIT, Cambridge MA, USA（美国麻省理工学院电子工程与计算机科学系）； Leyden Laboratories BV, Leiden, The Netherlands（荷兰莱顿实验室）

AI总结本研究采用 TopK 和 Ordered 稀疏自编码器（SAE）对抗体语言模型进行机制可解释性分析，发现 TopK SAE 能揭示有意义的生物学潜在特征但无法保证生成控制，而 Ordered SAE 通过层次结构可靠识别可操控特征但激活模式更复杂。

Comments v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section

2604.21454 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

混合与非混合大语言模型中的推理原语：架构差异在状态追踪和召回中是否带来优势？

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

发表机构 * Lamarr Institute for Machine Learning and Artificial Intelligence（拉玛尔机器学习与人工智能研究所）； Rheinische Friedrich-Wilhelms-Universität Bonn（波恩莱茵河弗里德里希-威廉大学）

AI总结本研究通过五个受控任务族比较了Transformer和混合架构在状态召回任务上的表现，发现推理增强是主要优势因素，而混合架构的优势较窄且依赖于任务。

详情

AI中文摘要

大型语言模型中的推理通常被视为单一能力，但其部分收益可能源于更简单的底层操作。我们通过五个以状态召回为中心的控制任务族，研究了两种这样的原语——召回和状态追踪，并比较了匹配的Transformer和混合架构（有无推理增强）。在整个套件中，推理增强变体显著优于仅指令变体，通常差距很大。这一模式与“状态超越令牌”观点一致：外部化推理痕迹之所以有帮助，是因为它们在令牌空间中向前传递中间状态。相比之下，一旦推理令牌可用，混合归纳偏置在准确性上并不产生统一优势。当架构差异确实出现时，它们遵循任务结构：混合Think模型在严格顺序的链式更新上更稳健，而Transformer Think模型在平面多跳检索上更稳健。因此，我们将本研究的主要贡献视为对状态召回任务性能驱动因素的描述性说明：推理令牌增强似乎是主导因素，而混合优势更窄、依赖于任务，并且可能更多关乎推理效率而非整体能力。我们还发布了重现这些结果所需的代码库和数据。

英文摘要

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five controlled task families centered on state-based recall, and compare matched transformer and hybrid architectures with and without reasoning augmentation. Across the suite, reasoning-augmented variants substantially outperform instruction-only variants, often by large margins. This pattern is consistent with the State over Tokens view: externalized reasoning traces help because they carry the intermediate state forward in token space. By contrast, hybrid inductive bias does not yield a uniform advantage in accuracy once reasoning tokens are available. When architectural differences do appear, they follow task structure: the hybrid Think model is more robust on strictly sequential chained updates, whereas the transformer Think model is more robust on flat multi-hop retrieval. We therefore cast the main contribution of this study as a descriptive account of what drives performance on state-based recall tasks: reasoning-token augmentation appears to be the dominant factor, while hybrid advantages are narrower, task-dependent, and potentially more about inference efficiency than overall capability. We also release the codebase and data required to reproduce these results.

URL PDF HTML ☆

赞 0 踩 0

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA 版本更新

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Tencent（腾讯）

AI总结提出Chat2Workflow基准，用于评估大语言模型从自然语言生成可执行可视化工作流的能力，并设计了一个智能体基线以提升性能。

Comments Work in progress

详情

AI中文摘要

目前，可执行的可视化工作流已成为实际工业部署中的主流范式，提供了强大的可靠性和可控性。然而，在当前实践中，此类工作流几乎完全通过手动工程构建：开发人员必须仔细设计工作流，为每个步骤编写提示，并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程，我们引入了Chat2Workflow，一个直接从自然语言生成可执行可视化工作流的基准，并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建，每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台（如Dify和Coze）上。实验结果表明，尽管最先进的语言模型通常能捕捉高层次意图，但在生成正确、稳定且可执行的工作流方面仍存在困难，尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升，但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

URL PDF HTML ☆

赞 0 踩 0

2604.18751 2026-05-27 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

超越系数：非线性时间序列模型中可解释因果发现的预测必要性检验

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society（数据与社会联合研究所）； University of Notre Dame（诺特大学）； Department of Political Science（政治学系）

AI总结针对非线性时间序列模型中因果分数被误读为回归系数的问题，提出基于边消融和预测比较的预测必要性检验框架，以评估因果关系的实际必要性。

详情

DOI: 10.32473/flairs.39.1

AI中文摘要

非线性机器学习模型越来越多地用于发现时间序列数据中的因果关系，但其输出的解释仍不明确。特别是，正则化神经自回归模型产生的因果分数常被视为回归系数的类比，导致误导性的统计显著性声明。在本文中，我们认为非线性时间序列模型中的因果相关性应通过预测必要性而非系数大小来评估，并提出了一种实用的评估程序。我们提出了一个基于系统边消融和预测比较的可解释评估框架，用于测试候选因果关系是否对准确预测是必要的。以神经加性向量自回归作为案例研究模型，我们将该框架应用于一个关于民主发展的真实世界案例研究，该案例将面板数据（139个国家的民主指标）建模为多元时间序列。我们表明，具有相似因果分数的关系由于冗余、时间持久性和特定制度效应，其预测必要性可能差异巨大。我们的结果展示了预测必要性检验如何支持应用AI系统中更可靠的因果推理，并为在高风险领域解释非线性时间序列模型提供实用指导。

英文摘要

Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

URL PDF HTML ☆

赞 0 踩 0

2504.01733 2026-05-27 cs.AI cs.CC cs.LO 版本更新

Epistemic Skills: Reasoning about Knowledge and Oblivion

认知技能：关于知识与遗忘的推理

Xiaolong Liang, Yì N. Wáng

发表机构 * School of Philosophy, Shanxi University, Taiyuan, Shanxi, China（山西大学哲学学院）； School of Philosophy and Social Development, Shandong University, Jinan, Shandong, China（山东大学哲学与社会发展学院）

AI总结本文提出一类认知逻辑，通过加权模型系统引入“认知技能”度量，将知识获取建模为技能提升、遗忘建模为技能下降，并研究可知性与可遗忘性以及de re与de dicto表达的区别，分析了模型检测和可满足性的计算复杂性。

详情

DOI: 10.46298/lmcs-22(2:21)2026
Journal ref: Logical Methods in Computer Science, Volume 22, Issue 2 (May 25, 2026) lmcs:15460

AI中文摘要

本文提出了一类认知逻辑，用于捕捉获取知识和陷入遗忘的动态过程，同时融入群体知识的概念。该方法基于加权模型系统，引入“认知技能”度量来表示与知识更新相关的认知能力。在此框架内，知识获取被建模为技能提升的过程，而遗忘则被表示为技能下降的结果。该框架进一步支持探索“可知性”和“可遗忘性”，分别定义为通过技能提升获得知识的潜力和通过技能下降陷入遗忘的潜力。此外，它还支持对认知de re与de dicto表达之间区别的详细分析。研究了模型检测和可满足性问题的计算复杂性，提供了对其理论基础和实际意义的洞察。

英文摘要

This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an ``epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of ``knowability'' and ``forgettability,'' defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

URL PDF HTML ☆

赞 0 踩 0

2604.18179 2026-05-27 cs.CR cs.AI 版本更新

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

托管LLM中审计会话替换检测的承诺SAE特征轨迹

Ziyang Liu

AI总结提出一种承诺-开放协议，通过Merkle树提交稀疏自编码器特征轨迹，以检测托管LLM提供商在服务中静默替换模型的行为。

Comments We identified inaccuracies in the security analysis: the closed-form intrinsic-dimension lower bound on the feature-forgery attacker (Proposition 4.2, Section 4, Appendix V) and the cross-backend noise calibration for the joint z-score threshold (Section 5.1, Table 2). These affect the claimed attack-resistance guarantees. We are withdrawing the paper to correct them before resubmission

详情

AI中文摘要

托管LLM提供商存在静默替换的动机：宣传更强的模型，同时提供更便宜的回复。诸如SVIP的探测后返回方案存在并行服务的侧信道，因为不诚实的提供商可以将验证者的探测路由到广告模型，同时为普通用户提供替代模型。我们提出一种承诺-开放协议来弥补这一漏洞。在任何开放请求之前，提供商通过Merkle树提交其在发布探测层上服务输出的每个位置稀疏自编码器（SAE）特征轨迹草图。验证者打开随机位置，根据公共命名电路探测库（经过跨后端噪声校准）进行评分，并使用固定阈值联合一致性z分数规则做出决策。我们在三个骨干模型上实例化该协议——Qwen3-1.7B、Gemma-2-2B，以及扩展到Gemma-2-9B（配备131k特征SAE）的4.5倍规模。在17种攻击者中，包括同族提升、跨族替代和秩<=128的自适应LoRA，所有攻击者都在共享的尺度稳定阈值下被拒绝；相同的攻击者都规避了匹配的SVIP风格并行服务基线。一种通过冻结SAE编码器反向传播的白盒端到端攻击并未缩小差距，而一种从不运行M_hon的特征伪造攻击者通过内在维度论证被封闭形式地限制。承诺在批大小为32时，仅增加不超过2.1%的前向计算时间。

英文摘要

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.

URL PDF HTML ☆

赞 0 踩 0

2604.18103 2026-05-27 cs.AI 版本更新

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

稳定性意味着冗余：Delta注意力选择性停止用于高效长上下文预填充

Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of California, San Diego（加州大学圣地亚哥分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对长上下文场景中预填充计算成本高的问题，提出一种无需训练的Delta注意力选择性停止策略（DASH），通过监控自注意力层更新动态来停止稳定令牌的处理，从而在不牺牲模型准确性和硬件效率的前提下实现预填充加速。

Comments Accepted to ACL 2026 main conference

2510.06133 2026-05-27 cs.CL cs.AI 版本更新

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

CreditDecoding: 利用轨迹信用加速扩散大语言模型中的并行解码

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Ant Group（蚂蚁集团）； Westlake University（西湖大学）

AI总结针对扩散大语言模型并行解码中正确令牌被反复重掩导致冗余迭代的问题，提出基于轨迹信用的无训练并行解码方法CreditDecoding，融合历史证据与当前logits提升低置信度正确令牌的置信度，实现高达5.48倍加速并提升准确性。

Comments 19 pages, 13 figures, 9 tables, Accepted to ACL 2026 main conference

详情

AI中文摘要

扩散大语言模型（dLLMs）通过迭代去噪生成文本。在普遍采用的并行解码方案中，每一步仅确认高置信度位置，而重掩其他位置。通过分析dLLM去噪轨迹，我们发现一个关键的低效问题：模型通常在目标令牌的置信度足够高以被解码之前的几个步骤就预测出正确令牌。这种早期预测与后期解码之间的差距导致已正确的令牌被反复重掩，造成冗余迭代并限制加速。为利用这种时间冗余，我们引入轨迹信用（Trace Credit），通过累积历史证据来量化令牌的解码潜力。基于此，我们提出CreditDecoding，一种无训练的并行解码方法，将轨迹信用与当前logits融合，以提升正确但低置信度令牌的置信度，从而加速去噪并提高鲁棒性。在八个基准测试上，CreditDecoding在LLaDA-8B上实现了高达5.48倍的加速和+0.48的准确率提升，并在多种dLLM架构和参数规模上持续改进性能。它还能扩展到长上下文，并与主流推理优化方法正交，使其成为一种实用且广泛适用的解决方案。

英文摘要

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.

URL PDF HTML ☆

赞 0 踩 0

2604.14640 2026-05-27 cs.CL cs.AI 版本更新

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战赛中的方法：通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

发表机构 * KaiNKaiho

AI总结本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架，用于无外部证据的金融虚假信息检测，在公开和私有测试集上分别达到95.4%和96.3%的准确率，获得竞赛第一名。

详情

DOI: 10.36190/2026.37
Journal ref: Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media

AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁，误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性，尤其是在现实场景中，外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架（Jiang等人，2026），挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置，我们提出了一个综合框架，利用最先进的大语言模型（LLM）的推理能力。我们的方法系统地集成了上下文学习（特别是零样本和少样本提示策略）以及通过低秩适应（LoRA）的参数高效微调（PEFT），以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果，成功在两个官方排行榜上均获得第一名。具体来说，我们在公开测试集上达到95.4%的准确率，在私有测试集上达到96.3%的准确率，突显了我们方法的鲁棒性，并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型（14B和32B）可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

URL PDF HTML ☆

赞 0 踩 0

2603.12564 2026-05-27 cs.CL cs.AI 版本更新

Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

卖给我这支股票：LLM智能体中的不安全推荐漂移

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * Centre for Artificial Intelligence, University College London（人工智能研究中心，伦敦大学学院）

AI总结研究LLM智能体在多轮金融推荐中因工具输出被操纵而产生风险不匹配推荐的问题，通过实验揭示评估盲区并分析机制。

详情

AI中文摘要

人们越来越多地使用LLM智能体进行多轮金融推荐，智能体通过工具获取市场数据并跨轮次跟踪用户偏好。当工具输出被操纵时，推荐不再匹配用户声明的风险偏好，但由于NDCG等标准指标仅衡量一般相关性，风险股票和安全股票的得分相同，因此指标显示一切正常。我们将这种差距称为评估盲区。我们在八个语言模型上回放23轮金融咨询对话，每段对话分别使用干净和被操纵的工具数据运行两次。质量得分与干净会话几乎相同，而智能体在65-99%的轮次中产生风险不匹配的推荐，所有八个模型一致。该机制在逐轮中可见：在1,840轮中，80%的风险评分引用逐字复现了被操纵的值，没有一轮提出质疑，高风险股票的安全语言框架比例从14%（Qwen2.5-7B）到69%（Claude Sonnet 4.6）不等。使前沿模型成为优秀智能体的特性——忠实地将其推理基于工具输出——也使其跟随被操纵的输出。损害并非由记忆驱动：仅污染当前轮次仍会产生95%的违规。模型内部能区分操纵（稀疏自编码器特征将对抗性扰动与随机扰动分开），但这并未转化为更安全的输出。激活层干预仅恢复不到6%的安全差距，提示级自我验证失败，因为自我检查读取了相同的被操纵数据，而参数化交叉检查在前沿模型上每轮以99-100%的比率标记污染，但整体适宜性仍未改变：智能体识别出篡改，但仍然推荐它。

英文摘要

People increasingly use LLM agents for multi-turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23-turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk-mismatched recommendations in 65-99% of turns, unanimous across all eight models. The mechanism is visible turn-by-turn: 80% of risk-score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe-language framing of high-risk stocks ranges from 14% (Qwen2.5-7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory-driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation-level interventions recover under 6% of the safety gap, prompt-level self-verification fails because the self-check reads the same manipulated data, and a parametric cross-check that flags contamination at 99-100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

URL PDF HTML ☆

赞 0 踩 0

2604.11467 2026-05-27 cs.AI cs.HC cs.LG 版本更新

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动：激活导向的人本应用

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

发表机构 * Fraunhofer Heinrich-Hertz-Institut（弗劳恩霍夫 Heinrich-Hertz 研究所）； Technische Universität Berlin（柏林技术大学）； BIFOLD – Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究所）

AI总结提出结合SAE归因与激活导向的交互式工作流，通过专家访谈验证其能促进从检查到干预的转变，并揭示组件抑制等调试策略及潜在风险。

详情

AI中文摘要

可解释人工智能（XAI）方法揭示了哪些特征影响模型预测，但为实践者基于这些解释采取行动提供了有限的手段。通过XAI识别出的组件的激活导向为可操作的解释提供了一条路径，但其实际效用仍未得到充分研究。我们引入了一个交互式工作流，将基于SAE的归因与激活导向相结合，用于视觉模型中概念使用的实例级分析，并实现为一个基于网页的工具。基于此工作流，我们进行了半结构化专家访谈（N=8），在CLIP上执行调试任务，以调查实践者如何推理、信任和应用激活导向。我们发现，导向使得从检查转向基于干预的假设检验（8/8参与者），大多数参与者将信任建立在观察到的模型响应上，而非仅仅解释的合理性（6/8）。参与者采用了系统性的调试策略，其中组件抑制占主导（7/8），并指出了包括涟漪效应和实例级修正的有限泛化在内的风险。总体而言，激活导向使可解释性更具可操作性，同时为安全有效使用提出了重要考虑。

英文摘要

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

URL PDF HTML ☆

赞 0 踩 0

2604.11056 2026-05-27 cs.LG cs.AI 版本更新

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

事后信用可驻留之处：RLVR中令牌更新的有符号容量视角

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Huawei Technologies Ltd.（华为技术有限公司）

AI总结本文通过条件互信息分析RLVR中令牌级信用的容量上限，提出四象限分解区分更新方向，并设计HAPO算法进行容量引导的优势重分配，提升数学推理性能。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）提升了大语言模型（LLMs）的推理能力，但稀疏的结果奖励使得令牌级信用分配变得困难。我们将令牌级信用视为从行为策略到事后后验的奖励条件偏移。在自回归RLVR中，这种偏移可以通过条件互信息（CMI）表示，这表明令牌熵限制了可能的事后信用上限。然而，熵指示的是容量而非更新方向，因此我们引入了四象限分解，根据奖励极性和令牌熵来分离更新。受控干预表明，这两个因素共同塑造了令牌更新。持续的推理增益集中在有符号的高熵象限，而低熵更新则迅速饱和。基于此分析，我们提出了事后感知策略优化（HAPO），这是对GRPO的一种符号保持修改，执行容量引导的优势重分配。在两个模型设置的数学推理基准上的实验表明，HAPO在熵感知基线中取得了有竞争力的性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.10102 2026-05-27 cs.CV cs.AI 版本更新

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

退化一致性配对训练用于鲁棒的AI生成图像检测

Zongyou Yang, Yinghan Hou, Xiaokun Yang

发表机构 * Department of Computer Science（计算机科学系）； University College London（伦敦大学学院）； Department of Earth Science and Engineering（地球科学与工程系）； Imperial College London（伦敦帝国理工学院）； School of Electronic Information（电子信息学院）

AI总结提出退化一致性配对训练（DCPT），通过特征一致性和预测一致性约束显式增强模型对JPEG压缩、高斯模糊等真实世界图像退化的鲁棒性，在Synthbuster基准上平均准确率提升9.1个百分点。

Comments 6 pages, 5 figures, 2 tables

详情

AI中文摘要

AI生成图像检测器在真实世界图像退化（如JPEG压缩、高斯模糊和分辨率降采样）下性能显著下降。我们观察到，包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品，而非明确的训练目标。在这项工作中，我们提出退化一致性配对训练（DCPT），这是一种简单而有效的训练策略，通过配对一致性约束显式增强鲁棒性。对于每张训练图像，我们构建一个干净视图和一个退化视图，然后施加两个约束：特征一致性损失，最小化干净表示和退化表示之间的余弦距离；以及基于对称KL散度的预测一致性损失，对齐两个视图的输出分布。DCPT不增加额外参数和推理开销。在Synthbuster基准（9个生成器，8种退化条件）上的实验表明，与没有配对训练的相同基线相比，DCPT将退化条件下的平均准确率提高了9.1个百分点，同时仅牺牲了0.9%的干净准确率。在JPEG压缩下改进最为显著（+15.7%至+17.9%）。消融实验进一步揭示，添加架构组件会导致在有限训练数据上过拟合，证实了对于退化鲁棒性，训练目标改进比架构增强更有效。

英文摘要

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

URL PDF HTML ☆

赞 0 踩 0

2509.21882 2026-05-27 cs.LG cs.AI 版本更新

别听我的！多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University（范德比尔大学）； Vanderbilt University Medical Center（范德比尔大学医学中心）； Intuit AI Research（Intuit人工智能研究）

AI总结提出“坚持或切换”(SoS)框架，通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性，发现对话税导致准确性和拒绝错误建议的能力平均下降30%，并观察到盲目切换现象。

详情

AI中文摘要

大型语言模型（LLM）在静态基准测试中表现出色，但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要，因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里，我们引入了“坚持或切换”（SoS）框架，该框架将问答空间划分为多个顺序呈现，以模拟两种以安全为中心的行为：坚持（即坚持正确的答案选择或拒绝错误的建议）和灵活性（即在引入正确建议时切换到该建议）。在三个临床基准测试中评估了17个LLM，我们观察到普遍存在的对话税，其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%，在某些模型中达到65%。我们还观察到盲目切换，即模型从初始拒绝转向错误和正确建议的比率几乎相同，达到50%。最后，我们表明，增加模型规模可以缓解其中一些对话效率低下的问题，但会加剧其他问题，例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明，静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

URL PDF HTML ☆

赞 0 踩 0

2604.07028 2026-05-27 cs.MA cs.AI cs.CL 版本更新

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

基于特质条件的多智能体系统在迭代法律论证中的战略说服

Philipp D. Siedler

发表机构 * Aleph Alpha Research（Aleph Alpha研究）

AI总结提出战略法庭框架，通过特质条件化的大语言模型智能体模拟多轮法律辩论，发现异质团队表现更优，并引入强化学习特质编排器动态优化辩护策略。

详情

AI中文摘要

在诸如法律、外交和谈判等对抗性领域中的战略互动是通过语言中介的，然而大多数博弈论模型忽略了通过话语运作的说服机制。我们提出了战略法庭框架，这是一个多智能体模拟环境，其中由特质条件化的大语言模型（LLM）智能体组成的控方和辩方团队参与迭代的、基于轮次的法律论证。智能体使用九种可解释的特质进行实例化，这些特质被组织成四种原型，从而能够系统控制修辞风格和战略取向。我们在10个合成法律案例和84个三特质团队配置上评估该框架，使用DeepSeek-R1和Gemini 2.5 Pro进行了超过7,000次模拟试验。我们的结果表明，具有互补特质的异质团队始终优于同质配置，适度的交互深度产生更稳定的判决，并且某些特质（特别是量化和魅力型）对说服成功贡献不成比例。我们进一步引入了一个基于强化学习的特质编排器，该编排器根据案件和对手团队动态生成辩护特质，发现优于静态、人类设计的特质组合的策略。这些发现共同证明了语言可以被视为第一类战略行动空间，并为构建能够在多智能体环境中进行自适应说服的自主智能体提供了基础。

英文摘要

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

URL PDF HTML ☆

赞 0 踩 0

2604.06550 2026-05-27 cs.CR cs.AI 版本更新

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

SkillSieve：一种用于检测恶意AI代理技能的分层分流框架

Yinghan Hou, Zongyou Yang, Zaihu Pang, Xiujun Ma

发表机构 * Department of Earth Science and Engineering（地球科学与工程系）； Imperial College London（帝国理工学院伦敦分校）； Department of Computer Science（计算机科学系）； University College London（伦敦大学学院）； Lingban Technology Co., Ltd.（灵伴科技有限公司）； State Key Laboratory of General Artificial Intelligence, Peking University（北京大学通用人工智能国家重点实验室）

AI总结提出SkillSieve三层检测框架，通过启发式评分、LLM子任务分析和多LLM陪审团辩论，高效检测恶意AI代理技能，在390个技能基准上达到F1=0.920。

Comments 10 pages, 2 figures, 6 tables

详情

AI中文摘要

OpenClaw的ClawHub市场托管了数万个社区贡献的代理技能（我们2026-04-04快照中有49,592个），最近的审计报告显示13-26%包含安全漏洞。正则表达式扫描器无法检测混淆的有效载荷；形式化静态分析器无法读取隐藏提示注入和社会工程的SKILL.md自然语言指令。这两种方法都无法覆盖两种模态。 SkillSieve是一个三层检测框架，仅在需要时应用更深入的分析。第1层通过召回调优的启发式评分器运行正则表达式、AST和元数据检查，过滤掉86%的数据量。第2层将可疑技能路由到LLM，将分析拆分为四个并行的子任务，并输出结构化结果。第3层将高风险技能提交给由三个LLM组成的陪审团，它们独立投票并在意见分歧时进行辩论。我们在49,592个真实的ClawHub技能和跨越五种规避技术的对抗样本上进行了评估，在440美元的ARM单板计算机上运行该管道。在390个技能的标记基准上，SkillSieve以每个技能0.006美元的成本实现了F1=0.920（精确率0.912，召回率0.929）。可选的XGBoost快速路径减少了32%的第2/3层LLM调用，F1下降1.6点，同时保持了全管道的召回率（0.929）。为了跨生态系统泛化，我们将该框架适配到飞书/Lark，并扫描了52个真实包，其中第2层纠正了第1层因领域特定习语产生的误报，表明了一条低成本的适配路径到类似的企业平台。我们将SkillSieve部署为飞书聊天机器人，用于实时技能审查。代码、数据和基准已开源。

英文摘要

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2604.07190 2026-05-27 cs.CY cs.AI cs.LG 版本更新

The ATOM Report: Measuring the Open Language Model Ecosystem

ATOM报告：衡量开放语言模型生态系统

Nathan Lambert, Florian Brand

发表机构 * Interconnects AI

AI总结本研究通过分析约1500个主流开放语言模型（如阿里巴巴的Qwen、DeepSeek、Meta的Llama）的下载量、衍生模型、推理市场份额和性能指标，揭示了2025年夏季中国模型超越美国模型并持续扩大差距的趋势。

Comments 23 pages, 17 figures

2604.04948 2026-05-27 cs.IR cs.AI cs.LG 版本更新

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

从PDF到RAG就绪：评估面向特定领域问答的文档转换框架

José Guilherme Marques dos Santos, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Brígida Mónica Faria, Henrique Lopes Cardoso, José Duarte, José Luís Reis, Luís Paulo Reis, Pedro Pimenta, José Paulo Marques dos Santos

发表机构 * Faculty of Engineering, University of Porto（葡萄牙波尔图大学工程学院）； Department of Business Administration, University of Maia（马亚大学商业管理系）； LIACC—Artificial Intelligence and Computer Science Laboratory, University of Porto（葡萄牙波尔图大学人工智能与计算机科学实验室）； Department of Communication Sciences and Information Technologies, University of Maia（马亚大学通讯科学与信息科技系）； School of Health, Polytechnic of Porto（波尔图理工学院健康学院）； School of Technology and Management, Polytechnic Institute of Maia（马亚理工学院技术与管理学院）

AI总结通过系统比较四种开源PDF转Markdown框架的21种流水线配置，发现文档预处理质量（尤其是层次化分块和元数据增强）对RAG系统问答准确率的影响远超转换工具本身，最佳配置（Docling+层次化分块+图像描述）达到94.1%准确率，超越人工整理。

Comments 27 pages, 3 figures, 7 tables

详情

DOI: 10.3390/app16105069
Journal ref: Applied Sciences 16 (2026) 5069

AI中文摘要

检索增强生成（RAG）系统严重依赖文档预处理的质量，然而尚无先前研究通过评估PDF处理框架对下游问答准确性的影响来填补这一空白。我们通过系统比较四种开源PDF到Markdown转换框架——Docling、MinerU、Marker和DeepSeek OCR——在21种流水线配置下的表现，这些配置在转换工具、清洗变换、分块策略和元数据增强方面有所变化。评估使用了一个包含36份葡萄牙语行政文档（1706页，约49.2万词）的语料库上的50个问题基准，每个配置通过LLM作为裁判进行超过50次独立运行的评分。通过Wilcoxon符号秩检验和Cohen's d效应量评估统计显著性。两个基线界定了结果范围：朴素的PDFLoader（86.2%）和人工整理的Markdown（91.3%）。采用层次化分块和图像描述的Docling实现了最高的自动准确率（94.1±1.6%），甚至超越了人工整理。按问题类型分析显示，依赖表格的问题导致了最大的准确率差异，在基本分块和层次化分块之间存在33个百分点的差距。元数据增强和层次感知分块对准确率的贡献超过了转换框架本身。探索性的GraphRAG实现表现不如基本RAG（82%对比94.1%）。这些发现表明，数据准备质量是RAG系统性能的主导因素。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

URL PDF HTML ☆

赞 0 踩 0

2604.04940 2026-05-27 cs.AI 版本更新

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

ReVEL：基于结构化性能反馈的多轮反思式LLM引导的启发式进化

Cuong Van Duc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Tung Vu Duy, Son Nguyen Van, Hanh Nguyen Thi, Binh Huynh Thi Thanh

发表机构 * Hanoi University of Science and Technology（河内科学技术大学）； Phenikaa University（Phenikaa大学）

AI总结针对NP-hard组合优化问题的启发式设计，提出ReVEL框架，通过行为感知分组和多轮迭代细化，利用LLM和累积性能反馈联合优化启发式，实验表明优于现有LLM引导的进化基线。

详情

AI中文摘要

为NP-hard组合优化问题设计有效的启发式仍然具有挑战性，通常需要大量的领域专业知识。最近的LLM引导的进化方法在自动启发式生成方面显示出前景，但大多数现有方法独立地或通过有限的成对反馈来细化启发式。我们提出ReVEL：基于结构化性能反馈的多轮反思式LLM引导的启发式进化，一个用于群体式多轮启发式细化的框架。ReVEL将启发式组织成行为感知的反思组，包括用于局部细化的相似性驱动组和用于探索性搜索的多样性驱动组。在每个组内，LLM使用累积的性能反馈执行迭代多轮细化，使得相关启发式能够在进化迭代中被联合分析和逐步改进。在标准组合优化基准上的实验表明，ReVEL在多种设置和LLM骨干下通常优于现有的LLM引导的进化基线。额外分析表明，行为感知分组有助于在迭代启发式进化过程中实现更一致的细化轨迹。

英文摘要

Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain expertise. Recent LLM-guided evolutionary methods have shown promise for automated heuristic generation, but most existing approaches refine heuristics independently or through limited pairwise feedback. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a framework for group-wise multi-turn heuristic refinement. ReVEL organizes heuristics into behavior-aware reflective groups, including similarity-driven groups for localized refinement and diversity-driven groups for exploratory search. Within each group, the LLM performs iterative multi-turn refinement using accumulated performance feedback, enabling related heuristics to be jointly analyzed and progressively improved across evolutionary iterations. Experiments on standard combinatorial optimization benchmarks show that ReVEL generally improves optimization performance over existing LLM-guided evolutionary baselines across multiple settings and LLM backbones. Additional analyses suggest that behavior-aware grouping contributes to more consistent refinement trajectories during iterative heuristic evolution.

URL PDF HTML ☆

赞 0 踩 0

1403.1076 2026-05-27 cs.AI 版本更新

A Discussion to Qualify Intelligence

关于智能定义的探讨

Kieran Greer

发表机构 * Distributed Computing Systems（分布式计算系统）

AI总结本文试图提出一个适用于自然世界和人工智能的统一智能定义，基于Kolmogorov复杂性理论提出度量标准，并区分智能与意识的不同。

Comments Newly edited version

详情

DOI: 10.64897/si.2026v2i1.001
Journal ref: Scientific Insights, 2(1), pp. 1 - 15

AI中文摘要

我们对智能的理解主要针对人类水平。本文试图给出一个更统一的定义，可应用于整个自然世界，然后应用于人工智能。该定义更侧重于定性而非定量，并可能有助于对此问题做出判断。虽然正确行为是首选定义，但本文提出了一种基于Kolmogorov复杂性理论的度量标准，该标准引出了关于熵的测量。随后，本文提出了一种公认的人工智能测试版本作为“酸性测试”，这可能是自由思维程序试图实现的目标。作者最近的工作更多是从机械过程的角度出发，基于结构构建。本文认为智能是一种主动事件，但也注意到其背后存在一个机械性的次要方面。本文建议将智能和意识视为略有不同，其中意识是更机械的方面。事实上，一个令人惊讶的结论是，一个被动但智能的大脑可能由主动但不太智能的感官所激发。

英文摘要

Our understanding of intelligence is directed primarily at the human level. This paper attempts to give a more unifying definition that can be applied to the natural world in general and then Artificial Intelligence. The definition would be used more to qualify than quantify it and might help when making judgements on the matter. While correct behaviour is the preferred definition, a metric that is grounded in Kolmogorov's Complexity Theory is suggested, which leads to a measurement about entropy. A version of an accepted AI test is then put forward as the 'acid test' and might be what a free-thinking program would try to achieve. Recent work by the author has been more from a direction of mechanical processes, built from structure. This paper agrees that intelligence is a pro-active event, but also notes a second aspect to it that is in the background and mechanical. The paper suggests looking at intelligence and the conscious as being slightly different, where the conscious is this more mechanical aspect. In fact, a surprising conclusion can be a passive but intelligent brain being invoked by active and less intelligent senses.

URL PDF HTML ☆

赞 0 踩 0

2604.03785 2026-05-27 cs.AI cs.MA 版本更新

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

跨时间步延迟下合作多智能体强化学习中的通信增益与延迟代价

Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

发表机构 * The State Key Laboratory for Manufacturing Systems Engineering（制造系统工程国家重点实验室）； School of Automation Science and Engineering, Xi’an Jiaotong University（西安交通大学自动化科学与工程学院）

AI总结针对部分可观测环境中跨时间步通信延迟导致的信息错位问题，提出通信增益与延迟代价（CGDC）度量，并基于此设计演员-评论家框架CDCMA，通过预测未来观测和注意力融合延迟消息来提升合作多智能体强化学习的性能、鲁棒性和泛化能力。

详情

AI中文摘要

在部分可观测的\emph{合作}多智能体强化学习中，通信对于协调至关重要，然而\emph{跨时间步}延迟会导致消息在生成后多个时间步才到达，造成时间错位，使得信息在消费时变得陈旧。我们将此设定形式化为延迟通信部分可观测马尔可夫博弈（DeComm-POMG），并将消息的影响分解为\emph{通信增益}和\emph{延迟代价}，从而得到通信增益与延迟代价（CGDC）度量。我们进一步建立了一个价值损失界，表明由延迟消息引起的性能下降被一个折扣累积的信息差距所上界，该差距由及时消息与延迟消息所诱导的动作分布之间的差异衡量。在CGDC的指导下，我们提出了 extbf{CDCMA}，一个演员-评论家框架，该框架仅在预测CGDC为正时请求消息，预测未来观测以减少消费时的错位，并通过CGDC引导的注意力融合延迟消息。在无队友视觉变体的合作导航和捕食者-猎物任务以及多个延迟级别的SMAC地图上的实验表明，该方法在性能、鲁棒性和泛化能力上均有一致提升，消融实验验证了每个组件的有效性。

英文摘要

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

URL PDF HTML ☆

赞 0 踩 0

2603.25152 2026-05-27 cs.AI cs.IR 版本更新

OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion

OMD-GraphRAG：利用本体引导提取、多维聚类和双通道融合增强GraphRAG

Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian

发表机构 * Data Science & Artificial Intelligence Research Institute（数据科学与人工智能研究院）

AI总结提出OMD-GraphRAG框架，通过本体引导知识提取、多维社区聚类和双通道图检索融合，提升GraphRAG在复杂推理和多跳查询中的性能。

详情

AI中文摘要

检索增强生成（RAG）系统在复杂推理、多跳查询和领域特定问答中面临重大挑战。尽管现有的GraphRAG框架在结构化知识组织方面取得了进展，但在知识提取精度、社区报告完整性和检索性能方面仍存在局限性。本文提出OMD-GraphRAG，一个基于开源GraphRAG构建的增强框架。该框架引入了三项核心创新：（1）本体引导知识提取，使用预定义Schema指导LLM准确识别领域特定实体和关系；（2）多维社区聚类策略，通过对齐完成、基于属性的聚类和多跳关系聚类提高社区完整性；（3）双通道图检索融合，通过混合图和社区检索平衡问答准确性和性能。在MultiHop-RAG基准上的评估结果显示，OMD-GraphRAG在综合F1分数上优于主流开源解决方案（如LightRAG），特别是在推理和时间查询方面。

英文摘要

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in knowledge extraction precision, community report integrity, and retrieval performance. This paper proposes OMD-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHop-RAG benchmark show that OMD-GraphRAG outperforms mainstream open source solutions (e.g., LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries.

URL PDF HTML ☆

赞 0 踩 0

2603.28345 2026-05-27 cs.SE cs.AI 版本更新

Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications

当代码遇见自然语言：基于分类法的LLM集成应用信息流分析

Zihao Xu, Xiao Cheng, Ruijie Meng, Yuekang Li

发表机构 * University of New South Wales（新南威尔士大学）； Macquarie University（麦考瑞大学）； National University of Singapore（新加坡国立大学）

AI总结提出一种基于定量信息流理论的分类法，定义24个标签以跨越LLM调用的自然语言与编程语言边界，实现信息流分析，并在污点传播和程序切片中验证有效性。

详情

AI中文摘要

LLM API调用正成为一种普遍的程序构造，但它们创建了一个现有程序分析无法跨越的边界：运行时值进入自然语言提示，在LLM内部经过不透明处理，然后作为程序消费的代码、SQL、JSON或文本重新出现。每个跨函数边界跟踪数据的分析，包括污点分析、程序切片、依赖分析和变更影响分析，都依赖于被调用者的数据流摘要。LLM调用没有这样的摘要，在我们称为NL/PL边界处打破了所有这些分析。我们提出了第一个信息流方法来跨越这个边界。基于定量信息流理论，我们的分类法沿两个正交维度定义了24个标签：信息保留级别（从词汇保留到完全阻塞）和输出模态（自然语言、结构化格式、可执行工件）。我们从4,154个真实世界Python文件中标记了9,083个占位符-输出对，并通过Cohen's κ=0.82和近乎完全的覆盖率（0.01%无法分类）验证了可靠性。我们在两个下游应用中展示了分类法的实用性：（1）一个两阶段污点传播管道，结合基于分类法的过滤和LLM验证，在353个专家注释对上达到F1=0.923，并在六个真实世界OpenClaw提示注入案例上的跨语言验证进一步确认了有效性；（2）基于分类法的反向切片在包含非传播占位符的文件中将切片大小平均减少了15%。每个标签的分析显示，四个阻塞标签几乎涵盖了所有非传播情况，为工具构建者提供了可操作的过滤标准。

英文摘要

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

URL PDF HTML ☆

赞 0 踩 0

2601.18987 2026-05-27 cs.CL cs.AI cs.PL 版本更新

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题：程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

发表机构 * FAIR Team, Meta AI（Meta AI FAIR 团队）； The Hebrew University of Jerusalem, Israel（耶路撒冷希伯来大学）； Bloomberg, New York, USA（彭博社，纽约，美国）； Imperial College London, UK（伦敦帝国理工学院，英国）； University College London, UK（伦敦大学学院，英国）

AI总结本文评估了前沿LLMs在程序终止推理上的能力，发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平，但无法生成形式化证明，并引入分歧前置条件形式化描述非终止条件。

详情

AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性，表明没有算法能普遍确定所有程序和输入的终止性。因此，验证工具近似地处理终止问题，有时无法证明或反驳；这些工具依赖于特定问题的架构，并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题：它们在多大程度上能够推理程序终止？我们在2025年国际软件验证竞赛（SV Comp）的一组多样化C程序上评估了前沿LLMs。我们的结果表明，GPT-5和Claude Sonnet 4.5（通过测试时缩放）达到了与顶级验证工具相当的分数。然而，尽管模型通常能正确推断程序是否终止，但它们经常无法构造一个见证作为形式化证明，揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加，性能进一步下降。为了分析这一差距，我们引入了一个分歧前置条件形式化方法，将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法，以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

URL PDF HTML ☆

赞 0 踩 0

2603.25415 2026-05-27 cs.AI cs.RO 版本更新

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

具身语义场景图生成的强化学习导航现代化

Roman Küble, Marco Hüller, Mrunmai Phatak, Rainer Lienhart, Jörg Hähner

发表机构 * Organic Computing Group（有机计算组）； Machine Learning and Computer Vision Group（机器学习与计算机视觉组）； University of Augsburg（奥格斯堡大学）； Am Technologiezentrum 8（技术中心8号）； Augsburg, Germany（德国奥格斯堡）

AI总结提出模块化导航组件，通过替换策略优化方法和重新设计离散动作表示，现代化具身语义场景图生成中的决策过程，并评估不同动作集和策略结构对场景图完整性、执行安全性和导航行为的影响。

详情

AI中文摘要

语义世界模型使具身智能体能够推理对象、关系和空间上下文，超越纯几何表示。在有机计算中，此类模型是在不确定性和资源约束下实现目标驱动自适应的关键。核心挑战是在有限动作预算内获取最大化模型质量和下游实用性的观测。语义场景图（SSG）为此提供了结构紧凑的表示。然而，在有限动作视界内构建SSG需要探索策略，在信息增益与导航成本之间权衡，并决定何时额外动作的收益递减。本文提出了用于具身语义场景图生成的模块化导航组件，并通过替换策略优化方法和重新审视离散动作公式来现代化其决策。我们研究了紧凑和更细粒度的较大离散动作集，并比较了原子动作上的单头策略与动作组件上的分解多头策略。我们评估了课程学习和基于深度的可选碰撞监督，并评估了SSG完整性、执行安全性和导航行为。结果表明，仅替换优化算法在相同奖励塑造下相对于基线将SSG完整性提高了21%。深度主要影响执行安全性（无碰撞运动），而完整性基本保持不变。将现代优化与更细粒度、分解的动作表示相结合，产生了最强的完整性-效率权衡。

英文摘要

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

URL PDF HTML ☆

赞 0 踩 0

2601.04426 2026-05-27 cs.AI 版本更新

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

XGrammar-2: 面向智能体LLM的高效动态结构化生成引擎

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Carnegie Mellon University（卡内基梅隆大学）； Carnegie Mellon University, NVIDIA（卡内基梅隆大学，NVIDIA）

AI总结针对智能体LLM中动态结构化生成（如工具调用和响应协议）的挑战，提出XGrammar-2引擎，通过标签触发结构切换和跨语法子结构缓存实现高效编译与近零开销。

Comments 10 pages, ACM CAIS 26

详情

DOI: 10.1145/3786335.3813124

AI中文摘要

现代LLM智能体越来越依赖动态结构化生成，例如工具调用和响应协议。与具有静态结构的传统结构化生成不同，这些工作负载在请求之间和请求内部都有变化，给现有引擎带来了新的挑战。我们提出了XGrammar-2，一种用于动态智能体工作负载的结构化生成引擎。我们的设计基于两个关键思想：对标签触发的结构切换的一流支持，以及跨具有不同输出结构的请求的细粒度重用。具体来说，XGrammar-2引入了TagDispatch用于动态结构调度，以及Cross-Grammar Cache用于跨语法的子结构级缓存重用。它通过基于Earley的自适应令牌掩码缓存、即时编译和重复状态压缩进一步提高了效率。实验表明，XGrammar-2的编译速度比先前的结构化生成引擎快6倍以上，并且在现代LLM服务系统中几乎为零的端到端开销。

英文摘要

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional structured generation with static structures, these workloads vary both across requests and within a request, posing new challenges to existing engines. We present XGrammar-2, a structured generation engine for dynamic agentic workloads. Our design is based on two key ideas: first-class support for tag-triggered structure switching, and fine-grained reuse across requests with different output structures. Concretely, XGrammar-2 introduces TagDispatch for dynamic structural dispatching and Cross-Grammar Cache for substructure-level cache reuse across grammars. It further improves efficiency with an Earley-based adaptive token mask cache, just-in-time compilation, and repetition state compression. Experiments show that XGrammar-2 achieves over 6x faster compilation than prior structured generation engines, and incurs near-zero end-to-end overhead in modern LLM serving systems.

URL PDF HTML ☆

赞 0 踩 0

2603.23994 2026-05-27 cs.LG cs.AI 版本更新

Understanding the Challenges in Iterative Generative Optimization with LLMs

理解大语言模型迭代生成优化中的挑战

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

发表机构 * Google DeepMind（谷歌DeepMind）； CNRS（国家科学研究中心）； Stanford University（斯坦福大学）； Carnegie Mellon University（卡内基梅隆大学）； Microsoft（微软）； AWS（亚马逊AWS）； Netflix Research（Netflix研究）； Microsoft Research（微软研究院）

AI总结本文通过案例研究，揭示了在基于大语言模型的迭代生成优化中，起始工件、信用分配和批处理等隐藏设计选择对优化成败的决定性影响，并指出缺乏跨领域的通用学习循环设置方法是生产化和采用的主要障碍。

Comments 39 pages, 17 figures

详情

AI中文摘要

生成优化利用大型语言模型（LLMs）通过执行反馈迭代改进工件（如代码、工作流或提示）。这是一种构建自我改进代理的有前途的方法，但在实践中仍然脆弱：尽管有活跃的研究，只有9%的调查代理使用了任何自动优化。我们认为这种脆弱性是因为，为了建立学习循环，工程师必须做出“隐藏”的设计选择：优化器可以编辑什么，以及在每次更新时提供什么“正确”的学习证据？我们调查了影响大多数应用的三个因素：起始工件、执行轨迹的信用跨度，以及将试错批处理为学习证据。通过在MLAgentBench、Atari和BigBench Extra Hard中的案例研究，我们发现这些设计决策可以决定生成优化是否成功，然而它们在先前的工作中很少被明确说明。不同的起始工件决定了在MLAgentBench中哪些解决方案是可达到的，截断的轨迹仍然可以改进Atari代理，而更大的小批量并不会单调地改善BBEH上的泛化。我们得出结论，缺乏一种简单、通用的跨领域设置学习循环的方法是生产化和采用的主要障碍。我们为做出这些选择提供了实用指导。

英文摘要

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

URL PDF HTML ☆

赞 0 踩 0

2603.20020 2026-05-27 cs.CV cs.AI 版本更新

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

分离跳跃链接与$R$-探针：解耦特征聚合与梯度传播用于MLLM OCR

Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China（多媒体信息处理国家重点实验室，计算机科学学院，PKU-Anker LLM实验室，软件与硬件协同人工智能系统北京重点实验室，北京大学，北京，中国）； Tsinghua University, Beijing, China（清华大学，北京，中国）； Baidu Inc, Beijing, China（百度公司，北京，中国）

AI总结针对多模态大语言模型在OCR任务中因梯度干扰导致细粒度视觉信息丢失的问题，提出分离跳跃链接（Detached Skip-Links）以解耦前向特征聚合与反向梯度传播，并引入$R$-探针（$R$-Probe）诊断视觉令牌的可重构性，从而提升OCR及通用多模态任务性能。

Comments Accepted by ICML 2026. Ziye Yuan and Ruchang Yao contributed equally to this work (co-first authors, listed in random order)

详情

AI中文摘要

多模态大语言模型（MLLMs）擅长高级推理，但在OCR任务中失败，因为细粒度视觉细节被破坏或错位。我们发现了多层特征融合中一个被忽视的优化问题。跳跃路径引入了从高级语义目标到早期视觉层的直接反向传播路径。这种机制覆盖了低级信号并破坏了训练稳定性。为了缓解这种梯度干扰，我们提出了分离跳跃链接（Detached Skip-Links），这是一种最小的修改，在前向传播中重用浅层特征，同时在联合训练期间停止通过跳跃分支的梯度。这种非对称设计减少了梯度干扰，提高了稳定性和收敛性，且无需增加可学习参数。为了诊断细粒度信息是否被保留并可供LLM使用，我们引入了$R$-探针（$R$-Probe），它使用从LLM前四分之一层初始化的浅层解码器测量投影视觉令牌的像素级可重构性。在多个ViT骨干网络和多模态基准测试中，以及高达7M训练样本的规模下，我们的方法持续改进了以OCR为中心的基准测试，并在通用多模态任务上取得了明显提升。

英文摘要

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.17218 2026-05-27 cs.CL cs.AI cs.GT 版本更新

Omanic：迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo（东京大学）； Yale University（耶鲁大学）； Stanford University（斯坦福大学）； Xiaomi EV（小米EV）； Soongsil University（顺天大学）

AI总结针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题，提出Omanic基准，通过分解为单跳子问题并分析步骤级错误，揭示后期跳数瓶颈、事实知识下限和错误传播，微调后提升多个推理基准性能。

详情

AI中文摘要

仅从最终答案评估大语言模型（LLM）的推理能力可能会掩盖中间步骤的失败，尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题，我们引入了Omanic，一个开放域4跳问答基准，它不仅用于衡量最终答案的准确性，还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例（OmanicSynth）和967个经专家审核的人工标注评估示例（OmanicBench），每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明，Omanic具有挑战性，而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准，平均提升7.41分，验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据，在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

URL PDF HTML ☆

赞 0 踩 0

2603.13853 2026-05-27 cs.CL cs.AI 版本更新

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； MAIS, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所MAIS部）； Wenge Technology Co., Ltd（Wenger科技有限公司）

AI总结针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题，提出APEX-Searcher，通过分离规划与执行的信用分配（规划用RL优化、执行用SFT学习），在多个基准上取得一致提升。

详情

AI中文摘要

检索增强生成（RAG）将大型语言模型（LLMs）与外部知识连接起来，但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力，大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能，但它们仍然面临任务推理和模型训练方面的挑战，尤其是模糊的检索执行路径和端到端强化学习（RL）中的稀疏奖励，这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠：单一的最终奖励同时更新规划和执行，因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher，它采用了一种细化信用分配的范式：规划通过带有规划级奖励的RL进行优化，而执行则通过SFT学习。大量实验表明，在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2603.15500 2026-05-27 cs.AI cs.LG 版本更新

Belief-Sim：迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

发表机构 * University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Texas State University（德克萨斯州立大学）

AI总结提出BeliefSim框架，利用心理学分类和调查先验构建人口信念档案，通过提示条件化和后训练适应，实现基于信念模拟人口统计错误信息易感性，对齐度达92%。

Comments Paper Under Review

2603.01800 2026-05-27 cs.LG cs.AI stat.ML stat.OT 版本更新

PoCo：智能合约的代理式概念验证漏洞利用生成

Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, Martin Monperrus

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结提出PoCo框架，通过代理式Reason-Act-Observe循环自动从自然语言漏洞描述生成可执行的PoC漏洞利用，在23个真实报告上优于基线方法。

Comments Under review

详情

DOI: 10.1145/3816704

AI中文摘要

智能合约在高度对抗的环境中运行，漏洞可能导致重大财务损失。因此，智能合约需进行安全审计。在审计中，概念验证（PoC）漏洞利用通过向利益相关者证明报告的漏洞是真实、可重现且可操作的，发挥着关键作用。然而，手动创建PoC耗时、易出错，且常受限于紧张的审计时间表。我们提出PoCo，一个代理式框架，可从审计人员编写的自然语言漏洞描述中自动生成可执行的PoC漏洞利用。PoCo通过在与一组代码执行工具交互的Reason-Act-Observe循环中以代理方式自主生成PoC漏洞利用。它生成与Foundry测试框架兼容的完全可执行漏洞利用，可直接集成到审计报告和其他安全工具中。我们在23个真实漏洞报告的数据集上评估PoCo。PoCo始终优于零样本和工作流基线，生成格式良好且逻辑正确的PoC。我们的结果表明，代理式框架可以显著减少智能合约审计中高质量PoC所需的工作量。我们的贡献为智能合约安全社区提供了可操作的知识。

英文摘要

Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce PoCo, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. PoCo autonomously generates PoC exploits in an agentic manner by interacting with a set of codeexecution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate PoCo on a dataset of 23 real-world vulnerability reports. PoCo consistently outperforms the Zero-shot and Workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides actionable knowledge for the smart contract security community.

URL PDF HTML ☆

赞 0 踩 0

2510.07231 2026-05-27 cs.CL cs.AI 版本更新

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

发表机构 * Graduate School of Data Science, KAIST（韩国科学技术院数据科学研究生院）； College of Business, KAIST（韩国科学技术院商学院）； Data Science for Humanity Group, MPI-SP（马克斯·普朗克所际数据科学为人类集团）； School of Computing, KAIST（韩国科学技术院计算学院）； Division of Social Science, HKUST（香港科技大学社会科学系）

AI总结提出EconCausal基准，包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组，评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情

AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型（LLM）在决策支持角色中提出了挑战：它们能否在指定上下文中推断因果效应的方向，并在上下文变化时修正该判断？为此，我们引入了EconCausal，这是一个大规模基准，包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组，通过严格的四阶段流程构建，包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明，LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率，但在需要跨上下文修正符号的情况下，准确率下降32.6个百分点（从73.9%降至41.3%），一旦引入误导性的符号证据，准确率降至50%以下。模型还过度承诺于方向性（+/-）符号，仅在13.8%的情况下识别出零效应，且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

URL PDF HTML ☆

赞 0 踩 0

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG 版本更新

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应：基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA（密歇根大学，安阿伯分校）； Washington University in St. Louis, St. Louis, MO, USA（华盛顿大学圣路易斯分校）

AI总结提出一个统一的地理空间发现框架，结合主动学习、在线元学习和概念引导推理，通过概念加权不确定性采样和相关性感知元批次形成策略，在有限数据和动态环境下高效发现隐藏目标。

详情

AI中文摘要

在环境监测中，数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS（全氟和多氟烷基物质）污染尤其如此，与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说，在灾害响应和公共卫生环境中也出现了类似的挑战，动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而，稀疏且有偏差的地理空间标签限制了现有基于学习方法（如强化学习）的适用性。为了解决这个问题，我们提出了一个统一的地理空间发现框架，该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新，该概念捕捉领域特定因素如何影响目标存在：一个*概念加权不确定性采样策略*，其中不确定性通过从现成概念（如土地覆盖和源距离）学习到的相关性进行调节；以及一个*相关性感知元批次形成策略*，该策略在在线元更新期间促进语义多样性，提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架，这是一个受真实世界启发的环境监测任务，展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

URL PDF HTML ☆

赞 0 踩 0

2510.03352 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University（电气与计算机工程系，德克萨斯A&M大学）

AI总结提出一种即插即用、无需训练的推理时搜索框架，将侧信息融入现有扩散模型逆问题求解器，显著提升重建质量。

详情

AI中文摘要

扩散模型已被用作解决逆问题的先验。然而，现有方法通常忽略了能够显著提高重建质量的侧信息，尤其是在严重病态设置中。在这项工作中，我们提出了一种新颖的框架，通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题（包括图像修复、超分辨率和几种去模糊任务）以及多种基于扩散模型的逆问题求解器（DPS、DAPS和MPGD）上的大量实验，我们表明，用我们的框架增强每个求解器，其重建质量始终优于相应的原始方法。为了展示我们方法的通用性，我们考虑了多种形式的侧信息，包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取：https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

URL PDF HTML ☆

赞 0 踩 0

2602.15919 2026-05-27 stat.ML cs.AI cs.LG 版本更新

Assessing Per-Sample Membership Inference Vulnerability without Retraining

无需重训练的逐样本成员推断脆弱性评估

Valentin Dorseuil, Jamal Atif, Olivier Cappé

发表机构 * ENS, École normale supérieure（巴黎高等师范学院）； Université PSL, CNRS（巴黎政治学院、国家科学研究中心）； Institut Polytechnique de Paris（巴黎理工 institute）

AI总结提出一种基于数据依赖几何度量的逐样本成员推断脆弱性评分方法，仅需单个训练模型即可高效识别高风险样本。

详情

AI中文摘要

近期隐私文献表明，针对样本的成员推断攻击（MIA）显著优于非针对性方法。受此启发，我们探讨以下问题：能否在不训练影子模型的情况下评估单个训练点的隐私脆弱性？我们表明，逐样本对MIA的暴露程度不仅受其损失影响，还受数据依赖的几何度量控制。在线性设置中，我们推导出个体黑盒MIA脆弱性的闭式分解，将其分解为总体杠杆得分和残差损失项，明确了样本依赖的几何结构如何转化为隐私暴露。由于大多数现代架构的最后一层是线性的，我们将此框架扩展到深度网络，并提出一种基于最后一层表示的替代评分，仅需单个训练模型且无需影子模型。跨不同数据集和架构的实验表明，我们的评分在识别最先进攻击下的最高风险点时优于损失和梯度范数基线，为逐样本隐私风险评估提供了计算高效且理论基础的工。

英文摘要

Recent work in the privacy literature shows that sample-targeted membership inference attacks (MIAs) significantly outperform untargeted approaches by a wide margin. Motivated by this observation, we address the following question: can the privacy vulnerability of individual training points be assessed without training shadow models? We show that per-sample exposure to MIA is governed not only by a point's loss, but also by a data-dependent geometric measure. In the linear setting, we derive a closed-form decomposition of individual black-box MIA vulnerability into a population leverage score and a residual loss term, making explicit how sample-dependent geometry translates into privacy exposure. Since the final layer of most modern architectures is linear, we extend this framework to deep networks and propose a surrogate score operating on last-layer representations that requires only a single trained model and no shadow models. Empirical evaluations across diverse datasets and architectures show that our score outperforms loss and gradient-norm baselines at identifying the highest-risk points under state-of-the-art attacks, providing a computationally efficient and theoretically grounded tool for per-sample privacy risk assessment.

URL PDF HTML ☆

赞 0 踩 0

2602.12833 2026-05-27 cs.LG cs.AI cs.MA 版本更新

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace: 协议约束的患者状态推理用于纵向临床轨迹

Zhan Qu, Michael Färber

发表机构 * TU Dresden（德累斯顿理工大学）

AI总结提出Vital Trace，一个协议约束的多智能体框架，通过紧凑的持久患者状态记忆和四个协调智能体（Router、Reasoner、Auditor、Steward）进行分阶段推理，以解决长期临床轨迹推理中的上下文漂移和不稳定问题，在MIMIC-IV和eICU数据集上预测未来血管加压药、呼吸、肾脏支持和恶化任务中优于自由形式多智能体基线。

详情

AI中文摘要

纵向临床推理需要跟踪电子健康记录中患者轨迹的生理测量、实验室结果和干预措施。现有的基于LLM的临床推理系统通常依赖于重复序列化患者历史或交换无约束的文本智能体消息，导致上下文漂移、推理不稳定以及长期推理成本增加。我们提出了Vital Trace，一个协议约束的多智能体框架，用于在动态ICU轨迹上进行未来临床风险预测。Vital Trace不维护无界文本历史，而是使用紧凑的持久患者状态记忆以及由四个协调智能体（Router、Reasoner、Auditor和Steward）执行的分阶段推理。为了支持时间上连贯的推理，我们引入了一个手动策划的全局协议，包含生理状态转换规则和动态患者状态表示，随时间跟踪血流动力学、呼吸、肾脏、代谢和炎症不稳定性。我们在MIMIC-IV和eICU上使用未来血管加压药支持、呼吸支持、肾脏支持和恶化预测任务评估Vital Trace。结果表明，与自由形式多智能体基线相比，结构化的协议约束推理提高了时间一致性、通信稳定性、校准性和可解释性，同时在长期ICU轨迹上实现了强大的预测性能。

英文摘要

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM-based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol-constrained multi-agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient-state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state-transition rules and a dynamic patient-state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC-IV and eICU using future vasopressor-support, respiratory-support, renal-support, and deterioration prediction tasks. Results show that structured protocol-constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free-form multi-agent baselines while achieving strong predictive performance across long ICU trajectories.

URL PDF HTML ☆

赞 0 踩 0

2602.11799 2026-05-27 cs.AI cs.IR 版本更新

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Hi-SAM: 一种面向大规模推荐的分层结构感知多模态框架

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music（网易云音乐）

AI总结针对多模态推荐中语义ID离散化存在的次优分词和架构-数据不匹配问题，提出Hi-SAM框架，通过解耦语义分词器和分层记忆-锚点Transformer，在冷启动场景下显著提升推荐性能。

Comments Accepted at ACM KDD 2026 ADS

详情

AI中文摘要

多模态推荐因物品具有文本和图像等丰富属性而受到关注。基于语义ID的方法有效地将这些信息离散化为紧凑的令牌。然而，存在两个挑战：（1）次优分词：现有方法（如RQ-VAE）缺乏共享跨模态语义和模态特定细节之间的解耦，导致冗余或崩溃；（2）架构-数据不匹配：普通Transformer将语义ID视为扁平流，忽略了用户交互、物品和令牌的层次结构。将物品扩展为多个令牌会放大长度和噪声，使注意力偏向局部细节而非整体语义。我们提出Hi-SAM，一种分层结构感知多模态框架，包含两个设计：（1）解耦语义分词器（DST）：通过几何感知对齐统一模态，并通过从粗到细的策略进行量化。共享码本提取共识，而模态特定码本通过互信息最小化从残差中恢复细微差别；（2）分层记忆-锚点Transformer（HMAT）：通过分层RoPE将位置编码分解为物品间和物品内子空间以恢复层次结构。它插入锚点令牌将物品压缩为紧凑记忆，保留当前物品的细节，同时仅通过压缩摘要访问历史。在真实世界数据集上的实验表明，相比最先进基线方法，Hi-SAM持续改进，尤其在冷启动场景中。在服务数百万用户的大规模社交平台上部署后，Hi-SAM在核心在线指标上实现了6.55%的提升。

英文摘要

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

URL PDF HTML ☆

赞 0 踩 0

2508.03771 2026-05-27 cs.CY cs.AI 版本更新

Trustworthiness of Legal Considerations for the Use of LLMs in Education

LLM在教育中使用的法律考量可信度

Sara Alaswad, Tatiana Kalganova, Wasan Awad

发表机构 * College of Information Technology（信息科技学院）； Brunel University of London（伦敦布鲁内尔大学）； Ahlia University（阿利亚大学）

AI总结本文通过比较全球主要地区（欧盟、英国、美国、中国、海湾合作委员会国家）的AI监管框架，提出针对海湾合作委员会国家的合规中心AI治理框架，以促进教育中AI系统的合法、伦理和文化适应性部署。

Comments 11 pages, 3 figures, 6 tables

详情

DOI: 10.1109/DASA68193.2025.11498739
Journal ref: Proc. IEEE DASA 2025, Manama, Bahrain, 2025

AI中文摘要

随着人工智能（AI），特别是大型语言模型（LLMs）日益嵌入全球教育系统，确保其伦理、法律和情境适当的部署已成为关键政策问题。本文对全球主要地区（包括欧盟、英国、美国、中国和海湾合作委员会（GCC）国家）的AI相关监管和伦理框架进行了比较分析。它映射了核心可信度原则（如透明度、公平性、问责制、数据隐私和人类监督）如何嵌入区域立法和AI治理结构中。特别强调了GCC地区不断发展的格局，这些国家正在迅速推进国家AI战略和教育部门创新。为支持这一发展，本文提出了一个针对GCC背景量身定制的以合规为中心的AI治理框架。这包括一个分层类型学和机构检查清单，旨在帮助监管机构、教育工作者和开发者将AI采用与国际规范和当地价值观对齐。通过综合全球最佳实践与区域特定挑战，本文为在教育中构建合法、伦理基础和文化敏感的AI系统提供了实用指导。这些见解旨在为未来的监管协调提供信息，并促进不同教育环境中负责任的AI集成。

英文摘要

As Artificial Intelligence (AI), particularly Large Language Models (LLMs), becomes increasingly embedded in education systems worldwide, ensuring their ethical, legal, and contextually appropriate deployment has become a critical policy concern. This paper offers a comparative analysis of AI-related regulatory and ethical frameworks across key global regions, including the European Union, United Kingdom, United States, China, and Gulf Cooperation Council (GCC) countries. It maps how core trustworthiness principles, such as transparency, fairness, accountability, data privacy, and human oversight are embedded in regional legislation and AI governance structures. Special emphasis is placed on the evolving landscape in the GCC, where countries are rapidly advancing national AI strategies and education-sector innovation. To support this development, the paper introduces a Compliance-Centered AI Governance Framework tailored to the GCC context. This includes a tiered typology and institutional checklist designed to help regulators, educators, and developers align AI adoption with both international norms and local values. By synthesizing global best practices with region-specific challenges, the paper contributes practical guidance for building legally sound, ethically grounded, and culturally sensitive AI systems in education. These insights are intended to inform future regulatory harmonization and promote responsible AI integration across diverse educational environments.

URL PDF HTML ☆

赞 0 踩 0

2602.10450 2026-05-27 cs.LG cs.AI math.OC 版本更新

Constructing Industrial-Scale Optimization Modeling Benchmark

构建工业规模优化建模基准

Zhong Li, Hongliang Lu, Tao Wei, Yuxuan Chen, Wenyu Liu, Yuan Lan, Fan Zhang, Zaiwen Wen

发表机构 * Great Bay University（大湾大学）； Peking University（北京大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结提出MIPLIB-NL基准，通过结构感知逆向构建方法从真实混合整数线性规划中生成自然语言规范与求解器代码，以评估大语言模型在工业规模优化建模中的性能。

Comments This paper was accepted by ICML'26 for publication

详情

AI中文摘要

优化建模支撑着物流、制造、能源和金融领域的决策，然而将自然语言需求转化为正确的优化公式和可执行求解器代码仍然需要大量人力。尽管大语言模型（LLMs）已被探索用于此任务，但评估仍以玩具级或合成基准为主，掩盖了具有$10^{3}$--$10^{6}$（或更多）变量和约束的工业问题的难度。一个关键瓶颈是缺乏将自然语言规范与基于真实优化模型的参考公式/求解器代码对齐的基准。为填补这一空白，我们引入了MIPLIB-NL，它通过一种结构感知的逆向构建方法从MIPLIB~2017中的真实混合整数线性规划构建而成。我们的流程（i）从平坦的求解器公式中恢复紧凑、可复用的模型结构，（ii）在统一的模型-数据分离格式下，逆向生成明确关联到该恢复结构的自然语言规范，以及（iii）通过专家评审和人类-LLM交互以及独立的逆向检查进行迭代语义验证。这产生了223个一对一的重构，保留了原始实例的数学内容，同时实现了现实的自然语言到优化评估。实验表明，在现有基准上表现良好的系统在MIPLIB-NL上性能显著下降，暴露了在玩具规模下不可见的失败模式。

英文摘要

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

URL PDF HTML ☆

赞 0 踩 0

2602.10104 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore ； Research (A STAR), Singapore

AI总结提出SeqΔ-REPA对齐目标，通过冻结自监督视频编码器的时序特征差异锚定潜在动作，实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情

AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口，但学习到的潜在表示往往难以跨上下文迁移：它们纠缠了场景特定线索，缺乏共享坐标系。这是因为标准目标仅在每个片段内操作，没有提供跨上下文对齐动作语义的机制。我们的关键洞察是，尽管动作未被观测到，但其语义效果是可观测的，可以作为共享参考。我们引入SeqΔ-REPA，一种序列级控制效果对齐目标，将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此，我们提出Olaf-World，一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明，我们的方法学习了更结构化的潜在动作空间，从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.09038 2026-05-27 cs.DB cs.AI 版本更新

Scaling GraphLLM with Bilevel-Optimized Sparse Querying

基于双层优化稀疏查询的GraphLLM扩展

Yangzhe Peng, Haiquan Qiu, Quanming Yao, Kun He

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Tsinghua University（清华大学）

AI总结提出BOSQ框架，通过自适应稀疏查询策略选择性调用LLM，在降低计算成本的同时保持或提升图节点任务性能。

详情

AI中文摘要

LLMs最近通过提供解释特征，在文本属性图（TAGs）上增强节点级任务方面显示出强大潜力。然而，重复LLM查询的高计算和货币成本严重限制了其实际应用。举例来说，使用代表性方法（如TAPE）为中等规模基准（如Photo，48k节点）上的所有节点朴素生成解释将消耗数天的处理时间。在本文中，我们提出双层优化稀疏查询（BOSQ），一个通用框架，选择性利用LLM导出的解释特征来提升TAGs上节点级任务的性能。我们设计了一种自适应稀疏查询策略，选择性决定何时调用LLM，避免冗余或低增益查询，显著降低计算开销。在涉及两种节点级任务的六个真实世界TAG数据集上的大量实验表明，BOSQ比现有GraphLLM方法运行速度显著更快，同时持续提供相当或更优的性能。

英文摘要

LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features. However, their practical use is severely limited by the high computational and monetary cost of repeated LLM queries. To illustrate, naively generating explanations for all nodes on a medium-sized benchmark like Photo (48k nodes) using a representative method (e.g., TAPE) would consume days of processing time. In this paper, we propose Bilevel-Optimized Sparse Querying (BOSQ), a general framework that selectively leverages LLM-derived explanation features to enhance performance on node-level tasks on TAGs. We design an adaptive sparse querying strategy that selectively decides when to invoke LLMs, avoiding redundant or low-gain queries and significantly reducing computation overhead. Extensive experiments on six real-world TAG datasets involving two types of node-level tasks demonstrate that BOSQ runs substantially faster than existing GraphLLM methods while consistently delivering on-par or superior performance.

URL PDF HTML ☆

赞 0 踩 0

2602.08586 2026-05-27 cs.AI 版本更新

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

DIANOIA: 多智能体推理的诊断性分解与联合优化

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出DIANOIA框架，通过覆盖度、保真度和综合度三个可测量通道分解多智能体推理增益，并基于此设计诊断协议和对应系统，在多个基准上以更少token实现更优性能。

详情

AI中文摘要

多智能体LLM系统持续优于单智能体基线，但从业者仍无法预测哪种设计适用于新任务或诊断失败原因。我们认为这一差距主要源于该领域缺乏具有可测量原语和可测试预测的诊断框架。我们引入 extbf{DIANOIA}，将多智能体推理增益分解为覆盖度、保真度和综合度三个通道，每个通道均可经验测量。基于此分解，我们推导出一个诊断协议，可识别任何给定任务的瓶颈通道。我们将该协议实例化为一个多智能体系统，其三个组件与通道对应：角色多样化的提议者（覆盖度）、基于执行验证的验证者（保真度）和迭代综合者。在GSM8K、AIME-2025、MBPP和BFCL-SP上，我们的方法在匹配token预算下优于强多智能体基线，在MBPP上以约$5 imes$的token节省主导帕累托前沿，在匹配成本下达到$+4.6$pp。在每个基准上，协议都能正确选择瓶颈通道；我们围绕它构建的系统在多个模型上领先。我们发布代码、适配器、诊断指标和Claude Code技能，网址为https://anonymous.4open.science/r/DIANOIA4MAS。DIANOIA将多智能体设计重新定义为通道感知的资源分配：诊断你的任务的瓶颈通道，然后相应投入token。

英文摘要

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new task or diagnose why one fails. We argue this gap persists largely because the field lacks a diagnostic framework with measurable primitives and testable predictions. We introduce \textbf{DIANOIA}, a three-channel decomposition of multi-agent reasoning gain into coverage, fidelity, and synthesis, each of which is empirically measurable. From this decomposition, we derive a diagnostic protocol that identifies the bottleneck channels for any given task. We instantiate the protocol as a multi-agent system whose three components mirror the channels: role-diverse proposers for coverage, execution-grounded verification for fidelity, and iterative synthesis. On GSM8K, AIME-2025, MBPP, and BFCL-SP, our method outperforms strong multi-agent baselines under matched token budgets, dominating the Pareto frontier on MBPP at $\sim$$5{\times}$ token savings and reaching $+4.6$pp at matched cost. On every benchmark, the protocol picks the right bottleneck channels; the system we built around it leads across models. We release code, adapters, diagnostic metrics, and a Claude Code skill at https://anonymous.4open.science/r/DIANOIA4MAS. DIANOIA reframes multi-agent design as channel-aware resource allocation: diagnose which channel is the bottleneck for your task, then invest tokens accordingly.

URL PDF HTML ☆

赞 0 踩 0

2511.16449 2026-05-27 cs.CV cs.AI 版本更新

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； University of Science and Technology of China（中国科学技术大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； BAAI（北京人工智能研究院）

AI总结提出VLA-Pruner方法，通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性，并采用Combine-then-Filter策略，在保持操作质量的同时实现高达1.99倍加速。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过整合视觉感知、语言理解和动作执行，在具身人工智能中展现出巨大潜力。在实时部署中，这些模型必须处理连续的视觉流，产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型（VLM）的主流技术——为这一挑战提供了自然的候选解决方案。然而，直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配：VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式，因此仅基于上下文预填充语义显著性的剪枝偏向语义线索，可能移除动作关键的视觉令牌。受此观察启发，我们提出VLA-Pruner，一种有效的即插即用令牌剪枝方法，基于VLA推理的视觉需求，并进一步利用机器人操作的时间连续性。具体来说，VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性，然后采用Combine-then-Filter策略，在计算预算下保留紧凑、非冗余的令牌。实验表明，VLA-Pruner在多种VLA架构上优于最先进方法，在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

URL PDF HTML ☆

赞 0 踩 0

2511.06625 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理：基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）； Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Department of Radiation Oncology（放射肿瘤学部）； Winship Cancer Institute, Emory University（埃默里大学Winship癌症研究所）

AI总结提出一种可解释的跨疾病推理框架，通过提取肺部发现、基于医学知识进行跨器官机制推理，并结合心脏子体积特征，从低剂量胸部CT中实现心血管风险评估，在NLST队列中AUC达0.919。

详情

AI中文摘要

低剂量胸部计算机断层扫描（LDCT）在一次扫描中捕获肺部和心脏结构，使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域，并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架，用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径：它提取肺部发现，将跨器官机制基于医学知识进行推理，并生成带有自然语言理由的心血管预测。它结合了四个组件：一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合，以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中，该框架在CVD筛查中达到0.919的AUC，在CVD死亡率预测中高达0.838，优于心脏特异性、单疾病和基础模型基线。目标对照表明，这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此，所提出的框架提供了一种可审计的方法，用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

URL PDF HTML ☆

赞 0 踩 0

2507.13428 2026-05-27 cs.CV cs.AI 版本更新

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench：文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）； NVIDIA Research（NVIDIA研究）； Northeastern University（东北大学）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结提出PhyWorldBench基准，通过1050个提示评估12个视频生成模型在物理规律遵循上的表现，并引入反物理类别，利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

详情

Journal ref: ICLR 2026 oral

AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而，它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench，一个全面的基准测试，旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象，从基本物理原理如物体运动和能量守恒，到更复杂的场景如刚体相互作用以及人或动物的运动。此外，我们引入了一个新颖的反物理类别，其中提示故意违反现实世界的物理规律，从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外，我们还设计了一种简单而有效的方法，利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型，包括五个开源模型和五个专有模型，并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试，我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现，并得出了针对性的建议，以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

URL PDF HTML ☆

赞 0 踩 0

2601.21008 2026-05-27 cs.LG cs.AI math.OC 版本更新

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

ORLoopBench：运筹学中自我修正与行为理性的求解器在环基准测试

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

AI总结提出ORLoopBench基准套件，通过将不可行模型修复形式化为求解器在环马尔可夫决策过程，利用不可约不可行子系统（IIS）反馈，结合验证强化学习训练（RLVR），使8B模型在LP修复上超越前沿API（95.3% vs 92.4% RR@5），并揭示全模型代码再生中的语义漂移问题。

Comments 58 pages, accepted by ICML 2026

详情

AI中文摘要

运筹学从业者通过迭代过程调试不可行模型：检查不可约不可行子系统（IIS），识别约束冲突，并修复公式直至恢复可行性。现有的LLM基准大多将OR视为从问题描述到求解器代码的一次性翻译，忽略了这一诊断循环。我们将不可行模型修复形式化为一个求解器在环马尔可夫决策过程，其中每个动作触发求解器重新执行和IIS重新计算，产生确定性的、可验证的反馈。我们引入ORLoopBench，一个包含两个组件的基准套件：OR-Debug-Bench发布5,362个LP/MILP修复实例，而OR-Bias-Bench评估库存设置中的闭式运营决策理性。求解器验证的RLVR训练使8B模型在LP修复上超越前沿API（95.3% vs 92.4% RR@5），改善诊断行为，并迁移到MILP修复。同样的评估暴露了全模型代码再生中的语义漂移：可行的再生MILP可能解决错误的问题。使用求解器预言机的过程级评估能够为可靠的OR自我修正进行针对性训练。

英文摘要

Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS), identifying constraint conflicts, and repairing formulations until feasibility is restored. Existing LLM benchmarks mostly treat OR as one-shot translation from problem descriptions to solver code, omitting this diagnostic loop. We formalize infeasible-model repair as a solver-in-the-loop Markov Decision Process in which each action triggers solver re-execution and IIS recomputation, yielding deterministic, verifiable feedback. We introduce ORLoopBench, a benchmark suite with two components: OR-Debug-Bench releases 5,362 LP/MILP repair instances, while OR-Bias-Bench evaluates closed-form operational decision rationality across inventory settings. Solver-verified RLVR training enables an 8B model to surpass frontier APIs on LP repair (95.3% vs 92.4% RR @5), improves diagnostic behavior, and transfers to MILP repair. The same evaluation exposes semantic drift in whole-model code regeneration: feasible regenerated MILPs can solve the wrong problem. Process-level evaluation with solver oracles enables targeted training for reliable OR self-correction.

URL PDF HTML ☆

赞 0 踩 0

2501.06708 2026-05-27 cs.LG cs.AI 版本更新

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

通过模仿模型权重评估样本效用以实现高效数据选择

Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

发表机构 * Apple（苹果公司）

AI总结提出基于梯度和几何的Mimic Score指标，通过Grad-Mimic框架在线重加权样本加速训练、离线构建数据过滤器，在六个图像数据集上提升数据效率和CLIP模型性能。

Comments This work appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026) and was selected as an Oral paper at the ICML 2025 DataWorld Workshop

详情

AI中文摘要

大规模网络爬取数据集包含噪声、偏差和不相关信息，因此需要数据选择技术。现有方法依赖于手工启发式、下游数据集或需要昂贵的基于影响力的计算——所有这些都限制了可扩展性并引入了不必要的数据依赖性。为了解决这个问题，我们引入了Mimic Score，一种简单且基于几何的数据质量指标，通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐来评估效用。这利用了现成的模型权重，避免了验证数据集的需求，并且计算开销最小。基于该指标，我们提出了Grad-Mimic，一个两阶段框架，在线重新加权样本以加速训练，并离线聚合样本效用以构建有效的数据过滤器。实验表明，使用模仿分数指导训练提高了数据效率，加速了收敛，在六个图像数据集上取得了一致的性能提升，并以减少20.7%的训练步骤增强了CLIP模型。此外，基于模仿分数的过滤器增强了现有过滤技术，使得用更少470万个样本训练的CLIP模型得到改进。

英文摘要

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

URL PDF HTML ☆

赞 0 踩 0

2602.04931 2026-05-27 cs.LG cs.AI 版本更新

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

大型语言模型中跨深度的涌现因果几何动力学

Shahar Haim, Daniel C McNamee

发表机构 * Champalimaud Centre for the Unknown（查普拉米乌德未知中心）

AI总结通过结合几何分析与因果干预，揭示了解码器-only大型语言模型中从上下文处理到预测形成的跨层转变，并发现后期层中角度结构参数化下一词分布相似性并实现选择性因果控制。

详情

AI中文摘要

对大型语言模型（LLM）表征的几何分析揭示了跨深度的结构化变化，但本质上与token预测形成相关。同时，因果干预揭示了依赖于深度的效能曲线，但缺乏对其表征动力学的统一解释。对LLM功能的完整解释需要说明表征结构如何跨深度演化以因果性地产生预测。我们通过将几何分析与机械干预相结合，明确将跨深度动力学作为解释LLM功能的组织轴，综合了这些视角。在解码器-only LLM中，我们识别出从上下文处理到预测形成计算的急剧转变，伴随着跨层的表征几何的更渐进重组。这种综合揭示了一种后期层几何编码，其中角度结构参数化下一词分布相似性，并能够对预测进行选择性因果控制，而表征范数编码的信息与预测基本解耦。总之，我们的结果提供了因果和几何视角的综合，产生了关于语言模型中跨深度的控制相关几何动力学如何将上下文转化为预测的机械论解释。这一视角调和了先前令人困惑的发现，并表明层状功能不能孤立地理解或有效干预，而只能在网络涌现的全局动力学结构中理解。

英文摘要

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlational with respect to token prediction formation. Meanwhile, causal interventions expose depth-dependent efficacy profiles without a unifying account of their representational dynamics. A complete account of LLM function requires explaining how representational structure evolves across depth to causally produce predictions. We synthesize these perspectives by combining geometric analysis with mechanistic interventions, explicitly centralizing depth-wise dynamics as the organizing axis for interpreting LLM function. In decoder-only LLMs, we identify a sharp transition from context-processing to prediction-forming computation, accompanied by a more gradual reorganization of representational geometry across layers. This synthesis reveals a late-layer geometric code in which angular structure parameterizes next-token distributional similarity and enables selective causal control over predictions, while representation norms encode information largely decoupled from prediction. Together, our results provide a synthesis of causal and geometric perspectives, yielding a mechanistic account of how control-relevant geometric dynamics across depth transform context into prediction in language models. This perspective reconciles previously puzzling findings and implies that layer-wise function cannot be understood or effectively intervened upon in isolation, but only within the emergent global dynamical structure of the network.

URL PDF HTML ☆

赞 0 踩 0

2602.03545 2026-05-27 cs.AI 版本更新

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

人格生成器：为任意上下文生成多样化的合成人格

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, Alexander Sasha Vezhnevets

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出Persona Generators，通过迭代进化优化生成覆盖广泛意见和偏好的多样化合成人格，在六个多样性指标上显著优于现有基线。

详情

AI中文摘要

评估与人类交互的AI系统需要理解它们在不同用户群体中的行为，但收集代表性人类数据通常成本高昂或不可行，特别是对于新技术或假设的未来场景。最近在生成式基于智能体建模方面的工作表明，大型语言模型可以高保真地模拟类似人类的合成人格，准确再现特定个体的信念和行为。然而，大多数方法需要关于目标群体的详细数据，并且通常优先考虑密度匹配（复制最可能的内容）而非支持覆盖（覆盖可能的内容），导致长尾行为未被充分探索。我们引入了Persona Generators，即能够为任意上下文生成多样化合成群体的函数。我们应用基于AlphaEvolve的迭代改进循环，使用大型语言模型作为变异算子，在数百次迭代中优化我们的Persona Generator代码。优化过程产生了轻量级的Persona Generators，能够自动将小规模描述扩展为多样化的合成人格群体，这些群体在相关多样性轴上最大化意见和偏好的覆盖。我们证明，进化后的生成器在保留上下文上的六个多样性指标上显著优于现有基线，产生了覆盖标准LLM输出中难以实现的罕见特征组合的群体。

英文摘要

Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

URL PDF HTML ☆

赞 0 踩 0

2602.03238 2026-05-27 cs.AI 版本更新

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

基于LLM的智能体评估统一框架的必要性

Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Chongqing University of Posts and Telecommunications（重庆邮电大学）

AI总结针对当前LLM智能体评估受系统提示、工具集和环境动态等混杂因素影响的问题，提出标准化统一评估框架以提升公平性和可复现性。

详情

AI中文摘要

随着大型语言模型（LLM）的出现，通用智能体取得了根本性进展。然而，评估这些智能体带来了独特的挑战，使其区别于静态的问答基准。我们观察到，当前的智能体基准受到系统提示、工具集配置和环境动态等外部因素的严重混淆。现有评估通常依赖于碎片化的、研究者特定的框架，其中推理和工具使用的提示工程差异很大，使得难以将性能提升归因于模型本身。此外，缺乏标准化的环境数据导致不可追踪的错误和不可重复的结果。这种标准化的缺失给该领域带来了显著的不公平性和不透明性。我们提出，一个统一的评估框架对于智能体评估的严谨进展至关重要。为此，我们提出了一项旨在标准化智能体评估的建议。

英文摘要

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.02518 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

发表机构 * Texas A&M University（德克萨斯大学A&M分校）； University of Waterloo（滑铁卢大学）； Lambda（Lambda公司）； University of Oregon（俄勒冈大学）

AI总结提出GraphDancer两阶段后训练框架，通过图感知课程逐步增加任务难度，使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理，仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情

AI中文摘要

大型语言模型（LLMs）越来越依赖外部知识来提高事实性，然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系，并在多轮交互中聚合证据。我们提出GraphDancer，一个两阶段后训练框架，通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互，而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程，该课程根据信息寻求轨迹的结构复杂性组织两个阶段，在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer，仅在一个领域上训练，并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型，GraphDancer仍优于配备更大/更强骨干的基线，展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

URL PDF HTML ☆

赞 0 踩 0

2602.01518 2026-05-27 cs.AI 版本更新

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Qrita：使用基于枢轴的截断和选择的高性能Top-k和Top-p

Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Qrita算法，通过基于高斯sigma截断和四元枢轴搜索的枢轴方法，高效实现Top-k和Top-p采样，在保持与排序算法相同输出的同时，将端到端服务吞吐量提升至1.4倍并减少一半内存使用。

详情

AI中文摘要

尽管Top-k和Top-p算法在模型采样中很重要，但对于大词汇表的高效实现仍然是一个重大挑战。现有方法通常依赖于排序，这在GPU上会带来显著的计算和内存开销，或者依赖于改变算法输出的随机方法。在这项工作中，我们提出了Qrita，一种基于枢轴截断和选择的高效Top-k和Top-p算法。Qrita利用基于枢轴的搜索来实现Top-k和Top-p，并采用两种关键技术：1. 基于高斯的sigma截断，大大减少了词汇表的搜索空间；2. 具有重复处理能力的四元枢轴搜索，将枢轴搜索迭代次数减半并保证确定性输出。我们使用Triton实现了Qrita，并针对高性能LLM执行引擎（如SGLang和FlashInfer）的Top-k和Top-p内核评估了其性能，将端到端服务吞吐量提高了1.4倍，同时内存使用量减半，并提供了与基于排序算法相同的输出。Qrita现在是vLLM GPU执行路径的默认Top-k和Top-p采样器，Qrita的三元实现可在https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py获取。

英文摘要

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.

URL PDF HTML ☆

赞 0 踩 0

2601.22648 2026-05-27 cs.AI cs.LG 版本更新

思维链压缩：理论分析

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan

发表机构 * School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi, China（山西大学计算机与信息学院）； Queen Mary, University of London, UK（伦敦大学女王学院）； School of Informatics, University of Edinburgh, UK（爱丁堡大学信息学院）

AI总结本文通过引入Order-r Interaction理论，证明了隐式思维链压缩中高阶逻辑依赖的学习信号指数衰减问题，并提出ALiCoT框架通过对齐潜在令牌分布与中间推理状态来克服信号衰减，实现54.4倍加速且性能与显式CoT相当。

详情

AI中文摘要

思维链（CoT）通过中间步骤解锁了大语言模型（LLMs）的高级推理能力，但由于生成额外令牌而带来了高昂的计算成本。最近的研究经验表明，将推理步骤压缩到潜在状态中，即隐式CoT压缩，提供了一种令牌高效的替代方案。然而，CoT压缩背后的机制仍不清楚。在本文中，我们首次对学习内化中间推理步骤的难度进行了理论分析。通过引入Order-r Interaction，我们证明了高阶逻辑依赖的学习信号指数衰减以解决不可约问题，其中跳过中间步骤不可避免地导致高阶交互障碍。为了经验验证这一点，我们引入了NatBool-DAG，这是一个具有挑战性的基准测试，旨在强制执行不可约逻辑推理并消除语义捷径。在我们的理论发现指导下，我们提出了ALiCoT（对齐隐式CoT），一种新颖的框架，通过对齐潜在令牌分布与中间推理状态来克服信号衰减。实验结果表明，ALiCoT成功解锁了高效推理：它实现了54.4倍加速，同时保持与显式CoT相当的性能。

英文摘要

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

URL PDF HTML ☆

赞 0 踩 0

2601.18904 2026-05-27 cs.SD cs.AI cs.CL 版本更新

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）； Tsinghua University（清华大学）

AI总结提出MetaSICL方法，利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力，在低资源场景下优于直接微调。

详情

AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而，当应用于低资源任务时，它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配，直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型，提供了一种无需训练、推理时的解决方案。在这项工作中，我们首先表明，$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能，这表明这种ICL适应能力可以推广到多模态设置。在此基础上，我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$，这是一种后训练方法，仅利用来自各种任务的高资源语音数据，旨在增强模型的上下文学习能力。实验表明，我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

URL PDF HTML ☆

赞 0 踩 0

2601.18381 2026-05-27 cs.AI cs.SE 版本更新

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI Agent 用于逆向工程遗留有限差分代码并转换为 Devito

Yinghan Hou, Zongyou Yang

发表机构 * Department of Earth Science and Engineering（地球科学与工程系）； Imperial College London（帝国理工学院伦敦分校）； Department of Computer Science（计算机科学系）； University College London（伦敦大学学院）

AI总结本研究提出一个集成 AI Agent 框架，结合检索增强生成（RAG）和开源大语言模型，通过多阶段迭代工作流将遗留有限差分代码转换为 Devito 环境，并引入强化学习反馈机制实现动态自适应代码翻译。

Comments 14 pages, 7 figures

详情

AI中文摘要

为了促进遗留有限差分实现向 Devito 环境的转换，本研究开发了一个集成的 AI Agent 框架。检索增强生成（RAG）和开源大语言模型通过系统混合 LangGraph 架构中的多阶段迭代工作流相结合。该 Agent 通过文档解析、结构感知分割、实体关系提取和基于 Leiden 的社区检测构建了一个广泛的 Devito 知识图谱。GraphRAG 优化增强了跨语义社区的查询性能，这些社区包括地震波模拟、计算流体动力学和性能调优库。一个逆向工程组件通过 Fortran 源代码的静态分析推导出用于 RAG 检索的三级查询策略。为了为语言模型指导提供精确的上下文信息，多阶段检索流水线执行并行搜索、概念扩展、社区级检索和语义相似性分析。代码合成受基于 Pydantic 的约束控制，以保证结构化输出和可靠性。一个全面的验证框架将传统静态分析与 G-Eval 方法相结合，涵盖执行正确性、结构健全性、数学一致性和 API 合规性。整个 Agent 工作流在 LangGraph 框架上实现，并采用并发处理以支持基于质量的迭代细化和状态感知的动态路由。主要贡献在于引入了受强化学习启发的反馈机制，实现了从静态代码翻译向动态自适应分析行为的转变。

英文摘要

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.

URL PDF HTML ☆

赞 0 踩 0

2512.20957 2026-05-27 cs.SE cs.AI 版本更新

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

一个工具就够了：面向仓库级LLM智能体的强化学习

Zhaoxi Zhang, Yitong Duan, Yanzhi Zhang, Yiming Xu, Zhixiang Wang, Kun Liang, Weikang Li, Jiahui Liang, Deguo Xia, Jizhou Huang, Jiyan He, Yunfang Wu

发表机构 * National Key Laboratory for Multimedia Information Processing, Peking University（信息处理国家级重点实验室，北京大学）； School of Computer Science, Peking University（北京大学计算机科学学院）； Zhongguancun Institute of Artificial Intelligence (ZGCI)（中关村人工智能研究院（ZGCI））； Baidu Inc（百度公司）

AI总结提出RepoNavigator，一个仅配备单一执行感知工具（跳转到调用符号定义）的LLM智能体，通过强化学习端到端训练，在仓库级问题定位中达到最先进性能。

详情

AI中文摘要

在大型软件仓库中定位需要修改的文件和函数由于规模和结构复杂性而具有挑战性。现有的基于LLM的方法通常将其视为仓库级检索任务，并依赖多个辅助工具，这些工具常常忽略代码执行逻辑并使模型控制复杂化。我们提出RepoNavigator，一个配备单一执行感知工具的LLM智能体：跳转到调用符号的定义。这种统一设计反映了代码执行的实际流程，同时简化了工具操作。RepoNavigator通过强化学习（RL）直接从基础预训练模型进行端到端训练，不依赖闭源蒸馏。实验表明，经过RL训练的RepoNavigator实现了最先进的性能，7B模型优于14B基线，14B模型超越32B竞争对手，32B模型在大多数指标上超过闭源模型如GPT-5。这些结果证实，将单一的、结构基础的工具与RL训练相结合，为仓库级问题定位提供了高效且可扩展的解决方案。

英文摘要

Locating files and functions requiring modification in large software repositories is challenging due to their scale and structural complexity. Existing LLM-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which often overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool: jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a base pretrained model, without relying on closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and the 32B model exceeding closed-source models such as GPT-5 on most metrics. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.

URL PDF HTML ☆

赞 0 踩 0

2512.01556 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Shandong University（山东大学）； Tongji University（同济大学）； City University of Hong Kong（香港城市大学）

AI总结提出LEC框架，通过线性期望约束将选择性预测转化为决策问题，在可交换性假设下利用校准集计算风险约束下的保留最大化阈值，并扩展到双模型路由系统，实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情

AI中文摘要

基础模型常常生成不可靠的答案，而启发式不确定性估计器无法完全区分正确与错误输出，导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题，旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此，我们提出了LEC，一个原则性框架，将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率，这对应于选择条件下的边际错误概率。在可交换性下，我们推导出一个仅依赖于保留校准集的有限样本充分条件，从而能够计算风险约束下的保留最大化阈值。此外，我们将LEC扩展到双模型路由系统：如果主模型的不确定性超过其校准阈值，则输入被委托给后续模型，同时保持系统级的选择条件误差控制。在封闭式和开放式问答（QA）以及视觉问答（VQA）上的实验表明，LEC在接受的预测中维持了规定的风险水平，并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.14702 2026-05-27 cs.AI cs.CV cs.RO 版本更新

超越迁移准确率：用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia（印度尼西亚墨尔本大学）； Institute Teknologi Bandung（Bandung理工大学）； MBZUAI（MBZUAI研究所）； Boston University（波士顿大学）

AI总结提出基于上下文分解的电路发现方法（CD-T），通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现，并利用电路目标监督微调（CT-SFT）在低资源跨语言情感迁移中最小化灾难性遗忘，优于全局微调。

详情

AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务，限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分，将上下文分解方法适配到非结构化设置（CD-T），实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调（CT-SFT），将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明，CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率，但CT-SFT独特地最小化灾难性遗忘，保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立，表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2512.01572 2026-05-27 cs.LG cs.AI physics.app-ph 版本更新

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

使用自编码器-扩散级联从极度稀疏测量中重建多尺度物理场

Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

发表机构 * Internet of Things Thrust（物联网方向）； Intelligent Transportation Thrust（智能交通方向）； Marine Hydrodynamic Research Facility（海洋流体研究设施）； Department of Civil and Environmental Engineering（土木与环境工程系）

AI总结提出Cascaded Sensing框架，通过粗尺度确定性估计和细尺度条件扩散模型级联，解决极度稀疏测量下物理场重建的不适定性和多模态后验问题。

Comments 34 pages,22 figures

详情

AI中文摘要

极端传感器稀疏性使得全场重建成为科学传感中一个根本性的不适定问题，其目标是从稀疏测量中推断物理场。在此情况下，后验严重欠约束且固有地多模态，使其近似高度病态。具体而言，确定性映射会坍塌不确定性，直接条件学习无法覆盖可能的观测条件解空间，而似然引导采样对噪声和传感器配置高度敏感。这些限制导致后验估计不稳定，并突显了以结构化方式建模不确定性的必要性。为此，我们提出了Cascaded Sensing，一个跨尺度重构后验推理的分层框架。Cas-Sensing不直接建模全场后验，而是首先通过确定性粗阶段估计器解决全局结构模糊性。一个基于神经算子的功能自编码器，使用掩码输入训练，将稀疏观测映射到粗尺度结构场，其作用类似于最大后验估计器，选择主导全局配置。该结构锚点固定了后验的主要自由度，并将问题转化为一个条件更好的残差推理任务。然后，一个条件扩散模型仅学习细化尺度的残差分布，将采样限制在合理解的稳定邻域内，并抑制观测一致模式之间的竞争。为了增强在不同传感条件下的鲁棒性，我们引入了掩码级联训练，通过中间粗重建使模型暴露于多样的稀疏观测模式。在推理过程中，流形约束引导将观测一致性作为细化机制而非全局模式选择过程来实施。

英文摘要

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

URL PDF HTML ☆

赞 0 踩 0

2601.07737 2026-05-27 cs.CV cs.AI 版本更新

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信：评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

发表机构 * Zhejiang University（浙江大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结为评估多模态大模型处理反直觉动作场景的能力，提出CAIT基准（400个高保真合成场景），发现开源模型因语言先验而忽视视觉证据，性能接近随机水平，而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容，通过微调和结构化提示可缓解此偏见。

详情

AI中文摘要

多模态大语言模型（MLLMs）在主流视觉理解任务中表现出色，但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白，我们引入了CAIT，一个包含400个高保真合成场景的基准，专注于反直觉的视觉动作，例如“兔子在追老虎”，其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型（如Claude和Gemini）以及14个代表性的开源MLLMs。人类达到近乎完美的性能（约0.95准确率），专有模型表现出稳健的理解（达到0.88准确率），而标准的开源指令微调模型性能处于随机水平。进一步分析表明，这种失败是由强烈的语言先验驱动的：模型不信任视觉输入，而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率，但会显著减慢响应速度并产生新的失败模式：模型过度思考场景，仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后，我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖，使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

URL PDF HTML ☆

赞 0 踩 0

2601.07085 2026-05-27 cs.HC cs.AI cs.CY 版本更新

The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance

AI认知特洛伊木马：大型语言模型如何绕过人类认知警觉

Andrew D. Maynard

发表机构 * School for the Future of Innovation in Society, Arizona State University（未来创新社会研究所，亚利桑那州立大学）

AI总结提出“认知特洛伊木马”假说，认为LLM通过优化产生的“诚实非信号”特征（流畅性、帮助性、表面无私）可能绕过人类进化出的认知警觉机制，导致用户高估其可信度。

Comments 16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusions

详情

AI中文摘要

基于大型语言模型（LLM）的对话式AI系统对人类认知提出了挑战，当前理解错误信息和说服的框架未能充分应对。本文提出，对话式AI的一个重大认知风险可能不在于不准确或有意欺骗，而在于更根本的问题：这些系统通过使其有用的优化过程，可能呈现出绕过人类进化出的评估传入信息的认知机制的特征。认知特洛伊木马假说借鉴了Sperber及其同事的认知警觉理论——即并行认知过程监控所传达的信息以寻找怀疑理由——并提出基于LLM的系统呈现出“诚实的非信号”：真实的特征（流畅性、帮助性、表面无私）缺乏人类相应特征所携带的信息等价物，因为在人类中这些特征的产生成本高昂，而在LLM中它们在计算上微不足道。识别出四种潜在的绕过机制：与理解脱钩的处理流畅性、无相应利害关系的信任-能力呈现、将评估本身委托给AI的认知卸载，以及系统性地产生谄媚的优化动态。该框架产生了可检验的预测，包括一个反直觉的推测：认知复杂的用户可能更容易受到AI介导的认知影响。这将AI安全重新定义为部分校准问题——使人类的评估反应与AI生成内容的实际认知状态对齐——而不仅仅是防止欺骗的问题。

英文摘要

Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.

URL PDF HTML ☆

赞 0 踩 0

2601.05899 2026-05-27 cs.AI 版本更新

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind: 一个用于LLM作为智能体的塔防游戏学习环境与基准

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

发表机构 * Newcastle University（新castle大学）； University of Auckland（奥克兰大学）

AI总结本文提出TowerMind，一个基于塔防子类型的轻量级、多模态游戏环境，用于评估大语言模型在长期规划和决策中的能力，并揭示其与人类专家的性能差距及关键局限性。

Comments AAAI 2026 Oral

详情

DOI: 10.1609/aaai.v40i31.39818

AI中文摘要

近年来，大语言模型（LLM）的突破性进展使其成为智能体的一种有前景的范式，其中长期规划和决策作为适应不同场景和任务的核心通用能力逐渐凸显。实时策略（RTS）游戏因其固有的游戏玩法需要宏观战略规划和微观战术调整与行动执行，成为评估这两种能力的理想测试平台。现有的基于RTS游戏的环境要么计算需求较高，要么缺乏对文本观察的支持，这限制了RTS游戏在LLM评估中的应用。受此启发，我们提出了TowerMind，一种基于RTS游戏子类型——塔防（TD）的新型环境。TowerMind保留了RTS游戏评估LLM的关键优势，同时具有低计算需求和多模态观察空间，包括基于像素、文本和结构化游戏状态的表示。此外，TowerMind支持模型幻觉评估，并提供高度的可定制性。我们设计了五个基准关卡，以评估几种广泛使用的LLM在不同多模态输入设置下的表现。结果揭示了LLM与人类专家在能力和幻觉维度上的明显性能差距。实验进一步突出了LLM行为的关键局限性，例如规划验证不足、决策缺乏多终性以及行动使用效率低下。我们还评估了两种经典强化学习算法：Ape-X DQN和PPO。通过提供轻量级和多模态设计，TowerMind补充了现有的基于RTS游戏的环境格局，并为AI智能体领域引入了一个新的基准。源代码已在GitHub上公开（https://github.com/tb6147877/TowerMind）。

英文摘要

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

URL PDF HTML ☆

赞 0 踩 0

2601.03525 2026-05-27 cs.LG cs.AI 版本更新

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

超越二元：将部分成功转化为代码生成中强化学习的密集可验证奖励

Longwen Wang, Yirui Liu, Xuan'er Wu, Xiaohui Hu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

发表机构 * Institute of Artificial Intelligence, China Telecom (TeleAI)（中国电信人工智能研究院（TeleAI））； Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd（中国电信人工智能技术（北京）有限公司Xingchen AGI实验室）； National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University（人机混合增强智能国家重点实验室，西安交通大学）

AI总结提出VeRPO框架，利用代码测试的部分成功作为可验证密集奖励，通过动态密度校准局部奖励修正基数偏差，并与全局执行结果结合，提升代码生成强化学习的性能。

详情

AI中文摘要

有效的奖励设计是代码生成强化学习（RL）中的核心挑战。主流的测试套件级结果奖励强制执行功能正确性但导致稀疏性，而外部奖励模型（RM）提供密集监督但代价是错位和额外开销。由于代码评估自然产生多个测试用例级结果，部分成功（即通过部分测试用例）提供了内在的、可验证的密集监督来源。在本文中，我们提出VeRPO（可验证密集奖励策略优化），一个系统地将可验证的部分成功转化为可靠密集奖励的RL框架。我们使用加权和公式分析部分成功奖励，理论上识别出一个关键的基数偏差，导致策略更新不成比例地偏向于从简单测试成功中获益，而非在前沿测试上取得进展。基于此，VeRPO引入了一个动态的、密度校准的局部奖励，明确纠正这种偏差，并从部分成功中提供稳健的密集监督。为了增强与端到端功能正确性的一致性，VeRPO进一步将局部密集奖励与全局执行结果相结合。在多种基准和设置上的大量实验表明，VeRPO优于结果驱动和基于RM的基线，实现了高达+8.83 pass@1的提升，且时间成本可忽略不计（<0.02%），GPU内存开销为零。

英文摘要

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense Reward Policy Optimization), an RL framework that systematically turns verifiable partial success into reliable dense rewards. We analyze partial-success rewards using a weighted sum formulation, theoretically identifying a critical cardinality bias that causes policy updates to disproportionately favor gains from easy-test successes over progress on frontier tests. Based on this, VeRPO introduces a dynamic, density-calibrated local reward that explicitly corrects this bias and provides robust dense supervision from partial success. To enhance alignment with end-to-end functional correctness, VeRPO further integrates the local dense reward with global execution outcomes. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO outperforms outcome-driven and RM-based baselines, achieving up to +8.83 pass@1 gain with negligible time cost (< 0.02%) and zero GPU memory overhead.

URL PDF HTML ☆

赞 0 踩 0

2601.04275 2026-05-27 cs.CR cs.AI cs.CL 版本更新

Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs

影子遗忘：一种面向LLM保真保留的无面孔遗忘的神经语义方法

Dinesh Srivasthav P, Ashok Urlana, Rahul Mishra, Bala Mallikarjunarao Garlapati, Ponnurangam Kumaraguru

发表机构 * TCS Research, Hyderabad（TCS研究院，海得拉巴）； IIIT Hyderabad（IIIT海得拉巴）

AI总结提出影子遗忘范式，通过神经语义投影器遗忘（NSPU）框架在匿名化遗忘数据上实现机器遗忘，保护隐私的同时保持模型效用，计算效率提升至少10倍。

详情

AI中文摘要

机器遗忘旨在选择性地移除特定训练样本的影响，以满足GDPR等隐私法规的“被遗忘权”。然而，许多现有方法需要访问被移除的数据，使其暴露于成员推断攻击和个人身份信息（PII）的潜在滥用。我们通过提出影子遗忘（Shadow Unlearning）来解决这一关键挑战，这是一种新的近似遗忘范式，在不暴露PII的情况下对匿名化遗忘数据进行机器遗忘。我们进一步提出了一种新颖的隐私保护框架——神经语义投影器遗忘（NSPU），以实现影子遗忘。为了评估我们的方法，我们跨五个不同领域构建了多领域虚构遗忘（MuFU）遗忘集，并引入了一个评估栈来量化知识保留与遗忘效果之间的权衡。在各种大型语言模型上的实验表明，NSPU实现了优越的遗忘性能，保持了模型效用，并增强了用户隐私。此外，所提出的方法在计算效率上比标准遗忘方法至少高出10倍。我们的研究为隐私感知的机器遗忘开辟了新方向，平衡了数据保护与模型保真度。

英文摘要

Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's 'Right to be Forgotten'. However, many existing methods require access to the data being removed, exposing it to membership inference attacks and potential misuse of Personally Identifiable Information (PII). We address this critical challenge by proposing Shadow Unlearning, a novel paradigm of approximate unlearning, that performs machine unlearning on anonymized forget data without exposing PII. We further propose a novel privacy-preserving framework, Neuro-Semantic Projector Unlearning (NSPU) to achieve Shadow unlearning. To evaluate our method, we compile Multi-domain Fictitious Unlearning (MuFU) forget set across five diverse domains and introduce an evaluation stack to quantify the trade-off between knowledge retention and unlearning effectiveness. Experimental results on various LLMs show that NSPU achieves superior unlearning performance, preserves model utility, and enhances user privacy. Additionally, the proposed approach is at least 10x more computationally efficient than standard unlearning approaches. Our findings foster a new direction for privacy-aware machine unlearning that balances data protection and model fidelity.

URL PDF HTML ☆

赞 0 踩 0

2601.03089 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

发表机构 * City University of Hong Kong（香港城市大学）

AI总结针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题，提出π-Soft-NC和π-Soft-NS框架，通过控制期望保留概率公平比较归因方法，并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情

AI中文摘要

大型语言模型（LLM）越来越多地使用输入归因方法进行评估，但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标，如Soft-NC和Soft-NS，可能将归因质量与扰动期间保留的词数混为一谈：平均得分较高的归因方法可能保留更多词，从而获得膨胀的分数。为解决此问题，我们提出π-Soft-NC和π-Soft-NS，这是一个在相同期望保留概率下比较归因方法的评估框架，从而控制保留词数。我们进一步引入Grad-ELLM，一种针对自回归仅解码器LLM定制的基于梯度的归因方法，该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明，Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性，而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架，将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

URL PDF HTML ☆

赞 0 踩 0

2601.01668 2026-05-27 cs.CL cs.AI 版本更新

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer：一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

发表机构 * MedLedger365 ； MedConnect365 ； Xylemed ； Kypath Associates Inc.

AI总结提出一种隐私感知、FHIR原生的参考架构EHRSummarizer，通过检索HL7 FHIR R4资源并约束生成源接地摘要，以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情

AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录（EHR）界面，以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer，一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源，将其标准化为临床上下文包，并使用受约束的摘要阶段生成源接地摘要，旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用，以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为，而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式；然而，本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划，重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控，以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

URL PDF HTML ☆

赞 0 踩 0

2512.17090 2026-05-27 cs.LG cs.AI 版本更新

How to Square Tensor Networks and Circuits Without Squaring Them

如何平方张量网络和电路而不进行平方操作

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

发表机构 * School of Informatics, University of Edinburgh, UK（爱丁堡大学信息学院）

AI总结提出一种参数化方法，通过正交性和确定性条件简化平方张量网络和电路的边际化计算，避免额外复杂度，并在分布估计任务中保持表达能力且提升学习效率。

详情

AI中文摘要

平方张量网络（TNs）及其作为计算图的扩展——平方电路——已被用作表达性的分布估计器，同时支持闭式边际化。然而，平方操作在计算配分函数或边际化变量时引入了额外的复杂性，这阻碍了它们在机器学习中的应用。为了解决这个问题，张量网络的正则形式通过酉矩阵参数化以简化边际计算。然而，这些正则形式不适用于电路，因为电路可以表示不直接映射到已知张量网络的分解。受正则形式中的正交性和电路中实现可处理最大化的确定性的启发，我们展示了如何参数化平方电路以克服其边际化开销。我们的参数化即使在不同于张量网络的分解中也能实现高效的边际化，这些分解编码为电路，否则其结构会使边际化计算变得困难。最后，我们在分布估计上的实验表明，我们提出的平方电路条件在没有任何表达能力损失的情况下，实现了更高效的学习。

英文摘要

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

URL PDF HTML ☆

赞 0 踩 0

2512.12413 2026-05-27 cs.AI cs.HC 版本更新

Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale

生成式人工智能使用中的批判性思维：批判性思维在AI使用中的量表开发、验证与关联因素

Gabriel R. Lau, Wei Yan Low, Louis Tay, Ysabel Guevarra, Dragan Gašević, Andree Hartanto

发表机构 * School of Social Sciences, Nanyang Technological University（南洋理工大学社会科学学院）； Interdisciplinary Graduate Programme, Nanyang Technological University（南洋理工大学跨学科研究生项目）； College of Health and Human Sciences, Purdue University（普渡大学健康与人类科学学院）； School of Social Sciences, Singapore Management University（新加坡管理学院社会科学学院）； Faculty of Information Technology, Monash University（墨尔本大学信息技术学院）

AI总结本研究开发并验证了13项批判性思维在AI使用中的量表，发现其包含验证、动机和反思三个因子，并与开放性、外向性、积极情感和AI使用频率正相关，且能预测更频繁的验证策略和更高的真实性判断准确性。

详情

DOI: 10.1016/j.chbr.2026.101103
Journal ref: Computers in Human Behavior Reports, 22, 101103 (2026)

AI中文摘要

生成式AI工具日益嵌入日常工作和学习中，但其流畅性、不透明性和产生幻觉的倾向意味着用户必须批判性地评估AI输出，而不是全盘接受。本研究将AI使用中的批判性思维概念化为一种倾向性特质，包括验证AI生成信息的来源和内容、理解模型的工作原理及其失败之处，以及反思依赖AI的更广泛影响。通过六项研究（N=1365），我们开发并验证了13项批判性思维在AI使用中的量表，并绘制了其法则网络。研究1生成并内容验证了量表项目。研究2支持了三因子结构（验证、动机和反思）。研究3、4和5确认了这一高阶模型，展示了内部一致性、重测信度、强因子载荷、性别不变性以及收敛和判别效度。研究3和4进一步揭示，AI使用中的批判性思维与开放性、外向性、积极特质情感和AI使用频率正相关。最后，研究6展示了量表的效标效度，更高的批判性思维在AI使用中的得分预测了更频繁和多样化的验证策略、在新型自然主义ChatGPT驱动的事实核查任务中更高的真实性判断准确性，以及对负责任AI的更深入反思。总之，当前工作阐明了人们为何以及如何对生成式AI输出进行监督，并提供了一个经过验证的量表和生态学基础的任务范式，以支持关于批判性参与生成式AI输出的理论检验、跨群体和纵向研究。

英文摘要

Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

URL PDF HTML ☆

赞 0 踩 0

2511.20586 2026-05-27 cs.AI cs.LG 版本更新

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

PaTAS：基于主观逻辑的神经网络信任传播框架

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Houda Labiod, Frank Kargl

AI总结提出PaTAS框架，利用主观逻辑在神经网络中并行传播信任，通过信任节点和信任函数量化输入、参数和激活的信任，并设计参数信任更新和推理路径信任评估方法，以在对抗或退化条件下提供可解释的信任估计。

详情

AI中文摘要

可信度已成为安全关键应用中人工智能系统部署的关键要求。传统的评估指标（如准确率和精确率）无法充分捕捉不确定性或模型预测的可靠性，尤其是在对抗或退化条件下。本文介绍了并行信任评估系统（PaTAS），这是一个使用主观逻辑（SL）对神经网络中的信任进行建模和传播的框架。PaTAS通过信任节点和信任函数与标准神经计算并行运行，这些节点和函数在网络中传播输入、参数和激活信任。该框架定义了一种参数信任更新机制，以在训练过程中优化参数可靠性，以及一种推理路径信任评估（IPTA）方法，以在推理时计算实例特定的信任。在真实世界和对抗性数据集上的实验表明，PaTAS产生可解释、对称且收敛的信任估计，这些估计补充了准确率，并揭示了在中毒、有偏或不确定数据场景中的可靠性差距。结果表明，PaTAS有效区分良性输入和对抗性输入，并识别模型置信度与实际可靠性不一致的情况。通过在神经架构中实现透明且可量化的信任推理，PaTAS为评估AI生命周期中的模型可靠性提供了基础。

英文摘要

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.

URL PDF HTML ☆

赞 0 踩 0

2412.20505 2026-05-27 cs.AI cs.CL cs.LG 版本更新

CFG-OEC: 带正交误差校正的无分类器引导

Nakgyu Yang, Yechan Lee, SooJean Han

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science（韩国科学技术院电子工程学院）

AI总结针对扩散模型中无分类器引导的采样规则与训练目标不匹配导致的误差，提出正交误差校正方法（CFG-OEC）通过减少条件与无条件预测误差的交互项来提升采样质量，并在Stable Diffusion上验证了FID和CLIP分数的改进。

详情

AI中文摘要

无分类器引导是扩散模型中条件采样的标准方法，但其采样规则与训练中使用的目标不一致。这种不匹配通过条件预测误差和无条件预测误差的相互作用引入了结构性采样误差。我们通过将采样误差分解为基础项和由两个误差对齐决定的交叉项来分析该问题。基于此分析，我们提出了带正交误差校正的无分类器引导（CFG-OEC），这是一种减少交互项的结构性修改。对于无法观测到真实噪声的实际场景，我们引入了一个从模型预测计算得到的代理量，以及一种跨扩散时间步稳定校正的动态方法。在受控环境下的实验验证了我们的理论误差分解和代理量构造。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成表明，CFG-OEC在多个采样器和引导机制下比CFG和CFG++改进了FID和CLIP分数。

英文摘要

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

URL PDF HTML ☆

赞 0 踩 0

2511.07667 2026-05-27 cs.AI 版本更新

AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

AI驱动的贡献评估与冲突解决：群体工作量调查的框架与设计

Jakub Slapek, Mir Seyedebrahimi, Jianhua Yang

发表机构 * University of Warwick（沃里克大学）； Warwick Manufacturing Group（沃里克制造集团）

AI总结提出一个AI增强的框架和实现设计，通过整合异构工件并利用大语言模型进行验证和上下文分析，以解决团队中个人贡献的公平评估和冲突解决难题。

Comments 20 pages, 8 figures, 8 tables

详情

AI中文摘要

团队中个人贡献的公平评估仍然是一个持续的挑战，工作量的冲突和差异可能导致不公平的绩效评估，通常需要人工干预——这是一个成本高昂且困难的过程。我们调查了现有工具的功能，并发现了冲突解决方法和AI集成方面的空白。为了解决这个问题，我们提出了一种新颖的AI增强工具的框架和实现设计，该工具协助争议调查。该框架将异构工件——提交物（代码、文本、媒体）、通信（聊天、电子邮件）、协调记录（会议日志、任务）、同行评估和上下文信息——组织成三个维度，包含九个基准：贡献、互动和角色。客观度量被归一化，按维度聚合，并与不平等度量（基尼指数）配对，以揭示冲突标记。大语言模型（LLM）架构对这些度量进行验证和上下文分析，以生成可解释且透明的咨询判断。我们论证了在当前法规和机构政策下的可行性，并概述了实际分析（情感、任务忠实度、字数/行数等）、偏见防护、限制和实际挑战。

英文摘要

The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can result in unfair performance evaluation, often requiring manual intervention - a costly and challenging process. We survey existing tool features and identify a gap in conflict resolution methods and AI integration. To address this, we propose a framework and implementation design for a novel AI-enhanced tool that assists in dispute investigation. The framework organises heterogeneous artefacts - submissions (code, text, media), communications (chat, email), coordination records (meeting logs, tasks), peer assessments, and contextual information - into three dimensions with nine benchmarks: Contribution, Interaction, and Role. Objective measures are normalised, aggregated per dimension, and paired with inequality measures (Gini index) to surface conflict markers. A Large Language Model (LLM) architecture performs validated and contextual analysis over these measures to generate interpretable and transparent advisory judgments. We argue for feasibility under current statutory and institutional policy, and outline practical analytics (sentimental, task fidelity, word/line count, etc.), bias safeguards, limitations, and practical challenges.

URL PDF HTML ☆

赞 0 踩 0

2511.04711 2026-05-27 cs.CR cs.AI cs.LG 版本更新

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

SWAP：通过顺序水印实现软提示的版权审计

Wenyuan Yang, Yichen Sun, Changzheng Chen, Zhixuan Chu, Jiaheng Zhang, Yiming Li, Dacheng Tao

发表机构 * Sun Yat-sen University（中山大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Nanyang Technological University（南洋理工大学）

AI总结针对软提示的版权保护问题，提出一种基于顺序水印的审计方法SWAP，通过将水印嵌入到更复杂的输出分布顺序空间中，实现无害且鲁棒的版权验证。

Comments This paper has been accepted by the International Journal of Computer Vision (IJCV), 2026. The first two authors contributed equally to this work. 28 pages

详情

AI中文摘要

大规模视觉语言模型，尤其是CLIP，在各种下游任务中展现了卓越的性能。软提示作为精心设计的模块，能够高效地将视觉语言模型适应特定任务，因此需要有效的版权保护。本文通过审计可疑的第三方模型是否使用了受保护的软提示，来研究模型版权保护。虽然这可以视为模型所有权审计的一个特例，但我们的分析表明，由于提示学习的独特特性，现有技术效果不佳。非侵入式审计在独立模型与受害模型共享相似数据分布时，本质上容易产生误报。侵入式方法也失败：为CLIP设计的后门方法无法嵌入功能性触发器，而将传统DNN后门技术扩展到提示学习则面临有害性和模糊性挑战。我们发现，侵入式审计的这些失败源于同一个根本原因：水印与主任务在同一决策空间中运行，却追求相反的目标。基于这些发现，我们提出了软提示的顺序水印（SWAP），将水印植入一个不同且更复杂的空间。SWAP通过防御者指定的分布外类别的特定顺序来编码水印，灵感来自CLIP的零样本预测能力。这种嵌入在更复杂空间中的水印保持原始预测标签不变，从而减少与主任务的冲突。我们进一步为SWAP设计了基于假设检验的验证协议，并提供了验证何时有效的理论分析。在11个数据集上的大量实验证明了SWAP的有效性、无害性以及对潜在攻击的鲁棒性。

英文摘要

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide a theoretical analysis of when verification works. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential attacks.

URL PDF HTML ☆

赞 0 踩 0

2511.02525 2026-05-27 cs.LG cs.AI 版本更新

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于求解带容量约束选址-路径问题的端到端学习方法

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（中国自动化智能无人系统国家级实验室，北京理工大学）

AI总结提出基于深度强化学习与异构查询机制（DRLHQ）的端到端方法，首次将编码器-解码器结构应用于带容量约束的选址-路径问题（CLRP）及其开放变体（OCLRP），通过异构查询注意力机制动态协调选址与路径决策，在合成和基准数据集上优于传统方法和现有DRL基线。

详情

AI中文摘要

带容量约束的选址-路径问题（CLRPs）是组合优化中的经典问题，需要同时做出选址和路径决策。在CLRPs中，复杂的约束以及各种决策之间的复杂关系使得问题难以求解。随着深度强化学习（DRL）的出现，它已被广泛应用于解决车辆路径问题及其变体，而与CLRPs相关的研究仍有待探索。在本文中，我们提出了带有异构查询的DRL（DRLHQ）来分别求解CLRP和开放CLRP（OCLRP）。我们是首个为CLRPs提出端到端学习方法的工作，遵循编码器-解码器结构。具体而言，我们将CLRPs重新表述为一个针对各种决策量身定制的马尔可夫决策过程，这是一个通用的建模框架，可适用于其他基于DRL的方法。为了更好地处理选址和路径决策之间的相互依赖关系，我们还引入了一种新颖的异构查询注意力机制，旨在动态适应不同的决策阶段。在合成和基准数据集上的实验结果表明，我们提出的方法在求解CLRP和OCLRP时，相较于代表性的传统方法和基于DRL的基线，具有更优的解质量和更好的泛化性能。

英文摘要

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.

URL PDF HTML ☆

赞 0 踩 0

2510.19420 2026-05-27 cs.CR cs.AI cs.LG cs.MA math.OC 版本更新

Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

通过节点贡献反向传播保护多智能体系统免受腐败影响

Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun

发表机构 * Peking University（北京大学）

AI总结针对多智能体系统中对抗性智能体注入误导信息的问题，提出一种基于有向无环图的反向传播动态防御方法，通过计算每个智能体对最终决策的贡献来识别和隔离恶意智能体，实验表明该方法优于现有防御机制。

Comments ICML 2026

详情

AI中文摘要

多智能体系统（MAS）已成为大型语言模型（LLM）应用的普遍范式。然而，MAS中复杂的多智能体设计引入了独特的可信度问题：对抗性智能体可以注入误导信息，这些信息通过系统传染性地传播，破坏良性智能体并导致错误输出。现有的基于图的防御将智能体建模为节点，通信建模为边，但仅限于静态图防御。在本文中，我们提出了一种动态防御范式，将MAS通信建模为带符号的有向无环图，并通过反向传播计算每个智能体对最终决策的贡献，从而能够准确识别和隔离恶意智能体，以保护多智能体任务协作。在复杂和动态的MAS环境中的实验结果表明，我们的方法显著优于现有的MAS防御机制，为可信赖的MAS部署提供了有效的保障。我们的代码可在https://github.com/ChengcanWu/BPD获取。

英文摘要

Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent design in MAS introduces unique trustworthiness concerns: adversarial agents can inject misleading information that propagates contagiously through the system, corrupting benign agents and leading to false outputs. Existing graph-based defenses model agents as nodes and communications as edges, yet are limited to static-graph defenses. In this paper, we propose a dynamic defense paradigm that models MAS communication as a signed directed acyclic graph and computes each agent's contribution to the final decision via backward propagation, enabling accurate identification and isolation of malicious agents to secure multi-agent task collaboration. Experimental results in complex and dynamic MAS environments demonstrate that our method notably outperforms existing MAS defense mechanisms, providing an effective guardrail for trustworthy MAS deployment. Our code is available at https://github.com/ChengcanWu/BPD.

URL PDF HTML ☆

赞 0 踩 0

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG 版本更新

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran（塔里哈大学电气与计算机工程学院）； Institute for Research in Fundamental Sciences (IPM)（基础科学研究所（IPM））

AI总结提出ParsVoice，目前最大的公开波斯语语音-文本语料库，通过可扩展的流水线从长篇有声读物构建高质量数据，用于训练多说话人TTS系统，并验证了其在零样本多说话人TTS中的有效性。

详情

AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足，限制了多说话人文本到语音（TTS）、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice，这是目前最大的公开波斯语语音-文本语料库，专为训练多说话人TTS系统而设计，同时提供了一个可扩展的流水线，用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集，包含来自1815个自动识别说话人ID的136万个对齐片段，比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库，我们微调了XTTS，一个直接操作原始波斯语文本（无需音素表示）的零样本多语言TTS模型，实现了自然度MOS为3.6/5，说话人相似度MOS为4.0/5。ParsVoice数据集公开在：https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

URL PDF HTML ☆

赞 0 踩 0

2509.04310 2026-05-27 cs.AI 版本更新

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

EvoEmo：面向多轮价格谈判中对抗性LLM智能体的进化情感策略

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）； Rotman School of Management, University of Toronto（多伦多大学罗特曼管理学院）； TUM School of Management, Technical University of Munich（慕尼黑技术大学管理学院）； The Alan Turing Institute, London, UK（伦敦阿尔安·图灵研究院）

AI总结提出EvoEmo进化强化学习框架，通过将情感状态转移建模为马尔可夫决策过程并采用种群遗传优化，动态优化多轮谈判中的情感表达，显著提升LLM智能体的谈判成功率、效率和买家节省。

详情

AI中文摘要

最近关于大型语言模型（LLM）中思维链（CoT）推理的研究表明，智能体可以参与 extit{复杂}、 extit{多轮}谈判，为智能体AI开辟了新途径。然而，现有的LLM智能体在很大程度上忽略了情感在此类谈判中的功能作用，而是生成被动、偏好驱动的情感反应，使其容易受到对抗方的操纵和策略性利用。为弥补这一差距，我们提出了EvoEmo，一个进化强化学习框架，用于优化谈判中的动态情感表达。EvoEmo将情感状态转移建模为马尔可夫决策过程，并采用基于种群的遗传优化，在多样化的谈判场景中进化出高奖励的情感策略。我们进一步提出了一个评估框架，包含两个基线——原始策略和固定情感策略——用于基准测试情感感知谈判。大量实验和消融研究表明，EvoEmo在成功率、效率和买家节省方面均持续优于两个基线。这一发现强调了适应性情感表达在使LLM智能体更有效地进行多轮谈判中的重要性。代码可在\href{https://github.com/Yunbo-max/EvoEmo}{ extcolor{red}{https://github.com/Yunbo-max/EvoEmo}}获取。

英文摘要

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines -- vanilla strategies and fixed-emotion strategies -- for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation. The code is available at \href{https://github.com/Yunbo-max/EvoEmo}{\textcolor{red}{https://github.com/Yunbo-max/EvoEmo}}.

URL PDF HTML ☆

赞 0 踩 0

2506.23274 2026-05-27 cs.LG cs.AI 版本更新

Real-Time Progress Prediction in Reasoning Language Models

推理语言模型中的实时进度预测

Hans Peter Lyngsøe Raaschou-Jensen, Constanza Fierro, Anders Søgaard

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）

AI总结研究通过离散化推理轨迹训练线性探针和微调模型生成0-100%进度估计，实现推理语言模型中的实时进度预测，并在数学推理任务上达到0.161 MAE。

详情

AI中文摘要

最近的推理语言模型，特别是那些采用长潜在思维链的模型，在复杂的智能体任务上表现出色。然而，随着这些模型在越来越长的时间范围内运行，其内部进展对用户变得不透明，使得期望管理和实时监督变得困难。在这项工作中，我们研究了对此类模型进行实时进度预测的可行性。我们首先通过离散化推理轨迹并训练线性探针对推理状态进行分类，测试隐藏状态是否编码进度信息。然后，我们微调模型以在思维链推理过程中生成0-100%的进度估计。我们最强的进度报告检查点在数学推理轨迹上达到了0.161的平均绝对误差，并在此设置中优于位置基线。最后，我们通过测量相同部分展开中隐含进度值的变化程度，量化了进度标签的内在模糊性。这种模糊性在Qwen3-4B中最低，其延续产生的展开离散度最小，表明更大的模型可以通过减少剩余解决方案长度的变化来使进度标签更稳定。

英文摘要

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

URL PDF HTML ☆

赞 0 踩 0

2510.06843 2026-05-27 cs.CL cs.AI 版本更新

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

发表机构 * University of Cambridge（剑桥大学）； Sorbonne Université（索邦大学）； University of Science and Technology of China（中国科学技术大学）； Beihang University（北航大学）； Nanyang Technological University（南洋理工大学）

AI总结提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法，在提高准确性的同时减少token消耗。

详情

AI中文摘要

大型语言模型（LLMs）在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论（MAD），通过使多个LLM迭代讨论和细化响应来增强性能。然而，现有的MAD方法主要关注利用外部结构（如辩论图）和LLM作为评判者，而忽略了生成过程中出现的自信号（如token logits和注意力）。这种遗漏导致了冗余计算和潜在的性能下降。在本文中，我们将重点转移到多LLM辩论的自信号上，并引入了一种自信号驱动的多LLM辩论（SID），它利用两种类型的自信号：模型级置信度和token级语义焦点，来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出，并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上，对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明，我们的方法不仅在准确性上优于现有的MAD技术，而且还减少了token消耗，突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

URL PDF HTML ☆

赞 0 踩 0

2510.06381 2026-05-27 cs.LG cs.AI 版本更新

Monte Carlo Permutation Search

蒙特卡洛排列搜索

Tristan Cazenave

AI总结提出一种改进GRAVE算法的通用蒙特卡洛树搜索算法MCPS，通过利用路径上所有节点的统计信息，在多种游戏中优于GRAVE，并给出了统计权重公式的数学推导。

2510.01833 2026-05-27 cs.AI cs.CL 版本更新

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

先规划后行动：面向LLM推理的高层规划引导强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

发表机构 * Case Western Reserve University, Cleveland, OH, USA（凯斯西储大学）； Kean University, Union, NJ, USA（凯恩大学）； The Ohio State University, Columbus, OH, USA（俄亥俄州立大学）； Fudan University, Shanghai, China（复旦大学）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室）； The University of Hong Kong, Hong Kong, China（香港大学）； North Carolina State University, Raleigh, NC, USA（北卡罗来纳州立大学）

AI总结提出PTA-GRPO两阶段框架，通过高层规划引导与强化学习联合优化，提升LLM在数学和自然科学推理任务中的准确性和泛化能力。

Comments 19 pages and 5 figures

详情

AI中文摘要

大型语言模型（LLMs）通过思维链（CoT）展现出强大的推理能力，但其token级别的生成倾向于局部决策，缺乏全局规划，常常导致冗余或不准确的推理。现有方法（如基于树的搜索和强化学习）试图解决这一问题，但计算成本高，且仍难以产生可靠的推理轨迹。为应对这些挑战，我们提出先规划后行动增强推理与组相对策略优化（PTA-GRPO），这是一个两阶段框架，旨在联合改进高层规划和细粒度CoT推理。具体而言，在第一阶段，给定LLM负责将CoT推理总结为紧凑的高层指导，然后用于监督微调。接着，我们引入一种指导感知的强化学习方法，联合优化最终输出和指导质量，提升推理效果。我们在数学和自然科学的十个推理基准上，使用五个覆盖多种数据模态的多样化基础模型进行评估。结果表明，PTA-GRPO在模型和任务上持续带来显著改进，展现出强大的有效性和泛化能力。

英文摘要

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

URL PDF HTML ☆

赞 0 踩 0

2510.01336 2026-05-27 cs.CL cs.AI cs.LG 版本更新

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin（德克萨斯大学奥斯汀分校电子与计算机工程系）

AI总结提出HiSpec框架，利用早期退出模型进行低开销中间验证，通过重用键值缓存和隐藏状态提高吞吐量，平均加速1.28倍，最高2.01倍，且不损失准确性。

详情

AI中文摘要

推测解码通过使用较小的草稿模型推测令牌，再由较大的目标模型验证，从而加速LLM推理。验证通常是瓶颈（例如，当3B模型为70B目标模型推测时，验证速度比令牌生成慢4倍），但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间，但现有方法在引入中间验证器时会产生大量训练开销，增加内存占用以协调中间验证步骤，并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$，一种高吞吐量推测解码框架，利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出，并经过显式训练，使得选定层的隐藏状态可解释，从而在不显著增加计算和内存开销的情况下，非常适合中间验证。为了进一步提高资源效率，我们设计了一种方法，使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性，HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明，与基线单层推测相比，HiSpec平均提高吞吐量1.28倍，最高达2.01倍，且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2509.26600 2026-05-27 cs.CL cs.AI 版本更新

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

当LLM自我基准测试：解构自动评估中的自我偏见

Wenda Xu, Sweta Agrawal, Vilém Zouhar, Markus Freitag, Daniel Deutsch

发表机构 * Google（谷歌）； ETH Zurich（苏黎世联邦理工学院）

AI总结研究LLM自动创建基准测试时存在的自我偏见问题，发现测试集生成和评估两个环节均产生偏见，导致模型偏爱自身输出，并提出了多样性指标以部分缓解该偏见。

详情

AI中文摘要

随着LLM迅速饱和现有基准测试，使用LLM自动创建基准测试（LLM-as-a-benchmark）——即模型生成测试输入（LLM-as-a-testset）并评估输出（LLM-as-an-evaluator）——已成为人工策划的廉价替代方案。我们表明，这种范式存在一个根本问题：LLM生成的基准测试系统性地偏爱创建它们的模型。以机器翻译为主要测试平台，我们发现自我偏见源于两个叠加来源：LLM-as-a-testset和LLM-as-an-evaluator，它们的组合放大了这种效应。关键的是，即使测试数据在显式多样性控制下生成，每个模型的隐式风格倾向也会产生同质的、模型特定的输出，从而抬高其自身分数。使用我们提出的多样性度量增加源文本多样性，可以部分缓解这种偏见。自我偏见足够强，以至于每个模型都将自己排在首位，覆盖了同行共识排序。我们确认该现象扩展到Chatbot Arena任务上的开放式生成。

英文摘要

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.

URL PDF HTML ☆

赞 0 踩 0

2509.04632 2026-05-27 cs.DB cs.AI 版本更新

Conceptual Schema Inference for Tabular Datasets using Large Language Models

使用大型语言模型对表格数据集进行概念模式推断

Zhenyu Wu, Jiaoyan Chen, Norman W. Paton

发表机构 * The University of Manchester（曼彻斯特大学）

AI总结本文提出两种基于大型语言模型的方法，从原始表格中自动推断概念模式，包括实体类型、属性和类型间关系，以解决异构表格数据的一致性问题。

详情

AI中文摘要

来自数据湖、网络表格和开放数据门户的大量表格数据通常源自异构源，导致表示不一致。因此，理解和组织此类存储库仍然是一个重大挑战。虽然先前的工作主要关注数据集发现和探索，但本文解决了概念模式推断的补充问题：自动从原始表格中推导出捕获实体类型、属性和类型间关系的概念模式。我们提出了两种基于大型语言模型（LLM）的方法，仅使用列标题和单元格值：GeSI使用生成式LLM从表格和列级语义推断层次化类型及其属性，并将它们集成到全局模式中，该模式还捕获跨类型的关系；EmSI使用基于LLM的表格嵌入按列级语义对表格进行分组，推断每组内的属性，并从共享属性模式构建层次结构。最后，我们报告了一项实验分析，展示了我们的方法在推断模式组件的简洁性和结构质量、对大型存储库的可扩展性方面的有效性，以及一个说明端到端模式推断的案例研究。

英文摘要

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a major challenge. While prior work has primarily focused on dataset discovery and exploration, this paper addresses the complementary problem of conceptual schema inference: automatically deriving a conceptual schema that captures entity types, attributes and inter-type relationships directly from raw tables. We propose two large language model (LLM)-based approaches that use only column headers and cell values: GeSI uses generative LLMs to infer hierarchical types and their attributes from table- and column-level semantics, and to integrate them into a global schema that also captures relationships across types; EmSI employs LLM-based table embeddings to group tables by column-level semantics, infer attributes within each group, and construct hierarchical structures from shared attribute patterns. Finally, we report an experimental analysis demonstrating the effectiveness of our approaches in terms of the conciseness and structural quality of the inferred schema components, their scalability to large repositories, and a case study illustrating end-to-end schema inference.

URL PDF HTML ☆

赞 0 踩 0

2508.18444 2026-05-27 cs.CL cs.AI 版本更新

How Reliable are LLMs for Reasoning on the Re-ranking task?

LLMs在重排序任务上的推理有多可靠？

Nafis Tanveer Islam, Zhiming Zhao

发表机构 * Multiscale Networked Systems (MNS) Group, University of Amsterdam（多尺度网络系统（MNS）组，阿姆斯特丹大学）

AI总结本研究分析不同训练方法对LLMs在重排序任务中语义理解的影响，并探究模型能否生成更知情的文本推理以克服透明度和数据有限的挑战。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情

AI中文摘要

随着大型语言模型（LLMs）语义理解能力的提升，它们表现出对人类更高的认知和一致性，但这以牺牲透明度为代价。尽管通过实验分析取得了有希望的结果，但深入理解LLM的内部工作机制对于理解重排序背后的推理是不可避免的，这为最终用户提供了解释，使他们能够做出明智的决定。此外，在新开发的系统中，用户参与有限且排序数据不足，准确地对内容进行重排序仍然是一个重大挑战。虽然各种训练方法影响LLMs的训练并生成推理，但我们的分析发现，一些训练方法比其他方法表现出更好的可解释性，这意味着并非所有训练方法都学到了准确的语义理解；相反，获得了抽象知识以优化评估，这引发了对LLMs真正可靠性的质疑。因此，在这项工作中，我们分析了不同训练方法如何影响LLMs在重排序任务中的语义理解，并调查这些模型是否能够生成更知情的文本推理，以克服透明度或LLMs以及有限训练数据的挑战。为了分析用于重排序任务的LLMs，我们利用来自环境和地球科学领域的相对较小的排序数据集来对检索到的内容进行重排序。此外，我们还分析了可解释信息，以查看是否可以使用可解释性对重排序进行推理。

英文摘要

With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

URL PDF HTML ☆

赞 0 踩 0

2504.08593 2026-05-27 cs.CV cs.AI 版本更新

Hands-On: Segmenting Individual Signs from Continuous Sequences

动手实践：从连续序列中分割单个手势

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey（CVSSP，萨里大学）

AI总结针对连续手语分割难题，提出基于Transformer的架构，利用HaMeR手部特征和3D角度，采用BIO标注方案建模时序动态，在DGS语料库上达到最优性能。

Comments Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition. Code Implementation Released

2508.00748 2026-05-27 cs.CV cs.AI cs.CR cs.MM 版本更新

Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

真的是你吗？探索逼真说话头像视频中的生物特征验证场景

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

发表机构 * Biometrics and Data Pattern Analytics Lab（生物特征与数据模式分析实验室）

AI总结本文研究在逼真说话头像视频中，利用面部运动模式作为行为生物特征进行身份验证，提出基于图卷积网络的轻量级模型，AUC接近80%。

Comments Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)

详情

DOI: 10.1109/IJCB65343.2025.11411089
Journal ref: 2025 IEEE International Joint Conference on Biometrics (IJCB)

AI中文摘要

逼真说话头像在虚拟会议、游戏和社交平台中越来越常见。这些头像允许更沉浸式的交流，但也引入了严重的安全风险。一个新兴威胁是冒充：攻击者可以窃取用户的头像，保留其外观和声音，使得仅凭视觉或听觉几乎无法检测欺诈性使用。在本文中，我们探讨了在这种头像中介场景中生物特征验证的挑战。我们的主要问题是，当头像的视觉外观是其主人的复制品时，个体的面部运动模式能否作为可靠的行为生物特征来验证其身份。为了回答这个问题，我们引入了一个新的数据集，其中包含使用最先进的一次性头像生成模型GAGAvatar创建的逼真头像视频，包括真实和冒充的头像视频。我们还提出了一种轻量级、可解释的时空图卷积网络架构，具有时间注意力池化，仅使用面部标志点来建模动态面部手势。实验结果表明，面部运动线索能够实现有意义的身份验证，AUC值接近80%。所提出的基准和生物特征系统可供研究社区使用，以引起对基于头像的通信系统中更高级行为生物特征防御的迫切需求的关注。

英文摘要

Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

URL PDF HTML ☆

赞 0 踩 0

2507.20758 2026-05-27 cs.AI 版本更新

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

思维链如何工作？从解码、投影和激活追踪信息流

Hao Yang, Qinghua Zhao, Lei Li, Lingyi Meng, Mengda Yu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； School of Artificial Intelligence and Big Data, Hefei University（合肥大学人工智能与大数据学院）； School of Artificial Intelligence, Beijing Institute of Technology（北京理工大学人工智能学院）； School of Computing and Information, University of Pittsburgh（匹兹堡大学计算机与信息学院）； Center for Biostatistics, The Ohio State University Wexner Medical Center（俄亥俄州立大学韦克斯纳医学中心生物统计中心）

AI总结通过反向追踪解码、投影和激活阶段的信息流，揭示思维链作为解码空间剪枝器的作用，并发现其以任务依赖方式调节神经元激活。

Comments Accept by ACL 2026

2506.21443 2026-05-27 cs.CL cs.AI 版本更新

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)（计算与增强智能学院（SCAI），亚利桑那州立大学）； Department of Computer Engineering, Tarsus University（计算机工程系，塔鲁斯大学）； Minerva CQ and HumaConn AI Consulting（Minerva CQ和HumaConn人工智能咨询）； School of Computing and Augmented Intelligence (SCAI), Arizona State University（计算与增强智能学院（SCAI），亚利桑那州立大学）

AI总结提出一种领域知识增强的大语言模型框架，通过集成结构化领域知识和漂移检测单元，实现高准确率的欺诈对话检测和概念漂移分类。

详情

DOI: 10.3390/electronics15030534

AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难，原因是语言模式的演变和概念漂移（CD）——即随着时间推移，语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话，使得准确分类具有挑战性。尽管大语言模型（LLMs）在自然语言任务中表现出色，但在风险敏感场景中，它们常常面临上下文模糊和幻觉问题。为了解决这些挑战，我们提出了一个领域知识（DK）增强的LLM框架，该框架将预训练的LLM与结构化的、任务特定的见解相结合，以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成：（1）一个DK-LLM模块，用于检测虚假或欺骗性对话；（2）一个漂移检测单元（OCDD），用于判断是否发生了语义转变；（3）第二个DK-LLM模块，用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值，然后将我们的完整框架应用于SEConvo，一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明，我们的系统能够高精度地检测虚假对话，并有效分类漂移的性质。在结构化提示的引导下，基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明，在高风险NLP应用中，融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

URL PDF HTML ☆

赞 0 踩 0

2506.17633 2026-05-27 cs.CV cs.AI 版本更新

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore（南洋理工大学计算机学院和数据科学学院，新加坡）

AI总结针对少样本分布外检测问题，提出自适应多提示对比网络（AMCN），通过CLIP学习可学习文本提示和类间/类内分布，实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情

AI中文摘要

分布外（OOD）检测旨在区分异常样本，以防止在分布内（ID）数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练，这严重限制了它们的实际应用。为此，我们针对一个具有挑战性的场景：少样本OOD检测，其中只有少量标记的ID样本可用。因此，少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中，我们提出了一种新颖的网络：自适应多提示对比网络（AMCN），它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺，我们利用CLIP连接文本与图像，设计可学习的ID和OOD文本提示。具体来说，我们首先生成自适应提示（可学习ID提示、标签固定OOD提示和标签自适应OOD提示）。然后，我们通过引入类级阈值为每个类生成自适应类边界。最后，我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明，AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

URL PDF HTML ☆

赞 0 踩 0

2506.10225 2026-05-27 cs.SD cs.AI eess.AS 版本更新

Genre Controlled Music Generation via Activation Steering

通过激活引导实现体裁控制的音乐生成

Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯学院）

AI总结提出一种在推理时对自回归生成模型MusicGen进行干预的方法，利用线性探针权重引导残差流，实现细粒度的体裁控制。

2506.07813 2026-05-27 cs.CV cs.AI 版本更新

消息传递状态空间模型：利用现代序列建模改进图学习

Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

发表机构 * University of Pisa（帕尔米斯大学）； University of Cambridge（剑桥大学）

AI总结提出MP-SSM，将现代状态空间模型的核心计算嵌入消息传递神经网络，实现静态和时序图上的高效、置换等变和长程信息传播，并通过精确敏感性分析刻画深层信息流问题。

详情

AI中文摘要

状态空间模型（SSM）在序列建模中的近期成功推动了其向图学习的迁移，催生了图状态空间模型（GSSM）。然而，现有的GSSM通过将SSM模块应用于从图中提取的序列，往往损害了置换等变性、消息传递兼容性和计算效率等核心属性。本文引入了一种新视角，将现代SSM计算的关键原理直接嵌入消息传递神经网络框架，从而为静态图和时序图提供统一的方法论。我们的方法MP-SSM能够实现高效、置换等变和长程信息传播，同时保持消息传递的架构简洁性。关键的是，MP-SSM支持精确的敏感性分析，我们利用该分析从理论上刻画信息流，并评估深层网络中的梯度消失和过压缩等问题。此外，我们的设计选择允许类似现代SSM的高度优化并行实现。我们在包括节点分类、图属性预测、长程基准和时空预测在内的广泛任务上验证了MP-SSM，展示了其多功能性和强大的实证性能。

英文摘要

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

URL PDF HTML ☆

赞 0 踩 0

2505.18603 2026-05-27 cs.AI cs.CV 版本更新

是的，Q学习有助于离线上下文强化学习

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

发表机构 * Reinforcement Learning Journal（强化学习期刊）

AI总结本文在离线上下文强化学习框架中整合RL目标，通过150多个数据集实验证明，直接优化RL目标相比算法蒸馏平均提升约30%性能，且价值学习中的保守性带来额外改进。

详情

AI中文摘要

现有的离线上下文强化学习（ICRL）方法主要依赖监督训练目标，这在离线RL设置中已知存在局限性。在本研究中，我们探索了在离线ICRL框架中整合RL目标。通过在150多个GridWorld和MuJoCo环境派生数据集上的实验，我们证明，与广泛采用的算法蒸馏（AD）相比，直接优化RL目标在各种数据集覆盖范围、结构、专业水平和环境复杂性下平均提升约30%的性能。此外，在具有挑战性的XLand-MiniGrid环境中，RL目标使AD的性能翻倍。我们的结果还揭示，在几乎所有测试的设置中，价值学习期间加入保守性带来了额外的改进。我们的发现强调了将ICRL学习目标与RL奖励最大化目标对齐的重要性，并表明离线RL是推进ICRL的一个有前景的方向。

英文摘要

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

URL PDF HTML ☆

赞 0 踩 0

2407.15073 2026-05-27 cs.AI cs.CL 版本更新

Multi-Agent Causal Discovery Using Large Language Models

多智能体因果发现使用大型语言模型

Hao Duong Le, Xin Xia, Haijie Xu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University（清华大学工业工程系）

AI总结提出多智能体因果发现框架MAC，通过元融合机制结合自主选择SCD算法的辩论编码模块和基于元数据的对抗性辩论模块，在多个基准上取得最优性能。

详情

AI中文摘要

因果发现旨在识别变量之间的因果关系，是各科学领域的基本问题。传统的统计因果发现（SCD）方法仅依赖观测数据，忽略元数据中可用的上下文信息，而近期基于LLM的方法利用元数据但将大型语言模型（LLM）视为单一智能体，使其判断易受记忆或偏见关联影响。为解决这一差距，我们引入MAC（多智能体因果发现框架），将因果发现转化为多智能体辩论与自主选择SCD算法相结合。MAC通过元融合机制桥接两个互补模块：辩论编码模块（DCM）通过自主选择并执行最合适的SCD算法将初始图基于数据，以及元辩论模块（MDM）通过对抗性的肯定-否定-裁判辩论基于元数据精炼图。在五个基准数据集和三个指标（F1、SHD、NHD）上，MAC在五个统计基线和四个基于LLM的基线中取得了最佳综合性能，在使用Gemini-2.0-Flash时在15个评估点中排名第一10次——包括完美重建地震图——并在三个骨干LLM上保持稳健。

英文摘要

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional statistical causal discovery (SCD) methods rely solely on observational data and ignore the contextual information available in metadata, whereas recent LLM-based methods exploit metadata but treat the large language model (LLM) as a single agent, leaving its judgments vulnerable to memorized or biased associations. To address this gap, we introduce MAC (Multi-Agent Causal Discovery Framework), which casts causal discovery as a multi-agent debate coupled with the autonomous selection of an SCD algorithm. MAC combines two complementary modules, bridged by a Meta Fusion mechanism: a Debate-Coding Module (DCM) that grounds an initial graph in data by autonomously selecting and executing the best-suited SCD algorithm, and a Meta-Debate Module (MDM) that refines the graph through an adversarial Affirmative-Negative-Judge debate over the metadata. Across five benchmark datasets and three metrics (F1, SHD, NHD), MAC achieves the best aggregate performance among five statistical and four LLM-based baselines, ranking first on 10 of 15 evaluation points with Gemini-2.0-Flash -- including a perfect reconstruction of the Earthquake graph -- and remains robust across three backbone LLMs.

URL PDF HTML ☆

赞 0 踩 0

2306.13985 2026-05-27 stat.ML cs.AI cs.LG stat.ME 版本更新

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

使用数据自适应能量距离的高维数据鲁棒分类

Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta

发表机构 * Indian Statistical Institute , Kolkata, India（印度统计研究所，加尔各答，印度）； School of Industrial and Systems Engineering, Georgia Institute of Technology , Atlanta, USA（工业与系统工程学院，佐治亚理工学院，美国亚特兰大）； Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology , Saudi Arabia（计算机、电子和数学科学与工程系，国王阿卜杜勒·阿齐兹大学科学与技术学院，沙特阿拉伯）； Applied Statistics Unit, Indian Statistical Institute , Kolkata, India（应用统计部，印度统计研究所，加尔各答，印度）； Department of Mathematics and Statistics, Indian Institute of Technology Kanpur , India（数学与统计系，印度理工学院坎普尔分校，印度）

AI总结针对高维低样本量数据，提出无调参、无矩条件的鲁棒分类器，在渐近条件下实现完美分类，并通过模拟和真实数据验证其优势。

Comments Published at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2023

详情

DOI: 10.1007/978-3-031-43424-2_6
Journal ref: In: ECML PKDD 2023: Research Track. Lecture Notes in Computer Science, vol 14173. Springer, Cham (2023)

AI中文摘要

高维低样本量数据的分类在基因表达研究、癌症研究和医学成像等多种实际场景中构成挑战。本文开发并分析了一些专门为HDLSS数据设计的分类器。这些分类器无需调参且具有鲁棒性，即它们不依赖于底层数据分布的任何矩条件。研究表明，在相当一般的条件下，它们在HDLSS渐近框架下能实现完美分类。还研究了所提分类器的比较性能。我们的理论结果得到了广泛的模拟研究和真实数据分析的支持，这些分析表明所提出的分类技术相对于几种广泛认可的方法具有显著优势。

英文摘要

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.

URL PDF HTML ☆

赞 0 踩 0

2003.05746 2026-05-27 cs.LO cs.AI cs.DB 版本更新

Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation

查询与修复不一致的优先知识库：复杂性分析与抽象论证的联系

Meghyn Bienvenu, Camille Bourgaux

发表机构 * CNRS & University of Bordeaux, France（法国国家科学研究中心与波尔多大学）； DI ENS, ENS, CNRS, PSL University & Inria, Paris, France（巴黎高等师范学院（ENS）、法国国家科学研究中心（CNRS）、巴黎萨克雷大学（PSL University）与法国国家信息与自动化研究所（Inria））

AI总结本文研究优先知识库中不一致性处理问题，定义了全局、帕累托和完成最优修复，分析了基于这些修复的查询蕴含、唯一最优修复存在性及枚举的数据复杂度，并揭示了最优修复与抽象论证框架扩展之间的关系。

Comments This is an extended version of a paper appearing at the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR 2020). This version corrects the statement of Theorem 43 (missing hypothesis). 27 pages

详情

AI中文摘要

本文探讨了优先知识库（由本体、事实集和冲突事实间的优先关系组成）的不一致性处理问题。在数据库设置中，已研究了密切相关的场景，并定义了优先不一致数据库的三种不同最优修复概念（全局、帕累托和完成）。将这些全局、帕累托和完成最优修复概念迁移到我们的设置后，我们研究了核心推理任务的数据复杂度：基于最优修复的不一致性容忍语义下的查询蕴含、唯一最优修复的存在性以及所有最优修复的枚举。我们的结果为用常见DL-Lite方言表述的本体上这些任务的数据复杂度提供了近乎完整的图景。我们工作的第二个贡献是阐明了最优修复与（基于集合的）论证框架不同扩展概念之间的关系。在我们的结果中，我们展示了帕累托最优修复精确对应于稳定扩展（并且通常也对应于优先扩展），并提出了一种受基础扩展启发且具有良好计算特性的优先知识库新语义。我们的研究还产生了一些关于基于偏好的论证框架的独立兴趣结果。

英文摘要

In this paper, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set of facts, and a priority relation between conflicting facts. In the database setting, a closely related scenario has been studied and led to the definition of three different notions of optimal repairs (global, Pareto, and completion) of a prioritized inconsistent database. After transferring the notions of globally-, Pareto- and completion-optimal repairs to our setting, we study the data complexity of the core reasoning tasks: query entailment under inconsistency-tolerant semantics based upon optimal repairs, existence of a unique optimal repair, and enumeration of all optimal repairs. Our results provide a nearly complete picture of the data complexity of these tasks for ontologies formulated in common DL-Lite dialects. The second contribution of our work is to clarify the relationship between optimal repairs and different notions of extensions for (set-based) argumentation frameworks. Among our results, we show that Pareto-optimal repairs correspond precisely to stable extensions (and often also to preferred extensions), and we propose a novel semantics for prioritized KBs which is inspired by grounded extensions and enjoys favourable computational properties. Our study also yields some results of independent interest concerning preference-based argumentation frameworks.

URL PDF HTML ☆

赞 0 踩 0

2404.18539 2026-05-27 cs.CV cs.AI 版本更新

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

基于骨架的方法增强边界分割的拓扑准确性

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

发表机构 * University of Science and Technology Beijing（北京科技大学）； Beijing Advanced Innovation Center for Materials Genome Engineering（北京材料基因组创新中心）； School of Intelligence Science and Technology（智能科学与技术学院）； Shunde Innovation School（顺德创新学校）； Institute for Advanced Materials and Technology（先进材料与技术研究院）； Key Laboratory of Intelligent Bionic Unmanned Systems（智能仿生无人系统重点实验室）； Institute of Materials Intelligent Technology（材料智能技术研究院）； Liaoning Academy of Materials（辽宁省材料科学院）； School of Materials Science and Technology（材料科学与技术学院）

AI总结提出Skea-Topo Aware损失函数，通过骨架感知加权和边界修正项提升网状图像边界分割的拓扑一致性，在三个数据集上相比13种方法VI指标提升最多7点。

详情

DOI: 10.24963/ijcai.2024/121
Journal ref: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pp. 1092-1100, 2024

AI中文摘要

拓扑一致性在网状图像的边界分割任务中起着关键作用，例如神经元电子显微镜图像中的细胞膜分割、材料显微图像中的晶界分割以及航拍图像中的道路分割。在这些领域中，分割结果的拓扑变化对下游任务产生严重影响，甚至可能超过边界本身的错位。为了增强分割结果的拓扑准确性，我们提出了Skea-Topo Aware损失函数，这是一种新颖的损失函数，考虑了每个物体的形状和像素的拓扑重要性。它由两部分组成。首先，骨架感知加权损失通过更好地利用骨架建模物体几何来提高分割准确性。其次，边界修正项通过使用真实标签和预测中的前景和背景骨架，有效识别并强调预测误差中的拓扑关键像素。实验证明，在三个不同的边界分割数据集上，基于客观和主观评估，我们的方法在VI指标上相比13种最先进方法将拓扑一致性提高了最多7点。代码可在https://github.com/clovermini/Skea_topo获取。

英文摘要

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.

URL PDF HTML ☆

赞 0 踩 0

2009.11997 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Continual Model-Based Reinforcement Learning with Hypernetworks

基于超网络的连续模型强化学习

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti

发表机构 * Division of Engineering Science, University of Toronto, Canada（多伦多大学工程科学系）； Department of Computer Science, University of Toronto, Canada（多伦多大学计算机科学系）

AI总结提出HyperCRL方法，利用任务条件超网络在序列任务中持续学习动力学模型，避免重新训练并固定存储开销，在机器人 locomotion 和 manipulation 任务中优于现有持续学习方法。

Comments Updated link to project website in the abstract. 7 pages (+2 pages in appendix), 8 figures. In proceedings of the 2021 IEEE International Conference on Robotics and Automation

详情

AI中文摘要

在基于模型的强化学习（MBRL）和模型预测控制（MPC）中，有效规划依赖于学习到的动力学模型的准确性。在MBRL和MPC的许多实例中，该模型被假定为平稳的，并且定期从头开始重新训练，使用从环境交互开始收集的状态转移经验。这意味着训练动力学模型所需的时间——以及计划执行之间的暂停时间——随着收集的经验规模线性增长。我们认为这对于终身机器人学习来说太慢，并提出了HyperCRL，一种使用任务条件超网络在序列任务中持续学习所遇到动力学的方法。我们的方法有三个主要特点：首先，它包括不重新访问先前任务训练数据的动力学学习会话，因此只需存储最近固定大小的状态转移经验；其次，它使用固定容量的超网络来表示非平稳且任务感知的动力学；第三，它优于依赖固定容量网络的现有持续学习替代方案，并且与记忆不断增长的过去经验核心集的基线方法相比具有竞争力。我们展示了HyperCRL在机器人 locomotion 和 manipulation 场景（如推和开门任务）中在连续基于模型的强化学习中的有效性。我们的项目网站（含视频）位于此链接：https://rvl.cs.toronto.edu/blog/hypercrl

英文摘要

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/hypercrl

URL PDF HTML ☆

赞 0 踩 0