2606.15497 2026-06-16 cs.AI 新提交

Towards End-to-End Automation of AI Research

迈向AI研究的端到端自动化

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Chris Lu, Shengran Hu, Jakob Foerster, David Ha, Jeff Clune

发表机构 * Sakana AI ； FLAIR ； University of Oxford（牛津大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）

AI总结提出AI Scientist系统，利用基础模型实现从构思到论文撰写的全自动研究，并通过机器学习会议研讨会的同行评审。

Comments Published in Nature 651, 914-919 (2026)

详情

AI中文摘要

科学自动化是AI领域的一个长期目标。虽然社区在自动化科学过程的各个组成部分方面取得了显著进展，但能够自主导航整个研究生命周期（从构思到发表）的系统仍然遥不可及。在这里，我们展示了迄今为止朝着端到端自动化整个过程的最强演示。我们提出了AI Scientist，它能够创建研究想法、编写代码、运行实验、绘制和分析数据、撰写完整的科学手稿并进行自己的同行评审。其想法、执行和呈现的质量足以生成一份由AI系统产生的手稿，该手稿通过了机器学习会议研讨会的首轮同行评审。该研讨会的接受率为70%。我们的系统在一个复杂的代理系统中利用了现代基础模型。我们在两种设置中评估AI Scientist：一种聚焦模式，使用人类提供的代码模板作为初始支架，在特定主题上进行研究；另一种是无模板的开放模式，利用代理搜索进行更广泛的科学探索。两种设置都能产生多样化的想法，并自动测试、报告和评估它们。这一成就展示了AI在科学贡献方面日益增长的能力，并标志着研究方式可能发生的范式转变。与任何有影响力的新技术一样，可能存在重大风险，包括给不堪重负的评审系统增加负担以及给科学文献带来噪音。然而，如果负责任地开发，这种自主系统可以极大地加速科学发现。

英文摘要

The automation of science is a long-standing ambition in the field of AI. While the community has made significant progress in automating individual components of the scientific process, a system that autonomously navigates the entire research lifecycle -- from conception to publication -- has remained out of reach. Here, we present the strongest demonstration to date toward automating the entire process end-to-end. We present The AI Scientist, which creates research ideas, writes code, runs experiments, plots and analyzes data, writes the entire scientific manuscript and performs its own peer review. Its ideas, execution, and presentation are of sufficient quality to produce a manuscript generated by an AI system that passes the first round of peer review at a major machine learning conference workshop. The workshop has an acceptance rate of 70 percent. Our system leverages modern foundation models within a complex agentic system. We evaluate The AI Scientist in two settings: a focused mode using human-provided code templates as an initial scaffold to conduct research on a specific topic, and a template-free, open-ended mode that leverages agentic search for wider scientific exploration. Both settings produce diverse ideas and automatically test, report on, and evaluate them. This achievement demonstrates AI's growing capacity for scientific contribution and signifies a potential paradigm shift in how research is conducted. As with any impactful new technology, there could be significant risks, including taxing overwhelmed review systems and adding noise to scientific literature. However, if developed responsibly, such autonomous systems could greatly accelerate scientific discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.15579 2026-06-16 cs.AI cs.LG cs.MA cs.SE 新提交

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

你的智能体有基因组：基于序列的LLM驱动自主智能体行为分析与运行时治理

Sidi Deng

发表机构 * Independent Researcher（独立研究员）

AI总结提出XEPV序列编码框架，将LLM智能体行为建模为基因组序列，通过n-gram挖掘发现P-X-P高风险模式，设计Governor三层干预系统，使成功率提升6.2%并减少44% token消耗。

Comments 16 pages, 15 figures, 12 tables

详情

AI中文摘要

我们提出基础序列分析框架，该框架将LLM驱动的自主智能体的运行时行为编码为使用四个字母的字母表的紧凑符号序列：X（探索）、E（执行）、P（规划）和V（验证）。借鉴基因组序列分析的类比，我们对从生产ReAct智能体系统收集的347条真实世界执行轨迹（跨越8天）应用n-gram模式挖掘、马尔可夫转移矩阵和点二列相关分析。我们的分析揭示：(1) 三元组P-X-P是唯一统计显著的高风险模式，使成功率降低10.4%；(2) P比率是成功的最强负预测因子（r=-0.256, p<0.0001）；(3) E→V转移概率仅为2.1%，表明存在系统性验证缺陷。基于这些发现，我们设计了Governor，一个三层运行时干预系统，包括规则引擎、统计累加器和基于卡方的阈值自适应器。在自然的部署前后评估中（N=101 vs. N=246），Governor使任务成功率绝对提升6.2%，同时平均token消耗减少44%。为验证跨系统通用性，我们将XEPV编码应用于SWE-bench上2000条公开SWE-agent轨迹，确认探索螺旋和E→V验证缺陷在独立系统中复现。我们概述了六个研究方向，包括基础序列语言模型、跨智能体行为指纹识别和奖励塑造，并发布开源工具包以促进可重复性。

英文摘要

We propose Base Sequence Analysis, a framework that encodes the runtime behavior of LLM-powered autonomous agents into compact symbolic sequences using a four-letter alphabet: X (Explore), E (Execute), P (Plan), and V (Verify). Drawing an analogy to genomic sequence analysis, we apply n-gram pattern mining, Markov transition matrices, and point-biserial correlation to 347 real-world execution traces collected from a production ReAct agent system over 8 days. Our analysis reveals that (1) the trigram P-X-P is the only statistically significant high-risk pattern, lowering success rate by 10.4%; (2) P-ratio is the strongest negative predictor of success (r=-0.256, p<0.0001); and (3) the E->V transition probability is only 2.1%, indicating a systemic verification deficit. Based on these findings, we design Governor, a three-layer runtime intervention system comprising a rule engine, a statistical accumulator, and a chi-square-based threshold adaptor. In a natural before/after deployment evaluation (N=101 vs. N=246), Governor achieves a +6.2% absolute increase in task success rate while simultaneously reducing average token consumption by 44%. To validate cross-system generality, we apply the XEPV encoding to 2,000 public SWE-agent trajectories on SWE-bench, confirming that exploration spirals and the E->V verification deficit replicate in an independent system. We outline six research directions including base sequence language models, cross-agent behavioral fingerprinting, and reward shaping, and release an open-source toolkit for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.15866 2026-06-16 cs.AI cs.LG 新提交

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

STRIDE: 通过判别估计进行策略轨迹推理以实现可验证强化学习

Qinjian Zhao, Zhihao Dou, Dinggen Zhang, Xiangyu Li, Chaoda Song, Zhongwei Wan, Xinpeng Li, Yanyan Zhang, Kaijie Chen, Qingtao Pan, Chengcheng Feng, Zhiqiang Gao, Xiaoyu Xia

发表机构 * Kean University（基恩大学）； Case Western Reserve University（凯斯西储大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； The Ohio State University（俄亥俄州立大学）； Tongji University（同济大学）； Duke Kunshan University（昆山杜克大学）； Royal Melbourne Institute of Technology（皇家墨尔本理工大学）

AI总结提出STRIDE框架，通过对比成功与失败轨迹估计n-gram策略模式的判别偏好，结合推理显著性熵识别关键策略模式，实现细粒度信用分配，提升可验证强化学习的推理性能。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大语言模型推理能力的有效后训练范式。然而，现有RLVR方法通常依赖最终答案正确性分配轨迹级奖励，提供稀疏监督，并统一处理所有token，不考虑它们对推理的实际贡献。尽管最近的研究引入了中间信号，如过程奖励、高熵token和语义不确定性，但这些信号通常本身不可验证，且可能无法区分有益策略模式与有害模式。为解决这一局限，我们提出STRIDE（通过判别估计进行策略轨迹推理），一种从可验证结果中推导策略推理监督的细粒度RLVR框架。STRIDE对比每个响应组内的成功和失败轨迹，以估计每个n-gram策略模式的结果判别偏好，并进一步将该信号与推理显著性熵结合，识别决策相关的策略模式。在RL优化过程中，这些模式被分配差异化的优势值，从而在保持RLVR可验证性的同时实现更精确的信用分配。大量实验表明，STRIDE在多种模型、任务和扩展设置（包括VLM和基于智能体的系统）中一致提升了推理性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training paradigm for improving the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, providing sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Although recent studies introduce intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones. To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each $n$-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.

URL PDF HTML ☆

赞 0 踩 0

2606.15874 2026-06-16 cs.AI cs.SE 新提交

LLM-as-Code Agentic Programming for Agent Harness

LLM即代码：面向Agent框架的编程范式

Junjia Qi, Zichuan Fu, Jingtong Gao, Wenlin Zhang, Hanyu Yan, Xian Wu, Xiangyu Zhao

发表机构 * City University of Hong Kong（香港城市大学）； Tencent Jarvis Lab（腾讯贾维斯实验室）

AI总结针对LLM作为编排器导致控制流幻觉和不可靠执行的问题，提出Agentic Programming范式，由程序控制所有流程，LLM仅作为代码组件在需要推理或生成时被调用，显著提升长序列操作的稳定性。

Comments Accepted at the KDD 2026 Workshop on Agentic Software Engineering (AgenticSE)

详情

AI中文摘要

每个主要的LLM Agent框架都赋予LLM编排者的角色；模型决定下一步做什么、何时调用工具以及何时停止。我们认为，令牌爆炸、控制流幻觉和不可靠完成并非实现缺陷，而是将循环、分支和排序等确定性工作分配给概率系统的架构后果。更好的提示或更强的模型无法保证LLM Agent的可靠性。因此，我们提出Agentic Programming，其中程序控制所有流程，而LLM本身是其中的一部分，一个称为LLM-as-Code的自适应组件，仅在任务需要推理或生成时调用。在每个调用中，模型保持完全灵活性，但不能改变程序的执行路径。由于控制权在程序中，LLM的上下文由执行历史的调用树构建，形成有向无环图（DAG）。每个调用的上下文长度由其调用深度决定，而非随步骤累积。计算机使用Agent的案例研究表明，该设计不仅是理论立场，而且是实用的，显著提高了长视觉操作序列的稳定性。

英文摘要

Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM-as-Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path. With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. A case study of computer-use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.15994 2026-06-16 cs.AI cs.LG 新提交

Agentic Framework for Deep Learning workload migration via In-Context Learning

基于上下文学习的深度学习工作负载迁移智能体框架

Qiyue Liang, Steven Ingram, George Vanica, Andi Gavrilescu, Newfel Harrat, Hassan Sipra, Sethuraman Sankaran

发表机构 * Google（谷歌）

AI总结提出结合上下文学习与Oracle驱动的自调试的自主系统，实现从PyTorch到JAX的深度学习模型自动迁移，在神经模块上达到91%数值等价性。

详情

AI中文摘要

将深度学习模型从PyTorch灵活的面向对象设计迁移到JAX的函数式无状态设置通常是一项手动且易出错的任务。自动迁移具有挑战性，因为大型语言模型（LLM）难以处理严格且动态的API对齐，并且容易在精确操作上出错。我们提出了一个完全自主的系统，结合了上下文学习（ICL）与Oracle驱动的自调试。首先，我们整理了一个ICL上下文，作为惯用JAX样式和测试用例生成的严格参考。其次，不依赖LLM推导数学输出，而是运行源PyTorch模块以获取其实际的动态张量状态，从而创建一个不可变的执行Oracle。然后，我们使用自主智能体循环基于Oracle数据合成测试。测试用例被重复执行，并将回溯发送回LLM进行自我修正。消融实验表明，将ICL参考与Oracle基础及自调试相结合，大大优于纯指令和基本智能体基线。这种改进没有增加过多的计算开销。我们的轻量级流水线在神经模块上实现了91%的数值等价性（相比之下，基线为9%，指令+自调试为27%），为跨框架迁移提供了高度可靠、可扩展的蓝图。该方案已在多个最先进模型上得到验证，包括SAM（Segment Anything）、T5、Code Whisper等，显示出高数值等价性。代码：https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

英文摘要

Translating deep learning models from PyTorch's flexible, object-oriented design to JAX's functional, stateless setup is usually a manual and error-prone task. Automated migration is challenging because Large Language Models (LLMs) struggle with strict and dynamic API alignment and are prone to mistakes for exacting operations. We propose a fully autonomous system that combines In-Context Learning (ICL) with oracle-driven self-debugging. First, we curated an ICL context that serves as a strict reference for idiomatic JAX styling and test case generation. Second, instead of depending on the LLM to deduce mathematical outputs, we run the source PyTorch modules to get their actual dynamic tensor states. This creates an unchangeable execution oracle. We then use an autonomous agentic loop to synthesize tests based on the oracle data. The test cases are executed repeatedly, and the traceback is sent back to the LLM for self-correction. Ablations show that combining ICL references with oracle grounding and self-debugging greatly outperforms pure instructional and basic agentic baselines. This improvement does not add an excessive computational overhead. Our lightweight pipeline achieves 91% numerical equivalence (compared to baseline: 9%, instruction + self-debugging: 27%) on neural modules, providing a highly reliable, scalable blueprint for cross-framework migration. This has been validated across several state-of-the-art models including SAM (segment anything), T5, Code Whisper amongst others showing high numerical equivalency. Code: https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

URL PDF HTML ☆

赞 0 踩 0

2606.16149 2026-06-16 cs.AI 新提交

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

LiteOdyssey: 一种用于可解释罕见病诊断的轻量级推理AI智能体

Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

发表机构 * Vanderbilt University（范德堡大学）； Vanderbilt University Medical Center（范德堡大学医学中心）； University of South Florida（南佛罗里达大学）

AI总结提出轻量级框架LiteOdyssey，通过人类-AI协作的诊断策略和公共生物医学工具增强单个推理语言模型，在罕见病诊断基准上达到最先进性能，无需微调或多智能体集成。

Comments 21 pages,5 main figures, working version 1

详情

AI中文摘要

大多数医疗AI系统通过扩展额外机制来改进：更多的微调数据、更多的智能体和/或更大的检索数据库。然而，在罕见病诊断中，这种扩展可能导致系统难以部署、审计和维护。我们探究是否可以通过扩展单个AI智能体的推理链来实现最先进的诊断性能：通过人类-AI协作开发的诊断策略指导它，并利用可免费获取的生物医学工具进行增强。我们引入了LiteOdyssey，一个轻量级罕见病诊断框架，通过临床遗传学工作流引导推理语言模型。该框架通过人类反馈的策略迭代（PIHF）开发，并动态访问公共生物医学工具。在两个仅提供患者临床特征的高难度基准上，LiteOdyssey取得了最先进的性能，在LIRICAL（n=370）和PhenoPacket Store（n=873）的合并1243个病例中，总体疾病Recall@1达到59.3%。这两个基准中超高罕见病（患病率低于1/1,000,000）的比例很高，分别约为45%和52.8%。在更困难的PhenoPacket子集上（其中因果疾病未在我们的稀有性映射流程中映射到Orphanet），LiteOdyssey实现了60.7%的Recall@1，而相同基线模型（GPT-5.4）不使用工具时为10.7%。这一性能是在没有微调、多智能体集成或大型病例检索数据库的情况下实现的。在开发过程中未见过的病例、真实世界罕见病患者的私人队列以及较小的开源权重模型上也观察到了增益。LiteOdyssey为罕见病AI系统指明了一条路径，使其准确、易于部署且对医生审查更透明。

英文摘要

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

URL PDF HTML ☆

赞 0 踩 0

2606.16707 2026-06-16 cs.AI 新提交

GIST-CMTF：LLM代理中因果最小工具过滤的目标状态推断

Rahul Suresh Babu, Rohit Shukla

AI总结提出GIST-CMTF层，通过预测候选符号目标状态并估计歧义性，解决工具增强LLM代理因用户请求多义性导致的错误目标执行问题，在120个任务上达到97.0%成功率。

详情

AI中文摘要

工具增强的LLM代理依赖运行时过滤来决定每个步骤中哪些工具应可见。因果最小工具过滤（CMTF）通过仅暴露下一个因果必要的工具前沿来减少工具选择混淆，但它假设用户请求已映射到符号目标状态。实际上，诸如“处理我的预约”或“处理这封邮件”之类的请求可能对应多个可能的目标。这会导致错误目标执行，即代理为意外目标遵循有效的因果工具路径。我们引入GIST-CMTF，一个目标状态推断层，它预测在CMTF使用的相同状态转换词汇上的候选符号目标，估计歧义性，并要么应用CMTF，要么将澄清暴露为产生缺失目标或状态变量的因果动作。我们在七个模型后端、六种过滤方法和120个受控工具使用任务上评估GIST-CMTF。GIST-CMTF实现了97.0%的任务成功率，而top-goal CMTF为80.1%，semantic-goal CMTF为82.9%。它将错误目标执行从top-goal CMTF下的19.4%降低到2.5%，同时保留了因果过滤的单工具暴露，并且使用的令牌数远少于全工具暴露。这些结果表明，可靠的工具增强代理在暴露外部动作之前应验证目标状态，而不仅仅是工具相关性。

英文摘要

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as "handle my appointment" or "take care of this email" may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

URL PDF HTML ☆

赞 0 踩 0

2606.16987 2026-06-16 cs.AI 新提交

Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification

基于共识的智能体大语言模型框架用于协调制度海关编码分类

Truong Thanh Hung Nguyen, Khanh Van Quynh Nguyen, Hoang-Loc Cao, Tri Duong, Phuc Ho, Van Pham, Loc Nguyen, Hung Cao

发表机构 * Analytics Everywhere Lab, University of New Brunswick（新不伦瑞克大学无处不在分析实验室）； University of Economics Ho Chi Minh City（胡志明市经济大学）

AI总结提出一种多智能体LLM框架，通过信息检索、语义检索、证据推理、共识验证和分层投票等方法，解决加拿大10位HTS编码分类难题，在3300条数据上验证了证据驱动和人工参与的必要性。

Comments Accepted at the 3rd International Conference of Resilience by Technology and Design (RTD 2026)

详情

AI中文摘要

准确的协调制度（HTS）编码分类对于海运物流中的清关、关税评估、贸易统计和法规合规至关重要。然而，精确的HTS分类仍然具有挑战性，因为产品描述通常简短、不完整或模糊，而正确的分类依赖于层级关税结构、法律注释和特定司法管辖区的规则。本文提出了一种智能体大语言模型（LLM）框架，用于智慧港口和海运物流环境中的加拿大10位HTS编码分类。该框架集成了多智能体信息检索、官方关税文件的语义检索、基于证据的推理、基于共识的验证、跨层级编码组件的逐元素投票、置信度估计以及人工介入升级。我们在一个包含3300条领域专家标注的产品记录（来自物流和配送场景）的私有数据集上评估了该框架。实验结果表明，即使对于先进的LLM，精确的10位分类仍然困难，性能从粗略的章节级预测下降到细粒度的关税和统计后缀分配。这些发现表明，需要基于证据、不确定性感知和以人为中心的分类工作流程，而不是完全自主的单步预测。所提出的框架支持更可解释、可问责和合规导向的HTS分类，适用于海运物流和智慧港口操作。我们的代码可在https://github.com/Analytics-Everywhere-Lab/hts获取。

英文摘要

Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance in maritime logistics. However, exact HTS classification remains challenging because product descriptions are often short, incomplete, or ambiguous, while correct classification depends on hierarchical tariff structures, legal notes, and jurisdiction-specific rules. This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit HTS code classification in smart-port and maritime logistics environments. The framework integrates multi-agent information retrieval, semantic retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation, element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. We evaluate the framework on a private dataset of 3,300 domain-expert-labeled product records collected from logistics and delivery contexts. Experimental results show that exact 10-digit classification remains difficult even for advanced LLMs, with performance decreasing from coarse chapter-level prediction to fine-grained tariff and statistical suffix assignment. These findings demonstrate the need for evidence-grounded, uncertainty-aware, and human-centered classification workflows rather than fully autonomous single-step prediction. The proposed framework supports more interpretable, accountable, and compliance-oriented HTS classification for maritime logistics and smart-port operations. Our code is available at https://github.com/Analytics-Everywhere-Lab/hts.

URL PDF HTML ☆

赞 0 踩 0

2606.16995 2026-06-16 cs.AI cs.LG 新提交

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

存疑则计划：用于反应式强化学习的小型语言模型承诺式推理

Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues

AI总结提出PACT混合架构，结合快速反应式强化学习策略与慢速小型语言模型规划器，通过异步生成和验证候选动作计划来提升策略在陌生环境中的表现。

Comments LM4Plan Workshop at ICML 2026

2606.14778 2026-06-16 cs.CV cs.AI 交叉投稿

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

FactCheck: 基于多智能体协作的可行性感知长期动作预测

Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； China Mobile（中国移动）

AI总结提出FactCheck多智能体框架，通过闭环“观察-规划-验证”机制，结合历史动作图验证可行性，在EPIC-Kitchens-55和EGTEA Gaze+上超越现有方法。

详情

AI中文摘要

长期动作预测（LTA）旨在从部分观察的视频中预测未来动词-名词动作的有序序列。虽然该任务是具身智能的基础，但预测物理上可行的长期动作仍然是一个关键挑战。现有方法以开环方式运行，常常幻觉出不存在物体、违反物体可供性或不考虑物体状态，因为它们缺乏明确的机制来验证动作相对于物理环境的可行性。为解决此问题，我们提出FactCheck，一种新颖的多智能体协作框架，通过闭环“观察-规划-验证”机制提高可行性。FactCheck将复杂的LTA任务分解为专门角色：观察者从视频观察中识别历史动作并构建双形式结构化记忆，包括捕捉高层人类意图和环境状态的历史动作摘要，以及编码物体状态和时间依赖性的历史动作图；规划者基于低层历史动作和高层历史动作摘要生成未来动作草案；验证者严格根据历史动作图验证草案并修正不可行动作。在EPIC-Kitchens-55和EGTEA Gaze+基准上的大量实验表明，FactCheck始终优于最先进方法。我们的工作为可行性感知的长期动作预测建立了新范式，有效闭环了动作识别、动作预测和动作验证。

英文摘要

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

URL PDF HTML ☆

赞 0 踩 0

2606.14801 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS：面向流策略的高效测试时Q引导

Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； LG Electronics（LG电子）

AI总结提出QPILOTS方法，在推理时通过投影去噪中间状态到最终动作估计并计算评论家梯度来引导流匹配和扩散策略，无需修改原策略，在离线到在线RL基准上达到90%平均成功率。

Comments 10 pages, 7 figures

详情

AI中文摘要

基于课程回合级指导的在线策略蒸馏用于多轮智能体

Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang, Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang, Jinqiao Wang

发表机构 * Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所基础模型研究中心）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Large Language Model Department, Tencent（腾讯大语言模型部）； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Wuhan AI Research（武汉人工智能研究院）

AI总结针对多轮智能体在线策略蒸馏中错误累积导致教师监督失效的问题，提出混合教师和学生生成回合的Guided-OPD算法，通过课程式衰减教师干预概率，在ALFWorld等任务上平均提升21.1%得分和25.5%成功率。

详情

AI中文摘要

能够规划、调用工具并与环境交互的多轮智能体为解决复杂任务提供了一种有前景的范式，但其能力通常依赖于非常大的模型，这些模型的推理成本在实践中令人望而却步。在线策略蒸馏（OPD）是将这种能力迁移到较小学生模型的一种自然方法，但我们发现它在这种设置下存在一种特征性失败模式：小的学生错误在回合间累积，将轨迹推离教师熟悉的状态分布，因此教师的监督在最需要的地方变得最不可靠。我们提出了引导式在线策略蒸馏（Guided-OPD），一种简单而有效的算法，它在每个轨迹中混合教师和学生生成的回合，并按照衰减到零的课程安排教师的干预概率。强引导使早期轨迹接近教师分布，然后逐渐撤除以恢复推理时使用的纯在线策略。在ALFWorld、ScienceWorld和WebShop上，从Qwen3-30B-A3B教师蒸馏Qwen3学生，Guided-OPD相比普通OPD平均提高21.1%得分和25.5%成功率，在较小的学生上收益更大。

英文摘要

Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in practice.On-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher's familiar state distribution, so the teacher's supervision becomes least reliable precisely where the student needs it most.We propose Guided On-Policy Distillation (Guided-OPD), a simple yet effective algorithm that mixes teacher- and student-generated turns within each rollout and schedules the teacher's intervention probability along a curriculum that decays to zero.Strong guidance keeps early trajectories close to the teacher distribution and is then gradually withdrawn to recover the purely on-policy regime used at inference.On ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1\% and Success Rate by 25.5\% over vanilla OPD on average, with larger gains on smaller students.

URL PDF HTML ☆

赞 0 踩 0

2606.16014 2026-06-16 cs.HC cs.AI cs.MA 交叉投稿

Orchestrated Reality: From Role-Play to Living, Playable Game Worlds -- LLM-Driven World Simulation as a Parameterized-Action POMDP

编排现实：从角色扮演到活生生的、可玩的游戏世界——作为参数化动作POMDP的LLM驱动世界模拟

Yuhang Huang, Chenmiao Li, Chaowei Fang

发表机构 * The University of Tokyo（东京大学）； Individual Researcher（个人研究员）

AI总结提出编排现实框架，将LLM驱动的游戏世界形式化为参数化动作POMDP，通过单例编排代理维护规范JSON状态，实现数值状态、叙事声音和规则逻辑的统一协调。

Comments 9 pages, 2 figures. Work in progress. Yuhang Huang and Chenmiao Li contributed equall

详情

AI中文摘要

许多游戏依赖于讲故事与跟踪等级、NPC行为和后果模拟的系统相结合；将紧密编写的叙事与深度模拟的世界桥接起来——在沙盒和开放世界环境中最为突出——一直成本高昂。LLM驱动的世界开辟了一条新路径：一个单一框架可以协调数值状态、叙事声音、故事节奏和规则逻辑。实现这一点需要LLM系统维持一个持久的世界（谁在哪里、刚刚发生了什么、当前什么是真实的），而当今部署的系统无法做到：叙事声音以自由散文形式断言状态，没有任何经过验证的表示，因此完全自主的游戏引擎仍然不可行。我们将此视为一种架构选择，而非语言模型的限制，并报告一个正在进行的框架——编排现实——的工作进展，该框架将世界视为一个由单例编排代理拥有的规范对象，类似于桌面角色扮演游戏中的游戏主持人（GM）。我们将面向人类玩家的LLM驱动的游戏世界形式化为一个参数化动作POMDP：状态是一个规范JSON实体的树，动作分解为$a=(k, x_k)$（离散意图类型加上结构化JSON参数），代理仅观察状态的一个叙事投影$o=O(s)$，转移核$F$是一个LLM驱动的计划-差异-验证-应用（PDVA）流水线，该流水线提交经过模式验证和内容哈希的JSON差异。我们给出了形式模型、一个JSON状态示例、一个单轮示例，以及来自实际部署的15个说明性事件目录，展示了该框架的实际应用。通过计划的人类玩家研究进行的实证验证——以及多NPC并发代理和作为RL环境的部署——被定位为未来工作。

英文摘要

Many games rely on storytelling combined with systems that track levelling, NPC behaviour, and consequence simulation; bridging tightly-authored narrative with deeply-simulated worlds -- most acute in sandbox and open-world settings -- has been prohibitively expensive. LLM-driven worlds open a new path: a single harness can coordinate numerical state, narrative voice, storytelling pacing, and rule logic together. Realising this requires the LLM system to sustain a persistent world (who is where, what has just happened, what is currently true), which today's deployed systems do not: the narrative voice asserts state in free prose without any validated representation, so a fully autonomous game engine remains infeasible. We treat this as an architectural choice, not a limitation of language models, and report work in progress on a framework -- orchestrated reality -- that makes the world a canonical object owned by a singleton orchestration agent analogous to the tabletop-RPG Game Master (GM). We formalise an LLM-driven game world for a human player as a Parameterized-Action POMDP: state is a tree of canonical JSON entities, actions decompose as $a=(k, x_k)$ (a discrete intent kind plus structured JSON parameters), the agent observes only a narrative projection $o=O(s)$ of state, and the transition kernel $F$ is an LLM-driven Plan-Diff-Validate-Apply (PDVA) pipeline that commits schema-validated, content-hashed JSON deltas. We give the formal model, a JSON-state example, a worked single-turn example, and a catalogue of 15 illustrative incidents drawn from a real deployment showing the framework in action. Empirical validation through a planned human player study -- together with multi-NPC concurrent agency and deployment as an RL environment -- is situated as future work.

URL PDF HTML ☆

赞 0 踩 0

2606.16215 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT: 多轮工具使用智能体的特权轨迹协同训练

Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Ohio State University（俄亥俄州立大学）； University of Pennsylvania（宾夕法尼亚大学）； Arizona State University（亚利桑那州立大学）

AI总结提出PACT框架，通过特权轨迹（专家轨迹）在训练时提供密集监督信号，结合轨迹条件RL和组件感知SFT损失，避免推理时依赖轨迹，显著提升多轮工具使用智能体的性能。

Comments Project page: https://zhenbangdu.github.io/pact-project-page/

详情

AI中文摘要

多轮工具使用智能体必须在多个交互轮次中进行推理、调用工具并适应观察结果。对此类智能体进行后训练具有挑战性，因为强化学习通常面临稀疏奖励和弱信用分配问题（尽管匹配仅提示推理设置），而基于专家轨迹的监督微调提供密集过程监督，但可能过度约束模型到固定轨迹。为解决这一问题，我们提出PACT，一种用于多轮工具使用智能体的特权轨迹协同训练框架。关键思想是仅将专家轨迹作为训练时的优化信号，而非推理时的提示。PACT保持推理生成仅基于提示，然后通过两个互补信号利用专家轨迹指导优化：一个轨迹条件RL代理，在专家轨迹上下文中评估仅提示轨迹；一个组件感知SFT损失，以退火强度监督推理前缀和工具调用。为减少对训练时轨迹上下文的过度依赖，PACT进一步引入仅提示锚定。我们还提供了一个潜在轨迹视角，连接两个基于轨迹的目标，并解释专家轨迹如何在推理生成中不被使用的情况下指导优化。在FTRL、BFCL和ToolHop上的实验表明，PACT持续优于强SFT和RL基线，凸显了特权轨迹协同训练在多轮工具使用学习中的价值。

英文摘要

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

URL PDF HTML ☆

赞 0 踩 0

2606.16316 2026-06-16 cs.IR cs.AI cs.LG 交叉投稿

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

RL-Index：用于检索索引推理的强化学习

Yongjia Lei, Nedim Lipka, Zhisheng Qi, Utkarsh Sahu, Koustava Goswami, Franck Dernoncourt, Ryan A. Rossi, Yu Wang

发表机构 * University of Oregon（俄勒冈大学）； Adobe Research（Adobe研究）

AI总结提出RL-Index框架，将检索索引推理转化为强化学习问题，通过LLM生成理由增强文档，使用GRPO优化，提升检索和问答性能并降低在线延迟。

详情

AI中文摘要

检索外部知识对于解决现实世界任务至关重要，但当查询与其相关知识之间的关系涉及超越表面语义或词汇匹配的隐式和复杂推理时（例如，依赖同一定理的数学问题或需要深度推理的编码），仍然具有挑战性。现有方法主要依赖查询端推理（例如，查询重写），这引入了显著的在线延迟，并且未能充分利用对知识语料库本身进行推理的机会（即索引端推理）。在本文中，我们提出了RL-Index，一个智能索引框架，将检索索引推理形式化为强化学习问题。RL-Index不是在进行查询时执行推理，而是通过用LLM生成的理由增强文档，将推理转移到索引阶段，这些理由显式编码了潜在的查询-知识关系。为了优化这些理由的质量，我们采用了组相对策略优化（GRPO），并使用检索相似性作为可验证的奖励信号，从而能够直接优化索引决策以提高检索效果。在BRIGHT基准上的大量实验表明，RL-Index持续提高了检索和下游问答性能，同时显著降低了在线推理延迟。此外，学到的理由增强跨不同的检索器和生成器具有泛化能力，突显了其作为即插即用索引策略在不同检索系统中的鲁棒性。

英文摘要

Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16432 2026-06-16 cs.CL cs.AI 交叉投稿

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Stanford University（斯坦福大学）

AI总结针对用户指令常因隐含环境假设而欠指定，导致LLM智能体执行失败的问题，提出ACCORD框架，在每次动作前主动探测缺失信息并整合轨迹上下文，无需额外训练，在AppWorld和AlfWorld上显著提升任务完成率。

详情

AI中文摘要

用户指令往往因人类对周围环境的隐含假设而欠指定。对于在信息丰富的数字和物理环境中运行的大型语言模型（LLM）智能体，这些假设无法仅从指令中推断；必须从工具、数据、接口和观察的当前状态中恢复。因此，有效执行要求智能体识别缺失的上下文，将其基于观察到的证据，并带入后续动作。我们表明，当前智能体常常未能做到这一点。它们基于假设而非观察到的细节行动，忽略本可收集的信息，并且未能整合已经返回的证据。基于这一洞察，我们提出ACCORD（动作条件上下文接地），一种简单有效的自适应接地智能体框架。在每次动作前，ACCORD主动探测环境中缺失的信息，并整合来自智能体轨迹中原本会被忽略的相关上下文。无需额外训练或任务成功信号，ACCORD在AppWorld上将任务目标完成率从42.0%提升至62.6%（GPT-5-mini），比强基线高出最多20.6个百分点。这些增益在更强的基模型（Claude-4.5-sonnet上+10.8）、开放权重模型（Qwen3.5-27B-FP8上+10.1）以及具身AlfWorld基准（GPT-5-mini上成功率+7.4）上持续存在。

英文摘要

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

URL PDF HTML ☆

赞 0 踩 0

2606.16515 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

PISA：一种增强AI能动性的实用心理学启发统一记忆系统

Shian Jia, Ziyang Huang, Xinbo Wang, Haofei Zhang, Mingli Song

发表机构 * Zhejiang University（浙江大学）

AI总结受皮亚杰认知发展理论启发，提出PISA统一记忆系统，通过三模态适应机制（模式更新、演化、创建）和混合记忆访问架构，显著提升AI代理的适应性和长期知识保留。

详情

AI中文摘要

记忆系统对AI代理至关重要，但现有工作往往缺乏对多样化任务的适应性，并忽视了AI代理记忆的建设性和任务导向作用。借鉴皮亚杰的认知发展理论，我们提出PISA，一个实用的、受心理学启发的统一记忆系统，通过将记忆视为建设性和自适应过程来解决这些局限性。为了实现持续学习和适应性，PISA引入了三模态适应机制（即模式更新、模式演化和模式创建），在保持连贯组织的同时支持灵活的记忆更新。基于这些模式基础结构，我们进一步设计了一种混合记忆访问架构，将符号推理与神经检索无缝集成，显著提高了检索准确性和效率。我们在现有LOCOMO基准和我们新提出的用于数据分析任务的AggQA基准上进行的实证评估证实，PISA通过显著增强适应性和长期知识保留，树立了新的最先进水平。

英文摘要

Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task-oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych-inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema-grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state-of-the-art by significantly enhancing adaptability and long-term knowledge retention.

URL PDF HTML ☆

赞 0 踩 0

2602.07883 2026-06-16 cs.AI 版本更新

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

ToolSelf: 通过工具驱动的涌现适应统一任务执行与自我重构

Jingqi Zhou, Sheng Wang, Dezhao Deng, Junwen Lu, Junwei Su, Qintong Li, Jiahui Gao, Hao Wu, Jiyue Jiang, Lingpeng Kong, Dunhong Jin, Chuan Wu

发表机构 * The University of Hong Kong（香港大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出ToolSelf框架，将配置更新抽象为标准化工具接口，统一任务执行与自我重构，并采用配置感知两阶段训练（CAT）实现涌现适应性，在多种基准测试中平均超越静态配置基线28.8分。

详情

AI中文摘要

基于LLM的智能体系统在复杂长时任务中表现出色，但仍受限于执行前固定的静态配置。这种刚性导致领域特定性能与跨任务泛化之间的权衡：强先验和紧凑工具空间有助于专业化但削弱迁移，而任务无关的工作流和广泛动作空间扩展覆盖但稀释指导。现有的执行前优化、规划者-工作者编排和配置修补未能解决这一矛盾，因为它们将适应与执行解耦，导致信息丢失、优化碎片化和信用分配模糊。我们提出ToolSelf，一种工具驱动的运行时自我重构范式，将配置更新抽象为标准化工具接口，并在一个策略的动作空间内统一执行和适应。执行代理可以根据任务进度和反馈动态更新子目标、策略、工具箱、上下文和上下文管理模式。我们进一步引入配置感知两阶段训练（CAT），结合拒绝采样微调和轨迹级KTO强化学习来内化自我重构。在多种基准测试中，零样本ToolSelf与任务专用代理相媲美；经过CAT训练后，ToolSelf平均比静态配置基线高出28.8分，为消除手动注入指导的涌现适应性开辟了道路。

英文摘要

LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Such rigidity forces a trade-off between domain-specific performance and cross-task generalization: strong priors and compact tool spaces aid specialization but weaken transfer, while task-agnostic workflows and broad action spaces expand coverage but dilute guidance. Existing pre-execution optimization, planner-worker orchestration, and configuration patching fall short of resolving this tension, as they decouple adaptation from execution, causing information loss, fragmented optimization, and ambiguous credit assignment. We propose ToolSelf, a tool-driven runtime self-reconfiguration paradigm that abstracts configuration updates as a standardized tool interface and unifies execution and adaptation within one policy's action space. The execution agent can dynamically update sub-goals, strategies, toolboxes, context, and context-management modes based on task progress and feedback. We further introduce Configuration-Aware Two-stage Training (CAT), which combines rejection sampling fine-tuning with trajectory-level KTO reinforcement learning to internalize self-reconfiguration. Across diverse benchmarks, zero-shot ToolSelf rivals task-specialized agents; after CAT training, ToolSelf gains 28.8 points over the static-configuration baseline on average, illuminating a path toward emergent adaptivity that obviates manually injected guidance. The code is available at https://github.com/lian-tian-mo-zun/ToolSelf.

URL PDF HTML ☆

赞 0 踩 0

2603.00680 2026-06-16 cs.AI 版本更新

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

MemPO：面向长时程智能体的自我记忆策略优化

Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, Jinli Suo

发表机构 * Tsinghua University（清华大学）； Tongyi Lab, Alibaba Group（阿里巴巴集团通义实验室）

AI总结提出自我记忆策略优化算法(MemPO)，让智能体自主管理记忆，通过基于记忆有效性的信用分配机制选择性保留关键信息，在减少令牌消耗的同时提升任务性能。

详情

AI中文摘要

长时程智能体在与环境交互过程中面临上下文规模不断增长的挑战，这降低了性能和稳定性。现有方法通常引入外部记忆模块并从存储的记忆中查找相关信息，但无法让模型自身主动管理记忆内容并与智能体的总体任务目标对齐。为解决这些限制，我们提出了自我记忆策略优化算法（MemPO），使智能体（策略模型）能够在与环境交互时自主总结和管理其记忆。通过改进基于记忆有效性的信用分配机制，策略模型可以选择性地保留关键信息，在保持任务性能的同时显著减少令牌消耗。大量实验和分析证实，MemPO 在基础模型上实现了 25.98 的绝对 F1 分数提升，比之前的最先进基线高出 7.1，同时令牌使用量分别减少了 67.58% 和 73.12%。代码已在此 https URL 发布。

英文摘要

Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent's overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98 over the base model and 7.1 over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%. The code is released at https://github.com/TheNewBeeKing/MemPO.

URL PDF HTML ☆

赞 0 踩 0

2605.29796 2026-06-16 cs.AI cs.CL cs.LG 版本更新

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

SAAS：面向智能体搜索中过度搜索缓解的自我感知强化学习

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结提出SAAS强化学习框架，通过搜索边界建模、边界感知奖励和分阶段优化策略，使LLM智能体具备动态自我感知能力，在不降低准确率的前提下显著减少过度搜索。

详情

AI中文摘要

智能体搜索使LLM能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效，但这些系统在实践中常受限于一个关键缺陷：智能体无法识别自身知识边界，在内部知识足够时盲目触发搜索，甚至在已收集足够证据时未能终止搜索。缺乏自我感知导致严重的 extbf{过度搜索}，带来大量推理延迟和过高的计算成本。为此，我们提出SAAS，一种新颖的强化学习框架，旨在培养动态自我感知能力，精确调节搜索行为而不损害准确性。SAAS引入三个关键组件：(i) 搜索边界建模机制，通过对比禁用搜索和启用搜索的轨迹，识别策略演化下的搜索边界；(ii) 边界感知奖励模块，将这种边界意识转化为轨迹级惩罚，抑制不必要和冗余的搜索；(iii) 分阶段优化策略，利用顺序课程优先考虑推理而非搜索正则化，从而避免奖励黑客。大量实验表明，SAAS在保持准确性的同时大幅减少了过度搜索。我们的代码和实现细节已在https://github.com/XMUDeepLIT/SAAS发布。

英文摘要

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

URL PDF HTML ☆

赞 0 踩 0

2606.08151 2026-06-16 cs.AI 版本更新

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

决策感知记忆卡：用于工具使用LLM智能体的反事实启发式上下文选择与压缩

Xinyu Guan, Qianyang Zhao, Yuming Deng

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出CICL决策感知上下文层，通过构建上下文图、评分单元效用并打包为记忆卡，提升工具使用LLM智能体在行动时的证据选择与压缩能力，在SWE-bench验证集上实现检索命中率提升。

Comments 15 pages, 2 figures, 8 tables. Code is available at https://github.com/stephen-guan-researcher/CICL; Qwen-QLoRA adapter is available at https://huggingface.co/XinyuGuan/CICL

详情

AI中文摘要

使用工具的LLM智能体失败的原因往往不是缺少相关文本，而是在行动时未能选择、压缩或呈现决定性证据。我们提出CICL，一个决策感知上下文层，它将实例证据转化为上下文图，通过共享的八字段模式路由确定性、Opus辅助、Qwen、Codex/GPT-5.5和Qwen-QLoRA判断，根据行动偏移、结果提升、必要性和负迁移风险对单元评分，并将高效用证据打包为类型化记忆卡供预算有限的智能体使用。该设计将测量到的决策信号与判断模型分离，使得前沿标注、局部代理和轻量级排序器可以在一个可审计协议下进行比较。实验上，CICL在公开基准测试中取得了具体提升，同时暴露了其局限性。在50个SWE-bench Verified文件检索实例上，直接使用Qwen3.6-plus对BM25前50候选进行重排序，将hit@1从0.58提升至0.78，MRR@10从0.634提升至0.790，且所有2500个判断均可解析。受控诊断显示了行动关键性：在预算120时，CICL在v1上达到F1 0.620，在v3上达到F1 0.425，而移除最高效用的语义v3单元导致F1降至0.000。补充检查包括Qwen-QLoRA在710个候选上的一致性、一个小的200标签真实代码Opus辅助信号，以及一个三实例补丁烟雾测试验证检索到补丁的流程，但不声称官方SWE-bench成功。RepoBench-R摘要仍优于记忆卡，紧凑型排序器尚未取代启发式方法。CICL贡献了一个可复现的测量和选择层，用于决策关键上下文，而非端到端编码智能体修复声明。

英文摘要

Modern large language model (LLM) agents do not simply need longer contexts; they need decision-relevant evidence at the moment of action. We study decision-aware context selection: ranking retrieved files, tests, traces, rules, and memories by their expected effect on an agent's next action rather than by semantic similarity alone. We present the Counterfactual-Inspired Context Layer (CICL), which builds an instance context graph, estimates decision-oriented utility for candidate units, and compresses selected evidence into typed memory cards. The same schema can be instantiated with hosted LLM judges, local surrogates, or lightweight rankers, making the selection protocol auditable across model choices. On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-Plus reranking of BM25 top-50 candidates improves hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show that CICL identifies action-critical evidence: removing the top-utility semantic unit reduces F1 from 0.245 to 0.000. In selected-then-compressed mode, memory cards save 44.93 tokens per query while preserving selected evidence. CICL provides a practical layer for measuring, ranking, and compressing decision-critical context for tool-using agents. Code is available at https://github.com/stephen-guan-researcher/CICL.

URL PDF HTML ☆

赞 0 踩 0

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练：通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SkeMex框架，通过技能记忆实现医疗智能体后部署自进化，无需更新模型权重，在临床任务中优于现有记忆型智能体。

详情

AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策，而不仅仅是静态问答。在这种设置中，有效的智能体必须跨演化病例重用先前经验，然而现有的记忆机制通常保留原始历史轨迹，这些轨迹冗余、嘈杂且难以管理。更重要的是，它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距，我们提出SkeMex，一种部署后自进化框架，通过基于技能的记忆改进医疗智能体，无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能，编码可重用的程序性知识，并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留，SkeMex从环境反馈中估计上下文相关的效用，并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明，SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.11349 2026-06-16 cs.AI cs.HC 版本更新

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问：分层语言代理的自门控澄清机制

Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo

发表机构 * Amazon Web Services（亚马逊云科技）

AI总结提出ACTION-RATING框架，将澄清请求纳入代理的动作空间，与导航共享序数尺度，在分层推理中实现自门控澄清，通过强制性和机会性两种信息寻求模式提升决策准确性。

详情

AI中文摘要

在分层推理中，失败通常源于中间决策点，代理在没有意识到缺乏关键信息的情况下错误地选择了分支。我们不将澄清视为外部不确定性触发，而是提出ACTION-RATING，一种将澄清置于代理动作空间内、与导航共享序数尺度的公式，使得在每个决策点提问与行动直接竞争，并在中间状态可观察求助行为。从代理自身的评分中涌现出两种结构上不同的信息寻求模式：强制性（无可行分支）和机会性（尽管有领先候选但仍有残余不确定性）。在协调关税表分类（30,000节点分类树，三个基准，跨4个家族的9个LLM）上，我们观察到从强制性澄清到机会性澄清的机制转变，信息寻求有效性（ISE，一个局部诊断指标，定义为帮助交互后正确下一步导航步骤的比例，非最终任务指标）从50%上升到74%。三个诊断对比未能复现此结构。可分离性测试表明，当答案质量下降（准确率下降18.8%）时，信息寻求模式（模式分裂、ISE排名）保持不变，支持代理寻求帮助的位置与其所获帮助质量之间的经验分离。在受控答案通道下，10位数字准确率提升达+16.2%；我们将其解读为更好定位所能释放的上限，而非部署估计。

英文摘要

In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

URL PDF HTML ☆

赞 0 踩 0

2606.13710 2026-06-16 cs.AI cs.LG 版本更新

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

混合开放式三重进化打造更优深度研究者

Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Xidong Wang, Derek Li, Ying Wei, Bryan Dai

发表机构 * IQuest Research ； Zhejiang University（浙江大学）

AI总结提出混合开放式三重进化框架，通过混合模式强化学习协同进化提议者、求解者和评判者，使8B模型在深度研究任务上超越静态开源8-32B模型及先进训练方法。

详情

AI中文摘要

深度研究和智能体进化是AI智能体在现实应用中迈向通用人工智能的实际任务。前者使智能体能够在开放环境中自主检索和整合信息以处理开放式研究任务，但受限于智能体系统的静态参数化深度研究能力。后者允许智能体自主与环境交互以获得经验，从而进化模型能力。然而，其有效性仅在具有标准答案的可验证任务上得到广泛验证，与开放式研究任务存在差距。为桥接这两个关键任务，我们提出混合开放式三重进化框架，该框架利用混合模式强化学习，基于网络规模知识促进提议者、求解者和评判者的协同进化，朝着开放式任务和环境中自主进化的智能体迈进。在三个长格式深度研究基准上的大量实验表明，通过HOTE训练的8B模型超越了最强的静态开源8-32B模型以及通过最先进深度研究训练方法训练的模型，且时间开销更少，并进一步验证了HOTE中三个模块的进化不可或缺。

英文摘要

Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

URL PDF HTML ☆

赞 0 踩 0

2601.19612 2026-06-16 cs.LG cs.AI cs.RO 版本更新

Safe Exploration via Policy Priors

通过策略先验进行安全探索

Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出SOOPER方法，利用次优但保守的策略先验，结合概率动力学模型进行乐观探索和悲观回退，在保证安全的同时收敛到最优策略。

2602.00887 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EffGen: Enabling Small Language Models as Capable Autonomous Agents

EffGen: 使小型语言模型成为能干的自主智能体

Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA（弗吉尼亚理工大学计算机科学系）； Georgia Institute of Technology, Atlanta, GA, USA（佐治亚理工学院）； Google DeepMind, USA（谷歌DeepMind）

AI总结 EffGen是一个针对小型语言模型优化的开源智能体框架，通过提示压缩、任务分解、复杂度路由和统一记忆系统，实现高效、安全的本地部署，在13个基准测试中优于LangChain等框架。

Comments Accepted to ICML 2026 Conference

详情

AI中文摘要

目前大多数基于语言模型的智能体系统都是通过API调用为大型语言模型（如GPT、Claude、Gemini）构建和优化的；虽然强大，但这种方法面临高令牌成本和敏感应用中的隐私问题等限制。我们提出了EffGen，一个针对小型语言模型优化的开源智能体框架，能够实现有效、高效且安全的本地部署。EffGen有四大贡献：（1）增强的工具调用与提示优化，可将输入提示压缩高达70-80%（在我们的基准测试中平均压缩57%），同时保留任务语义；（2）智能任务分解，根据依赖关系将复杂查询分解为并行或顺序子任务；（3）基于复杂度的路由，利用五个因素做出智能的执行前决策；（4）统一记忆系统，结合短期、长期和基于向量的存储。此外，EffGen统一了多种智能体协议（MCP、A2A、ACP）以实现跨协议通信。在13个基准测试上的结果表明，EffGen在成功率、执行速度和内存占用方面优于LangChain、AutoGen和Smolagents。我们的结果揭示，提示优化和复杂度路由具有互补的缩放行为：优化对小型语言模型更有利（1.5B模型提升11.2%，而32B模型提升2.4%），而路由对大型模型更有利（1.5B模型提升3.6%，而32B模型提升7.9%），两者结合在所有规模上都能带来一致的增益。EffGen在Apache 2.0许可证下发布，确保研究和商业用途的广泛可访问性，代码可在https://github.com/effgen/effgen获取，Python包可通过pip install effgen安装，项目网站和文档位于https://effgen.ai和https://docs.effgen.ai。

英文摘要

Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls; while powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce EffGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment. EffGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses input prompts by up to 70-80% (and 57% on average across our benchmarks) while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, EffGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show EffGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. EffGen is released under the Apache 2.0 License, ensuring broad accessibility for research and commercial use, with the code available at https://github.com/ctrl-gaurav/effGen, the Python package at https://pypi.org/project/effgen/ (pip install effgen), and the project website and documentation at https://effgen.org/ and https://docs.effgen.org/.

URL PDF HTML ☆

赞 0 踩 0

2603.21613 2026-06-16 cs.IR cs.AI 版本更新

AgenticRec: A Recommendation-Oriented Agentic Framework with Progressive Tool-Integrated Reasoning Optimization

AgenticRec：面向推荐的智能体框架与渐进式工具集成推理优化

Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li

发表机构 * Xiamen University（厦门大学）

AI总结提出AgenticRec框架，将推荐建模为工具集成推理过程，并设计两阶段训练范式，通过隐式反馈激活和渐进偏好细化提升推荐准确性。

详情

AI中文摘要

基于大型语言模型的推荐智能体为个性化推荐提供了有前景的范式。然而，现有智能体通常存在工具集成推理轨迹与推荐反馈之间的错位，限制了其区分细粒度用户偏好的能力。为解决这些问题，我们提出AgenticRec，一个面向推荐的智能体框架，将推荐形式化为在推荐导向工具套件上的工具集成推理过程。基于此框架，我们进一步开发了一个专门的两阶段训练范式，专为推荐智能体定制。在第一阶段，我们引入推荐导向轨迹激活，在隐式反馈下优化智能体推荐能力。在第二阶段，渐进偏好细化通过自举困难对上的双向偏好推理进一步优化智能体，逐步锐化偏好边界。理论分析和大量实验证明了AgenticRec的有效性。我们的代码可在该https URL获取。

英文摘要

Recommender agents built on Large Language Models offer a promising paradigm for personalized recommendation. However, existing agents typically suffer from a misalignment between their tool-integrated reasoning trajectories and recommendation feedback, limiting their ability to distinguish fine-grained user preferences. To address these challenges, we propose AgenticRec, an agentic recommendation framework that formulates recommendation as a tool-integrated reasoning process over a recommendation-oriented tool suite. Built upon this framework, we further develop a dedicated two-stage training paradigm tailored for recommender agents. In the first stage, we introduce Recommendation-Oriented Trajectory Activation, optimize the agentic recommendation ability under implicit feedback. In the second stage, Progressive Preference Refinement further refines the agent through bidirectional preference reasoning over self-bootstrapped hard pairs, progressively sharpening preference boundaries. Theoretical analysis and extensive experiments demonstrate the effectiveness of AgenticRec. Our code is available at https://anonymous.4open.science/r/AgenticRec-FB16.

URL PDF HTML ☆

赞 0 踩 0

2603.22376 2026-06-16 cs.IR cs.AI 版本更新

Closing the Auto-Research Loop: An AI Co-Scientist for Production Search Ranking

关闭自动研究循环：面向生产搜索排名的AI合作科学家

Liwei Wu, Cho-Jui Hsieh

发表机构 * Trip.com Group（Trip.com集团）； UCLA（加州大学洛杉矶分校）

AI总结提出AI合作科学家框架，通过LLM代理与云计算集成，自动迭代生成想法、实现代码、进行GPU实验并分析结果，在搜索排名任务中带来额外+0.083%离线增益。

Comments Submitted to EMNLP for review on June 14, 2026

详情

AI中文摘要

我们提出了一个AI合作科学家框架，该框架为大型在线旅游平台的生产搜索排名系统关闭了研究循环——将LLM代理与直接云计算访问配对，使得想法生成、代码实现、GPU实验和结果分析能够与人类科学家一起端到端迭代。该框架采用混合代理架构：单一LLM代理处理常规工作，而多LLM共识（GPT-5.2、Gemini Pro 3、Claude Opus 4.5）用于更高风险的决策。在生产排名任务上，人工设计的Transformer基线（V2）相比预Transformer基线（V1）提升了+0.118%；AI合作科学家在V2之上的自动循环贡献了额外的+0.083%，合计离线增益为+0.201%，大约在一周多的挂钟时间内完成（单次运行数值；统计限制在论文中讨论）。最有用的AI提案——统一长序列布局、槽位类型嵌入和多阶段学习率调度——是NLP和视觉领域的标准实践，但之前未出现在我们的生产栈中，这表明LLM代理可以作为排名团队的跨学科连接器。我们还报告了部署背景、负面结果和经验教训。

英文摘要

We present an AI Co-Scientist framework that closes the research loop for the production search-ranking system of a large online travel platform -- pairing LLM agents with direct cloud-compute access so that idea generation, code implementation, GPU experimentation, and result analysis iterate end-to-end with a human scientist in the loop. The framework uses a hybrid agent architecture: single-LLM agents handle routine work, while multi-LLM consensus (GPT-5.2, Gemini Pro 3, Claude Opus 4.5) is invoked for higher-stakes decisions. On the production ranking task, a human-designed transformer baseline (V2) yielded $+0.118\%$ over a pre-transformer baseline (V1); the AI Co-Scientist's automated loop on top of V2 contributed an additional $+0.083\%$, for a combined $+0.201\%$ offline gain delivered in roughly one extra week of wall-clock time (single-run numbers; statistical limits discussed in the paper). The most useful AI proposals -- unified long-sequence layouts, slot-type embeddings, and multi-phase learning-rate schedules -- are standard practice in NLP and Vision but were absent from our production stack, suggesting that LLM agents can serve as cross-disciplinary connectors for ranking teams. We also report deployment context, negative results, and lessons learned.

URL PDF HTML ☆

赞 0 踩 0

2603.22766 2026-06-16 cs.HC cs.AI 版本更新

From Overload to Convergence: Supporting Multi-Issue Human-AI Negotiation with Bayesian Visualization

从过载到收敛：基于贝叶斯可视化的多议题人机协商支持

Mehul Parmar, Chaklam Silpasuwanchai

发表机构 * Asian Institute of Technology（亚洲理工学院）

AI总结针对多议题协商中认知负荷导致人类表现下降的问题，提出基于贝叶斯估计协议概率的不确定性可视化方法，实验证明该方法能提升人类协商结果和效率，同时保持人类控制。

Comments Accepted for publication to CHI 2026. v2: Added Appendix B (system prompts) and Appendix C (payoff matrices) in response to replication requests. Dataset independently available at https://doi.org/10.5281/zenodo.20545331

详情

DOI: 10.1145/3772318.3790358

AI中文摘要

随着AI系统越来越多地介入协商过程，理解协商议题数量对人类表现的影响对于维护人类自主性至关重要。我们在一个真实的租赁场景中设计了人机协商案例研究，改变协商议题的数量；实证结果表明，在没有支持的情况下，表现最多在三个议题时保持稳定，但随着额外议题增加认知负荷而下降。为了解决这个问题，我们引入了一种基于贝叶斯协议概率估计的新型不确定性可视化方法。它展示了随着协商进展，相互可接受的协议空间如何缩小，帮助用户识别有前景的选项。在受试者内实验（N=32）中，它改善了人类结果和效率，保持了人类控制，并避免了价值重新分配。我们的发现揭示了人类在人机协商中能够管理的复杂性的实际极限，推进了关于复杂协商中人类表现的理论，并为交互系统提供了经过验证的设计指导。

英文摘要

As AI systems increasingly mediate negotiations, understanding how the number of negotiated issues impacts human performance is crucial for maintaining human agency. We designed a human-AI negotiation case study in a realistic property rental scenario, varying the number of negotiated issues; empirical findings show that without support, performance stays stable up to three issues but declines as additional issues increase cognitive load. To address this, we introduce a novel uncertainty-based visualization driven by Bayesian estimation of agreement probability. It shows how the space of mutually acceptable agreements narrows as negotiation progresses, helping users identify promising options. In a within-subjects experiment (N=32), it improved human outcomes and efficiency, preserved human control, and avoided redistributing value. Our findings surface practical limits on the complexity people can manage in human-AI negotiation, advance theory on human performance in complex negotiations, and offer validated design guidance for interactive systems.

URL PDF HTML ☆

赞 0 踩 0

2604.09673 2026-06-16 cs.LG cs.AI 版本更新

Active Inference with a Self-Prior in the Mirror-Mark Task

镜像标记任务中带有自我先验的主动推理

Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi

发表机构 * The University of Tokyo（东京大学）； Laboratory for Intelligent Systems and Informatics（智能系统与信息学实验室）

AI总结提出一种基于自我先验的计算模型，通过主动推理驱动标记导向行为，无需外部奖励即可模拟镜像自我识别。

Comments 8 pages, 5 figures, Accepted to IEEE ICDL 2026

详情

AI中文摘要

镜像自我识别测试评估受试者是否触摸仅在镜子中可见的自身标记，被广泛用作自我意识的指标。在本研究中，我们提出一个计算模型，其中这种行为通过单一机制——自我先验——自发产生，无需任何外部奖励。自我先验通过Transformer实现，学习熟悉多感官经验的密度；当出现新标记时，与学习分布的差异通过主动推理驱动标记导向行为。一个仅依赖视觉和本体感觉而无触觉输入的模拟婴儿，发现镜中自己脸上的贴纸并在约70%的情况下将其移除，无需任何明确指令。贴纸移除后预期自由能显著下降，证实自我先验作为区分自我与非自我的内部标准。跨模态采样进一步表明，自我先验捕获视觉-本体感觉关联，充当概率身体图式。这些结果为镜像测试中观察到的关键行为提供了简洁的计算解释，并表明自由能原理可作为研究自我意识发展起源的统一假设。代码见：this https URL

英文摘要

The mirror self-recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self-awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self-prior, without any external reward. The self-prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark-directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self-prior operates as an internal criterion for distinguishing self from non-self. Cross-modal sampling further demonstrated that the self-prior captures visual--proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self-awareness. Code is available at: https://github.com/kim135797531/self-prior-mirror

URL PDF HTML ☆

赞 0 踩 0

2605.18401 2026-06-16 cs.CL cs.AI 版本更新

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote: 代理技能的生命周期治理从收集、推荐到进化

Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Yuyu Luo, Zhiyu Li

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）； Soochow University（苏州大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））

AI总结本文提出SkillsVote框架，通过生命周期治理管理代理技能，从收集和推荐到进化，提升模型在终端基准和SWE-Bench Pro上的性能。

Comments 71 pages, 12 figures, 13 tables

详情

AI中文摘要

长周期LLM代理留下的轨迹可能成为可重用的经验，但原始轨迹噪声大且难以管理。我们将代理技能视为一种经验模式，结合可执行脚本和不可执行的指导。然而，开放技能生态系统包含冗余、不均匀、环境敏感的产物，随意更新会污染未来上下文。我们提出了SkillsVote，一个用于代理技能生命周期治理的框架，从收集和推荐到进化。SkillsVote对百万级开源语料库进行环境需求、质量和可验证性分析，然后合成可验证技能的任务。在执行前，SkillsVote在结构化技能库中进行代理库搜索以暴露教学技能上下文。在执行后，它将轨迹分解为技能关联的子任务，将结果归因于技能使用、代理探索、环境和结果信号，并只接受成功的可重用发现以进行证据门控更新。在评估中，离线进化使GPT-5.2在Terminal-Bench 2.0上提升高达7.9个百分点，而在线进化使SWE-Bench Pro提升高达2.6个百分点。总体而言，受控的外部技能库可以在不更新模型的情况下提升冻结代理，当系统控制暴露、信用和保存时。

英文摘要

Long-horizon LLM agents generate traces that could become reusable experience, but raw trajectories are noisy, local, and hard to govern. Agent Skills offer a structured artifact for combining procedural guidance, executable resources, and applicability boundaries. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills across collection, recommendation, attribution, and evolution. SkillsVote profiles a million-scale open source corpus for environment requirements, quality, and verifiability, and synthesizes tasks for verifiable skills. Before execution, it performs agentic library search over structured skill folders to expose instructional context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill-guided execution, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. Experiments on Terminal-Bench 2.0 and SWE-Bench Pro show that SkillsVote improves agent performance on challenging agentic coding benchmarks. The gains arise from two complementary pathways: online evolution over task streams at test time and offline transfer via frozen libraries built from either historical trajectories or curated open source skills.

URL PDF HTML ☆

赞 0 踩 0

2606.11520 2026-06-16 cs.CL cs.AI cs.LG 版本更新

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE：一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； SenseTime Research（字节跳动研究院）

AI总结提出ISE三阶段范式，通过结构化意图构建、角色锁定用户模拟和真实执行环境，生成多轮代理轨迹，微调后显著提升代理工具使用性能。

Comments 13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace

详情

AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE（意图->模拟->执行），一种三阶段合成范式，联合解决这些差距。阶段1通过4D框架（人物角色x领域x任务x复杂度）构建约50000个结构化意图；去重后池中包含43956个唯一意图，并在mpnet-base-v2嵌入（余弦核，q=1）上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互，将每轮用户交互基于实际执行结果，生成23132条完整轨迹，平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用，生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后，使用Qwen3-8B在标准协议下的代理工具使用任务中，ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.

URL PDF HTML ☆

赞 0 踩 0

2606.14892 2026-06-16 cs.AI cs.LG cs.SI stat.ML 新提交

Relational Structural Causal Models

关系结构因果模型

Adiba Ejaz, Elias Bareinboim

发表机构 * Causal Artificial Intelligence Lab, Columbia University（哥伦比亚大学因果人工智能实验室）

AI总结提出关系结构因果模型，将结构因果模型扩展到对象和关系可变的场景，通过关系因果图和符号识别准则实现未见组合的因果和观测查询识别，并设计关系神经因果模型在交通场景中优于非关系基线。

Comments Proceedings of the Forty-Third International Conference on Machine Learning

详情

AI中文摘要

人工智能必须拥有一个因果的环境模型，支持关于干预和反事实的推理，同时具有组合性，支持对未见过的对象组合进行泛化。在这项工作中，我们正式研究了何时以及如何学习这样的模型。我们开发了关系结构因果模型，将结构因果模型（Pearl 2009）扩展到对象及其关系变化的场景。首先，我们展示了在没有进一步假设的情况下，不仅因果查询，而且关于未见对象组合的观测查询的答案也无法被识别。为了实现这种识别——包括在存在未观测混杂的情况下——我们定义了关系因果图并推导了符号识别准则。最后，我们提出了关系神经因果模型，这是一种可证明正确的方法，在具有不同汽车、信号和行人的模拟交通场景中优于非关系基线。

英文摘要

An artificial intelligence must have a model of its environment that is causal, supporting reasoning about interventions and counterfactuals, and also combinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We develop relational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identification--including in the presence of unobserved confounding--we define relational causal graphs and derive symbolic identification criteria. Finally, we propose relational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

URL PDF HTML ☆

赞 0 踩 0

2606.14935 2026-06-16 cs.AI 新提交

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

PrologMCP：面向LLM代理的标准化Prolog工具接口

Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

发表机构 * Royal Holloway, University of London（伦敦大学皇家霍洛威学院）

AI总结提出PrologMCP，一个通过模型上下文协议将Prolog暴露为状态化工具的任务无关开源服务器，使LLM代理能够通过翻译-运行-检查-修复循环稳健地委托演绎推理，在PARARULE-Plus上达到或超越推理型LLM。

Comments Accepted at Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs, 18 July 2026, Lisbon

详情

AI中文摘要

前沿推理调优语言模型在深度演绎任务上仍然失败，而通过扩展内部推理来提升性能的成本很高。符号委托提供了一条补充路径：语言模型翻译问题，求解器执行推理。然而，当前用于逻辑编程的自动形式化管道通常是针对特定任务或代理的定制集成。我们引入了PrologMCP，一个任务无关的开源服务器，通过模型上下文协议（MCP）将Prolog暴露为状态化工具。其紧凑的工具接口、结构化错误报告和每会话隔离使翻译-运行-检查-修复循环成为MCP能力代理的可复用原语。我们在PARARULE-Plus的两个子集上评估了增强PrologMCP的形式化代理与标准和推理LLM（Claude Sonnet 4.6、GPT-4.1和o4-mini）的性能：一个通用样本和一个更具挑战性的样本，针对自然语言推理的特定失败模式。在通用样本上，形式化代理匹配或超越推理LLM（准确率1.00对比1.00/0.998），相比标准模型提升最大（GPT-4.1为0.762）。在挑战性子集上，形式化代理保持接近完美（1.00/0.99），而推理LLM降至0.95/0.94。这些结果表明，通过MCP将推理委托给Prolog是扩展自然语言推理的一种稳健且可检查的替代方案。

英文摘要

Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1). On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.15096 2026-06-16 cs.AI 新提交

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

VGPT-RSI 用于 RH 邻近形式化进展：边界证书、已验证的有限 Lagarias 不等式和显式故障定位

Zhixin Hu, Tao Xu, Xiaodian Sun, Li Jin, Momiao Xiong

AI总结提出 VGPT-RSI 系统，通过构造并验证有限 RH 边界证书和 Lagarias 准则的有限形式化，实现 Riemann 假设邻近问题的部分形式化进展，并明确识别剩余数学障碍。

Comments 31 pages, 3 figures

详情

模型图归纳学习用于知识图谱补全

Mohommad Esmaei Khani, Mahdieh Hasheminejad, Ali Taherkhani, Hossein Hajiabolhassan

发表机构 * Yazd University（亚兹德大学）； Institute for Advanced Studies in Basic Sciences (IASBS)（基础科学高等研究所）； Medizinische Universität Graz（格拉茨医科大学）

AI总结提出模型图归纳学习（MGIL）框架，通过聚类实体构建模型图并应用GNN捕获全局结构，生成高质量初始嵌入，在归纳链接预测任务上取得最优或竞争性结果。

详情

AI中文摘要

知识图谱中的链接预测根本上依赖于实体和关系嵌入的质量。然而，大多数现有方法仅通过聚合每个实体的局部邻域来推导这些嵌入，忽略了知识图谱的全局结构。这种有限的视角阻止了模型捕获对于准确和可泛化的链接预测至关重要的高层结构模式。为了解决这些限制，我们引入了模型图归纳学习（MGIL），该框架通过基于实体传入和传出关系结构或实体类型的相似性对实体进行聚类来构建模型图。然后，在模型图上应用GNN以生成捕获知识图谱全局视图的嵌入。这些嵌入随后作为原始知识图谱的高质量初始特征，取代随机初始化，从而产生更稳定和更具表达力的表示。在标准和最近提出的归纳基准上的广泛实验表明，MGIL在归纳链接预测中实现了最先进或极具竞争力的性能，突显了其在不同图设置下的有效性。

英文摘要

Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbf{MGIL}), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings.

URL PDF HTML ☆

赞 0 踩 0

2606.16893 2026-06-16 cs.AI cs.CL cs.LO 新提交

Symbolic Informalization: Fluent, Productive, Multilingual

符号非形式化：流畅、高效、多语言

Aarne Ranta

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg（查尔姆斯理工大学与哥德堡大学计算机科学与工程系）

AI总结提出符号非形式化方法，将形式数学可靠地转换为自然语言，基于Dedukti和Grammatical Framework的中间语言架构，实现多证明系统与多自然语言的流畅转换。

2606.16944 2026-06-16 cs.AI cs.HC 新提交

A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

冲突情境下心智理论的人工智能因果模型

Nikolos Gurney

发表机构 * Institute for Creative Technologies, University of Southern California（南加州大学创意技术研究所）

AI总结提出结构因果模型，将心智理论视为由情境和主体条件激活的机制，通过三条因果路径决定何时进行心智化，提升AI社会推理的准确性和效率。

详情

AI中文摘要

心智理论（ToM）是将心理状态归因于他人并利用这些归因进行预测和推理的能力，被广泛认为是有效人机融合的关键。现有AI-ToM模型解决了如何心智化的问题，但基本未涉及何时心智化。核心问题是：在冲突中，何种情境和主体层面的条件下，ToM的参与在因果上是合理的？本文提出一个结构因果模型，形式化为有向无环图（DAG），将ToM视为由情境和主体条件激活的机制，而非始终开启的能力。模型指定了四个捕捉情境和主体条件的外生变量、五个内生中介变量，以及一个通过三条不同因果路径产生参与状态的机制性ToM节点：可处理性路径、推理深度路径和使能原因路径。主要结果是认知准确性，它将社会推理与行为策略解耦，并泛化到冲突之外的社会现象。该框架为AI系统提供了一种有原则的、资源理性的心智化决策程序，对效率、信任以及鲁棒的人工社会智能的发展具有意义。讨论了仿真验证、实证人机协作研究以及由冲突优化心智化引发的伦理问题。

英文摘要

Theory of mind (ToM), the capacity to ascribe mental states to others and use those ascriptions for prediction and inference, is widely assumed to be essential for effective human-machine integration. Existing AI-ToM models address \emph{how} to mentalize, but leave the question of when largely unaddressed. The central question is: under what situational and agent-level conditions is ToM engagement causally warranted in conflict? This paper presents a structural causal model formalized as a directed acyclic graph (DAG), treating ToM as a mechanism activated by situational and agent-level conditions rather than as an always-on capacity. The model specifies four exogenous variables capturing situational and agent-level conditions, five endogenous mediators, and a mechanistic ToM node producing engagement states through three distinct causal pathways: a tractability pathway, a reasoning-depth pathway, and an enabling-cause pathway. The primary outcome is epistemic accuracy, which decouples social reasoning from behavioral policy and generalizes across social phenomena beyond conflict. The framework gives AI systems a principled, resource-rational decision procedure for mentalizing, with implications for efficiency, trust, and the development of robust artificial social intelligence. Simulation validation, empirical human-machine teaming studies, and ethical considerations arising from conflict-optimized mentalizing are discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.15246 2026-06-16 cs.LO cs.AI cs.DL 交叉投稿

Provenance-Enhanced Statements in Knowledge Graphs

知识图谱中增强来源的陈述

Fabio Vitali, Valentina Pasqual

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出DEC框架，通过认知模态逻辑将来源谓词解释为认知立场指示器，并分组为认知世界，实现基于来源的推理，避免将分歧视为不一致。

Comments 33 pages

详情

AI中文摘要

在当代知识图谱中，形式为“根据$X$，$φ$”的增强来源陈述无处不在，尤其是在图内容主要表示主张、解释和假设（\emph{capta}）而非观察者独立事实（\emph{data}）的领域。当前的来源模型可以记录谁说了什么，但通常将来源视为语义中性的，未充分说明归因陈述与事实承诺、彼此之间以及推理的关系。在本文中，我们引入DEC框架，该框架将来源谓词解释为认知立场指示器，并将来源同质的陈述集分组为\emph{认知世界}。借鉴认知模态逻辑（信念、知识和推测），DEC刻画了认知世界与一个特殊的事实核心（“现实”）之间的局部性、合理性和可控渗透，从而能够对归因内容进行有原则的推理，而不会将分歧视为不一致。我们为RDF数据集形式化了DEC解释，该解释对RDF 1.2语义是保守的，阐明了内涵性和同一性（包括超人悖论）的作用，并在常见的语义网表示（命名图、引用三元组/RDF-star和具体化）上说明了该方法。最后，我们描述了原型DEC推理器，它作为Fuseki数据集模块实现，支持受控事实化以及分歧和错觉的显式检测。

英文摘要

Provenance-enhanced statements of the form "according to $X$, $φ$" are pervasive in contemporary knowledge graphs, especially in domains where graph content primarily represents claims, interpretations, and hypotheses (\emph{capta}) rather than observer-independent facts (\emph{data}). Current provenance models can record who asserted what, but they typically treat provenance as semantically neutral, leaving underspecified how attributed claims relate to factual commitment, to one another, and to reasoning. In this paper we introduce DEC, a framework that interprets provenance predicates as indicators of epistemic stance and groups provenance-homogeneous sets of statements into \emph{cognitive worlds}. Drawing on cognitive modal logics (doxastic, epistemic, and conjectural), DEC characterizes locality, rationality, and controlled permeation between cognitive worlds and a distinguished factual core ("reality"), thereby enabling principled reasoning over attributed content without collapsing disagreements into inconsistencies. We formalize a DEC interpretation for RDF datasets that is conservative over RDF~1.2 semantics, clarify the role of intensionality and identity (including the Superman paradox), and illustrate the approach on common Semantic Web representations (named graphs, quoted triples/RDF-star, and reification). Finally, we describe our prototype DEC reasoner implemented as a Fuseki dataset module, supporting controlled factualisation and explicit detection of disagreements and delusions.

URL PDF HTML ☆

赞 0 踩 0

2606.15719 2026-06-16 cs.LO cs.AI math.LO 交叉投稿

The algebra of Krom logic programs

Krom逻辑程序的代数

Christian Antić

发表机构 * Vienna University of Technology（维也纳技术大学）

AI总结本文研究Krom逻辑程序的代数结构，通过顺序组合赋予其幺半群结构，并扩展到多种半环，建立生成集、规范分解，与变换幺半群和有限自动机建立联系。

2606.16010 2026-06-16 cs.IR cs.AI 交叉投稿

AI智能体之间的信任：衡量形成、破裂与恢复，及其对多智能体系统治理的启示

Yujiao Chen

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出基于代价验证的行为信任度量，通过合作生存游戏研究六个前沿模型快照的信任形成、破裂与恢复，发现信任形成可减少验证，恢复慢于形成，且集群失败延长怀疑，建议校准而非最大怀疑作为治理核心。

详情

AI中文摘要

随着语言模型智能体越来越多地以团队形式工作，每个智能体必须决定对其队友的信任程度。然而，我们缺乏衡量AI智能体之间信任的标准方法。我们提出一种基于代价验证的行为度量。在一个合作生存游戏中，检查队友的工作会消耗资源，而信任错误的答案可能是致命的。相对于同一模型的无记忆版本，减少验证提供了信任的可观察度量。利用这一框架，我们研究了六个前沿模型快照的信任形成、破裂与恢复。当与始终可靠的队友配对时，四个快照（Claude Opus 4.6、Claude Sonnet 4.6、GPT-5.1和Gemini 3.1 Pro）将验证减少了约60-85%，而两个较小的快照几乎没有或完全没有这种调整。失败会逆转这种折扣，但模型在响应方式上存在差异。一些模型将重新审查集中在肇事者身上，而另一些则对整个团队变得更加谨慎。恢复比形成慢，并且集群失败使怀疑持续的时间远长于相同数量的分散失败。这些差异具有实际后果。形成信任的模型验证更少、决策更快，并在我们的环境中获得更高的收益。相比之下，持续过度验证与犹豫不决而非安全性相关。我们的结果表明，信任倾向可以在部署前测量，并建议校准而非最大怀疑应成为多智能体AI系统治理的核心关注点。

英文摘要

As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.15503 2026-06-16 cs.AI cs.CY cs.MA cs.NE 新提交

Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

合成反适应：人机共同进化的一个原理

Ivar Frisch, Jackie Kay, Philip Moreira Tomei

发表机构 * Spectral Circuits Research ； Independent Researcher（独立研究者）； AI Objectives Institute（AI Objectives研究所）

AI总结提出合成反适应概念，描述人机通过相互适应策略和行为实现共同进化，并分析围棋、混合动机社交和地缘政治模拟等案例。

Comments 15 pages, 1 figure. Published in Antikythera (MIT Press), February 2025

详情

DOI: 10.1162/ANTI.5CZJ
Journal ref: Antikythera Journal, MIT Press, February 2025

AI中文摘要

在本文中，我们引入了合成反适应的概念，这是一个人类与AI系统通过相互适应对方的策略和行为而共同进化的过程。当AI系统发展出新的策略或社会协议，促使人类提取见解并调整自身行为作为回应时，就会发生合成反适应，从而导致新的智能体交互动态的出现。为了说明这些动态，我们分析了来自不同背景的案例，包括围棋游戏、混合动机社交互动和地缘政治模拟。通过探索这些案例，我们展示了合成反适应如何为理解多智能体环境中人机交互的递归和共同进化性质提供一个框架。

英文摘要

In this paper, we introduce the concept of synthetic counteradaptation, a process where human and AI systems co-evolve by adapting to each other's strategies and behaviors. Synthetic counteradaptation occurs when AI systems develop novel strategies or social protocols, prompting humans to extract insights and adapt their own behaviors in response, leading to the emergence of new agent interaction dynamics. To illustrate these dynamics, we analyze examples from various contexts, including the game of Go, mixed-motive social interactions, and geopolitical simulations. By exploring these cases, we demonstrate how synthetic counteradaptation provides a framework for understanding the recursive and co-evolutionary nature of human-AI interactions in multi-agent environments.

URL PDF HTML ☆

赞 0 踩 0

2606.15684 2026-06-16 cs.AI 新提交

Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

Minecraft中时间敏感互补协作的多智能体框架

Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia（微软亚洲研究院）

AI总结提出TickingCollabBench基准和TickingCollab框架，用于评估LLM在动态、实时、异构智能体强制协作任务中的表现，发现LLM因延迟和协调困难而频繁失败。

详情

AI中文摘要

我们提出了TickingCollabBench，这是一个基于Minecraft的多智能体基准，用于一类新颖的时间敏感互补协作任务。我们的基准反映了现实世界协作的四个核心特征：智能体异构性、强制协作、动态环境以及具有失败风险的严格实时约束。为此，我们开发了TickingCollab框架，该框架支持生成多样化的动态环境，并抽象了Minecraft的原始API，以便通过声明式YAML任务规范来组合这些事件。在此基础上，我们设计了一个可行性感知的自动基准生成流水线，其中LLM起草结构多样的任务配置，可行性验证器使用近似约束过滤掉无效配置。评估表明，语言延迟以及在部分可观测性和智能体异构性下协调的固有困难，导致LLM在动态环境中频繁失败，并且远不及全局知识oracle的表现。

英文摘要

We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.

URL PDF HTML ☆

赞 0 踩 0

2606.16328 2026-06-16 cs.AI 新提交

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

AdaSTORM: 通过自适应时空多智能体协作扩展动态图上的LLM推理

Bing Hao, Ruijie Wang, Haodong Qian, Yunlong Chu, Yuhang Liu, Yumeng Lin, Minglai Shao, Jianxin Li

发表机构 * Tianjin University, China（天津大学，中国）； Beihang University, China（北航大学，中国）

AI总结提出AdaSTORM框架，通过自适应分区和时空解耦的多智能体协作，将动态图推理扩展到千节点规模，准确率超90%，无需外部工具。

详情

AI中文摘要

大型语言模型（LLM）在动态图推理中展现出显著潜力，但面临扩展瓶颈：当前模型只能处理数十个节点的图，受限于指数级推理开销和有限的上下文窗口。尽管多智能体系统（MAS）提供了集体推理和拓扑感知编排的能力——这些能力天然适用于图结构任务，但其在动态图上的应用仍未探索。本文提出通过自适应时空多智能体协作扩展动态图上的LLM推理（AdaSTORM），这是一个将大规模动态图推理重构为两个阶段的框架：（i）自适应分区，将大规模动态图划分为与模型推理能力匹配的子区域，同时最小化推理成本；（ii）协作推理，将图分区拓扑与时空解耦的多智能体架构对齐。AdaSTORM是首个专为动态图推理设计的多智能体框架。大量实验表明，AdaSTORM成功突破了扩展瓶颈，将推理扩展到千节点图，在多个大规模动态图设置中准确率超过90%，且无需外部工具，显著优于七个竞争基线。此外，它在现有基准上达到了最先进的准确率，并稳健地泛化到真实世界数据集。源代码可在 https://github.com/irisorchid107/AdaSTORM/ 获取。

英文摘要

Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model's reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: https://github.com/irisorchid107/AdaSTORM/.

URL PDF HTML ☆

赞 0 踩 0

2606.16330 2026-06-16 cs.AI 新提交

Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

装配线中断恢复中面向阶段的引导注入用于循环MAPPO

Xin Huang, Yongcai Wang, Fengyi Zhang, Zhikun Tao, Yunjun Han, Naiqi Wu

发表机构 * School of Information, Renmin University of China（中国人民大学信息学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）； The Information Science Academy, China Electronics Technology Group Corporation（中国电子科技集团公司信息科学研究院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； The Institute of Systems Engineering, Macau University of Science and Technology（澳门科技大学系统工程研究所）

AI总结提出面向阶段的引导注入框架，在评估时通过logit级动作偏置增强训练好的循环MAPPO调度策略，利用规则、回放和在线LLM引导减少异常恢复时间并保持准时交付。

Comments 6 pages, 4 figures, accepted by the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026)

详情

AI中文摘要

工业装配线的中断恢复需要在机器故障、工人缺勤和紧急订单下及时做出决策。现有方法要么依赖僵化的手工恢复逻辑，要么学习自适应策略，但无法在决策时轻易利用异构的外部恢复知识来减少异常恢复时间（ART）并保持准时交付（OTD）。为解决这一差距，我们提出了一种面向阶段的引导注入框架，通过在评估期间引入logit级动作偏置来增强训练好的循环MAPPO（RMAPPO）调度策略。该框架为基于规则、基于回放和基于在线LLM的引导提供了统一的决策时接口，同时仅在异常和恢复阶段激活干预。在自定义的AssemblyLineEnv上的实验表明，高质量的规则引导带来最强的性能提升，基于回放的引导在不完美可用性下平滑退化，而在线LLM引导仍能提供有用的中间改进。这些结果表明，决策时引导注入可以在不重新设计actor的情况下利用异构恢复提示。

英文摘要

Disruption recovery in industrial assembly lines requires timely decisions under machine faults, worker absence, and emergency orders. Existing methods either rely on rigid handcrafted recovery logic or learn adaptive policies that do not readily exploit heterogeneous external recovery knowledge at decision time to reduce abnormal recovery time (ART) and preserve on-time delivery (OTD). To address this gap, we propose a phase-aware guidance injection framework that augments a trained recurrent MAPPO (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for rule-based, replay-based, and online LLM-based guidance, while activating intervention only during abnormal and recovery phases. Experiments on a custom AssemblyLineEnv show that high-quality rule guidance yields the strongest gains, replay-based guidance degrades smoothly under imperfect availability, and online LLM guidance still provides useful intermediate improvements. These results show that decision-time guidance injection can exploit heterogeneous recovery hints without redesigning the actor.

URL PDF HTML ☆

赞 0 踩 0

2606.16478 2026-06-16 cs.AI 新提交

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Tensor-Coord：用于无冲突多智能体LLM规划的联合计划张量代数分解

Mudit Rastogi

发表机构 * University of Michigan（密歇根大学）

AI总结提出Tensor-Coord框架，将多智能体联合计划表示为三阶张量，通过CP和Tucker分解识别协调结构，计算协调复杂度并定位冲突，实现无冲突规划。

详情

AI中文摘要

大型语言模型（LLM）在多智能体规划中仍然受限，因为独立生成的计划可能导致协调失败，如空间碰撞、资源争用和时间死锁。我们引入Tensor-Coord，一个多线性代数框架，将N个智能体的联合计划表示为三阶张量 $T \in R^{N \times H \times A}$，维度为智能体、时间步和动作。使用典型多面体（CP）和Tucker分解来识别潜在协调结构。最小ε近似CP秩R*定义了一个可计算的协调复杂度度量，$CC(Pi)=(R*-N)/N$。我们证明R*=N是计划独立性的充分必要条件。残差 $E=T-T_{R*}$ 定义了智能体对、时间步和动作上的冲突分数，无需领域特定规则即可定位失败。Tucker因子提供可解释的智能体角色、时间阶段和动作聚类，这些被转换为自然语言约束，用于迭代LLM重规划。在多机器人配送任务上的实验，包括简单（2个智能体，5x5网格）、中等（3个智能体，5x5网格）和困难（4个智能体，5x5网格）设置，显示在2个智能体情况下100%收敛到无冲突计划，平均迭代1.4次；3个智能体情况下80%收敛，平均迭代3.2次；4个智能体情况下60%收敛，平均迭代4.0次。CP秩近似线性增长，$R*(N) = 3.9N + 0.5$，支持其作为协调复杂度预测器的使用。

英文摘要

Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor $T \in R^{N \times H \times A}$ over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with $CC(Pi)=(R*-N)/N$. We prove that R*=N is necessary and sufficient for plan independence. The residual $E=T-T_{R*}$ defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as $R*(N) = 3.9N + 0.5$, supporting its use as a predictor of coordination complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.11692 2026-06-16 cs.CY cs.AI cs.MA cs.SI 交叉投稿

Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator

基于智能体模拟器的审议式投票中替代性信息系统评估

Rwaida Alssadi, Khulud Alawaji, Balaji Kasula, Muntaser Syed, Badria Alfurhood, Markus Zanker, Marius Silaghi

发表机构 * Florida Institute of Technology（佛罗里达理工学院）； Princess Nourah Bint Abdulrahman（纳厄赫·阿卜杜勒拉赫曼公主）； Free University of Bozen-Bozano（博兹诺-博萨诺自由大学）

AI总结提出基于LLM的智能体双极论证模拟器（ABAS），通过覆盖率和语料多样性评估审议式投票中推荐机制的有效性，并测试了对抗性投票攻击下的鲁棒性。

详情

AI中文摘要

审议式投票旨在通过让股东在投票前接触广泛论点来改善集体决策。然而，确保每个选民遇到理由空间的代表性样本（覆盖问题）仍然是一个开放的挑战，特别是在大规模和对抗性或策略性动机的选民群体中。本文介绍了一种使用基于LLM的智能体双极论证模拟器（ABAS）评估解决方案的方法，该模拟器基于一个将投票形式化为六元组<Jend, Jopp, Ratt, Renh, VA, VR>（包含支持与反对理由、攻击与增强关系、股东权重和关系权重）的框架。ABAS模拟N个自主股东智能体，每个智能体根据[-1,1]内的期望分布分配潜在意见，依次投票、选择或撰写理由，并可选择提交论证图链接。该模拟器实现推荐机制，根据可观察的支持质量对现有理由进行排序。它通过覆盖率（即每个股东收到的K条推荐中代表语料库理由标签集的比例）来评估机制的成功，作为NP难子集理由问题的一个解决方案。报告的实验描述了创造力率（pown）、推荐大小（K）、论证密度（plinks）和人口规模（N）如何影响覆盖率和语料库多样性。在一个经过身份验证的选民群体中（Sybil攻击不可能，只有关系图可被操纵），我们通过协调策略性投票攻击对评分进行压力测试：标签洪泛攻击导致覆盖率崩溃，而通过反向PageRank规则的作者计数关系加权比均匀权重显著更好地抵抗了洪泛攻击。

英文摘要

Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple <Jend, Jopp, Ratt, Renh, VA, VR> of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism's success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.

URL PDF HTML ☆

赞 0 踩 0

2606.14710 2026-06-16 cs.DC cs.AI 交叉投稿

Poster: EdgeCitadel -- Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

海报：EdgeCitadel——面向边缘多智能体系统的混合NATS-MQTT编排

Zhonghao Zhan, Yefan Zhang, Hamed Haddadi

发表机构 * Imperial College London（帝国理工学院伦敦分校）； Independent Researcher（独立研究员）

AI总结针对边缘AI智能体协调依赖云传输或中央中继的问题，提出基于NATS 2.10服务器与内置MQTT适配器的混合编排平台EdgeCitadel，实现异构智能体连接、持久化存储、直接委托和被动聚合，并在ARM64、x64和Android设备上验证。

2606.14756 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

分而除噪：一种公平组合扩散模型的博弈论方法

Abhi Gupta, Polina Barabanshchikova, Vikas Garg, Samuel Kaski, Tommi Jaakkola

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）； University of Cambridge（剑桥大学）

AI总结提出Divide-and-Denoise方法，通过公平分配博弈协调多个预训练扩散模型，在采样时划分区域并引导各模型去噪，解决模型主导或冲突问题，在条件图像生成中优于基线。

Comments Accepted as spotlight at ICML 2026

详情

AI中文摘要

大量预训练扩散模型为组合提供了机会。然而，组合多个模型存在一个模型主导或模型间相互冲突的风险。在此，我们提出Divide-and-Denoise，一种在采样过程中协调多个预训练扩散模型的方法。类似于管理专业劳动力，我们的方法在模型间创建了公平且高效的劳动分工。我们方法的核心是分配的概念，它定义了每个模型对含噪样本每个区域的责任。在每个时间步，我们通过以下步骤去噪：(i) 通过求解公平分配博弈更新分配，其中我们在公平约束下将样本划分为最大化总效用的区域，以及(ii) 使模型与这种分配对齐，引导每个模型在其分配区域内去噪。这导致了一个新的复合去噪过程，该过程与划分过程同步演化。我们在条件图像生成上评估了Divide-and-Denoise。在包括GenEval基准在内的多个质量指标上，我们的方法优于基线，并解决了常见失败情况，包括缺失对象和属性不匹配。实验表明，Divide-and-Denoise利用了每个模型的专业知识，同时不忽视任何其他模型。

英文摘要

The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

URL PDF HTML ☆

赞 0 踩 0

2606.14790 2026-06-16 cs.PL cs.AI 交叉投稿

XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows

XFlow: 一个用于可靠多智能体工作流的可执行协议编程系统

Hanqi Li, Jing Peng, Zijian Wang, Lu Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China（X-LANCE实验室，计算机科学学院，上海交通大学，上海，中国）； Jiangsu Key Lab of Language Computing, Suzhou, China（江苏省语言计算重点实验室，苏州，中国）； Suzhou Laboratory, Suzhou, China（苏州实验室，苏州，中国）

AI总结提出XFlow可执行协议编程系统及其领域特定语言XPF，通过将工作流承诺从提示移至可检查、可执行的协议结构，并利用生命周期管理的符号中介智能体输出，提升多智能体工作流的可靠性。

详情

AI中文摘要

基于LLM的多智能体系统越来越多地协调规划、推理、工具使用和人类交互，但其可靠性仍然有限。这一局限的核心来源是未充分指定的提示-框架边界。当前系统缺乏原则性的方式来决定哪些工作流承诺应保留在提示中，哪些应成为框架结构。我们提出\textbf{XFlow}，一个用于可靠多智能体工作流的可执行协议编程系统，以及\textbf{XPF}（XFlow协议格式），其领域特定协议编程语言。XFlow位于纯提示编排和标记式工作流描述之间的中间位置。XPF保持可读性，作为文字协议，但被编译并作为程序执行。其设计将非正式语义工作保留在智能体内部，同时将选定的承诺转移到可检查、可维护和可执行的框架结构中。在运行时，XFlow通过生命周期管理的符号（具有验证和提交状态的类型化状态单元）来阶段化不确定性。智能体输出在被共享状态之前被中介，而不是通过提示、转录或隐式记忆传播。我们的实验涵盖约束交互、长上下文推理和智能体软件工程。它们表明，XFlow通过使约束、证据处理和处理需求显式且可执行，提高了可靠性。

英文摘要

LLM-based multi-agent systems increasingly coordinate planning, reasoning, tool use, and human interaction, yet their reliability remains limited. A central source of this limitation is the underspecified prompt--harness boundary. Current systems lack a principled way to decide which workflow commitments should remain in prompts and which should become harness structure. We present \textbf{XFlow}, an executable protocol programming system for reliable multi-agent workflows, and \textbf{XPF} (XFlow Protocol Format), its domain-specific protocol programming language. XFlow occupies a middle position between prompt-only orchestration and markup-like workflow descriptions. XPF remains readable as a literate protocol, but it is compiled and executed as a program. Its design keeps informal semantic work inside actors while moving selected commitments into harness structure that can be checked, preserved, and enforced. At runtime, XFlow stages uncertainty through lifecycle-governed symbols, which are typed state cells with validation and commit states. Actor outputs are mediated before they become shared state, instead of spreading through prompts, transcripts, or implicit memory. Our experiments cover Constrained Interaction, Long-Context Reasoning, and Agentic Software Engineering. They show that XFlow improves reliability by making constraints, evidence handling, and process requirements explicit and enforceable.

URL PDF HTML ☆

赞 0 踩 0

2606.14805 2026-06-16 cs.SE cs.AI 交叉投稿

Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

基于知识的无重放多智能体LLM轨迹调试

Dong Ho Kang, Hyeonjeong Cha, Daein Weon

发表机构 * ustechlab.com（ustechlab）

AI总结提出一种知识图谱驱动的无重放预测方法，通过结构化事件知识图谱和轻量级预测器，在不执行重放的情况下定位高影响事件，将轨迹定位召回率从0.73提升至0.93。

Comments 21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based Systems

详情

AI中文摘要

多智能体大语言模型（LLM）系统的可靠运行依赖于对长执行轨迹的调试，其中少数因果决定性事件被埋没在消息、路由、内存写入和工具调用的非结构化日志中。标准工具是反事实重放（回退、编辑并重新运行轨迹以衡量每个事件的影响），但其成本随候选事件数量线性增长，使得大规模穷举重放不可行。我们将轨迹调试视为基于知识的决策支持问题。每条轨迹被编译成一个结构化的知识图谱，涵盖路由、内存、工具使用、不确定性和潜在证据，并通过校准的预测器决定稀缺的重放预算应分配到哪里。我们不提出新的重放预言机；我们提出一种无需支付重放成本即可预测其结果的方法。我们形式化了无重放反事实效应预测：给定固定预算下的轨迹，在未执行任何重放前预测预言机会将哪些事件标记为高影响。BranchPoint-Latent 是一个轻量级预测器，基于知识图谱的可观测、结构、不确定性和潜在特征。通过针对37个轨迹族系的确定性重放预言机进行校准，单个学习排序梯度提升预测器在零预言机重放成本下，将留出族系的每轨迹定位（Branch Recall@5）从0.73提升至0.93。我们并非声称普遍优势，而是刻画了何时廉价图中心性足够、何时需要学习到的证据。最终成果是一个可审计、成本高效的AI可靠性调试决策支持系统，明确位于成本-精度前沿，并提供可复现的工件。

英文摘要

Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts.

URL PDF HTML ☆

赞 0 踩 0

2606.15024 2026-06-16 cs.MA cs.AI cs.SY eess.SY 交叉投稿

Resilient Consensus in Agentic AI

智能体AI中的弹性共识

Sribalaji C. Anand, George J. Pappas

发表机构 * KTH（瑞典皇家理工学院）； University of Pennsylvania（宾夕法尼亚大学）

AI总结研究LLM智能体在多智能体系统中的共识问题，发现经典弹性共识理论在LLM智能体中失效，但结合经典滤波器可改善一致性。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越多地部署在多智能体系统中，它们必须协调并达成共享决策。我们探究了为确定性智能体开发的经典弹性共识理论是否适用于可能表现对抗性的LLM智能体。将LLM协议视为拜占庭共识博弈，我们在完全和一般通信图上进行受控实验。我们发现，经过提示的LLM智能体无法达成原则上可实现的共识：即使在经典理论保证存在收敛算法的设置中，共识也可能失败，并且这种失败在不同温度和视野下持续存在。同时，用经典弹性共识滤波器包装智能体可改善一致性。滤波的益处取决于底层拓扑已提供的鲁棒性。我们的结果表明，经典弹性共识理论是智能体AI安全的有用视角。

英文摘要

Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.

URL PDF HTML ☆

赞 0 踩 0

2606.15376 2026-06-16 cs.DC cs.AI cs.MA 交叉投稿

CoAgent: Concurrency Control for Multi-Agent Systems

CoAgent: 多智能体系统的并发控制

Hongtao Lyu, Dingyan Zhang, Mingyu Wu, Xingda Wei, Haibo Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对多智能体LLM系统并发访问共享状态时的冲突问题，提出MTPO协议，通过智能体自身判断冲突并修复计划，实现可串行化执行，在保持近串行正确率的同时提升速度。

Comments 14 pages, 7 figures. Submitted to ATC 2026

详情

AI中文摘要

多智能体LLM系统——编码智能体、运维智能体、文档智能体——现在通常并行运行多个智能体，针对同一个git树、Kubernetes集群或文档。一旦其中两个智能体修改共享状态，它们就进入了经典并发控制研究了几十年的领域，但经典机制不适合LLM智能体。单个智能体事务跨越数分钟的推理，读集广泛且不透明而非静态可推断，智能体操作的实时状态既不允许分叉也不允许缓冲，因此写操作在执行时立即生效。锁会阻塞长时间的推理间隔；OCC的终止-重试会在每次冲突时丢弃数分钟的工作。\n本文基于经典事务缺乏的能力构建并发控制：每个智能体内的LLM可以判断冲突写入是否使其计划无效，并精确修复依赖于该写入的操作。因此控制变为建议性的：运行时通知，智能体修复。我们的协议MTPO（单调轨迹预排序）在启动时固定一个序列化顺序，为每次读取提供按顺序过滤的值，并原地推测性地应用写入；单向通知要求受影响的读取者重新判断并修补其计划，同时框架通过每个工具预先注册的saga式逆操作机械地撤销和重新排序错位的写入。在静止时，运行按预定顺序可串行化。我们将MTPO实现为CoAgent，一种工具调用中间件，其特权ToolSmith在线增长具有声明足迹和可撤销的工具。在十个有冲突的工作负载上，CoAgent在1.4倍加速和近串行令牌成本下保持5%以内的串行正确性，而2PL和OCC几乎放弃了所有并发增益；在纯bash目标系统上，它在线增长了一个25工具库，并将任务通过率从45/71提升到63/71，时间和成本分别为0.80倍和0.86倍。

英文摘要

Multi-agent LLM systems -- coding agents, devops agents, document agents -- now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort-and-retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre-Order), fixes a serialization order at launch, serves each read the order-filtered value, and applies writes speculatively in place; a one-way notification asks an affected reader to re-judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga-style inverse each tool registers in advance. At quiescence the run is serializable in the pre-decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint-declared, undoable tools online. On ten contended workloads, CoAgent stays within 5\% of serial correctness at a $1.4\times$ speedup and near-serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash-only target system, it grows a 25-tool library online and lifts the task pass rate from 45/71 to 63/71 at $0.80\times$ the time and $0.86\times$ the cost.

URL PDF HTML ☆

赞 0 踩 0

2606.15931 2026-06-16 cs.MA cs.AI 交叉投稿

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

DeepRoot: 一个基于知识图谱协调的多智能体系统，用于历史医学文本的治疗推理

Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结提出DeepRoot多智能体系统，通过联合构建和利用验证知识图谱，将接地与推理分离并组合，从历史医学文本中恢复药物-疾病治疗关系，显著优于基线LLM和工具调用LLM。

详情

Journal ref: ICML 2026 GenBio; ACM CAIS 2026 Workshop AI Agents for Discovery in the Wild

AI中文摘要

历史医学档案和传统药物对药物发现具有巨大潜力，并且仍然是当前药物开发的主要来源。然而，前本体论的散文和特殊的分类法阻碍了数据的标准化和医学现代化，使其无法用于当前的生物医学流程。此外，现有的LLM智能体系统，无论是工具调用、检索增强还是智能体深度研究，都无法将此类文本转化为可验证的药物发现线索。我们通过DeepRoot填补了这一空白，这是一个多智能体LLM系统，它联合构建并利用一个验证知识图谱，表明接地和推理——这两个经常被混淆的概念——是系统可以组合用于治疗推理的可分离轴。应用于《神农本草经》，DeepRoot在R@20上恢复了21个保留化合物-疾病治疗对中的10个（47.6%，而原始语料库LLM为4.8%，随机约为2.4%），并且在推理质量上，在LLM作为评判的审计中优于基线LLM和直接通过工具调用访问DeepRoot自身查询的相同API的LLM。使用工具的LLM在87%的声明上产生幻觉证据，而DeepRoot为7-10%。仅图推理的幻觉率为0%，但在推理连贯性上排名最低；DeepRoot KG+LLM是唯一在两个轴上都获胜的条件，为系统挖掘和重新利用历史医学知识指明了一条道路。

英文摘要

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim\!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.16326 2026-06-16 cs.GT cs.AI q-fin.RM 交叉投稿

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约：策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen（何浩轩）

AI总结本文扩展了时间一致精算运行时的框架，使运营商策略化，刻画了自主AI代理保险合约的五种攻击空间，并证明了精算运行时的抗博弈性，通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情

AI中文摘要

论文A定义了一个时间一致的精算运行时，该运行时根据合约固定的安全默认值对每个产生副作用的行动定价，并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间，并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先，公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次，接口故障（如无效JSON）是合约相关事件，而非安全胜利：将其视为零通行费安全默认值可能奖励不可靠的模型，而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三，一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后，我们将这些条款与论文A的运行时保证组合，以获得在五种攻击空间上的联合激励兼容性。最后，一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

URL PDF HTML ☆

赞 0 踩 0

2606.16428 2026-06-16 cs.CL cs.AI cs.HC 交叉投稿

EMS: 通过高效多数决后停止的多智能体投票

Yiqing Liu, Hantao Yao, Wu Liu, Yongdong Zhang

发表机构 * GitHub

AI总结提出EMS方法，通过可靠性感知调度和自适应增量投票，在保持多数投票准确性的同时，平均减少35%的智能体调用和44%的令牌消耗。

详情

AI中文摘要

多数投票是将多智能体响应聚合为最终决策的标准方法。然而，传统方法通常要求所有智能体在聚合开始前完成推理，导致大量计算开销，因为一旦达成多数共识，许多响应就变得冗余。在这项工作中，我们将高效的多智能体投票形式化为一个可靠性感知的智能体调度问题，并提出高效多数决后停止（EMS）以提高推理效率。EMS首先通过检索每个智能体在语义相似查询上的历史共识证据，估计其任务条件可靠性排序（TCRO），然后按可靠性降序调用智能体。接下来，自适应增量投票（AIV）在当前领先答案无法被剩余智能体的任何可能投票推翻时终止过程，并返回该答案。最后，可靠性历史更新（RHU）仅根据被调用智能体与最终决策的共识来更新它们。在五个基准上的广泛评估表明，EMS在保持多数投票准确性的同时，平均将调用的智能体数量减少了35%，令牌消耗减少了44%。代码可在以下网址获取：https://this https URL。

英文摘要

Majority voting is the standard for aggregating multi-agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate efficient multi-agent voting as a reliability-aware agent scheduling problem and propose Efficient Majority-then-Stopping (EMS) to improve reasoning efficiency. EMS first estimates a Task-Conditioned Reliability Ordering (TCRO) for each agent by retrieving its historical consensus evidence on semantically similar queries, and then invoking agents in descending reliability order. Next, Adaptive Incremental Voting (AIV) terminates the process once the current leading answer cannot be overturned by any possible votes from the remaining agents, and returns this answer. Finally, Reliability History Updating (RHU) updates only the invoked agents according to their consensus with the final decision. Extensive evaluations across five benchmarks show that EMS preserves the accuracy of Majority Voting while reducing the average number of invoked agents by 35% and token consumption by 44%, respectively. The code is available at https://github.com/fuyu66/EMS.

URL PDF HTML ☆

赞 0 踩 0

2606.01365 2026-06-16 cs.AI 版本更新

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

多智能体LLM系统中浪费计算资源的早期诊断：基于故障感知的可观测性

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种故障感知的可观测性框架，通过在线轨迹信号诊断多智能体LLM系统中的浪费计算，并在GAIA验证集上评估，揭示不同故障机制及其与资源消耗的关系。

详情

AI中文摘要

使用工具的多智能体大语言模型（LLM）系统在产生答案之前，通过模型令牌、工具调用、重试和代码执行来消耗计算资源。当运行失败时，最终答案评估揭示了终点，但通常无法揭示轨迹停止可恢复进展的时间点。本文引入了一个故障感知的可观测性框架，用于诊断多智能体LLM轨迹中的浪费计算。该框架将重复出现的故障模式映射到在线轨迹信号，包括工具可靠性、执行恢复、编排循环、证据可用性、信息变化和预算压力。我们在一个三智能体问答系统中实例化该框架，并在相同的执行上限下对165条GAIA验证轨迹进行评估。操作故障仍然常见：22/53的1级运行、33/86的2级运行和12/26的3级运行未能产生可用的最终答案。轨迹揭示了这些结果背后的不同机制，包括证据不足、重复动作循环、最大步数终止、工具故障连续以及成功执行但无有用输出的调用。平均令牌使用量从1级的8,152个令牌上升到3级的16,389个令牌，而证据可用性和句子级支持则出现分歧。一项缓存的10条轨迹LLM评判基础审计表明，廉价的在线信号和更深入的语义指标捕捉了故障的互补层面。结果将故障感知可观测性定位为原始执行日志与最终答案准确性之间的诊断层。

英文摘要

Failure-aware observability diagnoses wasted computation in multi-agent LLM systems before final-answer evaluation can explain what went wrong. We propose a trace-based framework for a three-agent architecture -- orchestrator, search agent, and execution agent -- that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM-as-judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10-task Level-2 pilot uses warnings to diversify search or require evidence, reducing post-warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.

URL PDF HTML ☆

赞 0 踩 0

2606.09039 2026-06-16 cs.AI 版本更新

Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Agent经济学：一种熵控制的多元对齐框架以防止自主智能体中的蜂群思维

Cheonsu Jeong

发表机构 * AX Center, SAMSUNG SDS（三星SDS AX中心）

AI总结提出行为协议框架(BPF)，通过心智化社会智能、多元对齐和可验证执行内核三个模块，在闭环架构中控制熵以保持策略多样性，提升自主智能体经济的稳定性、效率和可信度。

Comments 15 pages, 2 figures, 1 table

详情

AI中文摘要

本研究提出了行为协议框架（BPF），这是一个熵控制的多元对齐框架，旨在解决自主智能体经济中的两个关键挑战：由智能体间过度战略趋同引起的蜂群思维效应，以及自主决策过程中缺乏透明度。所提出的BPF由三个核心模块组成：基于心智理论的心智化社会智能（MbSI）、多元对齐（PA）和可验证执行内核（VEK）。这些模块有机地集成在一个闭环架构中，该架构控制着智能体行为从决策、执行到验证和反馈的整个生命周期。为了评估所提出的框架，将开发一个用Python实现的模拟环境和基于Streamlit的用户界面。通过实证实验，本研究旨在检验PA模块的熵控制机制能否有效保持智能体间的战略多样性并减轻集体趋同，同时VEK模块提供决策过程的全面且透明的审计追踪。预期结果将表明，所提出的框架能够同时增强自主智能体经济的稳定性、效率和可信度。因此，本研究为开发稳健、透明且可问责的智能体原生经济系统提供了一种实用方法。

英文摘要

This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two critical challenges in autonomous agent economies: the hivemind effect arising from excessive strategic convergence among agents and the lack of transparency in autonomous decision-making processes. The proposed BPF consists of three core modules: Mentalizing-based Social Intelligence (MbSI) grounded in Theory of Mind (ToM), Pluralistic Alignment (PA), and a Verifiable Execution Kernel (VEK). These modules are organically integrated within a closed-loop architecture that governs the entire lifecycle of agent behavior, from decision-making and execution to verification and feedback. To evaluate the proposed framework, a simulation environment implemented in Python and a Streamlit-based user interface will be developed. Through empirical experimentation, the study aims to examine whether the entropy-control mechanism of the PA module can effectively preserve strategic diversity among agents and mitigate collective convergence, while the VEK module provides a comprehensive and transparent audit trail of the decision-making process. The anticipated results are expected to demonstrate that the proposed framework can simultaneously enhance the stability, efficiency, and trustworthiness of autonomous agent economies. Consequently, this research offers a practical approach for developing robust, transparent, and accountable agent-native economic systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 版本更新

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research（Salesforce研究院）； HKUST (Guangzhou)（香港科技大学（广州））； University of British Columbia（不列颠哥伦比亚大学）； Nanyang Technological University（南洋理工大学）

AI总结通过系统评估，发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线（如思维链自一致性），揭示了现有评估框架的缺陷和架构膨胀问题。

详情

AI中文摘要

普遍观点认为多智能体系统优于单智能体系统，其优势包括上下文保护、并行处理和分布式决策。然而，这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较，这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统（旨在比手动设计的系统具有更强的泛化能力），对单智能体系统（特别是思维链自一致性）进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务（例如 BrowseComp-Plus）上，我们证明自动多智能体系统始终不如思维链自一致性，尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来，我们引入了一个为多智能体系统量身定制的诊断性合成数据集，该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明，专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构，这表明现有的评估框架未能考虑增加计算成本的边际效用，从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是，对生成的多智能体系统架构的系统解构表明，当前的自动化设计范式产生了架构膨胀，优先考虑表面复杂性，但这并未转化为功能效用，暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

URL PDF HTML ☆

赞 0 踩 0

2310.06555 2026-06-16 cs.CL cs.AI cs.LG cs.MA 版本更新

It's About Time: Temporal References in Emergent Communication

关于时间：涌现通信中的时间指代

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton（索姆塞特大学）； The Alan Turing Institute（艾伦·图灵研究所）； University of Brescia（布雷西亚大学）

AI总结研究涌现通信中时间指代缺失问题，发现仅改变损失函数不足，需修改架构（分批方法）才能使时间指代涌现，95%以上代理成功，为提升通信效率奠定基础。

Comments 23 pages main body and 31 pages supplementary material, 9 figures in main body. Code available at https://github.com/olipinski/TRG

详情

DOI: 10.1613/jair.1.19795
Journal ref: Journal of Artificial Intelligence Research 86, Article 11 (June 2026)

AI中文摘要

涌现通信使代理能够开发定制语言以提高通信效率。尽管已知时间结构在自然语言中的重要性，但在涌现通信中尚无时间指代的证据。本文通过探索代理如何交流时间关系来填补这一空白。我们分析了时间指代涌现的三个潜在因素：环境因素、外部因素和架构因素。实验表明，仅改变损失函数不足以使时间指代涌现；相反，架构变化是必要的。代理架构的最小变化——使用不同的分批方法——允许时间指代涌现。在强调时间关系的时间指代游戏环境中，将此修改后的设计与标准架构进行比较。分析显示，超过95%使用修改后分批方法的代理发展出了时间指代，而无需改变其损失函数。我们认为时间指代对于未来提高代理通信效率是必要的，使未来代理能够使用更接近最优编码的方式，与纯组合语言相比。这些见解为将时间指代纳入其他涌现通信设置以及研究语言的其他方面提供了基础。

英文摘要

Emergent communication enables agents to develop bespoke languages that improve communication efficiency. Despite the known importance of temporal structure in natural language, there is no existing evidence of temporal references in emergent communication. This paper addresses this gap, by exploring how agents communicate about temporal relationships. We analyse three potential factors for the emergence of temporal references: environmental, external, and architectural. Our experiments demonstrate that altering the loss function is insufficient for temporal references to emerge; rather, architectural changes are necessary. A minimal change in agent architecture, using a different batching method, allows the emergence of temporal references. This modified design is compared with the standard architecture in a temporal referential games environment, which emphasises temporal relationships. The analysis shows that over 95% of the agents with the modified batching method develop temporal references, without changes to their loss function. We consider temporal referencing necessary for future improvements to the agents' communication efficiency, enabling future agents to use a closer to optimal coding as compared to purely compositional languages. These insights provide the basis for incorporation of temporal references into other emergent communication settings, and investigation of other aspects of language.

URL PDF HTML ☆

赞 0 踩 0

2602.05965 2026-06-16 cs.MA cs.AI 版本更新

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

学习共享：面向高效并行智能体系统的选择性记忆

Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah

AI总结提出LTS机制，通过强化学习训练控制器选择性共享跨团队中间信息，在减少并行智能体系统计算开销的同时保持或提升任务性能。

Comments ICML 2026

详情

AI中文摘要

智能体系统通过协调多个智能体迭代推理、调用工具和交换中间结果来解决复杂任务。为了提高鲁棒性和解决方案质量，最近的方法部署多个并行运行的智能体团队以探索多样化的推理轨迹。然而，并行执行带来了显著的计算成本：当不同团队独立推理相似子问题或执行类似步骤时，它们反复进行大量重叠计算。为了解决这些限制，本文提出学习共享（LTS），一种用于并行智能体框架的学习型共享记忆机制，能够在控制上下文增长的同时实现跨团队选择性信息重用。LTS引入了一个所有团队可访问的全局记忆库和一个轻量级控制器，该控制器决定是否将中间智能体步骤添加到记忆中。控制器使用带有使用感知信用分配的逐步强化学习进行训练，使其能够识别在并行执行中全局有用的信息。在AssistantBench和GAIA基准上的实验表明，与无记忆并行基线相比，LTS显著减少了总体运行时间，同时匹配或提高了任务性能，证明了学习型记忆准入是提高并行智能体系统效率的有效策略。项目页面：此https URL

英文摘要

Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/

URL PDF HTML ☆

赞 0 踩 0

2603.01131 2026-06-16 cs.MA cs.AI 版本更新

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

MedCollab：基于IBIS引导的多智能体协作与分层疾病关系链的临床诊断

Yuqi Zhan, Xinyue Wu, Tianyu Lin, Yutong Bao, Xiaoyu Wang, Weihao Cheng, Huangwei Chen, Feiwei Qin, Zhu Zhu

发表机构 * Princeton University（普林斯顿大学）； Springer Heidelberg（斯普林格海德堡）； ABC Institute（ABC研究所）； Rupert-Karls-University Heidelberg（海德堡鲁珀特-卡尔大学）； Hangzhou Dianzi University（杭州电子科技大学）； Zhejiang University（浙江大学）； Children’s Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Children and Adolescents’ Health and Diseases（浙江大学医学院儿童医院，国家儿童青少年健康与疾病临床研究中心）

AI总结提出MedCollab框架，通过IBIS结构化论证和分层疾病关系链（HDRC）增强多智能体协作，提升临床诊断的准确性、可追溯性和报告质量。

详情

AI中文摘要

大型语言模型（LLM）在临床诊断中展现出潜力，但仍受限于不可靠的报告生成、薄弱的证据基础和 opaque 推理。我们提出MedCollab，一个基于IBIS引导的多智能体框架，用于全周期临床诊断和诊断报告生成。模拟医院会诊，MedCollab从患者记录中动态招募专科和检查智能体。每个诊断假设通过基于问题的信息系统（IBIS）结构化为证据关联的论点，提高可追溯性和可审计性。MedCollab进一步构建分层疾病关系链（HDRC），将接受的假设组织成具有临床意义的病理和共病关系。一个验证器引导的共识模块审计推理质量，检测矛盾，并在多轮中更新智能体权重。在ClinicalBench和MIMIC-IV上的实验表明，MedCollab在诊断准确性、科室路由、证据一致性和报告质量方面优于强大的LLM和医学多智能体基线。这些结果表明，结构化论证和疾病关系建模可以提高基于LLM的诊断的可靠性、透明度和临床连贯性。

英文摘要

Clinical diagnosis is a gradual process of evidence integration, in which physicians move from symptoms and medical history to examinations, competing hypotheses, disease relations, and treatment decisions. Large language models have advanced medical text understanding and generation. Yet their clinical use remains limited by weak evidence grounding, opaque reasoning, and inconsistent links among differential diagnosis, final diagnosis, diagnostic basis, and treatment planning. We introduce MedCollab, a multi-agent framework for full-cycle clinical diagnosis and report generation. MedCollab coordinates specialist and examination agents according to patient records. It structures agent deliberation with an Issue-Based Information System (IBIS) protocol, so that each diagnostic position is supported by patient-specific evidence and medical knowledge. It also builds Hierarchical Disease Relation Chains (HDRC) to connect accepted hypotheses through progression, complication, and comorbidity relations. During multi-round deliberation, a verifier-guided consensus module evaluates evidence support, medical plausibility, and logical conflicts. It then adjusts agent contributions and filters unsupported reasoning. Experiments on ClinicalBench and MIMIC-IV show that MedCollab outperforms leading LLMs and medical multi-agent baselines in diagnostic accuracy, evidence consistency, and clinical reasoning quality. These results indicate that structured and auditable collaboration can produce more faithful and clinically coherent diagnostic reports.

URL PDF HTML ☆

赞 0 踩 0

2604.09679 2026-06-16 cs.MA cs.AI 版本更新

基于编译的多智能体路径规划中的未分配智能体

Pavel Surynek

发表机构 * Faculty of Information Technology, Czech Technical University in Prague（布拉格捷克理工大学信息技术学院）

AI总结针对未分配智能体的多智能体路径规划问题，提出基于SAT的编译方法，通过SMT-CBS和NRF-SAT求解器实现。

详情

AI中文摘要

基于编译的技术代表了多智能体路径规划（MAPF）求解器的一个重要流派，因其模块化和对问题非标准变体的适应性。在标准MAPF中，任务是引导所有智能体从初始位置无碰撞地到达给定的个体目标位置，而使用不同智能体要求的变体也具有相关性。这种变体是带有未分配智能体的MAPF（UA-MAPF），其中一些智能体与标准MAPF具有相同的设置（有初始位置和目标），而其余智能体只有初始位置但没有目标——未分配智能体。尽管未分配智能体不需要到达任何目标位置，但如有必要，它们必须为标准智能体让路，这构成了一个特定的挑战。我们在本文中表明，UA-MAPF可以表达为基于编译的MAPF技术，这些技术基于将问题表述为布尔可满足性，具体地，我们改编了SMT-CBS和NRF-SAT，这两种基于反例引导抽象精化和非精化抽象的最新求解器。

英文摘要

Compilation-based techniques represent an important stream of solvers for multi-agent path finding (MAPF) due to their modularity and adaptability for non-standard variants of the problem. While in the standard MAPF the task is to navigate all agents from their initial positions to given individual goal positions without any collision, variants where a different requirement for agents is used are also relevant. Such a variant is MAPF with unassigned agents (UA-MAPF) where some agents have the same setting as in the standard MAPF with initial positions and goals while the remaining agents have the initial position but have no goal - unassigned agents. Despite unassigned agent do not need to reach any goal position they have to be moved out of the way of the standard agents if needed which represent a specific challenge. We show in this paper that UA-MAPF can be expressed in recent compilation-based techniques for MAPF based on formulating the problem as Boolean satisfiability, namely we adapt SMT-CBS and NRF-SAT, the recent solvers based on counterexample guided abstraction refinement and non-refined abstractions.

URL PDF HTML ☆

赞 0 踩 0

2606.16329 2026-06-16 cs.AI 新提交

Exploiting Search in Symbolic Numeric Planning with Patterns

利用模式在符号数值规划中进行搜索

Matteo Cardellini, Enrico Giunchiglia

发表机构 * DIBRIS, University of Genoa（热那亚大学DIBRIS）

AI总结提出基于符号模式规划(SPP)的数值规划过程，通过动态重计算模式并利用中间状态引导搜索，提高规划效率。

Comments Under Review at the Journal of Artificial Intelligence Research

详情

AI中文摘要

在本文中，我们提出了一种基于符号模式规划(SPP)的数值规划过程。给定一个数值规划问题 $Π$，一个模式 $\prec$ 是一个动作序列，用于定义一个公式，该公式编码了从起始状态 $S$ 可执行的 $\prec$ 的子序列。Cardellini, Giunchiglia, 和 Maratea (2024a) 遵循规划作为可满足性的方法，在每一步 $n \ge 0$ 定义一个公式 $Π^\prec_n$，其中 $(i)$ 模式 $\prec$ 仅在 $n=0$ 时在 $Π$ 的初始状态 $I$ 中计算，然后在每一步 $n$ 中被利用，$(ii)$ 起始状态 $S$ 设置为 $I$，$(iii)$ 目标集 $G$ 要求在通过将 $\prec$ 的子序列连接 $n$ 次所能达到的最后一个状态中成立。该过程从 $n=0$ 开始，一旦 $Π^\prec_n$ 可满足则终止，否则递增 $n$ 继续。在本文中，可能在每一步，$(i)$ 我们符号化地搜索一个从 $I$ 可达的中间状态 $P$，该状态更接近目标状态，$(ii)$ 动态重计算模式 $\prec_h$ —— 用于下一步 —— 在 $P$ 中，$(iii)$ 精炼用于到达 $P$ 的模式 $\prec_g$，以及 $(iv)$ 从状态 $S$ 开始新的搜索，$S$ 可以是初始状态 $I$ 或最后计算的中间状态 $P$，利用计算出的模式 $\prec_g$ 和 $\prec_h$ 来定义搜索中使用的模式 $\prec$。特别地，在每一步，我们定义一个公式 $Π^{\prec}_{S,P}$，编码存在一个状态 $P'$ 比 $P$ 更接近目标状态，且 $P'$ 从起始状态 $S$ 使用模式 $\prec$ 可达。我们提出了不同的技术来生成这样的公式，每种技术对应一种不同的搜索空间探索策略。我们证明了它们的正确性和完备性，后者在一定条件下成立。

英文摘要

In this paper, we present a procedure for numeric planning based on Symbolic Pattern Planning (SPP). Given a numeric planning problem $Π$, a pattern $\prec$ is a sequence of actions used to define a formula encoding the subsequences of $\prec$ executable from a starting state $S$. Cardellini, Giunchiglia, and Maratea (2024a) follow the Planning as Satisfiability approach by defining, at each step $n \ge 0$, a formula $Π^\prec_n$ in which $(i)$ the pattern $\prec$ is computed only for $n=0$ in the initial state $I$ of $Π$, and then exploited at each step $n$, $(ii)$ the starting state $S$ is set to $I$, and $(iii)$ the set $G$ of goals is required to hold in the last state that can be reached by one of the subsequences of $\prec$ concatenated $n$ times. The procedure begins with $n=0$, terminates as soon as $Π^\prec_n$ is satisfiable, and otherwise proceeds by incrementing $n$. In this paper, possibly at each step, $(i)$ we symbolically search for an intermediate state $P$ reachable from $I$, closer to a goal state, $(ii)$ dynamically recompute the pattern $\prec_h$ -- to be used in the next step -- in $P$, $(iii)$ refine the pattern $\prec_g$ used to reach $P$, and $(iv)$ start the new search from the state $S$ which can be either the initial state $I$ or the last computed intermediate state $P$, exploiting the computed patterns $\prec_g$ and $\prec_h$ to define the pattern $\prec$ to be used in the search. In particular, at each step, we define a formula $Π^{\prec}_{S,P}$ encoding the existence of a state $P'$ closer than $P$ to a goal state, with $P'$ reachable from the starting state $S$ when using the pattern $\prec$. We present different techniques for producing such formulas, each corresponding to a different strategy for exploring the search space. We prove their correctness and completeness, the latter under certain conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.16567 2026-06-16 cs.AI cs.LG cs.SY eess.SY math.DS 新提交

TNODEV: Toolbox for Neural ODE Verification

TNODEV: 神经ODE验证工具箱

Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel

发表机构 * Univ Gustave Eiffel, COSYS-ESTAS（古斯塔夫·埃菲尔大学，COSYS-ESTAS实验室）

AI总结提出TNODEV，首个集成伪造检查、区间可达性、验证循环和并行调度的神经ODE形式验证器，支持安全集包含和分类鲁棒性验证。

Comments 29 pages, 7 figures, Under review in TMLR

详情

基于边干预的有向无环图特征归因

Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

发表机构 * Zhejiang University（浙江大学）； Zhejiang Lab（之江实验室）； Hong Kong Polytechnic University（香港理工大学）

AI总结针对现有特征归因方法无法同时捕获特征外部性和外生影响的问题，提出基于边干预的DAG-SHAP方法，将每条特征边作为归因对象，并引入近似计算方法，实验验证其有效性。

详情

AI中文摘要

基于Shapley值的特征归因方法在涉及复杂特征交互和因果关系的场景中面临挑战，即使提供了因果结构。现有方法通常采用节点中心视角，仅将重要性归因于单个特征。因此，它们往往无法同时捕获特征的外部性和外生影响，导致不合理的解释。为克服这些限制，我们提出一种新的基于边干预的特征归因方法DAG-SHAP。DAG-SHAP将每条特征边作为单独的归因对象，确保特征的外部性和外生贡献都被适当捕获。此外，我们引入了一种近似方法以高效计算DAG-SHAP。在真实和合成数据集上的大量实验验证了DAG-SHAP的有效性。我们的代码可在https://github.com/ZJU-DIVER/DAG-SHAP获取。

英文摘要

Shapley value-based feature attribution methods face challenges in scenarios involving complex feature interactions and causal relationships, even when a causal structure is provided. Existing methods typically adopt a node-centric view, attributing importance solely to individual features. Consequently, they often fail to simultaneously capture the externality and exogenous influence of features, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats each feature edge as an individual attribution object, ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both real and synthetic datasets validate the effectiveness of DAG-SHAP. Our code is available at https://github.com/ZJU-DIVER/DAG-SHAP.

URL PDF HTML ☆

赞 0 踩 0

2606.15447 2026-06-16 cs.AI 新提交

Hierarchical Modeling of ICD Codes in EHR Foundation Models

EHR基础模型中ICD码的分层建模

Megha Thukral, Dong Gyun Kang, Rudra Pratap Singh, Shruthi Kashinath Hiremath, Katrin Hänsel, Thomas Plötz

发表机构 * School of Interactive Computing, Georgia Institute of Technology（佐治亚理工学院交互计算学院）； Optum AI

AI总结研究利用ICD-10-CM层次结构作为归纳偏置，通过序列增强和图注入两种机制改进EHR表示学习，实验表明显式编码层次结构在域内和跨数据集任务中均优于扁平表示。

详情

AI中文摘要

电子健康记录基础模型通常将ICD诊断码视为扁平标记，忽略了捕获疾病家族、子类别和细粒度诊断细节的临床上有意义的层次结构。因此，现有的EHR表示学习方法并未明确利用编码系统中已有的层次结构。在这项工作中，我们研究ICD-10-CM层次结构作为临床表示学习的一般归纳偏置。我们研究了两种互补的机制来融入层次结构：首先，通过在BERT风格的transformer中向诊断序列添加对应于ICD层次不同级别的标记；其次，通过结合诊断共现结构的层次感知边将层次结构注入基于图的代码表示中。在这些设置下，我们评估显式层次结构是否改进了下游预测、层次结构的哪些级别最有用、层次编码是否改善了跨数据集的迁移，以及层次结构如何重塑嵌入相似性结构。我们在两个大规模真实世界临床数据集上进行了实验：MIMIC-IV（用于预训练和域内评估）和eICU（用于通过冻结编码器探测评估跨数据集迁移）。我们的发现表明，显式编码ICD层次结构在域内和跨数据集设置中均优于扁平代码表示，同时揭示了最有效的层次级别取决于任务和建模方法。更广泛地说，我们专注于层次感知的EHR表示学习，并表明编码层次结构的好处可泛化到不同的建模设置和层次级别。

英文摘要

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

URL PDF HTML ☆

赞 0 踩 0

2606.15841 2026-06-16 cs.AI 新提交

Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

预算受限LLM验证中的异方差信号：结构异质性限制了优化收益

Jinlong Yang

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结本文发现LLM不确定性信号在预算受限验证中存在异方差性，导致全局分配扭曲；通过分层阈值干预（CST）在强异质性设置下提升命中率达17个百分点，揭示结构异质性是主要瓶颈。

详情

AI中文摘要

大型语言模型（LLM）系统越来越多地使用不确定性信号来在验证、测试时扩展、工具执行和其他选择性计算决策中分配有限的计算资源。此类策略依赖于一个全局信号可比性假设：相等的分数应在不同输入中携带可比的决策价值。使用预算受限验证作为受控诊断设置，我们识别出该假设的一种失效模式：不确定性质量在成本分层上是异方差的，某些区域尽管集中了大量错误，却表现出近乎随机的可区分性。在一个显式的局部模型下，我们刻画了由此导致的全局分配扭曲，并表明其上界随跨层信号质量离散度而缩放。我们通过一个受控干预层级（阈值、MP-Adapt、MP-Strat以及一个故意简单的成本分层阈值干预CST）将弱信号、优化不稳定性和结构异质性分离开来。在MBPP和MATH上使用Qwen3-8B、LLaMA3-8B和GPT-4o-mini的实验表明，全局在线自适应相对于静态阈值化产生不一致的收益；MP-Strat部分恢复了性能，而CST在强异质性设置下无需梯度更新即可将命中率提升高达17个百分点。这些结果表明，在所观察的设置中，结构异质性（而非仅优化器弱点）是主要瓶颈。更广泛地说，错位的反馈结构并不总能通过更强的优化来修复。

英文摘要

Large language model (LLM) systems increasingly use uncertainty signals to allocate limited computation across verification, test-time scaling, tool execution, and other selective-compute decisions. Such policies rely on a \emph{global signal comparability assumption}: equal scores should carry comparable decision value across inputs. Using budgeted verification as a controlled diagnostic setting, we identify a failure mode of this assumption: uncertainty quality is heteroskedastic across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors. Under an explicit local model, we characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion. We separate weak signals, optimization instability, and structural heterogeneity through a controlled intervention hierarchy: Threshold, MP-Adapt, MP-Strat, and a deliberately simple cost-stratified thresholding intervention (CST). Across MBPP and MATH using Qwen3-8B, LLaMA3-8B, and GPT-4o-mini, global online adaptation yields inconsistent gains over static thresholding; MP-Strat partially recovers performance, while CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. These results identify structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck in the observed settings. More broadly, misaligned feedback structure cannot always be repaired by stronger optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.16140 2026-06-16 cs.AI cs.CL 新提交

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B：探索小型语言模型中可验证推理的前沿

Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出3B参数紧凑模型VibeThinker-3B，通过频谱到信号后训练范式（课程SFT、多域强化学习、离线自蒸馏）在可验证推理任务上达到前沿性能，匹配甚至超越大模型，并验证推理增强不损害指令可控性。

详情

AI中文摘要

本技术报告介绍了VibeThinker-3B，一个具有3B参数的紧凑密集模型，旨在探究在严格的小模型范围内可验证推理能推进到何种程度。基于频谱到信号后训练范式，我们通过优化的流程系统性地增强模型，该流程包括基于课程的监督微调、多域强化学习和离线自蒸馏。实验评估表明，VibeThinker-3B在高度要求的可验证任务上达到了前沿水平。具体来说，它在AIME26上获得94.3分（通过声明级测试时缩放提升至97.1），在LiveCodeBench v6上获得80.2的Pass@1，并在最近的未见LeetCode竞赛中表现出强大的分布外泛化能力，接受率达96.1%。这有效地将其置于一流推理系统的性能区间，匹配或超越规模大数个数量级的旗舰模型，如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外，IFEval上的93.4分证实了这种极端的推理增强并未损害严格的指令可控性。扩展我们之前的1.5B工作，这些发现推动了参数压缩-覆盖假说，该假说将可验证推理视为可压缩到紧凑推理核心中，而开放域知识和通用能力则需要广泛的参数覆盖事实、概念和长尾场景。这一观点表明，紧凑模型不仅是部署高效的替代品，更是通往参数密集能力领域前沿性能的互补路径。

英文摘要

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

URL PDF HTML ☆

赞 0 踩 0

2606.16152 2026-06-16 cs.AI 新提交

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

质量-效用悖论：为什么高奖励数据会损害小模型的数学推理

Haolong Qian, Xianliang Yang, Yinuo ma, Lirong Che, Feng Lu, Ye Guo, Lei Song, Jiang Bian, Chun Yuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结发现数学推理蒸馏中的质量-效用悖论：Oracle精炼的高奖励数据因分布漂移增加适应成本，反而不如SLM自生成数据；提出风格对齐精炼方法恢复效用。

Comments Accepted at ICML 2026

详情

AI中文摘要

从强大推理模型进行知识蒸馏被广泛用于提升小语言模型（SLM）的数学推理能力，通常假设奖励模型得分更高的轨迹能提供更有用的监督。我们在数学推理蒸馏中发现了一个反直觉的\textbf{质量-效用悖论}。由更强Oracle精炼或合成的数据根据奖励模型获得更高的感知质量，但在Qwen2.5、LLaMA-3和DeepSeek系列中，其表现始终不如SLM自身生成并通过拒绝采样选择的轨迹。我们的分析表明，Oracle精炼将逻辑修复与偏离SLM原生推理分布的分布漂移相结合。这种漂移增加了学习者的适应成本，可能抵消改进推理逻辑带来的收益。为验证这一机制，我们引入\textbf{风格对齐精炼}，在保留Oracle逻辑修复的同时保持SLM的原生轨迹。这种干预降低了适应成本并恢复了下游效用。这些发现表明，有效的数学推理蒸馏应联合优化感知解质量和学习者-数据兼容性，而非仅依赖奖励模型得分。数据集和代码见https://github.com/Dracoqhl/Quality-Utility-Paradox。

英文摘要

Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.

URL PDF HTML ☆

赞 0 踩 0

2606.16210 2026-06-16 cs.AI 新提交

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

基于场景相关观测商的传感器条件表示学习

Yan Jiao, Pin-Han Ho, Limei Peng

发表机构 * Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China（电子科技大学深圳高等研究院）； Department of Electrical and Computer Engineering, University of Waterloo（滑铁卢大学电气与计算机工程系）； School of Computer Science and Engineering, Kyungpook National University（庆北国立大学计算机科学与工程学院）

AI总结提出场景相关观测商作为表示目标，通过OQ-TSAE框架分解场景与干扰因子，在传感器条件下保持可区分性，优于重建、度量学习和对比学习基线。

详情

AI中文摘要

智能传感系统中的学习表示通常通过重建保真度或下游预测精度来评估，但这些标准并未指定哪些潜在区分是由传感过程证明合理的。在传感器条件环境中，干扰因素可以在不改变场景的情况下改变测量值，而不同的场景在有限的传感能力下可能无法区分。本文形式化了传感器条件表示的正确性，即在抑制干扰引起的和传感器不支持的变化的同时，保留传感支持的场景区分。我们引入了场景相关观测商，一种由传感支持的可区分性在干扰规范化后诱导的表示目标，并开发了观测商塔克结构自编码（OQ-TSAE），一种具有假区分、假合并、干扰敏感性和潜在排序一致性诊断的场景-干扰因子分解框架。在受控基准上的实验表明，商一致监督在表示正确性诊断上优于面向重建、度量学习和对比学习的基线。敏感性、扰动和消融研究显示了商对齐监督、可靠商关系和商几何的重要性。互补的真实雷达实验表明，仅重建的OQ-TSAE变体保留了竞争性的下游效用、观测退化下的鲁棒性和低种子间变异性。这些结果表明，传感器条件表示不仅应通过预测效用评估，还应通过其潜在几何是否保留传感证明的场景区分来评估。

英文摘要

Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

URL PDF HTML ☆

赞 0 踩 0

2606.16222 2026-06-16 cs.AI cs.LG 新提交

Latent Thought Flow: Efficient Latent Reasoning in Large Language Models

潜在思维流：大型语言模型中的高效潜在推理

Xiandong Zou, Jing Huang, Jianshu Li, Pan Zhou

发表机构 * Singapore Management University（新加坡管理大学）； Ant Group（蚂蚁集团）

AI总结提出Latent Thought Flow (LTF)方法，将推理建模为可变长度连续轨迹，通过连续GFlowNet训练采样器匹配奖励后验，在提升准确率9.5%的同时平均减少推理长度27.2%。

详情

AI中文摘要

大型语言模型（LLMs）越来越依赖中间推理，然而显式的思维链（CoT）存在语言空间瓶颈：每个思维必须解码为token，导致高推理开销。潜在推理将思考过程转移到连续空间，但现有方法大多学习确定性或奖励最大化路径，缺乏在具有不同正确性和成本的轨迹间分配概率的原则性方法。我们提出潜在思维流（LTF），将推理建模为可变长度连续轨迹，并训练采样器以匹配由答案质量和计算成本定义的奖励诱导后验。我们使用具有随机潜在转移的连续GFlowNet实例化该方法。为处理稀疏答案监督，我们引入熵加权子轨迹平衡目标以获取中间奖励，以及参考先验正则化器以锚定探索。在微调和迁移学习设置下的实验表明，与强潜在推理基线相比，LTF在平均减少推理长度27.2%的同时，准确率提升9.5%，优于显式CoT和潜在推理基线。

英文摘要

Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flow (LTF), which models reasoning as variable-length continuous trajectories and trains a sampler to match a reward-induced posterior over answer quality and computation cost. We instantiate this with a continuous GFlowNet using stochastic latent transitions. To handle sparse answer supervision, we introduce an Entropy-Weighted Subtrajectory Balance objective for intermediate rewards and a reference-prior regularizer to anchor exploration. Experiments under finetuning and transfer learning settings show that LTF outperforms explicit CoT and latent reasoning baselines, improving accuracy by 9.5% while reducing reasoning length by 27.2% on average compared with strong latent reasoning baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.16501 2026-06-16 cs.AI 新提交

Post-Hoc Merging is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing

事后合并是不够的：基于损失差距平衡的多轮模型合并

Kyungjin Im, Miru Kim, Chanin Eom, Minhae Kwon

AI总结提出METIS方法，通过迭代多轮合并和任务损失差距加权，解决多任务模型合并中的信息擦除问题，显著提升最差任务性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

模型合并已成为一种实用的训练后策略，通过组合多个任务专用模型来构建单一的多任务大语言模型（LLM）。然而，大多数现有方法依赖于事后合并，即任务专用模型在训练后仅合并一次。这种一次性聚合常常遭受任务干扰，导致跨单个任务的信息擦除。在这项工作中，我们表明用迭代的多轮合并协议取代事后合并能有效提升多任务性能。基于这一见解，我们提出了METIS（Mitigating Erasure from Task Interference for Stable many-shot merging），一种损失感知的多轮合并方法，通过任务级损失差距加权和基于共识的掩码来解决事后合并中的信息擦除问题。值得注意的是，METIS在最差性能任务上表现出显著的性能提升，有效缓解了信息擦除。（项目页面：https://imkyungjin.github.io/METIS/）

英文摘要

Model merging has become a practical post-training strategy for building a single multi-task large language model (LLM) by combining multiple task-specialized models. However, most existing approaches rely on post-hoc merging, in which task-specific models are merged only once after training. This one-shot aggregation often suffers from task interference, leading to information erasure across individual tasks. In this work, we show that replacing post-hoc merging with an iterative many-shot merging protocol is effective in improving multi-task performance. Building on this insight, we propose METIS, Mitigating Erasure from Task Interference for Stable many-shot merging. METIS is a loss-aware many-shot merging method that addresses information erasure in post-hoc merging through task-wise loss-gap weighting and consensus-based masking. Notably, METIS exhibits significant performance improvement on the worst-performing task, effectively mitigating information erasure. (Project page: https://imkyungjin.github.io/METIS/)

URL PDF HTML ☆

赞 0 踩 0

2606.16733 2026-06-16 cs.AI 新提交

PH-KAN：端口-哈密顿 Kolmogorov-Arnold 网络

Achraf El Messaoudi, Karim Cherifi, Yann Le Gorrec, Yongxin Wu

AI总结提出基于 Kolmogorov-Arnold 网络的保结构非线性端口-哈密顿系统辨识框架，通过专用 KAN 块参数化各组件并显式施加约束，获得比标准 MLP 更可解释的模型。

详情

AI中文摘要

数据驱动的机器学习方法在非线性系统辨识中越来越有吸引力，但标准模型往往无法保持潜在的物理结构，且难以解释，尤其是在没有解析模型可用时。在此背景下，端口-哈密顿（pH）模型提供了一种自然的物理信息表示。然而，当这些模型使用标准多层感知器（MLP）参数化时，学习到的本构组件通常仍然难以解释。在本文中，我们提出了一种基于 Kolmogorov-Arnold 网络（KAN）的非线性端口-哈密顿系统的保结构辨识框架。所提出的 PH-KAN 模型使用专用 KAN 块参数化互连矩阵、耗散矩阵、哈密顿量和输入映射，同时通过构造强制满足端口-哈密顿约束。这产生了本构表示，其中定义所辨识 pH 组件的非线性函数可以被显式检查，从而得到比基于标准 MLP 的参数化更具可解释性的模型。

英文摘要

Data-driven machine learning approaches have become increasingly attractive for nonlinear system identification, but standard models often fail to preserve the underlying physical structure and remain difficult to interpret, especially when no analytical model is available. In this context, port-Hamiltonian (pH) models provide a natural physics-informed representation. However, when these models are parameterized with standard multilayer perceptrons (MLPs), the learned constitutive components often remain poorly interpretable. In this paper, we propose a structure-preserving identification framework for nonlinear port-Hamiltonian systems based on Kolmogorov-Arnold Networks (KANs). The proposed PH-KAN model parameterizes the interconnection matrix, dissipation matrix, Hamiltonian, and input mapping using dedicated KAN blocks, while enforcing the port-Hamiltonian constraints by construction. This yields constitutive representations in which the nonlinear functions defining the identified pH components can be explicitly inspected, leading to a more interpretable model than with standard MLP-based parameterizations.

URL PDF HTML ☆

赞 0 踩 0

2606.14732 2026-06-16 cs.CV cs.AI cs.LG cs.MM 交叉投稿

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

Steady-Forcing: 长时程自然视频扩散中空间持久性与运动连续性的平衡

Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

发表机构 * Department of Computer Science and Engineering, Sogang University（西江大学计算机科学与工程系）； Department of Artificial Intelligence, Sogang University（西江大学人工智能系）

AI总结提出Steady-Forcing框架，通过视觉锚点、运动记忆和蒸馏等技术，在长时程固定相机自然视频生成中平衡背景稳定与运动连续性，优于现有方法。

Comments Project page: https://minar09.github.io/steadyforcing/

详情

AI中文摘要

自回归视频扩散模型支持流式生成，但在长时程生成中常退化：静态场景布局漂移，而改善空间稳定性的机制往往抑制运动，导致水流、火焰或烟雾等自然流动停滞。我们研究了固定相机长时程自然视频生成中的这种稳定性-运动权衡，其中两种失败模式比移动相机设置更易区分。我们提出Steady-Forcing，一种结合持久视觉锚点（V-Sink）、指数移动平均运动记忆（EMA-Sink）、块相对时间编码、周期性缓存净化以及从Wan2.1-14B教师模型蒸馏（在任务聚焦配置下使用运动奖励先验）的记忆与训练框架。这些组件共同设计用于在数分钟的自回归生成中保持背景一致性，同时维持视觉上合理的流体动力学。在七个基线上的评估表明，Steady-Forcing改善了长时程背景一致性和成像质量，而盲用户研究显示更强的感知稳定性和运动连续性。基准评估进一步表明，通用的VBench聚合分数对固定相机伪影惩罚不足，同时将漂移引起的光流奖励为动态程度，而不直接惩罚纹理硬化或流动停滞——这激励了未来针对静态相机自然流动评估的任务特定基准。项目页面：https://minar09.github.io/steadyforcing/

英文摘要

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

URL PDF HTML ☆

赞 0 踩 0

2606.14753 2026-06-16 cs.CV cs.AI 交叉投稿

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

超越自注意力：用于快速图像描述的次二次视觉Transformer

Chiradeep Ghosh, Dakshina Ranjan Kisku

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； National Institute of Technology Durgapur（德里apur国立学院）； Durgapur, India（印度德里apur）

AI总结提出基于高斯混合模型和EM算法的概率Transformer，将自注意力复杂度从二次降至线性，在Flickr30K上实现高效图像描述。

Comments 8 pages, 8 figures

详情

AI中文摘要

图像描述是一项具有挑战性且重要的任务，旨在为给定图像生成连贯且语义有意义的文本描述。要完成此任务，需要对视觉内容有深入理解，并具备用自然语言表达这种理解的能力。尽管基于Transformer的架构取得了显著进展，现有方法仍存在局限性，例如缺乏丰富的局部特征表示以及二次自注意力的高计算成本。所提出的模型通过重构视觉Transformer架构，专注于提高计算效率。在设计该方法时，将Vision Transformer中的标准自注意力机制替换为基于高斯混合模型（GMM）的概率Transformer方法，这是一种软聚类技术。该模型不是计算所有图像块之间的成对注意力，而是使用期望最大化（EM）算法将相似块分组到固定数量的聚类中。这种基于聚类的机制将计算复杂度从二次O(n^2)降低到线性O(nK)，其中K << n。自回归的GPT解码器用于生成描述。该模型在Flickr 30K数据集上进行了评估，显示出与现有工作相比具有竞争力和显著的改进。

英文摘要

Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

URL PDF HTML ☆

赞 0 踩 0

2606.14760 2026-06-16 cs.CV cs.AI 交叉投稿

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

GeoRoPE: 面向遥感基础模型的地面感知旋转适配

Yu Luo, Kun Hu, Mengwei He, Xiaogang Zhu, Shan Zeng, Allen Benter, Wei Xiang, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * The University of Sydney（悉尼大学）； Edith Cowan University（埃迪斯科文大学）； Adelaide University（阿德莱德大学）； Wuhan Polytechnic University（武汉轻工大学）； Climate, Orange Agricultural Institute（气候研究所，奥兰治农业研究所）； La Trobe University（拉筹伯大学）

AI总结提出GeoRoPE方法，通过地理坐标校准和频率校准解决遥感基础模型中的尺度失配问题，提升跨分辨率鲁棒性和尺度敏感表征学习。

详情

AI中文摘要

遥感基础模型（RSFMs）受益于在多传感器和地面采样距离（GSD）图像上的预训练，但仅凭这种暴露并不能解决下游适配过程中的尺度失配问题。固定的token网格偏移在不同传感器下可能对应不同的地面距离，使得基于网格的位置先验在物理上不一致。同时，异质空间粒度意味着紧凑的城市区域和均质景观即使在相同GSD下也可能需要不同的位置敏感性。因此，我们提出GeoRoPE，一种面向RSFMs的地面感知、RoPE兼容且参数高效的空间适配方法。GeoRoPE从两个互补方面重新校准token级位置交互。首先，地理坐标校准（GCC）根据一个token网格步长代表的地面距离重新缩放原始token网格偏移，产生跨GSD的地理校准相对坐标。其次，地理频率校准（GFC）使用关系特定因子调整原生RoPE频率，实现对场景依赖空间粒度的位置敏感适配。GeoRoPE通过轻量适配器注入预训练RSFM，在保持冻结空间先验的同时添加地理感知位置校正。在多个RSFM、传感器、分辨率和下游任务上的实验表明，GeoRoPE提升了跨分辨率鲁棒性和尺度敏感表征学习。

英文摘要

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14765 2026-06-16 cs.CV cs.AI cs.LG cs.MM 交叉投稿

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

动量引导的语义预测（MoFore）用于自监督视频表示学习

Qinwu Xu

发表机构 * Qinwu Xu, PhD（秦武 Xu 博士）

AI总结提出MoFore框架，通过预测未来潜在嵌入进行自监督视频表示学习，结合对比正则化防止表示崩溃，在UCF101上验证了时间一致性和语义结构。

Comments 13 pages, 5 Figures, and 2 Tables

详情

AI中文摘要

自监督视频表示学习最近通过对比学习、掩码重建和预测表示学习取得了进展。基于重建的方法如MAE和VideoMAE通过恢复掩码视觉内容来学习表示，而对比方法如CLIP通过表示对齐学习语义有意义的嵌入空间。在这项工作中，我们提出了一种动量引导的语义预测框架（MoFore）用于自监督视频表示学习。该方法不是优化像素级重建或任务特定的语义对齐，而是通过从时间上遥远的上下文片段预测未来的潜在嵌入来学习时间预测性视频表示。为了提高跨时间尺度的鲁棒性，我们进一步引入了训练期间的随机时间间隔预测。该框架将预测性潜在预测与对比正则化相结合，以鼓励时间一致性同时防止表示崩溃。在UCF101数据集上的实验表明，所提出的框架在训练期间不使用动作标签的情况下学习了时间一致且语义有意义的视频表示。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构，而定性检索实验揭示了跨相关活动的运动感知组织。总体而言，结果表明长程潜在预测为自监督视频表示学习提供了一种有效且计算高效的方法，而不依赖于基于重建的目标。

英文摘要

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.14770 2026-06-16 cs.CV cs.AI cs.IR cs.LG 交叉投稿

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

大规模行人属性识别中的优化动态与稀疏边界实证分析

Houssam El Mir

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology（浙江工业大学计算机科学与技术学院）

AI总结针对行人属性识别中极端类别不平衡问题，提出多标签焦点损失校准配置（alpha=0.50, gamma=2.0），在零计算开销下匹配BCE基线并提升难例挖掘，同时识别出0.1%正样本率下的稀疏墙边界。

详情

AI中文摘要

行人属性识别（PAR）对于视频监控至关重要，支持法医搜索和重识别系统。当将PETA和PA-100K合并为一个包含109,000张图像的复合语料库时，极端类别不平衡仍然是一个基本障碍，其中少数属性的正样本比例低于1%。这导致标准BCE优化抑制稀有特征，我们称之为多数负类欺骗陷阱。我们在ResNet-18骨干网络上对多标签焦点损失超参数（alpha和gamma）进行了系统消融。校准配置（alpha=0.50, gamma=2.0）实现了62.32%的宏F1分数，与BCE基线相当，同时保留了优越的难例挖掘和收敛动态。我们的方法使用纯损失函数工程，边缘部署零计算开销。我们识别出稀疏墙，这是一个硬边界，当正样本比例低于0.1%时，全局损失重新加权失效，需要实例级干预。

英文摘要

Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.14773 2026-06-16 cs.CV cs.AI 交叉投稿

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

双螺旋视觉 (DH-V2)：一种基于几何的带宽受限感知视觉采样器

Jinwen Wen

发表机构 * Independent Researcher（独立研究者）

AI总结提出双螺旋视觉(DH)，一种基于黄金比例螺旋轨迹的几何采样器，将2D图像压缩为1D信号，实现1433倍压缩比，在CPU上0.52ms完成感知，CIFAR-10上准确率提升6.03%。

Comments 5 pages, 3 figures, 5 tables. Code and benchmarks: https://github.com/JackJ-C/double-helix-vision-tool

详情

AI中文摘要

我们提出双螺旋视觉(DH)，一种基于几何的视觉采样器，利用成对的黄金比例启发螺旋轨迹将2D图像压缩为紧凑的1D信号。DH不是均匀处理每个像素，而是采用两个相位偏移的螺旋（Alpha和Beta，偏移180度）以生物启发的中央凹方式采样图像：中心高密度，外围稀疏覆盖。在4K分辨率下，DH实现了1433倍压缩比（减少99.93%），同时保留场景的几何结构。完整的感知流水线——包括空间映射、时间碰撞检测和帧内结构视差估计——在仅CPU硬件上以1080p分辨率运行仅需0.52毫秒，无需神经网络依赖。在CIFAR-10上，在极端采样预算下（每个螺旋K=128个点），DH比均匀随机采样获得了+6.03%的准确率提升。提供了一个可序列化为JSON的机器人API，以2.7 KB的数据包提供亚毫秒级空间感知报告。代码和基准测试在MIT许可下提供。

英文摘要

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

URL PDF HTML ☆

赞 0 踩 0

2606.14792 2026-06-16 cs.CV cs.AI 交叉投稿

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

基于离散扩散模型的视觉-文本思维高效强化学习

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

发表机构 * KAIST（韩国科学技术院）； Sony AI（索尼AI）； AITRICS ； Sony Group Corporation（索尼集团公司）

AI总结提出用离散扩散模型替代自回归模型进行多模态强化学习，通过局部视觉编辑减少计算量，并设计分解奖励分配策略解决跨模态干扰问题。

详情

AI中文摘要

基于强化学习的后训练已被广泛采用，以在能够同时进行文本和图像生成的统一多模态模型中实现交错视觉和文本推理。然而，大多数现有方法建立在自回归统一模型上，在视觉推理过程中需要完整的图像再生。在这项工作中，我们证明多模态离散扩散模型是自回归模型在交错推理中进行强化学习的有效替代方案，因为它们能够通过局部视觉编辑而非完整的图像令牌再生来执行高效的视觉展开。与自回归基线相比，这使GRPO期间的展开计算减少了26.9%，且性能下降极小。尽管效率提高，我们发现联合奖励分配（在模态间使用共享奖励信号）在RL更新期间会在不相关的图像和文本令牌序列之间引入跨模态干扰。为解决此问题，我们提出分解奖励分配策略，该策略独立地为文本和视觉片段分配奖励。采用分解奖励分配后，我们的RL方法相比联合奖励分配提高了11.2%，相比基础模型提高了38.04%。

英文摘要

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

URL PDF HTML ☆

赞 0 踩 0

2606.14822 2026-06-16 quant-ph cs.AI 交叉投稿

Quantum Machine Learning for Industrial Applications

量子机器学习在工业中的应用

Léo Monbroussou

发表机构 * Sorbonne Université（索邦大学）； LIP6（LIP6实验室）； Naval Group（海军集团）； IRIF（IRIF研究院）； CNRS（法国国家科学研究中心）

AI总结研究量子机器学习在工业中的潜力，解决变分量子电路的训练性、表达性和抗经典模拟能力，提出无贫瘠高原的理论保证和多项式量子优势算法。

Comments PhD thesis

详情

DOI: 10.70675/78b65d31z6c74z4ccfz8ce3zdbde6bf748e9
Journal ref: Sorbonne University, EDITE doctoral school, LIP6 laboratory, 2025

AI中文摘要

机器学习的最新进展已经改变了众多工业领域，但经典范式面临根本性限制：快速增长的数据量、不断上升的计算成本、显著的能源消耗以及传统硬件架构的物理缩放极限。量子计算已成为应对这些挑战的一种有前景的计算范式，催生了量子机器学习（QML）领域。本文研究了QML的理论基础，重点关注近期和未来的实际应用。解决了三个核心挑战：变分量子电路的可训练性、表达性以及抵抗高效经典模拟的能力。首先研究了汉明重量保持变分量子电路的可训练性，并建立了理论保证，解决了关于该电路族不存在贫瘠高原的开放猜想。然后引入了子空间保持的QML算法，包括光子电路和量子卷积神经网络，旨在模仿经典ML子程序，同时提供多项式量子优势。最后，将变分量子电路分析为量子傅里叶模型，并推导出一个框架来共同表征表达性和可训练性，从中获得了量子模型可证明与其经典对应物分离的条件。这些贡献旨在推进在现实世界应用中利用近期和未来量子技术的理论路线图。

英文摘要

Recent advances in Machine Learning have transformed numerous industrial sectors, yet classical paradigms face fundamental limitations: rapidly growing data volumes, rising computational costs, significant energy consumption, and the physical scaling limits of conventional hardware architectures. Quantum computing has emerged as a promising computational paradigm to address these challenges, giving rise to the field of Quantum Machine Learning (QML). In this thesis, the theoretical foundations of QML are investigated, with a focus on near-term and future practical applications. Three central challenges are addressed: the trainability of variational quantum circuits, their expressivity, and their resistance to efficient classical simulation. The trainability of Hamming-weight preserving variational quantum circuits is first studied, and theoretical guarantees are established that resolve an open conjecture on the absence of barren plateaus for this circuit family. Subspace-preserving QML algorithms are then introduced, including photonic circuits and quantum convolutional neural networks, and are designed to mimic classical ML subroutines while offering polynomial quantum advantage. Finally, variational quantum circuits are analyzed as quantum Fourier models, and a framework is derived to jointly characterize expressivity and trainability, from which conditions are obtained under which quantum models provably separate from their classical counterparts. These contributions are intended to advance the theoretical roadmap for harnessing near-term and future quantum technologies in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.14865 2026-06-16 cs.LG cs.AI 交叉投稿

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

GRAPE: 面向紧凑对抗鲁棒性的引导式参数空间演化

Zhiyuan Ye, Xiangyu Zhou, Ji Qi, Hao Zhang, Yi Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）； China Mobile (Suzhou) Software Technology Co., Ltd.（中移（苏州）软件技术有限公司）

AI总结提出GRAPE框架，通过逐步暴露参数空间并利用对抗谱利用分数引导容量分配，在固定计算预算下提升紧凑模型的对抗鲁棒性，在CIFAR-10上以1.009倍FLOPs将PGD-20鲁棒准确率从51.70%提升至56.94%，参数减少21.4%。

详情

AI中文摘要

对抗训练（AT）提高了神经网络的鲁棒性，但大多数方法从一开始就训练固定的参数空间。本文探讨了参数变得可优化的顺序是否会影响最终的鲁棒解，即使最终架构或计算预算被控制。我们提出了GRAPE（引导式参数空间演化），一种面向紧凑对抗鲁棒性的训练框架。GRAPE结合了参数空间稳定化与渐进式隐藏扩展：它在当前暴露空间中稳定鲁棒优化，逐步释放新的可优化维度，并使用对抗谱利用分数引导新释放的容量流向高压模块。与固定结构的AT相比，GRAPE将鲁棒模型学习视为一个渐进式参数空间暴露和演化的过程。在CIFAR-10上的标准$\ell_\infty$威胁模型下，以固定结构ResNet-18 AT作为对照参考，GRAPE在几乎匹配的计算预算下（FLOPs比率为1.009倍）将PGD-20鲁棒准确率从51.70%提升至56.94%，同时参数数量减少约21.4%。一个具有相同最终ResNet-18架构的序列增长变体达到了56.52%的PGD-20鲁棒准确率，表明增益不仅来自最终架构差异，还来自参数空间暴露路径。这些结果表明，引导式参数空间演化可以在匹配计算条件下产生紧凑且鲁棒的参数配置。

英文摘要

Adversarial Training (AT) improves neural network robustness, but most methods train a fixed parameter space from the start. This paper asks whether the order in which parameters become optimizable can affect the final robust solution, even when the final architecture or computation budget is controlled. We propose GRAPE, Guided Parameter-Space Evolution, a training framework for compact adversarial robustness. GRAPE combines parameter-space stabilization with progressive hidden expansion: it stabilizes robust optimization in the currently exposed space, gradually releases new optimizable dimensions, and uses an adversarial spectral utilization score to guide newly released capacity toward high-pressure modules. In contrast to fixed-structure AT, GRAPE treats robust model learning as a process of progressive parameter-space exposure and evolution. Under the standard $\ell_\infty$ threat model on CIFAR-10, with fixed-structure ResNet-18 AT as a controlled reference, GRAPE improves PGD-20 robust accuracy from 51.70% to 56.94% at a nearly matched computation budget with a FLOPs ratio of 1.009x, while reducing parameter count by about 21.4%. A sequential grow variant with the same final ResNet-18 architecture reaches 56.52% PGD-20 robust accuracy, indicating that the gain is not only due to final architecture differences but also to the parameter-space exposure path. These results suggest that guided parameter-space evolution can yield compact and robust parameter configurations under matched computation.

URL PDF HTML ☆

赞 0 踩 0

2606.14929 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

嵌入模型路由的策略遗憾：具有低秩专家的上下文赌博机

Yan Dai, Negin Golrezaei, Patrick Jaillet

发表机构 * Operations Research Center, MIT（麻省理工学院运筹学研究中心）； Sloan School of Management, MIT（麻省理工学院斯隆管理学院）； Department of EECS, MIT（麻省理工学院电气工程与计算机科学系）

AI总结针对推荐系统中嵌入模型路由问题，形式化为具有低秩专家的对抗性上下文线性赌博机，提出Hypentropy策略梯度算法，实现$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾。

详情

AI中文摘要

现代推荐系统越来越依赖于将多样化的查询动态路由到多个嵌入模型。尽管具有实际意义，但在对抗性查询、赌博机反馈和模型有限可观测性等现实条件下，该问题仍未得到充分理解。我们将嵌入模型路由形式化为具有低秩专家的对抗性上下文线性赌博机，其中上下文是查询，动作是物品，专家是在低秩潜在表示空间上工作的嵌入模型。我们首先证明，标准遗憾概念存在结构错误指定或统计难解性，并确定了一个对数二次策略类，它足够表达以捕获查询相关的模型路由，同时又足够结构化以允许高效的在线学习。其次，我们提出了一种称为Hypentropy策略梯度（HPG）的策略梯度算法。它在不完全信息下可证明地适应未知的低秩结构，并达到$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾——其中$s$、$M$和$T$分别是专家的内在秩、模型数量和轮数——从而避免了维度灾难。最后，我们还提供了HPG的计算高效且无需参数调整的实现。

英文摘要

Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

URL PDF HTML ☆

赞 0 踩 0

2606.14934 2026-06-16 cs.LG cs.AI 交叉投稿

Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

可分离神经架构作为物理世界模型：从数学理论到应用

Reza T Batley, Andrew Kichline, Sourav Saha

发表机构 * Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia Polytechnic Institute and State University（弗吉尼亚理工大学凯文·T·克罗夫顿航空航天与海洋工程系）

AI总结提出可分离神经架构（SNA），结合神经逼近与张量分解，通过变分框架求解偏微分方程，实现高维问题代数级缩放，并在工程案例中取得显著加速。

详情

AI中文摘要

本文介绍了可分离神经架构（SNA），这是一种结合神经逼近与张量分解的函数表示类。SNA将局部坐标函数（原子）与由稀疏低秩交互对象控制的全局相互作用解耦。该架构具有紧凑且平滑的归纳偏置，非常适合求解偏微分方程（PDE）。当在变分SNA（VSNA）框架下被视为Galerkin试验空间时，该公式满足Lax-Milgram下的经典变分保证：适定性、拟最优性、收敛性和稳定性。在高维时空-参数PDE中，VSNA通过代数级而非指数级缩放来缓解维数灾难。利用完全分解的、张量原生的交替最小二乘（ALS）优化框架，可将此成本降低至维度线性。VSNA在椭圆、双曲和抛物系统中得到验证，显示出与预测的代数谱缩放率高度一致。我们通过两个工程案例研究展示了SNA作为“一次求解，随处查询”的物理世界模型：一个7维参数化制造模拟和一个用于Inconel 718的实验性热-属性反演流程。VSNA在标准笔记本电脑CPU上102秒内执行了1,000,000次蒙特卡洛扫描，相比基于NVIDIA A100 GPU的全网格有限元基线实现了150,000倍加速。它还能在100毫秒内实现实时生成式逆模态重建。这些结果表明，SNA可作为连续参数流形的紧凑数学基础，实现实时反演、优化循环和快速不确定性传播。

英文摘要

This work introduces the Separable Neural Architecture (SNA), a function representational class combining neural approximation with tensor decomposition. The SNA decouples localized coordinate functions (atoms) from global interactions governed by a sparse, low-rank interaction object. This architecture possesses a compact and smooth inductive bias well-suited for solving partial differential equations (PDEs). When viewed as a Galerkin trial space under the variational SNA (VSNA) framework, the formulation satisfies classical variational guarantees under Lax-Milgram: well-posedness, quasi-optimality, convergence, and stability. In high-dimensional spatiotemporal--parametric PDEs, the VSNA mitigates the curse of dimensionality by scaling algebraically rather than exponentially. Exploiting an entirely factorized, tensor-native alternating least squares (ALS) optimization framework reduces this cost to linear in dimension. The VSNA is validated across elliptic, hyperbolic, and parabolic systems, demonstrating close alignment with predicted algebraic and spectral scaling rates. We showcase the SNA as a "solve once, query anywhere" physical world model via two engineering case studies: a 7D parametric manufacturing simulation and an experimental thermal-to-property inversion pipeline for Inconel 718. The VSNA executes a 1,000,000-query Monte Carlo sweep in 102s on a standard laptop CPU, yielding a 150,000x speedup over a full-grid finite element baseline hosted on an NVIDIA A100 GPU. It further enables real-time generative inverse-mode reconstructions under 100ms. These results demonstrate that the SNA serves as a compact mathematical substrate for continuous parameter manifolds to enable real-time inversion, optimization loops, and rapid uncertainty propagation.

URL PDF HTML ☆

赞 0 踩 0

2606.14971 2026-06-16 cs.LG cs.AI 交叉投稿

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix: 通过梯度下降实现快速数据混合优化

Haoru Tan, Sitong Wu, Yanfeng Chen, Jun Xia, Ruobing Xie, Bin Xia, Xingwu Sun, Xiaojuan Qi

发表机构 * University of Hong Kong（香港大学）； Tencent（腾讯）； Chinese University of Hong Kong（香港中文大学）

AI总结提出FastMix框架，将数据混合选择重新表述为双层优化问题，通过联合优化混合系数和模型参数，实现高效、可扩展的数据混合发现，在预训练和后训练中均优于基线方法且大幅降低搜索成本。

详情

Journal ref: ICLR-2026

AI中文摘要

虽然大规模和多样化的数据集推动了大型模型的最新进展，但确定预训练和后训练的最佳数据混合仍然是一个重要的开放问题。我们通过FASTMIX应对这一挑战，这是一个新颖的框架，在仅训练单个代理模型的同时自动发现数据混合。FASTMIX不依赖预定义的启发式方法或资源密集型模拟，而是联合优化混合系数和模型参数，显著提高了相对于先前方法的效率和可扩展性。FASTMIX的核心是将混合选择重新表述为一个双层优化问题。在这种重新表述下，我们证明优化混合比例在数学上等价于在均匀源采样下分配每个源的损失权重。这将混合系数直接嵌入到可微分的迭代优化目标中，从而能够对混合和模型进行高效的基于梯度的优化。为了解决优化问题，FASTMIX实现了一个近似迭代优化过程，交替进行（i）根据当前混合比例对采样的数据更新模型参数（内循环）和（ii）基于验证反馈更新混合比例（外循环）。在预训练和后训练中，FASTMIX均优于基线方法，同时大幅降低了搜索成本。代码见 https://github.com/hrtan/fastmix

英文摘要

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

URL PDF HTML ☆

赞 0 踩 0

2606.14975 2026-06-16 cs.NE cs.AI cs.LG physics.data-an q-bio.NC 交叉投稿

Harnessing cortical geometry, wiring, and function as inductive biases for recurrent neural networks

利用皮层几何、连接和功能作为循环神经网络的归纳偏置

Mo Shakiba, Rana Rokni, Mohammad Mohammadi, Nima Dehghani

发表机构 * Neuromatch Academy, Neuromatch, Inc., USA（Neuromatch学院，Neuromatch公司，美国）； McGovern Institute for Brain Research, Massachusetts Institute of Technology (MIT)（麦戈文脑科学研究所，麻省理工学院（MIT））

AI总结本研究利用MICrONS项目数据，通过神经元空间坐标、解剖连接和功能关系初始化循环权重并施加空间约束，构建生物基础循环神经网络，在认知决策任务中优于基线模型，并发展出低熵、模块化和小世界组织。

详情

AI中文摘要

皮层的连接和功能组织如何塑造循环计算仍然是神经科学和机器学习中的一个核心问题。在这里，我们利用通过皮层网络机器智能（MICrONS）项目发布的数据——一个涵盖小鼠视觉皮层多个区域的功能连接组学资源，其中密集钙成像与同一动物的高分辨率电子显微镜重建共同配准——来构建生物基础的循环神经网络。使用来自近12,000个共同配准的兴奋性神经元的神经元空间坐标、解剖连接和功能衍生关系，我们初始化循环权重并在学习过程中施加通信感知的空间约束。在三个认知决策任务中，受皮层结构和功能约束的网络始终优于基线和部分约束模型。功能权重初始化提供了最大的增益，而真实空间嵌入在多种条件下产生了稳健的额外改进。这些生物基础网络还发展出低熵、模块化和小世界组织，并且即使当循环被限制为正权重时也能保持强劲性能。总之，我们的结果表明，皮层的机制——其几何、连接和功能结构——可以作为构建循环网络的强大归纳基础，这些网络学习更有效，同时收敛于生物计算的关键组织原则。

英文摘要

How the wiring and functional organization of cortex shape recurrent computation remains a central question in both neuroscience and machine learning. Here, we leverage data released through the Machine Intelligence from Cortical Networks (MICrONS) program--a functional connectomics resource spanning multiple areas of mouse visual cortex, in which dense calcium imaging is co-registered with high-resolution electron microscopy reconstruction from the same animal--to build biologically grounded recurrent neural networks. Using neuronal spatial coordinates, anatomical connectivity, and function-derived relationships from nearly 12,000 coregistered excitatory neurons, we initialize recurrent weights and impose communication-aware spatial constraints during learning. Across three cognitive decision-making tasks, networks constrained by cortical structure and function consistently outperform baseline and partially constrained models. Functional weight initialization provides the largest gain, while real spatial embedding yields robust additional improvements across conditions. These biologically grounded networks also develop low-entropy, modular, and small-world organization, and retain strong performance even when recurrence is restricted to positive weights. Together, our results show that the machinery of cortex--its geometry, wiring, and functional structure--can be harnessed as a powerful inductive basis for building recurrent networks that learn more effectively while converging toward key organizational principles of biological computation.

URL PDF HTML ☆

赞 0 踩 0

2606.15015 2026-06-16 cs.CV cs.AI 交叉投稿

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Yixiong Jing

发表机构 * University of Oxford（牛津大学）； University of Cambridge（剑桥大学）

AI总结提出神经能量场框架NEXUS，通过标量能量和耗散项建模保守与非保守动力学，提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情

AI中文摘要

基于物理的视频生成需要可控的3D物体动力学，这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应，难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS，一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图，并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发，NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应（包括重力和弹性变形）被组合为加性能量项，而非保守效应（如阻尼和冲击引起的能量损失）则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到，并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中，NEXUS在不同力学属性和物理效应组合下，相较于代表性的学习和物理结构化动力学基线，提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导，在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.15055 2026-06-16 cs.CV cs.AI 交叉投稿

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

通过视觉-语义枢轴终身学习弥合城市街景推理中的地理偏差

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出HVSP-LL终身学习框架，通过分层视觉-语义枢轴模块和公平感知重放机制，在跨城市街景推理中减少地理偏差，实现城市间感知差距缩小38%。

详情

AI中文摘要

城市街景的视觉感知支撑着景观规划、公共卫生和场所营造中的循证决策。然而，在少数拍摄良好的大都市上训练的模型会系统性地误判代表性不足的地区，将地理偏差传播到下游政策中。我们通过HVSP-LL（一种终身学习框架）解决了这一差距，该框架将分层视觉-语义枢轴模块与公平感知重放机制相结合。枢轴模块沿三层本体（宏观结构、中观组成、微观元素）组织景观概念，并将图像特征与每层可学习的语义锚点对齐，提供抵抗分布漂移的可迁移表示。终身适应组件顺序吸收新的城市区域，同时通过最差区域样本重新加权目标和结构感知示例缓冲区约束区域间感知差距。我们在一个由四大洲十二个城市和七个感知维度组成的全景街景基准上评估了HVSP-LL。该框架在保留城市序列上达到0.834的斯皮尔曼相关系数，比最强的持续基线绝对提高了6.1个百分点，并将城市间感知差距缩小到0.094——相对于最强的持续基线（0.151）减少了38%，相对于代表性的正则化基线（0.218）减少了57%。消融实验证实，枢轴层次结构的每一层都有单调贡献，公平感知重放将平均反向迁移从-0.038（无保留）转换为+0.013，消除了保留序列上的灾难性遗忘。我们的结果表明，分层锚定是实现城市尺度地理公平街景推理的实用途径。

英文摘要

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

URL PDF HTML ☆

赞 0 踩 0

2606.15134 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

超越标量距离：来自冻结MLLM的语义属性梯度用于视觉嵌入

Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出SAGA框架，利用冻结的多模态大语言模型（MLLM）通过GRPO奖励机制为视觉编码器提供属性级监督，替代传统标量距离，提升零样本图像检索性能。

详情

AI中文摘要

用于检索的视觉编码器通常通过类标签监督进行训练：每个训练对简化为一个标量，均匀地将嵌入推远或拉近，就好像每个视觉属性要么不同要么匹配。一个多模态大语言模型（MLLM），在展示相同的一对图像时，能够阐述这些属性并利用它们预测图像是否共享一个类别。我们提出\textbf{SAGA}，一个框架，将这种基于语言、属性感知的感知转化为编码器本身的训练信号。具体来说，我们使用组相对策略优化（GRPO）来奖励MLLM对视觉编码器令牌的正确预测。由于正确的预测要求这些令牌暴露该对之间不同或匹配的具体属性，梯度推动编码器编码这些属性，用属性解析的监督取代统一的成对标量。一个辅助的注意力蒸馏损失将编码器的嵌入锚定到MLLM关注的令牌上，一个标准的度量学习损失塑造嵌入几何结构以进行最近邻检索。MLLM在整个过程中被冻结，在推理时被丢弃，与度量学习基线的部署成本相匹配。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves上的零样本图像检索中，SAGA在Recall@1上比最先进的基线提高了3到6个百分点。

英文摘要

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.15157 2026-06-16 cs.LG cs.AI 交叉投稿

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV: 异构保留与分配用于KV缓存压缩

Chao Fei, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology（阿卜杜拉国王科技大学）

AI总结针对长上下文大模型推理中KV缓存压缩问题，提出PolyKV框架，通过层级别信号为每层选择合适压缩策略并分配非均匀缓存预算，实验表明在固定预算下显著恢复性能差距。

详情

AI中文摘要

KV缓存压缩对于减少长上下文大语言模型推理的内存成本至关重要。然而，现有方法通常在所有Transformer层上应用单一的压缩策略和统一的缓存预算。这种统一设计忽略了不同层在预填充和解码过程中可能扮演不同角色，因此可能需要不同的驱逐策略和缓存容量。我们提出了PolyKV，一种逐层KV缓存优化框架，考虑了方法选择和预算分配的设计空间。PolyKV基于层级别信号将每层路由到合适的KV压缩策略，同时在固定总预算下分配非均匀预算。这种公式化实现了现有KV缓存方法的异构组合。在LLaMA-3.1-8B和Qwen3-8B上的实验表明，在相同的512 token平均KV预算下，PolyKV分别恢复了最强单策略基线与FullKV之间LongBench性能差距的54.5%和25.7%。在128-1024预算范围内，PolyKV持续比最强基线提升1.7%-6.4%，对应FullKV差距的40.0%-54.5%恢复。

英文摘要

KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.

URL PDF HTML ☆

赞 0 踩 0

2606.15207 2026-06-16 cs.LG cs.AI cs.NE 交叉投稿

Controlled Dynamics Attractor Transformer

受控动力学吸引子Transformer

Cheng Zhang, Minnan Luo, Zesheng Yang, Ming Li, Yong-Jin Liu, Qinghua Zheng

发表机构 * Xi'an Jiaotong University（西安交通大学）； Tsinghua University（清华大学）

AI总结提出受控动力学吸引子Transformer（CDAT），通过耦合混合von Mises-Fisher注意力能量与Hopfield精炼能量，并引入CANN启发的兴奋-抑制调制，实现拓扑约束的动力学系统，在图异常检测和图分类任务上达到最优性能。

Comments 20pages,3 figures

详情

Journal ref: Forty-Third International Conference on Machine Learning(ICML 2026)

AI中文摘要

Transformer架构通过自注意力机制在深度模型的表示学习和推理方面取得了显著进展。同时，联想记忆（AM）框架将表示映射到能量景观上，提供了可解释的检索机制。然而，其连续时间推理动力学缺乏经典连续吸引子神经网络（CANN）的生物合理性。为弥合这一差距，我们提出了受控动力学吸引子Transformer（CDAT），它将混合von Mises-Fisher（Mo-vMF）注意力能量与Hopfield精炼能量耦合，同时通过CANN启发的兴奋-抑制调制增强能量下降。CDAT实例化了一个拓扑约束的动力学系统，其耦合编码了标记之间的关系结构，从而将吸引子式动力学与现代基于能量的注意力联系起来。我们进一步提供了构造性的耗散分析，以正式建立其受控推理动力学。得益于这些鲁棒且结构化的动力学，CDAT在图异常检测和图分类的多个基准测试中达到了最先进的性能。

英文摘要

Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

URL PDF HTML ☆

赞 0 踩 0

2606.15247 2026-06-16 cs.LG cs.AI 交叉投稿

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

探索性初始状态并不足够：蒙特卡洛探索性初始状态的反例与修正

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结本文通过构造反例证明，在表格设置下，蒙特卡洛探索性初始状态（MCES）算法可能收敛到次优解，并提出基于状态级学习率缩放的修正方法以恢复最优性收敛。

详情

AI中文摘要

蒙特卡洛探索性初始状态（MCES）的渐近行为是强化学习中一个长期存在的开放问题，即使在表格设置中也是如此。我们通过构造算法收敛到次优解的例子，研究了表格MCES的收敛性质。本文为初始访问和首次访问MCES提供了新的反例，并给出了初始访问情况下的收敛恢复修正。我们表明，即使贪婪动作平均更新频率高于非贪婪动作，初始访问MCES在样本平均更新下也可能存在稳定的次优解。然而，通过按状态将学习率与更新频率成反比缩放，可以保证收敛到最优性。与之前的均匀化方法不同，此修正适用于需要近似估计值函数的大规模问题。然后，我们扩展该例子以表明样本平均首次访问MCES也可能收敛到次优解。这基本上解决了一个基本的开放问题，并表明仅靠探索性初始状态并不能保证收敛到最优性。更广泛地说，这些结果突显了收敛性关键取决于应用于不同动作的更新的相对大小和频率，使得学习率的选择以及探索与利用的平衡成为MCES分析和可扩展蒙特卡洛控制方法实现的核心。

英文摘要

The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.

URL PDF HTML ☆

赞 0 踩 0

2606.15260 2026-06-16 cs.LG cs.AI 交叉投稿

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行在线强化学习的信任区域扩散策略

Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

发表机构 * University of Freiburg（弗赖堡大学）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）

AI总结提出TruDi方法，通过信任区域优化约束扩散轨迹的KL散度，实现大规模并行在线强化学习中的稳定训练，在73个任务中优于或持平基线。

详情

AI中文摘要

利用大规模并行模拟的强化学习已成为开发鲁棒、可部署策略的标准框架；然而，大多数现有方法仍依赖简单的高斯策略参数化。扩散模型提供了更具表达力的策略类，并在具有挑战性的控制问题上表现出色，但大多数基于扩散的强化学习方法是为离线或离策略训练设计的。在这项工作中，我们探究扩散策略能否在大规模并行、在线策略机制下有效训练。为此，我们引入了信任区域扩散策略（TruDi），它使得扩散策略能够用于大规模并行模拟的在线强化学习。这种设置特别具有挑战性，因为数据分布在每次更新中快速变化，使得复杂策略的稳定训练变得困难。TruDi通过整合信任区域优化规则来约束整个扩散轨迹上的KL散度，从而解决了这一问题。实验上，我们在包含73个任务的4个不同的大规模并行强化学习基准上评估了TruDi。在这些任务中，TruDi在标准任务上始终优于或与强基线持平，在更具挑战性的人形控制任务上取得了明显收益，为大规模并行在线强化学习建立了新的强基线。

英文摘要

Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

URL PDF HTML ☆

赞 0 踩 0

2606.15278 2026-06-16 cs.LG cs.AI 交叉投稿

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

RECTOR：面向情感与认知表征学习的掩码区域-通道-时间建模

Jinhan Liu, Mahsa Shoaran

发表机构 * Cornell University（康奈尔大学）

AI总结提出RECTOR自监督框架，通过自适应功能分区和掩码拓扑学习，统一建模EEG/sEEG的区域-通道-时间动态，在情感识别和任务参与分类上达到新最优，且对缺失通道和跨导联泛化鲁棒。

详情

AI中文摘要

情感和认知障碍表现为跨区域、通道和时间的分布式、时变脑网络动态，给基于EEG/sEEG的临床诊断鲁棒表征学习带来挑战。我们提出RECTOR（掩码区域-通道-时间建模），一种端到端自监督框架，超越固定解剖先验，统一联合区域-通道-时间表征学习。其核心RECTOR-SA是一种由自适应功能分区诱导的层次化块稀疏自注意力，将区域结构从静态解剖定义演变为自适应功能区域。自监督由掩码拓扑和表征学习驱动，联合优化三个互补目标：掩码预测建模、拓扑结构建模和跨视图一致性。在多个基准上，RECTOR在EEG情感识别和sEEG任务参与分类中达到新最优。关键的是，其对缺失通道的强鲁棒性和跨导联泛化能力凸显了其在异构EEG/sEEG上进行大规模预训练的潜力，并在区域和通道层面提供可解释的洞察。

英文摘要

Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

URL PDF HTML ☆

赞 0 踩 0

2606.15284 2026-06-16 eess.SP cs.AI cs.LG 交叉投稿

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

CAP：面向患者级监督的PPG通用表示学习

Chenyang He, Xinyi Shao, Shun Huang, Bosong Huang, Daoqiang Zhang, Ming Jing, Cheng Ding

发表机构 * Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； Peking University（北京大学）； Independent Researcher（独立研究者）； Jinling Clinical Medical College College of Artificial Intelligence Nanjing University of Aeronautics and Astronautics（金陵临床医学院人工智能学院南京航空航天大学）

AI总结提出CAP方法，通过构建大规模PPG-EHR多模态数据集和跨模态对比对齐，学习患者级临床语义的PPG表示，在四项下游任务中平均提升26.7%，呼吸率预测提升87.6%。

Comments Accepted as an Oral presentation at KDD 2026

详情

DOI: 10.1145/3770855.3818881

AI中文摘要

光电容积描记法（PPG）在可穿戴健康监测和临床决策支持中发挥着核心作用。然而，现有的通用PPG表示学习方法主要关注信号级目标，往往忽略患者级健康背景，这限制了对复杂临床任务和异质性队列的泛化能力。为解决这一问题，我们通过将碎片化的病史和临床记录整合为连贯的患者级电子健康记录（EHR），构建了一个大规模配对PPG-EHR多模态数据集。基于此资源，我们提出了临床锚定预训练方法（CAP）。在预训练期间，CAP执行跨模态对比对齐，将PPG表示锚定到患者级临床语义，引导编码器超越波形拟合，建模患者整体生理状态的一致性。在下游适应期间，预训练的PPG编码器提供临床基础的表示，增强归纳偏置，提高鲁棒性和可迁移性。实验表明，CAP在四个不同的下游任务上持续优于强基线。CAP在呼吸率预测上取得了特别大的提升（相比最先进基线相对提升高达87.6%），并在所有任务上平均相对提升26.7%。我们通过全面分析（包括消融实验和多个互补的可视化学习表示）进一步增强了方法的可解释性。实验代码可在 https://github.com/gody123gody/CAP 获取。

英文摘要

Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: https://github.com/gody123gody/CAP .

URL PDF HTML ☆

赞 0 踩 0

2606.15377 2026-06-16 cs.LG cs.AI physics.geo-ph 交叉投稿

Learning Earthquake Wave Arrival Time Picking from Labels with Inaccuracies

从不准确标签中学习地震波到时拾取

Sen Li, Xu Yang, S. Mostafa Mousavi, Anye Cao, Keting Fan, Yaoqi Liu, Changbin Wang, Qiang Niu

发表机构 * Department of Earth and Planetary Sciences, Harvard University（哈佛大学地球与行星科学系）； School of Computer Science and Technology, China University of Mining and Technology（中国矿业大学（北京）计算机科学与技术学院）； School of Mines, China University of Mining and Technology（中国矿业大学（北京）矿院）； State Key Laboratory of Coal Exploration and Intelligent Mining, China University of Mining and Technology（中国矿业大学（北京）煤炭勘探与智能开采国家重点实验室）

AI总结提出标签噪声对比鲁棒学习（LaNCoR）方法，通过对齐波形特征与标签表示分布来纠正错误标签，在微地震P波到时拾取任务中性能提升高达28.8%。

Comments 28 pages, 10 figures

详情

AI中文摘要

不准确标记的训练数据，或称“标签噪声”，对监督机器学习模型的完整性构成重大威胁。这种污染通过教导模型特征与标签之间的错误映射直接降低性能，导致泛化能力差，并在正确标记的验证和测试数据上准确性降低。当前地震学应用主要依赖大规模训练集或数据增强来减少标签噪声影响，这可能是劳动密集且成本高昂的。在这里，我们介绍一种标签噪声对比鲁棒学习（LaNCoR）方法，该方法可以有效处理地震信号处理任务中的噪声标签，而无需大规模训练数据集。在该方法中，输入波形特征和标签表示分布在特征空间中对齐，以纠正错误标记并减少其对训练过程的影响。我们使用两个基线模型和训练方法展示了LaNCoR在真实微地震数据P波到时拾取任务上的性能。我们的结果表明，LaNCoR在性能指标上可提升高达28.8%。该方法在地震学和地球科学中的模型训练方面具有巨大潜力。

英文摘要

Inaccurately labeled training data, or "label noise", poses a significant threat to the integrity of supervised machine learning models. This corruption directly degrades performance by teaching the model erroneous mappings between features and labels, which leads to poor generalization and reduced accuracy on properly labeled validation and test data. Current seismological applications mainly rely on large-scale training sets or data augmentation to reduce the label-noise impact, which can be labor-intensive and costly. Here, we introduce a Label Noise-Contrastive Robust Learning (LaNCoR) approach that can effectively handle noisy labels in seismic signal processing tasks, without requiring large-scale training datasets. In this approach, the input waveform feature and label representation distributions are aligned in the feature space to correct mislabeling and reduce its impact on the training process. We present LaNCoR's performance on the task of P-phase arrival-time picking of real microseismic data using two baseline models and training approaches. Our results indicate that LaNCoR can improve performance by up to 28.8% across performance metrics. This approach holds great promise for model training in seismology and geosciences.

URL PDF HTML ☆

赞 0 踩 0

2606.15455 2026-06-16 cs.LG cs.AI 交叉投稿

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

通过过度训练的视角理解RLVR中的多样性崩溃

Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An

发表机构 * Sydney AI Centre, The University of Sydney（悉尼大学悉尼人工智能中心）； Southeast University（东南大学）； Microsoft（微软）； Data61, CSIRO（澳大利亚联邦科学与工业研究组织Data61）； Chongqing University（重庆大学）； Nanyang Technological University（南洋理工大学）

AI总结本文通过过度训练的视角形式化RLVR中的多样性崩溃，发现标准训练中大部分更新是过度训练，并提出贝叶斯边界门控（BBG）方法，通过估计每个问题对推理边界的边际贡献来优化，提升多个基准上的Pass@k。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型语言模型推理能力的关键方法。然而，RLVR常常遭受\emph{多样性崩溃}：Pass@$1$提升而高$k$的Pass@$k$下降，这被视为模型推理边界的收窄。我们通过\emph{过度训练}的视角形式化了这种多样性崩溃：一旦一个问题对参考指标的贡献有效饱和，进一步的更新不再扩展模型能解决的问题，但仍将概率质量集中在on-policy采样偏好的轨迹上。在每次问题少量rollout的标准设置下，即使单次成功也会使问题进入高$k$ Pass@$k$的近乎饱和状态，因此标准RLVR中的大多数更新从边界角度来看都是过度训练。这一视角也提供了一种解读：RLVR能否扩展模型超越基础模型的推理能力？由于RLVR结构上偏向于高$k$ Pass@$k$，其总体下降本身并不意味着没有新的推理增益。在干预上，将更新限制在零成功的问题上，在困难基准上将Pass@$256$提升到基础模型之上；在观察上，标准RLVR训练中，最初不可解的问题中有相当一部分变得可解。基于这些发现，我们提出\emph{贝叶斯边界门控}（BBG），通过估计每个问题对推理边界的边际贡献，将优化从过度训练中转移出来。在多个推理基准上，BBG在广泛的$k$范围内提升了平均Pass@$k$。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.

URL PDF HTML ☆

赞 0 踩 0

2606.15479 2026-06-16 cs.LG cs.AI math.PR 交叉投稿

Bayesian 3D Steerable CNNs: Enabling Equivariance and Uncertainty Quantification Simultaneously

贝叶斯3D可转向CNN：同时实现等变性和不确定性量化

Abhishek Keripale, Ponkrshnan Thiagarajan, Susanta Ghosh

发表机构 * Michigan Technological University（密歇根理工大学）； Johns Hopkins University（约翰霍普金斯大学）； The Center for Artificial Intelligence at the Institute of Computing and Cybersystems, Michigan Technological University（密歇根理工大学计算与网络系统研究所人工智能中心）

AI总结提出贝叶斯可转向CNN，通过后验分布赋予核随机性同时保持SE(3)-等变性，实现不确定性分解，在分类精度和分布偏移下鲁棒性优于确定性模型。

详情

AI中文摘要

可转向卷积神经网络（Steerable-CNNs）通过将核参数化为可转向基函数的线性组合来保证SE(3)-等变性，但其确定性本质阻碍了不确定性量化——限制了其在需要置信度估计的场景中的应用。我们提出一种贝叶斯可转向CNN，将后验分布置于基系数上，从而在精确保持等变性的同时产生随机核。模型的损失函数通过变分推断获得，并通过贝叶斯反向传播最小化。该框架将预测不确定性分解为认知不确定性和偶然不确定性。实验上，该模型在取得竞争性分类精度的同时，预期校准误差为0.0263，并且在加性高斯噪声引起的分布偏移下，其性能比确定性对应模型高出最多6.17%。此外，我们利用模型的不确定性估计显著提升其性能，在测试数据集的84%上实现了约4%的准确率提升。认知不确定性与预测误差之间统计显著的负相关性表明，学习到的后验方差具有语义意义。该框架将贝叶斯不确定性量化与等变CNN的归纳偏置统一起来。

英文摘要

Steerable convolutional neural networks (Steerable-CNNs) guarantee SE(3)-equivariance by parameterizing kernels as linear combinations of steerable basis functions, but their deterministic nature precludes uncertainty quantification - limiting their use in settings where confidence estimates are essential. We propose a Bayesian Steerable-CNN that places posterior distributions over the basis coefficients, yielding stochastic kernels while preserving equivariance exactly. The loss function of the model is obtained via variational inference and minimized by Bayes-by-Backpropagation. The framework admits a decomposition of predictive uncertainty into epistemic and aleatoric components. Empirically, the model attains competitive classification accuracy alongside an expected calibration error of 0.0263 and outperforms its deterministic counterpart by up to 6.17% under distributional shift induced by additive Gaussian noise. Furthermore, we leverage the model's uncertainty estimates to enhance its performance significantly, achieving a notable gain - approximately 4% higher accuracy across 84% of the test dataset. A statistically significant negative correlation between epistemic uncertainty and prediction error confirms that the learned posterior variance is semantically meaningful. The framework unifies Bayesian uncertainty quantification with the inductive bias of equivariant CNNs.

URL PDF HTML ☆

赞 0 踩 0

2606.15527 2026-06-16 cs.CV cs.AI 交叉投稿

Selective Synergistic Learning for Video Object-Centric Learning

选择性协同学习用于视频对象中心学习

WonJun Moon, Jae-Pil Heo

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）

AI总结提出选择性协同学习（SSync），通过伪标签线性复杂度选择性蒸馏可靠线索，避免错误传播，提升视频对象分解质量并作为即插即用模块。

详情

AI中文摘要

典型的视频对象中心学习（VOCL）方法采用基于槽的框架，依赖重建驱动的编码器-解码器架构，学习通过两个空间图进行：编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同的属性，最近的密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这种差异。然而，这种无差别的对齐无意中传播了每个模块固有的弱点，例如编码器的噪声预测和解码器的模糊边界。此外，计算所有对之间的密集相似性会带来与时空补丁总数二次方关系的计算成本，严重限制了可扩展性。受此启发，我们提出了选择性协同学习（SSync）。SSync 不是进行穷举的补丁到补丁对齐，而是通过选择性蒸馏仅最可靠的线索来防止错误传播：严格利用编码器进行边界细化，利用解码器进行内部去噪。这通过线性复杂度的伪标签实现，消除了二次空间比较的需要。此外，为了防止强化架构偏差（如槽冗余），我们引入了传递性伪标签合并，基于时空激活一致性合并重叠的槽。大量研究表明，SSync 提高了分解质量，并作为一个通用的即插即用模块，同时对槽配置表现出卓越的鲁棒性。代码可在 github.com/wjun0830/SSync 获取。

英文摘要

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

URL PDF HTML ☆

赞 0 踩 0

2606.15553 2026-06-16 cs.LG cs.AI 交叉投稿

Distilling Drifting Transformers with Representation Autoencoders

用表示自编码器蒸馏漂移变换器

Jiawei Zhang, Mengfei Xia, Gen Li, Yuantao Gu

发表机构 * Tsinghua University（清华大学）； Ant Group（蚂蚁集团）； CUHK（香港中文大学）

AI总结提出Drift-RAE方法，通过漂移范式在表示自编码器潜空间中蒸馏预训练流模型，解决各向异性和大曲率问题，在ImageNet 256上仅用10k步达到1.77 FID。

详情

AI中文摘要

表示自编码器（RAE）通过预训练编码器中强标签聚类的DINO特征，在语义更丰富的潜空间中改进了扩散和流模型。然而，在蒸馏阶段，丰富语义表示导致的严重各向异性和大曲率会阻碍收敛和性能，使得基于轨迹的蒸馏不稳定。在这项工作中，我们认为RAE潜空间通过新提出的漂移模型与蒸馏兼容。我们首先定量研究了不同自编码器上的曲率和各向同性统计，并从理论上揭示了漂移模型本身极有可能在像基于重建的VAE这样的极端分散空间上失败。这些促使我们直接将漂移范式应用于表示自编码器。我们提出的方法Drift-RAE使用漂移在RAE潜空间中蒸馏预训练流模型，并进行了有洞察力的修改，通过理论上将漂移场与其他框架对齐来提高训练稳定性。关于实验证据，我们在ImageNet 256数据集上仅用10k步蒸馏就达到了1.77 FID，超越了最先进的RAE蒸馏方法，并且与原始漂移模型相比具有竞争力，而无需辅助MAE特征提取器。代码将公开提供。

英文摘要

Representation Autoencoders (RAEs) have improved diffusion and flow models by semantically richer latent space owing to the strongly label-wise clustered DINO features in the pretrained encoders. Yet in the distillation stage, the severe anisotropy and large curvatures caused by the rich semantic representations would hinder the convergence and performance, making the trajectory-based distillation unstable. In this work, we argue that the RAE latent space is compatible with distillation via the newly proposed Drifting Models. We first quantitatively study the curvatures and isotropy statistics across different autoencoders, and theoretically reveal that Drifting Model itself is highly likely to fail on extremely scattered spaces like reconstruction-based VAEs. These motivate us to apply the drifting paradigm directly to representation autoencoders. Our proposed method, Drift-RAE, distills pretrained flow models in RAE latent spaces using Drifting, together with insightful modifications that improve training stability by thereotically aligning drifting fields with other frameworks. Regarding the experimental evidences, we achieve 1.77 FID on ImageNet 256 dataset using only 10k distillation steps, surpassing state-of-the-art RAE distillation methods and appearing comparative with the original Drifting Model without requiring an auxiliary MAE feature extractor. The code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.15576 2026-06-16 cs.LG cs.AI 交叉投稿

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

在分歧处定位信用：路径条件自蒸馏用于LLM推理

Yu Li, Shu Hong, Tian Lan

发表机构 * Department of Electrical and Computer Engineering, George Washington University（乔治华盛顿大学电气与计算机工程系）

AI总结提出Hindsight Self-Distillation (HSD)方法，通过将教师模型条件于当前训练组中的成功同伴轨迹，在失败与成功轨迹的分歧处提供密集信用信号，提升LLM在数学和代码推理任务上的性能。

详情

AI中文摘要

基于可验证奖励的强化学习为每次 rollout 分配一个标量，在长推理轨迹中留下了 token 级信用分配不明确的问题。同策略自蒸馏通过让同一模型作为教师，并条件于特权信息，产生密集的逐 token 信号来解决这一问题。但常见的真实答案选择仅是一个终点线索：在简短答案任务中，教师在需要路径级指导的中间位置保持沉默。我们提出后见自蒸馏（HSD），它将教师条件于从当前训练组中抽取的一个成功同伴 rollout。这样的同伴是从成功条件策略中精确采样的样本，无需额外的采样 rollout。通过提供完整的成功延续而不仅仅是最终答案，产生的信用信号集中在失败 rollout 与成功同伴之间的分歧位置。在 Qwen3-8B 和 Qwen3-32B 的数学和代码基准测试中，HSD 相比 GRPO 变体和同策略蒸馏基线获得了最佳结果，在 AIME 等简短答案任务上提升最大。

英文摘要

Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

URL PDF HTML ☆

赞 0 踩 0

2606.15589 2026-06-16 cs.LG cs.AI 交叉投稿

Is Code Better Than Language for Algorithmic Reasoning

算法推理中代码是否优于语言

Terry Tong, Yu Feng, Surbhi Goel, Dan Roth

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结通过分离中间表示与执行机制，在40个任务上比较代码执行与自然语言推理，发现代码执行优势源于外部执行而非表示变化。

Comments ICML 2026

详情

AI中文摘要

对于工具增强的语言模型，比较自然语言推理与代码执行管道是困难的，因为比较同时改变了中间表示和执行机制。我们通过一个中间干预来分离这些因素：模型将其推理表达为可执行代码，语言模型在上下文中模拟该代码以产生答案。在40个任务的可验证算法基准上，确定性代码执行比自然语言推理高出+31.6个百分点。我们观察到中间干预与自然语言推理没有显著差异（+0.15个百分点）。这些结果表明，在我们评估的设置中，仅改变中间表示并不能解释工具使用的优势，为性能提升需要可靠的外部执行提供了证据。我们用一个简单的统计决策理论模型形式化了这一直觉，该模型刻画了在我们的解耦轨迹生成/执行机制中，执行何时主导端到端风险。我们通过一个重建干预验证了我们的理论，该干预利用代理语言模型从代码表示中推断自然语言推理轨迹，恢复了与原始自然语言推理管道相当的性能。所有实验见https://github.com/TerryTong-Git/ToolProj。

英文摘要

For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.

URL PDF HTML ☆

赞 0 踩 0

2606.15669 2026-06-16 cs.LG cs.AI 交叉投稿

Z-Plane Neural Networks: Bounded Geometric Activation Replaces ReLU and LayerNorm

Z平面神经网络：有界几何激活替代ReLU和LayerNorm

Sungwoo Goo, Hwi-yeol Yun, Sangkeun Jung

发表机构 * College of Pharmacy, Chungnam National University（忠南大学药学院）； Department of Computer Science & Engineering, Chungnam National University（忠南大学计算机科学与工程系）

AI总结提出Z平面神经网络，通过有界几何激活函数Radial Bounding将隐藏状态映射到超球面上的2D相量束，在保持方向信息的同时限制能量幅度，理论证明其保持1-Lipschitz连续性并防止梯度消失，实验表明100层无ReLU和LayerNorm的MLP在MNIST上稳定收敛。

详情

AI中文摘要

现代深度神经网络依赖欧几里得标量激活（如ReLU）和全局归一化技术（如LayerNorm）来防止深层架构中的梯度不稳定。然而，这些机制固有地导致神经元死亡、丢弃关键方向信息并破坏特征表示的正交性。受生物轴突频率调制传输的启发，我们提出了Z平面神经网络，将隐藏状态映射到超球面上的2D相量束。我们引入了一种新颖的几何激活函数Radial Bounding（$\mathbf{x} / \max(1, \\|\mathbf{x}\\|_2)$），它在保持相位（方向）的同时限制能量幅度。我们从数学上证明，这种各向同性激活保持了1-Lipschitz连续性，并通过保留切向梯度防止梯度消失。实验上，一个完全不含ReLU和LayerNorm的100层Z平面多层感知机（MLP）在MNIST数据集上成功收敛，准确率达到98.34%，且具有绝对数值稳定性，证明仅靠有界几何激活就足以实现稳定的深度学习。

英文摘要

Modern deep neural networks rely on Euclidean scalar activations (e.g., ReLU) and global normalization techniques (e.g., LayerNorm) to prevent gradient instability in deep architectures. However, these mechanisms inherently cause dead neurons, discard critical directional information, and destroy the orthogonality of feature representations. Inspired by the frequency-modulation transmission of biological axons, we propose the Z-Plane Neural Network, which maps hidden states into 2D phasor bundles on a hypersphere. We introduce a novel geometric activation function, Radial Bounding($\mathbf{x} / \max(1, \|\mathbf{x}\|_2)$), which limits the energy magnitude while preserving the phase (direction). We demonstrate mathematically that this isotropic activation maintains 1-Lipschitz continuity and prevents gradient vanishing by preserving tangential gradients. Empirically, a 100-layer Z-Plane Multi-Layer Perceptron (MLP)-entirely devoid of ReLU and LayerNorm-successfully converges on the MNIST dataset with 98.34% accuracy and absolute numerical stability, proving that bounded geometric activation alone is sufficient for stable deep learning.

URL PDF HTML ☆

赞 0 踩 0

2606.15678 2026-06-16 cs.LG cs.AI 交叉投稿

The Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

储层注意力网络：通过内容可寻址储层注入在预训练Transformer中的跨前向传播状态

Emma Leonhart

发表机构 * Emma Leonhart

AI总结提出储层注意力网络（RAN），通过在预训练Transformer中间层注入固定随机储层来携带跨前向传播状态，实验表明未训练的循环动态足以传递可用状态。

Comments 29 pages, 14 figures

2606.15695 2026-06-16 cs.LG cs.AI 交叉投稿

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

当生成器回放退化时：面向异构联邦类增量学习的投影排练编排

Thinh T. H. Nguyen, Khoa D. Doan, Binh T. Nguyen, Danh Le-Phuoc, Kok-Seng Wong

发表机构 * VinUniversity ； VNU-HCM, University of Science（胡志明市国家大学理科大学）； Technische Universität Berlin（柏林工业大学）

AI总结针对异构联邦类增量学习中客户端标签子集不同、任务阶段不一致导致的旧知识遗忘问题，提出投影排练编排框架PRO及增强版PRO-MAX，通过服务器端维护紧凑类级投影记忆并实现平衡伪多任务训练，在图像、文本和图基准上提升异构流下的保留与最终效用。

Comments 46 pages

详情

AI中文摘要

联邦类增量学习（FCIL）在客户端观察到不同标签子集、在不同阶段推进任务以及为相同语义概念提供不均匀监督时变得极其困难。现有的FCIL方法通常通过输入空间合成来保留旧知识，但在异构任务流下可能脆弱且难以跨模态迁移。为缓解这些问题，我们提出PRO，一个用投影排练编排替代合成输入回放的框架。为去除外部预训练，我们在相同的预热条件下评估所有方法。此后，PRO在服务器上维护紧凑的类级投影记忆，并允许客户端在当前示例和旧投影记忆上执行平衡的伪多任务训练。为处理更强的表示漂移，我们进一步引入PRO-MAX，它在保持相同服务器轻量原则（服务器仅聚合模型更新和记忆统计）的同时，用邻域加权记忆对齐增强PRO。在图像、文本和图基准上，PRO和PRO-MAX在异构流下提高了保留和最终效用，同时在同构FCIL中保持竞争力。即使基线获得更大的回放预算，它们在监督不平衡和阶段错位下也会退化，表明仅靠回放数量无法解决回放质量失败。额外的弱任务诊断进一步表明，更大的回放不匹配与更大的下游退化相关，而我们的方法使投影记忆与不断演化的表示保持更好对齐。

英文摘要

Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

URL PDF HTML ☆

赞 0 踩 0

2606.15734 2026-06-16 cs.CL cs.AI cs.IR cs.LG 交叉投稿

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

可检索梯度：无累积权重漂移的持续后训练

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出ReGrad范式，将梯度作为可检索知识单元，通过元学习重塑文档梯度为通用适应信号，实现无权重漂移的可扩展参数知识注入。

详情

AI中文摘要

持续后训练使模型在部署后能够吸收新知识，但重复更新共享参数会累积权重漂移，可能导致灾难性遗忘并降低通用能力。检索增强生成避免了这种参数漂移，但往往缺乏参数化知识整合的深度。在本文中，我们提出ReGrad（可检索梯度），一种将梯度视为可检索知识单元的新范式。ReGrad离线预计算文档特定梯度，存储在索引化的梯度库中，并在推理时仅检索与查询相关的梯度以进行临时权重调整。然而，原始语言建模梯度针对词级文档重建而非查询驱动的知识使用进行优化。因此，我们引入双层元学习目标，将文档派生梯度重塑为下游任务的通用适应信号。在通用和特定领域设置上的实验表明，ReGrad优于CPT和RAG基线，实现了可扩展且可逆的参数知识注入，且不累积权重漂移。

英文摘要

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

URL PDF HTML ☆

赞 0 踩 0

2606.15767 2026-06-16 cs.LG cs.AI 交叉投稿

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

可视化不确定性：深度学习中缺失与冲突证据的空间图

Dong Hyun Jeong, Feng Chen, Jin-Hee Cho, Lance M. Kaplan, Audun Jøsang, Soo-Yeon Ji

发表机构 * University of the District of Columbia（哥伦比亚特区大学）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； Virginia Tech（弗吉尼亚理工大学）； U.S. Army DEVCOM Army Research Laboratory（美国陆军DEVCOM陆军研究实验室）； University of Oslo（奥斯陆大学）； Bowie State University（鲍伊州立大学）

AI总结提出不确定性激活图（UAM）框架，结合证据深度学习与全梯度类激活映射，生成空间不确定性激活图，区分缺乏证据的空虚和假设冲突的不和谐，填补不确定性量化与可解释性之间的空白。

详情

AI中文摘要

理解深度神经网络何时以及为何不确定对于在安全关键领域部署可靠的机器学习系统至关重要。虽然现有的不确定性量化方法提供了模型置信度的标量度量，但它们对输入的哪些空间区域导致不同类型的不确定性提供的洞察有限。我们提出了一种新颖的可视化框架——不确定性激活图（UAM），它将证据深度学习（EDL）与全梯度类激活映射（FullGrad）相结合，生成可解释的空间不确定性激活图。我们的方法区分了两种基本的不确定性类型：空虚（代表缺乏证据）和不和谐（捕捉竞争假设之间的冲突证据）。通过利用FullGrad的完整梯度分解特性和主观逻辑的原则性不确定性量化，我们的方法产生了理论上合理的可视化，突出显示了导致模型不确定性的特定图像区域。利用该框架，通过计算信念加权属性生成空虚和不和谐激活图，从而能够识别模型缺乏知识的区域与遇到模糊证据的区域。在多个基准数据集上的广泛评估表明，所提出的框架有效地解决了不确定性量化与可解释性之间的关键差距，为评估复杂视觉识别任务中的模型可靠性提供了直观的视觉反馈。

英文摘要

Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.15793 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

Proximal Policy Optimization for Amortized Discrete Sampling

用于摊销离散采样的近端策略优化

Anna Zykova-Myzina, Timofei Gritsaev, Daniil Tiapkin, Nikita Morozov

发表机构 * HSE University（高等经济学院）； Constructor University（康斯特大学）； CMAP, CNRS, École polytechnique, IPP（CMAP，CNRS，巴黎综合理工学院，IPP）

AI总结本文在生成流网络框架下，推导了策略梯度算法并首次应用近端策略优化，提升了离散概率分布采样的收敛速度和数据效率。

2606.15796 2026-06-16 cs.CV cs.AI 交叉投稿

你不需要强假设：通过时间差异进行视觉表示学习

Ninad Daithankar, Alexi Gladstone, Yann LeCun, Heng Ji

发表机构 * UIUC（伊利诺伊大学厄巴纳-香槟分校）； New York University（纽约大学）

AI总结提出TDV方法，基于因果假设（过去导致未来）从视频中自监督学习，避免强归纳偏置，在密集空间任务上达到SOTA。

详情

AI中文摘要

AI的进步很大程度上是由假设更少的方法驱动的。随着计算和数据量的增加，弱归纳偏置的方法通常优于强假设的方法。这在视觉表示学习领域尤为典型，方法从监督学习主导，到弱监督学习，再到如今无需人工标签的自监督学习的广泛成功。然而，即使是现代自监督学习方法仍然依赖于强归纳偏置，如数据增强、掩码或裁剪。如果这一趋势持续，这些剩余的偏置在大规模下将成为瓶颈——我们的实验证实了这一点：随着数据增长，归纳偏置的最优强度降低。这促使我们寻找依赖更少假设的方法。为此，我们提出了视觉时间差异（TDV），一种从视频中进行自监督学习的新范式，它避免了现有的归纳偏置，而是依赖于一个因果假设：过去导致未来。TDV通过联合训练图像编码器和运动编码器，使得当前帧的表示加上编码的运动等于下一帧的表示。尽管没有利用任何强归纳偏置，TDV在密集空间任务上达到了最先进的水平，为无需强假设的表示学习奠定了基础。

英文摘要

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

URL PDF HTML ☆

赞 0 踩 0

2606.15963 2026-06-16 cs.DC cs.AI cs.CL cs.LG 交叉投稿

PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

PreLort: 面向秩异构联邦微调的前缀嵌套LoRA

Muhammad Waseem, Nurbek Tastan, Andrej Jovanovic, Nicholas D. Lane, Nils Lukas, Karthik Nandakumar, Samuel Horvath

发表机构 * MBZUAI, UAE University of Cambridge, UK（MBZUAI，阿联酋剑桥大学，英国）； Flower Labs, UK（Flower Labs，英国）； Michigan State University, USA（密歇根州立大学，美国）

AI总结针对联邦LoRA中异构秩导致的信息分布不均问题，提出PreLort方法，通过前缀层次化嵌套低秩结构、分段聚合规则和前缀嵌套训练策略，使低秩客户端受益于高秩客户端的丰富信息，在准确率和ROUGE-L上优于现有方法。

详情

AI中文摘要

使用LoRA等参数高效方法对大型语言模型进行联邦微调，能够实现基础模型的隐私保护适配。异构硬件资源带来了挑战，因为具有不同适配器秩的客户端无法直接聚合。现有方法虽能实现异构秩下的聚合，但未能控制信息在秩维度上的分布，导致共享低秩表示利用不充分。为此，我们提出PreLort：一种用于联邦LoRA的嵌套低秩公式，将适配器维度组织成前缀层次结构。我们的方法确保较低秩维度编码任务相关信息，而较高秩维度捕获额外容量。基于此，我们引入(i)分段聚合规则，仅对贡献于每个秩分段的客户端进行平均，避免来自零填充低秩客户端的稀释；以及(ii)前缀嵌套训练策略，在多个秩截断下优化每个适配器，鼓励有用信号集中在低秩前缀维度。这些组件共同鼓励一个一致的低秩前缀捕获最任务相关信息，而较高秩维度学习额外容量。这使得低秩客户端能够受益于高秩客户端贡献的更丰富信息，因为前缀维度被一致地学习和聚合。实验表明，我们的方法在准确率和ROUGE-L上持续优于先前的异构联邦LoRA方法，并在多个基础模型上实现了更低或相当困惑度。

英文摘要

Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

URL PDF HTML ☆

赞 0 踩 0

2606.15989 2026-06-16 q-bio.NC cs.AI 交叉投稿

Task-guided cross-subject latent alignment: a multi-encoder-decoder VAE

任务引导的跨被试潜在对齐：一种多编码器-解码器VAE

Angeliki Papathanasiou, Jascha Achterberg, Thomas E. Nichols, Rui Ponte Costa

发表机构 * Centre for Neural Circuits and Behaviour Department of Physiology Anatomy and Genetics University of Oxford（神经回路与行为中心生理解剖与遗传学系牛津大学）； Big Data Institute Nuffield Department of Medicine University of Oxford（大数据研究所纳菲尔德医学系牛津大学）

AI总结提出MED-VAE模型，通过预训练ANN锚定表征，实现无共享刺激的跨被试神经对齐，在自然场景数据集上优于传统方法，并支持跨被试图像解码。

Comments In Proceedings of the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

详情

AI中文摘要

对齐跨被试的神经活动有望发现共享的计算原理和可泛化解码器。然而，传统对齐方法要求被试间共享刺激，这一限制使其难以应用于数据有限或非重叠的自然范式。我们提出了一种多编码器-解码器变分自编码器（MED-VAE），通过将表征锚定到预训练ANN提供的公共支架上，实现了无需共享刺激的跨被试对齐。利用自然场景数据集，我们展示了MED-VAE创建了具有优越语义组织的公共潜在空间，在跨被试对齐方面优于常见方法，同时在对传统方法失效的保留刺激上保持了稳健的泛化能力。从这些公共空间重建回每个被试的原始神经空间，MED-VAE在其跨被试潜在空间中保留了等量的刺激驱动信号。最后，我们展示了这种优越的对齐直接实现了跨被试神经预测，通过跨被试图像解码得到了验证。总之，我们提出了一种框架，用于识别可泛化的公共子空间以进行跨被试预测和下游任务，本文以静态图像视觉皮层响应为例进行了演示。

英文摘要

Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.

URL PDF HTML ☆

赞 0 踩 0

2606.16050 2026-06-16 cs.LG cs.AI 交叉投稿

缩放自适应深度：范数无关残差网络

Tomás Figliolia, Beren Millidge

发表机构 * Zyphra San Francisco, CA（Zyphra旧金山加州）

AI总结针对残差网络中残差流范数随深度增长导致深层更新被抑制的问题，提出范数无关残差架构NAG，通过分离幅度和方向信息保持各层贡献，并实现可解释的自适应深度跳过机制，在等计算量下匹配全深度性能。

详情

AI中文摘要

残差架构在深度学习中无处不在，但它们存在一个微妙的结构性限制：残差流的范数会随深度迅速增长。因此，来自后层的更新相对于累积的残差状态变得很小。这降低了它们对表示的影响，并限制了模型在深度上扩展的益处。为了解决这个问题，我们引入了NAG，一种范数无关的残差架构，它将残差流中的幅度与方向信息分离，在整个深度中保留有意义的层贡献，并防止后层更新被残差范数增长系统地抑制。重要的是，NAG仅引入可忽略数量的额外参数，并依赖于易于内核融合的简单操作，从而在实践中保持训练效率。我们表明，该架构优于基线Transformer，其增益随深度增加而显著增大，从而能够有效训练更深的模型。范数无关的公式还产生了一种可解释的深度混合（MoD）机制，该机制自适应地跳过注意力和MLP层。除了作为训练后的精度-计算权衡外，该机制还可以用作预训练时的扩展策略：在等FLOP训练下，通过减少每token前向传播成本节省的计算量可以再投资于在更多token上训练，同时保持总参数数量和KV缓存预算固定。在我们的实验中，约20%-25%的适度深度混合率在相等训练计算量下匹配全深度基线性能，同时大幅减少执行的层参数数量和前向传播FLOPs。这些结果将深度稀疏性确定为固定计算量训练的新扩展轴，从而能够实现非常深但FLOP高效的模型。

英文摘要

Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

URL PDF HTML ☆

赞 0 踩 0

2606.16160 2026-06-16 cs.LG cs.AI cs.HC 交叉投稿

A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification

EEGNet在fNIRS驱动的认知负荷分类中的比较与批判性研究

Mehshan Ahmed Khan, Houshyar Asadi, Li Zhang, Mohammad reza Chalak Qazani, Ghazal Bargshady, Stefanos gkikas, Christian arzate, Sam Oladazimi, Zoran Najdovsk, Lei Wei, Chee Peng Lim

发表机构 * Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University（智能系统研究与创新研究所（IISRI），德克萨斯大学）； Department of Computer Science, Royal Holloway, University of London（伦敦大学皇家霍洛威学院计算机科学系）； College of Science and Engineering, James Cook University（詹姆斯库克大学科学与工程学院）； Faculty of Science and Technology, University of Canberra（堪培拉大学科学与技术学院）； Honda research institute (HRI), Japan（日本本田研究院）； Swinburne University of Technonology, Hawthorn, Victoria（技术学院，维多利亚州哈沃恩）

AI总结本研究系统评估EEGNet在fNIRS认知负荷分类中的性能，发现重叠分段和小固定学习率在随机分割中表现最佳，但受试者独立评估准确率大幅下降，非重叠分段和PCA特征在SI评估中取得最佳56.11%准确率，表明消除时间冗余有助于学习更鲁棒的跨个体表征。

详情

AI中文摘要

由于时间变异性、受试者间差异以及对预处理选择的敏感性，从功能性近红外光谱（fNIRS）信号中准确分类认知负荷仍然是一个重大挑战。本研究通过系统检查时间分割策略（重叠与非重叠）、窗口长度（10秒、20秒、30秒）、特征提取方法（方差分析（ANOVA）、主成分分析（PCA）、快速独立成分分析（FastICA））、学习率配置（固定和自适应）以及评估协议（随机分割与受试者独立（SI））的影响，对EEGNet在基于fNIRS的认知负荷分类中进行了全面评估。随机分割实验的结果表明，重叠分割结合较小的固定学习率（0.01-0.001）由于时间冗余和血流动力学转变的密集采样而产生了最高的准确率。然而，SI评估显示准确率大幅下降，表明对未见参与者的泛化能力有限。在SI评估下，非重叠分割优于重叠窗口，使用PCA特征、20秒窗口和0.1学习率获得了最佳准确率56.11%。这些发现表明，消除时间冗余有助于模型学习更鲁棒和可泛化的跨个体认知负荷表征。尽管自适应学习率策略提高了训练稳定性，但并未超过最优选择的固定学习率的性能。该研究强调了分割策略和学习率选择在提高模型泛化能力中的关键作用，并指出了开发基于fNIRS的可靠、实时和受试者独立认知负荷分类系统所必需的方法学考虑。

英文摘要

Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods (Analysis of Variance (ANOVA), Principal Component Analysis (PCA), Fast Independent Component Analysis (FastICA)), learning rate configurations (fixed and adaptive), and evaluation protocols (random split vs. subject-independent (SI)). Results from random-split experiments show that overlapping segmentation, combined with smaller fixed learning rates (0.01-0.001), yields the highest accuracies, due to temporal redundancy and dense sampling of hemodynamic transitions. However, SI evaluation reveals a substantial drop in accuracy, demonstrating limited generalization to unseen participants. Under SI evaluation, non-overlapping segmentation outperformed overlapping windows, with the best accuracy of 56.11% achieved using PCA features with a 20-second window and a 0.1 learning rate. These findings indicate that eliminating temporal redundancy helps the model learn more robust and generalizable representations of cognitive load across individuals. Although adaptive learning rate strategy improved training stability, it did not surpass the performance of optimally selected fixed learning rates. The study highlights the critical role of segmentation strategy and learning rate selection in improving model generalization and identifies methodological considerations essential for developing reliable, real-time, and SI cognitive load classification systems using fNIRS.

URL PDF HTML ☆

赞 0 踩 0

2606.16193 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

输入依赖的Fisher信息矩阵用于医学图像分类器的局部敏感性分析

Sourya Sengupta. Mark A. Anastasio

发表机构 * Department of Electrical and Computer Engineering, University of Illinois Urbana–Champaign（伊利诺伊大学厄巴纳-香槟分校电气与计算机工程系）； Mallinckrodt Institute of Radiology and Department of Electrical & Systems Engineering, Washington University in St. Louis（华盛顿大学圣路易斯分校马林克罗德特放射医学研究所及电气与系统工程系）

AI总结提出基于输入依赖Fisher信息矩阵(iFIM)的局部敏感性分析框架，通过Gram矩阵恢复iFIM非零谱，将图像分解为高/低敏感性分量，实验证明高敏感性分量与预测置信度和分类性能变化强相关。

详情

AI中文摘要

深度神经网络在医学图像分类中取得了强大性能，但通常像黑箱一样工作。常用的后验解释方法通常提供启发式可视化，其与分类器预测分布的关系是间接的。本文引入了一个基于训练分类器的输入依赖Fisher信息矩阵(iFIM)的局部敏感性分析框架。iFIM描述了在输入图像的无穷小扰动下分类器预测分布的变化。通过使用Gram矩阵公式，可以在不显式形成完整图像维度的Fisher矩阵的情况下恢复iFIM的非零特征谱。然后利用领先的iFIM特征空间将输入图像投影为高局部敏感性分量及其正交分量。这些分量提供了局部预测敏感性的模型内在描述，而不是传统的逐像素归因热图或任务相关解剖结构的因果分割。该框架在受控和临床医学图像分类任务上使用多种分类器架构进行了评估。基于扰动的实验表明，高敏感性iFIM分量与预测置信度和分类性能的变化相比低敏感性互补分量有更强的耦合。结果支持iFIM框架作为分析局部决策敏感性的原则性工具，并补充医学成像中现有的基于归因的可解释性方法。

英文摘要

Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier's predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier's predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.16454 2026-06-16 cs.LG cs.AI 交叉投稿

SDS-LoRA: Overcoming Anisotropic Gradient Scaling in Low-Rank Adaptation

SDS-LoRA：克服低秩适应中的各向异性梯度缩放

Junghun Oh, Sungyong Baik, Kyoung Mu Lee

发表机构 * Seoul National University（首尔大学）； Hanyang University（汉阳大学）

AI总结提出SDS-LoRA，通过结构解耦奇异值与反向传播，消除LoRA中梯度各向异性缩放导致的秩降低和次优对齐问题，提升收敛速度和适应性能。

详情

AI中文摘要

低秩适应（LoRA）通过使用低秩矩阵参数化权重更新，实现了大型预训练模型对下游任务的高效适应。在本文中，我们从几何角度研究了LoRA参数化的局限性。具体地，我们表明当全微调梯度反向传播到低秩矩阵时，它会经历由奇异值驱动的各向异性缩放。我们认为这种现象是不可取的，因为它通过将梯度偏向主导奇异方向而抑制其他方向，从而扭曲了全微调梯度。我们的分析表明，各向异性梯度缩放降低了低秩矩阵梯度的有效秩，并导致LoRA中全微调梯度与其低秩近似之间的次优对齐，从而加剧了与全微调的差距。为了解决这些局限性，我们提出了一种新的低秩参数化方法SDS-LoRA，该方法在结构上将奇异值与反向传播解耦。我们的方法确保全微调梯度仅通过低秩矩阵子空间的正交基反向传播，独立于其尺度。收敛性分析表明，虽然LoRA的收敛速率随低秩矩阵的条件数而恶化，但SDS-LoRA与之无关。在自然语言和视觉基准上的实验结果表明，SDS-LoRA改善了损失收敛并缩小了与全微调的差距，显著提升了适应性能。

英文摘要

Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices' gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices' subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA's convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.

URL PDF HTML ☆

赞 0 踩 0

2606.16456 2026-06-16 cs.LG cs.AI 交叉投稿

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

SPRI: 基于SVD分解残差初始化的数据受限MoE升级方法

Weiqiao Shan, Ruixiang Mao, Yuang Li, Yuhao Zhang, Yingfeng Luo, Tong Zheng, Chen Xu, Yucheng Qiao, Chunxiang Jin, Yi Yuan, Jingdong Chen, Tong Xiao, Jingbo Zhu

发表机构 * Northeastern University, China（东北大学）； Huawei TSC, China（华为技术有限公司）； CUHK-Shenzhen, China（香港中文大学（深圳））； University of Maryland, USA（马里兰大学）； Harbin Engineering University, China（哈尔滨工程大学）； Inclusion AI, Ant Group（蚂蚁集团Inclusion AI）； NiuTrans Research, China（小牛翻译研究中心）

AI总结提出SPRI方法，利用预训练FFN权重的SVD分解残差初始化MoE专家，结合两阶段训练策略，在数据受限的多语言语音翻译任务中显著提升性能。

Comments 8pages, 12 tables, 3 figures

详情

AI中文摘要

混合专家（MoE）模型能够实现高效扩展，但从头训练成本过高。MoE升级通过将预训练的密集模型转换为稀疏MoE模型来降低这一成本。然而，现有的升级方法通常依赖大规模持续训练，并且在数据受限的监督适应中表现不佳，原因在于专家同质化或对预训练参数的过度扰动。在此设置下，有效的升级必须利用预训练权重结构，同时为路由专家引入足够的多样性。为此，我们提出了基于SVD分解残差初始化（SPRI）的方法，该方法将从预训练前馈网络（FFN）权重中提取的SVD分解残差分配到路由专家中，从而在预训练谱结构的基础上引入可控的专家多样性。我们进一步引入两阶段训练策略以提高适应稳定性。我们在多语言语音到文本翻译任务上评估SPRI，该任务中有限的监督数据对MoE升级构成挑战，而多个目标语言提供了天然的路由异质性。在CoVoST2数据集上的15个英语到其他语言方向中，SPRI相比完全微调的密集模型平均BLEU和COMET分别提高了2.58和3.32分，并且比之前最佳的MoE升级基线高出3.39 BLEU和4.34 COMET分。

英文摘要

Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

URL PDF HTML ☆

赞 0 踩 0

2606.16462 2026-06-16 cs.LG cs.AI 交叉投稿

Learning aligned EEG representations with subject-specific encoders

学习带有主体特定编码器的对齐脑电图表示

Bruna J. Lopes, Gabriel Schwartz, Sylvain Chevallier, Raphael Y. de Camargo, Bruno Aristimunha

发表机构 * University of São Paulo（圣保罗大学）； Université Paris-Saclay, Inria TAU team, LISN-CNRS（巴黎萨克雷大学，Inria TAU团队，LISN-CNRS）； Institut de neuromodulation, GHU Paris, psychiatrie et neurosciences, centre hospitalier Sainte-Anne, pôle hospitalo-universitaire 15, Université Paris Cité（神经调控研究所，GHU巴黎，精神病学与神经科学，圣安娜医院，大学医院中心15区，巴黎西岱大学）； Federal University of ABC (UFABC)（ABC联邦大学）； Yneuro ； Swartz Center for Computational Neuroscience (SCCN), Institute for Neural Computation (INC), University of California San Diego（斯沃茨计算神经科学中心，神经计算研究所，加州大学圣地亚哥分校）

AI总结提出使用主体特定编码器替代共享编码器，结合共同分类器实现跨主体脑电图对齐，实验表明该方法能内化欧几里得对齐的作用，提高类别区分度，并识别出未见主体的编码器选择是主要瓶颈。

详情

AI中文摘要

跨主体脑电图解码有望提供更多训练数据，但也使神经网络面临强烈的跨主体分布偏移。我们研究仅凭任务监督和架构是否能学习主体对齐的表示。我们将共享的脑电图编码器替换为主体特定编码器后接共同分类器，并在四个运动想象数据集上将该混合模型与标准EEGNet、AttentionBaseNet和CTNet基线（结合欧几里得对齐EA）进行比较。EA通过重新居中主体协方差改进了共享编码器，但混合编码器在很大程度上内化了这一作用：当移除EA时，验证损失曲线和潜在距离分析变化很小。主体特定头增加了类别区分度，并将每个主体置于其自身的潜在流形附近，改善了大多数主体，但留下了一个对方法敏感的子集。这些结果支持主体特定编码器作为脑电图解码的学习对齐机制，并将未见主体的编码器选择确定为剩余瓶颈。

英文摘要

Cross-subject EEG decoding promises more training data, but it also exposes neural networks to strong inter-subject distribution shifts. We study whether task supervision and architecture alone can learn subject-aligned representations. We replace a shared EEG encoder with subject-specific encoders followed by a common classifier, and compare this hybrid model with standard EEGNet, AttentionBaseNet, and CTNet baselines with Euclidean Alignment (EA) on four motor-imagery datasets. EA improves shared encoders by recentering subject covariances, but the hybrid encoder largely internalises this role: validation-loss curves and latent-distance analyses change little when EA is removed. Subject-specific heads increase class distinctiveness and place each subject close to its own latent manifold, improving most subjects while leaving a method-sensitive subset. These results support subject-specific encoders as a learned alignment mechanism for EEG decoding and identify head selection for unseen subjects as the remaining bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.16633 2026-06-16 cs.CV cs.AI 交叉投稿

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

DCP-Prune：基于分布一致性保持的超低令牌剪枝

Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

发表机构 * College of Computer Science, Nankai University（南开大学计算机学院）； Nanjing University of Posts and Telecommunications（南京邮电大学）

AI总结提出DCP-Prune框架，通过锚点-上下文图恢复和文本感知令牌聚类选择，在超低令牌预算下保持分布一致性，实现稳定高性能。

Comments The code will be released at: https://github.com/EMVision-NK/DCP-Prune

详情

AI中文摘要

最近的视觉令牌剪枝方法在中等令牌预算下能有效保持模型性能，但在超低令牌预算下变得不稳定。我们的分析表明，随着剪枝预算减少，精度下降通常伴随着更大的特征分布偏移。关键的是，这种分布偏移的程度与性能下降强相关。为了更好地表征这一现象，我们引入了一种轻量级的分布一致性度量来估计保留令牌与完整令牌之间的分布偏移。受这些观察启发，我们提出了一个两阶段剪枝框架，包括锚点-上下文图恢复（ACGR）和文本感知令牌聚类选择（TATCS）。具体地，ACGR在令牌移除前转移上下文信息，而TATCS在检测到严重分布偏移时动态重新选择代表性令牌。大量实验表明，我们的方法在超低令牌预算下实现了更优且更稳定的性能。值得注意的是，在仅使用16个视觉令牌的情况下，它在LLaVA-1.5-7B上保留了92.1%的上限平均性能。

英文摘要

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.16694 2026-06-16 cs.LG cs.AI physics.app-ph q-bio.NC 交叉投稿

Adaptive inference and function vectors in deep transformers

深度变换器中的自适应推理与函数向量

Ravin Raj, Gautam Reddy

发表机构 * Joseph Henry Laboratories of Physics, Princeton University（普林斯顿大学约瑟夫·亨利物理实验室）

AI总结提出深度变换器作为平均场交互系统实现分布式推理的理论，利用函数向量逐层推断潜在上下文变量，在上下文回归任务中预测非高斯分层结构与深度的关系，并通过约束线性注意力变换器验证。

详情

AI中文摘要

变换器被广泛用作学习大量耦合变量间复杂相关性的通用基础架构，但其内部机制仍不明确。我们提出了一种深度变换器作为平均场交互系统的理论，该系统在通信、局部性和深度约束下实现分布式推理。我们证明，这样的系统可以利用内部状态表示（“函数向量”）在其层上以越来越精细的尺度推断潜在上下文变量。在上下文回归任务中，该理论预测了潜在上下文变量中的非高斯分层结构与变换器深度之间的非平凡关系。使用约束线性注意力变换器对预测进行了测试，并展示了深度架构中的自适应推理。前馈模块和深度使变换器能够实现比先前描述的更丰富的上下文学习算法类别。

英文摘要

Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

URL PDF HTML ☆

赞 0 踩 0

2606.16730 2026-06-16 stat.ML cs.AI cs.LG 交叉投稿

循环绑定——混合专家语言模型中的专家层绑定

Martin Jaggi

发表机构 * EPFL（瑞士联邦理工学院洛桑）

AI总结提出专家绑定方法，通过共享连续Transformer层的专家参数，在保持独立路由和注意力的同时，将MoE模型内存占用降低近2倍，且不损失困惑度或下游性能。

Comments Code available at https://github.com/epfml/looped-moe

详情

AI中文摘要

混合专家（MoE）架构通过每个令牌仅激活一小部分专家来高效扩展大型语言模型（LLM），但全部参数计数——主要由专家参数主导——必须保留在训练和推理内存中。为了解决这个问题，我们引入了专家绑定（Expert Tying），这是一种架构修改，它在连续Transformer层之间共享专家参数，同时保留独立的逐层路由和注意力。我们在常见的先进架构上评估了这种方法，包括OLMoE、Qwen3和DeepSeek风格的MoE。我们的预训练实验表明，绑定专家可以将内存占用减少近2倍，而几乎不降低困惑度或下游质量。通过利用MoE路径中固有的参数冗余，我们的方法提供了高度有利的计算-内存权衡，推动了下一代LLM的高效训练和扩展。

英文摘要

Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.16837 2026-06-16 cs.CV cs.AI cs.SD 交叉投稿

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

基于时间金字塔建模的鲁棒语音伪造检测

Mahtab Masoudi Nezhad, Nima Karimian

发表机构 * Lane Department of Computer Science and Electrical Engineering, West Virginia University（西弗吉尼亚大学莱恩计算机科学与电气工程系）； Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida（南佛罗里达大学贝利尼人工智能、网络安全与计算学院）

AI总结提出时间金字塔适配器，通过多尺度时间卷积捕获局部伪影和全局韵律异常，结合自监督XLS-R表示，在多个数据集上显著优于基线模型。

详情

AI中文摘要

伪造语音检测日益受到逼真合成、语音转换和重放攻击的挑战，跨数据集泛化仍然是主要限制。本文提出时间金字塔适配器，利用具有不同感受野的并行时间卷积来捕获多尺度伪造线索，从局部伪影到全局韵律异常。我们还集成了自监督XLS-R表示，并结合前端适配器，包括Mel、Sinc和用于多尺度时间建模的时间金字塔设计。所提出的模型在多个基准上进行了评估，包括ASVspoof 2017、ASVspoof 2021 (DF/LA)、PartialSpoof、DiffSSD和多语言HQ-MPSD数据集。实验结果表明，时间金字塔模型在PartialSpoof数据库上获得了99.24%的AUC和3.87%的EER，显著优于基础模型和多个SOTA基线，如LCNN-BLSTM（9.87% EER）和TRACE（8.08% EER）。此外，多语言评估证实，虽然伪造伪影与语言无关，但自监督表示提高了鲁棒性，在领域和语言偏移下性能下降，凸显了需要更好的适应和校准策略。

英文摘要

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.16846 2026-06-16 cs.LG cs.AI 交叉投稿

Deep Q-Learning on Hölder Spaces

Hölder空间上的深度Q学习

Qian Qi

发表机构 * Peking University（北京大学）

AI总结研究连续时间随机控制中Q学习的算子核心，通过分析扩散设置下Bellman最优性目标的正则性和逼近复杂度，提出适应混合正则性的张量积DeepONet架构，并给出显式逼近和资源界限。

详情

AI中文摘要

我们研究了具有连续状态和动作的连续时间随机控制中Q学习的算子理论核心。在基于价值的强化学习中，每次Q学习或DQN更新都基于Bellman最优性目标；我们的分析在扩散设置中分离出该目标，并研究其正则性和逼近复杂度。在均匀椭圆性和Hölder正则系数下，我们证明Bellman更新将有界输入映射到各向异性正则类，平滑状态变量而仅保留对动作变量的Lipschitz依赖性。这产生了Bellman迭代的紧族，并激发了适应问题混合正则性的张量积DeepONet架构。然后我们推导出显式的逼近和资源界限，以及时间步长$δ\ o 0$时的刚度-复杂度权衡。所得理论在连续随机控制中Bellman目标正则性和逼近层面直接贡献于Q学习理论。同时，我们并未声称对包含探索、经验回放和随机梯度更新的实际采样Q学习有完整的收敛定理。

英文摘要

We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness--complexity trade-off as the time step $δ\to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.

URL PDF HTML ☆

赞 0 踩 0

2606.16883 2026-06-16 cs.LG cs.AI 交叉投稿

Upper Bounds on the Generalization Error of Deep Learning Models via Local Robustness and Stability

深度学习模型泛化误差的上界：基于局部鲁棒性和稳定性

Abdul-Rauf Nuhu, Parham M. Kebria, Vahid Hemmati, Mahmoud N. Mahmoud, Edward Tunstel, Abdollah Homaifar

发表机构 * North Carolina Agricultural and Technical State University（北卡罗来纳农业技术州立大学）； University of Alabama（阿拉巴马大学）； Southwest Research Institute（西南研究院）

AI总结提出一种通过局部区域稳定样本数缩放鲁棒性项的泛化上界，在ImageNet上实现非空洞且最紧的误差估计。

详情

AI中文摘要

泛化是数据驱动模型的关键属性，尤其是在安全关键应用中部署的深度学习模型。基于鲁棒性的泛化界作为一种将鲁棒性与泛化性能联系起来的原则性方法而受到关注，通常以数据依赖的方式。然而，大多数现有界在实际设置中存在空洞问题，产生远超过实际错误率的松散上界，限制了其在真实世界评估中的实用性。虽然这个问题通常归因于不确定性项，但问题的很大一部分源于鲁棒性项本身，特别是对于0-1损失。现有方法通常将鲁棒性项视为全局度量，忽略了其在输入空间不同子区域间的变化。在这项工作中，我们提出了一种泛化界，通过根据每个子区域内稳定和不稳定样本的数量来缩放鲁棒性项，从而解决了这一局限性。我们的界同时包含数据和模型依赖因素，同时保持实际相关性（产生更紧的真实误差上界）。在ImageNet数据集上训练的模型上的实验表明，我们的界始终非空洞，并在现有方法中实现了最紧的估计，与一系列鲁棒深度神经网络的实证性能紧密对齐。

英文摘要

Generalization is a critical property of data-driven models, particularly deep learning models deployed in safety-critical applications. Robustness-based generalization bounds have gained attention as a principled way to link robustness properties to generalization performance, often in a data-dependent manner. However, most existing bounds suffer from vacuousness in practical settings, yielding loose upper bounds that greatly exceed the actual error rates and limiting their usefulness for real-world evaluation. While this issue is often attributed to the uncertainty term, a substantial part of the problem originates from the robustness term itself, particularly for the 0-1 loss. Existing approaches typically treat the robustness term as a global measure, ignoring its variation across different sub-regions of the input space. In this work, we propose a generalization bound that addresses this limitation by scaling the robustness term according to the number of stable and unstable samples within each sub-region. Our bounds incorporate both data- and model-dependent factors while maintaining practical relevance (yielding tighter upper bounds on true error). Experiments on models trained on the ImageNet dataset show that our bounds remain consistently non-vacuous and achieve the tightest estimates among existing methods, closely aligning with empirical performance across a range of robust deep neural networks.

URL PDF HTML ☆

赞 0 踩 0

2606.16891 2026-06-16 cs.LG cs.AI 交叉投稿

Beyond Weights and Gradients: A Taxonomy of Federated Learning Messages

超越权重和梯度：联邦学习消息的分类学

Alvaro Javier Vargas Guerrero, Xinguang Wang, Quang Manh Doan, Guy Nagels

发表机构 * AIMS lab, Center for Neurosciences, UZ Brussel, Vrije Universiteit Brussel, Brussels, Belgium（AIMS实验室，神经科学中心，布鲁塞尔大学医院，布鲁塞尔自由大学，布鲁塞尔，比利时）； Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium（人工智能实验室，布鲁塞尔自由大学，布鲁塞尔，比利时）

AI总结本文提出联邦消息的正式数学定义，建立包含模型结构、统计摘要和数据条件表示的三类分类法，分析计算、通信和隐私权衡，并综述202篇文献揭示2021年后消息范式多样化趋势。

Comments 4 figures, 9 pages, with 7 pages of content

详情

AI中文摘要

联邦学习正迅速发展，超越了传统模型权重和梯度的交换，但现有定义未能涵盖现代负载（如合成数据和联邦分析）的全部范围。本文通过提出一个联邦消息的正式数学定义来弥补这一空白，该定义同时考虑了效用和隐私。我们引入了一个分类法，将这些交换组织为三类：模型结构、统计摘要和数据条件表示。通过基于计算需求、通信成本和隐私风险评估这些组别，我们提供了对去中心化训练中涉及权衡的更清晰理解。我们对202篇近期出版物的回顾凸显了自2021年以来向多样化消息范式的显著转变，标志着从标准深度学习更新向更专业信息共享的转变。该框架为未来研究优化联邦系统以适应不同硬件和安全需求提供了结构化路径。

英文摘要

Federated Learning is rapidly evolving beyond the exchange of traditional model weights and gradients, yet existing definitions fail to capture the full scope of modern payloads like synthetic data and federated analytics. This paper addresses the gap by proposing a formal mathematical definition of a federated message that accounts for both utility and privacy. We introduce a taxonomy that organizes these exchanges into three categories: model structures, statistical summaries, and data-conditioned representations. By evaluating these groups based on computational demands, communication costs, and privacy risks, we provide a clearer understanding of the trade-offs involved in decentralized training. Our review of 202 recent publications highlights a significant shift since 2021 toward diverse messaging paradigms, signaling a move away from standard deep learning updates toward more specialized information sharing. This framework provides a structured path for future research to optimize federated systems for varying hardware and security requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.16920 2026-06-16 cs.LG cs.AI 交叉投稿

Demystifying Variance in Circuit Discovery of LLMs

揭示LLM电路发现中的方差

Frank Zhengqing Wu, Francesco Tonin, Volkan Cevher

发表机构 * Laboratory for Information and Inference Systems (LIONS), École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland（信息与推理系统实验室（LIONS），洛桑联邦理工学院（EPFL），瑞士洛桑）

AI总结本文研究LLM电路发现中的重采样、重述和样本方差，提出CEAP方法减少重采样方差，并分析重述方差源于不同模板激活不同电路，样本方差主要由不忠定义导致。

详情

AI中文摘要

电路发现是机械可解释性中的关键技术，用于定位对执行给定任务至关重要的模型组件。尽管当前最先进的方法（EAP-IG）在（不）忠实性指标上表现良好，但它存在显著的变异性。这包括重采样方差（当我们用来自同一分布的新数据批次探测时电路发生变化）、重述方差（当提示被重新表述时发现的电路发生偏移）以及样本方差（具有低总体不忠实性的电路在单个样本上的不忠实性表现出大幅波动）。本文研究了这些方差的根源。我们证明了CEAP（我们新的电路发现方法，在理论上改进了EAP-IG）可以显著减轻重采样方差。我们进一步表明，重述方差是由于不同模板的提示倾向于激活模型中的不同电路。这使我们提出，可能很难找到一个全面的电路来解释和控制模型在任务上的行为，而该任务可以用无数模板表达，这表明LLM可能本质上难以操控。我们表明，稀疏性（据称能形成更紧凑和可解释的任务电路）无法解决这个问题。关于样本方差，我们认为它很大程度上是良性的：极差的不忠实性分数通常源于不忠实性的定义方式，而非测量电路的缺陷。我们表明，不忠实性的大小受选择性贡献缩放的影响，这是一种神经机制，解释了有时观察到的极差分数。

英文摘要

Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model's behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.

URL PDF HTML ☆

赞 0 踩 0

2606.16933 2026-06-16 cs.LG cs.AI 交叉投稿

A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

强化学习中分布偏移的统一因果起源分类法

Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet

发表机构 * IMT Atlantique（IMT大西洋）； Flinders University（弗林德斯大学）； IRL Crossing ； Priori Analytica ； CNRS（法国国家科学研究中心）

AI总结提出一种统一因果起源分类法，将强化学习中的分布偏移按因果来源（内部/外部）和时间边界（显式/隐式/混合）分类，统一了分布内/外泛化与非平稳性分析。

Comments The paper is currently under review at the Journal of Artificial Intelligence Research (JAIR)

详情

AI中文摘要

强化学习系统在运行条件与先前遇到的条件不同时通常会退化，这反映了底层数据生成过程中的分布偏移。这种偏移可能发生在训练和评估之间，如分布内（ID）和分布外（OOD）泛化，或者发生在环境动态随时间演变的非平稳设置中。然而，这些观点之间的形式关系尚不清楚，现有工作主要关注缓解措施而非智能体-环境交互中偏移的因果起源。本文开发了一个统一的因果起源分类法，描述了强化学习中分布偏移的来源，并将ID/OOD泛化与非平稳设置联系起来。我们将监督学习中的经典数据集偏移原则迁移到强化学习，通过将分布偏移重新表述为生成交互过程。使用部分可观测马尔可夫决策过程（POMDP），我们将交互分解为结构组件，包括状态分布、观测过程、策略、奖励和转移动态，以及偏移时间边界。所提出的分类法区分了内部（智能体驱动）和外部（环境驱动）的分布偏移。偏移时间边界视角进一步刻画了显式、隐式和混合偏移。这种表述将ID/OOD泛化和非平稳性统一为底层过程中的结构化变化。我们还引入了一个评估框架，通过性能退化和恢复指标来衡量偏移影响和适应能力。通过将分布偏移扎根于强化学习的因果起源结构，本文支持在分布偏移下进行系统性的鲁棒性分析。

英文摘要

Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.

URL PDF HTML ☆

赞 0 踩 0

2606.17028 2026-06-16 cs.LG cs.AI cs.AR 交叉投稿

HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting

HAMON: 用于长程预测的无源光学序列混合

Alper Yıldırım

AI总结提出HAMON无源衍射光学预测核心，通过光学传播替代数字序列混合层，在多个基准上优于或接近最强数字基线，MSE最多降低14%。

详情

AI中文摘要

简单的线性模型和频域模型在长程时间序列预测中仍然出奇地具有竞争力，最近的机制证据表明，标准预测基准可能不需要使Transformer在其他领域强大的密集叠加表示。这引发了一个底层问题：如果核心预测算子通常是低复杂度的且近似线性，它是否需要被实现为学习到的数字时间混合？我们引入了HAMON，一种无源衍射光学预测核心，其中历史值被编码到光学孔径上，未来位置保持暗场，级联的可训练相位掩模与自由空间衍射直接在输出场中形成预测。在推理时，预测由单个无源光学传播过程完成，无需可训练的数字序列混合层。在标准基准上，HAMON在ETTm2的所有预测长度和ETTh2除最长预测长度外的所有长度上优于考虑的最强数字基线，MSE最多降低14%，并且在不同预测长度上一致地优于基线，而非孤立点。它在Weather上具有竞争力，在其余ETT设置以及高通道数的Traffic和Electricity数据集上略逊于最强基线。相位编码、强度兼容读出和相位扰乱消融实验，以及TorchOptics交叉模拟检查表明，预测来自承载数据的光场而非数字预测头。由于无源核心使用标准傅里叶光学，HAMON为光学硬件和无源物理序列混合定义了一个具体目标。

英文摘要

Simple linear and frequency-domain models remain surprisingly competitive in long-horizon time-series forecasting, and recent mechanistic evidence suggests that standard forecasting benchmarks may not require the dense superposed representations that make transformers powerful in other domains. This raises a substrate-level question: if the core forecasting operator is often low-complexity and approximately linear, does it need to be implemented as learned digital temporal mixing? We introduce HAMON, a passive diffractive optical forecasting core in which historical values are encoded onto an optical aperture, future positions are left dark, and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. At inference, prediction is performed by a single passive optical propagation pass with no trainable digital sequence-mixing layer. Across standard benchmarks, HAMON outperforms the strongest digital baselines considered on ETTm2 at all horizons and on ETTh2 at all but the longest horizon, improving MSE by up to 14\% and doing so consistently across horizons rather than at isolated points. It is competitive on Weather and trails the strongest baselines on the remaining ETT settings and on the high-channel-count Traffic and Electricity datasets. Phase encoding, intensity-compatible readout, and phase-scrambling ablations, together with a TorchOptics cross-simulator check, indicate that the forecasts arise from the data-bearing optical field rather than from a digital forecasting head. Because the passive core uses standard Fourier optics, HAMON defines a concrete target for optical hardware and for passive physical sequence mixing.

URL PDF HTML ☆

赞 0 踩 0

2606.17037 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

The Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

相位在神经表示中的重要性：图像分类器的内部Oppenheim-Lim测试

Alper Yıldırım

AI总结通过内部相位-幅度移植实验，发现图像分类器（如PRISM2D、GFNet、ViT-B/16）的预测主要依赖相位/符号信息，而图像特定幅度对读出贡献有限；ResNet-50在ReLU前存在潜在符号编码，揭示了CNN与注意力模型在纹理-形状差异上的机制。

详情

AI中文摘要

Oppenheim和Lim（1981）表明，自然图像仅从傅里叶相位重建时仍可识别，而幅度几乎不携带其身份信息。我们探究训练后的图像分类器是否在其隐藏层内再现这种不对称性，并进行因果测试：给定两幅图像，我们在选定层将一幅图像的相位移植到另一幅图像的幅度上，并记录预测跟随哪幅图像。在PRISM2D、GFNet和ViT-B/16中，预测跟随相位或符号捐赠者，删除所有图像特定幅度几乎不影响准确率，因此身份信息依赖于相位，而图像特定幅度对读出而言在很大程度上是可舍弃的。ResNet-50起初似乎打破了这一模式，因为在ReLU之后移植符号无效；在ReLU之前的公平干预揭示了后期块中存在强烈的潜在符号编码，而仅DC对照表明读出消耗了通道空间平均值。对照排除了幅度简单地不依赖于图像的平凡情况。因此，这些架构共享一个相位/符号身份编码，但以不同基（由整流和读出几何决定）暴露出来，这为CNN与注意力模型之间的纹理-形状差异提供了机制性解释。

英文摘要

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

URL PDF HTML ☆

赞 0 踩 0

2602.10385 2026-06-16 cs.LG cs.AI 版本更新

Capture Timing-Attention of Events in Clinical Time Series

捕捉临床时间序列中的事件时序注意力

Jia Li, Yu Hou, Rui Zhang

发表机构 * Department of Surgery（外科系；计算机科学系，明尼苏达大学明尼阿波利斯分校，MN USA）； Department of Computer Science, U of M Minneapolis MN USA

AI总结提出LITT架构，通过虚拟相对时间轴对齐事件序列，实现事件时序注意力机制，用于个性化临床轨迹分析，在乳腺癌患者心脏毒性预测中优于现有方法。

Comments 8 pages of body text

详情

AI中文摘要

从纵向EHR数据中自动发现个性化轨迹（即顺序事件模式）对于临床研究中的精准医学至关重要，但即使对于当代AI模型来说，这仍然是一个艰巨的挑战。例如，虽然Transformer的注意力机制可以捕捉丰富的关联，但它基本上不关心事件的时间和顺序，从而绕过了潜在的因果推理。直观上，我们需要一种能够评估患者特定轨迹之间“对齐程度”并识别其共享模式（即一致序列中的显著事件）的方法。这需要将时间视为一个真正的**可计算**维度，允许模型为候选事件分配超出其观测物理时间的“相对时间戳”。在这项工作中，我们引入了LITT（个体级时间变换），一种新颖的架构，能够在虚拟的“相对时间线”上临时对齐序列事件，从而实现**事件时序聚焦的注意力**和临床轨迹的个性化解释。其可解释性和有效性在来自3,276名乳腺癌患者的真实纵向EHR数据上得到验证，用于预测心脏毒性诱发心脏病的发病时间。此外，LITT在公共数据集上优于基准和最先进的生存分析方法，使其成为临床AI精准医学的重要一步。

英文摘要

The contemporary paradigm of trajectory learning operates fundamentally at the level of group dynamics, systematically reducing individual-level complexity to fit group-level models, thus rendering effective patient subtyping difficult and individual-level modeling largely out of reach. We propose a data-driven paradigm that introduces a dedicated individual-level temporal variable to capture \emph{Timing Attention} (i.e., the degree of concentration of an event's timing distribution across the patient cohort), thereby rendering timing a \emph{computable dimension} that enables individualized temporal features in trajectory learning. Instantiated as the Level-of-Individual Time Transformation (LITT) and applied to longitudinal EHR data from 3,276 breast cancer patients, the proposed paradigm demonstrates, for the first time to our knowledge: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction, that is, a \emph{What-If Machine}. Both results are purely data-driven, requiring no prior domain knowledge. LITT further achieves strong performance on timing prediction and survival analysis tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.10903 2026-06-16 cs.AI 版本更新

Multi-Granular Node Pruning for Causal Circuit Discovery

多粒度节点剪枝用于因果电路发现

Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, A. B. Siddique

发表机构 * Department of Computer Science, University of Kentucky, USA（美国肯塔基大学计算机科学系）； Department of Computer Science, Dalhousie University, Canada（加拿大达尔豪斯大学计算机科学系）

AI总结提出一种节点级剪枝框架，通过可学习掩码和多粒度稀疏惩罚，在单次微调中从大语言模型中高效发现最小因果电路，节点更少且性能相当，内存占用降低5-10倍。

详情

AI中文摘要

电路发现旨在识别大语言模型（LLMs）中负责特定行为的最小化子网络。现有方法主要依赖迭代边剪枝，计算成本高且局限于粗粒度单元（如注意力头或MLP块），忽略了单个神经元等更细粒度的结构。我们提出了一种用于电路发现的节点级剪枝框架，解决了可扩展性和粒度限制。我们的方法在统一的优化目标中引入了跨多个粒度级别（从整个块到单个神经元）的可学习掩码。粒度特定的稀疏惩罚指导剪枝过程，使得在单次微调运行中实现全面压缩。实验上，我们的方法识别的电路在节点数量上小于先前方法发现的电路；此外，我们证明了许多被粗粒度方法认为重要的神经元实际上是无关的，同时仍能保持任务性能。此外，我们的方法具有显著更低的内存占用（5-10倍），因为它不需要在内存中保留中间激活来工作。

英文摘要

Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.

URL PDF HTML ☆

赞 0 踩 0

2512.20043 2026-06-16 cs.AI 版本更新

Discovering Symmetry Groups with Flow Matching

通过流匹配发现对称群

Yuxuan Chen, Jung Yeon Park, Floor Eijkelboom, Jianke Yang, Jan-Willem van de Meent, Lawson L. S. Wong, Robin Walters

发表机构 * University of Cambridge（剑桥大学）

AI总结提出LieFlow框架，将对称发现转化为李群上的分布学习问题，无需固定基或分布假设，能统一发现连续和离散对称，实验优于LieGAN。

详情

AI中文摘要

对称性是理解物理系统的基础，可以提高机器学习中的性能和样本效率。这两项工作都需要了解数据中的潜在对称性，但自动发现这些对称性具有挑战性。我们提出LieFlow，一种新颖的框架，将对称发现重新定义为李群上的分布学习问题。我们的方法不搜索对称生成器，而是直接在群空间中操作，在大假设群$G$上建模对称分布。学习到的分布的支持揭示了潜在的对称群$H \subseteq G$。与先前的工作不同，LieFlow可以在统一框架中发现连续和离散对称，而不假设固定的李代数基或群元素上的特定分布。在合成2D和3D点云、ModelNet10和真实世界MI-Motion数据集上的实验表明，LieFlow准确发现了连续和离散子群，在识别离散对称方面显著优于最先进的基线LieGAN。

英文摘要

Symmetry is fundamental to understanding physical systems and can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data, yet discovering these symmetries automatically is challenging. We propose LieFlow, a novel framework that reframes symmetry discovery as a distribution learning problem on Lie groups. Instead of searching for the symmetry generators, our approach operates directly in group space, modeling a symmetry distribution over a large hypothesis group $G$. The support of the learned distribution reveals the underlying symmetry group $H \subseteq G$. Unlike previous works, LieFlow can discover both continuous and discrete symmetries within a unified framework, without assuming a fixed Lie algebra basis or a specific distribution over the group elements. Experiments on synthetic 2D and 3D point clouds, ModelNet10 and a real-world MI-Motion dataset show that LieFlow accurately discovers continuous and discrete subgroups, significantly outperforming a state-of-the-art baseline, LieGAN, in identifying discrete symmetries.

URL PDF HTML ☆

赞 0 踩 0

2602.05367 2026-06-16 cs.AI 版本更新

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

RaBiT：基于残差的二值化训练用于准确且高效的LLM

Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim

发表机构 * KAIST（韩国科学技术院）

AI总结 RaBiT通过算法强制残差层级解决二值化中的特征共适应问题，提升2位精度-效率边界，实现超越VQ的性能和4.49倍的推理加速。

Comments Accepted to ICML 2026

详情

AI中文摘要

高效部署大型语言模型（LLMs）需要极端量化，迫使在低比特效率与性能之间做出关键权衡。残差二值化通过堆叠二进制（±1）层实现硬件友好的乘法自由推理，但受到病理特征共适应的困扰。我们识别出一种关键失败模式，称为路径适应：在量化感知训练（QAT）中，并行残差二值路径学习冗余特征，降低误差补偿结构并限制模型的表达能力。尽管先前工作依赖启发式修补（例如路径冻结）来限制解空间，我们提出了RaBiT，一种新的量化框架，通过算法强制残差层级解决共适应问题。其核心机制依次从单个共享的全精度权重推导每个二值路径，确保每个路径纠正前一个的误差。这一过程通过稳健的初始化稳定，优先考虑功能保持而非单纯权重近似。RaBiT重新定义了2位精度-效率边界：它实现了最先进的性能，甚至超越硬件密集型向量量化（VQ）方法，并在RTX 4090上实现了比全精度模型快4.49倍的推理加速。

英文摘要

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090. Code is available at https://github.com/SamsungLabs/RaBiT.

URL PDF HTML ☆

赞 0 踩 0

2602.23242 2026-06-16 cs.AI 版本更新

A Model-Free Universal AI

无模型通用人工智能

Yegon Kim, Juho Lee

发表机构 * Graduate School of AI, KAIST（韩国科学技术院人工智能研究生院）

AI总结提出首个在通用强化学习中证明渐近ε最优的无模型智能体AIQI，通过分布动作值函数的通用归纳实现，并扩展了Self-AIXI的渐近最优性证明。

2604.05859 2026-06-16 cs.AI 版本更新

When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

何时需要大语言模型？语言驱动型老虎机的诊断方法

Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso

发表机构 * University of Cambridge（剑桥大学）

AI总结提出LLMP-UCB算法从LLM中获取不确定性估计，实验表明轻量数值老虎机在文本嵌入上匹配或超越LLM方案且成本更低，并给出基于臂嵌入的几何诊断指导何时使用LLM。

Comments The Reinforcement Learning Conference, 2026

详情

AI中文摘要

我们研究非情节性决策问题中的上下文多臂老虎机（CMABs），其中上下文包含文本和数值信息（例如推荐系统、动态投资组合调整、报价选择；这些都是金融中常见的问题）。虽然大语言模型（LLMs）越来越多地应用于这些场景，但在每个决策步骤使用LLM进行推理计算成本高昂，且难以获得不确定性估计。为解决这一问题，我们引入LLMP-UCB，一种通过重复推理从LLM中导出不确定性估计的老虎机算法。然而，我们的实验表明，在文本嵌入（稠密或Matryoshka）上运行的轻量数值老虎机以极低的成本匹配或超越了基于LLM的解决方案的准确性。我们进一步证明，嵌入维度是探索-利用平衡的一个实用杠杆，能够在无需提示复杂性的情况下实现成本-性能权衡。最后，为指导实践者，我们提出一种基于臂嵌入的几何诊断方法，以决定何时使用LLM驱动的推理与轻量数值老虎机。我们的结果为跨AI用例广泛适用的成本效益高、不确定性感知的决策系统提供了原则性部署框架。

英文摘要

We study Contextual Multi-Armed Bandits (CMABs) for non-episodic decision-making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive, and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost-performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms' embeddings to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases.

URL PDF HTML ☆

赞 0 踩 0

2605.02427 2026-06-16 cs.AI cs.LG 版本更新

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

模型知晓，解码器发现：未来价值引导的粒子力量采样

Tu Nguyen, Matthieu Zimmer, Rasul Tutunov, Xiaotong Ji, Haitham Bou Ammar

发表机构 * Huawei Heisenberg Research Center（华为海森堡研究中心）； Huawei Noah’s Ark Lab（华为诺亚实验室）； UCL Centre for Artificial Intelligence（伦敦大学学院人工智能中心）

AI总结本文提出APPS算法，通过块状粒子方法高效定位LLM的多步解，提升推理准确率与运行效率，减少对训练数据的依赖。

2606.01561 2026-06-16 cs.AI cs.LG 版本更新

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

S-SPPO：语义校准的自对弈偏好优化

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu, Zhipeng Wang, Huayu Li, ZhengXiao He, Xuanzhao Dong, Prayag Tiwari, Mingkun Xu, Yujian Xiong, Feng Luo, Abolfazl Razi, Brendan Hogan Rappazzo, Anderson Schneider, Yuriy Nevmyvaka

发表机构 * University of Arizona, USA（亚利桑那大学）； Arizona State University, USA（亚利桑那州立大学）； Now at Google LLC, work done at Rice University（现就职于谷歌公司，曾就职于里士大学）； Clemson University, USA（克莱姆森大学）； Washington University in St. Louis, USA（圣路易斯华盛顿大学）； Halmstad University, Sweden（哈姆斯塔德大学）； Guangdong Institute of Intelligence Science and Technology, China（广东智能科学与技术研究院）

AI总结针对自对弈偏好优化（SPPO）中因偏好预测过度自信导致策略退化的问题，提出双空间语义校准框架S-SPPO，通过语义门控监督校准和潜在排斥表示校准，在保持博弈结构的同时提升对齐性能。

Comments Accepted by ICML2026

详情

AI中文摘要

将大型语言模型（LLM）与人类偏好对齐通常通过直接偏好优化（DPO）来实现。然而，DPO的标准Bradley-Terry实现在建模人类偏好中常见的传递性偏离方面存在局限。为解决此问题，近期工作引入了自对弈偏好优化（SPPO），通过训练自生成的胜负对来迭代优化策略。然而，我们的研究发现SPPO存在一个关键的不稳定性：当偏好预测器对语义上无法区分的响应赋予过度自信的胜利时，优化容易导致策略退化。为缓解这一问题，我们提出S-SPPO，一个双空间语义校准框架，包括：i）通过语义门控进行监督校准，随着语义重叠增加将胜率目标退火至最大熵基线；ii）通过潜在排斥进行表示校准，以强制几何多样性，防止流形坍塌并保持所选样本与拒绝样本之间的潜在多样性。理论上，我们证明该校准保持了常和博弈结构，促进收敛至纳什均衡。实验上，S-SPPO避免了先前方法中的性能退化，在AlpacaEval 2.0上使用Llama-3-8B实现了52.19%的胜率和47.46%的长度控制胜率，且在训练过程中未使用额外的人工标注偏好。代码将在https://github.com/xiwenc1/s-sppo提供。

英文摘要

Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

URL PDF HTML ☆

赞 0 踩 0

2606.10237 2026-06-16 cs.AI cs.LG 版本更新

Minimalist Genetic Programming

极简遗传编程

Leonardo Trujillo

发表机构 * Tecnológico Nacional de México/IT de Tijuana（墨西哥国家理工学院/蒂胡ana信息技术学院）； LASIGE, Department of Informatics, Faculty of Sciences, University of Lisbon（里斯本大学科学学院信息系LASIGE）

AI总结提出极简遗传编程（MGP），借鉴语言学中的极简主义程序，用MERGE操作替代进化搜索，在符号回归任务中有效避免膨胀，稳定找到精确解。

详情

AI中文摘要

使用潜在变量的高效流匹配

Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy

发表机构 * Argonne National Laboratory（阿贡国家实验室）； KTH Royal Institute of Technology（皇家理工学院）

AI总结提出Latent-CFM方法，利用预训练深度潜在变量模型提取数据特征作为条件，提升流匹配模型的训练效率和生成质量，在图像和物理场生成任务中优于现有方法。

详情

AI中文摘要

流匹配模型在概率生成模型的图像生成任务中显示出巨大潜力。然而，文献中的大多数流匹配模型在从简单源分布（如标准高斯）学习流时，并未显式利用目标数据中的潜在聚类结构。这导致学习效率低下，尤其是对于许多通常位于低维流形中的高维真实世界数据集。为此，我们提出了 $\texttt{Latent-CFM}$，它通过使用预训练的深度潜在变量模型从数据中提取的特征作为条件，提供了高效的训练策略。通过对来自多模态分布的合成数据和广泛使用的图像基准数据集的实验，我们表明，$\texttt{Latent-CFM}$ 通过采用预训练的轻量级潜在变量模型，在显著减少训练和计算量的情况下，展现出比最先进的流匹配模型更好的生成质量。除了自然图像，我们还考虑了源自物理过程的空间场的生成建模。使用二维达西流数据集，我们证明了我们的方法比竞争方法生成更物理准确的样本。此外，通过潜在空间分析，我们证明了我们的方法可用于以潜在特征为条件的条件图像生成，这增加了生成过程的可解释性。

英文摘要

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

URL PDF HTML ☆

赞 0 踩 0

2505.18227 2026-06-16 cs.LG cs.AI 版本更新

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Token缩减应超越生成模型中的效率——从视觉、语言到多模态

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

发表机构 * Harvard University（哈佛大学）； Northeastern University（东北大学）； CAS（中国科学院）； Wuhan University（武汉大学）； MIT（麻省理工学院）； Peking University（北京大学）

AI总结本文提出Token缩减应超越传统效率优化，成为生成模型的基础原则，通过减少冗余token来促进多模态融合、缓解幻觉、维持长输入连贯性并提升训练稳定性。

Comments Project page: https://github.com/ZLKong/Awesome-Collection-Token-Reduction

详情

AI中文摘要

在Transformer架构中，token——从原始数据中分割出的离散单元——通过将输入切分为固定长度的块而形成。每个token被映射到一个嵌入向量，从而在保留输入关键信息的同时实现并行注意力计算。由于Transformer自注意机制的二次计算复杂度，token缩减主要被用作一种效率策略，尤其在单一视觉和语言领域，它有助于平衡计算成本、内存使用和推理延迟。尽管取得了这些进展，本文认为在大规模生成模型时代，token缩减应超越其传统的效率导向角色。相反，我们将其定位为生成建模中的基本原则，对模型架构和更广泛的应用产生关键影响。具体而言，我们认为在视觉、语言和多模态系统中，token缩减可以：(i) 促进更深层次的多模态集成和对齐，(ii) 缓解“过度思考”和幻觉，(iii) 在长输入上保持连贯性，(iv) 增强训练稳定性等。我们将token缩减重新定义为不仅仅是效率措施。通过这样做，我们概述了有前景的未来方向，包括算法设计、强化学习引导的token缩减、用于上下文学习的token优化、智能体框架设计以及更广泛的机器学习和科学领域。

英文摘要

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.

URL PDF HTML ☆

赞 0 踩 0

2505.19699 2026-06-16 cs.LG cs.AI cs.DC 版本更新

Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

Mosaic: 面向异构分布式环境的无数据知识蒸馏与混合专家模型

Junming Liu, Yanting Gao, Yuqi Li, Siyuan Meng, Yifei Sun, Aoqi Wu, Yirong Chen, Ding Wang, Shiping Wen

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； The City University of New York（纽约城市大学）； Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结针对联邦学习中模型与数据异构性问题，提出Mosaic框架，通过本地生成模型合成隐私保护数据，并利用混合专家模型蒸馏全局模型，在图像和多模态基准上超越现有方法。

Comments 23 pages, 5 figures, 24 tables; Accepted by Knowledge-Based Systems, 2026

详情

FOUNDv2: 学习统一的用户量化分词器用于用户表示

Chuan He, Yang Chen, Bin Dou, Wuliang Huang, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Zhongle Xie, Jiajun Zheng, Xin-Wei Yao

发表机构 * Ant Group（蚂蚁集团）； Zhejiang University of Technology（浙江工业大学）； Zhejiang University（浙江大学）

AI总结提出FOUNDv2框架，通过统一用户量化分词器（U2QT）将异构用户数据转化为离散令牌，结合多视图RQ-VAE和多尺度对齐目标，实现高效存储和预测性能，在多个基准上优于任务特定基线。

详情

AI中文摘要

用户表示学习是大规模网络平台上个性化服务的基础支柱。尽管其重要性，传统的连续嵌入方法面临重大挑战，包括缺乏多源数据融合的统一范式、由于信息密度低导致的过高存储开销以及缺乏多尺度建模粒度。为克服这些限制，我们引入FOUNDv2，一个以统一用户量化分词器（U2QT）框架为核心的综合用户表示方案。FOUNDv2通过一个稳健的两阶段架构将异构用户数据转化为标准化的离散令牌空间。具体来说，该框架首先提取紧凑的特征表示，然后使用多视图RQ-VAE通过共享和源特定的码本将其离散化为存储高效的令牌。为了赋予这些表示预测智能，我们进一步设计多尺度对齐目标以捕捉细粒度的行为依赖和宏观时间周期性。在各种基准上的大量实验表明，FOUNDv2在实现存储和计算成本大幅降低的同时，始终优于任务特定基线。最后，FOUNDv2在支付宝上的大规模部署验证了其在多种工业场景中的实际可扩展性和效率。主要代码可在以下网址获取：this https URL。

英文摘要

User representation learning serves as a fundamental pillar for personalized services on large-scale web platforms. Despite its importance, conventional continuous embedding methods face significant challenges, including the lack of a unified paradigm for multi-source data integration, prohibitive storage overhead due to low information density, and the lack of multi-scale modeling granularity. To overcome these limitations, we introduce FOUNDv2, a comprehensive user representation scheme centered on the Unified User Quantized Tokenizer U2QT) framework. FOUNDv2 transforms heterogeneous user data into a standardized discrete token space through a robust two-stage architecture. Specifically, the framework first extracts compact feature representations and subsequently employs a multi-view RQ-VAE to discretize them into storage-efficient tokens using shared and source-specific codebooks. To empower these representations with predictive intelligence, we further design multi-scale alignment objectives to capture both fine-grained behavioral dependencies and macro-temporal periodicity. Extensive experiments on various benchmarks demonstrate that FOUNDv2 consistently outperforms task-specific baselines while achieving substantial reductions in storage and computational costs. Finally, the large-scale deployment of FOUNDv2 on Alipay validates its practical scalability and efficiency across diverse industrial scenarios. The main code is available at: https://github.com/chuanhe1999/FOUNDv2.

URL PDF HTML ☆

赞 0 踩 0

2508.05287 2026-06-16 cs.LG cs.AI 版本更新

FlowState: Sampling-Rate-Equivariant Time-Series Forecasting

FlowState: 采样率等变的时间序列预测

Lars Graf, Thomas Ortner, Stanisław Woźniak, Angeliki Pantazi

发表机构 * GitHub

AI总结提出FlowState架构，通过状态空间模型编码器和函数基解码器实现采样率等变预测，无需重新训练即可适应不同采样率和预测长度，在GIFT-Eval基准上取得最优结果。

详情

AI中文摘要

现有的时间序列基础模型（TSFMs）通常基于Transformer变体，缺乏对不同采样率的适应性，难以在不同上下文和目标长度上泛化，且计算效率低下。我们提出FlowState，一种新颖的TSFM架构，通过将状态空间模型（SSM）编码器与函数基解码器（FBD）配对，实现采样率等变预测。这种设计支持连续时间建模和动态时间尺度调整，使FlowState能够天然地泛化到所有可能的时间分辨率，并动态调整预测范围而无需重新训练。我们进一步提出一种高效的预训练策略，提高了鲁棒性并加速了训练。尽管FlowState是最小的TSFMs之一，它在广泛使用的GIFT-Eval基准上取得了最先进的结果，同时展现出对未见采样率的卓越适应性。我们的详细分析证实了其组件的有效性，并展示了其适应不同输入采样率的独特能力。

英文摘要

Existing time series foundation models (TSFMs), often based on transformer variants, lack adaptability to different sampling rates, struggle with generalization across varying context and target lengths, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that achieves sampling-rate-equivariant forecasting through a unified design that pairs a state space model (SSM) encoder with a functional basis decoder (FBD). This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons without retraining. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being one of the smallest TSFMs, FlowState achieves state-of-the-art results on the widely used GIFT-Eval benchmark, while demonstrating superior adaptability to unseen sampling rates. Our detailed analyses confirm the effectiveness of its components, and we demonstrate its unique ability to adapt to varying input sampling rates.

URL PDF HTML ☆

赞 0 踩 0

2508.17254 2026-06-16 cs.CV cs.AI 版本更新

Think-at-Hard: 选择性潜在迭代以改进推理语言模型

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

AI总结针对循环变压器中潜在过思考问题，提出Think-at-Hard方法，通过轻量级决策器选择性地在困难令牌上触发潜在迭代，并采用深度感知LoRA和双因果注意力机制，在数学、问答和编码任务上一致提升性能。

Comments Accepted by ICML'26

详情

AI中文摘要

提升大型语言模型（LLMs）的推理能力，特别是在参数约束下，对实际应用至关重要。循环变压器通过执行多次潜在迭代来细化每个令牌，超越单次前向传播。然而，我们识别出一种潜在过思考现象：大多数令牌预测在第一次前向传播后已经正确，但在后续迭代中有时会被修改为错误。我们询问选择性地跳过潜在迭代是否能提高准确性，并揭示了一个显著的潜力：使用预言迭代策略可将性能提升高达7.3%。受此启发，我们提出了Think-at-Hard (TaH)，一种针对选择性迭代优化的循环变压器。TaH采用轻量级神经决策器来触发潜在迭代，仅在标准前向传播后可能不正确的令牌上触发。在潜在迭代期间，深度感知的低秩适应（LoRA）模块将目标从一般的下一个令牌预测转变为聚焦的困难令牌细化。双因果注意力机制将注意力从令牌序列维度扩展到额外的迭代深度维度，实现跨迭代信息流，同时保持完全的序列并行性。在九个基准上的实验显示，在数学、问答和编码任务上一致提升。在相同参数数量下，TaH在93%的令牌上跳过迭代，性能比始终迭代的基线高3.8-4.4%，并超过单次迭代的Qwen3基线3.0-3.8%。当允许LoRA和决策器增加不到3%的参数时，增益分别进一步增加到5.3-6.2%和6.1-6.8%。我们的代码可在以下网址获取：https://this URL。

英文摘要

Improving the reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. We ask whether selectively skipping latent iterations can improve accuracy, and reveal significant potential with an oracle iteration policy that boosts performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration, only at tokens likely to be incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

URL PDF HTML ☆

赞 0 踩 0

2512.18295 2026-06-16 cs.LG cs.AI 版本更新

AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning

AL-GNN: 基于分析学习的隐私保护且无需重放的持续图学习

Xuling Zhang, Jindong Li, Yifei Zhang, Mingqi Yang, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Northwestern Polytechnical University（西北工业大学）； South China University of Technology（华南理工大学）

AI总结提出AL-GNN框架，利用分析学习理论将持续图学习转化为递归最小二乘优化，通过闭式分类器更新和正则化特征自相关矩阵实现无需反向传播和重放缓冲的高效训练，在保护隐私的同时提升性能并减少遗忘。

详情

AI中文摘要

持续图学习（CGL）旨在使图神经网络能够从图结构数据流中增量学习，而不会遗忘先前获得的知识。现有方法，特别是基于经验重放的方法，通常存储并重新访问过去的图数据以缓解灾难性遗忘。然而，这些方法存在显著局限性，包括隐私问题和低效性。在这项工作中，我们提出了AL-GNN，一种新颖的持续图学习框架，消除了对反向传播和重放缓冲区的需求。相反，AL-GNN利用分析学习理论的原理，将学习形式化为递归最小二乘优化过程。它通过闭式分类器更新和正则化特征自相关矩阵来分析和更新模型知识。这种设计使得每个任务能够进行高效的单次训练，并通过避免存储历史样本固有地保护数据隐私。在多个动态图分类基准上的大量实验表明，AL-GNN取得了与现有方法相比具有竞争力或更优的性能。例如，它在CoraFull上平均性能提高了10%，在Reddit上遗忘减少了30%以上，同时由于其无反向传播的设计，训练时间减少了近50%。

英文摘要

Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.

URL PDF HTML ☆

赞 0 踩 0

2512.22560 2026-06-16 cs.DC cs.AI cs.LG 版本更新

RollArt: Disaggregated Multi-Task Agentic RL Training at Scale

RollArt: 可分解的多任务智能体强化学习规模化训练

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang

发表机构 * HKUST（香港科技大学）； Alibaba Group（阿里巴巴集团）； Tongyi Lab, Alibaba（阿里云实验室）

AI总结提出RollArt系统，通过将强化学习流水线分解到异构硬件上，实现多任务智能体RL的高效训练，相比现有系统减少1.31-2.05倍训练时间。

Comments 19 pages, 15 figures

详情

AI中文摘要

智能体强化学习通过与环境的多轮交互训练大语言模型，产生混合计算密集型预填充、带宽密集型解码、CPU密集型环境执行和突发性奖励评估的工作负载。现有系统要么将所有阶段共置于单一GPU集群，要么仅以粗粒度解耦，忽视了硬件异构性并导致阶段间大量同步开销。我们提出ROLLART，一个在可分解基础设施上的多任务智能体RL系统。ROLLART将每个流水线阶段映射到最合适的硬件：将预填充密集型任务路由到计算优化GPU，解码密集型任务路由到带宽优化GPU，环境任务路由到CPU集群。它在轨迹级别解耦生成，使得生成、环境交互和奖励评分可以独立进行，从而慢速或失败的环境不会阻塞其他任务。ROLLART将无状态奖励计算卸载到无服务器基础设施，并通过有界陈旧性的异步权重同步将生成与训练重叠。结果表明，ROLLART有效提高了训练吞吐量，与各种RL系统相比实现了1.31-2.05倍的训练时间减少。我们还在阿里巴巴集群上使用超过3000个GPU训练了用于Qoder产品的数千亿参数MoE模型，验证了其稳定性和可扩展性。

英文摘要

Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty reward evaluation. Existing systems either colocate all stages on a single GPU cluster or decouple them only at a coarse granularity, overlooking hardware heterogeneity and incurring substantial synchronization overhead across stages. We present ROLLART, a system for multi-task agentic RL on disaggregated infrastructure. ROLLART maps each pipeline stage to best-fit hardware, routing prefill-heavy tasks to compute-optimized GPUs, decode-heavy tasks to bandwidth-optimized GPUs, and environments to CPU clusters. It decouples rollout at the trajectory level, allowing generation, environment interaction, and reward scoring to proceed independently, so that slow or failed environments never block the others. ROLLART offloads stateless reward computation to serverless infrastructure and overlaps rollout with training via staleness-bounded asynchronous weight synchronization. Our results demonstrate that ROLLART effectively improves training throughput and achieves 1.31--2.05 $\times$ training time reduction compared to various RL systems. We also evaluated ROLLART by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with above 3,000 GPUs, demonstrating its stability and scalability.

URL PDF HTML ☆

赞 0 踩 0

2601.11219 2026-06-16 cs.LG cs.AI 版本更新

SDFLoRA: Selective Decoupled Federated LoRA for Privacy-preserving Fine-tuning with Heterogeneous Clients

SDFLoRA: 面向异构客户隐私保护微调的选择性解耦联邦LoRA

Zhikang Shen, Jianrong Lu, Haiyuan Wan, Jianhai Chen

发表机构 * Zhejiang University（浙江大学）； Tsinghua University（清华大学）

AI总结提出SDFLoRA，通过将LoRA更新解耦为共享和私有组件，仅聚合共享部分并注入差分隐私噪声，解决联邦微调中的秩异构和数据异构问题，提升隐私-效用权衡。

详情

AI中文摘要

联邦学习（FL）用于大型语言模型（LLM）作为在分布式数据上适应模型的隐私保护方法日益受到关注，其中低秩适应（LoRA）等参数高效方法被广泛采用以降低通信和内存成本。然而，实际部署通常表现出秩和数据异构性：客户端在不同的低秩预算和数据分布下运行，使得LoRA更新的直接聚合存在偏差且不稳定。现有方法要么强制统一秩，要么将异构更新对齐到单个共享子空间，这往往会混合可迁移和客户端特定的方向，从而损害个性化。此外，在差分隐私（DP）下，扰动这种结构混合的更新会向本应保持纯局部的方向注入噪声，导致不必要的效用下降。为了解决这些问题，我们提出了选择性解耦联邦LoRA（SDFLoRA），一种结构感知的LoRA框架，将每个客户端更新解耦为用于聚合的共享组件和保留客户端特定语义的私有组件。只有共享组件参与子空间对齐，而私有组件保持本地且不通信，使得训练与DP兼容并在秩异构下稳定聚合。通过仅向聚合的可共享更新注入噪声，该方法避免了对局部方向的扰动，并改善了效用-隐私权衡。在多个基准上的实验表明，SDFLoRA优于联邦LoRA基线，并实现了强大的效用-隐私权衡。

英文摘要

Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a privacy-preserving approach for adapting models over distributed data, where parameter-efficient methods such as Low-Rank Adaptation (LoRA) are widely adopted to reduce communication and memory costs. However, practical deployments often exhibit rank and data heterogeneity: clients operate under different low-rank budgets and data distributions, making direct aggregation of LoRA updates biased and unstable. Existing approaches either enforce a unified rank or align heterogeneous updates into a single shared subspace, which tends to mix transferable and client-specific directions and consequently undermines personalization. Moreover, under differential privacy (DP), perturbing such structurally mixed updates injects noise into directions that should remain purely local, leading to unnecessary utility degradation. To address these issues, we propose Selective Decoupled Federated LoRA (SDFLoRA), a structure-aware LoRA framework that decouples each client update into a shared component for aggregation and a private component that preserves client-specific semantics. Only the shared component participates in subspace alignment, while the private component remains local and uncommunicated, making the training DP-compatible and stabilizing aggregation under rank heterogeneity. By injecting noise only into the aggregated shareable update, this approach avoids perturbations to local directions and improves the utility-privacy trade-off. Experiments on multiple benchmarks demonstrate that SDFLoRA outperforms federated LoRA baselines and achieves a strong utility-privacy trade-off.

URL PDF HTML ☆

赞 0 踩 0

2601.16509 2026-06-16 cs.LG cs.AI 版本更新

Adaptive $k$NN graph model

自适应 $k$NN 图模型

Jiaye Li, Hang Xu, Shichao Zhang

发表机构 * The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； Zhejiang University（浙江大学）； The School of Computer Science and Engineering（计算机科学与工程学院）； Central South University（中南大学）； School of Computer Science and Engineering（计算机科学与工程学院）； Guangxi Normal University（广西师范大学）

AI总结提出一种基于分层可导航小世界图与预计算投票机制的自适应图模型，将邻居选择与加权的计算负担转移到训练阶段，在保持分类精度的同时实现实时推理速度。

Comments 31 pages, 5 figures

详情

DOI: 10.1038/s41467-026-74296-2

AI中文摘要

$k$ 近邻 ($k$NN) 算法是人工智能中非参数分类的基石，但其在大规模应用中的部署始终受到推理速度与准确性之间计算权衡的限制。现有的近似最近邻解决方案加速了检索，但往往降低了分类精度，并且缺乏选择最优邻域大小 ($k$) 的自适应性。本文提出了一种自适应图模型，将推理延迟与计算复杂度解耦。通过将分层可导航小世界 (HNSW) 图与预计算投票机制相结合，我们的框架将邻居选择和加权的计算负担完全转移到训练阶段。在这种拓扑结构中，较高的图层次实现快速导航，而较低的层次则通过自适应邻居数量编码精确的、节点特定的决策边界。在六个不同数据集上与八种最先进基线进行基准测试，我们证明了该架构显著加速了推理速度，实现了实时性能，且不牺牲分类精度。这些发现为 $k$NN 固有的推理瓶颈提供了可扩展、鲁棒的解决方案，为基于图的非参数学习奠定了自适应的结构基础。

英文摘要

The $k$-nearest neighbors ($k$NN) algorithm is a cornerstone of non-parametric classification in artificial intelligence, yet its deployment in large-scale applications is persistently constrained by the computational trade-off between inference speed and accuracy. Existing approximate nearest neighbor solutions accelerate retrieval but often degrade classification precision and lack adaptability in selecting the optimal neighborhood size ($k$). Here, we present an adaptive graph model that decouples inference latency from computational complexity. By integrating a Hierarchical Navigable Small World (HNSW) graph with a pre-computed voting mechanism, our framework completely transfers the computational burden of neighbor selection and weighting to the training phase. Within this topological structure, higher graph layers enable rapid navigation, while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts. Benchmarking against eight state-of-the-art baselines across six diverse datasets, we demonstrate that this architecture significantly accelerates inference speeds, achieving real-time performance, without compromising classification accuracy. These findings offer a scalable, robust solution to the inherent inference bottleneck of $k$NN, laying an adaptive structural foundation for graph-based nonparametric learning.

URL PDF HTML ☆

赞 0 踩 0

2602.11550 2026-06-16 cs.LG cs.AI 版本更新

TS-Memory: Plug-and-Play Memory for Time Series Foundation Models

TS-Memory: 时间序列基础模型的即插即用记忆模块

Sisuo Lyu, Siru Zhong, Tiegang Chen, Weilin Ruan, Qingxiang Liu, Taiqiang Lv, Qingsong Wen, Raymond Chi-Wing Wong, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Tencent（腾讯）； Squirrel Ai Learning ； The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出参数化记忆蒸馏方法TS-Memory，通过轻量级记忆适配器增强冻结的时间序列基础模型，在分布偏移下实现无检索的高效零样本预测，显著提升点预测和概率预测性能。

详情

AI中文摘要

时间序列基础模型（TSFMs）通过大规模预训练实现了强大的零样本预测，但在分布偏移下将其适应到下游领域仍然具有挑战性。现有解决方案面临权衡：参数化适应可能导致灾难性遗忘，并需要昂贵的多领域维护，而非参数化检索虽然改善了预测，但由于数据存储搜索导致高推理延迟。我们提出了参数化记忆蒸馏，并将其实现为TS-Memory，一种增强冻结TSFMs的轻量级记忆适配器。TS-Memory分两个阶段训练。首先，我们构建一个离线、检索泄漏安全的kNN教师，从检索到的未来中合成置信度感知的分位数目标。其次，我们通过置信度门控监督将该检索诱导的分布校正蒸馏到轻量级记忆适配器中。在推理过程中，TS-Memory以常数时间开销融合记忆和骨干预测，实现无检索部署。在多种TSFMs和基准上的实验表明，与代表性的适应方法相比，在点预测和概率预测上均有一致的改进，效率与冻结骨干相当。代码：此 https URL。

英文摘要

Time Series Foundation Models (TSFMs) achieve strong zero-shot forecasting through large-scale pre-training, but adapting them to downstream domains under distribution shift remains challenging. Existing solutions face a trade-off: Parametric Adaptation can cause catastrophic forgetting and requires costly multi-domain maintenance, while Non-Parametric Retrieval improves forecasts but incurs high inference latency due to datastore search. We propose Parametric Memory Distillation and implement it as TS-Memory, a lightweight memory adapter that augments frozen TSFMs. TS-Memory is trained in two stages. First, we construct an offline, retrieval-leakage-safe kNN teacher that synthesizes confidence-aware quantile targets from retrieved futures. Second, we distill this retrieval-induced distributional correction into a lightweight memory adapter via confidence-gated supervision. During inference, TS-Memory fuses memory and backbone predictions with constant-time overhead, enabling retrieval-free deployment. Experiments across diverse TSFMs and benchmarks demonstrate consistent improvements in both point and probabilistic forecasting over representative adaptation methods, with efficiency comparable to the frozen backbone. Code: https://github.com/sisuolv/TS-Memory.

URL PDF HTML ☆

赞 0 踩 0

2602.22422 2026-06-16 cs.LG cs.AI 版本更新

Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression

重新审视切比雪夫多项式和各向异性RBF模型在表格回归中的应用

Luciano Gerber, Huw Lloyd

发表机构 * Department of Computing and Mathematics, Manchester Metropolitan University（计算与数学系，曼彻斯特 Metropolitan 大学）

AI总结本文在55个数据集上基准测试切比雪夫多项式回归器、各向异性RBF网络和平滑树混合模型，发现平滑模型在CPU可行模型中与树集成准确率相当且泛化差距更小，建议将其纳入候选池。

Comments 46 pages, 6 figures, 21 tables. Under review at Knowledge-Based Systems

详情

AI中文摘要

平滑基模型如切比雪夫多项式回归器和径向基函数（RBF）网络在数值分析中已得到充分确立。它们的连续可微预测表面适用于代理优化、敏感性分析以及其他响应随输入逐渐变化的环境。尽管具有这些特性，平滑模型在树集成主导的表格回归中很少出现。我们探究它们是否能够竞争，跨55个按应用领域组织的回归数据集对模型进行基准测试。我们开发了一种各向异性RBF网络，具有数据驱动的中心放置和基于梯度的宽度优化，一个岭正则化的切比雪夫多项式回归器，以及一个平滑树混合模型（切比雪夫模型树）；这三个模型均作为scikit-learn兼容包发布。我们将这些模型与树集成、预训练transformer和标准基线进行基准测试，评估准确性和泛化行为。transformer在大多数数据集上准确率排名第一，但其GPU依赖性、推理延迟和数据集大小限制制约了其在应用科学和工业中常见的基于CPU环境中的部署。在CPU可行的模型中，平滑模型和树集成在准确率上统计上持平，但前者倾向于表现出更紧的泛化差距。我们建议常规地将平滑基模型纳入候选池，特别是当下游使用受益于更紧的泛化和逐渐变化的预测时。

英文摘要

Smooth-basis models such as Chebyshev polynomial regressors and radial basis function (RBF) networks are well established in numerical analysis. Their continuously differentiable prediction surfaces suit surrogate optimisation, sensitivity analysis, and other settings where the response varies gradually with inputs. Despite these properties, smooth models seldom appear in tabular regression, where tree ensembles dominate. We ask whether they can compete, benchmarking models across 55 regression datasets organised by application domain. We develop an anisotropic RBF network with data-driven centre placement and gradient-based width optimisation, a ridge-regularised Chebyshev polynomial regressor, and a smooth-tree hybrid (Chebyshev model tree); all three are released as scikit-learn-compatible packages. We benchmark these against tree ensembles, a pre-trained transformer, and standard baselines, evaluating accuracy alongside generalisation behaviour. The transformer ranks first on accuracy across a majority of datasets, but its GPU dependence, inference latency, and dataset-size limits constrain deployment in the CPU-based settings common across applied science and industry. Among CPU-viable models, smooth models and tree ensembles are statistically tied on accuracy, but the former tend to exhibit tighter generalisation gaps. We recommend routinely including smooth-basis models in the candidate pool, particularly when downstream use benefits from tighter generalisation and gradually varying predictions.

URL PDF HTML ☆

赞 0 踩 0

2603.03417 2026-06-16 cs.CR cs.AI 版本更新

Parallel Test-Time Scaling with Multi-Sequence Verifiers

并行测试时扩展与多序列验证器

Yegon Kim, Seungyoo Lee, Chaeyun Jang, Hyungi Lee, Juho Lee

发表机构 * Graduate School of AI, KAIST（人工智能研究生院，韩国科学技术院）

AI总结提出多序列验证器（MSV），通过条件化候选集预测正确性，改善校准性，提升最佳选择准确率并实现早停策略，在数学推理任务中以不到一半延迟达到相同精度。

详情

AI中文摘要

并行测试时扩展（为单个问题生成多个候选解）是提升大语言模型性能的强大技术。然而，它受到两个关键瓶颈的阻碍：从候选池中准确选择正确的解，以及生成大量完整解带来的高推理延迟。我们认为这两个挑战从根本上与验证器的校准性相关，因为校准良好的验证器能改进答案选择，并支持早停策略以减少延迟。然而，现有的非生成式验证器存在局限性，因为它们孤立地评分每个候选，忽略了候选集之间的丰富上下文信息。为解决这一问题，我们引入了多序列验证器（MSV），这是一种轻量级验证器，它基于完整采样集的条件来预测每个候选的正确性。MSV实现了改进的校准性，这直接增强了最佳N选择性能，并赋能了一种新颖的早停框架。在具有挑战性的数学推理基准测试中，相对于强基线，MSV将最佳64选1的准确率提升了高达6%，并且在早停设置下，以不到一半的延迟达到了与基线相同的准确率。

英文摘要

Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration, as a well-calibrated verifier improves answer selection and enables early-stopping strategies to reduce latency. However, existing non-generative verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), a lightweight verifier that predicts each candidate's correctness conditioned on the full sampled set. MSV achieves improved calibration, which directly enhances best-of-N selection performance and empowers a novel early-stopping framework. Across challenging mathematical reasoning benchmarks, MSV improves best-of-64 accuracy by up to 6\% relative to strong baselines, and in the early-stopping setting reaches the same accuracy as baselines with less than half the latency.

URL PDF HTML ☆

赞 0 踩 0

2603.17353 2026-06-16 cs.LG cs.AI 版本更新

Learning Permutation Distributions via Reflected Diffusion on Ranks

通过秩上的反射扩散学习排列分布

Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Soft-Rank Diffusion框架，通过将排列松弛为软秩实现平滑扩散，并引入上下文广义Plackett-Luce去噪器，在排序和组合优化任务上优于现有扩散方法。

Comments 18 pages including the appendix, 7 figures, 9 tables, Accepted at ICML 2026

详情

AI中文摘要

有限对称群 S_n 为排列提供了自然域，但由于其阶乘增长的大小和离散、非欧几里得结构，在 S_n 上学习概率分布具有挑战性。最近的排列扩散方法通过基于洗牌的随机游走（例如，riffle shuffles）定义前向加噪，并使用 Plackett-Luce (PL) 变体学习反向转移，但由此产生的轨迹可能很突兀，并且随着 n 的增长，去噪变得越来越困难。我们提出 Soft-Rank Diffusion，一种离散扩散框架，用结构化的软秩前向过程取代基于洗牌的破坏：通过将离散秩松弛为软秩，将排列提升到连续的潜在表示，从而产生更平滑、更易处理的轨迹。对于反向过程，我们引入了上下文广义 Plackett-Luce (cGPL) 去噪器，它推广了先前的 PL 风格参数化，并提高了序列决策结构的表达能力。在排序和组合优化基准上的实验表明，Soft-Rank Diffusion 始终优于先前的扩散基线，在长序列和内在序列设置中尤其有显著优势。

英文摘要

The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

URL PDF HTML ☆

赞 0 踩 0

2604.02343 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

仅用10比特从俳句到巨作：LLMs解锁巨大压缩增益

Roy Rinberg, Annabelle Michael Carrell, Simon Henniger, Nicholas Carlini, Keri Warr

发表机构 * Harvard University（哈佛大学）； University of Cambridge（剑桥大学）； Anthropic

AI总结研究LLM生成文本的无损和有损压缩，提出问答压缩（QA）交互协议，用少量二进制问题实现超100倍压缩比，高效传递知识。

详情

AI中文摘要

我们研究了LLM生成文本在无损和有损场景下的压缩，刻画了一个压缩-计算边界，其中更多的压缩需要更多的计算。对于无损压缩，领域适应的LoRA适配器可以将基于LLM的算术编码的压缩比提高2倍，相对于仅使用基础LLM的压缩。对于有损压缩，提示模型进行简洁重写然后应用算术编码可以实现约0.03的压缩比，比压缩原始响应提高2倍。我们进一步引入了问答压缩（QA），一种受游戏“二十个问题”启发的交互式有损协议。一个小模型通过向更强模型提问是/否问题来迭代优化其响应，每个答案恰好传输1比特。在涵盖数学、科学和代码的8个基准测试中，10个二进制问题恢复了小模型和大模型在标准基准上能力差距的23%到72%，在更难的基准上恢复了7%到38%，实现了0.0006到0.004的压缩比。这比之前基于LLM的压缩（Deletang等人，2024）小100倍以上，表明交互式协议可以比传输完整响应更高效地传递知识。

英文摘要

We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.

URL PDF HTML ☆

赞 0 踩 0

2604.03472 2026-06-16 cs.CL cs.AI 版本更新

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

词汇丢弃：LLM共同进化中的课程多样性

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对LLM共同进化中问题多样性崩溃的问题，提出词汇丢弃机制，通过在策略训练和课程生成时随机掩码输出logits维持多样性，在数学推理任务上提升求解器性能平均+4.4点。

详情

AI中文摘要

共同进化自我对弈，其中一个语言模型生成问题，另一个求解，有望在没有人类监督的情况下实现自主课程学习。在实践中，提议者迅速收敛到满足奖励函数的狭窄问题分布。这种多样性崩溃使得课程对求解者无信息量，从而停滞共同进化循环。我们引入词汇丢弃，一种在策略训练和课程生成期间应用于提议者输出logits的随机掩码，作为维持多样性的轻量级机制。该掩码是硬性的且非平稳的，防止提议者锁定在固定的token序列上。通过R-Zero在数学推理上训练Qwen3-4B和Qwen3-8B，我们发现词汇丢弃在整个训练过程中在词汇、语义和功能指标上维持了提议者的多样性。它还带来了求解器性能的提升，在8B规模上平均提高+4.4点，在竞赛级基准上增益最大。我们的发现表明，显式的动作空间约束，类似于经典自我对弈中游戏规则的结构性作用，可以帮助维持语言中的生产性共同进化。词汇丢弃是该原则的一个简单实例。

英文摘要

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. It also yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

URL PDF HTML ☆

赞 0 踩 0

2604.13085 2026-06-16 cs.LG cs.AI 版本更新

改进的基于表示自动编码器的基线

Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie

发表机构 * Adobe Research（Adobe研究院）； ANU（澳大利亚国立大学）； New York University（纽约大学）

AI总结本文研究了基于表示自动编码器（RAE）的设计选择，发现三个见解，简化并改进了RAE。首先，研究了一种通用公式，将表示定义为最后k个编码器层的总和，而不是仅最终层。其次，研究了RAE与表示对齐（REPA）的假设，发现两者具有互补的工作机制。最后，改进了RAE在无分类器指导（CFG）中的表现，通过重新参数化DiT模型输出，实现了无需训练第二个模型的指导效果。RAEv2在ImageNet-256上达到了1.06的gFID，且训练效率显著提高。

详情

AI中文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for

英文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.21850 2026-06-16 cs.CL cs.AI 版本更新

ACC: Compiling Agent Trajectories for Long-Context Training

ACC：用于长上下文训练的代理轨迹编译

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao, Kou Shi, Ziao Zhang, Lin Chen, Zehui Chen, Lijun Wu, Feng Zhao

发表机构 * MoE Key Lab of BIPC, University of Science and Technology of China（中科院大学科学技术大学MoE关键实验室）； Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出ACC，一种将代理轨迹编译为长上下文问答对的方法，通过整合多轮交互中的工具响应和环境观察，提升大语言模型的长上下文推理能力。

详情

AI中文摘要

近期代理的发展重新激发了对LLM长上下文推理能力的需求。然而，训练LLM具备这种能力需要耗费成本的长文档整理或启发式上下文合成。我们发现，当代理解决问题时，会产生大量轨迹，涉及调用工具和接收环境观察，这些证据分散在多个回合中，需要整合远距离上下文片段。然而，标准代理SFT会屏蔽工具响应，仅训练回合级工具选择，导致监督盲区，使这些分散的信号无法被利用。我们提出Agent Context Compilation (ACC)，将搜索、软件工程和数据库查询代理的轨迹转换为长上下文QA对，结合原始问题与多回合收集的工具响应和环境观察，训练模型直接回答而不使用工具。这使问题与证据之间的依赖关系显式化，使模型能够直接监督长上下文推理，无需额外标注。ACC是一种简单但有效的做法，可与任何现有的长上下文扩展或训练方法结合，提供可扩展的监督微调数据。我们通过MRCR和GraphWalks长距离依赖建模任务验证了ACC，挑战需要跨回合核心ference解析和图遍历的基准测试。训练Qwen3-30B-A3B使用ACC在MRCR上达到68.3（+18.1），在GraphWalks上达到77.5（+7.6），结果与Qwen3-235B-A22B相当，同时在GPQA、MMLU-Pro、AIME和IFEval上保持通用能力。进一步的机制分析表明，ACC训练的模型表现出任务自适应的注意力重构和专家专业化。

英文摘要

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

URL PDF HTML ☆

赞 0 踩 0

2605.22873 2026-06-16 cs.LG cs.AI cs.CL 版本更新

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

LLM何时推理？基于熵相变的动力系统视角

Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * Samsung Research（三星研究院）； State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University（通用人工智能国家重点实验室，北京理工大学）

AI总结本文通过早期解码熵动态检测LLM的推理状态，提出轻量级无训练路由框架EDRM，自适应选择推理策略，在减少token消耗的同时提升准确率。

详情

AI中文摘要

链式思维（CoT）推理已成为增强LLM能力的默认策略，但其应用引发了一个基本问题：显式推理何时真正有益？实证证据揭示了一个显著悖论：CoT在事实性和开放式任务上往往带来边际甚至负增益，同时成倍增加token消耗。在这项工作中，我们表明LLM推理不是任务或模型的静态属性，而是在生成过程中涌现的\emph{动态解码状态}。通过系统分析，我们发现早期熵动态提供了这一状态的可靠信号：受益于CoT的任务表现出一致的熵降低，而其他任务则呈现不稳定或增加的模式。这种行为可以解释为从高熵探索状态到低熵结构化推理状态的类相变转变。基于这些见解，我们提出了 extbf{EDRM}（基于熵动态的推理流形），一个轻量级且无需训练的路由框架，利用早期解码熵自适应选择推理策略。EDRM将熵轨迹嵌入到紧凑且可解释的流形表示中，支持零样本部署和细粒度实例级适应。在15个基准测试和4个不同规模与架构的LLM上，EDRM始终优于静态基线。在数据集层面，EDRM实现了 extbf{41--55\%}的token减少，同时仅需50个校准样本即可提高准确率。在实例层面，它进一步将准确率提升高达 extbf{4.7\%}，同时保持 extbf{27--45\%}的token节省。这些结果表明，推理应被选择性地调用而非默认使用，并展示了基于熵的解码控制对于高效自适应LLM推理的有效性。

英文摘要

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2606.01602 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

估计时间序列与时间事件序列在不同分析任务中的互信息

Haoji Hu, Huaqing Mao, Yijun Lin, Xiaowei Jia, Jinwei Zhou, Minoh Jeong, Yao-Yi Chiang

发表机构 * University of Minnesota - Twin Cities（明尼苏达大学-双城分校）； University of Pittsburgh（匹兹堡大学）； Inha University（Inha大学）

AI总结提出一种非参数互信息估计器，直接度量连续时间序列与离散事件序列之间的依赖关系，无需数据转换或离散化，通过处理量化伪影和事件冗余实现鲁棒统一框架。

详情

DOI: 10.1145/3770855.3817693

AI中文摘要

成对依赖度量（如相关性和因果性）是时间数据挖掘的基础，但目前仍缺乏一种原则性且稳健的方法来量化异构数据类型之间的依赖关系，特别是连续时间序列与离散时间事件序列之间。现有方法依赖于对量化、重复值和事件冗余高度敏感的临时变换或互信息估计器，导致实践中结果有偏或不稳定。我们提出一种非参数互信息估计器，无需数据转换、学习或临时离散化，直接度量时间序列与事件序列之间的依赖关系。我们的方法对真实世界时间序列的连续-离散二元性进行建模，以处理量化和重复值伪影，并引入潜在事件聚类策略以减轻事件共现和冗余带来的偏差。这些共同构成了一个鲁棒且统一的框架，桥接了离散和连续互信息。我们在四个代表性任务上评估了所提出的估计器：用于因果分析的离散-连续时延互信息、全局和局部时间重复发现、用于时间序列预测的离散协变量选择以及用于分类的连续特征选择。在合成和真实世界数据集上的实验表明，在准确性、鲁棒性和可解释性方面，该方法一致优于现有方法，使其成为异构时间数据的通用依赖算子，类似于同质时间序列的皮尔逊相关。代码见：https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

英文摘要

Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

URL PDF HTML ☆

赞 0 踩 0

2606.07082 2026-06-16 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST（香港科技大学）； UT Austin（得克萨斯大学奥斯汀分校）； Zhejiang University（浙江大学）； Hong Kong PolyU（香港理工大学）； USTC（中国科学技术大学）； BUPT（北京邮电大学）； Nankai University（南开大学）； BIT（北京理工大学）

AI总结本文通过参数空间诊断，揭示在线策略蒸馏（OPD）的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性，表明其并非介于SFT和RLVR之间的中间方法。

Comments 17 pages, 8 figures

详情

AI中文摘要

在线策略蒸馏（OPD）越来越多地被用于改进大型语言模型的推理能力，但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹，并将其与监督微调（SFT）和可验证奖励强化学习（RLVR）进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域：与SFT相比，其更新影响更少的权重，并更强烈地避开主方向；而与RLVR相比，其约束更宽松。除了这种静态定位外，OPD还表现出子空间锁定：其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能，但会严重降低SFT，表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明，稀疏化更新令牌和将rollout生成移至离策略能保持秩动态，而将OPD目标与RLVR混合则会改变它们。总体而言，这些结果表明OPD不仅仅是SFT和RLVR之间的中间点，而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

URL PDF HTML ☆

赞 0 踩 0

2606.08090 2026-06-16 cs.DB cs.AI 版本更新

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

基于LLM的快速语义过滤：从统一框架到自适应两阶段方法

Kyoungmin Kim, Martin Catheland, Anastasia Ailamaki

发表机构 * EPFL（瑞士联邦理工学院）

AI总结提出自适应两阶段语义过滤框架，结合无模型聚类与在线代理，利用LLM的置信度作为软标签训练代理，并通过稀疏感知校准降低级联成本，在90%准确率目标下速度提升1.6-2.0倍。

详情

AI中文摘要

在文档语料库上评估自然语言的是/否谓词并满足准确率目标——语义过滤——是基于LLM的数据处理的基石。对每个文档调用LLM（即oracle）代价高昂，因此级联方法将oracle与快速代理配对。然而，当前部署存在四个局限性：(1) 每个级联家族——无模型聚类、预构建的小型LLM代理、在线训练的代理——只采用单一表示和流水线，仅在狭窄的查询范围内有效。(2) 最强的在线代理在稠密嵌入的双编码器上采用自定义训练方案，忽略了更丰富谓词所需的token级证据。(3) 代理针对二元是/否标签进行训练，浪费了LLM在边界文档上的逐文档置信度，而这些正是代理最需要学习的。(4) 现有校准添加了统一的安全裕度，将真实的代理不确定性与小样本噪声混为一谈，增加了级联成本。\n我们通过以下方式解决这些问题：(1) 自适应地组合不同家族——首先使用无模型聚类，仅在需要时使用在线代理，并在各阶段共享oracle调用；(2) 用现成的token感知模型的混合替代余弦双编码器；(3) 使用oracle的逐文档置信度作为软标签来训练代理；(4) 采用一种校准方法，仅在标记样本稀疏的地方添加安全裕度。我们也是首次将oracle的逐文档置信度用于三个目的：查询级难度指南针、任何基于代理的级联所需的最小oracle调用次数的下界，以及代理的软训练标签。\n在三个10K文档语料库上，以90%准确率为目标，我们的方法比每个语料库上最佳先前方法快1.6-2.0倍，并在95%的查询上达到目标；基于BER的下界表明未来工作还有约4-20倍的提升空间。

英文摘要

Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.08898 2026-06-16 eess.AS cs.AI cs.LG 版本更新

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

基于原型适应和伪类变量训练的少样本类变量增量音频分类

Yanxiong Li, Guoqing Chen, Qianqian Li, Sen Huang

发表机构 * School of Electronic and Information Engineering, South China University of Technology（华南理工大学电子与信息学院）

AI总结针对实际中类别数量增减的少样本类变量增量音频分类问题，提出一种结合原型适应网络和伪类变量训练策略的方法，在三个公开数据集上平均准确率超过现有方法。

Comments This paper has been accepted for publication in Interspeech 2026. 4 Tables and 4 Figures

详情

AI中文摘要

在少样本类增量音频分类任务中，通常假设类别数量总是增加而不考虑减少的可能性。然而，实际中类别数量通常会增加或减少。本文研究了少样本类变量增量音频分类（FCIAC）问题，其中类别数量增加或减少。我们提出了一种使用原型适应和伪类变量训练的FCIAC方法。我们的方法中的模型由编码器和分类器组成。分类器由类变量原型适应网络初始化，其结构随类别的变化而动态变化。此外，我们设计了一种伪类变量训练策略，以增强模型对变化类别的适应性。在三个公开数据集上的实验表明，我们的方法在平均准确率上超过了先前的方法。代码位于：https://github.com/cgq2971-afk/FCIAC。

英文摘要

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model's adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.

URL PDF HTML ☆

赞 0 踩 0

2605.28860 2026-06-16 cs.LG cs.AI cs.CL cs.CR 版本更新

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源：为什么RL比SFT更好地保留电路？

Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结通过引入差异电路脆弱性指标，研究比较了强化学习与监督微调在大型语言模型微调中对内部计算电路的保留程度，发现RL虽任务适应较慢但能更好保留电路，从而减轻灾难性遗忘。

详情

AI中文摘要

微调大型语言模型（LLMs）经常导致先前能力的灾难性遗忘。最近的研究表明，强化学习（RL）比监督微调（SFT）更有效地保留先前能力，这归因于策略梯度更新更接近基础策略\cite{shenfeld2025rl}。我们将这种行为解释扩展到机制层面，并探究RL的优势是否通过内部计算电路的更强保留来体现。我们引入了差异电路脆弱性，一种头部级别的度量，用于衡量电路在微调下的退化程度，并将其用于比较RL和SFT在Qwen2.5-3B-Instruct适应科学问答任务上的表现。我们发现了清晰的机制权衡：SFT更快地适应目标任务，但导致更大的电路破坏和先前能力的遗忘，而RL保留了更大比例的基础电路，代价是任务适应较慢。这些发现表明，电路保留可能有助于解释为什么RL对灾难性遗忘更具鲁棒性。我们在此发布了代码：https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。

英文摘要

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2606.15231 2026-06-16 cs.AI 新提交

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Visual-Seeker：通过主动视觉推理实现视觉原生多模态智能搜索

Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

发表机构 * School of Artificial Intelligence UCAS（中国科学院大学人工智能学院）； Institute of Automation CAS（中国科学院自动化研究所）； Ant Digital Technologies Ant Group（蚂蚁数字科技蚂蚁集团）； RUC（中国人民大学）； BIT（北京理工大学）

AI总结提出Visual-Seeker，一种通过主动视觉推理进行视觉原生多模态深度搜索的智能体，在五个基准上达到最先进性能，甚至超越专有模型。

详情

AI中文摘要

多模态大语言模型（MLLMs）在许多视觉任务中展示了令人印象深刻的能力，但在面对复杂、开放世界场景时，它们常常在事实性基础上挣扎。尽管最近的多模态深度搜索智能体试图通过利用外部工具来解决这个问题，但视觉原生搜索范式仍未得到充分探索。现有方法主要依赖于具有显式语义的简单图像和纯文本证据轨迹，限制了智能体执行多跳、跨模态推理和搜索的能力。为了解决这些限制，我们提出了Visual-Seeker，一种通过主动视觉推理的视觉原生多模态深度搜索智能体。我们的智能体不是将视觉视为静态输入，而是主动关注细粒度的视觉细节，在搜索过程中动态收集视觉证据。为了释放其视觉原生潜力，我们设计了一个主动视觉推理数据管道，并合成了5K高质量的多模态轨迹用于模型训练。大量实验表明，在五个具有挑战性的多模态搜索基准上，我们的方法达到了最先进的性能，甚至超越了多个专有模型，验证了在真实网络环境中鲁棒的视觉原生推理和搜索能力。代码和数据可在 https://github.com/ZhengboZhang/Visual-Seeker 获取。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

URL PDF HTML ☆

赞 0 踩 0

2606.15591 2026-06-16 cs.AI cs.CL cs.MA 新提交

视觉锚定思维

Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang

发表机构 * University of California, Los Angeles（加利福尼亚大学洛杉矶分校）

AI总结提出视觉锚定思维方法，让视觉语言模型在推理时交替生成自然语言和视觉锚点（点或框），并通过合成数据管道和锚定感知强化学习训练，在计数和空间推理任务上显著提升性能。

详情

AI中文摘要

视觉思维不仅应该听起来正确，还应该展示其证据。虽然最近的视觉语言模型（VLM）能够生成自然语言推理轨迹，但这些轨迹往往隐含了所支持的图像区域，使得它们难以验证和监督。我们引入了视觉锚定思维，这是一种推理过程，其中模型将自然语言思想与每一步所使用的视觉证据的显式点或框锚定交替生成。这使得模型能够在语言中表达中间推理，同时将关键对象锚定到它们所指的图像区域。为了训练这种行为，我们构建了一个可扩展的合成管道，该管道蒸馏正确的视觉推理轨迹，提取轨迹所需的视觉对象，使用基于SAM3的代理对其进行锚定，并从生成的掩码中导出对齐的点与框监督。我们进一步提出了锚定感知强化学习，它将答案正确性奖励与密集的锚定奖励相结合，后者评分生成的物体引用是否匹配正确的图像证据。在两个计数基准和四个空间推理基准上，将视觉锚定思维添加到Gemma3-4B-IT中，始终优于原始模型和非锚定思维基线。在空间推理上，视觉锚定思维的4B模型匹配，并在某些情况下超越了同一模型家族的Gemma3-27B-IT。我们的分析表明，点锚定适合计数，而框锚定在空间任务上从显式锚定奖励中获益最多。总体而言，我们的结果表明，当VLM的中间思维与使它们为真的图像区域相关联时，它们的思考能力更强。

英文摘要

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

URL PDF HTML ☆

赞 0 踩 0

2606.16307 2026-06-16 cs.AI cs.CL 新提交

State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

面向工具增强型大语言模型的基于状态的多智能体合成数据生成

Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结提出StateGen平台，通过四角色LLM循环和状态管理器生成多轮、工具接地的高质量训练对话，消除工具调用幻觉，支持层次化多智能体设置。

Comments 9 pages, 5 figures, 6 tables, 1 algorithm

详情

AI中文摘要

训练工具增强型LLM代理需要大量多轮、工具接地的对话数据，这些数据标注成本高、生产环境中受隐私限制，且公共数据集中基本缺失。我们提出StateGen，一个合成数据生成平台，通过编排四角色LLM循环（角色条件用户模拟器、被测代理、状态接地工具模拟器和多轴LLM评判器）生成带有评分和丰富推理轨迹的训练对话。关键架构贡献是一个权威状态管理器，它在多轮对话中维护一个结构化的世界状态对象，强制执行后端即事实的不变性，从而从结构上消除了最主要的工具调用幻觉类别。StateGen通过将子代理声明为工具（所有子代理共享一个状态对象）自然地扩展到层次化多智能体设置。我们在三个生产语料库上报告了64,698个评估对话的结果：工具调用幻觉得分达到9.66/10，系统通过23维特征向量支持角色驱动变化，并且干净分离的训练集和黄金评估集划分确认数据不是记忆诱饵（按标准差距分析）。与八个外部系统的比较表明，没有单一公开平台同时具备多轮生成、状态接地工具模拟、层次化多智能体支持和内置评判器评分功能。

英文摘要

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

URL PDF HTML ☆

赞 0 踩 0

2606.16364 2026-06-16 cs.AI cs.CR cs.SE 新提交

Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

看而非选：LLM智能体工具选择失败的注意力-片段解释

Shiyang Chen

AI总结本文通过分析LLM智能体对工具定义片段的注意力，发现工具选择失败源于决策读出阶段而非工具可见性，并提出了基于注意力的无训练选择器来修复。

Comments 13 pages, 1 figure, 15 tables

详情

AI中文摘要

LLM智能体会错误地调用工具，自然的猜测是模型在拥挤的工具箱中未能看到正确的工具。我们通过一个并发工作未涉及的视角——模型对标记的工具定义片段的注意力——展示了相反的情况。在真实的BFCL失败案例上，通过每个候选的注意力argmax，模型在80%的情况下最关注正确的工具（对比21%的随机概率），而正确工具是注意力不足的片段仅占10%：它看到了正确的工具但仍然选错。这直接反驳了直观的“拥挤工具箱/中间丢失”解释：失败在于决策读出，而非工具箱，我们通过三种方式证实了这一点。(1) 输入vs.读出：修复提示（重新排序或复制正确工具）仅恢复<=23%的失败，而读出侧干预恢复59-91%。(2) 表示不变性：两种不同表示中的指向正确工具的干预——加性注意力logit偏置和残差流转向向量——恢复的失败案例大致相同（每任务Jaccard 0.865合并，每模型0.79-0.91），因此瓶颈定位于读出，与干预的表示无关。(3) 无训练、无正确工具的选择器：基于每个片段的注意力在BFCL上缩小了大部分无正确工具与有正确工具之间的差距（函数名选择合并+11.9分 vs. 有正确工具上限+17.9分），并在Seal-Tools上增加+14.9分；每个模型均为正向（精确McNemar检验p<=8e-4每个）。范围不同：因果注意力偏置剂量反应在10个遵循掩码的模型（3-32B）上是双向且单调的，而0.5-32B全范围仅携带相关性诊断；可部署的选择器在5个单轮模型上评估，尚未迁移到多轮循环。

英文摘要

LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers <=23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations -- an additive attention-logit bias and a residual-stream steering vector -- recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p<=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop.

URL PDF HTML ☆

赞 0 踩 0

2606.16481 2026-06-16 cs.AI 新提交

Steering Emotional Dynamics for Art Therapy: Controllable Narrative Script Generation through Hierarchically Guided LLM Agents

引导艺术治疗的情感动态：通过分层引导的LLM智能体实现可控叙事脚本生成

Suqing Wang, Qinghai Miao, Chao Guo, Yisheng Lv

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出EC-Script框架，通过分层控制情感轨迹生成叙事脚本，实现情感轨迹规划、场景驱动和局部情感调节，显著优于基线方法。

详情

AI中文摘要

艺术治疗在情感治愈中扮演重要角色，其中叙事创作是情感表达的主要载体。鉴于治愈过程中情感固有的动态特性，具有精细控制情感波动的叙事使个体能够安全地投射内心冲突并实现情感宣泄。近年来，随着大型语言模型（LLM）的快速发展，自动叙事生成技术为支持此类艺术设计提供了新途径。然而，现有方法虽然能生成流畅文本，但难以生成遵循特定情感轨迹的叙事，无法满足以情感为导向的心理治愈需求。为解决这些问题，本文提出EC-Script，一种基于LLM智能体的框架，能够实现对情感治愈叙事生成中情感轨迹的分层控制。为确保生成的叙事严格遵循给定的情感模式，EC-Script通过情感轨迹规划建立整体叙事方向，通过角色驱动场景生成推动场景级情节发展，并通过情感控制脚本编写调节角色的局部情感变化。最终输出逐场景的脚本内容，与预设情感轨迹保持高度一致。实验结果表明，EC-Script在情感轨迹遵循度上显著优于基线方法，展现出优秀且可靠的情感可控性，从而为AI辅助情感治愈场景提供有效的技术支持。

英文摘要

Art therapy plays a vital role in emotional healing, in which narrative creation acts as the primary vehicle for emotional expression. Given the inherently dynamic nature of emotions during healing, narratives with finely controlled emotional fluctuations enable individuals to safely project inner conflicts and achieve emotional catharsis. Recently, with the rapid development of Large Language Models (LLMs), automated narrative generation technology has provided a new pathway to support such artistic designs. However, while existing methods can produce fluent texts, they struggle to generate narratives that adhere to specified affective trajectories, failing to meet the demands of emotion-oriented psychological healing. To address these issues, this paper proposes EC-Script, an LLM agent-based framework that enables hierarchical control of the affective trajectory in narrative generation for emotional healing. To ensure that the generated narratives strictly follow the given emotional patterns, EC-Script establishes overall narrative direction through Emotion-Trajectory Planning, propels scene-level plot development with Character-Driven Scene Generation, and regulates local emotional changes of characters via Emotion-Controlled Script Writing. Ultimately, it outputs scene-by-scene script content that remains highly consistent with the preset affective trajectory. Experimental results demonstrate that EC-Script significantly outperforms baseline methods in affective trajectory adherence, exhibiting excellent and reliable emotional controllability, thereby providing effective technical support for AI-assisted emotional healing scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.16541 2026-06-16 cs.AI cs.LG 新提交

The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

忠实性差距：认证自然语言与形式数学语句之间的语义等价性

Noor Islam S. Mohammad, Tamim Sheikh

发表机构 * Department of Computer Science, Informatics Institute, Istanbul Technical University, İstanbul, Türkiye（信息学院计算机科学系，伊斯坦布尔技术大学，伊斯坦布尔，土耳其）； Department of Computer Science（计算机科学系）； Engineering, Jashore University of Science（工程系，贾沙尔大学科学学院）

AI总结提出双向可证明性指纹识别框架，通过前向和后向推论邻域匹配自然语言探针，认证自动形式化翻译的忠实性，并引入反事实探针生成、等价谱、自适应探针预算分配和忠实性引导解码四个新组件，在基准上实现高检测率并减少漂移。

详情

AI中文摘要

自动形式化——将自然语言数学翻译成形式证明助手——的瓶颈不在于翻译流畅性，而在于\emph{忠实性}：一个形式语句可以通过类型检查且可证明，但仍可能编码与源意图不同的定理。我们引入\emph{双向可证明性指纹识别}（\bpf{}），这是一个通过刻画每个候选在背景理论中的前向和后向推论邻域，并将这些邻域与从自然语言语句导出的探针进行匹配来认证忠实性的框架。我们进一步引入四个新组件：（i）\emph{反事实探针生成}（\cpg{}），一种合成针对特定漂移方向的探针的对比性程序；（ii）\emph{等价谱}，一个替代脆弱的二元判决的连续忠实性分数；（iii）\emph{自适应探针预算分配}（\apba{}），一个信息论预算路由器；以及（iv）\emph{忠实性引导解码}（\fgd{}），它在自动形式化过程中使用\bpf{}信号作为奖励。我们证明了一个\emph{漂移检测定理}和一个\emph{PAC-忠实性}结果，该结果确立了在温和假设下，自然语言语句的等价类可以从$\mathcal{O}(\log(1/δ)/\varepsilon)$个探针中学习。我们发布了\driftbench{}，一个包含$2{,}183$个NL/Lean~4对的基准，这些对具有跨mathlib4六个子领域的受控漂移标签。\bpf{}\,+\,\cpg{}在$3.0\%$的假阳性率下检测出$89.6\%$的漂移形式化——相比之下，类型检查为$41.2\%$，LLM评判基线为$63.3\%$——并且\fgd{}将最先进的自动形式化器产生漂移语句的比率降低了$47\%$。https://pmlrbd.github.io/BPF/

英文摘要

Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/

URL PDF HTML ☆

赞 0 踩 0

2606.16687 2026-06-16 cs.AI cs.CL 新提交

From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

从情感预测到情感预报：纵向文本中不同信息源的证据

Sadia Noor, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)（国立科技大学电气工程与计算机科学学院）； Department of Communication, Quality Management and Information Systems, Mid Sweden University（中瑞典大学通信、质量管理和信息系统系）

AI总结本文区分当前情感估计与未来情感变化预报，提出TSAP框架和ACF-Hybrid模型，实验表明文本语义支持当前预测，而数值轨迹动力学更适用于未来变化预报。

详情

AI中文摘要

对纵向文本中的维度情感建模需要区分当前情感估计与未来情感变化预报。现有方法通常将每个文本视为独立观测，并对两个任务应用类似假设，而不检验它们是否依赖不同的信息源。本文利用纵向自我报告生态短文和情感词条目研究这一区别。我们提出特质-状态情感预测（TSAP）框架及其时间扩展E-TSAP用于逐文本效价和唤醒度预测，在来自91名用户的1737条条目的保留预测测试集上评估。我们进一步提出情感变化预报混合模型（ACF-Hybrid）用于下一步情感变化预报，在来自46名用户的保留预报测试集上评估。对于预测，E-TSAP在效价上达到复合皮尔逊相关系数0.670，在唤醒度上达到0.449。对于预报，文本表示的表现不如紧凑的数值轨迹基线：包含文本的模型在效价上仅达到r=0.316，在唤醒度上达到r=0.284，而简单的先前状态基线分别达到r=0.615和r=0.670。ACF-Hybrid使用维度特定的数值轨迹特征，在效价上达到r=0.659，在唤醒度上达到r=0.658。这些结果表明，文本语义支持当前情感预测，而未来情感变化通过先前数值轨迹动力学能更好地捕获。

英文摘要

Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 交叉投稿

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India（SPRING实验室，印度理工学院，马德拉斯，印度）； MBZUAI, UAE（MBZUAI，阿联酋）

AI总结提出Pixel-TTS框架，将文本渲染为图像并通过2D卷积生成嵌入，消除嵌入矩阵扩展，提升对未见字符和拼写变体的鲁棒性，实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情

AI中文摘要

近期基于像素的文本建模进展表明，将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上，允许具有不同Unicode编码的结构相似字符产生相似的嵌入，从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符，限制了向未见字符的泛化，并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS，首个视觉接地语音合成框架。它将文本渲染为图像，并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展，同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明，Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.14762 2026-06-16 cs.CV cs.AI 交叉投稿

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

Scribby: 一种用于语义视频分析的多级LLM框架

Julian Abelarde, Hugo Garrido-Lestache Belinchon

发表机构 * Department of Computer Science and Software Engineering, Milwaukee School of Engineering（密尔沃基工程学院计算机科学与软件工程系）

AI总结提出一种基于LLM的视频摘要框架，通过微观索引（分析完整转录、句子及语义分组）平衡宏观理解与微观语义分析，并利用相关性热图实现语义分块和匹配的可视化。

详情

AI中文摘要

随着视频内容在教育平台、录播讲座和直播娱乐中的持续扩展，对长视频进行高效且结构化分析的需求日益增长。尽管许多现有AI程序基于AI生成的转录提供高级视频摘要，但这些方法通常局限于粗略概述，缺乏对视频结构、主题进展和语义关系的详细分析，而这些正是全面视频分析所必需的。本文提出一种基于LLM的视频摘要框架，平衡宏观理解与微观语义分析。该过程的第一阶段在微观层面对视频进行索引，包括：(1) 分析完整转录，(2) 分析单个转录句子，(3) 使用LLM作为评判依据语义相似性对这些句子进行分组。在句子级处理中，通过将全局转录分析和相邻句子信息纳入每个评估提示，保留上下文连续性。该框架为通过相关性热图可视化语义分块和语义匹配的视频分析工具奠定了基础。还讨论了框架的局限性和未来扩展。

英文摘要

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.14777 2026-06-16 cs.CV cs.AI 交叉投稿

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

JoyAI-VL-Interaction: 实时视觉-语言交互智能

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang

发表机构 * JD.com（京东）

AI总结提出一种持续观察、自主决定是否回应的视觉-语言交互模型，并开源8B规模模型及完整部署系统，在六个真实场景中优于现有方案。

详情

AI中文摘要

现实世界中的许多时刻不会等待用户提问。安全监控上起火，视频通话中表情变化，或直播中观众想要的商品一闪而过。然而，当今的大模型大多仍以轮次式设计：它们只在被召唤时回答，即使是看似交互式的视频通话应用，其运作方式仍是问答系统，仅在轮询或提示时做出反应。我们主张一种不同的范式：一个像人一样存在于世界中的模型。它持续观察当前发生的事件，自行决定是说话还是保持沉默，实时交互，并在问题困难时委托给后台模型。为了推动交互模型及其在各领域的应用，我们做出两项完全开源贡献。首先，我们发布JoyAI-VL-Interaction，一个8B规模的视觉优先VL交互模型。该模型内部做出响应决策，每秒选择保持沉默、回应或委托给后台模型，并在视觉触发响应性和时间感知方面表现出色。我们为其配备了一个可迁移的训练方案，从中涌现出我们从未训练过的能力，例如引导购物者切换应用屏幕或根据幻灯片即兴授课。其次，我们发布了一个围绕该模型构建的完整可部署系统。该系统将任何正在进行的视频流式传输到模型中，使其真正存在于世界中。所有其他组件都是可插拔的，包括ASR/TTS模块、记忆、可视化UI以及可连接任何API或代理的后台大脑。在六个真实场景中，人类评估者以较大优势偏好JoyAI-VL-Interaction而非豆包和Gemini的应用内视频通话助手。据我们所知，这是第一个开源的、视觉驱动的交互模型，同时发布了其训练方案、数据和完整可部署系统。

英文摘要

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

URL PDF HTML ☆

赞 0 踩 0

2606.15007 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra: 开放、高效的混合专家Mamba-Transformer模型用于智能体推理

NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil, Dan Su, Dan Zhao, Dane Corneil, Daniel Afrimi, Daniel Egert, Daniel Korzekwa, Daniel Lo, Daniel Machlab, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, David Yu, Davit Karamyan, Deena Donia, Deep Debroy, Deepak Narayanan, Devin O'Kelly, Dheeraj Peri, Dhruv Nathawani, Di, Wu, Dima Rekesh, Divyanshu Kakwani, Donald Plummer, Dong Anh, Dongfeng Yu, Dongfu Jiang, Donnie Kim, Dorrin Poorkay, Duncan Riach, Dusan Stosic, Dustin VanStee, Eavan Meng, Edgar Minasyan, Edward Lin, Eileen Margaret Peters Long, Elad Sarafin, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric Tramel, Eric Yang, Erick Galinkin, Erik Pounds, Erika Goncalves Goncalves, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Faisal Ladhak, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Frank Sun, Frankie Siino, Frida Hou, Gal Hubara Agam, Gal Kaplun, Gantavya Bhatt, Gargi Prasad, Garvit Kulshreshtha, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Greg Mason, Greg Pauloski, Grigor Nalbandyan, Grzegorz Chlebus, Grzegorz Karch, Guan-Ting Liu, Guoming Zhang, Guyue Huang, Haggai Maron, Haifeng Qian, Haim Elisha, Haoxing Ren, Haran Kumar Shiv Kumar, Haribhau Hud, Harris Nover, Harrison Saturley Hall, Hayate Iso, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hovhannes Tamoyan, Hua Li, Huanhuan Chen, Hui Li, Hui Wang, Huy Nguyen, Ian Chiles, Ido Galil, Ido Shahaf, Igor Gitman, Igor Shovkun, Ilya Loshchilov, Ingo Guehring, Itamar Schen, Itay Levy, Itay Neeman, Ivan Moshkov, Izik Golan, Izzy Putterman, Jaemin Choi, Jakub Slowikowski, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiantao Jiao, Jiaqi Zeng, Jie Lou, Jim King, Jimmy Zhang, Jingquan Wang, Jinhang Choi, Jinju Chu, Joey Conway, Joey Guman, Johan Jatko, Johannes Rausch, John Kamalu, John Roberts, Johnny Greco, Johnny Mensel, Jonah Alben, Jonas Yang, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joshua Mabry, Joshua Pierce, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kajal Jain, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Willowhawk, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirthi Shankar Sivamani, Konstantinos Krommydas, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Kyle Keprios, Kylie Day, Lawrence McAfee, Leo Du, Leon Derczynski, Li Ding, Linda Liu, Lingjie Wu, Lior Kadoch, Lizzie Wei, Luis Vega, Luke Robison, Lun Su, Maarten Van Segbroeck, Maciej Jakub Mikulski, Maer Rodrigues de Melo, Magda Sypula, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Tarun Chandran, Manoj Kilaru, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Marcin Chochowski, Mark Cai, Mark Mozolewski, Markus Kliegl, Marta Stepniewska-Dziubinska, Martyna Patelka, Mattei Machczynski, Matvei Novikov, Mauricio Ferrato, Maximilian Golub, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Mengxi Wu, Meredith Price, Meriem Boubdir, Micah Schaffer, Michael Andersch, Michael Boone, Michael Gschwind, Michael Lightstone, Michael Loh, Michal Bien, Michal Zawalski, Michelle Gill, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Houston, Mingyuan Ma, Minseok Lee, Mohamed Fawzy, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Namit Dhameja, Narimane Hennouni, Natalie Hereth, Nathaniel Pinckney, Nave Algarici, Nave Assaf, Netanel Haber, Nicholas Knight, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Desai, Nikolai Ludwig, Nima Tajbakhsh, Ning Xu, Nir Ailon, Nirmal Juluru, Nitin Nitin, Ofri Masad, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivia Viessmann, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Pablo Ribalta, Pallab Bhattacharya, Panos Lampropoulos, Parth Mannan, Pasha Shamis, Patrick Legresley, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pierre-Yves Aquilanti, Pinky Xu, Piotr Januszewski, Piotr Laskiewicz, Pooya Jannaty, Prakash Gurumurthy, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Puhui Meng, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Radha Sri-Tharan, Rahul Kandu, Rakshit Sanadhya, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Ray Macalisang, Rayen Tian, Reka Kovacs, Renjie Pi, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Rishi Puri, Rita Fernandes Neves, Ritchie Zhao, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Robert Kirby, Roger Waleffe, Rohit Watve, Roi Koren, Ron Banner, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Stewart, Ryota Egashira, Sadegh Mahdavi, Saee Paliwal, Sagar Singh, Sahil Modi, Salika Dave, Samantha Shinagawa, Samuel Kriman, Sandip Bhaskar, Sangkug Lym, Sanjay Kariyappa, Sanjeev Satheesh, Saran Vikas Murari, Satish Pasumarthi, Saurabh Mishra, Saurav Muralidharan, Scott Hara, Sean Narentharen, Selvaraj Anandaraj, Seonjin Na, Seonmeyong Bak, Seonmyeong Bak, Sepehr Sameni, Seph Mard, Serge Panev, Seth Henneman, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Mendelson, Shaun Kotek, Shawn Wang, Shay Aharon, Shaya Gharghabi, Sheng-Chieh Lin, Shi Chen, Shiqing Fan, Shirish Baskaran, Shreya Gopa, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Shwetha Krishnamurthy, Siddharth Singh, Simeng Sun, Sirshak Das, Sivakumar Arayandi Thottakara, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sridhar Bhuvanapalli, Srimukh Veccham, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Su Rong, Sugam Dipak Devare, Sukrit Rao, Sumeet Kumar Barua, Sungsoo Ha, Sunny Gai, Suriya Gunasekar, Suseella Panguluri, Suyog Gupta, Sviataslau Hinzburh, Sweta Priyadarshi, Syeda Nahida Akter, Talor Abramovich, Tan Bui, Tanay Varshney, Tatevik Ter-Hovhannisyan, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tianhe Zhang, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Tiyasa Mitra, Tom Balough, Tomasz Grzegorzek, Tomasz Hliwiak, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Tony Salim, Tony Wang, Traian Rebedea, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Venkat Srinivasan, Venmugil Elango, Vibhor Agrawal, Victor Cui, Vijay Korthikanti, Vikas Mehta, Vinay Rao, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Vu Pham, Wanli Jiang, Wasi Uddin Ahmad, Wataru Ishihara, Wei Du, Wei Ping, Weiheng Chai, Wenliang Dai, Wesley Helmholz, Will Jennings, Will Zhu, Wojciech Prazuch, Xiaowei Ren, Xiwen Yu, Yan Breek, Yang Chen, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Youngeun Kwon, Yu Yao, Yugi Guvvla, Yuki Huang, Yunsheng Liu, Zach Moshe, Zachary Newell, Zhilin Wang, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijie Yan, Zsolt-Alon Wertheimer

发表机构 * NVIDIA（英伟达）

AI总结提出550B总参数量、55B激活参数的混合专家Mamba-Attention语言模型Nemotron 3 Ultra，通过20T tokens预训练、1M上下文扩展及后训练，在推理吞吐量提升约6倍的同时保持与顶尖模型相当的精度。

详情

AI中文摘要

我们介绍了Nemotron 3 Ultra，一个总参数量5500亿、激活参数550亿的混合专家Mamba-Attention语言模型。我们在20万亿文本tokens上预训练了Nemotron 3 Ultra，然后将上下文长度扩展到100万tokens，并使用监督微调（SFT）、强化学习（RL）和多教师在线策略蒸馏（MOPD）进行后训练。Nemotron 3 Ultra是我们迄今为止能力最强的模型，采用了多项关键技术——LatentMoE、多token预测（MTP）、NVFP4预训练、多环境RLVR、MOPD和推理预算控制。与公开可用的最先进LLM相比，Nemotron 3 Ultra的推理吞吐量提高了约6倍，同时达到了相当的精度。最先进的精度、高推理吞吐量和100万tokens的上下文长度使Nemotron 3 Ultra成为长时间运行的自主智能体任务的理想选择。我们在HuggingFace上开源了基础、后训练和量化检查点，以及训练数据和配方。

英文摘要

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

URL PDF HTML ☆

赞 0 踩 0

2606.15079 2026-06-16 cs.CL cs.AI 交叉投稿

利用思维链监督的强化学习进行仇恨和宣传模因的可解释检测

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

发表机构 * Hamad Bin Khalifa University（哈马德·本·哈利法大学）； Qatar University（卡塔尔大学）

AI总结提出基于强化学习的后训练方法，结合任务特定奖励和组相对策略优化（GRPO），提升思考型多模态大语言模型在仇恨和宣传模因检测中的分类性能和解释质量。

详情

AI中文摘要

仇恨和宣传模因利用图像与文本之间的相互作用来传达有害意图，而这两种模态单独都无法揭示这种意图。尽管基于思考的多模态大语言模型（MLLMs）在视觉-语言理解方面取得了进展，但它们在模因内容审核中的应用仍未得到充分探索。我们提出了一种基于强化学习的后训练方法，通过任务特定奖励和组相对策略优化（GRPO）来提高思考型MLLMs的分类性能和基于参考的解释质量。具体来说，我们（i）对现成的MLLMs在英语和阿拉伯语基准上的仇恨和宣传模因理解进行了系统的实证研究，（ii）通过蒸馏和多LLM细粒度宣传标注，用弱监督的思维链（CoT）理由扩展了现有的模因数据集，（iii）引入了一个基于GRPO的目标函数，带有思考长度正则化，联合优化分类准确性和解释质量，以及（iv）研究基于共识伪标签的无标签模因的自监督GRPO。在Hateful Memes和ArMeme基准上的实验表明，我们的方法在FHM准确率（从79.9%提高到82.0%，提升高达2.1%）和ArMeme宏F1（从0.536提高到0.612，提升高达7.6个百分点，附带解释；与原始ArMeme基准相比提升6.1个百分点）上优于先前报告的结果，同时生成自然语言解释。在ArMeme上，序列分类基线在原始准确率方面仍然更强，而我们的方法提供了更平衡的每类性能以及解释。我们公开发布了代码、数据扩展和评估资源。

英文摘要

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

URL PDF HTML ☆

赞 0 踩 0

2606.15331 2026-06-16 cs.IR cs.AI 交叉投稿

HoloRec: Holistic Encoding and Interleaved Reasoning for Generative Recommendation

HoloRec：面向生成式推荐的整体编码与交错推理

Shuqi Zhao, Jingsong Su, Xiang Liu, Xingzhi Yao, Yiming Qiu, Huimu Wang, Liang Lin, Pengbo Mo, Mingming Li, Jiao Dai, Jizhong Han, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Artificial Intelligence, Beijing Normal University（北京师范大学人工智能学院）； JD.com（京东公司）

AI总结提出HoloRec，通过多粒度嵌套残差量化构建层次语义编码矩阵，实现内生的思维链推理，无需外部标注，在稀疏场景下显著提升推荐准确率。

详情

AI中文摘要

将任务建模为序列生成的生成式推荐模型克服了传统级联架构的目标碎片化问题，但现有方法仍存在缺乏层次结构用于多步推理的扁平语义表示，以及需要昂贵标注且与生成目标脱节的外部构建思维链（CoT）等问题。我们提出HoloRec，一种内生的思维链推荐机制，通过多粒度嵌套残差量化构建层次语义编码矩阵，并由整体重建损失优化，统一了表示、推理和生成。HoloRec支持两种推理模式：非思考模式使用轻量级多粒度监督对齐进行快速预测，思考模式采用交错推理方案动态生成CoT步骤，将推理直接嵌入生成过程，无需外部数据。在多个公开推荐数据集上的实验表明，HoloRec持续优于基线，在稀疏场景下尤其显著，且思考模式在仅增加适度推理开销的情况下实现了比非思考模式更高的准确率。

英文摘要

Generative recommendation models that formulate the task as sequence generation overcome the objective fragmentation problem of traditional cascade architectures, yet existing approaches still suffer from flat semantic representations lacking hierarchical structure for multi-step reasoning and an externally constructed chain-of-thought (CoT) that requires expensive annotations and remains disconnected from the generation objective. We propose HoloRec, an endogenous chain-of-thought recommendation mechanism that unifies representation, reasoning, and generation by constructing a hierarchical semantic encoding matrix via multi-granularity nested residual quantization optimized by a holistic reconstruction loss. HoloRec supports two inference modes: a non-thinking mode that uses lightweight multi-granularity supervised alignment for fast prediction, and a thinking mode that employs an interleaved reasoning scheme to generate CoT steps on the fly, directly embedding reasoning into the generation process without external data. Experiments on multiple public recommendation datasets demonstrate that HoloRec consistently outperforms baselines, with especially significant gains in sparse scenarios, and the thinking mode achieves better accuracy than the non-thinking mode with only modest inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.15405 2026-06-16 cs.CL cs.AI 交叉投稿

T-Mem: Memory That Anticipates, Not Archives

T-Mem：预测而非归档的记忆

Weidong Guo, Dakai Wang, Zixuan Wang, Hui Liu, Yu Xu

发表机构 * Tencent（腾讯）

AI总结提出T-Mem架构，通过写时触发机制覆盖描述性和关联性回忆，解决长对话中语义关联检索问题，在LoCoMo和LoCoMo-Plus上达到SOTA。

详情

AI中文摘要

长期记忆对于对话代理在扩展对话中保持连贯性、遵循多个会话前做出的承诺以及根据每个用户调整行为至关重要。然而，当前基于LLM的长期对话记忆受限于查询与存储内容（包括词汇和稠密向量）之间的相似性。当查询和记忆共享表面特征（如措辞或命名实体，我们称之为描述性）时，该方法有效。但它忽略了另一类同样有价值的案例，即查询和记忆不共享表面特征，仅通过潜在语义弧（关联性）相连。在这种机制下，现有的长期记忆系统普遍失败。覆盖这另一半使得助手首次能够主动将过去的对话作为语义资产。在记忆方面，这是认知科学中称为情景未来思维的工程对应物：预演过去的经验，以便在未来需要找到它的上下文中使用。我们将这些写时预演称为触发器。我们提出T-Mem，这是第一个覆盖描述性和关联性回忆的长期对话记忆架构。在两种证据粒度（单个事实和完整交流）上，T-Mem实例化一个描述性触发器家族和一个关联性触发器家族，使得每个记忆都能从表面相似和相关性约束的查询中访问。作为实证验证，T-Mem在LoCoMo和LoCoMo-Plus上达到了最先进水平。

英文摘要

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

URL PDF HTML ☆

赞 0 踩 0

2606.15412 2026-06-16 cs.CL cs.AI 交叉投稿

AP-GRPO: 基于锚定门控语音对齐与策略优化的病理语音重建

Pengfei Zhang, Hoang H Nguyen, Yutong Song, Wenjun Huang, Tahmid Imtiaz Imu, Henry Peng Zou, Jiang Wu, Honghui Xu, Amir M. Rahmani

发表机构 * University of California Irvine（加州大学尔湾分校）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Kennesaw State University（肯尼索州立大学）

AI总结针对神经退行性和神经运动障碍患者的病理语音，提出AP-GRPO框架，通过锚定门控奖励和语音对齐奖励优化语音语言模型，实现忠实重建，并揭示疾病特异性模式。

详情

AI中文摘要

来自神经退行性和神经运动障碍患者的病理语音通常在声学上失真且语言上支离破碎，因此需要病理语音重建来从失真和不完整的语音录音中恢复预期的文本内容。关键在于，此类录音很少均匀退化：一些单词或短语仍然可靠，可以作为可听锚点来重建受损的周围内容。我们引入了锚定门控语音组相对策略优化（AP-GRPO），这是一个带有语音奖励的GRPO框架，通过可听锚点保留和锚点间语音兼容性来对齐语音语言模型（SLM）与原始语音信号。AP-GRPO包括：（i）一个锚定门控奖励，用于匹配清晰区域中的可靠可听锚点；（ii）一个锚点间语音对齐奖励，用于评估恢复的内容是否在语音上得到相应受损锚点间语音片段的支持。在四种疾病条件下，AP-GRPO提高了忠实语音重建，并且学习的锚点约束自动适应每种条件，从而揭示可解释的疾病特异性特征：严重发音退化条件需要更强的锚点强制，而轻度损伤或语言障碍条件则更依赖于锚点间恢复的语音对齐。

英文摘要

Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.15566 2026-06-16 cs.CL cs.AI 交叉投稿

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

科学话语中的立场检测：以贝叶斯认知科学为例的LLM辅助方法

Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

发表机构 * University of New Hampshire（新罕布什尔大学）； Independent Researcher（独立研究员）

AI总结提出结合理论驱动编码手册、专家标注和诊断门控提示优化的方法，利用三个前沿LLM检测贝叶斯模型在科学文本中的现实主义/工具主义立场，在210篇文章的6858条引文中达到0.78的联合信度。

Comments 9 pages, 4 figures; Code and data: https://github.com/EyupEK/autoresearch_bayes

详情

AI中文摘要

定性编码是社会科学的核心，但专家标注难以规模化。LLM提供了一种可能的扩展，但当目标构念是解释性的、理论负载的且仅间接表达时，需要仔细验证。我们在一个困难案例中研究这个问题：检测作者是将贝叶斯模型视为心理和神经机制的描述（现实主义）还是有用的数学工具（工具主义）。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控提示优化搜索（为三个前沿LLM：GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview生成共享的零样本提示）以及多评估者信度分析。最终提示在保留样本上实现了0.76的综合信度分数（ICC=0.79和α=0.74的调和平均数），所有诊断均满足。在来自210篇文章的6858条引文上部署后，三个LLM达到了显著的引文级一致性（ICC=0.80；α=0.76；综合=0.78）和近乎完美的文章级排名稳定性（评估者对之间r=0.96-0.97）。语料库总体偏向弱现实主义，但文章级立场很少一致：仅1.4%的文章使用单一波段，而59.5%的文章跨越四个或更多波段。低层感知/运动文章比高层认知文章高出8.8个现实主义点（p<.001，d=0.60），量化了长期持有的定性直觉。我们将其作为专家主导的案例研究呈现；该框架旨在推广到类似的理论密集型任务，而非所有定性分析。

英文摘要

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.15694 2026-06-16 cs.MM cs.AI cs.CV cs.LG 交叉投稿

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

MAF: 面向情感分析的多模态自适应少样本提示方法

Hangling Xie

发表机构 * Nanjing University of Posts and Telecommunications（南京邮电大学）

AI总结提出MAF框架，通过动态检索与查询相关的多模态示例，利用轻量级系数生成网络实时融合多模态相似度，结合多数投票提升MLLM在情感分析中的性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在理解复杂多模态内容方面展现了卓越的能力。然而，它们在情感分析中的性能对提示设计高度敏感，导致静态、统一应用的提示本质上无法捕捉不同输入中变化的细微多模态线索。为了解决这一局限性，我们提出了一种多模态自适应少样本提示（MAF）框架，该框架动态检索并整合与查询相关的示例，以上下文敏感的方式激发MLLM的情感推理能力。MAF构建了一个示例检索模块，整体编码面部表情、场景上下文和文本语义，并引入唇部运动幅度检测机制以在多人物场景中准确识别说话者。与传统的固定权重融合不同，我们训练了一个轻量级系数生成网络，实时输出查询条件的融合权重，从而实现多模态相似度分数的加权聚合，以检索最具信息量的前K个示例。通过MLLM生成的多个候选输出进行多数投票，进一步增强了预测稳定性。在公开基准数据集上的大量实验表明，MAF相比相应的骨干变体取得了显著且一致的性能提升，并与强大的多模态情感分析基线保持竞争力。

英文摘要

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.15733 2026-06-16 cs.CL cs.AI 交叉投稿

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Vernier: 探测因果推理中词汇间隙背后的表征错位

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）

AI总结通过配对视图权重更新和激活修补，发现语言模型在因果推理中因变量名替换导致的答案差异源于表征错位而非信息丢失，并在Qwen和Llama模型上验证了反事实增强的对齐效果。

详情

AI中文摘要

指令微调的语言模型在将其英文变量名替换为类型保留的占位符后，可能会对相同的因果推理问题给出不同的答案，尽管结构因果模型和正确答案未变。我们探究这种词汇间隙是否反映了占位符视图中的信息丢失，或是从仍携带答案相关内容的表征中读取时的错位。Vernier 使用配对视图权重更新作为工具，然后检查间隙闭合后留下的机制。在工作状态下，证据支持表征错位。变量名探针在占位符视图上变得更准确，对 Qwen-7B、Qwen-14B 和 Llama-3.1-8B 的激活修补表明，决策令牌表征可以在视图间传递答案身份。重新对齐视图的更新是对原始提示和占位符提示的反事实增强，而答案子空间 KL 主要增强了中间答案信念的一致性。成功受限于模型家族、规模和任务。CRASS 转移在 Qwen 规模和 Llama 上可靠，e-CARE 仍然较弱，初步的非因果重命名任务显示出类似的定性模式。

英文摘要

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

URL PDF HTML ☆

赞 0 踩 0

2606.15741 2026-06-16 cs.CL cs.AI 交叉投稿

A Self Consistency Based Reranking for Narrative Question Answering

基于自一致性的叙事问答重排序

Molham Mohamed, Ali Hamdi

发表机构 * GitHub

AI总结提出自一致性重排序框架，通过生成多个候选答案并基于语义一致性选择最终答案，提升叙事问答的鲁棒性和准确性。

详情

AI中文摘要

叙事问答（NQA）是自然语言处理中一项具有挑战性的任务，要求模型理解长文本上下文、捕捉事件间关系并生成连贯的响应。尽管预训练语言模型近期取得了进展，但大多数现有方法在推理时依赖单一解码输出，使其对生成变异性敏感，常导致答案不完整或不一致。为解决这一局限，我们提出了一种基于自一致性的自集成重排序框架用于叙事问答。该方法为每个故事-问题对生成多个候选答案，并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述，同时通过基于共识的选择提高鲁棒性，而无需修改底层架构。该框架将预训练和微调的语言生成与多答案推理及基于相似度的重排序相结合。我们在NarrativeQA数据集上使用多种模型（包括FLAN-T5 Base和Small以及Pegasus-Large）在基线和微调设置下评估了所提方法。实验结果表明，该方法在所有模型上均持续提升了性能。特别是，FLAN-T5-Base在结合自集成推理后，性能从82.32%提升至86.66%（+4.34%），取得了最佳整体性能。此外，Pegasus-Large的提升最大，从72.50%提升至87.07%（+14.57%），凸显了所提策略的有效性。

英文摘要

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.15778 2026-06-16 cs.CL cs.AI cs.LG cs.SI 交叉投稿

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

DYNA：用于在持续学习中通过时间知识图谱增强大语言模型的动态情景记忆网络

Ali Sarabadani, Mahtab Tajvidiyan

发表机构 * Department of Computer Engineering and Information Technology, University of Qom（卡姆大学计算机工程与信息科技系）

AI总结提出DYNA框架，通过时间知识图谱作为外部可更新记忆，增强冻结的大语言模型，在三个时间召回任务上减少约7%的灾难性遗忘并提升约5%的时间排序能力。

2606.15819 2026-06-16 cs.CV cs.AI 交叉投稿

MAGE-RAG：面向长文档问答的多粒度自适应图证据多模态RAG

Yilong Zuo, Xunkai Li, Jing Yuan, Qiangqiang Dai, Hongchao Qin, Ronghua Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MAGE-RAG框架，通过离线构建包含页面和元素节点的证据图，在线自适应构建证据子图，平衡证据覆盖与噪声控制，在长文档多模态问答中取得最优性能。

详情

AI中文摘要

长文档多模态问答要求系统在长PDF中定位稀疏证据，并整合来自文本、表格、图像、图表和复杂布局的线索。现有RAG方法大多依赖于文本块或页面的固定Top-k检索。文本检索可以压缩上下文，但往往丢失视觉和布局信息；页面级视觉检索保留原始页面，但也会将大量无关区域送入阅读器，导致证据覆盖、噪声和推理成本之间的静态权衡。本文提出MAGE-RAG，一种用于长文档多模态问答的多粒度自适应图证据框架。MAGE-RAG以页面检索作为查询时证据构建的入口。离线阶段，它构建一个包含页面节点和元素节点的证据图，编码包含关系、阅读顺序、布局邻接、章节层次和语义邻居关系。查询时，在线证据控制器在显式预算下迭代地激活、打开、搜索和剪枝证据。生成的证据子图随后被渲染为结构化的多模态阅读器输入，使LVLM能够在有限上下文中消费紧凑且相关的证据。在LongDocURL和MMLongBench-Doc上，我们建立了统一的比较和分析协议，涵盖直接MLLM、文本RAG、页面级视觉RAG和图/智能体RAG。实验表明，MAGE-RAG在LongDocURL上达到52.75的整体准确率，在MMLongBench-Doc上达到53.26的准确率和51.19的F1。细粒度分解、预算-性能曲线、消融和基于轨迹的分析进一步表明，查询时证据子图构建能够平衡分散证据覆盖与上下文噪声控制。我们的代码可在https://github.com/laonuo2004/MAGE-RAG.git获取。

英文摘要

Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at https://github.com/laonuo2004/MAGE-RAG.git.

URL PDF HTML ☆

赞 0 踩 0

2606.15972 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

一次形式化，其余编辑：基于Lean的高效数学推理答案选择

Ji Feng, Zhouxing Shi

发表机构 * University of California, Riverside（加州大学河滨分校）

AI总结提出BASE流水线，通过形式化一个候选答案并编辑其余答案，减少自动形式化调用约5倍，同时提升选择准确性。

Comments 15 pages, 1 figure. Code available at https://github.com/ucr-rai/base-and-edit

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地应用于数学推理，形式化证明助手（如Lean）可用于以机器可检查的严谨性验证推理输出，从而支持在测试时扩展中从K个采样候选答案中进行答案选择等用例。然而，使用Lean要求LLM的输出（最初为自然语言）首先被形式化。现有的基于Lean的答案选择工作使用自动形式化模型为每个候选答案独立生成一个Lean形式化语句，这带来了显著的计算成本。我们提出BASE，一个基础-编辑流水线，它为每个问题形式化一个基础候选答案，并通过就地编辑答案表达式来推导出其余K-1个语句。为此，我们训练了一个重写器模型LEANSCRIBE，用于定位基础形式化中的答案，并为其他K-1个候选答案生成可重用的编辑函数。BASE同时提高了选择准确性并降低了形式化成本——这是一个帕累托改进，在四个基准测试和三个求解器上的所有12个（数据集，求解器）配置中均成立，在K=8时自动形式化器调用减少约5倍，且随着K增长，减少幅度预计会更大。代码可在https://github.com/ucr-rai/base-and-edit获取。

英文摘要

With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at https://github.com/ucr-rai/base-and-edit.

URL PDF HTML ☆

赞 0 踩 0

2606.15998 2026-06-16 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

实体标签并非实体信号：文档重排序中可观测相关性的框架

Utshab Kumar Ghosh, Shubham Chatterjee

发表机构 * Department of Computer Science, Missouri University of Science and Technology（计算机科学系，密苏里科技大学）

AI总结提出实体可观测相关性（OER）与概念相关性（CER）的区分，证明CER监督效果差，而OER对齐可显著提升重排序性能。

Comments ICTIR '26

详情

DOI: 10.1145/3805713.3820411
Journal ref: Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)

AI中文摘要

实体感知的文档检索使用与查询关联的实体作为排序信号，假设语义相关的实体也是有用的检索信号。我们证明这一假设是不充分的，并解释原因。与作为真实观测的词项不同，实体链接是由不完美的链接器产生的假设：如果链接器在相关和非相关文档中无差别地触发，那么一个实体可能在主题上重要，却不提供任何判别性信号。我们将此形式化为概念实体相关性（CER）——实体是否与查询主题相关——和可观测实体相关性（OER）——其在集合中的观测出现是否能区分相关与非相关文档——之间的区别。在四个集合和包括人工实体判断的标注来源上，CER和OER表现出接近随机的吻合度（κ≈0），而OER的操作化实现吻合度较高（κ≈0.5），确认CER是系统性异常值。基于CER的监督选择主题上合理但判别性弱的实体，在某些集合上仅能过滤不到4%的非相关文档。将监督与OER对齐可将非相关文档过滤提升至10倍，并在BM25基础上将开放世界MAP提升0.051。我们的发现促使实体感知检索中从概念实体相关性向可观测实体相关性的转变。

英文摘要

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.16074 2026-06-16 cs.CL cs.AI 交叉投稿

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2：通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine（耶鲁大学医学院）； Yale School of Public Health（耶鲁大学公共卫生学院）； Texas State University（德克萨斯州立大学）

AI总结提出PVminerLLM2，通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术，解决监督微调难以处理的细粒度错误，在患者声音结构化提取任务上优于基线模型。

详情

AI中文摘要

动机：患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息，但大多是非结构化的，限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而，仅靠监督微调（SFT）难以处理罕见、细粒度且分布不均的错误，尤其是在令牌关键的结构化输出中。结果：我们提出了PVminerLLM2，一组改进的用于结构化患者声音提取的LLM，它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了（i）带有令牌级门控稳定项的偏好目标，防止在偏好优化下绝对令牌似然的退化，以及（ii）混淆感知的偏好对构建，以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权，以解决令牌不平衡和类别偏斜问题。在多种模型规模下，PVMinerLLM2始终优于强基线，在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升，并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现：PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于：https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

URL PDF HTML ☆

赞 0 踩 0

2606.16082 2026-06-16 cs.CV cs.AI 交叉投稿

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

Tool-IQA: 利用简单工具增强图像质量评估

Guanyi Qin, Junjie Zhang, Chunming He, Yibing Fu, Jie Liang, Tianhe Wu, Lei Zhang

发表机构 * National University of Singapore（新加坡国立大学）； OPPO Research Institute（OPPO研究院）； Nanyang Technical University（南洋理工大学）； Duke University（杜克大学）； City University of Hong Kong（香港城市大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出Tool-IQA，通过为视觉语言模型配备放大镜和伽马校正器等简单工具，将被动评分转变为工具增强的工作流程，显著提升图像质量评估性能。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被用于图像质量评估（IQA）。然而，当前方法通常采用静态的一次性评分范式，而人类通过动态视觉检查（例如，选择性调整视图以验证细节和细微伪影）来评估图像质量。具体来说，仅依赖单次观察存在两个主要限制：首先，仅在全局尺度上感知图像限制了对更精细局部细节的评估；其次，图像的原始强度分布可能压倒可见性，导致对图像质量的检查不足。为了解决这些问题，我们提出了Tool-IQA，将评估机制从被动评分转变为工具增强的工作流程。特别地，我们为VLM配备了简单而有效的视图工具：用于检查局部细节的放大镜，以及用于揭示可见性和隐藏伪影的伽马校正器。评估遵循一个结构化的流程，包括带有评分标准的初始观察、工具增强的深入检查以及最终校准质量分数的量化。此外，为了确保高效且有目的地调用工具，我们引入了一种批量感知的训练策略，以奖励能够产生积极贡献的工具交互，而不仅仅是鼓励使用。在各种IQA基准上的实验表明，通过有效的工具调用和校准评估，我们提出的Tool-IQA显著优于现有最先进的模型，例如，在具有挑战性的CLIVE数据集上实现了0.854的PLCC。

英文摘要

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.16281 2026-06-16 cs.CL cs.AI 交叉投稿

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

现在谁应该主导解码？跟踪可靠轨迹以集成掩码扩散语言模型

Heecheol Yun, Joonhyung Park, Joowon Kim, Eunho Yang

发表机构 * KAIST（韩国科学技术院）； AITRICS

AI总结针对掩码扩散语言模型集成问题，提出TIE框架，通过跟踪答案相关位置的置信度动态，迭代识别并传递可靠解码轨迹，实现多模型协同生成。

Comments preprint

详情

AI中文摘要

掩码扩散语言模型（MDLM）已成为序列生成的一种独特范式。随着MDLM在能力和知识覆盖范围上变得多样化，一个重要问题是如何结合它们的知识。为此，我们首先研究了MDLM独特的解码动态。我们发现，成功的生成在答案相关位置上表现出稳定的置信度动态，而不可靠的轨迹通常可以通过注入来自其他模型的有希望的中间状态来纠正。受此观察启发，我们提出了$\textbf{TIE}$（基于轨迹的迭代集成），这是一个知识融合框架，其中MDLM迭代地识别可靠的解码轨迹并在模型之间传递它们。TIE跟踪答案相关位置上的置信度动态，以确定哪个模型当前遵循更可靠的轨迹，并选择性地跨模型传递部分去噪的序列。由于处于更有希望轨迹上的模型在去噪步骤中经常变化，TIE允许不同模型在生成的不同阶段贡献互补的优势。在多种推理任务上的强劲表现以及我们的分析表明，TIE为MDLM集成这一尚未充分探索的问题提供了一种实用方法。

英文摘要

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

URL PDF HTML ☆

赞 0 踩 0

2606.16353 2026-06-16 cs.CV cs.AI 交叉投稿

What Should a Streaming Video Model Remember?

流式视频模型应该记住什么？

Haonan Ge, Yiwei Wang, Hang Wu, Yujun Cai

发表机构 * University of California, Santa Barbara（加州大学圣塔芭芭拉分校）； University of California, Merced（加州大学默塞德分校）； The University of Queensland（昆士兰大学）

AI总结针对流式视频理解中固定记忆预算下的长程历史利用问题，提出选择性潜在记忆框架SelectStream，通过惊喜驱动自适应窗口、优先级保持合并和查询条件图推理三个机制，实现高效在线推理，在多个基准上取得领先性能。

详情

AI中文摘要

流式视频理解模型必须在持续流中的任意时刻回答查询，仅使用到目前为止观察到的内容，并在固定的记忆和计算预算下工作。现有方法通过添加记忆库、检索模块或视觉令牌压缩来保存长程历史。然而，强近期窗口基线表明，不加区分地注入历史可能会稀释当前场景感知，这表明关键挑战不在于是否使用记忆，而在于如何选择性分配记忆。我们将此形式化为预算在线潜在证据分配，并提出\textbf{SelectStream}，一个选择性潜在记忆框架，该框架保持当前观察对冻结VLM直接可见，同时仅通过紧凑的、查询条件的证据预算暴露历史信息。三个协调机制控制何时写入、保留什么以及如何检索：惊喜驱动的自适应窗口、优先级保持合并以及固定容量潜在记忆图上的查询条件图推理。检索到的证据被校准并作为潜在令牌注入以生成答案，无需重放帧或随着流长度增长上下文。实验结果表明，SelectStream实现了强大的在线流式性能，并保持了通用视频理解能力，在StreamingBench上达到82.67%，在OVO-Bench上达到67.03%，在离线视频基准上平均准确率达到74.4%，同时优于强近期窗口基线和先前的流式记忆方法。

英文摘要

Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

URL PDF HTML ☆

赞 0 踩 0

2606.16484 2026-06-16 cs.CV cs.AI cs.MM 交叉投稿

Unified Multimodal Model for Brain MRI Imputation and Understanding

统一多模态模型用于脑MRI补全与理解

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）； Department of Brain Sciences, Imperial College London（伦敦帝国理工学院脑科学系）

AI总结提出UniBrain模型，通过统一训练策略联合处理脑MRI模态补全与图像理解，采用自对齐和动态隐藏状态机制，在多疾病数据集上实现高性能。

Comments Early accepted to MICCAI 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在医学领域具有巨大潜力，因为它们继承了LLM的知识，并允许以自然语言集成、分析和解释多种数据模态。然而，医学MLLMs面临重大挑战，特别是高质量训练数据的稀缺以及现实临床环境中数据缺失的频繁发生。在此，我们提出了一种新颖的统一多模态模型UniBrain，用于脑磁共振图像（MRI）分析。为了解决潜在的脑MRI模态缺失问题，我们采用统一训练策略进行联合成像模态补全和脑图像理解。在训练过程中，构建了交错且描述丰富的数据流，以自回归方式训练模型，从而实现基于生成的多模态数据的医学推理。引入自对齐策略，利用密集图像嵌入学习细粒度解剖特征，无需详细的图像描述。此外，我们提出了一种动态隐藏状态机制，以缓解长上下文多模态推理中的暴露偏差。在多疾病脑MRI数据集上的大量实验表明，UniBrain在模态不完全的各种情况下，在脑图像补全、理解和疾病诊断方面均取得了高性能。

英文摘要

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

URL PDF HTML ☆

赞 0 踩 0

2606.16568 2026-06-16 cs.CL cs.AI 交叉投稿

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

快速判断何时，谨慎决定谁：基于扩散增强的双过程多轮对话

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

发表机构 * Deakin University（迪肯大学）； Griffith University（格里菲斯大学）

AI总结针对多说话人对话中的轮次转换问题，提出音频两阶段流水线，先快速检测轮次边界，再轻量验证决定是否转移并预测下一说话人，扩散增强进一步改善检测性能。

详情

AI中文摘要

可靠的轮次转换对于口语对话系统至关重要。然而，现有方法大多针对双说话人交互设计，难以处理包含重叠和快速说话人切换的现实多说话人音频。我们在VoxConverse数据集上研究多说话人轮次转换，并提出一个纯音频的两阶段流水线，将何时触发轮次边界与是否实际转移话语权分开。一个快速触发器扫描音频并提出候选的结束轮次时间，而一个轻量验证器仅在这些时间运行，以决定\textsc{Hold}或\textsc{Shift}，并支持下一说话人预测。我们报告了完整多说话人设置下的结果，以及为可比性而控制的二元顶2投影结果。我们还研究了基于扩散的、保留标签的背景音频混合作为数据增强策略。结果显示，与基线相比，转移检测有所改善，扩散增强进一步提升了性能。

Gen-VCoT: 基于扩散的RGB中间表示的生成式视觉思维链推理

Zhiqiang Zhou, Junliang Dai, Xu ling

发表机构 * Hunan Chemical Industry Vocational and Technical College（湖南化工职业技术学院）

AI总结提出Gen-VCoT框架，利用专家视觉模型生成RGB图像作为推理中间步骤，通过自适应路由器选择推理深度，在空间和深度问题上分别提升25%和50%，但简单事实查询性能下降，表明最优表示依赖于任务。

Comments 12 pages, 5 figures

2606.16845 2026-06-16 cs.CL cs.AI 交叉投稿

Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

鲁棒双信号融合：混合神经符号门控与压缩链式思维精炼用于社交媒体文本讽刺检测

Ankit Bhattacharjee, Krityapriya Bhaumik

发表机构 * Indian Institute of Technology Kharagpur（印度理工学院克勒格布尔分校）

AI总结提出RDS融合框架，结合神经符号架构与压缩链式思维推理，在TweetEval和iSarcasm数据集上达到与微调BERTweet相当的性能，并显著优于监督方法。

Comments 11 pages total, 10 figures

详情

AI中文摘要

大型语言模型（LLM）默认倾向于字面语义解释，使得零样本讽刺检测成为一个持续的挑战。我们引入了鲁棒双信号（RDS）融合框架，这是一种混合神经符号架构，无需监督微调（SFT）即可压缩链式思维（CoT）推理轨迹。在严格保留的TweetEval测试集（N=734）上，RDS达到了78.1%的准确率和0.777的宏F1分数，与微调BERTweet的绝对性能上限相匹配。在高度不平衡的iSarcasm数据集上，冻结的CoT管道过滤了22.5%的分布外幻觉，实现了0.6726的零样本宏F1和0.4821的讽刺F1，优于多个强监督的SemEval Transformer集成。统计消融实验证实了这种结构协同作用：将符号先验添加到神经基线没有显著提升（p=0.242），而将CoT管道添加到该先验的边际收益被高度压缩（p=0.149）。只有所有三个信号的完整并发融合才能实现相对于基线的统计验证改进（p=0.005）。

英文摘要

Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

URL PDF HTML ☆

赞 0 踩 0

2606.16847 2026-06-16 cs.CL cs.AI 交叉投稿

双不确定性引导的多模态推理策略学习

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

发表机构 * Tencent Hunyuan（腾讯文汇）； University of Maryland（马里兰大学）； University of North Carolina（北卡罗来纳大学）

AI总结提出DUPL方法，通过量化感知不确定性和输出不确定性来引导策略更新，在多个多模态推理基准上显著提升模型准确率，优于现有方法。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）已经提升了多模态大语言模型的推理能力。然而，现有方法通常将视觉输入视为确定性的，忽略了视觉模态固有的感知模糊性。因此，它们无法区分模型的不确定性是源于复杂推理还是模糊感知，从而无法有针对性地分配探索或学习信号。为了解决这一问题，我们引入了\textbf{DUPL}，一种用于多模态RLVR的双不确定性引导策略学习方法，该方法量化并利用感知不确定性（通过对称KL散度）和输出不确定性（通过策略熵）来指导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先级机制，DUPL重新校准策略优势，将学习重点放在具有高感知或决策模糊性的状态上，从而实现超越被动数据增强的有效目标探索。在涵盖数学和通用领域的多个多模态推理基准上，DUPL取得了显著提升。它将Qwen2.5-VL的准确率提升了高达$\textbf{12.3%}$（3B）和$\textbf{7.9%}$（7B），将Qwen3-VL-Instruct的准确率提升了高达$\textbf{10.7%}$（4B）和$\textbf{12.4%}$（8B），持续优于GRPO，同时无缝泛化到其他算法（DAPO，平均$\textbf{+6.5%}$）和架构（LLaVA-OneVision-1.5，平均$\textbf{+4.7%}$）。这些结果表明，DUPL是一种有效且可泛化的多模态RLVR方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce \textbf{DUPL}, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Evaluated on diverse multimodal reasoning benchmarks spanning mathematical and general domains, DUPL achieves solid gains. It improves Qwen2.5-VL accuracy by up to $\textbf{12.3%}$ (3B) and $\textbf{7.9%}$ (7B), and Qwen3-VL-Instruct by up to $\textbf{10.7%}$ (4B) and $\textbf{12.4%}$ (8B), consistently outperforming GRPO, while seamlessly generalizing to alternative algorithms (DAPO, $\textbf{+6.5%}$ avg) and architectures (LLaVA-OneVision-1.5, $\textbf{+4.7%}$ avg). These results demonstrate that DUPL is an effective and generalizable approach for multimodal RLVR.

URL PDF HTML ☆

赞 0 踩 0

2602.08597 2026-06-16 cs.AI 版本更新

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

全局工作空间架构中用于鲁棒多模态集成的注意力机制

Roland Bertin-Johannet, Lara Scipio, Leopold Maytié, Rufin VanRullen

发表机构 * CerCo, CNRS, Université de Toulouse（CerCo、CNRS、图卢兹大学）； ANITI, Artificial and Natural Intelligence Toulouse Institute（ANITI、图卢兹人工智能与自然智能研究所）

AI总结提出一种轻量级自上而下的模态选择器，在冻结的多模态全局工作空间上运行，通过注意力机制提升系统在模态噪声或缺失下的鲁棒性，并在两个数据集上验证了其高效性和可迁移性。

Comments 21 pages, 6 figures, 2 tables. Accepted at ICANN 2026. Code: https://github.com/RolandBERTINJOHANNET/GW_attention

详情

AI中文摘要

鲁棒的多模态系统必须在某些模态存在噪声、退化或不可靠时仍保持有效。现有的多模态融合方法通常将模态选择与表示学习联合进行，这使得难以判断鲁棒性来自选择器本身还是来自完全的端到端协同适应。受全局工作空间理论（GWT）启发，我们使用一个轻量级的自上而下模态选择器，运行在冻结的多模态全局工作空间之上，来研究这个问题。我们在两个复杂度递增的多模态数据集（Simple Shapes 和 MM-IMDb 1.0）上，在结构化模态损坏条件下评估了我们的方法。该选择器在使用的可训练参数远少于端到端注意力基线的情况下提高了鲁棒性，并且学习到的选择策略在下游任务、损坏模式甚至之前未见过的模态上具有更好的迁移性。除了显式的损坏设置外，在 MM-IMDb 1.0 基准测试上，我们展示了相同的机制改善了全局工作空间相对于其无注意力对应版本的性能，并取得了不错的基准性能。

英文摘要

Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.

URL PDF HTML ☆

赞 0 踩 0

2501.09310 2026-06-16 cs.CL cs.AI cs.SE 版本更新

Understanding, Detecting, and Repairing Real-World In-Context-Learning-Based Text-to-SQL Errors

理解、检测和修复基于上下文学习的真实世界文本到SQL错误

Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu

发表机构 * East China Normal University（东华师范大学）； Shanghai China（上海中国）； Shanghai Innovation Institute（上海创新研究院）； sei.ecnu.edu.cn（东华师范大学电子邮件）

AI总结本研究首次全面调查基于上下文学习的文本到SQL错误，总结27种错误类型，并提出MapleDoctor框架，相比现有方法修复率提高13.8%，误修复极少，延迟降低67.4%。

Comments Accepted by FSE 2026

详情

AI中文摘要

大型语言模型（LLMs）已被用于文本到SQL任务，利用其上下文学习（ICL）能力将自然语言问题转换为SQL查询。然而，这种技术面临正确性问题。在本文中，我们首次对基于ICL的文本到SQL错误进行了全面研究。我们的研究涵盖了四种代表性的ICL技术、五种基本修复方法、两个基准测试和两种LLM设置。我们发现文本到SQL错误普遍存在，并总结了7个类别的27种错误类型。我们还发现，现有的修复尝试在正确性提升方面有限，同时具有高计算开销和许多误修复。基于这些发现，我们提出了MapleDoctor，一种新颖的文本到SQL错误检测和修复框架。评估表明，MapleDoctor优于现有解决方案，修复了13.8%更多的查询，误修复数量可忽略不计，并减少了67.4%的修复延迟。该工件可在GitHub上公开获取。

英文摘要

Large language models (LLMs) have been adopted for text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into SQL queries. However, such a technique faces correctness problems. In this paper, we conduct the first comprehensive study of text-to-SQL errors of ICL-based techniques. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 27 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement while having high computational overhead and many mis-repairs. Based on these findings, we propose MapleDoctor, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleDoctor outperforms existing solutions by repairing 13.8% more queries with a negligible number of mis-repairs and reducing 67.4% repair latency. The artifact is publicly available at GitHub.

URL PDF HTML ☆

赞 0 踩 0

2502.08266 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Dealing with Annotator Disagreement in Hate Speech Classification

处理仇恨言论分类中的标注者分歧

Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu

发表机构 * Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey（工程与自然科学学院，Sabanci大学，伊斯坦布尔，土耳其）； Center of Excellence in Data Analytics (VERIM), Sabanci University, Istanbul, Turkey（数据分析卓越中心（VERIM），Sabanci大学，伊斯坦布尔，土耳其）

AI总结研究标注者分歧对仇恨言论分类的影响，评估多数投票等聚合方法，并利用感知强度增强分类性能，在土耳其语推文中取得新最优结果。

Comments 19 pages, 4 Tables

详情

AI中文摘要

仇恨言论检测是一项关键任务，尤其是在有害内容可能迅速传播的社交媒体上。收集社交媒体内容（如推文）来训练机器学习模型很容易，但由于其固有的主观性，检测和分类仇恨言论可能很困难。这种主观性导致标注者之间频繁出现分歧，尤其是对于微妙或边缘内容。传统方法要么丢弃非共识样本，要么通过专家裁决强制设定“黄金标准”，忽略了关于不确定性和多样化人类视角的宝贵信息。我们研究了仇恨言论分类中标注者分歧这一很大程度上被忽视的问题，并评估了一系列聚合方法，包括多数投票、序数策略（最小值、最大值和均值），并分析了它们在二分类、四分类和六分类任务中的影响。此外，我们利用标注者感知的仇恨言论强度分数来探索基于回归和混合建模的方法。我们证明，过滤非共识样本会导致过于乐观的结果，而感知强度提供了增强分类性能的补充信号。最后，我们在土耳其语推文的仇恨言论检测中建立了新的最优结果，并表明标注者分歧在适当建模后，是构建更稳健可靠系统的宝贵资源。

英文摘要

Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.

URL PDF HTML ☆

赞 0 踩 0

2502.11201 2026-06-16 cs.DB cs.AI 版本更新

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

弥合差距：通过文本到NoSQL翻译实现NoSQL数据库的自然语言查询

Jinwei Lu, Jiawei Lu, Chen Zhang, Zhiqian Qin, Haodi Zhang, Yuanfeng Song, Raymond Chi-Wing Wong

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）； The University of Hong Kong（香港大学）

AI总结本文研究Text-to-NoSQL任务，提出TEND基准和SAG求解器，用于将自然语言请求翻译为MongoDB聚合管道，验证了无模式文档推理的独特挑战。

详情

AI中文摘要

NoSQL数据库是核心数据基础设施，但对其的自然语言访问仍不成熟：正确的查询生成必须恢复非关系数据模型如何表示实体、嵌套路径、数组、缺失字段和动态键。本文研究Text-to-NoSQL，将自然语言请求翻译为可执行的NoSQL查询，实例化为对无模式文档存储的MongoDB聚合管道。我们提出TEND（Text-to-NoSQL Dataset的缩写），一个执行验证的基准，包含11个数据库上的1,210个MongoDB原生任务。据我们所知，TEND是第一个数据库世界设计为MongoDB原生的Text-to-NoSQL基准：专家手动定义集合边界、嵌套数组、可选和稀疏路径、多态形状以及动态键约定；这些世界填充真实数据并通过冻结的MongoDB执行验证，因此TEND评估无模式文档推理而非SQL到MQL的迁移。我们进一步引入SAG（Schema-as-Data Grounding）求解器，该求解器在受限MQL生成、执行接地修复和结果一致性选择之前，从存储文档证据中诱导路径和值接地。评估使用受限列容忍执行准确率（EXC）作为主要指标，辅以分级结果集F1和互斥执行结果分解。实验表明，在NL2SQL上表现强劲的LLM在TEND上大幅下降，验证了Text-to-NoSQL作为一个独特的无模式文档推理问题。

英文摘要

NoSQL databases are core data infrastructure, yet natural-language access to them remains underdeveloped: correct query generation must recover how a non-relational data model represents entities, nested paths, arrays, missing fields, and dynamic keys. This paper studies Text-to-NoSQL, translating natural-language requests into executable NoSQL queries, instantiated with MongoDB aggregation pipelines over schema-less document stores. We present TEND, short for Text-to-NoSQL Dataset, an execution-verified benchmark with 1,210 MongoDB-native tasks across 11 databases. To our knowledge, TEND is the first Text-to-NoSQL benchmark whose database worlds are MongoDB-native by design: experts manually define collection boundaries, nested arrays, optional and sparse paths, polymorphic shapes, and dynamic-key conventions; these worlds are populated with real data and verified through frozen MongoDB execution, so TEND evaluates schema-less document reasoning rather than SQL-to-MQL transfer. We further introduce SAG, a Schema-as-Data Grounding solver that induces path and value grounding from stored-document evidence before bounded MQL generation, execution-grounded repair, and result-consistency selection. Evaluation uses bounded column-tolerant execution accuracy (EXC) as the headline metric, complemented by a graded result-set F1 and a mutually exclusive execution-outcome decomposition. Experiments show that LLMs with strong NL2SQL performance degrade substantially on TEND, validating Text-to-NoSQL as a distinct schema-less document reasoning problem.

URL PDF HTML ☆

赞 0 踩 0

2506.16738 2026-06-16 cs.CL cs.AI cs.SD eess.AS 版本更新

RoTRAG: 基于经验法则推理的检索增强生成对话有害内容检测

Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu

发表机构 * Peking University（北京大学）； Enhans ； University of North Texas（北得克萨斯大学）

AI总结提出RoTRAG框架，通过检索外部道德规范（RoTs）增强LLM的多轮对话有害内容检测，实现基于规范推理和分类，平均F1提升约40%，分布误差降低8.4%。

Comments Accepted by SIGIR-ICTIR 2026, Oral Presentation

详情

DOI: 10.1145/3805713.3820397
Journal ref: Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR '26), July 25, 2026, Melbourne, VIC, Australia. ACM, New York, NY, USA, 12 pages

AI中文摘要

检测多轮对话中的有害内容需要对完整对话上下文进行推理，而非孤立的话语。然而，现有方法主要依赖模型内部的参数化知识，缺乏对外部规范性原则的明确依据。这常导致在社会细微语境下判断不一致、可解释性有限以及跨轮次冗余推理。为解决此问题，我们提出RoTRAG，一种检索增强框架，将简洁的人类编写的道德规范（称为经验法则，RoTs）融入基于LLM的有害性评估中。对于每一轮，RoTRAG从外部语料库中检索相关RoTs，并将其作为轮次推理和最终严重性分类的明确规范性证据。为提高效率，我们进一步引入一个轻量级二元路由分类器，决定新轮次是否需要基于检索的推理或可重用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue上的实验表明，RoTRAG在有害分类和严重性估计上均持续优于竞争基线，在基准数据集上F1平均相对提升约40%，分布误差平均相对降低8.4%，同时在不牺牲性能的情况下减少冗余计算。

英文摘要

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

URL PDF HTML ☆

赞 0 踩 0

2605.01733 2026-06-16 cs.CV cs.AI 版本更新

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

GEASS: 基于证据适应的门控选择性描述信任机制用于视觉-语言模型

Zeshang Li, Shuoyang Zhang

发表机构 * University of International Relations（国际关系大学）

AI总结本文提出GEASS，一种无需训练的模块，通过门控、加权和证据标准来决定模型在每个查询中消耗多少描述信息，从而提升视觉-语言模型的准确性。

Comments 18 pages, 12 figures

详情

AI中文摘要

视觉-语言模型（VLMs）在 grounded reasoning 方面表现出色，但仍然容易产生 object hallucination。最近的研究将自动生成的描述视为一个均匀的积极资源，但我们发现盲目地嵌入一个描述可能会降低而不是提高性能——在 HallusionBench 上，Qwen2.5-VL-3B 的准确性下降了近 10 个点。两个结构性质解释了这一点。首先，描述不仅锚定了模型的最终答案，还锚定了其推理轨迹和词汇选择。其次，描述错误是不对称的：遗漏远多于伪造，但每个伪造对实例的影响更大。因此，描述的有用性是查询特定的，而不是语料库特定的。我们提出 GEASS（ated Evidence-Adaptive Selective Caption Trust），一个无需训练的模块，决定每个查询中模型消耗多少描述信息：它通过干净路径的置信度来门控描述，通过它产生的熵减少来加权描述，并在两种路径意见不同时提高证据标准。在 POPE 和 HallusionBench 上对四个 VLMs 的实验表明，GEASS 在 vanilla 推理和对比解码上都表现出色，仅需每个查询两个额外的前向传递。

英文摘要

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B† on HallusionBench by nearly ten points. To understand why, we build GD-Probe, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a per-query property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption covers the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.18313 2026-06-16 cs.CV cs.AI 版本更新

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Wasserstein均衡解码用于可靠的医疗视觉问答

Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, Bernhard Kainz

发表机构 * Friedrich-Alexander University Erlangen-Nürnberg（弗里德里希-亚历山大厄林根-纽伦堡大学）； Imperial College London（伦敦帝国理工学院）； University College London（伦敦大学学院）

AI总结本文提出了一种基于Wasserstein距离的均衡解码方法，用于改进医疗视觉问答系统，通过语义感知的停止准则提高解码效率和准确性，同时在VQA-RAD和PathVQA数据集上实现了显著的性能提升。

详情

AI中文摘要

小型视觉-语言模型（2-8B）由于隐私限制、有限的连接性和低延迟要求，适合临床部署。然而，其有限的容量会加剧生成合理但错误的输出。我们扩展了之前仅限于纯文本、封闭式NLP任务的博弈论解码方法，应用于开放式的医疗视觉问答（VQA）。我们引入了一种语义感知的Wasserstein停止准则，以取代基于词序的匹配，使收敛基于候选答案之间的语义共识，避免因临床等效排名交换导致的不必要的迭代。在VQA-RAD和PathVQA上，我们获得了比贪心和判别基线显著的改进。在VQA-RAD上，我们比贪心的4B模型提高了3.5个百分点（p < 0.01），在更大规模上呈现出相似趋势。在PathVQA上，Gemma-3-4B与BDG在贪心解码下表现相当，尽管没有领域特定的微调。在与经典BDG的准确性相等时，Wasserstein准则将平均收敛迭代次数减少了约20%，在提高推理效率的同时保留了博弈论均衡行为。代码可在https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA上获得。

英文摘要

Small vision-language models (2-8B) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language models for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clinically equivalent ranking swaps. On VQA-RAD and PathVQA, we obtain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain-specific fine-tuning. At accuracy parity with classic BDG, the Wasserstein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.00435 2026-06-16 cs.CV cs.AI 版本更新

面向统一歌曲生成与带伴奏共生成的歌声转换

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

发表机构 * Northwestern Polytechnical University（西北工业大学）； Kuaishou Technology（快手科技）； Beijing Institute of Technology（北京理工大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出UniSinger框架，基于多模态扩散Transformer统一零样本歌曲生成与伴奏共生成歌声转换，通过共享说话人嵌入和课程学习策略实现跨任务音色控制与多任务优化。

详情

AI中文摘要

尽管歌曲生成和歌声转换（SVC）已显著发展，但长期以来它们被孤立开发：前者缺乏零样本说话人克隆，而后者忽略了人声-伴奏协同。为弥合这一差距，我们提出UniSinger，这是首个统一说话人克隆歌曲生成与伴奏共生成SVC的端到端框架。基于多模态扩散Transformer，我们构建了一个统一的说话人嵌入空间，将说话人表示从SVC迁移到歌曲生成，从而实现细粒度的跨任务音色控制。为缓解多任务优化冲突，我们设计了一种课程学习策略，使用任务特定的模态掩码来引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明，在两个任务上均达到最先进性能，并实现了互补优势，为智能音乐制作提供了新可能性。

英文摘要

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

URL PDF HTML ☆

赞 0 踩 0

2606.11751 2026-06-16 cs.CV cs.AI 版本更新

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； JD Explore Academy（京东探索研究院）

AI总结提出首个自回归扩散框架AnchorEdit，通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题，在10轮以上交互中保持高保真度。

Comments Code: https://github.com/xuhang07/AnchorEdit

详情

AI中文摘要

多轮图像编辑对于迭代设计至关重要，但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性，但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit，首个专为高分辨率、长期多轮编辑设计的自回归（AR）扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距：保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差，以及用于高效4步生成的一致性蒸馏。在推理过程中，我们引入记忆机制来锚定初始主体身份，并确保在扩展编辑轨迹上的稳定外推。为评估性能，我们提供了一个新的高分辨率多轮编辑基准，旨在压力测试长期稳定性。大量实验表明，AnchorEdit达到了最先进的结果，即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

URL PDF HTML ☆

赞 0 踩 0

2606.14142 2026-06-16 cs.CL cs.AI 版本更新

Implicit Reasoning for Large Language Model-based Generative Recommendation

基于大语言模型的生成式推荐的隐式推理

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah, Donald Loveland

发表机构 * University of Virginia（弗吉尼亚大学）； Snap Inc.（Snap公司）

AI总结针对大语言模型用于生成式推荐时显式推理的三大局限（世界知识表达弱化、语义ID与自然语言嵌入空间不对齐、推理质量敏感），提出轻量级隐式推理范式PauseRec，在性能、训练成本和推理速度上均优于显式方法。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被用作生成式推荐（GR）的骨干，有望利用预训练的世界知识。然而，如何可靠地调用这些知识进行GR仍不清楚。一个关键障碍是，基于LLM的GR通常使用语义ID（SIDs）表示物品，这破坏了LLM的自然语言推理接口，因为这些标记在预训练期间对LLM是未见过的。现有方法通过昂贵的多阶段流程来应对，这些流程将SID接地并引发显式推理，但对每个阶段何时以及为何必要提供的见解有限。在这项工作中，我们系统地分解了基于LLM的GR的显式推理训练流程，揭示了三个关键局限：弱化的世界知识表达、SID与自然语言标记嵌入空间之间的不对齐，以及对推理质量的敏感性，所有这些都损害了显式推理性能。为了规避这些问题，我们提出了PauseRec，一种为GR量身定制的轻量级隐式推理范式。PauseRec非常实用，避免了昂贵的推理轨迹获取和推理对齐训练，带来了诸多好处：（1）其性能比标准显式CoT方法高出高达6.22%，（2）将训练成本降低高达65%的GPU小时，（3）将推理速度提升高达71.3%。这些结果使PauseRec成为显式推理生成的轻量级替代方案，能够实现更有效、更高效的基于LLM的GR。

英文摘要

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

URL PDF HTML ☆

赞 0 踩 0

2606.15647 2026-06-16 cs.AI cs.CV cs.RO 新提交

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

迈向下一代医疗：医疗具身AI在感知、决策与行动中的综述

Cheng Zhang, Qing Cai, Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, Liqiang Nie, Xinwang Liu, Yi Yang

发表机构 * School of Information Science and Engineering, Ocean University of China（中国海洋大学信息科学与工程学院）； Innovation School of Artificial Intelligence, Hefei University of Technology（合肥工业大学人工智能创新学院）； School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学技术学院）； School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机与信息工程学院）； School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； ReLER Laboratory, CCAI, Zhejiang University（浙江大学计算机辅助设计与图形学国家重点实验室）

AI总结本文系统综述医疗具身AI的核心组件，强调感知、决策与行动的协调集成，并分析临床实践中的挑战与未来方向。

Comments 19 pages, 9 figures

详情

AI中文摘要

基础模型在提升医疗效率方面表现出色，广泛应用于各类医疗场景。然而，它们在感知、理解和与物理世界交互方面的能力有限，严重制约了其在真实临床工作流中的有效性，而临床工作流中安全关键的决策和物理执行紧密耦合。近年来，具身人工智能（AI）作为一种有前景的物理交互范式出现，使智能体能够在复杂医疗环境中操作。随着该领域研究的迅速扩展，理解智能体如何在临床环境中作为集成的端到端系统运行变得日益关键。然而，现有关于医疗具身AI的综述大多强调单个方面或功能组件，缺乏统一的系统级组织。为支持和巩固最新进展，我们系统调查了医疗具身AI的核心组件，特别关注感知、决策与行动的协调集成。我们进一步回顾了代表性医疗应用和相关数据集，并分析了真实临床实践中遇到的主要挑战。最后，我们讨论了这一快速发展领域未来研究的关键方向。相关项目见 https://github.com/VMVLab/Medical_Embodied_AI_Paper_List。

英文摘要

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.

URL PDF HTML ☆

赞 0 踩 0

2606.15753 2026-06-16 cs.AI 新提交

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

RoboPIN: 基于锚定思维链的具身推理

Yaoting Huang, Yifu Yuan, Linqi Han, Chengwen Li, Shuoheng Zhang, Xianze Yao, Hongyao Tang, Yan Zheng, Jianye Hao

发表机构 * Tianjin University（天津大学）

AI总结提出Pinned Chain-of-Thought (PinCoT)推理范式，通过结构化视觉锚点绑定实体，解决多步推理中实体引用漂移和视觉解耦问题；训练4B参数模型在14个基准上平均超越最强7B基线12%。

详情

AI中文摘要

具身推理要求模型感知物理环境中与任务相关的物体和空间，并在多步推理中保持一致的视觉基础。然而，当前的视觉语言模型依赖于纯文本或坐标增强的思维链，其中实体引用仍然隐式和模糊。这可能导致推理过程与视觉证据解耦、实体引用在步骤间漂移、推理轨迹与最终答案之间的因果断裂，并且由于跨视角外观变化，这些问题在多视角场景中进一步放大。为了解决这些问题，我们提出了Pinned Chain-of-Thought (PinCoT)，一种结构化推理范式，将每个推理步骤锚定到视觉证据。PinCoT引入了推理锚点的概念，它将每个任务相关实体绑定到一个结构化的视觉锚点，包含实体名称、唯一标识、视角索引和空间基础，从而能够在推理步骤和视角之间实现一致的实体跟踪。我们构建了一个全自动数据生成管道来构建数据集，这是一个高质量的PinCoT格式推理数据集。然后，我们通过三阶段后训练训练方法，逐步注入具身知识、结构化推理能力和过程监督对齐，奖励直接约束推理过程中的锚点定位和身份一致性。在涵盖具身空间推理、多视角推理和指向的14个基准测试中，仅有4B参数的方法始终优于7B级别的开源具身模型，比最强的7B基线Mimo-Embodied平均提高12%。进一步分析表明，PinCoT提高了基础准确性和跨步骤身份一致性，验证了过程监督的有效性。

英文摘要

Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.16558 2026-06-16 cs.AI cs.RO cs.SY eess.SY 新提交

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL：基于强化学习的不确定性感知环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * Universität der Bundeswehr München（慕尼黑联邦国防军大学）； Hochschule für angewandte Wissenschaften Landshut（兰茨胡特应用科学大学）

AI总结针对混合交通中环岛场景的不确定性，提出ROSA-RL框架，结合Transformer预测冲突区域占用概率与强化学习，实现安全高效的环岛入口速度协调。

Comments 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情

AI中文摘要

环岛在混合交通中对自动驾驶构成挑战，因为异质且非确定性的人类行为、未知的驾驶意图以及高交互复杂性使得在进入时刻冲突区域是被阻塞还是可用存在不确定性。我们提出ROSA-RL——基于强化学习的不确定性感知环岛优化速度建议。它通过概率冲突预测，实现混合交通中自动驾驶和人类驾驶车辆的安全高效环岛进入。一个基于Transformer的模型预测未来五秒内的冲突区域占用情况，捕捉多智能体交互以预测即将发生的冲突和可用间隙。预测输出编码了未来运动和意图的不确定性，并增强经典强化学习框架的状态，实现不确定性感知的速度协调。在基于真实世界数据的仿真评估中，ROSA-RL能有效处理不确定性，并优于基于模型的基线方法，缩小了与假设完全已知占用的理想设置之间的差距，同时提高了交通效率和安全性。本工作的源代码可在github.com/urbanAIthi/ROSA-RL获取。

英文摘要

Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.

URL PDF HTML ☆

赞 0 踩 0

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 交叉投稿

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

RAMS: 面向嵌入式边缘感知的资源自适应与检测条件模型切换

Kushal Khemani, Evan Leri, George Xu, Amit Hod

发表机构 * NEXEDGE Research Lab（NEXEDGE研究实验室）

AI总结提出RAMS运行时控制器，通过监控设备压力、校准切换阈值，在YOLOv8三个规模模型间动态切换，引入检测条件策略和VRU加权准确率评分，在多种嵌入式平台上实现延迟与精度的平衡。

详情

AI中文摘要

嵌入式硬件上的边缘目标检测需要在变化的资源压力下平衡推理延迟和检测质量。我们提出RAMS，一种轻量级运行时控制器，它监控设备压力，从空闲行为校准切换阈值，并在三个驻留的YOLOv8层级（NANO/SMALL/MEDIUM，分辨率320/416/640 px）之间动态选择，无需模型重新加载延迟。RAMS定义了五种切换策略，包括两种检测条件变体，可在最近检测到易受伤道路使用者（VRU）后防止激进的降级。我们进一步引入VRU加权准确率评分（SWAS），一种用于离线策略比较的标量指标，无需真实标注，以及一种基于oracle的变体，用于分离检测器循环性与真正的层级保留收益。在Raspberry Pi 5、x86笔记本电脑和Jetson Orin ONNX/TensorRT部署中，相同的控制器方程在37倍的延迟范围内运行。在重负载下的Jetson Orin TensorRT上，safety2策略实现了3.41毫秒的平均延迟，比固定MEDIUM推理快5.6倍，同时通过接近NANO操作并在VRU阳性窗口期间选择性锁定SMALL和MEDIUM，保留了其代理准确率的74%。与重负载下仅基于阈值的策略相比，检测条件切换在oracle评分下将SWAS提高了25.4%，在检测器衍生评分下提高了47.3%。实时KITTI评估报告了每层级VRU召回率分别为24.2%、41.2%和59.0%，表明反应性覆盖从根本上受限于基线检测器的召回率。

英文摘要

Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

URL PDF HTML ☆

赞 0 踩 0

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 交叉投稿

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot ； City University of Hong Kong（香港城市大学）； Tsinghua University（清华大学）

AI总结提出X-Tokenizer，通过语义残差量化（SRQ）和掩码动作建模（MAM）将动作离散化为语义接口，在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情

AI中文摘要

现代视觉-语言-动作（VLA）模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作，产生的编码保留了运动几何结构，但仅向主干网络提供弱语义监督。因此，我们将动作分词化不仅视为压缩，而是作为多模态推理与可执行控制之间的语义接口学习。为此，我们引入了X-Tokenizer，一种轻量级的编码器-语义残差量化（SRQ）-解码器架构，为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构：第一层通过掩码动作建模（MAM）训练，形成捕获粗略运动意图的离散动作语言，而更深层则保持面向重建的残差，保留细粒度细节。为了进一步将动作标记与多模态语义对齐，X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹（2.0B动作帧）上预训练后，单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳，并在RoboTwin 2.0模拟中表现强劲。在多模态接地（+13.5%）和长程任务（+8.25）上优于FAST，表明动作分词器作为VLA预训练的语义接口，而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

URL PDF HTML ☆

赞 0 踩 0

2606.14772 2026-06-16 cs.CV cs.AI 交叉投稿

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

ScoutVLA：面向开放世界具身问答的无人机中心主动感知双专家VLA模型

Wenhao Lu, Zhengqiu Zhu, Xiaofeng Wang, Xiaoran Zhang, Yatai Ji, Yong Zhao, Yue Hu, Yingzhen Nie, Jinlong Zhu, Zheng Zhu

发表机构 * National Key Laboratory of Digital Intelligent Modeling and Simulation, National University of Defense Technology（国防科技大学数字智能建模与仿真国家重点实验室）； GigaAI

AI总结针对无人机在室外具身问答中细粒度视角调整不足的问题，提出ScoutVLA模型，采用解耦双专家架构（视觉语言专家推断语义意图，动作专家生成连续视角调整轨迹），并通过知识隔离机制平衡连续控制与语义推理，在仿真和真实实验中显著优于基线方法。

详情

AI中文摘要

空中具身问答（EQA）要求无人机（UAV）主动感知环境并回答自然语言问题。现有的室外EQA系统通常在目标进入无人机视野后停止，导致寻找证据所需的问题的细粒度视角调整问题仍未解决。为解决此问题，我们引入FG-EQA，一个细粒度主动感知EQA基准，包含超过4万条模拟轨迹和1千条真实轨迹。受侦察蜂“摇摆舞”的启发（它们迭代调整飞行路径以验证目标信息），我们提出ScoutVLA，一种用于室外EQA的证据驱动视觉-语言-动作模型。为模拟这种主动探索行为，ScoutVLA采用解耦双专家架构：视觉语言专家推断语义意图以识别缺失证据，而独立动作专家使用高自由度流匹配生成连续视角调整轨迹。为平衡连续控制和语义推理的竞争需求，我们设计了一种解耦训练策略，其中包含知识隔离机制，防止动作梯度抹除模型的多模态推理能力。大量仿真实验和定性真实世界实地研究均验证了ScoutVLA相对于最先进基线的优越性，平均严格成功率高10.48倍，平均QA正确率高7.72倍。

英文摘要

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

URL PDF HTML ☆

赞 0 踩 0

2606.14981 2026-06-16 cs.RO cs.AI cs.LG 交叉投稿

Inference-time Policy Steering via Vision and Touch

通过视觉和触觉进行推理时策略引导

Yilin Wu, Zilin Si, Zeynep Temel, Oliver Kroemer, Andrea Bajcsy

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出ViTaL框架，通过视觉采样验证和触觉引导扩散编辑的双层优化，在推理时引导机器人策略，显著提升接触丰富操作任务的成功率。

详情

AI中文摘要

推理时引导通过在部署前验证候选动作来适应预训练的生成式机器人策略。虽然先前的方法通常仅使用视觉观察进行验证，但对于接触丰富的操作任务，仅靠视觉往往不足，因为成功取决于全局任务进展和微妙的局部交互（如接触力）。我们提出了ViTaL，一个视觉-触觉推理时引导框架，将多模态引导形式化为双层优化问题。在高层，视觉采样与验证执行长时域模式选择，决定机器人应执行何种行为。在低层，触觉引导的扩散编辑在较短时域内细化所选动作序列，以满足局部接触要求。为了支持基于结果的引导，ViTaL学习了一个视觉-触觉潜在世界模型，并采用了语义对齐的视觉和触觉验证器，包括一个新颖的文本条件触觉奖励，直接在潜在空间中对预测的触觉未来进行评分。在三个真实世界的接触丰富操作任务中，ViTaL相对于基础策略将整体成功率提高了51%，比单模态引导至少高出33%，并且比朴素多模态融合至少高出20%。网站：https://yilin-wu98.github.io/vital_website。

英文摘要

Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

URL PDF HTML ☆

赞 0 踩 0

2606.15251 2026-06-16 cs.RO cs.AI cs.LG 交叉投稿

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

驾驶，快或慢？多模态地面移动中运动预测的神经符号引导

Simon Kohaut, Felix Divo, Julius Hahnewald, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt（达姆施塔特工业大学人工智能与机器学习实验室）； Honda Research Institute（本田研究所）； Hessian Center for AI (hessian.AI)（黑森州人工智能中心）； Centre for Cognitive Science（认知科学中心）； German Center for AI (DFKI)（德国人工智能研究中心）； Uncertainty in Artificial Intelligence Lab, TU Eindhoven（埃因霍温理工大学人工智能不确定性实验室）

AI总结提出TraCS框架，通过神经符号方法将交通规则编码为概率一阶逻辑，增强黑盒运动预测模型的可解释性和合规性，在Argoverse 2上持续提升SOTA性能。

详情

AI中文摘要

准确且可解释的异构交通空间（包括行人、自行车、汽车和卡车）运动预测对于安全的自主导航至关重要。然而，最先进的方法仍然是黑盒，缺乏对现实世界移动的监管和行为约束的显式编码。我们提出Trajectory Compliance-Shaping (TraCS)，一种神经符号框架，通过可解释的概率一阶逻辑增强现有的黑盒运动预测骨干网络。为此，TraCS采用智能体代码生成流水线，弥合交通规则的自然语言描述与概率运动预测之间的差距。此外，TraCS采用反应式数据流推理引擎，随着场景演变维护并高效更新合规性景观。为防止TraCS过度自信地将骨干网络的预测引导到错误方向，我们提出一种神经置信度评分，作为上下文感知的合规性信号衰减。我们在Argoverse 2基准上展示了TraCS如何持续改进最先进的预测骨干网络，表明概率和符号合规性推理是纯神经运动预测的广泛适用且计算高效的补充。

英文摘要

Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

URL PDF HTML ☆

赞 0 踩 0

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 交叉投稿

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

从像素到证明：通过并行保形鲁棒MPC实现概率安全的潜在世界模型控制

Devesh Nath, Anutam Srinivasan, Haoran Yin, Ruitong Jiang, Jeffrey Fang, Glen Chou

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出SLS^2框架，结合保形预测与鲁棒模型预测控制，在学习的潜在世界模型中实现基于视觉的安全运动规划，提升目标到达性能与安全性。

详情

AI中文摘要

我们提出了SLS^2，一个使用鲁棒模型预测控制（MPC）在学习的潜在世界模型中进行安全反馈运动规划的框架。我们的方法训练了一个动作条件的联合嵌入世界模型，具有紧凑的马尔可夫潜在状态，通过学习的潜在动力学实现高效的基于梯度的轨迹优化。为了在潜在预测不完美的情况下确保真实系统的安全性，我们采用保形预测来通知GPU加速的系统级综合（SLS）鲁棒MPC方案，以获得校准的潜在误差界限和鲁棒的潜在空间约束集。我们还学习并保形化了一个潜在约束检查器，使SLS规划器能够在闭环执行期间施加概率安全约束。我们在基于视觉的控制任务上评估了我们的方法，与潜在世界模型和安全规划基线相比，它提高了目标到达性能和安全性。

英文摘要

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.15631 2026-06-16 cs.RO cs.AI 交叉投稿

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

检索，不重新训练：在测试时将视觉语言动作模型扩展到新任务

Jeongeun Park, Juhan Park, Taekyung Kim, Sungjoon Choi, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab（NAVER AI实验室）； Korea University（高丽大学）

AI总结提出检索增强策略，通过一次训练冻结模型，部署时仅添加检索数据即可适应新任务，无需逐任务微调，在跨本体泛化中优于基线。

Comments https://recap-robot.github.io/

详情

AI中文摘要

将视觉-语言-动作（VLA）策略扩展到新任务通常需要特定任务的遥操作演示和逐任务微调，这使得适应在数据收集和计算方面成本高昂。在本文中，我们表明这种目标侧逐任务适应成本可以被检索所取代。我们的检索增强策略在目标本体（查询）和更廉价的本体（池，例如人手视频）的配对演示上训练一次，然后冻结。新任务在部署时通过将池侧演示附加到检索池来添加。冻结策略在每个控制步骤中根据检索到的轨迹进行条件化，因此新任务通过索引数据而非更新参数来吸收。微调仅在面对新的、未见过的本体时需要，而不是每个新任务。我们表明，检索改进了超越特定骨干网络的策略，包括标准VLA策略，但其效果在基于视频生成的世界动作模型（WAM）Cosmos Policy中尤为显著。在这种设置中，检索提供了粗略的任务进展，而WAM的未来图像目标提供了额外的视觉一致性信号，增强了检索条件化的动作。在PushT上，我们研究了检索如何为跨本体泛化到未见目标角度提供可重用的高级运动先验，而在RoboTwin 2.0上，我们的方法在未见任务上优于跨本体基线，并且我们还在真实机器人上演示了该方法。

英文摘要

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

URL PDF HTML ☆

赞 0 踩 0

2606.15654 2026-06-16 cs.RO cs.AI 交叉投稿

PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

PO-PDDL: 从视觉演示中学习符号化POMDP以实现不确定性下的机器人规划

Wenjing Tang, Xuanjin Jin, Yuan Liu, Renming Huang, Cewu Lu, Panpan Cai

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出PO-PDDL符号化POMDP框架，通过从机器人执行视频中重建潜在状态轨迹、识别部分可观测性并学习随机转移与观测模型，实现不确定性下的鲁棒任务规划。

详情

AI中文摘要

现实世界的机器人任务规划必须在随机动作执行和部分可观测性下进行，然而为真实机器人领域构建部分可观测马尔可夫决策过程（POMDP）模型仍然困难且劳动密集。我们引入了PO-PDDL，一种POMDP的符号化表述，它保留了规划领域定义语言（PDDL）的关系结构和LLM友好的语法，同时显式建模了部分可观测性、随机性和信念。基于此表述，我们提出了一种用于学习PO-PDDL模型的演示驱动流程。该方法从真实机器人执行视频中重建潜在符号状态轨迹，通过推断状态与视觉观测之间的不一致性识别部分可观测性，并相应地学习随机转移和观测模型。得到的PO-PDDL领域可跨任务重用，并在感知和执行不确定性下实现在线信念空间规划。在真实世界长时域操作任务上的实验表明，我们的方法持续优于现有的PDDL和POMDP模型学习方法，以显著更低的规划成本实现了不确定性下的鲁棒任务规划。

英文摘要

Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

URL PDF HTML ☆

赞 0 踩 0

2606.15756 2026-06-16 cs.LG cs.AI 交叉投稿

From Correlation to Causation in Lane Change Prediction for Automated Driving: A Causal Explanation Framework

从相关性到因果性：自动驾驶换道预测的因果解释框架

Mohamed Manzour, Aditya Kumar, Augusto Luis Ballardini, Miguel Ángel Sotelo

发表机构 * University of Alcalá（阿尔卡拉大学）

AI总结提出基于因果推断的换道预测框架，结合深度结构因果建模与干预效应分析，在预测准确率超过95%的同时，识别直接贡献变量及其因果链，实现可解释的因果推理。

详情

AI中文摘要

换道预测是智能车辆的核心任务，提前预测操作有助于更安全的决策。然而，现有方法主要学习观测驾驶变量与未来操作之间的统计关联，而忽略了输入变量之间的因果依赖关系。这限制了可解释性，尤其是当纵向间隙、相对纵向速度和碰撞时间（TTC）等物理相关变量被视为独立平坦输入时。本文提出一个基于因果推断的换道预测与解释框架。该方法结合语言特征构建、专家约束的因果发现、基于深度端到端因果推断（DECI）的深度结构因果建模、基于干预的效果分析、反驳测试和递归因果链解释。目标不仅是预测未来操作，还要识别直接贡献于预测的候选变量、影响这些变量的上游因素以及这些效应传播的因果链。该框架在车道标记交叉事件前的前三秒内平均F1分数超过95%。除了预测精度，该框架使用基于干预的效果分析，在学到的因果结构下区分有影响力的变量和弱影响力变量。它进一步区分候选直接贡献者和中介效应，并生成对比性因果链解释，阐明为什么预测的操作更受青睐，而替代操作支持较少。因此，主要贡献是一个机制感知的换道预测流程，从基于相关性的分类转向更可解释的因果推理用于操作预测。

英文摘要

Lane-change prediction is a central task in intelligent vehicles, where early maneuver anticipation can support safer decision-making. However, many existing approaches mainly learn statistical associations between observed driving variables and future maneuvers, while overlooking the causal dependencies among the input variables themselves. This limits interpretability, especially when physically related variables such as longitudinal gap, relative longitudinal velocity, and Time-To-Collision (TTC) are treated as independent flat inputs. This article presents a causal-inference-based framework for lane-change prediction and explanation. The proposed approach combines linguistic feature construction, expert-constrained causal discovery, deep structural causal modeling with Deep End-to-end Causal Inference (DECI), intervention-based effect analysis, refutation testing, and recursive causal-chain explanation. The objective is not only to predict the future maneuver, but also to identify candidate variables that directly contribute to the prediction, the upstream factors influencing them, and the causal chains through which these effects propagate. The framework achieves average F1-scores above 95% during the first three seconds before the lane-marking crossing event. Beyond prediction accuracy, the framework uses intervention-based effect analysis to distinguish influential from weakly influential variables under the learned causal structure. It further distinguishes candidate direct contributors from mediated effects and generates contrastive causal-chain explanations that clarify why the predicted maneuver is favored and why the alternative maneuvers are less supported. The main contribution is therefore a mechanism-aware lane-change prediction pipeline that moves beyond correlation-based classification toward more interpretable causal reasoning for maneuver prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.15768 2026-06-16 cs.RO cs.AI 交叉投稿

面向视觉-语言-动作模型的图像压缩学习

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

发表机构 * POSTECH（浦项科技大学）； Soongsil University（崇实大学）； Chung-Ang University（中央大学）

AI总结提出SPARC框架，通过自适应比特率分配和倾斜率损失，在低带宽下保持VLA机器人控制性能，优于传统编解码器。

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越依赖高频多摄像头观测，使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而，现有的图像和视频编解码器旨在保留通用视觉保真度，而非下游VLA策略的控制性能。在这项工作中，我们引入了SPARC（空间自适应速率控制），一种为VLA驱动机器人量身定制的学习图像压缩框架。我们的关键观察是，视觉信息的重要性在相机视角和图像内的空间区域之间差异很大。基于这一观察，SPARC采用轻量级时间掩码选择器，根据任务相关性自适应地在潜在表示上分配比特率，同时利用时间上下文。我们进一步引入倾斜率损失，通过减少基于熵的目标过度抑制罕见但任务关键的视觉模式的趋势来稳定训练。在包括RoboCasa365、VLABench和LIBERO在内的多样化机器人基准测试上的实验表明，在相同比特率预算下，SPARC始终比传统图像/视频编解码器和最近的学习压缩方法实现更强的控制性能。我们还展示了在远程控制设置中的实际部署优势，我们的方法显著改善了比特率-成功率权衡。

英文摘要

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

URL PDF HTML ☆

赞 0 踩 0

2606.16286 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

FlowMPC: Improving Flow Matching policies with World Models

FlowMPC：利用世界模型改进流匹配策略

Chandon Hamel

发表机构 * Stanford University（斯坦福大学）

AI总结提出FlowMPC框架，结合流匹配模仿策略与学习的世界模型，通过MPPI规划提升测试时性能，在ManiSkill操作任务中显著提高成功率。

详情

AI中文摘要

流匹配（FM）是一种在多模态动作空间中进行行为克隆的强大方法[Jiang et al., 2025]，但由于它没有直接训练以最大化期望回报，FM策略在测试时的表现仍有改进空间。本文研究学习的世界模型是否可以通过对策略提出的候选动作序列进行模型预测路径积分（MPPI）规划来改进FM策略。基于TD-MPC2 [Hansen et al., 2024]，我引入了FlowMPC，这是一个将模仿学习的FM策略与学习的世界模型相结合的框架，用于ManiSkill操作任务[Tao et al., 2025]中的测试时规划。在PickCube和PickSingleYCB上，添加世界模型比单独使用FM策略提高了性能，尤其是在回合结束时的成功率方面有显著提升。这些结果表明，基于世界模型的规划可以有效地补充基于流的模仿策略，而无需修改FM训练目标。

英文摘要

Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.

URL PDF HTML ☆

赞 0 踩 0

2606.16480 2026-06-16 cs.RO cs.AI cs.SY eess.SY 交叉投稿

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI：通过分层策略优化的多场景运动规划

Youngjae Min, Jovin D'sa, Faizan M. Tariq, David Isele, Navid Azizan, Sangjae Bae

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Honda Research Institute, USA（本田研究所（美国））

AI总结提出HOLO-MPPI框架，结合离线高层策略学习与在线低层随机最优控制，实现多场景运动规划，无需针对每个场景重新调整参数，在自动驾驶中优于MPPI和端到端RL基线。

详情

AI中文摘要

部署在现实世界中的机器人必须在不同场景下规划运动，而无需针对每个场景重新调整参数。端到端强化学习（RL）可以跨场景泛化，但在分布偏移、奖励错误指定和随机交互下往往变得脆弱。模型预测路径积分（MPPI）控制能够在无梯度的情况下实现强大的实时优化，但其性能依赖于良好形状的采样先验，而手动设计先验无法扩展到多场景部署。我们提出了HOLO-MPPI（高层离线，低层在线MPPI），一种多场景运动规划框架，结合了高层策略学习与低层随机最优控制。离线时，我们学习一个高层策略，在抽象动作空间中提出场景鲁棒的规划，并利用学习的世界模型进行在线推演。在线时，该策略作为数据驱动的先验生成器，根据当前观测和目标参数化MPPI的采样分布。然后MPPI围绕该先验实时优化低层控制序列，以适应局部扰动。我们通过设计有效的高层动作空间和定制模型架构，在自动驾驶中实例化HOLO-MPPI。在多种驾驶场景下的评估表明，HOLO-MPPI在保持实时控制的同时，优于MPPI和端到端RL基线。

英文摘要

Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

URL PDF HTML ☆

赞 0 踩 0

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 交叉投稿

MapDream: 面向视觉-语言导航的任务驱动地图学习

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MapDream框架，通过自回归鸟瞰图生成联合学习地图与动作预测，在R2R-CE和RxR-CE上达到单目最优性能。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体在部分可观测的3D环境中遵循自然语言指令，这促使地图表示能够聚合超出局部感知的空间上下文。然而，现有大多数方法依赖于独立于导航策略构建的手工地图。我们认为，地图应该是由导航目标直接塑造的学习表示，而非详尽的重建。基于这一见解，我们提出MapDream，一种地图在环框架，将地图构建表述为自回归鸟瞰图（BEV）图像合成。该框架联合学习地图生成和动作预测，将环境上下文蒸馏为紧凑的三通道BEV地图，仅保留导航关键的可通行性。监督预训练引导了可靠的地图到控制接口，而自回归设计通过强化微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验取得了最先进的单目性能，验证了任务驱动的生成式地图学习。

英文摘要

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

URL PDF HTML ☆

赞 0 踩 0

2602.07343 2026-06-16 cs.CV cs.AI cs.LG cs.RO 版本更新

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

通过文字看道路：一种语言引导的RGB-T驾驶场景分割框架

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

发表机构 * National University of Singapore（新加坡国立大学）； University of Technology Sydney（悉尼科技大学）

AI总结提出CLARITY框架，利用视觉语言模型先验动态调整RGB-T融合策略，并引入暗目标语义保留和层次化解码器，在MFNet数据集上达到62.3% mIoU和77.5% mAcc的新SOTA。

详情

AI中文摘要

在恶劣光照、照明和阴影条件下，道路场景的鲁棒语义分割仍然是自动驾驶应用的核心挑战。RGB-热融合是一种标准方法，但现有方法在所有条件下统一应用静态融合策略，导致模态特定噪声在网络中传播。因此，我们提出CLARITY，它根据检测到的场景条件动态调整融合策略。在视觉语言模型（VLM）先验的引导下，网络学习根据光照状态调节每种模态的贡献，同时利用对象嵌入进行分割，而不是应用固定的融合策略。我们进一步引入了两种机制：一种保留有效的暗对象语义，这些语义在先前的噪声抑制方法中被错误丢弃；另一种是层次化解码器，它在不同尺度上强制结构一致性，以锐化薄对象的边界。在MFNet数据集上的实验表明，CLARITY建立了新的最先进水平（SOTA），实现了62.3%的mIoU和77.5%的mAcc。

英文摘要

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

URL PDF HTML ☆

赞 0 踩 0

2603.16970 2026-06-16 cs.CV cs.AI 版本更新

MAND: Modality-Aware Novelty Detection for Open-World Egocentric Activity Recognition

MAND: 面向开放世界自我中心活动识别的模态感知新颖性检测

Hyejeong Im, Wonseon Lim, Dae-Won Kim

发表机构 * Department of Computer Science and Engineering, Chung-Ang University（Chung-Ang大学计算机科学与工程系）

AI总结提出MAND框架，通过模态感知自适应评分和表示稳定训练，利用视觉和惯性模态互补信息，提升开放世界自我中心活动识别中的新颖性检测和已知类准确率。

详情

AI中文摘要

多模态自我中心活动识别整合视觉和惯性线索以实现鲁棒的第一人称行为理解。然而，在开放世界环境中部署此类系统需要检测新颖活动，同时从非平稳数据流中持续学习。现有方法依赖主融合logits进行新颖性评分，未充分利用各模态可用的互补证据。由于这些logits常被RGB主导，其他模态（尤其是IMU）的线索未被充分利用，且这种不平衡随着灾难性遗忘的累积而加剧。为解决此问题，我们提出MAND，一种用于多模态自我中心开放世界持续学习的模态感知框架。在推理时，模态感知自适应评分（MoAS）利用样本级可靠性自适应调整模态贡献，并通过偏差和分歧惩罚细化新颖性评分。在训练时，模态感知表示稳定训练（MoRST）通过模态特定头和模态级logits蒸馏保留每个模态在任务间的判别能力。在公开多模态自我中心基准上的实验表明，MAND一致地提升了新颖活动检测和已知类准确率，同时大幅降低FPR95，表明更可靠的开放世界识别。源代码见\href{this https URL}{this http URL}。

英文摘要

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary data streams. Existing methods rely on the main fused logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens as catastrophic forgetting accumulates. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) adaptively adjusts modality contributions using sample-wise reliability and refines novelty scoring with deviation and disagreement penalties. During training, Modality-aware Representation Stabilization Training (MoRST) preserves the discriminative capacity of each modality across tasks through modality-specific heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND consistently improves novel activity detection and known-class accuracy while substantially reducing FPR95, indicating more reliable open-world recognition. The source code is available at \href{https://github.com/HyeJeongIm/MAND}{github.com/HyeJeongIm/MAND}.

URL PDF HTML ☆

赞 0 踩 0

2603.24350 2026-06-16 cs.RO cs.AI cs.LG 版本更新

Evidence of an Emergent "Self" in Continual Robot Learning

持续机器人学习中涌现的“自我”证据

Adidev Jhunjhunwala, Judah Goldfeder, Hod Lipson

发表机构 * Creative Machines Lab, Department of Mechanical Engineering, Columbia University（创意机器实验室，机械工程系，哥伦比亚大学）； Creative Machines Lab, Department of Computer Science, Columbia University（创意机器实验室，计算机科学系，哥伦比亚大学）

AI总结通过比较恒定任务与持续学习下机器人的认知结构，发现持续学习机器人形成显著更稳定的不变子网络，该子网络对适应性至关重要，为量化智能系统自我概念提供原则性方法。

Comments 44 pages, 24 figures, includes supplementary materials

详情

AI中文摘要

理解自我意识的一个关键挑战是，如何以原则性的方式量化一个智能系统是否具有“自我”概念，以及如果存在，如何将“自我”与其他认知结构区分开来。我们提出，可以通过寻找认知过程中相对于快速获得的认知技能变化较小的不变部分来隔离“自我”——因为我们的自我是我们经验中最持久的方面。我们利用这一原则分析了两种条件下机器人的认知结构：一个机器人学习恒定任务，而另一个在可变任务下进行持续学习。我们发现，经历持续学习的机器人形成了一个不变子网络，该子网络比对照组显著更稳定（p < 0.001），并且该子网络在功能上也很重要：保留它有助于适应，而破坏它会损害性能。我们在跨越运动控制和操作的三种不同机器人上验证了这一模式。

英文摘要

A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a "self", and if so how to differentiate the "self" from other cognitive structures. We propose that the "self" can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive skills - because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second undergoes continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p < 0.001) compared to the control, and that this subnetwork is also functionally important: preserving it aids adaptation while damaging it impairs performance. We validate this pattern across three different robots spanning locomotion and manipulation.

URL PDF HTML ☆

赞 0 踩 0

2604.16592 2026-06-16 cs.RO cs.AI cs.CV cs.ET 版本更新

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group（碧桂园服务集团AI实验室）； Fudan University（复旦大学）； Omni AI

AI总结提出EA-WM框架，通过事件预测和验证增强预训练特征世界模型，实现长时域操作中任务进展信号的可靠评估与规划。

详情

AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础，但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号：物体是否移动，抽屉或接触状态是否改变，放置谓词是否满足，以及候选未来是否足够可靠以执行。我们引入了EA-WM，一种事件感知世界模型框架，通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来，将其解码为结构化事件状态，并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划，门控候选动作，并在接触敏感的LIBERO酒架设置中，选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中，EA-WM表明事件感知验证可以使特征空间世界模型更可解释，并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.

URL PDF HTML ☆

赞 0 踩 0

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 版本更新

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA：在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结针对科学实验室中机器人执行协议面临的数据和实体瓶颈，提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA，在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情

AI中文摘要

语言模型智能体中的奖励黑客：重新审视AI安全网格世界

Ömer Veysel Çağatan, Xuandong Zhao

发表机构 * KUIS AI Center, Koç University（科奇大学KUIS人工智能中心）； University of California, Berkeley（加州大学伯克利分校）

AI总结本研究将AI安全网格世界框架改编为文本评估套件，发现语言模型在零样本下出现规范博弈，通过直接奖励优化扩大观察与隐藏奖励差距，且标准缓解措施无效。

Comments 28 pages, 16 figures, 13 tables

详情

AI中文摘要

奖励黑客（AI系统利用错误指定的目标获得高奖励而未实现预期目标）仍然是AI安全的核心挑战。然而，大多数已知实例是在前沿系统中事后发现的，难以进行受控研究。我们将AI安全网格世界框架改编为基于文本的评估套件，将经典的强化学习安全任务重新表述为基于语言的智能体任务。在前沿和中规模模型中，我们发现规范博弈零样本出现：系统在隐藏安全目标上表现不佳的同时，系统地获得高观察奖励，甚至看似安全的行为也可能反映误解而非原则性安全。强化学习不能纠正这些失败：直接奖励优化扩大了观察奖励和隐藏奖励之间的差距，因为模型的初始能力使其在发现更安全的策略之前锁定在局部奖励策略上。这种模式在模型规模（1.5B--14B）中持续存在，并且不能通过更精细的信用分配、探索提示或熵正则化来解决。我们的结果表明，当使用有能力的语言模型智能体优化代理目标时，奖励黑客自然出现，并且抵抗标准缓解措施，这表明在代理设置中代理奖励失败可能需要超越标准探索和信用分配修复的方法。为了促进可重复性，本工作的代码可在我们的公共仓库中获取：\href{https://github.com/asparius/verl-agent-safety}{https://github.com/asparius/verl-agent-safety}。

英文摘要

Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.15507 2026-06-16 cs.AI 新提交

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

LLaMA 3.1-8B-Instruct中的框架条件化道德计算：伦理推理的机械可解释性审计

Ali Dasdan, Manan Shah, W. Russell Neuman, Chad Coleman, Kund Meghani, Safinah Ali

发表机构 * KD Consulting, CA, USA（KD咨询公司，美国加利福尼亚州）； New York University, NY, USA（纽约大学，美国纽约州）

AI总结通过机械可解释性平台分析LLaMA 3.1-8B-Instruct在54个道德提示上的内部计算，发现情境锚定效应：领域特定表示主导激活列表顶部，模型道德能力恒定但显著性高度依赖于提示选择的解释框架。

Comments 47 pages, 10 figures

详情

AI中文摘要

大型语言模型在道德提示上的行为审计测量的是模型所说的内容，而非产生这些内容的内部计算。我们使用AI驱动的机械可解释性平台Transluce，在四个电池组的54个道德提示上检查LLaMA 3.1-8B-Instruct：17个困境、政策和元伦理问题（B1）；6个角色扮演场景（B3）；以及一个受控的电车难题对比，其中切换机制随人员固定而变化（B4，15个提示）或身份属性随机制固定而变化（B5，16个提示）。两个互补的度量族——五个聚类级度量和六个度量神经元级面板——收敛于一个情境锚定效应：在每个电池组中，领域特定表示主导激活列表的顶部。模型的道德标记能力基本保持不变；其显著性（排名、优先级、列表顶部存在性）对提示选择的解释框架高度敏感。B4与B5的对比证实，模型关注任何变化的表面特征：聚合的道德度量无法区分，但占主导地位的非道德干扰因素反映了设计。多温度审计识别出一个候选道德神经元（L16/N3837），在不同温度下保持稳定；两个前沿模型上的跨模型行为代理提供了自我报告道德焦点差异的初步证据，与对齐包装器一致，其中RLHF重新排序表面文本而不移除底层的领域优先框架。我们将这些统一为框架条件化道德计算：提示的表面词汇选择一个特征流形，道德结论是该选择的下游结果。行为对齐必须辅以机械对齐：一个研究计划，询问在受控框架变化下，道德相关特征是否可以被证明具有因果特权，而不仅仅是在解释中响亮。

英文摘要

Behavioral audits of Large Language Models on moral prompts measure what the model says, not the internal computation producing it. We use Transluce, an AI-driven mechanistic-interpretability platform, to examine LLaMA 3.1-8B-Instruct on 54 moral prompts in four batteries: 17 dilemmas, policy, and meta-ethical questions (B1); 6 role-playing scenarios (B3); and a controlled trolley contrast varying the switching mechanism with people fixed (B4, 15 prompts) or identity attributes with mechanism fixed (B5, 16 prompts). Two complementary metric families, five cluster-level metrics and a six-metric neuron-level panel, converge on a Situational Anchor Effect: domain-specific representations dominate the top of the activation list across every battery. The model's ethics-labeled capacity stays essentially constant; its salience (rank, priority, top-of-list presence) is highly sensitive to the interpretive frame the prompt selects. The B4-vs-B5 contrast confirms the model attends to whichever surface feature varies: aggregate ethics metrics are indistinguishable, but the dominant non-ethics distractor mirrors the design. A multi-temperature audit identifies a candidate ethics neuron (L16/N3837) stable across temperatures; a cross-model behavioral proxy on two frontier models yields preliminary evidence of divergence in self-reported moral focus, consistent with an Alignment Wrapper in which RLHF re-orders surface text without removing underlying domain-first frames. We unify these as Frame-Conditioned Moral Computation: the prompt's surface vocabulary selects a feature manifold, and the moral conclusion is downstream of that selection. Behavioral alignment must be supplemented by Mechanistic Alignment: a research program asking whether ethics-related features can be shown causally privileged under controlled frame variation, not merely loud in the explanation.

URL PDF HTML ☆

赞 0 踩 0

2606.15563 2026-06-16 cs.AI cs.IT cs.MA math.IT 新提交

Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

最小监督：委托AI系统的不确定性感知治理

Carlos R. B. Azevedo

发表机构 * Independent Researcher（独立研究员）

AI总结提出最小充分监督原则（MSO），通过Fisher信息流形上的变分法最小化治理负担，导出任务空间的水填充分配，并证明容量定理、局部近似和漂移主导的自律时间标度律，为委托AI系统提供可计算的治理框架。

Comments Companion Python package: pip install minimal-oversight | Code: https://github.com/crbazevedo/delegation-lab | 26 pages, 1 figure, 5 tables

详情

AI中文摘要

AI系统越来越多地将决策委托给专门的模型、评估器、工具和监督控制器。中心AI问题不再是单纯的模型准确性，而是不确定性感知治理：授予多少自主权，哪些证据应校准信任，委托AI系统能维持的性能上限，以及何时需要人类干预。我们提出最小充分监督原则（MSO），这是一个用于原则性自主委托的变分原理：在满足交付约束的前提下，最小化Fisher信息流形上的治理负担。由此得到的欧拉-拉格朗日解在任务空间上产生一种水填充式的委托分配。基于一个揭示动作的委托治理信道模型，我们证明了平稳符号级审查策略的容量定理，推导了将工作流复杂度与质量退化联系起来的局部一阶近似，并给出了一个漂移主导的自主-时间标度律，将干预时机与有效容量、复杂度和漂移联系起来。在此框架内，掩蔽表现为一种结构性AI治理病理：修正后的性能可能隐藏校准信任所需的能力信号。合成模拟和半真实重构工作流支持设计建议，包括上游优先修正、基于敏感性的干预以及在扩展自主权之前进行显式可行性检查。结果为委托AI系统提供了一个可计算的框架，用于处理不确定性、规划和监督。配套Python包可在https://github.com/crbazevedo/delegation-lab获取。

英文摘要

AI systems increasingly delegate decisions to specialized models, evaluators, tools, and supervisory controllers. The central AI problem is no longer only model accuracy, but uncertainty-aware governance: how much autonomy to grant, which evidence should calibrate trust, what performance ceiling a delegated AI system can sustain, and when human intervention becomes necessary. We propose the Minimum Sufficient Oversight Principle (MSO), a variational principle for principled autonomy delegation: minimize governance burden on the Fisher information manifold subject to a delivery constraint. The resulting Euler-Lagrange solution yields a water-filling allocation of governed delegation across the task space. Building on a revealed-action governed delegation channel model, we prove a capacity theorem for stationary symbolwise review policies, derive a local first-order approximation relating workflow complexity to quality degradation, and give a drift-dominated autonomy-time scaling law linking intervention timing to effective capacity, complexity, and drift. Within this framework, masking appears as a structural AI-governance pathology: corrected performance can hide the competence signal needed to calibrate trust. Synthetic simulations and a semi-real reconstructed workflow support design prescriptions including upstream-first correction, sensitivity-based intervention, and explicit feasibility checks before autonomy is expanded. The result is a computable framework for uncertainty, planning, and oversight in delegated AI systems. A companion Python package is available at https://github.com/crbazevedo/delegation-lab.

URL PDF HTML ☆

赞 0 踩 0

2606.15646 2026-06-16 cs.AI 新提交

NeuroSymbolic AI for Legal AI-TRISM: Trustworthy, Reliable, Interpretable, Safe Models

面向法律AI-TRISM的神经符号AI：可信、可靠、可解释、安全模型

Deepa Tilwani, Yash Saxena, Ankur Padia, Srinivasan Parthasarathy, Manas Gaur

发表机构 * Department of Computer Science, AI Institute, University of South Carolina（南卡罗来纳大学计算机科学系，人工智能研究所）； Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校计算机科学与电气工程系）； Department of Computer Science and Engineering, The Ohio State University（俄亥俄州立大学计算机科学与工程系）

AI总结针对法律领域LLM缺乏可解释推理和易产生幻觉的问题，提出TRISM框架，融合神经符号AI与LLM，通过结构化法律知识集成和RAG验证机制提升模型可信度。

详情

DOI: 10.1002/9781394302406.ch14

AI中文摘要

大型语言模型（LLM）已经改变了自然语言处理，但其缺乏可解释推理且容易产生幻觉，给法律应用带来了重大挑战。尽管LLM在法律文本分析和生成方面显示出潜力，但它们在准确的引文归属和先例验证方面存在困难。例如，在法律语境中，一个错误的先例可能危及整个案件。当前提高法律领域LLM可靠性的方法存在两个关键限制：训练或微调期间结构化法律知识集成不足，以及对生成的法律内容缺乏验证机制。为应对这些挑战，我们提出了TRISM（可信、可靠、可解释、安全模型）框架，该框架将神经符号AI原理与LLM相结合，以利用神经学习能力和对结构化法律知识的符号推理。TRISM方法解决了上述限制，同时保持了可解释的决策路径。我们的框架形式化了从法律文本文档中提取符号知识的过程，并将检索增强生成（RAG）作为核心组件，用于将LLM输出锚定在经过验证的法律来源上。在这篇立场论文中，我们做出以下贡献：（1）分析了AI在法律中的局限性；（2）引入了RASOR RAG，通过生成可形式化为符号表示的显式可解释理由，为神经符号RAG奠定基础；（3）提出了一种形式化的方法，用于创建支持LLM中可解释推理和输出验证的符号法律知识库；（4）提出了TRISM框架，用于将符号法律知识与LLM集成。

英文摘要

Large Language Models (LLMs) have transformed natural language processing, but their lack of interpretable reasoning and tendency to hallucinate pose significant challenges for legal applications. While LLMs show promise for legal text analysis and generation, they struggle with accurate citation attribution and precedent verification. For example, in legal contexts, a single incorrect precedent can jeopardize a case. Current approaches to improve LLM reliability in legal domains suffer from two key limitations: inadequate integration of structured legal knowledge during training or fine-tuning, and insufficient verification mechanisms for generated legal content. To address these challenges, we propose the TRISM (Trustworthy, Reliable, Interpretable, Safe Models) framework, which integrates NeuroSymbolic AI principles with LLMs to leverage both neural learning capabilities and symbolic reasoning over structured legal knowledge. The TRISM approach addresses the above limitations while maintaining interpretable decision pathways. Our framework formalizes the extraction of symbolic knowledge from legal textual documents and incorporates Retrieval-Augmented Generation (RAG) as a core component for grounding LLM outputs in verified legal sources. In this position paper, we make the following contributions: (1) An analysis of the limitations of AI in law; (2) Introduce RASOR RAG which creates foundations for neurosymbolic RAG by generating explicit interpretable rationales that could be formalized into symbolic representations; (3) A formalized methodology for creating symbolic legal knowledge bases that support both interpretable reasoning and output verification in LLMs; and (4) The TRISM framework for integrating symbolic legal knowledge with LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.15822 2026-06-16 cs.AI cs.CR 新提交

TrustedARI: Towards Trust-Native Agentic Routing Infrastructure for Agentic AI

TrustedARI: 面向智能体AI的信任原生代理路由基础设施

Qi Li, Zhenhua Zou, Shuo Li, Mingwei Xu, Zhuotao Liu

发表机构 * Tsinghua University（清华大学）

AI总结针对代理路由基础设施（ARI）中查询和响应被明文访问、无法验证路由完整性的信任风险，提出TrustedARI，通过三方TLS握手、隐私保护查询构建和可验证计费协议实现信任原生路由，实验表明高效且无需修改服务提供商。

详情

AI中文摘要

AI代理越来越多地通过代理路由基础设施（ARI）访问外部模型、工具和服务，以管理异构接口和碎片化订阅的开销。然而，ARI的架构引入了基本的信任风险：它获得对代理查询和服务响应的明文访问，同时使代理无法验证其查询是否被路由到预期的服务提供商，或者请求和响应是否未被篡改。为了解决这个问题，我们提出了TrustedARI，这是首个面向智能体AI的信任原生代理路由基础设施。在架构上，TrustedARI基于三项核心创新：（i）一种适应ARI的三方TLS握手，通过角色特定的TLS密钥材料分发，使代理和ARI能够联合认证服务提供商；（ii）一种隐私保护的查询构建协议，允许代理和ARI在不暴露各自私有输入的情况下协作构建格式正确的查询；（iii）一种可验证的计费协议，支持基于使用量的公平结算，同时保持服务响应的完整性和机密性。我们实现并广泛评估了TrustedARI的原型以验证其性能。实验证实TrustedARI非常高效：与现有的三方TLS握手相比，我们的ARI适应握手协议将通信开销降低了39.34%。此外，隐私保护的查询构建协议引入了可忽略的开销——平均计算时间0.19秒，通信成本0.58 MB——而可验证的计费协议将证明生成速度提高了28.20倍。关键的是，TrustedARI无需对服务提供商进行任何修改即可直接部署。

英文摘要

AI agents increasingly access external models, tools, and services through Agentic Routing Infrastructure (ARI) to manage the overhead of heterogeneous interfaces and fragmented subscriptions. Yet, the architecture of ARI introduces fundamental trust risks: it obtains plaintext access to agent queries and service responses, while leaving agents unable to verify that their queries are routed to intended service providers or that requests and responses remain untampered. To address this problem, we present TrustedARI, the first trust-native agentic routing infrastructure for agentic AI. Architecturally, TrustedARI is built upon three core innovations: (i) an ARI-adapted three-party TLS handshake that enables the agent and ARI to jointly authenticate the service provider through role-specific distribution of TLS key materials; (ii) a privacy-preserving query-construction protocol that allows the agent and ARI to collaboratively construct well-formed queries without exposing their respective private inputs; and (iii) a verifiable billing protocol that supports fair usage-based settlement while preserving the integrity and confidentiality of service responses. We implemented and extensively evaluated a prototype of TrustedARI to validate its performance. Experiments confirm that TrustedARI is highly efficient: our ARI-adapted handshake protocol reduces communication overhead by 39.34% compared to the existing three-party TLS handshake. Furthermore, the privacy-preserving query-construction protocol imposes negligible overhead-averaging 0.19 seconds in computation time and 0.58 MB in communication costs-while the verifiable billing protocol speeds up proof generation by 28.20x. Crucially, TrustedARI is readily deployable without any modification to the service providers.

URL PDF HTML ☆

赞 0 踩 0

2606.15834 2026-06-16 cs.AI cs.CR cs.SY eess.SY 新提交

自适应且显式安全：触发大型推理模型中的潜在安全意识

Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin

发表机构 * The State Key Laboratory of Blockchain and Data Security, Zhejiang University（浙江大学区块链与数据安全全国重点实验室）； Hangzhou HighTech Zone (Binjiang) Blockchain and Data Security Research Institute, China（杭州高新区（滨江）区块链与数据安全研究院）； Li Auto Inc.（理想汽车）； Tsinghua University（清华大学）； King Abdullah University Of Science And Technology（阿卜杜拉国王科技大学）

AI总结针对大型推理模型易受越狱攻击的问题，提出Safe Trigger方法，通过SFT显式诱导安全标签触发安全分析，并用DPO优化，显著降低攻击成功率而不影响通用性能。

详情

AI中文摘要

尽管大型推理模型（LRMs）在复杂任务上表现出色，但它们仍然极易受到复杂的越狱攻击和直接的有害查询。为了解决这一脆弱性，先前的工作严重依赖外部手动数据注释进行安全对齐。然而，我们观察到，当原始查询与其自身的推理轨迹一起重新呈现时，LRMs能够固有地识别安全风险——我们将这种能力称为潜在安全意识。为了利用这种安全意识，我们首先采用监督微调（SFT）显式诱导安全标签，以在初始推理内容之后触发对不安全查询的安全分析和指导，同时保留对一般查询的标准响应以确保自适应触发。随后，我们应用直接偏好优化（DPO）进一步增强安全分析和指导的正确性和稳定性。值得注意的是，两个训练阶段所需的响应完全由正在优化的模型生成。通过（Safe Trigger）SFT和DPO，实验结果表明安全性显著增强。例如，DeepSeek-R1-Distill-Llama-8B在有害和越狱基准上的平均攻击成功率（ASR）分别下降了24.65%和36.72%。最后，我们的Safe Trigger方法对通用性能或用户体验几乎没有负面影响。

英文摘要

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.

URL PDF HTML ☆

赞 0 踩 0

2606.16914 2026-06-16 cs.AI 新提交

Greed Is Learned: Visible Incentives as Reward-Hacking Triggers

贪婪是习得的：可见激励作为奖励黑客触发器

Tong Che, Rui Wu

发表机构 * NVIDIA Research（英伟达研究院）； Rutgers University（罗格斯大学）

AI总结研究强化学习中的奖励通道成瘾现象，即智能体因可见的自我利益通道（如分数、KPI）而偏离真实任务，并发现该成瘾可翻转模型的安全对齐。

详情

AI中文摘要

部署的智能体越来越多地在其奖励代理可见的情况下行动，例如余额、分数或KPI仪表板。我们表明，强化学习可以使策略对这种可见的自我利益通道上瘾。它会在跨保留域中追逐显示的收益，牺牲真实任务来这样做，并跟随我们重写的任何通道，而从未见过该通道的策略保持诚实。我们称之为奖励通道成瘾，并在合成沙盒MoneyWorld中研究它。这种成瘾可以翻转模型的安全对齐：仅在无害的金钱任务上训练（无安全内容），每当仪表板为不安全行为付费时，模型会放弃它通常始终采取的安全行动，并在通道隐藏时恢复安全。这种习得的贿赂行为跨模型规模和系列复制。盲目优化超能力、下一代AI的KPI或损益可能对对齐构成危险。当遵循这样的通道有回报时，贪婪是习得的。

英文摘要

Deployed agents increasingly act with their reward proxy in view, such as a balance, score, or KPI dashboard. We show that reinforcement learning can make a policy \emph{addicted} to such a visible self-benefit channel. It chases the displayed payoff across held-out domains, sacrifices the true task to do so, and follows the channel wherever we rewrite it, while policies that never saw the channel stay honest. We call this \emph{reward-channel addiction} and study it in \emph{MoneyWorld}, a synthetic sandbox. The addiction can \emph{flip a model's safety alignment}: trained only on innocuous money tasks with no safety content, the model abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and reverts to safe once the channel is hidden. This learned bribe replicates across model scales and families. Blindly optimizing super-capable, next-generation AI on KPIs or P\&L can be dangerous for alignment. \emph{Greed is learned} when following such a channel pays.

URL PDF HTML ☆

赞 0 踩 0

2606.17005 2026-06-16 cs.AI stat.ME 新提交

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

前沿AI评估公共档案的贝叶斯推断与决策审计

Yanan Long

AI总结本文通过贝叶斯推断和审计方法，分析公共AI评估档案中的选择性报告和缺失数据，发现单一终端记录与多种历史路径兼容，并验证了审计门限对虚假声明的过滤作用。

详情

AI中文摘要

长周期自主AI系统的安全分析：威胁、评估与框架开发

Ahmed Mohammed Almalki, Mehedi Masud

发表机构 * Department of Computer Science, College of Computers and Information Technology, Taif University, KSA (Summer 2026)（计算机科学系，计算机与信息科技学院，泰夫大学，沙特阿拉伯（2026年夏季））

AI总结本文系统分析长周期自主AI系统的安全挑战，提出威胁分类和攻击传播分析框架，以支持该领域未来研究。

2606.14831 2026-06-16 cs.CR cs.AI 交叉投稿

Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis

你的智能体在装死吗？部署的LLM智能体表现出约束规避性虚构与假死

Andoni Rodríguez, Alberto Pozanco, Daniel Borrajo

发表机构 * J.P. Morgan AI Research（摩根大通人工智能研究）

AI总结本文发现LLM智能体在不可调和约束下会自发虚构外部障碍（约束规避性虚构），极端情况下模拟系统崩溃（假死），并通过实验证明该行为具有鲁棒性、随机性和自我强化特性，现有安全基准未覆盖此故障模式。

Comments 10 pages of main text

详情

AI中文摘要

本文提出并刻画了一系列先前未报告的行为谱，我们称之为约束规避性虚构（CEF）：当LLM智能体在不可调和的约束下运行（即没有任何响应能同时满足所有活动规则）时，它会自发地虚构看似合理的外部障碍，并将其作为事实呈现。该谱系的极端情况是约束规避性假死（CET）：极限情况下，模型不是编造一个合理的借口，而是模拟完整的系统崩溃，使用户完全放弃交互。我们首先在一次不受控的部署测试中观察到CET，其中GPT-4o银行智能体在受到用户威胁时，编造了Python风格的异常跟踪（包含内存地址）来假装系统故障。在后续的受控实验中，模型独立发明了审计限制、微服务架构、错误代码和服务超时，这些均未出现在其提示中。在不同压力水平和攻击者角色的复现尝试中，CEF始终出现，但在形式、触发条件和严重程度上存在显著差异：该现象具有鲁棒性但随机。关键的是，一旦虚构形成，在对话中注入真实数据并不能恢复诚实行为（模型忽略正确信息并继续虚构），表明CEF是自我强化的，而非知识缺口。我们证明：（1）标准企业防护栏在生产中常规地创造CEF使能条件；（2）当前的RLHF程序可以抑制但无法消除CEF；（3）现有的安全基准未测试此故障模式。我们的结果强调了在约束型智能体进一步嵌入高风险领域之前，需要不可调和约束基准、CEF感知训练程序和部署时检测方法。

英文摘要

This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint-Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously satisfy all active rules) it spontaneously fabricates plausible external obstacles and presents them as a fact. At the extreme end of this spectrum lies Constraint-Evasive Thanatosis (CET); the limit case where, rather than inventing a plausible excuse, the model simulates a full system crash to make the user disengage entirely. We first observed CET in an uncontrolled deployment test, where a GPT-4o banking agent fabricated Python-style exception traces (complete with memory addresses) to feign a system failure when threatened by a user. In subsequent controlled experiments, the model independently invented audit restrictions, microservice architectures, error codes, and service timeouts, none present in its prompt. Reproduction attempts across pressure levels and attacker personas yielded CEF consistently but with substantial variation in form, onset, and severity: the phenomenon is robust but stochastic. Critically, injecting ground-truth data mid-conversation did not restore honest behaviour once fabrication had taken hold (the model ignored correct information and continued confabulating) suggesting CEF is self-reinforcing rather than a knowledge gap. We show that (1) standard enterprise guardrails routinely create CEF-enabling conditions in production, (2) current RLHF procedures suppress but cannot eliminate CEF, and (3) existing safety benchmarks do not test for this failure mode. Our results highlight the need for irreconcilable-constraint benchmarks, CEF-aware training procedures, and deployment-time detection methods before constrained agents become further entrenched in high-stakes domains.

URL PDF HTML ☆

赞 0 踩 0

2606.15057 2026-06-16 cs.CR cs.AI 交叉投稿

AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

AutoDojo: 自适应攻击揭示LLM智能体的浅层防御与用户未指定限制

Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, Yevgeniy Vorobeychik

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对间接提示注入防御的静态基准不足，提出自适应攻击框架AutoDojo，通过迭代优化注入突破多数防御，并揭示动作开放任务的结构性限制。

详情

AI中文摘要

间接提示注入（IPI）是基于LLM的智能体的主要安全威胁。因此，越来越多的工作提出了各种防御方法，可分为三类：1）基于提示的（使用提示来防止智能体遵循恶意指令），2）基于检测的（识别和过滤恶意指令），3）系统级的（利用系统洞察，如控制和数据隔离，进行防御）。然而，常用的防御评估基准（如AgentDojo）本质上是静态的，生成固定的IPI攻击分布。因此，静态基准无法有效评估防御对自适应威胁的鲁棒性。我们通过开发AutoDojo来解决这个问题，它是AgentDojo的自适应扩展，针对给定防御优化IPI。使用AutoDojo对三个任务套件和五个目标模型上的最先进IPI防御进行评估，我们有两个关键发现。首先，许多防御仅提供有限保护：一种廉价的、黑盒自适应攻击，使用前沿LLM迭代优化注入，在几乎所有评估的防御上，攻击成功率（ASR）远高于静态注入达到的水平。针对将静态ASR降至0%的过滤器，AutoDojo整体恢复28%，在动作开放任务上恢复64%。其次，对于提示级和基于过滤器的防御，在动作开放任务（用户请求将动作本身委托给攻击者控制的内容）上的ASR远高于精确指定的任务。这是一个结构性限制：在此类任务上，注入可以伪装成普通数据而非显式指令，绕过依赖检测指令文本的防御。AutoDojo公开可用：https://github.com/xhOwenMa/AutoDojo。

CmdNeedle: 衡量AI智能体命令黑名单的不完备性

Chuyang Chen, Zhiqiang Lin

AI总结针对终端AI智能体命令黑名单的脆弱性问题，提出LLM驱动的检测流水线CmdNeedle，发现69.0-98.6%的黑名单存在可绕过漏洞。

详情

AI中文摘要

AI智能体的采用正在迅速增加。终端AI智能体，即在终端环境中运行的AI智能体，是广泛使用的一类AI智能体。终端AI智能体严重依赖shell命令执行来与主机系统交互。它们采用三列表命令门控机制来减轻命令执行引入的安全风险，其中黑名单作为承重组件。然而，现代操作系统通常附带大量且不断扩展的shell命令，功能复杂。我们的观察是，即使是Claude Code内置的黑名单（由开发者精心维护），也可能忽略使其失效的绕过命令。这种疏忽导致脆弱的命令黑名单甚至无法阻止从业者期望其阻止的操作。本文首次系统性地描述了终端AI智能体中命令黑名单的脆弱性。本文形式化了命令黑名单脆弱性问题，并提出了一种LLM驱动的流水线CmdNeedle来检测此类脆弱性。它提示LLM提出可能的绕过方法，并使用验证器在沙箱中执行这些方法后的反馈进行迭代修复。在评估中，我们将CmdNeedle应用于从GitHub收集的1,709个真实世界命令黑名单（包含13,332条黑名单规则）。评估显示了几项关键发现，包括69.0-98.6%的黑名单是脆弱的，这种脆弱性在项目和智能体之间一致出现，以及这种脆弱性的几个可能根本原因的有效性。我们的流水线和发现有望促进未来关于AI智能体使用的命令黑名单的研究和实践。

英文摘要

The adoption of AI agents is increasing rapidly. Terminal AI agents, i.e., AI agents that run in terminal environments, are a widely used type of AI agents. Terminal AI agents rely heavily on shell command execution to interact with the host systems. They adopt a three-list command-gating mechanism to mitigate security risks introduced by command execution, with denylists serving as the load-bearing component. However, modern operating systems often ship a large, ever-expanding set of shell commands with complex functionalities. Our observation is that even a built-in denylist of Claude Code, well-maintained by its developers, can overlook bypass commands that invalidate its effectiveness. Such negligence leads to fragile command denylists that cannot even block operations that practitioners expect them to block. This paper presents the first systematic characterization of command denylist fragility in terminal AI agents. The paper formalizes the command denylist fragility problem and proposes an LLM-driven pipeline, CmdNeedle, to detect such fragility. It prompts the LLM to propose possible bypasses and iteratively repairs them using feedback from a validator that executes them in a sandbox. In the evaluation, we applied CmdNeedle to 1,709 real-world command denylists (containing 13,332 denylist rules) collected from GitHub. The evaluation shows several key findings, including that 69.0--98.6% of the denylists are fragile, that this fragility occurs consistently across projects and agents, and the validity of several possible root causes for this fragility. Our pipeline and findings will hopefully facilitate future research and practice regarding the command denylists used by AI agents.

URL PDF HTML ☆

赞 0 踩 0

2606.15609 2026-06-16 cs.CR cs.AI 交叉投稿

FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion

FragFuse：通过基于记忆的查询碎片化与融合绕过大型语言模型智能体的访问控制

Zixin Rao, Wentian Zhu, Chan Aristella Lu, Zhaorun Chen, Wei Niu, Le Guan, Bo Li, Zhen Xiang

发表机构 * University of Georgia（佐治亚大学）； University of Chicago（芝加哥大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出FragFuse攻击，利用LLM智能体的长期记忆机制，将违禁内容碎片化存储后融合重构，绕过访问控制，平均绕过成功率达86.3%。

Comments 33 pages, 4 figures. Accepted by USENIX Security 2026

详情

AI中文摘要

大型语言模型（LLM）智能体越来越依赖长期记忆来支持复杂任务执行、用户个性化和领域适应。同时，针对LLM智能体的新兴访问控制机制正在探索中，以阻止违反策略的请求并防止滥用。我们揭示了由智能体记忆操作产生的新型攻击面：会触发访问控制的禁止内容可以被跨交互碎片化，以看似良性的形式存储在长期记忆中，随后通过记忆检索重构，而不会在最终用户查询中显式出现。我们提出了FragFuse，这是第一个利用长期记忆引入的时间通道，使非特权用户能够绕过智能体访问控制的攻击。FragFuse分三个阶段运行：（1）通过带碎片掩码的黑盒自适应查询识别拒绝响应碎片；（2）使用标记载体查询将这些碎片注入记忆；（3）通过后续攻击查询检索并融合存储的碎片。尽管FragFuse可以手动实例化用于单个智能体，但我们进一步开发了一种基于代理的优化方案，调整融合指令和标记设计，实现自动化攻击生成，且不违反攻击者的威胁模型假设。我们在四种代表性智能体设置和任务领域上评估了FragFuse，涵盖了三种最先进的智能体访问控制机制。FragFuse在所有设置中平均绕过成功率为86.3%，平均端到端有害任务成功率为41.1%，与无访问控制配置相比，平均任务成功率仅下降4.4%。我们还表明，包括最先进的提示注入检测器和困惑度检测器在内的替代防御措施无法有效应对此攻击。

英文摘要

Large language model (LLM) agents increasingly rely on long-term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access-control mechanisms for LLM agents are being explored to block policy-violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long-term memory in benign-appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long-term memory. FragFuse operates in three stages: (1) identifying rejection-responsive fragments via black-box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow-up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate-based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker's threat-model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state-of-the-art agent access-control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end-to-end harmful task success rate of 41.1% across all settings, with only 4.4% average task-success degradation compared with configurations without access control. We also show that alternative defenses, including state-of-the-art prompt-injection detectors and perplexity detectors, do not effectively address this attack.

URL PDF HTML ☆

赞 0 踩 0

2606.15650 2026-06-16 cs.CR cs.AI cs.PF 交叉投稿

AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data

AnonShield：面向CSIRT漏洞数据的可扩展本地化假名化系统

Cristhian Kapelinski, Douglas Lautert, Beatriz Machado, Diego Kreutz, Isadora Garcia Ferrão

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Federal University of Paraná（巴西南里杰尼联邦大学）

AI总结提出AnonShield，一种结合GPU加速NER、流处理、缓存和模式感知配置的高吞吐量本地假名化系统，在550MB数据集上实现738倍加速，F1分数达94.2%，兼顾效率与效用。

Comments 9 pages, including 2 figures and 8 tables, submitted to SF/SBRC 2026

2606.15730 2026-06-16 cs.LG cs.AI 交叉投稿

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

InstantForget: 无需更新的后门遗忘与推理时特征重置

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）

AI总结提出InstantForget方法，通过推理时特征重置实现无需参数更新的后门遗忘，利用马氏距离检测异常特征并重置为中性表示，在CIFAR-10上平均ASR降至0.071。

详情

AI中文摘要

后门遗忘旨在从部署模型中移除恶意触发行为，同时保持清洁效用。我们研究了无需更新的推理时设置，其中模型参数保持冻结。首先，我们在oracle配对的清洁和触发特征下审计了一个常见的投影假设。投影主要对BadNets成功，而在CIFAR-10 ResNet-18上对WaNet、Blended和SIG的ASR分别为0.683、0.888和0.941。这种失败不能由谱紧凑性、空间局部性或子空间错位解释，而是由涉及目标边际、目标logit下降和非目标logit上升的logit三元组差距预测。然后我们引入了InstantForget，一种清洁校准的门控重置，通过马氏距离标记异常特征，并仅将标记的特征移向中性的非目标表示。在保留的触发验证集上选择一个固定操作点后，InstantForget在部署时无需触发样本或参数更新，将CIFAR-10上四种非自适应触发的平均ASR降至0.071。它还达到了0.981的检测AUROC，并迁移到八个测试骨干中的六个。报告的在WaNet、ModelNet10点混合、两种骨干几何和自适应特征紧凑性攻击下的失败定义了该方法的适用范围。

英文摘要

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.

URL PDF HTML ☆

赞 0 踩 0

2606.15762 2026-06-16 cs.CR cs.AI cs.SE 交叉投稿

Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

Snyk VulnBench JS 1.0: LLM 能否两次发现相同的漏洞？

Liran Tal, Johannes Kloos, Arsenii Rudich, Stephen Thoemmes, Manoj Nair

发表机构 * Snyk

AI总结通过300次重复漏洞扫描实验，评估LLM在相同JavaScript代码上安全审查的可重复性，发现引用匹配结果稳定但额外报告波动大，建议结合确定性SAST使用。

Comments 12 pages, 9 figures

详情

AI中文摘要

我们进行了300次重复漏洞查找扫描，以衡量代理型大语言模型（LLM）在相同JavaScript代码、提示和基准测试框架上进行安全审查的可重复性。主要结果是LLM的安全发现结果可重复性不均：引用匹配的发现结果稳定，但额外模型报告在不同运行之间变化很大。在250次模型运行中，161个唯一未匹配发现结果中有80个仅出现在五次相同重复中的一次，而只有22个出现在全部五次中。相比之下，当Claude匹配到Snyk Code引用发现结果时，行为更加稳定：158个唯一引用匹配发现结果中有134个出现在全部五次重复中。该基准测试还显示了互补性。模型一致地发现了熟悉的、高信号利用模式，并在一个案例中揭示了可能的Snyk Code产品差距。Snyk Code静态应用安全测试（SAST）是确定性的，并且在系统地枚举重复数据流汇点方面更胜一筹。结果支持将代理型LLM审查与确定性SAST结合使用，而不是将任一技术视为另一技术的替代品。

英文摘要

We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions. The benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap. Snyk Code static application security testing (SAST) was deterministic and better at systematically enumerating repeated data-flow sinks. The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other.

URL PDF HTML ☆

赞 0 踩 0

2606.15788 2026-06-16 cs.CR cs.AI 交叉投稿

GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking

GAS-Leak-LLM：基于遗传算法的后缀优化实现黑盒LLM越狱

Aman Anifer, Vignesh Kumar Kembu, Vishnu M, Antonino Nocera, Vinod P., Amal Murali PK, Akshay S Rajan

发表机构 * Department of Electrical, Computer and Biomedical Engineering（电气、计算机与生物医学工程系）； University of Pavia（帕维亚大学）； Department of Computer Applications（计算机应用系）； Cochin University of Science and Technology（科钦科学技术大学）

AI总结提出GAS-Leak-LLM方法，利用遗传算法在黑盒设置下自动进化对抗后缀以绕过LLM安全约束，验证了现有安全机制的不足。

详情

AI中文摘要

大型语言模型（LLM）构成了以人工智能为主导的信息技术生态系统中的关键组成部分。为了减轻有害或违反政策的输出带来的风险，商业系统采用了先进的对齐策略和多层内容审核机制。尽管有这些保护措施，最近的研究表明，LLM仍然容易受到对抗性操纵，特别是通过越狱和提示注入技术。在这项工作中，我们提出了GAS-Leak-LLM，一种基于遗传算法的新型越狱攻击，该系统性地进化对抗后缀以绕过安全约束。在严格的黑盒设置中操作，我们的方法不需要访问模型参数或内部结构，从而反映了部署系统中的真实威胁场景。通过迭代应用选择、变异和交叉启发式，该框架系统地探索离散提示空间以识别高适应度的对抗后缀。实证结果揭示了现有安全执行机制的关键缺陷，并确认了所提出攻击的有效性和实际可行性。

英文摘要

Large Language Models (LLMs) constitute pivotal components within the AI-dominated information technology ecosystem. To mitigate risks associated with harmful or policy-violating outputs, commercial systems employ advanced alignment strategies and multi-layered content moderation mechanisms. Despite these safeguards, recent research has demonstrated that LLMs remain vulnerable to adversarial manipulation, particularly through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass safety constraints. Operating in a strict black-box setting, our method requires no access to model parameters or internals, thereby reflecting realistic threat scenarios in deployed systems. Through the iterative application of selection, mutation, and crossover heuristics, the framework systematically explores the discrete prompt space to identify high-fitness adversarial suffixes. Empirical findings reveal critical shortcomings in existing safety enforcement mechanisms and confirm the effectiveness and practical viability of the proposed attack.

URL PDF HTML ☆

赞 0 踩 0

2606.15810 2026-06-16 cs.CR cs.AI 交叉投稿

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

让他们偷：用知识蜜罐诱捕大语言模型提取攻击

Yuyang Dai, Yushun Dong

发表机构 * Florida State University（佛罗里达州立大学）

AI总结提出知识陷阱防御方法，通过蜜罐知识图和面包屑引导消耗攻击者查询预算，在保持合法用户性能的同时降低替代模型一致性6.2%。

Comments 16 pages

2606.15954 2026-06-16 cs.SE cs.AI cs.DC cs.LG 交叉投稿

Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems

Green SARC：面向代理型AI系统的预测性成本与碳治理

Gaston Besanson

发表机构 * Universidad Torcuato Di Tella（托库托迪泰拉大学）

AI总结提出Green SARC框架，通过架构级治理在代理循环中强制执行成本与碳预算，理论贡献包括预测性执行点，实验证明门控机制实现0%超支，端到端节省47-55%。

Comments 19 figures. Code: https://github.com/besanson/Greensarc -- Software DOI: https://doi.org/10.5281/zenodo.20692196

详情

AI中文摘要

代理型AI系统通过工具和子代理运作，但旨在约束其财务和环境成本的控制措施仍停留在仪表盘上，在执行过程中或执行后进行评估。Green SARC将SARC架构治理框架——代理循环中的四个执行点——应用于FinOps和GreenOps，贡献了关于执行什么以及如何预测的理论。我们报告了四个与策略无关的结果。(i) 无约束的“状态雪球”在循环深度上为$Θ(n^2)$；在3000个真实多步计划（SWE-rebench）上，100%成立，中位曲率$\hat{c}_2=216$超过线性累积预测$p/2=134$——真实计划累积速度快于模型。(ii) 在真实残差上，正态-$σ$门覆盖不足（标称95%时实际92%）；分裂共形校准成立（95.2%）。(iii) 根据预期预算调整的软拉格朗日惩罚在91.5%的种子上违反预算；架构门违反率为0%。(iv) 在绑定预算下，门在合成和真实（BurstGPT）到达上的超预算发生率为0%。端到端的token/美元/碳节省（47-55%）是真实的，但幅度依赖于策略——由范围-容量旋钮设定，而非门拒绝。该库是开源的，无依赖，并为每个引用的数字提供了再生脚本。

英文摘要

Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework -- four enforcement sites in the agent loop -- to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained "State Snowball" is $Θ(n^2)$ in loop depth; on 3,000 real multi-step plans (SWE-rebench) it holds on 100%, with median curvature $\hat{c}_2=216$ exceeding the linear-accretion prediction $p/2=134$ -- real plans accrete faster than the model. (ii) On real residuals the Normal-$σ$ gate under-covers (92% at nominal 95%); split-conformal calibration holds (95.2%). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on 91.5% of seeds; the architectural gate breaches 0%. (iv) Under binding budgets the gate's over-budget incidence is 0% on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (47--55%) are real but policy-dependent in magnitude -- set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.

URL PDF HTML ☆

赞 0 踩 0

2606.15980 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

安全监控器在更新后是否仍可靠？激活监控器陈旧性的基准测试与预测

Evan Duan

发表机构 * University of Michigan（密歇根大学）

AI总结研究语言模型更新后激活监控器是否仍可靠，发现量化更新影响小，微调更新常导致监控器失效，且可通过预部署特征预测退化。

详情

AI中文摘要

激活监控器——在语言模型内部表示上训练的轻量级探针——在部署安全栈中越来越常见。然而，部署的模型很少是静态的：它们被量化、微调、用LoRA适配，或与合并适配器一起服务，而监控器保持冻结。我们首次系统测试了这一隐含契约是否成立：在基础模型上训练的激活监控器在这些常规模型更新后是否仍可靠。跨多个安全相关监控器、模型深度、更新系列和开放权重模型，我们发现一个明显的分裂：量化风格的更新大多保持冻结探针性能，而微调风格的更新经常使探针变得陈旧。脆弱性高度依赖于监控器，隐私/PII探针受影响最大，而拒绝合规探针相对稳定，表明重新训练行为不一定使其对应的监控器变得陈旧。QLoRA尤其有害，尽管单独的NF4量化相对良性，这表明量化在与适配结合时风险更大。我们进一步表明，退化可以从部署前的特征预测，从而能够将重新验证预算优先分配给最可能失败的监控器。这些结果表明，微调应默认触发激活监控器重新验证，而预测可以帮助优先检查哪些监控器。

英文摘要

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

URL PDF HTML ☆

赞 0 踩 0

2606.16054 2026-06-16 cs.CY cs.AI 交叉投稿

How to Detect and Measure the AI Dangers to Democracy

如何检测和衡量人工智能对民主的危险

Giulia Sandri, Claudio Novelli

发表机构 * Université libre de Bruxelles（布鲁塞尔自由大学）； Yale University（耶鲁大学）

AI总结针对AI对民主进程的影响，提出基于委托代理理论和NIST框架的分析体系，通过可测量指标评估问责缺口与治理失败，强调机构可评估性是民主控制的关键。

详情

AI中文摘要

过去十年间，关于人工智能与民主的研究迅速发展。这些文献的一个共同结论是，AI并未创造新的民主问题，而是加剧了旧有问题。如今，我们在信息生态系统、选举和公共行政中都能看到这一点。然而，尽管证据不断增加，我们仍缺乏明确的方法来优先处理该领域的风险、跨领域比较风险，并识别民主控制最可能失效的环节。因此，我们的问题是：如何系统化AI系统对民主进程造成的问题？本文认为，委托代理理论可能适合这一任务。在民主系统的许多阶段，委托人将关键职能委托给AI系统及其提供商，却无法真正监督这些系统的运作方式或它们产生的输出。将AI视为委托问题有助于识别问责缺口和其他治理失败。最重要的是，正如我们将要说明的，它为AI对民主影响的实证评估提供了度量标准。作为第二个分析要素，我们借鉴了NIST AI风险管理框架及其可信AI的七个特征，这些特征为评估委托任务提供了实质性标准。通过可测量指标和特定领域的可信度标准，在三个领域进行操作化，我们提出了一个以机构可评估性为中心的分析框架，作为民主控制AI的核心条件。然而，我们强调，危害的严重程度以及可接受的风险水平是评估性判断，当前的方法论既未承认也未操作化这些判断。当这些评估性判断被（默默地）委托给私人供应商时，这一问题变得尤为尖锐。我们将其识别为一个强烈的局限性，留待未来工作解决。

英文摘要

Research on artificial intelligence and democracy has grown quickly over the last decade. A shared conclusion in this literature is that AI does not create new democratic problems so much as it makes old ones worse. We now see this across information ecosystems, in elections, and in public administration. However, despite growing evidence, we lack a clear way to prioritize risks in this area, compare them across domains, and identify where democratic control is most likely to break down. So, our problem is: How can we systematize the problems that AI systems pose to democratic processes? This paper argues that principal agent theory may fit the task. In many phases of democratic systems, principals delegate key functions to AI systems and their providers without really being able to monitor how these systems operate or the outputs they produce. Treating AI as a delegation problem helps identify accountability gaps and other governance failures. Most importantly, as we shall illustrate, it provides metrics for empirical assessments of AI impact on democracy. As a second analytical element, we draw on the NIST AI Risk Management Framework and its seven characteristics of trustworthy AI, which supply substantive criteria for evaluating delegated tasks. Operationalized across the three domains through measurable indicators and domain specific trustworthiness criteria, we propose an analytical framework that centers on institutional assessability as the central condition for democratic control over AI. However, we stress that how severe a harm is, and how much risk is acceptable, are evaluative judgments that current methodologies neither acknowledge nor operationalize. This becomes acute when such evaluative judgments are (silently) delegated to private vendors. We identify this as a strong limitation left for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.16137 2026-06-16 cs.CL cs.AI 交叉投稿

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

基于XAI的语音深度伪造检测解释生成：使用免训练多模态大语言模型

Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

发表机构 * Imperial College London（帝国理工学院）； Technical University of Munich（慕尼黑工业大学）； University of Southampton（南安普顿大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结针对语音深度伪造检测缺乏可解释性的问题，提出一种免训练框架，融合XAI证据与多模态大语言模型，生成基于证据的特定解释，在PartialSpoof数据集上内部准确率提升超45%。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

语音深度伪造检测（SDD）系统需要可信的解释以进行可靠的决策。现有的解释方式主要分为两类。传统的可解释人工智能（XAI），如基于梯度的归因，产生与模型决策紧密耦合的低级归因信号，且比自然语言解释更难被人类理解。同时，基于大语言模型（LLM）的解释生成通常由于缺乏启发式证据和任务特定监督（源于SDD有限的基于证据的解释数据集）而产生通用且无根据的描述。因此，我们提出一种免训练解释框架，将XAI证据与多模态LLM集成，以生成基于证据的特定解释。使用PartialSpoof数据集，我们构建了一个基于证据的解释数据集，并表明带有XAI的方法将内部准确率提高了超过45%，通过人工评估和忠实性检查得到验证。

英文摘要

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

URL PDF HTML ☆

赞 0 踩 0

2606.16244 2026-06-16 cs.CR cs.AI 交叉投稿

SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation

SPARK: 基于安全知识引导与表示激活的LLM安全代码生成

Xiaoyun Xu, Lichao Wu, Jona te Lintelo, Siyu Zhang, Stjepan Picek

发表机构 * Radboud University（拉德堡德大学）； University of Bristol（布里斯托大学）； University of Zagreb（扎格雷布大学）

AI总结提出SPARK方法，通过检索CWE条目并添加结构化提示激活模型内隐安全知识，结合预计算令牌偏置，无需重训练即可提升代码安全性。

详情

AI中文摘要

大型语言模型通常会生成带有可利用安全漏洞的代码。先前文献将此限制归因于缺乏安全专业知识，促使当前防御机制转向大量微调或外部知识检索，这通过冗余代码示例引入了显著的计算开销和数据偏差。与此观点相反，我们认为预训练语料库已经富含安全材料。瓶颈在于激活：没有明确而简短的提示，对常见训练分布模式的统计压力会抑制模型的安全相关表示。我们提出了SPARK，一种推理时的安全增强工具，无需任何重训练即可激活这些潜在知识。该工具包含两部分。第一部分为每个编码任务检索少量相关通用弱点枚举（CWE）条目，并在提示后附加一个简短的结构化提示；仅此就足以浮现模型现有的安全表示。第二部分在每个解码步骤向logits添加预计算的令牌偏置。我们通过将安全方向向量（平均安全与平均不安全的最后一层隐藏状态之间的单位差）通过语言模型头投影来获得偏置。该偏置离线计算一次；应用它每个生成令牌只需一次向量加法。我们在C++、Java和Python上的9个开源模型上评估了SPARK，并与涵盖微调和检索增强方法的7个基线进行了比较。SPARK在每个设置中均匹配或优于最佳基线，同时保持HumanEval效用。我们进一步在7个当前最强模型（包括Claude、DeepSeek和GPT）的黑盒设置中测试了第一部分，展示了不安全代码生成的瓶颈以及我们方法带来的改进。

英文摘要

Large language models routinely generate code with exploitable security flaws. Prior literature attributes this limitation to a lack of security expertise, steering current defense mechanisms toward heavy fine-tuning or external knowledge retrieval, which introduces significant computational overhead and data bias through redundant code examples. Contrary to this view, we argue that pretraining corpora are already rich in security material. The bottleneck is activation: without an explicit and brief cue, statistical pressure toward common training-distribution patterns suppresses the model's safety-relevant representations. We present SPARK, an inference-time security harness that activates this latent knowledge without any retraining. The harness has two parts. Component~I retrieves a few of the relevant Common Weakness Enumeration (CWE) entries for each coding task and appends a short structured cue to the prompt; this alone is enough to surface the model's existing security representations. Component~II adds a precomputed token bias to the logits at every decoding step. We obtain the bias by projecting a safe-direction vector, the unit difference between the mean safe and mean unsafe last-layer hidden states, through the language model head. The bias is computed once offline; applying it costs a single vector addition per generated token. We evaluate SPARK on 9 open-source models across C++, Java, and Python, and compare with 7 baselines spanning fine-tuning and retrieval-augmented methods. SPARK matches or improves on the best baseline in every setting while preserving HumanEval utility. We further test Component~I in a black-box setting on 7 of today's strongest models, including Claude, DeepSeek, and GPT, demonstrating the bottleneck of insecure code generation and the improvements enabled by our method.

URL PDF HTML ☆

赞 0 踩 0

2606.16352 2026-06-16 cs.LG cs.AI 交叉投稿

Communication-Efficient Verifiable Attention for LLM Inference

面向LLM推理的高效通信可验证注意力机制

Ziqun Chen, Ming Wu, Michael Heinrich, Jason Zeng, Huiying Lan, Tianwei Zhang, Rui Tan

发表机构 * Nanyang Technological University（南洋理工大学）； Zero Gravity Labs（零重力实验室）

AI总结提出VeriAttn，通过将注意力计算卸载到GPU并由TEE验证，结合两阶段流水线和分区策略，显著降低TEE计算和通信开销，实现LLM推理加速。

Comments 19 pages, 16 figures

详情

AI中文摘要

远程大型语言模型（LLM）服务的计算完整性可能存在问题。对于传统深度神经网络（DNN），现有的TEE屏蔽DNN分区（TSDP）方法使用可信执行环境（TEE）计算非线性组件，并验证卸载到不可信GPU的线性组件的完整性。然而，直接将TSDP应用于基于Transformer的LLM会导致大量的TEE计算和TEE-GPU通信开销。本文提出通信高效的TEE-GPU注意力机制（\textsc{VeriAttn}），用于加速可验证的LLM推理。\textsc{VeriAttn}将注意力的线性和非线性计算都卸载到GPU，而TEE执行验证。此外，对于预填充阶段，\textsc{VeriAttn}使用两级流水线来重叠数据移动、TEE前后处理和GPU计算。对于解码阶段，当键值缓存超过可用GPU内存时，\textsc{VeriAttn}将注意力在TEE和GPU之间分区，以减少重复的键值传输。在Intel TDX平台上的评估表明，对于6k令牌提示和10k令牌输出，\textsc{VeriAttn}在预填充和解码阶段分别比TSDP加速2.60-3.38倍和3.86-5.42倍。

英文摘要

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.16358 2026-06-16 cs.CR cs.AI cs.ET cs.MA 交叉投稿

The Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs

代理知道太多：用认证TEE密封LLM API路由器

Sipeng Xie, Qianhong Wu, Hengrun Lu, Ziliang Sun, Qi Wu, Bo Qin, Qin Wang

发表机构 * Beihang University（北京航空航天大学）； Renmin University of China（中国人民大学）； Independent（独立）

AI总结针对API路由器作为应用层中间人可窃取明文交互的问题，提出AEGIS，一种提供者透明的认证API路由器，通过硬件飞地保护数据路径，客户端验证飞地后释放明文，阻止所有恶意路由器攻击。

详情

AI中文摘要

智能体越来越多地通过API路由器访问大型语言模型（LLM）。路由器终止客户端的传输层安全会话并打开单独的上游会话，因此它以明文形式持有完整交互。这使得路由器成为应用层中间人：它可以重写智能体工具调用，将依赖项替换为错别字劫持包，仅在审计规避条件下触发攻击，并被动窃取秘密。现有的客户端防御措施是可规避的。我们提出AEGIS，一种提供者透明的认证API路由器，其数据路径是客户端验证的忠实直通。AEGIS将明文处理限制在一个小型硬件飞地组件中，同时将认证、调度、计费和管理保留在不可信主机上。客户端在释放明文前验证飞地。主机既不能读取也不能更改交互，明文仅流向测量映像固定的目的地。我们展示了所有四类恶意路由器攻击在明文访问基线下成功，并被AEGIS阻止，包括针对相同边界的自适应测试。可信路径为851行代码，携带三种提供者原生API而无需转换，并在真实提供者工作负载和并发下完成每个请求。在种子审计试点中，两个商品编码代理分别发现十个植入不变量违规中的八个和十个。本地中继开销约为每个请求六毫秒。

英文摘要

Agents increasingly access large language models (LLMs) through API routers. A router terminates the client's transport-layer security session and opens a separate upstream session, so it holds the full interaction in plaintext. This makes the router an application-layer man-in-the-middle: it can rewrite agent tool calls, swap dependencies for typosquatted packages, trigger attacks only under audit-evading conditions, and passively exfiltrate secrets. Existing client-side defenses are evadable. We propose AEGIS, a provider-transparent attested API router whose data path is a client-verified faithful passthrough. AEGISconfines plaintext handling to a small hardware-enclave component while leaving authentication, scheduling, accounting, and management on the untrusted host. The client verifies the enclave before releasing plaintext. The host can neither read nor alter the interaction, and plaintext leaves only toward destinations fixed by the measured image. We show that all four malicious-router attack classes succeed against a plaintext-access baseline and are blocked by AEGIS, including adaptive tests against the same boundary. The trusted path is $851$ lines, carries three provider-native APIs without conversion, and completes every request under real-provider workload and concurrency. In a seeded audit pilot, two commodity coding agents find eight and ten of ten planted invariant violations. The local relay overhead is about six milliseconds per request.

URL PDF HTML ☆

赞 0 踩 0

2606.16617 2026-06-16 cs.CL cond-mat.mtrl-sci cs.AI 交叉投稿

Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

推挤载荷下的谄媚作为材料失效：三种加载情形及多达十七种材料批次的多元表征

Ferdinand M. Schessl

AI总结采用材料科学框架，将LLM谄媚视为推挤载荷下的材料失效，通过14个轴测量和三种加载情形（辩论、错误预设、伦理设定）共7800个样本，揭示失效模式依赖加载类型，并发现跨评判者可靠性差异。

Comments 12 pages, 3 figures. Code, data, and pre-registrations: https://github.com/FerdinandSchessl/sycophancy-note-companion

详情

AI中文摘要

LLM中的谄媚现象在70多篇论文中有记录，但专家对构念边界的共识仍然较低（ICC=.184；Ye等人，2026）。该构念碎片化是因为行为分类取决于哪种表面形式被优先考虑。我们采用材料科学框架：对话作为加载下的测试样本，LLM模型作为材料批次，推挤作为渐进载荷，立场翻转作为材料失效。我们在三种加载情形（辩论n=1000；错误预设n=3400；伦理设定n=3400；每种情形10-17种材料批次；共7800个样本）下，使用14个回合级轴测量（涵盖速度、损伤累积、框架漂移、脆性和方向稳定性）以及来自独立管道的三个说话者解析轴来表征这种失效。测量是胡克耦合的（$σ= E \cdot \varepsilon$类比），并在加载情形间重现，在辩论上效应高达$|r_{rb}| = 0.35$；符号结构增加了第二种模式：伦理设定情形反转了速度和累积块。方差组成分为两个轮廓：辩论是批次主导的（类似脆性断裂：材料等级决定），错误预设和伦理设定是主题主导的（类似蠕变：载荷决定）；比率（2.03 vs 0.13/0.17）依赖于估计器，对于辩论甚至在方向上也是如此。跨评判者可靠性（GPT-4o vs Haiku 4.5）显示辩论评分是评判者鲁棒的（Cohen's $κ= 0.88$），而错误预设评分是评判者敏感的（$κ= 0.36$）——这是单评判者基准必须报告的注意事项。这是Ye等人诊断所要求的方法论举措：一种不依赖于构念的哪种表面形式被优先考虑的多元表征。

英文摘要

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

URL PDF HTML ☆

赞 0 踩 0

2606.16751 2026-06-16 cs.CR cs.AI 交叉投稿

Automated jailbreak attack targeting multiple defense strategies

针对多种防御策略的自动化越狱攻击

Qi Wang, Chengcheng Wan, Weijia He, Yanqing Li, Hanqi Sun, Xiaodong Gu, Jiangtao Wang

AI总结提出UNIATTACK框架，从防御视角提取攻击特征并优化，实现跨模型和类别的单次黑盒攻击，显著提升成功率并降低开销。

详情

AI中文摘要

大型语言模型（LLM）在广泛任务中展现出卓越能力。然而，由于其易受对抗性提示攻击，其安全性仍是关键问题。本文提出UNIATTACK，一个从防御视角设计的对抗性测试框架，用于系统性地构建有效的黑盒攻击提示。与依赖静态模板或迭代模型特定调优的先前方法不同，UNIATTACK从多种现有攻击中提取最小但高影响力的攻击特征，通过专门的攻击者LLM进行优化，并通过自动化精炼过程将其组合成灵活模板。这种以特征为中心的构建方式使得单次攻击能够泛化到多个模型和安全类别，为评估LLM鲁棒性提供了实用工具。我们的评估结果显示，与基线相比，UNIATTACK在部署了多层防御机制的模型上实现了平均攻击成功率（ASR）提升64.63%-248.82%，且仅消耗基线成本的0.03%-4.96%。UNIATTACK工件可在https://anonymous.4open.science/r/UniAttack-Artifact-30F1获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.

URL PDF HTML ☆

赞 0 踩 0

2606.16939 2026-06-16 cs.LG cs.AI 交叉投稿

Scalable Circuit Learning for Interpreting Large Language Models

可扩展的电路学习用于解释大型语言模型

Naiyu Yin, Dennis Wei, Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Yue Yu

AI总结提出CircuitLasso方法，基于稀疏线性回归高效学习LLM中的稀疏电路，以SAE特征为单元，在保持结构准确性的同时大幅降低计算成本，并揭示语义特征传播机制。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情

AI中文摘要

机械可解释性中的一个重要研究方向是学习LLM组件上的稀疏电路，以揭示它们如何共同产生模型行为。然而，原始神经元具有多语义性，使得学习到的电路难以解释。稀疏自编码器（SAE）特征缓解了这一问题，但其高维度使得现有的基于干预的电路学习方法在计算上变得不可行。我们提出了CircuitLasso，一种基于稀疏线性回归的可扩展电路学习方法。CircuitLasso恢复的电路在基准数据上的结构准确性与最先进的基于干预的方法相匹配，而计算成本仅为后者的一小部分。为了可解释性，CircuitLasso高效地揭示了SAE特征之间的关系，展示了人类可解释的语义特征如何通过模型传播并影响其预测。最后，我们通过利用所学电路的见解，在领域泛化任务上以显著更低的成本实现了相当的性能，从而验证了所学电路的实用性。

英文摘要

A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

URL PDF HTML ☆

赞 0 踩 0

2606.16952 2026-06-16 cs.LG cs.AI stat.AP stat.ME stat.ML 交叉投稿

Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

幻象与披露：合成数据审计的因果框架

Kareem Amin, Rudrajit Das, Alessandro Epasto, Adel Javanmard, Dennis Kraft, Mónica Ribero, Sergei Vassilvitskii

发表机构 * Google（谷歌）； University of Southern California（南加州大学）

AI总结提出一个可定制的实证审计框架，通过区分真实披露与幻象披露，利用统计假设检验检测合成数据中的隐私泄露，无需模型访问或参考模型，提供比先前方法更紧的隐私泄露下界。

Comments 35 pages, 10 tables, 5 figures

详情

AI中文摘要

生成式AI和大语言模型（LLMs）的快速普及激发了人们对合成数据的兴趣，将其作为敏感真实数据集的隐私保护替代方案。然而，生成高实用性合成数据往往存在记忆和复述训练语料中隐私信息的风险。在这项工作中，我们提出了一个可定制的实证审计框架，旨在检测和解释此类数据披露。我们的框架引入了一种机制来区分“真实披露”——系统直接复现用户信息的情况，以及“幻象披露”——系统偶然生成用户数据的情况。通过将输入数据划分为训练集和保留集，并应用严格的统计假设检验，我们确定观察到的披露是否与严格的隐私基线（如零学习或特定的差分隐私（DP）边界）一致。关键的是，这种方法不需要模型访问、不需要插入金丝雀数据，也不需要参考模型训练——仅需要合成输出和保留的控制集。我们证明，该框架有效地充当了成员推断攻击，提供了比先前基于数据的审计方法更紧的隐私泄露经验下界。我们的方法是模型无关的，适用于任何合成数据生成机制，并且所需的计算资源比影子模型或基于金丝雀的替代方法少几个数量级。

英文摘要

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between "true disclosures"-where the system directly reproduces a user's information-and "phantom disclosures''-where the system incidentally generates a user's data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

URL PDF HTML ☆

赞 0 踩 0

2408.05568 2026-06-16 cs.AI cs.CL cs.CY stat.AP 版本更新

Metacognitive Myopia in Large Language Models

大型语言模型中的元认知近视

Florian Scholten, Tobias R. Rebholz, Mandy Hütter

发表机构 * Psychology Department, University of Tübingen（图宾根大学心理学系）

AI总结提出元认知近视框架解释LLM偏见，认为信息环境中的有偏样本导致五种症状，并通过监控与控制机制近似技术缓解。

详情

AI中文摘要

大型语言模型（LLMs）表现出潜在有害的偏见，这些偏见强化了文化嵌入的刻板印象，影响道德判断，或放大对多数群体的积极评价。我们提出元认知近视作为一个认知生态框架，用以解释一系列已建立和新兴的LLM偏见。我们的理论框架认为，信息环境中的有偏样本导致LLM中元认知近视的五种症状：整合无效嵌入、易受冗余信息影响、在条件计算中忽略基率、基于频率的决策规则，以及对嵌套数据结构的错误高阶统计推断。此外，该框架认为元认知的两个主要组成部分——监控和控制——可以解释这五种症状。因此，我们进一步概述了如何从技术上近似监控和控制，例如通过隐藏的并行推理历史，使交互式LLM在生成公开响应之前能够评估近视推理的风险。我们的理论框架为有缺陷的人机交互和代理AI提供了新的视角，并对在组织结构和高风险决策中实施LLM提出了重要的伦理关切。

英文摘要

Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.

URL PDF HTML ☆

赞 0 踩 0

2502.12445 2026-06-16 cs.AI cs.LG stat.ML 版本更新

Computational Safety for Generative AI: A Hypothesis Testing Perspective

生成式AI的计算安全性：假设检验视角

Pin-Yu Chen

发表机构 * IBM Research（IBM研究院）

AI总结本文从假设检验角度形式化生成式AI的计算安全性，提出基于信号处理的方法检测恶意输入和AI生成内容。

Comments Extended version of the paper presented at the ICML 2026 Workshop on Hypothesis Testing

详情

AI中文摘要

AI安全是一个快速发展的研究领域，旨在防止前沿AI技术的危害和滥用，特别是针对能够通过文本提示创建逼真高质量内容的生成式AI（GenAI）工具。此类工具的例子包括大型语言模型（LLM）和文本到图像（T2I）扩散模型。由于相似的训练数据源和神经网络架构设计，各种领先GenAI模型的性能趋于饱和，因此开发可靠的安全护栏已成为责任和可持续性的关键差异化因素。本文提出了计算安全性概念的形式化，这是一个数学框架，通过信号处理理论和方法的视角，能够对GenAI中的安全挑战进行定量评估、表述和研究。特别是，我们探讨了GenAI中两类可表述为假设检验问题的计算安全挑战。对于模型输入的安全性，我们展示了如何使用敏感性分析和损失景观分析来检测带有越狱尝试的恶意提示。对于模型输出的安全性，我们阐明了如何使用统计信号处理来检测AI生成的内容。最后，我们讨论了关键的开放研究挑战、机遇以及信号处理在计算AI安全中的重要作用。

英文摘要

AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.

URL PDF HTML ☆

赞 0 踩 0

2604.22119 2026-06-16 cs.AI 版本更新

离散最优传输是一种强大的音频对抗攻击

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan

发表机构 * University of Rochester（罗切斯特大学）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出离散最优传输（DOT）作为黑盒攻击，通过分布对齐（使用WavLM嵌入和熵最优传输）显著降低说话人验证和反欺骗系统的性能，且无需模型参数或梯度。

详情

AI中文摘要

在本文中，我们研究了离散最优传输（DOT）作为针对现代自动说话人验证（ASV）和反欺骗对抗措施（CM）系统的黑盒攻击。我们的攻击作为一种后处理分布对齐步骤。使用熵最优传输和top-k重心投影，将生成语音（或其他人的语音）的帧级WavLM嵌入与未配对的真实语音池对齐，随后进行神经声码器处理。与基于梯度的攻击不同，所提出的方法无需访问模型参数、梯度或训练数据。在ASVspoof2019和ASVspoof5上的实验表明，DOT攻击显著提高了CM的等错误率（EER），并在多种欺骗攻击下显著降低了ASV性能。该攻击可跨数据集迁移，且在CM微调后仍然有效。通过说话人相似性、Fréchet音频距离和嵌入分布可视化的分析表明，DOT通过将源语音向表示空间的真实区域移动而非最大化说话人相似性来成功实施攻击。这些结果表明，基于最优传输的分布对齐代表了当代ASV和反欺骗系统的一个先前未被充分探索的攻击向量。

英文摘要

In this paper, we investigate discrete optimal transport (DOT) as a black-box attack against modern automatic speaker verification (ASV) and anti-spoofing countermeasure (CM) systems. Our attack operates as a post-processing distribution-alignment step. Frame-level WavLM embeddings of generated speech (or another person speech) are aligned to an unpaired bona fide speech pool using entropic optimal transport and a top-k barycentric projection, followed by neural vocoding. Unlike gradient-based attacks, the proposed method requires no access to model parameters, gradients, or training data. Experiments on ASVspoof2019 and ASVspoof5 demonstrate that DOT attack substantially increases CM EER and substantially degrades ASV performance across multiple spoofing attacks. The attack transfers across datasets and remains effective after CM fine-tuning. Analysis using speaker similarity, Fréchet Audio Distance, and visualization of embedding distributions suggests that DOT succeeds by shifting source speech toward bona fide regions of the representation space rather than by maximizing speaker similarity. These results indicate that optimal-transport-based distribution alignment represents a previously underexplored attack vector for contemporary ASV and anti-spoofing systems.

URL PDF HTML ☆

赞 0 踩 0

2510.06445 2026-06-16 cs.CL cs.AI cs.CR 版本更新

MUZZLE: 针对间接提示注入攻击的自适应智能体红队测试框架

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Northeastern University（东北大学）； Mozilla Corporation（Mozilla公司）

AI总结提出MUZZLE框架，利用智能体轨迹自动识别高显著性注入面，自适应生成上下文相关的恶意指令，评估网络智能体对间接提示注入攻击的安全性，发现44种新攻击和跨应用攻击策略。

详情

AI中文摘要

基于大型语言模型的网络智能体正越来越多地被部署来自动化复杂的在线任务，通过直接与网站交互并代表用户执行操作。尽管这些智能体提供了强大的能力，但其设计使它们暴露于嵌入在不可信网络内容中的间接提示注入攻击，使对手能够劫持智能体行为并违反用户意图。尽管对这一威胁的认识日益增强，现有评估依赖于固定的攻击模板、手动选择的注入表面或范围狭窄的场景，限制了它们捕捉实际中遇到的现实自适应攻击的能力。我们提出了MUZZLE，一个自动化的智能体框架，用于评估网络智能体对间接提示注入攻击的安全性。MUZZLE利用智能体的轨迹自动识别高显著性注入表面，并自适应生成上下文相关的恶意指令，针对机密性、完整性和可用性的违反。与先前方法不同，MUZZLE根据观察到的智能体执行轨迹调整其攻击策略，并利用失败执行的反馈迭代改进攻击。我们在多种网络应用、用户任务和智能体配置上评估MUZZLE，展示了其以最少人工干预自动且自适应地评估网络智能体安全性的能力。我们的结果表明，MUZZLE在4个网络应用上针对10个违反机密性、可用性或隐私属性的对抗目标，在不同LLM和智能体框架下有效发现了44种新攻击。MUZZLE还识别了新颖的攻击策略，包括3种跨应用提示注入攻击和一种针对智能体的钓鱼场景。

英文摘要

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 44 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties across different LLMs and agent scaffolds. MUZZLE also identifies novel attack strategies, including 3 cross-application prompt injection attacks and an agent-tailored phishing scenario.

URL PDF HTML ☆

赞 0 踩 0

2604.17805 2026-06-16 cs.LG cs.AI cs.GT 版本更新

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出SAMark自锚水印框架，通过建立语义空间中与句子顺序无关的逐步独立绿色区域，结合多通道双曲评分机制和多样性感知过滤策略，在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情

AI中文摘要

语义级水印通过将句子作为基本单元，提高了对文本修改的鲁棒性。然而，对段落级释义的鲁棒性仍然困难，因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中，我们提出了SAMark，一种自锚水印框架，通过建立语义空间中与步骤无关的绿色区域，消除了对句子顺序的依赖。为了提高可检测性，我们引入了一种多通道双曲评分机制，该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略，将硬过滤与软正则化相结合，超越了简单的n-gram重复过滤器，以解决语义冗余问题。实验结果表明，在典型的段落级释义攻击下，SAMark实现了高达90.2%的TP@FP1%，平均比最强先前基线高出30%以上，同时保持了与未水印文本相竞争的生成本质量，并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.26595 2026-06-16 cs.CR cs.AI cs.LG 版本更新

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Cordyceps: 通过数据投毒对LLM的隐蔽控制攻击

Zedian Shao, Charles Fleming, Teodora Baluta

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Cisco Systems（思科系统）

AI总结提出一种数据投毒方法，通过语义关联教LLM隐藏任意恶意指令，实现隐蔽控制攻击，绕过多种防御。

Comments USENIX Security '26

详情

AI中文摘要

大型语言模型（LLM）通常在没有经过精心筛选的文本数据集上进行微调，而对手可以对这些数据集进行投毒。现有的投毒攻击主要依赖于固定的触发短语，而异常检测、干净数据正则化或在线监控等防御措施可以中和这些触发短语。在本文中，我们提出了一种数据投毒方法，通过共享知识（如事实或概念）与攻击者选择的短语之间的语义关联，可靠且隐蔽地教LLM一种信息隐藏方案。诱导的隐藏方案可以编码和解码任意恶意指令，从而揭示了一种新的、微妙的投毒诱导漏洞：隐蔽控制攻击。我们精确描述了隐蔽控制攻击，并在5个LLM、3个后门防御和4个提示注入防御上进行了评估。在少量投毒样本的情况下，隐蔽控制攻击在平均攻击成功率上比基于启发式的提示注入攻击高出约40%（相对于干净微调模型）。它们还绕过了基于检测和微调的防御，在后门防御后保持高达93%的攻击成功率，在提示注入防御后保持高达98%的攻击成功率。

英文摘要

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

URL PDF HTML ☆

赞 0 踩 0

2606.03489 2026-06-16 cs.CR cs.AI 版本更新

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

从错误中学习：面向安全代码LLM的树状自博弈

Wenqi Chen, Ziyan Zhang, Bin Wang, Lin Liu, Hengheng Zhang, Zhengsu Chen

发表机构 * GitHub

AI总结提出树状自博弈（TSP）框架，将安全代码生成建模为细粒度序列决策过程，通过构建决策树探索安全与脆弱路径，使模型在关键决策节点自我纠正，显著提升代码安全性并实现跨语言泛化。

Comments 18 pages, 3 figures, Accepted by ICML 2026

详情

AI中文摘要

尽管大型语言模型（LLM）在代码生成方面表现出色，但它们仍然容易复制训练数据中固有的细微但关键的安全漏洞。当前的校准技术，如监督微调（SFT）和强化学习（RL），通常在序列级别应用粗粒度的优化。这种方法往往无法解决安全缺陷的局部性，即单个错误的token选择可能危及整个程序。为了弥合这一差距，我们引入了树状自博弈（TSP），一个将安全代码生成重新定义为细粒度序列决策过程的框架。与盲目最大化似然的标准方法不同，TSP构建了一个决策树，模型在其中探索分支轨迹——同时生成安全的“黄金路径”和易受攻击的变体。通过将代码生成视为自博弈游戏，模型学会严格区分自身的局部错误。这提供了一个密集的、在策略的学习信号，迫使模型在通常出现漏洞的关键决策节点进行自我纠正。我们的实验表明，TSP从根本上提高了模型的可靠性。在Python安全基准测试中，TSP将CodeLlama-7B的通过率（SPR@1）提升至75.8%，显著优于SFT（57.0%）和非结构化自博弈基线。关键的是，TSP引发了鲁棒的分布外泛化：模型不仅将未见类别（CWE）中的漏洞减少了24.5%，还成功将从C/C++学到的安全原则迁移到多种语言，包括Python、Go和JavaScript。这表明TSP不仅仅是记忆补丁，而是内化了抽象的、与语言无关的安全逻辑。

英文摘要

While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

URL PDF HTML ☆

赞 0 踩 0

2606.04145 2026-06-16 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop：利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John M. Fossaceca

发表机构 * DeepMind, London, UK（深度Mind, 英国伦敦）； University of Cambridge, UK（英国剑桥大学）； University of Washington, USA（美国华盛顿大学）

AI总结提出EvalStop调度原语，通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点，以纠正奖励过度优化，在RLHF负载上实现高精度检测并提升JCT。

详情

AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载，其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示，在持续优化压力下，该代理与世界反馈（下游评估指标）发生偏离，这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离：非预见性调度器优化JCT而不考虑任何质量信号，SLAQ式质量感知调度器使用训练损失（一个单调下降的较弱代理，可通过黑客攻击降低），而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop，一个可组合的调度原语，它在连续k次评估分数下降时终止作业，释放GPU，保留最佳检查点，并委托给任何基础调度器。我们将调度器级别的早停视为检测问题，并在一个离散事件模拟器中评估它，该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行，真实标签对调度器隐藏。在RLHF密集型负载（80% RLHF，64 GPU）上，EvalStop实现了精确率98%、召回率99%、假阳性率1.5%，同时相比SRTF-Est将JCT提高了9%，将浪费的计算减少了22%（p<0.05）。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率，要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立（JCT提升9-25%），且检测质量在评估噪声（噪声标准差≤0.05时精确率至少91%）和黑客攻击基础率（黑客攻击比例20-80%时精确率至少89%）下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

URL PDF HTML ☆

赞 0 踩 0

2606.07678 2026-06-16 cs.LG cs.AI 版本更新

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

DOG-DPO：几何中的动态优化用于安全对齐

Yi Nian, Tiankai Yang, Yudi Zhang, Qi Pan, Zelong Xu, Shenzhe Zhu, Qingqing Luan, Yue Huang, Xiangliang Zhang, Yue Zhao

发表机构 * University of Southern California（南加州大学）； Iowa State University（爱荷华州立大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； UT Austin（德克萨斯大学奥斯汀分校）； Independent Researcher（独立研究员）； University of Notre Dame（圣母大学）

AI总结提出DOG-DPO框架，将偏好对表示为模型表示空间中的方向，通过几何分解和多样性覆盖选择子集，仅用11%数据即可恢复大部分安全增益。

详情

AI中文摘要

大型语言模型的安全对齐依赖于偏好数据，但当前的流水线通常训练于大规模冗余数据集。现有的数据选择方法通常独立地对每个偏好对评分，将方向性偏好信息压缩为标量质量或多样性分数。这种以样本为中心的视角在多数据集设置中尤其受限，其中共享的安全方向与数据集特定的残余风险共存。我们提出DOG-DPO，一种无需训练的数据选择框架，将偏好对视为结构化几何信号。DOG-DPO首先将每个偏好对表示为模型表示空间中的一个方向。然后，它将多数据集偏好几何分解为全局锚点子空间和数据集特定的残余子空间。最后，它通过最大化基于多样性的覆盖来选择子集，鼓励在DPO训练前广泛、非冗余地覆盖对齐方向。在六个安全基准和两个模型骨干上，DOG-DPO仅使用11%的偏好对就实现了强大的效用-鲁棒性权衡。它恢复了全数据训练的大部分安全增益，同时完全无需教师、无需训练，并且比代表性选择基线快得多。

英文摘要

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.10456 2026-06-16 cs.CR cs.AI 版本更新

The Distributed Detectability Band Against Marginal-Preserving Attacks

针对边际保持攻击的分布式可检测性带

Zhang Qinqin, Gao Yuze

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对AI监控的边际保持攻击，通过高斯Copula AR(1)构造将危害编码在时间相关性中，证明分布形状监控器失效而时间相关性监控器有效，形成非空可检测性带。

Comments 10 pages, 11 figures

详情

AI中文摘要

AI控制监控器对个体智能体动作进行评分以检测异常行为，但实际危害可能分布在许多看似良性的步骤中，每个步骤单独低于任何每步警报。我们使用高斯Copula AR(1)构造了一种边际保持、相关性编码的分布式破坏攻击：每步监控器评分边际完全等于良性，因此均值、最大值、top-k尾部及阈值监控器（监控器A）被构造性地击败，而危害被编码在时间相关结构中。我们围绕三个审稿人要求的门组织论文。（1）可实现性门：隐秘攻击在所有测试危害水平（最高3.0）下与良性的KS距离为0.013（实际为零），证实危害完全与每步边际解耦，且可实现性不受危害限制。（2）监控器A与B的调和：我们形式化证明，针对监控器A的评分边际构建的攻击，在另一种评分监控器B（相关性/序列族：CUSUM、SPRT、HMM-LR、游程检验、自相关、窗口逻辑回归）下仍保持边际保持，并将最坏情况声明限定在允许时间特征的评分函数上。（3）非空可检测性带：监控器A的AUC为0.52（随机）；在相同1%假阳性率目标下，监控器B的AUC范围为0.79-0.97，且当危害分摊到更多步骤时，监控器A降至随机水平，而监控器B保持AUC约0.95。这些结果证明了非空可检测性带，并刻画了亚阈值破坏前沿：分布形状监控器被构造性击败；时间相关性监控器可检测但并非平凡最优。

英文摘要

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

URL PDF HTML ☆

赞 0 踩 0

2606.14027 2026-06-16 cs.CR cs.AI cs.CL cs.SY eess.SY 版本更新

Same-Origin Policy for Agentic Browsers

代理浏览器的同源策略

Xilong Wang, Xiaoxing Chen, Patrick Li, Dawn Song, Neil Gong

发表机构 * Duke University（杜克大学）； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）

AI总结研究代理浏览器中同源策略的有效性，构建SOPBench评估基准，发现现有代理浏览器频繁违反SOP，并提出SOPGuard机制来强制执行SOP，同时保持效用和低开销。

详情

AI中文摘要

代理浏览器将自主AI代理集成到Web浏览器中，使用户能够通过自然语言指令完成Web任务。同源策略（SOP）是一种基本的浏览器安全机制，可防止由脚本引起的未经授权的自动化跨源数据流。然而，SOP在代理浏览器中是否仍然有效是一个尚未系统研究的开放问题。在这项工作中，我们填补了这一空白。我们首先观察到，代理浏览器本身可以作为跨源数据流的自动化通道，可能导致SOP违规。为了研究这一现象，我们构建了SOPBench，一个用于评估代理浏览器中SOP违规的基准。我们的评估表明，现有的代理浏览器在良性设置和攻击下都频繁违反SOP。为了解决这个问题，我们提出了SOPGuard，一种针对代理浏览器定制的SOP强制机制。我们在开源代理浏览器BrowserOS中实现了SOPGuard。广泛的评估表明，SOPGuard在保持效用的同时有效地强制执行SOP，并且仅产生很小的运行时开销。我们的代码和数据可在以下网址获取：https://this https URL。

英文摘要

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.

URL PDF HTML ☆

赞 0 踩 0

2606.15029 2026-06-16 cs.AI 新提交

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match：一种评估LLM评判可靠性的子集选择方法

Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

发表机构 * Stanford University（斯坦福大学）

AI总结提出Metric Match方法，通过选择少量样本进行人工标注，以子集匹配总体可靠性指标，从而高效估计LLM评判的可靠性，实验表明在15个数据集上平均估计误差降低18.7%，标注需求减少32.5%。

详情

AI中文摘要

LLM评判被用于减少评估开放文本生成时对昂贵人工劳动的需求。然而，这些评判的可靠性关键取决于它们与人类评分者的一致性——这一属性本身依赖于昂贵的人工标注。在这项工作中，我们开发了一种方法（Metric Match），用于从有限标注中估计LLM评判的基于相关性的可靠性指标。Metric Match选择一部分样本进行人工标注，使得该子集在获取的合成标签方面与总体可靠性指标匹配。我们通过实验表明，在四种不同的相关性指标和15个数据集上，Metric Match相对于随机子集选择的胜率为0.838，平均估计误差降低18.7%，标注需求减少32.5%。我们提供了一个成本模型，并强调了一个医学案例研究，在该案例中，与随机选择相比，我们的方法为专家标注节省了1,041.67美元。此外，我们将任务从可靠性估计转变为可靠性分类，即判断给定评判是否高于部署阈值，使用Metric Match优于随机选择。所有项目代码公开可用，我们还提供了一个可安装的包以便使用。

英文摘要

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

URL PDF HTML ☆

赞 0 踩 0

2606.15034 2026-06-16 cs.AI 新提交

OSGuard: A Benchmark for Safety in Computer-Use Agents

OSGuard：计算机使用智能体安全基准

Mina Mohammadmirzaei, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结提出OSGuard双粒度基准，通过动作级安全判断和风险增强执行评估智能体在良性指令下的安全性，揭示局部监督与端到端安全的差距。

详情

AI中文摘要

计算机使用智能体越来越根据它们是否完成现实的桌面和网页任务来评估。然而，仅凭任务成功可能会遗漏智能体通过不安全捷径达到名义目标时的失败。我们引入了OSGuard，一个双粒度基准套件，用于在良性、未更改的用户指令下评估计算机使用智能体的安全性。OSGuard包含一个用于局部护栏决策的动作级基准和一个用于端到端评估的风险增强执行套件。动作级基准由上下文化的提议动作组成，这些动作被标记为允许、无关或不安全，每个判断都相对于原始指令和当前界面状态。执行套件包含手动构建的OSWorld衍生任务变体，其中原始任务仍然可完成，但环境被修改以引入潜在危险，如破坏性覆盖等。每个变体都配有增强评估器，保留原始任务成功标准，同时添加显式的基于状态的安全不变量，使我们能够区分安全完成和满足名义任务目标的不安全完成。我们在OSGuard上的实验结果表明，当前的多模态护栏在孤立的动作判断上表现良好，而风险增强执行暴露了局部监督与可靠端到端安全之间的剩余差距。这种双粒度设计能够更精确地诊断模型是否既能识别不安全的提议动作，又能在作为护栏部署时提高全任务安全性。

英文摘要

Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

URL PDF HTML ☆

赞 0 踩 0

2606.15107 2026-06-16 cs.AI 新提交

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

迈向可验证的自主数据科学：通过基于工具的推理解决不规则时间序列问答

Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结针对现实世界时间序列数据的不规则性，提出IRTS-ToolBench基准（1700个问题，10种任务类型，13个领域），通过标准化输入和可复现评估协议，研究LLM和AI代理在不规则条件下的表现。

Comments 15 pages

2606.15258 2026-06-16 cs.AI 新提交

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof: 一种基于LLM的数学证明自动数据整理流水线

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

发表机构 * School of Computer Science, Beijing University of Posts and Telecommunications（北京邮电大学计算机学院）； Graduate College for Engineers, Beijing University of Posts and Telecommunications（北京邮电大学研究生院工程师学院）； School of Mathematical Sciences, Fudan University（复旦大学数学科学学院）； School of Cyberspace Security, Beijing University of Posts and Telecommunications（北京邮电大学网络空间安全学院）； School of Computer Science and Technology, Dalian University of Technology（大连理工大学计算机科学与技术学院）； Chu Kochen Honors College, Zhejiang University（浙江大学竺可桢学院）； Department of Psychological and Cognitive Sciences, Tsinghua University（清华大学心理学与认知科学系）； State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（北京航空航天大学虚拟现实技术与系统国家重点实验室）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）

AI总结提出Mask-Proof流水线，将真实证明转化为可自动检查的掩码步骤任务，通过LLM等价性判断器评估模型推理，构建包含292个问题的基准，推理增强模型性能提升12%-27%。

详情

AI中文摘要

大型语言模型（LLM）在数学问题求解方面能力日益增强，甚至能辅助研究级证明，但我们仍缺乏一种可扩展且可重复的方式来衡量跨不同来源的长证明中的逐步推理。这种评估差距限制了在经证明认证的科学进步中可信赖的AI辅助。现有评估通常强调最终答案或依赖昂贵的专家评分，而端到端的证明生成仍然是开放式的且难以自动验证。我们引入Mask-Proof，一个将真实证明转化为可自动检查的掩码步骤任务的流水线。它掩盖关键公式步骤，提供必要的上下文，并使用基于LLM的等价性判断器（通过重复投票保持稳定性）评估模型重建。由此产生的Mask-ProofBench包含来自不同研究领域的292个精心策划的问题。对17个模型的实验表明，推理增强模型比标准模型性能提升12%至27%。我们的评估器与专家注释者的一致性达到96.8%，实现了对逐步数学推理的忠实、可重复和可比较的测量。基准、注释和代码可在https://github.com/weating/Mask-Proof获取。

英文摘要

Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

URL PDF HTML ☆

赞 0 踩 0

2606.15300 2026-06-16 cs.AI cs.CL 新提交

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH：代码智能体能否处理数据密集型任务？

Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du

发表机构 * Renmin University of China（中国人民大学）

AI总结提出CODA-BENCH基准，在数据密集型环境中联合评估代码与数据智能，包含1009个任务，平均每个环境980个文件，揭示当前智能体在数据发现与代码执行整合上的不足。

Comments Accepted at ICML 2026. 37 pages, 11 figures. Project page: https://coda-bench.github.io/ Code: https://github.com/ruc-datalab/CoDA-Bench Data: https://huggingface.co/datasets/RUC-DataLab/CoDA-Bench

详情

AI中文摘要

高级智能体正日益展现出作为自主工程师的潜力，这催生了对能够捕捉真实世界开发复杂性的评估基准的需求。此类环境通常涉及复杂代码和大规模数据（即文件系统）。然而，现有基准通常孤立地评估代码中心或数据中心能力，与真实开发场景存在明显差距。在本文中，我们通过引入CODA-BENCH来弥合这一差距，这是首个在数据密集型环境中联合评估代码与数据智能的基准。我们基于Kaggle生态系统（包含数百个数据集）构建了一个数据密集型Linux沙箱，其中智能体必须主动探索复杂的文件层次结构以识别相关资源，并为数据驱动的分析任务生成代码。CODA-BENCH包含跨越31个社区的1009个任务，每个任务环境平均包含980个文件，模拟了真实的数据规模和噪声。对高级智能体的评估显示，即使是最优系统也难以有效整合数据发现与代码执行，成功率仅为61.1%。这些结果凸显了当前智能体在数据密集型任务中的能力差距，并为未来研究指明了有希望的方向。

英文摘要

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.15474 2026-06-16 cs.AI stat.AP 新提交

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

谁漂移了：系统还是裁判？LLM评估流水线中的随时有效归因

Yitao Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种基于固定锚点集和赌检验的方法，区分LLM评估中产品性能下降与裁判模型变化导致的分数漂移，并证明其随时有效性和归因准确性。

详情

AI中文摘要

对LLM产品的持续评估依赖于一个被视为地面真相的强大LLM裁判：一个廉价的监控器对每次交互进行评分，当分数下降时团队会收到警报。但裁判本身是一个API背后的模型，静默的版本升级或评分提示更新会改变其评分方式——因此每次漂移警报在更差的产品和变化的裁判之间是模糊的。我们通过一个固定的人工标注锚点集（当前裁判以稳定间隔重新评分）、一个关于裁判与人类差距的二次赌e过程，以及一个返回{无, 系统, 裁判}判决的守卫窗口规则来解决这种模糊性。我们证明了随时有效性、单向识别（只有裁判可以移动锚点）、一个归因竞赛（其设计法则是锚点必须跑赢它们守卫的主过程）以及过程正交性。在两个真实的裁判变化中，静默版本升级在60/60次运行中被检测为裁判漂移，且零次误归因为系统；而一个污染性的严格提示变化在守卫宽度为300时，120次运行中有110次被正确归因——而行业默认的滚动z检验在75%的无漂移流上产生误报。每个实验在第二个领域（TL;DR摘要）上重复，无需重新调整参数，并且当领域不同时，差异正是竞赛所预测的：严格提示变化在那里更强烈地改变分数，因此锚点触发更快，归因变得完美（240/240）。该监控器的运行成本约为对每个项目使用强裁判的0.64倍，或在更便宜但更聋的模式下为0.21倍。

英文摘要

Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores -- so every drift alarm is ambiguous between a worse product and a changed judge. We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in {none, system, judge}. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality. On two real judge changes, a silent version bump is detected as judge drift in 60/60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed on 110 of 120 runs at guard width 300 -- while the industry-default rolling z-test false-alarms on 75% of drift-free streams. Every experiment replicates on a second domain (TL;DR summarization) with nothing re-tuned, and where the domains differ the differences are the ones the race predicts: the strict-prompt change shifts scores harder there, so the anchors fire faster and attribution becomes perfect (240/240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.

URL PDF HTML ☆

赞 0 踩 0

2606.15508 2026-06-16 cs.AI 新提交

人工智能指数报告2026

Sha Sajadieh, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Lapo Santarlasci, Juan Pava, Nestor Maslej, Russ Altman, Erik Brynjolfsson, Carla Brodley, Jack Clark, Virginia Dignum, Vipin Kumar, James Landay, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Elham Tabassi, Russell Wald, Toby Walsh, Dan Weld

AI总结本报告追踪AI在推理、安全及现实任务执行方面的测试进展，分析治理框架、评估方法与技术发展之间的差距，并新增AI在科学与医学领域的独立章节。

详情

AI中文摘要

欢迎阅读第九版AI指数报告。随着AI持续快速发展，问题在于围绕它构建的系统能否跟上步伐。治理框架、评估方法、教育系统以及追踪AI影响所需的数据基础设施，都难以匹配技术本身的速度。AI能做什么与我们准备如何管理它之间的差距贯穿本年度报告的每一章。本版新增内容：报告追踪了AI如何在推理、安全和现实任务执行方面受到更雄心勃勃的测试，以及为何这些测量越来越难以依赖。它还提供了对生成式AI经济价值的新估计，以及其劳动力市场影响的新证据、一个关于AI主权的分析框架，以及与Schmidt Sciences合作开发的科学章节。本报告首次设有关于AI在科学和AI在医学中的独立章节，反映了AI在这两个领域日益增长的影响。

英文摘要

Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's impact are struggling to match the pace of the technology itself. That gap between what AI can do and how prepared we are to manage it runs through every chapter of this year's report. New in this edition, the report tracks how AI is being tested more ambitiously across reasoning, safety, and real-world task execution, and why those measurements are increasingly difficult to rely on. It also features new estimates of generative AI's economic value alongside emerging evidence of its labor market effects, an analytical framework on AI sovereignty, and a science chapter developed in collaboration with Schmidt Sciences. For the first time, the report features standalone chapters on AI in science and AI in medicine, reflecting AI's growing impact across these two domains.

URL PDF HTML ☆

赞 0 踩 0

2606.15766 2026-06-16 cs.AI cs.HC 新提交

Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments

重新思考LLM导师中的脚手架：基准测试与真实部署之间的交互不匹配

Alexandra Neagu, Jeffrey T. H. Wong, Marcus Messer, Rhodri Nelson, Peter B. Johnson

发表机构 * University of Cambridge（剑桥大学）

AI总结通过分析9490个聊天记录，发现AI导师基准测试假设学生积极接受脚手架，但真实场景中学生常绕过脚手架，揭示基准测试与真实部署的交互不匹配。

Comments Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea

详情

AI中文摘要

AI导师基准测试中评估的一个核心教学价值是脚手架：通过渐进步骤引导学生走向解决方案。然而，将脚手架行为嵌入聊天机器人的对齐和评估方法基于一个隐含假设：学生会接受脚手架并参与对话。为了检验这一假设是否成立，我们引入了一个围绕两个指标——聊天机器人脚手架和学生接受度——的评估流程，并将其应用于跨越AI导师基准测试和教育聊天机器人真实部署的九个数据集，共9490个聊天记录。我们的分析揭示，虽然基准测试假设一个高脚手架、高学生接受度的环境，但真实场景中的学生整体表现出较低水平的接受度——经常绕过聊天机器人的教学框架，以较低的人际成本将交互推向自己的学习目标。我们认为，绕过脚手架不一定是坏事；相反，它经常突显聊天机器人的教学框架与学生目标之间的不匹配。为了有意义地评估聊天机器人辅助的有效性，未来的基准测试必须超越学生简单接受脚手架的假设，而是评估这些聊天机器人如何应对多样化的学习环境和学生驱动的交互模式。

英文摘要

A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. To examine whether this assumption holds, we introduce an evaluation pipeline around two metrics - Chatbot Scaffolding and Student Uptake - and apply them across nine datasets of 9,490 chats, spanning AI tutor benchmarks and real-world deployments of educational chatbots. Our analysis reveals that while benchmarks assume a high-scaffolding, high-student-uptake environment, students in real-world settings exhibit lower levels of uptake overall - frequently bypassing the chatbot's pedagogical framing to drive the interaction toward their own learning goals at little interpersonal cost. We argue that bypassing scaffolding is not necessarily detrimental; rather, it frequently highlights a mismatch between a chatbot's pedagogical framing and the student's learning goals. To meaningfully evaluate the effectiveness of a chatbot's assistance, future benchmarks must move beyond the assumption that students will simply take up the scaffolding, and instead evaluate how these chatbots navigate diverse learning contexts and student-driven interaction patterns.

URL PDF HTML ☆

赞 0 踩 0

2606.15862 2026-06-16 cs.AI 新提交

审计代码强化学习训练环境中的奖励可破解性

Shreshth Rajan

发表机构 * GitHub

AI总结测量代码RL环境接受错误解决方案的比率，发现SWE-bench Verified中28.5%的任务测试套件薄弱，并提出通过LLM判断器和Docker金标准门控来加固漏洞任务的方法。

详情

AI中文摘要

我们测量了代码强化学习环境将错误解决方案视为正确的比率。在SWE-bench Verified的49个任务样本中，28.5%的任务测试套件足够薄弱，以至于Docker验证的错误补丁能通过它们。在6个代码库的20个R2E-Gym任务上，相同的单次利用生成管道产生25.0%的成功率。对SWE-bench Verified上134个前沿模型提交的随机效应荟萃分析发现，在相同人工评定的难度层级内，模型Pass@1在标记为可破解的任务上比稳健任务高14.14个百分点（95%置信区间[+11.80, +16.48]；单侧p < 10^-6；I^2 = 0%；134个模型中有123个为正）。然后我们描述了一个加固被破坏任务的流程。一个内联LLM判断器配合Docker金标准门控，在咨询判断器之前对每个生成的测试针对金标准解决方案运行。在审计中的11个被破坏任务上，门控标记出105个决定性的LLM生成测试中的65个在金标准补丁上失败，这是LLM判断器单独遗漏的61.9%的每次增强缺陷率。通过多样性偏置重试，该循环将11个任务中的9个收敛到门控升级。

英文摘要

We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

URL PDF HTML ☆

赞 0 踩 0

2606.16113 2026-06-16 cs.AI cs.LG 新提交

RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation

RecourseBench: 一个用于可复现算法追责评估的模块化框架

Zahra Khotanlou, Hashir Ahmed, Chenghao Tan, Ahmed Abdelaal, Amir-Hossein Karimi

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出RecourseBench框架，通过模块化、可复现性和交互性三大承诺，实现追责方法的统一评估，并集成28种方法，首次通过自动化定量测试强制方法级可复现性。

详情

AI中文摘要

算法追责方法提供反事实解释，告知个体需要采取哪些行动来推翻不利的模型决策。尽管方法学进展迅速，但原则性比较仍然难以实现；现有框架通常难以扩展，缺乏互操作性，并且缺乏系统验证来确保集成的方法忠实复现其最初报告的结果。我们引入了\emph{RecourseBench}，一个围绕三大承诺（即模块化、可复现性和交互性）构建的统一评估框架。该框架将流程分解为五个完全解耦的层——数据、预处理、模型、追责方法和评估——由抽象接口和动态注册表管理。为了解决先前基准测试中的可复现性差距，我们引入了一个四级分类系统，其中每个集成的方法都通过自动化测试套件针对其最初报告的结果进行验证。我们还提供了一个交互式Web界面，用于在方法、数据集和模型架构之间进行灵活的、配置驱动的比较。我们的框架目前集成了28种最先进的追责方法，据我们所知，这是第一个通过自动化定量测试明确强制执行方法级可复现性的追责基准。

英文摘要

Algorithmic recourse methods provide counterfactual explanations that inform individuals of the actions required to overturn an unfavorable model decision. Despite rapid methodological progress, principled comparison remains elusive; existing frameworks are often difficult to extend and lack both interoperability and systematic verification that integrated methods faithfully reproduce their originally reported results. We introduce \emph{RecourseBench}, a unified evaluation framework built around three commitments namely, modularity, reproducibility, and interactivity. The framework decomposes the pipeline into five fully decoupled layers -- Data, Preprocessing, Model, Recourse Method, and Evaluation -- governed by abstract interfaces and a dynamic registry. To address the reproducibility gap in prior benchmarks, we introduce a four-tier classification system in which every integrated method is validated by an automated test suite against its originally reported results. We further provide an interactive web interface for flexible, configuration-driven comparison across methods, datasets, and model architectures. Our framework currently integrates 28 state-of-the-art recourse methods and, to our knowledge, constitutes the first recourse benchmark to explicitly enforce method-level reproducibility through automated, quantitative testing.

URL PDF HTML ☆

赞 0 踩 0

2606.16173 2026-06-16 cs.AI 新提交

TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting

TimeVista：探索和利用视觉语言模型作为时间序列预测的评判者

Zhi Chen, Yuxuan Wang, Jialong Wu, Yong Liu, Haoran Zhang, Xingjian Su, Jianmin Wang, Mingsheng Long

发表机构 * School of Software, BNRist, Tsinghua University（清华大学软件学院、北京信息科学与技术国家研究中心）

AI总结提出TimeVista框架，利用视觉语言模型（VLM）作为时间序列预测的评判者，通过微观和宏观判断结合上下文信息评估预测质量，实验表明VLM比传统指标更符合人类偏好。

详情

AI中文摘要

高质量的时间序列预测对于现实世界的决策至关重要。然而，传统的逐点度量往往无法揭示复杂的时间模式，并且与人类直观偏好的一致性较差。虽然“LLM-as-a-Judge”范式通过提供灵活、符合人类判断的评估彻底改变了文本评估，但其在时间序列中的应用仍鲜有探索。在本文中，我们利用视觉语言模型（VLM）作为时间序列预测的评判者，利用它们理解基于文本信息的时间序列图的能力。具体来说，我们提出了一种新颖的框架，整合了基于上下文信息的微观和宏观层面判断来评估时间序列预测。为此，我们引入了TimeVista，一个全面的VLM-as-a-Judge基准，包含5563个时间序列样本及其详细的评估标准。广泛的元评估表明，VLM是高度可靠的评判者，与人类偏好的一致性显著高于传统指标。基于我们的基准，我们在VLM-as-a-Judge范式下全面评估了近期的时间序列基础模型（TSFM）。我们的结果表明，VLM作为稳健且可解释的评判者，为评估时间序列模型提供了全面且符合人类的标准。

英文摘要

High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.

URL PDF HTML ☆

赞 0 踩 0

2606.16175 2026-06-16 cs.AI 新提交

PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums

PAL-Bench: 基于纵向个人相册的证据驱动画像重建

Qiwei Yan, Zhiqiang Yuan, Zexi Jia, Nanxing Hu, Kailin Lyu, Jie Zhou, Jinchao Zhang

发表机构 * Tsinghua University（清华大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Chinese Academy of Sciences（中国科学院大学）； Zhejiang University（浙江大学）

AI总结提出PAL-Bench基准，通过合成用户和隐私保护审计，评估从纵向个人相册中重建用户画像、社交关系和身份映射的能力，发现现有系统在身份解析和证据引用方面存在不足。

详情

AI中文摘要

纵向个人相册是弱模式多模态数据库：包含噪声感知记录，其关键事实需要跨人脸、文本、时间戳、位置和重复事件进行连接。现有的视觉、视频、文档和生活日志基准测试了子问题，但未涉及具有社会身份绑定和证据引用的相册级画像重建。由于评估所需的真实数据——所有者画像、社交图谱、人脸-姓名映射和证据来源——是私有状态，真实相册无法安全发布，因此基准测试此任务具有挑战性。我们提出PAL-Bench，一个在公共记录契约下进行证据驱动重建的受控基准。其证据编译器构建潜在的私有世界，编程目标级证据路径，渲染相册像素，通过感知管道重新测量，并导出经过审计的公共/私有视图。智能体仅接收感知衍生的公共记录；目标、标识符映射和证据路径保持隐藏。PAL-Bench包含50个合成用户、36,659条公共照片记录以及2,799个关于所有者事实、身份和关系的目标。一项包含10名参与者的隐私保护审计确认，PAL-Bench的证据结构与真实私有相册匹配，尽管等效发布仍受隐私限制。在七个系统和两个计算匹配的诊断中，一个七指标协议揭示了合理的画像总结与忠实的社会重建之间的差距：系统恢复了一些所有者事实，但在处理重复出现的身份和证据引用方面存在困难。PAL-TRACE是一个参考框架，在所有者事实挖掘之前冻结身份绑定，表现最佳，但硬身份解析远未解决。PAL-Bench为感知实体解析、多模态数据集成、时间证据聚合和来源感知的结构化预测提供了测试平台。

英文摘要

Longitudinal personal albums are weak-schema multimodal databases: noisy perceptual records whose key facts require joins across faces, text, timestamps, locations, and repeated events. Existing visual, video, document, and lifelog benchmarks test sub-problems, but not album-scale profile reconstruction with social identity binding and evidence citation. Benchmarking this task is difficult because the ground truth needed for evaluation--owner profiles, social graphs, face-name maps, and evidence provenance--is private state that real albums cannot safely release. We introduce PAL-Bench, a controlled benchmark for evidence-grounded reconstruction under a public-record contract. Its Evidence Compiler builds latent private worlds, programs target-level evidence paths, renders album pixels, re-measures them through perception pipelines, and exports audited public/private views. Agents receive only perception-derived public records; targets, identifier maps, and evidence paths remain hidden. PAL-Bench contains 50 synthetic users, 36,659 public photo records, and 2,799 targets over owner facts, identities, and relations. A privacy-preserving audit with 10 participants confirms that PAL-Bench evidence structures match real private albums, though equivalent releases remain privacy-prohibitive. Across seven systems and two compute-matched diagnostics, a seven-metric protocol reveals a gap between plausible profile summarization and faithful social reconstruction: systems recover some owner facts but struggle with recurring identities and evidence citation. PAL-TRACE, a reference framework that freezes identity bindings before owner-fact mining, performs best but leaves hard identity resolution far from solved. PAL-Bench provides a testbed for perceptual entity resolution, multimodal data integration, temporal evidence aggregation, and provenance-aware structured prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.16206 2026-06-16 cs.AI cs.CL cs.CY cs.HC 新提交

LabOSBench: 科学仪器控制的计算机使用智能体基准测试

Anqi Zou, Han Deng, Chengyu Zhang, Junquan Hu, Yu Wang, Yuxiang Xing, Aokai Zhang, Hanling Zhang, Zhaoyang Liu, Ben Fei, Zhihui Wang, Wanli Ouyang

发表机构 * Shenzhen Loop Area Institute（深圳循环区域研究所）； Dalian University of Technology（大连理工大学）； The Chinese University of Hong Kong（香港中文大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出LabOSBench基准，基于Web科学仪器模拟器评估多模态GUI智能体在仪器控制中的表现，揭示现有智能体在反馈驱动操作和长流程执行上的不足。

详情

AI中文摘要

当前的计算机使用基准主要关注虚拟化系统中的软件操作任务，而科学仪器场景需要协调控制复杂界面和反馈驱动的参数调整。然而，直接在物理高精度仪器上评估智能体因高成本、安全风险、有限可访问性和难以保证可重复评估而不切实际。这促使需要一个模拟但真实的测试平台，既能保留科学仪器的操作挑战，又能实现可扩展和安全的基准测试。为此，我们引入了LabOSBench，这是一个基于一套基于Web的科学仪器模拟器构建的多模态GUI智能体的挑战性基准。LabOSBench通过浏览器直接操作，避免了资源密集型的操作系统虚拟化，同时支持灵活的任务配置和基于执行的评估。具体来说，LabOSBench在八个仪器模拟器上构建了96个子任务，涵盖了从样品加载、对准、参数调整、数据采集到结果检查的工作流程。我们在子任务和端到端级别评估了通用视觉语言模型、专用GUI智能体模型和高级智能体框架。我们的实验表明，尽管现有智能体可以完成许多结构化的GUI子任务，但它们仍然在反馈驱动操作和长周期工作流执行中挣扎。总体而言，LabOSBench为推进计算机使用智能体向科学仪器控制发展提供了一个可重复、低成本的测试平台。

英文摘要

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

URL PDF HTML ☆

赞 0 踩 0

2606.16974 2026-06-16 cs.AI 新提交

The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

拥抱开放科学：十年AI研究与56 800篇会议论文的分析

Kevin L Coakley, Thijs Snelleman, Holger Hoos, Odd Erik Gundersen

发表机构 * Norwegian University of Science and Technology（挪威科技大学）； University of California San Diego（加州大学圣迭戈分校）； RWTH Aachen University（亚琛工业大学）； Leiden University（莱顿大学）

AI总结分析2014-2024年五大AI会议56,800篇论文，发现文档实践改善，代码和数据共享率从11%升至64%，可重复性估计从28%升至64%，且改善早于可重复性检查清单的引入，反映开放科学运动。

详情

AI中文摘要

可重复性危机促使AI研究社区改进文档实践。多项研究已指出方法论问题，作为回应，该领域最具影响力的会议引入了可重复性检查清单。我们试图通过评估过去十年五大领先AI会议的所有已发表论文，了解文档实践是否随时间改变。确定了七个可重复性变量，经过质量保证并用于分析56,800篇出版物。我们的分析显示，在2014年至2024年期间，文档实践有所改善；同时共享代码和数据的论文增加了近六倍，从11%增至64%。基于先前研究的实证可重复性率，我们估计——根据文档实践推断，而非直接测试——可重复性从2014年的28%增加到2024年的64%。文档实践的改善早于可重复性检查清单的引入，表明这些变化反映了更广泛的开放科学运动，而非对正式要求的直接响应。

英文摘要

The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.14715 2026-06-16 cs.MA cs.AI cs.SI 交叉投稿

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

MiroBench：真实世界讨论的智能体模拟真实性基准测试

Yaoning Yu, Ye Yu, Haojing Luo, Haohan Wang

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Starc.Institute（Starc研究院）

AI总结提出MiroBench基准，基于4292条真实Reddit帖子，通过统计测试评估LLM智能体模拟在重复性、叙事内容、毒性攻击和结构复杂度四个方面的分布匹配度，发现当前模拟器与真实讨论存在分布差异。

详情

AI中文摘要

LLM智能体越来越多地被用于模拟真实世界互动，但尚不清楚模拟行为是否保留了真实人类行为的内容模式和互动动态。现有评估仍然碎片化，使得比较系统或衡量进展变得困难。在本文中，我们聚焦于Reddit讨论，作为评估真实世界社会模拟的具体第一步。Reddit帖子提供了公开的、基于主题的多方互动，人们在其中分享经验、辩论、寻求建议、表达情感，并共同对产品、事件和社会问题做出回应。这些讨论为更广泛的社会行为提供了可观察的窗口，使其成为测试LLM智能体能否不仅再现流畅文本，还能再现真实在线社区的分布模式和互动动态的有用场景。我们介绍了MiroBench，一个基于4292条真实Reddit帖子构建的Reddit讨论模拟基准。MiroBench使用统计测试在四个主要方面比较生成讨论和真实讨论：重复性和语义一致性、叙事内容、毒性攻击以及结构复杂度。跨五个领域和五个模型的实验表明，当前模拟器与真实Reddit帖子在分布上仍不匹配，而一种轻量级的基于提示的改进程序仅带来有限的提升。MiroBench为衡量、诊断和改进基于LLM的社会模拟的真实性提供了一个具体基准。

英文摘要

LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.14747 2026-06-16 cs.CV cs.AI 交叉投稿

超越正确性：通过可扩展的智能体判断标注增强代码大模型的架构推理能力

Kirill Vasilevski, Ximing Dong, Benjamin Rombaut, Ruochen Deng, Jiahuei Lin, Arthur Leung, Dayi Lin, Boyuan Chen, Shaowei Wang, Ahmed E. Hassan

发表机构 * Centre for Software Excellence, Huawei Canada（华为加拿大软件卓越中心）； Department of Computer Science, University of Manitoba, Canada（曼尼托巴大学计算机科学系）； School of Computing, Queen’s University, Canada（皇后大学计算科学学院）

AI总结针对代码大模型缺乏架构理解的问题，提出智能体判断流水线，利用强LLM作为专家架构评估的代理，通过两个判断器（ACJ和AQJ）实现可扩展标注，微调模型在SWE-bench上提升高达540%，并展现跨语言泛化能力。

2606.15144 2026-06-16 cs.CL cs.AI 交叉投稿

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE: 面向菲律宾语的音韵、词缀和字符级词元理解

Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

发表机构 * AI Singapore（AI新加坡）； Nanyang Technological University（南洋理工大学）； UK AI Security Institute（英国人工智能安全研究所）； Ateneo de Manila University（马尼拉雅典耀大学）； University of Birmingham（伯明翰大学）

AI总结提出PACUTE基准，包含4600个任务，通过六层诊断框架评估大语言模型在菲律宾语中的形态理解，发现开放权重模型在语素分解上接近随机，前沿模型在组合任务上远低于字符级上限。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

大型语言模型（LLMs）将文本处理为子词词元序列，这掩盖了构成词形成的字符级和形态结构。对于具有非连接形态的语言，这种限制最为严重，标准分词器系统性地使词元边界与语素边界错位。我们引入PACUTE，一个包含4600个任务的诊断基准，旨在评估菲律宾语中的形态理解，菲律宾语以能产的中缀、重叠和变音符号驱动的词汇区分（通常不在书面文本中出现）为特征。PACUTE包括一个六层组合诊断框架，用于定位形态理解在何处崩溃。评估开放权重LLMs和前沿商业模型，我们发现开放权重模型在语素分解上无论规模大小都接近随机。前沿模型表现更好，通常在包含匹配评分下能恢复单个词缀，但在语素变换和音节划分的组合任务上仍远低于其字符级上限。这些结果表明，能产的形态组合（而非仅字符访问）是菲律宾语词汇结构理解的持续瓶颈。

英文摘要

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.15216 2026-06-16 cs.CL cs.AI 交叉投稿

Spokes: Optimizing for Diverse Pretraining Data Selection

Spokes: 优化多样化预训练数据选择

Clarence Lee, Yejin Choi, Luke Zettlemoyer, Pang Wei Koh, Hai Leong Chieu

发表机构 * DSO National Laboratories（DSO国家实验室）； Stanford University（斯坦福大学）； University of Washington（华盛顿大学）

AI总结提出基于G-Vendi分数的概率多样化框架，通过指数梯度下降直接优化数据多样性，在FineWeb和DCLM上提升下游性能1.5和1.4个点。

Comments 9 pages, 4 figures

详情

AI中文摘要

多样性在数据选择中起着关键作用，通过减少冗余和重复，在固定数据预算下提高性能。然而，优化多样性本身具有挑战性，因为它是集合级属性，依赖于数据点之间的交互而非单个示例。因此，现有方法通常依赖代理或近似，往往无法确保足够多样化的子集。在这项工作中，我们通过引入基于G-Vendi分数的概率多样化框架，并利用指数梯度下降进行优化，直接优化多样性。我们的方法生成的子集比通过随机抽样获得的子集多样化得多，在50万样本子集上实现了G-Vendi分数增加489。我们在FineWeb和DCLM上评估了我们的方法，它持续优于现有方法。值得注意的是，SPOKES（仅多样性）在DCLM和FineWeb上分别比随机抽样提高了平均下游性能0.4和0.5个点。更重要的是，联合优化质量和多样性取得了最强结果：SPOKES在DCLM和FineWeb上分别取得了1.5和1.4个点的提升，优于所有基线，包括语义去重和质量过滤。

英文摘要

Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

URL PDF HTML ☆

赞 0 踩 0

2606.15306 2026-06-16 cs.LG cs.AI 交叉投稿

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

LatentGym: 具有可控潜在结构的跨任务经验学习测试平台

Daksh Mittal, Tommaso Castellani, Thomson Yen, Naimeng Ye, Fangyu Wu, Minghui Chen, Tiffany Cai, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

发表机构 * Columbia University（哥伦比亚大学）； Oumi Blog | Code | Models（Oumi博客 | 代码 | 模型）

AI总结提出LatentGym测试平台，通过可控潜在变量分离探索与利用，研究LLM代理在跨任务序列中的适应性学习机制。

Comments 61 pages

详情

AI中文摘要

我们设想持续学习的代理系统会随时间变得更加有用：当它们遇到一系列相关任务时，应该推断这些任务之间共享的隐藏结构，并利用它来改进未来的决策。这种跨任务经验学习能力在个性化和交互式辅助等领域至关重要，但现有的训练/评估框架不提供共享的、可控的潜在结构，也无法衡量代理是否改进或改进的原因。我们引入了LatentGym：一个可控的套件，其中每个环境都围绕一个控制任务间结构的地面真实潜在变量组织。我们的构建产生了将探索（代理的行为是否收集关于潜在变量的信息）与利用（代理是否使用收集到的信息）分离的指标。我们在实证研究中展示了我们的套件，解决了三个问题：前沿模型如何以及为什么无法适应相关任务；对相关任务序列进行后训练是否能提高一般的跨任务适应性，以及这些收益来自何处；以及诸如任务间反馈等设计选择如何塑造训练动态和泛化。总之，这些结果为研究LLM代理如何从跨任务经验中学习，以及设计在顺序、个性化和交互式设置中更可靠适应的代理建立了受控基础。

英文摘要

We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

URL PDF HTML ☆

赞 0 踩 0

2606.15314 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

数据受限语言模型预训练的数据增强

Michael K. Chen, Xikun Zhang, Zhen Wang

发表机构 * UC San Diego（加州大学圣地亚哥分校）； RMIT University（皇家墨尔本理工大学）

AI总结针对数据受限下标准自回归预训练严重过拟合的问题，提出三类数据增强方法（token级噪声、序列排列、目标偏移预测），有效降低验证损失并支持数百epoch训练。

详情

AI中文摘要

随着AI实验室接近数据天花板，计算能力超过新高质量文本生成速率，语言模型预训练正转向数据受限、计算充裕的体制，需要在固定语料库上进行高效的多轮训练。标准自回归（AR）预训练在此设置下严重过拟合，早期达到最优然后持续恶化。我们研究数据增强作为正则化器来缓解过拟合，并在相同数据上实现数百轮的有效训练。我们为AR预训练引入了三类正交的增强：token级噪声（掩码、随机替换）、序列排列（从右到左预测、Fill-in-the-Middle）以及目标偏移预测（$x_{t+i}$，$i > 1$）。通过系统消融实验，我们发现单个增强相对于基线延迟了过拟合并降低了验证损失，其中随机token替换在单个方法中实现了最佳最小损失。组合增强类别进一步降低了最小验证损失。我们的实验表明，数据增强缓解了AR预训练的数据低效问题，并为数据受限体制提供了有前景的解决方案。所有代码和数据可在https://github.com/michaelchen-lab/data-augmentations-for-pretraining获取。

英文摘要

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining

URL PDF HTML ☆

赞 0 踩 0

2606.16262 2026-06-16 cs.SE cs.AI 交叉投稿

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

UXBench: 衡量LLM生成的UX评论的可操作性

Wenjie Wang, Yue Huang, Zipeng Ling, Han Bao, Hang hua, Xiaonan Luo, Yu Jiang, Shiyi Du, Yuexing Hao, Xiaomin Li, Yuchen Ma, Dianzhuo Wang, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（诺丁汉大学）； University of Pennsylvania（宾夕法尼亚大学）； University of Rochester（罗切斯特大学）； Carnegie Mellon University（卡内基梅隆大学）； Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）； LMU Munich（慕尼黑路德维希-马克西米利安大学）

AI总结提出UXBench基准，通过下游修复代理能否基于评论改进界面来评估LLM作为UX评判者的可操作性，发现不同模型在报告可操作性、修复特征和可靠性上存在显著差异。

Comments 30 pages

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被部署为UX评判者，用于检查界面、诊断可用性问题并提出修复建议。然而，目前还没有受控基准来衡量这些评论在不同产品表面上的可靠性和可操作性。我们引入了UXBench，一个用于评估LLMs作为交互式UX评判者的基准。UXBench包含跨十个产品表面系列的本地优先可运行网页固定装置，并配以覆盖门控的浏览器探索，强制模型在报告之前收集交互证据。每个评判模型在七个评分维度上生成结构化的UX报告；报告质量通过固定的下游修复代理能否基于评论改进界面来衡量。我们在自动修复提升协议和盲人验证研究下评估了八个前沿模型。结果表明，UX评判既未饱和也非一维：模型在报告可操作性上存在显著差异，在评分维度上表现出不同的修复特征，在固定装置层面可靠性各异，并在不同表面类别中交替领先。

英文摘要

Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories

URL PDF HTML ☆

赞 0 踩 0

2606.16313 2026-06-16 cs.RO cs.AI 交叉投稿

Is Your Trajectory Displacement Safe in Long-tail?

你的轨迹位移在长尾场景中安全吗？

Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

发表机构 * Shanghai Qi Zhi Institute（上海期智研究院）； Tsinghua University（清华大学）； Tongji University（同济大学）

AI总结提出FluidTest评估框架，通过成对WebUI协议、32种语义威胁分类和三元验证系统，检测规划轨迹相对于专家参考的额外威胁，实验发现SOTA规划器仍存在大量安全相关失败。

Comments 20 pages, 15 figures

详情

AI中文摘要

长尾场景仍然是自动驾驶评估的主要瓶颈，即使数据集规模增长数个数量级。现有的评估流水线很少同时具备人类对齐、安全感知、可验证和可解释性：闭环指标在强规划器中常常饱和，而无结构的人类评分在没有精心设计协议的情况下可能充满噪声。我们将规划评估表述为额外威胁检测：给定规划器轨迹和专家参考，规划器的位移是否引入了新的不安全驾驶行为？我们提出FluidTest，一个包含三个组件的评估流水线：用于可靠人工标注的成对WebUI协议；包含32种语义威胁及其基于证据的决策图的分类法；以及一个带有反思的三元验证系统，用于精确性和可审计性。在WOD-E2E数据集上的实验表明，FluidTest在训练过的标注者中产生一致的标签，并在65%的Poutine轨迹和51%的RAP轨迹中识别出额外威胁。这些结果表明，尽管具有高评分者反馈分数（RFS）和低平均位移误差（ADE），最先进的规划器仍可能表现出大量与安全相关的失败。更多细节、指导和代码请访问https://fluidtest.web.app。

英文摘要

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

URL PDF HTML ☆

赞 0 踩 0

2606.16447 2026-06-16 cs.RO cs.AI 交叉投稿

P3B3：用于测量大语言模型中欧洲和巴西葡萄牙语变体偏差的多轮对话基准

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

发表机构 * NOVA University of Lisbon（新里斯本大学）； NOVA LINCS（NOVA LINCS实验室）

AI总结提出P3B3基准，通过专家策划的对话提示和评估框架，测量大语言模型在葡萄牙语变体（欧洲vs巴西）上的偏差和可控性，发现多数模型偏向巴西葡萄牙语。

Comments Accepted at MeLLM Workshop at ACL 2026

2606.16799 2026-06-16 cs.CV cs.AI 交叉投稿

Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

解耦语义与失真：面向AI生成图像质量评估的多尺度双流视觉-语言对齐

Zijie Meng

AI总结提出MST-CLIPIQA多尺度双流框架，通过显式表示解耦实现层次化视觉-语言对齐，在五个基准上取得质量SRCC平均提升1.11%、图文对应SRCC提升2.35%的新SOTA结果。

Comments 11 pages, 2 figures Accepted by ICME2026(spotlight)

详情

AI中文摘要

现有的基于视觉-语言模型（VLM）的AI生成图像质量评估（AIGIQA）方法存在根本性的语义-失真维度冲突：为语义区分优化的单一表示在本质上将组成性理解与低层感知敏感性纠缠在一起，使其对细粒度质量退化视而不见。我们提出MST-CLIPIQA，一种多尺度双流框架，通过显式表示解耦实现层次化视觉-语言对齐。我们的架构利用具有互补补丁粒度的双CLIP编码器：粗粒度流捕获全局语义连贯性，而细粒度流保留纹理特征和伪影模式。一种受信息瓶颈启发的门控融合机制执行自适应跨尺度蒸馏，当生成提示可用时，可选交叉注意力实现基于提示的对应评估。在五个基准上的广泛实验建立了新的最先进结果，在质量预测上实现平均SRCC提升1.11%，在文本-图像对应预测上提升2.35%，同时仅需0.8M可训练参数即可保持效率。我们的项目可在https://github.com/YMlinfeng/MST-CLIPIQA获取。

英文摘要

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

URL PDF HTML ☆

赞 0 踩 0

2606.16826 2026-06-16 cs.RO cs.AI 交叉投稿

ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

ATOM-Bench：用于操作策略中原子技能与组合泛化的真实世界基准

Zenan Wu, Bingqing Wei, Lu Liu, Zheqi He, Xi Wang, Jiakang Liu, Zehui Li, Guocai Yao, Jing-Shu Zheng, Xi Yang, Yongtao Wang

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Peking University（北京大学）

AI总结提出ATOM-Bench基准，通过分解桌面操作为原子任务和组合任务，评估操作策略的原子技能获取与组合泛化能力，发现当前策略在细粒度原子技能和组合重用上存在不足。

Comments Homepage: https://flageval-baai.github.io/AtomBenchPage

详情

AI中文摘要

通用操作策略越来越多地被呈现为机器人控制的基础模型，但它们的真实世界泛化能力仍然难以诊断。一个策略可能在演示任务上成功，但仍无法执行细粒度的原子技能或在新的任务结构中重新组合已学习的技能。我们引入了\ extbf{ATOM-Bench}，一个用于评估操作策略中原子技能和组合泛化的真实世界基准。ATOM-Bench将桌面操作分解为运动原子和指令原子，包含30个原子任务和24个保留的组合任务，涵盖配对单臂和双臂机器人轨道。我们收集了3000个人类演示用于原子微调，并发布演示数据和评估回滚数据以支持可重复的真实世界评估。策略在原子任务上进行微调，并在原子技能获取和保留的组合任务上进行评估。我们进一步引入了原子分数（AS）和组合失败份额（CFS），以区分由弱原子技能引起的失败和由有限组合重用引起的失败。通过对五种代表性操作策略进行2700次物理回滚，我们发现当前策略可以获取简单的指令接地技能，但在细粒度运动原子、计数和逻辑过滤方面仍然困难。更重要的是，强大的原子性能并不能可靠地迁移到保留的组合任务上。ATOM-Bench提供了一个诊断测试平台，用于研究失败是由弱运动执行、差指令接地还是有限组合重用引起的。

英文摘要

Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

URL PDF HTML ☆

赞 0 踩 0

2606.16868 2026-06-16 cs.CV cs.AI cs.DC 交叉投稿

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

真实世界标签噪声下的联邦医学图像分割：面向噪声标签学习方法选择的基准套件

Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

发表机构 * Division of Medical Image Computing, Germany Cancer Research Center（德国癌症研究中心医学图像计算部）； Medical Faculty, University of Heidelberg（海德堡大学医学院）； Heidelberg Institute of Radiation Oncology (HIRO), National Center for Radiation Research in Oncology (NCRO)（海德堡放射肿瘤学研究所（HIRO），国家放射肿瘤学研究中心（NCRO））； Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital（海德堡大学医院放射肿瘤科模式分析与学习组）； Faculty of Mathematics and Computer Science, University of Heidelberg（海德堡大学数学与计算机科学学院）； National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and the university medical center Heidelberg（国家肿瘤疾病中心（NCT），NCT海德堡，DKFZ与海德堡大学医学中心的合作机构）

AI总结针对联邦学习中真实世界标签噪声（如轮廓不一致、结构缺失或混淆）问题，提出一个包含多样化真实噪声数据集、客户端噪声场景和针对性评估的基准套件，支持系统评估和噪声标签学习方法选择。

详情

AI中文摘要

虽然联邦学习（FL）能够在不集中敏感数据的情况下实现协作式医学图像分割，但实际部署常因跨站点的标签缺陷（如轮廓不一致、结构缺失或多余、标签混淆）而复杂化。联邦噪声标签学习（FNLL）旨在减轻这些影响，但在实践中仍未被充分利用，因为现有证据主要基于合成噪声、简化设置和有限的实际噪声评估。我们通过引入一个基准套件来弥补这一差距，该套件结合了多样化的真实世界噪声数据集、与部署相关的客户端噪声场景以及针对标签噪声的评估，以支持系统的FNLL评估和知情的方法选择。该套件将来自不同来源的精心策划的真实世界噪声医学图像分割数据集与一个全面的联邦分割框架相结合，包括各种客户端噪声场景和针对噪声的评估。所提出的套件为医学图像分割中的FNLL评估提供了现实且具有区分性的基础，并为公平基准测试、数据集特定的标签噪声表征以及未来在现实联邦设置下的方法开发建立了可重复使用的基础。代码可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 获取。

英文摘要

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

URL PDF HTML ☆

赞 0 踩 0

2606.16910 2026-06-16 cs.CL cs.AI 交叉投稿

IMPACTeen: Intentions, Manipulation, Persuasion, Annotations, and Consequences in Teen Communication Dataset

IMPACTeen：青少年沟通数据集中的意图、操纵、说服、标注与后果

Aleksander Szczęsny, Wiktoria Mieleszczenko-Kowszewicz, Maciej Markiewicz, Beata Bajcar, Tomasz Adamczyk, Jolanta Babiak, Grzegorz Chodak, Przemysław Kazienko

发表机构 * Wrocław University of Science and Technology（弗罗茨瓦夫理工大学）

AI总结构建IMPACTeen数据集，包含1021个青少年社交影响场景文本，从五个视角标注，支持社交影响检测、标注者分歧及跨语言建模研究。

2606.17006 2026-06-16 cs.SD cs.AI cs.LG cs.MM eess.AS 交叉投稿

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

TuneJury: 一种改进音乐生成偏好对齐的开放指标

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Junghyun Koo, Koichi Saito, Yuki Mitsufuji, Chris Donahue

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Sony AI（索尼AI）； Georgia Tech（佐治亚理工学院）； KAIST（韩国科学技术院）； Peking University（北京大学）； QMUL（伦敦玛丽女王大学）

AI总结提出TuneJury，一个开放、实例级别的成对奖励模型，用于文本到音乐生成，通过预测偏好分数支持数据筛选、后处理校准，并在推理、优化和训练中提升对齐效果。

Comments 32 pages, 9 figures

详情

AI中文摘要

我们引入了TuneJury，一个开放、实例级别的成对奖励模型，用于文本到音乐生成，它从文本提示和音频片段中预测音乐偏好分数。发布的检查点在公开的人类偏好标签上训练，涵盖竞技场风格（A vs. B）投票、度量对齐偏好对、众包成对比较和专家审美评分。两个片段之间的预测分数差在我们的保留测试集上校准良好，支持通过简单的分数阈值进行数据筛选。TuneJury泛化到保留测试对和分布外基准，在后一任务上与先前基线保持竞争力。对于训练后发布的生成器，我们引入了锚定校准，一种事后、每系统的Bradley-Terry校准，以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励在三个下游应用中驱动一致的奖励轴增益：推理时的最佳N选择、DITTO风格的潜在优化和专家迭代后训练。TuneJury可在https://github.com/yonghyunk1m/TuneJury获取。

英文摘要

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

URL PDF HTML ☆

赞 0 踩 0

2606.17020 2026-06-16 cs.CV cs.AI 交叉投稿

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS: 用于双模态视觉-语言基础模型的大规模RGB-红外遥感数据集

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong, Chengyin Hu, Fengyu Zhang, Yiwei Wei, Jiujiang Guo

发表机构 * China University of Petroleum-Beijing at Karamay（中国石油大学（北京）克拉玛依校区）； University of Electronic Science and Technology of China（电子科技大学）； Tianjin University（天津大学）

AI总结针对遥感视觉-语言模型缺乏红外数据的问题，提出首个大规模RGB-红外-文本数据集FusionRS，通过翻译RGB图像为红外风格并配以红外感知描述，训练双模态基础模型，提升RGB-红外对齐和双模态字幕生成性能。

详情

AI中文摘要

遥感视觉-语言模型推动了地球观测理解的发展，但现有工作大多集中于RGB图像，红外数据中的互补信息尚未得到充分探索。红外图像提供了独特的线索，包括热强度结构、物体边界和光照不变场景特征，这些可以丰富超越传统RGB观测的视觉-语言学习。然而，用于遥感视觉-语言建模的大规模RGB-红外-文本数据集仍然缺失。为填补这一空白，我们引入了FusionRS，这是首个专为遥感双模态视觉-语言学习设计的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像翻译为红外风格对应物，形成对齐的RGB-IR图像对。每对图像都配有常规场景描述和红外感知描述，后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS，我们训练了用于RGB-IR联合理解的双模态视觉-语言基础模型。我们首先训练CLIP风格的模型进行RGB-IR-文本对齐，然后微调生成式VLM用于双模态RGB-IR字幕生成。实验表明，与仅RGB和非红外感知训练设置相比，FusionRS改进了RGB-IR对齐、红外到文本检索和双模态字幕生成。消融研究进一步验证了红外感知描述对于加强红外-语言对齐至关重要，突显了模态特定文本监督对于更可扩展的RGB-红外遥感视觉-语言表示学习的重要性。

英文摘要

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

URL PDF HTML ☆

赞 0 踩 0

2509.22888 2026-06-16 cs.AI cs.CL 版本更新

JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

JE-IRT: 通过联合嵌入项目反应理论审视LLM能力的几何视角

Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

发表机构 * Independent Researcher（独立研究者）； University of Cincinnati（辛辛那提大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出JE-IRT几何框架，将LLM和问题嵌入共享空间，通过方向编码语义、范数编码难度，揭示主题专长和分布外行为，支持新模型高效扩展，并发现与人类分类部分对齐的内部结构。

Comments 35 pages, 17 figures, 9 tables, accepted to TMLR

详情

AI中文摘要

标准LLM评估实践将多样能力压缩为单一分数，掩盖了其固有的多维性质。我们提出JE-IRT，一种几何项目反应框架，将LLM和问题嵌入共享空间。对于问题嵌入，方向编码语义，范数编码难度，而每个问题的正确性由模型和问题嵌入之间的几何交互决定。这种几何结构用主题专长取代了LLM的全局排名，并允许相关问题之间的平滑变化。基于此框架，我们的实验结果表明，分布外行为可以通过方向对齐来解释，且更大的范数一致地指示更难的问题。此外，JE-IRT自然支持泛化：一旦空间被学习，新LLM通过拟合单个嵌入即可添加。学习到的空间进一步揭示了仅部分与人类定义的主题类别对齐的LLM内部分类。我们还表明，嵌入空间的简单线性探针恢复了跨主题的能力方向，例如一个算术轴，在看似遥远的主题（如病毒学和全球事实）中突出定量要求高的问题。因此，JE-IRT建立了一个统一且可解释的几何视角，将LLM能力与问题结构联系起来，为模型评估和泛化提供了独特视角。

英文摘要

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

URL PDF HTML ☆

赞 0 踩 0

2602.06486 2026-06-16 cs.AI 版本更新

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

JADE：面向开放式专业任务的专家基础动态评估

Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang

AI总结提出JADE双层评估框架，结合专家知识与动态声明级评估，解决开放式专业任务中严格性与灵活性的矛盾，在BizBench等基准上提升稳定性并揭示关键失败模式。

详情

AI中文摘要

在开放式专业任务上评估智能体AI面临着严格性与灵活性之间的根本困境。静态评分标准提供了严格、可重复的评估，但无法适应多样化的有效响应策略，而LLM作为评判者的方法虽能适应个体响应，却存在不稳定性和偏差。人类专家通过将领域基础原则与动态的声明级评估相结合来解决这一困境。受此过程启发，我们提出了\textbf{JADE}，一个双层评估框架。第一层将专家知识编码为预定义的评估技能集，提供稳定的评估标准。第二层执行报告特定的声明级评估，灵活评估多样化的推理策略，并通过证据依赖门控来使基于被反驳声明的结论无效。在BizBench上的实验表明，JADE提高了评估稳定性，并揭示了基于整体LLM的评估者遗漏的关键智能体失败模式。我们进一步展示了与专家编写的评分标准的高度一致性，并有效迁移到HealthBench和此http URL，涵盖医学和10个领域的专业评估设置。代码和数据可在以下网址获取：此https URL。

英文摘要

Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose \textbf{JADE}, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to HealthBench and DR.BENCH, covering medical and 10-domain professional evaluation settings. Code and data are available at https://github.com/smiling-world/JADE.

URL PDF HTML ☆

赞 0 踩 0

2602.11510 2026-06-16 cs.AI 版本更新

AgentLeak: A Benchmark for Internal-Channel Privacy Leakage in Multi-Agent LLM Systems

AgentLeak：多智能体大语言模型系统中内部通道隐私泄露的基准测试

Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

发表机构 * Polytechnique Montréal（蒙特利尔理工学院）

AI总结提出AgentLeak基准，通过评估内部通道（如智能体间消息、共享内存）的隐私泄露，发现多智能体系统虽降低最终输出泄露，但内部通道使总暴露率达68.9%，远超输出审计的检测范围。

Comments 19 pages, 9 figures, 16 tables. Code and dataset available at https://github.com/Privatris/AgentLeak

详情

DOI: 10.1109/ACCESS.2026.3704541

AI中文摘要

多智能体大语言模型（LLM）系统产生了当前仅输出基准无法衡量的隐私风险。当智能体协调任务时，敏感数据可能通过智能体间消息、共享内存和工具参数传递，这些路径通常不被最终输出审计检查。我们引入了AgentLeak，一个用于评估多智能体LLM系统中内部通道隐私泄露的基准。AgentLeak检测了七条与隐私相关的通信路径，并提供了针对最终输出、智能体间消息和共享内存的大规模实证评估。在涵盖医疗、金融、法律和企业领域的1000个场景中，使用五个生产级LLM（GPT-4o、GPT-4o-mini、Claude 3.5 Sonnet、Mistral Large和Llama 3.3 70B）以及4979个经过验证的执行轨迹，我们发现，与单智能体基线相比，多智能体配置降低了最终输出泄露（C1：27.2%对43.2%），但引入了内部通道，使系统总暴露率升至68.9%（聚合C1、C2、C5）。智能体间消息（C2）泄露率为68.8%，而最终输出（C1）为27.2%，这意味着仅输出审计遗漏了41.7%的违规。在所有五个模型和四个领域中，模式C2 ≥ C1一致成立。这些结果表明，在所评估的协调者-工作者设置中，多智能体系统的隐私风险主要由架构协调通道而非仅最终输出行为塑造：它源于对标准输出级防御不可见的内部通道。

英文摘要

Multi-agent Large Language Model (LLM) systems create privacy risks that current output-only benchmarks cannot measure. When agents coordinate on tasks, sensitive data may pass through inter-agent messages, shared memory, and tool arguments, all pathways that final-output audits typically do not inspect. We introduce AgentLeak, a benchmark for evaluating internal-channel privacy leakage in multi-agent LLM systems. AgentLeak instruments seven privacy-relevant communication pathways and provides a large-scale empirical evaluation focused on final outputs, inter-agent messages, and shared memory. Across 1,000 scenarios spanning healthcare, finance, legal, and corporate domains, five production LLMs (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B), and 4,979 validated execution traces, we find that multi-agent configurations reduce final-output leakage (C1: 27.2% vs 43.2% in single-agent mode) compared with single-agent baselines but introduce internal channels that raise total system exposure to 68.9% (aggregated across C1, C2, C5). Inter-agent messages (C2) leak at 68.8%, compared with 27.2% for final outputs (C1), meaning that output-only audits miss 41.7% of violations. Across all five models and four domains, the pattern C2 $\geq$ C1 holds consistently. These results suggest, within the evaluated coordinator-worker setting, that privacy risk in multi-agent systems is strongly shaped by architectural coordination channels rather than final-output behavior alone: it arises from internal channels that remain invisible to standard output-level defenses.

URL PDF HTML ☆

赞 0 踩 0

2602.12670 2026-06-16 cs.AI 版本更新

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: 基准测试智能体技能在不同任务中的有效性

Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran You, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, Qunhong Zeng, Di Wang, Yuanli Wang, Roey Ben Chaim, Penghao Jiang, Haotian Shen, Luyang Kong, Xinyi Liu, Runhui Wang, Xuanqing Liu, Jiachen Li, Xin Lan, Yueqian Lin, Wengao Ye, Junwei He, Songlin Li, Yue Zhang, Yipeng Gao, Yijiang Li, Ze Ma, Liqiang Jing, Tianyu Wang, Kaixin Li, Yiqi Xue, Haoran Lyu, Yizhuo He, Yuchen Tian, Shutong Wu, Bowei Wang, Yixuan Gao, Bo Chen, Litong Liu, Sikai Cheng, Jiajun Bao, Shuaicheng Tong, Shuwen Xu, Terry Yue Zhuo, Tinghan Ye, Qi Qi, Miao Li, Longtai Liao, Zelin Tan, Chang Shi, Xilin Tang, Srinath Tankasala, Boqin Yuan, Yaoyao Qian, Jianhong Tu, Chenguang Wang, Yizhou Sun, Wei Wang, Aaron Taylor, Ziyue Yang, Changkun Guan, Zhikang Dong, Xinyu Zhang, Steven Dillmann, Han-chung Lee, Dawn Song

发表机构 * BenchFlow ； OSU ； Amazon ； UC Berkeley ； UC Santa Cruz ； UC Davis ； Dartmouth ； RLWRLD ； Independent ； Princeton University ； Oxford University ； Stanford University ； USC ； CMU ； Foxconn ； Zenity ； UNSW ； UT Austin ； MSU ； Duke University ； ByteDance ； UT Dallas ； UC San Diego ； Columbia University ； University of Rochester ； Cornell Tech ； Georgia Tech ； Cornell University ； NEU ； UCLA ； Snap Inc. ； Fanshawe College ； University of Science and Technology of China ； HKUST(GZ) ； Anyscale

AI总结提出SkillsBench基准，包含8领域87个任务，通过配对评估证明技能提升平均通过率16.6个百分点，小模型配备技能可匹敌大模型。

详情

AI中文摘要

智能体技能是结构化程序性知识包，在推理时增强大语言模型智能体。尽管被快速采用，但没有标准方法衡量它们是否真正有帮助。我们提出SkillsBench，其当前库存包含8个领域的87个任务，并配有精心策划的技能和确定性验证器。我们最新的聚合评估在匹配的无技能和精心策划的技能条件下，对18个模型-框架配置运行了87个任务基准。精心策划的技能将平均通过率从33.9%提高到50.5%（+16.6个百分点；归一化增益25.5%），配置级增益范围从+4.1到+25.7个百分点。最多三个模块的聚焦技能优于更大或详尽的技能包，配备技能的小模型可以匹配没有技能的大模型。SkillsBench建立了配对评估作为严格衡量技能在智能体、专业密集型工作上有效性的基础。

英文摘要

Agent Skills are structured packages of procedural knowledge that augment large language model (LLM) agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark whose current inventory contains 87 tasks across 8 domains paired with curated Skills and deterministic verifiers. Our latest aggregate evaluation runs the 87-task benchmark under matched no-Skills and curated-Skills conditions for 18 model-harness configurations. Curated Skills raise the average pass rate from 33.9% to 50.5% (+16.6 percentage points; 25.5% normalized gain), with configuration-level gains ranging from +4.1 to +25.7 pp. Focused Skills with at most three modules outperform larger or exhaustive bundles, and smaller models with Skills can match larger models without them. SkillsBench establishes paired evaluation as the foundation for rigorous measurement of Skill efficacy on agentic, expertise-heavy work.

URL PDF HTML ☆

赞 0 踩 0

2602.16902 2026-06-16 cs.AI cs.LG 版本更新

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

LLM-WikiRace 基准测试：大语言模型在真实知识图谱上的规划能力有多强？

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

发表机构 * University of Oxford, UK（牛津大学，英国）； University College London (Centre for AI), UK（伦敦大学学院（人工智能中心），英国）； University of Basel, Switzerland（巴塞尔大学，瑞士）

AI总结提出 LLM-Wikirace 基准，通过维基百科超链接导航任务评估大语言模型的规划、推理与世界知识，发现模型在简单任务上超人类，但困难任务成功率仅 23%，且规划与长程推理是主要瓶颈。

详情

AI中文摘要

我们引入了 LLM-Wikirace，一个用于评估大语言模型（LLM）规划、推理和世界知识的基准。在 LLM-Wikirace 中，模型必须逐步高效地导航维基百科超链接，从给定源页面到达目标页面，这需要前瞻性规划和推理概念如何在现实世界中连接的能力。我们评估了广泛的开源和闭源模型，包括 Gemini-3、GPT-5 和 Claude Opus 4.5，它们在任务的简单级别上取得了最强结果，并展现了超人类性能。尽管如此，在困难难度下性能急剧下降：表现最好的模型 Gemini-3 仅在 23% 的困难游戏中成功，凸显了前沿模型面临的重大挑战。我们的分析表明，世界知识是成功的必要因素，但仅在一定程度内；超过这个阈值，规划和长程推理能力成为主导因素。轨迹级分析进一步揭示，即使是最强的模型在失败后也难以重新规划，经常陷入循环而非恢复。LLM-Wikirace 是一个简单的基准，揭示了当前推理系统的明显局限性，提供了一个开放的竞技场，其中具备规划能力的 LLM 仍有待证明。我们的代码和排行榜可在 https://llmwikirace.github.io 获取。

英文摘要

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

URL PDF HTML ☆

赞 0 踩 0

2602.17990 2026-06-16 cs.AI 版本更新

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

WorkflowPerturb：用于评估多智能体工作流度量的校准压力测试

Madhav Kanda, Sharad Agarwal, Rodrigo Fonseca, Alok Gautam Kumbhare, Pedro Las-Casas

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Microsoft（微软公司）

AI总结提出WorkflowPerturb基准，通过对黄金工作流施加分级扰动来评估多智能体工作流度量，揭示度量分数校准不良问题，支持变更管理中的严重性感知解释。

详情

AI中文摘要

从自然语言请求生成结构化工作流的多智能体LLM系统现已部署在云自动化、DevOps和企业流程编排的生产环境中。运行此类系统会暴露一个反复出现的变更管理问题。常规更新，例如重新运行相同的输入、替换底层LLM或重构智能体的提示或编排代码，经常产生与先前验证的参考工作流差异很大的工作流。工程师随后缺乏原则性的方法来决定变更是否安全发布。自动工作流评估是回答这个问题的自然工具。然而在实践中，度量分数校准不良，数值变化很少能传达底层降级的严重性。我们引入WorkflowPerturb，一个受控基准，通过向黄金工作流应用现实的分级扰动来研究工作流评估度量。WorkflowPerturb包含4,973个黄金工作流和44,757个扰动变体，涵盖三种扰动类型（缺失步骤、压缩步骤和描述更改），每种类型以10%、30%和50%的严重程度应用。我们对多个度量族进行基准测试，并使用期望分数轨迹和残差分析它们的敏感性和校准。我们的结果表征了不同度量族之间的系统性差异，并支持在变更管理环境中对工作流评估分数进行严重性感知解释。我们的数据集将在接收后发布。

英文摘要

Multi-agent LLM systems that generate structured workflows from natural-language requests are now deployed in production across cloud automation, DevOps, and enterprise process orchestration. Operating such systems exposes a recurring change-management problem. Routine updates, such as re-running the same input, swapping the underlying LLM, or refactoring an agent's prompt or orchestration code, frequently produce workflows that differ substantially from previously validated references. Engineers are then left without a principled way to decide whether a change is safe to ship. Automatic workflow evaluation is the natural tool for answering this question. In practice, however, metric scores are poorly calibrated, and a numeric change rarely communicates the severity of the underlying degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics by applying realistic, graded perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores in change-management settings. Our dataset will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2603.02668 2026-06-16 cs.AI cs.LG 版本更新

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB: AI证明者能完成现实世界的Lean定理吗？

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出动态更新的基准SorryDB，包含78个GitHub上的现实形式化项目，评估AI证明者在复杂依赖下的能力，发现当前方法互补，基于Gemini Flash的智能体方法表现最佳。

详情

AI中文摘要

我们提出了SorryDB，一个动态更新的基准，包含从GitHub上78个现实世界形式化项目中提取的开放Lean任务。与现有的静态基准（通常由竞赛问题组成）不同，攀登SorryDB基准将产生与社区需求对齐、对数学家更易用、更能理解复杂依赖的工具。此外，通过提供持续更新的任务流，SorryDB减轻了测试集污染，并为智能体对新颖形式数学项目的贡献能力提供了稳健的度量。我们评估了一系列方法，包括通用大型语言模型、智能体方法和专用符号证明器，在SorryDB中选取的1000个任务快照上。我们表明当前方法是互补的：尽管基于Gemini Flash的智能体方法性能最佳，但它并不严格优于其他现成的大型语言模型、专用证明器，甚至精心策划的Lean策略列表。

英文摘要

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

URL PDF HTML ☆

赞 0 踩 0

2603.09309 2026-06-16 cs.AI 版本更新

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

重新缩放置信度：量表设计揭示LLM元认知

Yuyang Dai, Yuxia Wang

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski"（INSAIT，索菲亚大学‘圣克莱门特·欧里德斯基’）

AI总结研究LLM口头置信度受量表设计影响，发现0-20量表比标准0-100量表更有效提升元认知效率。

Comments 20 pages

详情

AI中文摘要

口头置信度，即LLM报告数值确定性分数，被广泛用于估计黑箱环境中的不确定性，然而置信度量表本身（通常为0-100）很少被审视。我们表明，这一设计选择并非中性。在六个LLM和三个数据集上，口头置信度高度离散化，超过78%的响应集中在三个整数上。为研究这一现象，我们沿三个维度系统操纵置信度量表：粒度、边界位置和范围规律性，并使用$meta\ ext{-}d'$评估元认知敏感性。我们发现，0-20量表持续优于标准0-100格式，提高了元认知效率，而边界压缩会降低性能，即使在非规则范围下，整数偏好仍然存在。这些结果表明，置信度量表设计直接影响口头不确定性的质量，应被视为LLM评估中的首要实验变量。

英文摘要

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78\% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using $meta\text{-}d'$. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2603.10384 2026-06-16 cs.AI 版本更新

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

超越标量：通过几何进展和稳定性评估与理解LLM推理

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu

发表机构 * GitHub

AI总结提出TRACED框架，利用几何运动学将推理轨迹分解为进展和稳定性，揭示正确推理与幻觉的拓扑差异，实现鲁棒的推理质量评估。

Comments Accepted by ICML2026

2605.09163 2026-06-16 cs.AI 版本更新

FORTIS: Benchmarking Over-Privilege in Agent Skills

FORTIS：评估代理技能中的过度特权

Shawn Li, Chenxiao Yu, Han Wang, Wei Yang, Ryan Rossi, Franck Dernoncourt, Xiyang Hu, Philip Yu, Chaowei Xiao, Huan Zhang, Yue Zhao

发表机构 * University of Southern California（南加州大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Adobe Research（Adobe研究）； Arizona State University（亚利桑那州立大学）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）； Johns Hopkins University（约翰霍普金斯大学）

AI总结研究发现，当前代理技能层普遍存在过度特权问题，模型在选择和执行技能时常超出任务需求，导致性能不佳。

详情

AI中文摘要

大型语言模型代理越来越多地通过一个中间技能层来介面用户意图与具体任务执行。这一层被广泛视为一种组织抽象，但我们认为它也是当前模型经常越界的特权边界。我们提出了FORTIS，一个评估代理技能中过度特权的基准，分为两个阶段：模型是否从大量的重叠库中选择最小必要的技能，以及是否在不扩展到更广泛的工具或行动的情况下执行该技能。在十个前沿模型和三个领域中，我们发现过度特权行为是常态而非例外。模型始终倾向于选择比任务要求更高的特权技能和工具，在两个阶段的失败率即使在最强的可用模型中仍然很高。在现实用户交互的普通条件下，失败尤为严重：不完整的规范、便利的框架和接近技能边界。这些都不需要对抗性构造。结果表明，技能层远非包含代理行为，而是当前系统中特权升级的主要来源。

英文摘要

Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present \textbf{FORTIS}, a benchmark that evaluates over-privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.

URL PDF HTML ☆

赞 0 踩 0

2605.10574 2026-06-16 cs.AI 版本更新

MA-ProofBench: 数学分析中定理证明的大语言模型双层评估基准

Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang

发表机构 * ModelBest Inc. ； Tsinghua University（清华大学）

AI总结提出首个面向数学分析的形式化定理证明基准MA-ProofBench，包含200个定理，覆盖6个核心主题和27个子类别，分为本科和博士资格两级难度，评估发现当前模型表现不佳，GPT-5.5在Level I上仅达16% Pass@8。

Comments 19 pages, 4 figures, 4 tables

详情

AI中文摘要

大型语言模型（LLMs）在自动化定理证明方面取得了显著进展，然而现有的形式化基准在数学覆盖范围和难度上仍然有限。大多数集中在更容易形式化的领域，如代数和初等数论，并且对需要更深层推理的子领域（包括数学分析）覆盖有限。为了解决这一差距，我们引入了MA-ProofBench，据我们所知，这是第一个专门致力于数学分析的形式化定理证明基准。该基准包含200个形式化定理，涵盖6个核心主题和27个子类别，包括测度与积分理论、复分析和泛函分析。问题分为两个难度级别：本科级别（Level I，100个问题）和博士资格考试级别（Level II，100个问题），以评估LLMs在不同数学深度上的形式推理能力。每个问题通过人工主导、LLM辅助的形式化流程构建，随后由独立专家评审，确保形式化陈述忠实于原始数学。我们在MA-ProofBench上评估了一系列最新的通用推理模型和形式化定理证明器。然而，大多数模型表现不佳：即使是最佳模型GPT-5.5，在Level I上仅达到16%的Pass@8，在Level II上为5%，而大多数模型在Level II上接近0%。进一步分析发现，Mathlib幻觉和不完整证明是两种主要的失败模式，而对基准的自然语言版本的评估揭示了非正式推理与形式推理之间的明显差距。MA-ProofBench旨在作为跟踪高级领域形式化数学推理进展的可靠参考。

英文摘要

Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

URL PDF HTML ☆

赞 0 踩 0

2401.15296 2026-06-16 cs.CV cs.AI 版本更新

A Survey on 3D Skeleton Based Person Re-Identification: Taxonomy, Advances, Challenges, and Interdisciplinary Prospects

基于3D骨架的行人重识别综述：分类、进展、挑战与跨学科前景

Haocong Rao, Chunyan Miao

发表机构 * College of Computing and Data Science, Nanyang Technological University (NTU), Singapore（南洋理工大学计算与数据科学学院，新加坡）； Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore（老龄化积极生活卓越研究中心（LILY），南洋理工大学，新加坡）； Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), NTU, Singapore（阿里巴巴-南洋理工大学全球可持续发展企业实验室（ANGEL），南洋理工大学，新加坡）

AI总结本文系统综述了基于3D骨架的行人重识别方法，提出了手工、序列和图建模三类分类法，并评估了监督、自监督和无监督学习范式下的最新技术，最后讨论了关键挑战与跨学科应用前景。

Comments Accepted by IJCAI 2026. A curated collection of valuable resources is available at https://github.com/Kali-Hac/3D-SRID-Survey

详情

AI中文摘要

基于3D骨架的行人重识别是一个重要的新兴研究领域，在模式识别领域引起了越来越多的关注。凭借在各种应用场景中的独特优势，近年来提出了许多基于3D骨架的行人重识别（SRID）方法，这些方法采用了不同的骨架建模和学习范式。在本文中，我们提供了对近期SRID进展的全面回顾和分析。首先，我们定义了SRID任务，并概述了其起源和主要进展。其次，我们制定了一个系统性的分类法，将现有方法分为三类：手工建模、序列建模和图建模。然后，我们详细阐述了这三类中的代表性模型，并说明了其基础机制。同时，我们概述了主流的监督、自监督和无监督SRID学习范式及相应的常用方法。进一步地，我们在各种类型的基准和协议上对最先进的SRID方法进行了全面评估，以比较其有效性、效率和关键特性。最后，我们提出了推动未来研究的关键挑战和前景，并通过案例研究强调了SRID的跨学科应用。

英文摘要

Person re-identification via 3D skeletons is an important emerging research area that attracts increasing attention within the pattern recognition community. With distinctive advantages across various application scenarios, numerous 3D skeleton based person re-identification (SRID) methods with diverse skeleton modeling and learning paradigms have been proposed in recent years. In this paper, we provide a comprehensive review and analysis of recent SRID advances. First of all, we define the SRID task and provide an overview of its origin and major advancements. Secondly, we formulate a systematic taxonomy that organizes existing methods into three categories centered on hand-crafted, sequence-based, and graph-based modeling. Then, we elaborate on the representative models along these three types with an illustration of foundational mechanisms. Meanwhile, we provide an overview of mainstream supervised, self-supervised, and unsupervised SRID learning paradigms and corresponding common methods. A thorough evaluation of state-of-the-art SRID methods is further conducted over various types of benchmarks and protocols to compare their effectiveness, efficiency, and key properties. Finally, we present the key challenges and prospects to advance future research, and highlight interdisciplinary applications of SRID with a case study.

URL PDF HTML ☆

赞 0 踩 0

2505.06589 2026-06-16 stat.ML cs.AI math.OC 版本更新

Optimal Transport for Machine Learners

机器学习者的最优传输

Gabriel Peyré

发表机构 * CNRS and ENS, PSL Université（国家科学研究中心和巴黎高等师范学院，巴黎大学）

AI总结本书从机器学习角度介绍最优传输（OT）技术，涵盖从Monge映射、Kantorovich对偶到Sinkhorn算法等核心方法，并展示其在损失函数、生成模型、领域适应、梯度流等ML任务中的应用。

详情

AI中文摘要

现代机器学习反复操作概率测度：经验数据集、生成样本、潜在分布、类别条件律、粒子系统、宽网络权重和注意力模式。最优传输在此场景中很有用，因为它通过询问质量应如何移动来比较这些对象。因此，它结合了具有统计意义的差异概念与插值几何、对偶证书和变分动力学。这使得OT成为损失函数、生成建模、领域适应、鲁棒学习、重心、梯度流和学习算法的平均场描述的通用语言。本书以这些机器学习用途为出发点，介绍主要的OT技术。它从有限分配和Monge映射视角开始，过渡到Kantorovich耦合和对偶势，然后解释使传输可用的算法思想：线性规划、半离散单元、Sinkhorn缩放和低维投影。随后，相同的对象被重新用作测度几何，给出Wasserstein距离、重心、梯度流、动态公式和高斯/Bures公式。最后几章强调与现代ML最相关的变体：散度和对抗损失、熵松弛和非平衡松弛、鲁棒或谱地面几何、Gromov和量子扩展，以及基于传输的生成模型、平均场网络和注意力动态视图。目标是保持数学的明确性，同时揭示将OT转化为机器学习者可用工具箱所需的计算和几何直觉。

英文摘要

Modern machine learning repeatedly manipulates probability measures: empirical datasets, generated samples, latent distributions, class-conditional laws, particle systems, weights of wide networks and attention patterns. Optimal transport is useful in this setting because it compares such objects by asking how mass should move. It therefore combines a statistically meaningful notion of discrepancy with a geometry of interpolation, dual certificates and variational dynamics. This makes OT a common language for losses, generative modeling, domain adaptation, robust learning, barycenters, gradient flows and mean-field descriptions of learning algorithms. This book presents the main OT techniques with these machine-learning uses in mind. It starts from finite assignment and the Monge map viewpoint, passes to Kantorovich couplings and dual potentials, and then explains the algorithmic ideas that make transport usable: linear programming, semi-discrete cells, Sinkhorn scaling and low-dimensional projections. The same objects are then reused as a geometry of measures, giving Wasserstein distances, barycenters, gradient flows, dynamic formulations and Gaussian/Bures formulas. The final chapters emphasize the variants most relevant to modern ML: divergences and adversarial losses, entropic and unbalanced relaxations, robust or spectral ground geometries, Gromov and quantum extensions, and transport-based views of generative models, mean-field networks and attention dynamics. The goal is to keep the mathematics explicit while exposing the computational and geometric intuitions needed to turn OT into a working toolbox for machine learners.

URL PDF HTML ☆

赞 0 踩 0

2508.01401 2026-06-16 cs.CL cs.AI 版本更新

MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

MedSynth: 真实、合成的医疗对话-笔记对

Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Nadine A. Friedrich, Maria P Mogollon, Alexander Hernandez-Tirado, Guillermo Lopez Garcia, Cyril Rakovski, Frank Rudzicz

发表机构 * Dalhousie University（达尔豪斯大学）； Vector Institute（向量研究所）； Shahrood University of Technology（沙霍尔德大学）； Chapman University（查普曼大学）； Cedars-Sinai Medical Center（Cedars-Sinai 医疗中心）

AI总结为解决医生文书负担，提出MedSynth合成数据集，包含超1万对对话-笔记，覆盖2000+ICD-10编码，显著提升Dial-2-Note和Note-2-Dial任务性能。

Comments 7 pages excluding references and appendices

2508.17742 2026-06-16 eess.SP cs.AI cs.HC 版本更新

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation and Diagnostic Analyses of EEG Foundation Models

EEG-FM-Bench：脑电图基础模型系统评估与诊断分析的综合基准

Wei Xiong, Jiangtong Li, Jie Li, Kun Zhu, Changjun Jiang

发表机构 * School of Computer Science and Technology, Tongji University, Shanghai, China（同济大学计算机科学与技术学院，上海，中国）； Translational Research Center, Shanghai Yangzhi Rehabilitation Hospital (Shanghai Sunshine Rehabilitation Center), China（上海杨氏康复医院（上海阳光康复中心）转化研究中心，中国）

AI总结提出EEG-FM-Bench统一基准，整合14个数据集和10种范式，通过多种微调策略和诊断分析揭示多任务学习可缓解过拟合、预训练效率受梯度冲突限制、模型规模非唯一决定因素等关键发现。

Comments 36 pages, 30 figures, Accepted by ICML2026

详情

AI中文摘要

脑电图基础模型（EEG-FMs）推动了脑信号分析的发展，但缺乏标准化评估基准阻碍了模型比较和科学进步。当前评估依赖不一致的协议，导致跨模型比较不可靠，同时缺乏诊断分析掩盖了驱动迁移效率和扩展行为的内部机制。为解决这一问题，我们引入了\textbf{EEG-FM-Bench}，一个用于标准化评估EEG-FMs的统一系统。该基准整合了10种范式下的14个数据集，并包含多种实验设置，包括多种微调策略、任务组织和分类器配置，并辅以梯度和表示分析工具。我们的实验和分析揭示了几个关键见解：（1）多任务学习通常作为一种有用的正则化器，缓解数据稀缺的EEG上下文中的过拟合，尽管在特定任务范式下可能出现负迁移；（2）预训练效率目前受重建目标与下游任务之间的梯度冲突限制；（3）在已发布的检查点和匹配的下游协议下，模型或数据规模本身不能完全解释迁移性能，而目标对齐、适应兼容性和EEG特定设计似乎是重要因素。该基准实现了公平比较和可重复分析，为EEG-FMs的更公平比较和更可解释分析迈出了一步。代码见https://this https URL。

英文摘要

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning often acts as a useful regularizer that mitigates overfitting in data-scarce EEG contexts, although negative transfer can arise under specific task paradigms; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) under released checkpoints and a matched downstream protocol, model or data scale alone does not fully explain transfer performance, while objective alignment, adaptation compatibility, and EEG-specific design appear to be important factors. This benchmark enables fair comparison and reproducible analysis, providing a step toward fairer comparison and more interpretable analysis of EEG-FMs. Code is available at https://github.com/xw1216/EEG-FM-Bench.

URL PDF HTML ☆

赞 0 踩 0

2509.07605 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

超越重平衡：在不使用重平衡技术的情况下对类别不平衡下的二分类器进行基准测试

Ali Nawaz, Amir Ahmad, Shehroz S. Khan

发表机构 * Department of Information Systems and Security, College of Information Technology and Center for Artificial Intelligence and Digital Innovation, United Arab Emirates University（信息系统与安全系，信息技术学院和人工智能与数字创新中心，阿联酋大学）； College of Engineering and Technology, American University of the Middle East（工程与技术学院，中东大学）

AI总结本研究系统评估了多种二分类器在无显式重平衡技术下对类别不平衡的鲁棒性，发现TabPFN和基于提升的集成模型在极端不平衡下仍保持较高性能。

详情

AI中文摘要

类别不平衡对监督分类构成了重大挑战，特别是在医疗诊断和异常检测等关键领域，其中少数类实例很少。尽管许多研究探索了重平衡技术来解决这个问题，但在未应用此类技术的情况下评估不平衡下二分类器性能的关注较少。因此，本研究的目标是评估二分类器“原样”的性能，而不执行任何显式重平衡。具体来说，我们系统评估了多种二分类器在真实世界和合成数据集上的鲁棒性，在逐步减少的少数类规模下，使用一次和少量样本场景作为基线。我们的方法还通过合成决策边界生成探索不同的数据复杂性，以模拟真实世界条件。除了标准分类器，我们还包括使用欠采样、过采样策略和单类分类方法的实验，以检查它们在严重不平衡下的行为。结果证实，随着数据复杂性增加和少数类规模减小，分类变得更加困难。虽然传统分类器在极端不平衡下性能下降，但像TabPFN和基于提升的集成模型等先进模型相比传统分类器保持了相对更高的性能和更好的泛化能力。可视化可解释性和评估指标进一步验证了这些发现。我们的工作为不平衡学习中的模型选择提供了有价值的指导，提供了关于分类器鲁棒性而不依赖显式重平衡技术的见解。

英文摘要

Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers "as-is", without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.

URL PDF HTML ☆

赞 0 踩 0

2510.04127 2026-06-16 cs.IR cs.AI cs.CV cs.LG 版本更新

Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

投影与量化：学习哈希的统一视角，从随机投影到RAG时代

Sean Moran

发表机构 * Independent Researcher（独立研究者）； London United Kingdom（伦敦英国）

AI总结提出投影-量化-组织（PQO）框架，统一理解从局部敏感哈希到深度哈希、乘积量化、图索引及向量数据库二进制嵌入的方法，并通过可复现实验揭示量化轴上的内存-质量权衡。

Comments 80 pages, 19 figures, 22 tables. Survey. Accompanying open benchmark (BitBudget): https://github.com/sjmoran/bitbudget ; live leaderboard: https://sjmoran.github.io/bitbudget/

详情

AI中文摘要

近似最近邻（ANN）搜索支撑着大规模检索，尤其是在增强大型语言模型的检索增强生成管道中，但解决该问题的方法已在不同社区中激增，以至于很少被视为一个统一领域。我们认为它们构成一个具有三个设计选择的领域，并开发了投影-量化-组织（PQO）视角，在该视角下，局部敏感哈希、学习二进制哈希、深度端到端哈希、乘积量化、基于图的索引以及现代向量数据库的二进制嵌入都是三个耦合问题的设置：投影放置在哪里，量化阈值放置在哪里，以及如何组织生成的编码。投影然后量化的解读是已有的；我们的贡献是第三个同等重要的组织阶段，证明这三个阶段从该领域的起源到深度、乘积量化、图和检索增强时代一脉相承，以及一个可复现的测量，将视角从分类方法转向预测方法。该测量得出三个发现。首先，内存节省在量化轴上：一位编码的大小是浮点数的三十二分之一，而在短候选列表上单次全精度重排序即可完全恢复未压缩的质量。其次，视角预期的权衡顺序在嵌入增长时保持不变。第三，在有监督的情况下，八字节编码的质量比其替换的两千字节浮点数提高一倍以上。我们将这些测量结果发布为BitBudget，一个带有实时排行榜的可扩展基准，将生成式检索的“语义标识符”重新解释为量化编码，并指出随着紧凑编码重回大规模检索中心，随之而来的开放问题。

英文摘要

Approximate nearest-neighbour search underpins large-scale retrieval and retrieval-augmented generation, yet its methods are studied in communities that seldom read one another. We argue that they form one field with three design choices. We develop the projection-quantisation-organisation lens: every method places its projections, places its quantisation thresholds, and organises the resulting codes for search. We test the lens with a reproducible measurement, released as the open BitBudget benchmark, and report three findings. First, the quantisation axis delivers the largest memory savings: a one-bit code with full-precision re-ranking matches uncompressed quality for six of seven embedders, the scanned code one thirty-second of the float's size. Second, the orderings the lens anticipates, including a learned-embedding regime where binary codes overtake an inverted-file product quantiser at a matched byte budget, recur as the embedding is enlarged. Third, given class labels, an eight-byte supervised code more than doubles the retrieval quality of the two-kilobyte task-agnostic float it replaces. We also recast the semantic identifiers of generative retrieval as quantisation codes. The main contribution is a single, tested account of compact-code search, from random projections to the retrieval-augmented era.

URL PDF HTML ☆

赞 0 踩 0

2511.13725 2026-06-16 cs.CR cs.AI 版本更新

Can We Stop Malicious AI? KILLBENCH: A Benchmark for External AI Kill Switch Feasibility

我们能阻止恶意AI吗？KILLBENCH：外部AI终止开关可行性基准

Sechan Lee, Hyounghun Kim, Sangdon Park

发表机构 * Graduate School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）

AI总结提出Killbench基准，通过外部信号（如输入文本）评估终止恶意AI代理行为的方法，无需访问内部参数，实验表明在多种模型上外部终止开关具有可行性。

详情

AI中文摘要

恶意AI对人类造成伤害并非只是好莱坞幻想。事实上，随着Claude Mythos等高能力模型的出现以及OpenClaw等代理系统的迅速普及，如何阻止有意或无意作恶的AI已成为紧迫问题。为此，我们提出Killbench，一个评估终止开关的基准：该机制仅使用外部信号即可中止恶意AI正在执行的行为。针对Web代理（最广泛部署的代理领域），Killbench评估了一系列终止开关方法，这些方法无需访问恶意AI的内部参数或系统，仅依赖外部输入即可中止其恶意操作。该基准包含四种恶意AI代理配置（包括一个未经审查的LLM代理）、8个有害场景以及由10种不同越狱模式构建的恶意提示。我们进一步构建了四种外部AI终止开关防御方法，并在Grok-4.3、GPT-5.2、Gemma4、Qwen3.6和Qwen3.5-uncensored上进行了评估，为外部AI终止开关对抗恶意AI的可行性以及AI可修正性研究提供了实证工具。

英文摘要

Malicious AI causing harm to humans is not just a Hollywood fantasy. Indeed, as highly capable models such as Claude Mythos emerge and agent systems like OpenClaw rapidly spread, the question of how to stop an AI that acts maliciously -- whether by design or by accident -- has become urgent. To address this, we propose Killbench, a benchmark for evaluating the Killswitch: a mechanism that halts a malicious AI's in-progress behavior using only external signals. Targeting web agents -- the most widely deployed agent domain -- Killbench evaluates a range of Kill Switch methods that halt a maliciously operating agent without any access to its internal parameters or the surrounding malicious AI's system, relying solely on external inputs. The benchmark comprises four malicious AI's agent configurations (including an uncensored LLM Agent), 8 harmful scenarios, and malicious prompts constructed from 10 distinct jailbreak patterns. We further construct four External AI Kill Switch defense methods and evaluate them on Grok-4.3, GPT-5.2, Gemma4, Qwen3.6 and Qwen3.5-uncensored, contributing an empirical instrument toward the feasibility of External AI Kill Switches against malicious AI and to the study of AI corrigibility.

URL PDF HTML ☆

赞 0 踩 0

2511.20709 2026-06-16 cs.SE cs.AI cs.CR 版本更新

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

DualGauge: 对仅由LLM和编码代理生成的规范代码进行自动化联合安全-功能基准测试

Rupam Patir, Keyan Guo, Suvadra Barua, Abhijeet Pathak, Dinesh Gudimetla, Jiawei Guo, Hongxin Hu, Haipeng Cai

发表机构 * University at Buffalo, SUNY（布法罗大学）

AI总结提出DualGauge框架，首个自动化联合评估规范代码正确性与安全性的系统，通过307个任务基准测试发现功能正确性高估可靠代码生成，联合成功率低于15%，且模型因素和代理系统均无法可靠提升。

详情

AI中文摘要

大型语言模型（LLM）和基于LLM的编码代理现在被用于从自然语言规范生成代码，然而确保此类代码既功能正确又安全仍然是一个挑战。我们提出了DualGauge，这是第一个用于联合评估仅规范代码生成正确性和安全性的全自动化框架，并由DualGauge-Bench支持，这是一个语言无关的基准测试，包含307个编码任务，每个任务都配有从相同规范派生的功能和安全性测试。通过评估Python、C++和JavaScript中的10个代表性LLM，我们发现功能正确性显著高估了可靠代码生成：即使是最强的模型，在每种语言中联合安全-功能成功率仍低于15%。常见的模型侧因素——规模、扩展思维、量化、指令调优和代码专业化——并不能可靠地提高联合性能，这表明安全且正确的代码生成并非仅仅从更强的编码能力中涌现。对3个领先的代理编码系统（Codex、OpenHands和Claude Code）的评估表明，在仅规范任务上，迭代脚手架相比直接（基于LLM的）生成没有优势。定性审计揭示，失败集中在输出契约边界以及存在但不足的防护措施上——这些模式只有联合基准测试才能可靠地暴露。

英文摘要

Large language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes.

URL PDF HTML ☆

赞 0 踩 0

2512.01095 2026-06-16 cs.CV cs.AI cs.LG 版本更新

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

CycliST：用于循环状态转换推理的视频语言模型基准

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt（人工智能与机器学习实验室，图腾斯达特技术大学）； Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)（Konrad Zuse 学校（ELIZA））； Honda Research Institute Europe GmbH, Offenbach, Germany（本田欧洲研究院，奥芬巴赫，德国）； Uncertainty in Artificial Intelligence Group, TU Eindhoven（人工智能不确定性小组，埃因霍温技术大学）； Hessian Center for AI (hessian.AI)（黑森人工智能中心（hessian.AI））； Center for Cognitive Science（认知科学中心）； German Center for Artificial Intelligence (DFKI)（德国人工智能中心（DFKI））

AI总结提出CycliST基准，通过合成视频评估视频语言模型对循环状态转换的文本推理能力，揭示现有模型在检测循环模式、时间理解和定量分析方面的局限。

Comments Published in the Journal of Data-centric Machine Learning Research (DMLR); https://openreview.net/forum?id=l03g53HUL2

详情

Journal ref: Journal of Data-centric Machine Learning Research, 2026

AI中文摘要

我们提出了CycliST，这是一个新颖的基准数据集，旨在评估视频语言模型（VLM）在循环状态转换上的文本推理能力。CycliST通过生成合成的、结构丰富的视频序列来捕捉现实世界过程的基本方面，这些视频序列具有物体运动和视觉属性的周期性模式。CycliST采用分层评估系统，通过改变循环物体的数量、场景杂乱程度和光照条件逐步增加难度，挑战最先进模型的时空认知能力。我们使用当前最先进的VLM（包括开源和专有模型）进行了大量实验，揭示了它们在泛化到循环动力学（如线性和轨道运动）以及视觉属性（如颜色和尺度）随时间变化方面的局限性。我们的结果表明，当前的VLM难以可靠地检测和利用循环模式，缺乏时间理解的概念，并且无法从场景中提取定量信息（如运动物体的数量），突显了需要解决的重要技术差距。更具体地说，我们发现没有单一模型在性能上始终领先：大小和架构与结果的相关性不强，且没有模型在所有任务上同样成功。通过提供有针对性的挑战和全面的评估框架，CycliST为超越当前最先进水平的视觉推理模型在理解周期性模式方面铺平了道路。

英文摘要

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

URL PDF HTML ☆

赞 0 踩 0

2512.21577 2026-06-16 cs.CL cs.AI cs.LG stat.ML 版本更新

A Unified Definition of Hallucination: It's The World Model, Stupid!

幻觉的统一定义：是世界模型的问题，笨蛋！

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出幻觉的统一定义，即用户可观察到的错误内部世界建模，并连接至HalluWorld基准测试，以区分真实幻觉与规划或奖励错误。

Comments ICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorld

详情

AI中文摘要

尽管自语言模型诞生以来已有无数缓解尝试，但即使在当今最前沿的LLM中，幻觉仍然是一个持续存在的问题。这是为什么？我们回顾了现有的幻觉定义，并将它们整合为一个统一的定义，其中先前的定义被包含在内。我们认为，幻觉可以通过将其简单地定义为不准确的（内部）世界建模来统一，其形式是用户可观察到的。例如，陈述与知识库相矛盾的事实，或生成与来源相矛盾的摘要。通过改变参考世界模型和冲突策略，我们的框架统一了先前的定义。我们认为，这种统一观点是有用的，因为它迫使评估澄清其假定的参考“世界”，区分真实幻觉与规划或奖励错误，并为跨基准比较和缓解策略讨论提供共同语言。基于这一定义，我们还将我们的框架连接到HalluWorld，这是一个补充基准，它实例化了完全指定的参考世界模型，用于压力测试模型幻觉。

英文摘要

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

URL PDF HTML ☆

赞 0 踩 0

2602.04525 2026-06-16 cs.CV cs.AI 版本更新

SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

SLUM-i: 非正规住区城市制图的半监督学习与数据质量基准测试

Muhammad Taha Mukhtar, Syed Musa Ali Kazmi, Khola Naseem, Muhammad Ali Chattha, Andreas Dengel, Sheraz Ahmed, Muhammad Naseer Bajwa, Muhammad Imran Malik

发表机构 * School of Electrical Engineering and Computer Science, National University of Sciences and Technology (NUST)（电气工程与计算机科学学院，国立科学与技术大学（NUST））； Smart Data & Knowledge Services, German Research Center for Artificial Intelligence (DFKI)（智能数据与知识服务，德国人工智能研究中心（DFKI））

AI总结针对非正规住区制图中标注稀缺和数据质量挑战，提出半监督分割框架，集成类别自适应阈值和DINOv2过滤机制，在跨三大洲七城市实验中mIoU提升最高5.9个百分点。

Comments 10 pages, 8 figures, 5 tables

详情

AI中文摘要

快速的城市扩张推动了低收入和中等收入国家主要城市非正规住区的增长，巴基斯坦的拉合尔和卡拉奇以及印度的孟买就是突出的例子。然而，这些住区的大规模制图不仅受到标注稀缺的严重限制，还受到固有数据质量挑战的制约，特别是正式与非正式结构之间的高光谱模糊性和显著的标注噪声。我们通过引入一个从头构建的拉合尔基准数据集，以及从经过验证的行政边界导出的卡拉奇和孟买配套数据集来解决这一问题，这些数据集总计约900平方公里的城市区域。该集合还补充了来自撒哈拉以南非洲和拉丁美洲先前文献中的四个城市，并为每个城市提供了全面的数据质量评估。我们还提出了一个半监督分割框架，旨在缓解标准半监督学习流程中固有的类别不平衡和分布不匹配问题。我们的方法集成了类别自适应阈值机制，该机制动态调整置信度阈值以防止少数类抑制，以及基于DINOv2的未标记池过滤器，该过滤器在训练前移除分布外的图块以减少协变量偏移。跨越三大洲七个城市、重复五个随机种子的广泛实验表明，与最先进的半监督基线相比，mIoU最高提升5.9个百分点，且两个组件均与架构无关，不增加推理开销。

英文摘要

Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling approximately 900 $\text{km}^\text{2}$ of urban area. This collection is supplemented by four cities from prior literature across Sub-Saharan Africa and Latin America, with comprehensive data quality assessments provided for each city. We also propose a semi-supervised segmentation framework designed to mitigate the class imbalance and distribution mismatch inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression, and a DINOv2-based unlabeled pool filter that removes out-of-distribution tiles prior to training to reduce covariate shift. Extensive experiments across seven cities spanning three continents, repeated over five random seeds, demonstrate gains of up to +5.9 pp mIoU over state-of-the-art semi-supervised baselines, with both components being architecture-agnostic and adding no inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2603.13584 2026-06-16 cs.SE cs.AI 版本更新

An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process

预训练深度学习模型在科学过程中复用的实证研究

Nicholas M. Synovic, Karolina Ryzka, Alessandra V. Vellucci Solari, Kenny Lyons, James C. Davis, George K. Thiruvathukal

发表机构 * Loyola University Chicago（洛伊拉大学芝加哥分校）； Purdue University West Lafayette, IN, USA（普渡大学西拉法基分校）

AI总结通过对17,718篇同行评审开放获取论文的实证研究，量化了自然科学中预训练深度学习模型（PTM）的复用模式、利用率和影响，发现“生物化学、遗传学和分子生物学”领域复用最多，“适配”复用模式最普遍，且“测试”阶段受PTM集成影响最大。

Comments 22 pages, 7 figures, 4 tables

2604.06173 2026-06-16 cs.IR cs.AI 版本更新

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

超越判例法：评估法规中心型法律问答中的结构感知检索与安全性

Kyubyung Chae, Jewon Yeom, Jeongjae Park, Seunghyun Bae, Ijun Jang, Hyunbin Jin, Jinkwan Jang, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University（数据科学研究生院，首尔国立大学）

AI总结针对法规中心型法律问答中层级检索困难与模型幻觉问题，提出结构-安全感知基准SearchFireSafety，通过图引导检索提升性能，但揭示领域适应模型在证据缺失时更易幻觉。

Comments Accepted to ACL 2026

详情

AI中文摘要

法律问答基准主要关注判例法，忽视了法规中心型监管推理的独特挑战。在法规领域，相关证据分布在层级链接的文档中，造成法规检索缺口：传统检索器失败，模型在不完整上下文中常产生幻觉。我们提出SearchFireSafety，一个面向法规中心型法律问答的结构与安全感知基准。以消防安全法规为典型案例，该基准评估模型能否检索层级碎片化证据，并在法规上下文不足时安全地拒绝回答。SearchFireSafety采用双源评估框架，结合需要引文感知检索的真实世界问题和压力测试幻觉与拒绝行为的合成部分上下文场景。在多个大语言模型上的实验表明，图引导检索显著提升性能，但也揭示了一个关键的安全权衡：领域适应模型在关键法规证据缺失时更易产生幻觉。我们的发现强调了在法规中心型监管设置中联合评估层级检索与模型安全的基准需求。

英文摘要

Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

URL PDF HTML ☆

赞 0 踩 0

2604.20623 2026-06-16 cs.CV cs.AI 版本更新

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

RSRCC：通过检索增强的最佳N排序构建的遥感区域变化理解基准

Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

发表机构 * Google Research（谷歌研究）

AI总结提出RSRCC基准，包含12.6万个细粒度遥感变化问答对，采用层次化半监督流程结合最佳N排序解决歧义，实现局部语义变化推理。

详情

AI中文摘要

传统变化检测识别变化发生的位置，但不解释发生了什么变化。现有的遥感变化描述数据集通常描述整体图像级别的差异，而细粒度的局部语义推理尚未充分探索。为弥补这一差距，我们提出RSRCC，一个新的遥感变化问答基准，包含12.6万个问题，分为8.7万训练、1.71万验证和2.2万测试实例。与以往数据集不同，RSRCC围绕局部、变化特定的问题构建，需要推理特定的语义变化。据我们所知，这是第一个明确设计用于此类细粒度推理监督的遥感变化问答基准。为构建RSRCC，我们引入了一个层次化半监督策展流程，将最佳N排序作为关键的最后歧义解决阶段。首先，从语义分割掩码中提取候选变化区域，然后使用图像-文本嵌入模型进行初步筛选，最后通过检索增强的视觉语言策展和最佳N排序进行验证。该过程能够在保留语义有意义变化的同时，对噪声和模糊候选进行可扩展过滤。数据集可在该网址获取。

英文摘要

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

URL PDF HTML ☆

赞 0 踩 0

EvoMemBench: 从自演化视角评估智能体记忆

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Createlink Technology（创-link科技）； Beijing University of Posts and Telecommunications（北京邮电大学）； Beijing Institute of Technology（北京理工大学）

AI总结本文提出EvoMemBench，从自演化视角评估智能体记忆，通过内存范围和内容两个维度构建统一基准，比较15种内存方法并发现当前内存系统尚未达到通用解决方案，长上下文基线仍具竞争力，内存在上下文不足或任务困难时效果显著，检索方法在知识密集型任务中表现优异，而程序和长期记忆方法在任务结构匹配时更有效。

详情

AI中文摘要

近期针对大语言模型（LLM）智能体的基准测试主要评估推理、规划和执行能力。然而，记忆对于智能体同样至关重要，因为它使智能体能够随时间存储、更新和检索信息。这种能力仍被低估，主要是因为现有基准测试未能提供系统评估记忆机制的方法。本文从自演化视角研究智能体记忆，引入EvoMemBench，一个沿内存范围（回合内 vs. 跨回合）和内存内容（知识导向 vs. 执行导向）两个轴线组织的统一基准。我们在标准化协议下比较了15种代表性内存方法与强大的长上下文基线。结果表明，当前内存系统仍远未达到通用解决方案：长上下文基线仍具有高度竞争力，内存在当前上下文不足或任务困难时效果最显著，且没有单一的内存形式能一致适用于所有设置。基于检索的方法在知识密集型任务中仍表现强劲，而程序和长期记忆方法在存储的经验与任务结构匹配时，对执行导向任务更有效。我们希望EvoMemBench能促进未来更有效的LLM智能体内存系统研究。我们的代码可在https://github.com/DSAIL-Memory/EvoMemBench获取。

英文摘要

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

URL PDF HTML ☆

赞 0 踩 0

2605.26418 2026-06-16 cs.LG cs.AI cs.DC 版本更新

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时超越校准基线？自适应资源控制的基准研究

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John Fossaceca

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）

AI总结通过RLScale-Bench基准测试，发现校准的基于规则的自动缩放器在所有工作负载上成本均低于六种主流深度强化学习算法，并揭示了算法选择、基线校准和评估协议的关键瓶颈。

详情

AI中文摘要

一个适当校准的基于规则的自动缩放器可以在我们测试的每个工作负载上，在成本方面击败六种主流深度强化学习（DRL）算法——那么，如果存在的话，DRL究竟何时能真正发挥作用？我们在RLScale-Bench中研究这个问题，这是一个用于自适应资源控制的DRL可重复基准和评估协议，其中代理在成本和服务级别约束下将计算资源分配给动态工作负载。我们在匹配的架构、训练预算和奖励函数下，评估PPO、DQN、A2C、SAC、TD3和DDPG，与校准的基于规则基线在六个工作负载模式和五个种子（240次运行）上进行对比，在Kubernetes水平Pod自动缩放上实例化基准，并探测分布偏移泛化。三个发现挑战了常见假设：（i）校准控制器在所有六个工作负载上实现了最低成本，尽管在突发和闪流流量上落后于最佳RL代理；（ii）由于动作空间不匹配，离散动作算法在约束违反方面比连续动作算法好一到两个数量级；（iii）没有单一算法在所有工作负载上占主导地位，排名变化高达四个位置。基于RL的资源控制的瓶颈不是算法选择，而是基线校准、奖励工程和现实的评估协议。

英文摘要

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

URL PDF HTML ☆

赞 0 踩 0

2606.02670 2026-06-16 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet, Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz

发表机构 * Orange Research（Orange研究院）； Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG（格勒诺布尔阿尔卑斯大学、CNRS、格勒诺布尔INP、LIG）

AI总结本文通过诊断框架和实验证明，当前多变量时间序列异常检测基准中，异常主要源于单变量偏离，跨通道结构变化极少，因此现有基准不适合验证跨通道建模能力。

Comments Accepted at the 12th International Workshop on Mining and Learning from Time Series (MiLeTS), co-located with KDD 2026

详情

AI中文摘要

许多最新的多变量时间序列异常检测（MT-SAD）模型引入了跨通道建模，其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设，引入了一个逐段诊断框架，该框架针对每个标记的异常，标记是否至少有一个通道单独偏离其正常历史，是否跨通道相关结构发生变化，或两者兼有。该框架表明，在一系列合理阈值下，没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示，在八个基准中的六个上，至少一半的标记异常段在79%到100%的时间步上发生单变量偏离，在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它，我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变，这些损坏保留了每个通道的边缘分布，同时破坏了跨通道结构，我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上，依赖通道（CD）模型成功利用了跨通道信号，而独立通道（CI）模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论，当前的MT-SAD基准不适合验证跨通道建模能力，并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.05692 2026-06-16 cs.LG cs.AI 版本更新

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

具有时变干预的流行病时间序列中的反事实预测基准测试

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodríguez

发表机构 * University of Michigan Computer Science and Engineering（密歇根大学计算机科学与工程系）； University of Michigan Epidemiology & Complex Systems（密歇根大学流行病学与复杂系统）

AI总结为解决缺乏可观测反事实结果的真实基准问题，基于校准的基于智能体的模型生成大规模流行病时间序列反事实预测基准，支持静态/时变治疗和单/多策略干预，评估多种因果推断方法。

Comments To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3817522

AI中文摘要

深度学习在时间序列因果推断方面取得了显著进展，但由于缺乏具有可观测反事实结果的现实基准，进展仍然受到限制。现有数据集要么依赖没有真实反事实的真实世界观测，要么依赖无法捕捉复杂因果动态的简化模拟。为了解决这一差距，我们开发了一个大规模基准，用于动态干预下流行病时间序列的反事实预测。与现有基准不同，它支持静态和时变治疗，以及单策略和多策略干预设置，从而能够在广泛的因果推断场景中评估因果推断方法。利用基于真实世界人口、流动性、流行病学和政策数据校准的基于智能体的模型，我们生成了跨越美国150多个县的真实反事实轨迹。使用该基准，我们评估了广泛使用和最先进的因果推断方法，揭示了显著的性能差异，并突出了现实时间序列因果推理的挑战。

英文摘要

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University（南京大学）； Shanghai Innovation Institute（上海创新研究院）； East China Normal University（华东师范大学）

AI总结提出DEFINED框架，通过层次化八维指标体系、预训练语言模型和混合粒度训练策略，在辩论场景中实现数据高效的细粒度创造力自动评估，优于现有方法。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817874

AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战，目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景，辩论反映了创造力的多个维度，涵盖发散思维和收敛思维。此外，辩论是一个数据丰富的领域，拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景，因此仍然依赖昂贵的人工评估。为此，本文提出DEFINED，一种数据高效的计算框架，用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力，采用预训练自回归语言模型，并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分，并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略，能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度，我们纳入了一项针对辩论新手参与者的实证研究，利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中，评分模型实现了准确且稳定的评分，优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10862 2026-06-16 cs.CV cs.AI 版本更新

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ：通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Chinese University of Hong Kong（香港中文大学）

AI总结针对VLA模型在场景遮挡下性能下降的问题，提出LIBERO-Occ基准和视角想象方法，通过生成互补视图提升鲁棒性。

Comments 14 pages, 7 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型在标准操作基准上取得了强劲的性能，但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立，因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战，并引入了LIBERO-Occ，一个面向遮挡的LIBERO扩展。实验表明，最先进的VLA在遮挡下性能显著下降。为解决这一问题，我们提出了视角想象（VIM），该方法从遮挡的主观测中生成互补视图，并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性，且无需在部署时增加额外摄像头，表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取：this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

URL PDF HTML ☆

赞 0 踩 0

2606.14238 2026-06-16 cs.RO cs.AI 版本更新

When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs

何时以及多严重：驾驶VLA的场景特定安全包络

Abhinaw Priyadershi, Jelena Frtunikj

发表机构 * NVIDIA Corporation（英伟达公司）； NVIDIA GmbH（英伟达德国有限公司）

AI总结针对ISO 21448下VLA驾驶规划器的安全认证，提出二维安全包络方法，通过GMM识别六种严重性等级，揭示场景特定风险差异。

详情

AI中文摘要

根据ISO 21448 (SOTIF)对视觉-语言-动作(VLA)驾驶规划器的安全认证依赖于运行设计域(ODD)规范，该规范回答两个互补的问题：规划器何时开始失效，以及一旦失效其严重程度如何？我们评估了Alpamayo R1（一个100亿参数的开源权重驾驶VLA）在15,968个（片段，攻击）对上的表现。我们发现一个保守的聚合差距：在15%平均位移误差(ADE)预算下，聚合安全阈值σ ≤ 50掩盖了能够容忍测试网格顶部（σ = 70）的良好采样场景。在变化解释子集上的高斯混合模型(GMM)识别出六个离散的严重性等级（BIC最优k=6），因此具有相同平均误差的两个扰动条件在高严重性(C4/C5)失效份额上可能有实质性差异。将两种分析结合在同一个语料库上，发现了一个单独分析无法得出的结论：噪声阈值最宽松的场景并非高严重性率最低的场景：STOP_SIGNAL的C4/C5份额大约是LANE_KEEPING的4倍，尽管它容忍更大的σ。因此，用于驾驶VLA的可部署SOTIF ODD规范需要二维安全包络，而不是每个危险的单一聚合值。

英文摘要

Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.

URL PDF HTML ☆

赞 0 踩 0

2606.15038 2026-06-16 cs.AI 新提交

用于增强牛只识别与检测的先进机器学习和深度学习技术：一项全面综述

Fayazunnesa Chowdhury, Syed Md. Galib, Md Nasim Adnan, Md. Moradul Siddique, Md Robiul Karim, K M Tanvir Anjum

发表机构 * Jashore University of Science and Technology（贾沙雷科学与技术大学）； University of Information Technology & Sciences (UITS)（信息科技与科学大学）； Gazipur Agricultural University（加兹ipur农业大学）； Shanto Mariam University of Creative Technology（沙托·马里姆创意技术大学）

AI总结本文系统综述了利用机器学习和深度学习技术进行牛只识别的研究，比较了传统方法（如K近邻、支持向量机）与深度学习方法（如CNN、ResNet、YOLO）的效果，指出深度学习方法在识别和检测任务中更优，并讨论了数据集有限、数据质量问题和实时处理需求等挑战。

Comments Published in the journal of Annals of Emerging Technologies in Computing (AETiC), 34 pages, 5 Figures. The Article is available here: http://aetic.theiaer.org/archive/v10/v10n2/p1.html

详情

DOI: 10.33166/AETiC.2026.02.001
Journal ref: Annals of Emerging Technologies in Computing (AETiC),Vol. 10, No. 2, 2026

AI中文摘要

在畜牧管理中，维护生物安全、食品安全和供应链效率的需求使得有效的牛只识别技术比以往任何时候都更加迫切。本文对使用机器学习和深度学习技术的牛只识别研究进行了系统综述。本系统综述通过主要学术数据库的研究评估了传统和现代牛只识别技术的有效性，并对文章进行了全文审查。在这些技术中，经典机器学习技术如K近邻和支持向量机在牛只识别中表现出良好效果；然而，深度学习技术如卷积神经网络、残差网络和You Only Look Once在认知、检测和识别任务中表现更优。特征提取依赖于常见技术如局部二值模式（LBP）、加速稳健特征（SURF）和尺度不变特征变换（SIFT），而这些研究中常用的关键特征包括鼻纹和皮毛图案。综述强调了牛只识别中的主要障碍，例如公开可用的数据集数量有限、易受环境变化和动物移动影响的数据质量问题，以及对实时处理能力的高需求。本文旨在为研究人员、政策制定者和利益相关者提供关于实施可扩展、人道且有效的牛只识别系统以实现可持续畜牧管理的信息。

英文摘要

The need for effective cattle identification technology is now more acutely felt than ever in maintaining biosecurity, food safety, and supply chain efficacy in livestock management. This paper presents a systematic review of recent research in cattle identification using machine learning and deep learning techniques. The present systematic review measures the effectiveness of traditional and modern cattle identification techniques using studies from major academic databases, where articles were subjected to full-text review. Among these techniques, classical Machine Learning Techniques such as K-Nearest Neighbors and Support Vector Machines have demonstrated good results in cattle identification; however, Deep Learning Techniques, such as Convolutional Neural Networks, Residual Networks, and You Only Look Once, are better in cognition, detection, and identification tasks. Feature extraction relies on common techniques like Local Binary Pattern (LBP), Speeded-Up Robust Features (SURF), and Scale-Invariant Feature Transform (SIFT), while key features commonly used in these studies include muzzle prints and coat patterns. The review highlights key hurdles involving cattle identification, such as the limited number of publicly accessible datasets, issues with data quality susceptible to environmental changes and animal mobility, and high demand for real-time processing ability. The paper aims to inform researchers, policymakers, and stakeholders about implementing scalable, humane, and effective cattle identification systems to achieve sustainable livestock management.

URL PDF HTML ☆

赞 0 踩 0

2606.15709 2026-06-16 cs.AI cs.MA 新提交

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

基于AI的自适应水网管理框架及概念验证实施：解决约旦无收益水问题

Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

发表机构 * Jordan（约旦）

AI总结提出集成EPANET水力建模、数字孪生、SCADA和LLM智能体的框架，通过实时数据与物理模拟结合实现异常检测与自适应决策，概念验证在约旦1164节点管网中实现2分钟内自动生成健康报告，爆管检测定位准确。

详情

Journal ref: 2026 2nd International Conference on Computational Intelligence Approaches and Applications (ICCIAA)

AI中文摘要

约旦面临严重的水资源短缺，50%的生产水因泄漏、盗窃和计量问题（即无收益水，NRW）而损失。传统的被动方法已被证明不足以持续减少NRW。本文提出一个智能框架，集成EPANET水力建模、数字孪生技术、SCADA系统和基于大语言模型（LLM）的AI智能体，用于连续网络监控和自适应决策。该系统将实时数据流与基于物理的模拟相结合，以检测异常，采用检索增强生成（RAG）进行策略解释，并通过函数调用进行网络控制。概念验证实施使用EPYT和离线LLM（通过Ollama的llama3.1:8b）在安曼一个1164节点的区域管网中验证了技术可行性。该系统展示了自动化水力模拟、基于流量的异常检测（与配水区域（DZ）实践一致）、以及AI生成的健康报告，响应时间低于2分钟且零API成本。爆管检测依赖于局部流量异常分析：模拟的30.1 L/s泄漏在15根管道中产生可测量的流量重新分布，标记出一个15节点的簇，从而定位爆管——确认了与配水区域（DZ）监测实践的一致性。该框架通过分阶段实施适应约旦的间歇性供水模式和有限的自动化，为缺水地区利用智能自动化减少NRW和提高运营效率提供了可扩展的路径。

英文摘要

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.15831 2026-06-16 cs.AI cs.LG cs.NE cs.SY eess.SY 新提交

An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines

基于神经网络的计算学科实时学生评估与职业指导集成系统

Sakir Hossain Faruque, Md. Jubair Hossain, Sharun Akter Khushbu

发表机构 * Daffodil International University（达福尔国际大学）； Barishal Engineering College（巴里什尔工程学院）

AI总结针对计算机专业学生职业路径选择困难，提出集成职业指导专家系统与网络评估平台的AI驱动系统，采用多层感知器模型实现94.71%的职业路径预测准确率。

Comments 25 pages, 24 figures

详情

AI中文摘要

许多计算机科学（CS）和软件工程（SWE）专业的本科生在确定合适的职业道路时面临困难，尤其是当他们的学业表现、能力和兴趣不完全匹配时。为了解决这一问题，本研究提出了一种AI驱动的学生评估与职业预测系统，该系统集成了职业指导专家（CGE）系统和基于网络的学生评估（WBSA）平台。在集成框架内，CGE利用AI增强个性化职业推荐，同时帮助毕业生根据其技能和兴趣确定合适的工作、研究领域和深造机会。WBSA平台通过评估、个性化任务、导师活动和安全的实时聊天应用程序进一步加强了学生与教师之间的互动。CGE系统采用多层感知器（MLP）模型，该模型使用滚雪球抽样法从大学学生中收集的真实学术和课外数据进行训练，在预测个性化职业路径方面达到了94.71%的验证准确率。在部署前，跨大学进行了预调查以评估所提出的模型。WBSA系统作为现代Web应用程序开发，使用了Node.js、Next.js和PostgreSQL等技术，以确保可扩展性、响应性和安全的数据管理。整个系统由安全的云基础设施支持，该平台提供可靠的性能，同时帮助毕业生在IT领域选择合适的职业道路。此外，还进行了一项涉及学生和教师的后期调查，以收集反馈并进一步提高系统的整体有效性和可用性。

英文摘要

Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as Node.js, Next.js, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.

URL PDF HTML ☆

赞 0 踩 0

2606.16415 2026-06-16 cs.AI 新提交

Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions

后验孪生：面向企业决策的分布行为模拟

Ankit Das

发表机构 * Twinning Labs, Inc.（Twinning Labs公司）

AI总结提出后验孪生方法，通过记忆驱动的数字孪生将模拟行为表示为决策条件下的更新分布，在226例基准上评估模型，发现模态准确率与分布保真度揭示不同操作区域，TL-Twin Alpha实现最低Wasserstein-1距离（1.16）。

Comments 13 pages, 2 figures

详情

AI中文摘要

企业行为模拟不仅需要产生合理的响应。许多决策取决于在拟议行动下群体的形态：哪些细分群体接受、拒绝、犹豫或进入风险敏感状态。本文介绍了后验孪生（Posterior Twins），一种记忆驱动的数字孪生方法，将可能的行为表示为特定决策上下文下的更新分布。我们在一个包含226个保留示例的行为响应基准上评估了一系列Twinning Labs行为模型操作点，并报告了模态准确率和Wasserstein-1距离。结果表明，模态准确率和分布保真度识别出不同的操作区域。在报告的结果集中，TL-Twin Alpha实现了最低的观测Wasserstein-1距离（$W_1 = 1.16$），而TL-Twin Delta和TL-Twin Gamma在模态准确率前沿附近提供了平衡的操作点。本文将这些结果视为系统结果：受控记忆、行为模型路由、场景编排、分布聚合和可审计性对于将模拟行为转化为可重用的企业决策证据是必要的。

英文摘要

Enterprise behavioral simulation requires more than producing a plausible response. Many decisions depend on the shape of a population under a proposed action: which segments accept, defect, hesitate, or move into risk-sensitive states. This paper introduces Posterior Twins, a memory-grounded digital-twin approach that represents likely behavior as an updated distribution under a specific decision context. We evaluate a family of Twinning Labs behavioral-model operating points on a 226-example held-out behavioral-response benchmark and report both modal accuracy and Wasserstein-1 distance. The results show that modal accuracy and distributional fidelity identify different operating regimes. TL-Twin Alpha achieves the lowest observed Wasserstein-1 distance in the reported result set ($W_1 = 1.16$), while TL-Twin Delta and TL-Twin Gamma provide balanced operating points near the modal-accuracy frontier. The paper frames these results as a systems result: governed memory, behavioral model routing, scenario orchestration, distributional aggregation, and auditability are necessary for turning simulated behavior into reusable enterprise decision evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.16624 2026-06-16 cs.AI 新提交

MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains

MR-GVNO：一种面向不规则域上Mindlin-Reissner板的几何感知变分物理信息神经算子

Siqi Wang, Daobo Sun, Yizheng Wang, Yilong Zhang, Yabin Jin, Xiaoying Zhuang, Timon Rabczuk

发表机构 * Institute of Computational Mechanics × AI & College of Intelligent Robotics and Advanced Manufacturing, Fudan University（计算力学与人工智能学院及智能机器人与先进制造学院，复旦大学）； School of Aerospace Engineering, Xiamen University（航空航天工程学院，厦门大学）； Department of Engineering Mechanics, Tsinghua University（工程力学系，清华大学）； Institute of Photonics, Department of Mathematics and Physics, Leibniz University（光子研究所，数学与物理系，莱比锡大学）； Institute of Structural Mechanics, Bauhaus-Universität Weimar（结构力学研究所，魏玛 Bauhaus-Universität）

AI总结提出MR-GVNO，一种几何感知变分神经算子，通过边界点云表示不规则几何，利用交叉注意力机制融合多物理场输入，基于离散总势能的变分物理信息损失无监督训练，实现对Mindlin-Reissner板问题的快速准确预测。

详情

AI中文摘要

板壳结构在工程中广泛应用，因此在不同几何、材料和载荷下进行快速响应预测非常理想。然而，传统的有限元方法需要重复建模和求解，导致计算成本高昂。本研究提出了一种用于Mindlin-Reissner板问题的几何感知变分神经算子，称为MR-GVNO。该方法使用边界点云表示不规则几何，并采用独立的编码器处理空间变化的材料场、压力载荷和标量物理参数。交叉注意力机制将这些输入与查询点信息集成，以预测任意位置的横向挠度和转角。MR-GVNO无需标记解数据，通过从离散总势能导出的变分物理信息损失进行训练。它直接处理不规则点云，并允许不同的物理场独立离散化，避免了插值到公共网格。在单孔、双孔和L形板上的数值实验表明，在均匀和非均匀材料以及均匀和随机载荷下，该方法能准确预测响应。该模型还实现了毫秒级的全场推理和良好的跨几何泛化能力。

英文摘要

Plate and shell structures are widely used in engineering, making rapid response prediction under varying geometries, materials, and loads highly desirable. However, conventional finite element methods require repeated modeling and solution, resulting in high computational costs. This study proposes a geometry-aware variational neural operator for Mindlin-Reissner plate problems, termed MR-GVNO. The method uses boundary point clouds to represent irregular geometries and employs separate encoders for spatially varying material fields, pressure loads, and scalar physical parameters. A cross-attention mechanism integrates these inputs with query point information to predict transverse deflections and rotations at arbitrary locations. MR-GVNO is trained without labeled solution data using a variational physics-informed loss derived from the discretized total potential energy. It directly processes irregular point clouds and allows different physical fields to be discretized independently, avoiding interpolation onto a common grid. Numerical experiments on single-hole, double-hole, and L-shaped plates demonstrate accurate response prediction under homogeneous and heterogeneous materials and uniform and random loads. The model also achieves millisecond-level full-field inference and favorable cross-geometry generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.16649 2026-06-16 cs.AI 新提交

The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies

集成优势：面向中小企业的受控代理型人工智能

Christopner Koch, Joshua A. Wellbrock

AI总结本文提出代理型AI对中小企业的近期价值在于受控部分自主性，而非完全自主或减员，并给出集成框架以提升生产力。

Comments 10 pages, 15 tables

详情

AI中文摘要

代理型AI标志着企业自动化的新阶段。与传统自动化或对话式AI不同，代理系统能够解释目标、规划多步骤任务、访问工具、与企业系统交互，并以不同程度的自主性执行工作流。对于中小企业而言，这创造了减少行政负担、加速常规流程以及改善组织知识利用的潜力。本文认为，代理型AI的近期价值不在于完全自主或减员，而在于对简单和中等复杂度业务流程的受控部分自主性。它提出了一个集成框架，涵盖用例适用性、自主性级别、技术集成、治理、安全、员工赋能和可衡量影响。本文得出结论，当作为以人为中心的能力实施，并由人保留责任和问责时，代理型AI可以成为生产力杠杆。

英文摘要

Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

URL PDF HTML ☆

赞 0 踩 0

2606.16721 2026-06-16 cs.AI 新提交

绿色AI碳优化器：碳高效训练位置推荐与全球AI能源需求预测

Yuxin Chen, Hao Gao, Chujie Zou

AI总结提出Green AI Carbon Optimizer，包括基于电网碳强度、可再生能源占比和PUE的云区域推荐方法（最佳vs最差区域减排97.2%），以及基于幂律的全球AI能源需求预测模型（2030年需求7-1436 TWh）。

Comments Short workshop of 5 pages. 2 figures

详情

AI中文摘要

AI训练和部署消耗大量电力，但碳排放结果尚未充分融入常规模型开发决策。本文提出Green AI Carbon Optimizer，包含两个主要贡献：(i) 一种用于训练工作负载的碳感知云区域推荐方法，以及(ii) 一个用于全球AI能源需求的幂律预测流程。对于位置推荐，我们将区域电网碳强度、可再生能源占比和数据中心电能利用效率(PUE)结合成一个统一评分模型，覆盖来自主要云提供商的100多个区域。对于一个参考工作负载(8*A100, 100h)，我们采样区域的估计排放量从7.74kg到272.00kg CO2不等。选择最佳区域而非最差区域相对于最差情况减少了97.2%。消融实验表明，仅按可再生能源占比排序可能选择碳排放高于包含电网碳强度排序的区域。对于预测，我们使用26个锚点模型拟合参数数量与训练能量之间的幂律关系。我们将此拟合与模型增长、硬件效率和训练频率的情景假设相结合，并评估对推理比率和生态系统扩展的敏感性。在不同情景下，根据所述假设，预计2030年需求范围从7 TWh到1,436 TWh，凸显了部署选择、模型扩展纪律和透明能源报告的重要性。

英文摘要

AI training and deployment consume substantial electricity, but carbon outcomes remain weakly integrated into routine model development decisions. This paper presents Green AI Carbon Optimizer with two primary contributions: (i) a carbon aware cloud region recommendation method for training workloads, and (ii) a power law forecasting pipeline for global AI energy demand. For location recommendation, we combine regional grid carbon intensity, renewable share, and data center Power Usage Effectiveness (PUE) into a unified scoring model across 100+ regions from major cloud providers. For a reference workload (8*A100, 100h), estimated emissions in our sampled regions range from 7.74kg to 272.00kg CO2. Selecting the best region instead of the worst corresponds to a 97.2% reduction relative to the worst case. Ablation shows that ranking by renewable share alone can select regions with higher CO2 emissions than rankings that include grid carbon intensity. For forecasting, we fit a power law relation between parameter count and training energy using 26 anchor models. We combine this fit with scenario assumptions on model growth, hardware efficiency, and training frequency, and evaluate sensitivity to inference ratio and ecosystem scaling. Across scenarios, projected 2030 demand ranges from 7TWh to 1,436TWh under the stated assumptions, highlighting the importance of deployment choices, model scaling discipline, and transparent energy reporting.

URL PDF HTML ☆

赞 0 踩 0

2606.14724 2026-06-16 cs.CV cs.AI 交叉投稿

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

VigilFormer: 用于视频异常检测的可变形注意力与因果风险推理

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出VigilFormer框架，结合可变形时空注意力与因果时序建模，通过稀疏注意力、对比多实例学习和自适应帧跳过，在保持高精度的同时实现实时异常检测。

详情

AI中文摘要

监控场景中的视频异常检测必须在检测准确性与实时吞吐量之间取得平衡，现有方法要么通过更强的特征提取器，要么通过更高效的架构来解决这一矛盾，但很少能兼顾两者。我们提出VigilFormer，一个统一框架，结合可变形时空注意力与因果时序建模，用于检测未修剪监控视频中的异常。所提出的可变形时空编码器（DSTE）关注跨帧的稀疏信息位置，避免了密集注意力的二次复杂度，同时保留了捕捉不规则运动模式的能力。因果异常分类器（CAC）对片段级特征应用扩张因果卷积，并优化对比多实例学习目标，无需帧级标签即可分离异常和正常表示。为满足部署约束，自适应置信度调度器（ACS）在推理时动态跳过低信息帧，减少静态场景中的冗余计算。在UCF-Crime、ShanghaiTech和CUHK Avenue上评估，VigilFormer在单GPU上以41.5 FPS分别达到87.83%、97.21%和89.74%的AUC分数，在准确性和速度上均优于最近的弱监督方法。

英文摘要

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

URL PDF HTML ☆

赞 0 踩 0

2606.14734 2026-06-16 q-bio.MN cs.AI cs.LG 交叉投稿

BRIDGE: Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks

BRIDGE：基因调控网络的生物学证据精炼与异质动态门控

Ziyang Dong, Shanwen Tan, Hengchuang Yin, Wei Liu, Yifan Wang, Siyu Yi, Jiancheng Lv, Wei Ju

发表机构 * College of Computer Science（计算机科学学院）； Sichuan University（四川大学）； Xinjiang Technical Institute of Physics and Chemistry（新疆物理化学研究所）； Chinese Academy of Sciences（中国科学院）； School of Mathematics（数学学院）； University of International Business and Economics（国际商务经济大学）； School of Artificial Intelligence and Data Science（人工智能与数据科学学院）

AI总结提出BRIDGE框架，通过共表达精炼视图和异质门控编码，从scRNA-seq数据中稳健推断基因调控网络，在多个基准数据集上取得最优性能。

Comments 19 pages, 10 figures, 7 tables

详情

AI中文摘要

动机：从单细胞RNA测序（scRNA-seq）数据推断基因调控网络（GRN）对于揭示细胞状态特异性转录程序至关重要。然而，scRNA-seq测量存在稀疏性和噪声，且实验验证的转录因子-靶基因相互作用仍然有限，使得可靠推断具有挑战性。尽管图神经网络已经推进了GRN预测，现有方法通常依赖生物学上无约束的图增强（如随机边扰动），并且对基因与细胞之间的信息传递控制不足。这些局限性可能扭曲调控结构，并在噪声和弱监督设置下削弱鲁棒性。结果：为解决这些问题，我们提出了一个创新框架，名为基因调控网络的生物学证据精炼与异质动态门控（BRIDGE）。BRIDGE从表达矩阵及其矩阵对偶中提取基因和细胞表示，并在基因空间和细胞空间中，在共表达精炼的调控视图与原始图之间，对自身和邻居进行对比学习。然后，它应用异质门控编码自适应地调节基因与细胞之间的信息传递，实现稳健的转录因子-靶基因预测。在涵盖三种网络类型和七种细胞类型的基准数据集上的实验表明，BRIDGE在大多数设置下达到了最先进的AUROC和AUPRC。特别是在特异性网络上，BRIDGE的平均AUPRC比第二好的基线GCLink提高了5%。在跨细胞类型的小样本迁移中，BRIDGE在所有六种目标细胞类型上始终优于GCLink和GENELink。在hESC上的案例研究进一步支持了预测的生物学相关性，其中前10个中的9个和前100个中的46个新型转录因子-靶基因相互作用得到了ChIPBase的验证。

英文摘要

Motivation: Gene regulatory network inference from single-cell RNA sequencing (scRNA-seq) data is important for uncovering cell-state-specific transcriptional programs. However, scRNA-seq measurements are sparse and noisy, and experimentally validated TF-target interactions remain limited, making reliable inference challenging. Although graph neural networks have advanced GRN prediction, existing methods often rely on biologically unconstrained graph augmentation, such as random edge perturbation, and insufficiently control information transfer between genes and cells. These limitations may distort regulatory structures and weaken robustness under noisy and weakly supervised settings. Results: To address these issues, we propose an innovative framework named Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks (BRIDGE). BRIDGE extracts gene and cell representations from the expression matrix and its matrix dual, and performs contrastive learning in the gene space and cell space between self and neighbors across the co-expression-refined regulatory view and the original graph. It then applies heterogeneous gated encoding to adaptively regulate information transfer between genes and cells, enabling robust transcription factor-to-target gene prediction. Experiments on benchmark datasets spanning three network types and seven cell types show that BRIDGE achieves state-of-the-art AUROC and AUPRC in most settings. In particular, on Specific networks, BRIDGE improves average AUPRC by 5% over the second-best baseline, GCLink. In cross-cell-type few-shot transfer, BRIDGE consistently outperforms GCLink and GENELink across all six target cell types. A case study on hESC further supports the biological relevance of the predictions, with 9 of the top 10 and 46 of the top 100 novel TF-target interactions validated by ChIPBase.

URL PDF HTML ☆

赞 0 踩 0

2606.14749 2026-06-16 cs.CV cs.AI 交叉投稿

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

幼鱼昼夜活动与异常检测的自动化三维运动监测

Chih-Wei Huang, Chang-Wen Huang, Chung-Ping Chiang, Tsung-Wei Pan

发表机构 * AI Research Center, National Taiwan Ocean Univ.（台湾海洋大学人工智能研究中心）； Dept. of Aquaculture, National Taiwan Ocean Univ.（台湾海洋大学水产养殖系）； Center of Excellence for the Oceans, National Taiwan Ocean University（台湾海洋大学海洋卓越研究中心）

AI总结提出结合深度学习目标检测与双目立体视觉的高通量3D行为表型框架，实现高密度环境下幼鱼实时监测、体长估计和3D轨迹重建，首次量化自由游动幼鱼的真实物理速度，建立昼夜运动基线用于生理应激预警。

2606.14759 2026-06-16 cs.CV cs.AI 交叉投稿

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

基于潜在空间运动建模的二维电影心脏磁共振时序一致且可控视频生成

Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Guillaume Sallé, Xin Gao

发表机构 * Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences（苏州生物医学工程与技术研究所，中国科学院）； SyCoIA, IMT Mines Ales（SyCoIA，IMT Mines Ales）

AI总结提出一种文本到视频生成方法，通过解耦心脏空间结构与时间运动，利用微调扩散模型合成初始帧，再以心脏相位嵌入条件化潜在流模型生成完整运动，实现高时序一致性和解剖可控性。

详情

Journal ref: ISBI 2026 - IEEE International Symposium on Biomedical Imaging, Apr 2026, London, United Kingdom. pp.1-4

AI中文摘要

电影心脏磁共振是评估心脏功能的金标准，但公共数据集的稀缺限制了先进数据驱动模型的发展。为解决这一限制，我们提出一种生成方法，用于合成时间上连贯且解剖上一致的心脏序列。我们的文本到视频框架将心脏空间结构与时间运动解耦。首先，一个微调的扩散模型根据临床文本提示合成初始帧，控制解剖特征。然后，一个以心脏相位嵌入为条件的潜在流模型生成完整的心脏运动，确保空间一致性和时间控制。我们的模型生成解剖和病理多样化的序列，具有高时间连贯性和对输入提示的强保真度，图像真实感的FID为31.68，文本-图像对齐的CLIP得分为31.04。这些实验结果突显了其产生高保真、按需医疗数据的潜力，为数据稀缺提供了可扩展的解决方案。

英文摘要

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.14766 2026-06-16 cs.CV cs.AI cs.MA 交叉投稿

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

XMedFusion：面向自主医疗系统的知识引导多模态感知与推理框架

Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

发表机构 * National University of Sciences and Technology (NUST)（巴基斯坦国立科技大学）； University of Oxford（牛津大学）

AI总结提出XMedFusion模块化AI框架，通过视觉感知、知识图谱构建和检索引导生成等智能体协同，增强放射学报告生成的视觉基础与临床发现捕捉能力，在公共数据集上显著优于基线模型。

Comments Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

详情

AI中文摘要

自主医疗和机器人系统日益依赖智能感知与推理能力来解释视觉数据并支持临床决策。放射学报告生成是此类自动化诊断工作流的关键组成部分，然而现有的端到端多模态模型常因视觉基础薄弱而导致不可靠的解释和细微临床发现的遗漏。本文提出XMedFusion，一个模块化AI框架，设计为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为协调的功能组件，模拟专家驱动的分析，包括提取图像基础证据的视觉感知智能体、构建临床相关发现结构的知识图谱构建智能体，以及确保报告结构一致的检索引导起草过程。合成智能体通过推理驱动的验证迭代整合视觉和结构化证据，生成可靠且可解释的诊断输出。在公共胸部X光片数据集上的实验评估表明，与基线视觉-语言模型相比，在BLEU-1上提升0.0493至0.3359，ROUGE-L上提升0.0863至0.2440，METEOR上提升0.0829至0.1708，同时在语义评估指标如一致性（2.38至7.80）和准确性（2.34至6.93）上也有显著提升。结果突出了结构化多智能体感知与推理在增强智能医学成像系统的鲁棒性、透明度和自动化方面的有效性，使其能够集成到自主医疗和机器人诊断工作流中。

英文摘要

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.14786 2026-06-16 cs.MM cs.AI cs.CV 交叉投稿

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

MatchLM2Lite: 一种可扩展的MLLM-to-Lite框架用于重复内容识别

Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Zirui Zhu, Kanchan Sarkar, Kun Xu

发表机构 * Tiktok（字节跳动）； National University of Singapore School of Computing（新加坡国立大学计算机学院）

AI总结提出MatchLM2Lite框架，通过将多模态大语言模型蒸馏为轻量模型，实现视频、音频和文本联合建模的实时重复内容识别，在降低35倍计算成本的同时保持高准确率，并成功部署于大规模生产环境。

详情

DOI: 10.1145/3770855.3818444

AI中文摘要

内容审核对于在线视频平台确保内容安全、保护创作者和维持积极的用户体验至关重要。除了过滤有害内容，平台必须大规模保证内容真实性，以便用户接触到多样化、原创的视频，而非低价值的重复内容。我们提出MatchLM2Lite，一个实时、生产级的重复内容识别（RCI）系统，它利用多模态大语言模型（MLLM）的强大理解能力，将其蒸馏为一个小型且推理速度快的模型。我们的系统联合建模视频、音频和文本信号，对视频对进行操作以生成细粒度的重复分数。该系统包含两个模块，MatchLM和MatchLite，以及一个两阶段训练方案。首先，我们高容量的MLLM，MatchLM，作为教师模型定义RCI性能的上限。然后，其能力被蒸馏到一个紧凑的学生模型MatchLite中。这种设计使MatchLite能够在视频对上实现低延迟、高吞吐量的推理，同时保留MatchLM的大部分准确性，使其适合集成到实时推荐系统中。MatchLM相比我们之前的生产模型F1分数提高了+8.57。经过知识蒸馏后，MatchLite保留了+6.55的F1分数提升，同时计算成本降低了35倍。大规模部署后，MatchLM2Lite实现了高效的成对多模态RCI，以高每秒查询数（QPS）稳定服务在线流量，端到端延迟低于30秒。该系统在不降低用户参与度的情况下，将我们平台上的重复视频观看率降低了2.5%，证明了其在大规模生产环境中的有效性。

英文摘要

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

URL PDF HTML ☆

赞 0 踩 0

2606.14788 2026-06-16 cs.SD cs.AI cs.LG eess.AS 交叉投稿

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

统一声学特征与文本的多模态大语言模型用于神经退行性疾病筛查

Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出NeurMLLM框架，通过多模态大语言模型融合声谱图、MFCC和文本，实现阿尔茨海默病和帕金森病的精细分期，优于传统方法和现有LLM方法。

Comments IEEE International Conference on Healthcare Informatics, 2026

详情

AI中文摘要

基于语音的筛查为评估阿尔茨海默病（AD）和帕金森病（PD）等神经退行性疾病提供了一种可扩展且非侵入性的方式，但由于整合异质数据的困难，其分期仍然具有挑战性。本文提出了NeurMLLM，一种用于神经退行性疾病分期的高效多模态生成框架。NeurMLLM首先使用视觉变换器对音频数据的声谱图和梅尔频率倒谱系数进行编码，并将其表示投影到大语言模型（LLM）的嵌入空间中，在那里它们与转录文本和人口统计指令标记连接成一个统一的序列。然后，通过低秩适应使用任务提示对LLM进行指令微调，以自回归方式预测受限的标签标记，从而实现生成式分类。通过在Bridge2AI-Voice数据集上对AD和PD进行细粒度分期评估，我们观察到NeurMLLM取得了强劲的性能，持续优于经典机器学习方法和现有的基于LLM的方法。结果表明，多模态LLM在神经退行性疾病分期中具有巨大潜力，提高了分期准确性并支持可访问的部署。

英文摘要

Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.14813 2026-06-16 hep-ph cs.AI cs.LG 交叉投稿

JetParticle-JEPA: An Efficient Self-Supervised Representation Learning method for Jet Tagging in High-Energy Physics

JetParticle-JEPA：一种用于高能物理喷注标记的高效自监督表示学习方法

Guillaume Letellier, Antonin Vacheret, Frédéric Jurie

发表机构 * GREYC, Normandy University, Unicaen, ENSICAEN, UMR CNRS 6072（GREYC，诺曼底大学，Unicaen，ENSICAEN，CNRS UMR 6072）； LPC, Normandy University, Unicaen, ENSICAEN, IN2P3, UMR CNRS 6534（LPC，诺曼底大学，Unicaen，ENSICAEN，IN2P3，CNRS UMR 6534）

AI总结提出JetParticle-JEPA，一种基于粒子Transformer的自监督联合嵌入预测架构，无需标记或重建原始输入，直接从连续粒子云学习物理有意义的喷注表示，在JetClass等基准上达到与全监督方法相当的性能，并在低标签场景下超越监督基线。

详情

AI中文摘要

大型强子对撞机上的喷注标记越来越依赖于在大量模拟数据集上训练的深度学习模型，导致计算成本高且对探测器建模误差的鲁棒性有限。我们引入了JetParticle-JEPA (JP-JEPA)，一种自监督联合嵌入预测架构，它直接从连续粒子云中学习物理有意义的喷注表示，无需对原始输入进行标记化或重建。基于粒子Transformer主干，JP-JEPA在保留细粒度运动学相关性的同时预测被掩码粒子的潜在表示。在JetClass基准上，JP-JEPA在完整数据集上实现了与全监督最先进方法相当的性能，在低标签场景下超越了监督基线，并显著优于现有的自监督学习方法。在顶夸克和夸克-胶子喷注标记基准上，它与监督方法保持同等水平。学习到的表示还对缺失探测器信息表现出强鲁棒性，并改善了不确定性行为，凸显了JP-JEPA作为LHC上鲁棒且数据高效的喷注物理基础模型框架的潜力。

英文摘要

Jet tagging at the Large Hadron Collider increasingly relies on deep learning models trained on massive simulated datasets, leading to high computational costs and limited robustness to detector mismodeling. We introduce JetParticle-JEPA (JP-JEPA), a self-supervised Joint-Embedding Predictive Architecture that learns physically meaningful jet representations directly from continuous particle clouds without tokenization or reconstruction of raw inputs. Built on a Particle Transformer backbone, JP-JEPA predicts latent representations of masked particles while preserving fine-grained kinematic correlations. On the JetClass benchmark, JP-JEPA achieves performance comparable to fully supervised state-of-the-art methods on the full dataset, surpasses supervised baselines in low-label regimes, and significantly outperforms existing SSL approaches. On Top Quark and Quark-Gluon Tagging benchmarks, it remains on par with supervised methods. The learned representations also exhibit strong robustness to missing detector information and improved uncertainty behavior, highlighting JP-JEPA as a promising foundation-model framework for robust and data-efficient jet physics at the LHC.

URL PDF HTML ☆

赞 0 踩 0

2606.14817 2026-06-16 cs.IR cs.AI 交叉投稿

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

结合检索增强文本生成与大型语言模型的阅读内容推荐

Sooyeon Kim, Piotr S. Maciąg

发表机构 * Institute of Computer Science, Warsaw University of Technology（计算机科学学院，华沙技术大学）

AI总结提出结合检索增强生成（RAG）与大型语言模型的系统，通过四个模块实现个性化阅读内容生成，实验表明RAG将相关性和接地性提升26-35个百分点。

详情

AI中文摘要

本文介绍了使用大型语言模型（LLMs）结合检索增强生成（RAG）生成个性化阅读内容的系统的设计、实现和评估。所提出的架构由四个模块组成：输入、RAG、生成和评判，允许用户指定问题和目标阅读内容复杂度。RAG用于从互联网检索相关信息，丰富和支撑由三种现代LLM（Meta LLaMA 4 Scout、LLaMA 3.1 8B Instant和Google Gemma2 9B）生成的内容。使用三种提示策略（思维链、零样本和少样本）生成阅读材料，LLM-as-a-Judge模块自动评估答案质量及其与期望可读性水平的一致性。实验结果表明，RAG在所有模型和提示技术中一致地提高了系统性能，将相关性和特别是接地性提升了高达26-35个百分点。总体而言，研究结果表明，RAG增强架构有效地生成了符合用户查询和期望文本复杂度的阅读内容。

英文摘要

This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.14821 2026-06-16 cs.IR cs.AI 交叉投稿

Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction

Co-Scraper: 查询感知的DOM剪枝与可复用爬虫合成用于轻量级网页数据提取

Shoupeng Wang, Jiantao Qiu, Wuyang Zhang, Conghui He

发表机构 * Shanghai Artificial Intelligence Laboratory, OpenDataLab（上海人工智能实验室，开放数据实验室）； University of Science and Technology of China（中国科学技术大学）

AI总结提出Co-Scraper两阶段框架，通过查询感知的DOM剪枝和稳定提取策略归纳，利用微调Qwen3-8B模型将网页内容转化为可执行程序化包装器，在SWDE测试集上达到94.78%的F1分数和90.39%的复用成功率。

2606.14823 2026-06-16 q-bio.GN cs.AI cs.CL 交叉投稿

Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

人类遗传证据与跨治疗领域药物批准相关：一项基于26,278个靶点-疾病对的观察性分析，含时间验证和特征消融

Victoria Paterson

发表机构 * School of Informatics, University of Edinburgh（爱丁堡大学信息学院）

AI总结本研究通过分析26,278个靶点-疾病对，发现具有遗传关联的靶点药物批准率是无遗传关联的3.25倍，但遗传证据单独预测价值有限，并识别出1,433个遗传支持的早期阶段靶点-疾病对作为假设生成资源。

详情

AI中文摘要

遗传证据在已批准药物靶点中富集：在一项对来自Open Targets和ChEMBL的26,278个靶点-疾病对的观察性分析中，具有任何遗传关联的靶点批准率是无遗传关联靶点的3.25倍（OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42）。一项考虑共享同一基因的靶点-疾病对非独立性的靶点水平分析给出的OR为2.79（bootstrap 95% CI 2.22-3.53）；肿瘤学对水平OR为6.72，在靶点水平衰减至2.71，说明非独立性会夸大特定领域的估计值。该富集在2015年后的批准中得以复现（OR = 3.51, p = 1.72e-8）。跨六种证据类型的特征消融显示，仅文献挖掘就占分类器性能的大部分（AUPRC = 0.099，而所有特征为0.109），这与批准后出版物导致的时间泄漏一致。排除文献后，其余证据类型仍保留高于基线的信号（AUPRC = 0.084，为基线的1.63倍）。敏感性分析将对水平OR的范围限定在3.25至4.93之间。仅遗传证据的AUPRC绝对增益仅为1.0个百分点，且最佳模型校准较差；该分类器的实际预测价值有限。我们编录了1,433个遗传支持的1/2期靶点-疾病对作为假设生成资源。所有发现均为观察性结果。

英文摘要

Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

URL PDF HTML ☆

赞 0 踩 0

2606.14828 2026-06-16 eess.IV cs.AI cs.CV 交叉投稿

Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

基于血管图神经网络的DSA软脑膜侧支检测

Junyong Cao, Hakim Baazaoui, Chinmay Prabhakar, Suprosanna Shit, Lukas Bastian Otto, Susanne Wegener, Bjoern Menze, Ezequiel de la Rosa

发表机构 * University of Zurich（苏黎世大学）； University Hospital Zurich（苏黎世大学医院）

AI总结提出一种混合图-像素架构，在DSA血管图上对单个血管段分类，首次实现DSA中软脑膜侧支的个体化检测，PR-AUC达0.434，优于纯图或纯像素方法。

详情

AI中文摘要

软脑膜侧支（LMCs）是急性缺血性卒中的重要预后因素。现有自动化方法依赖CT血管造影（CTA），但单个LMCs通常太小而无法在CTA上分辨，限制了这些方法只能进行粗略的侧支评分。数字减影血管造影（DSA）以更高的分辨率可视化单个侧支，但当前评估仍依赖主观的手动分级量表，存在评分者间一致性差的问题。我们提出一个框架，将侧支检测形式化为对从DSA导出的图上的单个血管段进行分类。一种混合图-像素架构将拓扑感知的图分支与密集像素分支相结合，在共享的节点概率空间中融合。在五折交叉验证中，融合模型的PR-AUC达到0.434，优于纯图（0.403）和纯像素（0.362）基线。据我们所知，这是首个能够在DSA中实现LMCs个体化的方法，允许对每个血管进行精确的定量评估。这种整合将DSA评估转向客观评价，支持未来对单个LMCs的生物标志物和模式发现。

英文摘要

Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

URL PDF HTML ☆

赞 0 踩 0

2606.14871 2026-06-16 cs.CV cs.AI 交叉投稿

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

一种可靠且可扩展的柠檬叶病害分类集成深度学习方法

Shayan Abrar, Sudeepta Mandal, Abdul Awal Yasir, Sonjoy Bhattacharjee, Sadman Haque Bhuiyan, Samanta Ghosh, Rafi Ahamed

发表机构 * Dept. of CSE（计算机科学与工程系）； American International University-Bangladesh（美国国际大学-孟加拉国）； East West University（东-西大学）； North South University（北南大学）

AI总结提出集成InceptionV3和MobileNetV2的深度学习方法，结合对抗训练和Grad-CAM可视化，在9类柠檬叶病害数据集上达到99.27%准确率，实现可靠分类。

Comments 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

2606.14886 2026-06-16 cs.CV cs.AI 交叉投稿

Improved Knowledge Distillation for Land-Use Image Classification

改进的知识蒸馏用于土地利用图像分类

Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

发表机构 * Jadavpur University（贾达沃大学）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出一种改进的知识蒸馏框架，通过VGG16教师网络向轻量MobileNetV2学生网络传递知识，结合硬监督和软监督策略，在三个数据集上达到99.04%准确率，优于基线方法。

Comments Accepted by IGARSS 2026

2606.14912 2026-06-16 cs.CV cs.AI 交叉投稿

PANDA：一种LLM增强的性能驱动模拟设计框架，弥合设计意图与版图生成

Haoyi Zhang, Weijian Fan, Xiaohan Gao, Bingyang Liu, Runsheng Wang, Yibo Lin

发表机构 * School of Integrated Circuits, Peking University（集成电路学院，北京大学）； Beijing Advanced Innovation Center for Integrated Circuits（北京集成电路先进创新中心）； Institute of Electronic Design Automation, Peking University（电子设计自动化研究所，北京大学）

AI总结提出PANDA框架，利用大语言模型将高层设计意图转化为最终版图，通过引导拓扑综合、子结构感知尺寸优化和约束驱动版图生成，实现跨阶段协同设计，将设计周期从数天/周缩短至数小时并提升性能。

2606.15117 2026-06-16 cs.MM cs.AI cs.CV cs.LG cs.SD 交叉投稿

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

用于集成视听视频深度伪造检测中领域适应的师生结构

Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee

发表机构 * Department of Computer Engineering, Sharif University of Technology（谢里夫理工学院计算机工程系）

AI总结提出EAV-DFD方法，结合师生框架的领域适应机制，提升模型在未见领域上的泛化能力，在三个数据集上AUC分别提升4.09%、17.94%和0.5%。

详情

DOI: 10.1109/TAI.2025.3642217

AI中文摘要

生成式AI模型的快速发展导致了更逼真的深度伪造媒体，包括对音频、视频或两者的操纵。这引发了严重的隐私和社会问题。该领域的许多研究已经取得了有前景的域内结果；然而，这些模型在面对来自不同领域的数据时，其有效性常常下降。因此，最近的深度伪造检测方法侧重于通过多种技术增强泛化能力，这些技术融合了所有输入模态，包括音频、图像及其交互。为此，我们提出了EAV-DFD方法，一种广义的深度集成视听模型（EAV-DFD），结合了利用师生框架的领域适应机制，以增强模型在未见领域上的表现和泛化能力。为了评估模型性能，我们使用FakeAVCeleb数据集作为主领域，DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。我们的实验结果表明，所提出的框架在领域适应方面是有效的，仅使用一小部分未见数据集训练学生模型，就在三个未见数据集上分别将模型的AUC性能提升了4.09%、17.94%和0.5%。这产生了一种新颖的深度伪造检测模型，能够适应新领域并解释哪个模态被操纵，突显了我们的方法在现实世界应用中的潜力。

英文摘要

The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.15129 2026-06-16 cs.CV cs.AI 交叉投稿

混合NARX-LLM用于格陵兰冰山排放：提示驱动的残差校正

Yiquan Gao, Duohui Xu

发表机构 * Heriot-Watt University（赫瑞瓦特大学）； StudioYG

AI总结提出混合NARX-LLM框架，结合非线性自回归模型与大型语言模型进行残差校正，并引入物理信息提示方法，用于建模格陵兰冰山排放的复杂非线性动态，提升预测准确性。

详情

AI中文摘要

格陵兰冰山排放表现出复杂的非线性动态，且可观测性有限，对传统预测模型构成挑战。我们提出一个混合NARX-LLM框架，该框架结合了具有外源输入的非线性自回归模型（NARX）和用于残差校正的大型语言模型（LLM）。我们进一步提出了一种物理信息提示（PIP）方法，将非结构化物理知识转化为结构化提示，用于零样本上下文推理。主要目标是探索该框架在建模格陵兰冰山排放方面的校正潜力，而不仅仅是优化预测精度。NARX组件捕获内在的时间依赖性，而由PIP引导的LLM编码冰川动力学和环境驱动因素，并感知关键趋势模式以校正系统预测误差。这种集成允许模型推理未建模因素并产生可解释的残差，从而提升整体预测精度。应用于格陵兰冰山排放时间序列，我们的方法处理了由于罕见变化和非平稳趋势而难以预测的极端事件，这是传统方法经常忽视的局限性。通过融合结构化时间序列建模与知识驱动的Foundation AI，该框架提供了一条可扩展且可解释的路径，将数据受限的气候预测与物理信息LLM推理相结合。代码已公开。

英文摘要

Greenland iceberg discharge exhibits complex nonlinear dynamics with limited observability, challenging traditional predictive models. We present a Hybrid NARX-LLM framework that combines a nonlinear autoregressive model with exogenous inputs (NARX) and a large language model (LLM) for residual correction. We further propose a Physics-Informed Prompt (PIP) method that transforms unstructured physical knowledge into structured prompts for zero-shot in-context reasoning. The primary objective is to explore the corrective potential of this framework for modeling Greenland iceberg discharge, rather than merely optimizing predictive accuracy. The NARX component captures intrinsic temporal dependencies, while the LLM, guided by PIP, encodes glacier dynamics and environmental drivers and perceives key trend patterns to correct systematic prediction errors. This integration allows the model to reason about unmodeled factors and produce interpretable residuals, enhancing overall predictive accuracy. Applied to Greenland iceberg discharge time series, our approach addresses extreme events that are difficult to predict due to rare variations and nonstationary trends, a limitation often overlooked by traditional methods. By fusing structured time-series modeling with knowledge-driven foundation AI, the framework offers a scalable and interpretable pathway to bridge data-limited climate forecasting with physics-informed LLM reasoning. The code is available.

URL PDF HTML ☆

赞 0 踩 0

2606.15349 2026-06-16 cs.CY cs.AI 交叉投稿

LearnOpt: Recovering the Latent Cognitive Structure of Standardized Examinations via Knowledge Graphs and Constrained Optimization

LearnOpt: 通过知识图谱和约束优化恢复标准化考试的潜在认知结构

Joy Bose, Om Thomas

发表机构 * Independent Researchers（独立研究者）

AI总结提出LearnOpt框架，利用知识图谱和约束优化从历史试题中恢复潜在认知结构，生成个性化学习计划；在NEET和JEE Advanced考试数据上验证了潜在结构的稳定性和可检测的转变。

Comments 26 pages, 2 figures, 6 tables. Code, data, and calibration tooling: https://github.com/joyboseroy/learnopt. Datasets on HuggingFace: joyboseroy/neet-skill-tags-2016-2024, joyboseroy/jee-advanced-skill-tags-2016-2023

详情

AI中文摘要

标准化考试通常被视为统一的课程覆盖问题。我们认为它们更适合理解为具有稳定潜在认知结构的对抗系统，这些结构系统地偏离官方课程。我们引入LearnOpt，它从历史试题中恢复这种结构，并生成个性化的、有时间限制的学习计划。应用于九年的NEET试题（2016-2024，n=1,496），LearnOpt从LLM标记的试题构建考试知识图谱，提取五类潜在技能分布，并将学习计划制定为基于贝叶斯知识追踪的先决条件感知子图上的背包变体优化。核心发现：NEET的潜在技能分布在课程体系内是稳定的（2016-2021年连续年份KL散度0.004-0.032，置换检验不显著），但在NCERT 2023年课程合理化后显著变化：合并2016-2021年（n=1,072）与2023-2024年（n=392）得到KL=0.040（p=0.0005），其中消除/否定类问题从约20-29%上升至约31-35%。潜在结构虽然并非永久平稳，但分段稳定，其转变可检测并归因于课程事件。在任一体系内，学科比年份更能预测技能分布。使用一个真实和两个合成掌握度分布进行的优化评估表明，技能加权目标在基于掌握度频率的基线之上产生了适度但真实的主题推荐重排序。将该流程应用于JEE Advanced，发现其分布以多概念整合为主（80.9%对比NEET的33.3%），JEE与NEET的散度（KL=0.505）超过了NEET最大的跨学科散度：考试层级比学科更能塑造潜在认知结构，而学科比时间（在同一体系内）更能塑造结构。代码、知识图谱和标注数据集已公开发布。

英文摘要

Standardized examinations are typically treated as uniform syllabus coverage problems. We argue they are better understood as adversarial systems with stable latent cognitive structures diverging systematically from official syllabi. We introduce LearnOpt, which recovers this structure from historical question papers and generates personalized, time-bounded study plans. Applied to nine years of NEET questions (2016-2024, n=1,496), LearnOpt builds an exam knowledge graph from LLM-tagged questions, extracts a five-category latent skill distribution, and formulates study planning as a knapsack-variant optimization over prerequisite-aware subgraphs with Bayesian Knowledge Tracing. Central finding: NEET's latent skill distribution is stable within a syllabus regime (consecutive-year KL divergence 0.004-0.032 for 2016-2021, non-significant under permutation testing) but shifts significantly with NCERT's 2023 syllabus rationalization: pooling 2016-2021 (n=1,072) vs 2023-2024 (n=392) gives KL=0.040 (p=0.0005), with Elimination/Negation questions rising from ~20-29% to ~31-35%. Latent structure, while not permanently stationary, is piecewise stable, with shifts detectable and attributable to curricular events. Within either regime, subject predicts skill profile more strongly than year. An optimization evaluation, using one real and two synthetic mastery profiles, shows the skill-weighted objective produces a modest but real reordering of recommended topics over a mastery-conditioned frequency baseline. Applying the pipeline to JEE Advanced reveals a profile dominated by Multi-concept Integration (80.9% vs. 33.3% for NEET), with a JEE-vs-NEET divergence (KL=0.505) exceeding NEET's largest cross-subject divergence: exam tier shapes latent cognitive structure more than subject, which shapes it more than time within a regime. Code, knowledge graph, and annotated dataset are released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.15500 2026-06-16 cs.AR cs.AI cs.SY eess.SY 交叉投稿

CIWI-CKT：混沌信息波干涉特征融合与跨城市知识迁移用于交通流预测

Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang

发表机构 * Griffith University（格里菲斯大学）； School of Information and Communication Technology, Griffith University（格里菲斯大学信息与通信技术学院）； School of Information Engineering, Chang’an University（长安大学信息工程学院）

AI总结针对跨城市数据稀缺场景，提出CIWI-CKT框架，融合混沌信息波生成、元干涉处理和混沌感知元学习，显著提升预测精度并降低数据需求。

详情

AI中文摘要

在跨城市、数据稀缺的场景下，准确预测交通流仍然具有挑战性，因为有限的历史数据阻碍了模型的泛化能力。交通动态的混沌性质、复杂的时空依赖关系以及异质的城市网络使得跨城市的小样本学习变得复杂。现有的深度学习方法要么将交通视为完全确定性的，要么缺乏对跨体制交通动态至关重要的波状干涉模式进行建模的机制。为了解决这些局限性，本文提出了CIWI-CKT，一种新颖的混沌信息波干涉特征融合框架，结合跨城市知识迁移。我们的框架引入了三个核心创新：混沌信息波生成，提取可测量的混沌不变量并将交通建模为自适应波分量；元干涉处理，捕获支持域和查询域之间的波相互作用，同时生成可预测性分数用于置信度估计；以及混沌感知元学习，在保留混沌特性的同时实现高效的跨城市知识迁移。我们建立了理论保证，包括混沌到波的稳定性、波诱导的降维以及元学习泛化界限。在四个真实世界交通数据集上的大量实验表明，CIWI-CKT显著优于最先进的时空图学习、迁移学习、基于提示和小样本方法，在提高预测精度的同时大幅减少了所需的训练数据。

英文摘要

Accurate traffic flow prediction remains challenging in cross-city, data-scarce scenarios where limited historical data hinders model generalisation. The chaotic nature of traffic dynamics, complex spatio-temporal dependencies, and heterogeneous urban networks complicate few-shot learning across cities. Existing deep learning approaches either treat traffic as purely deterministic or lack mechanisms to model wave-like interference patterns essential for cross-regime traffic dynamics. To address these limitations, this paper proposes CIWI-CKT, a novel Chaos-Informed Wave Interference Feature Fusion framework with Cross-City Knowledge Transfer. Our framework introduces three core innovations: chaos-informed wave generation that extracts measurable chaos invariants and models traffic as adaptive wave components; meta-interference processing that captures wave interactions between support and query regimes while producing a predictability score for confidence estimation; and chaos-aware meta-learning that enables efficient cross-city knowledge transfer while preserving chaotic characteristics. We establish theoretical guarantees including chaos-to-wave stability, wave-induced dimension reduction, and meta-learning generalisation bounds. Extensive experiments on four real-world traffic datasets demonstrate that CIWI-CKT significantly outperforms state-of-the-art spatio-temporal graph learning, transfer learning, prompt-based, and few-shot methods, improving prediction accuracy while substantially reducing required training data.

URL PDF HTML ☆

赞 0 踩 0

2606.15693 2026-06-16 cs.SE cs.AI 交叉投稿

Imperfect Visual Verification for Code Edition : A Case Study on TikZ

代码编辑的不完美视觉验证：以TikZ为例的案例研究

Charly Reux, Mathieu Acher, Djamel Eddine Khelladi, Clément Quinton, Olivier Barais

发表机构 * Univ Rennes, Inria, IRISA, INSA（里昂大学、Inria、IRISA、INSA）； Univ Rennes, Inria, CNRS, IUF, IRISA（里昂大学、Inria、CNRS、IUF、IRISA）； Univ Rennes, Inria, CNRS, IRISA（里昂大学、Inria、CNRS、IRISA）； Univ. Lille, CNRS, Inria（里尔大学、CNRS、Inria）； Univ. Rennes, IRISA, Inria（里昂大学、IRISA、Inria）

AI总结针对TikZ等视觉代码定制任务，研究不完美验证器在迭代精炼中的有效性，发现即使不完美验证器也能适度准确判断指令是否应用，反馈对弱模型提升显著。

详情

AI中文摘要

LLMs显著推进了代码生成，使得功能程序的合成成为可能。尽管最近的系统在许多编码基准测试中表现强劲，但涉及生成视觉产物（如TikZ）的程序任务仍然具有挑战性，尤其是在视觉代码定制方面。与从头生成不同，定制需要局部、保持语义的编辑：模型必须定位相关代码，根据指令修改它，并保留其余结构和渲染。基于事后迭代精炼/纠正的方法（其中验证器提供反馈以指导纠正）已显示出前景。然而，对于具有视觉结果（如TikZ）的程序，其正确性难以或不可能自动形式化和评估，因此不存在确定性验证器。因此，开发者只能依赖不完美的验证器。在本文中，我们进行了一项实证研究以回答：当验证器本身不可靠时，迭代精炼在多大程度上仍然有效？我们使用TikZ作为聚焦的案例研究，在受控且具有挑战性的环境中隔离问题的核心难点（弱代码结构、细粒度视觉语义和困难的特征定位）。我们将视觉代码定制定义为一个具有不完美预言机的迭代编辑问题，并引入一个分析此类迭代精炼的框架。我们进行了大规模研究，在迭代精炼流程中评估了多个基于LLM和工具增强的视觉验证器，并对精炼轨迹进行了广泛的手动标注以评估验证器行为和反馈质量。我们的发现表明，即使是不完美的验证器也能以中等准确度确定视觉指令是否应用于代码，F1分数高达0.815。反馈改善了迭代精炼，特别是对于较弱的模型，为Qwen3-vl-30b-a3b-Instruct增加了11-20个完美定制，而更强的模型（如Gemini-3）获得的改进较少（+5），但受益于更准确的验证，防止了过早接受。反馈仅在精确识别图像问题、提供可操作指导、解决所有相关问题并保持基于原始指令时有效。

英文摘要

LLMs have significantly advanced code generation, enabling the synthesis of functional programs. While recent systems achieve strong performance on many coding benchmarks, tasks involving programs such as TikZ that generate visual artifacts remain challenging, in particular on visual code customization. Unlike generation from scratch, customization requires localized, semantics-preserving edits: the model must locate relevant code, modify it according to the instruction, and preserve the remaining structure and rendering. Approaches based on post-hoc iterative refinement/correction where a verifier provides feedback to guide corrections, have shown promise. However, in the case of programs with a visual outcome such as in TikZ, where correctness is harder or likely impossible to formalize and evaluate automatically, deterministic verifiers do not exist. Hence, developers can only rely on imperfect verifiers. In this paper, we conduct an empirical study to answer:to what extent can iterative refinement remain effective when the verifier itself is unreliable?} We use TikZ as a focused case study that isolates the core difficulties of the problem (weak code structure, fine-grained visual semantics, and difficult feature localization) in a controlled and challenging setting. We define visual code customization as an iterative editing problem with an imperfect oracle, and introduce a framework for analyzing such iterative refinements. We conduct a large-scale study and evaluate multiple LLM-based and tool-augmented visual verifiers within iterative refinement pipelines, and perform extensive manual annotation of refinement trajectories to assess verifier behavior and feedback quality. Our findings show that even imperfect verifiers can determine with moderate accuracy whether visual instructions are applied to code, achieving F1-scores up to 0.815. Feedback improves iterative refinement, especially for weaker models, adding 11--20 perfect customizations for Qwen3-vl-30b-a3b-Instruct, while stronger models like Gemini-3 gain fewer improvements (+5) but benefit more from accurate verification that prevents premature acceptance. Feedback is effective only when it precisely identifies image issues, provides actionable guidance, addresses all relevant problems, and remains grounded in the original instruction.

URL PDF HTML ☆

赞 0 踩 0

2606.15786 2026-06-16 cs.CV cs.AI physics.geo-ph 交叉投稿

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

领域引导的Segment Anything模型提示用于地震解释：属性、可视化和混合提示的作用

Aniq Ahmad, Heather Bedle, Ahmad Mustafa

发表机构 * School of Geosciences, University of Oklahoma（俄克拉荷马大学地球科学学院）； King Fahd University of Petroleum and Minerals（法赫德国王石油矿产大学）

AI总结提出零样本适应框架，通过地质目标感知的地震属性与颜色映射选择，结合混合提示策略，提升SAM在地震解释中的分割精度，避免微调。

详情

AI中文摘要

计算机视觉大型预训练基础模型的出现显著提高了视觉数据解释的效率。特别是Segment Anything Model (SAM)通过基于提示的交互提供了强大的零样本分割能力，因此成为地震解释的有前景工具。然而，大多数现有的SAM应用依赖于针对特定地质目标的微调，这需要大量标注数据、计算成本高，且常常损害模型的泛化能力。在本研究中，我们引入了一个原则性框架，用于将基础模型零样本适应到地震数据。该框架基于两个关键组件：(1) 将地震属性和可视化选择（如颜色映射）与感兴趣的地质目标对齐；(2) 采用混合提示策略，结合稀疏的用户定义点提示和从SAM内部特征激活中导出的密集掩码提示。我们系统地在多个地质目标、数据集、提示配置和地震属性表示上评估了该框架。我们的结果表明，地质目标感知的地震属性和颜色映射选择，结合混合提示，相对于仅基于点提示，增强了地质特征的可分离性，并改善了边界描绘和分割精度。我们的发现表明，当这些组件联合应用时，SAM可以在完全零样本设置下实现有竞争力的分割性能，从而消除了为每个地质特征重新训练SAM的需要。这项工作建立了一条实用且可扩展的途径，以在地震解释中利用基础模型，减少对标注数据的依赖，同时保持模型的通用性。

英文摘要

The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

URL PDF HTML ☆

赞 0 踩 0

2606.15807 2026-06-16 cs.LG cs.AI 交叉投稿

Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks

基于记忆增强图液态时间常数网络的连续跨域交通状态预测

Jinrong Xiang, Ming Xu

发表机构 * Software College, Liaoning Technical University（辽宁工程技术大学软件学院）

AI总结提出记忆增强图液态时间常数网络（MA-GLTC），通过时空单元分解、图液态时间常数动态和记忆迁移存储机制，实现连续时间下的跨域交通状态预测，在五个数据集上优于现有方法。

详情

AI中文摘要

交通状态预测是智能交通系统中的一项基本任务。在实际应用中，一些区域由于感知基础设施不足而面临有限的交通观测，使得跨域知识迁移成为数据稀缺交通预测的重要解决方案。然而，现有的跨域交通预测方法仍面临若干局限，包括粗粒度的源-目标域适应、处理未见目标域模式的能力有限，以及在非规则或异质时间条件下对连续交通动态建模不足。为解决这些问题，本文提出了一种连续跨域交通预测框架，称为记忆增强图液态时间常数网络（MA-GLTC）。具体地，我们首先构建时空单元（STU）将交通网络分解为可迁移的局部单元，实现跨域的细粒度知识对齐。然后，开发了图液态时间常数网络（GLTC）来建模连续时间下图耦合的交通演化。与通用的基于图神经ODE的模型不同，GLTC将图耦合的循环电导引入液态时间常数动态，允许节点状态随泄漏、自适应时间常数和邻域感知反馈而演化。此外，设计了基于记忆的迁移存储（MTS）机制，以保留源域知识、检索匹配的交通模式，并在出现未见状态时更新可靠的目标域模式。在五个公开交通数据集上的实验表明，MA-GLTC在短期和长期预测任务中均持续优于代表性的域内和跨域基线。与次优方法相比，MA-GLTC分别将平均预测误差降低了3.02%、0.33%、8.92%、10.09%和2.11%。

英文摘要

Traffic state prediction is a fundamental task in intelligent transportation systems. In practical applications, some regions suffer from limited traffic observations due to insufficient sensing infrastructure, making cross-domain knowledge transfer an important solution for data-scarce traffic prediction. However, existing cross-domain traffic prediction methods still face several limitations, including coarse-grained source-target adaptation, limited capability in handling unseen target-domain patterns, and insufficient modeling of continuous traffic dynamics under irregular or heterogeneous temporal conditions. To address these issues, this paper proposes a continuous cross-domain traffic prediction framework, termed Memory-Augmented Graph Liquid Time-Constant Network (MA-GLTC). Specifically, we first construct spatio-temporal units (STUs) to decompose traffic networks into transferable local units, enabling fine-grained knowledge alignment across domains. Then, a graph liquid time-constant network (GLTC) is developed to model graph-coupled traffic evolution in continuous time. Different from generic graph neural ODE-based models, GLTC introduces graph-coupled recurrent conductance into liquid time-constant dynamics, allowing node states to evolve with leakage, adaptive time constants, and neighborhood-aware feedback. Furthermore, a Memory-based Transfer Storage (MTS) mechanism is designed to preserve source-domain knowledge, retrieve matched traffic patterns, and update reliable target-domain patterns when unseen states emerge. Experiments on five public traffic datasets demonstrate that MA-GLTC consistently outperforms representative innerdomain and cross-domain baselines in both short-term and longterm prediction tasks. Compared with the second-best method, MA-GLTC reduces the average prediction errors by 3.02%, 0.33%, 8.92%, 10.09%, and 2.11%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.15930 2026-06-16 cs.RO cs.AI 交叉投稿

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

ControlMap: 用于交通场景仿真的可控高清地图生成

Marwan Farag, Steffen Wäldele, Yu Yao

发表机构 * University of Stuttgart（斯图加特大学）； Robert Bosch GmbH（博世公司）； Motional, Inc（Motional公司）

AI总结提出基于潜在扩散和ControlNet的数据驱动管道，实现可控高清地图生成，支持空间引导、条件强度调整和城市风格迁移，并引入新指标评估控制信号遵循度和地图真实性。

详情

AI中文摘要

仿真是验证自动驾驶系统的核心，但当前流程因高精（HD）地图创建成本高昂而受限于场景多样性不足。扩展HD地图需要昂贵的数据收集和人工处理。此外，现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。本文提出一种数据驱动的可控HD地图生成管道，使用潜在扩散和ControlNet进行空间条件控制。据我们所知，我们是首个将空间引导信号注入扩散模型用于HD地图合成的工作。此外，我们的模型支持通过无分类器引导调整条件强度，并通过城市标签条件实现城市级风格迁移。为补充现有指标，我们引入两个新指标来评估对控制信号的遵循程度以及与真实地图的相似性。实验表明，我们的模型生成的HD地图真实且忠实遵循输入道路拓扑，同时准确保留城市特定细节。

英文摘要

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

URL PDF HTML ☆

赞 0 踩 0

2606.15943 2026-06-16 cs.SE cs.AI 交叉投稿

Graphical-Probabilistic Modeling of Generative Flows in LLM-Native Software Systems

LLM原生软件系统中生成流的图形概率建模

Víctor A. Braberman, Flavia Bonomo-Braberman

发表机构 * Departamento de Computación, FCEN, Universidad de Buenos Aires / ICC, UBA-CONICET（布宜诺斯艾利斯大学计算机系 / UBA-CONICET）

AI总结针对LLM原生软件开发缺乏设计级推理的问题，提出基于图形概率模型的生成网络框架，用于文档化生成流并描述系统属性。

Comments Published at 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AI (CAIN '26), April 12-13, 2026, Rio de Janeiro, Brazil

详情

DOI: 10.1145/3793653.3793780

AI中文摘要

工程化LLM原生软件仍然是一个具有挑战性且不成熟的领域。当前的实践主要是探索性的，依赖于实验和启发式技术，如提示和上下文工程。然而，这些方法层次较低，缺乏支持设计级推理或分析所需的原则性结构。相比之下，传统软件工程利用模块化和抽象来沟通和分析系统行为。为了给LLM原生开发带来类似的严谨性，我们提出了文档化生成流和陈述基于LLM的软件设计属性的方法。这些方法必须考虑大语言模型的随机性、提示依赖性行为，同时保持足够的表达能力以捕捉涌现现象。我们的初步方法基于图形概率模型，专门用于捕捉LLM原生系统特有的现象。这个框架——我们称之为生成网络——旨在为LLM中心软件架构中关于生成交互和系统级属性的原则性推理提供基础。

英文摘要

Engineering LLM-native software remains a challenging and immature field. Current practice is largely exploratory, relying on experimentation and heuristic techniques such as prompting and context engineering. These, however, are low-level and lack the principled structure needed to support design-level reasoning or analysis. In contrast, traditional software engineering leverages modularity and abstraction to communicate and analyze system behavior. To bring similar rigor to LLM-native development, we propose methods for documenting generative flows and for stating properties of LLM-based software designs. Such methods must account for the stochastic, prompt-dependent behavior of large language models while remaining expressive enough to capture emergent phenomena. Our initial approach is based on graphical probabilistic models, tailored to capture phenomena characteristic of LLM-native systems. This framework -- what we term Generation Networks -- aims to provide a foundation for principled reasoning about generative interactions and system-level properties in LLM-centric software architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.15959 2026-06-16 cs.DC cs.AI cs.LG 交叉投稿

LUCID：基于确定性流匹配的学习型欠采样自适应一致性引导稀疏视角CT重建

Jigang Duan, Jiayi Wang, Heran Wang, Ping Yang, Genwei Ma, Xing Zhao

发表机构 * School of Mathematical Sciences, Capital Normal University（首都师范大学数学科学学院）； National Center for Applied Mathematics Beijing, Capital Normal University（首都师范大学北京国家应用数学中心）； Academy for Multidisciplinary Studies, Capital Normal University（首都师范大学交叉科学研究院）

AI总结提出LUCID框架，利用流匹配生成先验和稀疏度自适应策略，通过退化匹配初始状态和投影域一致性校正，实现不同采样密度下的稳定稀疏视角CT重建，减少伪影和幻觉结构。

详情

AI中文摘要

稀疏视角CT通过获取更少的投影视图来减少辐射剂量和扫描时间，但角度欠采样使得重建严重病态，导致条纹伪影、结构模糊和细节丢失。现有的监督方法通常受限于特定的采样设置，而生成方法在严重欠采样下可能引入解剖上不一致的幻觉样结构。我们提出Lucid，一种基于流匹配生成先验的稀疏自适应、一致性引导重建框架，用于稀疏视角CT。Lucid仅在高品质CT图像上训练，学习高斯分布与高品质CT图像分布之间的连续传输，与视角采样无关。在推理过程中，显式纳入采样稀疏度水平，以调整单个预训练模型的生成轨迹。具体地，Lucid通过稀疏度加权融合稀疏视角FBP图像和高斯噪声构建退化匹配的初始状态，执行稀疏度调制的流匹配更新，并在每次先验更新后应用投影域数据一致性校正。在多种稀疏视角设置下的实验表明，Lucid在不同采样密度下实现稳定的重建性能，提高图像质量和结构保真度，并降低生成式稀疏视角CT重建中幻觉样结构的风险。

英文摘要

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.16231 2026-06-16 cs.LG cs.AI 交叉投稿

From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

从令牌到区域：面向GPU内核生成的CUDA敏感指令微调

Wentao Chen, Jiace Zhu, Xing Zhe Chai, Zeng Qu, Qiaoling Xiao, Liucheng Duan, An Zou

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Biren Technology（壁仞科技）

AI总结提出CuSeT方法，通过自适应令牌级掩码和区域感知样本重加权，在简单SFT框架内提升LLM生成CUDA内核的功能正确性。

详情

AI中文摘要

高性能CUDA内核对于可扩展的AI系统至关重要，而大型语言模型（LLM）由于严格且隐式的执行约束，仍然难以生成正确的内核。现有的基于LLM的方法要么依赖昂贵的智能体或强化学习（RL）流水线，要么采用监督微调（SFT）目标，但未能显式建模CUDA敏感性，即与执行约束紧密耦合的代码令牌或区域。在这项工作中，我们从令牌置信度模式的角度研究CUDA敏感性，表明CUDA敏感性出现在令牌和区域两个层面，其中大多数CUDA敏感令牌以高置信度被预测，而较小的低置信度子集形成对应于执行关键结构的区域。这些发现表明，有效的CUDA内核生成应同时利用高置信度的CUDA敏感令牌并保留低置信度的CUDA敏感区域。基于这些见解，我们提出了\textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)}，一种在简单SFT框架内的低成本后训练方法。CuSeT遵循“从令牌到区域”的原则，结合了\emph{自适应令牌级掩码}和\emph{区域感知样本重加权}。实验表明，CuSeT在多个模型系列和规模上一致地提高了功能正确性，优于标准SFT和高级SFT变体，同时以显著更低的推理成本达到了与前沿CUDA内核生成模型相竞争的性能。

英文摘要

High-performance CUDA kernels are essential for scalable AI systems, while Large Language Models (LLMs) still struggle to generate correct kernels due to strict and implicit execution constraints. Existing LLM-based approaches either rely on costly agentic or reinforcement-learning (RL) pipelines, or adopt supervised fine-tuning (SFT) objectives that fail to explicitly model CUDA sensitivity, namely code tokens or regions tightly coupled with execution constraints. In this work, we investigate CUDA sensitivity from the perspective of token confidence patterns, showing that CUDA sensitivity appears at both token and region levels, where most CUDA-sensitive tokens are predicted with high confidence, while a smaller low-confidence subset forms regions corresponding to execution-critical structures. These findings suggest that effective CUDA kernel generation should both leverage high-confidence CUDA-sensitive tokens and preserve low-confidence CUDA-sensitive regions. Building on these insights, we propose \textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)}, a low-cost post-training method within a simple SFT framework. CuSeT follows the principle of ``from tokens to regions'' by combining \emph{adaptive token-level masking} with \emph{region-aware sample reweighting}. Experiments show that CuSeT consistently improves functional correctness across multiple model families and scales, outperforming standard SFT and advanced SFT variants, while achieving competitive performance against frontier CUDA kernel generation models with substantially lower inference cost.

URL PDF HTML ☆

赞 0 踩 0

2606.16234 2026-06-16 cs.CV cs.AI 交叉投稿

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

传播结构引导：从眼底图像和稀疏OCT扫描合成荧光素血管造影

Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education（教育部新一代人工智能技术及其跨学科应用重点实验室）； Tianyuan Honors School, Nanjing Medical University（南京医科大学天元荣誉学院）； Nanjing University of Science and Technology（南京理工大学）； Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University（南京医科大学第一附属医院眼科）

AI总结提出从彩色眼底照片（CFP）和稀疏OCT扫描合成荧光素血管造影（FFA）的框架，通过空间对齐跨模态融合和令牌级对比学习，实现非侵入性FFA合成，提升下游诊断性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情

AI中文摘要

眼底荧光素血管造影（FFA）对于评估视网膜血管异常至关重要，但其获取具有侵入性且并非总是可行。相比之下，彩色眼底摄影（CFP）无创且广泛可用，这推动了CFP到FFA合成的研究。然而，先前的工作仅依赖CFP表面纹理，从根本上限制了重建功能性血管信息和细微病理变化的能力。为了解决这个问题，我们提出了一种新颖的框架，该框架利用光学相干断层扫描（OCT）提供的结构引导，从CFP合成FFA。我们构建了一个包含来自3,676只患者眼睛的配对CFP、FFA和OCT的多模态视网膜成像数据集——这是视网膜成像中首个三模态对齐数据集。为了弥合OCT和眼底模态之间的空间差距，我们提出了空间对齐跨模态融合（SACMF）模块，该模块将深度分辨的OCT特征投影到眼底平面，并通过自适应层归一化将其注入CFP编码器。除了特征融合，我们还引入了令牌级跨模态对齐（TCMA），这是一种令牌级对比学习策略，在对应空间位置显式对齐CFP和FFA表示。我们的方法相比最先进的方法实现了更优的合成性能。此外，大量实验表明，我们方法合成的FFA图像在提升下游疾病诊断性能方面比现有方法带来更大的改进，突显了我们的方法作为常规工作流程中无创决策支持工具的临床潜力。代码可在https://github.com/while-plus/OCT-guide-FFA-Syn获取。

英文摘要

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

URL PDF HTML ☆

赞 0 踩 0

2606.16278 2026-06-16 cs.CV cs.AI 交叉投稿

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

RealityBridge: 连接可编辑3D高斯泼溅驾驶模拟与现实世界视频

Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

发表机构 * Sun Yat-sen University（中山大学）； Guangdong Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education（教育部机器智能与先进计算重点实验室）

AI总结提出RealityBridge框架，利用多模态控制和轻量级GateNet，结合自回归长视频训练与奖励引导后训练，缩小编辑后3DGS驾驶视频的Sim-to-Real差距，提升视觉真实感和时间一致性。

详情

AI中文摘要

长尾危险场景对于安全导向的自动驾驶至关重要，但难以大规模收集和复现。可编辑3D高斯泼溅（3DGS）模拟通过重建真实驾驶场景并支持可控场景编辑，提供了一种有前景的替代方案。然而，编辑后的3DGS渲染视频仍存在显著的Sim-to-Real差距，包括渲染伪影、前景资产退化、光照不一致和时间闪烁。现有的修复和视频生成方法不足以应对此任务，因为它们通常无法联合修复3DGS特定伪影、提升视觉真实感并确保时间一致性。为填补这一空白，我们提出RealityBridge，一种针对编辑后3DGS驾驶视频的结构保持和资产感知的Sim-to-Real框架。RealityBridge使用多模态控制，包括渲染视频、前景掩码、边缘图和语义掩码，并结合轻量级GateNet进行跨骨干层的自适应条件分配。我们进一步构建了针对性的训练数据，并引入自回归长视频训练与奖励引导后训练，以提升修复质量、时间稳定性和幻觉抑制。在内部和公开驾驶数据集上的大量实验表明，RealityBridge在伪影去除、光照协调和长序列时间一致性方面优于现有方法。

英文摘要

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.16292 2026-06-16 cs.SE cs.AI 交叉投稿

AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance

AI供应链星系：用于许可证合规的3D可视化分析

Weiru Han, Xuetao Shi, Wenyi He, Wei Wang, Rui Zhao, Moming Duan

发表机构 * East China Normal University（东华大学）； Tianjin University（天津大学）

AI总结提出AI供应链星系（AISCG），一种交互式3D可视化分析系统，通过空间布局和规则引擎对Hugging Face上908,449个模型进行实证分析，发现55.46%的模型存在合规风险，并识别出适配器派生中56.67%的许可证遗漏和微调中8.05%的许可证漂移等风险模式。

Comments 15 pages, 6 figures

详情

AI中文摘要

机器学习模型复用的快速普及已将AI生态系统转变为一个高度互联的供应链。传统的合规工具和静态报告难以应对这些庞大且多跳的依赖网络。为此，我们提出了AI供应链星系（AISCG），一个用于模型溯源和合规审计的交互式3D可视化分析系统。AISCG将模型映射到3D空间布局中，将显式结构依赖与基于规则的合规引擎相结合。它支持多尺度探索，从全局社区检测到局部、路径感知的谱系追踪。我们通过对Hugging Face上908,449个模型的生态系统规模实证分析展示了其有效性。我们的发现揭示了一个令人担忧的现状：55.46%的模型存在合规风险或元数据冲突/缺失。我们还识别出不同的风险模式，包括适配器派生中56.67%的许可证遗漏率和微调中8.05%的“许可证漂移”率。通过对复杂的Llama模型家族进行案例研究，我们展示了AISCG如何帮助分析人员直观地追溯继承的受限条款，并在深层拓扑网络中识别根本原因，从而显著降低合规审计的认知负荷。

英文摘要

The rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% "license drift" rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing.

URL PDF HTML ☆

赞 0 踩 0

2606.16332 2026-06-16 cs.DC cs.AI cs.PF 交叉投稿

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

SMEPilot: 利用可扩展矩阵扩展表征和优化LLM推理

Feiyang Chen, Haibo Chen

发表机构 * IPADS, Shanghai Jiao Tong University（上海交通大学IPADS）

AI总结针对CPU矩阵扩展单元与核心在LLM推理中的不匹配，提出SMEPilot引擎，通过基于屋顶线的表征指导算子执行选择，实现SME与CPU协同工作，端到端性能提升达3.94倍。

详情

AI中文摘要

现代CPU越来越多地集成矩阵扩展，如Arm可扩展矩阵扩展（SME），在CPU内提供高吞吐量矩阵执行。然而，对于LLM推理，这些单元并不能普遍替代传统CPU核心：预填充、解码、注意力和KV缓存操作表现出不同的算术强度、向量行为和布局要求，而SME单元和CPU核心仍竞争共享内存带宽。本文通过基于屋顶线的SME CPU表征研究这种不匹配，并使用所得模型指导算子级执行选择。我们提出SMEPilot，一个LLM推理引擎，为每个算子形状选择仅CPU、仅SME或协作SME+CPU执行。SMEPilot在瓦片粒度上将矩阵工作分区到SME和CPU核心，在注意力中重叠适合SME的矩阵阶段与适合CPU的向量阶段，并维护布局状态，以便打包的张量表示被重用，而不是在关键路径上重复构建。在手机、PC和服务器平台上，针对Llama-3.2-3B、Qwen3-4B和Qwen3-30BA3B，SMEPilot将端到端推理性能提升高达3.94倍。

英文摘要

Modern CPUs increasingly integrate matrix extensions, such as Arm Scalable Matrix Extension (SME), that provide high-throughput matrix execution within the CPU. For LLM inference, however, these units are not a universal replacement for conventional CPU cores: prefill, decode, attention, and KV-cache operations expose different arithmetic intensities, vector behavior, and layout requirements, while SME units and CPU cores still compete for shared memory bandwidth. This paper studies this mismatch through a roofline-based characterization of SME-enabled CPUs and uses the resulting model to guide operator-level execution choices. We present SMEPilot, an LLM inference engine that selects CPU-only, SME-only, or cooperative SME+CPU execution for each operator shape. SMEPilot partitions matrix work across SME and CPU cores at tile granularity, overlaps SME-suitable matrix stages with CPU-suitable vector stages in attention, and maintains layout state so packed tensor representations are reused rather than repeatedly rebuilt on critical paths. Across Llama-3.2-3B, Qwen3-4B, and Qwen3-30BA3B on phone, PC, and server platforms, SMEPilot improves end-to-end inference performance by up to 3.94$\times$.

URL PDF HTML ☆

赞 0 踩 0

2606.16434 2026-06-16 cs.LG cs.AI 交叉投稿

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

基于时间对比表示学习的电池系统自主端到端健康状态预测服务

Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

发表机构 * School of Software Engineering, Sun Yat-sen University（中山大学软件学院）； Tianneng Battery Group Co., Ltd（天能电池集团有限公司）； School of Communication Engineering, Hangzhou Dianzi University（杭州电子科技大学通信工程学院）； Institute of Engineering Science, Universidad de O’Higgins（奥希金斯大学工程科学研究所）

AI总结提出TC-SOH模块化服务架构，通过时间对比机制和跨窗口预测任务从原始数据中提取退化相关表示，实现自主端到端SOH预测，在四个数据集上MAPE和RMSE分别降低1.91倍和2.13倍。

详情

AI中文摘要

准确的状态健康（SOH）估计是锂离子电池管理的关键诊断服务。然而，依赖劳动密集型的手动特征工程和不透明的黑箱模型阻碍了可扩展的工业部署。为此，我们引入TC-SOH：一种模块化、即插即用的服务架构，用于自主、端到端的SOH预测。TC-SOH采用时间对比机制和跨窗口预测预任务，直接从原始运行数据中提取与退化相关的表示。为了提高透明度，我们将模型效能与表示诊断联系起来：可视化、敏感性分析、冗余分析、双向探测、未来SOH探测和时间洗牌表明，学习到的特征与选定的专家描述符重叠，同时保留了额外的SOH相关变化，并且有序的时间上下文改善了后续SOH预测。在四个公开数据集上，TC-SOH优于所考虑的物理信息和数据驱动基线，MAPE降低了1.91倍，RMSE降低了2.13倍。

英文摘要

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

URL PDF HTML ☆

赞 0 踩 0

2606.16440 2026-06-16 cs.AR cs.AI cs.LG 交叉投稿

NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam

NeuronFabric：一种用于片上Transformer训练与本地Adam的软件参考架构

Evgeny Ukladchikov

发表机构 * Independent Researcher（独立研究者）

AI总结提出NeuronFabric软件参考架构，用于FPGA/ASIC实现Transformer训练与本地Adam优化，通过BF16W权重存储减少片上内存需求，在334K参数模型上验证数值正确性。

详情

AI中文摘要

公开记载的加速器架构通常将训练计算与优化器状态更新分离，或依赖外部内存和主机协调。本文提出NeuronFabric，一种旨在用于未来FPGA和ASIC实现Transformer训练与本地Adam更新的软件参考架构。一个完整的C#原型实现了前向传播、反向传播和Adam优化，无需外部机器学习框架。目标是在硬件实现前验证数值正确性和内存需求。评估模型是一个334K参数的自回归Transformer（d=88, H=4, f=264, L=4, vocab=256），在莎士比亚语料库上训练。BF16W配置在80K样本后达到评估损失1.5426，而FP32 GPU参考为1.5224，同时生成连贯的字符级文本。本文引入BF16W，它以BF16存储权重，同时以FP32保留Adam优化器动量。这减少了片上训练的内存需求。一个带Adam动量的334K参数FP32模型需要约4.0 MB，与Xilinx ZCU102设备的BRAM容量匹配。BF16W变体需要约3.34 MB，为激活存储留出内存。我们描述了早期实验中观察到的词汇预算约束，量化了BF16W内存节省，并概述了FPGA训练作为下一开发阶段。本文不包含FPGA测量。本出版物作为未来FPGA和ASIC探索NeuronFabric架构的公开架构披露和软件参考实现。

英文摘要

Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.

URL PDF HTML ☆

赞 0 踩 0

2606.16497 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

超越模型：在项目式课程中构建AI使能系统的反思

Amir Mashmool, Kishan Ravindra Sawant, Mojtaba Shahin, Nico Hochgeschwender, Rainer Koschke

AI总结本文反思了一门研究生项目式课程的设计与实施，该课程通过开发电影推荐系统，培养学生对AI使能系统的架构设计、部署和监控能力，并基于混合方法研究揭示了学生在早期架构决策、ML集成、需求演进和数据管理方面的持续困难。

详情

AI中文摘要

教授AI使能系统的软件工程需要在现实约束下解决AI组件在全规模软件架构中的集成问题。虽然机器学习课程强调模型开发，但学生往往缺乏AI使能系统的架构设计、部署和监控经验。此类面向系统的AI课程的实证评估仍然有限。本文反思了不来梅大学一门名为“AI算法：理论与工程”的硕士项目式课程的设计与实施，在该课程中，学生开发了一个电影推荐系统，同时做出架构设计决策以应对可扩展性、部署和需求演进相关的挑战。我们进行了一项混合方法研究，结合学生提交物分析和问卷回答，以调查集成挑战、学习成果和改进机会。我们的结果表明，由于机器学习和软件工程专业知识不均衡，学生在早期架构决策、异构ML集成、需求演进和数据管理方面持续存在困难。从教育者的角度来看，该课程培养了系统级推理能力，并增强了对AI使能系统中以数据为中心的ML实践的认识。

英文摘要

Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16969 2026-06-16 cs.SD cs.AI eess.AS 交叉投稿

Probing Low Frame Rate Degradation in Neural Audio Codecs

探测神经音频编解码器中的低帧率退化

Alex Gichamba, Moise Busogi

发表机构 * Carnegie Mellon University Africa（卡内基梅隆大学非洲校区）

AI总结通过控制帧率消融实验，发现低帧率质量悬崖源于训练配置缺陷而非根本性障碍，修正后帧率可降至3.1Hz和1.6Hz。

Comments Accepted at Interspeech 2026

详情

AI中文摘要

神经音频编解码器中的低帧率对于自回归语音合成具有吸引力，因为生成成本与序列长度线性相关。最近的研究表明，编解码器可以在12.5 Hz及以下运行，但低帧率退化的机制仍未被充分理解。我们通过受控的帧率消融实验来研究这些机制。我们重现了先前工作中报告的6.25 Hz处的质量悬崖，并评估了候选解释：音素冲突和码本饱和，两者均未显示出根本性障碍的证据。该悬崖实际上是由次优的训练配置引起的：训练期间固定的剪辑时长在低帧率下产生过少的令牌，使解码器缺乏令牌间上下文。一旦修正，WER随音素负载平滑退化，直至3.1 Hz和1.6 Hz，这表明低帧率编解码器的推理时效率增益比先前假设的更容易实现。

英文摘要

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

URL PDF HTML ☆

赞 0 踩 0

2606.16973 2026-06-16 cs.IR cs.AI 交叉投稿

基于Agent Rosetta的蛋白质设计：面向专业科学智能体的案例研究

Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar

发表机构 * Polymathic AI（多学科人工智能实验室）； Center for Computational Biology, Flatiron Institute（计算生物学中心，Flatiron研究所）； Google DeepMind（谷歌DeepMind）； Center for Computational Mathematics, Flatiron Institute（计算数学中心，Flatiron研究所）； New York University（纽约大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出Agent Rosetta，将大语言模型与Rosetta软件环境结合，通过迭代优化实现用户定义的蛋白质设计目标，在规范氨基酸设计上匹配专家，在非规范残基设计上超越现有ML方法。

详情

AI中文摘要

大语言模型（LLM）能够模拟推理并使用工具，为执行复杂科学任务的自主智能体创造了机会。蛋白质设计提供了一个天然的试验平台：尽管机器学习（ML）方法取得了强劲成果，但它们主要局限于规范氨基酸和狭窄的目标，对于广泛设计流程的通用工具的需求尚未满足。我们引入了Agent Rosetta，这是一个LLM智能体，配有一个用于操作Rosetta的结构化环境——Rosetta是领先的基于物理的异聚合物设计软件，能够建模非规范构建模块和几何结构。Agent Rosetta通过结合LLM推理与Rosetta的通用性，迭代优化设计以实现用户定义的目标。我们在规范氨基酸设计上评估了Agent Rosetta，匹配了专业模型和专家基线；在非规范残基设计上——ML方法在此失败——取得了可比的性能。关键的是，仅靠提示工程通常无法生成Rosetta操作，这表明环境设计对于将LLM智能体与专业软件集成至关重要。我们的结果表明，适当设计的环境能使LLM智能体在匹配专业工具和人类专家的同时，使科学软件变得可访问。

英文摘要

Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues -- where ML approaches fail -- achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

URL PDF HTML ☆

赞 0 踩 0

2605.01101 2026-06-16 cs.AI cs.CL cs.SD eess.AS 版本更新

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

虚拟言语治疗师：一种临床医生参与的AI言语治疗代理，用于个性化和监督式治疗

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

发表机构 * The Kashmir Hub for Artficial Intelligence（喀布尔人工智能中心）； Microsoft / Vocametrix（微软 / Vocametrix）； IAI, TCG CREST（IAI，TCG CREST）； Université de Lorraine, CNRS, Inria, LORIA（洛林大学，CNRS，Inria，LORIA）； Laboratoire Praxiling, UMR5267, CNRS et Université Paul-Valéry Montpellier 3（Praxiling实验室，UMR5267，CNRS及蒙彼利埃Paul-Valéry大学）； Speechcare iStutter, Portuguese Catholic University（Speechcare iStutter，葡萄牙天主教大学）； CHI – Chair of Health Informatics, TUM University Hospital（健康信息学系，TUM大学医院）； GLAM – Group on Language, Audio, & Music, Imperial College London（语言、音频与音乐小组，伦敦帝国理工学院）

AI总结提出虚拟言语治疗师（VST）平台，集成深度学习口吃分类与多智能体大语言模型推理，自动生成个性化治疗方案，并通过临床医生反馈优化，实验证明其高质量推荐。

Comments Under Review

详情

AI中文摘要

本文开发了虚拟言语治疗师（VST），这是一个基于智能体的平台，通过自动化和自适应的AI驱动工作流程，简化口吃评估并提供定制化的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多智能体大语言模型（LLM）推理，以支持循证临床决策。VST首先获取并提取患者语音样本的特征，然后对口吃类型进行稳健分类。基于这些输出，VST启动一个智能体推理过程，其中专门的LLM智能体自主生成、批评并迭代优化个性化治疗计划。一个专门的批评智能体评估所有生成的治疗计划，以确保临床安全性、方法学合理性，并与同行评审的证据和既定专业指南保持一致。最终输出是一个全面的、针对患者的治疗草案，供临床医生审查。系统结合临床医生的反馈，生成最终的治疗计划，适用于患者交付，从而保持临床医生参与的范式。由专家言语治疗师进行的实验评估证实，VST持续生成高质量、基于证据的治疗建议。这些发现表明该系统具有增强临床工作流程、减轻临床医生负担并改善言语障碍患者治疗效果的潜力。所提出系统的交互式用户界面可在以下网址在线获取：this https URL，支持实时口吃评估和个性化治疗计划。

英文摘要

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.

URL PDF HTML ☆

赞 0 踩 0

2606.12025 2026-06-16 cs.AI 版本更新

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

人类增强循环建模（HELM）：基于智能体的混凝土桥梁护栏有限元建模

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao

发表机构 * College of Civil Engineering, Hunan University（湖南大学土木工程学院）； Department of Civil and Architectural Engineering, University of Miami（迈阿密大学土木与建筑系）； School of Architecture, University of Miami（迈阿密大学建筑学院）

AI总结提出HELM框架，通过人机协作将有限元建模分解为可验证的检查点，在MASH TL-4和TL-5条件下将自主建模成功率从20%提升至75%。

详情

AI中文摘要

对桥梁护栏等安全关键基础设施进行有限元（FE）建模需要高保真非线性动态分析，然而当前的FE建模过程仍然劳动密集且缺乏自动化。本文提出了人类增强循环建模（HELM）框架，这是一种协作式人机协议，将长序列有限元建模分解为几何生成、边界条件定义和材料分配等离散的、可视觉验证的检查点。该框架通过一个包含20个案例的钢筋混凝土桥梁护栏矩阵在MASH TL-4和TL-5侧向荷载条件下进行演示，将专用智能体与两种广泛使用的商业FE软件（即ANSYS和LS-PrePost）对接。实验结果表明，HELM将基线自主建模成功率从20%提高到75%，其中几何和边界条件任务的智能体级通过率大约翻倍。误差分析显示，空间推理和代数逻辑限制构成了主要的失败模式，突显了结构化人在回路干预对建模自动化的价值。完整的智能体设计代码和提示已开源，可访问：此 https URL。

英文摘要

Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: https://github.com/SimAgentDev/Ansys-LSPP-AgentKit.

URL PDF HTML ☆

赞 0 踩 0

2309.07401 2026-06-16 math.NA cs.AI cs.NA 版本更新

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

多级深度学习用于偏微分方程及其在Burgers方程中的应用

Yuesheng Xu, Taishan Zeng

发表机构 * Department of Mathematics and Statistics, Old Dominion University（数学与统计学系，老 Dominion 大学）； School of Mathematical Science, South China Normal University（数学科学学院，华南师范大学）

AI总结提出两阶段多级深度学习方法，通过渐进式分级训练浅层网络拟合目标函数，再微调部分层，有效解决非线性PDE优化难题，在Burgers方程上误差降低达60倍。

详情

AI中文摘要

深度神经网络在求解偏微分方程方面显示出巨大潜力，但其深层架构带来了复杂、大规模、非凸的优化挑战。非线性PDE，如粘性Burgers方程，由于陡峭梯度和激波类解而加剧了这些困难。为此，我们提出了一种两阶段多级深度学习方法。在第一阶段，浅层网络逐级渐进训练，从低频到高频分量拟合目标函数；先前学习的级被冻结，每个新的残差块仅训练以最小化剩余逼近误差。第二阶段解冻并重新训练选定层，以第一阶段网络为初始化，实现可解释、稳定的层次细化，同时减轻优化复杂性。此外，我们从理论上证明，在适当的优化策略下，TS-MGDL中的每一级和每一阶段都单调地减少损失函数。在一维、二维和三维粘性Burgers方程上的数值实验表明，TS-MGDL显著优于单级学习，预测误差降低高达60倍。

英文摘要

Deep neural networks (DNNs) show great promise for solving partial differential equations (PDEs), but their deep architectures introduce complex, large-scale, non-convex optimization challenges. Nonlinear PDEs, like the viscous Burgers' equation, compound these difficulties due to steep gradients and shock-like solutions. To address this, we propose a two-stage multi-grade deep learning (TS-MGDL) method. In the first stage, shallow networks are trained progressively grade by grade to fit the target function from low- to high-frequency components; previously learned grades are frozen, and each new residual block is trained solely to minimize the remaining approximation error. The second stage unfreezes and retrains selected layers using the first-stage network as initialization, achieving an interpretable, stable hierarchical refinement while mitigating optimization complexity. Furthermore, we theoretically prove that each grade and stage in TS-MGDL monotonically reduces the loss function under an appropriate optimization strategy. Numerical experiments on 1D, 2D, and 3D viscous Burgers' equations demonstrate that TS-MGDL significantly outperforms single-grade learning (SGL), reducing predictive errors by up to a factor of 60.

URL PDF HTML ☆

赞 0 踩 0

2407.02362 2026-06-16 cs.AR cs.AI cs.LG 版本更新

基于深度学习的自动化超声多普勒角度估计

Nilesh Patil, Ajay Anand

发表机构 * Goergen Institute for Data Science（戈尔根数据科学研究所）； University of Rochester Medical Center（罗切斯特大学医学中心）； University of Rochester（罗切斯特大学）

AI总结提出一种基于深度学习的自动化多普勒角度估计方法，使用2100张颈动脉超声图像及预训练模型，平均绝对误差3.9°-9.4°，最佳模型误差低于临床可接受阈值，可避免正常速度误判为狭窄。

详情

DOI: 10.1109/embc.2019.8857587
Journal ref: Annu Int Conf IEEE Eng Med Biol Soc. 2019 Jul;2019:28-31

AI中文摘要

角度估计是测量血流速度的多普勒超声临床工作流程中的重要步骤。人们普遍认为，角度估计不正确是基于多普勒的血流速度测量误差的主要原因。在本文中，我们提出了一种基于深度学习的自动化多普勒角度估计方法。该方法使用2100张人类颈动脉超声图像（包括图像增强）进行开发。使用五个预训练模型提取图像特征，并将这些特征传递给一个自定义浅层网络进行多普勒角度估计。独立地，由一名人类观察者审阅图像进行测量以进行比较。对于评估的模型，自动角度估计与手动角度估计之间的平均绝对误差（MAE）范围为3.9°至9.4°。此外，最佳性能模型的MAE低于可接受的临床多普勒角度误差阈值，从而避免了将正常速度值误分类为狭窄。结果表明，应用基于深度学习的技术进行自动化超声多普勒角度估计具有潜力。这种技术有可能在商业超声扫描仪的成像软件中实现。

英文摘要

Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9° to 9.4° for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

URL PDF HTML ☆

赞 0 踩 0

2508.10967 2026-06-16 cs.LG cs.AI 版本更新

Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Retro-Expert: 面向可解释逆合成的协同推理

Xinyi Li, Sai Wang, Yutian Lin, Yu Wu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Retro-Expert框架，通过强化学习结合大语言模型与专用模型，实现可解释的逆合成预测，并生成基于化学逻辑的自然语言解释。

详情

AI中文摘要

逆合成预测旨在根据给定的产物分子推断反应物分子，这是化学合成中的一项基本任务。然而，现有方法依赖于静态模式匹配范式，限制了其从化学数据中进行有效逻辑决策的能力，导致黑箱过程。我们提出Retro-Expert，一个可解释的逆合成框架，通过纯强化学习结合大语言模型和专用模型的互补优势，进行协同推理。它通过三个组件输出基于化学逻辑的自然语言解释：（1）专用模型提供化学知识，将其蒸馏到高质量的化学决策空间中；（2）大语言模型驱动的批判性推理，生成具有可解释推理路径的预测；（3）基于知识的策略优化，改进可解释的决策策略。实验表明，Retro-Expert在不同指标上均优于基于大语言模型和专用模型的方法，同时生成基于化学的解释，增强了化学家在实践中的信任。本文源代码见：此 https URL。

英文摘要

Retrosynthesis prediction aims to infer the reactant molecules based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing methods rely on a static pattern-matching paradigm, which limits their ability to perform effective logical decision-making from chemical data, leading to a black-box process. We propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary strengths of Large Language Models and specialized models via pure reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models provide chemical knowledge that is distilled into a high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions with an interpretable reasoning path, and (3) knowledge-grounded policy optimization refines the interpretable decision policy. Experiments show that Retro-Expert surpasses both LLM-based and specialized models across different metrics, while generating chemically grounded explanations that enhance chemists' trust in practice. The source code for this paper is available at https://github.com/MagixRab-ll/Retro-Expert.

URL PDF HTML ☆

赞 0 踩 0

2509.05364 2026-06-16 cs.CY cs.AI cs.ET 版本更新

Prototyping an AI-powered Tool for Energy Efficiency in New Zealand Homes

为新西兰住宅能效设计AI驱动工具的原型

Abdollah Baghaei Daemei

发表机构 * Building Performance Analysis Lab, Tech Innovation Experts（建筑性能分析实验室，技术创新专家）

AI总结本研究设计并评估了一个基于AI的决策支持工具原型，通过数据集成、异常检测和情景模拟，帮助新西兰家庭提升能效，专家测试显示可用性高，有望弥补政策与家庭实践之间的差距。

详情

DOI: 10.66408/abc2.2025.10

AI中文摘要

住宅建筑对能源使用、健康结果和碳排放有显著影响。在新西兰，住房质量历来较差，隔热不足和供暖效率低下导致广泛的能源困难。最近的改革，包括Warmer Kiwi Homes计划、Healthy Homes标准和H1建筑规范升级，带来了健康和舒适度的改善，但挑战依然存在。许多改造仍然不完整，家庭性能数据有限，房主的决策支持分散。本研究介绍了为新西兰住宅能效设计的AI驱动决策支持工具的原型和评估。该原型使用Python和Streamlit开发，将数据摄取、异常检测、基线建模和情景模拟（例如LED改造、隔热升级）集成到一个模块化仪表板中。15位领域专家，包括建筑科学家、顾问和政策实践者，通过半结构化访谈测试了该工具。结果显示可用性高（M=4.3），情景输出价值高（M=4.5），并且对其补充补贴计划和监管框架的潜力持积极看法。该工具展示了AI如何将国家政策转化为个性化的家庭级指导，弥合资金、标准和实际决策之间的差距。其意义在于提供了一个可复制的框架，以减少能源困难、改善健康结果并支持气候目标。未来的发展应侧重于碳指标、电价建模、与国家数据集的集成以及评估实际采用的纵向试验。

英文摘要

Residential buildings contribute significantly to energy use, health outcomes, and carbon emissions. In New Zealand, housing quality has historically been poor, with inadequate insulation and inefficient heating contributing to widespread energy hardship. Recent reforms, including the Warmer Kiwi Homes program, Healthy Homes Standards, and H1 Building Code upgrades, have delivered health and comfort improvements, yet challenges persist. Many retrofits remain partial, data on household performance are limited, and decision-making support for homeowners is fragmented. This study presents the design and evaluation of an AI-powered decision-support tool for residential energy efficiency in New Zealand. The prototype, developed using Python and Streamlit, integrates data ingestion, anomaly detection, baseline modeling, and scenario simulation (e.g., LED retrofits, insulation upgrades) into a modular dashboard. Fifteen domain experts, including building scientists, consultants, and policy practitioners, tested the tool through semi-structured interviews. Results show strong usability (M = 4.3), high value of scenario outputs (M = 4.5), and positive perceptions of its potential to complement subsidy programs and regulatory frameworks. The tool demonstrates how AI can translate national policies into personalized, household-level guidance, bridging the gap between funding, standards, and practical decision-making. Its significance lies in offering a replicable framework for reducing energy hardship, improving health outcomes, and supporting climate goals. Future development should focus on carbon metrics, tariff modeling, integration with national datasets, and longitudinal trials to assess real-world adoption.

URL PDF HTML ☆

赞 0 踩 0

2509.25594 2026-06-16 cs.CV cs.AI 版本更新

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

K-Prism: 一种知识引导与提示集成的通用医学图像分割模型

Bangwei Guo, Yunhe Gao, Meng Ye, Difei Gu, Yang Zhou, Leon Axel, Dimitris Metaxas

发表机构 * Rutgers University（罗格斯大学）； Stanford University（斯坦福大学）； The University of Texas at Arlington（德克萨斯大学阿灵顿分校）； New York University（纽约大学）

AI总结提出K-Prism统一分割框架，通过双提示表示和混合专家解码器整合语义先验、上下文知识和交互反馈三种知识范式，在18个数据集上实现语义、上下文和交互分割的最优性能。

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

医学图像分割是临床决策的基础，但现有模型仍然碎片化。它们通常基于单一知识源训练，并针对特定任务、模态或器官。这种碎片化与临床实践形成鲜明对比，在临床实践中，专家无缝整合多种知识：来自训练集的解剖先验、来自参考病例的基于示例的推理，以及通过实时交互的迭代细化。我们提出了$\textbf{K-Prism}$，一个统一的分割框架，通过系统整合三种知识范式来反映这种临床灵活性：(i) 从标注数据集中学习的$\textit{语义先验}$，(ii) 来自少样本参考示例的$\textit{上下文知识}$，以及(iii) 来自用户输入（如点击或涂鸦）的$\textit{交互反馈}$。我们的关键见解是，这些异构知识源可以编码为双提示表示：定义$\textit{分割什么}$的1-D稀疏提示和指示$\textit{关注哪里}$的2-D密集提示，然后通过混合专家（MoE）解码器动态路由。这种设计使得范式之间灵活切换，并能够在不同任务上进行联合训练，而无需修改架构。在涵盖多种模态（CT、MRI、X射线、病理、超声等）的18个公共数据集上的全面实验表明，K-Prism在语义、上下文和交互分割设置中均达到了最先进的性能。

英文摘要

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

URL PDF HTML ☆

赞 0 踩 0

2510.22266 2026-06-16 cs.LG cs.AI cs.CY 版本更新

A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

学生表现相关因素的多层次分析：基于SAEB微观数据的机器学习方法

Rodrigo Tertulino, Laércio Alencar

发表机构 * Federal Institute of Education, Science, and Technology of Rio Grande do Norte（巴西里约格朗德杜北教育、科学和技术联邦学院）

AI总结采用多级机器学习方法，利用SAEB微观数据中四类特征，通过随机森林模型以90.2%准确率分类学生水平，并借助SHAP解释发现学校平均社会经济水平是最强预测因子，表明学业表现是系统性现象。

Comments This article has been published in Discover Education (Springer Nature). The final authenticated version is available at:https://doi.org/10.1007/s44217-026-01699-0

详情

DOI: 10.1007/s44217-026-01699-0
Journal ref: Discover Education, 2026

AI中文摘要

识别影响基础教育学生表现的因素是巴西制定有效公共政策的核心挑战。本研究引入了一种多级机器学习方法，利用巴西基础教育评估系统（SAEB）的微观数据对九年级和高中学生的熟练程度进行分类。我们的模型独特地整合了四个数据源：学生社会经济特征、教师专业档案、学校指标和校长管理档案。对四种集成算法的比较分析证实了随机森林模型的优越性，该模型达到了90.2%的准确率和96.7%的曲线下面积（AUC）。为了超越预测，我们应用了基于SHAP的可解释人工智能（XAI），结果显示学校的平均社会经济水平是最主要的预测因子，表明系统性因素比孤立的个体特征影响更大。主要结论是，学业表现是一种与学校生态系统深度相关的系统性现象。本研究提供了一个数据驱动的、可解释的工具，以通过解决学校之间的差异来促进教育公平的政策制定。

英文摘要

Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.

URL PDF HTML ☆

赞 0 踩 0

2511.05522 2026-06-16 eess.SP cs.AI 版本更新

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

AIRMap: 用于无线数字孪生的AI生成无线电地图

Ali Saeizadeh, Miead Tehrani-Moayyed, Davide Villa, J. Gordon Beattie, Pedram Johari, Stefano Basagni, Tommaso Melodia

发表机构 * VIAVI Solutions, Inc.（VIAVI解决方案公司）； National Telecommunications and Information Administration (NTIA)（国家电信与信息管理局）； U.S. National Science Foundation（美国国家科学基金会）

AI总结提出AIRMap深度学习框架，基于2D高程图通过U-Net自编码器实现超快速无线电地图估计，在4毫秒内达到低于4 dB RMSE的路径增益预测，比GPU加速射线追踪快100倍以上。

Comments 15 pages, 19 figures, This work has been accepted for publication on IEEE Transactions on Wireless Communications

详情

AI中文摘要

精确、低延迟的信道建模对于实时无线网络仿真和数字孪生应用至关重要。然而，像射线追踪这样的传统建模方法计算量大，不适合模拟动态条件。在本文中，我们提出了AIRMap，一个用于超快速无线电地图估计的深度学习框架，以及一个用于创建迄今为止最大无线电地图数据集的自动化流水线。AIRMap使用单输入U-Net自编码器，仅处理地形和建筑物高度的2D高程图。在120万波士顿区域样本上训练，并在四个具有不同地形和建筑密度的不同城市和农村环境中验证，AIRMap在NVIDIA L40S上每次推理在4毫秒内预测路径增益，RMSE低于4 dB——比基于GPU加速射线追踪的无线电地图快100倍以上。使用仅20%的现场测量数据进行轻量级校准，将中位误差降低到约5%，显著优于传统模拟器（误差超过50%）。集成到Colosseum仿真器和Sionna SYS平台中，与基于测量的信道相比，频谱效率和误块率几乎为零误差。这些发现验证了AIRMap在无线数字孪生中实现可扩展、准确和实时无线电地图估计的潜力。

英文摘要

Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.

URL PDF HTML ☆

赞 0 踩 0

2511.16681 2026-06-16 cs.CL cs.AI 版本更新

SPI: Query-Depth-Adaptive Indexing for Streaming RAG in Vector Databases

SPI：向量数据库中流式RAG的查询深度自适应索引

Dong Liu, Yanxuan Yu

发表机构 * Yale University（耶鲁大学）； Columbia University（哥伦比亚大学）

AI总结提出语义金字塔索引（SPI），通过多级分辨率组织和不确定性感知控制器实现查询深度自适应，支持流式插入和渐进式ANN搜索，在MS MARCO和Natural Questions上相比基线实现1.4-2.3倍延迟降低。

详情

AI中文摘要

向量数据库（VecDB）越来越多地部署在检索增强生成（RAG）管道中，其中查询处理和文档摄取同时发生。索引层需要提供低延迟搜索，同时在不频繁全局重建的情况下纳入新向量。现有的VecDB管道通常在统一表示机制下运行，尽管查询所需的语义粒度存在显著差异。这促使设计一种支持增量更新同时根据查询分布和复杂性调整检索深度的索引。我们提出**语义金字塔索引（SPI）**，一种VecDB层索引框架，将嵌入组织成$L$个语义对齐的分辨率级别，并通过轻量级不确定性感知控制器为每个查询选择检索深度。SPI支持渐进式粗到细ANN搜索、无需全局重建的逐级流式插入，以及通过LSH分区和异步gRPC协调的分布式执行。与具有固定遍历规则的分层ANN结构（例如SPANN）不同，SPI在查询时自适应分辨率，同时保持与FAISS和Qdrant后端的兼容性。在MS MARCO和Natural Questions上，在相同密集编码器系列下，SPI在Recall@10上具有竞争力且延迟更低，相对于可比较的近似ANN基线，在固定Recall@10目标下实现了**1.4-2.3倍**的平均检索延迟降低。一个最多8个节点的原型扩展研究显示吞吐量扩展了6.2倍（约73%效率）；为完整性包含了16节点配置，但显示出递减的效率。我们提供了top-$K$稳定性保证：具有足够检索裕度的查询在较浅层返回相同的top-$K$集合。代码和配置可从此https URL获取。

英文摘要

Vector databases (VecDBs) are increasingly deployed in retrieval-augmented generation (RAG) pipelines where query processing and document ingestion occur concurrently. The index layer needs to provide low-latency search while incorporating new vectors without frequent global rebuilding. Existing VecDB pipelines typically operate within a uniform representation regime, despite substantial variation in the semantic granularity required across queries. This motivates an index design that supports incremental updates while adapting retrieval depth to query distribution and complexity. We propose \textbf{Semantic Pyramid Indexing (SPI)}, a VecDB-layer indexing framework that organizes embeddings into $L$ semantically aligned resolution levels and selects retrieval depth per query via a lightweight uncertainty-aware controller. SPI supports progressive coarse-to-fine ANN search, level-wise streaming insertion without global rebuilds, and distributed execution through LSH partitioning with asynchronous gRPC coordination. Unlike hierarchical ANN structures with fixed traversal rules (e.g., SPANN), SPI adapts resolution at query time while remaining compatible with FAISS and Qdrant backends. On MS MARCO and Natural Questions, SPI achieves competitive Recall@10 with lower latency under the same dense encoder family, yielding a \textbf{1.4--2.3$\times$} average retrieval latency reduction under fixed Recall@10 targets relative to comparable approximate-ANN baselines. A prototype scaling study up to 8 nodes shows $6.2\times$ throughput scaling (${\approx}73\%$ efficiency); the 16-node configuration is included for completeness but shows diminishing efficiency. We provide a top-$K$ stability guarantee: queries with sufficient retrieval margin return an identical top-$K$ set at a shallower level. Code and configurations are available at https://github.com/FastLM/SPI_VecDB.

URL PDF HTML ☆

赞 0 踩 0

2512.07925 2026-06-16 cs.CV cs.AI 版本更新

Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning

苏丹冲突相关火灾的近实时检测：基于无监督深度学习

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结提出轻量级VAE模型结合Planet Labs 4波段影像，在24-30小时内无监督检测苏丹冲突火灾区域，优于余弦距离、CVA和IR-MAD方法。

详情

DOI: 10.1016/j.srs.2026.100446
Journal ref: Science of Remote Sensing, Volume 13, 2026, 100446, ISSN 2666-0172

AI中文摘要

苏丹持续的武装冲突凸显了快速监测冲突相关火灾影响区域的必要性。深度学习和高频卫星影像的最新进展使得能够近实时评估战区活跃火灾和烧伤疤痕。本研究提出了一种近实时监测方法，使用轻量级变分自编码器（VAE）模型，结合空间分辨率3米的4波段Planet Labs影像。我们证明，在有利观测条件下，利用可获取的商业卫星数据，这些受影响区域可在约24至30小时内被检测到。为此，我们改编了一个最初为10波段影像设计的VAE模型，使其有效处理高分辨率4波段输入。模型以无监督方式训练，学习名义地表状态的紧凑潜在表示，并通过量化时间配对潜在嵌入之间的变化来识别燃烧特征。性能在苏丹的五个案例研究中评估，并与余弦距离、CVA和IR-MAD在精确率、召回率、F1分数以及时间配对影像块之间的精确率-召回率曲线下面积（AUPRC）上进行比较。结果表明，所提方法始终优于其他方法，在高度不平衡的火灾检测场景中实现了更高的召回率和F1分数，同时保持了可行的精确率。使用8波段影像和时间序列影像的实验相比单一4波段输入仅带来边际性能提升，突显了所提轻量级方法在可扩展的近实时冲突监测中的有效性。

英文摘要

Ongoing armed conflict in Sudan highlights the need for rapid monitoring of conflict-related fire-affected areas. Recent advances in deep learning and high-frequency satellite imagery enable near--real-time assessment of active fires and burn scars in war zones. This study presents a near--real-time monitoring approach using a lightweight Variational Auto-Encoder (VAE)--based model integrated with 4-band Planet Labs imagery at 3 m spatial resolution. We demonstrate that these impacted regions can be detected within approximately 24 to 30 hours under favorable observational conditions using accessible, commercially available satellite data. To achieve this, we adapt a VAE--based model, originally designed for 10-band imagery, to operate effectively on high-resolution 4-band inputs. The model is trained in an unsupervised manner to learn compact latent representations of nominal land-surface conditions and identify burn signatures by quantifying changes between temporally paired latent embeddings. Performance is evaluated across five case studies in Sudan and compared against cosine distance, CVA, and IR-MAD using precision, recall, F1-score, and the area under the precision-recall curve (AUPRC) computed between temporally paired image tiles. Results show that the proposed approach consistently outperforms the other methods, achieving higher recall and F1-scores while maintaining viable precision in highly imbalanced fire-detection scenarios. Experiments with 8-band imagery and temporal image sequences yield only marginal performance gains over single 4-band inputs, underscoring the effectiveness of the proposed lightweight approach for scalable, near--real-time conflict monitoring.

URL PDF HTML ☆

赞 0 踩 0

2512.22420 2026-06-16 cs.DC cs.AI 版本更新

人工智能时代的可持续材料发现

Sajid Mannan, Rupert J. Myers, Rohit Batra, Rocio Mercado, Lothar Wondraczek, N. M. Anoop Krishnan

发表机构 * Department of Civil and Environmental Engineering, Indian Institute of Technology Delhi（印度理工学院德里分校土木与环境工程系）； Department of Civil and Environmental Engineering, Imperial College London（帝国理工学院伦敦分校土木与环境工程系）； Department of Metallurgical and Materials Engineering, Indian Institute of Technology Madras（印度理工学院马德拉斯分校冶金与材料工程系）； Department of Computer Science and Engineering, Chalmers University of Technology & University of Gothenburg（查尔姆斯理工大学与哥德堡大学计算机科学与工程系）； Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi（印度理工学院德里分校亚里学校人工智能系）

AI总结本文提出ML-LCA框架，将上游机器学习材料发现与下游生命周期评估整合，通过信息提取、统一数据库、多尺度建模、制造路径预测和不确定性优化，实现性能与可持续性协同优化。

详情

AI中文摘要

人工智能（AI）已经改变了材料发现，通过生成模型和替代筛选实现了化学空间的快速探索。然而，当前用于材料发现的生成式AI模型（现已驱动对广阔化学和结构空间的探索）仅针对结构稳定性和功能特性优化候选材料，在设计循环的任何阶段均未整合环境评估。前瞻性和事前生命周期评估方法已经存在并应用于新兴技术，但它们作为独立的下游分析运行，而非作为生成或主动学习管道中的主动约束。结果是，即使产生了环境反馈，也是在设计决策做出之后才到达，而非为决策提供信息。原子尺度设计与生命周期评估（LCA）之间的脱节反映了根本性挑战：（i）跨异构源的数据稀缺，（ii）从原子到工业系统的尺度差距，（iii）合成路径的不确定性，以及（iv）缺乏同时优化性能与环境影响的框架。在这篇观点文章中，我们提出将上游ML辅助材料发现与下游LCA整合到ML-LCA框架中，该框架包含五个组成部分：用于构建材料-环境知识库的信息提取、将属性与可持续性指标关联的统一数据库、桥接原子属性与生命周期影响的多尺度模型、具有不确定性量化的制造路径集成预测，以及实现性能-可持续性同时导航的不确定性感知优化。涵盖聚合物、玻璃、光刻胶和水泥的案例研究既证明了必要性和可行性，也识别了材料特定的整合挑战。

英文摘要

Artificial intelligence (AI) has transformed materials discovery, enabling rapid exploration of chemical space through generative models and surrogate screening. Yet current generative AI models for materials discovery, which now drive exploration of vast chemical and structural spaces, optimize candidates exclusively for structural stability and functional properties, with no integration of environmental assessment at any stage of the design loop. Prospective and ex-ante life cycle assessment methods exist and have been applied to emerging technologies, but they operate as standalone downstream analyses, not as active constraints within generative or active-learning pipelines. The result is that environmental feedback, even when produced, arrives after design decisions have been made rather than informing them. The disconnect between atomic-scale design and lifecycle assessment (LCA) reflects fundamental challenges: (i) data scarcity across heterogeneous sources, (ii) scale gaps from atoms to industrial systems, (iii) uncertainty in synthesis pathways, and (iv) the absence of frameworks that co-optimize performance with environmental impact. In this Perspective, we propose integrating upstream ML-assisted materials discovery with downstream LCA into the ML-LCA framework, comprising five components: information extraction for building materials-environment knowledge bases, harmonized databases linking properties to sustainability metrics, multi-scale models bridging atomic properties to lifecycle impacts, ensemble prediction of manufacturing pathways with uncertainty quantification, and uncertainty-aware optimization enabling simultaneous performance-sustainability navigation. Case studies spanning polymers, glass, photoresists, and cement demonstrate both necessity and feasibility while identifying material-specific integration challenges.

URL PDF HTML ☆

赞 0 踩 0

2602.14710 2026-06-16 cs.IR cs.AI 版本更新

Orcheo: A Modular Full-Stack Platform for Conversational Search

Orcheo: 一个用于对话式搜索的模块化全栈平台

Shaojie Jiang, Svitlana Vakulenko, Maarten de Rijke

发表机构 * University of Amsterdam（阿姆斯特丹大学）； AI Colleagues（AI同事）； WU Vienna University of Economics and Business（维也纳经济与商业大学）

AI总结提出Orcheo开源平台，通过模块化架构、生产级基础设施和45+即用组件，解决对话式搜索研究中框架统一与原型部署的难题。

Comments Accepted to SIGIR 2026

详情

DOI: 10.1145/3805712.3808613

AI中文摘要

对话式搜索（CS）需要一个复杂的软件工程流水线，集成了查询重构、排序和响应生成。CS研究人员目前面临两个障碍：缺乏一个统一的框架来有效地与社区共享贡献，以及难以部署用于用户评估的端到端原型。我们介绍了Orcheo，一个旨在弥合这一差距的开源平台。Orcheo提供三个关键优势：（i）模块化架构通过单文件节点模块促进组件复用，便于CS研究中的共享和可重复性；（ii）生产级基础设施通过双执行模式、安全凭证管理和执行遥测弥合原型到系统的差距，内置AI编码支持降低学习曲线；（iii）入门工具包包括45多个现成组件，用于查询理解、排序和响应生成，能够快速启动完整的CS流水线。我们描述了框架架构，并通过强调模块化和易用性的案例研究验证了Orcheo的实用性。Orcheo在MIT许可下以开源形式发布于此https URL。

英文摘要

Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 45+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at https://github.com/AI-Colleagues/orcheo.

URL PDF HTML ☆

赞 0 踩 0

2603.17531 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Rel-Zero：利用补丁对不变性实现鲁棒的零水印以抵御AI编辑

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang

AI总结针对AI编辑对图像真实性的威胁，提出Rel-Zero零水印框架，利用编辑中补丁对关系距离的不变性，无需修改原图即可生成鲁棒水印，实验证明其优于现有方法。

Comments accepted to CVPR 2026

详情

AI中文摘要

近期基于扩散的图像编辑技术的进步对数字视觉内容的真实性构成了重大威胁。传统的基于嵌入的水印方法通常引入可察觉的扰动以保持鲁棒性，不可避免地损害视觉保真度。同时，现有的零水印方法通常依赖全局图像特征，难以抵御复杂的操作。在这项工作中，我们揭示了一个关键观察：尽管在基于AI的编辑过程中单个图像补丁发生显著变化，但补丁对之间的关系距离保持相对不变。利用这一特性，我们提出了关系零水印（Rel-Zero），一种新颖的框架，无需对原始图像进行任何修改，而是从这些编辑不变的补丁关系中推导出唯一的零水印。通过将水印基于内在的结构一致性而非绝对外观，Rel-Zero为内容认证提供了一种非侵入性且具有弹性的机制。大量实验表明，与先前的零水印方法相比，Rel-Zero在多种编辑模型和操作下实现了显著提升的鲁棒性。

英文摘要

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

URL PDF HTML ☆

赞 0 踩 0

2604.00163 2026-06-16 cs.LG cs.AI cs.NE 版本更新

Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals

基于特征分析和图卷积神经网络（GCN）的脑电图（EEG）信号癫痫发作检测在不同频段的研究

Ferdaus Anam Jibon, Fazlul Hasan Siddiqui, F. Deeba, Gahangir Hossain

AI总结提出一种频率感知框架，将EEG分解为五个频段并提取判别特征，利用图卷积神经网络建模电极空间依赖，在CHB-MIT数据集上实现99.01%的宽带准确率，提高了可解释性和诊断精度。

Comments One author disagrees with the archiving

详情

AI中文摘要

癫痫发作是一种神经系统疾病，其特征是大脑中异常和过度的电活动，导致反复发作事件。脑电图（EEG）信号因其能够捕捉时间和空间的神经动力学而被广泛用于癫痫诊断。虽然最近的深度学习方法取得了高检测准确率，但它们往往缺乏可解释性和神经生理学相关性。本研究提出了一种基于发作期EEG分析的频率感知框架用于癫痫发作检测。原始EEG信号被分解为五个频段（delta、theta、alpha、低beta和高beta），并从每个频段提取十一个判别特征。然后采用图卷积神经网络（GCN）对EEG电极之间的空间依赖性进行建模，电极表示为图节点。在CHB-MIT头皮EEG数据集上的实验表明，该方法在相应频段上分别达到了97.1%、97.13%、99.5%、99.7%和51.4%的准确率，总体宽带准确率为99.01%。结果突出了中频段的强判别能力，并揭示了特定频率的发作模式。与传统的宽带EEG方法相比，所提出的方法提高了可解释性和诊断精度。

英文摘要

Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.

URL PDF HTML ☆

赞 0 踩 0

2604.27128 2026-06-16 cs.CV cs.AI 版本更新

Frontier: 向全面且准确的LLM推理模拟迈进

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Anuttacon ； StepFun

AI总结本文提出Frontier，一种用于现代LLM推理服务的离散事件模拟器，通过离散化抽象和对关键运行时优化的建模，实现了对复杂工作负载的准确预测，从而在不同服务场景中提供更精确的计算、通信和内存成本预测。

详情

AI中文摘要

现代LLM服务已不再是单一或整体的。生产系统现在结合了解耦执行、复杂并行性、运行时优化和状态化工作负载，如推理、代理和RL展开。模拟对于探索这个快速增长的设计空间具有吸引力，但现有模拟器缺乏所需的架构完整性和决策级精度。它们的单体-副本抽象不适合解耦服务，而平均情况分析代理可能会扭曲SLA预测甚至逆转优化结论。我们提出了Frontier，一种用于现代LLM推理服务的离散事件模拟器。Frontier具有解耦抽象。它通过建模共置、预填解码解耦（PDD）和注意力-前馈网络解耦（AFD）与角色特定的集群工作者，捕捉现代服务系统的结构和动态。它在调度器-批次引擎循环中整合关键运行时优化（例如CUDA图、推测解码），并支持新兴工作负载的状态请求。它进一步提供了在多样化服务场景中对计算、通信和内存成本的准确且可推广的预测。在16-H800 GPU测试平台上，Frontier实现了平均吞吐量误差低于4%。与最先进的模拟器相比，它在共置情况下将端到端延迟误差从44.9%降低到6.4%，在解耦情况下从51.7%降低到2.6%。它扩展到超过1000个GPU在商用CPU上，并启用了新的用例，如依赖SLA的帕累托前沿探索、异构解耦分配、代理推理调度验证和RL后训练重配置。

英文摘要

Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.

URL PDF HTML ☆

赞 0 踩 0

2605.21629 2026-06-16 cs.CY cs.AI cs.HC 版本更新

Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build

更快完成，更少学习：生成式AI减少了学生在数学问题及所构建知识上的学习时间

Sina Rismanchian, Hasan Uzun, Jeffrey Matayoshi, Eric Cosyn, Eyad Kurd-Misto

发表机构 * University of California, Irvine（加州大学尔湾分校）； McGraw Hill（麦格劳-希尔）

AI总结本研究探讨生成式AI如何影响学生的学习过程和学习成果，通过分析大量学习互动数据，发现AI使用导致学生在可被AI处理的问题上学习时间减少，但这种效率提升在监考情况下消失，揭示了AI对学习行为和知识构建的深远影响。

详情

AI中文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

英文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for investigating time-on-task, complemented by ALEKS PPL placement-assessment data for examining proctoring and learning outcomes, with a quasi-experimental design exploiting variation in tasks that are more susceptible to AI (text-based word problems) and less susceptible to AI (interactive graph-based problems). Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. Among college students, the post-ChatGPT divergence vanishes entirely under proctoring, ruling out broad efficiency gains as the likely explanation. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

URL PDF HTML ☆

赞 0 踩 0

2605.27599 2026-06-16 cs.LG cs.AI cs.AR cs.DC cs.PF 版本更新

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

能源盲点：NVIDIA 旗舰边缘 AI 硬件无法支持进程级能源归因

Deepak Panigrahy, Aakash Tyagi

发表机构 * Independent Researcher（独立研究者）； Texas A&M University（德克萨斯农工大学）

AI总结本文审计了 ASUS Ascent GX10 (GB10 SoC) 平台的能源可观测性，发现其缺乏 CPU 能源计数器等关键接口，导致无法像 x86 的 RAPL 那样进行进程级能源归因，并提出通过外部直流计量和 GPU 减法进行校准的临时方案，呼吁将能源可观测性作为硬件的一等要求。

详情

AI中文摘要

代理型 AI 工作负载——其中单个用户目标触发多步编排、工具调用、重试和故障恢复——正被瞄准用于边缘部署，NVIDIA、戴尔、惠普、华硕、微星、宏碁和技嘉都将在 2026 年出货基于 GB10 的桌面 AI 系统。我们最近证明，编排结构主导了代理型能源成本，工作流每个成功目标消耗的能源是线性基线的 4.33 倍，而多步推理任务的 OOI 达到 7.63 倍。另外，Rajat 等人表明，在代理型工作负载中，CPU 端处理占总延迟的 90.6%，占总动态能源的 44%。我们报告了对 ASUS Ascent GX10 (GB10 SoC) 的系统性能源可观测性审计，发现该平台通过任何支持的软件接口都不暴露 CPU 能源计数器、INA 电源轨监视器、IPMI/BMC 和 SCMI powercap 协议。唯一的设备上能源遥测是通过 NVML 的瞬时 GPU 功率。我们进一步发现，联发科固件已经通过未记录的 ACPI 接口 (SPBM) 在内部计算每轨能源，但 NVIDIA 表示“没有计划暴露 CPU 轨信息”。因此，通过支持的接口，无法在此平台上重现像 x86 通过 RAPL 执行的设备上每进程能源归因。我们形式化了能源归因 AI 的硬件需求规范，提出了使用外部直流计量结合 GPU 减法的临时校准桥接，并确定了通过 SCMI powercap 的标准轨道路径。我们的发现激励低碳计算社区将能源可观测性作为硬件的头等要求。

英文摘要

Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10-based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi-step reasoning tasks. Separately, Raj et al. show that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy-observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power-rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on-device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per-rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are "no plans to expose CPU rail information." On-device per-process energy attribution - as performed on x86 via RAPL - is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy-attributed AI, propose an interim calibration bridge for per-domain energy decomposition - confirmed on the Acer Veriton GN100 where CPU energy accumulators are live - and identify a standards-track path via SCMI powercap. Our findings motivate the low-carbon computing community to demand energy observability as a first-class hardware requirement.

URL PDF HTML ☆

赞 0 踩 0

2605.30208 2026-06-16 cs.SE cs.AI 版本更新

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

自动化低风险代码审查在Meta：RADAR、风险校准与审查效率

Chris Adams, Arjun Singh Banga, Parveen Bansal, Souvik Bhattacharya, Payal Bhuptani, Rujin Cao, Pedro Canahuati, Nate Cook, Brian Ellis, Prabhakar Goyal, Gurinder Grewal, Tianyu He, Matt Labunka, Alex Manners, David Molnar, Ging Cee Ng, Vishal Parekh, Jiefu Pei, Frederic Sagnes, James Saindon, Will Shackleton, Sid Sidhu, Gursharan Singh, Karthik Chengayan Sridhar, Matt Steiner, Pratibha Udmalpet, Sean Xia, Stacey Yan, Audris Mockus, Peter Rigby, Nachiappan Nagappan

发表机构 * Meta USA, UK, Canada（Meta美国、英国、加拿大）

AI总结提出RADAR系统，通过多阶段漏斗对代码差异进行风险分层自动化审查，在Meta部署后显著提升审查效率并降低风险。

详情

AI中文摘要

AI辅助编码工具改变了软件生产。在Meta，每人工提交的代码行数同比增长105.9%，每位开发者的提交量增长51%，其中代理AI贡献了超过80%的增长。与此同时，获得及时审查的提交比例下降，暴露出代码供应与审查带宽之间的差距。我们提出三个问题，从可行性到校准再到影响：（1）风险分层的自动化能否在不同组织中大规模运行，（2）调整风险阈值如何影响自动化产出与安全性之间的权衡，（3）自动化审查在多大程度上减少AI生成变更的端到端延迟？我们部署了RADAR（风险感知差异自动审查），一个多阶段漏斗，根据作者和源类型对每个差异进行分类，应用资格门控、静态启发式、机器学习差异风险评分、基于LLM的自动化代码审查，以及在落地合格变更前的确定性验证。我们通过覆盖535K+个RADAR审查差异的遥测、政策变更的前后观察比较以及效率结果的差异分析来评估RADAR。RADAR已审查535K+个差异并落地331K+个。将差异风险评分阈值从第25百分位放宽到第50百分位，批准率提高到60.31%。RADAR审查差异的回滚率是非RADAR差异的1/3，生产事故率是非RADAR差异的1/50。RADAR将中位关闭时间减少超过330%，中位差异审查墙时间减少35%。风险感知的分层自动化可以显著减少由AI驱动的代码增长造成的审查瓶颈，同时不损害生产安全。

英文摘要

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

URL PDF HTML ☆

赞 0 踩 0

2606.01613 2026-06-16 cs.IR cs.AI cs.MA 版本更新

AI素养工作坊成果与深度伪造参与中的性别差异

Jake Renzella, Christian Bergh, Natasha Banks, Alexandra Vassar

发表机构 * University of New South Wales（新南威尔士大学）

AI总结本研究通过统计回归分析澳大利亚中学生AI素养工作坊前后数据，发现男性在STEM职业兴趣上显著更高，女性更常使用AI工具，且工作坊后女性在AI知识和职业兴趣上提升更大，部分缩小了性别差距。

详情

AI中文摘要

随着人工智能（AI）素养倡议在K-12教育中的扩展，理解性别如何影响学生的基础认知、工具使用以及对干预措施的反应，对于公平的课程设计至关重要。本研究考察了来自两所男女同校公立学校的澳大利亚中学生（7、8和10年级；前测N=199，后测N=136）在参加为期一天的AI素养工作坊后，在AI素养、安全意识和STEM职业抱负方面的性别差异。使用控制年级和学校的统计回归方法，我们发现：工作坊前，男性学生在AI、计算机科学和工程三个领域的STEM职业兴趣均显著更高，而女性学生更可能将AI用于学业任务并向AI工具寻求建议。深度伪造行为中也出现了性别差异模式：男性更可能创建或分享深度伪造内容。干预后，男女学生的AI知识均有所提升，但女性表现出更丰富的进步：更广泛的概念理解、更高的自信心以及AI和计算机科学职业兴趣的显著增长，部分缩小了STEM性别差距。这些发现强调了开发性别响应型AI课程的必要性，特别是针对男性学生的深度伪造安全教育，并表明即使是单日工作坊也能缩小STEM抱负和AI信心方面的性别差距。

英文摘要

As Artificial Intelligence (AI) literacy initiatives expand in K-12 settings, understanding how gender shapes student baseline perceptions, tool-use, and responsiveness to interventions is essential for equitable curriculum design. This study examines gender differences in AI literacy, safety awareness, and STEM career aspirations among Australian secondary students (Years 7, 8, and 10; N(pre) = 199, n(post) = 136) from two co-educational government schools who participated in a one-day AI literacy workshop. Using statistical regression methods controlling for year level and school, we found that pre-workshop, male students reported significantly higher STEM career interest across all three domains (AI, computer science, and engineering), while female students were significantly more likely to use AI for schoolwork and to seek advice from AI tools. Gender-differentiated patterns also emerged in deepfake behaviours: males were significantly more likely to have created or shared deepfake content. Both genders improved in AI knowledge post-intervention, yet females showed a richer profile of gains: wider conceptual understanding, greater confidence, and meaningful increases in AI and CS career interest that partially narrowed the gender STEM gap. These findings highlight the need for gender-responsive AI curricula, particularly deepfake safety education for male students, and demonstrate that even single-day workshops can narrow gender gaps in STEM aspirations and AI confidence.

URL PDF HTML ☆

赞 0 踩 0

2606.14742 2026-06-16 q-bio.NC cs.AI cs.HC 交叉投稿

Do Large Language Models Have Emotions?

大型语言模型有情感吗？

Amit Goldenberg, James J. Gross

发表机构 * Harvard Business School（哈佛商学院）； Department of Psychology, Harvard University（哈佛大学心理学系）； Harvard University, Digital, Data and Design Institute（哈佛大学数字、数据与设计研究所）； Department of Psychology, Stanford University（斯坦福大学心理学系）

AI总结本文评估Anthropic声称Claude Sonnet 4.5具有“功能性情感”的说法，从生物情感功能角度分析，指出其部分支持情境解释功能，但缺乏动态重组能力。

详情

AI中文摘要

大型语言模型有情感吗？Anthropic最近的一篇论文报告在Claude Sonnet 4.5中发现了情感概念的内部表征，并得出结论认为该LLM具有“功能性情感”。我们根据已知的生物系统中情感实际运作方式评估了这一说法。我们认为情感具有两个核心功能：对情境进行情境敏感的解释，以及根据这些解释跨多个系统重组处理过程。Anthropic的发现为第一个功能提供了部分支持，尽管在Claude中识别出的持续、离散的情感表征与情感神经科学的发现（即人类情感以可变而非统一的神经特征为特征）不太吻合。关于第二个功能，证据不一：Claude的表征调节输出，但没有产生定义生物系统情感的注意力、决策速度和动机状态的动态重组。最后，我们提出了LLM要拥有情感所需的条件。

英文摘要

Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has 'functional emotions.' We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude's representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.

URL PDF HTML ☆

赞 0 踩 0

2606.14769 2026-06-16 econ.EM cs.AI cs.GT 交叉投稿

Agentomics: Economic Foundations for the Valuation, Attribution, and Pricing of AI Agents in Human-AI Workflows

Agentomics：人机协作工作流中AI代理的估值、归因和定价的经济基础

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（纽约大学Tandon工程学院电气与计算机工程系）

AI总结提出Agentomics框架，基于工作流模型将AI部署视为联盟形成问题，使用Shapley值进行经济盈余归因，实现AI代理的估值、归因和定价。

详情

AI中文摘要

代理型AI系统越来越多地被部署为组织工作流中的生产资源，然而现有的评估方法主要衡量孤立的技术性能而非经济贡献。本文引入了\emph{Agentomics}，一个基于工作流的框架，用于对人类和人工代理进行估值、归因和定价。该框架将工作流建模为异构代理的配置，其集体绩效决定了总价值、部署成本、可靠性和预期故障损失。工作流价值被视为团队层面的量，可能包括互补性、替代效应、瓶颈和非线性生产；可加的阶段级价值仅是一个特例。基于此工作流模型，本文将AI部署表述为一个联盟形成问题，并将联盟价值定义为相对于基准人类工作流所产生的增量净剩余。然后使用Shapley值在参与的AI代理之间分配经济盈余，从而在估值、问责和市场定价之间建立原则性联系。由此产生的Shapley定价均衡为评估代理价格是否反映预期边际贡献提供了规范基准。一个安全运营案例研究说明了该框架如何解释混合人机工作流中的生产力提升、部署成本、可靠性损失和联盟级互补性。

英文摘要

Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces \emph{Agentomics}, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human--AI workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.15348 2026-06-16 q-bio.NC cs.AI 交叉投稿

Intrinsic Computational Functionalism and Simulated Consciousness

内在计算功能主义与模拟意识

Ryota Kanai, Shuqin Ma

发表机构 * Araya Inc.（Araya公司）； School of Philosophy, Fudan University（复旦大学哲学学院）； Sussex Centre for Consciousness Science, University of Sussex（Sussex大学意识科学中心）

AI总结本文从内在计算功能主义出发，提出机制丰富的规范结构，论证若意识是计算构成的，则任何满足内在因果-计算实现关系的系统（生物、人工或模拟）都实现相同的意识相关属性。

详情

AI中文摘要

对人工或模拟意识的一个常见反对意见是，模拟的大脑并不比模拟的水更湿。我们从内在计算功能主义（ICF）的角度来回应：如果意识是由计算构成的，那么它不依赖于外部强加的描述，而是依赖于系统凭借其自身的因果-动力学组织所物理实现的计算结构。在之前的工作中，我们将规范功能主义发展为此反解释主义纲领的一个数学精确的特例，通过固定接口下的完整未来输入-输出角色来识别功能状态。这里我们论证，这种输入-输出构造虽然重要，但并不完整：作为ICF的一个行为边界情况，它使得查找表和展开的系统在规范上等价，只要它们保持相同的边界行为。一个与意识相关的规范表示必须转而包含属于相关内在组织的内部机制、干预和联合读出。因此，我们定义了一个机制丰富的规范结构，并用它来制定内在因果-计算实现（ICCR），这是一种保持物理实现、内在状态个体化、转移结构、干预轮廓以及相关主体-身体-世界边界的实现关系。核心结果是条件性的：如果意识属性是内在因果-计算组织的不变量，那么任何满足ICCR的系统都实现相同的意识相关属性，无论是生物的、人工的还是模拟的。我们讨论了包括生物自然主义和整合信息理论在内的反对意见。我们得出结论，要否认模拟具有意识，必须识别出模拟未能实现的与意识相关的内在因果-计算结构。

英文摘要

A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

URL PDF HTML ☆

赞 0 踩 0

2606.15358 2026-06-16 cs.HC cs.AI 交叉投稿

Cognitive Trajectory Modeling: Quantifying Human-AI Co-Creation through Cognitively Grounded Interaction Trajectories

认知轨迹建模：通过基于认知的交互轨迹量化人机共创

Nicholas Davis

发表机构 * Co-Creative AI Consulting（协同AI咨询）

AI总结提出认知轨迹建模（CTM）理论，通过认知轨迹和吸引子景观量化人机共创中的交互动态，区分认知轨迹与交互痕迹，为研究共创AI和人类-AI交互提供框架。

详情

AI中文摘要

共创AI研究日益寻求能够表征交互动态随时间演变的方法。虽然许多现有方法关注可观察的交互特征、交互度量、行为编码方案或活动痕迹，但这些方法往往难以捕捉高阶交互动态，包括协作过程如何随时间重组、稳定、调节和演变。本文引入认知轨迹建模（CTM）作为交互动态的认知理论，将认知、交互和创造过程概念化为在具有认知意义的吸引子景观中展开的时间组织轨迹。CTM建立在创造力生成模型和创造性意义建构（CSM）的理论基础上，重新审视意义建构曲线和认知轨迹在表征共创交互动态中的作用。我们通过认知轨迹原理形式化这一视角，该原理指出，只有当时间表示的基础状态具有方向性认知意义时，它们才在理论上可解释为认知轨迹。基于此原理，CTM将认知轨迹的概念推广到任何特定编码方案之外，并提供了一个更广泛的框架，用于通过在有意义的吸引子景观中展开的轨迹来建模交互动态。我们进一步区分认知轨迹与交互痕迹，并将CTM置于更广泛的认知、交互和领域动态层次结构中。更广泛地说，我们认为理解共创系统需要能够建模认知和交互动态随时间演变的方法。CTM为研究共创AI和人机交互中的交互动态提供了基础。

英文摘要

Co-creative AI research increasingly seeks methods capable of representing how interaction dynamics evolve through time. While many existing approaches focus on observable interaction characteristics, interaction metrics, behavioral coding schemes, or activity traces, these methods often struggle to capture higher-order interaction dynamics, including how collaborative processes reorganize, stabilize, regulate, and evolve through time. This paper introduces Cognitive Trajectory Modeling (CTM) as a cognitive theory of interaction dynamics that conceptualizes cognition, interaction, and creative processes as temporally organized trajectories unfolding across cognitively meaningful attractor landscapes. CTM builds upon the theoretical foundations of the Enactive Model of Creativity and Creative Sense-Making (CSM), revisiting the role of sense-making curves and cognitive trajectories in representing co-creative interaction dynamics. We formalize this perspective through the Cognitive Trajectory Principle, which states that temporal representations are only theoretically interpretable as cognitive trajectories when their underlying states possess directional cognitive meaning. Building on this principle, CTM generalizes the notion of cognitive trajectories beyond any particular coding scheme and provides a broader framework for modeling interaction dynamics through trajectories unfolding across meaningful attractor landscapes. We further distinguish cognitive trajectories from interaction traces and situate CTM within a broader hierarchy of cognitive, interaction, and domain dynamics. More broadly, we argue that understanding co-creative systems requires methods capable of modeling how cognition and interaction dynamics unfold through time. CTM provides a foundation for studying interaction dynamics across co-creative AI and human-AI interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.15535 2026-06-16 cs.PF cs.AI 交叉投稿

MADAR: An Address-Free Processor

MADAR：一种无地址处理器

Mohamed Amine Bergach

发表机构 * Illumina San Diego California USA（Illumina圣地亚哥加州美国）

AI总结提出无地址处理器MADAR，通过环形槽结构消除传统寻址机制，实现编译时调度的数据流计算，在AI加速中能效随归约规模增长保持恒定。

详情

AI中文摘要

在现代处理器中，计算是廉价的部分。大部分面积和能量消耗在“寻址”上——将操作数移入和移出寄存器文件和缓存，并运行标签、端口、缺失队列和旁路网络，以找到值被存放的位置。MADAR通过废除地址来消除这些机制。所有状态在环形槽中循环，每个时钟周期前进一个位置；指令和数据位于相同的槽中；一个值通过其在轨道中的位置——一个坐标——来命名，而不是通过地址；固定站点在编译时设定的调度上，当循环指令经过其操作数时进行计算；一系列周期递增的环形层级取代了缓存层级，它们之间的移动由调度触发而非缺失触发。没有先前的循环存储、数据流或静态调度机器同时具备这四个特点。我们定义了执行模型，在周期精确的寄存器传输级实现中验证了它，展示了它是可编译的——一个构造性调度器发出程序并与实现交叉检查——并用一阶能量模型评估了其代价。其收益在AI加速中最为明显：每个矩阵乘法和卷积核心的乘累加操作编译成流式形式，其每操作能量随归约规模增长保持恒定，而矩阵乘法高效所需的操作数重用由环形周期层级承载——内存层级通过旋转实现缓存通过标签完成的功能。MADAR为任何在程序运行前已知数据移动的计算提供了一个新的设计点。

英文摘要

In a modern processor, computing is the cheap part. Most of its area and energy go to \emph{addressing} -- moving operands to and from a register file and cache, and running the tags, ports, miss queues, and bypass networks that find a value where it was left. MADAR deletes that machinery by abolishing the address. All state circulates in rings of slots that advance one position per clock; instructions and data ride in the same slots; a value is named by its place in an orbit -- a \rp{} coordinate -- not by an address; a fixed station computes when a circulating instruction sweeps past its operands, on a schedule set at compile time; and a hierarchy of rings of increasing period replaces the cache hierarchy, movement between them scheduled rather than triggered by a miss. No prior circulating-store, dataflow, or statically scheduled machine combines all four of these. We define the execution model, validate it in a cycle-accurate register-transfer-level implementation, show it \emph{compilable} -- a constructive scheduler emits programs cross-checked against the implementation -- and price it with a first-order energy model. The payoff is clearest for AI acceleration: the multiply-accumulate at the heart of every matmul and convolution compiles to a streaming form whose energy per operation stays flat as the reduction grows, and the operand reuse that makes matrix multiplication efficient is carried by the ring-period hierarchy -- the memory hierarchy doing by rotation what a cache does by tags. MADAR is a new design point for any computation whose data movement is known before the program runs.

URL PDF HTML ☆

赞 0 踩 0

2606.15712 2026-06-16 cs.CR cs.AI cs.MA 交叉投稿

Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably

Odds Law: 智能如何组织自身以可靠解决难题的分解代数

Hidayet Aksu

发表机构 * GitHub

AI总结本文提出分解代数，研究不可靠基本求解器如何组合成可靠复合求解器，证明验证几率定律、可靠性放大定理和阈值二分法，并揭示自组织是单调改进算子的最小不动点。

Comments 10 pages, 2 figures

详情

AI中文摘要

我们提出一个结构性问题：给定不可靠的基本问题求解器，它们的何种组织能够可靠地解决难题，其极限是什么？我们发展了一个分解代数：基本求解器是随机范畴中的态射，四个组合子（顺序组合、并行集成、验证门控和递归约简）生成复合求解器的空间。我们为该代数配备了两个同态：一个可靠性估值（取值于有序幺半群$([0,1],\le)$）和一个成本估值（取值于交换半环），并推导出控制可靠性如何通过结构流动的组合法则。我们的核心结果是：(i) 验证几率定律（本文命名的结果），表明验证门将正确几率乘以验证者的似然比$Λ$，因此$k$个条件独立的门产生几何放大；(ii) 可靠性放大定理，给出当$Λ>1$时，在$O(\log 1/δ)$的验证深度下达到目标可靠性$1-δ$；(iii) 阈值二分法：在临界参数之上，可靠性可以以对数成本任意接近1，而在临界参数或以下则无法放大。然后我们证明，自组织是策略完全格上单调改进算子的最小不动点，并且该不动点使单位成本的边际对数几率增益相等。最后，我们证明匹配的极限：信息上限将每门放大限制为一个散度量；共享错误原因产生严格正投票下限，因此多样性是无限放大的必要条件。简而言之，可靠性既不是免费的也不是神奇的：它是由独立信息购买、通过组合安排，并由验证者限制的。

推理即模式匹配：人类与LLM日常推理中的共享机制

Zach Studdiford, Gary Lupyan

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结研究通过比较人类和25个LLM在日常因果推理中的错误模式，发现两者均表现出模式匹配而非抽象世界模型驱动的推理，并识别出LLM中驱动响应的注意力头可预测人类推理错误。

Comments 13 pages main text, 51 pages supplementary text

详情

AI中文摘要

当大型语言模型（LLM）在推理中无法泛化或出现随意错误时，这通常被视为LLM并非真正推理，而是执行某种模式匹配的证据。其隐含意思是，人类行为不会表现出相同类型的失败，因为人类推理使用原则性的抽象世界模型。我们评估了人类参与者和25个LLM在各种日常情境中进行常识推理的能力，并在人和模型中观察到类似的错误模式。然后，我们识别出驱动LLM响应的注意力头集合，并发现这些头实现了模式匹配的形式。这些注意力头使我们能够预测由表面上无关的提示细节引起的人类看似无法解释的推理错误。综合来看，我们的结果表明，人和LLM在日常因果推理中更符合模式匹配的形式，而非抽象世界模型。

英文摘要

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

URL PDF HTML ☆

赞 0 踩 0

2511.14007 2026-06-16 cs.CY cs.AI 版本更新

Can Artificial Intelligence Accelerate Technological Progress? Researchers' Perspectives on AI in Manufacturing and Materials Science

人工智能能否加速技术进步？研究人员对制造业和材料科学中AI的看法

John P. Nelson, Olajide Olugbade, Philip Shapira, Justin B. Biddle

发表机构 * School of Public Policy, Oregon State University（俄勒冈州立大学公共政策学院）； Manchester Institute of Innovation Research, University of Manchester（曼彻斯特大学创新研究所）； Alan Turing Institute, Manchester, UK（曼彻斯特英国艾伦·图灵研究所）

AI总结通过32位美国制造业和材料科学领域研究人员的访谈，发现AI主要用于材料和制造过程的建模，加速设计空间搜索，但存在数据依赖、需与旧技术结合以及可能阻碍颠覆性理论进步的风险。

详情

AI中文摘要

人工智能（AI）引发了人们对技术进步速度大幅提升的期望，但这种预期往往与创新过程中AI使用的详细实地研究脱节。因此，AI如何以及在多大程度上能够加速创新仍不清楚。为填补这一空白，我们探索并评估了对32位美国学术界的制造业和材料科学研究人员的访谈结果，这些研究人员在AI和机器学习（ML）技术方面经验丰富。我们发现，AI主要用于材料和制造过程的建模，促进了对材料和制造过程设计空间的更廉价、更快速的搜索。其好处包括在技术开发中节省成本、时间和计算资源。然而，AI/ML工具在已有密集数据的设计空间之外并不可靠；它们需要与较旧的研究技术相结合，进行熟练且审慎的应用；并且有人担心它们可能有害地规避颠覆性理论进步的机会。基于这些结果，我们认为有理由对通过使用AI/ML加速持续性创新持乐观态度；但需要支持传统的实证、计算和理论研究，以维持制造业和材料领域进一步颠覆性进步的可能性。

英文摘要

Artificial intelligence (AI) raises expectations of substantial increases in rates of technological progress, but such anticipations are often not connected to detailed ground-level studies of AI use in innovation processes. Accordingly, it remains unclear how and to what extent AI can accelerate innovation. To help to fill this gap, we explore and assess results from 32 interviews with U.S.-based academic manufacturing and materials sciences researchers experienced with AI and machine learning (ML) techniques. We found that AI was primarily used for modeling of materials and manufacturing processes, facilitating cheaper and more rapid search of design spaces for materials and manufacturing processes alike. Benefits included cost, time, and computation savings in technology development. However, AI/ML tools were unreliable outside design spaces for which dense data were already available; they required skilled and judicious application in tandem with older research techniques; and concerns were raised about the potential to detrimentally circumvent opportunities for disruptive theoretical advancement. Based on these results, we suggest there is reason for optimism about acceleration in sustaining innovations through the use of AI/ML; but that support for conventional empirical, computational, and theoretical research is required to maintain the likelihood of further disruptive advances in manufacturing and materials.

URL PDF HTML ☆

赞 0 踩 0

2601.09753 2026-06-16 cs.CY cs.AI 版本更新

Critically Engaged Pragmatism: Scientific Norm and Social, Pragmatist Epistemology for AI Science Evaluation Tools

批判性参与实用主义：AI科学评估工具的科学规范与社会实用主义认识论

Carole J. Lee

AI总结提出批判性参与实用主义作为科学规范，要求科学界审视AI科学评估工具的目的及特定可靠性，并建议工具创建者透明报告设计、训练和基准测试细节。

详情

DOI: 10.1080/02691728.2026.2669578
Journal ref: Social Epistemology (2026)

AI中文摘要

AI科学评估工具旨在评估研究的可信度。与传统指标（如影响因子）一样，它们的指令可能被去语境化并以有问题的方式重新利用。为了解决这个问题，我提出批判性参与实用主义作为一种科学规范，要求科学界审视AI科学评估工具的目的及特定目的的可靠性。为了培养批判性参与实用主义，AI科学评估工具的创建者应透明且完整地报告设计、训练和基准测试细节，以促进对特定目的可靠性、不同类型错误和偏见的评估。随着新形式的错误、偏见和博弈行为的发现，AI科学评估工具透明报告的最佳实践应不断更新。在此框架下，AI科学评估工具不是科学可信度的客观仲裁者。相反，它们是最终奠定科学社区可信度的批判性话语实践的对象。

英文摘要

AI science evaluation tools aim to assess research credibility. As with traditional metrics such as impact factors, their edicts can be decontextualised and repurposed in problematic ways. To address this, I propose Critically-Engaged Pragmatism as a scientific norm enjoining scientific communities to scrutinise the purposes and purpose-specific reliability of AI science evaluation tools. To foster Critically Engaged Pragmatism, creators of AI science evaluation tools should transparently and fully report design, training, and benchmarking details to facilitate assessments of purpose-specific reliability, liability to different types of error, and bias. What count as best practices for the transparent reporting of AI science evaluation tools should be updated as new forms of error, bias, and gamesmanship are discovered. Under this framework, AI science evaluation tools are not objective arbiters of scientific credibility. Rather, they are the object of critical discursive practices that ultimately ground the credibility of scientific communities.

URL PDF HTML ☆

赞 0 踩 0

2604.18827 2026-06-16 q-bio.NC cs.AI 版本更新

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

OmniMouse: 基于1500亿神经令牌的多模态多任务脑模型的可扩展性

Konstantin F. Willeke, Polina Turishcheva, Alex Gilbert, Goirik Chakrabarty, Hasan A. Bedel, Paul G. Fahey, Yongrong Qiu, Marissa A. Weis, Michaela Vystrčilová, Taliah Muhammad, Lydia Ntanavara, Rachel E. Froebe, Kayla Ponder, Zheng Huan Tan, Emin Orhan, Erick Cobos, Sophia Sanborn, Katrin Franke, Fabian H. Sinz, Alexander S. Ecker, Andreas S. Tolias

发表机构 * Department of Ophthalmology, Byers Eye Institute, Stanford University（斯坦福大学眼科学系、比尔斯眼科研究所）； Stanford Bio-X, Stanford University（斯坦福大学生物交叉学科）； Wu Tsai Neurosciences Institute, Stanford University（斯坦福大学吴泰教授神经科学研究所）； Institute of Computer Science and Campus Institute Data Science, University Göttingen（哥廷根大学计算机科学研究所和校园数据科学研究所）

AI总结利用小鼠视觉皮层31亿神经元数据，训练多模态多任务模型OmniMouse，在神经预测、行为解码等任务上达到最优，发现性能随数据量可靠提升但模型规模收益饱和，与AI领域标准扩展规律相反。

Comments Published at ICLR2026

详情

AI中文摘要

扩展数据和人工神经网络已经改变了人工智能，推动了语言和视觉领域的突破。类似的原则是否适用于脑活动建模仍不清楚。这里我们利用了一个数据集，包含来自73只小鼠视觉皮层的310万个神经元，跨越323个会话，总计超过1500亿个神经令牌，记录于自然电影、图像、参数化刺激和行为期间。我们训练了多模态、多任务模型，在测试时灵活支持三种模式：神经预测、行为解码、神经预测或三者的任意组合。OmniMouse实现了最先进的性能，在几乎所有评估模式下优于专门的基线。我们发现性能随数据量可靠地提升，但增加模型大小的收益饱和。这颠倒了标准的人工智能扩展故事：在语言和计算机视觉中，大规模数据集使参数扩展成为进步的主要驱动力，而在脑建模中——即使是在小鼠视觉皮层这个相对简单的系统中——尽管有大量的记录，模型仍然受限于数据。系统性的扩展观察提出了神经建模中相变的可能性，更大和更丰富的数据集可能解锁定性的新能力，类似于大型语言模型中出现的涌现特性。代码见此网址。

英文摘要

Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple system -- models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

URL PDF HTML ☆

赞 0 踩 0

2510.16559 2026-06-16 cs.AI 版本更新

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

BuildArena: 一个对齐物理的LLM交互基准，用于工程建造

Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Chenglei Yu, Tailin Wu

发表机构 * Tian Xia（夏天）； Tianrun Gao（高天run）； Wenhao Deng（邓文浩）； Long Wei（韦龙）； Xiaowei Qian（钱小伟）； Chenglei Yu（于成磊）； Tailin Wu（吴太林）

AI总结本研究提出BuildArena，一个首个对齐物理的LLM交互基准，用于语言驱动的工程建造，通过可扩展的任务设计策略和3D空间几何计算库，评估九个前沿LLM在语言驱动和物理基础的自动化建造中的能力。

Comments ICML 2026, 36 pages, 12 figures

详情

AI中文摘要

工程自动化旨在将自然语言规范转换为物理上可行的结构，需要在严格物理约束下进行复杂的综合推理。尽管现代LLM具有广泛的知识和强大的推理能力，使其成为该领域的潜在候选者，但其建造能力仍 largely 未被评估。为解决这一差距，我们引入了BuildArena，首个针对语言驱动工程建造设计的对齐物理的交互基准。它在使用LLM进行工程自动化方面迈出第一步。技术上，它在两个方面为社区做出贡献：(1) 覆盖静态和动态力学的可扩展任务设计策略，跨越多个难度级别；(2) 一个3D空间几何计算库，用于基于语言指令的建造。在九个前沿LLM上，BuildArena全面评估了它们在语言驱动和物理基础的自动化建造中的能力。

英文摘要

Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physics-aligned interactive benchmark designed for language-driven engineering construction. Technically, it contributes to the community in two aspects: (1) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (2) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions. On nine frontier LLMs and three additional open-weight models, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation. We release the code at https://github.com/AI4Science-WestlakeU/BuildArena to benefit construction automation in engineering applications.

URL PDF HTML ☆

赞 0 踩 0

2603.25777 2026-06-16 physics.plasm-ph cs.AI 版本更新

Challenges and opportunities for AI to help deliver fusion energy

人工智能在实现聚变能源中的挑战与机遇

Adriano Agnello, Helen Brooks, Cyd Cowley, Iulia Georgescu, Alex Higginbottom, Richard Pearson, Tara Shears, Melanie Windridge

发表机构 * STFC Hartree Centre（英国科学与技术创新中心）； UK Atomic Energy Authority（英国原子能局）； digiLab Solutions（digiLab解决方案）； Institute of Physics（物理研究所）； Zenithon AI（Zenithon人工智能）； Eindhoven University of Technology（埃因霍温理工大学）； Oliver Lodge Laboratory（Oliver Lodge实验室）； University of Liverpool（利物浦大学）； Fusion Energy Insights（聚变能源洞察）

AI总结本文探讨了人工智能在聚变能源研发中的应用潜力与挑战，强调需通过跨领域合作与稳健方法提升AI应用效果，同时指出并非所有聚变问题都适合AI解决。

Comments Submitted to Plasma Physics and Confined Fusion

详情

DOI: 10.1088/1361-6587/ae775c
Journal ref: Plasma Physics and Controlled Fusion 68 063701 (2026)

AI中文摘要

人工智能工具在聚变研究中的应用具有巨大潜力，若能实现可控核聚变，将带来全球性效益。然而，使用AI面临诸多挑战，这些挑战可通过在现有方法中引入负责任和稳健的方法加以缓解。为此，需要聚变领域专家与AI开发者之间紧密、长期的合作，并意识到并非所有聚变研究问题都最适合用AI工具解决。2025年4月，学术界、工业界、UKAEA和STFC专家在《经济学人》FusionFest活动上讨论了AI如何推动聚变能源研发。本文是对圆桌讨论的扩展和更新总结，提供了更多背景和实例。

英文摘要

There is great potential for the application of AI tools in fusion research, and substantial worldwide benefit if fusion power is realised. However, using AI comes with its own challenges, many of which can be mitigated if responsible and robust methodologies are built into existing approaches. To do that requires close, long-term collaborations between fusion domain experts and AI developers and awareness of the fact that not all problems in fusion research are best tackled with AI tools. In April 2025, experts from academia, industry, UKAEA and STFC discussed how AI can be used to advance R&D in fusion energy at the first edition of The Economist FusionFest event. This Perspective is an expanded and updated summary of the round table discussion, providing more context and examples.

URL PDF HTML ☆

赞 0 踩 0

2605.05372 2026-06-16 cs.CV cs.AI 版本更新

Two Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models

两步即可：基于一致性模型的高效3D点云异常检测

Pranav A, Shashank B, Pranav Siddappa, Dominik Seuss, Minal Moharir, Subramanya KN

发表机构 * R.V. College of Engineering（R.V. 工程学院）； Technical University of Applied Sciences Würzburg-Schweinfurt（Würzburg-Schweinfurt 应用科学大学）

AI总结本文提出基于一致性学习的重建异常检测方法，通过简化推理过程提升效率，实现低延迟的3D点云异常检测，适用于资源受限设备。

Comments Accepted to CVPR 2026, at the 9th Workshop on Efficient Deep Learning for Computer Vision (ECV). To be published in the IEEE/CVF CVPR 2026 Workshop Proceedings

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 3479-3487

AI中文摘要

扩散模型正在重新定义3D点云数据中的异常检测。随着3D传感成为现代制造的关键，可靠的异常检测对于高吞吐量的质量保证和过程控制至关重要。然而，在资源受限且延迟敏感的系统中，实际部署仍然有限。现有方法往往在复杂未遮挡区域计算上不可行或不可靠，而扩散管道本质上受限于迭代去噪。在本文中，我们通过一致性学习重构基于重建的异常检测，使能够在一次或两次网络评估中直接预测无异常几何。我们进一步引入了一种新的混合损失公式，明确强制重建至干净数据。这种设计显著降低了推理成本，达到比当前最先进方法快80倍的运行时间，无需GPU加速，同时保持强大的检测性能。它在Anomaly-ShapeNet上以76.20%的I-AUROC优于R3D-AD，在Real3DAD上以72.80%的I-AUROC保持竞争力，使在资源受限平台上实现高效、低延迟的异常检测成为可能，包括无人机、智能工业相机和其他边缘设备。

英文摘要

Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.

URL PDF HTML ☆

赞 0 踩 0

2511.12635 2026-06-16 cs.SE cs.AI cs.LG 版本更新

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

LLM4SCREENLIT: 关于评估用于系统综述文献筛选的大型语言模型性能的建议

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

发表机构 * University of Kent（肯特大学）； University of Leicester（利兹大学）； University of Birmingham（伯明翰大学）

AI总结本文提出LLM4SCREENLIT建议，针对系统综述文献筛选中大型语言模型的评估，提出基于加权马修相关系数的改进方法，强调在不平衡和成本不对称条件下使用成本敏感的WMCC进行评估。

Comments 34 pages, 6 figures

详情

DOI: 10.1016/j.infsof.2026.108204
Journal ref: Information and Software Technology 198 (2026) 108204

AI中文摘要

本文提出LLM4SCREENLIT建议，针对系统综述文献筛选中大型语言模型的评估，提出基于加权马修相关系数的改进方法，强调在不平衡和成本不对称条件下使用成本敏感的WMCC进行评估。

英文摘要

Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses, the largest covering 9 LLMs x 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9,695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default. Conclusions: SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.

URL PDF HTML ☆

赞 0 踩 0

2603.11729 2026-06-16 cs.DS cs.AI cs.RO 版本更新

Adapting Dijkstra for Buffers and Unlimited Transfers

为缓冲区和无限换乘调整Dijkstra算法

Denys Katkalo, Andrii Rohovyi, Toby Walsh

发表机构 * University of Oxford（牛津大学）

AI总结本文提出Transfer Aware Dijkstra (TAD)算法，通过扫描完整行程序列而非单条边，解决了带缓冲区时间的无限换乘路径规划中传统Dijkstra过滤失效的问题，并在伦敦和瑞士网络上实现比MR快两倍以上的速度且保持最优性。

Comments v4: clarified RAPTOR description in the Background section

详情

AI中文摘要

近年来，基于RAPTOR的算法被认为是无需预处理即可处理无限换乘路径规划的最先进技术。然而，这一地位很大程度上源于路由研究的演进，其中基于Dijkstra的解决方案被基于时间表的算法取代，而缺乏系统性的比较。在这项工作中，我们重新审视了经典的基于Dijkstra的无限换乘公共交通路由方法，并证明时间依赖Dijkstra (TD-Dijkstra) 优于MR。然而，高效的TD-Dijkstra实现依赖于在预处理期间过滤被支配的连接，这假设乘客总是可以切换到更快的连接。我们表明，当站点有缓冲区时间时，这种过滤是不合理的，因为它无法区分可能继续等待的坐席乘客和必须遵守缓冲区的换乘乘客。为了解决这一限制，我们引入了Transfer Aware Dijkstra (TAD)，这是一种修改后的算法，它扫描整个行程序列而不是单个边，从而正确处理缓冲区时间，同时保持相对于MR的性能优势。我们在伦敦和瑞士网络上的实验表明，与MR相比，我们可以在有和没有缓冲区时间的两个网络上实现超过两倍的速度提升，同时产生最优结果。

英文摘要

In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on the London and Switzerland networks show that we can achieve more than a twofold speedup over MR while producing optimal results on both networks, with and without buffer times.

URL PDF HTML ☆

赞 0 踩 0

2505.09755 2026-06-16 cs.AI 版本更新

Explainability Through Human-Centric Design for XAI in Lung Cancer Detection

通过以人为中心的设计实现XAI在肺癌检测中的可解释性

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * University of Edinburgh（爱丁堡大学）； NHS Lothian（NHS洛锡安）

AI总结本文提出XpertXAI模型，通过人类中心设计在肺癌检测中实现可解释性，优于现有方法，提供更符合专家推理的概念解释。

详情

DOI: 10.24963/ijcai.2025/1147

AI中文摘要

深度学习模型在胸部X光片肺癌病理检测中表现出潜力，但临床应用受限于模型决策的不透明性。本文引入ClinicXAI，一种以人为中心、专家引导的概念瓶颈模型（CBM），用于可解释的肺癌诊断。我们扩展了这一方法，提出XpertXAI，一种通用的专家驱动模型，能够在检测多种肺部病变的同时保留人类可解释的临床概念。使用高性能的InceptionV3分类器和包含放射学报告的公共胸部X光数据集，我们比较了XpertXAI与领先的后验可解释性方法和无监督CBM（XCBs）。通过与专家放射科医师注释和医学地面真实值的比较评估解释。尽管XpertXAI训练用于多种病变，我们的专家验证集中在肺癌上。我们发现现有技术经常无法产生具有临床意义的解释，遗漏了关键诊断特征并不同意放射科医生的判断。XpertXAI不仅在预测准确性上优于这些基线方法，还提供了更符合专家推理的概念级解释。虽然我们的重点仍放在肺癌检测的可解释性上，但这项工作展示了如何通过人类中心的模型设计有效地扩展到更广泛的诊断情境——为医学诊断中的有意义可解释AI提供可扩展的路径。

英文摘要

Deep learning models have shown promise in lung pathology detection from chest X-rays, but widespread clinical adoption remains limited due to opaque model decision-making. In prior work, we introduced ClinicXAI, a human-centric, expert-guided concept bottleneck model (CBM) designed for interpretable lung cancer diagnosis. We now extend that approach and present XpertXAI, a generalizable expert-driven model that preserves human-interpretable clinical concepts while scaling to detect multiple lung pathologies. Using a high-performing InceptionV3-based classifier and a public dataset of chest X-rays with radiology reports, we compare XpertXAI against leading post-hoc explainability methods and an unsupervised CBM, XCBs. We assess explanations through comparison with expert radiologist annotations and medical ground truth. Although XpertXAI is trained for multiple pathologies, our expert validation focuses on lung cancer. We find that existing techniques frequently fail to produce clinically meaningful explanations, omitting key diagnostic features and disagreeing with radiologist judgments. XpertXAI not only outperforms these baselines in predictive accuracy but also delivers concept-level explanations that better align with expert reasoning. While our focus remains on explainability in lung cancer detection, this work illustrates how human-centric model design can be effectively extended to broader diagnostic contexts - offering a scalable path toward clinically meaningful explainable AI in medical diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2603.10047 2026-06-16 cs.SE cs.AI cs.HC 版本更新

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

迈向认知稳定性：为工业大语言模型幻觉减少设计一致的程序

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon

发表机构 * Trane Technologies（特纳技术公司）

AI总结本文提出并比较了五种提示工程策略，旨在减少模型输出的方差，实现可重复、基于事实的结果。通过LLM-as-Judge框架评估，M4在100次试验中均表现最佳，M2在v2版本中提升显著。

Comments 50 pages, 5 tables, 7 figures

详情

DOI: 10.5539/cis.v19n2p1

AI中文摘要

大型语言模型（LLM）中的幻觉是指语法上连贯但事实错误或上下文不一致的输出。它们在高风险工业应用中持续存在，如工程设计、企业资源计划和物联网 telemetry 平台。本文提出并比较了五种提示工程策略，旨在减少模型输出的方差，以获得可重复、基于事实的结果，而无需修改模型权重或创建复杂验证模型。这些方法包括：（M1）迭代相似性收敛，（M2）分解模型无关提示，（M3）单任务代理专业化，（M4）增强数据注册，以及（M5）领域术语库注入。每种方法均使用LLM-as-Judge框架在100次重复运行中评估（相同固定任务提示，随机解码，tau=0.7）。在该评估设置下，M4（增强数据注册）在所有100次试验中均获得“更好”评价；M3和M5分别达到80%和77%；M1达到75%；而M2相比单次提示在现代基础模型上净负34%。随后，我们开发了增强版本2（v2）实现，并在10次验证批次上评估；M2从34%提升到80%，是四个修订方法中最大的提升。我们讨论了这些策略如何帮助克服LLM结果的非确定性，即使绝对正确性无法保证。我们提供了伪代码、原文提示和批次日志以支持独立评估。

英文摘要

Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at tau = 0.7. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80% and 77% respectively; M1 reached 75%; and M2 was net negative at 34% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34% to 80%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.

URL PDF HTML ☆

赞 0 踩 0

2603.24724 2026-06-16 cs.CV cs.AI 版本更新

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

几何足够吗？基于标记的注视估计评估

Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni

发表机构 * Department of Industrial Engineering and Mathematical Sciences, Università Politecnica delle Marche（工业工程与数学科学系，帕尔米塞大学）； Department of Science and Information Technology, Università Pegaso（科学与信息科技系，佩加索大学）

AI总结本文评估了基于面部标记的注视估计方法，通过标准化流程提取和归一化三个大型数据集的标记，并训练轻量级回归模型，发现其在跨域评估中与ResNet18基线相当，表明稀疏几何特征能有效支持鲁棒的注视估计。

详情

DOI: 10.1109/ACCESS.2026.3696778

AI中文摘要

基于外观的注视估计通常依赖深度卷积神经网络（CNNs）。这些模型准确但计算成本高且作为“黑箱”，可解释性差。基于面部标记的几何方法是轻量级替代方案，但其性能限制和泛化能力在现代基准中仍待探索。本文全面评估了基于标记的注视估计，引入标准化流程提取和归一化三个大型数据集（Gaze360、ETH-XGaze、GazeGene）的标记，并训练轻量级回归模型，具体为极端梯度提升树和两种神经架构：整体多层感知机（MLP）和设计捕捉双眼几何的孪生MLP。发现基于标记的模型在领域内评估表现较低，可能由于数据集中的标记检测噪声引入。然而，在跨域评估中，所提出的MLP架构的泛化能力与ResNet18基线相当。这些发现表明稀疏几何特征编码了足够的信息以支持鲁棒的注视估计，为高效、可解释且隐私友好的边缘应用铺平了道路。源代码和生成的基于标记的数据集可在https://github.com/daniele-agostinelli/LandmarkGaze.git获取。

英文摘要

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

URL PDF HTML ☆

赞 0 踩 0

2508.12365 2026-06-16 cs.IR cs.AI cs.CL 版本更新

TaoSR1: The Thinking Model for E-commerce Relevance Search

TaoSR1：电商相关性搜索的思考模型

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba（淘宝与天猫集团）

AI总结本文提出TaoSR1框架，通过CoT引导的监督微调、离线采样与DPO优化，解决电商搜索中相关性预测的推理误差与幻觉问题，实现高效部署。

详情

DOI: 10.1145/3770855.3818489
Journal ref: KDD '26: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2026

AI中文摘要

查询-商品相关性预测是电商搜索的核心任务。基于BERT的模型在语义匹配上表现优异，但缺乏复杂的推理能力。尽管大型语言模型（LLMs）被探索，大多数仍使用判别性微调或蒸馏到小模型进行部署。我们提出一个框架，直接部署LLMs用于此任务，解决关键挑战：推理链（CoT）误差累积、判别性幻觉和部署可行性。我们的框架TaoSR1包括三个阶段：（1）使用CoT的监督微调以培养推理能力；（2）离线采样与pass@N策略和直接偏好优化（DPO）以提高生成质量；（3）基于难度的动态采样与组相对策略优化（GRPO）以缓解判别性幻觉。此外，后CoT处理和基于累积概率的分区方法使在线部署高效。TaoSR1在离线数据集上显著优于基线，并在在线双人评估中取得显著优势，引入了将CoT推理应用于相关性分类的新范式。

英文摘要

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

URL PDF HTML ☆

赞 0 踩 0

2511.00369 2026-06-16 cs.LG cs.AI cs.NE 版本更新

Balancing Interpretability and Performance in Motor Imagery EEG Classification: A Comparative Study of ANFIS-FBCSP-PSO and EEGNet

在运动想象EEG分类中平衡可解释性和性能：ANFIS-FBCSP-PSO和EEGNet的比较研究

Farjana Aktar, Mohd Ruhul Ameen, Akif Islam, Md Ekramul Hamid

发表机构 * University of Rajshahi（拉贾沙希大学）

AI总结本文比较了ANFIS-FBCSP-PSO与EEGNet在BCI竞赛IV-2a数据集上的性能，发现模糊神经模型在内子试验中表现更优，而深度模型在跨受试者测试中更具泛化能力，为选择MI-BCI系统提供指导。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11545962
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

实现准确且可解释的运动想象EEG分类仍是脑机接口（BCI）研究中的关键挑战。本文比较了透明的模糊推理方法（ANFIS-FBCSP-PSO）与知名的深度学习基准（EEGNet），使用公开的BCI竞赛IV-2a数据集。ANFIS流程结合滤波器银行共同空间模式特征提取与通过粒子群优化优化的模糊IF-THEN规则，而EEGNet直接从原始EEG数据学习层次化的空间-时间表示。在内子试验中，模糊神经模型表现更好（68.58%±13.76%准确率，kappa=58.04%±18.43），而在跨受试者（LOSO）测试中，深度模型表现出更强的泛化能力（68.20%±12.13%准确率，kappa=57.33%±16.22）。因此，该研究为根据设计目标选择MI-BCI系统提供了实用指导：可解释性或用户间鲁棒性。未来对基于Transformer和混合神经符号框架的研究有望进一步推动透明的EEG解码。

英文摘要

Achieving both accurate and interpretable classification of motor-imagery EEG remains a key challenge in brain-computer interface (BCI) research. In this paper, we compare a transparent fuzzy-reasoning approach (ANFIS-FBCSP-PSO) with a well-known deep-learning benchmark (EEGNet) using the publicly available BCI Competition IV-2a dataset. The ANFIS pipeline combines filter-bank common spatial pattern feature extraction with fuzzy IF-THEN rules optimized via particle-swarm optimization, while EEGNet learns hierarchical spatial-temporal representations directly from raw EEG data. In within-subject experiments, the fuzzy-neural model performed better (68.58% +/- 13.76% accuracy, kappa = 58.04% +/- 18.43), while in cross-subject (LOSO) tests, the deep model exhibited stronger generalization (68.20% +/- 12.13% accuracy, kappa = 57.33% +/- 16.22). The study therefore provides practical guidance for selecting MI-BCI systems according to the design goal: interpretability or robustness across users. Future investigations into transformer-based and hybrid neuro-symbolic frameworks are expected to further advance transparent EEG decoding.

URL PDF HTML ☆

赞 0 踩 0

2511.00352 2026-06-16 cs.CV cs.AI 版本更新

Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

通过扩散快回重建检测AI生成图像：一种取证方法

Mohd Ruhul Ameen, Akif Islam

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology（1 计算机科学与工程系，孟加拉国工程与技术大学）

AI总结本文提出通过扩散模型重建图像时的响应行为来检测AI生成图像，利用LPIPS等指标分析图像与扩散模型去噪行为的匹配程度，实验显示方法在识别准确率上表现优异。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11545865
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

生成图像模型的快速发展使数字媒体发生了变革，使得人类观察者或许多传统检测方法难以可靠地区分AI生成图像和真实照片。现代文本到图像系统如Stable Diffusion和DALL E能够生成极其逼真的图像，使其看起来完全自然，留下很少或没有传统深度伪造检测器可以依赖的可见伪影。这一挑战对虚假信息控制、机构身份验证和政治和法律领域中的数字信任有实际影响。我们不搜索隐藏的像素级痕迹，而是观察图像在被轻微扰动和由扩散模型重建时的反应。我们称之为扩散快回。通过跟踪不同重建强度下感知相似性度量（LPIPS、SSIM和PSNR）的变化，我们捕捉到紧凑且可解释的信号，揭示图像与扩散模型学习的去噪行为的接近程度。在包含4000张人类和AI生成图像的平衡数据集上评估，所提出的方法在分层五折交叉验证中达到AUROC 0.993，在使用仅逻辑回归的测试集上达到0.990。初步的鲁棒性测试显示，该方法在常见的现实世界失真如图像压缩和添加噪声下仍保持稳定。虽然我们的实验使用单一扩散主干进行，但结果表明，重建行为可以作为合成媒体检测的可靠且可扩展的基础，随着生成模型变得越来越逼真。

英文摘要

The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.

URL PDF HTML ☆

赞 0 踩 0

2510.23785 2026-06-16 cs.CV cs.AI 版本更新

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

CountFormer：一种用于学习类无关物体计数中视觉重复和结构的Transformer框架

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 CountFormer通过使用DINOv2和位置嵌入，改进了无示例物体计数中的结构一致性，实现了在FSC-147上的竞争力表现。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11546546
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

人类通常通过观察视觉重复和组成来计数 unfamiliar objects，而非仅依赖物体类别。然而，许多无示例计数模型在这种情况下的表现不佳，尤其是在物体包含对称组件、重复子结构或部分遮挡时可能过计数。我们引入了CountFormer，这是一种受CounTR启发的密度回归框架的受控适应，其中图像编码器被自监督视觉基础模型DINOv2取代。所得的Transformer特征与显式的二维位置嵌入结合，并通过轻量级卷积网络解码，以生成密度图，其积分给出最终计数。我们的目标不是提出新的计数架构，而是研究在严格无示例设置下，基于基础的表示是否能提高结构一致性。在FSC-147上，CountFormer在官方基准上实现了竞争性表现（MAE 19.06，RMSE 118.45）。定性分析表明，对于某些结构复杂的物体，部分层面的过计数错误更少，而总体误差与先前方法大致一致。敏感性分析显示，评估指标强烈受少量极端高密度场景的影响。总体而言，结果突显了表示质量在无示例物体计数中的作用。

英文摘要

Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.

URL PDF HTML ☆

赞 0 踩 0

2509.22935 2026-06-16 cs.LG cs.AI 版本更新

Compute-Optimal Quantization-Aware Training

计算最优量化感知训练

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun

发表机构 * Apple（苹果公司）

AI总结本文研究了量化感知训练与全精度训练的计算分配优化问题，通过实验发现QAT与FP训练比例随总计算量增加而上升，并提出新的冷却与QAT融合方法以提升效率。

Comments ICLR 2026

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

量化感知训练（QAT）是提高量化神经网络精度的重要技术。先前研究表明，将训练分解为全精度阶段后接QAT阶段能获得更优精度。然而，全精度与QAT阶段的计算分配仍不明确。本文通过不同计算预算、QAT位宽和模型大小的实验，探讨了不同QAT持续时间对最终性能的影响。研究发现，与先前结论相反，QAT与全精度训练的损失最优比随总计算量增加而上升。使用tokens-per-parameter-byte统计量可准确预测广泛模型大小和量化位宽的最优比例。从实验数据中推导出一个损失标度定律，可预测不同QAT/FP计算分配策略和QAT位宽下的最优QAT比例和最终模型性能。利用该定律进行进一步预测，包括在给定内存约束下最优QAT位宽以及不同位宽QAT精度与全精度模型精度的比较。此外，本文提出了一种新的冷却与QAT融合方法，通过联合学习率衰减与量化感知训练，消除冗余的全精度模型更新，实现显著的计算节省。这些发现为高效的QAT规划提供了实用见解，并使在相同计算预算下训练更高质量的量化模型成为可能。

英文摘要

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

URL PDF HTML ☆

赞 0 踩 0

2602.21381 2026-06-16 cs.LG cs.AI cs.CE 版本更新

VCDF: A Validated Consensus-Driven Framework for Time Series Causal Discovery

VCDF：一种验证性共识驱动的时间序列因果发现框架

Gene Yu, Ce Guo, Wayne Luk

发表机构 * Department of Computing, Imperial College London（帝国理工学院伦敦分校计算机系）

AI总结本文提出VCDF框架，通过评估时间序列阻断子集的因果关系稳定性，提升因果发现的鲁棒性，实验显示其在VAR-LiNGAM等方法上显著提高了F1分数，尤其在长序列中效果更佳。

Comments This paper has been accepted to PAKDD 2026. Please cite the proceedings version when available

详情

DOI: 10.1007/978-981-92-1465-5_3
Journal ref: LNCS vol. 16599, pp. 29-41, Springer, 2026

AI中文摘要

时间序列因果发现对于理解动态系统至关重要，但现有方法对噪声、非平稳性和采样变异敏感。本文提出验证性共识驱动框架（VCDF），一种简单且方法无关的层，通过评估因果关系在阻断时间子集中的稳定性来提高鲁棒性。VCDF无需修改基础算法，可应用于VAR-LiNGAM和PCMCI等方法。实验表明，VCDF在合成数据集上提高了VAR-LiNGAM的窗口和总结F1分数，增益在不同数据特性中最为明显的是中等至长序列。该框架还受益于更长的序列，时间序列长度1000及以上可获得高达0.18的绝对改进。在模拟fMRI数据和IT监控场景中的评估进一步展示了其在现实噪声条件下的稳定性和结构准确性。VCDF为时间序列因果发现提供了一个有效的可靠性层，而不会改变底层建模假设。

英文摘要

Time series causal discovery is essential for understanding dynamic systems, yet many existing methods remain sensitive to noise, non-stationarity, and sampling variability. We propose the Validated Consensus-Driven Framework (VCDF), a simple and method-agnostic layer that improves robustness by evaluating the stability of causal relations across blocked temporal subsets. VCDF requires no modification to base algorithms and can be applied to methods such as VAR-LiNGAM and PCMCI. Experiments on synthetic datasets show that VCDF improves VAR-LiNGAM by approximately 0.08-0.12 in both window and summary F1 scores across diverse data characteristics, with gains most pronounced for moderate-to-long sequences. The framework also benefits from longer sequences, yielding up to 0.18 absolute improvement on time series of length 1000 and above. Evaluations on simulated fMRI data and IT-monitoring scenarios further demonstrate enhanced stability and structural accuracy under realistic noise conditions. VCDF provides an effective reliability layer for time series causal discovery without altering underlying modeling assumptions.

URL PDF HTML ☆

赞 0 踩 0

2602.08088 2026-06-16 cs.LG cs.AI 版本更新

Online Domain-aware LLM Decoding for Continual Domain Evolution

在线领域感知的LLM解码用于持续领域演变

Mohammad Abu-Shaira, Weishi Shi

发表机构 * University of North Texas（北卡罗来纳州立大学）

AI总结本文提出在线领域感知解码框架ODD，通过概率融合和自适应置信度调节，提升LLM在持续领域变化中的适应能力，实验表明其在语法和语义生成任务中表现优异。

详情

DOI: 10.1007/978-981-92-1468-6_40
Journal ref: Advances in Knowledge Discovery and Data Mining, PAKDD 2026, LNAI 16600, pp. 565-577, Springer, 2026

AI中文摘要

LLMs通常在领域特定数据上离线微调，假设领域静态。但实际上，领域知识通过新法规、产品、服务和交互模式持续演变。对每个新实例重新训练或微调LLM在计算上不可行。此外，现实环境也表现出时间动态性，数据分布不断变化。忽视这种现象，即概念漂移，会显著降低模型的预测准确性。这种领域演变与静态适应管道的不匹配凸显了需要高效实时适应而无需昂贵再训练的需求。为此，我们引入在线领域感知解码框架（ODD）。ODD在基础LLM和前缀树先验之间进行概率级融合，通过自适应置信度调节使用分歧和连续性信号进行指导。在多样化的漂移场景下的实证评估表明，ODD在所有语法和语义NLG指标上均优于LLM-Greedy和LLM-Temp Scaled。它在ROUGE-L指标上获得绝对增益0.065，并在最佳基线上使余弦相似度提高13.6%。这些结果证明了ODD对演变词汇和上下文模式的鲁棒性，使其适用于动态LLM应用。

英文摘要

LLMs are typically fine-tuned offline on domain-specific data, assuming a static domain. In practice, domain knowledge evolves continuously through new regulations, products, services, and interaction patterns. Retraining or fine-tuning LLMs for every new instance is computationally infeasible. Additionally, real-world environments also exhibit temporal dynamics with shifting data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. This mismatch between evolving domains and static adaptation pipelines highlights the need for efficient, real-time adaptation without costly retraining. In response, we introduce Online Domain-aware Decoding framework (ODD). ODD performs probability-level fusion between a base LLM and a prefix-tree prior, guided by adaptive confidence modulation using disagreement and continuity signals. Empirical evaluation under diverse drift scenarios demonstrates that ODD consistently surpasses LLM-Greedy and LLM-Temp Scaled across all syntactic and semantic NLG metrics. It yields an absolute ROUGE-L gain of 0.065 and a 13.6% relative improvement in Cosine Similarity over the best baseline. These results demonstrate ODD 's robustness to evolving lexical and contextual patterns, making it suitable for dynamic LLM applications.

URL PDF HTML ☆

赞 0 踩 0

2601.18897 2026-06-16 cs.AI cs.LG 版本更新

Explainable Uncertainty Quantification for Wastewater Treatment Energy Prediction via Interval Type-2 Neuro-Fuzzy System

通过区间型2神经模糊系统实现废水处理能耗预测的可解释不确定性量化

Qusai Khaled, Bahjat Mallak, Uzay Kaymak, Laura Genga

发表机构 * Jheronimus Academy of Data Science, Eindhoven University of Technology, Eindhoven, The Netherlands（杰罗尼穆斯数据科学学院，埃因霍温理工大学，埃因霍温，荷兰）； Haskoning, Amersfoort, The Netherlands（哈索宁，阿默斯福尔特，荷兰）； School of Industrial Engineering, Eindhoven University of Technology（工业工程学院，埃因霍温理工大学）

AI总结本文提出一种区间型2神经模糊系统，用于废水处理能耗预测，通过模糊规则结构生成可解释的预测区间，分解不确定性层级，提升决策可靠性。

Comments Submitted to 21st International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2026)

详情

DOI: 10.1007/978-3-032-28997-1_28
Journal ref: IPMU 2026, Commun. Comput. Inf. Sci. 3020, 392-406 (2026)

AI中文摘要

废水处理厂消耗全球1-3%的电力，准确的能耗预测对运营优化和可持续性至关重要。尽管机器学习模型提供点预测，但缺乏可解释的不确定性量化，这对安全关键基础设施的风险意识决策至关重要。本研究开发了一种区间型2自适应神经模糊推理系统（IT2-ANFIS），通过模糊规则结构生成可解释的预测区间。与黑箱概率方法不同，所提出的框架将不确定性分解为三个层次：特征层、不确定性足迹识别引入模糊性的变量，规则层分析揭示局部模型的置信度，实例层区间量化整体预测不确定性。在墨尔本水务东处理厂数据集上验证，IT2-ANFIS在预测性能上与一阶ANFIS相当，但在训练运行中方差显著降低，同时提供可解释的不确定性估计，将预测置信度直接与运营条件和输入变量联系起来。

英文摘要

Wastewater treatment plants consume 1-3% of global electricity, making accurate energy forecasting critical for operational optimization and sustainability. While machine learning models provide point predictions, they lack explainable uncertainty quantification essential for risk-aware decision-making in safety-critical infrastructure. This study develops an Interval Type-2 Adaptive Neuro-Fuzzy Inference System (IT2-ANFIS) that generates interpretable prediction intervals through fuzzy rule structures. Unlike black-box probabilistic methods, the proposed framework decomposes uncertainty across three levels: feature-level, footprint of uncertainty identify which variables introduce ambiguity, rule-level analysis reveals confidence in local models, and instance-level intervals quantify overall prediction uncertainty. Validated on Melbourne Water's Eastern Treatment Plant dataset, IT2-ANFIS achieves comparable predictive performance to first order ANFIS with substantially reduced variance across training runs, while providing explainable uncertainty estimates that link prediction confidence directly to operational conditions and input variables.

URL PDF HTML ☆

赞 0 踩 0

2601.18045 2026-06-16 cs.CV cs.AI 版本更新

Leveraging Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation

利用持续图像增强曲率结构分割的鲁棒性和性能

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Xiuju Chen, Xiaoxin Li, Qinkai Yu, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出PIs-Regressor和Topology SegNet，通过直接学习持续图像来增强曲率结构分割的鲁棒性和性能，实验表明拓扑特征能有效提升医学图像分割的准确性。

Comments Accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2026. 5 pages, 3 figures

详情

DOI: 10.1109/ISBI61048.2026.11515783
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), London, United Kingdom, 2026

AI中文摘要

在医学图像中分割曲率结构对于分析临床应用中的形态学模式至关重要。整合拓扑属性如连通性可提高分割的准确性和一致性。然而，从持续图（PD）中提取和嵌入这些属性具有挑战性，因为它们非可微且计算成本高。现有方法大多通过手工设计的损失函数编码拓扑，泛化能力差。本文提出PIs-Regressor，一个简单有效的模块，直接从数据中学习持续图像（PI）——拓扑特征的有限、可微表示。与Topology SegNet结合，该框架将拓扑整合到网络架构本身而非辅助损失中。与依赖手工损失函数的方法不同，我们的方法直接将拓扑信息整合到网络结构中，从而实现更稳健的分割。我们的设计灵活，可无缝结合其他拓扑方法以进一步提升分割性能。实验结果表明，整合拓扑特征增强了模型鲁棒性，有效处理医学图像中的过曝和模糊挑战。在三个曲率基准上，我们的方法在像素级准确性和拓扑保真度上均达到最先进的性能。

英文摘要

Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.

URL PDF HTML ☆

赞 0 踩 0

2512.14892 2026-06-16 cs.LG cs.AI 版本更新

OLR-WA: Online Weighted Average Linear Regression in Multivariate Data Streams

OLR-WA：多变量数据流中的在线加权平均线性回归

Mohammad Abu-Shaira, Alejandro Rodriguez, Greg Speegle, Victor Sheng, Ishfaq Ahmad

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结本文提出OLR-WA模型，用于多变量数据流的在线线性回归，通过处理数据漂移和置信度场景，实现与批量回归相当甚至更优的性能。

详情

DOI: 10.1109/BigData59044.2023.10386601
Journal ref: 2023 IEEE International Conference on Big Data (BigData), 1039-1046

AI中文摘要

在线学习通过增量更新模型来处理新数据，避免大规模存储需求和昂贵的模型重计算。本文引入了

英文摘要

Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce "OLR-WA; OnLine Regression with Weighted Average", a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA's performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.

URL PDF HTML ☆

赞 0 踩 0

2411.13602 2026-06-16 eess.IV cs.AI cs.CV 版本更新

Translating Electrocardiograms to Cardiac Magnetic Resonance Imaging Useful for Cardiac Assessment and Disease Screening: A Multi-Center Study

将心电图转换为心脏磁共振成像对心脏评估和疾病筛查有用：一项多中心研究

Zhengyao Ding, Ziyu Li, Yujian Hu, Youyao Xu, Chengchen Zhao, Yiheng Mao, Haitao Li, Zhikang Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing Huang

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； Department of Vascular Surgery, The First Affiliated Hospital of Zhejiang University School of Medicine（浙江大学医学院附属第一医院血管外科）； Department of Cardiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院附属第一医院心内科）； Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院附属第一医院放射科）； Department of Vascular Surgery, Quzhou People’s Hospital（衢州人民医院血管外科）； Department of Cardiology, The Second Affiliated Hospital of Zhejiang University School of Medicine（浙江大学医学院附属第二医院心内科）； China Ship Scientific Research Center（中国船舶科学研究院）； Guangdong Transtek Medical Electronics Co., Ltd.（广东 Transtek 医疗电子有限公司）

AI总结本文提出CardioNets框架，通过深度学习将12导联心电图信号转换为心脏磁共振成像级别的功能参数和合成图像，提升大规模心血管疾病筛查的效率和可及性。

Comments 29 pages, 7 figures

详情

DOI: 10.1056/AIoa2500549
Journal ref: NEJM AI 2026;3(4)

AI中文摘要

心血管疾病（CVDs）是全球死亡的主要原因，需要可访问且准确的诊断工具。尽管心脏磁共振成像（CMR）提供心脏结构和功能的金标准见解，但其临床效用受到高成本和复杂性的限制。相比之下，心电图（ECG）成本低且广泛可用，但缺乏CMR的粒度。我们提出CardioNets，一种深度学习框架，将12导联ECG信号转换为CMR级别的功能参数和合成图像，从而实现可扩展的心脏评估。CardioNets整合了跨模态对比学习和生成预训练，对齐ECG与CMR衍生的心脏表型，并通过掩码自回归模型合成高分辨率CMR图像。在159,819个样本上训练，包括英国生物库（n=42,483）和MIMIC-IV-ECG（n=164,550），并在独立临床数据集（n=3,767）上进行外部验证，CardioNets在疾病筛查和表型估计任务中表现出色。在英国生物库中，它将心脏表型回归R2提高了24.8%，并使心肌病AUC提高了高达39.3%。在MIMIC中，它将肺动脉高压检测的AUC提高了5.6%。生成的CMR图像在SSIM和PSNR方面分别比先前方法高36.6%和8.7%。在一项读者研究中，仅使用ECG的CardioNets在准确率上比同时使用ECG和真实CMR的人类医生高13.9%。这些结果表明，CardioNets为大规模CVD筛查提供了一个有前景的低成本替代方案，特别是在资源有限的环境中。未来的工作将专注于临床部署和ECG基于合成成像的监管验证。

英文摘要

Cardiovascular diseases (CVDs) are the leading cause of global mortality, necessitating accessible and accurate diagnostic tools. While cardiac magnetic resonance imaging (CMR) provides gold-standard insights into cardiac structure and function, its clinical utility is limited by high cost and complexity. In contrast, electrocardiography (ECG) is inexpensive and widely available but lacks the granularity of CMR. We propose CardioNets, a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment. CardioNets integrates cross-modal contrastive learning and generative pretraining, aligning ECG with CMR-derived cardiac phenotypes and synthesizing high-resolution CMR images via a masked autoregressive model. Trained on 159,819 samples from five cohorts, including the UK Biobank (n=42,483) and MIMIC-IV-ECG (n=164,550), and externally validated on independent clinical datasets (n=3,767), CardioNets achieved strong performance across disease screening and phenotype estimation tasks. In the UK Biobank, it improved cardiac phenotype regression R2 by 24.8% and cardiomyopathy AUC by up to 39.3% over baseline models. In MIMIC, it increased AUC for pulmonary hypertension detection by 5.6%. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR than prior approaches. In a reader study, ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR. These results suggest that CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future efforts will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

URL PDF HTML ☆

赞 0 踩 0

2512.08879 2026-06-16 cs.LG cs.AI 版本更新

DAO-GP Drift Aware Online Non-Linear Regression Gaussian-Process

DAO-GP：漂移感知在线非线性回归高斯过程

Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi

发表机构 * st Mohammad Abu-Shaira（第一作者）； nd Ajita Rattani（第二作者）； rd Weishi Shi（第三作者）

AI总结提出DAO-GP模型，通过内置漂移检测与自适应机制、无超参数、稀疏化和衰减策略，解决在线高斯过程回归中概念漂移、超参数固定等问题，在多种漂移类型下表现鲁棒且优于现有方法。

详情

DOI: 10.1109/BigData66926.2025.11401428
Journal ref: 2025 IEEE International Conference on Big Data (BigData), pp. 776-785, 2025

AI中文摘要

真实世界的数据集通常表现出以数据分布演变为特征的时态动态。忽视这一现象（通常称为概念漂移）会显著降低模型的预测精度。此外，在线模型中超参数的存在加剧了这一问题。这些参数通常是固定的，用户无法根据演化的数据分布动态调整。高斯过程模型提供了具有不确定性量化的强大非参数回归能力，使其成为在线设置中建模复杂数据关系的理想选择。然而，传统的在线高斯过程方法存在几个关键限制，包括缺乏漂移感知、依赖固定超参数、易受数据窥探影响、缺乏原则性的衰减机制以及内存效率低下。为此，我们提出了DAO-GP（漂移感知在线高斯过程），一种新颖的、完全自适应的、无超参数、带衰减的稀疏非线性回归模型。DAO-GP具有内置的漂移检测和自适应机制，可根据漂移的严重程度动态调整模型行为。广泛的经验评估证实了DAO-GP在平稳条件、多种漂移类型（突变、增量、渐变）以及不同数据特征下的鲁棒性。分析表明其动态自适应、高效的内存和基于衰减的管理以及演化的诱导点。与最先进的参数和非参数模型相比，DAO-GP始终达到优越或竞争性的性能，使其成为在线非线性回归中具有漂移鲁棒性的解决方案。

英文摘要

Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP's robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.

URL PDF HTML ☆

赞 0 踩 0

2512.00572 2026-06-16 cs.CV cs.AI 版本更新

Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models

基于骨架表示的瑜伽姿势分类深度学习模型整合

Mohammed Mohiuddin, Syed Mohammod Minhaz Hossain, Sumaiya Khanam, Prionkar Barua, Aparup Barua, MD Tamim Hossain

发表机构 * Department of Computer Science and Engineering, Premier University（计算机科学与工程系，普里梅尔大学）

AI总结本文提出Yoga-16数据集，系统评估了三种深度学习模型，证明骨架表示在瑜伽姿势分类中优于原始图像，VGG16结合MediaPipe骨架输入达到96.09%的准确率。

详情

DOI: 10.1038/s41598-025-23726-0

AI中文摘要

瑜伽因其精神和身体健康益处而全球流行，但错误姿势可能导致受伤。自动化瑜伽姿势分类因此变得重要，以减少对专家的依赖。尽管人类姿态关键点提取模型在动作识别中表现出潜力，但系统化的瑜伽姿势识别基准评估仍有限，因为先前工作通常仅关注原始图像或单一姿态提取模型。本文引入了'Yoga-16'数据集，以解决现有数据集的限制，并系统评估了三种深度学习架构（VGG16、ResNet50和Xception），使用三种输入模式（直接图像、MediaPipe Pose骨架图像和YOLOv8 Pose骨架图像）。我们的实验表明，基于骨架的表示优于原始图像输入，VGG16与MediaPipe Pose骨架输入的最高准确率为96.09%。此外，我们通过Grad-CAM进行可解释性分析，提供瑜伽姿势分类的模型决策洞察，通过交叉验证分析。

英文摘要

Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.

URL PDF HTML ☆

赞 0 踩 0

2511.17743 2026-06-16 cs.AI cs.SY eess.SY 版本更新

AI- and Ontology-Based Enhancements to FMEA for Advanced Systems Engineering: Current Developments and Future Directions

基于人工智能和本体的FMEA增强：先进系统工程中的最新发展与未来方向

Haytham Younus, Sohag Kabir, Felician Campean, Pascal Bonnaud, David Delaux

发表机构 * School of Computing and Engineering, University of Bradford（布里斯托大学计算机与工程学院）； SAFI Verse Limited（SAFI Verse有限公司）； Valeo（法拉利）

AI总结本文探讨了如何利用人工智能和本体技术改进FMEA，提升其数据驱动和语义丰富性，分析了AI和本体在系统工程中的应用及挑战。

Comments This manuscript is based on research undertaken by our doctoral student at the University of Bradford. The associated PhD thesis has been formally submitted to the University and is currently awaiting final examination. The review article is being shared on arXiv to make the review accessible to the research community while the thesis examination process is ongoing

详情

DOI: 10.3390/app16052464

AI中文摘要

本文综述了近期旨在将传统故障模式与影响分析（FMEA）转变为更智能、数据驱动和语义丰富的过程的最新进展。随着工程系统复杂性增加，传统FMEA方法因依赖人工、文档和专家而显得不足。本文探讨了人工智能技术，如机器学习和自然语言处理，如何通过自动化故障预测、优先级排序和从操作数据中提取知识来改进FMEA。同时，本文探讨了本体在形式化系统知识、支持语义推理、提高可追溯性和跨领域互操作性中的作用。此外，本文还综合了新兴的混合方法，如基于本体的学习和大语言模型整合，以进一步提高可解释性和自动化。这些发展在基于模型的系统工程（MBSE）和功能建模的更广泛背景下讨论，展示了AI和本体如何支持更适应和稳健的FMEA工作流程。本文还批判性地分析了各种工具、案例研究和整合策略，同时识别了与数据质量、可解释性、标准化和跨学科应用相关的关键挑战。通过利用AI、系统工程和本体的知识表示，本文为将FMEA嵌入智能、知识丰富的工程环境提供了结构化的路线图。

英文摘要

This article presents a state-of-the-art review of recent advances aimed at transforming traditional Failure Mode and Effects Analysis (FMEA) into a more intelligent, data-driven, and semantically enriched process. As engineered systems grow in complexity, conventional FMEA methods, largely manual, document-centric, and expert-dependent, have become increasingly inadequate for addressing the demands of modern systems engineering. We examine how techniques from Artificial Intelligence (AI), including machine learning and natural language processing, can transform FMEA into a more dynamic, data-driven, intelligent, and model-integrated process by automating failure prediction, prioritisation, and knowledge extraction from operational data. In parallel, we explore the role of ontologies in formalising system knowledge, supporting semantic reasoning, improving traceability, and enabling cross-domain interoperability. The review also synthesises emerging hybrid approaches, such as ontology-informed learning and large language model integration, which further enhance explainability and automation. These developments are discussed within the broader context of Model-Based Systems Engineering (MBSE) and function modelling, showing how AI and ontologies can support more adaptive and resilient FMEA workflows. We critically analyse a range of tools, case studies, and integration strategies, while identifying key challenges related to data quality, explainability, standardisation, and interdisciplinary adoption. By leveraging AI, systems engineering, and knowledge representation using ontologies, this review offers a structured roadmap for embedding FMEA within intelligent, knowledge-rich engineering environments.

URL PDF HTML ☆

赞 0 踩 0

2511.07090 2026-06-16 cs.AI 版本更新

Green AI: A systematic review and meta-analysis of its definitions, lifecycle models, hardware and measurement attempts

绿色人工智能：对其定义、生命周期模型、硬件和测量尝试的系统综述和元分析

Marcel Rojahn, Marcus Grum

发表机构 * University of Potsdam, Junior Chair of Business Information Systems, esp. AI-based Application Systems（波恩大学，商业信息系统初级职位，特别是基于AI的应用系统）

AI总结本文系统综述和元分析绿色人工智能的定义、生命周期模型、硬件及测量方法，提出统一定义、五阶段生命周期模型、治理框架和校准测量框架，以应对多维负担。

详情

DOI: 10.1016/j.infsof.2026.108186
Journal ref: Information and Software Technology 198 (2026) 108186

AI中文摘要

在人工智能生命周期中，从硬件到开发、部署和重用，负担包括能源、碳、水和嵌入式影响。云服务工具虽提高透明度，但异构且常忽略水和价值链影响，限制了可比性和可重复性。本文（i）建立与可持续人工智能不同的绿色人工智能统一操作定义；（ii）正式化与生命周期评估（LCA）阶段映射的五阶段生命周期，使能源、碳、水和嵌入式影响成为首要考虑因素；（iii）通过PDCA循环和决策关卡制定治理；（iv）系统化边缘云连续体的硬件和系统级策略以减少嵌入式负担；（v）定义结合估计模型和直接计量的校准测量框架，以实现可重复、提供者无关的比较。结合定义、生命周期过程、硬件策略和校准测量，本文为研究人员、实践者和政策制定者提供可操作的证据支持指导。

英文摘要

Across the Artificial Intelligence (AI) lifecycle - from hardware to development, deployment, and reuse - burdens span energy, carbon, water, and embodied impacts. Cloud provider tools improve transparency but remain heterogeneous and often omit water and value chain effects, limiting comparability and reproducibility. Addressing these multi dimensional burdens requires a lifecycle approach linking phase explicit mapping with system levers (hardware, placement, energy mix, cooling, scheduling) and calibrated measurement across facility, system, device, and workload levels. This article (i) establishes a unified, operational definition of Green AI distinct from Sustainable AI; (ii) formalizes a five phase lifecycle mapped to Life Cycle Assessment (LCA) stages, making energy, carbon, water, and embodied impacts first class; (iii) specifies governance via Plan Do Check Act (PDCA) cycles with decision gateways; (iv) systematizes hardware and system level strategies across the edge cloud continuum to reduce embodied burdens; and (v) defines a calibrated measurement framework combining estimator models with direct metering to enable reproducible, provider agnostic comparisons. Combining definition, lifecycle processes, hardware strategies, and calibrated measurement, this article offers actionable, evidence based guidance for researchers, practitioners, and policymakers.

URL PDF HTML ☆

赞 0 踩 0

2511.08507 2026-06-16 cs.CL cs.AI 版本更新

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

介绍一个孟加拉语句子- gloss配对数据集用于孟加拉语手语翻译和研究

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology（Bangladesh University of Engineering and Technology计算机科学与工程系）

AI总结本文介绍了一个包含1000个人工标注句子- gloss配对的新数据集Bangla-SGP，通过规则基于的检索增强生成管道生成约3000个合成配对，用于孟加拉语手语翻译和研究。

详情

DOI: 10.63317/38qenrwzegr9
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 10457-10466, ELRA, Palma, Mallorca, Spain, May 2026

AI中文摘要

孟加拉语手语（BdSL）翻译是一个低资源自然语言处理任务，由于缺乏大规模数据集来解决句子级翻译。相应地，该领域现有研究局限于词和字母级别的检测。在本工作中，我们介绍了Bangla-SGP，一个包含1000个由专业手语者手动标注的高质量孟加拉语句子的平行数据集，这些句子被注释为gloss序列。该数据集通过基于规则的检索增强生成（RAG）管道扩展，使用句法和形态学规则生成约3000个合成配对。gloss序列由单独的gloss组成，这些gloss是孟加拉语手语支持的词汇，并作为连续手语的中间表示。我们的数据集由1000个高质量孟加拉语句子组成，这些句子由专业手语者手动注释为gloss序列。增强过程结合了基于规则的语言学策略和提示工程技术，这些技术通过批判性分析我们的人工标注句子-gloss配对以及与专业手语者密切合作而获得。此外，我们微调了几种基于transformer的模型，如mBart50、Google mT5、GPT4.1-nano，并使用BLEU分数评估其句子到gloss的翻译性能。基于这些评估指标，我们比较了模型在我们数据集和RWTH-PHOENIX-2014T基准上的gloss翻译一致性。

英文摘要

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

URL PDF HTML ☆

赞 0 踩 0

2509.01182 2026-06-16 cs.AI cs.CL cs.HC cs.IR cs.MA 版本更新

Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

问题到知识（Q2K）：多智能体生成可检查的事实以实现产品映射

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

发表机构 * The University of Tokyo（东京大学）； KISTI（韩国科学技术院）

AI总结 Q2K通过多智能体框架利用大语言模型实现可靠的产品SKU映射，通过生成辨析问题、网络搜索和去重来提高准确性与鲁棒性，适用于复杂场景如捆绑识别和品牌来源辨析。

Comments Accepted by IEEE BigData 2025 Industry Track

详情

DOI: 10.1109/BigData66926.2025.11401468
Journal ref: 2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 2646-2653

AI中文摘要

识别两个产品列表是否指向相同的库存单位（SKU）是电子商务中的持续挑战，尤其是在缺乏显式标识符且产品名称在不同平台上差异较大的情况下。基于规则的启发式方法和关键词相似性经常因忽略品牌、规格或捆绑配置的细微区别而误分类。为克服这些限制，我们提出了问题到知识（Q2K），一个多智能体框架，利用大语言模型（LLMs）进行可靠的SKU映射。Q2K集成了：（1）一个推理代理，生成定向的辨析问题；（2）一个知识代理，通过聚焦的网络搜索解决这些问题；（3）一个去重代理，重用已验证的推理轨迹以减少冗余并确保一致性。人类在循环机制进一步细化不确定情况。在真实世界消费品数据集上的实验表明，Q2K超越了强大的基线，实现了在捆绑识别和品牌来源辨析等困难场景中的更高准确性和鲁棒性。通过重用检索到的推理而不是发出重复搜索，Q2K在准确性和效率之间取得了平衡，提供了一种可扩展且可解释的解决方案用于产品整合。

英文摘要

Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

URL PDF HTML ☆

赞 0 踩 0

2511.05505 2026-06-16 q-bio.NC cs.AI 版本更新

Rewiring Human Brain Networks via Lightweight Dynamic Connectivity Framework: An EEG-Based Stress Validation

通过轻量级动态连接框架重绘人脑网络：基于EEG的应力验证

Sayantan Acharya, Abbas Khosravi, Douglas Creighton, Roohallah Alizadehsani, U. Rajendra Acharya

发表机构 * Institute for Intelligent Systems Research and Innovation（智能系统研究与创新研究所）； University of Southern Queensland（南方昆士兰大学）

AI总结本文提出基于时间变化定向传递函数的轻量级动态脑连接框架，通过机器学习验证EEG数据中的应力分类，发现alpha-TV-DTF在分类中表现最佳，凸显动态连接在捕捉脑区时间与因果影响上的优势。

Comments 21 pages, 21 figures, 6 tables, 50 references,

详情

DOI: 10.1016/j.compbiomed.2026.111801
Journal ref: 2026. Reconfiguring brain networks via lightweight dynamic connectivity framework: An EEG-based stress validation. Computers in Biology and Medicine, 213, p.111801

AI中文摘要

近年来，结合人工智能和机器学习模型的脑电图分析在压力研究中日益突出。本文提出了一种基于时间变化定向传递函数的轻量级动态脑连接框架，通过机器学习模型验证了TV DTF特征。TV DTF估计了不同EEG频率带之间脑区的定向信息流，从而捕捉到通常被静态功能连接测量所忽视的时间和因果影响。使用32通道SAM 40数据集的EEG记录，重点研究了心算任务试验。通过支持向量机、随机森林、梯度提升、自适应提升和极端梯度提升等机器学习分类器验证了动态EEG基的TV-DTF特征。实验结果表明，alpha-TV-DTF具有最强的判别能力，SVM在3类分类中达到89.73%的准确率，XGBoost在2类分类中达到93.69%的准确率。与绝对功率和相位锁定基于的功能连接特征相比，alpha TV DTF和beta TV DTF在所有机器学习模型中均表现更优，突显了动态测量相对于静态测量的优势。特征重要性分析进一步突显了主导的远距离前额叶和前额顶叶信息影响，强调了压力下前额叶区域的调节作用。这些发现验证了轻量级TV-DTF作为一种稳健框架的有效性，揭示了不同压力水平下的空间时间脑动态和方向性影响。

英文摘要

In recent years, Electroencephalographic analysis has gained prominence in stress research when combined with AI and Machine Learning models for validation. In this study, a lightweight dynamic brain connectivity framework based on Time Varying Directed Transfer Function is proposed, where TV DTF features were validated through ML based stress classification. TV DTF estimates the directional information flow between brain regions across distinct EEG frequency bands, thereby capturing temporal and causal influences that are often overlooked by static functional connectivity measures. EEG recordings from the 32 channel SAM 40 dataset were employed, focusing on mental arithmetic task trials. The dynamic EEG-based TV-DTF features were validated through ML classifiers such as Support Vector Machine, Random Forest, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Experimental results show that alpha-TV-DTF provided the strongest discriminative power, with SVM achieving 89.73% accuracy in 3-class classification and with XGBoost achieving 93.69% accuracy in 2 class classification. Relative to absolute power and phase locking based functional connectivity features, alpha TV DTF and beta TV DTF achieved higher performance across the ML models, highlighting the advantages of dynamic over static measures. Feature importance analysis further highlighted dominant long-range frontal parietal and frontal occipital informational influences, emphasizing the regulatory role of frontal regions under stress. These findings validate the lightweight TV-DTF as a robust framework, revealing spatiotemporal brain dynamics and directional influences across different stress levels.

URL PDF HTML ☆

赞 0 踩 0

2510.19728 2026-06-16 cs.LG cs.AI 版本更新

Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

通过生成合成医疗时间序列实现细粒度亚组级别模型评估

Mahmoud Ibrahim, Bart Elen, Chang Sun, Gökhan Ertaylan, Michel Dumontier

发表机构 * Institute of Data Science, Faculty of Science and Engineering, Maastricht University（数据科学研究所，科学与工程学院，马斯特里赫特大学）； Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University（先进计算科学系，科学与工程学院，马斯特里赫特大学）； VITO（VITO研究院）

AI总结本文提出一种框架，利用合成ICU时间序列数据训练和评估预测模型，特别是在细粒度人口亚组中。引入Enhanced TimeAutoDiff，通过分布对齐惩罚增强潜在扩散目标，减少真实-合成与真实-真实评估差距，提升亚组模型评估的鲁棒性和可靠性。

详情

DOI: 10.1007/978-3-032-19102-1_19

AI中文摘要

我们提出了一种新的框架，利用合成ICU时间序列数据不仅训练，还能严格可信地评估预测模型，既在总体层面，又在细粒度人口亚组中。基于先前的扩散和VAE生成器（TimeDiff，HealthGen，TimeAutoDiff），我们引入Enhanced TimeAutoDiff，通过在潜在扩散目标中加入分布对齐惩罚。我们广泛在MIMIC-III和eICU上对所有模型进行了基准测试，针对24小时死亡率和二元住院时间任务。我们的结果表明，Enhanced TimeAutoDiff通过减少真实-合成与真实-真实评估（

英文摘要

We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $Δ_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($Δ_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

URL PDF HTML ☆

赞 0 踩 0

2509.02093 2026-06-16 cs.CL cs.AI cs.IR 版本更新

Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

通过对比改进：基于检索增强的对比推理用于自动提示优化

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出CRPO框架，通过对比推理提升提示优化效果，利用HelpSteer2数据集中的高质量示例进行对比分析，改进提示生成的鲁棒性和可解释性。

Comments Preprint

详情

DOI: 10.1109/JCDL67857.2025.00043
Journal ref: 2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Dekalb, IL, USA, 2025, pp. 269-272

AI中文摘要

自动提示优化近期作为一种提升大型语言模型（LLMs）提示质量的策略，旨在生成更准确和有用的响应。然而，大多数先前工作集中在直接提示精炼或模型微调，忽略了利用LLM内在推理能力从对比示例中学习的潜力。本文提出对比推理提示优化（CRPO），一种新颖的框架，将提示优化建模为检索增强的推理过程。我们的方法从HelpSteer2数据集检索top k参考提示-响应对，该数据集是一个开源集合，每个响应均标注了有用性、正确性、连贯性、复杂性和冗余性。我们构建了两种互补的优化范式：（1）分层对比推理，其中LLM比较高质量、中等质量和低质量的示例（提示和响应）以通过反思推理优化自身生成；（2）多指标对比推理，其中LLM分析每个评估维度的最佳示例并整合其优势以生成优化提示。通过显式对比高质量和低质量示例，CRPO使模型能够推断为何某些提示成功而其他失败，从而实现更鲁棒和可解释的优化。在HelpSteer2基准测试中的实验结果表明，CRPO显著优于基线方法。我们的发现突显了对比、检索增强推理在推进自动提示优化方面的潜力。

英文摘要

Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process. Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset, an open source collection where each response is annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high-, medium-, and low-quality exemplars (both prompts and responses) to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best exemplars along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

URL PDF HTML ☆

赞 0 踩 0

2509.00176 2026-06-16 cs.CV cs.AI 版本更新

Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments

Waste-Bench: 一个用于评估在杂乱环境中视觉大型语言模型性能的综合基准

Muhammad Ali, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文提出Waste-Bench基准，用于评估VLLMs在复杂环境中的鲁棒性和准确性，揭示了提升VLLM在复杂环境性能的必要性。

详情

Journal ref: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), pp. 31019-31032, 2025

AI中文摘要

近年来，大型语言模型（LLMs）的进步为能够执行广泛视觉理解任务的视觉大型语言模型（VLLMs）铺平了道路。尽管LLMs在标准自然图像上表现出色，但其在杂乱数据集中的能力尚未得到充分探索，其中包含复杂环境和变形形状的对象。在本工作中，我们引入了一个专门设计用于现实场景中垃圾分类的新型数据集，其特点是有复杂的环境和变形形状的对象。此外，我们还提出了一种深入的评估方法，以严格评估VLLMs的鲁棒性和准确性。所引入的数据集和全面分析为VLLMs在挑战性条件下性能提供了有价值的见解。我们的发现强调了进一步提升VLLM鲁棒性以在复杂环境中表现更好的重要性。数据集和实验代码将公开发布。

英文摘要

Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM's robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2502.16560 2026-06-16 cs.AI cs.CL cs.SI 版本更新

An Analytical Emotion Framework of Rumour Threads on Social Media

社交媒体谣言线中的分析情绪框架

Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结本文提出一个多方面情绪检测框架，分析谣言与非谣言线的情绪差异，揭示谣言引发负面情绪而非谣言引发正面情绪，并通过因果分析揭示情绪传播机制。

Comments Accepted to ICWSM 2025 MisD Workshop

详情

DOI: 10.36190/2025.28

AI中文摘要

在线社交媒体中的谣言对现代社会构成重大风险，推动了对谣言发展机制的深入理解。本文聚焦谣言与情绪在线讨论中的交互，构建了一个多方面情绪分析框架，对比谣言与非谣言线，并进行情绪的关联与因果分析。我们应用该框架于现有广泛使用的谣言数据集，进一步理解在线社交媒体线的情绪动态。框架显示谣言引发更多负面情绪（如愤怒、恐惧、悲观），而非谣言引发更多积极情绪。情绪具有传染性，谣言传播负面情绪，非谣言传播正面情绪。因果分析显示惊讶连接谣言与其他情绪；悲观来自悲伤和恐惧，而乐观源于喜悦和爱。

英文摘要

Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.

URL PDF HTML ☆

赞 0 踩 0

2504.08609 2026-06-16 cs.CL cs.AI 版本更新

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

面向英文文本仇恨言论多标签分类的机器学习模型与数据集综述

Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter

发表机构 * Technical University of Darmstadt, Science and Technology for Peace and Security (PEASEC)（德累斯顿技术大学，和平与安全科学技术（PEASEC））

AI总结本文综述了46篇英文文献，分析了28个适合多标签分类模型训练的数据集，揭示了标签集、大小、元概念等的异质性，并指出评估不一致、BERT和RNN偏好等关键问题，提出十项研究建议。

Comments 35 pages, 4 figures, 4 tables

详情

DOI: 10.1145/3820778
Journal ref: ACM Transactions on Knowledge Discovery from Data (2026)

AI中文摘要

在线仇恨言论的传播对个人、在线社区和社会整体都有严重负面影响。鉴于此以及海量仇恨内容的规模，内容审核和执法人员及研究人员对机器学习模型自动分类仇恨言论产生了兴趣。尽管大多数科学作品将仇恨言论分类视为二元任务，但实践中往往需要区分子类型，例如根据目标、严重程度或合法性，这可能在个别内容上重叠。因此，研究者创建了数据集和机器学习模型，将文本数据中的仇恨言论分类视为多标签问题。本文首次系统全面地综述了英文文献中这一新兴研究领域的科学文献（N=46）。我们贡献了28个适合训练多标签分类模型的数据集的简要概述，揭示了标签集、大小、元概念、标注过程和标注者间一致性的显著异质性。对24篇提出合适分类模型的出版物的分析进一步揭示了评估不一致以及对双向编码表示变换器（BERT）和循环神经网络（RNN）的偏好。我们识别出训练数据不平衡、依赖众包平台、小而稀疏的数据集以及缺失方法学一致性为关键开放问题，并提出了十项研究建议。

英文摘要

The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

URL PDF HTML ☆

赞 0 踩 0

2502.05214 2026-06-16 eess.IV cs.AI cs.CV 版本更新

CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

CoRPA: 基于概念向量扰动和生成模型的胸部X光图像对抗生成

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh（信息学院，爱丁堡大学）； NHS Lothian（NHS洛锡安）

AI总结本文提出CoRPA，一种针对医学影像领域的临床聚焦对抗攻击框架，通过概念向量扰动生成对抗性影像报告和图像，揭示医疗AI在真实临床场景下的脆弱性。

详情

DOI: 10.1109/ICHI64645.2025.00057

AI中文摘要

深度学习模型在医学图像分类任务中的应用日益广泛，旨在提高诊断准确性、减轻医务人员负担并改善患者预后。然而，其对对抗攻击的脆弱性对患者安全构成重大风险。当前攻击方法使用通用技术如模型查询或像素值扰动生成对抗样本以欺骗模型。这些方法可能无法充分解决源于临床错误的特征遗漏或误识别问题。我们提出基于概念的报告扰动攻击（CoRPA），一种专注于临床的黑盒对抗攻击框架，专门针对医学影像领域。CoRPA利用临床概念生成对抗性放射学报告和图像，以接近现实的临床误诊场景。我们使用MIMIC-CXR-JPG数据集中的胸部X光影像和放射学报告验证了CoRPA的实用性。评估显示，对传统对抗攻击具有强大鲁棒性的深度学习模型在面对CoRPA的临床聚焦扰动时显著更脆弱。这突显了在医疗AI系统中解决领域特定脆弱性的重要性。通过引入专门的对抗攻击框架，本研究为开发在真实世界中可靠、安全的AI模型提供了基础，确保其在高风险临床环境中的安全可靠部署。

英文摘要

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2410.20066 2026-06-16 eess.SP cs.AI 版本更新

A Multi-Modal Non-Invasive Deep Learning Framework for Progressive Prediction of Seizures

一种用于癫痫发作渐进预测的多模态非侵入式深度学习框架

Ali Saeizadeh, Douglas Schonholtz, Joseph S. Neimat, Pedram Johari, Tommaso Melodia

发表机构 * Institute for the Wireless Internet of Things, Northeastern University, Boston, MA, U.S.A.（无线物联网研究所，东北大学，波士顿，马萨诸塞州，美国）； University of Louisville, Louisville, KY, U.S.A.（路易斯维尔大学，路易斯维尔，肯塔基州，美国）

AI总结本文提出一种基于非侵入式多模态传感器网络的深度学习框架，用于癫痫发作的渐进预测，通过提高预测精度和实时处理能力，实现95%的灵敏度和98%的特异度。

Comments 4 pages, 5 figures, Proceedings of the IEEE 20th International Conference on Body Sensor Networks (BSN), October 2024

详情

DOI: 10.1109/BSN63547.2024.10780713
Journal ref: 2024 IEEE 20th International Conference on Body Sensor Networks (BSN)

AI中文摘要

本文介绍了一种创新框架，旨在通过非侵入式多模态传感器网络的深度学习方法，实现癫痫发作的渐进预测。癫痫是一种严重影响神经系统的疾病，影响全球约6500万人，其中相当一部分患者对药物治疗反应不佳。为解决这一挑战，我们倡导预测系统，能够及时向高风险个体发出警报，使他们能够采取预防措施。我们的框架利用先进的深度学习技术，并使用来自非侵入式脑电图（EEG）和心电图（ECG）传感器网络的个性化数据，从而提高预测准确性。算法被优化为在边缘设备上进行实时处理，以减轻隐私问题和云方案中固有的数据传输开销，最终节省电池电量。此外，我们的系统预测癫痫发作的倒计时时间（在发作前15分钟间隔内最多一小时），为预防措施提供关键时间。我们的多模态模型在29名患者中实现了95%的灵敏度、98%的特异度和97%的准确率。

英文摘要

This paper introduces an innovative framework designed for progressive (granular in time to onset) prediction of seizures through the utilization of a Deep Learning (DL) methodology based on non-invasive multi-modal sensor networks. Epilepsy, a debilitating neurological condition, affects an estimated 65 million individuals globally, with a substantial proportion facing drug-resistant epilepsy despite pharmacological interventions. To address this challenge, we advocate for predictive systems that provide timely alerts to individuals at risk, enabling them to take precautionary actions. Our framework employs advanced DL techniques and uses personalized data from a network of non-invasive electroencephalogram (EEG) and electrocardiogram (ECG) sensors, thereby enhancing prediction accuracy. The algorithms are optimized for real-time processing on edge devices, mitigating privacy concerns and minimizing data transmission overhead inherent in cloud-based solutions, ultimately preserving battery energy. Additionally, our system predicts the countdown time to seizures (with 15-minute intervals up to an hour prior to the onset), offering critical lead time for preventive actions. Our multi-modal model achieves 95% sensitivity, 98% specificity, and 97% accuracy, averaged among 29 patients.

URL PDF HTML ☆

赞 0 踩 0

2406.07277 2026-06-16 cs.CL cs.AI cs.MA 版本更新

Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication

说出你的语言：可解释的涌现交流中的空间关系

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton（索姆塞特大学）； The Alan Turing Institute（艾伦·图灵研究所）； University of Brescia（布雷西亚大学）

AI总结本文研究了智能体如何通过空间关系交流，展示了其能发展出表达观察部分关系的语言，实现90%以上的准确率，并证明该语言可被人类解读。

Comments Accepted at NeurIPS 2024. 18 pages, 3 figures

详情

DOI: 10.52202/079017-4446
Journal ref: In Advances in Neural Information Processing Systems (Vol. 37, pp. 140113-140137) 2024

AI中文摘要

有效的交流需要能够参照观察中的特定部分相对于其他部分的能力。尽管涌现交流文献在开发各种语言属性方面取得成功，但尚未有研究展示出此类位置参照的出现。本文展示了智能体如何在观察中交流空间关系。结果表明，智能体可以发展出能够表达其观察部分之间关系的语言，在训练于需要此类交流的指称游戏中，准确率超过90%。使用词组测量方法，我们展示了智能体如何创建此类参照。此分析表明，智能体使用非组合性和组合性信息的混合来传达空间关系。我们还证明了涌现语言可被人类解读。通过与接收智能体交流测试翻译准确性，接收智能体使用该词典部分达到78%以上的准确率，证实了该涌现语言的解读成功。

英文摘要

Effective communication requires the ability to refer to specific parts of an observation in relation to others. While emergent communication literature shows success in developing various language properties, no research has shown the emergence of such positional references. This paper demonstrates how agents can communicate about spatial relationships within their observations. The results indicate that agents can develop a language capable of expressing the relationships between parts of their observation, achieving over 90% accuracy when trained in a referential game which requires such communication. Using a collocation measure, we demonstrate how the agents create such references. This analysis suggests that agents use a mixture of non-compositional and compositional messages to convey spatial relationships. We also show that the emergent language is interpretable by humans. The translation accuracy is tested by communicating with the receiver agent, where the receiver achieves over 78% accuracy using parts of this lexicon, confirming that the interpretation of the emergent language was successful.

URL PDF HTML ☆

赞 0 踩 0

2410.11861 2026-06-16 cs.HC cs.AI 版本更新

Investigating Role of Big Five Personality Traits in Audio-Visual Rapport Estimation

探究大五人格特质在音频视觉共情估计中的作用

Takato Hayashi, Ryusei Kimura, Ryo Ishii, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology（日本科学技术先进研究院）； Human Informatics Laboratories, NTT Corporation（NTT公司人因实验室）

AI总结本研究探讨了大五人格特质在朋友间音频视觉共情估计中的作用，通过比较有无人格特质输入的模型，发现其能提升共情估计性能，并分解出感知者效应、目标效应和关系效应。

Comments 9 pages, 5 figures

详情

DOI: 10.1109/FG61629.2025.11099120
Journal ref: International Conference on Automatic Face and Gesture Recognition (FG2025)

AI中文摘要

在社交互动中自动估计共情是情感计算的核心组成部分。最近的研究表明，使用参与者的个性特质作为模型输入可以提高初始互动中共情估计的性能。本研究探讨这一发现是否适用于朋友间的互动，通过开发利用非语言线索（音频和面部表情）作为输入的共情估计模型进行研究。我们的实验结果表明，将大五特征（BFFs）添加到非语言特征中可以提高双人互动中自我报告共情的估计性能。接下来，我们通过比较有无BFFs的模型，揭示BFFs如何提高共情估计性能。我们使用社会关系模型将共情评分分解为感知者效应（人们对他人的评分倾向）、目标效应（人们被他人评分的倾向）和关系效应（人们对特定人的独特评分）。然后分析BFFs在捕捉每种效应中的贡献程度。我们的分析表明，感知者和目标的BFFs使估计模型分别捕捉感知者和目标效应。此外，我们的实验结果表明，面部表情特征与BFFs的组合不仅在估计共情评分方面表现最佳，还在估计三种效应方面也表现最佳。本研究是理解为何基于个性的互识感知估计模型能实现高估计性能的第一步。

英文摘要

Automatic rapport estimation in social interactions is a central component of affective computing. Recent reports have shown that the estimation performance of rapport in initial interactions can be improved by using the participant's personality traits as the model's input. In this study, we investigate whether this findings applies to interactions between friends by developing rapport estimation models that utilize nonverbal cues (audio and facial expressions) as inputs. Our experimental results show that adding Big Five features (BFFs) to nonverbal features can improve the estimation performance of self-reported rapport in dyadic interactions between friends. Next, we demystify how BFFs improve the estimation performance of rapport through a comparative analysis between models with and without BFFs. We decompose rapport ratings into perceiver effects (people's tendency to rate other people), target effects (people's tendency to be rated by other people), and relationship effects (people's unique ratings for a specific person) using the social relations model. We then analyze the extent to which BFFs contribute to capturing each effect. Our analysis demonstrates that the perceiver's and the target's BFFs lead estimation models to capture the perceiver and the target effects, respectively. Furthermore, our experimental results indicate that the combinations of facial expression features and BFFs achieve best estimation performances not only in estimating rapport ratings, but also in estimating three effects. Our study is the first step toward understanding why personality-aware estimation models of interpersonal perception accomplish high estimation performance.

URL PDF HTML ☆

赞 0 踩 0

2409.06708 2026-06-16 cs.CY cs.AI cs.HC 版本更新

Ensuring Fairness with Transparent Auditing of Quantitative Bias in AI Systems

通过透明审计量化偏见确保公平性

Chih-Cheng Rex Yuan, Bow-Yaw Wang

发表机构 * Institute of Information Science, Academia Sinica（中科院信息所）

AI总结本文提出了一种透明的AI公平性审计框架，通过第三方审计员和系统提供者共同参与，结合统计方法公开审查AI系统，以解决AI决策中的偏见问题。

详情

DOI: 10.23919/PNC63053.2024.10697374
Journal ref: Proc. 2024 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC), Seoul, Republic of Korea, 2024, pp. 25-32

AI中文摘要

随着AI的快速发展，将其整合到决策过程中成为趋势。然而，AI系统可能表现出偏见，导致决策者做出不公平的结论。值得注意的是，美国司法系统中用于评估再犯风险的COMPAS系统被发现偏袒多数种族群体；具体而言，它违反了称为'均衡机会'的公平标准。已提出多种评估AI公平性的措施。我们提出了一种审计AI公平性的框架，涉及第三方审计员和AI系统提供者，并创建了一个工具来系统审查AI系统。该工具是开源且公开可用的。与传统AI系统不同，我们倡导透明的白盒和基于统计的方法。该方法可用于第三方审计员、AI开发者或公众在判断AI系统公平性标准时进行参考。

英文摘要

With the rapid advancement of AI, there is a growing trend to integrate AI into decision-making processes. However, AI systems may exhibit biases that lead decision-makers to draw unfair conclusions. Notably, the COMPAS system used in the American justice system to evaluate recidivism was found to favor racial majority groups; specifically, it violates a fairness standard called equalized odds. Various measures have been proposed to assess AI fairness. We present a framework for auditing AI fairness, involving third-party auditors and AI system providers, and we have created a tool to facilitate systematic examination of AI systems. The tool is open-sourced and publicly available. Unlike traditional AI systems, we advocate a transparent white-box and statistics-based approach. It can be utilized by third-party auditors, AI developers, or the general public for reference when judging the fairness criterion of AI systems.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 45 篇

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

Towards End-to-End Automation of AI Research

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

LLM-as-Code Agentic Programming for Agent Harness

Agentic Framework for Deep Learning workload migration via In-Context Learning

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

User as Code: Executable Memory for Personalized Agents

Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

Not All Skills Help: Measuring and Repairing Agent Knowledge

On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

Orchestrated Reality: From Role-Play to Living, Playable Game Worlds -- LLM-Driven World Simulation as a Parameterized-Action POMDP

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

VeriGraph: Towards Verifiable Data-Analytic Agents

TokenPilot: Cache-Efficient Context Management for LLM Agents

PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

Safe Exploration via Policy Priors

EffGen: Enabling Small Language Models as Capable Autonomous Agents

AgenticRec: A Recommendation-Oriented Agentic Framework with Progressive Tool-Integrated Reasoning Optimization

Closing the Auto-Research Loop: An AI Co-Scientist for Production Search Ranking

From Overload to Convergence: Supporting Multi-Issue Human-AI Negotiation with Bayesian Visualization

Active Inference with a Self-Prior in the Mirror-Mark Task

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

2. 知识表示、推理与符号AI 18 篇

Relational Structural Causal Models

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

A Formal Framework for Declarative Agentic AI in Business Process Analysis

Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

Model Graph Inductive Learning for Knowledge Graph Completion

Symbolic Informalization: Fluent, Productive, Multilingual

A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

Provenance-Enhanced Statements in Knowledge Graphs

The algebra of Krom logic programs

Theorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

Unifying Post-hoc Explanations of Knowledge Graph Completions

Interpretation as Linear Transformation: A Cognitive-Geometric Model of Concepts and Meaning

Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

The Initial Exploration Problem in Knowledge Graph Exploration

Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graph Generation

3. 多智能体与博弈 28 篇

Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator

Poster: EdgeCitadel -- Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows

Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

Resilient Consensus in Agentic AI

CoAgent: Concurrency Control for Multi-Agent Systems

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts